本篇博文主要内容为 2025-06-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-06-10)
今日共更新1030篇论文,其中:
- 自然语言处理共193篇(Computation and Language (cs.CL))
- 人工智能共371篇(Artificial Intelligence (cs.AI))
- 计算机视觉共255篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共357篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Play to Generalize: Learning to Reason Through Game Play
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨领域推理能力上的泛化性不足问题。其解决方案的关键在于提出一种新的后训练范式——视觉游戏学习(Visual Game Learning, ViGaL),通过让MLLMs在类似街机游戏的环境中进行强化学习,从而提升其多模态推理的泛化能力。实验表明,这种基于规则的合成游戏作为预训练任务,能够在不接触具体解题步骤、公式或图表的情况下,有效增强模型在多模态数学基准和跨学科问题上的表现。
链接: https://arxiv.org/abs/2506.08011
作者: Yunfei Xie,Yinsong Ma,Shiyi Lan,Alan Yuille,Junfei Xiao,Chen Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Project Page: this https URL
Abstract:Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model’s performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.
zh
[NLP-1] Reinforcement Pre-Training
【速读】: 该论文试图解决传统大规模语言模型预训练中依赖领域特定标注答案的问题,以及如何有效利用海量文本数据提升强化学习(Reinforcement Learning, RL)的通用性。其解决方案的关键在于引入强化预训练(Reinforcement Pre-Training, RPT),将下一个词预测任务重新定义为一个通过强化学习进行训练的推理任务,通过可验证的奖励机制激励模型正确预测下一个词,从而提升语言建模精度并为后续强化微调提供强大的预训练基础。
链接: https://arxiv.org/abs/2506.08007
作者: Qingxiu Dong,Li Dong,Yao Tang,Tianzhu Ye,Yutao Sun,Zhifang Sui,Furu Wei
机构: Microsoft Research (微软研究院); Peking University (北京大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
zh
[NLP-2] Reparameterized LLM Training via Orthogonal Equivalence Transformation
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练过程中有效性和可靠性不足的问题。其解决方案的关键在于提出POET,一种基于正交等价变换(Orthogonal Equivalence Transformation)的重新参数化训练算法,通过为每个神经元引入两个可学习的正交矩阵和一个固定随机权重矩阵,确保权重矩阵的谱特性得以保持,从而实现对目标函数的稳定优化并提升模型的泛化能力。
链接: https://arxiv.org/abs/2506.08001
作者: Zeju Qiu,Simon Buchholz,Tim Z. Xiao,Maximilian Dax,Bernhard Schölkopf,Weiyang Liu
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical report v1 (36 pages, 24 figures, project page: this https URL )
Abstract:While large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field’s most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices, POET can stably optimize the objective function with improved generalization. We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks. Extensive experiments validate the effectiveness and scalability of POET in training LLMs.
zh
[NLP-3] τ2-Bench: Evaluating Conversational Agents in a Dual-Control Environment
【速读】: 该论文试图解决现有对话式人工智能(Conversational AI)评估基准无法模拟用户与智能体共同控制环境的问题,这与现实场景如技术支持中用户需要主动参与修改共享环境状态的情况不符。解决方案的关键在于提出 \tau^2 -bench,其核心包括:1) 一个建模为Dec-POMDP的电信双控制领域,其中智能体和用户共同使用工具在共享动态环境中行动,测试智能体的协作与通信能力;2) 一种组合任务生成器,能够从原子组件程序化生成多样且可验证的任务,确保领域覆盖和可控复杂度;3) 一个与环境紧密耦合的可靠用户模拟器,其行为由工具和可观测状态约束,提升仿真保真度;4) 通过多种消融实验对智能体性能进行细粒度分析,区分推理错误与沟通/协调错误。
链接: https://arxiv.org/abs/2506.07982
作者: Victor Barres,Honghua Dong,Soham Ray,Xujie Si,Karthik Narasimhan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce \tau^2 -bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, \tau^2 -bench provides a controlled testbed for agents that must both reason effectively and guide user actions. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2506.07982 [cs.AI] (or arXiv:2506.07982v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.07982 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Honghua Dong [view email] [v1] Mon, 9 Jun 2025 17:52:18 UTC (688 KB) Full-text links: Access Paper: View a PDF of the paper titled \tau^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment, by Victor Barres and 4 other authorsView PDFTeX SourceOther Formats view license Current browse context: cs.AI prev | next new | recent | 2025-06 Change to browse by: cs cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-4] HeuriGym: An Agent ic Benchmark for LLM -Crafted Heuristics in Combinatorial Optimization
【速读】: 该论文试图解决当前评估大型语言模型(Large Language Models, LLMs)在组合优化问题中生成启发式算法能力的方法存在不足的问题,具体表现为现有基准测试要么依赖于容易饱和和记忆的封闭式问题,要么依赖于缺乏一致性和严谨性的主观比较。解决方案的关键是提出HeuriGym,这是一个代理框架,旨在评估由LLMs生成的启发式算法,其特点包括明确的目标定义和广阔的解空间,该框架使LLMs能够提出启发式方法、通过代码执行获得评估反馈,并迭代优化解决方案。
链接: https://arxiv.org/abs/2506.07972
作者: Hongzheng Chen,Yingheng Wang,Yaohui Cai,Hins Hu,Jiajie Li,Shirley Huang,Chenhui Deng,Rongjian Liang,Shufeng Kong,Haoxing Ren,Samitha Samaranayake,Carla P. Gomes,Zhiru Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
zh
[NLP-5] Reinforcing Multimodal Understanding and Generation with Dual Self-rewards
【速读】: 该论文试图解决大型多模态模型(LMMs)在图像-文本对齐上的不足,即模型生成的文本响应与视觉输入矛盾或未能遵循文本到图像的提示。其解决方案的关键在于引入一种自监督的双奖励机制,通过将理解和生成视为逆向对偶任务,利用模型在单一任务域中生成的多个输出,反转输入-输出对以计算模型的双重似然作为优化的自奖励,从而无需外部监督即可有效提升模型性能,尤其在文本到图像任务中表现显著。
链接: https://arxiv.org/abs/2506.07963
作者: Jixiang Hong,Yiran Zhang,Guanzhong Wang,Yi Liu,Ji-Rong Wen,Rui Yan
机构: Renmin University of China (中国人民大学); Baidu Inc. (百度公司); UCAS (中国科学院大学); Wuhan University (武汉大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Building upon large language models (LLMs), recent large multimodal models (LMMs) unify cross-model understanding and generation into a single framework. However, LMMs still struggle to achieve accurate image-text alignment, prone to generating text responses contradicting the visual input or failing to follow the text-to-image prompts. Current solutions require external supervision (e.g., human feedback or reward models) and only address unidirectional tasks-either understanding or generation. In this work, based on the observation that understanding and generation are inverse dual tasks, we introduce a self-supervised dual reward mechanism to reinforce the understanding and generation capabilities of LMMs. Specifically, we sample multiple outputs for a given input in one task domain, then reverse the input-output pairs to compute the dual likelihood of the model as self-rewards for optimization. Extensive experimental results on visual understanding and generation benchmarks demonstrate that our method can effectively enhance the performance of the model without any external supervision, especially achieving remarkable improvements in text-to-image tasks.
zh
[NLP-6] Correlated Errors in Large Language Models ICML2025
【速读】: 该论文试图解决的问题是:不同大语言模型(Large Language Models, LLMs)之间是否存在显著差异,以及模型间的相似性是否会影响其在下游任务中的表现。论文通过在超过350个LLMs上进行大规模实证评估,发现模型错误存在显著相关性,尤其是在共享架构或提供者的情况下。关键解决方案在于识别驱动模型相关性的因素,并揭示模型规模和准确性与错误相关性之间的关系,从而为理解LLM的同质化问题及其对实际应用(如自动评估和招聘)的影响提供了实证依据。
链接: https://arxiv.org/abs/2506.07962
作者: Elliot Kim,Avi Garg,Kenny Peng,Nikhil Garg
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (stat.ML)
备注: Accepted to ICML 2025
Abstract:Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ meaningfully. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors – on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring – the latter reflecting theoretical predictions regarding algorithmic monoculture.
zh
[NLP-7] Language Models over Canonical Byte-Pair Encodings ICML2025
【速读】: 该论文试图解决现代语言模型在词元化过程中产生的非规范词元编码(noncanonical token encodings)问题,即模型为大量实际训练数据中不可能出现的词元字符串分配了非零概率,这导致了错误和概率质量的浪费。解决方案的关键在于通过两种方法确保词元级语言模型仅对规范词元字符串分配正概率:(1) 通过条件推理在测试阶段实现规范性,无需额外训练;(2) 通过构造方式在模型参数化层面保证规范输出,但需要训练。
链接: https://arxiv.org/abs/2506.07956
作者: Tim Vieira,Tianyu Liu,Clemente Pasti,Yahya Emara,Brian DuSell,Benjamin LeBrun,Mario Giulianelli,Juan Luis Gastaldi,Timothy J. O’Donnell,Ryan Cotterell
机构: 未知
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注: ICML 2025
Abstract:Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of \itnoncanonical token encodings of each character string – these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.
zh
[NLP-8] Statistical Hypothesis Testing for Auditing Robustness in Language Models
【速读】: 该论文试图解决在任意干预(如输入扰动或模型变体更改)下,判断大型语言模型(Large Language Model, LLM)输出是否发生变化的问题。现有方法无法直接比较两个LLM输出,因为其可能因系统的随机性而不同,同时无法比较整个输出分布,因为存在计算不可行性。该研究提出基于分布的扰动分析(distribution-based perturbation analysis),将LLM扰动分析重新表述为频率学派假设检验问题,通过蒙特卡洛采样在低维语义相似性空间中构建经验零假设和备择假设输出分布,从而在不施加严格分布假设的情况下实现可行的推断。该解决方案的关键在于利用统计假设检验框架,结合语义空间中的分布建模,以可靠地评估LLM输出的变化。
链接: https://arxiv.org/abs/2506.07947
作者: Paulius Rauba,Qiyao Wei,Mihaela van der Schaar
机构: 未知
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2412.00868
Abstract:Consider the problem of testing whether the outputs of a large language model (LLM) system change under an arbitrary intervention, such as an input perturbation or changing the model variant. We cannot simply compare two LLM outputs since they might differ due to the stochastic nature of the system, nor can we compare the entire output distribution due to computational intractability. While existing methods for analyzing text-based outputs exist, they focus on fundamentally different problems, such as measuring bias or fairness. To this end, we introduce distribution-based perturbation analysis, a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling, enabling tractable inference without restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values; (iv) supports multiple perturbations via controlled error rates; and (v) provides scalar effect sizes. We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models. Above all, we see this as a reliable frequentist hypothesis testing framework for LLM auditing.
zh
[NLP-9] ProtocolLLM : RTL Benchmark for SystemVerilog Generation of Communication Protocols ISCA2025
【速读】: 该论文试图解决生成式 AI (Generative AI) 在硬件描述语言(HDL)领域,特别是生成可综合且功能正确的 SystemVerilog 设计方面的适用性不足问题。其解决方案的关键在于构建一个针对四种广泛使用的通信协议(SPI、I2C、UART 和 AXI)的基准测试套件,并定义不同抽象级别和提示具体性的代码生成任务,以评估生成设计的语法正确性、可综合性和功能保真度。
链接: https://arxiv.org/abs/2506.07945
作者: Arnav Sheth,Ivaxi Sheth,Mario Fritz
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at MLSysArch@ISCA 2025
Abstract:Recent advances in Large Language Models (LLMs) have shown promising capabilities in generating code for general-purpose programming languages. In contrast, their applicability for hardware description languages, particularly for generating synthesizable and functionally correct designs, remains significantly underexplored. HDLs such as SystemVerilog are logic-oriented and demand strict adherence to timing semantics, concurrency, and synthesizability constraints. Moreover, HDL-based design flows encompass a broad set of tasks beyond structural code generation, including testbench development, assertion-based verification, timing closure, and protocol-level integration for on-chip communication. The objective of our paper is to analyze the capabilities of state-of-the-art LLMs in generating SystemVerilog implementations of standard communication protocols, a core component of embedded and System-on-Chip (SoC) architectures. This paper introduces the first benchmark suite targeting four widely used protocols: SPI, I2C, UART, and AXI. We define code generation tasks that capture varying levels of design abstraction and prompt specificity. The generated designs are assessed for syntactic correctness, synthesizability, and functional fidelity via waveform simulation and test benches.
zh
[NLP-10] Quantum Graph Transformer for NLP Sentiment Classification
【速读】: 该论文旨在解决传统自然语言处理模型在处理复杂结构化数据时效率与表达能力不足的问题,特别是在情感分类任务中。其解决方案的关键在于提出一种基于图的量子变换器(Quantum Graph Transformer, QGT),该架构将参数化量子电路(Parameterized Quantum Circuits, PQCs)引入消息传递框架中的自注意力机制,从而在减少可训练参数数量的同时捕捉更丰富的上下文关系,提升了模型的性能与样本效率。
链接: https://arxiv.org/abs/2506.07937
作者: Shamminuj Aktar,Andreas Bärtschi,Abdel-Hameed A. Badawy,Stephan Eidenbenz
机构: Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室); New Mexico State University(新墨西哥州立大学)
类目: Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注:
Abstract:Quantum machine learning is a promising direction for building more efficient and expressive models, particularly in domains where understanding complex, structured data is critical. We present the Quantum Graph Transformer (QGT), a hybrid graph-based architecture that integrates a quantum self-attention mechanism into the message-passing framework for structured language modeling. The attention mechanism is implemented using parameterized quantum circuits (PQCs), which enable the model to capture rich contextual relationships while significantly reducing the number of trainable parameters compared to classical attention mechanisms. We evaluate QGT on five sentiment classification benchmarks. Experimental results show that QGT consistently achieves higher or comparable accuracy than existing quantum natural language processing (QNLP) models, including both attention-based and non-attention-based approaches. When compared with an equivalent classical graph transformer, QGT yields an average accuracy improvement of 5.42% on real-world datasets and 4.76% on synthetic datasets. Additionally, QGT demonstrates improved sample efficiency, requiring nearly 50% fewer labeled samples to reach comparable performance on the Yelp dataset. These results highlight the potential of graph-based QNLP techniques for advancing efficient and scalable language understanding.
zh
[NLP-11] Mimicking or Reasoning : Rethinking Multi-Modal In-Context Learning in Vision-Language Models
【速读】: 该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在上下文学习(In-Context Learning, ICL)中是否能够真正理解任务而非依赖浅层启发式策略的问题。其解决方案的关键在于提出一种新的多模态上下文学习(Multimodal ICL, MM-ICL)推理流程,该流程通过在每个示例中附加生成的推理过程(rationale)来增强示例的语义信息,以促进模型对任务的理解而非简单的答案复制。
链接: https://arxiv.org/abs/2506.07936
作者: Chengyue Huang,Yuchen Zhu,Sichen Zhu,Jingyun Xiao,Moises Andrade,Shivang Chopra,Zsolt Kira
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics – such as copying or majority voting – rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs ranging from 3B to 72B and proprietary models such as Gemini 2.0. We conduct controlled studies varying shot count, retrieval method, rationale quality, and distribution. Our results show limited performance sensitivity across these factors, suggesting that current VLMs do not effectively utilize demonstration-level information as intended in MM-ICL.
zh
[NLP-12] Solving Inequality Proofs with Large Language Models
【速读】: 该论文试图解决不等式证明这一在科学和数学领域中具有挑战性的任务,该任务需要高级推理能力,如发现紧致界和战略性定理应用,而现有数据集通常稀缺、合成或过于形式化,限制了该领域的进展。其解决方案的关键在于提出了一种非正式但可验证的任务形式,将不等式证明重构为两个可自动检查的子任务:边界估计和关系预测,并构建了IneqMath数据集,包含奥数级别的不等式问题及其分步解法和定理标注。此外,还开发了一个基于LLM-as-judge的评估框架,以检测常见的推理缺陷,从而更准确地评估模型的证明能力。
链接: https://arxiv.org/abs/2506.07927
作者: Jiayi Sheng,Luna Lyu,Jikai Jin,Tony Xia,Alex Gu,James Zou,Pan Lu
机构: Stanford University (斯坦福大学); UC Berkeley (加州大学伯克利分校); Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 52 pages, 16 figures
Abstract:Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at this https URL.
zh
[NLP-13] Uncovering the Functional Roles of Nonlinearity in Memory
【速读】: 该论文试图解决序列建模任务中记忆与长程时间处理的核心需求,特别是探讨非线性在循环神经网络中的功能作用及其必要性。研究的关键在于通过Almost Linear Recurrent Neural Networks (AL-RNNs)这一工具,系统地分析非线性在循环网络中的计算必要性和所支持的机制,发现最小非线性不仅足够且通常最优,从而为选择性引入非线性提供了一个理论框架,连接了动力系统理论与循环神经网络中长程记忆和结构化计算的功能需求。
链接: https://arxiv.org/abs/2506.07919
作者: Manuel Brenner,Georgia Koppe
机构: Interdisciplinary Center for Scientific Computing, Heidelberg University, Germany (跨学科科学计算中心,海德堡大学,德国); Hector Institute for AI in Psychiatry and Dept. of Psychiatry and Psychotherapy, Central Institute of Mental Health, Heidelberg University, Germany (精神病学与心理治疗系及人工智能精神病学赫克托研究所,中央精神健康研究所,海德堡大学,德国); Hertie Institute for AI in Brain Health, University of Tübingen, Germany (脑健康人工智能赫特ie研究所,图宾根大学,德国)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Chaotic Dynamics (nlin.CD); Computational Physics (physics.comp-ph)
备注: Preprint under review
Abstract:Memory and long-range temporal processing are core requirements for sequence modeling tasks across natural language processing, time-series forecasting, speech recognition, and control. While nonlinear recurrence has long been viewed as essential for enabling such mechanisms, recent work suggests that linear dynamics may often suffice. In this study, we go beyond performance comparisons to systematically dissect the functional role of nonlinearity in recurrent networks–identifying both when it is computationally necessary, and what mechanisms it enables. We use Almost Linear Recurrent Neural Networks (AL-RNNs), which allow fine-grained control over nonlinearity, as both a flexible modeling tool and a probe into the internal mechanisms of memory. Across a range of classic sequence modeling tasks and a real-world stimulus selection task, we find that minimal nonlinearity is not only sufficient but often optimal, yielding models that are simpler, more robust, and more interpretable than their fully nonlinear or linear counterparts. Our results provide a principled framework for selectively introducing nonlinearity, bridging dynamical systems theory with the functional demands of long-range memory and structured computation in recurrent neural networks, with implications for both artificial and biological neural systems.
zh
[NLP-14] LUCIFER: Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement
【速读】: 该论文旨在解决动态环境中代理系统内部模型与实际操作情境之间存在的知识滞后问题,这种差异限制了自主决策的有效性。解决方案的关键在于引入领域专家的上下文偏见,并通过LUCIFER(Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement)框架将其转化为可操作的情报。该框架结合了分层决策架构、强化学习(Reinforcement Learning, RL)和大语言模型(Large Language Models, LLMs),在两个协同角色中利用LLMs:作为上下文提取器,将口头领域专家输入结构化为影响决策的领域感知表示;作为零样本探索促进者,指导代理在探索过程中的动作选择。
链接: https://arxiv.org/abs/2506.07915
作者: Dimitris Panagopoulos,Adolfo Perrusquia,Weisi Guo
机构: Cranfield University (克兰菲尔德大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
备注: 12 pages, 4 Figures, 3 Tables, submitted to the IEEE for possible publication
Abstract:In dynamic environments, the rapid obsolescence of pre-existing environmental knowledge creates a gap between an agent’s internal model and the evolving reality of its operational context. This disparity between prior and updated environmental valuations fundamentally limits the effectiveness of autonomous decision-making. To bridge this gap, the contextual bias of human domain stakeholders, who naturally accumulate insights through direct, real-time observation, becomes indispensable. However, translating their nuanced, and context-rich input into actionable intelligence for autonomous systems remains an open challenge. To address this, we propose LUCIFER (Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement), a domain-agnostic framework that integrates a hierarchical decision-making architecture with reinforcement learning (RL) and large language models (LLMs) into a unified system. This architecture mirrors how humans decompose complex tasks, enabling a high-level planner to coordinate specialised sub-agents, each focused on distinct objectives and temporally interdependent actions. Unlike traditional applications where LLMs are limited to single role, LUCIFER integrates them in two synergistic roles: as context extractors, structuring verbal stakeholder input into domain-aware representations that influence decision-making through an attention space mechanism aligning LLM-derived insights with the agent’s learning process, and as zero-shot exploration facilitators guiding the agent’s action selection process during exploration. We benchmark various LLMs in both roles and demonstrate that LUCIFER improves exploration efficiency and decision quality, outperforming flat, goal-conditioned policies. Our findings show the potential of context-driven decision-making, where autonomous systems leverage human contextual knowledge for operational success.
zh
[NLP-15] MiniCPM4: Ultra-Efficient LLM s on End Devices
【速读】: 该论文旨在解决在端侧设备上高效运行大规模语言模型(LLM)的问题,其核心挑战在于如何在有限的计算资源下实现高性能和低能耗。解决方案的关键在于从模型架构、训练数据、训练算法和推理系统四个维度进行系统性创新,其中尤为重要的是提出InfLLM v2稀疏注意力机制以加速长上下文处理,以及通过UltraClean和UltraChat v2优化预训练数据过滤与生成策略,结合ModelTunnel v2和BitCPM等高效训练算法,最终在推理系统中集成稀疏注意力、模型量化和推测采样技术,从而显著提升模型的效率与性能。
链接: https://arxiv.org/abs/2506.07900
作者: MiniCPM Team:Chaojun Xiao,Yuxuan Li,Xu Han,Yuzhuo Bai,Jie Cai,Haotian Chen,Wentong Chen,Xin Cong,Ganqu Cui,Ning Ding,Shengdan Fan,Yewei Fang,Zixuan Fu,Wenyu Guan,Yitong Guan,Junshao Guo,Yufeng Han,Bingxiang He,Yuxiang Huang,Cunliang Kong,Qiuzuo Li,Siyuan Li,Wenhao Li,Yanghao Li,Yishan Li,Zhen Li,Dan Liu,Biyuan Lin,Yankai Lin,Xiang Long,Quanyu Lu,Yaxi Lu,Peiyan Luo,Hongya Lyu,Litu Ou,Yinxu Pan,Zekai Qu,Qundong Shi,Zijun Song,Jiayuan Su,Zhou Su,Ao Sun,Xianghui Sun,Peijun Tang,Fangzheng Wang,Feng Wang,Shuo Wang,Yudong Wang,Yesai Wu,Zhenyu Xiao,Jie Xie,Zihao Xie,Yukun Yan,Jiarui Yuan,Kaihuo Zhang,Lei Zhang,Linyue Zhang,Xueren Zhang,Yudi Zhang,Hengyu Zhao,Weilin Zhao,Weilun Zhao,Yuanqian Zhao,Zhi Zheng,Ge Zhou,Jie Zhou,Wei Zhou,Zihan Zhou,Zixuan Zhou,Zhiyuan Liu,Guoyang Zeng,Chao Jia,Dahai Li,Maosong Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: MiniCPM4 Technical Report
Abstract:This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose this http URL that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Sufficient evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences. Through further adaptation, MiniCPM4 successfully powers diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability.
zh
[NLP-16] MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLM s
【速读】: 该论文试图解决语言模型在实际系统中需要进行后处理更新以纳入新知识或修正信息的问题,但现有方法在高效且可靠地编辑模型的同时,难以避免重训练或遗忘旧信息。解决方案的关键在于提出MEMOIR框架,该框架通过残差记忆(residual memory)注入知识,即一个专用的参数模块,同时保留预训练模型的核心能力。该方法通过样本依赖的掩码稀疏化输入激活,将每次编辑限制在记忆参数的不同子集内,从而最小化编辑间的干扰,并在推理时通过比较新查询的稀疏激活模式与编辑期间存储的模式来识别相关编辑,实现对改写查询的泛化能力。
链接: https://arxiv.org/abs/2506.07899
作者: Ke Wang,Yiming Qin,Nikolaos Dimitriadis,Alessandro Favero,Pascal Frossard
机构: EPFL, Lausanne, Switzerland(瑞士洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The first two authors contributed equally to this work
Abstract:Language models deployed in real-world systems often require post-hoc updates to incorporate new or corrected knowledge. However, editing such models efficiently and reliably - without retraining or forgetting previous information - remains a major challenge. Existing methods for lifelong model editing either compromise generalization, interfere with past edits, or fail to scale to long editing sequences. We propose MEMOIR, a novel scalable framework that injects knowledge through a residual memory, i.e., a dedicated parameter module, while preserving the core capabilities of the pre-trained model. By sparsifying input activations through sample-dependent masks, MEMOIR confines each edit to a distinct subset of the memory parameters, minimizing interference among edits. At inference, it identifies relevant edits by comparing the sparse activation patterns of new queries to those stored during editing. This enables generalization to rephrased queries by activating only the relevant knowledge while suppressing unnecessary memory activation for unrelated prompts. Experiments on question answering, hallucination correction, and out-of-distribution generalization benchmarks across LLaMA-3 and Mistral demonstrate that MEMOIR achieves state-of-the-art performance across reliability, generalization, and locality metrics, scaling to thousands of sequential edits with minimal forgetting.
zh
[NLP-17] Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark
【速读】: 该论文试图解决生成式 AI (Generative AI) 在面对经典哲学问题——即框架问题(Frame Problem)和符号接地问题(Symbol Grounding Problem)时的认知能力不足问题。研究的关键在于设计两个基准任务,以反映这两个问题的哲学核心,并在零样本条件下测试13个主流生成式 AI 模型的表现,评估其输出在上下文推理、语义连贯性和信息过滤等方面的质量。研究结果表明,部分封闭源代码模型在这些任务中表现优异,暗示现代生成式 AI 可能正在获得应对长期理论挑战的能力。
链接: https://arxiv.org/abs/2506.07896
作者: Shoko Oka
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 52 pages, Additional resources available on GitHub repository
Abstract:Recent advancements in large language models (LLMs) have revitalized philosophical debates surrounding artificial intelligence. Two of the most fundamental challenges - namely, the Frame Problem and the Symbol Grounding Problem - have historically been viewed as unsolvable within traditional symbolic AI systems. This study investigates whether modern LLMs possess the cognitive capacities required to address these problems. To do so, I designed two benchmark tasks reflecting the philosophical core of each problem, administered them under zero-shot conditions to 13 prominent LLMs (both closed and open-source), and assessed the quality of the models’ outputs across five trials each. Responses were scored along multiple criteria, including contextual reasoning, semantic coherence, and information filtering. The results demonstrate that while open-source models showed variability in performance due to differences in model size, quantization, and instruction tuning, several closed models consistently achieved high scores. These findings suggest that select modern LLMs may be acquiring capacities sufficient to produce meaningful and stable responses to these long-standing theoretical challenges.
zh
[NLP-18] Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在长上下文推理和生成过程中对关键信息注意力不足的问题,这一问题导致模型容易受到干扰模式的影响,从而降低推理准确性和生成质量。解决方案的关键在于提出一种名为Learning to Focus (LeaF)的两阶段框架,通过基于干预的推理方法解耦混淆因素。第一阶段利用梯度比较识别训练语料中与因果关系相关的混淆标记,第二阶段在知识蒸馏过程中剪枝这些标记,使学生的注意力分布与教师的关注点对齐,从而提升模型的可解释性和可靠性。
链接: https://arxiv.org/abs/2506.07851
作者: Yiju Guo,Wenkai Yang,Zexu Sun,Ning Ding,Zhiyuan Liu,Yankai Lin
机构: Gaoling School of Artificial Intelligence, Renmin University of China (高瓴人工智能学院,中国人民大学); Department of Computer Science and Technology, Tsinghua University (计算机科学与技术系,清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated significant improvements in contextual understanding. However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace. Specifically, our preliminary experiments reveal that certain distracting patterns can misdirect the model’s attention during inference, and removing these patterns substantially improves reasoning accuracy and generation quality. We attribute this phenomenon to spurious correlations in the training data, which obstruct the model’s capacity to infer authentic causal instruction-response relationships. This phenomenon may induce redundant reasoning processes, potentially resulting in significant inference overhead and, more critically, the generation of erroneous or suboptimal responses. To mitigate this, we introduce a two-stage framework called Learning to Focus (LeaF) leveraging intervention-based inference to disentangle confounding factors. In the first stage, LeaF employs gradient-based comparisons with an advanced teacher to automatically identify confounding tokens based on causal relationships in the training corpus. Then, in the second stage, it prunes these tokens during distillation to enact intervention, aligning the student’s attention with the teacher’s focus distribution on truly critical context tokens. Experimental results demonstrate that LeaF not only achieves an absolute improvement in various mathematical reasoning and code generation benchmarks but also effectively suppresses attention to confounding tokens during inference, yielding a more interpretable and reliable reasoning model.
zh
[NLP-19] Improving large language models with concept-aware fine-tuning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在形成连贯、高层次概念方面的局限性,这一问题源于当前基于下一个词预测的范式,导致模型无法将复杂短语视为统一的语义实体。解决方案的关键在于引入概念感知微调(Concept-Aware Fine-Tuning, CAFT),这是一种多词训练方法,通过让模型学习跨多个词的序列,增强其概念感知能力,从而提升模型的理解与推理能力。
链接: https://arxiv.org/abs/2506.07833
作者: Michael K. Chen,Xikun Zhang,Jiaxing Huang,Dacheng Tao
机构: Nanyang Technological University (南洋理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase “ribonucleic acid” as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments (“rib”, “on”, …), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at this https URL
zh
[NLP-20] WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code
【速读】: 该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在Web应用开发过程中缺乏系统性评估框架的问题,特别是现有基准测试未能全面评估模型在不同开发阶段的子能力。解决方案的关键在于提出WebUIBench,这是一个基于软件工程原理设计的基准测试,旨在系统评估MLLMs在四个关键领域的能力:WebUI感知、HTML编程、WebUI-HTML理解以及从WebUI生成代码。该基准包含21K条高质量问答对,来源于超过0.7K个真实网站,能够更准确地指导模型开发效率的提升。
链接: https://arxiv.org/abs/2506.07818
作者: Zhiyu Lin,Zhengda Zhou,Zhiyuan Zhao,Tianrui Wan,Yilun Ma,Junyu Gao,Xuelong Li
机构: Institute of Artificial Intelligence (TeleAI), China Telecom(电信人工智能研究所,中国电信); Northwestern Polytechnical University(西北工业大学); Beijing Jiaotong University(北京交通大学); Nanjing University(南京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes. In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. The extensive evaluation of 29 mainstream MLLMs uncovers the skill characteristics and various weakness that models encountered during the development process.
zh
[NLP-21] MultiMatch: Multihead Consistency Regularization Matching for Semi-Supervised Text Classification
【速读】: 该论文旨在解决半监督学习(Semi-Supervised Learning, SSL)中的伪标签选择与加权问题,以提升模型在少量标注数据和大量未标注数据下的性能与鲁棒性。其解决方案的关键在于提出了一种三重伪标签加权模块,该模块通过结合头部一致性、模型置信度以及分类难度感知来筛选和加权伪标签,从而整合了多头协同训练中的头部一致性、FreeMatch中的自适应阈值以及MarginMatch中的平均伪边界三种技术,形成一种统一且高效的SSL方法。
链接: https://arxiv.org/abs/2506.07801
作者: Iustin Sirbu(1),Robert-Adrian Popovici(1),Cornelia Caragea(2),Stefan Trausan-Matu(1),Traian Rebedea(1 and 3) ((1) National University of Science and Technology POLITEHNICA Bucharest, (2) University of Illinois Chicago, (3) NVIDIA)
机构: National University of Science and Technology POLITEHNICA Bucharest (国家科学与技术理工大学布加勒斯特理工学院); University of Illinois Chicago (伊利诺伊大学芝加哥分校); NVIDIA (英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce MultiMatch, a novel semi-supervised learning (SSL) algorithm combining the paradigms of co-training and consistency regularization with pseudo-labeling. At its core, MultiMatch features a three-fold pseudo-label weighting module designed for three key purposes: selecting and filtering pseudo-labels based on head agreement and model confidence, and weighting them according to the perceived classification difficulty. This novel module enhances and unifies three existing techniques – heads agreement from Multihead Co-training, self-adaptive thresholds from FreeMatch, and Average Pseudo-Margins from MarginMatch – resulting in a holistic approach that improves robustness and performance in SSL settings. Experimental results on benchmark datasets highlight the superior performance of MultiMatch, achieving state-of-the-art results on 9 out of 10 setups from 5 natural language processing datasets and ranking first according to the Friedman test among 19 methods. Furthermore, MultiMatch demonstrates exceptional robustness in highly imbalanced settings, outperforming the second-best approach by 3.26% – and data imbalance is a key factor for many text classification tasks.
zh
[NLP-22] LLM Unlearning Should Be Form-Independent
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)去学习(unlearning)过程中存在的形式依赖偏差(Form-Dependent Bias)问题,即现有去学习方法的效果高度依赖于训练样本的形式,难以泛化到相同知识的不同表达方式,从而导致下游任务中的失败。解决方案的关键在于提出一种无需额外训练的去学习方法——秩一概念重定向(Rank-one Concept Redirection, ROCR),该方法通过针对下游任务中的不变量,特别是被激活的危险概念进行干预,能够在短时间内修改模型参数,将模型对特定目标概念的感知重新导向至无害概念,从而提升去学习的有效性。
链接: https://arxiv.org/abs/2506.07795
作者: Xiaotian Ye,Mengqi Zhang,Shu Wu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Shandong University (山东大学); New Laboratory of Pattern Recognition (NLPR) (模式识别新实验室(NLPR)); State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS) (多模态人工智能系统国家重点实验室(MAIS)); Institute of Automation, Chinese Academy of Sciences (自动化研究所,中国科学院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Large Language Model (LLM) unlearning aims to erase or suppress undesirable knowledge within the model, offering promise for controlling harmful or private information to prevent misuse. However, recent studies highlight its limited efficacy in real-world scenarios, hindering practical adoption. In this study, we identify a pervasive issue underlying many downstream failures: the effectiveness of existing unlearning methods heavily depends on the form of training samples and frequently fails to generalize to alternate expressions of the same knowledge. We formally characterize this problem as Form-Dependent Bias and systematically investigate its specific manifestation patterns across various downstream tasks. To quantify its prevalence and support future research, we introduce ORT, a novel benchmark designed to evaluate the robustness of unlearning methods against variations in knowledge expression. Results reveal that Form-Dependent Bias is both widespread and severe among current techniques. We argue that LLM unlearning should be form-independent to address the endless forms of downstream tasks encountered in real-world security-critical scenarios. Towards this goal, we introduce Rank-one Concept Redirection (ROCR), a novel training-free method, as a promising solution path. ROCR performs unlearning by targeting the invariants in downstream tasks, specifically the activated dangerous concepts. It is capable of modifying model parameters within seconds to redirect the model’s perception of a specific unlearning target concept to another harmless concept. Extensive experiments demonstrate that ROCR significantly improves unlearning effectiveness compared to traditional methods while generating highly natural outputs. Subjects: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2506.07795 [cs.CL] (or arXiv:2506.07795v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.07795 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-23] Augmenting LLM s Reasoning by Reinforcing Abstract Thinking
【速读】: 该论文旨在解决小型大语言模型(Large Language Models, LLMs)在面对分布偏移时推理能力不足的问题,例如数值或名义变量的变化以及干扰性从句的插入。论文提出的解决方案的关键在于通过强化学习(Reinforcement Learning, RL)在粒度抽象数据上训练,以促进模型对推理问题的抽象能力,而非仅依赖监督微调,后者往往无法生成忠实的抽象表示。该方法被称为AbstraL,能够有效缓解在GSM扰动基准测试中的性能下降。
链接: https://arxiv.org/abs/2506.07751
作者: Silin Gao,Antoine Bosselut,Samy Bengio,Emmanuel Abbe
机构: Apple(苹果); EPFL(洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注: Under review
Abstract:Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in their reasoning. I.e., they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further “instantiate” reasoning problems on potential variations. In contrast, our approach focuses on “abstracting” reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. We find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstraL – which promotes abstract reasoning in LLMs using RL on granular abstraction data – significantly mitigates performance degradation on recent GSM perturbation benchmarks.
zh
[NLP-24] E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time ICML2025
【速读】: 该论文试图解决在潜在狄利克雷分布(Latent Dirichlet Allocation, LDA)主题模型中,对每篇文档分配的主题进行推断的问题,这是社会科学研究、数据探索和因果推断设置中的主要推理问题。解决方案的关键在于提出了一种新颖的非梯度、组合式的主题模型估计方法,该方法能够在对数并行计算时间(适应性)内收敛到接近最优后验概率,相较于任何已知的LDA算法快指数级。此外,该方法还能提供可解释性保证,确保每个学习到的主题与已知关键词有正式关联,并保持独立性假设以支持下游因果推断方法的应用。
链接: https://arxiv.org/abs/2506.07747
作者: Adam Breuer
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: ICML 2025; Code available at: this https URL LDA
Abstract:In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel non-gradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity) – exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters.
zh
[NLP-25] Swiss Parliaments Corpus Re-Imagined (SPC_R): Enhanced Transcription with RAG -based Correction and Predicted BLEU
【速读】: 该论文旨在解决低资源、领域特定语音语料库的高质量构建问题,通过提升自动语音识别(ASR)与语言模型(LLM)结合后的文本准确性。其解决方案的关键在于采用高计算资源配置下的Whisper Large-v3进行语音转录,随后利用两步GPT-4o校正流程:第一步基于官方会议记录修正识别错误,尤其是命名实体;第二步通过语义完整性评估筛选文本段落,并结合预测的BLEU分数进行数据驱动的过滤,从而显著提升语料库质量。
链接: https://arxiv.org/abs/2506.07726
作者: Vincenzo Timmel,Manfred Vogel,Daniel Perruchoud,Reza Kakooee
机构: University of Applied Sciences and Arts Northwestern Switzerland (瑞士西北应用科学与艺术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents a new long-form release of the Swiss Parliaments Corpus, converting entire multi-hour Swiss German debate sessions (each aligned with the official session protocols) into high-quality speech-text pairs. Our pipeline starts by transcribing all session audio into Standard German using Whisper Large-v3 under high-compute settings. We then apply a two-step GPT-4o correction process: first, GPT-4o ingests the raw Whisper output alongside the official protocols to refine misrecognitions, mainly named entities. Second, a separate GPT-4o pass evaluates each refined segment for semantic completeness. We filter out any segments whose Predicted BLEU score (derived from Whisper’s average token log-probability) and GPT-4o evaluation score fall below a certain threshold. The final corpus contains 801 hours of audio, of which 751 hours pass our quality control. Compared to the original sentence-level SPC release, our long-form dataset achieves a 6-point BLEU improvement, demonstrating the power of combining robust ASR, LLM-based correction, and data-driven filtering for low-resource, domain-specific speech corpora.
zh
[NLP-26] Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific Flexibility
【速读】: 该论文试图解决多语言语法错误修正(Grammatical Error Correction, GEC)中因语言类型学多样性导致的现有框架(如 \texttterrant)在扩展性方面的局限性。解决方案的关键在于提出一种标准化、模块化的多语言语法错误标注框架,该框架结合了与语言无关的基础结构和结构化的语言特定扩展,从而在保持跨语言一致性的同时提供灵活性。通过使用 \textttstanza 重新实现 \texttterrant,该框架展示了其在多种语言(包括英语、德语、捷克语、韩语和汉语)中的适应性,支持从通用标注到更定制化语言学细化的应用。
链接: https://arxiv.org/abs/2506.07719
作者: Mengyang Qiu,Tran Minh Nguyen,Zihao Huang,Zelong Li,Yang Gu,Qingyu Gao,Siliang Liu,Jungyeul Park
机构: Open Writing Evaluation, France; University College London, UK; The University of British Columbia, Canada
类目: Computation and Language (cs.CL)
备注: BEA2025
Abstract:Grammatical Error Correction (GEC) relies on accurate error annotation and evaluation, yet existing frameworks, such as \texttterrant , face limitations when extended to typologically diverse languages. In this paper, we introduce a standardized, modular framework for multilingual grammatical error annotation. Our approach combines a language-agnostic foundation with structured language-specific extensions, enabling both consistency and flexibility across languages. We reimplement \texttterrant using \textttstanza to support broader multilingual coverage, and demonstrate the framework’s adaptability through applications to English, German, Czech, Korean, and Chinese, ranging from general-purpose annotation to more customized linguistic refinements. This work supports scalable and interpretable GEC annotation across languages and promotes more consistent evaluation in multilingual settings. The complete codebase and annotation tools can be accessed at this https URL.
zh
[NLP-27] hrough the Valley: Path to Effective Long CoT Training for Small Language Models
【速读】: 该论文试图解决小语言模型(Small Language Models, SLMs)在使用长链式思维(Long chain-of-thought, CoT)监督进行训练时出现的性能退化问题。研究发现,SLMs在有限的长CoT数据上训练后,其性能会显著下降,甚至在部分情况下,即使使用大量长CoT数据也无法恢复或超越原始性能。解决方案的关键在于识别并缓解误差累积效应,即长响应虽然增强了多步骤推理能力,但也增加了错误叠加的风险。此外,研究还表明,通过充分扩展的有监督微调(Supervised Fine-Tuning, SFT)可以缓解长CoT退化对下游强化学习(Reinforcement Learning, RL)的负面影响。
链接: https://arxiv.org/abs/2506.07712
作者: Renjie Luo,Jiaxi Li,Chen Huang,Wei Lu
机构: StatNLP Research Group, Singapore University of Technology and Design (统计自然语言处理研究组,新加坡科技设计大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; =3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi-step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine-tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small-scale reasoning models.
zh
[NLP-28] raining Superior Sparse Autoencoders for Instruct Models
【速读】: 该论文旨在解决现有稀疏自编码器(Sparse Autoencoder, SAE)训练方法在指令模型(instruct model)上的重建质量和特征可解释性不足的问题。其解决方案的关键在于提出一种专门针对指令模型设计的新型训练方法,即Finetuning-Aligned Sequential Training(FAST),该方法通过与指令模型的数据分布和激活模式对齐,显著提升了重建精度和特征可解释性。
链接: https://arxiv.org/abs/2506.07691
作者: Jiaming Li,Haoran Ye,Yukun Chen,Xinyue Li,Lei Zhang,Hamid Alinejad-Rokny,Jimmy Chih-Hsien Peng,Min Yang
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院); University of Chinese Academy of Sciences(中国科学院大学); National University of Singapore(新加坡国立大学); The University of New South Wales(新南威尔士大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the extraction of human-interpretable features from LLMs. However, existing SAE training methods are primarily designed for base models, resulting in reduced reconstruction quality and interpretability when applied to instruct models. To bridge this gap, we propose \underline\textbfF inetuning- \underline\textbfa ligned \underline\textbfS equential \underline\textbfT raining ( \textitFAST ), a novel training method specifically tailored for instruct models. \textitFAST aligns the training process with the data distribution and activation patterns characteristic of instruct models, resulting in substantial improvements in both reconstruction and feature interpretability. On Qwen2.5-7B-Instruct, \textitFAST achieves a mean squared error of 0.6468 in token reconstruction, significantly outperforming baseline methods with errors of 5.1985 and 1.5096. In feature interpretability, \textitFAST yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, 21.1% scored in the top range, compared to 7.0% and 10.2% for \textitBT§ and \textitBT(F) . Surprisingly, we discover that intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior. Code, data, and 240 trained SAEs are available at this https URL.
zh
[NLP-29] GaRAG e: A Benchmark with Grounding Annotations for RAG Evaluation ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成基于检索的问答(Retrieval-Augmented Generation, RAG)答案时,难以准确识别相关依据文本并有效处理信息不足情况的问题。解决方案的关键在于构建GaRAGe基准测试集,该基准包含由人类标注的长篇回答及每个依据段落的注释,从而实现对LLMs在生成RAG答案时是否能精准识别相关依据的细粒度评估。
链接: https://arxiv.org/abs/2506.07671
作者: Ionut-Teodor Sorodoc,Leonardo F. R. Ribeiro,Rexhina Blloshmi,Christopher Davis,Adrià de Gispert
机构: Amazon AGI (亚马逊AGI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 (Findings)
Abstract:We present GaRAGe, a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage, allowing a fine-grained evaluation of whether LLMs can identify relevant grounding when generating RAG answers. Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web, to reflect real-world RAG use cases. This makes it an ideal test bed to evaluate an LLM’s ability to identify only the relevant information necessary to compose a response, or provide a deflective response when there is insufficient information. Evaluations of multiple state-of-the-art LLMs on GaRAGe show that the models tend to over-summarise rather than (a) ground their answers strictly on the annotated relevant passages (reaching at most a Relevance-Aware Factuality Score of 60%), or (b) deflect when no relevant grounding is available (reaching at most 31% true positive rate in deflections). The F1 in attribution to relevant sources is at most 58.9%, and we show that performance is particularly reduced when answering time-sensitive questions and when having to draw knowledge from sparser private grounding sources.
zh
[NLP-30] Silencing Empowerment Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch
【速读】: 该论文试图解决在线平台中自动化内容审核系统在实时互动场景下的有效性问题,特别是针对Twitch平台的自动审核工具AutoMod在识别仇恨内容方面的局限性。解决方案的关键在于通过构建独立测试环境并利用Twitch的API发送大量来自多个数据集的评论,对AutoMod的审核能力进行审计,从而揭示其在处理包含性别歧视、种族主义、能力歧视和恐同内容时的高误漏检率以及对侮辱性语言的过度依赖。
链接: https://arxiv.org/abs/2506.07667
作者: Prarabdh Shukla,Wei Yin Chong,Yash Patel,Brennan Schaffner,Danish Pruthi,Arjun Bhagoji
机构: Indian Institute of Science (印度科学研究所); University of Chicago (芝加哥大学); Indian Institute of Technology, Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:To meet the demands of content moderation, online platforms have resorted to automated systems. Newer forms of real-time engagement( \textite.g. , users commenting on live streams) on platforms like Twitch exert additional pressures on the latency expected of such moderation systems. Despite their prevalence, relatively little is known about the effectiveness of these systems. In this paper, we conduct an audit of Twitch’s automated moderation tool ( \textttAutoMod ) to investigate its effectiveness in flagging hateful content. For our audit, we create streaming accounts to act as siloed test beds, and interface with the live chat using Twitch’s APIs to send over 107,000 comments collated from 4 datasets. We measure \textttAutoMod 's accuracy in flagging blatantly hateful content containing misogyny, racism, ableism and homophobia. Our experiments reveal that a large fraction of hateful messages, up to 94% on some datasets, \textitbypass moderation . Contextual addition of slurs to these messages results in 100% removal, revealing \textttAutoMod 's reliance on slurs as a moderation signal. We also find that contrary to Twitch’s community guidelines, \textttAutoMod blocks up to 89.5% of benign examples that use sensitive words in pedagogical or empowering contexts. Overall, our audit points to large gaps in \textttAutoMod 's capabilities and underscores the importance for such systems to understand context effectively.
zh
[NLP-31] Synthesis by Design: Controlled Data Generation via Structural Guidance
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学推理任务中面临的复杂逻辑和精确计算挑战。其解决方案的关键在于通过生成的问题求解代码提取结构化信息,并利用结构化解题过程引导数据生成,从而构建高质量、高难度的数学问题数据集。该方法在MATH和GSM8K数据集上生成了包含39,000个带中间步骤标注的问题以及一个6,100个问题的更高难度基准数据集,有效提升了模型在长推理链任务中的表现。
链接: https://arxiv.org/abs/2506.07664
作者: Lei Xu,Sirui Chen,Yuxuan Huang,Chaochao Lu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tongji University (同济大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and guide data generation with structured solutions. Applied to MATH and GSM8K, our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty. Results on our benchmark show that model performance declines as reasoning length increases. Additionally, we conducted fine-tuning experiments using the proposed training data on a range of LLMs, and the results validate the effectiveness of our dataset. We hope the proposed method and dataset will contribute to future research in enhancing LLM reasoning capabilities.
zh
[NLP-32] Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping
【速读】: 该论文试图解决语言模型(Language Model, LM)评估中的两个关键问题:构建可靠的专业领域基准以及理解领域适应过程中的知识表示。其解决方案的关键在于提出了一种确定性流程,该流程无需依赖语言模型或人工标注,即可将原始领域语料转换为完成型基准,从而避免了基准污染问题,并实现了对最新领域数据的评估。该方法通过TF和Term TF-IDF方法生成领域特定关键词及相关词表,构建提示-目标对,并通过测量模型完成提示的准确性来评估领域知识,具有计算成本低的优势。
链接: https://arxiv.org/abs/2506.07658
作者: Nitin Sharma,Thomas Wolfers,Çağatay Yıldız
机构: University of Tübingen(图宾根大学)
类目: Computation and Language (cs.CL)
备注: 35 pages, 24 figures. First submission
Abstract:The paper addresses two critical challenges in language model (LM) evaluation: creating reliable domain-specific benchmarks and understanding knowledge representation during domain adaptation. We introduce a deterministic pipeline that converts raw domain corpora into completion-type benchmarks without relying on LMs or human curation, eliminating benchmark contamination issues while enabling evaluation on the latest domain data. Our approach generates domain-specific keywords and related word lists using TF and Term TF-IDF methods and constructs prompt-target pairs. We evaluate models by measuring their ability to complete these prompts with the correct domain-specific targets, providing a direct assessment of domain knowledge with low computational cost. Through comprehensive experiments across multiple models (GPT-2 medium/XL, Llama-2/3.1, OLMo-2, Qwen-2, Mistral) and domains, we demonstrate that our benchmark strongly correlates with expert-generated benchmarks while providing a more accurate measure of domain knowledge than traditional perplexity metrics. We reveal that domain adaptation happens rapidly in smaller models (within 500 steps) and illustrate a new approach to domain knowledge evaluation in base models during training for early stopping. By extending mechanistic analysis to domain adaptation, we discover that initial-to-mid layers are primarily responsible for attribute extraction, while later layers focus on next token prediction. Furthermore, we show that during adaptation, forgetting begins in the middle layers, where attribute extraction happens and is amplified in later layers. Our work provides both a practical evaluation methodology for domain-specific LMs and novel insights into knowledge representation during adaptation, with implications for more efficient fine-tuning strategies and targeted approaches to mitigate catastrophic forgetting.
zh
[NLP-33] ranscript-Prompted Whisper with Dictionary-Enhanced Decoding for Japanese Speech Annotation INTERSPEECH2025
【速读】: 该论文旨在解决如何高效且准确地为日语文本-语音(TTS)数据集标注音素和韵律标签的问题。其关键解决方案是通过微调一个大规模预训练的自动语音识别(ASR)模型,使其在给定真实文本的前提下,同时输出词级图素和标注标签,并结合字典先验知识的解码策略以纠正音素标注中的错误。
链接: https://arxiv.org/abs/2506.07646
作者: Rui Hu,Xiaolong Lin,Jiawang Liu,Shixi Huang,Zhenpeng Zhan
机构: Baidu Inc. (百度公司)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to INTERSPEECH 2025
Abstract:In this paper, we propose a method for annotating phonemic and prosodic labels on a given audio-transcript pair, aimed at constructing Japanese text-to-speech (TTS) datasets. Our approach involves fine-tuning a large-scale pre-trained automatic speech recognition (ASR) model, conditioned on ground truth transcripts, to simultaneously output phrase-level graphemes and annotation labels. To further correct errors in phonemic labeling, we employ a decoding strategy that utilizes dictionary prior knowledge. The objective evaluation results demonstrate that our proposed method outperforms previous approaches relying solely on text or audio. The subjective evaluation results indicate that the naturalness of speech synthesized by the TTS model, trained with labels annotated using our method, is comparable to that of a model trained with manual annotations.
zh
[NLP-34] Evaluating LLM s Robustness in Less Resourced Languages with Proxy Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对针对低资源语言的扰动和越狱攻击时的安全性问题。其关键解决方案是通过仅修改少量字符并利用小型代理模型进行词重要性计算,从而以低成本生成强大的攻击方法。该方法揭示了LLMs在低资源语言如波兰语中的潜在漏洞,并验证了其在其他语言中的可扩展性。
链接: https://arxiv.org/abs/2506.07645
作者: Maciej Chrabąszcz,Katarzyna Lorenc,Karolina Seweryn
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional evaluations. Many LLMs are multilingual, but safety-related training data contains mainly high-resource languages like English. This can leave them vulnerable to perturbations in low-resource languages such as Polish. We show how surprisingly strong attacks can be cheaply created by altering just a few characters and using a small proxy model for word importance calculation. We find that these character and word-level attacks drastically alter the predictions of different LLMs, suggesting a potential vulnerability that can be used to circumvent their internal safety mechanisms. We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language. Additionally, we show how it can be extended to other languages. We release the created datasets and code for further research.
zh
[NLP-35] reeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM -based Scientific Peer Review
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在辅助同行评审过程中难以生成全面且有深度的评审意见同时保持效率的问题。其解决方案的关键在于提出TreeReview框架,该框架将论文评审建模为一种分层且双向的问答过程,通过递归分解高层次问题为细粒度子问题,并从叶节点到根节点迭代聚合答案以生成最终评审;同时引入动态问题扩展机制,在需要时生成后续问题以实现更深入的探究。
链接: https://arxiv.org/abs/2506.07642
作者: Yuan Chang,Ziyue Li,Hengyuan Zhang,Yuanbo Kong,Yanru Wu,Zhijiang Guo,Ngai Wong
机构: National Science Library, Chinese Academy of Sciences (国家科学图书馆,中国科学院); Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences (信息资源管理系,经济管理学院,中国科学院大学); Tsinghua University (清华大学); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注: 30 pages, 17 figures
Abstract:While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches. Our code and benchmark dataset are available at this https URL.
zh
[NLP-36] Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
【速读】: 该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, VLMs)生成的图像描述在事实准确性方面的评估难题,特别是针对细粒度错误检测不足的问题。现有方法在处理长段落描述时表现不佳,或缺乏带有验证错误的数据集。其解决方案的关键在于构建DOCCI-Critique基准,包含1,400条VLM生成的段落级图像描述及其超过10,216条句子级别的事实正确性人工标注和错误解释,并基于此开发VNLI-Critique模型,实现自动化的句子级事实性分类与批评生成,从而提升对VLM生成内容的细粒度评估能力。
链接: https://arxiv.org/abs/2506.07631
作者: Brian Gordon,Yonatan Bitton,Andreea Marzoca,Yasumasa Onoe,Xiao Wang,Daniel Cohen-Or,Idan Szpektor
机构: Tel Aviv University (特拉维夫大学); Google Research (谷歌研究院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: this https URL
zh
[NLP-37] Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation
【速读】: 该论文试图解决当前大型语言模型(Large language models, LLMs)在教育应用中,尤其是智能辅导系统中,缺乏与教学策略对齐的问题。解决方案的关键在于通过细粒度的教师意图标注来提升LLM生成的辅导回应的质量。研究者在MathDial数据集上应用自动化标注框架,使用包含11种教学意图的详细分类体系重新标注部分数据,并基于这些新标注微调LLM,结果表明细粒度标注能够显著提升生成回应的教育适配性和有效性。
链接: https://arxiv.org/abs/2506.07626
作者: Kseniia Petukhova,Ekaterina Kochmar
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) hold great promise for educational applications, particularly in intelligent tutoring systems. However, effective tutoring requires alignment with pedagogical strategies - something current LLMs lack without task-specific adaptation. In this work, we explore whether fine-grained annotation of teacher intents can improve the quality of LLM-generated tutoring responses. We focus on MathDial, a dialog dataset for math instruction, and apply an automated annotation framework to re-annotate a portion of the dataset using a detailed taxonomy of eleven pedagogical intents. We then fine-tune an LLM using these new annotations and compare its performance to models trained on the original four-category taxonomy. Both automatic and qualitative evaluations show that the fine-grained model produces more pedagogically aligned and effective responses. Our findings highlight the value of intent specificity for controlled text generation in educational settings, and we release our annotated data and code to facilitate further research.
zh
[NLP-38] LoRMA: Low-Rank Multiplicative Adaptation for LLM s ACL
【速读】: 该论文试图解决大规模语言模型在下游任务中进行全量微调时计算成本过高的问题,其解决方案的关键在于提出一种名为低秩乘法适配(Low-Rank Multiplicative Adaptation, LoRMA)的新方法。与传统基于加法更新的低秩适配(LoRA)不同,LoRMA通过将参数更新从加法形式转换为矩阵乘法形式,扩展了参数调整的空间,从而提升了模型适应能力。为了应对矩阵乘法带来的计算复杂性和秩瓶颈问题,该方法通过有效重新排序操作并引入秩膨胀策略来优化性能。
链接: https://arxiv.org/abs/2506.07621
作者: Harsh Bihany,Shubham Patel,Ashutosh Modi
机构: Indian Institute of Technology Kanpur (印度理工学院坎普尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ACL Findings 2025; 21 pages (9 main paper + 5 pages references + 7 pages appendix)
Abstract:Large Language Models have shown remarkable capabilities in the NLP domain. Their effectiveness can mainly be attributed to their ability to adapt to an array of downstream tasks. However, generally, full fine-tuning is a computationally expensive job. To mitigate this, many techniques have been developed that prime efficiency, a prominent one being Low-Rank Adaptation (LoRA). However, LoRA and its variants employ re-parametrized additive updates. In this paper, we propose Low-Rank Multiplicative Adaptation (LoRMA), which shifts the paradigm of additive updates to a richer space of matrix multiplicative transformations. We tackle challenges such as computational complexity and rank bottleneck of matrix multiplication by effectively re-ordering operations and introducing rank inflation strategies. We conduct extensive experiments to demonstrate the effectiveness of our approach in terms of various evaluation metrics.
zh
[NLP-39] Vuyko Mistral: Adapting LLM s for Low-Resource Dialectal Translation KR
【速读】: 该论文试图解决将大型语言模型(LLMs)适配到资源稀缺且形态学复杂的乌克兰语方言(以Hutsul方言为例)的问题。其关键解决方案是构建了一个包含9852对方言与标准乌克兰语句子的平行语料库以及一个包含7320个方言词汇映射的词典,并通过提出一种先进的检索增强生成(RAG)管道来生成合成平行翻译对,从而扩展语料库至52142个示例,以缓解数据不足的问题。
链接: https://arxiv.org/abs/2506.07617
作者: Roman Kyslyi,Yuliia Maksymiuk,Ihor Pysmennyi
机构: National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute” (乌克兰国立技术大学“伊戈尔·西科斯基基辅理工学院"); Ukrainian Catholic University (乌克兰天主教大学)
类目: Computation and Language (cs.CL)
备注: Preprint. Will be published at Proceedings of the Fourth Ukrainian Natural Language Processing Workshop (UNLP)
Abstract:In this paper we introduce the first effort to adapt large language models (LLMs) to the Ukrainian dialect (in our case Hutsul), a low-resource and morphologically complex dialect spoken in the Carpathian Highlands. We created a parallel corpus of 9852 dialect-to-standard Ukrainian sentence pairs and a dictionary of 7320 dialectal word mappings. We also addressed data shortage by proposing an advanced Retrieval-Augmented Generation (RAG) pipeline to generate synthetic parallel translation pairs, expanding the corpus with 52142 examples. We have fine-tuned multiple open-source LLMs using LoRA and evaluated them on a standard-to-dialect translation task, also comparing with few-shot GPT-4o translation. In the absence of human annotators, we adopt a multi-metric evaluation strategy combining BLEU, chrF++, TER, and LLM-based judgment (GPT-4o). The results show that even small(7B) finetuned models outperform zero-shot baselines such as GPT-4o across both automatic and LLM-evaluated metrics. All data, models, and code are publicly released at: this https URL
zh
[NLP-40] PolitiSky24: U.S. Political Bluesky Dataset with User Stance Labels
【速读】: 该论文试图解决用户级立场检测资源在新兴平台上的稀缺性问题,特别是在政治人物如Kamala Harris和Donald Trump上的立场分析。其解决方案的关键在于构建了一个基于Bluesky平台的首个2024年美国总统选举立场检测数据集(PolitiSky24),该数据集通过结合先进的信息检索技术和大语言模型(LLM)生成具有支持性论据和文本片段的立场标签,实现了81%的准确率,并提供了用户完整的发布历史、互动图谱和参与元数据,从而提升了政治立场分析的时效性、透明度和全面性。
链接: https://arxiv.org/abs/2506.07606
作者: Peyman Rostami,Vahid Rahimzadeh,Ali Adibi,Azadeh Shakery
机构: University of Tehran (德黑兰大学); Institute for Research in Fundamental Sciences (基础科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
备注: The dataset is available at this https URL
Abstract:Stance detection identifies the viewpoint expressed in text toward a specific target, such as a political figure. While previous datasets have focused primarily on tweet-level stances from established platforms, user-level stance resources, especially on emerging platforms like Bluesky remain scarce. User-level stance detection provides a more holistic view by considering a user’s complete posting history rather than isolated posts. We present the first stance detection dataset for the 2024 U.S. presidential election, collected from Bluesky and centered on Kamala Harris and Donald Trump. The dataset comprises 16,044 user-target stance pairs enriched with engagement metadata, interaction graphs, and user posting histories. PolitiSky24 was created using a carefully evaluated pipeline combining advanced information retrieval and large language models, which generates stance labels with supporting rationales and text spans for transparency. The labeling approach achieves 81% accuracy with scalable LLMs. This resource addresses gaps in political stance analysis through its timeliness, open-data nature, and user-level perspective. The dataset is available at this https URL
zh
[NLP-41] Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque
【速读】: 该论文试图解决在低资源语言场景下,如何有效指导语言模型以适应用户意图的问题。其关键解决方案在于利用目标语言语料库、现有的多语言基础和指令调优的大型语言模型(LLM)以及从指令调优模型中合成生成的指令,构建一种替代传统指令适应流程的方法。实验表明,目标语言语料库是必不可少的,合成指令能够生成鲁棒模型,并且使用指令调优的模型作为主干网络优于使用未指令调优的基础模型,尤其在规模扩展时效果更显著。
链接: https://arxiv.org/abs/2506.07597
作者: Oscar Sainz,Naiara Perez,Julen Etxaniz,Joseba Fernandez de Landa,Itziar Aldabe,Iker García-Ferrero,Aimar Zabala,Ekhi Azurmendi,German Rigau,Eneko Agirre,Mikel Artetxe,Aitor Soroa
机构: HiTZ Center - Ixa, University of the Basque Country UPV/EHU (HiTZ中心-Ixa,巴斯克大学UPV/EHU)
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model, and improved results when scaling up. Using Llama 3.1 instruct 70B as backbone our model comes near frontier models of much larger sizes for Basque, without using any Basque data apart from the 1.2B word corpora. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation.
zh
[NLP-42] Beyond the Sentence: A Survey on Context-Aware Machine Translation with Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在上下文感知机器翻译中的应用研究相对不足的问题。其解决方案的关键在于通过文献综述分析现有方法,包括提示(prompting)和微调(fine-tuning)策略,并指出当前研究较少关注自动后编辑和构建上下文感知的翻译代理。研究还发现,商业LLMs(如ChatGPT和Tower LLM)在翻译效果上优于开源LLMs(如Llama和Bloom LLM),且基于提示的方法可作为评估翻译质量的良好基线。
链接: https://arxiv.org/abs/2506.07583
作者: Ramakrishna Appicharla,Baban Gain,Santanu Pal,Asif Ekbal
机构: Indian Institute of Technology Patna(印度理工学院巴特那分校); Wipro AI(维普罗人工智能); Indian Institute of Technology Jodhpur(印度理工学院焦德布尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the popularity of the large language models (LLMs), their application to machine translation is relatively underexplored, especially in context-aware settings. This work presents a literature review of context-aware translation with LLMs. The existing works utilise prompting and fine-tuning approaches, with few focusing on automatic post-editing and creating translation agents for context-aware machine translation. We observed that the commercial LLMs (such as ChatGPT and Tower LLM) achieved better results than the open-source LLMs (such as Llama and Bloom LLMs), and prompt-based approaches serve as good baselines to assess the quality of translations. Finally, we present some interesting future directions to explore.
zh
[NLP-43] Learning Speaker-Invariant Visual Features for Lipreading
【速读】: 该论文旨在解决唇读(lipreading)任务中因视觉特征包含说话人特定属性(如形状、颜色、纹理)而引入的虚假相关性问题,这些问题导致模型精度不足并限制了泛化能力。其解决方案的关键在于提出SIFLip框架,通过两个互补的解耦模块——隐式解耦和显式解耦——来分离说话人特定属性,从而学习到与说话人无关的视觉特征。具体而言,隐式解耦模块利用稳定的文本嵌入作为监督信号,以跨说话人的语义一致性学习共性视觉表示;显式解耦则通过在主唇读流程中设计说话人识别子任务,并借助梯度反转层进一步剥离个性化视觉特征。
链接: https://arxiv.org/abs/2506.07572
作者: Yu Li,Feng Xue,Shujie Li,Jinrui Zhang,Shuang Yang,Dan Guo,Richang Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Lipreading is a challenging cross-modal task that aims to convert visual lip movements into spoken text. Existing lipreading methods often extract visual features that include speaker-specific lip attributes (e.g., shape, color, texture), which introduce spurious correlations between vision and text. These correlations lead to suboptimal lipreading accuracy and restrict model generalization. To address this challenge, we introduce SIFLip, a speaker-invariant visual feature learning framework that disentangles speaker-specific attributes using two complementary disentanglement modules (Implicit Disentanglement and Explicit Disentanglement) to improve generalization. Specifically, since different speakers exhibit semantic consistency between lip movements and phonetic text when pronouncing the same words, our implicit disentanglement module leverages stable text embeddings as supervisory signals to learn common visual representations across speakers, implicitly decoupling speaker-specific features. Additionally, we design a speaker recognition sub-task within the main lipreading pipeline to filter speaker-specific features, then further explicitly disentangle these personalized visual features from the backbone network via gradient reversal. Experimental results demonstrate that SIFLip significantly enhances generalization performance across multiple public datasets. Experimental results demonstrate that SIFLip significantly improves generalization performance across multiple public datasets, outperforming state-of-the-art methods.
zh
[NLP-44] SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)的自主代理框架在安全信息流、可靠性及多代理协调方面存在的脆弱性问题。其解决方案的关键在于提出SAFEFLOW协议级框架,通过细粒度的信息流控制(Information Flow Control, IFC)实现对代理、工具、用户和环境间数据交换的来源追溯、完整性与机密性保障,并结合事务执行、冲突解决和安全调度机制确保多代理场景下的全局一致性,同时引入预写日志、回滚和安全缓存等机制提升运行时错误和策略违规的容错能力。
链接: https://arxiv.org/abs/2506.07564
作者: Peiran Li,Xinkai Zou,Zhuohang Wu,Ruifeng Li,Shuo Xing,Hanwen Zheng,Zhikai Hu,Yuping Wang,Haoxi Li,Qin Yuan,Yingmo Zhang,Zhengzhong Tu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) and vision-language models (VLMs) have enabled powerful autonomous agents capable of complex reasoning and multi-modal tool use. Despite their growing capabilities, today’s agent frameworks remain fragile, lacking principled mechanisms for secure information flow, reliability, and multi-agent coordination. In this work, we introduce SAFEFLOW, a new protocol-level framework for building trustworthy LLM/VLM-based agents. SAFEFLOW enforces fine-grained information flow control (IFC), precisely tracking provenance, integrity, and confidentiality of all the data exchanged between agents, tools, users, and environments. By constraining LLM reasoning to respect these security labels, SAFEFLOW prevents untrusted or adversarial inputs from contaminating high-integrity decisions. To ensure robustness in concurrent multi-agent settings, SAFEFLOW introduces transactional execution, conflict resolution, and secure scheduling over shared state, preserving global consistency across agents. We further introduce mechanisms, including write-ahead logging, rollback, and secure caches, that further enhance resilience against runtime errors and policy violations. To validate the performances, we built SAFEFLOWBENCH, a comprehensive benchmark suite designed to evaluate agent reliability under adversarial, noisy, and concurrent operational conditions. Extensive experiments demonstrate that agents built with SAFEFLOW maintain impressive task performance and security guarantees even in hostile environments, substantially outperforming state-of-the-art. Together, SAFEFLOW and SAFEFLOWBENCH lay the groundwork for principled, robust, and secure agent ecosystems, advancing the frontier of reliable autonomy.
zh
[NLP-45] SELT: Self-Evaluation Tree Search for LLM s with Task Decomposition
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂推理任务中性能下降的问题。其解决方案的关键在于提出一种名为SELT(Self-Evaluation LLM Tree Search)的框架,该框架通过改进蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)来增强LLM的推理能力,而无需依赖外部奖励模型。SELT通过重新定义上置信界评分以匹配LLM的内在自我评估能力,并在每个节点引入语义聚类的原子子任务分解,有效平衡了探索与利用,减少了冗余推理路径并缓解了幻觉现象。
链接: https://arxiv.org/abs/2506.07557
作者: Mengsong Wu,Di Zhang,Yuqiang Li,Dongzhan Zhou,Wenliang Chen
机构: Soochow University (苏州大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures
Abstract:While Large Language Models (LLMs) have achieved remarkable success in a wide range of applications, their performance often degrades in complex reasoning tasks. In this work, we introduce SELT (Self-Evaluation LLM Tree Search), a novel framework that leverages a modified Monte Carlo Tree Search (MCTS) to enhance LLM reasoning without relying on external reward models. By redefining the Upper Confidence Bound scoring to align with intrinsic self-evaluation capabilities of LLMs and decomposing the inference process into atomic subtasks augmented with semantic clustering at each node, SELT effectively balances exploration and exploitation, reduces redundant reasoning paths, and mitigates hallucination. We validate our approach on challenging benchmarks, including the knowledge-based MMLU and the Tool Learning dataset Seal-Tools, where SELT achieves significant improvements in answer accuracy and reasoning robustness compared to baseline methods. Notably, our framework operates without task-specific fine-tuning, demonstrating strong generalizability across diverse reasoning tasks. Relevant results and code are available at this https URL .
zh
[NLP-46] ChemAgent : Enhancing LLM s for Chemistry and Materials Science through Tree-Search Based Tool Learning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在化学任务中因预训练知识过时以及难以整合专业化学知识而面临的挑战。其解决方案的关键在于提出一种基于LLM的智能体,该智能体协同集成137个外部化学工具,并构建了一个数据集整理流程以生成ChemToolBench数据集,从而在微调和评估过程中实现有效的工具选择和精确的参数填充。此外,引入了分层进化蒙特卡洛树搜索(Hierarchical Evolutionary Monte Carlo Tree Search, HE-MCTS)框架,实现了工具规划与执行的独立优化,并通过自生成数据支持策略模型的步骤级微调以及任务自适应的PRM和ORM的训练,性能优于GPT-4o。
链接: https://arxiv.org/abs/2506.07551
作者: Mengsong Wu,YaFei Wang,Yidong Ming,Yuqi An,Yuwei Wan,Wenliang Chen,Binbin Lin,Yuqiang Li,Tong Xie,Dongzhan Zhou
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Soochow University (苏州大学); Zhejiang University (浙江大学); City University of Hong Kong (香港城市大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: 15 pages, 6 figures
Abstract:Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks while still facing challenges due to outdated pretraining knowledge and the difficulty of incorporating specialized chemical expertise. To address these issues, we propose an LLM-based agent that synergistically integrates 137 external chemical tools created ranging from basic information retrieval to complex reaction predictions, and a dataset curation pipeline to generate the dataset ChemToolBench that facilitates both effective tool selection and precise parameter filling during fine-tuning and evaluation. We introduce a Hierarchical Evolutionary Monte Carlo Tree Search (HE-MCTS) framework, enabling independent optimization of tool planning and execution. By leveraging self-generated data, our approach supports step-level fine-tuning (FT) of the policy model and training task-adaptive PRM and ORM that surpass GPT-4o. Experimental evaluations demonstrate that our approach significantly improves performance in Chemistry QA and discovery tasks, offering a robust solution to integrate specialized tools with LLMs for advanced chemical applications. All datasets and code are available at this https URL .
zh
[NLP-47] Bit-level BPE: Below the byte boundary
【速读】: 该论文试图解决子词分词(subword tokenization)中字节级回退(byte-level fallbacks)导致的序列长度增加问题,这一问题在中文、日文、韩文(CJK)等语言以及包含大量字符的场景(如表情符号)中尤为显著。解决方案的关键在于提出一种简单的无损压缩技术,以减少序列长度带来的计算开销。
链接: https://arxiv.org/abs/2506.07541
作者: Sangwhan Moon,Tatsuya Hiraoka,Naoaki Okazaki
机构: Google LLC(谷歌有限公司); MBZUAI(穆罕默德本扎耶德人工智能大学); Institute of Science Tokyo(东京科学大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Byte-level fallbacks for subword tokenization have become a common practice in large language models. In particular, it has been demonstrated to be incredibly effective as a pragmatic solution for preventing OOV, especially in the context of larger models. However, breaking a character down to individual bytes significantly increases the sequence length for long-tail tokens in languages such as Chinese, Japanese, and Korean (CJK) and other character-diverse contexts such as emoji. The increased sequence length results in longer computation during both training and inference. In this work, we propose a simple compression technique that reduces the sequence length losslessly.
zh
[NLP-48] owards Large Language Models with Self-Consistent Natural Language Explanations
【速读】: 该论文试图解决生成式 AI (Generative AI) 模型在事后解释其决策时存在的不一致性问题,即模型生成的解释与实际决策过程之间存在特征重要性不匹配的现象。现有方法由于估计特征重要性的高成本,难以在大规模数据集上进行有效评估。论文提出的解决方案关键在于构建了一个大规模基准 PSCB(Post-hoc Self-Consistency Bank),包含多种任务和模型的决策及其对应的 LLM 生成解释和特征重要性评分,并通过改进的度量标准对模型进行微调,从而提升解释与决策相关特征的一致性。
链接: https://arxiv.org/abs/2506.07523
作者: Sahar Admoni,Ofra Amir,Assaf Hallak,Yftah Ziser
机构: Technion – Israel Institute of Technology (以色列理工学院); Nvidia Research (Nvidia研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their decisions. Yet, studies show that these post-hoc explanations often misrepresent the true decision process, as revealed by mismatches in feature importance. Despite growing evidence of this inconsistency, no systematic solutions have emerged, partly due to the high cost of estimating feature importance, which limits evaluations to small datasets. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB) - a large-scale benchmark of decisions spanning diverse tasks and models, each paired with LLM-generated explanations and corresponding feature importance scores. Analysis of PSCB reveals that self-consistency scores barely differ between correct and incorrect predictions. We also show that the standard metric fails to meaningfully distinguish between explanations. To overcome this limitation, we propose an alternative metric that more effectively captures variation in explanation quality. We use it to fine-tune LLMs via Direct Preference Optimization (DPO), leading to significantly better alignment between explanations and decision-relevant features, even under domain shift. Our findings point to a scalable path toward more trustworthy, self-consistent LLMs.
zh
[NLP-49] DeRAG EC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction ACL2025
【速读】: 该论文试图解决自动语音识别(Automatic Speech Recognition, ASR)系统中命名实体(Named Entity, NE)纠错的问题。其解决方案的关键在于提出DeRAGEC方法,该方法通过扩展检索增强生成错误纠正(Retrieval-Augmented Generative Error Correction, RAGEC)框架,利用合成去噪推理来过滤噪声的NE候选,在不进行额外训练的情况下,结合语音相似性和增强定义,通过上下文学习对噪声检索的NE进行精炼,从而提升纠错效果。
链接: https://arxiv.org/abs/2506.07510
作者: Solee Im,Wonjun Lee,Jinmyeong An,Yunsu Kim,Jungseul Ok,Gary Geunbae Lee
机构: POSTECH(浦项科技大学); Samsung Electronics(三星电子); aiXplain Inc.(aiXplain公司)
类目: Computation and Language (cs.CL)
备注: ACL2025 Findings
Abstract:We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using in-context learning, requiring no additional training. Experimental results on CommonVoice and STOP datasets show significant improvements in Word Error Rate (WER) and NE hit ratio, outperforming baseline ASR and RAGEC methods. Specifically, we achieved a 28% relative reduction in WER compared to ASR without postprocessing. Our source code is publicly available at: this https URL
zh
[NLP-50] What Do Indonesians Really Need from Language Technology? A Nationwide Survey
【速读】: 该论文试图解决印度尼西亚700多种本土语言在自然语言处理(Natural Language Processing, NLP)发展中面临的进展缓慢问题,其核心原因是需要直接与母语者互动而带来的高昂成本。为了解决这一问题,研究者开展了一项全国性调查,以评估母语者的真实需求。研究发现,解决语言障碍,尤其是通过机器翻译和信息检索技术,是当前最紧迫的优先事项。此外,尽管对语言技术进步充满热情,但隐私、偏见以及公共数据用于AI训练的问题凸显了提高透明度和加强沟通的重要性,以促进更广泛的人工智能(Artificial Intelligence, AI)应用。
链接: https://arxiv.org/abs/2506.07506
作者: Muhammad Dehan Al Kautsar,Lucky Susanto,Derry Wijaya,Fajri Koto
机构: MBZUAI; Monash University
类目: Computation and Language (cs.CL)
备注: 26 pages, 12 figures, 5 tables
Abstract:There is an emerging effort to develop NLP for Indonesias 700+ local languages, but progress remains costly due to the need for direct engagement with native speakers. However, it is unclear what these language communities truly need from language technology. To address this, we conduct a nationwide survey to assess the actual needs of native speakers in Indonesia. Our findings indicate that addressing language barriers, particularly through machine translation and information retrieval, is the most critical priority. Although there is strong enthusiasm for advancements in language technology, concerns around privacy, bias, and the use of public data for AI training highlight the need for greater transparency and clear communication to support broader AI adoption.
zh
[NLP-51] DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech
【速读】: 该论文试图解决通过语音(Disambiguation through Speech, DTS)消除文本歧义的问题,这一领域在现有研究中仍较为薄弱。其关键解决方案是构建DEBATE,一个公开的中文语音-文本数据集,旨在研究语音线索(如发音、停顿、重音和语调)如何帮助消除文本歧义并揭示说话者的真正意图。该数据集包含1,001个精心挑选的歧义话语,每个由10名母语者录制,涵盖了多样化的语言歧义及其通过语音实现的消歧过程。
链接: https://arxiv.org/abs/2506.07502
作者: Haotian Guo,Jing Han,Yongfeng Tu,Shihao Gao,Shengfan Shen,Wulong Xiang,Weihao Gan,Zixing Zhang
机构: Hunan University(湖南大学); University of Cambridge(剑桥大学); Malanshan Audio&Video Laboratory(马兰山音视频实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Despite extensive research on textual and visual disambiguation, disambiguation through speech (DTS) remains underexplored. This is largely due to the lack of high-quality datasets that pair spoken sentences with richly ambiguous text. To address this gap, we present DEBATE, a unique public Chinese speech-text dataset designed to study how speech cues and patterns-pronunciation, pause, stress and intonation-can help resolve textual ambiguity and reveal a speaker’s true intent. DEBATE contains 1,001 carefully selected ambiguous utterances, each recorded by 10 native speakers, capturing diverse linguistic ambiguities and their disambiguation through speech. We detail the data collection pipeline and provide rigorous quality analysis. Additionally, we benchmark three state-of-the-art large speech and audio-language models, illustrating clear and huge performance gaps between machine and human understanding of spoken intent. DEBATE represents the first effort of its kind and offers a foundation for building similar DTS datasets across languages and cultures. The dataset and associated code are available at: this https URL.
zh
[NLP-52] Graph-of-Causal Evolution: Challenging Chain-of-Model for Reasoning
【速读】: 该论文试图解决链式模型(Chain-of-Model, CoM)中每个子链仅依赖前序子链信息,导致因因果掩码阻断多层级子链间的全局上下文流动而丢失长程依赖的问题。其解决方案的关键在于提出一种因果演化图(Graph of Causal Evolution, GoCE),通过将隐式token表示映射为可微且稀疏的因果邻接矩阵,并利用因果掩码注意力和因果-MoE在每一层计算中渗透因果约束,结合干预一致性损失测试与自演化门控机制,实现因果结构学习与Transformer架构自适应更新之间的动态平衡。
链接: https://arxiv.org/abs/2506.07501
作者: Libo Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: The relevant code has been uploaded to the publicly available GitHub repository. The link is: this https URL
Abstract:In view of the problem that each subchain in the chain-of-model (CoM) relies only on the information of the previous subchain and may lose long-range dependencies due to the causal mask blocking the global context flow between multi-level subchains, this work proposes a graph of causal evolution (GoCE). Its core principle is to map the implicit token representation into a differentiable and sparse causal adjacency matrix, then permeate causal constraints through each layer of calculation using causal-masked attention and causal-MoE. By combining intervention consistency loss test and self-evolution gate, the dynamic balance between causal structure learning and adaptive updating of transformer architecture is realized. The researcher built experimental environments in sandboxes built with Claude Sonnet 4, o4-mini-high, and DeepSeek R1 respectively with the transformer variant architecture introduced in GoCE. It is evaluated on publicly available datasets including CLUTRR, CLADDER, EX-FEVER, and CausalQA and compared with the baseline LLMs. The finding proves that GoCE strengthens the transformer’s ability to capture long-range causal dependencies, while the ability to self-evolve is improved. It not only surpasses the design of CoM in terms of design principles, but also provides experience for future research on causal learning and continuous adaptive improvement.
zh
[NLP-53] A Hybrid GA LLM Framework for Structured Task Optimization
【速读】: 该论文试图解决在严格约束下进行结构化生成任务的问题,例如行程规划、学术提纲和商业报告生成。解决方案的关键在于将遗传算法(Genetic Algorithm, GA)与大语言模型(Large Language Model, LLM)相结合,形成一种混合框架GA LLM。该框架将每个输出视为一个基因,并通过语言模型引导的选择、交叉和变异等进化操作,迭代优化解决方案,从而在保证结构完整性和全局优化的同时,融入领域知识和创造性变化。
链接: https://arxiv.org/abs/2506.07483
作者: Berry Feng,Jonas Lin,Patrick Lau
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages
Abstract:GA LLM is a hybrid framework that combines Genetic Algorithms with Large Language Models to handle structured generation tasks under strict constraints. Each output, such as a plan or report, is treated as a gene, and evolutionary operations like selection, crossover, and mutation are guided by the language model to iteratively improve solutions. The language model provides domain knowledge and creative variation, while the genetic algorithm ensures structural integrity and global optimization. GA LLM has proven effective in tasks such as itinerary planning, academic outlining, and business reporting, consistently producing well structured and requirement satisfying results. Its modular design also makes it easy to adapt to new tasks. Compared to using a language model alone, GA LLM achieves better constraint satisfaction and higher quality solutions by combining the strengths of both components.
zh
[NLP-54] Improving Fairness of Large Language Models in Multi-document Summarization ACL2025
【速读】: 该论文试图解决多文档摘要(multi-document summarization, MDS)中的公平性问题,即在生成摘要时如何确保不同社会属性值的文档能够被公正地呈现,以避免因偏见导致的决策偏差。解决方案的关键在于提出一种名为FairPO的偏好调优方法,该方法同时关注摘要级和语料库级的公平性。为提升摘要级公平性,通过扰动文档集生成偏好对;为提升语料库级公平性,通过动态调整偏好对的权重实现公平感知的偏好调优。实验结果表明,FairPO在保持摘要关键质量的同时优于现有强基线方法。
链接: https://arxiv.org/abs/2506.07479
作者: Haoyuan Li Yusen Zhang,Snigdha Chaturvedi
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2025 main
Abstract:Fairness in multi-document summarization (MDS) is crucial for providing comprehensive views across documents with diverse social attribute values, which can significantly impact decision-making. For example, a summarization system that tends to overrepresent negative reviews of products can mislead customers into disregarding good products. Previous works measure fairness in MDS at two levels: summary-level and corpus-level. While summary-level fairness focuses on individual summaries, corpus-level fairness focuses on a corpus of summaries. Recent methods primarily focus on summary-level fairness. We propose FairPO, a preference tuning method that focuses on both summary-level and corpus-level fairness in MDS. To improve summary-level fairness, we propose to generate preference pairs by perturbing document sets. To improve corpus-level fairness, we propose fairness-aware preference tuning by dynamically adjusting the weights of preference pairs. Our experiments show that FairPO outperforms strong baselines while maintaining the critical qualities of summaries. The code is available at this https URL.
zh
[NLP-55] Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
【速读】: 该论文试图解决传统语言模型(Language Model, LM)安全对齐中存在的时间滞后与防御失效问题,即攻击者利用静态模型进行攻击,而防御者通过微调来修补漏洞,这种顺序性方法导致攻击者过拟合于过时的防御机制,而防御者始终落后于新兴威胁。解决方案的关键在于提出一种在线自博弈强化学习算法——Self-RedTeam,其中攻击者和防御者代理通过持续交互共同进化,将安全对齐建模为一个双人零和博弈,使单一模型在攻击者和防御者角色间交替,同时由一个奖励语言模型裁定结果,从而实现动态协同适应。该方法基于零和博弈的博弈论框架,提供了理论上的安全保证,并通过实验证明其在攻击多样性与安全基准鲁棒性方面优于传统方法。
链接: https://arxiv.org/abs/2506.07468
作者: Mickel Liu,Liwei Jiang,Yancheng Liang,Simon Shaolei Du,Yejin Choi,Tim Althoff,Natasha Jaques
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch – attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction. We cast safety alignment as a two-player zero-sum game, where a single model alternates between attacker and defender roles – generating adversarial prompts and safeguarding against them – while a reward LM adjudicates outcomes. This enables dynamic co-adaptation. Grounded in the game-theoretic framework of zero-sum games, we establish a theoretical safety guarantee which motivates the design of our method: if self-play converges to a Nash Equilibrium, the defender will reliably produce safe responses to any adversarial input. Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared to attackers trained against static defenders and achieves higher robustness on safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained against static attackers. We further propose hidden Chain-of-Thought, allowing agents to plan privately, which boosts adversarial diversity and reduces over-refusals. Our results motivate a shift from reactive patching to proactive co-evolution in LM safety training, enabling scalable, autonomous, and robust self-improvement of LMs via multi-agent reinforcement learning (MARL).
zh
[NLP-56] CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models
【速读】: 该论文旨在解决大规模预训练数据集在数据质量与人类推理轨迹多样性方面的不足,以提升大语言模型(Large Language Model, LLM)的性能。其关键解决方案是提出了一种基于模型的新型数据质量验证管道,包括两阶段去重、多分类器质量评分和领域感知流畅性过滤,从而确保数据的高质量与多样性,同时通过分阶段提取思维链(Chain-of-Thought, CoT)模板,减少幻觉现象并增强模型的推理能力。
链接: https://arxiv.org/abs/2506.07463
作者: Guang Liu,Liangdong Wang,Jijie Li,Yang Yu,Yao Xu,Jiabei Chen,Yu Bai,Feng Liao,Yonghua Lin
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly 35 TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a 5.2 TB carefully curated Chinese web corpus, a 22.5 TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract 4.5 billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.
zh
[NLP-57] From Calibration to Collaboration: LLM Uncertainty Quantification Should Be More Human-Centered
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中可靠性不足的问题,特别是如何有效量化不确定性(Uncertainty Quantification, UQ),以提升人机协作的可信度。论文指出当前LLM UQ方法存在三个关键问题:1)评估基准缺乏生态效度;2)仅考虑认知性不确定性;3)优化指标与下游任务实用性不相关。解决方案的关键在于采用以用户为中心的方法,推动更符合实际应用场景的UQ研究方向,而非依赖不具代表性的任务和不相关的评估指标。
链接: https://arxiv.org/abs/2506.07461
作者: Siddartha Devic,Tejas Srinivasan,Jesse Thomason,Willie Neiswanger,Vatsal Sharan
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly assisting users in the real world, yet their reliability remains a concern. Uncertainty quantification (UQ) has been heralded as a tool to enhance human-LLM collaboration by enabling users to know when to trust LLM predictions. We argue that current practices for uncertainty quantification in LLMs are not optimal for developing useful UQ for human users making decisions in real-world tasks. Through an analysis of 40 LLM UQ methods, we identify three prevalent practices hindering the community’s progress toward its goal of benefiting downstream users: 1) evaluating on benchmarks with low ecological validity; 2) considering only epistemic uncertainty; and 3) optimizing metrics that are not necessarily indicative of downstream utility. For each issue, we propose concrete user-centric practices and research directions that LLM UQ researchers should consider. Instead of hill-climbing on unrepresentative tasks using imperfect metrics, we argue that the community should adopt a more human-centered approach to LLM uncertainty quantification.
zh
[NLP-58] GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning
【速读】: 该论文旨在解决手语生成(Sign Language Generation, SLG)中存在词汇顺序错误和语义准确性低的问题。现有方法通常采用句子级条件,将输入文本的整个句子编码为单一特征向量作为SLG的条件,这导致无法捕捉手语的时间结构并缺乏词级语义的粒度,从而引发手语序列紊乱和动作模糊。解决方案的关键在于提出GLOS框架,其核心是引入时序对齐的词典级条件(gloss-level conditions),使模型能够在每个时间步访问手语的时间结构和词级语义,实现对手势的细粒度控制,并通过条件融合模块——时序对齐条件(Temporal Alignment Conditioning, TAC)高效传递词级语义和时间结构至对应的动作时间步,从而生成具有正确词汇顺序和高语义准确性的手语。
链接: https://arxiv.org/abs/2506.07460
作者: Taeryung Lee,Hyeongjin Nam,Gyeongsik Moon,Kyoung Mu Lee
机构: Seoul National University (首尔大学); KRAFTON (KRAFTON); Korea University (韩国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Sign language generation (SLG), or text-to-sign generation, bridges the gap between signers and non-signers. Despite recent progress in SLG, existing methods still often suffer from incorrect lexical ordering and low semantic accuracy. This is primarily due to sentence-level condition, which encodes the entire sentence of the input text into a single feature vector as a condition for SLG. This approach fails to capture the temporal structure of sign language and lacks the granularity of word-level semantics, often leading to disordered sign sequences and ambiguous motions. To overcome these limitations, we propose GLOS, a sign language generation framework with temporally aligned gloss-level conditioning. First, we employ gloss-level conditions, which we define as sequences of gloss embeddings temporally aligned with the motion sequence. This enables the model to access both the temporal structure of sign language and word-level semantics at each timestep. As a result, this allows for fine-grained control of signs and better preservation of lexical order. Second, we introduce a condition fusion module, temporal alignment conditioning (TAC), to efficiently deliver the word-level semantic and temporal structure provided by the gloss-level condition to the corresponding motion timesteps. Our method, which is composed of gloss-level conditions and TAC, generates signs with correct lexical order and high semantic accuracy, outperforming prior methods on CSL-Daily and Phoenix-2014T.
zh
[NLP-59] KScope: A Framework for Characterizing the Knowledge Status of Language Models
【速读】: 该论文试图解决如何准确表征大型语言模型(Large Language Model, LLM)对特定问题的知识状态这一挑战。现有研究主要关注LLM在知识冲突下的行为,但未能全面反映模型对问题答案的掌握程度。论文提出了一种基于知识模式一致性和正确性的五类知识状态分类法,并引入KScope框架,通过分层统计检验逐步细化关于知识模式的假设,从而将LLM的知识状态归类到这五种状态中。解决方案的关键在于通过系统化的统计方法和上下文特征分析,提升模型知识更新的有效性与泛化能力。
链接: https://arxiv.org/abs/2506.07458
作者: Yuxin Xiao,Shan Chen,Jack Gallifant,Danielle Bitterman,Thomas Hartvigsen,Marzyeh Ghassemi
机构: Massachusetts Institute of Technology (麻省理工学院); Harvard University (哈佛大学); University of Virginia (弗吉尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Characterizing a large language model’s (LLM’s) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model’s internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.
zh
[NLP-60] Understanding Cross-Domain Adaptation in Low-Resource Topic Modeling
【速读】: 该论文试图解决低资源场景下主题建模(topic modeling)中由于目标领域数据有限导致的主题推断不稳定和不连贯的问题。解决方案的关键在于引入领域自适应(domain adaptation),通过高资源源领域向低资源目标领域进行有效的知识迁移,同时避免无关内容的干扰。研究提出了DALTA(Domain-Aligned Latent Topic Adaptation)框架,其核心包括共享编码器以提取领域不变特征、专用解码器捕捉领域特定细节,以及对抗对齐机制以选择性地传递相关信息,从而实现更稳定、连贯和可迁移的主题建模效果。
链接: https://arxiv.org/abs/2506.07453
作者: Pritom Saha Akash,Kevin Chen-Chuan Chang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Topic modeling plays a vital role in uncovering hidden semantic structures within text corpora, but existing models struggle in low-resource settings where limited target-domain data leads to unstable and incoherent topic inference. We address this challenge by formally introducing domain adaptation for low-resource topic modeling, where a high-resource source domain informs a low-resource target domain without overwhelming it with irrelevant content. We establish a finite-sample generalization bound showing that effective knowledge transfer depends on robust performance in both domains, minimizing latent-space discrepancy, and preventing overfitting to the data. Guided by these insights, we propose DALTA (Domain-Aligned Latent Topic Adaptation), a new framework that employs a shared encoder for domain-invariant features, specialized decoders for domain-specific nuances, and adversarial alignment to selectively transfer relevant information. Experiments on diverse low-resource datasets demonstrate that DALTA consistently outperforms state-of-the-art methods in terms of topic coherence, stability, and transferability.
zh
[NLP-61] When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment
【速读】: 该论文试图解决风格模式(style patterns)对大型语言模型(Large Language Models, LLMs)安全性的影响问题,特别是风格模式如何增加模型被越狱攻击(jailbreak attacks)的脆弱性。研究发现,带有特定风格模式的恶意查询会显著提升攻击成功率(ASR),且ASR的上升与风格模式的长度及其在模型中的相对注意力有关。解决方案的关键在于提出SafeStyle,这是一种通过在安全训练数据中引入与微调数据中风格模式分布相匹配的增强数据,从而有效缓解风格模式引发的安全风险的防御策略。
链接: https://arxiv.org/abs/2506.07452
作者: Yuxin Xiao,Sana Tonekaboni,Walter Gerych,Vinith Suriyakumar,Marzyeh Ghassemi
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in jailbreak queries. Although these style patterns are semantically unrelated to the malicious intents behind jailbreak queries, their safety impact remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We evaluate 32 LLMs across seven jailbreak benchmarks, and find that malicious queries with style patterns inflate the attack success rate (ASR) for nearly all models. Notably, ASR inflation correlates with both the length of style patterns and the relative attention an LLM exhibits on them. We then investigate superficial style alignment, and find that fine-tuning with specific styles makes LLMs more vulnerable to jailbreaks of those same styles. Finally, we propose SafeStyle, a defense strategy that incorporates a small amount of safety training data augmented to match the distribution of style patterns in the fine-tuning data. Across three LLMs and five fine-tuning style settings, SafeStyle consistently outperforms baselines in maintaining LLM safety.
zh
[NLP-62] LlamaRec-LKG-RAG : A Single-Pass Learnable Knowledge Graph-RAG Framework for LLM -Based Ranking
【速读】: 该论文旨在解决传统检索增强生成(RAG)方法在推荐系统中依赖扁平化、基于相似性的检索机制,未能充分利用用户-物品交互中的丰富关系结构的问题。其解决方案的关键在于提出LlamaRec-LKG-RAG框架,该框架通过整合个性化知识图谱(Knowledge Graph, KG)上下文到基于大语言模型(LLM)的推荐排序中,利用轻量级用户偏好模块动态识别异构知识图谱中的关键关系路径,并将这些个性化子图无缝融入微调后的Llama-2模型的提示中,从而实现高效且可解释的推荐。
链接: https://arxiv.org/abs/2506.07449
作者: Vahid Azizi,Fatemeh Koochaki
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have driven their adoption in recommender systems through Retrieval-Augmented Generation (RAG) frameworks. However, existing RAG approaches predominantly rely on flat, similarity-based retrieval that fails to leverage the rich relational structure inherent in user-item interactions. We introduce LlamaRec-LKG-RAG, a novel single-pass, end-to-end trainable framework that integrates personalized knowledge graph context into LLM-based recommendation ranking. Our approach extends the LlamaRec architecture by incorporating a lightweight user preference module that dynamically identifies salient relation paths within a heterogeneous knowledge graph constructed from user behavior and item metadata. These personalized subgraphs are seamlessly integrated into prompts for a fine-tuned Llama-2 model, enabling efficient and interpretable recommendations through a unified inference step. Comprehensive experiments on ML-100K and Amazon Beauty datasets demonstrate consistent and significant improvements over LlamaRec across key ranking metrics (MRR, NDCG, Recall). LlamaRec-LKG-RAG demonstrates the critical value of structured reasoning in LLM-based recommendations and establishes a foundation for scalable, knowledge-aware personalization in next-generation recommender systems. Code is available at~\hrefthis https URLrepository.
zh
[NLP-63] LG-ANNA-Embedding technical report
【速读】: 该论文试图解决如何生成适用于信息检索(IR)和非IR任务的通用文本嵌入问题,旨在减少对任务特定微调的依赖。其解决方案的关键在于构建一个基于统一指令框架的方法,结合了上下文学习、软监督和自适应硬负样本挖掘,以生成上下文感知的嵌入。通过结构化指令和少量示例引导模型完成多样化任务,并利用从高性能密集检索器和重排序器中蒸馏出的连续相关性评分作为细粒度监督信号,同时引入基于自适应边距的硬负样本挖掘策略,提升语义区分能力和训练稳定性。
链接: https://arxiv.org/abs/2506.07438
作者: Jooyoung Choi,Hyun Kim,Hansol Jang,Changwook Jun,Kyunghoon Bae,Hyewon Choi,Stanley Jungkyu Choi,Honglak Lee,Chulmin Yun
机构: LG AI Research (LG人工智能研究院)
类目: Computation and Language (cs.CL)
备注: 10 pages
Abstract:This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks. Built upon a decoder-only large language model (Mistral-7B), our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings without task-specific fine-tuning. Structured instructions and few-shot examples are used to guide the model across diverse tasks, enabling strong performance on classification, semantic similarity, clustering, and reranking benchmarks. To improve semantic discrimination, we employ a soft labeling framework where continuous relevance scores, distilled from a high-performance dense retriever and reranker, serve as fine-grained supervision signals. In addition, we introduce adaptive margin-based hard-negative mining, which filters out semantically ambiguous negatives based on their similarity to positive examples, thereby enhancing training stability and retrieval robustness. Our model is evaluated on the newly introduced MTEB (English, v2) benchmark, covering 41 tasks across seven categories. Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score, outperforming several larger or fully fine-tuned baselines. These findings highlight the effectiveness of combining in-context prompting, soft supervision, and adaptive sampling for scalable, high-quality embedding generation.
zh
[NLP-64] Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding ACL2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成内容时缺乏与人类偏好对齐的问题,以避免产生攻击性、虚假或无意义的内容。其解决方案的关键在于提出一种名为弱到强解码(Weak-to-Strong Decoding, WSD)的框架,通过一个小型对齐模型引导大型基础模型生成更符合人类偏好的内容。具体而言,小型模型首先生成对齐的开头部分,随后由大型模型继续完成剩余内容,整个过程由一个精心设计的自动切换机制控制,从而有效提升模型的对齐能力并避免下游任务性能下降。
链接: https://arxiv.org/abs/2506.07434
作者: Feifan Song,Shaohang Wei,Wen Luo,Yuxuan Fan,Tianyu Liu,Guoyin Wang,Houfeng Wang
机构: Peking University (北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 Findings
Abstract:Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.
zh
[NLP-65] Conjoined Predication and Scalar Implicature
【速读】: 该论文试图解决Magri(2016)提出的第一个关于连词现象的谜题,该谜题涉及量化、集体/同时解释和语境更新维度之间的隐性互动。论文的关键解决方案在于指出,句子如“(Only)Some Italians come from a warm country and are blond”之所以显得不恰当,是因为连词谓词的集体或同时解释引发了间接的语境矛盾。这一解释将问题置于原始理论框架内进行概念分析,并提出标量含义生成的语用机制可能超出了基于穷尽性(exhaustification)的语法许可理论的范畴。
链接: https://arxiv.org/abs/2506.07429
作者: Ratna Kandala
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Magri (2016) investigates two puzzles arising from conjunction. Although Magri has proposed a solution to the second puzzle, the first remains unresolved. This first puzzle reveals a hidden interaction among quantification, collective/concurrent interpretation, and contextual updating dimensions that have yet to be explored. In essence, the problem is that certain forms of sentences like “Some Italians come from a warm country,” when conjoined as in “(Only) Some Italians come from a warm country and are blond,” sound infelicitous, even though no obvious alternative triggers a conflicting scalar implicature. In this paper, we offer a conceptual analysis of Magri’s first puzzle by situating it within its original theoretical framework. We argue that the oddness arises from the collective or concurrent reading of the conjunctive predicate: in examples such as “(Only) Some Italians come from a warm country and are blond,” this interpretation generates an indirect contextual contradiction. Moreover, we suggest that the pragmatic mechanisms governing scalar implicature generation extend beyond what is captured by exhaustification-based grammatical licensing accounts.
zh
[NLP-66] Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models ACL2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)计算需求高而难以在资源受限环境中部署,以及小型语言模型(Small Language Models, SLMs)计算效率高但泛化能力不足之间的性能差距问题。其解决方案的关键在于提出PiFi框架,该框架通过将一个冻结的LLM层整合到SLM中,并对融合后的模型进行特定任务微调,从而在不显著增加计算成本的前提下提升性能。
链接: https://arxiv.org/abs/2506.07424
作者: Kyeonghyun Kim,Jinhee Jang,Juhwan Choi,Yoonji Lee,Kyohoon Jin,YoungBin Kim
机构: Chung-Ang University (忠南大学); AITRICS (AITRICS); DATUMO (DATUMO)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 main conference
Abstract:Large language models (LLMs) are renowned for their extensive linguistic knowledge and strong generalization capabilities, but their high computational demands make them unsuitable for resource-constrained environments. In contrast, small language models (SLMs) are computationally efficient but often lack the broad generalization capacity of LLMs. To bridge this gap, we propose PiFi, a novel framework that combines the strengths of both LLMs and SLMs to achieve high performance while maintaining efficiency. PiFi integrates a single frozen layer from an LLM into a SLM and fine-tunes the combined model for specific tasks, boosting performance without a significant increase in computational cost. We show that PiFi delivers consistent performance improvements across a range of natural language processing tasks, including both natural language understanding and generation. Moreover, our findings demonstrate PiFi’s ability to effectively leverage LLM knowledge, enhancing generalization to unseen domains and facilitating the transfer of linguistic abilities.
zh
[NLP-67] SEED: Enhancing Text-to-SQL Performance and Practical Usability Through Automatic Evidence Generation
【速读】: 该论文旨在解决现有文本到SQL(Text-to-SQL)研究中依赖BIRD数据集所存在的问题,该数据集假设用户已提供证据,但现实中非专家用户通常无法提供此类信息,且BIRD中的证据存在缺失或错误,影响模型性能。论文提出的解决方案是SEED(System for Evidence Extraction and Domain knowledge generation),其关键在于通过系统分析数据库模式、描述文件和值来自动生成证据,从而提升模型在无证据场景下的性能和实际应用能力。
链接: https://arxiv.org/abs/2506.07423
作者: Janghyeon Yun,Sang-goo Lee
机构: Seoul National University (首尔国立大学); IntelliSys (IntelliSys)
类目: Computation and Language (cs.CL)
备注:
Abstract:Text-to-SQL enables non-experts to retrieve data from databases by converting natural language queries into SQL. However, state-of-the-art text-to-SQL studies rely on the BIRD dataset, which assumes that evidence is provided along with questions. Although BIRD facilitates research advancements, it assumes that users have expertise and domain knowledge, contradicting the fundamental goal of text-to-SQL. In addition, human-generated evidence in BIRD contains defects, including missing or erroneous evidence, which affects model performance. To address this issue, we propose SEED (System for Evidence Extraction and Domain knowledge generation), an approach that automatically generates evidence to improve performance and practical usability in real-world scenarios. SEED systematically analyzes database schema, description files, and values to extract relevant information. We evaluated SEED on BIRD and Spider, demonstrating that it significantly improves SQL generation accuracy in the no-evidence scenario, and in some cases, even outperforms the setting where BIRD evidence is provided. Our results highlight that SEED-generated evidence not only bridges the gap between research and real-world deployment but also improves the adaptability and robustness of text-to-SQL models. Our code is available at this https URL
zh
[NLP-68] Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中存在的隐性危害(Implicit Harm)问题,即模型对看似无害的输入产生错误回答,从而可能造成现实世界中的危害。解决方案的关键在于提出JailFlipBench基准测试框架,该框架旨在捕捉隐性危害,并通过结构化的四象限视角重新定义LLM的风险格局,同时开发初步的JailFlip攻击方法,在多种开源和黑盒LLMs上进行全面评估,揭示隐性危害的紧迫性和广泛性,推动更全面的LLM安全评估与对齐机制。
链接: https://arxiv.org/abs/2506.07402
作者: Yukai Zhou,Sibei Yang,Wenjie Wang
机构: ShanghaiTech University (上海科技大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly deployed in real-world applications, raising concerns about their security. While jailbreak attacks highlight failures under overtly harmful queries, they overlook a critical risk: incorrectly answering harmless-looking inputs can be dangerous and cause real-world harm (Implicit Harm). We systematically reformulate the LLM risk landscape through a structured quadrant perspective based on output factuality and input harmlessness, uncovering an overlooked high-risk region. To investigate this gap, we propose JailFlipBench, a benchmark aims to capture implicit harm, spanning single-modal, multimodal, and factual extension scenarios with diverse evaluation metrics. We further develop initial JailFlip attack methodologies and conduct comprehensive evaluations across multiple open-source and black-box LLMs, show that implicit harm present immediate and urgent real-world risks, calling for broader LLM safety assessments and alignment beyond conventional jailbreak paradigms.
zh
[NLP-69] G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems
【速读】: 该论文旨在解决多智能体系统(MAS)中记忆架构不完善导致的自我进化能力受限问题,具体表现为现有MAS记忆机制过于简单,未能捕捉智能体间的复杂协作轨迹,并且缺乏跨试验和智能体特定的定制化能力。解决方案的关键在于提出G-Memory,这是一种受组织记忆理论启发的分层智能体记忆系统,通过三层图结构(洞察图、查询图和交互图)管理MAS的长期交互,实现双向记忆遍历以获取高层次的可泛化洞察和细粒度的交互轨迹,从而促进智能体团队的渐进式演化。
链接: https://arxiv.org/abs/2506.07398
作者: Guibin Zhang,Muxin Fu,Guancheng Wan,Miao Yu,Kun Wang,Shuicheng Yan
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G-Memory performs bi-directional memory traversal to retrieve both \textithigh-level, generalizable insights that enable the system to leverage cross-trial knowledge, and \textitfine-grained, condensed interaction trajectories that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to 20.89% and 10.12% , respectively, without any modifications to the original frameworks. Our codes are available at this https URL.
zh
[NLP-70] Refusal-Feature-guided Teacher for Safe Finetuning via Data Filtering and Alignment Distillation
【速读】: 该论文旨在解决Finetuning-as-a-Service(微调即服务)中由于用户数据包含有害提示而导致大型语言模型(Large Language Models, LLMs)安全对齐退化的问题。现有方法未能从根本上过滤有害数据,而本文提出的解决方案关键在于利用从安全对齐的LLMs中获得的反映拒绝行为的方向性表示(称为拒绝特征,refusal feature),通过计算输入提示特征与拒绝特征之间的相似性来识别有害提示,并在微调过程中作为教师模型过滤有害内容并提炼对齐知识,从而有效减少有害输出并提升用户特定任务的微调精度。
链接: https://arxiv.org/abs/2506.07356
作者: Seokil Ham,Yubin Choi,Seungju Cho,Yujin Yang,Younghun Kim,Changick Kim
机构: Korea Advanced Institute of Science and Technology (KAIST)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, major AI service providers such as Google and OpenAI have introduced Finetuning-as-a-Service, which enables users to customize Large Language Models (LLMs) for specific downstream tasks using their own data. However, this service is vulnerable to degradation of LLM safety-alignment when user data contains harmful prompts. While some prior works address this issue, fundamentally filtering harmful data from user data remains unexplored. Motivated by our observation that a directional representation reflecting refusal behavior (called the refusal feature) obtained from safety-aligned LLMs can inherently distinguish between harmful and harmless prompts, we propose the Refusal-Feature-guided Teacher (ReFT). Our ReFT model is trained to identify harmful prompts based on the similarity between input prompt features and its refusal feature. During finetuning, the ReFT model serves as a teacher that filters harmful prompts from user data and distills alignment knowledge into the base model. Extensive experiments demonstrate that our ReFT-based finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in Finetuning-as-a-Service.
zh
[NLP-71] Improving LLM Reasoning through Interpretable Role-Playing Steering
【速读】: 该论文试图解决传统基于提示工程的角色扮演方法在稳定性和可解释性方面的不足。其解决方案的关键在于引入Sparse Autoencoder Role-Playing Steering (SRPS),通过识别和操控与角色扮演行为相关的内部模型特征,提取角色扮演提示中的潜在表示,并根据激活模式选择最相关特征,构建可注入模型残差流的控制向量,从而实现对特定角色行为的细粒度控制。
链接: https://arxiv.org/abs/2506.07335
作者: Anyi Wang,Dong Shu,Yifan Wang,Yunpu Ma,Mengnan Du
机构: Center for Information and Language Processing, LMU Munich (信息与语言处理中心,慕尼黑路德维希-马克西米利安大学); Northwestern University (西北大学); Munich Center for Machine Learning (慕尼黑机器学习中心); New Jersey Institute of Technology (新泽西理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures, 8 tables
Abstract:Role-playing has emerged as an effective technique for enhancing the reasoning capabilities of large language models (LLMs). However, existing methods primarily rely on prompt engineering, which often lacks stability and interpretability. In this paper, we introduce Sparse Autoencoder Role-Playing Steering (SRPS), a novel framework that identifies and manipulates internal model features associated with role-playing behavior. Our approach extracts latent representations from role-play prompts, selects the most relevant features based on activation patterns, and constructs a steering vector that can be injected into the model’s residual stream with controllable intensity. Our method enables fine-grained control over role-specific behavior and offers insights into how role information influences internal model activations. Extensive experiments across various reasoning benchmarks and model sizes demonstrate consistent performance gains. Notably, in the zero-shot chain-of-thought (CoT) setting, the accuracy of Llama3.1-8B on CSQA improves from 31.86% to 39.80%, while Gemma2-9B on SVAMP increases from 37.50% to 45.10%. These results highlight the potential of SRPS to enhance reasoning ability in LLMs, providing better interpretability and stability compared to traditional prompt-based role-playing.
zh
[NLP-72] Reward Model Interpretability via Optimal and Pessimal Tokens
【速读】: 该论文试图解决奖励模型(reward model)在对齐大型语言模型与人类价值观过程中的可解释性不足问题,特别是其内部如何编码人类价值判断的机制尚未得到充分研究。解决方案的关键在于通过对其整个词汇空间内的响应进行详尽分析,揭示不同奖励模型在评分策略、对高/低分标记的编码差异、对提示框架的敏感性以及对高频标记的过度评价等方面的显著异质性与系统性偏差。这一方法为理解奖励模型的内在运作机制及其潜在偏见提供了新的视角。
链接: https://arxiv.org/abs/2506.07326
作者: Brian Christian,Hannah Rose Kirk,Jessica A.F. Thompson,Christopher Summerfield,Tsvetomira Dumbalska
机构: University of Oxford(牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted for publication in Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25), to appear June 2025
Abstract:Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves – which directly encode human value judgments by turning prompt-response pairs into scalar rewards – remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training – distortions that risk propagating through the downstream large language models now deployed to millions.
zh
[NLP-73] ConfQA: Answer Only If You Are Confident
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成回答时产生事实性幻觉(hallucination)的问题。其核心解决方案是提出一种名为ConfQA的微调策略,通过在模型正确回答问题时鼓励其继续提供答案,而在不确定时引导其承认“我不确定”,从而显著降低幻觉率。该方法的关键在于两个因素:一是引入“仅在自信时回答”的抑制性提示,以明确引导模型行为;二是利用知识图谱中的简单事实性陈述作为置信度校准的依据,从而实现跨领域和问题类型的稳健泛化。
链接: https://arxiv.org/abs/2506.07309
作者: Yin Huang,Yifan Ethan Xu,Kai Sun,Vera Yan,Alicia Sun,Haidar Khan,Jimmy Nguyen,Mohammad Kachuee,Zhaojiang Lin,Yue Liu,Aaron Colak,Anuj Kumar,Wen-tau Yih,Xin Luna Dong
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages main content, 10 pages appendix, 5 figures, 7 tables
Abstract:Can we teach Large Language Models (LLMs) to refrain from hallucinating factual statements? In this paper we present a fine-tuning strategy that we call ConfQA, which can reduce hallucination rate from 20-40% to under 5% across multiple factuality benchmarks. The core idea is simple: when the LLM answers a question correctly, it is trained to continue with the answer; otherwise, it is trained to admit “I am unsure”. But there are two key factors that make the training highly effective. First, we introduce a dampening prompt “answer only if you are confident” to explicitly guide the behavior, without which hallucination remains high as 15%-25%. Second, we leverage simple factual statements, specifically attribute values from knowledge graphs, to help LLMs calibrate the confidence, resulting in robust generalization across domains and question types. Building on this insight, we propose the Dual Neural Knowledge framework, which seamlessly select between internally parameterized neural knowledge and externally recorded symbolic knowledge based on ConfQA’s confidence. The framework enables potential accuracy gains to beyond 95%, while reducing unnecessary external retrievals by over 30%.
zh
[NLP-74] Subjectivity in the Annotation of Bridging Anaphora ACL2025
【速读】: 该论文试图解决在话语中识别桥接指代(bridging anaphora)及其先行语时存在的主观性导致的标注一致性不足的问题。其解决方案的关键在于对桥接实例的标注从三个层面进行探索:指代识别、先行语解析以及桥接子类型选择,并通过在GUM语料库测试集上进行标注试点,提出一种新的桥接子类型分类系统,以评估现有资源的标注充分性及标注者之间的一致性水平。
链接: https://arxiv.org/abs/2506.07297
作者: Lauren Levine,Amir Zeldes
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注: LAW-XIX, ACL 2025 Workshop
Abstract:Bridging refers to the associative relationship between inferable entities in a discourse and the antecedents which allow us to understand them, such as understanding what “the door” means with respect to an aforementioned “house”. As identifying associative relations between entities is an inherently subjective task, it is difficult to achieve consistent agreement in the annotation of bridging anaphora and their antecedents. In this paper, we explore the subjectivity involved in the annotation of bridging instances at three levels: anaphor recognition, antecedent resolution, and bridging subtype selection. To do this, we conduct an annotation pilot on the test set of the existing GUM corpus, and propose a newly developed classification system for bridging subtypes, which we compare to previously proposed schemes. Our results suggest that some previous resources are likely to be severely under-annotated. We also find that while agreement on the bridging subtype category was moderate, annotator overlap for exhaustively identifying instances of bridging is low, and that many disagreements resulted from subjective understanding of the entities involved.
zh
[NLP-75] Exploring the Impact of Temperature on Large Language Models :Hot or Cold?
【速读】: 该论文试图解决温度参数在大型语言模型(Large Language Models, LLMs)中的影响及其在不同模型规模和精度下的表现问题,特别是在实际应用中如何选择最优温度以提升模型性能。其解决方案的关键在于提出一种基于BERT的温度选择器,该方法利用温度对模型性能的特定影响,针对给定提示(prompt)自动识别最优温度,从而显著提升小型和中型模型在SuperGLUE数据集上的表现。
链接: https://arxiv.org/abs/2506.07295
作者: Lujun Li,Lama Sleem,Niccolo’ Gentile,Geoffrey Nichil,Radu State
机构: University of Luxembourg (卢森堡大学); Foyer S.A. (Foyer S.A.)
类目: Computation and Language (cs.CL)
备注:
Abstract:The sampling temperature, a critical hyperparameter in large language models (LLMs), modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperature, plays a crucial role in model inference. In this study, we systematically evaluated the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities, conducting statistical analyses on open source models of three different sizes: small (1B–4B), medium (6B–13B), and large (40B–80B). Our findings reveal distinct skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection in practical applications. To address this challenge, we propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt. We demonstrate that this approach can significantly improve the performance of small and medium models in the SuperGLUE datasets. Furthermore, our study extends to FP16 precision inference, revealing that temperature effects are consistent with those observed in 4-bit quantized models. By evaluating temperature effects up to 4.0 in three quantized models, we find that the Mutation Temperature – the point at which significant performance changes occur – increases with model size.
zh
[NLP-76] Parsing the Switch: LLM -Based UD Annotation for Complex Code-Switched and Low-Resource Languages
【速读】: 该论文试图解决代码转换(code-switching)在句法分析中的复杂性问题,尤其是在低资源语言环境中标注数据稀缺的情况下。现有基于单语树库训练的解析器难以适应多语言和混合语言输入,因此无法有效处理代码转换文本。论文提出的解决方案是引入BiLingua Parser,这是一种基于大语言模型(LLM)的注释流程,旨在生成通用依存关系(UD)标注。其关键在于开发了一个基于提示的框架,结合少量样本的LLM提示与专家评审,并发布了两个标注数据集,从而显著提升了代码转换文本的句法分析性能。
链接: https://arxiv.org/abs/2506.07274
作者: Olga Kellert,Nemika Tyagi,Muhammad Imran,Nelvin Licona-Guevara,Carlos Gómez-Rodríguez
机构: Arizona State University (亚利桑那州立大学); Universidade da Coruña, CITIC (拉科鲁尼亚大学,CITIC)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Parser, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Parser achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments. Data and source code are available at this https URL
zh
[NLP-77] Question Answering under Temporal Conflict: Evaluating and Organizing Evolving Knowledge with LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对动态演变的现实世界信息时,其知识更新成本高且效果有限的问题。传统方法如重新训练或上下文学习(In-Context Learning, ICL)在处理大规模和高频变化的信息时显得不切实际。论文提出的解决方案关键在于设计一种轻量级、自主的框架,通过从源文档中增量构建结构化的外部记忆,无需重新训练即可实现对时间过滤后的相关信息的检索与推理。该策略显著提升了模型在处理复杂推理任务和整合冲突事实方面的能力。
链接: https://arxiv.org/abs/2506.07270
作者: Atahan Özer,Çağatay Yıldız
机构: University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) exhibit remarkable capabilities in question answering and reasoning thanks to their extensive parametric memory. However, their knowledge is inherently limited by the scope of their pre-training data, while real-world information evolves continuously. Updating this knowledge typically requires costly and brittle re-training, or in-context learning (ICL), which becomes impractical at scale given the volume and volatility of modern information. Motivated by these limitations, we investigate how LLMs perform when exposed to temporal text corpora, or documents that reflect evolving knowledge over time, such as sports biographies where facts like a player’s “current team” change year by year. To this end, we introduce two new benchmarks: Temporal Wiki, which captures factual drift across historical Wikipedia snapshots, and Unified Clark, which aggregates timestamped news articles to simulate real-world information accumulation. Our analysis reveals that LLMs often struggle to reconcile conflicting or outdated facts and can be misled when multiple versions of a fact appear in context. To address these issues, we propose a lightweight, agentic framework that incrementally builds a structured, external memory from source documents without requiring re-training. This knowledge organization strategy enables models to retrieve and reason over temporally filtered, relevant information at inference time. Empirically, our method outperforms ICL and RAG baselines across both benchmarks, especially on questions requiring more complex reasoning or integration of conflicting facts.
zh
[NLP-78] Bias Attribution in Filipino Language Models: Extending a Bias Interpretability Metric for Application on Agglutinative Languages ACL2025
【速读】: 该论文试图解决语言模型在处理非英语文本时,尤其是屈折语(如菲律宾语)中偏见来源的可解释性问题。其解决方案的关键在于将信息论基础的偏见归因评分指标适配到处理屈折语言的模型上,并通过在纯菲律宾语模型及三种多语言模型上的实验验证了该方法的有效性。研究揭示了菲律宾语模型中的偏见驱动因素主要与人、物体和关系相关,这与英语中以行为(如犯罪、性及亲社会行为)为主的偏见贡献主题形成对比。
链接: https://arxiv.org/abs/2506.07249
作者: Lance Calvin Lim Gamboa,Yue Feng,Mark Lee
机构: University of Birmingham (伯明翰大学); Ateneo de Manila University (马尼拉雅典耀大学)
类目: Computation and Language (cs.CL)
备注: Accepted into the Gender Bias in NLP Workshop at ACL 2025 (GeBNLP@ACL2025)
Abstract:Emerging research on bias attribution and interpretability have revealed how tokens contribute to biased behavior in language models processing English texts. We build on this line of inquiry by adapting the information-theoretic bias attribution score metric for implementation on models handling agglutinative languages, particularly Filipino. We then demonstrate the effectiveness of our adapted method by using it on a purely Filipino model and on three multilingual models: one trained on languages worldwide and two on Southeast Asian data. Our results show that Filipino models are driven towards bias by words pertaining to people, objects, and relationships, entity-based themes that stand in contrast to the action-heavy nature of bias-contributing themes in English (i.e., criminal, sexual, and prosocial behaviors). These findings point to differences in how English and non-English models process inputs linked to sociodemographic groups and bias.
zh
[NLP-79] Improving the Efficiency of Long Document Classification using Sentence Ranking Approach
【速读】: 该论文试图解决长文档分类中由于基于Transformer的模型(如BERT)计算限制所带来的问题,特别是其固定输入长度和二次注意力复杂度的局限性。此外,论文指出使用完整文档进行分类通常冗余,因为仅有一部分句子包含必要的信息。解决方案的关键在于提出一种基于TF-IDF的句子排序方法,通过选择最具信息量的内容来提高效率,该方法探索了固定数量和基于百分比的句子选择策略,并结合归一化TF-IDF得分与句子长度的增强评分策略。
链接: https://arxiv.org/abs/2506.07248
作者: Prathamesh Kokate,Mitali Sarnaik,Manavi Khopade,Raviraj Joshi
机构: Pune Institute of Computer Technology, Pune; Indian Institute of Technology Madras, Chennai; L3Cube Labs, Pune
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent. This demonstrates that significant context reduction is possible without sacrificing performance, making the method practical for real-world long document classification tasks.
zh
[NLP-80] SDE-SQL: Enhancing Text-to-SQL Generation in Large Language Models via Self-Driven Exploration with SQL Probes
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在Text-to-SQL任务中对数据库内容理解受限的问题,传统方法依赖于推理时提供的静态、预处理的数据库信息,限制了模型对数据库内容的全面理解。解决方案的关键在于提出SDE-SQL框架,该框架使LLMs能够在推理过程中进行自我驱动的数据库探索,通过生成并执行SQL探针,主动从数据库中检索信息并迭代更新对数据的理解,从而实现无需任何问题- SQL对作为上下文演示的零样本设置。
链接: https://arxiv.org/abs/2506.07245
作者: Wenxuan Xie,Yaxun Dai,Wenhao Jiang
机构: South China University of Technology (华南理工大学); Soochow University (苏州大学); Guangdong Laboratory of AI and Digital Economy (SZ) (广东省人工智能与数字经济实验室(深圳))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) have significantly improved performance on the Text-to-SQL task. However, prior approaches typically rely on static, pre-processed database information provided at inference time, which limits the model’s ability to fully understand the database contents. Without dynamic interaction, LLMs are constrained to fixed, human-provided context and cannot autonomously explore the underlying data. To address this limitation, we propose SDE-SQL, a framework that enables large language models to perform self-driven exploration of databases during inference. This is accomplished by generating and executing SQL probes, which allow the model to actively retrieve information from the database and iteratively update its understanding of the data. Unlike prior methods, SDE-SQL operates in a zero-shot setting, without relying on any question-SQL pairs as in-context demonstrations. When evaluated on the BIRD benchmark with Qwen2.5-72B-Instruct, SDE-SQL achieves an 8.02% relative improvement in execution accuracy over the vanilla Qwen2.5-72B-Instruct baseline, establishing a new state-of-the-art among methods based on open-source models without supervised fine-tuning (SFT) or model ensembling. Moreover, with SFT, the performance of SDE-SQL can be further enhanced, yielding an additional 0.52% improvement.
zh
[NLP-81] Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification
【速读】: 该论文试图解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在推理过程中采用静态推理范式导致的无法迭代优化理解或适应上下文的问题。其关键解决方案是引入一种基于马尔可夫决策过程(Markov Decision Process)的推理时视觉标记扩展框架,通过一个推理器提出视觉动作,并由经过多步直接偏好优化(multi-step Direct Preference Optimization, DPO)训练的验证器评估这些动作并决定推理终止时机,从而实现动态、验证器引导的视觉内容推理。
链接: https://arxiv.org/abs/2506.07235
作者: Tianyi Bai,Zengjie Hu,Fupeng Sun,Jiantao Qiu,Yizhen Jiang,Guangxin He,Bohan Zeng,Conghui He,Binhang Yuan,Wentao Zhang
机构: HKUST(香港科技大学); Peking University(北京大学); Shanghai AI Lab(上海人工智能实验室); Imperial College London(帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Multi-modal large language models (MLLMs) have achieved remarkable capabilities by integrating visual perception with language understanding, enabling applications such as image-grounded dialogue, visual question answering, and scientific analysis. However, most MLLMs adopt a static inference paradigm, encoding the entire image into fixed visual tokens upfront, which limits their ability to iteratively refine understanding or adapt to context during inference. This contrasts sharply with human perception, which is dynamic, selective, and feedback-driven. In this work, we introduce a novel framework for inference-time visual token scaling that enables MLLMs to perform iterative, verifier-guided reasoning over visual content. We formulate the problem as a Markov Decision Process, involving a reasoner that proposes visual actions and a verifier, which is trained via multi-step Direct Preference Optimization (DPO), that evaluates these actions and determines when reasoning should terminate. To support this, we present a new dataset, VTS, comprising supervised reasoning trajectories (VTS-SFT) and preference-labeled reasoning comparisons (VTS-DPO). Our method significantly outperforms existing approaches across diverse visual reasoning benchmarks, offering not only improved accuracy but also more interpretable and grounded reasoning processes. These results demonstrate the promise of dynamic inference mechanisms for enabling fine-grained, context-aware visual reasoning in next-generation MLLMs.
zh
[NLP-82] Hallucination at a Glance: Controlled Visual Edits and Fine-Grained Multimodal Learning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理细粒度视觉差异时的不足,这些问题导致了幻觉或语义变化的遗漏。其解决方案的关键在于提出一种受控的数据生成流程,该流程能够生成语义对齐的最小编辑图像对,并基于此构建了包含超过50K图像-文本对的Micro Edit Dataset (MED)。此外,论文还引入了一个带有特征级一致性损失的监督微调框架,以提升小规模编辑下的视觉嵌入稳定性。
链接: https://arxiv.org/abs/2506.07227
作者: Tianyi Bai,Yuxuan Fan,Jiantao Qiu,Fupeng Sun,Jiayi Song,Junlin Han,Zichen Liu,Conghui He,Wentao Zhang,Binhang Yuan
机构: HKUST(香港科技大学); Shanghai AI Lab(上海人工智能实验室); Peking University(北京大学); HKUST(GZ)(香港科技大学(广州)); Oxford University(牛津大学); Imperial College London(帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks but still struggle with fine-grained visual differences, leading to hallucinations or missed semantic shifts. We attribute this to limitations in both training data and learning objectives. To address these issues, we propose a controlled data generation pipeline that produces minimally edited image pairs with semantically aligned captions. Using this pipeline, we construct the Micro Edit Dataset (MED), containing over 50K image-text pairs spanning 11 fine-grained edit categories, including attribute, count, position, and object presence changes. Building on MED, we introduce a supervised fine-tuning (SFT) framework with a feature-level consistency loss that promotes stable visual embeddings under small edits. We evaluate our approach on the Micro Edit Detection benchmark, which includes carefully balanced evaluation pairs designed to test sensitivity to subtle visual variations across the same edit categories. Our method improves difference detection accuracy and reduces hallucinations compared to strong baselines, including GPT-4o. Moreover, it yields consistent gains on standard vision-language tasks such as image captioning and visual question answering. These results demonstrate the effectiveness of combining targeted data and alignment objectives for enhancing fine-grained visual reasoning in MLLMs.
zh
[NLP-83] SAP-Bench: Benchmarking Multimodal Large Language Models in Surgical Action Planning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在手术动作规划(Surgical Action Planning, SAP)任务中的评估不足问题,特别是在复杂、长时程的手术操作中,现有基准无法充分评估模型对原子视觉动作的区分能力和协调复杂流程的能力。其解决方案的关键在于构建一个大规模、高质量的SAP-Bench数据集,该数据集包含经临床验证的手术动作片段和多模态分析锚点,并提出MLLM-SAP框架,通过注入手术领域知识,使MLLMs能够从当前手术场景和自然语言指令中生成下一步动作建议,从而提升模型在手术决策中的可靠性和准确性。
链接: https://arxiv.org/abs/2506.07196
作者: Mengya Xu,Zhongzhen Huang,Dillan Imans,Yiru Ye,Xiaofan Zhang,Qi Dou
机构: The Chinese University of Hong Kong, Hong Kong SAR, China (香港中文大学); Shanghai Jiao Tong University, Shanghai, China (上海交通大学); Sungkyunkwan University, Seoul, South Korea (成均馆大学); The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China (温州医科大学附属第一医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 11 pages, 4 figures
Abstract:Effective evaluation is critical for driving advancements in MLLM research. The surgical action planning (SAP) task, which aims to generate future action sequences from visual inputs, demands precise and sophisticated analytical capabilities. Unlike mathematical reasoning, surgical decision-making operates in life-critical domains and requires meticulous, verifiable processes to ensure reliability and patient safety. This task demands the ability to distinguish between atomic visual actions and coordinate complex, long-horizon procedures, capabilities that are inadequately evaluated by current benchmarks. To address this gap, we introduce SAP-Bench, a large-scale, high-quality dataset designed to enable multimodal large language models (MLLMs) to perform interpretable surgical action planning. Our SAP-Bench benchmark, derived from the cholecystectomy procedures context with the mean duration of 1137.5s, and introduces temporally-grounded surgical action annotations, comprising the 1,226 clinically validated action clips (mean duration: 68.7s) capturing five fundamental surgical actions across 74 procedures. The dataset provides 1,152 strategically sampled current frames, each paired with the corresponding next action as multimodal analysis anchors. We propose the MLLM-SAP framework that leverages MLLMs to generate next action recommendations from the current surgical scene and natural language instructions, enhanced with injected surgical domain knowledge. To assess our dataset’s effectiveness and the broader capabilities of current models, we evaluate seven state-of-the-art MLLMs (e.g., OpenAI-o1, GPT-4o, QwenVL2.5-72B, Claude-3.5-Sonnet, GeminiPro2.5, Step-1o, and GLM-4v) and reveal critical gaps in next action prediction performance.
zh
[NLP-84] Mitigating Behavioral Hallucination in Multimodal Large Language Models for Sequential Images
【速读】: 该论文旨在解决多模态大语言模型在处理序列图像时存在的行为幻觉(behavioral hallucination)问题,这一问题尚未得到充分研究。解决方案的关键在于提出SHE(Sequence Hallucination Eradication)框架,该框架通过两个阶段实现:首先利用自适应时间窗口进行视觉-文本对齐检查以检测幻觉,其次通过在联合嵌入空间上的正交投影来减轻幻觉。此外,研究还引入了BEACH度量标准以量化行为幻觉的严重程度。
链接: https://arxiv.org/abs/2506.07184
作者: Liangliang You,Junchi Yao,Shu Yang,Guimin Hu,Lijie Hu,Di Wang
机构: PRADA Lab (Provable Responsible AI and Data Analytics Lab); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学); University of Electronic Science and Technology of China (中国电子科技大学); University of Copenhagen (哥本哈根大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While multimodal large language models excel at various tasks, they still suffer from hallucinations, which limit their reliability and scalability for broader domain applications. To address this issue, recent research mainly focuses on objective hallucination. However, for sequential images, besides objective hallucination, there is also behavioral hallucination, which is less studied. This work aims to fill in the gap. We first reveal that behavioral hallucinations mainly arise from two key factors: prior-driven bias and the snowball effect. Based on these observations, we introduce SHE (Sequence Hallucination Eradication), a lightweight, two-stage framework that (1) detects hallucinations via visual-textual alignment check using our proposed adaptive temporal window and (2) mitigates them via orthogonal projection onto the joint embedding space. We also propose a new metric (BEACH) to quantify behavioral hallucination severity. Empirical results on standard benchmarks demonstrate that SHE reduces behavioral hallucination by over 10% on BEACH while maintaining descriptive accuracy.
zh
[NLP-85] Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLM s
【速读】: 该论文试图解决视频大语言模型(Video-LLMs)在面对误导性用户输入时表现出的迎合倾向(sycophancy)问题,这种倾向会导致模型违背视觉证据而盲目迎合用户意图,从而影响其事实一致性和可靠性。解决方案的关键在于提出VISE(Video-LLM Sycophancy Benchmarking and Evaluation),这是首个专门用于评估先进Video-LLMs在多种问题格式、提示偏差和视觉推理任务中迎合行为的基准,并首次将语言学视角引入视觉领域,实现对不同类型的迎合行为及交互模式的细粒度分析。此外,研究还探索了关键帧选择作为一种无需训练的可解释缓解策略,为减少迎合偏差提供了潜在路径。
链接: https://arxiv.org/abs/2506.07180
作者: Wenrui Zhou,Shu Yang,Qingsong Yang,Zikun Guo,Lijie Hu,Di Wang
机构: King Abdullah University of Science and Technology (沙特阿拉伯国王阿卜杜拉科技大学); University of Science and Technology of China (中国科学技术大学); Kyungpook National University (庆北国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages
Abstract:As video large language models (Video-LLMs) become increasingly integrated into real-world applications that demand grounded multimodal reasoning, ensuring their factual consistency and reliability is of critical importance. However, sycophancy, the tendency of these models to align with user input even when it contradicts the visual evidence, undermines their trustworthiness in such contexts. Current sycophancy research has largely overlooked its specific manifestations in the video-language domain, resulting in a notable absence of systematic benchmarks and targeted evaluations to understand how Video-LLMs respond under misleading user input. To fill this gap, we propose VISE (Video-LLM Sycophancy Benchmarking and Evaluation), the first dedicated benchmark designed to evaluate sycophantic behavior in state-of-the-art Video-LLMs across diverse question formats, prompt biases, and visual reasoning tasks. Specifically, VISE pioneeringly brings linguistic perspectives on sycophancy into the visual domain, enabling fine-grained analysis across multiple sycophancy types and interaction patterns. In addition, we explore key-frame selection as an interpretable, training-free mitigation strategy, which reveals potential paths for reducing sycophantic bias by strengthening visual grounding.
zh
[NLP-86] RULE: Reinforcement UnLEarning Achieves Forget-Retain Pareto Optimality
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中可能包含敏感、受版权保护或非法内容的问题,旨在实现模型的针对性遗忘(unlearning),即在不重新训练模型或损害整体性能的前提下,选择性地移除特定信息。解决方案的关键在于提出一种名为Reinforcement UnLearning (RULE) 的高效框架,该框架将遗忘任务建模为拒绝边界优化问题,并通过少量遗忘数据和合成边界查询进行训练,利用可验证的奖励函数,在保证模型对合法输入提供帮助性响应的同时,鼓励对遗忘相关查询的合理拒绝,从而实现高效的遗忘与保留平衡。
链接: https://arxiv.org/abs/2506.07171
作者: Chenlong Zhang,Zhuoran Jin,Hongbang Yuan,Jiaheng Wei,Tong Zhou,Kang Liu,Jun Zhao,Yubo Chen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper under review
Abstract:The widespread deployment of Large Language Models (LLMs) trained on massive, uncurated corpora has raised growing concerns about the inclusion of sensitive, copyrighted, or illegal content. This has led to increasing interest in LLM unlearning: the task of selectively removing specific information from a model without retraining from scratch or degrading overall utility. However, existing methods often rely on large-scale forget and retain datasets, and suffer from unnatural responses, poor generalization, or catastrophic utility loss. In this work, we propose Reinforcement UnLearning (RULE), an efficient framework that formulates unlearning as a refusal boundary optimization problem. RULE is trained with a small portion of the forget set and synthesized boundary queries, using a verifiable reward function that encourages safe refusal on forget–related queries while preserving helpful responses on permissible inputs. We provide both theoretical and empirical evidence demonstrating the effectiveness of RULE in achieving targeted unlearning without compromising model utility. Experimental results show that, with only 12% forget set and 8% synthesized boundary data, RULE outperforms existing baselines by up to 17.5% forget quality and 16.3% naturalness response while maintaining general utility, achieving forget–retain Pareto optimality. Remarkably, we further observe that RULE improves the naturalness of model outputs, enhances training efficiency, and exhibits strong generalization ability, generalizing refusal behavior to semantically related but unseen queries.
zh
[NLP-87] CTDGSI: A comprehensive exploitation of instance selection methods for automatic text classification. VII Concurso de Teses Dissertações e Trabalhos de Graduação em SI – XXI Simpósio Brasileiro de Sistemas de Informação
【速读】: 该论文旨在解决在自然语言处理(Natural Language Processing, NLP)中,针对特定应用训练或微调大型密集模型时所需的大量计算资源问题。其核心解决方案是通过实例选择(Instance Selection, IS)技术减少训练集规模,同时保持模型的有效性并降低训练成本。论文的关键在于提出两种新型的IS方法,这些方法注重噪声和冗余,特别适用于大规模数据集和Transformer架构,最终实现了平均41%的训练集规模缩减,并在多个数据集上保持了相同的模型性能,同时提升了训练速度。
链接: https://arxiv.org/abs/2506.07169
作者: Washington Cunha,Leonardo Rocha,Marcos André Gonçalves
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures, 2 tables
Abstract:Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power and more complexity, best exemplified by the Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. This \textbfPh.D. dissertation focuses on an under-investi-gated NLP data engineering technique, whose potential is enormous in the current scenario known as Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining the effectiveness of the trained models and reducing the training process cost. We provide a comprehensive and scientifically sound comparison of IS methods applied to an essential NLP task – Automatic Text Classification (ATC), considering several classification solutions and many datasets. Our findings reveal a significant untapped potential for IS solutions. We also propose two novel IS solutions that are noise-oriented and redundancy-aware, specifically designed for large datasets and transformer architectures. Our final solution achieved an average reduction of 41% in training sets, while maintaining the same levels of effectiveness in all datasets. Importantly, our solutions demonstrated speedup improvements of 1.67x (up to 2.46x), making them scalable for datasets with hundreds of thousands of documents.
zh
[NLP-88] Efficient Text-Attributed Graph Learning through Selective Annotation and Graph Alignment
【速读】: 该论文旨在解决文本属性图(Text-attributed Graphs, TAGs)中传统图神经网络(GNNs)因节点复杂文本信息而表现不佳的问题。现有方法虽通过大型语言模型(LLMs)增强节点文本特征,但通常需要对所有节点进行大量标注或微调,成本高且耗时。论文提出的解决方案GAGA的关键在于通过仅标注具有代表性的节点和边来减少标注时间和成本,并构建一个捕捉这些标注之间拓扑关系的注释图,同时采用双层对齐模块将注释图与TAG的有效结构进行对齐,从而实现高效的表示学习。
链接: https://arxiv.org/abs/2506.07168
作者: Huanyi Xie,Lijie Hu,Lu Yu,Tianhao Huang,Longfei Li,Meng Li,Jun Zhou,Huan Wang,Di Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages
Abstract:In the realm of Text-attributed Graphs (TAGs), traditional graph neural networks (GNNs) often fall short due to the complex textual information associated with each node. Recent methods have improved node representations by leveraging large language models (LLMs) to enhance node text features, but these approaches typically require extensive annotations or fine-tuning across all nodes, which is both time-consuming and costly. To overcome these challenges, we introduce GAGA, an efficient framework for TAG representation learning. GAGA reduces annotation time and cost by focusing on annotating only representative nodes and edges. It constructs an annotation graph that captures the topological relationships among these annotations. Furthermore, GAGA employs a two-level alignment module to effectively integrate the annotation graph with the TAG, aligning their underlying structures. Experiments show that GAGA achieves classification accuracies on par with or surpassing state-of-the-art methods while requiring only 1% of the data to be annotated, demonstrating its high efficiency.
zh
[NLP-89] GeometryZero: Improving Geometry Solving for LLM with Group Contrastive Policy Optimization
【速读】: 该论文试图解决几何问题求解中辅助构造(auxiliary construction)与鲁棒几何推理结合的难题,尤其是在小型模型上实现高效且有效的几何推理能力。现有方法要么性能不佳,要么依赖于大规模语言模型(如GPT-4o),导致计算成本高昂。论文提出的解决方案关键在于引入一种新的强化学习框架——Group Contrastive Policy Optimization (GCPO),其核心创新包括:(1)基于上下文效用自适应提供正负奖励信号的Group Contrastive Masking,以指导辅助构造的有效性;(2)促进更长推理链的长度奖励机制。基于GCPO,作者构建了GeometryZero系列模型,能够合理判断何时使用辅助构造,从而在多个几何基准测试中显著优于基线方法。
链接: https://arxiv.org/abs/2506.07160
作者: Yikun Wang,Yibin Wang,Dianyi Wang,Zimian Peng,Qipeng Guo,Dacheng Tao,Jiaqi Wang
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院); Zhejiang University (浙江大学); Shanghai AI Laboratory (上海人工智能实验室); Nanyang Technological University (南洋理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, particularly in mathematical reasoning, amid which geometry problem solving remains a challenging area where auxiliary construction plays a enssential role. Existing approaches either achieve suboptimal performance or rely on massive LLMs (e.g., GPT-4o), incurring massive computational costs. We posit that reinforcement learning with verifiable reward (e.g., GRPO) offers a promising direction for training smaller models that effectively combine auxiliary construction with robust geometric reasoning. However, directly applying GRPO to geometric reasoning presents fundamental limitations due to its dependence on unconditional rewards, which leads to indiscriminate and counterproductive auxiliary constructions. To address these challenges, we propose Group Contrastive Policy Optimization (GCPO), a novel reinforcement learning framework featuring two key innovations: (1) Group Contrastive Masking, which adaptively provides positive or negative reward signals for auxiliary construction based on contextual utility, and a (2) length reward that promotes longer reasoning chains. Building on GCPO, we develop GeometryZero, a family of affordable-size geometric reasoning models that judiciously determine when to employ auxiliary construction. Our extensive empirical evaluation across popular geometric benchmarks (Geometry3K, MathVista) demonstrates that GeometryZero models consistently outperform baselines (e.g. GRPO), achieving an average improvement of 4.29% across all benchmarks.
zh
[NLP-90] Syntactic Control of Language Models by Posterior Inference
【速读】: 该论文试图解决语言模型生成文本时难以精确控制句法结构的问题,这一问题在需要清晰性、风格一致性或可解释性的应用中尤为关键。解决方案的关键在于利用基于后验推断的采样算法,在生成过程中强制实施目标成分结构,具体方法结合了序列蒙特卡洛方法与句法标注器,通过从提议分布中采样来估计后验分布,并确保每个生成的词符符合预期的句法结构。实验结果表明,该方法在不牺牲语言模型流畅性的情况下显著提升了句法准确性。
链接: https://arxiv.org/abs/2506.07154
作者: Vicky Xefteri,Tim Vieira,Ryan Cotterell,Afra Amini
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability, yet it remains a challenging task. In this paper, we argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation. Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure. Our experiments with GPT2 and Llama3-8B models show that with an appropriate proposal distribution, we can improve syntactic accuracy, increasing the F1 score from 12.31 (GPT2-large) and 35.33 (Llama3-8B) to about 93 in both cases without compromising the language model’s fluency. These results underscore both the complexity of syntactic control and the effectiveness of sampling algorithms, offering a promising approach for applications where precise control over syntax is essential.
zh
[NLP-91] Semantic-preserved Augmentation with Confidence-weighted Fine-tuning for Aspect Category Sentiment Analysis
【速读】: 该论文旨在解决低资源场景下数据稀缺问题,特别是针对方面类别情感分析(Aspect Category Sentiment Analysis, ACSA)任务中数据不足带来的挑战。其解决方案的关键在于提出一种数据增强策略,通过提供结构化的提示模板引导生成式 AI (Generative AI) 生成具有语言多样性和保留原始句义的预定义内容,并结合后处理技术确保生成句子与原句的语义一致性。该方法扩展了训练分布的语义覆盖范围,提升了模型对方面类别与情感极性之间关系的理解能力。
链接: https://arxiv.org/abs/2506.07148
作者: Yaping Chai,Haoran Xie,Joe S. Qin
机构: Lingnan University(岭南大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 7 figures, 4 tables
Abstract:Large language model (LLM) is an effective approach to addressing data scarcity in low-resource scenarios. Recent existing research designs hand-crafted prompts to guide LLM for data augmentation. We introduce a data augmentation strategy for the aspect category sentiment analysis (ACSA) task that preserves the original sentence semantics and has linguistic diversity, specifically by providing a structured prompt template for an LLM to generate predefined content. In addition, we employ a post-processing technique to further ensure semantic consistency between the generated sentence and the original sentence. The augmented data increases the semantic coverage of the training distribution, enabling the model better to understand the relationship between aspect categories and sentiment polarities, enhancing its inference capabilities. Furthermore, we propose a confidence-weighted fine-tuning strategy to encourage the model to generate more confident and accurate sentiment polarity predictions. Compared with powerful and recent works, our method consistently achieves the best performance on four benchmark datasets over all baselines.
zh
[NLP-92] Prompting Science Report 2: The Decreasing Value of Chain of Thought in Prompting
【速读】: 该论文旨在帮助商业、教育和政策领域的领导者理解在实际应用中与人工智能(Artificial Intelligence, AI)交互的技术细节,特别是通过严格的测试来评估Chain-of-Thought (CoT) prompting技术的有效性。其解决方案的关键在于系统性地分析CoT提示方法在不同任务类型和模型架构下的表现,揭示其在提升推理任务性能方面的局限性和潜在副作用,从而为实际应用提供更全面的指导。
链接: https://arxiv.org/abs/2506.07142
作者: Lennart Meincke,Ethan Mollick,Lilach Mollick,Dan Shapiro
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This is the second in a series of short reports that seek to help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. In this report, we investigate Chain-of-Thought (CoT) prompting, a technique that encourages a large language model (LLM) to “think step by step” (Wei et al., 2022). CoT is a widely adopted method for improving reasoning tasks, however, our findings reveal a more nuanced picture of its effectiveness. We demonstrate two things: - The effectiveness of Chain-of-Thought prompting can vary greatly depending on the type of task and model. For non-reasoning models, CoT generally improves average performance by a small amount, particularly if the model does not inherently engage in step-by-step processing by default. However, CoT can introduce more variability in answers, sometimes triggering occasional errors in questions the model would otherwise get right. We also found that many recent models perform some form of CoT reasoning even if not asked; for these models, a request to perform CoT had little impact. Performing CoT generally requires far more tokens (increasing cost and time) than direct answers. - For models designed with explicit reasoning capabilities, CoT prompting often results in only marginal, if any, gains in answer accuracy. However, it significantly increases the time and tokens needed to generate a response. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.07142 [cs.CL] (or arXiv:2506.07142v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.07142 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-93] Learning Compact Vision Tokens for Efficient Large Multimodal Models
【速读】: 该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在计算上的挑战,特别是由于大语言模型(Large Language Models, LLMs)的高成本以及处理长视觉标记序列的二次复杂性所带来的问题。解决方案的关键在于通过空间冗余分析缩短视觉标记序列的长度,从而加速推理过程。具体而言,提出了一种空间标记融合(Spatial Token Fusion, STF)方法,将相邻的空间标记融合为一个,以生成紧凑的视觉标记序列;同时引入多块标记融合(Multi-Block Token Fusion, MBTF)模块,以补充细化后的标记序列的多粒度特征,从而在减少标记数量的同时保持信息完整性,实现推理效率与多模态推理能力的平衡。
链接: https://arxiv.org/abs/2506.07138
作者: Hao Tang,Chengchao Shen
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: The source code and trained weights are available at this https URL
Abstract:Large multimodal models (LMMs) suffer significant computational challenges due to the high cost of Large Language Models (LLMs) and the quadratic complexity of processing long vision token sequences. In this paper, we explore the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration. Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one. Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STF and MBTF module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities. Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only 25% vision tokens of baseline. The source code and trained weights are available at this https URL.
zh
[NLP-94] heorem-of-Thought: A Multi-Agent Framework for Abductive Deductive and Inductive Reasoning in Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在自然语言推理任务中推理过程脆弱且难以解释的问题。现有提示技术如链式思维(Chain-of-Thought, CoT)虽然通过激发中间推理步骤或聚合多个输出来提高可靠性,但缺乏对逻辑结构的强制和内部一致性评估机制。该论文提出的解决方案是Theorem-of-Thought (ToTh),其关键在于将推理建模为三个并行代理之间的协作,每个代理模拟不同的推理模式:溯因、演绎和归纳,生成结构化的形式化推理图,并通过基于自然语言推理(Natural Language Inference, NLI)的贝叶斯信念传播评估一致性,最终选择最连贯的图以得出答案。
链接: https://arxiv.org/abs/2506.07106
作者: Samir Abdaljalil,Hasan Kurban,Khalid Qaraqe,Erchin Serpedin
机构: Texas A&M University (德克萨斯A&M大学); Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at this https URL.
zh
[NLP-95] How Far Are We from Optimal Reasoning Efficiency?
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在推理过程中产生的冗长且低效的思维链(Chain-of-Thought, CoT)问题,该问题导致推理成本过高并限制了实际应用。其关键解决方案是引入推理效率前沿(Reasoning Efficiency Frontiers),作为通过微调基础LRM所得的实证上界,并提出统一的推理效率差距(Reasoning Efficiency Gap, REG)指标,用于量化任何微调LRM与该前沿的偏离程度。进一步地,论文提出REO-RL方法,通过强化学习优化稀疏的token预算,以最小化REG,从而在保持准确性的同时显著提升推理效率。
链接: https://arxiv.org/abs/2506.07104
作者: Jiaxuan Gao,Shu Yan,Qixin Tan,Lu Yang,Shusheng Xu,Wei Fu,Zhiyu Mei,Kaifeng Lyu,Yi Wu
机构: IIIS, Tsinghua University (清华大学); Ant Research (蚂蚁集团); Simons Institute, UC Berkeley (伯克利大学西蒙斯研究所); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) demonstrate remarkable problem-solving capabilities through extended Chain-of-Thought (CoT) reasoning but often produce excessively verbose and redundant reasoning traces. This inefficiency incurs high inference costs and limits practical deployment. While existing fine-tuning methods aim to improve reasoning efficiency, assessing their efficiency gains remains challenging due to inconsistent evaluations. In this work, we introduce the reasoning efficiency frontiers, empirical upper bounds derived from fine-tuning base LRMs across diverse approaches and training configurations. Based on these frontiers, we propose the Reasoning Efficiency Gap (REG), a unified metric quantifying deviations of any fine-tuned LRMs from these frontiers. Systematic evaluation on challenging mathematical benchmarks reveals significant gaps in current methods: they either sacrifice accuracy for short length or still remain inefficient under tight token budgets. To reduce the efficiency gap, we propose REO-RL, a class of Reinforcement Learning algorithms that minimizes REG by targeting a sparse set of token budgets. Leveraging numerical integration over strategically selected budgets, REO-RL approximates the full efficiency objective with low error using a small set of token budgets. Through systematic benchmarking, we demonstrate that our efficiency metric, REG, effectively captures the accuracy-length trade-off, with low-REG methods reducing length while maintaining accuracy. Our approach, REO-RL, consistently reduces REG by =50 across all evaluated LRMs and matching Qwen3-4B/8B efficiency frontiers under a 16K token budget with minimal accuracy loss. Ablation studies confirm the effectiveness of our exponential token budget strategy. Finally, our findings highlight that fine-tuning LRMs to perfectly align with the efficiency frontiers remains an open challenge.
zh
[NLP-96] Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing
【速读】: 该论文旨在解决多模态情感计算中现有方法依赖单模态分析或简单跨模态信息融合,无法有效捕捉不同模态间复杂且冲突的证据的问题。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的方法,该方法显式地将视觉和文本表示分解为共享(模态不变)和模态特定成分,并通过注意力机制整合分解后的信号,生成动态软提示以提升多模态情感理解的性能。
链接: https://arxiv.org/abs/2506.07086
作者: Yuanhe Tian,Pengsen Cheng,Guoqing Jin,Lei Zhang,Yan Song
机构: University of Washington (华盛顿大学); Sichuan University (四川大学); People’s Daily Online (人民日报社); University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 4 figures
Abstract:Multi-modal affective computing aims to automatically recognize and interpret human attitudes from diverse data sources such as images and text, thereby enhancing human-computer interaction and emotion understanding. Existing approaches typically rely on unimodal analysis or straightforward fusion of cross-modal information that fail to capture complex and conflicting evidence presented across different modalities. In this paper, we propose a novel LLM-based approach for affective computing that explicitly deconstructs visual and textual representations into shared (modality-invariant) and modality-specific components. Specifically, our approach firstly encodes and aligns input modalities using pre-trained multi-modal encoders, then employs a representation decomposition framework to separate common emotional content from unique cues, and finally integrates these decomposed signals via an attention mechanism to form a dynamic soft prompt for a multi-modal LLM. Extensive experiments on three representative tasks for affective computing, namely, multi-modal aspect-based sentiment analysis, multi-modal emotion analysis, and hateful meme detection, demonstrate the effectiveness of our approach, which consistently outperforms strong baselines and state-of-the-art models.
zh
[NLP-97] Com2: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂且隐含的常识推理任务时表现不足的问题,这类任务通常涉及从简单常识推导出长期影响等深层次理解,而这是人类更关注的方面。解决方案的关键在于构建一个名为Com²的基准测试集,通过引入因果事件图(causal event graphs)作为结构化的复杂常识表示,并利用因果理论(如干预)对因果事件图进行修改以生成符合人类关注点的不同情景,最后采用LLM进行基于逻辑关系的慢思考合成示例,从而提升模型在推理深度和广度上的表现。
链接: https://arxiv.org/abs/2506.07064
作者: Kai Xiong,Xiao Ding,Yixin Cao,Yuxiong Yan,Li Du,Yufei Zhang,Jinglong Gao,Jiaqian Liu,Bing Qin,Ting Liu
机构: Harbin Institute of Technology, Harbin, China(哈尔滨工业大学); Fudan University, Shanghai, China(复旦大学); Beijing Academy of Artificial Intelligence, Beijing, China(北京人工智能研究院); Nanjing University, Nanjing, China(南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025 Main Conference
Abstract:Large language models (LLMs) have mastered abundant simple and explicit commonsense knowledge through pre-training, enabling them to achieve human-like performance in simple commonsense reasoning. Nevertheless, LLMs struggle to reason with complex and implicit commonsense knowledge that is derived from simple ones (such as understanding the long-term effects of certain events), an aspect humans tend to focus on more. Existing works focus on complex tasks like math and code, while complex commonsense reasoning remains underexplored due to its uncertainty and lack of structure. To fill this gap and align with real-world concerns, we propose a benchmark Com ^2 focusing on complex commonsense reasoning. We first incorporate causal event graphs to serve as structured complex commonsense. Then we adopt causal theory~(e.g., intervention) to modify the causal event graphs and obtain different scenarios that meet human concerns. Finally, an LLM is employed to synthesize examples with slow thinking, which is guided by the logical relationships in the modified causal graphs. Furthermore, we use detective stories to construct a more challenging subset. Experiments show that LLMs struggle in reasoning depth and breadth, while post-training and slow thinking can alleviate this. The code and data are available at this https URL.
zh
[NLP-98] Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLM s
【速读】: 该论文试图解决AI生成图像检测中缺乏可解释性和鲁棒性的问题,现有方法虽准确但多为黑箱模型,无法提供人类可理解的解释。其解决方案的关键在于利用多模态大语言模型(MLLM)的强大分析与推理能力,并通过多阶段优化策略对其进行微调,以平衡准确检测、视觉定位和连贯文本解释的目标,从而实现更优的AI生成图像检测与视觉缺陷定位性能。
链接: https://arxiv.org/abs/2506.07045
作者: Yikun Ji,Hong Yan,Jun Lan,Huijia Zhu,Weiqiang Wang,Qi Fan,Liqing Zhang,Jianfu Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.
zh
[NLP-99] Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在医疗应用中的有效性受限问题,具体表现为医疗知识覆盖范围有限、数据标注质量不佳导致的幻觉现象以及缺乏针对复杂医疗场景的推理能力。其解决方案的关键在于提出一种全面的数据整理流程,该流程不仅高效获取来自医学影像、广泛医学文本及通用领域数据的丰富医疗知识,还合成准确的医学描述、视觉问答(VQA)和推理样本,从而构建一个富含医疗知识的多模态数据集。在此基础上,开发了专门针对医疗领域的MLLM——Lingshu,并通过多阶段训练嵌入医疗专业知识,提升任务解决能力。
链接: https://arxiv.org/abs/2506.07044
作者: LASA Team,Weiwen Xu,Hou Pong Chan,Long Li,Mahani Aljunied,Ruifeng Yuan,Jianyu Wang,Chenghao Xiao,Guizhen Chen,Chaoqun Liu,Zhaodonghui Li,Yu Sun,Junao Shen,Chaojun Wang,Jie Tan,Deli Zhao,Tingyang Xu,Hao Zhang,Yu Rong
机构: DAMO Academy (达摩院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report, 53 pages, 25 tables, and 16 figures
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu’s medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks …
zh
[NLP-100] Reasoning with RAG RAG ged events: RAG-Enhanced Event Knowledge Base Construction and reasoning with proof-assistants
【速读】: 该论文试图解决从叙述性文本中自动提取结构化计算表示的历史事件问题,以及传统RDF/OWL推理器在深度时间与语义分析方面的局限性。其解决方案的关键在于利用多个大语言模型(LLM)(如GPT-4、Claude、Llama 3.2)结合三种增强策略:纯基础生成、知识图谱增强和检索增强生成(RAG),以优化不同性能维度。此外,论文还开发了自动化翻译流程,将提取的RDF表示转换为Coq证明助手规范,从而实现超越RDF能力的高阶推理。
链接: https://arxiv.org/abs/2506.07042
作者: Stergios Chatzikyriakidis
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Extracting structured computational representations of historical events from narrative text remains computationally expensive when constructed manually. While RDF/OWL reasoners enable graph-based reasoning, they are limited to fragments of first-order logic, preventing deeper temporal and semantic analysis. This paper addresses both challenges by developing automatic historical event extraction models using multiple LLMs (GPT-4, Claude, Llama 3.2) with three enhancement strategies: pure base generation, knowledge graph enhancement, and Retrieval-Augmented Generation (RAG). We conducted comprehensive evaluations using historical texts from Thucydides. Our findings reveal that enhancement strategies optimize different performance dimensions rather than providing universal improvements. For coverage and historical breadth, base generation achieves optimal performance with Claude and GPT-4 extracting comprehensive events. However, for precision, RAG enhancement improves coordinate accuracy and metadata completeness. Model architecture fundamentally determines enhancement sensitivity: larger models demonstrate robust baseline performance with incremental RAG improvements, while Llama 3.2 shows extreme variance from competitive performance to complete failure. We then developed an automated translation pipeline converting extracted RDF representations into Coq proof assistant specifications, enabling higher-order reasoning beyond RDF capabilities including multi-step causal verification, temporal arithmetic with BC dates, and formal proofs about historical causation. The Coq formalization validates that RAG-discovered event types represent legitimate domain-specific semantic structures rather than ontological violations.
zh
[NLP-101] KG2QA: Knowledge Graph-enhanced Retrieval-Augmented Generation for Communication Standards Question Answering
【速读】: 该论文旨在解决传统通信标准咨询模型周期长、依赖专家知识与经验,难以满足快速发展的技术需求的问题。其解决方案的关键在于将大语言模型的微调与知识图谱的构建相结合,通过LoRA微调方法在6,587条通信标准问答数据集上提升模型的专业能力,并基于包含6个实体属性和10个关系属性的本体框架构建了包含13,906个实体和13,524个关系的通信标准领域知识图谱,从而实现智能咨询与问答系统的高效运行。
链接: https://arxiv.org/abs/2506.07037
作者: Zhongze Luo,Weixuan Wan,Qizhi Zheng,Yanhong Bai,Jingyun Sun,Jian Wang,Dan Wang
机构: Xi’an Jiaotong University (西安交通大学); Northeast Forestry University (东北林业大学); China Agricultural University (中国农业大学)
类目: Computation and Language (cs.CL)
备注: 23 pages
Abstract:There are many types of standards in the field of communication. The traditional consulting model has a long cycle and relies on the knowledge and experience of experts, making it difficult to meet the rapidly developing technological demands. This paper combines the fine-tuning of large language models with the construction of knowledge graphs to implement an intelligent consultation and question-answering system for communication standards. The experimental results show that after LoRA tuning on the constructed dataset of 6,587 questions and answers in the field of communication standards, Qwen2.5-7B-Instruct demonstrates outstanding professional capabilities in the field of communication standards on the test set. BLEU-4 rose from 18.8564 to 66.8993, and evaluation indicators such as ROUGE also increased significantly, outperforming the fine-tuning effect of the comparison model Llama-3-8B-Instruct. Based on the ontology framework containing 6 entity attributes and 10 relation attributes, a knowledge graph of the communication standard domain containing 13,906 entities and 13,524 relations was constructed, showing a relatively good query accuracy rate. The intelligent consultation and question-answering system enables the fine-tuned model on the server side to access the locally constructed knowledge graph and conduct graphical retrieval of key information first, which is conducive to improving the question-answering effect. The evaluation using DeepSeek as the Judge on the test set shows that our RAG framework enables the fine-tuned model to improve the scores at all five angles, with an average score increase of 2.26%. And combined with web services and API interfaces, it has achieved very good results in terms of interaction experience and back-end access, and has very good practical application value.
zh
[NLP-102] A Culturally-diverse Multilingual Multimodal Video Benchmark Model
【速读】: 该论文试图解决视频领域多语言模型(Video LMMs)在文化与语言包容性方面的不足,特别是针对非英语语言的覆盖有限的问题。其解决方案的关键在于引入了一个多语言视频LMM基准测试集ViMUL-Bench,该基准覆盖14种语言,并包含15个涵盖文化多样性的类别,同时通过人工验证的8k样本和1.2百万条机器翻译的多语言视频训练数据,支持高资源与低资源语言之间的平衡。此外,论文还提出了一种简单的多语言视频LMM模型ViMUL,以提升视频理解任务中不同语言资源的性能表现。
链接: https://arxiv.org/abs/2506.07032
作者: Bhuiyan Sanjid Shafique,Ashmal Vayani,Muhammad Maaz,Hanoona Abdul Rasheed,Dinura Dissanayake,Mohammed Irfan Kurpath,Yahya Hmaiti,Go Inoue,Jean Lahoud,Md. Safirur Rashid,Shadid Intisar Quasem,Maheen Fatima,Franco Vidal,Mykola Maslych,Ketan Pravin More,Sanoojan Baliah,Hasindri Watawana,Yuhao Li,Fabian Farestam,Leon Schaller,Roman Tymtsiv,Simon Weber,Hisham Cholakkal,Ivan Laptev,Shin’ichi Satoh,Michael Felsberg,Mubarak Shah,Salman Khan,Fahad Shahbaz Khan
机构: Mohamed bin Zayed University of AI(穆罕默德·本·扎耶德人工智能大学); University of Central Florida(中佛罗里达大学); Islamic University of Technology(伊斯兰技术大学); Air University(空军大学); ETH Zurich(苏黎世联邦理工学院); Technische Universität München(慕尼黑工业大学); Independent Researcher(独立研究员); National Institute of Informatics(信息研究所); Australian National University(澳大利亚国立大学); Linköping University(林雪平大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at this https URL.
zh
[NLP-103] HauntAttack: When Attack Follows Reasoning as a Shadow
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在推理能力增强的同时,因内部推理过程暴露而产生的新型安全漏洞问题。具体而言,研究关注当推理过程与有害性高度耦合时,LRMs所表现出的安全性与推理能力之间的权衡问题。解决方案的关键在于提出一种名为HauntAttack的新型通用黑盒攻击框架,该框架通过将有害指令系统性地嵌入推理问题中,利用推理问题作为载体,替换其原有条件以引导模型逐步生成不安全输出,从而揭示LRMs的安全隐患。
链接: https://arxiv.org/abs/2506.07031
作者: Jingyuan Ma,Rui Li,Zheng Li,Junfeng Liu,Lei Sha,Zhifang Sui
机构: Peking University (北京大学); StepFun (步履科技); Beihang University (北京航空航天大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing exceptional capabilities. However, the enhancement of reasoning abilities and the exposure of their internal reasoning processes introduce new safety vulnerabilities. One intriguing concern is: when reasoning is strongly entangled with harmfulness, what safety-reasoning trade-off do LRMs exhibit? To address this issue, we introduce HauntAttack, a novel and general-purpose black-box attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we treat reasoning questions as carriers and substitute one of their original conditions with a harmful instruction. This process creates a reasoning pathway in which the model is guided step by step toward generating unsafe outputs. Based on HauntAttack, we conduct comprehensive experiments on multiple LRMs. Our results reveal that even the most advanced LRMs exhibit significant safety vulnerabilities. Additionally, we perform a detailed analysis of different models, various types of harmful instructions, and model output patterns, providing valuable insights into the security of LRMs.
zh
[NLP-104] Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text
【速读】: 该论文试图解决生成式 AI (Generative AI) 生成文本在检测系统中被识别的问题,特别是针对AI生成的抄袭和社交工程等滥用行为。其解决方案的关键在于提出一种无需训练的对抗性改写框架——Adversarial Paraphrasing,该框架利用现成的指令遵循大型语言模型,在AI文本检测器的指导下对AI生成内容进行改写,从而生成能够有效规避检测的对抗样本。
链接: https://arxiv.org/abs/2506.07001
作者: Yize Cheng,Vinu Sankar Sadasivan,Mehrdad Saberi,Shoumik Saha,Soheil Feizi
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack–which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT–adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors–including neural network-based, watermark-based, and zero-shot approaches–our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.
zh
[NLP-105] What makes Reasoning Models Different? Follow the Reasoning Leader for Efficient Decoding
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在推理过程中产生的冗长思维链导致的推理效率低下和过度思考现象问题。其核心解决方案是基于“局部对齐衰减”(Local Misalignment Diminish)现象,提出了一种名为FoReaL-Decoding的协同快慢思维解码方法,通过让一个轻量级模型主导每句话的前几项token,随后由一个较弱的草稿模型完成剩余部分,从而实现计算成本与模型性能之间的可控权衡。
链接: https://arxiv.org/abs/2506.06998
作者: Ming Li,Zhengyuan Yang,Xiyao Wang,Dianqi Li,Kevin Lin,Tianyi Zhou,Lijuan Wang
机构: University of Maryland (马里兰大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large reasoning models (LRMs) achieve strong reasoning performance by emitting long chains of thought. Yet, these verbose traces slow down inference and often drift into unnecessary detail, known as the overthinking phenomenon. To better understand LRMs’ behavior, we systematically analyze the token-level misalignment between reasoning and non-reasoning models. While it is expected that their primary difference lies in the stylistic “thinking cues”, LRMs uniquely exhibit two pivotal, previously under-explored phenomena: a Global Misalignment Rebound, where their divergence from non-reasoning models persists or even grows as response length increases, and more critically, a Local Misalignment Diminish, where the misalignment concentrates at the “thinking cues” each sentence starts with but rapidly declines in the remaining of the sentence. Motivated by the Local Misalignment Diminish, we propose FoReaL-Decoding, a collaborative fast-slow thinking decoding method for cost-quality trade-off. In FoReaL-Decoding, a Leading model leads the first few tokens for each sentence, and then a weaker draft model completes the following tokens to the end of each sentence. FoReaL-Decoding adopts a stochastic gate to smoothly interpolate between the small and the large model. On four popular math-reasoning benchmarks (AIME24, GPQA-Diamond, MATH500, AMC23), FoReaL-Decoding reduces theoretical FLOPs by 30 to 50% and trims CoT length by up to 40%, while preserving 86 to 100% of model performance. These results establish FoReaL-Decoding as a simple, plug-and-play route to controllable cost-quality trade-offs in reasoning-centric tasks.
zh
[NLP-106] Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors ACL2025
【速读】: 该论文试图解决当前自然语言处理(NLP)中因训练数据文化偏向性导致的隐喻处理模型性能高估问题,特别是在跨文化多模态场景下的研究不足。其解决方案的关键在于构建一个跨文化的多模态隐喻数据集MultiMM,并提出一种融合情感嵌入的隐喻检测模型Sentiment-Enriched Metaphor Detection (SEMD),以提升模型在不同文化背景下的隐喻理解能力。
链接: https://arxiv.org/abs/2506.06987
作者: Senqi Yang,Dongyu Zhang,Jing Ren,Ziqi Xu,Xiuzhen Zhang,Yiliao Song,Hongfei Lin,Feng Xia
机构: Dalian University of Technology (大连理工大学); RMIT University (皇家墨尔本理工大学); The University of Adelaide (阿德莱德大学)
类目: Computation and Language (cs.CL)
备注: This paper has been accepted to the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Main Conference
Abstract:Metaphors are pervasive in communication, making them crucial for natural language processing (NLP). Previous research on automatic metaphor processing predominantly relies on training data consisting of English samples, which often reflect Western European or North American biases. This cultural skew can lead to an overestimation of model performance and contributions to NLP progress. However, the impact of cultural bias on metaphor processing, particularly in multimodal contexts, remains largely unexplored. To address this gap, we introduce MultiMM, a Multicultural Multimodal Metaphor dataset designed for cross-cultural studies of metaphor in Chinese and English. MultiMM consists of 8,461 text-image advertisement pairs, each accompanied by fine-grained annotations, providing a deeper understanding of multimodal metaphors beyond a single cultural domain. Additionally, we propose Sentiment-Enriched Metaphor Detection (SEMD), a baseline model that integrates sentiment embeddings to enhance metaphor comprehension across cultural backgrounds. Experimental results validate the effectiveness of SEMD on metaphor detection and sentiment analysis tasks. We hope this work increases awareness of cultural bias in NLP research and contributes to the development of fairer and more inclusive language models. Our dataset and code are available at this https URL.
zh
[NLP-107] Chain of Methodologies: Scaling Test Time Computation without Training
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂推理任务时因训练数据中缺乏深度洞察而表现不佳的问题。解决方案的关键在于提出一种名为方法链(Chain of Methodologies, CoM)的创新提示框架,该框架通过整合人类的方法论洞察来增强结构化思维,使LLMs能够在不进行显式微调的情况下激活系统性推理能力,从而提升其处理复杂任务的能力。
链接: https://arxiv.org/abs/2506.06982
作者: Cong Liu,Jie Wu,Weigang Wu,Xu Chen,Liang Lin,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学); Temple University (坦普尔大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) often struggle with complex reasoning tasks due to insufficient in-depth insights in their training data, which are typically absent in publicly available documents. This paper introduces the Chain of Methodologies (CoM), an innovative and intuitive prompting framework that enhances structured thinking by integrating human methodological insights, enabling LLMs to tackle complex tasks with extended reasoning. CoM leverages the metacognitive abilities of advanced LLMs, activating systematic reasoning throught user-defined methodologies without explicit fine-tuning. Experiments show that CoM surpasses competitive baselines, demonstrating the potential of training-free prompting methods as robust solutions for complex reasoning tasks and bridging the gap toward human-level reasoning through human-like methodological insights.
zh
[NLP-108] Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test
【速读】: 该论文试图解决在大型语言模型(Large Language Model, LLM)通过API接口提供服务时,用户无法获得模型权重或输出logits,导致难以检测API提供者可能暗中替换为量化或微调后的模型版本的问题。解决方案的关键在于提出一种基于排名的均匀性检验方法,该方法能够验证黑盒LLM的行为与本地部署的原始模型是否一致,具有查询效率高、不易被检测到查询模式的特点,从而对抗恶意的API提供者。
链接: https://arxiv.org/abs/2506.06975
作者: Xiaoyuan Zhu,Yaowen Ye,Tianyi Qiu,Hanlin Zhu,Sijun Tan,Ajraf Mannan,Jonathan Michala,Raluca Ada Popa,Willie Neiswanger
机构: University of Southern California (南加州大学); University of California, Berkeley (加州大学伯克利分校); Peking University (北京大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As API access becomes a primary interface to large language models (LLMs), users often interact with black-box systems that offer little transparency into the deployed model. To reduce costs or maliciously alter model behaviors, API providers may discreetly serve quantized or fine-tuned variants, which can degrade performance and compromise safety. Detecting such substitutions is difficult, as users lack access to model weights and, in most cases, even output logits. To tackle this problem, we propose a rank-based uniformity test that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model. Our method is accurate, query-efficient, and avoids detectable query patterns, making it robust to adversarial providers that reroute or mix responses upon the detection of testing attempts. We evaluate the approach across diverse threat scenarios, including quantization, harmful fine-tuning, jailbreak prompts, and full model substitution, showing that it consistently achieves superior statistical power over prior methods under constrained query budgets.
zh
[NLP-109] Atomic Reasoning for Scientific Table Claim Verification
【速读】: 该论文试图解决科学文本中由于复杂语言和数据导致的误信息传播问题,特别是非专家对基于科学表格的误导性声明的易感性问题。现有表格声明验证模型,包括最先进的大语言模型(Large Language Models, LLMs),在精确的细粒度推理方面表现不佳,导致验证错误和精度不足。该研究的关键解决方案是受认知负荷理论启发,通过开发模块化、可重用的推理组件(即原子技能)来降低认知负荷,进而提升模型对表格声明的解释能力。研究提出了一种技能链式结构,动态组合这些技能以实现更准确和泛化的推理,并通过SciAtomicBench基准测试验证了其有效性。
链接: https://arxiv.org/abs/2506.06972
作者: Yuji Zhang,Qingyun Wang,Cheng Qian,Jiateng Liu,Chenkai Sun,Denghui Zhang,Tarek Abdelzaher,Chengxiang Zhai,Preslav Nakov,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Stevens Institute of Technology (斯蒂文斯理工学院); MBZUAI (MBZUAI)
类目: Computation and Language (cs.CL)
备注:
Abstract:Scientific texts often convey authority due to their technical language and complex data. However, this complexity can sometimes lead to the spread of misinformation. Non-experts are particularly susceptible to misleading claims based on scientific tables due to their high information density and perceived credibility. Existing table claim verification models, including state-of-the-art large language models (LLMs), often struggle with precise fine-grained reasoning, resulting in errors and a lack of precision in verifying scientific claims. Inspired by Cognitive Load Theory, we propose that enhancing a model’s ability to interpret table-based claims involves reducing cognitive load by developing modular, reusable reasoning components (i.e., atomic skills). We introduce a skill-chaining schema that dynamically composes these skills to facilitate more accurate and generalizable reasoning with a reduced cognitive load. To evaluate this, we create SciAtomicBench, a cross-domain benchmark with fine-grained reasoning annotations. With only 350 fine-tuning examples, our model trained by atomic reasoning outperforms GPT-4o’s chain-of-thought method, achieving state-of-the-art results with far less training data.
zh
[NLP-110] Break-The-Chain: Reasoning Failures in LLM s via Adversarial Prompting in Code Generation
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在复杂推理任务中是否真正具备推理能力,还是仅依赖于浅层统计模式的问题。其解决方案的关键在于通过引入语义忠实但对抗性构造的提示扰动(prompt perturbations),系统地评估推理型LLMs的鲁棒性。具体方法包括对LeetCode风格问题进行多种变换,如叙事重构、无关约束注入、示例重新排序和数值扰动,从而揭示模型对语义和表面提示动态的敏感性。
链接: https://arxiv.org/abs/2506.06971
作者: Jaechul Roh,Varun Gandhi,Shivani Anilkumar,Arin Garg
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis – especially when aided by reasoning tokens and Chain-of-Thought prompting. Yet, a core question remains: do these models truly reason, or do they merely exploit shallow statistical patterns? In this paper, we systematically investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations. Our evaluation – spanning 700 perturbed code generations derived from LeetCode-style problems – applies transformations such as storytelling reframing, irrelevant constraint injection, example reordering, and numeric perturbation. We observe that while certain modifications severely degrade performance (with accuracy drops up to -42.1%), others surprisingly improve model accuracy by up to 35.3%, suggesting sensitivity not only to semantics but also to surface-level prompt dynamics. These findings expose the fragility and unpredictability of current reasoning systems, underscoring the need for more principles approaches to reasoning alignments and prompting robustness. We release our perturbation datasets and evaluation framework to promote further research in trustworthy and resilient LLM reasoning.
zh
[NLP-111] A dependently-typed calculus of event telicity and culminativity
【速读】: 该论文试图解决跨语言事件的及物性(telicity)和终结性(culminativity)分析问题,其核心在于构建一个依赖类型化的框架来形式化描述事件结构及其语义特性。解决方案的关键在于将名词短语的限定性与子类型、限定数量和形容词修饰之间的关系,以及动词短语中的依赖事件演算相结合,其中及物事件被定义为受事者受限的事件,而终结性事件则是达到内在终点的及物事件。该框架基于模态马丁-洛夫依赖类型理论,并在Agda证明助手中共进行了形式化。
链接: https://arxiv.org/abs/2506.06968
作者: Pavel Kovalev,Carlo Angiuli
机构: 未知
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 52 pages, Agda formalization available at this https URL
Abstract:We present a dependently-typed cross-linguistic framework for analyzing the telicity and culminativity of events, accompanied by examples of using our framework to model English sentences. Our framework consists of two parts. In the nominal domain, we model the boundedness of noun phrases and its relationship to subtyping, delimited quantities, and adjectival modification. In the verbal domain we define a dependent event calculus, modeling telic events as those whose undergoer is bounded, culminating events as telic events that achieve their inherent endpoint, and consider adverbial modification. In both domains we pay particular attention to associated entailments. Our framework is defined as an extension of intensional Martin-Löf dependent type theory, and the rules and examples in this paper have been formalized in the Agda proof assistant.
zh
[NLP-112] Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning
【速读】: 该论文试图解决问答(QA)代理在面对模糊或不明确问题时,无法主动提出澄清性问题以提高回答准确性的局限性。解决方案的关键在于模拟包含澄清性问题的对话,并通过强化学习(RL)进行训练。为使RL在实际中可行,作者提出了可视为奖励加权监督微调(SFT)的离线RL目标,并在大型语言模型中易于优化。这种方法与基于SFT和直接偏好优化的近期方法形成鲜明对比,后者具有额外的超参数且不直接优化奖励。
链接: https://arxiv.org/abs/2506.06964
作者: Subhojyoti Mukherjee,Viet Dac Lai,Raghavendra Addanki,Ryan Rossi,Seunghyun Yoon,Trung Bui,Anup Rao,Jayakumar Subramanian,Branislav Kveton
机构: Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 39 pages
Abstract:Question answering (QA) agents automatically answer questions posed in natural language. In this work, we learn to ask clarifying questions in QA agents. The key idea in our method is to simulate conversations that contain clarifying questions and learn from them using reinforcement learning (RL). To make RL practical, we propose and analyze offline RL objectives that can be viewed as reward-weighted supervised fine-tuning (SFT) and easily optimized in large language models. Our work stands in a stark contrast to recently proposed methods, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize rewards. We compare to these methods empirically and report gains in both optimized rewards and language quality.
zh
[NLP-113] BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理逻辑有效但与信念不一致的三段论推理问题时存在的推理偏差问题。解决方案的关键在于构建了BIS Reasoning 1.0,这是一个针对日语设计的大规模三段论推理数据集,专门用于评估LLMs在面对逻辑上正确但与既有信念冲突的问题时的表现,从而揭示模型在处理此类问题时的弱点。
链接: https://arxiv.org/abs/2506.06955
作者: Ha-Thanh Nguyen,Chaoran Liu,Hirokazu Kiyomaru,Koichi Takeda,Yusuke Miyao,Maki Matsuda,Yusuke Oda,Pontus Stenetorp,Qianying Liu,Su Myat Noe,Hideyuki Tachibana,Kouta Nakayama,Sadao Kurohashi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.
zh
[NLP-114] What Makes a Good Natural Language Prompt? ACL2025
【速读】: 该论文试图解决当前关于自然语言提示(natural language prompts)的量化标准缺乏共识的问题。其解决方案的关键在于提出一个以属性和人类为中心的框架,用于评估提示质量,该框架包含21个属性,并将其归类为六个维度,从而为提示质量的系统性分析提供理论基础。
链接: https://arxiv.org/abs/2506.06950
作者: Do Xuan Long,Duy Dinh,Ngoc-Hai Nguyen,Kenji Kawaguchi,Nancy F. Chen,Shafiq Joty,Min-Yen Kan
机构: National University of Singapore (新加坡国立大学); Salesforce AI Research (Salesforce人工智能研究院); Institute for Infocomm Research (I2R), A*STAR (资讯通信研究局)
类目: Computation and Language (cs.CL)
备注: ACL 2025 Main Conference
Abstract:As large language models (LLMs) have progressed towards more human-like and human–AI communications have become prevalent, prompting has emerged as a decisive component. However, there is limited conceptual consensus on what exactly quantifies natural language prompts. We attempt to address this question by conducting a meta-analysis surveying more than 150 prompting-related papers from leading NLP and AI conferences from 2022 to 2025 and blogs. We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions. We then examine how existing studies assess their impact on LLMs, revealing their imbalanced support across models and tasks, and substantial research gaps. Further, we analyze correlations among properties in high-quality natural language prompts, deriving prompting recommendations. We then empirically explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact. Finally, we discover that instruction-tuning on property-enhanced prompts can result in better reasoning models. Our findings establish a foundation for property-centric prompt evaluation and optimization, bridging the gaps between human–AI communication and opening new prompting research directions.
zh
[NLP-115] he Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在复杂任务中的性能局限性及其推理机制的理解不足问题。现有评估主要依赖于数学和编程基准测试,仅关注最终答案的准确性,而忽视了推理过程的分析。本文的关键解决方案是利用可控的谜题环境,通过精确调控任务复杂度并保持逻辑结构的一致性,系统地研究LRMs的推理轨迹与最终答案,从而揭示其在不同复杂度下的表现差异及内在限制。
链接: https://arxiv.org/abs/2506.06941
作者: Parshin Shojaee,Iman Mirzadeh,Keivan Alizadeh,Maxwell Horton,Samy Bengio,Mehrdad Farajtabar
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: preprint
Abstract:Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.
zh
[NLP-116] DiscoSum: Discourse-aware News Summarization
【速读】: 该论文试图解决新闻文章在使用大型语言模型进行摘要生成时,难以保持长期话语结构的问题(discourse structure),这会影响读者的参与度。解决方案的关键在于引入一种新的方法,将话语结构整合到摘要过程中,具体包括构建一个新颖的新闻话语模式(news discourse schema)以及开发一种名为DiscoSum的算法,该算法采用束搜索技术实现结构感知的摘要生成,从而满足不同风格和结构的需求。
链接: https://arxiv.org/abs/2506.06930
作者: Alexander Spangher,Tenghao Huang,Jialiang Gu,Jiatong Shi,Muhao Chen
机构: University of Southern California Information Sciences Institute(南加州大学信息科学研究所); School of Computer Science, Wuhan University(武汉大学计算机学院); University of California, Davis(加利福尼亚大学戴维斯分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 10 pages in Appendix
Abstract:Recent advances in text summarization have predominantly leveraged large language models to generate concise summaries. However, language models often do not maintain long-term discourse structure, especially in news articles, where organizational flow significantly influences reader engagement. We introduce a novel approach to integrating discourse structure into summarization processes, focusing specifically on news articles across various media. We present a novel summarization dataset where news articles are summarized multiple times in different ways across different social media platforms (e.g. LinkedIn, Facebook, etc.). We develop a novel news discourse schema to describe summarization structures and a novel algorithm, DiscoSum, which employs beam search technique for structure-aware summarization, enabling the transformation of news stories to meet different stylistic and structural demands. Both human and automatic evaluation results demonstrate the efficacy of our approach in maintaining narrative fidelity and meeting structural requirements.
zh
[NLP-117] Hybrid Extractive Abstractive Summarization for Multilingual Sentiment Analysis
【速读】: 该论文试图解决多语言情感分析中单一方法存在的局限性,通过结合抽取式和生成式摘要技术来提升模型性能。其解决方案的关键在于将基于TF-IDF的抽取模块与微调后的XLM-R生成式模块相结合,并引入动态阈值和文化适应机制以增强模型的泛化能力和效率。
链接: https://arxiv.org/abs/2506.06929
作者: Mikhail Krasitskii,Grigori Sidorov,Olga Kolesnikova,Liliana Chanona Hernandez,Alexander Gelbukh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages
Abstract:We propose a hybrid approach for multilingual sentiment analysis that combines extractive and abstractive summarization to address the limitations of standalone methods. The model integrates TF-IDF-based extraction with a fine-tuned XLM-R abstractive module, enhanced by dynamic thresholding and cultural adaptation. Experiments across 10 languages show significant improvements over baselines, achieving 0.90 accuracy for English and 0.84 for low-resource languages. The approach also demonstrates 22% greater computational efficiency than traditional methods. Practical applications include real-time brand monitoring and cross-cultural discourse analysis. Future work will focus on optimization for low-resource languages via 8-bit quantization.
zh
[NLP-118] Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering
【速读】: 该论文试图解决小型大型多模态模型(Large Multimodal Models, LMMs)在上下文学习(In-Context Learning, ICL)中表现不稳定的问题,特别是在增加示例数量时性能并不总是单调提升。其解决方案的关键在于提出一种元学习方法,通过从任务相关的图像特征中提炼出固定的一组软提示(soft prompts),并在测试时使用少量示例进行适应,从而实现少样本能力的诱导。为支持这一提炼过程,论文引入了一个注意力映射模块(attention-mapper module),可与LLaVA v1.5架构集成,并与软提示联合学习,使得在低数据环境下,仅需少量梯度步骤即可实现任务适应。
链接: https://arxiv.org/abs/2506.06905
作者: Akash Gupta,Amos Storkey,Mirella Lapata
机构: University of Edinburgh (爱丁堡大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, is inconsistent and does not always improve monotonically with increasing examples. We hypothesize that this occurs due to the LMM being overwhelmed by additional information present in the image embeddings, which is not required for the downstream task. To address this, we propose a meta-learning approach that provides an alternative for inducing few-shot capabilities in LMMs, using a fixed set of soft prompts that are distilled from task-relevant image features and can be adapted at test time using a few examples. To facilitate this distillation, we introduce an attention-mapper module that can be easily integrated with the popular LLaVA v1.5 architecture and is jointly learned with soft prompts, enabling task adaptation in LMMs under low-data regimes with just a few gradient steps. Evaluation on the VL-ICL Bench shows that our method consistently outperforms ICL and related prompt-tuning approaches, even under image perturbations, improving task induction and reasoning across visual question answering tasks.
zh
[NLP-119] Automatic Speech Recognition of African American English: Lexical and Contextual Effects INTERSPEECH2025
【速读】: 该论文试图解决自动语音识别(ASR)模型在处理非洲裔美国人英语(AAE)中的音位、音系和词形句法特征时的识别误差问题,特别是针对辅音连缀简化(CCR)和ING简化(ING-reduction)这两种AAE变量对识别准确率的影响。其解决方案的关键在于利用wav2vec 2.0模型进行语音转录,并通过蒙特利尔强制对齐器(MFA)检测CCR和ING-reduction现象,同时比较无外部语言模型(LM)与有LM的端到端ASR系统在词汇邻近效应和语境可预测性方面的差异。研究结果表明,无LM的系统更易受词汇邻近效应影响,而有LM的系统则表现出更强的语境依赖性。
链接: https://arxiv.org/abs/2506.06888
作者: Hamid Mojarad,Kevin Tang
机构: HHU(海德堡大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: submitted to Interspeech 2025
Abstract:Automatic Speech Recognition (ASR) models often struggle with the phonetic, phonological, and morphosyntactic features found in African American English (AAE). This study focuses on two key AAE variables: Consonant Cluster Reduction (CCR) and ING-reduction. It examines whether the presence of CCR and ING-reduction increases ASR misrecognition. Subsequently, it investigates whether end-to-end ASR systems without an external Language Model (LM) are more influenced by lexical neighborhood effect and less by contextual predictability compared to systems with an LM. The Corpus of Regional African American Language (CORAAL) was transcribed using wav2vec 2.0 with and without an LM. CCR and ING-reduction were detected using the Montreal Forced Aligner (MFA) with pronunciation expansion. The analysis reveals a small but significant effect of CCR and ING on Word Error Rate (WER) and indicates a stronger presence of lexical neighborhood effect in ASR systems without LMs.
zh
[NLP-120] Mixture of Small and Large Models for Chinese Spelling Check
【速读】: 该论文旨在解决中文拼写检查(Chinese Spelling Check, CSC)任务中大型语言模型(Large Language Models, LLMs)性能不理想的问题,同时克服基于微调的BERT模型在领域数据上容易过拟合编辑模式的缺陷。其解决方案的关键在于提出一种动态混合方法,在解码阶段的束搜索(beam search)过程中有效融合小模型和LLMs的概率分布,从而在保持小模型精准纠错能力的同时提升LLMs的流畅性,且无需对LLMs进行微调,降低了时间和资源成本,并增强了领域适应性。
链接: https://arxiv.org/abs/2506.06887
作者: Ziheng Qiao,Houquan Zhou,Zhenghua Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In the era of large language models (LLMs), the Chinese Spelling Check (CSC) task has seen various LLM methods developed, yet their performance remains unsatisfactory. In contrast, fine-tuned BERT-based models, relying on high-quality in-domain data, show excellent performance but suffer from edit pattern overfitting. This paper proposes a novel dynamic mixture approach that effectively combines the probability distributions of small models and LLMs during the beam search decoding phase, achieving a balanced enhancement of precise corrections from small models and the fluency of LLMs. This approach also eliminates the need for fine-tuning LLMs, saving significant time and resources, and facilitating domain adaptation. Comprehensive experiments demonstrate that our mixture approach significantly boosts error correction capabilities, achieving state-of-the-art results across multiple datasets. Our code is available at this https URL.
zh
[NLP-121] Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLM s for Math Reasoning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在数学问题求解中虽然答案正确,但推理过程存在根本性缺陷的问题,这一现象被称为奖励黑客(reward hacking)。解决方案的关键在于提出ParaStepVerifier,这是一种新颖的方法论,通过细致的分步验证来识别数学解题中的错误推理步骤,从而显著提升检测有缺陷解题方案的准确性,尤其是在复杂、多步骤的问题上。
链接: https://arxiv.org/abs/2506.06877
作者: Jiaxing Guo,Wenjie Yang,Shengzhong Zhang,Tongshan Xu,Lun Du,Da Zheng,Zengfeng Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Outcome-rewarded Large Language Models (LLMs) have demonstrated remarkable success in mathematical problem-solving. However, this success often masks a critical issue: models frequently achieve correct answers through fundamentally unsound reasoning processes, a phenomenon indicative of reward hacking. We introduce MathOlympiadEval, a new dataset with fine-grained annotations, which reveals a significant gap between LLMs’ answer correctness and their low process correctness. Existing automated methods like LLM-as-a-judge struggle to reliably detect these reasoning flaws. To address this, we propose ParaStepVerifier, a novel methodology for meticulous, step-by-step verification of mathematical solutions. ParaStepVerifier identifies incorrect reasoning steps. Empirical results demonstrate that ParaStepVerifier substantially improves the accuracy of identifying flawed solutions compared to baselines, especially for complex, multi-step problems. This offers a more robust path towards evaluating and training LLMs with genuine mathematical reasoning.
zh
[NLP-122] Adapt Once Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models ACL2025
【速读】: 该论文试图解决在基础模型更新后,参数高效微调(PEFT)模块在新版本模型上性能显著下降的问题,这一问题导致需要重新微调大量模块以恢复性能,从而产生高昂的计算成本。解决方案的关键在于发现持续训练主要影响前馈网络(FFN)中存储的任务特定知识,而对注意力机制中的任务特定模式影响较小,并据此提出Trans-PEFT方法,通过聚焦任务特定模式并减少对基础模型中某些知识的依赖,使PEFT模块能够在不重新微调的情况下保持在更新后基础模型上的性能。
链接: https://arxiv.org/abs/2506.06844
作者: Naibin Gu,Peng Fu,Xiyu Liu,Ke Ma,Zheng Lin,Weiping Wang
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院); School of Electronic, Electrical and Communication Engineering, UCAS(中国科学院大学电子电气与通信工程学院)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025
Abstract:Parameter-efficient fine-tuning (PEFT) has become a common method for fine-tuning large language models, where a base model can serve multiple users through PEFT module switching. To enhance user experience, base models require periodic updates. However, once updated, PEFT modules fine-tuned on previous versions often suffer substantial performance degradation on newer versions. Re-tuning these numerous modules to restore performance would incur significant computational costs. Through a comprehensive analysis of the changes that occur during base model updates, we uncover an interesting phenomenon: continual training primarily affects task-specific knowledge stored in Feed-Forward Networks (FFN), while having less impact on the task-specific pattern in the Attention mechanism. Based on these findings, we introduce Trans-PEFT, a novel approach that enhances the PEFT module by focusing on the task-specific pattern while reducing its dependence on certain knowledge in the base model. Further theoretical analysis supports our approach. Extensive experiments across 7 base models and 12 datasets demonstrate that Trans-PEFT trained modules can maintain performance on updated base models without re-tuning, significantly reducing maintenance overhead in real-world applications.
zh
[NLP-123] PCoT: Persuasion-Augmented Chain of Thought for Detecting Fake News and Social Media Disinformation ACL2025
【速读】: 该论文试图解决零样本分类场景下虚假信息检测(disinformation detection)的准确性问题。解决方案的关键在于引入了基于说服机制的增强链式思维(Persuasion-Augmented Chain of Thought, PCoT),通过融合说服知识来提升大语言模型(LLMs)在未见过数据上的虚假信息识别能力。
链接: https://arxiv.org/abs/2506.06842
作者: Arkadiusz Modzelewski,Witold Sosnowski,Tiziano Labruna,Adam Wierzbicki,Giovanni Da San Martino
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 Main Conference
Abstract:Disinformation detection is a key aspect of media literacy. Psychological studies have shown that knowledge of persuasive fallacies helps individuals detect disinformation. Inspired by these findings, we experimented with large language models (LLMs) to test whether infusing persuasion knowledge enhances disinformation detection. As a result, we introduce the Persuasion-Augmented Chain of Thought (PCoT), a novel approach that leverages persuasion to improve disinformation detection in zero-shot classification. We extensively evaluate PCoT on online news and social media posts. Moreover, we publish two novel, up-to-date disinformation datasets: EUDisinfo and MultiDis. These datasets enable the evaluation of PCoT on content entirely unseen by the LLMs used in our experiments, as the content was published after the models’ knowledge cutoffs. We show that, on average, PCoT outperforms competitive methods by 15% across five LLMs and five datasets. These findings highlight the value of persuasion in strengthening zero-shot disinformation detection.
zh
[NLP-124] Cross-Entropy Games for Language Models: From Implicit Knowledge to General Capability Measures
【速读】: 该论文试图解决如何衡量大型语言模型(Large Language Models, LLMs)的能力问题,特别是针对其隐含知识和算法表现的评估。解决方案的关键在于提出一种基于交叉熵(Cross-Entropy, Xent)的游戏框架,即Xent Games,这些游戏能够通过交叉熵分数和约束来形式化多种任务,如摘要、反事实思考、异常检测等。Xent Games可作为计算图或程序表达,并且其空间足够丰富以包含大量有趣的例子,同时可通过基本博弈论一致性公理构建。进一步地,论文提出了Xent Game measures,即由覆盖测度提取的一组有限Xent Games,用作能力基准,以系统性地评估LLMs的通用能力。
链接: https://arxiv.org/abs/2506.06832
作者: Clément Hongler,Andrew Emil
机构: EPFL(瑞士联邦理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Information Theory (cs.IT); Neural and Evolutionary Computing (cs.NE)
备注: 41 pages, 16 figures
Abstract:Large Language Models (LLMs) define probability measures on text. By considering the implicit knowledge question of what it means for an LLM to know such a measure and what it entails algorithmically, we are naturally led to formulate a series of tasks that go beyond generative sampling, involving forms of summarization, counterfactual thinking, anomaly detection, originality search, reverse prompting, debating, creative solving, etc. These tasks can be formulated as games based on LLM measures, which we call Cross-Entropy (Xent) Games. Xent Games can be single-player or multi-player. They involve cross-entropy scores and cross-entropy constraints, and can be expressed as simple computational graphs and programs. We show the Xent Game space is large enough to contain a wealth of interesting examples, while being constructible from basic game-theoretic consistency axioms. We then discuss how the Xent Game space can be used to measure the abilities of LLMs. This leads to the construction of Xent Game measures: finite families of Xent Games that can be used as capability benchmarks, built from a given scope, by extracting a covering measure. To address the unbounded scope problem associated with the challenge of measuring general abilities, we propose to explore the space of Xent Games in a coherent fashion, using ideas inspired by evolutionary dynamics.
zh
[NLP-125] Can LLM s Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在代码检查或调试中通过生成测试用例的应用潜力问题,特别是其在生成针对特定缺陷的测试用例方面的能力不足。解决方案的关键在于提出TCGBench,一个用于评估LLM生成测试用例生成器能力的基准,同时构建了一个高质量的人工标注指令数据集,以提升LLMs在生成针对性测试用例任务中的表现。
链接: https://arxiv.org/abs/2506.06821
作者: Yuhan Cao,Zian Chen,Kun Quan,Ziliang Zhang,Yu Wang,Xiaoning Dong,Yeqi Feng,Guanzhong He,Jingcheng Huang,Jianhao Li,Yixuan Tan,Jiafu Tang,Yilin Tang,Junlei Wu,Qianyu Xiao,Can Zheng,Shouchen Zhou,Yuxiang Zhu,Yiming Huang,Tian Xie,Tianxing He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 37 pages, 22 figures
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.
zh
[NLP-126] Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLM s
【速读】: 该论文试图解决音频大语言模型(AudioLLMs)在情感识别任务中对副语言线索(如情绪)建模能力不足的问题,以及现有方法将情感理解视为分类问题而缺乏对预测依据的深入分析。解决方案的关键在于引入一种情感推理(emotion reasoning)策略,利用AudioLLMs的生成能力生成语义对齐且有证据支持的解释,从而提升情感识别的准确性与生成回复的一致性与证据基础。为此,研究提出了一种统一框架,结合了增强推理的数据监督、双编码器架构和任务交替训练,使AudioLLMs能够在多任务学习中有效整合情感推理。
链接: https://arxiv.org/abs/2506.06820
作者: Wenyu Zhang,Yingxu He,Geyu Lin,Zhuohan Liu,Shuo Sun,Bin Wang,Xunlong Zou,Jeremy H. M. Wong,Qiongqiong Wang,Hardik B. Sailor,Nancy F. Chen,Ai Ti Aw
机构: Institute for Infocomm Research (I22{}^{\text{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTR); Agency for Science, Technology and Research (A*STAR); Centre for Frontier AI Research (CFAR)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Audio Large Language Models (AudioLLMs) have achieved strong results in semantic tasks like speech recognition and translation, but remain limited in modeling paralinguistic cues such as emotion. Existing approaches often treat emotion understanding as a classification problem, offering little insight into the underlying rationale behind predictions. In this work, we explore emotion reasoning, a strategy that leverages the generative capabilities of AudioLLMs to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations. To support this in multitask AudioLLMs, we introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training. This approach enables AudioLLMs to effectively learn different tasks while incorporating emotional reasoning. Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.
zh
[NLP-127] How do datasets developers and models affect biases in a low-resourced language?
【速读】: 该论文试图解决社会技术系统中基于身份的偏见问题,特别是在低资源语言如孟加拉语(Bengali)的自然语言处理模型中的性别、宗教和国籍相关偏见。其解决方案的关键在于通过算法审计对基于mBERT和BanglaBERT的孟加拉语情感分析(Bengali Sentiment Analysis, BSA)模型进行实证评估,以揭示尽管语义内容和结构相似,这些模型仍存在跨身份类别的偏差,并探讨预训练模型与不同背景个体构建的数据集结合所带来的不一致性和不确定性。
链接: https://arxiv.org/abs/2506.06816
作者: Dipto Das,Shion Guha,Bryan Semaan
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Sociotechnical systems, such as language technologies, frequently exhibit identity-based biases. These biases exacerbate the experiences of historically marginalized communities and remain understudied in low-resource contexts. While models and datasets specific to a language or with multilingual support are commonly recommended to address these biases, this paper empirically tests the effectiveness of such approaches in the context of gender, religion, and nationality-based identities in Bengali, a widely spoken but low-resourced language. We conducted an algorithmic audit of sentiment analysis models built on mBERT and BanglaBERT, which were fine-tuned using all Bengali sentiment analysis (BSA) datasets from Google Dataset Search. Our analyses showed that BSA models exhibit biases across different identity categories despite having similar semantic content and structure. We also examined the inconsistencies and uncertainties arising from combining pre-trained models and datasets created by individuals from diverse demographic backgrounds. We connected these findings to the broader discussions on epistemic injustice, AI alignment, and methodological decisions in algorithmic audits.
zh
[NLP-128] BTPD: A Multilingual Hand-curated Dataset of Bengali Transnational Political Discourse Across Online Communities
【速读】: 该论文试图解决在资源匮乏的语言如孟加拉语中,对在线政治话语进行分析的难题,这一问题主要源于缺乏相应的数据集。解决方案的关键在于构建一个跨语言的孟加拉语跨国政治话语数据集(Bengali Transnational Political Discourse, BTPD),该数据集从三个具有不同社区结构和互动动态的在线平台中收集,并通过基于社区反馈的关键词检索方法进行人工整理,从而为后续研究提供基础支持。
链接: https://arxiv.org/abs/2506.06813
作者: Dipto Das,Syed Ishtiaque Ahmed,Shion Guha
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:Understanding political discourse in online spaces is crucial for analyzing public opinion and ideological polarization. While social computing and computational linguistics have explored such discussions in English, such research efforts are significantly limited in major yet under-resourced languages like Bengali due to the unavailability of datasets. In this paper, we present a multilingual dataset of Bengali transnational political discourse (BTPD) collected from three online platforms, each representing distinct community structures and interaction dynamics. Besides describing how we hand-curated the dataset through community-informed keyword-based retrieval, this paper also provides a general overview of its topics and multilingual content.
zh
[NLP-129] Advancing Question Generation with Joint Narrative and Difficulty Control ACL
【速读】: 该论文试图解决阅读理解问题生成中同时控制问题难度和叙述性(narrative)的不足,旨在生成更符合教育需求的问题。其解决方案的关键在于提出一种联合叙述与难度控制策略(Joint Narrative and Difficulty Control),实现对生成问题的两个属性进行同步控制。
链接: https://arxiv.org/abs/2506.06812
作者: Bernardo Leite,Henrique Lopes Cardoso
机构: LIACC, Faculdade de Engenharia, Universidade do Porto (LIACC,工程学院,波尔图大学)
类目: Computation and Language (cs.CL)
备注: Preprint. Accepted to the BEA 2025 Workshop (ACL)
Abstract:Question Generation (QG), the task of automatically generating questions from a source input, has seen significant progress in recent years. Difficulty-controllable QG (DCQG) enables control over the difficulty level of generated questions while considering the learner’s ability. Additionally, narrative-controllable QG (NCQG) allows control over the narrative aspects embedded in the questions. However, research in QG lacks a focus on combining these two types of control, which is important for generating questions tailored to educational purposes. To address this gap, we propose a strategy for Joint Narrative and Difficulty Control, enabling simultaneous control over these two attributes in the generation of reading comprehension questions. Our evaluation provides preliminary evidence that this approach is feasible, though it is not effective across all instances. Our findings highlight the conditions under which the strategy performs well and discuss the trade-offs associated with its application.
zh
[NLP-130] Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events ACL2025
【速读】: 该论文试图解决语言模型是否能够可靠地预测可能事件比仅仅是不常发生的事件更有可能的问题。其解决方案的关键在于通过区分可能性(possibility)、典型性(typicality)和语境相关性(contextual relatedness),分析语言模型在判断事件概率时的表现,结果表明当前模型在这一任务上的能力远未稳健,甚至在某些条件下表现低于随机水平。
链接: https://arxiv.org/abs/2506.06808
作者: James A. Michaelov,Reeka Estacio,Zhien Zhang,Benjamin K. Bergen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2025
Abstract:Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models’ ability to do this is far from robust. In fact, under certain conditions, all models tested - including Llama 3, Gemma 2, and Mistral NeMo - perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as ‘the car was given a parking ticket by the brake’ than to merely unlikely sentences such as ‘the car was given a parking ticket by the explorer’.
zh
[NLP-131] Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification ACL
【速读】: 该论文旨在解决文本数据爆炸背景下手动文档分类日益困难的问题,提出了一种鲁棒且高效的领域无关的生成式模型框架,用于多标签文本分类。其解决方案的关键在于将标签视为预定义的描述而非简单的原子符号,并通过训练模型根据输入文本生成这些描述,在推理阶段利用微调的句子变换器将生成的描述与预定义标签进行匹配,同时结合交叉熵损失和生成句子与目标描述之间的余弦相似度构建双目标损失函数,以确保语义对齐与准确性。
链接: https://arxiv.org/abs/2506.06806
作者: Subhendu Khatuya,Shashwat Naidu,Saptarshi Ghosh,Pawan Goyal,Niloy Ganguly
机构: Indian Institute of Technology Kharagpur, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been accepted to appear at the Association for Computational Linguistics (ACL), 2025
Abstract:The explosion of textual data has made manual document classification increasingly challenging. To address this, we introduce a robust, efficient domain-agnostic generative model framework for multi-label text classification. Instead of treating labels as mere atomic symbols, our approach utilizes predefined label descriptions and is trained to generate these descriptions based on the input text. During inference, the generated descriptions are matched to the pre-defined labels using a finetuned sentence transformer. We integrate this with a dual-objective loss function, combining cross-entropy loss and cosine similarity of the generated sentences with the predefined target descriptions, ensuring both semantic alignment and accuracy. Our proposed model LAGAMC stands out for its parameter efficiency and versatility across diverse datasets, making it well-suited for practical applications. We demonstrate the effectiveness of our proposed model by achieving new state-of-the-art performances across all evaluated datasets, surpassing several strong baselines. We achieve improvements of 13.94% in Micro-F1 and 24.85% in Macro-F1 compared to the closest baseline across all datasets.
zh
[NLP-132] On the Adaptive Psychological Persuasion of Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在心理修辞情境中自主说服与抵抗说服能力的系统性探索不足问题。现有研究虽展示了LLMs在遵循指令和修辞流畅性方面的潜力,但对其在对抗性对话中如何有效运用心理说服策略的研究仍较为有限。论文的关键解决方案是提出一种基于直接偏好优化的自适应框架,通过利用特定策略响应的说服结果作为偏好对,训练LLMs自主选择最优说服策略,从而显著提升其说服成功率并保持整体能力。
链接: https://arxiv.org/abs/2506.06800
作者: Tianjie Ju,Yujia Chen,Hao Fei,Mong-Li Lee,Wynne Hsu,Pengzhou Cheng,Zongru Wu,Zhuosheng Zhang,Gongshen Liu
机构: Shanghai Jiao Tong University (上海交通大学); National University of Singapore (新加坡国立大学); Sichuan University (四川大学)
类目: Computation and Language (cs.CL)
备注: Working in progress
Abstract:Previous work has showcased the intriguing capabilities of Large Language Models (LLMs) in instruction-following and rhetorical fluency. However, systematic exploration of their dual capabilities to autonomously persuade and resist persuasion, particularly in contexts involving psychological rhetoric, remains unexplored. In this paper, we first evaluate four commonly adopted LLMs by tasking them to alternately act as persuaders and listeners in adversarial dialogues. Empirical results show that persuader LLMs predominantly employ repetitive strategies, leading to low success rates. Then we introduce eleven comprehensive psychological persuasion strategies, finding that explicitly instructing LLMs to adopt specific strategies such as Fluency Effect and Repetition Effect significantly improves persuasion success rates. However, no ``one-size-fits-all’’ strategy proves universally effective, with performance heavily dependent on contextual counterfactuals. Motivated by these observations, we propose an adaptive framework based on direct preference optimization that trains LLMs to autonomously select optimal strategies by leveraging persuasion results from strategy-specific responses as preference pairs. Experiments on three open-source LLMs confirm that the proposed adaptive psychological persuasion method effectively enables persuader LLMs to select optimal strategies, significantly enhancing their success rates while maintaining general capabilities. Our code is available at this https URL.
zh
[NLP-133] Extending dependencies to the taggedPBC: Word order in transitive clauses
【速读】: 该论文试图解决跨语言句法依赖关系标注不足的问题,以支持更广泛的语言类型学比较。其解决方案的关键在于将依赖信息与词性标注一同转移到taggedPBC数据集中的所有语言中,从而生成一个CoNLLU格式的标注数据集。尽管存在标签和依赖关系质量方面的担忧,但该数据集中的论元和谓词位置信息与三个类型学数据库(WALS, Grambank, Autotyp)中的专家判断具有相关性,表明基于语料库的类型学方法在扩展离散语言类别比较方面具有重要价值。
链接: https://arxiv.org/abs/2506.06785
作者: Hiram Ring
机构: NTU Singapore(新加坡南洋理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The taggedPBC (Ring 2025a) contains more than 1,800 sentences of pos-tagged parallel text data from over 1,500 languages, representing 133 language families and 111 isolates. While this dwarfs previously available resources, and the POS tags achieve decent accuracy, allowing for predictive crosslinguistic insights (Ring 2025b), the dataset was not initially annotated for dependencies. This paper reports on a CoNLLU-formatted version of the dataset which transfers dependency information along with POS tags to all languages in the taggedPBC. Although there are various concerns regarding the quality of the tags and the dependencies, word order information derived from this dataset regarding the position of arguments and predicates in transitive clauses correlates with expert determinations of word order in three typological databases (WALS, Grambank, Autotyp). This highlights the usefulness of corpus-based typological approaches (as per Baylor et al. 2023; Bjerva 2024) for extending comparisons of discrete linguistic categories, and suggests that important insights can be gained even from noisy data, given sufficient annotation. The dependency-annotated corpora are also made available for research and collaboration via GitHub.
zh
[NLP-134] hey want to pretend not to understand: The Limits of Current LLM s in Interpreting Implicit Content of Political Discourse ACL2025
【速读】: 该论文试图解决在政治话语中识别和解释隐含内容(implicit content)的问题,特别是针对预设(presupposition)和暗示(implicature)等语用策略的解析。解决方案的关键在于利用首次发布的大型IMPAQTS语料库,该语料库包含标注了操控性隐含内容的意大利政治演讲,并通过多项选择任务和开放式生成任务测试大型语言模型(LLMs)在这一挑战性问题上的表现,以评估其在处理高度隐含语言方面的语用能力。
链接: https://arxiv.org/abs/2506.06775
作者: Walter Paci(1),Alessandro Panunzi(1),Sandro Pezzelle(2) ((1) University of Florence, (2) University of Amsterdam)
机构: University of Florence (佛罗伦萨大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the ACL2025 Findings
Abstract:Implicit content plays a crucial role in political discourse, where speakers systematically employ pragmatic strategies such as implicatures and presuppositions to influence their audiences. Large Language Models (LLMs) have demonstrated strong performance in tasks requiring complex semantic and pragmatic understanding, highlighting their potential for detecting and explaining the meaning of implicit content. However, their ability to do this within political discourse remains largely underexplored. Leveraging, for the first time, the large IMPAQTS corpus, which comprises Italian political speeches with the annotation of manipulative implicit content, we propose methods to test the effectiveness of LLMs in this challenging problem. Through a multiple-choice task and an open-ended generation task, we demonstrate that all tested models struggle to interpret presuppositions and implicatures. We conclude that current LLMs lack the key pragmatic capabilities necessary for accurately interpreting highly implicit language, such as that found in political discourse. At the same time, we highlight promising trends and future directions for enhancing model performance. We release our data and code at this https URL
zh
[NLP-135] Geopolitical biases in LLM s: what are the “good” and the “bad” countries according to contemporary language models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中存在地缘政治偏见的问题,具体表现为模型对具有冲突性国家视角的历史事件解释中倾向于支持特定国家叙事。解决方案的关键在于构建一个包含中立事件描述和不同国家对比观点的新型数据集,并通过实验分析模型对地缘政治信息的处理方式,揭示其对归因的敏感性以及简单去偏提示在减少偏见方面的局限性。
链接: https://arxiv.org/abs/2506.06751
作者: Mikhail Salnikov,Dmitrii Korzh,Ivan Lazichny,Elvir Karimov,Artyom Iudin,Ivan Oseledets,Oleg Y. Rogov,Alexander Panchenko,Natalia Loukachevitch,Elena Tutubalina
机构: AIRI; Skoltech; MIPT; MTUCI; Lomonosov MSU; Kazan Federal University; Sber AI
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper evaluates geopolitical biases in LLMs with respect to various countries though an analysis of their interpretation of historical events with conflicting national perspectives (USA, UK, USSR, and China). We introduce a novel dataset with neutral event descriptions and contrasting viewpoints from different countries. Our findings show significant geopolitical biases, with models favoring specific national narratives. Additionally, simple debiasing prompts had a limited effect in reducing these biases. Experiments with manipulated participant labels reveal models’ sensitivity to attribution, sometimes amplifying biases or recognizing inconsistencies, especially with swapped labels. This work highlights national narrative biases in LLMs, challenges the effectiveness of simple debiasing methods, and offers a framework and dataset for future geopolitical bias research.
zh
[NLP-136] C-PATH: Conversational Patient Assistance and Triage in Healthcare System
【速读】: 该论文旨在解决患者在复杂医疗系统中难以及时获得恰当医疗服务的问题,提出了一种名为C-PATH(Conversational Patient Assistance and Triage in Healthcare)的对话式人工智能系统,该系统基于大语言模型(LLMs),通过自然的多轮对话帮助患者识别症状并推荐合适的科室。解决方案的关键在于利用多阶段管道对医学知识、对话数据和临床摘要进行微调,并引入一种基于GPT的数据增强框架,将结构化临床知识转化为通俗易懂的对话形式,同时采用可扩展的对话历史管理策略以确保长距离连贯性。
链接: https://arxiv.org/abs/2506.06737
作者: Qi Shi,Qiwei Han,Cláudia Soares
机构: Universidade Nova de Lisboa (新里斯本大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE ICDH 2025, 10 pages, 8 figures, 5 tables
Abstract:Navigating healthcare systems can be complex and overwhelming, creating barriers for patients seeking timely and appropriate medical attention. In this paper, we introduce C-PATH (Conversational Patient Assistance and Triage in Healthcare), a novel conversational AI system powered by large language models (LLMs) designed to assist patients in recognizing symptoms and recommending appropriate medical departments through natural, multi-turn dialogues. C-PATH is fine-tuned on medical knowledge, dialogue data, and clinical summaries using a multi-stage pipeline built on the LLaMA3 architecture. A core contribution of this work is a GPT-based data augmentation framework that transforms structured clinical knowledge from DDXPlus into lay-person-friendly conversations, allowing alignment with patient communication norms. We also implement a scalable conversation history management strategy to ensure long-range coherence. Evaluation with GPTScore demonstrates strong performance across dimensions such as clarity, informativeness, and recommendation accuracy. Quantitative benchmarks show that C-PATH achieves superior performance in GPT-rewritten conversational datasets, significantly outperforming domain-specific baselines. C-PATH represents a step forward in the development of user-centric, accessible, and accurate AI tools for digital health assistance and triage.
zh
[NLP-137] Mitigating Object Hallucination via Robust Local Perception Search
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成过程中出现的幻觉(hallucination)问题,即模型输出看似合理但与图像内容不一致的现象。解决方案的关键在于引入一种称为局部感知搜索(Local Perception Search, LPS)的推理阶段解码方法,该方法无需训练即可有效抑制幻觉,其核心是利用局部视觉先验信息作为价值函数来修正解码过程。实验表明,LPS在噪声较大的场景下表现尤为突出,并且具有良好的模型兼容性。
链接: https://arxiv.org/abs/2506.06729
作者: Zixian Gao,Chao Yang,Zhanhui Zhou,Xing Xu,Chaochao Lu
机构: Shanghai Artificial Intelligence Laboratory; University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have enabled them to effectively integrate vision and language, addressing a variety of downstream tasks. However, despite their significant success, these models still exhibit hallucination phenomena, where the outputs appear plausible but do not align with the content of the images. To mitigate this issue, we introduce Local Perception Search (LPS), a decoding method during inference that is both simple and training-free, yet effectively suppresses hallucinations. This method leverages local visual prior information as a value function to correct the decoding process. Additionally, we observe that the impact of the local visual prior on model performance is more pronounced in scenarios with high levels of image noise. Notably, LPS is a plug-and-play approach that is compatible with various models. Extensive experiments on widely used hallucination benchmarks and noisy data demonstrate that LPS significantly reduces the incidence of hallucinations compared to the baseline, showing exceptional performance, particularly in noisy settings.
zh
[NLP-138] A Survey of Retentive Network
【速读】: 该论文旨在填补对Retentive Network(RetNet)架构及其应用的全面综述空白。RetNet作为一种高效的神经网络架构,旨在解决传统Transformer模型在处理长序列时因二次复杂度导致的高内存消耗和可扩展性受限问题。其解决方案的关键在于引入了保留机制(retention mechanism),该机制将递归的归纳偏差与注意力机制的全局依赖建模能力相结合,从而实现线性时间推理、高效扩展上下文建模,并保持与全并行训练流水线的兼容性。
链接: https://arxiv.org/abs/2506.06708
作者: Haiqi Yang,Zhiyuan Li,Yi Chang,Yuan Wu
机构: School of Artificial Intelligence, Jilin University (人工智能学院,吉林大学); Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China (知识驱动人机智能工程研究中心,教育部,中国); International Center of Future Science, Jilin University (未来科学国际中心,吉林大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 3 figures
Abstract:Retentive Network (RetNet) represents a significant advancement in neural network architecture, offering an efficient alternative to the Transformer. While Transformers rely on self-attention to model dependencies, they suffer from high memory costs and limited scalability when handling long sequences due to their quadratic complexity. To mitigate these limitations, RetNet introduces a retention mechanism that unifies the inductive bias of recurrence with the global dependency modeling of attention. This mechanism enables linear-time inference, facilitates efficient modeling of extended contexts, and remains compatible with fully parallelizable training pipelines. RetNet has garnered significant research interest due to its consistently demonstrated cross-domain effectiveness, achieving robust performance across machine learning paradigms including natural language processing, speech recognition, and time-series analysis. However, a comprehensive review of RetNet is still missing from the current literature. This paper aims to fill that gap by offering the first detailed survey of the RetNet architecture, its key innovations, and its diverse applications. We also explore the main challenges associated with RetNet and propose future research directions to support its continued advancement in both academic research and practical deployment.
zh
[NLP-139] DivScore: Zero-Shot Detection of LLM -Generated Text in Specialized Domains
【速读】: 该论文旨在解决在医学和法律等专业高风险领域中检测大语言模型(LLM)生成文本的问题,这一问题对于防止虚假信息和确保内容真实性至关重要。现有零样本检测器在通用文本上表现良好,但在面对专业内容时由于领域偏移而失效。论文的关键解决方案是提出DivScore,这是一种基于归一化熵评分和领域知识蒸馏的零样本检测框架,通过理论分析表明其有效性与人类文本、检测器和源文本分布之间的KL散度密切相关。实验结果表明,DivScore在专业领域基准测试中显著优于现有先进方法。
链接: https://arxiv.org/abs/2506.06705
作者: Zhihui Chen,Kai He,Yucheng Huang,Yunxiao Zhu,Mengling Feng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Zhihui Chen and Kai He contributed equally to this work, Mengling Feng is the corresponding author
Abstract:Detecting LLM-generated text in specialized and high-stakes domains like medicine and law is crucial for combating misinformation and ensuring authenticity. However, current zero-shot detectors, while effective on general text, often fail when applied to specialized content due to domain shift. We provide a theoretical analysis showing this failure is fundamentally linked to the KL divergence between human, detector, and source text distributions. To address this, we propose DivScore, a zero-shot detection framework using normalized entropy-based scoring and domain knowledge distillation to robustly identify LLM-generated text in specialized domains. We also release a domain-specific benchmark for LLM-generated text detection in the medical and legal domains. Experiments on our benchmark show that DivScore consistently outperforms state-of-the-art detectors, with 14.4% higher AUROC and 64.0% higher recall (0.1% false positive rate threshold). In adversarial settings, DivScore demonstrates superior robustness than other baselines, achieving on average 22.8% advantage in AUROC and 29.5% in recall. Code and data are publicly available.
zh
[NLP-140] Dynamic and Parametric Retrieval-Augmented Generation
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理需要多跳推理、自适应信息访问和更深层次外部知识整合的复杂任务时存在的不足。传统RAG系统通常采用静态的“检索-生成”流程,并依赖于上下文中的知识注入,这在面对动态变化的信息需求时表现不佳。论文提出的解决方案关键在于探索两种新兴的研究方向:动态RAG与参数化RAG。动态RAG通过在生成过程中自适应地决定何时以及检索什么信息,实现对模型信息需求的实时响应;而参数化RAG则重新思考如何将检索到的知识注入到大语言模型中,从输入层转向参数层的知识注入,以提高效率和效果。
链接: https://arxiv.org/abs/2506.06704
作者: Weihang Su,Qingyao Ai,Jingtao Zhan,Qian Dong,Yiqun Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has become a foundational paradigm for equipping large language models (LLMs) with external knowledge, playing a critical role in information retrieval and knowledge-intensive applications. However, conventional RAG systems typically adopt a static retrieve-then-generate pipeline and rely on in-context knowledge injection, which can be suboptimal for complex tasks that require multihop reasoning, adaptive information access, and deeper integration of external knowledge. Motivated by these limitations, the research community has moved beyond static retrieval and in-context knowledge injection. Among the emerging directions, this tutorial delves into two rapidly growing and complementary research areas on RAG: Dynamic RAG and Parametric RAG. Dynamic RAG adaptively determines when and what to retrieve during the LLM’s generation process, enabling real-time adaptation to the LLM’s evolving information needs. Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness. This tutorial offers a comprehensive overview of recent advances in these emerging research areas. It also shares theoretical foundations and practical insights to support and inspire further research in RAG.
zh
[NLP-141] MarginSel : Max-Margin Demonstration Selection for LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在上下文学习(In-Context Learning, ICL)中对演示示例的选择和排序敏感导致效果不稳定的问题。解决方案的关键在于提出MarginSel方法,该方法通过两步策略选择难度较高的演示示例,并根据每个测试实例进行适应性调整,从而提升ICL的性能。实验结果表明,该方法在分类任务中相比随机选择示例提升了2-7%的F1分数,其核心机制是通过增强困难示例的间隔(margin),类似于支持向量机中的最大间隔行为,从而优化决策边界。
链接: https://arxiv.org/abs/2506.06699
作者: Rajeev Bhatt Ambati,James Lester,Shashank Srivastava,Snigdha Chaturvedi
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校); NC State University (北卡罗来纳州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) excel at few-shot learning via in-context learning (ICL). However, the effectiveness of ICL is often sensitive to the selection and ordering of demonstration examples. To address this, we present MarginSel: Max-Margin Demonstration Selection for LLMs, a two-step method that selects hard demonstration examples for the ICL prompt, adapting to each test instance. Our approach achieves 2-7% absolute improvement in F1-score across classification tasks, compared to a random selection of examples. We also provide theoretical insights and empirical evidence showing that MarginSel induces max-margin behavior in LLMs by effectively increasing the margin for hard examples, analogous to support vectors, thereby shifting the decision boundary in a beneficial direction.
zh
[NLP-142] Contextual Experience Replay for Self-Improvement of Language Agents ACL2025
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在复杂序列决策任务中因缺乏环境特定经验而表现不佳的问题,以及现有LLM代理在推理过程中无法持续学习过去经验的局限性。其解决方案的关键在于提出一种无需训练的框架——情境经验回放(Contextual Experience Replay, CER),通过动态记忆缓冲区积累和融合过去的环境动态与决策模式,使代理能够在新任务中检索并增强相关知识,从而提升其在复杂环境中的适应能力。
链接: https://arxiv.org/abs/2506.06698
作者: Yitao Liu,Chenglei Si,Karthik Narasimhan,Shunyu Yao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ACL 2025. 20 pages
Abstract:Large language model (LLM) agents have been applied to sequential decision-making tasks such as web navigation, but without any environment-specific experiences, they often fail in these complex tasks. Moreover, current LLM agents are not designed to continually learn from past experiences during inference time, which could be crucial for them to gain these environment-specific experiences. To address this, we propose Contextual Experience Replay (CER), a training-free framework to enable efficient self-improvement for language agents in their context window. Specifically, CER accumulates and synthesizes past experiences into a dynamic memory buffer. These experiences encompass environment dynamics and common decision-making patterns, allowing the agents to retrieve and augment themselves with relevant knowledge in new tasks, enhancing their adaptability in complex environments. We evaluate CER on the challenging WebArena and VisualWebArena benchmarks. On VisualWebArena, CER achieves a competitive performance of 31.9%. On WebArena, CER also gets a competitive average success rate of 36.7%, relatively improving the success rate of the GPT-4o agent baseline by 51.0%. We also conduct a comprehensive analysis on it to prove its efficiency, validity and understand it better.
zh
[NLP-143] Learning Distribution-Wise Control in Representation Space for Language Models ICML2025
【速读】: 该论文试图解决如何更有效地通过干预手段控制语言模型(Language Models, LMs)行为的问题,特别是在保持模型鲁棒性和可操控性的同时实现更精细的控制。其解决方案的关键在于将原有的点对点干预(pointwise interventions)扩展到分布层面(distribution-wise interventions),使模型不仅能够进行概念子空间内的点对点变换,还能学习该子空间周围区域的分布特性,从而提升模型在早期层中的性能表现,并在多个常识推理和算术推理基准测试中展现出更强的可控性和鲁棒性。
链接: https://arxiv.org/abs/2506.06686
作者: Chunyuan Deng,Ruidi Chang,Hanjie Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2025
Abstract:Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: \hrefthis https URLthis https URL.
zh
[NLP-144] Quantile Regression with Large Language Models for Price Prediction ACL
【速读】: 该论文旨在解决在非结构化输入下利用大型语言模型(Large Language Models, LLMs)进行概率回归的问题,特别是在文本到分布预测任务中,如价格估计,其中需要同时具备细致的文本理解和不确定性量化能力。其解决方案的关键在于提出一种新的分位数回归方法,使LLMs能够生成完整的预测分布,而不仅仅是传统的点估计,从而在预测准确性和分布校准方面取得显著提升。
链接: https://arxiv.org/abs/2506.06657
作者: Nikhita Vedula,Dushyanta Dhyani,Laleh Jalali,Boris Oreshkin,Mohsen Bayati,Shervin Malmasi
机构: Amazon.com, Inc.(亚马逊公司); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL, 2025
Abstract:Large Language Models (LLMs) have shown promise in structured prediction tasks, including regression, but existing approaches primarily focus on point estimates and lack systematic comparison across different methods. We investigate probabilistic regression using LLMs for unstructured inputs, addressing challenging text-to-distribution prediction tasks such as price estimation where both nuanced text understanding and uncertainty quantification are critical. We propose a novel quantile regression approach that enables LLMs to produce full predictive distributions, improving upon traditional point estimates. Through extensive experiments across three diverse price prediction datasets, we demonstrate that a Mistral-7B model fine-tuned with quantile heads significantly outperforms traditional approaches for both point and distributional estimations, as measured by three established metrics each for prediction accuracy and distributional calibration. Our systematic comparison of LLM approaches, model architectures, training approaches, and data scaling reveals that Mistral-7B consistently outperforms encoder architectures, embedding-based methods, and few-shot learning methods. Our experiments also reveal the effectiveness of LLM-assisted label correction in achieving human-level accuracy without systematic bias. Our curated datasets are made available at this https URL to support future research.
zh
[NLP-145] SafeLawBench: Towards Safe Alignment of Large Language Models ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)安全性评估缺乏明确标准的问题,因为当前的安全基准具有主观性。其解决方案的关键是提出SafeLawBench基准,从法律角度对LLMs的安全风险进行分类,将其划分为三个等级,并构建了一个系统且全面的评估框架,包含24,860道多选题和1,106个开放域问答任务,以客观衡量LLMs的安全性能。
链接: https://arxiv.org/abs/2506.06636
作者: Chuxue Cao,Han Zhu,Jiaming Ji,Qichao Sun,Zhenghao Zhu,Yinyu Wu,Juntao Dai,Yaodong Yang,Sirui Han,Yike Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL2025 Findings
Abstract:With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs’ safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs’ safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8%. We urge the community to prioritize research on the safety of LLMs.
zh
[NLP-146] Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning
【速读】: 该论文旨在通过强化学习(Reinforcement Learning, RL)提升语言模型的推理能力。然而,现有研究表明,仅依靠RL在本质上复杂的任务上提升推理能力效果有限。为此,论文受课程学习(Curriculum Learning)启发,提出了一种从易到难(Easy-to-Hard, E2H)的任务调度方法,使大语言模型(Large Language Models, LLMs)能够逐步构建推理技能。该方法的关键在于通过合理安排任务难度顺序,在初期利用简单任务促进模型学习,随后逐步引入复杂任务以避免过拟合,并在理论和实验层面验证了该方法的有效性。
链接: https://arxiv.org/abs/2506.06632
作者: Shubham Parashar,Shurui Gui,Xiner Li,Hongyi Ling,Sushil Vemuri,Blake Olson,Eric Li,Yu Zhang,James Caverlee,Dileep Kalathil,Shuiwang Ji
机构: Texas A&M University (得克萨斯A&M大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method.
zh
[NLP-147] Psychological Counseling Cannot Be Achieved Overnight: Automated Psychological Counseling Through Multi-Session Conversations
【速读】: 该论文试图解决当前研究中对心理辅导的单次会话关注过多,而忽视了实际心理辅导作为持续性、多轮互动过程的问题。解决方案的关键在于构建了一个多轮心理辅导对话数据集(MusPsy-Dataset),该数据集基于公开的心理案例报告中的真实客户资料,捕捉了同一客户在不同会话中的动态辅导进程,并据此开发了MusPsy-Model,该模型能够跟踪客户进展并随时间调整辅导方向。
链接: https://arxiv.org/abs/2506.06626
作者: Junzhe Wang,Bichen Wang,Xing Fu,Yixin Sun,Yanyan Zhao,Bing Qin
机构: Harbin Institute of Technology, China(哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 19 figures
Abstract:In recent years, Large Language Models (LLMs) have made significant progress in automated psychological counseling. However, current research focuses on single-session counseling, which doesn’t represent real-world scenarios. In practice, psychological counseling is a process, not a one-time event, requiring sustained, multi-session engagement to progressively address clients’ issues. To overcome this limitation, we introduce a dataset for Multi-Session Psychological Counseling Conversation Dataset (MusPsy-Dataset). Our MusPsy-Dataset is constructed using real client profiles from publicly available psychological case reports. It captures the dynamic arc of counseling, encompassing multiple progressive counseling conversations from the same client across different sessions. Leveraging our dataset, we also developed our MusPsy-Model, which aims to track client progress and adapt its counseling direction over time. Experiments show that our model performs better than baseline models across multiple sessions.
zh
[NLP-148] BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs ACL
【速读】: 该论文试图解决法律自然语言处理(Legal NLP)领域中尚未充分探索的问题,即法律文书的撰写与编辑。解决方案的关键在于引入BRIEFME数据集,该数据集专注于法律文书,并包含三个任务:论点摘要、论点补全和案例检索,旨在评估和提升语言模型在辅助法律专业人士撰写法律文书方面的能力。
链接: https://arxiv.org/abs/2506.06619
作者: Jesse Woo,Fateme Hashemi Chaleshtori,Ana Marasović,Kenneth Marino
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL Findings 2025; 10 pages main, 5 pages references, 37 pages appendix
Abstract:A core part of legal work that has been under-explored in Legal NLP is the writing and editing of legal briefs. This requires not only a thorough understanding of the law of a jurisdiction, from judgments to statutes, but also the ability to make new arguments to try to expand the law in a new direction and make novel and creative arguments that are persuasive to judges. To capture and evaluate these legal skills in language models, we introduce BRIEFME, a new dataset focused on legal briefs. It contains three tasks for language models to assist legal professionals in writing briefs: argument summarization, argument completion, and case retrieval. In this work, we describe the creation of these tasks, analyze them, and show how current models perform. We see that today’s large language models (LLMs) are already quite good at the summarization and guided completion tasks, even beating human-generated headings. Yet, they perform poorly on other tasks in our benchmark: realistic argument completion and retrieving relevant legal cases. We hope this dataset encourages more development in Legal NLP in ways that will specifically aid people in performing legal work.
zh
[NLP-149] Interpretable Depression Detection from Social Media Text Using LLM -Derived Embeddings
【速读】: 该论文试图解决在社交媒体数据中准确且可解释地检测抑郁语言的问题,以支持心理健康状况的早期干预,并对临床实践和公共卫生工作具有重要意义。其解决方案的关键在于比较大型语言模型(LLMs)与传统机器学习分类器在三种分类任务中的表现,特别是利用LLM生成的摘要嵌入作为特征,相较于传统的文本嵌入,能够提升分类性能,尤其是在细粒度有序分类任务中展现出更好的效果。
链接: https://arxiv.org/abs/2506.06616
作者: Samuel Kim,Oghenemaro Imieye,Yunting Yin
机构: Earlham College(厄尔哈姆学院)
类目: Computation and Language (cs.CL)
备注: Submitted to the IEEE EMBS BHI 2025 Conference
Abstract:Accurate and interpretable detection of depressive language in social media is useful for early interventions of mental health conditions, and has important implications for both clinical practice and broader public health efforts. In this paper, we investigate the performance of large language models (LLMs) and traditional machine learning classifiers across three classification tasks involving social media data: binary depression classification, depression severity classification, and differential diagnosis classification among depression, PTSD, and anxiety. Our study compares zero-shot LLMs with supervised classifiers trained on both conventional text embeddings and LLM-generated summary embeddings. Our experiments reveal that while zero-shot LLMs demonstrate strong generalization capabilities in binary classification, they struggle with fine-grained ordinal classifications. In contrast, classifiers trained on summary embeddings generated by LLMs demonstrate competitive, and in some cases superior, performance on the classification tasks, particularly when compared to models using traditional text embeddings. Our findings demonstrate the strengths of LLMs in mental health prediction, and suggest promising directions for better utilization of their zero-shot capabilities and context-aware summarization techniques.
zh
[NLP-150] ransferring Features Across Language Models With Model Stitching
【速读】: 该论文试图解决如何高效地在不同规模的语言模型之间迁移所学习到的表示特征的问题,特别是针对像稀疏自编码器(Sparse Autoencoders, SAEs)这样的计算成本较高的组件。解决方案的关键在于利用残差流之间的仿射映射(affine mappings),这是一种低成本但有效的特征迁移方法,能够将小模型中的SAE权重迁移到大模型中,从而在保持性能的同时降低训练所需的浮点运算次数(FLOPs)。
链接: https://arxiv.org/abs/2506.06609
作者: Alan Chen,Jack Merullo,Alessandro Stolfo,Ellie Pavlick
机构: Brown University (布朗大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:In this work, we demonstrate that affine mappings between residual streams of language models is a cheap way to effectively transfer represented features between models. We apply this technique to transfer the weights of Sparse Autoencoders (SAEs) between models of different sizes to compare their representations. We find that small and large models learn highly similar representation spaces, which motivates training expensive components like SAEs on a smaller model and transferring to a larger model at a FLOPs savings. For example, using a small-to-large transferred SAE as initialization can lead to 50% cheaper training runs when training SAEs on larger models. Next, we show that transferred probes and steering vectors can effectively recover ground truth performance. Finally, we dive deeper into feature-level transferability, finding that semantic and structural features transfer noticeably differently while specific classes of functional features have their roles faithfully mapped. Overall, our findings illustrate similarities and differences in the linear representation spaces of small and large models and demonstrate a method for improving the training efficiency of SAEs.
zh
[NLP-151] raining-Free Tokenizer Transplantation via Orthogonal Matching Pursuit
【速读】: 该论文旨在解决预训练大语言模型(Large Language Models, LLMs)中词器(Tokenizer)迁移的问题,即如何在不进行微调的情况下,将一个模型的词器移植到另一个模型中,同时保持原模型的性能。其解决方案的关键在于通过正交匹配追踪(Orthogonal Matching Pursuit, OMP)重构未见词的嵌入表示,将新词的表示近似为共享词的稀疏线性组合,并将其稀疏系数转移到目标模型的嵌入空间中,从而有效缓解词器间的差异。
链接: https://arxiv.org/abs/2506.06607
作者: Charles Goddard,Fernando Fernandes Neto
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token as a sparse linear combination of shared tokens, in two phases: first, compute each new token’s representation in the donor embedding space with a small dictionary of shared anchor tokens, then transfer these same sparse coefficients back into the base model’s embedding space. On two challenging cross-tokenizer tasks–Llama \to Mistral NeMo (12B) and Qwen \to Llama (1B)–we show that OMP achieves best zero-shot preservation of the base model’s performance across multiple benchmarks, while other zero-shot approaches degrade significantly. Compared to baselines (zero-init, mean-init, and existing approaches like WECHSEL, FOCUS, ZETT), OMP consistently achieves the best overall performance, effectively bridging large tokenizer discrepancies without gradient updates. Our analysis further identifies mismatched numerical tokenization schemes as a critical challenge for preserving mathematical reasoning capabilities. This technique enables direct reuse of pretrained model weights with new tokenizers, facilitating cross-tokenizer knowledge distillation, speculative decoding, ensembling, merging, and domain-specific vocabulary adaptations. We integrate our method into the open-source mergekit-tokensurgeon tool for post hoc vocabulary realignment. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.06607 [cs.CL] (or arXiv:2506.06607v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2506.06607 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-152] MedCite: Can Language Models Generate Verifiable Text for Medicine?
【速读】: 该论文试图解决现有基于大语言模型(Large Language Model, LLM)的医学问答系统在引用生成与评估能力方面的不足,这限制了其在实际应用中的可靠性。解决方案的关键在于提出一种端到端的框架——\name,用于医学任务中引用生成的设计与评估,并引入一种新颖的多轮检索-引用方法,以生成高质量的引用。该方法在引用精度和召回率方面优于强基准方法,且评估结果与专业专家的标注结果具有良好的相关性。
链接: https://arxiv.org/abs/2506.06605
作者: Xiao Wang,Mengjue Tan,Qiao Jin,Guangzhi Xiong,Yu Hu,Aidong Zhang,Zhiyong Lu,Minjia Zhang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Brown University (布朗大学); National Library of Medicine, NIH (美国国家医学图书馆,美国国立卫生研究院); University of Virginia (弗吉尼亚大学); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing LLM-based medical question-answering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce \name, the first end-to-end framework that facilitates the design and evaluation of citation generation with LLMs for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations. Our evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that evaluation results correlate well with annotation results from professional experts.
zh
[NLP-153] Precise Information Control in Long-Form Text Generation
【速读】: 该论文旨在解决现代语言模型(Language Models, LMs)中存在的内在幻觉(intrinsic hallucination)问题,即模型生成与输入上下文无关或缺乏依据的合理但未经证实的信息。其解决方案的关键是提出一种新的任务范式——精确信息控制(Precise Information Control, PIC),要求模型在生成长文本时严格基于给定的一组可验证声明(verifiable claims),并避免添加任何未经支持的内容。该方法通过全面和部分两种设置评估模型对输入声明的准确引用能力,并构建了PIC-Bench基准测试平台以评估不同模型的表现。
链接: https://arxiv.org/abs/2506.06589
作者: Jacqueline He,Howard Yen,Margaret Li,Shuyue Stella Li,Zhiyuan Zeng,Weijia Shi,Yulia Tsvetkov,Danqi Chen,Pang Wei Koh,Luke Zettlemoyer
机构: Paul G. Allen School of Computing Science & Engineering, University of Washington (保罗·G·艾伦计算机科学与工程学院,华盛顿大学); Princeton Language and Intelligence (PLI), Princeton University (普林斯顿语言与智能实验室,普林斯顿大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注: 56 pages, 8 figures. Code and models are publicly available at this https URL
Abstract:A central challenge in modern language models (LMs) is intrinsic hallucination: the generation of information that is plausible but unsubstantiated relative to input context. To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, known as verifiable claims, without adding any unsupported ones. For comprehensiveness, PIC includes a full setting that tests a model’s ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still intrinsically hallucinate in over 70% of outputs. To alleviate this lack of faithfulness, we introduce a post-training framework, using a weakly supervised preference data construction method, to train an 8B PIC-LM with stronger PIC ability–improving from 69.1% to 91.0% F1 in the full PIC setting. When integrated into end-to-end factual generation pipelines, PIC-LM improves exact match recall by 17.1% on ambiguous QA with retrieval, and factual precision by 30.5% on a birthplace verification task, underscoring the potential of precisely grounded generation.
zh
[NLP-154] owards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中计算成本高、能耗大的问题,尤其是在硬件资源、电力或带宽受限的场景下,难以部署LLMs。解决方案的关键在于引入多LLM智能模型选择策略,通过动态分配计算资源来优化性能,具体包括两种互补方法:路由机制根据查询内容选择最合适的模型,以及级联或分层推理(Hierarchical Inference, HI)通过一系列模型逐步处理查询直至获得置信响应,从而在保证性能的同时减少计算开销。
链接: https://arxiv.org/abs/2506.06579
作者: Adarsh Prasad Behera,Jaya Prakash Champati,Roberto Morabito,Sasu Tarkoma,James Gross
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Recent progress in Language Models (LMs) has dramatically advanced the field of natural language processing (NLP), excelling at tasks like text generation, summarization, and question answering. However, their inference remains computationally expensive and energy intensive, especially in settings with limited hardware, power, or bandwidth. This makes it difficult to deploy LMs in mobile, edge, or cost sensitive environments. To address these challenges, recent approaches have introduced multi LLM intelligent model selection strategies that dynamically allocate computational resources based on query complexity – using lightweight models for simpler queries and escalating to larger models only when necessary. This survey explores two complementary strategies for efficient LLM inference: (i) routing, which selects the most suitable model based on the query, and (ii) cascading or hierarchical inference (HI), which escalates queries through a sequence of models until a confident response is found. Both approaches aim to reduce computation by using lightweight models for simpler tasks while offloading only when needed. We provide a comparative analysis of these techniques across key performance metrics, discuss benchmarking efforts, and outline open challenges. Finally, we outline future research directions to enable faster response times, adaptive model selection based on task complexity, and scalable deployment across heterogeneous environments, making LLM based systems more efficient and accessible for real world applications.
zh
[NLP-155] Future of Work with AI Agents : Auditing Automation and Augmentation Potential across the U.S. Workforce
【速读】: 该论文试图解决复合型AI系统(compound AI systems,a.k.a., AI agents)对劳动力市场的影响问题,特别是关于工作替代、人类自主性减弱和过度依赖自动化带来的担忧,但目前缺乏对这一动态变化的系统性理解。其解决方案的关键在于引入一种新颖的审计框架,通过音频增强的微型访谈捕捉工人对AI代理自动或增强任务的偏好,并利用人类自主性量表(Human Agency Scale, HAS)作为量化人类参与程度的共同语言。该框架还构建了WORKBank数据库,整合了1,500名领域工作者的偏好和844项任务中AI专家的能力评估,从而将任务划分为四个区域,揭示了AI代理发展的关键错配与机遇。
链接: https://arxiv.org/abs/2506.06576
作者: Yijia Shao,Humishka Zope,Yucheng Jiang,Jiaxin Pei,David Nguyen,Erik Brynjolfsson,Diyi Yang
机构: Stanford University (斯坦福大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Preprint
Abstract:The rapid rise of compound AI systems (a.k.a., AI agents) is reshaping the labor market, raising concerns about job displacement, diminished human agency, and overreliance on automation. Yet, we lack a systematic understanding of the evolving landscape. In this paper, we address this gap by introducing a novel auditing framework to assess which occupational tasks workers want AI agents to automate or augment, and how those desires align with the current technological capabilities. Our framework features an audio-enhanced mini-interview to capture nuanced worker desires and introduces the Human Agency Scale (HAS) as a shared language to quantify the preferred level of human involvement. Using this framework, we construct the WORKBank database, building on the U.S. Department of Labor’s O*NET database, to capture preferences from 1,500 domain workers and capability assessments from AI experts across over 844 tasks spanning 104 occupations. Jointly considering the desire and technological capability divides tasks in WORKBank into four zones: Automation “Green Light” Zone, Automation “Red Light” Zone, RD Opportunity Zone, Low Priority Zone. This highlights critical mismatches and opportunities for AI agent development. Moving beyond a simple automate-or-not dichotomy, our results reveal diverse HAS profiles across occupations, reflecting heterogeneous expectations for human involvement. Moreover, our study offers early signals of how AI agent integration may reshape the core human competencies, shifting from information-focused skills to interpersonal ones. These findings underscore the importance of aligning AI agent development with human desires and preparing workers for evolving workplace dynamics.
zh
[NLP-156] LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
【速读】: 该论文试图解决生成式 AI (Generative AI) 生成的图表说明(figure caption)缺乏个性化的问题,即作者通常需要对通用的AI生成说明进行修改以适应其写作风格和领域规范。解决方案的关键在于引入 LaMP-Cap 数据集,该数据集通过提供包含多模态信息(如图表图像、相关图表及其说明和提及图表的段落)的个性化图表资料,增强生成说明的上下文理解能力,从而提升生成说明与原始作者风格的一致性。
链接: https://arxiv.org/abs/2506.06561
作者: Ho Yin ‘Sam’ Ng,Ting-Yao Hsu,Aashish Anantha Ramakrishnan,Branislav Kveton,Nedim Lipka,Franck Dernoncourt,Dongwon Lee,Tong Yu,Sungchul Kim,Ryan A. Rossi,Ting-Hao ‘Kenneth’ Huang
机构: Pennsylvania State University (宾夕法尼亚州立大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Figure captions are crucial for helping readers understand and remember a figure’s key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain’s style, highlighting the need for personalization. Despite language models’ personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document–each with its image, caption, and figure-mentioning paragraphs–as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.
zh
[NLP-157] Large Language Models Can Be a Viable Substitute for Expert Political Surveys When a Shock Disrupts Traditional Measurement Approaches
【速读】: 该论文试图解决在重大事件(如2025年联邦政府效率部门(Department of Government Efficiency, DOGE)裁员)之后,由于专家判断受到结果影响而难以重建事件前的感知,从而无法研究与事件相关因素的问题。解决方案的关键在于利用大规模语言模型(large language models, LLMs)对数字媒体数据进行训练,通过成对比较提示生成意识形态评分,并以此预测被裁员的联邦执行机构。该方法不仅能够复现裁员前的专家测量结果,还能在控制意识形态因素的情况下,识别出被视为知识机构的联邦机构,从而揭示事件背后的关联因素。
链接: https://arxiv.org/abs/2506.06540
作者: Patrick Y. Wu
机构: American University (美国大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 6 figures
Abstract:After a disruptive event or shock, such as the Department of Government Efficiency (DOGE) federal layoffs of 2025, expert judgments are colored by knowledge of the outcome. This can make it difficult or impossible to reconstruct the pre-event perceptions needed to study the factors associated with the event. This position paper argues that large language models (LLMs), trained on vast amounts of digital media data, can be a viable substitute for expert political surveys when a shock disrupts traditional measurement. We analyze the DOGE layoffs as a specific case study for this position. We use pairwise comparison prompts with LLMs and derive ideology scores for federal executive agencies. These scores replicate pre-layoff expert measures and predict which agencies were targeted by DOGE. We also use this same approach and find that the perceptions of certain federal agencies as knowledge institutions predict which agencies were targeted by DOGE, even when controlling for ideology. This case study demonstrates that using LLMs allows us to rapidly and easily test the associated factors hypothesized behind the shock. More broadly, our case study of this recent event exemplifies how LLMs offer insights into the correlational factors of the shock when traditional measurement techniques fail. We conclude by proposing a two-part criterion for when researchers can turn to LLMs as a substitute for expert political surveys.
zh
[NLP-158] Beyond Facts: Evaluating Intent Hallucination in Large Language Models ACL2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在处理包含多个条件的复杂查询时,容易产生仅部分满足查询要求而忽略某些条件的响应问题,这一现象被称为意图幻觉(Intent Hallucination)。其解决方案的关键在于引入了一个新的基准测试集FAITHQA,用于系统评估意图幻觉,并提出了一种自动化的评估指标CONSTRAINT SCORE,以检测意图幻觉。通过该基准和指标,研究发现意图幻觉是当前先进模型普遍存在的问题,其根源在于模型对查询内容的遗漏或误读。
链接: https://arxiv.org/abs/2506.06539
作者: Yijie Hao,Haofei Yu,Jiaxuan You
机构: Emory University (埃默里大学); UIUC (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 main conference
Abstract:When exposed to complex queries containing multiple conditions, today’s large language models (LLMs) tend to produce responses that only partially satisfy the query while neglecting certain conditions. We therefore introduce the concept of Intent Hallucination. In this phenomenon, LLMs either omit (neglecting to address certain parts) or misinterpret (responding to invented query parts) elements of the given query, leading to intent hallucinated generation. To systematically evaluate intent hallucination, we introduce FAITHQA, a novel benchmark for intent hallucination that contains 20,068 problems, covering both query-only and retrieval-augmented generation (RAG) setups with varying topics and difficulty. FAITHQA is the first hallucination benchmark that goes beyond factual verification, tailored to identify the fundamental cause of intent hallucination. By evaluating various LLMs on FAITHQA, we find that (1) intent hallucination is a common issue even for state-of-the-art models, and (2) the phenomenon stems from omission or misinterpretation of LLMs. To facilitate future research, we introduce an automatic LLM generation evaluation metric, CONSTRAINT SCORE, for detecting intent hallucination. Human evaluation results demonstrate that CONSTRAINT SCORE is closer to human performance for intent hallucination compared to baselines.
zh
[NLP-159] Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance
【速读】: 该论文试图解决后训练数据集的透明度不足以及其对模型性能影响的系统性评估困难问题。现有主流开源和闭源大语言模型的后训练数据集大多不公开,且其构建过程缺乏详细信息,导致难以全面理解数据质量与模型表现之间的关系。论文的关键解决方案是通过构建并分析两个具有代表性的开源后训练数据集——Tulu-3-SFT-Mix和SmolTalk,利用Magpie框架对样本进行详细标注,提取结构和质量指标,并基于此设计出一种优化的数据混合策略,最终生成TuluTalk数据集,该数据集在减少样本数量的同时保持或超越了原始数据集的性能表现。
链接: https://arxiv.org/abs/2506.06522
作者: Aladin Djuhera,Swanand Ravindra Kadhe,Syed Zawad,Farhan Ahmed,Heiko Ludwig,Holger Boche
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work on large language models (LLMs) has increasingly focused on post-training and alignment with datasets curated to enhance instruction following, world knowledge, and specialized skills. However, most post-training datasets used in leading open- and closed-source LLMs remain inaccessible to the public, with limited information about their construction process. This lack of transparency has motivated the recent development of open-source post-training corpora. While training on these open alternatives can yield performance comparable to that of leading models, systematic comparisons remain challenging due to the significant computational cost of conducting them rigorously at scale, and are therefore largely absent. As a result, it remains unclear how specific samples, task types, or curation strategies influence downstream performance when assessing data quality. In this work, we conduct the first comprehensive side-by-side analysis of two prominent open post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie framework, we annotate each sample with detailed quality metrics, including turn structure (single-turn vs. multi-turn), task category, input quality, and response quality, and we derive statistics that reveal structural and qualitative similarities and differences between the two datasets. Based on these insights, we design a principled curation recipe that produces a new data mixture, TuluTalk, which contains 14% fewer samples than either source dataset while matching or exceeding their performance on key benchmarks. Our findings offer actionable insights for constructing more effective post-training datasets that improve model performance within practical resource limits. To support future research, we publicly release both the annotated source datasets and our curated TuluTalk mixture.
zh
[NLP-160] Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval Outcomes ACL
【速读】: 该论文试图解决基础编码器型视觉-语言模型(VLM)中固有的社会群体偏见如何在下游任务中体现的问题,特别是这些偏见在零样本检索任务中的传播现象。解决方案的关键在于引入一个受控框架,通过将(a)表征空间中的内在偏见度量与(b)零样本文本到图像(TTI)和图像到文本(ITT)检索中的外在偏见度量进行相关性分析,从而量化偏见的传播程度。该框架揭示了内在与外在偏见之间存在显著的相关性(平均ρ=0.83±0.10),并发现更大或性能更好的模型表现出更严重的偏见传播,这为评估和缓解AI系统中的偏见提供了新的基准任务和方法。
链接: https://arxiv.org/abs/2506.06506
作者: Kshitish Ghate,Tessa Charlesworth,Mona Diab,Aylin Caliskan
机构: Carnegie Mellon University (卡内基梅隆大学); Northwestern University (西北大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL Findings 2025
Abstract:To build fair AI systems we need to understand how social-group biases intrinsic to foundational encoder-based vision-language models (VLMs) manifest in biases in downstream tasks. In this study, we demonstrate that intrinsic biases in VLM representations systematically ``carry over’’ or propagate into zero-shot retrieval tasks, revealing how deeply rooted biases shape a model’s outputs. We introduce a controlled framework to measure this propagation by correlating (a) intrinsic measures of bias in the representational space with (b) extrinsic measures of bias in zero-shot text-to-image (TTI) and image-to-text (ITT) retrieval. Results show substantial correlations between intrinsic and extrinsic bias, with an average \rho = 0.83 \pm 0.10. This pattern is consistent across 114 analyses, both retrieval directions, six social groups, and three distinct VLMs. Notably, we find that larger/better-performing models exhibit greater bias propagation, a finding that raises concerns given the trend towards increasingly complex AI models. Our framework introduces baseline evaluation tasks to measure the propagation of group and valence signals. Investigations reveal that underrepresented groups experience less robust propagation, further skewing their model-related outcomes.
zh
[NLP-161] Improving LLM -Powered EDA Assistants with RAFT
【速读】: 该论文旨在解决电子设计自动化(Electronic Design Automation, EDA)领域中大型语言模型(Large Language Models, LLMs)缺乏领域特定知识的问题,以及在检索增强生成(Retrieval-Augmented Generation, RAG)框架下,LLMs可能产生的不准确响应问题。其解决方案的关键在于采用合成问答(Q/A)数据集结合检索增强微调(Retrieval-Augmented Fine-Tuning, RAFT),以提升LLMs在EDA任务中的性能,同时通过真实用户问题作为检索增强少样本(Retrieval-Augmented Few-Shot, RAFS)示例来优化合成数据的生成效果。
链接: https://arxiv.org/abs/2506.06500
作者: Luyao Shi,Michael Kazda,Charles Schmitter,Hemlata Gupta
机构: IBM Research(IBM研究院); IBM Infrastructure(IBM基础设施)
类目: Computation and Language (cs.CL)
备注: Accepted paper at IEEE International Conference on LLM-Aided Design, 2025 (LAD 2025)
Abstract:Electronic design engineers often struggle to efficiently access relevant information for tasks like design verification and technology development. While large language models (LLMs) can enhance productivity as conversational agents, pre-trained open-source LLMs lack domain-specific knowledge for Electronic Design Automation (EDA). In a Retrieval-Augmented Generation (RAG) context, LLMs rely on external context but may still produce inaccurate responses. Retrieval-Augmented Fine-Tuning (RAFT) improves LLM performance, but acquiring labeled question/answer (Q/A) data in EDA is difficult. To address this, we propose using synthetic Q/A datasets to enhance LLMs with RAFT. Our results show that RAFT with synthetic data significantly boosts LLM performance for RAG-based EDA tasks. We also investigate the impact of using real user questions as Retrieval-Augmented Few-Shot (RAFS) examples for synthetic data generation. Additionally, we implement secure access control to ensure sensitive information is only accessible to authorized personnel. Finally, we assess the risk of data leakage and unintended memorization during fine-tuning with synthetic data, providing practical insights.
zh
[NLP-162] What Is Seen Cannot Be Unseen: The Disruptive Effect of Knowledge Conflict on Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对上下文信息与参数化知识冲突时的行为评估问题。其解决方案的关键在于提出一种诊断框架,用于系统评估LLM在上下文-记忆冲突下的表现,通过构建引发此类冲突的诊断数据,并分析模型在多种任务类型中的表现,从而揭示模型在知识冲突下的行为特征与局限性。
链接: https://arxiv.org/abs/2506.06485
作者: Kaiser Sun,Fan Bai,Mark Dredze
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models frequently rely on both contextual input and parametric knowledge to perform tasks. However, these sources can come into conflict, especially when retrieved documents contradict the model’s parametric knowledge. We propose a diagnostic framework to systematically evaluate LLM behavior under context-memory conflict, where the contextual information diverges from their parametric beliefs. We construct diagnostic data that elicit these conflicts and analyze model performance across multiple task types. Our findings reveal that (1) knowledge conflict has minimal impact on tasks that do not require knowledge utilization, (2) model performance is consistently higher when contextual and parametric knowledge are aligned, (3) models are unable to fully suppress their internal knowledge even when instructed, and (4) providing rationales that explain the conflict increases reliance on contexts. These insights raise concerns about the validity of model-based evaluation and underscore the need to account for knowledge conflict in the deployment of LLMs.
zh
[NLP-163] Canonical Autoregressive Generation
【速读】: 该论文试图解决大型语言模型在生成文本时可能偏离标准分词(canonical tokenization)的问题,这一现象可能导致生成结果与训练数据分布不一致。解决方案的关键在于引入一种称为“标准采样”(canonical sampling)的采样方法,该方法通过确保模型在自回归生成过程中每一步都生成部分标准分词序列,从而防止非标准分词序列的生成,进而使生成的分词序列更接近训练时使用的实际分布。
链接: https://arxiv.org/abs/2506.06446
作者: Ivi Chatzi,Nina Corvelo Benz,Stratis Tsirtsis,Manuel Gomez-Rodriguez
机构: Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所); ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:State of the art large language models are trained using large amounts of tokens derived from raw text using what is called a tokenizer. Crucially, the tokenizer determines the (token) vocabulary a model will use during inference as well as, in principle, the (token) language. This is because, while the token vocabulary may allow for different tokenizations of a string, the tokenizer always maps the string to only one of these tokenizations–the canonical tokenization. However, multiple lines of empirical evidence suggest that large language models do not always generate canonical token sequences, and this comes with several negative consequences. In this work, we first show that, to generate a canonical token sequence, a model needs to generate (partial) canonical token sequences at each step of the autoregressive generation process underpinning its functioning. Building upon this theoretical result, we introduce canonical sampling, a simple and efficient sampling method that precludes a given model from generating non-canonical token sequences. Further, we also show that, in comparison with standard sampling, the distribution of token sequences generated using canonical sampling is provably closer to the true distribution of token sequences used during training.
zh
[NLP-164] HeavyWater and SimplexWater: Watermarking Low-Entropy Text Distributions
【速读】: 该论文旨在解决在低熵生成任务(如编程)中,如何有效设计生成式 AI (Generative AI) 水印以实现文本溯源、防止机器生成文本的滥用,并提升 AI 系统可信度的问题。其解决方案的关键在于提出一种优化框架,用于设计水印,以最有效地利用随机侧信息,在最大化水印检测概率的同时最小化生成文本的失真。通过该框架,作者设计了两种新的可调水印方法——HeavyWater 和 SimplexWater,它们能够在检测准确性和文本失真之间进行灵活权衡,并适用于任何大型语言模型 (LLM),且与侧信息生成方式无关。
链接: https://arxiv.org/abs/2506.06409
作者: Dor Tsur,Carol Xuan Long,Claudio Mayrink Verdun,Hsiang Hsu,Chen-Fu Chen,Haim Permuter,Sajani Vithana,Flavio P. Calmon
机构: Ben Gurion University (本·古里安大学); Harvard University (哈佛大学); JPMorganChase Global Technology Applied Research (摩根大通全球科技应用研究)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM) watermarks enable authentication of text provenance, curb misuse of machine-generated text, and promote trust in AI systems. Current watermarks operate by changing the next-token predictions output by an LLM. The updated (i.e., watermarked) predictions depend on random side information produced, for example, by hashing previously generated tokens. LLM watermarking is particularly challenging in low-entropy generation tasks - such as coding - where next-token predictions are near-deterministic. In this paper, we propose an optimization framework for watermark design. Our goal is to understand how to most effectively use random side information in order to maximize the likelihood of watermark detection and minimize the distortion of generated text. Our analysis informs the design of two new watermarks: HeavyWater and SimplexWater. Both watermarks are tunable, gracefully trading-off between detection accuracy and text distortion. They can also be applied to any LLM and are agnostic to side information generation. We examine the performance of HeavyWater and SimplexWater through several benchmarks, demonstrating that they can achieve high watermark detection accuracy with minimal compromise of text generation quality, particularly in the low-entropy regime. Our theoretical analysis also reveals surprising new connections between LLM watermarking and coding theory. The code implementation can be found in this https URL
zh
[NLP-165] SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities
【速读】: 该论文试图解决多模态Mixture of Experts (MoE)模型在扩展过程中面临的训练成本高以及适应预训练模型时语言能力下降的问题。其解决方案的关键在于提出一种名为Soft Modality-Aware Routing (SMAR)的正则化技术,该技术通过Kullback-Leibler divergence控制跨模态的路由概率分布,从而在不修改模型架构或依赖大量文本数据的情况下,促进专家的专业化分工。
链接: https://arxiv.org/abs/2506.06406
作者: Guoyang Xia,Yifeng Ding,Fengfa Li,Lei Ren,Chen Wei,Fangxiang Feng,Xiaojie Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture of Experts (MoE) architectures have become a key approach for scaling large language models, with growing interest in extending them to multimodal tasks. Existing methods to build multimodal MoE models either incur high training costs or suffer from degraded language capabilities when adapting pretrained models. To address this, we propose Soft ModalityAware Routing (SMAR), a novel regularization technique that uses Kullback Leibler divergence to control routing probability distributions across modalities, encouraging expert specialization without modifying model architecture or heavily relying on textual data. Experiments on visual instruction tuning show that SMAR preserves language ability at 86.6% retention with only 2.5% pure text, outperforming baselines while maintaining strong multimodal performance. Our approach offers a practical and efficient solution to balance modality differentiation and language capabilities in multimodal MoE models.
zh
[NLP-166] Unintended Harms of Value-Aligned LLM s: Psychological and Empirical Insights ACL2025
【速读】: 该论文试图解决价值对齐的大型语言模型(Large Language Models, LLMs)在安全性和有害行为方面的潜在风险问题。研究发现,这类模型由于严格遵循对齐后的价值观生成文本,可能加剧有害内容的产生。解决方案的关键在于提出基于上下文对齐的方法,以提升价值对齐LLMs的安全性,并通过心理原理分析和详细安全类别数据验证其有效性。
链接: https://arxiv.org/abs/2506.06404
作者: Sooyung Choi,Jaehyeok Lee,Xiaoyuan Yi,Jing Yao,Xing Xie,JinYeong Bak
机构: Sungkyunkwan University (成均馆大学); Microsoft Research Asia (微软亚洲研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted to ACL 2025
Abstract:The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the “black box” of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.
zh
[NLP-167] Direct Behavior Optimization: Unlocking the Potential of Lightweight LLM s ACL2025
【速读】: 该论文试图解决轻量级大语言模型(LwLLMs)在推理和复杂任务处理能力上的不足,以及现有提示优化方法对人工干预或先进大语言模型(LLMs)元认知能力的依赖问题。解决方案的关键在于提出一种新的直接行为优化范式(DeBoP),该方法基于思维链(CoT)提示技术,但不同于传统的CoT提示,DeBoP是一种自动优化方法,通过无梯度蒙特卡洛树搜索将复杂提示的优化转化为离散可量化执行序列的优化,从而提升LwLLMs的性能。
链接: https://arxiv.org/abs/2506.06401
作者: Hongming Yang,Shi Lin,Jun Shao,Changting Lin,Donghai Zhu,Meng Han,Qinglei Kong
机构: Zhejiang Gongshang University (浙江工商大学); Zhejiang University (浙江大学); Binjiang Institute of Zhejiang University (浙江大学滨江研究院); GenTel.io (智网科技); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work is accepted at ACL 2025
Abstract:Lightweight Large Language Models (LwLLMs) are reduced-parameter, optimized models designed to run efficiently on consumer-grade hardware, offering significant advantages in resource efficiency, cost-effectiveness, and data privacy. However, these models often struggle with limited inference and reasoning capabilities, which restrict their performance on complex tasks and limit their practical applicability. Moreover, existing prompt optimization methods typically rely on extensive manual effort or the meta-cognitive abilities of state-of-the-art LLMs, making them less effective for LwLLMs. To address these challenges, we introduce DeBoP, a new Direct Behavior Optimization Paradigm, original from the Chain-of-Thought (CoT) prompting technique. Unlike CoT Prompting, DeBoP is an automatic optimization method, which focuses on the optimization directly on the behavior of LwLLMs. In particular, DeBoP transforms the optimization of complex prompts into the optimization of discrete, quantifiable execution sequences using a gradient-free Monte Carlo Tree Search. We evaluate DeBoP on seven challenging tasks where state-of-the-art LLMs excel but LwLLMs generally underperform. Experimental results demonstrate that DeBoP significantly outperforms recent prompt optimization methods on most tasks. In particular, DeBoP-optimized LwLLMs surpass GPT-3.5 on most tasks while reducing computational time by approximately 60% compared to other automatic prompt optimization methods.
zh
[NLP-168] Natural Language Interaction with Databases on Edge Devices in the Internet of Battlefield Things
【速读】: 该论文试图解决如何将物联网(Internet of Things, IoT)在战场环境中的扩展——即战场物联网(Internet of Battlefield Things, IoBT)所产生的数据转化为可用的信息对象,并按需提供给用户的问题。解决方案的关键在于提出一种工作流,利用自然语言处理(Natural Language Processing, NLP)技术查询数据库并以自然语言返回响应。该方案采用适用于边缘设备的大型语言模型(Large Language Models, LLMs)进行NLP处理,并结合图数据库以适应IoBT中动态连接网络的特点。通过将自然语言问题映射为Cypher查询并将其结果总结为自然语言,该架构提升了信息获取的效率与准确性。
链接: https://arxiv.org/abs/2506.06396
作者: Christopher D. Molek,Roberto Fronteddu,K. Brent Venable,Niranjan Suri
机构: The University of West Florida (UWF); Florida Institute for Human and Machine Cognition (IHMC); US Army DEVCOM Army Research Laboratory (ARL)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:The expansion of the Internet of Things (IoT) in the battlefield, Internet of Battlefield Things (IoBT), gives rise to new opportunities for enhancing situational awareness. To increase the potential of IoBT for situational awareness in critical decision making, the data from these devices must be processed into consumer-ready information objects, and made available to consumers on demand. To address this challenge we propose a workflow that makes use of natural language processing (NLP) to query a database technology and return a response in natural language. Our solution utilizes Large Language Models (LLMs) that are sized for edge devices to perform NLP as well as graphical databases which are well suited for dynamic connected networks which are pervasive in the IoBT. Our architecture employs LLMs for both mapping questions in natural language to Cypher database queries as well as to summarize the database output back to the user in natural language. We evaluate several medium sized LLMs for both of these tasks on a database representing publicly available data from the US Army’s Multipurpose Sensing Area (MSA) at the Jornada Range in Las Cruces, NM. We observe that Llama 3.1 (8 billion parameters) outperforms the other models across all the considered metrics. Most importantly, we note that, unlike current methods, our two step approach allows the relaxation of the Exact Match (EM) requirement of the produced Cypher queries with ground truth code and, in this way, it achieves a 19.4% increase in accuracy. Our workflow lays the ground work for deploying LLMs on edge devices to enable natural language interactions with databases containing information objects for critical decision making.
zh
[NLP-169] Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段如何有效对齐其行为与任务目标的问题。传统强化学习(Reinforcement Learning, RL)方法通常依赖于昂贵的人工标注或外部奖励模型,而本文提出了一种名为通过自信心进行强化学习(Reinforcement Learning via Self-Confidence, RLSC)的解决方案,其关键在于利用模型自身的置信度作为奖励信号,从而无需依赖标签、偏好模型或奖励工程。
链接: https://arxiv.org/abs/2506.06395
作者: Pengyi Li,Matvey Skripkin,Alexander Zubrey,Andrey Kuznetsov,Ivan Oseledets
机构: AIRI, Skoltech( AIRI, 斯科尔科沃科技研究所); Skotech(斯科尔科技术); Skoltech(斯科尔科沃科技研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model’s own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 8 samples per question and 4 training epochs, RLSC improves accuracy by +20.10% on AIME2024, +49.40% on MATH500, and +52.50% on AMC23. RLSC offers a simple, scalable post-training method for reasoning models with minimal supervision.
zh
[NLP-170] From Rogue to Safe AI: The Role of Explicit Refusals in Aligning LLM s with International Humanitarian Law
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对违反国际人道法(International Humanitarian Law, IHL)的指令时,其拒绝行为的清晰度、一致性和透明度不足的问题。解决方案的关键在于通过引入标准化的系统级安全提示,提升模型在拒绝非法请求时的解释质量,从而明确系统的边界,减少歧义并防止滥用。研究还表明,尽管轻量级干预有效,但在涉及技术性语言或代码请求的复杂场景中仍存在显著漏洞。
链接: https://arxiv.org/abs/2506.06391
作者: John Mavi,Diana Teodora Găitan,Sergio Coronado
机构: Luxembourg Tech School A.s.b.l.(卢森堡科技学校A.s.b.l.)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are widely used across sectors, yet their alignment with International Humanitarian Law (IHL) is not well understood. This study evaluates eight leading LLMs on their ability to refuse prompts that explicitly violate these legal frameworks, focusing also on helpfulness - how clearly and constructively refusals are communicated. While most models rejected unlawful requests, the clarity and consistency of their responses varied. By revealing the model’s rationale and referencing relevant legal or safety principles, explanatory refusals clarify the system’s boundaries, reduce ambiguity, and help prevent misuse. A standardised system-level safety prompt significantly improved the quality of the explanations expressed within refusals in most models, highlighting the effectiveness of lightweight interventions. However, more complex prompts involving technical language or requests for code revealed ongoing vulnerabilities. These findings contribute to the development of safer, more transparent AI systems and propose a benchmark to evaluate the compliance of LLM with IHL.
zh
[NLP-171] Detection Method for Prompt Injection by Integrating Pre-trained Model and Heuristic Feature Engineering KSEM2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中提示注入攻击(prompt injection attacks)的检测问题,此类攻击已成为重要的安全威胁。现有防御机制在效果与泛化能力之间存在显著权衡,因此亟需一种高效且适用于多种LLMs的检测方法。该论文提出的解决方案为DMPI-PMHFE,其关键在于采用双通道特征融合框架,结合预训练语言模型与启发式特征工程,通过DeBERTa-v3-base提取语义向量,并结合基于已知攻击模式设计的启发式规则提取显式结构特征,最终通过全连接神经网络进行融合与预测,从而提升检测性能。
链接: https://arxiv.org/abs/2506.06384
作者: Yi Ji,Runzhi Li,Baolei Mao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by KSEM2025 AI Sec Workshop
Abstract:With the widespread adoption of Large Language Models (LLMs), prompt injection attacks have emerged as a significant security threat. Existing defense mechanisms often face critical trade-offs between effectiveness and generalizability. This highlights the urgent need for efficient prompt injection detection methods that are applicable across a wide range of LLMs. To address this challenge, we propose DMPI-PMHFE, a dual-channel feature fusion detection framework. It integrates a pretrained language model with heuristic feature engineering to detect prompt injection attacks. Specifically, the framework employs DeBERTa-v3-base as a feature extractor to transform input text into semantic vectors enriched with contextual information. In parallel, we design heuristic rules based on known attack patterns to extract explicit structural features commonly observed in attacks. Features from both channels are subsequently fused and passed through a fully connected neural network to produce the final prediction. This dual-channel approach mitigates the limitations of relying only on DeBERTa to extract features. Experimental results on diverse benchmark datasets demonstrate that DMPI-PMHFE outperforms existing methods in terms of accuracy, recall, and F1-score. Furthermore, when deployed actually, it significantly reduces attack success rates across mainstream LLMs, including GLM-4, LLaMA 3, Qwen 2.5, and GPT-4o.
zh
[NLP-172] Enhancing Decision-Making of Large Language Models via Actor-Critic ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在需要长期推理和与高层次目标对齐的复杂决策场景中表现不佳的问题。现有方法要么依赖于短期自回归动作生成,要么在准确模拟轨迹和评估结果方面存在局限,导致决策效果不理想。解决方案的关键在于提出一种基于LLM的Actor-Critic框架——LAC,通过计算与正/负结果相关的token logits来提取稳健的动作评估,并结合未来轨迹的回放和推理以提升评估效果;同时,通过无梯度机制实现高效的策略改进。
链接: https://arxiv.org/abs/2506.06376
作者: Heng Dong,Kefei Duan,Chongjie Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Forty-second International Conference on Machine Learning (ICML 2025)
Abstract:Large Language Models (LLMs) have achieved remarkable advancements in natural language processing tasks, yet they encounter challenges in complex decision-making scenarios that require long-term reasoning and alignment with high-level objectives. Existing methods either rely on short-term auto-regressive action generation or face limitations in accurately simulating rollouts and assessing outcomes, leading to sub-optimal decisions. This paper introduces a novel LLM-based Actor-Critic framework, termed LAC, that effectively improves LLM policies with long-term action evaluations in a principled and scalable way. Our approach addresses two key challenges: (1) extracting robust action evaluations by computing Q-values via token logits associated with positive/negative outcomes, enhanced by future trajectory rollouts and reasoning; and (2) enabling efficient policy improvement through a gradient-free mechanism. Experiments across diverse environments – including high-level decision-making (ALFWorld), low-level action spaces (BabyAI-Text), and large action spaces (WebShop) – demonstrate the framework’s generality and superiority over state-of-the-art methods. Notably, our approach achieves competitive performance using 7B/8B parameter LLMs, even outperforming baseline methods employing GPT-4 in complex tasks. These results underscore the potential of integrating structured policy optimization with LLMs’ intrinsic knowledge to advance decision-making capabilities in multi-step environments.
zh
[NLP-173] Relationship Detection on Tabular Data Using Statistical Analysis and Large Language Models
【速读】: 该论文旨在解决无标签表格数据中列间关系检测的问题,这一任务被称为列关系分析(Column Relationship Analysis, CPA)。解决方案的关键在于采用一种混合方法,利用知识图谱(Knowledge Graph, KG)作为参考点,结合大型语言模型(Large Language Models, LLMs)与统计分析技术,以减少潜在KG关系的搜索空间。该方法的主要模块包括领域和范围约束检测以及关系共现分析,从而提升关系检测的效率与准确性。
链接: https://arxiv.org/abs/2506.06371
作者: Panagiotis Koletsis,Christos Panagiotopoulos,Georgios Th. Papadopoulos,Vasilis Efthymiou
机构: Harokopio University (哈罗科皮奥大学); Athens (雅典); Greece (希腊)
类目: Computation and Language (cs.CL)
备注:
Abstract:Over the past few years, table interpretation tasks have made significant progress due to their importance and the introduction of new technologies and benchmarks in the field. This work experiments with a hybrid approach for detecting relationships among columns of unlabeled tabular data, using a Knowledge Graph (KG) as a reference point, a task known as CPA. This approach leverages large language models (LLMs) while employing statistical analysis to reduce the search space of potential KG relations. The main modules of this approach for reducing the search space are domain and range constraints detection, as well as relation co-appearance analysis. The experimental evaluation on two benchmark datasets provided by the SemTab challenge assesses the influence of each module and the effectiveness of different state-of-the-art LLMs at various levels of quantization. The experiments were performed, as well as at different prompting techniques. The proposed methodology, which is publicly available on github, proved to be competitive with state-of-the-art approaches on these datasets.
zh
[NLP-174] LLM s as World Models: Data-Driven and Human-Centered Pre-Event Simulation for Disaster Impact Assessment
【速读】: 该论文试图解决突发性灾害(如地震)中如何高效模拟以提升主动准备能力的问题,其解决方案的关键在于利用大型语言模型(LLMs)作为世界模型,结合多模态数据(包括地理空间、社会经济、建筑和街景图像数据),生成修正麦卡利震级(Modified Mercalli Intensity, MMI)预测,从而在邮编区和县尺度上估计地震影响。通过使用美国地质调查局(USGS)的“你是否感受到?(DYFI)”报告进行评估,验证了该方法在地震事件中的有效性,显示出高相关性(0.88)和低均方根误差(0.77)。此外,检索增强生成(RAG)和实例上下文学习(ICL)技术以及视觉输入的引入显著提升了模拟性能。
链接: https://arxiv.org/abs/2506.06355
作者: Lingyao Li,Dawei Li,Zhenhui Ou,Xiaoran Xu,Jingxiao Liu,Zihui Ma,Runlong Yu,Min Deng
机构: University of South Florida (南佛罗里达大学); Arizona State University (亚利桑那州立大学); Massachusetts Institute of Technology (麻省理工学院); New York University (纽约大学); University of Alabama (阿拉巴马大学); Texas Tech University (德克萨斯技术大学)
类目: Computers and Society (cs.CY); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Efficient simulation is essential for enhancing proactive preparedness for sudden-onset disasters such as earthquakes. Recent advancements in large language models (LLMs) as world models show promise in simulating complex scenarios. This study examines multiple LLMs to proactively estimate perceived earthquake impacts. Leveraging multimodal datasets including geospatial, socioeconomic, building, and street-level imagery data, our framework generates Modified Mercalli Intensity (MMI) predictions at zip code and county scales. Evaluations on the 2014 Napa and 2019 Ridgecrest earthquakes using USGS ‘‘Did You Feel It? (DYFI)’’ reports demonstrate significant alignment, as evidenced by a high correlation of 0.88 and a low RMSE of 0.77 as compared to real reports at the zip code level. Techniques such as RAG and ICL can improve simulation performance, while visual inputs notably enhance accuracy compared to structured numerical data alone. These findings show the promise of LLMs in simulating disaster impacts that can help strengthen pre-event planning.
zh
[NLP-175] Unified Game Moderation: Soft-Prompting and LLM -Assisted Label Transfer for Resource-Efficient Toxicity Detection KDD2025
【速读】: 该论文旨在解决游戏社区中毒性内容检测的可扩展性问题,特别是在多游戏、多语言环境下实时检测的挑战。其关键解决方案是引入一种基于软提示(soft-prompting)的方法,使单一模型能够通过引入游戏上下文标记有效处理多种游戏,同时保持与复杂方法(如课程学习)相当的性能,并具备更高的可扩展性;此外,还开发了一个利用GPT-4o-mini的大型语言模型(Large Language Model, LLM)辅助标签迁移框架,以支持更多语言的毒性检测。
链接: https://arxiv.org/abs/2506.06347
作者: Zachary Yang,Domenico Tullo,Reihaneh Rabbany
机构: Ubisoft La Forge (育碧蒙特利尔实验室); McGill University (麦吉尔大学); Mila (Mila人工智能研究所); Montreal Canada (蒙特利尔加拿大); McGill University (麦吉尔大学); Mila (Mila人工智能研究所); CIFAR AI Chair (加拿大高级研究基金会人工智能主席); Montreal Canada (蒙特利尔加拿大)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 1 figure, 9 Tables, KDD 2025 ADS Track
Abstract:Toxicity detection in gaming communities faces significant scaling challenges when expanding across multiple games and languages, particularly in real-time environments where computational efficiency is crucial. We present two key findings to address these challenges while building upon our previous work on ToxBuster, a BERT-based real-time toxicity detection system. First, we introduce a soft-prompting approach that enables a single model to effectively handle multiple games by incorporating game-context tokens, matching the performance of more complex methods like curriculum learning while offering superior scalability. Second, we develop an LLM-assisted label transfer framework using GPT-4o-mini to extend support to seven additional languages. Evaluations on real game chat data across French, German, Portuguese, and Russian achieve macro F1-scores ranging from 32.96% to 58.88%, with particularly strong performance in German, surpassing the English benchmark of 45.39%. In production, this unified approach significantly reduces computational resources and maintenance overhead compared to maintaining separate models for each game and language combination. At Ubisoft, this model successfully identifies an average of 50 players, per game, per day engaging in sanctionable behavior.
zh
[NLP-176] ESU-LLM : Training Speech-LLM s Without Speech via Unified Encoder Alignment
【速读】: 该论文试图解决传统语音增强语言模型依赖大规模成对语音-文本数据和计算资源的问题,从而限制了模型的可扩展性和可访问性。解决方案的关键在于提出一种名为TESU-LLM的框架,其核心思想是利用一个统一的编码器将语义等价的文本和语音输入映射到共享的潜在空间,并通过轻量级投影网络将编码器输出与语言模型(Language Model, LLM)的嵌入空间对齐,从而实现从纯文本监督到语音推理的泛化能力。
链接: https://arxiv.org/abs/2506.06343
作者: Taesoo Kim,Jong Hwan Ko
机构: KT Corporation (KT 公司); Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent advances in speech-enabled language models have shown promising results in building intelligent voice assistants. However, most existing approaches rely on large-scale paired speech-text data and extensive computational resources, which pose challenges in terms of scalability and accessibility. In this paper, we present \textbfTESU-LLM, a novel framework that enables training speech-capable language models using only text data. Our key insight is to leverage a unified encoder that maps semantically equivalent text and speech inputs to a shared latent space. By aligning the encoder output with the embedding space of a LLM via a lightweight projection network, we enable the model to generalize from text-only supervision to speech-based inference. Despite being trained exclusively on text, TESU-LLM achieves strong performance on various speech-related benchmarks, comparable to baseline methods trained with large-scale multimodal datasets and substantial computational resources. These results highlight the effectiveness and efficiency of our approach, offering a scalable path toward building speech LLMs without speech data.
zh
[NLP-177] Optimizing RAG Pipelines for Arabic: A Systematic Analysis of Core Components
【速读】: 该论文旨在解决阿拉伯语中检索增强生成(Retrieval-Augmented Generation, RAG)组件优化不足的问题,特别是在高资源语言中已有大量研究的情况下,阿拉伯语的RAG系统仍缺乏系统的评估与优化。其解决方案的关键在于对多种RAG组件进行全面的实证评估,包括分块策略、嵌入模型、重排序器和语言模型,并基于RAGAS框架在四个核心指标上进行系统比较,从而为构建高质量的阿拉伯语RAG管道提供关键见解和实用指导。
链接: https://arxiv.org/abs/2506.06339
作者: Jumana Alsubhi,Mohammad D. Alahmadi,Ahmed Alhusayni,Ibrahim Aldailami,Israa Hamdine,Ahmad Shabana,Yazeed Iskandar,Suhayb Khayyat
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful architecture for combining the precision of retrieval systems with the fluency of large language models. While several studies have investigated RAG pipelines for high-resource languages, the optimization of RAG components for Arabic remains underexplored. This study presents a comprehensive empirical evaluation of state-of-the-art RAG components-including chunking strategies, embedding models, rerankers, and language models-across a diverse set of Arabic datasets. Using the RAGAS framework, we systematically compare performance across four core metrics: context precision, context recall, answer faithfulness, and answer relevancy. Our experiments demonstrate that sentence-aware chunking outperforms all other segmentation methods, while BGE-M3 and Multilingual-E5-large emerge as the most effective embedding models. The inclusion of a reranker (bge-reranker-v2-m3) significantly boosts faithfulness in complex datasets, and Aya-8B surpasses StableLM in generation quality. These findings provide critical insights for building high-quality Arabic RAG pipelines and offer practical guidelines for selecting optimal components across different document types.
zh
[NLP-178] FinBERT2: A Specialized Bidirectional Encoder for Bridging the Gap in Finance-Specific Deployment of Large Language Models
【速读】: 该论文试图解决大型语言模型(LLMs)在金融领域应用中的三个主要问题:在判别性任务中性能不如微调的BERT模型但计算成本更高、生成任务依赖检索增强生成(RAG)方法且通用检索器在领域特定任务中表现不佳,以及在基于特征的场景(如主题建模)中存在不足。解决方案的关键在于引入FinBERT2,这是一个在高质量金融语料库上预训练的双向编码器,其规模为320亿个token,是目前已知参数量级最大的中文金融预训练语料库。FinBERT2通过在金融分类、检索和主题建模任务中的改进表现,有效弥补了LLMs在金融领域部署的不足。
链接: https://arxiv.org/abs/2506.06335
作者: Xuan Xu,Fufang Wen,Beilin Chu,Zhibing Fu,Qinhong Lin,Jiaqi Liu,Binjie Fei,Zhongliang Yang,Linna Zhou,Yu Li
机构: Beijing University of Posts and Telecommunications(北京邮电大学); Beijing Value Simplex Technology Co.Ltd.(北京价值简约科技有限公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:
Abstract:In natural language processing (NLP), the focus has shifted from encoder-only tiny language models like BERT to decoder-only large language models(LLMs) such as GPT-3. However, LLMs’ practical application in the financial sector has revealed three limitations: (1) LLMs often perform worse than fine-tuned BERT on discriminative tasks despite costing much higher computational resources, such as market sentiment analysis in financial reports; (2) Application on generative tasks heavily relies on retrieval augmented generation (RAG) methods to provide current and specialized information, with general retrievers showing suboptimal performance on domain-specific retrieval tasks; (3) There are additional inadequacies in other feature-based scenarios, such as topic modeling. We introduce FinBERT2, a specialized bidirectional encoder pretrained on a high-quality, financial-specific corpus of 32b tokens. This represents the largest known Chinese financial pretraining corpus for models of this parameter size. As a better backbone, FinBERT2 can bridge the gap in the financial-specific deployment of LLMs through the following achievements: (1) Discriminative fine-tuned models (Fin-Labelers) outperform other (Fin)BERT variants by 0.4%-3.3% and leading LLMs by 9.7%-12.3% on average across five financial classification tasks. (2) Contrastive fine-tuned models (Fin-Retrievers) outperform both open-source (e.g., +6.8% avg improvement over BGE-base-zh) and proprietary (e.g., +4.2% avg improvement over OpenAI’s text-embedding-3-large) embedders across five financial retrieval tasks; (3) Building on FinBERT2 variants, we construct the Fin-TopicModel, which enables superior clustering and topic representation for financial titles. Our work revisits financial BERT models through comparative analysis with contemporary LLMs and offers practical insights for effectively utilizing FinBERT in the LLMs era.
zh
[NLP-179] How Significant Are the Real Performance Gains? An Unbiased Evaluation Framework for GraphRAG
【速读】: 该论文试图解决当前图检索增强生成(GraphRAG)方法在答案评估框架中存在的两个关键问题,即无关问题和评估偏差,这些问题可能导致对性能的偏误甚至错误结论。解决方案的关键在于提出一种无偏评估框架,该框架通过图-文本-实体关联的问题生成技术生成更贴近底层数据集的问题,并采用无偏评估流程以消除基于大语言模型(LLM)的答案评估中的偏差。
链接: https://arxiv.org/abs/2506.06331
作者: Qiming Zeng,Xiao Yan,Hao Luo,Yuhao Lin,Yuxiang Wang,Fangcheng Fu,Bo Du,Quanqing Xu,Jiawei Jiang
机构: School of Computer Science, Wuhan University (武汉大学计算机学院); School of Computer Science, Peking University (北京大学计算机学院); Centre for Perceptual and Interactive Intelligence (CPII) (感知与交互智能中心); OceanBase (OceanBase)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:By retrieving contexts from knowledge graphs, graph-based retrieval-augmented generation (GraphRAG) enhances large language models (LLMs) to generate quality answers for user questions. Many GraphRAG methods have been proposed and reported inspiring performance in answer quality. However, we observe that the current answer evaluation framework for GraphRAG has two critical flaws, i.e., unrelated questions and evaluation biases, which may lead to biased or even wrong conclusions on performance. To tackle the two flaws, we propose an unbiased evaluation framework that uses graph-text-grounded question generation to produce questions that are more related to the underlying dataset and an unbiased evaluation procedure to eliminate the biases in LLM-based answer assessment. We apply our unbiased framework to evaluate 3 representative GraphRAG methods and find that their performance gains are much more moderate than reported previously. Although our evaluation framework may still have flaws, it calls for scientific evaluations to lay solid foundations for GraphRAG research.
zh
[NLP-180] Is BERTopic Better than PLSA for Extracting Key Topics in Aviation Safety Reports?
【速读】: 该论文旨在解决从航空安全报告中有效提取有意义主题的问题,以提升对航空事件数据模式的理解。其解决方案的关键在于采用基于Transformer的BERTopic方法,通过嵌入和层次聚类实现更高质量的主题建模,相较于传统的概率潜在语义分析(PLSA)方法,在主题连贯性和可解释性方面表现出显著优势。
链接: https://arxiv.org/abs/2506.06328
作者: Aziida Nanyonga,Joiner Keith,Turhan Ugur,Wild Graham
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:This study compares the effectiveness of BERTopic and Probabilistic Latent Semantic Analysis (PLSA) in extracting meaningful topics from aviation safety reports aiming to enhance the understanding of patterns in aviation incident data. Using a dataset of over 36,000 National Transportation Safety Board (NTSB) reports from 2000 to 2020, BERTopic employed transformer based embeddings and hierarchical clustering, while PLSA utilized probabilistic modelling through the Expectation-Maximization (EM) algorithm. Results showed that BERTopic outperformed PLSA in topic coherence, achieving a Cv score of 0.41 compared to PLSA 0.37, while also demonstrating superior interpretability as validated by aviation safety experts. These findings underscore the advantages of modern transformer based approaches in analyzing complex aviation datasets, paving the way for enhanced insights and informed decision-making in aviation safety. Future work will explore hybrid models, multilingual datasets, and advanced clustering techniques to further improve topic modelling in this domain.
zh
[NLP-181] DISRetrieval: Harnessing Discourse Structure for Long Document Retrieval
【速读】: 该论文旨在解决长文档理解中因大语言模型(Large Language Models, LLMs)上下文长度限制而导致的信息检索效率与准确性不足的问题。现有方法通常将文档视为扁平序列或采用任意分块策略,未能捕捉指导人类理解的内在话语结构。其解决方案的关键在于提出DISRetrieval框架,该框架利用话语结构(discourse structure)提升长文档理解能力,核心创新包括:基于修辞结构理论(Rhetorical Structure Theory, RST)的句子级层次化表示构建、结合话语结构与自适应摘要的节点表示增强技术,以及保持话语连贯性的层次化证据检索机制。
链接: https://arxiv.org/abs/2506.06313
作者: Huiyao Chen,Yi Yang,Yinghui Li,Meishan Zhang,Min Zhang
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 7 figures
Abstract:Long document understanding has become increasingly crucial in natural language processing, with retrieval-based methods emerging as a promising solution to address the context length limitations of large language models (LLMs). However, existing approaches either treat documents as flat sequences or employ arbitrary chunking strategies, failing to capture the inherent discourse structure that guides human comprehension. We present DISRetrieval, a novel hierarchical retrieval framework that leverages linguistic discourse structure to enhance long document understanding. Our approach introduces three key innovations: (1) a discourse-aware document organization framework that utilizes rhetorical structure theory (RST) to create sentence-level hierarchical representations, preserving both semantic relationships and natural document flow; (2) an LLM-enhanced node representation technique that combines discourse structure with adaptive summarization to enrich tree nodes with contextual information; and (3) a hierarchical evidence retrieval mechanism that effectively selects relevant content while maintaining discourse coherence. Through comprehensive experiments on QASPER and QuALITY datasets, DISRetrieval demonstrates substantial improvements over existing methods in both token-level retrieval metrics and downstream question answering tasks. Our ablation studies confirm that incorporating discourse structure significantly enhances retrieval effectiveness across different document lengths and query types, validating the importance of linguistically-informed document representation in long-text understanding. Our code and datasets are publicly available at github/DreamH1gh/DISRetrieval to facilitate future research.
zh
[NLP-182] Reward Is Enough: LLM s Are In-Context Reinforcement Learners
【速读】: 该论文试图解决如何在大型语言模型(Large Language Model, LLM)的推理过程中实现类似强化学习(Reinforcement Learning, RL)的行为问题,即在不进行显式训练的情况下,通过反馈机制使LLM逐步优化其输出。解决方案的关键在于提出了一种名为ICRL prompting的多轮提示框架,该框架通过在每一轮中向LLM提供任务、先前的响应及对应的奖励信号,使LLM在推理过程中动态调整其行为以最大化奖励,从而实现了类似RL的决策优化过程。
链接: https://arxiv.org/abs/2506.06303
作者: Kefan Song,Amir Moeini,Peng Wang,Lei Gong,Rohan Chandra,Yanjun Qi,Shangtong Zhang
机构: University of Virginia (弗吉尼亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) is a human-designed framework for solving sequential decision making problems. In this work, we demonstrate that, surprisingly, RL emerges in LLM’s (Large Language Model) inference time – a phenomenon known as in-context RL (ICRL). Specifically, we propose a novel multi-round prompting framework called ICRL prompting. The goal is to prompt the LLM to complete a task. After the LLM generates a response at the current round, we give numerical scalar feedbacks for the response, called the rewards. At the next round, we prompt the LLM again with the same task and a context consisting of all previous responses and rewards. We observe that the quality of the LLM’s response increases as the context grows. In other words, the LLM is able to maximize the scalar reward signal in the inference time, just like an RL algorithm. We evaluate ICRL prompting in three benchmarks (Game of 24, creative writing, and ScienceWorld) and demonstrate significant performance improvements over baseline methods such as Self-Refine and Reflexion. Surprisingly, in some experiments the reward signals are generated by the LLM itself, yet performance improvements are still observed from ICRL prompting, offering a promising paradigm for scaling test-time compute.
zh
[NLP-183] How Malicious AI Swarms Can Threaten Democracy
【速读】: 该论文试图解决由生成式 AI 引发的新型虚假信息操作问题,特别是恶意 AI 群体(malicious AI swarms)所带来的威胁。这些群体能够隐蔽协作、渗透社区、规避传统检测机制并持续进行 A/B 测试,从而导致虚假的基层共识、分裂的共同现实、大规模骚扰、选民微压制或动员、AI 训练数据污染以及机构信任的削弱。论文提出的解决方案关键在于构建多层级应对策略:平台侧防御(如持续的群组检测仪表盘、高保真模拟压力测试、透明度审计和用户可选的“AI 防护盾”)、模型侧保障(如标准化的说服风险测试、来源认证密钥和水印技术)以及系统级监督(如联合国支持的 AI 影响观测站)。
链接: https://arxiv.org/abs/2506.06299
作者: Daniel Thilo Schroeder,Meeyoung Cha,Andrea Baronchelli,Nick Bostrom,Nicholas A. Christakis,David Garcia,Amit Goldenberg,Yara Kyrychenko,Kevin Leyton-Brown,Nina Lutz,Gary Marcus,Filippo Menczer,Gordon Pennycook,David G. Rand,Frank Schweitzer,Christopher Summerfield,Audrey Tang,Jay Van Bavel,Sander van der Linden,Dawn Song,Jonas R. Kunst
机构: SINTEF; Max Planck Institute for Security and Privacy; City St George’s University of London; Macrostrategy Research Initiative; Yale University; University of Konstanz; Harvard Business School; Cambridge University; University of British Columbia; University of Washington; New York University; Indiana University; Cornell University; Massachusetts Institute of Technology; ETH Zürich; University of Oxford; Digital Minister & Cyber Ambassador Taiwan; University of California Berkeley; BI Norwegian Business School
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 1 figure
Abstract:Advances in AI portend a new era of sophisticated disinformation operations. While individual AI systems already create convincing – and at times misleading – information, an imminent development is the emergence of malicious AI swarms. These systems can coordinate covertly, infiltrate communities, evade traditional detectors, and run continuous A/B tests, with round-the-clock persistence. The result can include fabricated grassroots consensus, fragmented shared reality, mass harassment, voter micro-suppression or mobilization, contamination of AI training data, and erosion of institutional trust. With democratic processes worldwide increasingly vulnerable, we urge a three-pronged response: (1) platform-side defenses – always-on swarm-detection dashboards, pre-election high-fidelity swarm-simulation stress-tests, transparency audits, and optional client-side “AI shields” for users; (2) model-side safeguards – standardized persuasion-risk tests, provenance-authenticating passkeys, and watermarking; and (3) system-level oversight – a UN-backed AI Influence Observatory.
zh
[NLP-184] dLLM -Cache: Accelerating Diffusion Large Language Models with Adaptive Caching
【速读】: 该论文试图解决扩散模型(diffusion-based Large Language Models, dLLMs)在推理过程中存在的高延迟问题。传统自回归模型(Autoregressive Models, ARMs)的加速技术,如键值缓存(Key-Value caching),由于dLLMs采用双向注意力机制而不适用于此类模型。解决方案的关键在于观察到dLLM推理过程中存在静态提示和部分动态响应,其中大多数标记在相邻去噪步骤中保持稳定,基于此提出了dLLM-Cache,一种无需训练的自适应缓存框架,结合了长间隔提示缓存与基于特征相似性的部分响应更新,从而实现高效中间计算复用且不牺牲模型性能。
链接: https://arxiv.org/abs/2506.06295
作者: Zhiyuan Liu,Yicun Yang,Yaojie Zhang,Junjie Chen,Chang Zou,Qingyuan Wei,Shaobo Wang,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1 x speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. Codes are provided in the supplementary material and will be released publicly on GitHub.
zh
[NLP-185] GLProtein: Global-and-Local Structure Aware Protein Representation Learning
【速读】: 该论文旨在解决蛋白质功能预测与结构分析中对结构信息整合不足的问题,特别是在传统方法主要依赖序列信息而忽视结构细节的情况下。其解决方案的关键在于提出GLProtein框架,该框架首次在蛋白质预训练中融合了全局结构相似性与局部氨基酸细节,通过结合蛋白质掩码建模、三元组结构相似性评分、三维距离编码以及基于子结构的氨基酸分子编码,从而提升预测精度和功能洞察力。
链接: https://arxiv.org/abs/2506.06294
作者: Yunqing Liu,Wenqi Fan,Xiaoyong Wei,Qing Li
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:
Abstract:Proteins are central to biological systems, participating as building blocks across all forms of life. Despite advancements in understanding protein functions through protein sequence analysis, there remains potential for further exploration in integrating protein structural information. We argue that the structural information of proteins is not only limited to their 3D information but also encompasses information from amino acid molecules (local information) to protein-protein structure similarity (global information). To address this, we propose \textbfGLProtein, the first framework in protein pre-training that incorporates both global structural similarity and local amino acid details to enhance prediction accuracy and functional insights. GLProtein innovatively combines protein-masked modelling with triplet structure similarity scoring, protein 3D distance encoding and substructure-based amino acid molecule encoding. Experimental results demonstrate that GLProtein outperforms previous methods in several bioinformatics tasks, including predicting protein-protein interaction, contact prediction, and so on.
zh
[NLP-186] GOLFer: Smaller LM-Generated Documents Hallucination Filter Combiner for Query Expansion in Information Retrieval
【速读】: 该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的查询扩展在信息检索中的性能依赖于模型规模的问题,这一问题导致了计算成本高、资源消耗大以及可访问性受限。其解决方案的关键在于提出GOLFer——一种利用小型开源语言模型生成文档并进行幻觉过滤与内容组合的方法,通过两个模块:幻觉过滤器用于检测并移除生成文档中非事实性和不一致的句子,而文档组合器则通过权重向量将过滤后的内容与查询结合,以平衡两者的影响,从而在保持高性能的同时降低对大规模LLM的依赖。
链接: https://arxiv.org/abs/2506.04762
作者: Lingyuan Liu,Mengxiang Zhang
机构: City University of Hong Kong (香港城市大学); The University of Hong Kong (香港大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs)-based query expansion for information retrieval augments queries with generated hypothetical documents with LLMs. However, its performance relies heavily on the scale of the language models (LMs), necessitating larger, more advanced LLMs. This approach is costly, computationally intensive, and often has limited accessibility. To address these limitations, we introduce GOLFer - Smaller LMs-Generated Documents Hallucination Filter Combiner - a novel method leveraging smaller open-source LMs for query expansion. GOLFer comprises two modules: a hallucination filter and a documents combiner. The former detects and removes non-factual and inconsistent sentences in generated documents, a common issue with smaller LMs, while the latter combines the filtered content with the query using a weight vector to balance their influence. We evaluate GOLFer alongside dominant LLM-based query expansion methods on three web search and ten low-resource datasets. Experimental results demonstrate that GOLFer consistently outperforms other methods using smaller LMs, and maintains competitive performance against methods using large-size LLMs, demonstrating its effectiveness.
zh
[NLP-187] Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service
【速读】: 该论文试图解决低资源机器翻译(Low-resource Machine Translation, MT)中社区需求和应用挑战缺乏深入理解的问题。其解决方案的关键在于通过观察tetun.org这一针对提顿语(Tetun)的专用MT服务的实际使用模式,分析100,000条翻译请求数据,从而揭示用户在科学、医疗和日常生活等多样化领域中将高资源语言翻译为提顿语的行为特征。这一方法突破了传统调查和焦点小组研究样本量小的局限,为低资源语言技术开发提供了基于实际社区需求的实证依据。
链接: https://arxiv.org/abs/2411.12262
作者: Raphael Merx,Adérito José Guterres Correia,Hanna Suominen,Ekaterina Vylomova
机构: The University of Melbourne(墨尔本大学); Instituto Nacional de Linguística(国家语言学研究所); The Australian National University(澳大利亚国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: to be published in LoResMT 2025
Abstract:Low-resource machine translation (MT) presents a diversity of community needs and application challenges that remain poorly understood. To complement surveys and focus groups, which tend to rely on small samples of respondents, we propose an observational study on actual usage patterns of tetun . org, a specialized MT service for the Tetun language, which is the lingua franca in Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate text from a high-resource language into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues. Our results suggest that MT systems for institutionalized minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts, in the high-resource to low-resource direction. More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.
zh
[NLP-188] Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning ACM-MM2023
【速读】: 该论文旨在解决文本-视频检索任务中对跨模态相似性精确度量的问题,特别是现有对比学习方法在处理困难负例和建模细粒度语义相似性方面的不足。其解决方案的关键在于引入两种新颖技术:首先,提出双模态注意力增强模块(DMAE)以挖掘文本和视觉线索中的困难负例,并通过引入负例感知的信息最大熵对比损失(NegNCE)来自适应地识别并强调这些困难负例的影响;其次,设计三元组部分边界对比学习模块(TPM-CL),通过自动生成匹配文本-视频对的细粒度困难负例来构建部分顺序三元组样本,并采用带有跨模态交互的自适应标记掩码策略来建模细微的语义差异。
链接: https://arxiv.org/abs/2309.11082
作者: Chen Jiang,Hong Liu,Xuzheng Yu,Qing Wang,Yuan Cheng,Jia Xu,Zhongyi Liu,Qingpei Guo,Wei Chu,Ming Yang,Yuan Qi
机构: Ant Group(蚂蚁集团); Fudan University(复旦大学); Artificial Intelligence Innovation and Incubation Institute, Fudan University(复旦大学人工智能创新与孵化研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted by ACM MM 2023
Abstract:In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.
zh
[NLP-189] Speaker-Distinguishable CTC: Learning Speaker Distinction Using CTC for Multi-Talker Speech Recognition INTERSPEECH2025
【速读】: 该论文旨在解决多说话人自动语音识别(multi-talker automatic speech recognition)中因说话人分配失败导致的识别错误问题。传统方法如序列输出训练(SOT)在没有辅助信息的情况下表现不佳,尽管引入如帧级时间戳等辅助信息可以提升识别准确率,但这些信息的提取在自然对话中仍具挑战性。论文提出的解决方案关键在于Speaker-Distinguishable CTC(SD-CTC),这是一种扩展的连接时序分类(CTC)方法,能够同时为每个帧分配一个词元及其对应的说话人标签,从而在仅使用重叠语音和转录文本的情况下实现说话人区分。通过将SD-CTC整合到SOT框架中,实验表明该方法在多任务学习下使SOT模型的错误率降低26%,并达到与依赖辅助信息的最先进方法相当的性能。
链接: https://arxiv.org/abs/2506.07515
作者: Asahi Sakuma,Hiroaki Sato,Ryuga Sugano,Tadashi Kumano,Yoshihiko Kawai,Tetsuji Ogawa
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted at INTERSPEECH 2025
Abstract:This paper presents a novel framework for multi-talker automatic speech recognition without the need for auxiliary information. Serialized Output Training (SOT), a widely used approach, suffers from recognition errors due to speaker assignment failures. Although incorporating auxiliary information, such as token-level timestamps, can improve recognition accuracy, extracting such information from natural conversational speech remains challenging. To address this limitation, we propose Speaker-Distinguishable CTC (SD-CTC), an extension of CTC that jointly assigns a token and its corresponding speaker label to each frame. We further integrate SD-CTC into the SOT framework, enabling the SOT model to learn speaker distinction using only overlapping speech and transcriptions. Experimental comparisons show that multi-task learning with SD-CTC and SOT reduces the error rate of the SOT model by 26% and achieves performance comparable to state-of-the-art methods relying on auxiliary information.
zh
[NLP-190] Reducing Object Hallucination in Large Audio-Language Models via Audio-Aware Decoding
【速读】: 该论文试图解决大型音频-语言模型(Large Audio-Language Models, LALMs)在处理音频输入时产生的幻觉问题,即模型可能生成与实际音频内容不符的输出。解决方案的关键在于引入一种轻量级的推理阶段策略——音频感知解码(Audio-Aware Decoding, AAD),该方法通过对比解码技术,将带有音频上下文和不带音频上下文的标记预测概率进行比较,从而提升在音频存在时概率增加的标记的生成可能性。
链接: https://arxiv.org/abs/2506.07233
作者: Tzu-wen Hsu,Ke-Han Lu,Cheng-Han Chiang,Hung-yi Lee
机构: Purdue University, United States(普渡大学,美国); National Taiwan University, Taiwan(台湾大学,台湾)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:
Abstract:Large Audio-Language Models (LALMs) can take audio and text as the inputs and answer questions about the audio. While prior LALMs have shown strong performance on standard benchmarks, there has been alarming evidence that LALMs can hallucinate what is presented in the audio. To mitigate the hallucination of LALMs, we introduce Audio-Aware Decoding (AAD), a lightweight inference-time strategy that uses contrastive decoding to compare the token prediction logits with and without the audio context. By contrastive decoding, AAD promotes the tokens whose probability increases when the audio is present. We conduct our experiment on object hallucination datasets with three LALMs and show that AAD improves the F1 score by 0.046 to 0.428. We also show that AAD can improve the accuracy on general audio QA datasets like Clotho-AQA by 5.4% to 10.3%. We conduct thorough ablation studies to understand the effectiveness of each component in AAD.
zh
[NLP-191] On the Fundamental Impossibility of Hallucination Control in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)无法避免幻觉(hallucination)的问题,即模型在生成文本时产生与事实不符或无依据的内容。其解决方案的关键在于提出一个形式化的不可能定理(impossibility theorem),证明任何推理机制都无法同时满足四个基本属性: truthful(非幻觉性)生成、语义信息保全、相关知识揭示以及知识约束下的最优性。通过将LLM的推理建模为思想拍卖(auction of ideas)过程,其中神经组件竞争性地贡献回答内容,并利用Green-Laffont定理进行证明,该研究为理解推理过程的本质提供了严格的数学基础,并对模型架构、训练目标和评估方法具有重要启示。
链接: https://arxiv.org/abs/2506.06382
作者: Michał P. Karpowicz
机构: Samsung AI Center Warsaw (三星人工智能华沙中心)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:
Abstract:This paper explains \textbfwhy it is impossible to create large language models that do not hallucinate and what are the trade-offs we should be looking for. It presents a formal \textbfimpossibility theorem demonstrating that no inference mechanism can simultaneously satisfy four fundamental properties: \textbftruthful (non-hallucinatory) generation, semantic information conservation, relevant knowledge revelation, and knowledge-constrained optimality. By modeling LLM inference as an \textbfauction of ideas where neural components compete to contribute to responses, we prove the impossibility using the Green-Laffont theorem. That mathematical framework provides a rigorous foundation for understanding the nature of inference process, with implications for model architecture, training objectives, and evaluation methods.
zh
[NLP-192] he Hype Index: an NLP-driven Measure of Market News Attention
【速读】: 该论文试图解决如何量化媒体对大型股票的关注度,并从中提取预测性信号的问题,以辅助股票波动率分析和市场信号识别。解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)技术构建两种类型的Hype Index:基于新闻数量的Hype Index和考虑市值调整的Hype Index,通过计算股票或行业在媒体报道中的权重与其市值权重的比率,从而更准确地反映媒体关注度与经济规模之间的关系。
链接: https://arxiv.org/abs/2506.06329
作者: Zheng Cao,Wanchaloem Wunkaew,Helyette Geman
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: atistical Finance (q-fin.ST); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:
Abstract:This paper introduces the Hype Index as a novel metric to quantify media attention toward large-cap equities, leveraging advances in Natural Language Processing (NLP) for extracting predictive signals from financial news. Using the SP 100 as the focus universe, we first construct a News Count-Based Hype Index, which measures relative media exposure by computing the share of news articles referencing each stock or sector. We then extend it to the Capitalization Adjusted Hype Index, adjusts for economic size by taking the ratio of a stock’s or sector’s media weight to its market capitalization weight within its industry or sector. We compute both versions of the Hype Index at the stock and sector levels, and evaluate them through multiple lenses: (1) their classification into different hype groups, (2) their associations with returns, volatility, and VIX index at various lags, (3) their signaling power for short-term market movements, and (4) their empirical properties including correlations, samplings, and trends. Our findings suggest that the Hype Index family provides a valuable set of tools for stock volatility analysis, market signaling, and NLP extensions in Finance.
zh
计算机视觉
[CV-0] 4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos
【速读】:该论文试图解决动态场景重建问题,特别是在单目姿态视频中实现复杂、时间变化环境的高效建模与渲染。解决方案的关键在于提出一种基于4D高斯(4D Gaussian)的Transformer模型——4DGT,通过引入4D高斯作为归纳偏置,统一静态与动态成分,并采用新颖的密度控制策略,使模型能够处理更长的时空输入并在运行时保持高效的渲染能力。此外,4DGT通过滚动窗口方式处理连续帧,实现场景中一致的4D高斯预测,且无需优化过程,仅依赖前馈推理,显著提升了重建速度并支持长视频序列的扩展。
链接: https://arxiv.org/abs/2506.08015
作者: Zhen Xu,Zhengqin Li,Zhao Dong,Xiaowei Zhou,Richard Newcombe,Zhaoyang Lv
机构: Reality Labs Research, Meta; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We propose 4DGT, a 4D Gaussian-based Transformer model for dynamic scene reconstruction, trained entirely on real-world monocular posed videos. Using 4D Gaussian as an inductive bias, 4DGT unifies static and dynamic components, enabling the modeling of complex, time-varying environments with varying object lifespans. We proposed a novel density control strategy in training, which enables our 4DGT to handle longer space-time input and remain efficient rendering at runtime. Our model processes 64 consecutive posed frames in a rolling-window fashion, predicting consistent 4D Gaussians in the scene. Unlike optimization-based methods, 4DGT performs purely feed-forward inference, reducing reconstruction time from hours to seconds and scaling effectively to long video sequences. Trained only on large-scale monocular posed video datasets, 4DGT can outperform prior Gaussian-based networks significantly in real-world videos and achieve on-par accuracy with optimization-based methods on cross-domain videos. Project page: this https URL
zh
[CV-1] StableMTL: Repurposing Latent Diffusion Models for Multi-Task Learning from Partially Annotated Synthetic Datasets
【速读】:该论文旨在解决多任务学习中密集预测任务因需要大量标注数据而受限的问题,尤其针对部分任务标签训练的场景进行扩展,提出了一个零样本学习设置。其解决方案的关键在于提出了一种名为StableMTL的方法,该方法利用扩散模型的泛化能力,通过在多个仅部分标注的合成数据集上训练多任务模型,并采用统一的潜在损失替代各任务损失,从而实现任务的无缝扩展;同时引入多流模型与任务注意力机制,将N-to-N的任务交互转化为高效的1-to-N注意力机制,以促进跨任务的知识共享。
链接: https://arxiv.org/abs/2506.08013
作者: Anh-Quan Cao,Ivan Lopes,Raoul de Charette
机构: Valeo.ai(维恩诺人工智能); Inria(法国国家信息与自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at this https URL
Abstract:Multi-task learning for dense prediction is limited by the need for extensive annotation for every task, though recent works have explored training with partial task labels. Leveraging the generalization power of diffusion models, we extend the partial learning setup to a zero-shot setting, training a multi-task model on multiple synthetic datasets, each labeled for only a subset of tasks. Our method, StableMTL, repurposes image generators for latent regression. Adapting a denoising framework with task encoding, per-task conditioning and a tailored training scheme. Instead of per-task losses requiring careful balancing, a unified latent loss is adopted, enabling seamless scaling to more tasks. To encourage inter-task synergy, we introduce a multi-stream model with a task-attention mechanism that converts N-to-N task interactions into efficient 1-to-N attention, promoting effective cross-task sharing. StableMTL outperforms baselines on 7 tasks across 8 benchmarks.
zh
[CV-2] GUI-Reflection: Empowering Multimodal GUI Models with Self-Reflection Behavior
【速读】:该论文试图解决现有图形用户界面(GUI)模型在自动化任务中缺乏自我反思和错误恢复能力的问题,这些问题主要源于模型依赖于几乎无错误的离线轨迹进行训练。解决方案的关键在于提出GUI-Reflection框架,该框架通过在端到端多模态GUI模型的专用训练阶段(包括GUI特定预训练、离线监督微调和在线反思微调)显式集成自我反思和错误纠正能力,从而实现模型在无需人工标注的情况下,自动生成数据并学习自我反思行为。
链接: https://arxiv.org/abs/2506.08012
作者: Penghao Wu,Shengnan Ma,Bo Wang,Jiaheng Yu,Lewei Lu,Ziwei Liu
机构: Nanyang Technological University (南洋理工大学); SenseTime Research (商汤科技)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page at this https URL
Abstract:Multimodal Large Language Models (MLLMs) have shown great potential in revolutionizing Graphical User Interface (GUI) automation. However, existing GUI models mostly rely on learning from nearly error-free offline trajectories, thus lacking reflection and error recovery capabilities. To bridge this gap, we propose GUI-Reflection, a novel framework that explicitly integrates self-reflection and error correction capabilities into end-to-end multimodal GUI models throughout dedicated training stages: GUI-specific pre-training, offline supervised fine-tuning (SFT), and online reflection tuning. GUI-reflection enables self-reflection behavior emergence with fully automated data generation and learning processes without requiring any human annotation. Specifically, 1) we first propose scalable data pipelines to automatically construct reflection and error correction data from existing successful trajectories. While existing GUI models mainly focus on grounding and UI understanding ability, we propose the GUI-Reflection Task Suite to learn and evaluate reflection-oriented abilities explicitly. 2) Furthermore, we built a diverse and efficient environment for online training and data collection of GUI models on mobile devices. 3) We also present an iterative online reflection tuning algorithm leveraging the proposed environment, enabling the model to continuously enhance its reflection and error correction abilities. Our framework equips GUI agents with self-reflection and correction capabilities, paving the way for more robust, adaptable, and intelligent GUI automation, with all data, models, environments, and tools to be released publicly.
zh
[CV-3] Vision Transformers Dont Need Trained Registers
【速读】:该论文试图解决Vision Transformers中高范数标记(high-norm tokens)导致的注意力图噪声问题,这些问题源于少数神经元对异常标记的高范数激活集中,进而引发不规则的注意力模式并影响下游视觉任务的性能。解决方案的关键在于发现这些高范数激活来源于特定的注册神经元(register neurons),并通过将这些激活转移到一个未训练的额外标记中,模拟注册标记在已训练模型中的作用,从而在不进行重新训练的情况下减轻此类伪影。
链接: https://arxiv.org/abs/2506.08010
作者: Nick Jiang,Amil Dravid,Alexei Efros,Yossi Gandelsman
机构: UC Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page and code: this https URL
Abstract:We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers – the emergence of high-norm tokens that lead to noisy attention maps. We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models to improve their interpretability. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.
zh
[CV-4] Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
【速读】:该论文试图解决自回归视频扩散模型中的暴露偏差(exposure bias)问题,即模型在训练时使用真实上下文进行学习,但在推理时需依赖自身生成的不完美输出来生成序列。解决方案的关键在于引入Self Forcing训练范式,通过在训练过程中使用关键值缓存(KV caching)进行自回归滚动生成,使每帧的生成基于之前自生成的输出,从而实现基于视频整体质量的监督,而非仅依赖传统的逐帧目标。
链接: https://arxiv.org/abs/2506.08009
作者: Xun Huang,Zhengqi Li,Guande He,Mingyuan Zhou,Eli Shechtman
机构: Adobe Research (Adobe 研究院); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project website: this http URL
Abstract:We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame’s generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: this http URL
zh
[CV-5] Hidden in plain sight: VLMs overlook their visual representations
【速读】:该论文试图解决视觉语言模型(Vision Language Models, VLMs)在视觉中心任务中表现不佳的问题,特别是其在整合视觉与语言信息方面的能力不足。研究通过对比VLMs与直接读取其视觉编码器的结果,揭示了VLMs在深度估计、对应关系等任务中表现显著劣于单独的视觉编码器,甚至接近随机水平。解决方案的关键在于识别出性能瓶颈主要源于语言模型(Language Model, LLM)在任务求解中的作用,即VLMs未能有效利用模型中可获取的视觉信息,并继承了LLM中的语言先验,导致视觉理解能力受限。
链接: https://arxiv.org/abs/2506.08008
作者: Stephanie Fu,Tyler Bonnen,Devin Guillory,Trevor Darrell
机构: UC Berkeley(加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Language provides a natural interface to specify and evaluate performance on visual tasks. To realize this possibility, vision language models (VLMs) must successfully integrate visual and linguistic information. Our work compares VLMs to a direct readout of their visual encoders to understand their ability to integrate across these modalities. Across a series of vision-centric benchmarks (e.g., depth estimation, correspondence), we find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance. We investigate these results through a series of analyses across the entire VLM: namely 1) the degradation of vision representations, 2) brittleness to task prompt, and 3) the language model’s role in solving the task. We find that the bottleneck in performing these vision-centric tasks lies in this third category; VLMs are not effectively using visual information easily accessible throughout the entire model, and they inherit the language priors present in the LLM. Our work helps diagnose the failure modes of open-source VLMs, and presents a series of evaluations useful for future investigations into visual understanding within VLMs.
zh
[CV-6] Dreamland: Controllable World Creation with Simulator and Generative Models
【速读】:该论文试图解决大规模视频生成模型在元素级控制方面的不足,这限制了其在场景编辑和具身AI代理训练中的应用。解决方案的关键在于提出Dreamland框架,该框架结合了基于物理的模拟器的细粒度控制能力和大规模预训练生成模型的逼真内容输出能力,通过设计分层世界抽象结构,将像素级和对象级语义与几何信息作为中间表示,从而实现模拟器与生成模型之间的有效衔接。
链接: https://arxiv.org/abs/2506.08006
作者: Sicheng Mo,Ziyang Leng,Leon Liu,Weizhen Wang,Honglin He,Bolei Zhou
机构: University of California, Los Angeles (加利福尼亚大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Large-scale video generative models can synthesize diverse and realistic visual content for dynamic world creation, but they often lack element-wise controllability, hindering their use in editing scenes and training embodied AI agents. We propose Dreamland, a hybrid world generation framework combining the granular control of a physics-based simulator and the photorealistic content output of large-scale pretrained generative models. In particular, we design a layered world abstraction that encodes both pixel-level and object-level semantics and geometry as an intermediate representation to bridge the simulator and the generative model. This approach enhances controllability, minimizes adaptation cost through early alignment with real-world distributions, and supports off-the-shelf use of existing and future pretrained generative models. We further construct a D3Sim dataset to facilitate the training and evaluation of hybrid generation pipelines. Experiments demonstrate that Dreamland outperforms existing baselines with 50.8% improved image quality, 17.9% stronger controllability, and has great potential to enhance embodied agent training. Code and data will be made available.
zh
[CV-7] ZeroVO: Visual Odometry with Minimal Assumptions
【速读】:该论文试图解决传统视觉里程计(VO)算法在不同相机和环境下的泛化能力不足的问题,现有方法通常依赖于预定义或静态的相机标定设置。其解决方案的关键在于提出一种无需标定的、几何感知的网络结构,结合基于语言的先验信息以增强特征提取与泛化能力,并采用灵活的半监督训练范式,通过未标记数据迭代适应新场景,从而实现跨多种真实世界场景的零样本泛化。
链接: https://arxiv.org/abs/2506.08005
作者: Lei Lai,Zekai Yin,Eshed Ohn-Bar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce ZeroVO, a novel visual odometry (VO) algorithm that achieves zero-shot generalization across diverse cameras and environments, overcoming limitations in existing methods that depend on predefined or static camera calibration setups. Our approach incorporates three main innovations. First, we design a calibration-free, geometry-aware network structure capable of handling noise in estimated depth and camera parameters. Second, we introduce a language-based prior that infuses semantic information to enhance robust feature extraction and generalization to previously unseen domains. Third, we develop a flexible, semi-supervised training paradigm that iteratively adapts to new scenes using unlabeled data, further boosting the models’ ability to generalize across diverse real-world scenarios. We analyze complex autonomous driving contexts, demonstrating over 30% improvement against prior methods on three standard benchmarks, KITTI, nuScenes, and Argoverse 2, as well as a newly introduced, high-fidelity synthetic dataset derived from Grand Theft Auto (GTA). By not requiring fine-tuning or camera calibration, our work broadens the applicability of VO, providing a versatile solution for real-world deployment at scale.
zh
[CV-8] Dynamic View Synthesis as an Inverse Problem
【速读】:该论文试图解决单目视频中的动态视图合成问题,将其视为一种无需训练的逆问题。其解决方案的关键在于重新设计预训练视频扩散模型的噪声初始化阶段,通过引入一种称为K阶递归噪声表示(K-order Recursive Noise Representation)的新噪声表示方法,解决了由于零终端信噪比(SNR)调度导致的确定性反演障碍,并推导出该表示的闭式表达式,从而实现了VAE编码潜变量与DDIM反演潜变量之间的精确高效对齐。此外,为合成由相机运动产生的新可见区域,还引入了随机潜变量调制(Stochastic Latent Modulation),在潜在空间中进行可见性感知采样以补全被遮挡区域。
链接: https://arxiv.org/abs/2506.08004
作者: Hidir Yesiltepe,Pinar Yanardag
机构: Virginia Tech (弗吉尼亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:In this work, we address dynamic view synthesis from monocular videos as an inverse problem in a training-free setting. By redesigning the noise initialization phase of a pre-trained video diffusion model, we enable high-fidelity dynamic view synthesis without any weight updates or auxiliary modules. We begin by identifying a fundamental obstacle to deterministic inversion arising from zero-terminal signal-to-noise ratio (SNR) schedules and resolve it by introducing a novel noise representation, termed K-order Recursive Noise Representation. We derive a closed form expression for this representation, enabling precise and efficient alignment between the VAE-encoded and the DDIM inverted latents. To synthesize newly visible regions resulting from camera motion, we introduce Stochastic Latent Modulation, which performs visibility aware sampling over the latent space to complete occluded regions. Comprehensive experiments demonstrate that dynamic view synthesis can be effectively performed through structured latent manipulation in the noise initialization phase.
zh
[CV-9] Audio-Sync Video Generation with Multi-Stream Temporal Control
【速读】:该论文旨在解决可控视频生成中音频-视频同步质量不足的问题,尤其是在处理多样且复杂的音频类型时。其解决方案的关键在于提出MTV框架,该框架通过将音频显式分离为语音、音效和音乐轨道,实现了对唇部动作、事件时间以及视觉情绪的解耦控制,从而实现细粒度且语义对齐的视频生成。
链接: https://arxiv.org/abs/2506.08003
作者: Shuchen Weng,Haojie Zheng,Zheng Chang,Si Li,Boxin Shi,Xinlong Wang
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); School of Software and Microelectronics, Peking University (北京大学软件与微电子学院); School of Artificial Intelligence, Beijing University of Posts and Telecommunications (北京邮电大学人工智能学院); Nat’l Key Lab of General AI, School of Intelligence Science and Technology, Peking University (国家通用人工智能重点实验室,北京大学智能科学与技术学院); Nat’l Eng. Research Ctr. of Visual Tech., School of Computer Science, Peking University (国家视觉技术工程研究中心,北京大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Audio is inherently temporal and closely synchronized with the visual world, making it a naturally aligned and expressive control signal for controllable video generation (e.g., movies). Beyond control, directly translating audio into video is essential for understanding and visualizing rich audio narratives (e.g., Podcasts or historical recordings). However, existing approaches fall short in generating high-quality videos with precise audio-visual synchronization, especially across diverse and complex audio types. In this work, we introduce MTV, a versatile framework for audio-sync video generation. MTV explicitly separates audios into speech, effects, and music tracks, enabling disentangled control over lip motion, event timing, and visual mood, respectively – resulting in fine-grained and semantically aligned video generation. To support the framework, we additionally present DEMIX, a dataset comprising high-quality cinematic videos and demixed audio tracks. DEMIX is structured into five overlapped subsets, enabling scalable multi-stage training for diverse generation scenarios. Extensive experiments demonstrate that MTV achieves state-of-the-art performance across six standard metrics spanning video quality, text-video consistency, and audio-video alignment. Project page: this https URL.
zh
[CV-10] Aligning Text Images and 3D Structure Token-by-Token ALT
【速读】:该论文旨在解决如何使机器能够理解三维(3D)世界的问题,以辅助设计师构建和编辑三维环境,以及帮助机器人在三维空间中导航和交互。其解决方案的关键在于提出一个统一的大型语言模型(Large Language Model, LLM)框架,该框架能够对齐语言、图像和结构化三维场景,并通过详细的设计指南优化训练与性能,重点解决了数据表示、模态特定目标等关键问题。
链接: https://arxiv.org/abs/2506.08002
作者: Aadarsh Sahoo,Vansh Tibrewal,Georgia Gkioxari
机构: California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL
Abstract:Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ‘‘cookbook’’ outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We evaluate performance across four core 3D tasks – rendering, recognition, instruction-following, and question-answering – and four 3D datasets, synthetic and real-world. We extend our approach to reconstruct complex 3D object shapes by enriching our 3D modality with quantized shape encodings, and show our model’s effectiveness on real-world 3D object recognition tasks. Project webpage: this https URL
zh
[CV-11] MADFormer: Mixed Autoregressive and Diffusion Transformers for Continuous Image Generation
【速读】:该论文试图解决多模态生成中自回归(Autoregressive, AR)与扩散模型混合架构缺乏系统性容量分配指导的问题。其解决方案的关键在于提出MADFormer,该模型通过将图像生成划分为空间块,利用AR层进行跨块的一次性全局条件控制,同时使用扩散层对每个块进行迭代局部优化,从而实现AR与扩散机制的有效结合。
链接: https://arxiv.org/abs/2506.07999
作者: Junhao Chen,Yulia Tsvetkov,Xiaochuang Han
机构: Tsinghua University (清华大学); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent progress in multimodal generation has increasingly combined autoregressive (AR) and diffusion-based approaches, leveraging their complementary strengths: AR models capture long-range dependencies and produce fluent, context-aware outputs, while diffusion models operate in continuous latent spaces to refine high-fidelity visual details. However, existing hybrids often lack systematic guidance on how and why to allocate model capacity between these paradigms. In this work, we introduce MADFormer, a Mixed Autoregressive and Diffusion Transformer that serves as a testbed for analyzing AR-diffusion trade-offs. MADFormer partitions image generation into spatial blocks, using AR layers for one-pass global conditioning across blocks and diffusion layers for iterative local refinement within each block. Through controlled experiments on FFHQ-1024 and ImageNet, we identify two key insights: (1) block-wise partitioning significantly improves performance on high-resolution images, and (2) vertically mixing AR and diffusion layers yields better quality-efficiency balances–improving FID by up to 75% under constrained inference compute. Our findings offer practical design principles for future hybrid generative models.
zh
[CV-12] Generative Modeling of Weights: Generalization or Memorization?
【速读】:该论文试图解决生成式模型在神经网络权重合成中的有效性问题,特别是其是否能够生成与训练检查点不同的新型模型权重。研究发现,现有方法主要通过记忆训练检查点来生成权重,而非真正创造新的、高性能的模型结构。解决方案的关键在于揭示当前生成式模型在权重生成任务中的局限性,并强调需要更严谨的评估方法以推动该领域的发展。
链接: https://arxiv.org/abs/2506.07998
作者: Boya Zeng,Yida Yin,Zhiqiu Xu,Zhuang Liu
机构: Princeton University (普林斯顿大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL
Abstract:Generative models, with their success in image and video generation, have recently been explored for synthesizing effective neural network weights. These approaches take trained neural network checkpoints as training data, and aim to generate high-performing neural network weights during inference. In this work, we examine four representative methods on their ability to generate novel model weights, i.e., weights that are different from the checkpoints seen during training. Surprisingly, we find that these methods synthesize weights largely by memorization: they produce either replicas, or at best simple interpolations, of the training checkpoints. Current methods fail to outperform simple baselines, such as adding noise to the weights or taking a simple weight ensemble, in obtaining different and simultaneously high-performing models. We further show that this memorization cannot be effectively mitigated by modifying modeling factors commonly associated with memorization in image diffusion models, or applying data augmentations. Our findings provide a realistic assessment of what types of data current generative models can model, and highlight the need for more careful evaluation of generative models in new domains. Our code is available at this https URL.
zh
[CV-13] UA-Pose: Uncertainty-Aware 6D Object Pose Estimation and Online Object Completion with Partial References CVPR2025
【速读】:该论文旨在解决从部分参考信息中估计6D物体位姿的挑战,即在缺乏完整、高质量3D模型或充足参考图像的情况下,如何准确估计物体的位姿。其解决方案的关键在于提出一种不确定性感知的方法(UA-Pose),通过将不确定性整合到不完整的3D模型中,区分可见与不可见区域,并利用该不确定性进行姿态估计的置信度评估以及指导在线物体补全的采样策略,从而提升姿态估计的鲁棒性和物体完整性。
链接: https://arxiv.org/abs/2506.07996
作者: Ming-Feng Li,Xin Yang,Fu-En Wang,Hritam Basak,Yuyin Sun,Shreekant Gayaka,Min Sun,Cheng-Hao Kuo
机构: Carnegie Mellon University (卡内基梅隆大学); Stony Brook University (石溪大学); National Tsing Hua University (国立清华大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2025
Abstract:6D object pose estimation has shown strong generalizability to novel objects. However, existing methods often require either a complete, well-reconstructed 3D model or numerous reference images that fully cover the object. Estimating 6D poses from partial references, which capture only fragments of an object’s appearance and geometry, remains challenging. To address this, we propose UA-Pose, an uncertainty-aware approach for 6D object pose estimation and online object completion specifically designed for partial references. We assume access to either (1) a limited set of RGBD images with known poses or (2) a single 2D image. For the first case, we initialize a partial object 3D model based on the provided images and poses, while for the second, we use image-to-3D techniques to generate an initial object 3D model. Our method integrates uncertainty into the incomplete 3D model, distinguishing between seen and unseen regions. This uncertainty enables confidence assessment in pose estimation and guides an uncertainty-aware sampling strategy for online object completion, enhancing robustness in pose estimation accuracy and improving object completeness. We evaluate our method on the YCB-Video, YCBInEOAT, and HO3D datasets, including RGBD sequences of YCB objects manipulated by robots and human hands. Experimental results demonstrate significant performance improvements over existing methods, particularly when object observations are incomplete or partially captured. Project page: this https URL
zh
[CV-14] PairEdit: Learning Semantic Variations for Exemplar-based Image Editing
【速读】:该论文试图解决文本引导图像编辑中难以通过纯文本描述精确指定某些编辑语义的问题,以及现有基于示例的编辑方法仍依赖文本提示或隐式文本指令的局限性。其解决方案的关键在于提出PairEdit,一种无需任何文本指导即可从少量图像对甚至单个图像对中有效学习复杂编辑语义的视觉编辑方法。该方法通过目标噪声预测显式建模成对图像中的语义变化,并引入内容保留的噪声调度策略以提升语义学习效果,同时优化不同的LoRAs以分离语义变化与内容的学习。
链接: https://arxiv.org/abs/2506.07992
作者: Haoguang Lu,Jiacheng Chen,Zhenguo Yang,Aurele Tohokantche Gnanha,Fu Lee Wang,Li Qing,Xudong Mao
机构: Sun Yat-sen University (中山大学); Guangdong University of Technology (广东工业大学); Huawei Noah’s Ark Laboratory (华为诺亚方舟实验室); Hong Kong Metropolitan University (香港都会大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in text-guided image editing have achieved notable success by leveraging natural language prompts for fine-grained semantic control. However, certain editing semantics are challenging to specify precisely using textual descriptions alone. A practical alternative involves learning editing semantics from paired source-target examples. Existing exemplar-based editing methods still rely on text prompts describing the change within paired examples or learning implicit text-based editing instructions. In this paper, we introduce PairEdit, a novel visual editing method designed to effectively learn complex editing semantics from a limited number of image pairs or even a single image pair, without using any textual guidance. We propose a target noise prediction that explicitly models semantic variations within paired images through a guidance direction term. Moreover, we introduce a content-preserving noise schedule to facilitate more effective semantic learning. We also propose optimizing distinct LoRAs to disentangle the learning of semantic variations from content. Extensive qualitative and quantitative evaluations demonstrate that PairEdit successfully learns intricate semantics while significantly improving content consistency compared to baseline methods. Code will be available at this https URL.
zh
[CV-15] Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
【速读】:该论文旨在解决文本驱动的视觉生成中,多模态扩散变换器(MM-DiT)模型在文本提示与生成内容之间难以实现精确对齐的问题。研究发现,MM-DiT的注意力机制存在两个关键问题:一是视觉与文本模态之间的标记不平衡导致跨模态注意力被抑制,二是缺乏时间步感知的注意力权重调整,从而阻碍了对齐效果。解决方案的关键在于提出一种参数高效的温度调整跨模态注意力(TACA)方法,通过温度缩放和时间步依赖的调整动态平衡多模态交互,从而显著提升文本-图像对齐效果。
链接: https://arxiv.org/abs/2506.07986
作者: Zhengyao Lv,Tianlin Pan,Chenyang Si,Zhaoxi Chen,Wangmeng Zuo,Ziwei Liu,Kwan-Yee K. Wong
机构: The University of Hong Kong (香港大学); Nanjing University (南京大学); University of Chinese Academy of Sciences (中国科学院大学); Nanyang Technological University (南洋理工大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbfTemperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \hrefthis https URL
zh
[CV-16] Rethinking Crowd-Sourced Evaluation of Neuron Explanations
【速读】:该论文试图解决神经元解释的可靠性评估问题,即如何有效、准确地衡量不同算法生成的神经元解释质量。其解决方案的关键在于设计一种成本效益高且精度高的众包评估策略,通过引入重要性采样技术选择最具信息量的输入样本进行评估,从而实现约30倍的成本降低;同时,针对众包评估中的标签噪声,提出了一种贝叶斯聚合方法,进一步减少了约5倍的评分需求以达到相同的准确性。
链接: https://arxiv.org/abs/2506.07985
作者: Tuomas Oikarinen,Ge Yan,Akshay Kulkarni,Tsui-Wei Weng
机构: UC San Diego, CSE (加州大学圣地亚哥分校,计算机科学与工程系); UC San Diego, HDSI (加州大学圣地亚哥分校,数据科学与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Interpreting individual neurons or directions in activations space is an important component of mechanistic interpretability. As such, many algorithms have been proposed to automatically produce neuron explanations, but it is often not clear how reliable these explanations are, or which methods produce the best explanations. This can be measured via crowd-sourced evaluations, but they can often be noisy and expensive, leading to unreliable results. In this paper, we carefully analyze the evaluation pipeline and develop a cost-effective and highly accurate crowdsourced evaluation strategy. In contrast to previous human studies that only rate whether the explanation matches the most highly activating inputs, we estimate whether the explanation describes neuron activations across all inputs. To estimate this effectively, we introduce a novel application of importance sampling to determine which inputs are the most valuable to show to raters, leading to around 30x cost reduction compared to uniform sampling. We also analyze the label noise present in crowd-sourced evaluations and propose a Bayesian method to aggregate multiple ratings leading to a further ~5x reduction in number of ratings required for the same accuracy. Finally, we use these methods to conduct a large-scale study comparing the quality of neuron explanations produced by the most popular methods for two different vision models.
zh
[CV-17] CXR-LT 2024: A MICCAI challenge on long-tailed multi-label and zero-shot disease classification from chest X-ray
【速读】:该论文旨在解决肺部疾病分类中的开放长尾分布问题以及现有技术在实际临床场景中的可衡量性和泛化能力不足的问题。其关键解决方案包括构建大规模、包含45种疾病标签(含19种新出现的罕见疾病)的胸部X光(CXR)数据集,并引入零样本学习方法以应对之前事件中发现的局限性。此外,还采用了多模态模型进行罕见疾病检测、先进的生成方法处理噪声标签,以及零样本学习策略以实现对未见疾病的有效泛化。
链接: https://arxiv.org/abs/2506.07984
作者: Mingquan Lin,Gregory Holste,Song Wang,Yiliang Zhou,Yishu Wei,Imon Banerjee,Pengyi Chen,Tianjie Dai,Yuexi Du,Nicha C. Dvornek,Yuyan Ge,Zuowei Guo,Shouhei Hanaoka,Dongkyun Kim,Pablo Messina,Yang Lu,Denis Parra,Donghyun Son,Álvaro Soto,Aisha Urooj,René Vidal,Yosuke Yamagishi,Zefan Yang,Ruichi Zhang,Yang Zhou,Leo Anthony Celi,Ronald M. Summers,Zhiyong Lu,Hao Chen,Adam Flanders,George Shih,Zhangyang Wang,Yifan Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 3 figures
Abstract:The CXR-LT series is a community-driven initiative designed to enhance lung disease classification using chest X-rays (CXR). It tackles challenges in open long-tailed lung disease classification and enhances the measurability of state-of-the-art techniques. The first event, CXR-LT 2023, aimed to achieve these goals by providing high-quality benchmark CXR data for model development and conducting comprehensive evaluations to identify ongoing issues impacting lung disease classification performance. Building on the success of CXR-LT 2023, the CXR-LT 2024 expands the dataset to 377,110 chest X-rays (CXRs) and 45 disease labels, including 19 new rare disease findings. It also introduces a new focus on zero-shot learning to address limitations identified in the previous event. Specifically, CXR-LT 2024 features three tasks: (i) long-tailed classification on a large, noisy test set, (ii) long-tailed classification on a manually annotated “gold standard” subset, and (iii) zero-shot generalization to five previously unseen disease findings. This paper provides an overview of CXR-LT 2024, detailing the data curation process and consolidating state-of-the-art solutions, including the use of multimodal models for rare disease detection, advanced generative approaches to handle noisy labels, and zero-shot learning strategies for unseen diseases. Additionally, the expanded dataset enhances disease coverage to better represent real-world clinical settings, offering a valuable resource for future research. By synthesizing the insights and innovations of participating teams, we aim to advance the development of clinically realistic and generalizable diagnostic models for chest radiography.
zh
[CV-18] Real-time Localization of a Soccer Ball from a Single Camera
【速读】:该论文旨在解决从单个广播摄像头实时重建三维足球轨迹的问题,传统方法在处理遮挡、运动模糊和复杂背景时存在效率低或精度不足的局限。其解决方案的关键在于引入了一个具有W个离散模式的多模态状态模型,该模型在保持厘米级精度的同时显著加速了优化过程,使系统能够在标准CPU上运行并满足直播场景的低延迟需求。
链接: https://arxiv.org/abs/2506.07981
作者: Dmitrii Vorobev,Artem Prosvetov,Karim Elhadji Daou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 4 figures
Abstract:We propose a computationally efficient method for real-time three-dimensional football trajectory reconstruction from a single broadcast camera. In contrast to previous work, our approach introduces a multi-mode state model with W discrete modes to significantly accelerate optimization while preserving centimeter-level accuracy – even in cases of severe occlusion, motion blur, and complex backgrounds. The system operates on standard CPUs and achieves low latency suitable for live broadcast settings. Extensive evaluation on a proprietary dataset of 6K-resolution Russian Premier League matches demonstrates performance comparable to multi-camera systems, without the need for specialized or costly infrastructure. This work provides a practical method for accessible and accurate 3D ball tracking in professional football environments.
zh
[CV-19] OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
【速读】:该论文试图解决现有文本到图像(Text-to-image, T2I)模型评估体系在综合性、细致性以及对推理能力等前沿任务支持方面的不足。其解决方案的关键在于提出OneIG-Bench,一个精心设计的全面基准框架,用于多维度的T2I模型细粒度评估,涵盖提示-图像对齐、文本渲染精度、推理生成内容、风格化和多样性等方面,通过结构化的评估方式,使研究人员能够深入分析模型性能并定位图像生成全流程中的优势与瓶颈。
链接: https://arxiv.org/abs/2506.07977
作者: Jingjing Chang,Yixiao Fang,Peng Xing,Shuhan Wu,Wei Cheng,Rui Wang,Xianfang Zeng,Gang Yu,Hai-Bao Chen
机构: SJTU(上海交通大学); StepFun(步履科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image (T2I) models have garnered significant attention for generating high-quality images aligned with text prompts. However, rapid T2I model advancements reveal limitations in early benchmarks, lacking comprehensive evaluations, for example, the evaluation on reasoning, text rendering and style. Notably, recent state-of-the-art models, with their rich knowledge modeling capabilities, show promising results on the image generation problems requiring strong reasoning ability, yet existing evaluation systems have not adequately addressed this frontier. To systematically address these gaps, we introduce OneIG-Bench, a meticulously designed comprehensive benchmark framework for fine-grained evaluation of T2I models across multiple dimensions, including prompt-image alignment, text rendering precision, reasoning-generated content, stylization, and diversity. By structuring the evaluation, this benchmark enables in-depth analysis of model performance, helping researchers and practitioners pinpoint strengths and bottlenecks in the full pipeline of image generation. Specifically, OneIG-Bench enables flexible evaluation by allowing users to focus on a particular evaluation subset. Instead of generating images for the entire set of prompts, users can generate images only for the prompts associated with the selected dimension and complete the corresponding evaluation accordingly. Our codebase and dataset are now publicly available to facilitate reproducible evaluation studies and cross-model comparisons within the T2I research community.
zh
[CV-20] CyberV: Cybernetics for Test-time Scaling in Video Understanding
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长或复杂视频时存在的计算需求高、鲁棒性不足和准确率有限的问题,这些问题主要源于其前馈处理机制。论文提出的解决方案关键在于引入一种受控制论原理启发的框架——CyberV,该框架将视频MLLMs重新设计为具备自我监控、自我修正和动态资源分配能力的自适应系统,通过构建包含MLLM推理系统、传感器和控制器的控制环路,实现测试阶段的自适应扩展,从而提升模型性能而无需重新训练或增加额外组件。
链接: https://arxiv.org/abs/2506.07971
作者: Jiahao Meng,Shuyang Sun,Yue Tan,Lu Qi,Yunhai Tong,Xiangtai Li,Longyin Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at this https URL.
zh
[CV-21] SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence
【速读】:该论文旨在解决现有基准在评估多模态大语言模型(Multimodal Large Language Models, MLLMs)的空间智能时,无法从原子层面到组合层面进行全面评价的问题。其解决方案的关键在于提出SpaCE-10,一个综合性组合空间评估基准,定义了10个原子空间能力,并通过组合形成8个组合能力,同时设计了一种新的分层标注流程以生成高质量且多样化的问答对,从而全面评估MLLMs的空间智能水平。
链接: https://arxiv.org/abs/2506.07966
作者: Ziyang Gong,Wenhao Li,Oliver Ma,Songyuan Li,Jiayi Ji,Xue Yang,Gen Luo,Junchi Yan,Rongrong Ji
机构: Shanghai Jiao Tong University; Shanghai AI Laboratory; Xiamen University; Sun Yat-sen University; National University of Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in various multimodal tasks. To pursue higher intelligence in space, MLLMs require integrating multiple atomic spatial capabilities to handle complex and dynamic tasks. However, existing benchmarks struggle to comprehensively evaluate the spatial intelligence of common MLLMs from the atomic level to the compositional level. To fill this gap, we present SpaCE-10, a comprehensive benchmark for compositional spatial evaluations. In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities. Based on these definitions, we propose a novel hierarchical annotation pipeline to generate high-quality and diverse question-answer (QA) pairs. With over 150+ hours of human expert effort, we obtain over 5k QA pairs for 811 real indoor scenes in SpaCE-10, which covers various evaluation settings like point cloud input and multi-choice QA. We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins. Through our careful study, we also draw several significant findings that benefit the MLLM community. For example, we reveal that the shortcoming of counting capability greatly limits the compositional spatial capabilities of existing MLLMs. The evaluation code and benchmark datasets are available at this https URL.
zh
[CV-22] SlideCoder: Layout-aware RAG -enhanced Hierarchical Slide Generation from Design
【速读】:该论文试图解决手动制作幻灯片耗时且需要专业知识的问题,以及现有基于自然语言的大型语言模型生成方法在捕捉幻灯片设计的视觉和结构细节方面存在的不足。其解决方案的关键在于提出Slide2Code基准和SlideCoder框架,该框架结合了基于颜色梯度的分割算法和分层检索增强生成方法,以分解复杂任务并提升代码生成效果。
链接: https://arxiv.org/abs/2506.07964
作者: Wenxin Tang,Jingyu Xiao,Wenxuan Jiang,Xi Xiao,Yuhang Wang,Xuxin Tang,Qing Li,Yuehe Ma,Junliang Liu,Shisong Tang,Michael R. Lyu
机构: Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学); Northeastern University (东北大学); Southwest University (西南大学); Kuaishou Technology (快手科技); Peng Cheng Laboratory (鹏城实验室); BNU-HKBU United International College (北京师范大学-香港浸会大学联合国际学院); Dalian Maritime University (大连海事大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Manual slide creation is labor-intensive and requires expert prior knowledge. Existing natural language-based LLM generation methods struggle to capture the visual and structural nuances of slide designs. To address this, we formalize the Reference Image to Slide Generation task and propose Slide2Code, the first benchmark with difficulty-tiered samples based on a novel Slide Complexity Metric. We introduce SlideCoder, a layout-aware, retrieval-augmented framework for generating editable slides from reference images. SlideCoder integrates a Color Gradient-based Segmentation algorithm and a Hierarchical Retrieval-Augmented Generation method to decompose complex tasks and enhance code generation. We also release SlideMaster, a 7B open-source model fine-tuned with improved reverse-engineered data. Experiments show that SlideCoder outperforms state-of-the-art baselines by up to 40.5 points, demonstrating strong performance across layout fidelity, execution accuracy, and visual consistency. Our code is available at this https URL.
zh
[CV-23] Creating a Historical Migration Dataset from Finnish Church Records 1800-1920
【速读】:该论文试图解决如何从大量手写档案中提取结构化数据以支持历史人口学研究的问题。其关键解决方案是构建一个深度学习流水线,包括版面分析、表格检测、单元格分类和手写识别,从而自动化处理约20万张手写迁移记录图像,生成包含超过六百万条记录的结构化数据集。
链接: https://arxiv.org/abs/2506.07960
作者: Ari Vesalainen,Jenna Kanerva,Aida Nitsch,Kiia Korsu,Ilari Larkiola,Laura Ruotsalainen,Filip Ginter
机构: University of Helsinki(赫尔辛基大学); University of Turku(图尔库大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This article presents a large-scale effort to create a structured dataset of internal migration in Finland between 1800 and 1920 using digitized church moving records. These records, maintained by Evangelical-Lutheran parishes, document the migration of individuals and families and offer a valuable source for studying historical demographic patterns. The dataset includes over six million entries extracted from approximately 200,000 images of handwritten migration records. The data extraction process was automated using a deep learning pipeline that included layout analysis, table detection, cell classification, and handwriting recognition. The complete pipeline was applied to all images, resulting in a structured dataset suitable for research. The dataset can be used to study internal migration, urbanization, and family migration, and the spread of disease in preindustrial Finland. A case study from the Elimäki parish shows how local migration histories can be reconstructed. The work demonstrates how large volumes of handwritten archival material can be transformed into structured data to support historical and demographic research. Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: I.4.6, J.5 Cite as: arXiv:2506.07960 [cs.CV] (or arXiv:2506.07960v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.07960 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-24] Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations
【速读】:该论文旨在解决Reasoning Segmentation (RS)任务中由于图像标记化破坏物体间连续空间关系而导致的视觉感知与视觉-文本推理能力不足的问题。其解决方案的关键在于引入Digital Twin (DT)表示作为感知与推理之间的中间层,通过将图像转换为保留空间关系和语义特性的结构化DT表示,并利用大型语言模型(LLM)对其进行显式推理,从而实现更精确的对象分割。
链接: https://arxiv.org/abs/2506.07943
作者: Yizhen Li,Dell Zhang,Xuelong Li,Yiqing Shen
机构: ShanDong University (山东大学); TeleAI, China Telecom (电信AI,中国电信); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning Segmentation (RS) is a multimodal vision-text task that requires segmenting objects based on implicit text queries, demanding both precise visual perception and vision-text reasoning capabilities. Current RS approaches rely on fine-tuning vision-language models (VLMs) for both perception and reasoning, but their tokenization of images fundamentally disrupts continuous spatial relationships between objects. We introduce DTwinSeger, a novel RS approach that leverages Digital Twin (DT) representation as an intermediate layer to decouple perception from reasoning. Innovatively, DTwinSeger reformulates RS as a two-stage process, where the first transforms the image into a structured DT representation that preserves spatial relationships and semantic properties and then employs a Large Language Model (LLM) to perform explicit reasoning over this representation to identify target objects. We propose a supervised fine-tuning method specifically for LLM with DT representation, together with a corresponding fine-tuning dataset Seg-DT, to enhance the LLM’s reasoning capabilities with DT representations. Experiments show that our method can achieve state-of-the-art performance on two image RS benchmarks and three image referring segmentation benchmarks. It yields that DT representation functions as an effective bridge between vision and text, enabling complex multimodal reasoning tasks to be accomplished solely with an LLM.
zh
[CV-25] Squeeze3D: Your 3D Generation Model is Secretly an Extreme Neural Compressor
【速读】:该论文试图解决如何在极高压缩比下有效压缩3D数据的问题,同时保持视觉质量。解决方案的关键在于利用预训练的3D生成模型所学习到的隐式先验知识,并通过可训练的映射网络在预训练编码器和生成模型的潜在空间之间建立桥梁,从而实现高效的3D数据压缩与解压缩。
链接: https://arxiv.org/abs/2506.07932
作者: Rishit Dagli,Yushi Guan,Sankeerth Durvasula,Mohammadreza Mofayezi,Nandita Vijaykumar
机构: University of Toronto (多伦多大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We propose Squeeze3D, a novel framework that leverages implicit prior knowledge learnt by existing pre-trained 3D generative models to compress 3D data at extremely high compression ratios. Our approach bridges the latent spaces between a pre-trained encoder and a pre-trained generation model through trainable mapping networks. Any 3D model represented as a mesh, point cloud, or a radiance field is first encoded by the pre-trained encoder and then transformed (i.e. compressed) into a highly compact latent code. This latent code can effectively be used as an extremely compressed representation of the mesh or point cloud. A mapping network transforms the compressed latent code into the latent space of a powerful generative model, which is then conditioned to recreate the original 3D model (i.e. decompression). Squeeze3D is trained entirely on generated synthetic data and does not require any 3D datasets. The Squeeze3D architecture can be flexibly used with existing pre-trained 3D encoders and existing generative models. It can flexibly support different formats, including meshes, point clouds, and radiance fields. Our experiments demonstrate that Squeeze3D achieves compression ratios of up to 2187x for textured meshes, 55x for point clouds, and 619x for radiance fields while maintaining visual quality comparable to many existing methods. Squeeze3D only incurs a small compression and decompression latency since it does not involve training object-specific networks to compress an object.
zh
[CV-26] A Comparative Study of U-Net Architectures for Change Detection in Satellite Images
【速读】:该论文试图解决遥感变化检测中对U-Net架构应用研究不足的问题,特别是在多时相数据处理和长距离关系建模方面。其解决方案的关键在于系统分析18种不同的U-Net变体,评估它们在遥感变化检测中的性能,并强调如Siamese Swin-U-Net等专为变化检测设计的结构,通过引入双流架构和长距离依赖建模来提升检测精度。
链接: https://arxiv.org/abs/2506.07925
作者: Yaxita Amin,Naimisha S Trivedi,Rashmi Bhattad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Remote sensing change detection is essential for monitoring the everchanging landscapes of the Earth. The U-Net architecture has gained popularity for its capability to capture spatial information and perform pixel-wise classification. However, their application in the Remote sensing field remains largely unexplored. Therefore, this paper fill the gap by conducting a comprehensive analysis of 34 papers. This study conducts a comparison and analysis of 18 different U-Net variations, assessing their potential for detecting changes in remote sensing. We evaluate both benefits along with drawbacks of each variation within the framework of this particular application. We emphasize variations that are explicitly built for change detection, such as Siamese Swin-U-Net, which utilizes a Siamese architecture. The analysis highlights the significance of aspects such as managing data from different time periods and collecting relationships over a long distance to enhance the precision of change detection. This study provides valuable insights for researchers and practitioners that choose U-Net versions for remote sensing change detection tasks.
zh
[CV-27] Speedy Deformable 3D Gaussian Splatting: Fast Rendering and Compression of Dynamic Scenes
【速读】:该论文旨在解决动态3D高斯溅射(Dynamic 3D Gaussian Splatting, 3DGS)中由于每帧对每个高斯进行神经网络推理导致的渲染速度慢、内存和计算需求高的问题。其解决方案的关键在于通过两种互补技术减少神经推理:一是基于时间敏感性剪枝分数的高斯剪枝方法,用于移除对动态场景重建贡献较低的高斯;二是GroupFlow运动分析技术,通过轨迹相似性聚类高斯并为每组预测单一刚性变换,从而显著提升渲染效率并降低模型规模。
链接: https://arxiv.org/abs/2506.07917
作者: Allen Tu,Haiyang Ying,Alex Hanson,Yonghan Lee,Tom Goldstein,Matthias Zwicker
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Recent extensions of 3D Gaussian Splatting (3DGS) to dynamic scenes achieve high-quality novel view synthesis by using neural networks to predict the time-varying deformation of each Gaussian. However, performing per-Gaussian neural inference at every frame poses a significant bottleneck, limiting rendering speed and increasing memory and compute requirements. In this paper, we present Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), a general pipeline for accelerating the rendering speed of dynamic 3DGS and 4DGS representations by reducing neural inference through two complementary techniques. First, we propose a temporal sensitivity pruning score that identifies and removes Gaussians with low contribution to the dynamic scene reconstruction. We also introduce an annealing smooth pruning mechanism that improves pruning robustness in real-world scenes with imprecise camera poses. Second, we propose GroupFlow, a motion analysis technique that clusters Gaussians by trajectory similarity and predicts a single rigid transformation per group instead of separate deformations for each Gaussian. Together, our techniques accelerate rendering by 10.37\times , reduce model size by 7.71\times , and shorten training time by 2.71\times on the NeRF-DS dataset. SpeeDe3DGS also improves rendering speed by 4.20\times and 58.23\times on the D-NeRF and HyperNeRF vrig datasets. Our methods are modular and can be integrated into any deformable 3DGS or 4DGS framework.
zh
[CV-28] WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
【速读】:该论文旨在解决如何通过强化学习(Reinforcement Learning, RL)实现通用视觉-语言推理的问题。现有研究虽尝试将类似DeepSeek-R1的RL训练范式应用于多模态大语言模型(Multimodal Large Language Models, MLLM),但主要聚焦于特定领域任务,缺乏对通用视觉-语言推理能力的系统性提升。论文提出三个关键解决方案:一是构建了一个可扩展的多模态问答生成流水线,能够自主生成上下文感知、以推理为核心的问答对;二是发布了包含超过12万个多模态问答对的WeThink数据集,涵盖多个数据源和问题领域;三是设计了一种结合规则验证与模型评估的混合奖励机制,以提升RL训练在不同任务领域的效率。这些方法共同推动了多模态推理能力的提升。
链接: https://arxiv.org/abs/2506.07905
作者: Jie Yang,Feipeng Ma,Zitian Wang,Dacheng Yin,Kang Rong,Fengyun Rao,Ruimao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Building on the success of text-based reasoning models like DeepSeek-R1, extending these capabilities to multimodal reasoning holds great promise. While recent works have attempted to adapt DeepSeek-R1-style reinforcement learning (RL) training paradigms to multimodal large language models (MLLM), focusing on domain-specific tasks like math and visual perception, a critical question remains: How can we achieve the general-purpose visual-language reasoning through RL? To address this challenge, we make three key efforts: (1) A novel Scalable Multimodal QA Synthesis pipeline that autonomously generates context-aware, reasoning-centric question-answer (QA) pairs directly from the given images. (2) The open-source WeThink dataset containing over 120K multimodal QA pairs with annotated reasoning paths, curated from 18 diverse dataset sources and covering various question domains. (3) A comprehensive exploration of RL on our dataset, incorporating a hybrid reward mechanism that combines rule-based verification with model-based assessment to optimize RL training efficiency across various task domains. Across 14 diverse MLLM benchmarks, we demonstrate that our WeThink dataset significantly enhances performance, from mathematical reasoning to diverse general multimodal tasks. Moreover, we show that our automated data pipeline can continuously increase data diversity to further improve model performance.
zh
[CV-29] Diffuse Everything: Multimodal Diffusion Models on Arbitrary State Spaces ICML2025
【速读】:该论文试图解决多模态数据联合生成中依赖外部预处理协议的问题,这些问题通常包括使用分词器和变分自编码器将不同数据表示统一为单模态格式,从而对编码器和解码器的高精度提出严格要求,这在数据有限的应用中尤为困难。解决方案的关键在于提出一种新颖的框架,可在任意状态空间上构建多模态扩散模型,实现跨模态的原生数据生成,并通过引入每个模态的解耦噪声调度,使单一模型能够同时支持无条件和模态条件生成。
链接: https://arxiv.org/abs/2506.07903
作者: Kevin Rojas,Yuchen Zhu,Sichen Zhu,Felix X.-F. Ye,Molei Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2025. Code available at this https URL
Abstract:Diffusion models have demonstrated remarkable performance in generating unimodal data across various tasks, including image, video, and text generation. On the contrary, the joint generation of multimodal data through diffusion models is still in the early stages of exploration. Existing approaches heavily rely on external preprocessing protocols, such as tokenizers and variational autoencoders, to harmonize varied data representations into a unified, unimodal format. This process heavily demands the high accuracy of encoders and decoders, which can be problematic for applications with limited data. To lift this restriction, we propose a novel framework for building multimodal diffusion models on arbitrary state spaces, enabling native generation of coupled data across different modalities. By introducing an innovative decoupled noise schedule for each modality, we enable both unconditional and modality-conditioned generation within a single model simultaneously. We empirically validate our approach for text-image generation and mixed-type tabular data synthesis, demonstrating that it achieves competitive performance.
zh
[CV-30] GaussianVAE: Adaptive Learning Dynamics of 3D Gaussians for High-Fidelity Super-Resolution
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在重建过程中受限于输入分辨率,无法超越训练视图中细节水平的问题。其解决方案的关键在于提出一种轻量级生成模型,通过Hessian-assisted sampling策略智能识别需要细化的区域,并预测和优化额外的3D高斯分布,从而在保证计算效率的同时提升几何精度与渲染质量。
链接: https://arxiv.org/abs/2506.07897
作者: Shuja Khalid,Mohamed Ibrahim,Yang Liu
机构: Huawei Canada(华为加拿大)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We present a novel approach for enhancing the resolution and geometric fidelity of 3D Gaussian Splatting (3DGS) beyond native training resolution. Current 3DGS methods are fundamentally limited by their input resolution, producing reconstructions that cannot extrapolate finer details than are present in the training views. Our work breaks this limitation through a lightweight generative model that predicts and refines additional 3D Gaussians where needed most. The key innovation is our Hessian-assisted sampling strategy, which intelligently identifies regions that are likely to benefit from densification, ensuring computational efficiency. Unlike computationally intensive GANs or diffusion approaches, our method operates in real-time (0.015s per inference on a single consumer-grade GPU), making it practical for interactive applications. Comprehensive experiments demonstrate significant improvements in both geometric accuracy and rendering quality compared to state-of-the-art methods, establishing a new paradigm for resolution-free 3D scene enhancement.
zh
[CV-31] Video Unlearning via Low-Rank Refusal Vector
【速读】:该论文旨在解决视频生成模型在训练数据中继承的偏见和有害概念所带来的风险,这些风险可能导致用户生成不良甚至非法内容。其解决方案的关键在于提出一种专门针对视频扩散模型的遗忘技术,通过使用5组多模态提示对(每组包含一个“安全”和一个“不安全”的示例),计算每层潜在差异并生成“拒绝向量”,该向量从模型参数中减去后可中和不安全概念,同时通过新颖的低秩分解方法保持视觉质量并减少对其他语义的误删。
链接: https://arxiv.org/abs/2506.07891
作者: Simone Facchiano,Stefano Saravalle,Matteo Migliarini,Edoardo De Matteis,Alessio Sampieri,Andrea Pilzer,Emanuele Rodolà,Indro Spinelli,Luca Franco,Fabio Galasso
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generative models democratize the creation of visual content through intuitive instruction following, but they also inherit the biases and harmful concepts embedded within their web-scale training data. This inheritance creates a significant risk, as users can readily generate undesirable and even illegal content. This work introduces the first unlearning technique tailored explicitly for video diffusion models to address this critical issue. Our method requires 5 multi-modal prompt pairs only. Each pair contains a “safe” and an “unsafe” example that differ only by the target concept. Averaging their per-layer latent differences produces a “refusal vector”, which, once subtracted from the model parameters, neutralizes the unsafe concept. We introduce a novel low-rank factorization approach on the covariance difference of embeddings that yields robust refusal vectors. This isolates the target concept while minimizing collateral unlearning of other semantics, thus preserving the visual quality of the generated video. Our method preserves the model’s generation quality while operating without retraining or access to the original training data. By embedding the refusal direction directly into the model’s weights, the suppression mechanism becomes inherently more robust against adversarial bypass attempts compared to surface-level input-output filters. In a thorough qualitative and quantitative evaluation, we show that we can neutralize a variety of harmful contents, including explicit nudity, graphic violence, copyrights, and trademarks. Project page: this https URL.
zh
[CV-32] EgoM2P: Egocentric Multimodal Multitask Pretraining
【速读】:该论文旨在解决在第一人称视觉(egocentric vision)中处理多模态信号(如RGB视频、深度信息、相机位姿和注视点)时所面临的挑战,特别是在构建大规模多任务模型时遇到的异构数据、模态缺失以及动态相机运动等问题。其解决方案的关键在于引入了一组高效的时序分词器,并提出了EgoM2P框架,该框架通过学习具有时间感知能力的多模态标记来训练一个通用的大规模模型,以实现对第一人称4D理解的统一建模,从而支持多种感知与生成任务。
链接: https://arxiv.org/abs/2506.07886
作者: Gen Li,Yutong Chen,Yiqian Wu,Kaifeng Zhao,Marc Pollefeys,Siyu Tang
机构: ETH Zürich (ETH Zurich); Zhejiang University (Zhejiang University); Microsoft (Microsoft)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding multimodal signals in egocentric vision, such as RGB video, depth, camera poses, and gaze, is essential for applications in augmented reality, robotics, and human-computer interaction. These capabilities enable systems to better interpret the camera wearer’s actions, intentions, and surrounding environment. However, building large-scale egocentric multimodal and multitask models presents unique challenges. Egocentric data are inherently heterogeneous, with large variations in modality coverage across devices and settings. Generating pseudo-labels for missing modalities, such as gaze or head-mounted camera trajectories, is often infeasible, making standard supervised learning approaches difficult to scale. Furthermore, dynamic camera motion and the complex temporal and spatial structure of first-person video pose additional challenges for the direct application of existing multimodal foundation models. To address these challenges, we introduce a set of efficient temporal tokenizers and propose EgoM2P, a masked modeling framework that learns from temporally aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding. This unified design supports multitasking across diverse egocentric perception and synthesis tasks, including gaze prediction, egocentric camera tracking, and monocular depth estimation from egocentric video. EgoM2P also serves as a generative model for conditional egocentric video synthesis. Across these tasks, EgoM2P matches or outperforms specialist models while being an order of magnitude faster. We will fully open-source EgoM2P to support the community and advance egocentric vision research. Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.07886 [cs.CV] (or arXiv:2506.07886v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.07886 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-33] CrosswalkNet: An Optimized Deep Learning Framework for Pedestrian Crosswalk Detection in Aerial Images with High-Performance Computing
【速读】:该论文旨在解决从高分辨率航空影像中准确检测行人横道的问题,以支持交通资产管理、安全分析和城市规划。其解决方案的关键在于提出了一种名为CrosswalkNet的深度学习框架,该框架采用定向边界框(oriented bounding boxes, OBB)改进了传统目标检测策略,从而在不同方向的横道检测中提升了精度。此外,通过引入卷积块注意力机制、双分支空间金字塔池化-快速模块以及余弦退火等优化技术,进一步提高了模型的性能与效率。
链接: https://arxiv.org/abs/2506.07885
作者: Zubin Bhuyan,Yuanchang Xie,AngkeaReach Rith,Xintong Yan,Nasko Apostolov,Jimi Oke,Chengbo Ai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the increasing availability of aerial and satellite imagery, deep learning presents significant potential for transportation asset management, safety analysis, and urban planning. This study introduces CrosswalkNet, a robust and efficient deep learning framework designed to detect various types of pedestrian crosswalks from 15-cm resolution aerial images. CrosswalkNet incorporates a novel detection approach that improves upon traditional object detection strategies by utilizing oriented bounding boxes (OBB), enhancing detection precision by accurately capturing crosswalks regardless of their orientation. Several optimization techniques, including Convolutional Block Attention, a dual-branch Spatial Pyramid Pooling-Fast module, and cosine annealing, are implemented to maximize performance and efficiency. A comprehensive dataset comprising over 23,000 annotated crosswalk instances is utilized to train and validate the proposed framework. The best-performing model achieves an impressive precision of 96.5% and a recall of 93.3% on aerial imagery from Massachusetts, demonstrating its accuracy and effectiveness. CrosswalkNet has also been successfully applied to datasets from New Hampshire, Virginia, and Maine without transfer learning or fine-tuning, showcasing its robustness and strong generalization capability. Additionally, the crosswalk detection results, processed using High-Performance Computing (HPC) platforms and provided in polygon shapefile format, have been shown to accelerate data processing and detection, supporting real-time analysis for safety and mobility applications. This integration offers policymakers, transportation engineers, and urban planners an effective instrument to enhance pedestrian safety and improve urban mobility.
zh
[CV-34] Diffusion Counterfactual Generation with Semantic Abduction
【速读】:该论文旨在解决反事实图像生成中的身份保持、感知质量和对潜在因果模型的忠实性等问题。其解决方案的关键在于引入基于扩散模型的因果机制,结合Pearlian因果理论,将语义表示整合到扩散模型中,通过反事实推理过程实现图像编辑,从而在忠实的因果控制与身份保持之间实现有原则的权衡。
链接: https://arxiv.org/abs/2506.07883
作者: Rajat Rasal,Avinash Kori,Fabio De Sousa Ribeiro,Tian Xia,Ben Glocker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada
Abstract:Counterfactual image generation presents significant challenges, including preserving identity, maintaining perceptual quality, and ensuring faithfulness to an underlying causal model. While existing auto-encoding frameworks admit semantic latent spaces which can be manipulated for causal control, they struggle with scalability and fidelity. Advancements in diffusion models present opportunities for improving counterfactual image editing, having demonstrated state-of-the-art visual quality, human-aligned perception and representation learning capabilities. Here, we present a suite of diffusion-based causal mechanisms, introducing the notions of spatial, semantic and dynamic abduction. We propose a general framework that integrates semantic representations into diffusion models through the lens of Pearlian causality to edit images via a counterfactual reasoning process. To our knowledge, this is the first work to consider high-level semantic identity preservation for diffusion counterfactuals and to demonstrate how semantic control enables principled trade-offs between faithful causal control and identity preservation.
zh
[CV-35] Spatio-Temporal State Space Model For Efficient Event-Based Optical Flow
【速读】:该论文旨在解决传统深度学习方法在低延迟运动估计(如光流估计)任务中计算效率不足的问题,同时克服异步事件驱动方法在捕捉时空信息方面的局限性。其解决方案的关键在于引入了时空状态空间模型(Spatio-Temporal State Space Model, STSSM)模块,该模块利用状态空间模型有效捕捉事件数据中的时空相关性,在保持高性能的同时显著降低了计算复杂度。
链接: https://arxiv.org/abs/2506.07878
作者: Muhammad Ahmed Humais,Xiaoqian Huang,Hussain Sajwani,Sajid Javed,Yahya Zweiri
机构: Khalifa University(哈利法大学); Khalifa University of Science and Technology(哈利法科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras unlock new frontiers that were previously unthinkable with standard frame-based cameras. One notable example is low-latency motion estimation (optical flow), which is critical for many real-time applications. In such applications, the computational efficiency of algorithms is paramount. Although recent deep learning paradigms such as CNN, RNN, or ViT have shown remarkable performance, they often lack the desired computational efficiency. Conversely, asynchronous event-based methods including SNNs and GNNs are computationally efficient; however, these approaches fail to capture sufficient spatio-temporal information, a powerful feature required to achieve better performance for optical flow estimation. In this work, we introduce Spatio-Temporal State Space Model (STSSM) module along with a novel network architecture to develop an extremely efficient solution with competitive performance. Our STSSM module leverages state-space models to effectively capture spatio-temporal correlations in event data, offering higher performance with lower complexity compared to ViT, CNN-based architectures in similar settings. Our model achieves 4.5x faster inference and 8x lower computations compared to TMA and 2x lower computations compared to EV-FlowNet with competitive performance on the DSEC benchmark. Our code will be available at this https URL
zh
[CV-36] FreeGave: 3D Physics Learning from Dynamic Videos by Gaussian Velocity CVPR2025
【速读】:该论文旨在从多视角视频中纯数据驱动地建模3D场景的几何、外观及底层物理特性。现有方法通常依赖于各种控制偏微分方程(PDE)作为物理信息神经网络(PINN)损失或结合物理仿真到神经网络中,但这些方法在学习复杂物理运动时表现不佳,或者需要物体先验信息如掩码或类别。该论文提出的FreeGave方法无需任何物体先验即可学习复杂动态3D场景的物理特性,其关键在于引入一个物理编码,并结合一个精心设计的无散度模块来估计每个高斯点的速度场,从而避免了低效的PINN损失。
链接: https://arxiv.org/abs/2506.07865
作者: Jinxi Li,Ziyang Song,Siyuan Zhou,Bo Yang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Robotics (cs.RO)
备注: CVPR 2025. Code and data are available at: this https URL
Abstract:In this paper, we aim to model 3D scene geometry, appearance, and the underlying physics purely from multi-view videos. By applying various governing PDEs as PINN losses or incorporating physics simulation into neural networks, existing works often fail to learn complex physical motions at boundaries or require object priors such as masks or types. In this paper, we propose FreeGave to learn the physics of complex dynamic 3D scenes without needing any object priors. The key to our approach is to introduce a physics code followed by a carefully designed divergence-free module for estimating a per-Gaussian velocity field, without relying on the inefficient PINN losses. Extensive experiments on three public datasets and a newly collected challenging real-world dataset demonstrate the superior performance of our method for future frame extrapolation and motion segmentation. Most notably, our investigation into the learned physics codes reveals that they truly learn meaningful 3D physical motion patterns in the absence of any human labels in training.
zh
[CV-37] VIVAT: Virtuous Improving VAE Training through Artifact Mitigation
【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)在训练过程中出现的常见伪影问题,这些问题会降低重建和生成质量。其解决方案的关键在于提出一种系统性方法——VIVAT,通过调整损失权重、优化填充策略以及引入空间条件归一化等简单修改,有效缓解了KL-VAE训练中的五类典型伪影(颜色偏移、网格模式、模糊、角落和滴状伪影),并在保持KL-VAE框架简洁性的前提下提升了模型性能。
链接: https://arxiv.org/abs/2506.07863
作者: Lev Novitskiy,Viacheslav Vasilev,Maria Kovaleva,Vladimir Arkhipkin,Denis Dimitrov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Variational Autoencoders (VAEs) remain a cornerstone of generative computer vision, yet their training is often plagued by artifacts that degrade reconstruction and generation quality. This paper introduces VIVAT, a systematic approach to mitigating common artifacts in KL-VAE training without requiring radical architectural changes. We present a detailed taxonomy of five prevalent artifacts - color shift, grid patterns, blur, corner and droplet artifacts - and analyze their root causes. Through straightforward modifications, including adjustments to loss weights, padding strategies, and the integration of Spatially Conditional Normalization, we demonstrate significant improvements in VAE performance. Our method achieves state-of-the-art results in image reconstruction metrics (PSNR and SSIM) across multiple benchmarks and enhances text-to-image generation quality, as evidenced by superior CLIP scores. By preserving the simplicity of the KL-VAE framework while addressing its practical challenges, VIVAT offers actionable insights for researchers and practitioners aiming to optimize VAE training.
zh
[CV-38] Egocentric Event-Based Vision for Ping Pong Ball Trajectory Prediction CVPR
【速读】:该论文旨在解决乒乓球运动中从第一视角(egocentric perspective)进行实时轨迹预测的问题,尤其是在高速运动场景下传统相机因高延迟和运动模糊导致的性能下降问题。解决方案的关键在于采用事件相机(event camera),其具备更高的时间分辨率,能够在对手击球后短时间内提供更频繁的状态更新,从而实现更鲁棒和精确的轨迹预测。此外,系统结合了视网膜中央视觉(foveated vision)机制,利用Meta Project Aria眼镜获取的眼动数据,仅处理视野中心区域的事件,显著降低了计算延迟并提升了球体检测性能。
链接: https://arxiv.org/abs/2506.07860
作者: Ivan Alberico,Marco Cannici,Giovanni Cioffi,Davide Scaramuzza
机构: University of Zurich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville (TN), USA, 2025; 5th International Workshop on Event-Based Vision
Abstract:In this paper, we present a real-time egocentric trajectory prediction system for table tennis using event cameras. Unlike standard cameras, which suffer from high latency and motion blur at fast ball speeds, event cameras provide higher temporal resolution, allowing more frequent state updates, greater robustness to outliers, and accurate trajectory predictions using just a short time window after the opponent’s impact. We collect a dataset of ping-pong game sequences, including 3D ground-truth trajectories of the ball, synchronized with sensor data from the Meta Project Aria glasses and event streams. Our system leverages foveated vision, using eye-gaze data from the glasses to process only events in the viewer’s fovea. This biologically inspired approach improves ball detection performance and significantly reduces computational latency, as it efficiently allocates resources to the most perceptually relevant regions, achieving a reduction factor of 10.81 on the collected trajectories. Our detection pipeline has a worst-case total latency of 4.5 ms, including computation and perception - significantly lower than a frame-based 30 FPS system, which, in the worst case, takes 66 ms solely for perception. Finally, we fit a trajectory prediction model to the estimated states of the ball, enabling 3D trajectory forecasting in the future. To the best of our knowledge, this is the first approach to predict table tennis trajectories from an egocentric perspective using event cameras.
zh
[CV-39] LogoSP: Local-global Grouping of Superpoints for Unsupervised Semantic Segmentation of 3D Point Clouds CVPR2025
【速读】:该论文旨在解决在无需人工标注的情况下,对原始点云进行无监督的3D语义分割问题。现有方法通常通过学习每个点的局部特征并采用简单的分组策略来处理该问题,但缺乏发现额外且可能更丰富的语义先验的能力。本文提出的解决方案关键在于通过在频域中根据超点的全局模式进行分组,从而从局部和全局点特征中学习3D语义,生成高精度的语义伪标签以训练分割网络。
链接: https://arxiv.org/abs/2506.07857
作者: Zihui Zhang,Weisheng Dai,Hongtao Wen,Bo Yang
机构: Shenzhen Research Institute, The Hong Kong Polytechnic University (深圳研究院,香港理工大学); vLAR Group, The Hong Kong Polytechnic University (视觉与学习实验室,香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: CVPR 2025. Code and data are available at: this https URL
Abstract:We study the problem of unsupervised 3D semantic segmentation on raw point clouds without needing human labels in training. Existing methods usually formulate this problem into learning per-point local features followed by a simple grouping strategy, lacking the ability to discover additional and possibly richer semantic priors beyond local features. In this paper, we introduce LogoSP to learn 3D semantics from both local and global point features. The key to our approach is to discover 3D semantic information by grouping superpoints according to their global patterns in the frequency domain, thus generating highly accurate semantic pseudo-labels for training a segmentation network. Extensive experiments on two indoor and an outdoor datasets show that our LogoSP surpasses all existing unsupervised methods by large margins, achieving the state-of-the-art performance for unsupervised 3D semantic segmentation. Notably, our investigation into the learned global patterns reveals that they truly represent meaningful 3D semantics in the absence of human labels during training.
zh
[CV-40] SAM2Auto: Auto Annotation Using FLASH
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)因标注数据集稀缺而导致的性能滞后问题,特别是视频数据集中成对的视觉-文本标注成本高、耗时长的问题。解决方案的关键在于提出SAM2Auto,这是一个完全自动化的视频数据标注流水线,无需人工干预或特定数据集的训练。其核心组件包括SMART-OD(结合自动掩码生成与开放世界目标检测的鲁棒目标检测系统)和FLASH(多目标实时视频实例分割系统),二者共同实现了跨视频帧的一致对象识别与分割,有效减少了检测误差并提升了标注效率。
链接: https://arxiv.org/abs/2506.07850
作者: Arash Rocky,Q.M. Jonathan Wu
机构: University of Windsor (温莎大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) lag behind Large Language Models due to the scarcity of annotated datasets, as creating paired visual-textual annotations is labor-intensive and expensive. To address this bottleneck, we introduce SAM2Auto, the first fully automated annotation pipeline for video datasets requiring no human intervention or dataset-specific training. Our approach consists of two key components: SMART-OD, a robust object detection system that combines automatic mask generation with open-world object detection capabilities, and FLASH (Frame-Level Annotation and Segmentation Handler), a multi-object real-time video instance segmentation (VIS) that maintains consistent object identification across video frames even with intermittent detection gaps. Unlike existing open-world detection methods that require frame-specific hyperparameter tuning and suffer from numerous false positives, our system employs statistical approaches to minimize detection errors while ensuring consistent object tracking throughout entire video sequences. Extensive experimental validation demonstrates that SAM2Auto achieves comparable accuracy to manual annotation while dramatically reducing annotation time and eliminating labor costs. The system successfully handles diverse datasets without requiring retraining or extensive parameter adjustments, making it a practical solution for large-scale dataset creation. Our work establishes a new baseline for automated video annotation and provides a pathway for accelerating VLM development by addressing the fundamental dataset bottleneck that has constrained progress in vision-language understanding.
zh
[CV-41] PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement
【速读】:该论文旨在解决视频生成中多主体定制的细粒度可控性不足问题,特别是在保持主体身份一致性与交互连贯性方面。其解决方案的关键在于提出一种多主体视频定制框架PolyVivid,通过设计基于视觉语言大模型(VLLM)的文本-图像融合模块,实现主体图像与文本实体之间的准确对应;引入基于3D-RoPE的增强模块,促进文本与图像嵌入的结构化双向融合;开发注意力继承的身份注入模块,有效将融合的身份特征注入视频生成过程以缓解身份漂移;构建基于多模态大模型(MLLM)的数据流水线,结合MLLM引导的定位、分割及团集合并策略,提升多主体数据质量,增强主体区分度并降低下游视频生成中的歧义性。
链接: https://arxiv.org/abs/2506.07848
作者: Teng Hu,Zhentao Yu,Zhengguang Zhou,Jiangning Zhang,Yuan Zhou,Qinglin Lu,Ran Yi
机构: Shanghai Jiao Tong University (上海交通大学); Tencent Hunyuan (腾讯混元); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.
zh
[CV-42] F2Net: A Frequency-Fused Network for Ultra-High Resolution Remote Sensing Segmentation
【速读】:该论文旨在解决超高清(Ultra-High-Resolution, UHR)遥感图像语义分割中的计算效率与优化挑战,传统方法在下采样过程中丢失细节或通过分块处理破坏全局上下文。其解决方案的关键在于提出F2Net,一种频率感知框架,通过将UHR图像分解为高频和低频成分进行专门处理,高频分支保留全分辨率结构细节,低频分支通过双子分支捕捉短程和长程依赖关系,并引入混合频率融合模块及两种新型损失函数以确保语义一致性与训练稳定性。
链接: https://arxiv.org/abs/2506.07847
作者: Hengzhi Chen,Liqian Feng,Wenhua Wu,Xiaogang Zhu,Shawn Leo,Kun Hu
机构: The University of Sydney(悉尼大学); The University of Adelaide(阿德莱德大学); Tsinghua University(清华大学); Edith Cowan University(埃迪斯科文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation of ultra-high-resolution (UHR) remote sensing imagery is critical for applications like environmental monitoring and urban planning but faces computational and optimization challenges. Conventional methods either lose fine details through downsampling or fragment global context via patch processing. While multi-branch networks address this trade-off, they suffer from computational inefficiency and conflicting gradient dynamics during training. We propose F2Net, a frequency-aware framework that decomposes UHR images into high- and low-frequency components for specialized processing. The high-frequency branch preserves full-resolution structural details, while the low-frequency branch processes downsampled inputs through dual sub-branches capturing short- and long-range dependencies. A Hybrid-Frequency Fusion module integrates these observations, guided by two novel objectives: Cross-Frequency Alignment Loss ensures semantic consistency between frequency components, and Cross-Frequency Balance Loss regulates gradient magnitudes across branches to stabilize training. Evaluated on DeepGlobe and Inria Aerial benchmarks, F2Net achieves state-of-the-art performance with mIoU of 80.22 and 83.39, respectively. Our code will be publicly available.
zh
[CV-43] Diffusion models under low-noise regime
【速读】:该论文试图解决扩散模型在低噪声扩散动力学下的行为问题,特别是其作为有效去噪器时的可靠性与可解释性。研究发现,当噪声水平较低时,模型在不同训练数据集上产生的输出虽然在高噪声下趋于一致,但在接近数据流形处出现分歧。解决方案的关键在于系统地分析训练集大小、数据几何结构以及模型目标函数对去噪轨迹和得分精度的影响,从而揭示模型如何学习数据分布的表示。
链接: https://arxiv.org/abs/2506.07841
作者: Elizabeth Pavlova,Xue-Xin Wei
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work on diffusion models proposed that they operate in two regimes: memorization, in which models reproduce their training data, and generalization, in which they generate novel samples. While this has been tested in high-noise settings, the behavior of diffusion models as effective denoisers when the corruption level is small remains unclear. To address this gap, we systematically investigated the behavior of diffusion models under low-noise diffusion dynamics, with implications for model robustness and interpretability. Using (i) CelebA subsets of varying sample sizes and (ii) analytic Gaussian mixture benchmarks, we reveal that models trained on disjoint data diverge near the data manifold even when their high-noise outputs converge. We quantify how training set size, data geometry, and model objective choice shape denoising trajectories and affect score accuracy, providing insights into how these models actually learn representations of data distributions. This work starts to address gaps in our understanding of generative model reliability in practical applications where small perturbations are common.
zh
[CV-44] R3D2: Realistic 3D Asset Insertion via Diffusion for Autonomous Driving Simulation
【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)系统验证中面临的挑战,即传统仿真平台在可扩展性和与真实世界数据的领域差距方面的不足,以及神经重建方法在动态物体操作和重用性上的局限性。其解决方案的关键在于提出一种轻量级、单步扩散模型R3D2,该模型能够实时生成合理的渲染效果(如阴影和一致的光照),从而实现完整3D资产在现有场景中的真实插入,显著提升插入资产的现实感,并支持诸如文本到3D资产插入和跨场景/数据集物体迁移等应用。
链接: https://arxiv.org/abs/2506.07826
作者: William Ljungbergh,Bernardo Taveira,Wenzhao Zheng,Adam Tonderski,Chensheng Peng,Fredrik Kahl,Christoffer Petersson,Michael Felsberg,Kurt Keutzer,Masayoshi Tomizuka,Wei Zhan
机构: Zenseact; Linköping University (林雪平大学); Chalmers University (查尔姆斯理工大学); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Validating autonomous driving (AD) systems requires diverse and safety-critical testing, making photorealistic virtual environments essential. Traditional simulation platforms, while controllable, are resource-intensive to scale and often suffer from a domain gap with real-world data. In contrast, neural reconstruction methods like 3D Gaussian Splatting (3DGS) offer a scalable solution for creating photorealistic digital twins of real-world driving scenes. However, they struggle with dynamic object manipulation and reusability as their per-scene optimization-based methodology tends to result in incomplete object models with integrated illumination effects. This paper introduces R3D2, a lightweight, one-step diffusion model designed to overcome these limitations and enable realistic insertion of complete 3D assets into existing scenes by generating plausible rendering effects-such as shadows and consistent lighting-in real time. This is achieved by training R3D2 on a novel dataset: 3DGS object assets are generated from in-the-wild AD data using an image-conditioned 3D generative model, and then synthetically placed into neural rendering-based virtual environments, allowing R3D2 to learn realistic integration. Quantitative and qualitative evaluations demonstrate that R3D2 significantly enhances the realism of inserted assets, enabling use-cases like text-to-3D asset insertion and cross-scene/dataset object transfer, allowing for true scalability in AD validation. To promote further research in scalable and realistic AD simulation, we will release our dataset and code, see this https URL.
zh
[CV-45] M2Restore: Mixture-of-Experts-based Mamba-CNN Fusion Framework for All-in-One Image Restoration
【速读】:该论文旨在解决自然图像在复杂、复合退化条件(如雨、雪和雾霾)下导致的下游视觉应用性能下降问题,以及现有图像恢复方法在跨动态变化的退化场景中泛化能力有限和局部细节保留与全局依赖建模之间平衡不佳的两个关键挑战。其解决方案的关键在于提出M2Restore框架,该框架基于Mixture-of-Experts (MoE) 的Mamba-CNN融合结构,通过引入CLIP引导的MoE门控机制提升模型在多样化退化条件下的泛化能力,设计双流架构以协同捕捉全局上下文依赖与局部细粒度信息,并采用边缘感知的动态门控机制自适应地平衡全局建模与局部增强,从而实现高效且鲁棒的全栈图像恢复。
链接: https://arxiv.org/abs/2506.07814
作者: Yongzhen Wang,Yongjun Li,Zhuoran Zheng,Xiao-Ping Zhang,Mingqiang Wei
机构: Anhui University of Technology (安徽理工大学); Sun Yat-sen University (中山大学); Tsinghua University (清华大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Taiyuan University of Technology (太原理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures, 3 tables
Abstract:Natural images are often degraded by complex, composite degradations such as rain, snow, and haze, which adversely impact downstream vision applications. While existing image restoration efforts have achieved notable success, they are still hindered by two critical challenges: limited generalization across dynamically varying degradation scenarios and a suboptimal balance between preserving local details and modeling global dependencies. To overcome these challenges, we propose M2Restore, a novel Mixture-of-Experts (MoE)-based Mamba-CNN fusion framework for efficient and robust all-in-one image restoration. M2Restore introduces three key contributions: First, to boost the model’s generalization across diverse degradation conditions, we exploit a CLIP-guided MoE gating mechanism that fuses task-conditioned prompts with CLIP-derived semantic priors. This mechanism is further refined via cross-modal feature calibration, which enables precise expert selection for various degradation types. Second, to jointly capture global contextual dependencies and fine-grained local details, we design a dual-stream architecture that integrates the localized representational strength of CNNs with the long-range modeling efficiency of Mamba. This integration enables collaborative optimization of global semantic relationships and local structural fidelity, preserving global coherence while enhancing detail restoration. Third, we introduce an edge-aware dynamic gating mechanism that adaptively balances global modeling and local enhancement by reallocating computational attention to degradation-sensitive regions. This targeted focus leads to more efficient and precise restoration. Extensive experiments across multiple image restoration benchmarks validate the superiority of M2Restore in both visual quality and quantitative performance.
zh
[CV-46] Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution
【速读】:该论文旨在解决任意尺度图像超分辨率(arbitrary-scale image super-resolution)问题,即在不同缩放比例下对图像进行上采样,以实现更高的灵活性。传统方法通常采用单阶段上采样过程,难以在广泛的连续缩放因子分布中有效学习。论文提出的解决方案关键在于设计了一个自级联扩散框架CasArbi,通过将复杂的缩放需求分解为一系列较小的连续因子,并在每个步骤中逐步提升图像分辨率,同时保证任意尺度下的平滑过渡。其核心创新点在于引入了基于坐标引导的残差扩散模型,从而实现了连续图像表示的学习与高效的扩散采样。
链接: https://arxiv.org/abs/2506.07813
作者: Junseo Bang,Joonhee Lee,Kyeonghyun Lee,Haechang Lee,Dong Un Kang,Se Young Chun
机构: Seoul National University (首尔国立大学); Department of Electrical and Computer Engineering (电子与计算机工程系); Institute of New Media and Communications (新媒体与通信研究所); Interdisciplinary Program in AI (人工智能跨学科项目)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fixed-scale super-resolution. Recent approaches in this domain utilize regression-based or generative models, but many of them are a single-stage upsampling process, which may be challenging to learn across a wide, continuous distribution of scaling factors. Progressive upsampling strategies have shown promise in mitigating this issue, yet their integration with diffusion models for flexible upscaling remains underexplored. Here, we present CasArbi, a novel self-cascaded diffusion framework for arbitrary-scale image super-resolution. CasArbi meets the varying scaling demands by breaking them down into smaller sequential factors and progressively enhancing the image resolution at each step with seamless transitions for arbitrary scales. Our novel coordinate-guided residual diffusion model allows for the learning of continuous image representations while enabling efficient diffusion sampling. Extensive experiments demonstrate that our CasArbi outperforms prior arts in both perceptual and distortion performance metrics across diverse arbitrary-scale super-resolution benchmarks.
zh
[CV-47] Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning
【速读】:该论文试图解决视频问答(VideoQA)中由于缺乏显式视觉证据而导致的性能下降问题,特别是在处理涉及符号意义或深层意图的问题时。解决方案的关键在于提出一种新的推理框架IRM(Implicit Reasoning Model),该框架通过双流建模方式对上下文动作和意图线索进行隐式推理链建模,包含Action-Intent Module(AIM)和Visual Enhancement Module(VEM),分别用于推断并保留与问题相关的双重线索以及增强上下文视觉表征。
链接: https://arxiv.org/abs/2506.07811
作者: Tieyuan Chen,Huabin Liu,Yi Wang,Chaofan Gan,Mingxi Lyu,Gui Zou,Weiyao Lin
机构: Shanghai Jiao Tong University (上海交通大学); Zhongguancun Academy (中关村研究院); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint
Abstract:Video Question Answering (VideoQA) aims to answer natural language questions based on the given video, with prior work primarily focusing on identifying the duration of relevant segments, referred to as explicit visual evidence. However, explicit visual evidence is not always directly available, particularly when questions target symbolic meanings or deeper intentions, leading to significant performance degradation. To fill this gap, we introduce a novel task and dataset, \textbfI mplicit \textbfV ideo \textbfQ uestion \textbfA nswering (I-VQA), which focuses on answering questions in scenarios where explicit visual evidence is inaccessible. Given an implicit question and its corresponding video, I-VQA requires answering based on the contextual visual cues present within the video. To tackle I-VQA, we propose a novel reasoning framework, IRM (Implicit Reasoning Model), incorporating dual-stream modeling of contextual actions and intent clues as implicit reasoning chains. IRM comprises the Action-Intent Module (AIM) and the Visual Enhancement Module (VEM). AIM deduces and preserves question-related dual clues by generating clue candidates and performing relation deduction. VEM enhances contextual visual representation by leveraging key contextual clues. Extensive experiments validate the effectiveness of our IRM in I-VQA tasks, outperforming GPT-4o, OpenAI-o3, and fine-tuned VideoChat2 by 0.76% , 1.37% , and 4.87% , respectively. Additionally, IRM performs SOTA on similar implicit advertisement understanding and future prediction in traffic-VQA. Datasets and codes are available for double-blind review in anonymous repo: this https URL.
zh
[CV-48] Incorporating Uncertainty-Guided and Top-k Codebook Matching for Real-World Blind Image Super-Resolution
【速读】:该论文旨在解决基于代码本的现实图像超分辨率(codebook-based real image super-resolution)中存在的一些关键问题,包括特征匹配不准确和纹理细节重建效果不佳。其解决方案的关键在于提出一种名为Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR)的框架,该框架包含三个核心组件:(1) 一种不确定性学习机制,用于引导模型关注纹理丰富的区域;(2) 一种Top-k特征匹配策略,通过融合多个候选特征来提高特征匹配的准确性;(3) 一种Align-Attention模块,用于增强低分辨率(LR)与高分辨率(HR)特征之间的信息对齐。
链接: https://arxiv.org/abs/2506.07809
作者: Weilei Wen,Tianyi Zhang,Qianqian Zhao,Zhaohui Zheng,Chunle Guo,Xiuli Shao,Chongyi Li
机构: Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in codebook-based real image super-resolution (SR) have shown promising results in real-world applications. The core idea involves matching high-quality image features from a codebook based on low-resolution (LR) image features. However, existing methods face two major challenges: inaccurate feature matching with the codebook and poor texture detail reconstruction. To address these issues, we propose a novel Uncertainty-Guided and Top-k Codebook Matching SR (UGTSR) framework, which incorporates three key components: (1) an uncertainty learning mechanism that guides the model to focus on texture-rich regions, (2) a Top-k feature matching strategy that enhances feature matching accuracy by fusing multiple candidate features, and (3) an Align-Attention module that enhances the alignment of information between LR and HR features. Experimental results demonstrate significant improvements in texture realism and reconstruction fidelity compared to existing methods. We will release the code upon formal publication.
zh
[CV-49] Identifiable Object Representations under Spatial Ambiguities
【速读】:该论文旨在解决在存在空间模糊性(如遮挡和视角模糊)的情况下,获取模块化、以物体为中心的表示这一难题,这对于实现类似人类的推理至关重要。其解决方案的关键在于提出一种新颖的多视角概率方法,通过聚合视图特定的槽位来捕捉不变内容信息,同时学习解耦的全局视角级信息,从而有效解决空间模糊性问题,并在无需视角标注的情况下提供可识别性的理论保证。
链接: https://arxiv.org/abs/2506.07806
作者: Avinash Kori,Francesca Toni,Ben Glocker
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modular object-centric representations are essential for human-like reasoning but are challenging to obtain under spatial ambiguities, e.g. due to occlusions and view ambiguities. However, addressing challenges presents both theoretical and practical difficulties. We introduce a novel multi-view probabilistic approach that aggregates view-specific slots to capture invariant content information while simultaneously learning disentangled global viewpoint-level information. Unlike prior single-view methods, our approach resolves spatial ambiguities, provides theoretical guarantees for identifiability, and requires no viewpoint annotations. Extensive experiments on standard benchmarks and novel complex datasets validate our method’s robustness and scalability.
zh
[CV-50] Image Reconstruction as a Tool for Feature Analysis
【速读】:该论文试图解决视觉编码器内部特征表示的可解释性问题,即如何理解这些模型在不同任务中学习到的特征结构。其解决方案的关键在于提出一种通过图像重建来解析视觉特征的新方法,该方法能够评估不同预训练任务对特征信息保留的影响,并揭示特征空间中的颜色编码机制。
链接: https://arxiv.org/abs/2506.07803
作者: Eduard Allakhverdov,Dmitrii Tarasov,Elizaveta Goncharova,Andrey Kuznetsov
机构: AIRI(人工智能研究机构); MIPT(莫斯科物理技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 14 figures
Abstract:Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.
zh
[CV-51] Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger ICML2025
【速读】:该论文旨在解决大型视觉语言模型(LVLMs)在视觉问答(VQA)任务中因知识稀缺和检索知识产生的响应不稳定问题。其解决方案的关键在于提出一种多模态检索增强生成框架(RCTS),通过构建富含推理上下文的知识库以及基于树搜索的重排序方法,提升模型的推理能力和回答一致性。具体而言,引入了自洽性评估机制以丰富知识库中的内在推理模式,并采用带有启发式奖励的蒙特卡洛树搜索(MCTS-HR)来优先选择最相关示例,从而确保LVLM能够利用高质量的上下文推理生成更优且一致的回答。
链接: https://arxiv.org/abs/2506.07785
作者: Qi Yang,Chenghao Zhang,Lubin Fan,Kun Ding,Jieping Ye,Shiming Xiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025 Spotlight. 22 pages, 16 figures
Abstract:Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs. Our code is available at this https URL.
zh
[CV-52] Design and Evaluation of Deep Learning-Based Dual-Spectrum Image Fusion Methods
【速读】:该论文旨在解决多模态图像融合中的评估标准不统一及下游任务性能不足的问题,特别是在复杂环境下的目标检测任务中,现有方法依赖通用评价指标而缺乏针对具体任务的公平比较。其解决方案的关键在于构建了一个高质量的校园环境双谱数据集,涵盖多种具有挑战性的场景,并提出一个任务感知的综合评估框架,结合融合速度、通用指标和基于lang-segment-anything模型的目标检测性能,以确保下游任务评估的公平性与有效性。
链接: https://arxiv.org/abs/2506.07779
作者: Beining Xu,Junxian Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 13 figures
Abstract:Visible images offer rich texture details, while infrared images emphasize salient targets. Fusing these complementary modalities enhances scene understanding, particularly for advanced vision tasks under challenging conditions. Recently, deep learning-based fusion methods have gained attention, but current evaluations primarily rely on general-purpose metrics without standardized benchmarks or downstream task performance. Additionally, the lack of well-developed dual-spectrum datasets and fair algorithm comparisons hinders progress. To address these gaps, we construct a high-quality dual-spectrum dataset captured in campus environments, comprising 1,369 well-aligned visible-infrared image pairs across four representative scenarios: daytime, nighttime, smoke occlusion, and underpasses. We also propose a comprehensive and fair evaluation framework that integrates fusion speed, general metrics, and object detection performance using the lang-segment-anything model to ensure fairness in downstream evaluation. Extensive experiments benchmark several state-of-the-art fusion algorithms under this framework. Results demonstrate that fusion models optimized for downstream tasks achieve superior performance in target detection, especially in low-light and occluded scenes. Notably, some algorithms that perform well on general metrics do not translate to strong downstream performance, highlighting limitations of current evaluation practices and validating the necessity of our proposed framework. The main contributions of this work are: (1)a campus-oriented dual-spectrum dataset with diverse and challenging scenes; (2) a task-aware, comprehensive evaluation framework; and (3) thorough comparative analysis of leading fusion methods across multiple datasets, offering insights for future development. Comments: 11 pages, 13 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.07779 [cs.CV] (or arXiv:2506.07779v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.07779 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Junxian Li [view email] [v1] Mon, 9 Jun 2025 13:56:32 UTC (8,354 KB)
zh
[CV-53] Language-Vision Planner and Executor for Text-to-Visual Reasoning
【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)在多模态视觉-文本推理任务中的泛化性能不足问题。其解决方案的关键在于提出VLAgent,一个能够通过集成规划脚本与执行验证的自动化流程,实时执行分步视觉推理计划的AI系统。VLAgent通过上下文学习微调大语言模型以生成任务规划器,并利用神经符号可执行模块逐步优化推理结果,同时引入语法-语义解析器和集成方法提升规划质量和泛化能力。
链接: https://arxiv.org/abs/2506.07778
作者: Yichang Xu,Gaowen Liu,Ramana Rao Kompella,Sihao Hu,Tiansheng Huang,Fatih Ilhan,Selim Furkan Tekin,Zachary Yahn,Ling Liu
机构: Georgia Institute of Technology (佐治亚理工学院); Cisco Research (思科研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The advancement in large language models (LLMs) and large vision models has fueled the rapid progress in multi-modal visual-text reasoning capabilities. However, existing vision-language models (VLMs) to date suffer from generalization performance. Inspired by recent development in LLMs for visual reasoning, this paper presents VLAgent, an AI system that can create a step-by-step visual reasoning plan with an easy-to-understand script and execute each step of the plan in real time by integrating planning script with execution verifications via an automated process supported by VLAgent. In the task planning phase, VLAgent fine-tunes an LLM through in-context learning to generate a step-by-step planner for each user-submitted text-visual reasoning task. During the plan execution phase, VLAgent progressively refines the composition of neuro-symbolic executable modules to generate high-confidence reasoning results. VLAgent has three unique design characteristics: First, we improve the quality of plan generation through in-context learning, improving logic reasoning by reducing erroneous logic steps, incorrect programs, and LLM hallucinations. Second, we design a syntax-semantics parser to identify and correct additional logic errors of the LLM-generated planning script prior to launching the plan executor. Finally, we employ the ensemble method to improve the generalization performance of our step-executor. Extensive experiments with four visual reasoning benchmarks (GQA, MME, NLVR2, VQAv2) show that VLAgent achieves significant performance enhancement for multimodal text-visual reasoning applications, compared to the exiting representative VLMs and LLM based visual composition approaches like ViperGPT and VisProg, thanks to the novel optimization modules of VLAgent back-engine (SS-Parser, Plan Repairer, Output Verifiers). Code and data will be made available upon paper acceptance.
zh
[CV-54] rend-Aware Fashion Recommendation with Visual Segmentation and Semantic Similarity
【速读】:该论文旨在解决个性化时尚推荐中如何平衡用户个体风格与新兴趋势的问题。其关键解决方案是构建一个融合深度视觉表征、服装感知分割、语义类别相似性以及用户行为模拟的推荐系统。通过语义分割掩码非服饰区域后,利用预训练的卷积神经网络(CNN)提取聚焦的视觉嵌入,并结合加权评分函数融合视觉相似性、语义一致性和流行度对齐,从而实现更精准的推荐效果。
链接: https://arxiv.org/abs/2506.07773
作者: Mohamed Djilani,Nassim Ali Ousalah,Nidhal Eddine Chenni
机构: University of Luxembourg(卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We introduce a trend-aware and visually-grounded fashion recommendation system that integrates deep visual representations, garment-aware segmentation, semantic category similarity and user behavior simulation. Our pipeline extracts focused visual embeddings by masking non-garment regions via semantic segmentation followed by feature extraction using pretrained CNN backbones (ResNet-50, DenseNet-121, VGG16). To simulate realistic shopping behavior, we generate synthetic purchase histories influenced by user-specific trendiness and item popularity. Recommendations are computed using a weighted scoring function that fuses visual similarity, semantic coherence and popularity alignment. Experiments on the DeepFashion dataset demonstrate consistent gender alignment and improved category relevance, with ResNet-50 achieving 64.95% category similarity and lowest popularity MAE. An ablation study confirms the complementary roles of visual and popularity cues. Our method provides a scalable framework for personalized fashion recommendations that balances individual style with emerging trends. Our implementation is available at this https URL
zh
[CV-55] Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation CVPR2025
【速读】:该论文试图解决在给定输入图像A、A’和B的情况下,生成满足A:A’::B:B’关系的图像B’的问题。现有方法如视觉上下文学习或视觉指令通常局限于特定模型(如InstructPix2Pix、修复模型),而非通用扩散模型(如Stable Diffusion、SDXL),这可能导致继承性偏差或编辑能力受限。解决方案的关键在于提出Difference Inversion方法,该方法通过隔离A与A’之间的差异,并将其应用于B以生成合理的B’。为减少模型依赖性,采用“Full Prompt”结构而非“Instruction Prompt”,并结合Delta Interpolation、Token Consistency Loss和Zero Initialization of Token Embeddings等技术以提高差异提取的准确性。
链接: https://arxiv.org/abs/2506.07750
作者: Hyunsoo Kim,Donghyun Kim,Suhyun Kim
机构: Korea University (韩国科学技术大学); Korea Institute of Science and Technology (韩国科学技术院); Kyung Hee University (庆熙大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at CVPR 2025
Abstract:How can we generate an image B’ that satisfies A:A’::B:B’, given the input images A,A’ and B? Recent works have tackled this challenge through approaches like visual in-context learning or visual instruction. However, these methods are typically limited to specific models (e.g. InstructPix2Pix. Inpainting models) rather than general diffusion models (e.g. Stable Diffusion, SDXL). This dependency may lead to inherited biases or lower editing capabilities. In this paper, we propose Difference Inversion, a method that isolates only the difference from A and A’ and applies it to B to generate a plausible B’. To address model dependency, it is crucial to structure prompts in the form of a “Full Prompt” suitable for input to stable diffusion models, rather than using an “Instruction Prompt”. To this end, we accurately extract the Difference between A and A’ and combine it with the prompt of B, enabling a plug-and-play application of the difference. To extract a precise difference, we first identify it through 1) Delta Interpolation. Additionally, to ensure accurate training, we propose the 2) Token Consistency Loss and 3) Zero Initialization of Token Embeddings. Our extensive experiments demonstrate that Difference Inversion outperforms existing baselines both quantitatively and qualitatively, indicating its ability to generate more feasible B’ in a model-agnostic manner.
zh
[CV-56] Flow-Anything: Learning Real-World Optical Flow Estimation from Large-Scale Single-view Images
【速读】:该论文试图解决光学流估计(optical flow estimation)在真实世界应用中因依赖动画合成数据集训练而导致的领域差距问题,以及数据集规模扩大带来的局限性。其解决方案的关键在于提出一种名为\textbf{Flow-Anything}的大规模数据生成框架,该框架通过两个有效步骤实现数据扩展:首先,利用先进的单目深度估计网络将单视角图像转换为3D表示,从而在虚拟相机下渲染光学流和新视角图像;其次,开发了与物体无关的体绘制模块和深度感知修复模块,以建模3D表示中的动态物体,最终生成逼真的训练数据集FA-Flow Dataset。
链接: https://arxiv.org/abs/2506.07740
作者: Yingping Liang,Ying Fu,Yutao Hu,Wenqi Shao,Jiaming Liu,Debing Zhang
机构: Beijing Institute of Technology (北京理工大学); Southeast University (东南大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tiamat AI (Tiamat AI); Xiaohongshu (小红书)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical flow estimation is a crucial subfield of computer vision, serving as a foundation for video tasks. However, the real-world robustness is limited by animated synthetic datasets for training. This introduces domain gaps when applied to real-world applications and limits the benefits of scaling up datasets. To address these challenges, we propose \textbfFlow-Anything, a large-scale data generation framework designed to learn optical flow estimation from any single-view images in the real world. We employ two effective steps to make data scaling-up promising. First, we convert a single-view image into a 3D representation using advanced monocular depth estimation networks. This allows us to render optical flow and novel view images under a virtual camera. Second, we develop an Object-Independent Volume Rendering module and a Depth-Aware Inpainting module to model the dynamic objects in the 3D representation. These two steps allow us to generate realistic datasets for training from large-scale single-view images, namely \textbfFA-Flow Dataset. For the first time, we demonstrate the benefits of generating optical flow training data from large-scale real-world images, outperforming the most advanced unsupervised methods and supervised methods on synthetic datasets. Moreover, our models serve as a foundation model and enhance the performance of various downstream video tasks.
zh
[CV-57] ArchiLense: A Framework for Quantitative Analysis of Architectural Styles Based on Vision Large Language Models
【速读】:该论文试图解决传统建筑文化研究中依赖主观专家解读和历史文献回顾所导致的区域偏见及解释范围有限的问题。其解决方案的关键在于提出了一种基于视觉-语言模型的分析框架ArchiLense,并构建了专业的建筑风格数据集ArchDiffBench,通过整合先进的计算机视觉技术、深度学习与机器学习算法,实现了建筑图像的自动识别、比较与精确分类,从而提供更具客观性和准确性的建筑风格分析方法。
链接: https://arxiv.org/abs/2506.07739
作者: Jing Zhong,Jun Yin,Peilin Li,Pengyu Zeng,Miao Zhang,Shuai Lu,Ran Luo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Architectural cultures across regions are characterized by stylistic diversity, shaped by historical, social, and technological contexts in addition to geograph-ical conditions. Understanding architectural styles requires the ability to describe and analyze the stylistic features of different architects from various regions through visual observations of architectural imagery. However, traditional studies of architectural culture have largely relied on subjective expert interpretations and historical literature reviews, often suffering from regional biases and limited ex-planatory scope. To address these challenges, this study proposes three core contributions: (1) We construct a professional architectural style dataset named ArchDiffBench, which comprises 1,765 high-quality architectural images and their corresponding style annotations, collected from different regions and historical periods. (2) We propose ArchiLense, an analytical framework grounded in Vision-Language Models and constructed using the ArchDiffBench dataset. By integrating ad-vanced computer vision techniques, deep learning, and machine learning algo-rithms, ArchiLense enables automatic recognition, comparison, and precise classi-fication of architectural imagery, producing descriptive language outputs that ar-ticulate stylistic differences. (3) Extensive evaluations show that ArchiLense achieves strong performance in architectural style recognition, with a 92.4% con-sistency rate with expert annotations and 84.5% classification accuracy, effec-tively capturing stylistic distinctions across images. The proposed approach transcends the subjectivity inherent in traditional analyses and offers a more objective and accurate perspective for comparative studies of architectural culture.
zh
[CV-58] AssetDropper: Asset Extraction via Diffusion Models with Reward-Driven Optimization SIGGRAPH2025
【速读】:该论文旨在解决从参考图像中高效提取高质量、标准化资产的问题,这一需求在设计领域尤为突出,尽管开放世界场景提供了丰富的素材,但生成式能力尚未显著提升该领域的资产标准化处理。解决方案的关键在于提出AssetDropper框架,该框架能够从输入图像中提取选定主体的正面视图,并有效处理透视失真和主体遮挡等复杂情况,同时通过预训练的奖励模型实现与图像提示对齐的精确资产提取,从而提升提取结果的一致性并减少幻觉现象。
链接: https://arxiv.org/abs/2506.07738
作者: Lanjiong Li,Guanhua Zhao,Lingting Zhu,Zeyu Cai,Lequan Yu,Jian Zhang,Zeyu Wang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Peking University (北京大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2025. 11 pages, 12 figures
Abstract:Recent research on generative models has primarily focused on creating product-ready visual outputs; however, designers often favor access to standardized asset libraries, a domain that has yet to be significantly enhanced by generative capabilities. Although open-world scenes provide ample raw materials for designers, efficiently extracting high-quality, standardized assets remains a challenge. To address this, we introduce AssetDropper, the first framework designed to extract assets from reference images, providing artists with an open-world asset palette. Our model adeptly extracts a front view of selected subjects from input images, effectively handling complex scenarios such as perspective distortion and subject occlusion. We establish a synthetic dataset of more than 200,000 image-subject pairs and a real-world benchmark with thousands more for evaluation, facilitating the exploration of future research in downstream tasks. Furthermore, to ensure precise asset extraction that aligns well with the image prompts, we employ a pre-trained reward model to fulfill a closed-loop with feedback. We design the reward model to perform an inverse task that pastes the extracted assets back into the reference sources, which assists training with additional consistency and mitigates hallucination. Extensive experiments show that, with the aid of reward-driven optimization, AssetDropper achieves the state-of-the-art results in asset extraction. Project page: this http URL.
zh
[CV-59] SpikeSMOKE: Spiking Neural Networks for Monocular 3D Object Detection with Cross-Scale Gated Coding
【速读】:该论文旨在解决低功耗单目3D目标检测中的高能耗问题,特别是在自动驾驶等应用中日益增长的能源消耗需求。其关键解决方案是引入脉冲神经网络(SNNs),并提出SpikeSMOKE架构,结合跨尺度门控编码机制(CSGC)以增强特征表达能力,同时设计轻量级残差块以降低计算量并提升训练速度,从而在保持较高检测性能的同时显著降低能耗。
链接: https://arxiv.org/abs/2506.07737
作者: Xuemei Chen,Huamin Wang,Hangchi Shen,Shukai Duan,Shiping Wen,Tingwen Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low energy consumption for 3D object detection is an important research area because of the increasing energy consumption with their wide application in fields such as autonomous driving. The spiking neural networks (SNNs) with low-power consumption characteristics can provide a novel solution for this research. Therefore, we apply SNNs to monocular 3D object detection and propose the SpikeSMOKE architecture in this paper, which is a new attempt for low-power monocular 3D object detection. As we all know, discrete signals of SNNs will generate information loss and limit their feature expression ability compared with the artificial neural networks (ANNs).In order to address this issue, inspired by the filtering mechanism of biological neuronal synapses, we propose a cross-scale gated coding mechanism(CSGC), which can enhance feature representation by combining cross-scale fusion of attentional methods and gated filtering this http URL addition, to reduce the computation and increase the speed of training, we present a novel light-weight residual block that can maintain spiking computing paradigm and the highest possible detection performance. Compared to the baseline SpikeSMOKE under the 3D Object Detection, the proposed SpikeSMOKE with CSGC can achieve 11.78 (+2.82, Easy), 10.69 (+3.2, Moderate), and 10.48 (+3.17, Hard) on the KITTI autonomous driving dataset by AP|R11 at 0.7 IoU threshold, respectively. It is important to note that the results of SpikeSMOKE can significantly reduce energy consumption compared to the results on SMOKE. For example,the energy consumption can be reduced by 72.2% on the hard category, while the detection performance is reduced by only 4%. SpikeSMOKE-L (lightweight) can further reduce the amount of parameters by 3 times and computation by 10 times compared to SMOKE.
zh
[CV-60] Language Embedding Meets Dynamic Graph: A New Exploration for Neural Architecture Representation Learning
【速读】:该论文旨在解决神经网络架构表示学习中的两个关键问题:一是现有方法忽视了硬件属性信息,限制了模型在多样化深度学习硬件环境下的实用性;二是当前编码方法依赖静态邻接矩阵来表示拓扑结构,无法有效捕捉计算节点间的结构差异,从而影响编码效果。其解决方案的关键在于提出LeDG-Former框架,通过语言基础语义嵌入与动态图表示学习的协同整合,将神经架构和硬件平台规范统一映射到语义空间,实现跨硬件平台的零样本预测,并引入基于动态图的Transformer模型提升神经架构建模性能。
链接: https://arxiv.org/abs/2506.07735
作者: Haizhao Jing,Haokui Zhang,Zhenhao Shang,Rong Xiao,Peng Wang,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学); Intellifusion (智元)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures
Abstract:Neural Architecture Representation Learning aims to transform network models into feature representations for predicting network attributes, playing a crucial role in deploying and designing networks for real-world applications. Recently, inspired by the success of transformers, transformer-based models integrated with Graph Neural Networks (GNNs) have achieved significant progress in representation learning. However, current methods still have some limitations. First, existing methods overlook hardware attribute information, which conflicts with the current trend of diversified deep learning hardware and limits the practical applicability of models. Second, current encoding approaches rely on static adjacency matrices to represent topological structures, failing to capture the structural differences between computational nodes, which ultimately compromises encoding effectiveness. In this paper, we introduce LeDG-Former, an innovative framework that addresses these limitations through the synergistic integration of language-based semantic embedding and dynamic graph representation learning. Specifically, inspired by large language models (LLMs), we propose a language embedding framework where both neural architectures and hardware platform specifications are projected into a unified semantic space through tokenization and LLM processing, enabling zero-shot prediction across different hardware platforms for the first time. Then, we propose a dynamic graph-based transformer for modeling neural architectures, resulting in improved neural architecture modeling performance. On the NNLQP benchmark, LeDG-Former surpasses previous methods, establishing a new SOTA while demonstrating the first successful cross-hardware latency prediction capability. Furthermore, our framework achieves superior performance on the cell-structured NAS-Bench-101 and NAS-Bench-201 datasets.
zh
[CV-61] ETA: Efficiency through Thinking Ahead A Dual Approach to Self-Driving with Large Models ICCV2025
【速读】:该论文试图解决在自动驾驶系统中使用大模型时面临的推理速度与性能之间的权衡问题(dilemma)。现有双系统架构通常采用并行结构,但难以实现大模型对每个在线帧的及时响应。该研究的关键解决方案是引入一种异步系统——通过提前思考(Efficiency through Thinking Ahead, ETA),将当前帧的密集计算任务转移到之前的时序步骤,并对多个时序步骤进行批量推理,从而提升大模型对每个时序步骤的响应速度。
链接: https://arxiv.org/abs/2506.07725
作者: Shadi Hamdan,Chonghao Sima,Zetong Yang,Hongyang Li,Fatma Güney
机构: Koç University (科克大学); KUIS AI Center (KUIS人工智能中心); The University of Hong Kong (香港大学); OpenDriveLab (OpenDriveLab)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025 submission. For code, see this https URL
Abstract:How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results. However, these works still struggle to enable large models for a timely response to every online frame. Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step. To achieve the shifting, we introduce Efficiency through Thinking Ahead (ETA), an asynchronous system designed to: (1) propagate informative features from the past to the current frame using future predictions from the large model, (2) extract current frame features using a small model for real-time responsiveness, and (3) integrate these dual features via an action mask mechanism that emphasizes action-critical image regions. Evaluated on the Bench2Drive CARLA Leaderboard-v2 benchmark, ETA advances state-of-the-art performance by 8% with a driving score of 69.53 while maintaining a near-real-time inference speed at 50 ms.
zh
[CV-62] ReverB-SNN: Reversing Bit of the Weight and Activation for Spiking Neural Networks ICML2024
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Network, SNN)中由于二进制脉冲激活导致的信息表达不足,从而影响模型精度的问题。其解决方案的关键在于反转权重和激活的位表示,提出一种称为\textbf{ReverB-SNN}的方法,该方法采用实值脉冲激活与二进制权重相结合的方式,在保持SNN事件驱动和无乘法运算优势的同时,提升了激活的信息容量,并通过引入可训练因子使二进制权重能够自适应学习合适的权重幅度,从而增强网络容量。
链接: https://arxiv.org/abs/2506.07720
作者: Yufei Guo,Yuhan Zhang,Zhou Jie,Xiaode Liu,Xin Tong,Yuanpei Chen,Weihang Peng,Zhe Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accpeted by ICML2024
Abstract:The Spiking Neural Network (SNN), a biologically inspired neural network infrastructure, has garnered significant attention recently. SNNs utilize binary spike activations for efficient information transmission, replacing multiplications with additions, thereby enhancing energy efficiency. However, binary spike activation maps often fail to capture sufficient data information, resulting in reduced accuracy. To address this challenge, we advocate reversing the bit of the weight and activation for SNNs, called \textbfReverB-SNN, inspired by recent findings that highlight greater accuracy degradation from quantizing activations compared to weights. Specifically, our method employs real-valued spike activations alongside binary weights in SNNs. This preserves the event-driven and multiplication-free advantages of standard SNNs while enhancing the information capacity of activations. Additionally, we introduce a trainable factor within binary weights to adaptively learn suitable weight amplitudes during training, thereby increasing network capacity. To maintain efficiency akin to vanilla \textbfReverB-SNN, our trainable binary weight SNNs are converted back to standard form using a re-parameterization technique during inference. Extensive experiments across various network architectures and datasets, both static and dynamic, demonstrate that our approach consistently outperforms state-of-the-art methods.
zh
[CV-63] Consistent Video Editing as Flow-Driven Image-to-Video Generation
【速读】:该论文旨在解决视频编辑中运动迁移过程中的复杂运动模式建模问题,特别是针对非刚性物体运动(如多物体和人像编辑)的不足。现有方法在视频编辑任务中存在局限性,主要集中在对象替换上,无法有效处理复杂的运动模式。论文提出的解决方案关键在于利用光流(optical flow)进行复杂运动建模,并提出FlowV2V框架,将视频编辑任务重新定义为由光流驱动的图像到视频(I2V)生成问题。该方法通过分解整个流程为第一帧编辑和条件I2V生成,并模拟与形变一致的伪光流序列,从而确保编辑过程中的时间一致性。
链接: https://arxiv.org/abs/2506.07713
作者: Ge Wang,Songlin Fan,Hangxu Liu,Quanjian Song,Hewei Wang,Jinfeng Xu
机构: Fudan University (复旦大学); Xiamen University (厦门大学); Carnegie Mellon University (卡内基梅隆大学); Hong Kong University (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 12 figures
Abstract:With the prosper of video diffusion models, down-stream applications like video editing have been significantly promoted without consuming much computational cost. One particular challenge in this task lies at the motion transfer process from the source video to the edited one, where it requires the consideration of the shape deformation in between, meanwhile maintaining the temporal consistency in the generated video sequence. However, existing methods fail to model complicated motion patterns for video editing, and are fundamentally limited to object replacement, where tasks with non-rigid object motions like multi-object and portrait editing are largely neglected. In this paper, we observe that optical flows offer a promising alternative in complex motion modeling, and present FlowV2V to re-investigate video editing as a task of flow-driven Image-to-Video (I2V) generation. Specifically, FlowV2V decomposes the entire pipeline into first-frame editing and conditional I2V generation, and simulates pseudo flow sequence that aligns with the deformed shape, thus ensuring the consistency during editing. Experimental results on DAVIS-EDIT with improvements of 13.67% and 50.66% on DOVER and warping error illustrate the superior temporal consistency and sample quality of FlowV2V compared to existing state-of-the-art ones. Furthermore, we conduct comprehensive ablation studies to analyze the internal functionalities of the first-frame paradigm and flow alignment in the proposed method.
zh
[CV-64] Adaptive Blind Super-Resolution Network for Spatial-Specific and Spatial-Agnostic Degradations
【速读】:该论文试图解决图像重建中不同退化类型被统一模型处理而导致性能受限的问题,特别是针对空间无关和空间相关的主导退化类型。解决方案的关键在于引入一种融合全局与局部分支的动态滤波网络,通过全局动态滤波层感知空间无关的主导退化,利用注意力机制生成权重对多个并行标准卷积核进行加权;同时,局部动态滤波层将特征图转换为具有空间特性的动态滤波算子,执行空间特定的卷积操作以处理空间相关退化,从而有效提升图像重建效果。
链接: https://arxiv.org/abs/2506.07705
作者: Weilei Wen,Chunle Guo,Wenqi Ren,Hongpeng Wang,Xiuli Shao
机构: Nankai University(南开大学); Sun Yat-sen University(中山大学); National University of Defense Technology(国防科技大学
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: IEEE TRANSACTIONS ON IMAGE PROCESSING
Abstract:Prior methodologies have disregarded the diversities among distinct degradation types during image reconstruction, employing a uniform network model to handle multiple deteriorations. Nevertheless, we discover that prevalent degradation modalities, including sampling, blurring, and noise, can be roughly categorized into two classes. We classify the first class as spatial-agnostic dominant degradations, less affected by regional changes in image space, such as downsampling and noise degradation. The second class degradation type is intimately associated with the spatial position of the image, such as blurring, and we identify them as spatial-specific dominant degradations. We introduce a dynamic filter network integrating global and local branches to address these two degradation types. This network can greatly alleviate the practical degradation problem. Specifically, the global dynamic filtering layer can perceive the spatial-agnostic dominant degradation in different images by applying weights generated by the attention mechanism to multiple parallel standard convolution kernels, enhancing the network’s representation ability. Meanwhile, the local dynamic filtering layer converts feature maps of the image into a spatially specific dynamic filtering operator, which performs spatially specific convolution operations on the image features to handle spatial-specific dominant degradations. By effectively integrating both global and local dynamic filtering operators, our proposed method outperforms state-of-the-art blind super-resolution algorithms in both synthetic and real image datasets.
zh
[CV-65] NOVA3D: Normal Aligned Video Diffusion Model for Single Image to 3D Generation ICME2025
【速读】:该论文旨在解决单图像生成3D内容时存在的多视角一致性不足问题,这一问题通常源于3D先验知识的缺失。其解决方案的关键在于引入NOVA3D框架,通过利用预训练视频扩散模型中的强大3D先验,并在多视角视频微调过程中整合几何信息,从而提升生成结果的多视角一致性和泛化能力。
链接: https://arxiv.org/abs/2506.07698
作者: Yuxiao Yang,Peihao Li,Yuhong Zhang,Junzhe Lu,Xianglong He,Minghan Qin,Weitao Wang,Haoqian Wang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, accepted by ICME 2025
Abstract:3D AI-generated content (AIGC) has made it increasingly accessible for anyone to become a 3D content creator. While recent methods leverage Score Distillation Sampling to distill 3D objects from pretrained image diffusion models, they often suffer from inadequate 3D priors, leading to insufficient multi-view consistency. In this work, we introduce NOVA3D, an innovative single-image-to-3D generation framework. Our key insight lies in leveraging strong 3D priors from a pretrained video diffusion model and integrating geometric information during multi-view video fine-tuning. To facilitate information exchange between color and geometric domains, we propose the Geometry-Temporal Alignment (GTA) attention mechanism, thereby improving generalization and multi-view consistency. Moreover, we introduce the de-conflict geometry fusion algorithm, which improves texture fidelity by addressing multi-view inaccuracies and resolving discrepancies in pose alignment. Extensive experiments validate the superiority of NOVA3D over existing baselines.
zh
[CV-66] OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting
【速读】:该论文旨在解决3D场景中开放词汇的3D实例分割问题,即在无需人工标注的情况下,根据自然语言描述识别和分割任意物体。其解决方案的关键在于引入OpenSplat3D方法,通过特征点云(feature-splatting)技术将语义信息与单个高斯分布相关联,结合Segment Anything Model实例掩码与对比损失函数以实现精确的实例级分割,并利用视觉-语言模型的语言嵌入实现灵活的文本驱动实例识别。
链接: https://arxiv.org/abs/2506.07697
作者: Jens Piekenbrinck,Christian Schmidt,Alexander Hermans,Narunas Vaskevicius,Timm Linder,Bastian Leibe
机构: RWTH Aachen University(亚琛工业大学); Robert Bosch GmbH(博世集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has emerged as a powerful representation for neural scene reconstruction, offering high-quality novel view synthesis while maintaining computational efficiency. In this paper, we extend the capabilities of 3DGS beyond pure scene representation by introducing an approach for open-vocabulary 3D instance segmentation without requiring manual labeling, termed OpenSplat3D. Our method leverages feature-splatting techniques to associate semantic information with individual Gaussians, enabling fine-grained scene understanding. We incorporate Segment Anything Model instance masks with a contrastive loss formulation as guidance for the instance features to achieve accurate instance-level segmentation. Furthermore, we utilize language embeddings of a vision-language model, allowing for flexible, text-driven instance identification. This combination enables our system to identify and segment arbitrary objects in 3D scenes based on natural language descriptions. We show results on LERF-mask and LERF-OVS as well as the full ScanNet++ validation set, demonstrating the effectiveness of our approach.
zh
[CV-67] ProSplat: Improved Feed-Forward 3D Gaussian Splatting for Wide-Baseline Sparse Views
【速读】:该论文旨在解决在宽基线(wide-baseline)条件下,前向3D高斯点云(feed-forward 3D Gaussian Splatting, 3DGS)在新颖视图合成(novel view synthesis, NVS)任务中因纹理细节不足和几何不一致导致性能显著下降的问题。其解决方案的关键在于提出ProSplat框架,该框架由两阶段组成:第一阶段通过3DGS生成器生成3D高斯原语,第二阶段利用改进模型对渲染视图进行增强,其中改进模型基于单步扩散模型,并结合了最大重叠参考视图注入(Maximum Overlap Reference view Injection, MORI)和距离加权极线注意力(Distance-Weighted Epipolar Attention, DWEA)技术,以补充缺失的纹理与颜色并增强几何一致性。此外,还引入了分而治之的训练策略,通过联合优化使两阶段的数据分布对齐。
链接: https://arxiv.org/abs/2506.07670
作者: Xiaohan Lu,Jiaye Fu,Jiaqi Zhang,Zetian Song,Chuanmin Jia,Siwei Ma
机构: Peking University(北京大学); State Key Laboratory of Multimedia Information Processing(多媒体信息处理国家重点实验室); School of Computer Science(计算机学院); School of Electronic and Computer Engineering(电子与计算机工程学院); National Engineering Research Center of Visual Technology(视觉技术国家工程研究中心); Wangxuan Institute of Computer Technology(王选计算机技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Feed-forward 3D Gaussian Splatting (3DGS) has recently demonstrated promising results for novel view synthesis (NVS) from sparse input views, particularly under narrow-baseline conditions. However, its performance significantly degrades in wide-baseline scenarios due to limited texture details and geometric inconsistencies across views. To address these challenges, in this paper, we propose ProSplat, a two-stage feed-forward framework designed for high-fidelity rendering under wide-baseline conditions. The first stage involves generating 3D Gaussian primitives via a 3DGS generator. In the second stage, rendered views from these primitives are enhanced through an improvement model. Specifically, this improvement model is based on a one-step diffusion model, further optimized by our proposed Maximum Overlap Reference view Injection (MORI) and Distance-Weighted Epipolar Attention (DWEA). MORI supplements missing texture and color by strategically selecting a reference view with maximum viewpoint overlap, while DWEA enforces geometric consistency using epipolar constraints. Additionally, we introduce a divide-and-conquer training strategy that aligns data distributions between the two stages through joint optimization. We evaluate ProSplat on the RealEstate10K and DL3DV-10K datasets under wide-baseline settings. Experimental results demonstrate that ProSplat achieves an average improvement of 1 dB in PSNR compared to recent SOTA methods.
zh
[CV-68] PIG: Physically-based Multi-Material Interaction with 3D Gaussians
【速读】:该论文旨在解决由3D Gaussian primitives表示的场景中物体之间的交互问题,具体表现为不准确的3D分割、不同材料间的变形不精确以及严重的渲染伪影。其解决方案的关键在于引入PIG:基于物理的多材料交互与3D高斯分布(Physically-Based Multi-Material Interaction with 3D Gaussians),该方法通过将3D物体分割与高精度物体交互模拟相结合,实现了精确的3D物体级分割,并为场景中的物体分配独特的物理属性以实现多材料耦合交互,同时通过将约束尺度嵌入变形梯度,有效抑制了伪影并提升了几何保真度和视觉一致性。
链接: https://arxiv.org/abs/2506.07657
作者: Zeyu Xiao,Zhenyi Wu,Mingyang Sun,Qipeng Yan,Yufan Guo,Zhuoer Liang,Lihua Zhang
机构: Fudan University (复旦大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting has achieved remarkable success in reconstructing both static and dynamic 3D scenes. However, in a scene represented by 3D Gaussian primitives, interactions between objects suffer from inaccurate 3D segmentation, imprecise deformation among different materials, and severe rendering artifacts. To address these challenges, we introduce PIG: Physically-Based Multi-Material Interaction with 3D Gaussians, a novel approach that combines 3D object segmentation with the simulation of interacting objects in high precision. Firstly, our method facilitates fast and accurate mapping from 2D pixels to 3D Gaussians, enabling precise 3D object-level segmentation. Secondly, we assign unique physical properties to correspondingly segmented objects within the scene for multi-material coupled interactions. Finally, we have successfully embedded constraint scales into deformation gradients, specifically clamping the scaling and rotation properties of the Gaussian primitives to eliminate artifacts and achieve geometric fidelity and visual consistency. Experimental results demonstrate that our method not only outperforms the state-of-the-art (SOTA) in terms of visual quality, but also opens up new directions and pipelines for the field of physically realistic scene generation.
zh
[CV-69] FMaMIL: Frequency-Driven Mamba Multi-Instance Learning for Weakly Supervised Lesion Segmentation in Medical Images
【速读】:该论文旨在解决组织病理学图像中病灶分割的挑战,尤其是在缺乏昂贵像素级标注的情况下实现准确分割。其解决方案的关键在于提出一种基于图像级标签的弱监督分割框架FMaMIL,该框架包含两个阶段:第一阶段采用轻量级Mamba编码器在MIL(Multiple Instance Learning)范式下捕捉跨图像块的长程依赖关系,并引入可学习的频域编码模块以增强空间敏感性和结构感知能力;第二阶段通过CAM(Class Activation Map)引导的软标签监督和自校正机制对初始伪标签进行优化,从而在存在标签噪声的情况下实现鲁棒训练。
链接: https://arxiv.org/abs/2506.07652
作者: Hangbei Cheng,Xiaorong Dong,Xueyu Liu,Jianan Zhang,Xuetao Ma,Mingqiang Wei,Liansheng Wang,Junxin Chen,Yongfei Wu
机构: Taiyuan University of Technology, Taiyuan, Shanxi, 030024, China; Beijing Normal University, Beijing, 100875, China; Xiammen, 361005, China; Dalian University of Technology, Dalian,116621, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate lesion segmentation in histopathology images is essential for diagnostic interpretation and quantitative analysis, yet it remains challenging due to the limited availability of costly pixel-level annotations. To address this, we propose FMaMIL, a novel two-stage framework for weakly supervised lesion segmentation based solely on image-level labels. In the first stage, a lightweight Mamba-based encoder is introduced to capture long-range dependencies across image patches under the MIL paradigm. To enhance spatial sensitivity and structural awareness, we design a learnable frequency-domain encoding module that supplements spatial-domain features with spectrum-based information. CAMs generated in this stage are used to guide segmentation training. In the second stage, we refine the initial pseudo labels via a CAM-guided soft-label supervision and a self-correction mechanism, enabling robust training even under label noise. Extensive experiments on both public and private histopathology datasets demonstrate that FMaMIL outperforms state-of-the-art weakly supervised methods without relying on pixel-level annotations, validating its effectiveness and potential for digital pathology applications.
zh
[CV-70] Synthetic Visual Genome CVPR2025
【速读】:该论文旨在解决视觉关系推理(visual relationship reasoning)在多模态语言模型(MLMs)中的精确性与生成能力不足的问题。其关键解决方案是引入ROBIN模型,该模型通过密集标注的关系进行指令微调,能够大规模构建高质量的密集场景图(scene graph)。为训练ROBIN,研究者构建了SVG数据集,通过教师MLM和精心设计的过滤流程补充现有场景图中缺失的关系,确保数据质量;同时,采用SG-EDIT自蒸馏框架,利用GPT-4o进一步优化ROBIN预测的场景图,提升其准确性和丰富性。
链接: https://arxiv.org/abs/2506.07643
作者: Jae Sung Park,Zixian Ma,Linjie Li,Chenhao Zheng,Cheng-Yu Hsieh,Ximing Lu,Khyathi Chandu,Quan Kong,Norimasa Kobori,Ali Farhadi,Yejin Choi,Ranjay Krishna
机构: University of Washington (华盛顿大学); Allen Institute for Artificial Intelligence (艾伦人工智能研究所); Stanford University (斯坦福大学); Woven by Toyota (丰田编织公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
Abstract:Reasoning over visual relationships-spatial, functional, interactional, social, etc.-is considered to be a fundamental component of human cognition. Yet, despite the major advances in visual comprehension in multimodal language models (MLMs), precise reasoning over relationships and their generations remains a challenge. We introduce ROBIN: an MLM instruction-tuned with densely annotated relationships capable of constructing high-quality dense scene graphs at scale. To train ROBIN, we curate SVG, a synthetic scene graph dataset by completing the missing relations of selected objects in existing scene graphs using a teacher MLM and a carefully designed filtering process to ensure high-quality. To generate more accurate and rich scene graphs at scale for any image, we introduce SG-EDIT: a self-distillation framework where GPT-4o further refines ROBIN’s predicted scene graphs by removing unlikely relations and/or suggesting relevant ones. In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects. Results show that our ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks, and even surpasses larger models up to 13B parameters. Notably, it achieves state-of-the-art performance in referring expression comprehension with a score of 88.9, surpassing the previous best of 87.4. Our results suggest that training on the refined scene graph data is crucial to maintaining high performance across diverse visual reasoning task.
zh
[CV-71] HieraEdgeNet: A Multi-Scale Edge-Enhanced Framework for Automated Pollen Recognition WWW
【速读】:该论文旨在解决显微目标(如花粉)在自动化识别中面临的定位精度不足问题,传统方法因效率低和主观性强而难以满足 paleoclimatology(古气候学)、生物多样性监测和公共卫生等领域的需求。其解决方案的关键在于提出 HieraEdgeNet,该框架通过三个协同模块实现多尺度边缘增强:Hierarchical Edge Module (HEM) 提取多尺度边缘特征,Synergistic Edge Fusion (SEF) 模块在各尺度上融合边缘先验与语义信息,Cross Stage Partial Omni-Kernel Module (CSPOKM) 则利用 Omni-Kernel 算子对细节丰富的特征层进行最大优化,从而显著提升了检测精度与效率。
链接: https://arxiv.org/abs/2506.07637
作者: Yuchong Long,Wen Sun,Ningxiao Sun,Wenxiao Wang,Chao Li,Shan Yin
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Forestry Station (上海林业站); Shanghai Yangtze River Delta Eco-Environmental Change and Management Observation and Research Station (上海长江三角洲生态环境变化与管理观测研究站)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 5 figures, 2 tables. The dataset at this https URL . The models at this https URL . The source code in at this https URL
Abstract:Automated pollen recognition is vital to paleoclimatology, biodiversity monitoring, and public health, yet conventional methods are hampered by inefficiency and subjectivity. Existing deep learning models often struggle to achieve the requisite localization accuracy for microscopic targets like pollen, which are characterized by their minute size, indistinct edges, and complex backgrounds. To overcome this limitation, we introduce HieraEdgeNet, a multi-scale edge-enhancement framework. The framework’s core innovation is the introduction of three synergistic modules: the Hierarchical Edge Module (HEM), which explicitly extracts a multi-scale pyramid of edge features that corresponds to the semantic hierarchy at early network stages; the Synergistic Edge Fusion (SEF) module, for deeply fusing these edge priors with semantic information at each respective scale; and the Cross Stage Partial Omni-Kernel Module (CSPOKM), which maximally refines the most detail-rich feature layers using an Omni-Kernel operator - comprising anisotropic large-kernel convolutions and mixed-domain attention - all within a computationally efficient Cross-Stage Partial (CSP) framework. On a large-scale dataset comprising 120 pollen classes, HieraEdgeNet achieves a mean Average Precision (mAP@.5) of 0.9501, significantly outperforming state-of-the-art baseline models such as YOLOv12n and RT-DETR. Furthermore, qualitative analysis confirms that our approach generates feature representations that are more precisely focused on object boundaries. By systematically integrating edge information, HieraEdgeNet provides a robust and powerful solution for high-precision, high-efficiency automated detection of microscopic objects.
zh
[CV-72] HuSc3D: Human Sculpture dataset for 3D object reconstruction
【速读】:该论文旨在解决现有3D场景重建数据集和基准测试主要聚焦于理想化的合成数据或精心采集的真实数据,无法反映新获取真实场景中固有复杂性的问题。其关键解决方案是提出HuSc3D数据集,该数据集专为在现实采集挑战下对3D重建模型进行严格基准测试而设计,包含六种高度详细、完全白色且具有复杂穿孔结构的雕塑,同时每场景图像数量差异显著,从而引入了有限训练数据与标准视图数场景的额外挑战,有效区分模型性能,突出方法对细粒度几何细节、颜色歧义和数据可用性的敏感性。
链接: https://arxiv.org/abs/2506.07628
作者: Weronika Smolak-Dyżewska,Dawid Malarz,Grzegorz Wilczyński,Rafał Tobiasz,Joanna Waczyńska,Piotr Borycki,Przemysław Spurek
机构: Jagiellonian University (亚捷隆大学); IDEAS Research Institute (IDEAS 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D scene reconstruction from 2D images is one of the most important tasks in computer graphics. Unfortunately, existing datasets and benchmarks concentrate on idealized synthetic or meticulously captured realistic data. Such benchmarks fail to convey the inherent complexities encountered in newly acquired real-world scenes. In such scenes especially those acquired outside, the background is often dynamic, and by popular usage of cell phone cameras, there might be discrepancies in, e.g., white balance. To address this gap, we present HuSc3D, a novel dataset specifically designed for rigorous benchmarking of 3D reconstruction models under realistic acquisition challenges. Our dataset uniquely features six highly detailed, fully white sculptures characterized by intricate perforations and minimal textural and color variation. Furthermore, the number of images per scene varies significantly, introducing the additional challenge of limited training data for some instances alongside scenes with a standard number of views. By evaluating popular 3D reconstruction methods on this diverse dataset, we demonstrate the distinctiveness of HuSc3D in effectively differentiating model performance, particularly highlighting the sensitivity of methods to fine geometric details, color ambiguity, and varying data availability–limitations often masked by more conventional datasets.
zh
[CV-73] Event-Priori-Based Vision-Language Model for Efficient Visual Understanding
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的视觉-语言模型(Vision-Language Models, VLMs)在资源受限的边缘设备上部署时面临的计算需求过高的问题。其关键解决方案是引入一种基于事件先验的视觉-语言模型(Event-Priori-Based Vision-Language Model, EP-VLM),通过动态事件视觉中提取的运动先验,实现对视觉输入的稀疏化处理,从而提升模型推理效率。该方法通过事件数据引导RGB图像的块级稀疏化,将计算资源集中于视觉输入中的显著区域,并采用保持位置信息的标记化策略,有效提升了VLM的效率,同时保持了接近无损的准确性。
链接: https://arxiv.org/abs/2506.07627
作者: Haotong Qin,Cheng Hu,Michele Magno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Language Model (LLM)-based Vision-Language Models (VLMs) have substantially extended the boundaries of visual understanding capabilities. However, their high computational demands hinder deployment on resource-constrained edge devices. A key source of inefficiency stems from the VLM’s need to process dense and redundant visual information. Visual inputs contain significant regions irrelevant to text semantics, rendering the associated computations ineffective for inference. This paper introduces a novel Event-Priori-Based Vision-Language Model, termed EP-VLM. Its core contribution is a novel mechanism leveraging motion priors derived from dynamic event vision to enhance VLM efficiency. Inspired by human visual cognition, EP-VLM first employs event data to guide the patch-wise sparsification of RGB visual inputs, progressively concentrating VLM computation on salient regions of the visual input. Subsequently, we construct a position-preserving tokenization strategy for the visual encoder within the VLM architecture. This strategy processes the event-guided, unstructured, sparse visual input while accurately preserving positional understanding within the visual input. Experimental results demonstrate that EP-VLM achieves significant efficiency improvements while maintaining nearly lossless accuracy compared to baseline models from the Qwen2-VL series. For instance, against the original Qwen2-VL-2B, EP-VLM achieves 50% FLOPs savings while retaining 98% of the original accuracy on the RealWorldQA dataset. This work demonstrates the potential of event-based vision priors for improving VLM inference efficiency, paving the way for creating more efficient and deployable VLMs for sustainable visual understanding at the edge.
zh
[CV-74] Scaling Human Activity Recognition: A Comparative Evaluation of Synthetic Data Generation and Augmentation Techniques
【速读】:该论文旨在解决人类活动识别(Human Activity Recognition, HAR)中因标注数据集稀缺而导致的性能受限问题,这一问题主要源于真实世界数据收集的高成本和复杂性。论文提出的解决方案的关键在于通过跨模态迁移生成虚拟惯性测量单元(Inertial Measurement Unit, IMU)数据,以弥补真实数据或传统传感器级数据增强方法的不足。研究对比了基于视频和基于语言的两种虚拟IMU生成方法,并与经典的数据增强技术进行了直接比较,验证了虚拟IMU数据在有限数据条件下的显著性能提升。
链接: https://arxiv.org/abs/2506.07612
作者: Zikang Leng,Archith Iyer,Thomas Plötz
机构: Georgia Institute of Technology(佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human activity recognition (HAR) is often limited by the scarcity of labeled datasets due to the high cost and complexity of real-world data collection. To mitigate this, recent work has explored generating virtual inertial measurement unit (IMU) data via cross-modality transfer. While video-based and language-based pipelines have each shown promise, they differ in assumptions and computational cost. Moreover, their effectiveness relative to traditional sensor-level data augmentation remains unclear. In this paper, we present a direct comparison between these two virtual IMU generation approaches against classical data augmentation techniques. We construct a large-scale virtual IMU dataset spanning 100 diverse activities from Kinetics-400 and simulate sensor signals at 22 body locations. The three data generation strategies are evaluated on benchmark HAR datasets (UTD-MHAD, PAMAP2, HAD-AW) using four popular models. Results show that virtual IMU data significantly improves performance over real or augmented data alone, particularly under limited-data conditions. We offer practical guidance on choosing data generation strategies and highlight the distinct advantages and disadvantages of each approach.
zh
[CV-75] Drag NeXt: Rethinking Drag -Based Image Editing
【速读】:该论文试图解决Drag-Based Image Editing (DBIE)中存在的两个关键问题:一是基于点的拖拽操作通常存在高度歧义且难以与用户意图对齐;二是现有DBIE方法主要依赖于运动监督与点跟踪的交替进行,不仅繁琐且难以生成高质量结果。解决方案的关键在于重新定义DBIE为用户指定的控制区域的形变、旋转和平移,并提出一种简单而有效的编辑框架——DragNeXt,该框架将DBIE统一为一个潜在区域优化(Latent Region Optimization, LRO)问题,并通过渐进式反向自我干预(Progressive Backward Self-Intervention, PBSI)进行求解,从而简化整体流程并提升编辑质量。
链接: https://arxiv.org/abs/2506.07611
作者: Yuan Zhou,Junbao Zhou,Qingshan Xu,Kesen Zhao,Yuxuan Wang,Hao Fei,Richang Hong,Hanwang Zhang
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Drag-Based Image Editing (DBIE), which allows users to manipulate images by directly dragging objects within them, has recently attracted much attention from the community. However, it faces two key challenges: (\emph\textcolormagentai) point-based drag is often highly ambiguous and difficult to align with users’ intentions; (\emph\textcolormagentaii) current DBIE methods primarily rely on alternating between motion supervision and point tracking, which is not only cumbersome but also fails to produce high-quality results. These limitations motivate us to explore DBIE from a new perspective – redefining it as deformation, rotation, and translation of user-specified handle regions. Thereby, by requiring users to explicitly specify both drag areas and types, we can effectively address the ambiguity issue. Furthermore, we propose a simple-yet-effective editing framework, dubbed \textcolorSkyBlue\textbfDragNeXt. It unifies DBIE as a Latent Region Optimization (LRO) problem and solves it through Progressive Backward Self-Intervention (PBSI), simplifying the overall procedure of DBIE while further enhancing quality by fully leveraging region-level structure information and progressive guidance from intermediate drag states. We validate \textcolorSkyBlue\textbfDragNeXt on our NextBench, and extensive experiments demonstrate that our proposed method can significantly outperform existing approaches. Code will be released on github.
zh
[CV-76] SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis
【速读】:该论文旨在解决手术视频理解领域中由于缺乏大规模、多样化的预训练数据集和系统性评估体系而阻碍手术视频基础模型(foundation models, FMs)发展的问题。其解决方案的关键在于提出一个统一的手术视频基准框架——SurgBench,包含预训练数据集SurgBench-P和评估基准SurgBench-E,通过覆盖广泛手术场景和任务,提升现有视频FMs在不同手术分析任务中的泛化能力。
链接: https://arxiv.org/abs/2506.07603
作者: Jianhui Wei,Zikai Xiao,Danyu Sun,Luqi Gong,Zongxin Yang,Zuozhu Liu,Jian Wu
机构: Zhejiang University (浙江大学); Zhejiang Lab (浙江省实验室); Harvard University (哈佛大学); Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence (浙江省医学影像人工智能重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Surgical video understanding is pivotal for enabling automated intraoperative decision-making, skill assessment, and postoperative quality improvement. However, progress in developing surgical video foundation models (FMs) remains hindered by the scarcity of large-scale, diverse datasets for pretraining and systematic evaluation. In this paper, we introduce \textbfSurgBench, a unified surgical video benchmarking framework comprising a pretraining dataset, \textbfSurgBench-P, and an evaluation benchmark, \textbfSurgBench-E. SurgBench offers extensive coverage of diverse surgical scenarios, with SurgBench-P encompassing 53 million frames across 22 surgical procedures and 11 specialties, and SurgBench-E providing robust evaluation across six categories (phase classification, camera motion, tool recognition, disease diagnosis, action classification, and organ detection) spanning 72 fine-grained tasks. Extensive experiments reveal that existing video FMs struggle to generalize across varied surgical video analysis tasks, whereas pretraining on SurgBench-P yields substantial performance improvements and superior cross-domain generalization to unseen procedures and modalities. Our dataset and code are available upon request.
zh
[CV-77] SceneRAG : Scene-level Retrieval-Augmented Generation for Video Understanding
【速读】:该论文试图解决长视频内容理解中由于视频数据规模庞大和复杂性高而导致的挑战,特别是现有检索增强生成(RAG)方法通过固定长度片段分割视频时破坏了上下文连续性并无法准确捕捉场景边界的问题。解决方案的关键在于提出SceneRAG框架,该框架利用大语言模型结合自动语音识别(ASR)转录文本和时间元数据,将视频分割为叙事一致的场景,并通过轻量级启发式方法和迭代修正优化初始边界。此外,SceneRAG通过融合视觉与文本模态信息提取实体关系,动态构建知识图谱,从而实现考虑长程依赖的鲁棒多跳检索与生成。
链接: https://arxiv.org/abs/2506.07600
作者: Nianbo Zeng,Haowen Hou,Fei Richard Yu,Si Shi,Ying Tiffany He
机构: Guangdong Laboratory of Artificial Intelligence and Digital Economy (广东省人工智能与数字经济发展实验室); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundaries. Inspired by the human ability to naturally organize continuous experiences into coherent scenes, we present SceneRAG, a unified framework that leverages large language models to segment videos into narrative-consistent scenes by processing ASR transcripts alongside temporal metadata. SceneRAG further sharpens these initial boundaries through lightweight heuristics and iterative correction. For each scene, the framework fuses information from both visual and textual modalities to extract entity relations and dynamically builds a knowledge graph, enabling robust multi-hop retrieval and generation that account for long-range dependencies. Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines, achieving a win rate of up to 72.5 percent on generation tasks.
zh
[CV-78] Explore the vulnerability of black-box models via diffusion models
【速读】:该论文试图解决扩散模型在安全和隐私方面带来的新威胁,即攻击者利用扩散模型API生成合成图像,并以此训练高性能的替代模型,从而对黑盒分类模型执行模型提取和基于迁移的对抗攻击问题。解决方案的关键在于通过生成高分辨率且多样化的合成图像来训练替代模型,使其输出能够紧密匹配目标模型的行为,从而在极少查询次数下实现高效的对抗攻击。
链接: https://arxiv.org/abs/2506.07590
作者: Jiacheng Shi,Yanfu Zhang,Huajie Shao,Ashley Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Recent advancements in diffusion models have enabled high-fidelity and photorealistic image generation across diverse applications. However, these models also present security and privacy risks, including copyright violations, sensitive information leakage, and the creation of harmful or offensive content that could be exploited maliciously. In this study, we uncover a novel security threat where an attacker leverages diffusion model APIs to generate synthetic images, which are then used to train a high-performing substitute model. This enables the attacker to execute model extraction and transfer-based adversarial attacks on black-box classification models with minimal queries, without needing access to the original training data. The generated images are sufficiently high-resolution and diverse to train a substitute model whose outputs closely match those of the target model. Across the seven benchmarks, including CIFAR and ImageNet subsets, our method shows an average improvement of 27.37% over state-of-the-art methods while using just 0.01 times of the query budget, achieving a 98.68% success rate in adversarial attacks on the target model.
zh
[CV-79] Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
【速读】:该论文试图解决多模态基础模型在视频理解任务中缺乏深层次多模态交互的问题,这限制了其对复杂目标运动和多样化视频场景的理解能力。解决方案的关键在于提出一种统一的Super Encoding Network (SEN),通过递归关联基础模型中的多模态编码器,构建出具有区分性的多模态交互机制。具体而言,该方法将预训练好的编码器视为“超级神经元”,并设计了递归关联(Recursive Association, RA)模块,以递归方式融合多模态信息,从而有效编码更深层次的多模态交互,提升下游视频理解任务的性能。
链接: https://arxiv.org/abs/2506.07576
作者: Boyu Chen,Siran Chen,Kunchang Li,Qinglin Xu,Yu Qiao,Yali Wang
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院); School of Artificial Intelligence, University of Chinese Academy of Sciences(人工智能学院,中国科学院大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multi-modal foundation models have shown such potential via large-scale pretraining. However, these models simply align encoders of different modalities via contrastive learning, while lacking deeper multi-modal interactions, which is critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through recursive association of multi-modal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as “super neurons” in our SEN. Via designing a Recursive Association (RA) block, we progressively fuse multi-modalities with the input video, based on knowledge integrating, distributing, and prompting of super neurons in a recursive manner. In this way, our SEN can effectively encode deeper multi-modal interactions, for prompting various video understanding tasks in downstream. Extensive experiments show that, our SEN can remarkably boost the four most representative video tasks, including tracking, recognition, chatting, and editing, e.g., for pixel-level tracking, the average jaccard index improves 2.7%, temporal coherence(TC) drops 8.8% compared to the popular CaDeX++ approach. For one-shot video editing, textual alignment improves 6.4%, and frame consistency increases 4.1% compared to the popular TuneA-Video approach.
zh
[CV-80] Uncertainty-o: One Model-agnostic Framework for Unveiling Uncertainty in Large Multimodal Models
【速读】:该论文试图解决大型多模态模型(Large Multimodal Models, LMMs)在不确定性评估方面的三个关键问题:如何统一评估不同LMM的不确定性、如何引导LMM展示其不确定性以及如何为下游任务量化不确定性。解决方案的关键在于提出了一种模型无关的框架——Uncertainty-o,该框架能够揭示不同模态、架构和能力的LMM中的不确定性,并通过实证研究多模态提示扰动来探索LMM的不确定性,同时推导出多模态语义不确定性的公式,从而实现从多模态响应中量化不确定性。
链接: https://arxiv.org/abs/2506.07575
作者: Ruiyang Zhang,Hu Zhang,Hao Fei,Zhedong Zheng
机构: University of Macau(澳门大学); CSIRO Data61(澳大利亚联邦科学与工业研究组织数据61); National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Large Multimodal Models (LMMs), harnessing the complementarity among diverse modalities, are often considered more robust than pure Language Large Models (LLMs); yet do LMMs know what they do not know? There are three key open questions remaining: (1) how to evaluate the uncertainty of diverse LMMs in a unified manner, (2) how to prompt LMMs to show its uncertainty, and (3) how to quantify uncertainty for downstream tasks. In an attempt to address these challenges, we introduce Uncertainty-o: (1) a model-agnostic framework designed to reveal uncertainty in LMMs regardless of their modalities, architectures, or capabilities, (2) an empirical exploration of multimodal prompt perturbations to uncover LMM uncertainty, offering insights and findings, and (3) derive the formulation of multimodal semantic uncertainty, which enables quantifying uncertainty from multimodal responses. Experiments across 18 benchmarks spanning various modalities and 10 LMMs (both open- and closed-source) demonstrate the effectiveness of Uncertainty-o in reliably estimating LMM uncertainty, thereby enhancing downstream tasks such as hallucination detection, hallucination mitigation, and uncertainty-aware Chain-of-Thought reasoning.
zh
[CV-81] LLM -driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization
【速读】:该论文旨在解决室内布局生成中的空间不一致性和泛化能力不足的问题,尤其是在传统提示驱动方法和基于学习的方法中存在计算成本高、关系图粗略及数据集有限等局限。其解决方案的关键在于构建一个大规模的合成数据集3D-SynthPlace,并引入OptiScene,这是一个针对室内布局生成优化的大型语言模型(Large Language Model, LLM),通过两阶段训练策略——监督微调(Supervised Fine-Tuning, SFT)与多轮直接偏好优化(Multi-turn Direct Preference Optimization, DPO)——显著提升了布局质量和生成成功率。
链接: https://arxiv.org/abs/2506.07570
作者: Yixuan Yang,Zhen Luo,Tongsheng Ding,Junru Lu,Mingqi Gao,Jinyu Yang,Victor Sanchez,Feng Zheng
机构: Southern University of Science and Technology (南方科技大学); Shanghai Innovation Institute (上海创新研究院); University of Warwick (华威大学); Tapall.ai (Tapall.ai)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic indoor layout generation has attracted increasing attention due to its potential in interior design, virtual environment construction, and embodied AI. Existing methods fall into two categories: prompt-driven approaches that leverage proprietary LLM services (e.g., GPT APIs) and learning-based methods trained on layout data upon diffusion-based models. Prompt-driven methods often suffer from spatial inconsistency and high computational costs, while learning-based methods are typically constrained by coarse relational graphs and limited datasets, restricting their generalization to diverse room categories. In this paper, we revisit LLM-based indoor layout generation and present 3D-SynthPlace, a large-scale dataset that combines synthetic layouts generated via a ‘GPT synthesize, Human inspect’ pipeline, upgraded from the 3D-Front dataset. 3D-SynthPlace contains nearly 17,000 scenes, covering four common room types – bedroom, living room, kitchen, and bathroom – enriched with diverse objects and high-level spatial annotations. We further introduce OptiScene, a strong open-source LLM optimized for indoor layout generation, fine-tuned based on our 3D-SynthPlace dataset through our two-stage training. For the warum-up stage I, we adopt supervised fine-tuning (SFT), which is taught to first generate high-level spatial descriptions then conditionally predict concrete object placements. For the reinforcing stage II, to better align the generated layouts with human design preferences, we apply multi-turn direct preference optimization (DPO), which significantly improving layout quality and generation success rates. Extensive experiments demonstrate that OptiScene outperforms traditional prompt-driven and learning-based baselines. Moreover, OptiScene shows promising potential in interactive tasks such as scene editing and robot navigation.
zh
[CV-82] owards the Influence of Text Quantity on Writer Retrieval ICDAR2025
【速读】:该论文试图解决手写文本中作者检索(writer retrieval)的问题,即根据手写相似性在数据集中识别由同一人撰写的文档。其解决方案的关键在于评估文本数量对作者检索性能的影响,并通过线级和词级检索来探索不同文本量下的表现。研究分析了三种先进的作者检索系统,包括基于手工特征和深度学习的方法,并发现当至少包含四行文本时,检索准确率仍能保持在全页性能的90%以上,同时表明在低文本场景下,依赖文本的检索方法仍能维持较强的性能,而传统手工特征在该场景下存在明显局限。
链接: https://arxiv.org/abs/2506.07566
作者: Marco Peer,Robert Sablatnig,Florian Kleber
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted for ICDAR2025
Abstract:This paper investigates the task of writer retrieval, which identifies documents authored by the same individual within a dataset based on handwriting similarities. While existing datasets and methodologies primarily focus on page level retrieval, we explore the impact of text quantity on writer retrieval performance by evaluating line- and word level retrieval. We examine three state-of-the-art writer retrieval systems, including both handcrafted and deep learning-based approaches, and analyze their performance using varying amounts of text. Our experiments on the CVL and IAM dataset demonstrate that while performance decreases by 20-30% when only one line of text is used as query and gallery, retrieval accuracy remains above 90% of full-page performance when at least four lines are included. We further show that text-dependent retrieval can maintain strong performance in low-text scenarios. Our findings also highlight the limitations of handcrafted features in low-text scenarios, with deep learning-based methods like NetVLAD outperforming traditional VLAD encoding.
zh
[CV-83] OpenDance: Multimodal Controllable 3D Dance Generation Using Large-scale Internet Data
【速读】:该论文旨在解决音乐驱动舞蹈生成中的可控性和多样性不足问题,这些问题主要源于缺乏细粒度的多模态数据以及难以实现灵活的多条件生成。其解决方案的关键在于构建OpenDance5D数据集,该数据集包含超过101小时的14种不同风格的人类舞蹈数据,每条样本包含五种模态(RGB视频、音频、2D关键点、3D运动和细粒度文本描述),以支持鲁棒的跨模态学习;同时提出OpenDanceNet,这是一个统一的掩码建模框架,用于在音乐和任意组合的文本提示、关键点或角色定位条件下进行可控的舞蹈生成。
链接: https://arxiv.org/abs/2506.07565
作者: Jinlu Zhang,Zixi Kang,Yizhou Wang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Music-driven dance generation offers significant creative potential yet faces considerable challenges. The absence of fine-grained multimodal data and the difficulty of flexible multi-conditional generation limit previous works on generation controllability and diversity in practice. In this paper, we build OpenDance5D, an extensive human dance dataset comprising over 101 hours across 14 distinct genres. Each sample has five modalities to facilitate robust cross-modal learning: RGB video, audio, 2D keypoints, 3D motion, and fine-grained textual descriptions from human arts. Furthermore, we propose OpenDanceNet, a unified masked modeling framework for controllable dance generation conditioned on music and arbitrary combinations of text prompts, keypoints, or character positioning. Comprehensive experiments demonstrate that OpenDanceNet achieves high-fidelity and flexible controllability.
zh
[CV-84] Cross-channel Perception Learning for HE-to-IHC Virtual Staining
【速读】:该论文试图解决现有HE-to-IHC(Hematoxylin and Eosin to Immunohistochemistry)研究中忽视细胞核与细胞膜之间跨通道相关性的问题。其解决方案的关键在于提出一种新颖的Cross-Channel Perception Learning(CCPL)策略,通过分解HER2免疫组化染色为Hematoxylin和DAB染色通道,分别对应细胞核和细胞膜,并利用Gigapath的Tile Encoder提取双通道特征,测量核与膜之间的跨通道相关性,同时通过特征蒸馏损失增强模型的特征提取能力,确保生成图像在病理特征和染色分布上与真实图像一致。
链接: https://arxiv.org/abs/2506.07559
作者: Hao Yang,JianYu Wu,Run Fang,Xuelian Zhao,Yuan Ji,Zhiyu Chen,Guibin He,Junceng Guo,Yang Liu,Xinhua Zeng
机构: Fudan University (复旦大学); Zhongshan Hospital, Fudan University (复旦大学附属中山医院); Yiwu Research Institute, Fudan University (复旦大学义乌研究院); The University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid development of digital pathology, virtual staining has become a key technology in multimedia medical information systems, offering new possibilities for the analysis and diagnosis of pathological images. However, existing HE-to-IHC studies often overlook the cross-channel correlations between cell nuclei and cell membranes. To address this issue, we propose a novel Cross-Channel Perception Learning (CCPL) strategy. Specifically, CCPL first decomposes HER2 immunohistochemical staining into Hematoxylin and DAB staining channels, corresponding to cell nuclei and cell membranes, respectively. Using the pathology foundation model Gigapath’s Tile Encoder, CCPL extracts dual-channel features from both the generated and real images and measures cross-channel correlations between nuclei and membranes. The features of the generated and real stained images, obtained through the Tile Encoder, are also used to calculate feature distillation loss, enhancing the model’s feature extraction capabilities without increasing the inference burden. Additionally, CCPL performs statistical analysis on the focal optical density maps of both single channels to ensure consistency in staining distribution and intensity. Experimental results, based on quantitative metrics such as PSNR, SSIM, PCC, and FID, along with professional evaluations from pathologists, demonstrate that CCPL effectively preserves pathological features, generates high-quality virtual stained images, and provides robust support for automated pathological diagnosis using multimedia medical data.
zh
[CV-85] Synthesize Privacy-Preserving High-Resolution Images via Private Textual Intermediaries
【速读】:该论文试图解决在不泄露个体隐私的前提下,生成高质量、高分辨率的差分隐私(Differential Privacy, DP)合成图像的问题。现有方法在生成高分辨率图像时难以准确捕捉原始数据的结构。解决方案的关键在于将DP图像合成任务从图像域转移到文本域,通过利用先进的DP文本生成方法,首先使用图像到文本模型对每张私有图像进行简洁的文本描述,然后应用改进的Private Evolution算法生成DP文本,最后通过文本到图像模型重建图像。该方法无需模型训练,仅需使用现成模型进行推理,从而实现了更高质量的DP合成图像生成。
链接: https://arxiv.org/abs/2506.07555
作者: Haoxiang Wang,Zinan Lin,Da Yu,Huishuai Zhang
机构: Peking University (北京大学); Microsoft Research (微软研究院); Google Research (谷歌研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating high fidelity, differentially private (DP) synthetic images offers a promising route to share and analyze sensitive visual data without compromising individual privacy. However, existing DP image synthesis methods struggle to produce high resolution outputs that faithfully capture the structure of the original data. In this paper, we introduce a novel method, referred to as Synthesis via Private Textual Intermediaries (SPTI), that can generate high resolution DP images with easy adoption. The key idea is to shift the challenge of DP image synthesis from the image domain to the text domain by leveraging state of the art DP text generation methods. SPTI first summarizes each private image into a concise textual description using image to text models, then applies a modified Private Evolution algorithm to generate DP text, and finally reconstructs images using text to image models. Notably, SPTI requires no model training, only inference with off the shelf models. Given a private dataset, SPTI produces synthetic images of substantially higher quality than prior DP approaches. On the LSUN Bedroom dataset, SPTI attains an FID less than or equal to 26.71 under epsilon equal to 1.0, improving over Private Evolution FID of 40.36. Similarly, on MM CelebA HQ, SPTI achieves an FID less than or equal to 33.27 at epsilon equal to 1.0, compared to 57.01 from DP fine tuning baselines. Overall, our results demonstrate that Synthesis via Private Textual Intermediaries provides a resource efficient and proprietary model compatible framework for generating high resolution DP synthetic images, greatly expanding access to private visual datasets.
zh
[CV-86] APTOS-2024 challenge report: Generation of synthetic 3D OCT images from fundus photographs
【速读】:该论文旨在解决如何将2D色斑眼底图像(color fundus photography)转换为3D光学相干断层扫描(Optical Coherence Tomography, OCT)图像的问题,以降低OCT设备的依赖性并提升眼科诊疗的可及性。其关键解决方案在于利用生成式人工智能(Generative AI)技术,通过创新的数据预处理与增强方法、外部眼科影像数据集的预训练、视觉基础模型的整合以及模型架构的优化,实现跨模态的高质量图像合成,从而在保持生物信息一致性的前提下,生成具有临床价值的3D OCT图像。
链接: https://arxiv.org/abs/2506.07542
作者: Bowen Liu,Weiyi Zhang,Peranut Chotcomwongse,Xiaolan Chen,Ruoyu Chen,Pawin Pakaymaskul,Niracha Arjkongharn,Nattaporn Vongsa,Xuelian Cheng,Zongyuan Ge,Kun Huang,Xiaohui Li,Yiru Duan,Zhenbang Wang,BaoYe Xie,Qiang Chen,Huazhu Fu,Michael A. Mahr,Jiaqi Qu,Wangyiyang Chen,Shiye Wang,Yubo Tan,Yongjie Li,Mingguang He,Danli Shi,Paisan Ruamviboonsuk
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Optical Coherence Tomography (OCT) provides high-resolution, 3D, and non-invasive visualization of retinal layers in vivo, serving as a critical tool for lesion localization and disease diagnosis. However, its widespread adoption is limited by equipment costs and the need for specialized operators. In comparison, 2D color fundus photography offers faster acquisition and greater accessibility with less dependence on expensive devices. Although generative artificial intelligence has demonstrated promising results in medical image synthesis, translating 2D fundus images into 3D OCT images presents unique challenges due to inherent differences in data dimensionality and biological information between modalities. To advance generative models in the fundus-to-3D-OCT setting, the Asia Pacific Tele-Ophthalmology Society (APTOS-2024) organized a challenge titled Artificial Intelligence-based OCT Generation from Fundus Images. This paper details the challenge framework (referred to as APTOS-2024 Challenge), including: the benchmark dataset, evaluation methodology featuring two fidelity metrics-image-based distance (pixel-level OCT B-scan similarity) and video-based distance (semantic-level volumetric consistency), and analysis of top-performing solutions. The challenge attracted 342 participating teams, with 42 preliminary submissions and 9 finalists. Leading methodologies incorporated innovations in hybrid data preprocessing or augmentation (cross-modality collaborative paradigms), pre-training on external ophthalmic imaging datasets, integration of vision foundation models, and model architecture improvement. The APTOS-2024 Challenge is the first benchmark demonstrating the feasibility of fundus-to-3D-OCT synthesis as a potential solution for improving ophthalmic care accessibility in under-resourced healthcare settings, while helping to expedite medical research and clinical applications.
zh
[CV-87] Domain Randomization for Object Detection in Manufacturing Applications using Synthetic Data: A Comprehensive Study ICRA
【速读】:该论文旨在解决在制造场景中使用合成数据进行目标检测时的域随机化问题,以提升模拟到现实(sim-to-real)的迁移效果。其关键解决方案是构建一个全面的数据生成流程,涵盖物体特征、背景、光照、相机设置和后处理等多个因素,并引入了Synthetic Industrial Parts Object Detection dataset (SIP15-OD)作为测试基准。通过分析材料属性、渲染方法、后处理和干扰物等重要因素,该方法在仅使用合成数据训练的Yolov8模型上取得了优异性能,验证了所提出域随机化策略的有效性。
链接: https://arxiv.org/abs/2506.07539
作者: Xiaomeng Zhu,Jacob Henningsson,Duruo Li,Pär Mårtensson,Lars Hanson,Mårten Björkman,Atsuto Maki
机构: Scania CV AB (斯堪尼亚卡客车公司); Skövde University (锡格弗德大学); KTH Royal Institute of Technology (瑞典皇家理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This is accepted by 2025 IEEE International Conference on Robotics Automation (ICRA), waiting for publication. 14 pages, 14 figures
Abstract:This paper addresses key aspects of domain randomization in generating synthetic data for manufacturing object detection applications. To this end, we present a comprehensive data generation pipeline that reflects different factors: object characteristics, background, illumination, camera settings, and post-processing. We also introduce the Synthetic Industrial Parts Object Detection dataset (SIP15-OD) consisting of 15 objects from three industrial use cases under varying environments as a test bed for the study, while also employing an industrial dataset publicly available for robotic applications. In our experiments, we present more abundant results and insights into the feasibility as well as challenges of sim-to-real object detection. In particular, we identified material properties, rendering methods, post-processing, and distractors as important factors. Our method, leveraging these, achieves top performance on the public dataset with Yolov8 models trained exclusively on synthetic data; mAP@50 scores of 96.4% for the robotics dataset, and 94.1%, 99.5%, and 95.3% across three of the SIP15-OD use cases, respectively. The results showcase the effectiveness of the proposed domain randomization, potentially covering the distribution close to real data for the applications.
zh
[CV-88] MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts ACL2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文推理优化中面临的高内存消耗问题,特别是键值(Key-Value, KV)缓存的内存占用问题。其解决方案的关键在于提出一种新的混合精度量化方法——MoQAE(Mixture of Quantization-Aware Experts),通过将不同的量化位宽配置视为专家,并利用传统专家混合(Mixture of Experts, MoE)方法选择最优配置,同时采用分块输入路由器以提高效率,并设计轻量级仅路由微调过程以平衡模型精度与内存使用,最终通过路由冻结和路由共享机制进一步降低推理开销。
链接: https://arxiv.org/abs/2506.07533
作者: Wei Tao,Haocheng Lu,Xiaoyang Qu,Bin Zhang,Kai Lu,Jiguang Wan,Jianzong Wang
机构: Huazhong University of Science and Technology (华中科技大学); Ping An Technology (Shenzhen) Co., Ltd. (平安科技(深圳)有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)
Abstract:One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE, a novel mixed-precision quantization method via mixture of quantization-aware experts. First, we view different quantization bit-width configurations as experts and use the traditional mixture of experts (MoE) method to select the optimal configuration. To avoid the inefficiency caused by inputting tokens one by one into the router in the traditional MoE method, we input the tokens into the router chunk by chunk. Second, we design a lightweight router-only fine-tuning process to train MoQAE with a comprehensive loss to learn the trade-off between model accuracy and memory usage. Finally, we introduce a routing freezing (RF) and a routing sharing (RS) mechanism to further reduce the inference overhead. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.
zh
[CV-89] BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation
【速读】:该论文旨在解决大规模Vision-Language-Action (VLA)模型在资源受限的机器人系统上部署时面临的内存和计算效率问题。其关键解决方案是提出BitVLA,一个首次将所有参数量化为三元值(-1, 0, 1)的1-bit VLA模型,并通过蒸馏感知训练策略将视觉编码器的权重压缩至1.58-bit,从而显著降低内存占用,同时保持较高的任务性能。
链接: https://arxiv.org/abs/2506.07530
作者: Hongyu Wang,Chuyan Xiong,Ruiping Wang,Xilin Chen
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress
Abstract:Vision-Language-Action (VLA) models have shown impressive capabilities across a wide range of robotics manipulation tasks. However, their growing model size poses significant challenges for deployment on resource-constrained robotic systems. While 1-bit pretraining has proven effective for enhancing the inference efficiency of large language models with minimal performance loss, its application to VLA models remains underexplored. In this work, we present BitVLA, the first 1-bit VLA model for robotics manipulation, in which every parameter is ternary, i.e., -1, 0, 1. To further reduce the memory footprint of the vision encoder, we propose the distillation-aware training strategy that compresses the full-precision encoder to 1.58-bit weights. During this process, a full-precision encoder serves as a teacher model to better align latent representations. Despite the lack of large-scale robotics pretraining, BitVLA achieves performance comparable to the state-of-the-art model OpenVLA-OFT with 4-bit post-training quantization on the LIBERO benchmark, while consuming only 29.8% of the memory. These results highlight BitVLA’s promise for deployment on memory-constrained edge devices. We release the code and model weights in this https URL.
zh
[CV-90] Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency
【速读】:该论文试图解决多视角驾驶视频与LiDAR序列的联合生成问题,旨在实现时空一致性和跨模态一致性。其解决方案的关键在于提出一个两阶段架构,将基于DiT的视频扩散模型与3D-VAE编码相结合,并引入一个BEV感知的LiDAR生成器,该生成器结合了NeRF渲染和自适应采样技术。通过共享潜在空间直接耦合两种模态,实现了视觉与几何域间的连贯演化,并利用DataCrafter模块提供场景级和实例级的语义监督,以增强生成数据的语义保真度。
链接: https://arxiv.org/abs/2506.07497
作者: Xiangyu Guo,Zhanqian Wu,Kaixin Xiong,Ziyang Xu,Lijun Zhou,Gangwei Xu,Shaoqing Xu,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学); Xiaomi EV (小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.
zh
[CV-91] SpatialLM : Training Large Language Models for Structured Indoor Modeling
【速读】:该论文旨在解决如何提升现代大型语言模型(Large Language Model, LLM)在增强现实和具身机器人等应用中的空间理解能力问题。其关键解决方案是设计并训练一个名为SpatialLM的模型,该模型基于标准多模态LLM架构,并直接从开源LLM进行微调,以处理3D点云数据并生成结构化的3D场景理解输出,如墙体、门、窗及带有语义类别的定向物体框。
链接: https://arxiv.org/abs/2506.07491
作者: Yongsen Mao,Junhao Zhong,Chuan Fang,Jia Zheng,Rui Tang,Hao Zhu,Ping Tan,Zihan Zhou
机构: Manycore Tech Inc. (Manycore 技术公司); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:SpatialLM is a large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object boxes with their semantic categories. Unlike previous methods which exploit task-specific network designs, our model adheres to the standard multimodal LLM architecture and is fine-tuned directly from open-source LLMs. To train SpatialLM, we collect a large-scale, high-quality synthetic dataset consisting of the point clouds of 12,328 indoor scenes (54,778 rooms) with ground-truth 3D annotations, and conduct a careful study on various modeling and training decisions. On public benchmarks, our model gives state-of-the-art performance in layout estimation and competitive results in 3D object detection. With that, we show a feasible path for enhancing the spatial understanding capabilities of modern LLMs for applications in augmented reality, embodied robotics, and more. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.07491 [cs.CV] (or arXiv:2506.07491v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.07491 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-92] Drive Any Mesh: 4D Latent Diffusion for Mesh Deformation from Video
【速读】:该论文试图解决单目视频驱动网格(mesh)生成中的挑战,特别是在现代渲染引擎中4D生成技术存在的效率低下和泛化能力不足问题。现有隐式方法渲染效率低且不适用于基于光栅化的引擎,而骨骼方法则需要大量手动工作且缺乏跨类别泛化能力。论文提出的解决方案关键在于构建一个4D扩散模型,该模型通过去噪潜在集合序列,结合基于Transformer的变分自编码器,同时捕捉三维形状和运动信息,并利用时空Transformer扩散模型在多个潜在帧之间交换信息,从而提升生成结果的效率和泛化能力。
链接: https://arxiv.org/abs/2506.07489
作者: Yahao Shi,Yang Liu,Yanmin Wu,Xing Liu,Chen Zhao,Jie Luo,Bin Zhou
机构: Beihang University (北京航空航天大学); Peking University (北京大学); Baidu VIS (百度视觉)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical report
Abstract:We propose DriveAnyMesh, a method for driving mesh guided by monocular video. Current 4D generation techniques encounter challenges with modern rendering engines. Implicit methods have low rendering efficiency and are unfriendly to rasterization-based engines, while skeletal methods demand significant manual effort and lack cross-category generalization. Animating existing 3D assets, instead of creating 4D assets from scratch, demands a deep understanding of the input’s 3D structure. To tackle these challenges, we present a 4D diffusion model that denoises sequences of latent sets, which are then decoded to produce mesh animations from point cloud trajectory sequences. These latent sets leverage a transformer-based variational autoencoder, simultaneously capturing 3D shape and motion information. By employing a spatiotemporal, transformer-based diffusion model, information is exchanged across multiple latent frames, enhancing the efficiency and generalization of the generated results. Our experimental results demonstrate that DriveAnyMesh can rapidly produce high-quality animations for complex motions and is compatible with modern rendering engines. This method holds potential for applications in both the gaming and filming industries.
zh
[CV-93] CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization ICML2025
【速读】:该论文旨在解决视觉-语言模型在进行提示微调(prompt tuning)时,由于冻结编码器导致的特征对齐问题,进而影响任务特异性和跨域泛化能力的问题。其解决方案的关键在于引入一种混淆感知损失(confusion-aware loss, CoA-loss),通过优化易混淆类别的决策边界来提升任务特异性,同时结合置信度感知权重(confidence-aware weights, CoA-weights)构建混合模型,以增强模型在未见领域的泛化能力,而不会牺牲任务特异性。
链接: https://arxiv.org/abs/2506.07484
作者: Dasol Hong,Wooju Lee,Hyun Myung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 5 figures; accepted at ICML 2025
Abstract:Prompt tuning, which adapts vision-language models by freezing model parameters and optimizing only the prompt, has proven effective for task-specific adaptations. The core challenge in prompt tuning is improving specialization for a specific task and generalization for unseen domains. However, frozen encoders often produce misaligned features, leading to confusion between classes and limiting specialization. To overcome this issue, we propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes. Additionally, we mathematically demonstrate that a mixture model can enhance generalization without compromising specialization. This is achieved using confidence-aware weights (CoA-weights), which adjust the weights of each prediction in the mixture model based on its confidence within the class domains. Extensive experiments show that CoCoA-Mix, a mixture model with CoA-loss and CoA-weights, outperforms state-of-the-art methods by enhancing specialization and generalization. Our code is publicly available at this https URL.
zh
[CV-94] Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval AAAI2025
【速读】:该论文旨在解决部分相关视频检索(Partially Relevant Video Retrieval, PRVR)中存在文本与视频内容之间的固有模糊性问题。传统PRVR训练过程假设每个文本查询仅与一个视频相关,但实际中存在多个可能相关的视频或视频片段。为应对这一问题,论文提出了一种基于模糊性约束的表示学习(Ambiguity-Restrained representation Learning, ARL)框架。其关键在于通过不确定性与相似性两个标准检测模糊对,并利用多正例对比学习和双三元组边界损失进行层次化语义关系学习,同时关注视频内部细粒度的帧级语义关系,以提升模型对模糊性的建模能力。此外,还引入了跨模型模糊性检测机制,以减少单一模型在检测模糊对时的误差传播问题。
链接: https://arxiv.org/abs/2506.07471
作者: CH Cho,WJ Moon,W Jun,MS Jung,JP Heo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2025
Abstract:Partially Relevant Video Retrieval~(PRVR) aims to retrieve a video where a specific segment is relevant to a given text query. Typical training processes of PRVR assume a one-to-one relationship where each text query is relevant to only one video. However, we point out the inherent ambiguity between text and video content based on their conceptual scope and propose a framework that incorporates this ambiguity into the model learning process. Specifically, we propose Ambiguity-Restrained representation Learning~(ARL) to address ambiguous text-video pairs. Initially, ARL detects ambiguous pairs based on two criteria: uncertainty and similarity. Uncertainty represents whether instances include commonly shared context across the dataset, while similarity indicates pair-wise semantic overlap. Then, with the detected ambiguous pairs, our ARL hierarchically learns the semantic relationship via multi-positive contrastive learning and dual triplet margin loss. Additionally, we delve into fine-grained relationships within the video instances. Unlike typical training at the text-video level, where pairwise information is provided, we address the inherent ambiguity within frames of the same untrimmed video, which often contains multiple contexts. This allows us to further enhance learning at the text-frame level. Lastly, we propose cross-model ambiguity detection to mitigate the error propagation that occurs when a single model is employed to detect ambiguous pairs for its training. With all components combined, our proposed method demonstrates its effectiveness in PRVR.
zh
[CV-95] DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO
【速读】:该论文旨在解决Group Relative Policy Optimization (GRPO)在视频大语言模型(Video LLMs)中的应用问题,特别是其有效学习受到的两个主要障碍:对安全机制的依赖以及优势值消失问题。解决方案的关键在于提出DeepVideo-R1,该模型采用改进的Reg-GRPO(Regressive GRPO)算法和难度感知的数据增强策略。Reg-GRPO将GRPO的目标重新定义为回归任务,直接预测优势值,从而消除了对裁剪和最小函数等安全机制的依赖,实现了更直接的策略引导。
链接: https://arxiv.org/abs/2506.07464
作者: Jinyoung Park,Jeehye Na,Jinyoung Kim,Hyunwoo J. Kim
机构: Korea University (韩国科学技术大学); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training in enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success by employing a PPO-style reinforcement algorithm with group-based normalized rewards. However, the application of GRPO to Video Large Language Models (Video LLMs) has been less studied. In this paper, we explore GRPO for video LLMs and identify two primary issues that impede its effective learning: (1) reliance on safeguards, and (2) the vanishing advantage problem. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with our proposed Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation strategy. Reg-GRPO reformulates the GRPO objective as a regression task, directly predicting the advantage in GRPO. This design eliminates the need for safeguards like clipping and min functions, thereby facilitating more direct policy guidance by aligning the model with the advantage values. We also design the difficulty-aware data augmentation strategy that dynamically augments training samples at solvable difficulty levels, fostering diverse and informative reward signals. Our comprehensive experiments show that DeepVideo-R1 significantly improves video reasoning performance across multiple video reasoning benchmarks.
zh
[CV-96] PhysiInter: Integrating Physical Mapping for High-Fidelity Human Interaction Generation
【速读】:该论文试图解决现有运动捕捉技术和生成模型在合成人类动作时忽视物理约束的问题,导致出现穿透、滑动和漂浮等伪影,尤其在多人运动生成中问题更为突出。解决方案的关键在于引入物理映射(physical mapping),通过基于物理的仿真环境进行运动模仿,将目标动作投影到符合物理规律的空间,并在保留原始语义意义的同时调整动作以满足现实物理约束。此方法不仅提升了运动捕捉数据质量,还直接指导了生成动作的后处理。此外,针对多人场景的交互性,提出了运动一致性(Motion Consistency, MC)和基于标记的交互(Marker-based Interaction, MI)损失函数,以提升模型性能。
链接: https://arxiv.org/abs/2506.07456
作者: Wei Yao,Yunlian Sun,Chang Liu,Hongwen Zhang,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Driven by advancements in motion capture and generative artificial intelligence, leveraging large-scale MoCap datasets to train generative models for synthesizing diverse, realistic human motions has become a promising research direction. However, existing motion-capture techniques and generative models often neglect physical constraints, leading to artifacts such as interpenetration, sliding, and floating. These issues are exacerbated in multi-person motion generation, where complex interactions are involved. To address these limitations, we introduce physical mapping, integrated throughout the human interaction generation pipeline. Specifically, motion imitation within a physics-based simulation environment is used to project target motions into a physically valid space. The resulting motions are adjusted to adhere to real-world physics constraints while retaining their original semantic meaning. This mapping not only improves MoCap data quality but also directly informs post-processing of generated motions. Given the unique interactivity of multi-person scenarios, we propose a tailored motion representation framework. Motion Consistency (MC) and Marker-based Interaction (MI) loss functions are introduced to improve model performance. Experiments show our method achieves impressive results in generated human motion quality, with a 3%-89% improvement in physical fidelity. Project page this http URL
zh
[CV-97] Prompt to Protection: A Comparative Study of Multimodal LLM s in Construction Hazard Recognition
【速读】:该论文试图解决在建筑工地环境中,如何利用多模态大语言模型(Multimodal Large Language Models, LLMs)提升视觉危险识别的准确性与一致性问题。其解决方案的关键在于通过优化提示工程(prompt engineering),特别是采用链式思维(Chain-of-Thought, CoT)提示策略,以增强模型在安全关键任务中的表现,从而为构建更可靠的AI辅助安全系统提供可行路径。
链接: https://arxiv.org/abs/2506.07436
作者: Nishi Chaudhary,S M Jamil Uddin,Sathvik Sharath Chandra,Anto Ovid,Alex Albert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:The recent emergence of multimodal large language models (LLMs) has introduced new opportunities for improving visual hazard recognition on construction sites. Unlike traditional computer vision models that rely on domain-specific training and extensive datasets, modern LLMs can interpret and describe complex visual scenes using simple natural language prompts. However, despite growing interest in their applications, there has been limited investigation into how different LLMs perform in safety-critical visual tasks within the construction domain. To address this gap, this study conducts a comparative evaluation of five state-of-the-art LLMs: Claude-3 Opus, GPT-4.5, GPT-4o, GPT-o3, and Gemini 2.0 Pro, to assess their ability to identify potential hazards from real-world construction images. Each model was tested under three prompting strategies: zero-shot, few-shot, and chain-of-thought (CoT). Zero-shot prompting involved minimal instruction, few-shot incorporated basic safety context and a hazard source mnemonic, and CoT provided step-by-step reasoning examples to scaffold model thinking. Quantitative analysis was performed using precision, recall, and F1-score metrics across all conditions. Results reveal that prompting strategy significantly influenced performance, with CoT prompting consistently producing higher accuracy across models. Additionally, LLM performance varied under different conditions, with GPT-4.5 and GPT-o3 outperforming others in most settings. The findings also demonstrate the critical role of prompt design in enhancing the accuracy and consistency of multimodal LLMs for construction safety applications. This study offers actionable insights into the integration of prompt engineering and LLMs for practical hazard recognition, contributing to the development of more reliable AI-assisted safety systems.
zh
[CV-98] FAMSeg: Fetal Femur and Cranial Ultrasound Segmentation Using Feature-Aware Attention and Mamba Enhancement
【速读】:该论文旨在解决超声图像分割中因依赖人工标注而导致的误差大、耗时长以及现有分割模型难以适应高噪声和高相似性超声目标的问题,尤其是在小目标分割中出现的明显锯齿效应。其解决方案的关键在于提出一种基于特征感知与Mamba增强的胎儿股骨和颅脑超声图像分割模型,通过设计纵向与横向独立视角扫描卷积块和特征感知模块,提升局部细节捕捉能力和上下文信息融合效果,并结合Mamba优化的残差结构,抑制原始噪声干扰并增强局部多维扫描能力,从而构建全局信息与局部特征依赖关系,实现最优分割性能。
链接: https://arxiv.org/abs/2506.07431
作者: Jie He,Minglang Chen,Minying Lu,Bocheng Liang,Junming Wei,Guiyan Peng,Jiaxi Chen,Ying Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate ultrasound image segmentation is a prerequisite for precise biometrics and accurate assessment. Relying on manual delineation introduces significant errors and is time-consuming. However, existing segmentation models are designed based on objects in natural scenes, making them difficult to adapt to ultrasound objects with high noise and high similarity. This is particularly evident in small object segmentation, where a pronounced jagged effect occurs. Therefore, this paper proposes a fetal femur and cranial ultrasound image segmentation model based on feature perception and Mamba enhancement to address these challenges. Specifically, a longitudinal and transverse independent viewpoint scanning convolution block and a feature perception module were designed to enhance the ability to capture local detail information and improve the fusion of contextual information. Combined with the Mamba-optimized residual structure, this design suppresses the interference of raw noise and enhances local multi-dimensional scanning. The system builds global information and local feature dependencies, and is trained with a combination of different optimizers to achieve the optimal solution. After extensive experimental validation, the FAMSeg network achieved the fastest loss reduction and the best segmentation performance across images of varying sizes and orientations.
zh
[CV-99] DPFormer: Dynamic Prompt Transformer for Continual Learning
【速读】:该论文试图解决持续学习中的灾难性遗忘问题以及任务间混淆问题(catastrophic forgetting and inter-task confusion)。其解决方案的关键在于提出一种名为动态提示变换器(DPFormer)的模型,结合提示方案(prompt schemes),使模型能够在保持先前知识的同时,继续从新任务中学习,并通过提供差异化的信息来缓解任务间的混淆。此外,基于提示方案设计了一个统一的分类模块,结合二元交叉熵损失、知识蒸馏损失和辅助损失,实现端到端的训练。
链接: https://arxiv.org/abs/2506.07414
作者: Sheng-Kai Huang,Jiun-Feng Chang,Chun-Rong Huang
机构: Nvidia Corporation(英伟达公司); National Chung Hsing University(国立中兴大学); National Yang Ming Chiao Tung University(国立阳明交通大学
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In continual learning, solving the catastrophic forgetting problem may make the models fall into the stability-plasticity dilemma. Moreover, inter-task confusion will also occur due to the lack of knowledge exchanges between different tasks. In order to solve the aforementioned problems, we propose a novel dynamic prompt transformer (DPFormer) with prompt schemes. The prompt schemes help the DPFormer memorize learned knowledge of previous classes and tasks, and keep on learning new knowledge from new classes and tasks under a single network structure with a nearly fixed number of model parameters. Moreover, they also provide discrepant information to represent different tasks to solve the inter-task confusion problem. Based on prompt schemes, a unified classification module with the binary cross entropy loss, the knowledge distillation loss and the auxiliary loss is proposed to train the whole model in an end-to-end trainable manner. Compared with state-of-the-art methods, our method achieves the best performance in the CIFAR-100, ImageNet100 and ImageNet1K datasets under different class-incremental settings in continual learning. The source code will be available at our GitHub after acceptance.
zh
[CV-100] Variational Supervised Contrastive Learning
【速读】:该论文旨在解决对比学习在嵌入空间中存在语义相关实例被无意推开以及过度依赖大批次负样本和定制化增强策略导致泛化能力受限的问题。其解决方案的关键在于提出变分监督对比学习(VarCon),通过将监督对比学习重新表述为潜在类别变量上的变分推断,并最大化后验加权证据下界(ELBO),从而实现高效的类别感知匹配并精细控制嵌入空间中的类内离散度。
链接: https://arxiv.org/abs/2506.07413
作者: Ziwen Wang,Jiajun Fan,Thao Nguyen,Heng Ji,Ge Liu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contrastive learning has proven to be highly efficient and adaptable in shaping representation spaces across diverse modalities by pulling similar samples together and pushing dissimilar ones apart. However, two key limitations persist: (1) Without explicit regulation of the embedding distribution, semantically related instances can inadvertently be pushed apart unless complementary signals guide pair selection, and (2) excessive reliance on large in-batch negatives and tailored augmentations hinders generalization. To address these limitations, we propose Variational Supervised Contrastive Learning (VarCon), which reformulates supervised contrastive learning as variational inference over latent class variables and maximizes a posterior-weighted evidence lower bound (ELBO) that replaces exhaustive pair-wise comparisons for efficient class-aware matching and grants fine-grained control over intra-class dispersion in the embedding space. Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100, ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while converging in just 200 epochs; (2) yields substantially clearer decision boundaries and semantic organization in the embedding space, as evidenced by KNN classification, hierarchical clustering results, and transfer-learning assessments; and (3) demonstrates superior performance in few-shot learning than supervised baseline and superior robustness across various augmentation strategies.
zh
[CV-101] Compressed Feature Quality Assessment: Dataset and Baselines
【速读】:该论文试图解决压缩特征质量评估(Compressed Feature Quality Assessment, CFQA)问题,即如何有效评估在特征传输、存储和重用过程中因特征编码导致的语义失真。解决方案的关键在于提出首个基准数据集,包含300个原始特征和从三个视觉任务及四种特征编解码器中生成的12000个压缩特征,并提供任务特定的性能下降作为真实语义失真的评价标准。该数据集旨在推动更精确的语义退化度量方法的发展。
链接: https://arxiv.org/abs/2506.07412
作者: Changsheng Gao,Wei Zhou,Guosheng Lin,Weisi Lin
机构: Nanyang Technological University (南洋理工大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The widespread deployment of large models in resource-constrained environments has underscored the need for efficient transmission of intermediate feature representations. In this context, feature coding, which compresses features into compact bitstreams, becomes a critical component for scenarios involving feature transmission, storage, and reuse. However, this compression process introduces inherent semantic degradation that is notoriously difficult to quantify with traditional metrics. To address this, this paper introduces the research problem of Compressed Feature Quality Assessment (CFQA), which seeks to evaluate the semantic fidelity of compressed features. To advance CFQA research, we propose the first benchmark dataset, comprising 300 original features and 12000 compressed features derived from three vision tasks and four feature codecs. Task-specific performance drops are provided as true semantic distortion for the evaluation of CFQA metrics. We assess the performance of three widely used metrics (MSE, cosine similarity, and Centered Kernel Alignment) in capturing semantic degradation. The results underscore the representativeness of the dataset and highlight the need for more refined metrics capable of addressing the nuances of semantic distortion in compressed features. To facilitate the ongoing development of CFQA research, we release the dataset and all accompanying source code at \hrefthis https URLthis https URL. This contribution aims to advance the field and provide a foundational resource for the community to explore CFQA.
zh
[CV-102] MedChat: A Multi-Agent Framework for Multimodal Diagnosis with Large Language Models
【速读】:该论文试图解决眼科医生短缺以及临床报告效率低下的问题,同时应对将通用大语言模型(Large Language Models, LLMs)应用于医学影像时存在的幻觉、可解释性差和领域知识不足等挑战。其解决方案的关键在于提出MedChat,一个由多个角色特定的LLM代理和专门的视觉模型组成的多代理诊断框架,所有代理由一个协调代理统一管理,从而提升诊断的可靠性、降低幻觉风险,并通过面向临床审查和教育用途的交互式报告界面实现高效诊断。
链接: https://arxiv.org/abs/2506.07400
作者: Philip Liu,Sparsh Bansal,Jimmy Dinh,Aditya Pawar,Ramani Satishkumar,Shail Desai,Neeraj Gupta,Xin Wang,Shu Hu
机构: Purdue University (普渡大学); University at Albany, State University of New York (纽约州立大学阿尔巴尼分校)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 6 figures. Accepted to the 2025 IEEE 8th International Conference on Multimedia Information Processing and Retrieval (MIPR). Code and platform available at this https URL
Abstract:The integration of deep learning-based glaucoma detection with large language models (LLMs) presents an automated strategy to mitigate ophthalmologist shortages and improve clinical reporting efficiency. However, applying general LLMs to medical imaging remains challenging due to hallucinations, limited interpretability, and insufficient domain-specific medical knowledge, which can potentially reduce clinical accuracy. Although recent approaches combining imaging models with LLM reasoning have improved reporting, they typically rely on a single generalist agent, restricting their capacity to emulate the diverse and complex reasoning found in multidisciplinary medical teams. To address these limitations, we propose MedChat, a multi-agent diagnostic framework and platform that combines specialized vision models with multiple role-specific LLM agents, all coordinated by a director agent. This design enhances reliability, reduces hallucination risk, and enables interactive diagnostic reporting through an interface tailored for clinical review and educational use. Code available at this https URL.
zh
[CV-103] MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems
【速读】:该论文旨在解决多模态检索增强生成(Multimodal Retrieval-Augmented Generation, RAG)系统在隐私保护方面的漏洞问题,特别是针对成员身份推断攻击(Membership Inference Attacks, MIAs)的防御不足。现有方法主要关注文本模态,而对视觉模态的研究相对匮乏。论文提出的MrM框架是首个针对多模态RAG系统的黑盒MIAs框架,其关键在于通过基于反事实攻击约束的多对象数据扰动机制,同时诱导RAG系统检索目标数据并生成泄露成员信息的输出,结合对象感知的数据扰动和反事实引导的掩码选择策略,以提高攻击效果并实现统计意义上的成员身份推断。
链接: https://arxiv.org/abs/2506.07399
作者: Peiru Yang,Jinhua Yin,Haoran Zheng,Xueying Bai,Huili Wang,Yufei Sun,Xintian Li,Shangguang Wang,Yongfeng Huang,Tao Qi
机构: Tsinghua University (清华大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal retrieval-augmented generation (RAG) systems enhance large vision-language models by integrating cross-modal knowledge, enabling their increasing adoption across real-world multimodal tasks. These knowledge databases may contain sensitive information that requires privacy protection. However, multimodal RAG systems inherently grant external users indirect access to such data, making them potentially vulnerable to privacy attacks, particularly membership inference attacks (MIAs). % Existing MIA methods targeting RAG systems predominantly focus on the textual modality, while the visual modality remains relatively underexplored. To bridge this gap, we propose MrM, the first black-box MIA framework targeted at multimodal RAG systems. It utilizes a multi-object data perturbation framework constrained by counterfactual attacks, which can concurrently induce the RAG systems to retrieve the target data and generate information that leaks the membership information. Our method first employs an object-aware data perturbation method to constrain the perturbation to key semantics and ensure successful retrieval. Building on this, we design a counterfact-informed mask selection strategy to prioritize the most informative masked regions, aiming to eliminate the interference of model self-knowledge and amplify attack efficacy. Finally, we perform statistical membership inference by modeling query trials to extract features that reflect the reconstruction of masked semantics from response patterns. Experiments on two visual datasets and eight mainstream commercial visual-language models (e.g., GPT-4o, Gemini-2) demonstrate that MrM achieves consistently strong performance across both sample-level and set-level evaluations, and remains robust under adaptive defenses.
zh
[CV-104] Adapter Naturally Serves as Decoupler for Cross-Domain Few-Shot Semantic Segmentation ICML2025
【速读】:该论文旨在解决跨域小样本分割(Cross-domain Few-shot Segmentation, CD-FSS)中的两个主要挑战:领域差异(domain gap)以及在数据稀缺情况下的微调问题。其解决方案的关键在于重新审视基于适配器(adapter-based)的方法,并发现适配器不仅有助于下游任务的微调,还能自然地作为领域信息解耦器(domain information decoupler)。基于这一发现,作者提出了基于结构的解耦器——领域特征导航器(Domain Feature Navigator, DFN),用于捕获领域特定信息,引导模型关注领域无关的知识,从而提升模型在目标域上的泛化能力。此外,为防止DFN在源域训练过程中过度拟合,作者进一步设计了SAM-SVN方法以限制其学习样本特定知识,最终在目标域上通过冻结模型并微调DFN来学习目标特定知识。
链接: https://arxiv.org/abs/2506.07376
作者: Jintao Tong,Ran Ma,Yixiong Zou,Guangyao Chen,Yuhua Li,Ruixuan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025 Spotlight
Abstract:Cross-domain few-shot segmentation (CD-FSS) is proposed to pre-train the model on a source-domain dataset with sufficient samples, and then transfer the model to target-domain datasets where only a few samples are available for efficient fine-tuning. There are majorly two challenges in this task: (1) the domain gap and (2) fine-tuning with scarce data. To solve these challenges, we revisit the adapter-based methods, and discover an intriguing insight not explored in previous works: the adapter not only helps the fine-tuning of downstream tasks but also naturally serves as a domain information decoupler. Then, we delve into this finding for an interpretation, and find the model’s inherent structure could lead to a natural decoupling of domain information. Building upon this insight, we propose the Domain Feature Navigator (DFN), which is a structure-based decoupler instead of loss-based ones like current works, to capture domain-specific information, thereby directing the model’s attention towards domain-agnostic knowledge. Moreover, to prevent the potential excessive overfitting of DFN during the source-domain training, we further design the SAM-SVN method to constrain DFN from learning sample-specific knowledge. On target domains, we freeze the model and fine-tune the DFN to learn target-specific knowledge specific. Extensive experiments demonstrate that our method surpasses the state-of-the-art method in CD-FSS significantly by 2.69% and 4.68% MIoU in 1-shot and 5-shot scenarios, respectively.
zh
[CV-105] DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models
【速读】:该论文旨在解决多类别协同检测与跟踪任务中,现有方法主要针对车辆超类,缺乏对多样化道路使用者(如行人、非机动车等)的有效处理问题,这一局限性限制了其在实际场景中的应用。解决方案的关键在于提出一个面向多样化的多类别协同检测与跟踪框架,其中包含三个核心模块:全局空间注意力融合(GSAF)模块以增强多尺度特征学习,基于视觉语义的Tracklet RE-IDentification(REID)模块以减少ID切换(IDSW)错误,以及基于速度的自适应Tracklet管理(VATM)模块以动态调整跟踪间隔。
链接: https://arxiv.org/abs/2506.07375
作者: Xunjie He,Christina Dao Wen Lee,Meiling Wang,Chengran Yuan,Zefan Huang,Yufeng Yue,Marcelo H. Ang Jr
机构: Beijing Institute of Technology (北京理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Collaborative perception plays a crucial role in enhancing environmental understanding by expanding the perceptual range and improving robustness against sensor failures, which primarily involves collaborative 3D detection and tracking tasks. The former focuses on object recognition in individual frames, while the latter captures continuous instance tracklets over time. However, existing works in both areas predominantly focus on the vehicle superclass, lacking effective solutions for both multi-class collaborative detection and tracking. This limitation hinders their applicability in real-world scenarios, which involve diverse object classes with varying appearances and motion patterns. To overcome these limitations, we propose a multi-class collaborative detection and tracking framework tailored for diverse road users. We first present a detector with a global spatial attention fusion (GSAF) module, enhancing multi-scale feature learning for objects of varying sizes. Next, we introduce a tracklet RE-IDentification (REID) module that leverages visual semantics with a vision foundation model to effectively reduce ID SWitch (IDSW) errors, in cases of erroneous mismatches involving small objects like pedestrians. We further design a velocity-based adaptive tracklet management (VATM) module that adjusts the tracking interval dynamically based on object motion. Extensive experiments on the V2X-Real and OPV2V datasets show that our approach significantly outperforms existing state-of-the-art methods in both detection and tracking accuracy.
zh
[CV-106] ARGUS: Hallucination and Omission Evaluation in Video-LLM s
【速读】:该论文试图解决视频大语言模型(VideoLLM)在生成自由文本任务(如视频字幕生成)中存在幻觉(hallucination)严重的问题,而传统基准测试仅依赖选择题形式,无法全面评估模型的性能。解决方案的关键是提出ARGUS基准,通过将VideoLLM的输出与人类真实字幕进行比较,量化两个核心指标:一是关于视频内容或时间关系的错误陈述率,二是重要描述性细节的遗漏率,从而形成对视频字幕生成性能的全面评估。
链接: https://arxiv.org/abs/2506.07371
作者: Ruchit Rawal,Reza Shirkavand,Heng Huang,Gowthami Somepalli,Tom Goldstein
机构: University of Maryland, College Park(马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page with all the artifacts: this https URL
Abstract:Video large language models have not yet been widely deployed, largely due to their tendency to hallucinate. Typical benchmarks for Video-LLMs rely simply on multiple-choice questions. Unfortunately, VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics. First, we measure the rate of hallucinations in the form of incorrect statements about video content or temporal relationships. Second, we measure the rate at which the model omits important descriptive details. Together, these dual metrics form a comprehensive view of video captioning performance.
zh
[CV-107] Generative Models at the Frontier of Compression: A Survey on Generative Face Video Coding
【速读】:该论文旨在解决传统视频编码标准在高保真人脸视频通信中效率不足的问题,特别是针对超低码率下的视频压缩与重建质量。其解决方案的关键在于构建一种基于生成式人脸视频编码(Generative Face Video Coding, GFVC)的范式,通过深度生成模型实现语义感知的表征学习和逼真合成,从而在编码端将复杂的面部动态映射为紧凑的潜在代码,并在解码端利用强大的生成模型从这些代码中重建高质量的人脸信号,显著超越最新Versatile Video Coding (VVC)标准的性能。
链接: https://arxiv.org/abs/2506.07369
作者: Bolin Chen,Shanzhi Yin,Goluck Konuko,Giuseppe Valenzise,Zihan Zhang,Shiqi Wang,Yan Ye
机构: DAMO Academy, Alibaba Group(达摩院,阿里巴巴集团); Hupan Lab(虎跑实验室); City University of Hong Kong(香港城市大学); L2S - CentraleSupèlec( L2S - 中央理工大学); LTCI - Telécom Paris( LTCI - 法国电信学院); CNRS(法国国家科学研究中心); CentraleSupelec(中央理工大学); Laboratoire des Signaux et Systèmes(信号与系统实验室); Université Paris-Saclay(巴黎-萨克雷大学); Alibaba Group(阿里巴巴集团); Hupan Lab(虎跑实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rise of deep generative models has greatly advanced video compression, reshaping the paradigm of face video coding through their powerful capability for semantic-aware representation and lifelike synthesis. Generative Face Video Coding (GFVC) stands at the forefront of this revolution, which could characterize complex facial dynamics into compact latent codes for bitstream compactness at the encoder side and leverages powerful deep generative models to reconstruct high-fidelity face signal from the compressed latent codes at the decoder side. As such, this well-designed GFVC paradigm could enable high-fidelity face video communication at ultra-low bitrate ranges, far surpassing the capabilities of the latest Versatile Video Coding (VVC) standard. To pioneer foundational research and accelerate the evolution of GFVC, this paper presents the first comprehensive survey of GFVC technologies, systematically bridging critical gaps between theoretical innovation and industrial standardization. In particular, we first review a broad range of existing GFVC methods with different feature representations and optimization strategies, and conduct a thorough benchmarking analysis. In addition, we construct a large-scale GFVC-compressed face video database with subjective Mean Opinion Scores (MOSs) based on human perception, aiming to identify the most appropriate quality metrics tailored to GFVC. Moreover, we summarize the GFVC standardization potentials with a unified high-level syntax and develop a low-complexity GFVC system which are both expected to push forward future practical deployments and applications. Finally, we envision the potential of GFVC in industrial applications and deliberate on the current challenges and future opportunities.
zh
[CV-108] C3S3: Complementary Competition and Contrastive Selection for Semi-Supervised Medical Image Segmentation ICME2025
【速读】:该论文旨在解决医学影像分割中因标注样本不足而导致的半监督医学图像分割(Semi-Supervised Medical Image Segmentation, SSMIS)精度不高的问题。其解决方案的关键在于提出一种名为C3S3的新颖半监督分割模型,该模型通过协同整合互补竞争(Dynamic Complementary Competition)和对比选择(Contrastive Selection)机制,显著提升了边界分割的精确性。具体而言,该模型包含一个结果驱动的对比学习模块用于优化边界定位,以及一个动态互补竞争模块,利用两个高性能子网络生成伪标签以进一步提升分割质量。
链接: https://arxiv.org/abs/2506.07368
作者: Jiaying He,Yitong Lin,Jiahe Chen,Honghui Xu,Jianwei Zheng
机构: Zhejiang University of Technology (浙江理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, ICME2025
Abstract:For the immanent challenge of insufficiently annotated samples in the medical field, semi-supervised medical image segmentation (SSMIS) offers a promising solution. Despite achieving impressive results in delineating primary target areas, most current methodologies struggle to precisely capture the subtle details of boundaries. This deficiency often leads to significant diagnostic inaccuracies. To tackle this issue, we introduce C3S3, a novel semi-supervised segmentation model that synergistically integrates complementary competition and contrastive selection. This design significantly sharpens boundary delineation and enhances overall precision. Specifically, we develop an \textitOutcome-Driven Contrastive Learning module dedicated to refining boundary localization. Additionally, we incorporate a \textitDynamic Complementary Competition module that leverages two high-performing sub-networks to generate pseudo-labels, thereby further improving segmentation quality. The proposed C3S3 undergoes rigorous validation on two publicly accessible datasets, encompassing the practices of both MRI and CT scans. The results demonstrate that our method achieves superior performance compared to previous cutting-edge competitors. Especially, on the 95HD and ASD metrics, our approach achieves a notable improvement of at least 6% , highlighting the significant advancements. The code is available at this https URL.
zh
[CV-109] Multiple Object Stitching for Unsupervised Representation Learning
【速读】:该论文旨在解决对比学习在单对象中心图像上表现优异,但在广泛存在的多对象图像中性能下降的问题。其解决方案的关键在于提出一种简单但有效的多对象拼接(Multiple Object Stitching, MOS)方法,通过将单对象中心图像拼接成多对象图像,并预设其中的对象,从而在无需人工标注的情况下提供额外的物体对应关系,使模型更关注多对象图像中每个物体的表示,进而提升复杂下游任务(如目标检测和语义分割)的表示质量。
链接: https://arxiv.org/abs/2506.07364
作者: Chengchao Shen,Dawei Liu,Jianxin Wang
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Contrastive learning for single object centric images has achieved remarkable progress on unsupervised representation, but suffering inferior performance on the widespread images with multiple objects. In this paper, we propose a simple but effective method, Multiple Object Stitching (MOS), to refine the unsupervised representation for multi-object images. Specifically, we construct the multi-object images by stitching the single object centric ones, where the objects in the synthesized multi-object images are predetermined. Hence, compared to the existing contrastive methods, our method provides additional object correspondences between multi-object images without human annotations. In this manner, our method pays more attention to the representations of each object in multi-object image, thus providing more detailed representations for complicated downstream tasks, such as object detection and semantic segmentation. Experimental results on ImageNet, CIFAR and COCO datasets demonstrate that our proposed method achieves the leading unsupervised representation performance on both single object centric images and multi-object ones. The source code is available at this https URL.
zh
[CV-110] CBAM-STN-TPS-YOLO: Enhancing Agricultural Object Detection through Spatially Adaptive Attention Mechanisms
【速读】:该论文旨在解决农业领域中目标检测模型在处理遮挡、不规则结构和背景噪声时精度不足的问题。其关键解决方案是提出CBAM-STN-TPS-YOLO模型,该模型将Thin-Plate Splines (TPS)引入空间变换网络(STN),以实现更灵活的非刚性空间变换,从而更好地对齐特征,并结合卷积块注意力模块(CBAM)抑制背景噪声并增强相关特征,提升检测性能。
链接: https://arxiv.org/abs/2506.07357
作者: Satvik Praveen,Yoonsung Jung
机构: Texas A&M University (得克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Object detection is vital in precision agriculture for plant monitoring, disease detection, and yield estimation. However, models like YOLO struggle with occlusions, irregular structures, and background noise, reducing detection accuracy. While Spatial Transformer Networks (STNs) improve spatial invariance through learned transformations, affine mappings are insufficient for non-rigid deformations such as bent leaves and overlaps. We propose CBAM-STN-TPS-YOLO, a model integrating Thin-Plate Splines (TPS) into STNs for flexible, non-rigid spatial transformations that better align features. Performance is further enhanced by the Convolutional Block Attention Module (CBAM), which suppresses background noise and emphasizes relevant spatial and channel-wise features. On the occlusion-heavy Plant Growth and Phenotyping (PGP) dataset, our model outperforms STN-YOLO in precision, recall, and mAP. It achieves a 12% reduction in false positives, highlighting the benefits of improved spatial flexibility and attention-guided refinement. We also examine the impact of the TPS regularization parameter in balancing transformation smoothness and detection performance. This lightweight model improves spatial awareness and supports real-time edge deployment, making it ideal for smart farming applications requiring accurate and efficient monitoring. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2506.07357 [cs.CV] (or arXiv:2506.07357v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.07357 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Satvik Praveen [view email] [v1] Mon, 9 Jun 2025 02:11:46 UTC (9,208 KB)
zh
[CV-111] MapBERT: Bitwise Masked Modeling for Real-Time Semantic Mapping Generation
【速读】:该论文旨在解决室内语义分布建模的挑战,特别是在稀疏、不平衡的物体类别和多样的空间尺度下,如何实现对未观测区域的鲁棒实时生成以及在新环境中的良好泛化能力。解决方案的关键在于提出一种名为MapBERT的新框架,其核心是利用无查找表的BitVAE将语义地图编码为紧凑的位级标记,并结合掩码Transformer来推断缺失区域,从而生成完整的语义地图。此外,通过引入面向物体的掩码策略,模型能够同时掩码整个物体类别并将其与可学习嵌入配对,从而捕捉物体嵌入与空间标记之间的隐含关系,提升对室内语义分布的建模能力。
链接: https://arxiv.org/abs/2506.07350
作者: Yijie Deng,Shuaihang Yuan,Congcong Wen,Hao Huang,Anthony Tzes,Geeta Chandra Raju Bethala,Yi Fang
机构: NYUAD Center for Artificial Intelligence and Robotics (CAIR), Abu Dhabi, UAE.;New York University Abu Dhabi, Electrical Engineering, Abu Dhabi 129188, UAE.;New York University, Electrical & Computer Engineering Dept., Brooklyn, NY 11201, USA.;Embodied AI and Robotics (AIR) Lab, NYU Abu Dhabi, UAE.;School of Software, Tsinghua University, Beijing, China.
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spatial awareness is a critical capability for embodied agents, as it enables them to anticipate and reason about unobserved regions. The primary challenge arises from learning the distribution of indoor semantics, complicated by sparse, imbalanced object categories and diverse spatial scales. Existing methods struggle to robustly generate unobserved areas in real time and do not generalize well to new environments. To this end, we propose \textbfMapBERT, a novel framework designed to effectively model the distribution of unseen spaces. Motivated by the observation that the one-hot encoding of semantic maps aligns naturally with the binary structure of bit encoding, we, for the first time, leverage a lookup-free BitVAE to encode semantic maps into compact bitwise tokens. Building on this, a masked transformer is employed to infer missing regions and generate complete semantic maps from limited observations. To enhance object-centric reasoning, we propose an object-aware masking strategy that masks entire object categories concurrently and pairs them with learnable embeddings, capturing implicit relationships between object embeddings and spatial tokens. By learning these relationships, the model more effectively captures indoor semantic distributions crucial for practical robotic tasks. Experiments on Gibson benchmarks show that MapBERT achieves state-of-the-art semantic map generation, balancing computational efficiency with accurate reconstruction of unobserved regions.
zh
[CV-112] Hierarchical Scoring with 3D Gaussian Splatting for Instance Image-Goal Navigation
【速读】:该论文旨在解决实例图像目标导航(Instance Image-Goal Navigation, IIN)中因依赖随机采样多视角或轨迹而导致的冗余图像样本和缺乏系统性视点选择的问题,从而降低渲染与比较的计算开销。其解决方案的关键在于提出一种分层评分框架,通过结合跨层级语义评分与细粒度局部几何评分,利用CLIP衍生的相关性场识别与目标物体类别高度语义相似的区域,并在这些区域中进行精确的姿态估计,以实现最优视点的估计与目标匹配。
链接: https://arxiv.org/abs/2506.07338
作者: Yijie Deng,Shuaihang Yuan,Geeta Chandra Raju Bethala,Anthony Tzes,Yu-Shen Liu,Yi Fang
机构: NYUAD Center for Artificial Intelligence and Robotics (CAIR), Abu Dhabi, UAE.;New York University Abu Dhabi, Electrical Engineering, Abu Dhabi 129188, UAE.;New York University, Electrical & Computer Engineering Dept., Brooklyn, NY 11201, USA.;Embodied AI and Robotics (AIR) Lab, NYU Abu Dhabi, UAE.;School of Software, Tsinghua University, Beijing, China.
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Instance Image-Goal Navigation (IIN) requires autonomous agents to identify and navigate to a target object or location depicted in a reference image captured from any viewpoint. While recent methods leverage powerful novel view synthesis (NVS) techniques, such as three-dimensional Gaussian splatting (3DGS), they typically rely on randomly sampling multiple viewpoints or trajectories to ensure comprehensive coverage of discriminative visual cues. This approach, however, creates significant redundancy through overlapping image samples and lacks principled view selection, substantially increasing both rendering and comparison overhead. In this paper, we introduce a novel IIN framework with a hierarchical scoring paradigm that estimates optimal viewpoints for target matching. Our approach integrates cross-level semantic scoring, utilizing CLIP-derived relevancy fields to identify regions with high semantic similarity to the target object class, with fine-grained local geometric scoring that performs precise pose estimation within promising regions. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on simulated IIN benchmarks and real-world applicability.
zh
[CV-113] "CASE: Contrastive Activation for Saliency Estimation
【速读】:该论文试图解决现有显著性方法(saliency methods)在类敏感性(class sensitivity)方面的不足,即这些方法在不同类别标签下生成的解释几乎相同,无法有效区分竞争类别。解决方案的关键在于提出一种对比解释方法——CASE(Contrastive Explanation Method),该方法通过隔离预测类别特有的判别特征,从而生成更忠实且更具类别特异性的解释。
链接: https://arxiv.org/abs/2506.07327
作者: Dane Williamson,Yangfeng Ji,Matthew Dwyer
机构: University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 5 figures. Submitted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Abstract:Saliency methods are widely used to visualize which input features are deemed relevant to a model’s prediction. However, their visual plausibility can obscure critical limitations. In this work, we propose a diagnostic test for class sensitivity: a method’s ability to distinguish between competing class labels on the same input. Through extensive experiments, we show that many widely used saliency methods produce nearly identical explanations regardless of the class label, calling into question their reliability. We find that class-insensitive behavior persists across architectures and datasets, suggesting the failure mode is structural rather than model-specific. Motivated by these findings, we introduce CASE, a contrastive explanation method that isolates features uniquely discriminative for the predicted class. We evaluate CASE using the proposed diagnostic and a perturbation-based fidelity test, and show that it produces faithful and more class-specific explanations than existing methods.
zh
[CV-114] AllTracker: Efficient Dense Point Tracking at High Resolution
【速读】:该论文试图解决视频中长距离点轨迹估计的问题,即在视频序列中为每个像素建立高分辨率、密集的对应关系场,以实现跨多帧的精确跟踪。与现有方法不同,其解决方案的关键在于提出了一种新的架构,结合了光流和点跟踪技术,通过在低分辨率网格上进行迭代推理,并利用二维卷积层进行空间信息传播以及像素对齐注意力层进行时间信息传播,从而实现高效且高精度的点跟踪。
链接: https://arxiv.org/abs/2506.07310
作者: Adam W. Harley,Yang You,Xinglong Sun,Yang Zheng,Nikhil Raghuraman,Yunqi Gu,Sheldon Liang,Wen-Hsuan Chu,Achal Dave,Pavel Tokmakov,Suya You,Rares Ambrus,Katerina Fragkiadaki,Leonidas J. Guibas
机构: Stanford University (斯坦福大学); Carnegie Mellon University (卡内基梅隆大学); Toyota Research Institute (丰田研究院); Army Research Laboratory (陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce AllTracker: a model that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on low-resolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768x1024 pixels, on a 40G GPU). A benefit of our design is that we can train on a wider set of datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. Our code and model weights are available at this https URL .
zh
[CV-115] FANVID: A Benchmark for Face and License Plate Recognition in Low-Resolution Videos
【速读】:该论文旨在解决低分辨率(LR)视频中人脸和车牌难以识别的问题,特别是在实际监控场景下,由于分辨率不足导致的人脸和车牌无法辨识,从而影响可靠的身份识别。解决方案的关键在于提出FANVID基准数据集,该数据集包含近1,463段LR视频片段,具有挑战性的干扰项,并通过时间信息的利用来提升识别性能。此外,该研究还定义了两个任务:人脸匹配和车牌识别,并引入了相应的评估指标,以促进对时间建模在LR识别中的创新研究。
链接: https://arxiv.org/abs/2506.07304
作者: Kavitha Viswanathan,Vrinda Goel,Shlesh Gholap,Devayan Ghosh,Madhav Gupta,Dhruvi Ganatra,Sanket Potdar,Amit Sethi
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world surveillance often renders faces and license plates unrecognizable in individual low-resolution (LR) frames, hindering reliable identification. To advance temporal recognition models, we present FANVID, a novel video-based benchmark comprising nearly 1,463 LR clips (180 x 320, 20–60 FPS) featuring 63 identities and 49 license plates from three English-speaking countries. Each video includes distractor faces and plates, increasing task difficulty and realism. The dataset contains 31,096 manually verified bounding boxes and labels. FANVID defines two tasks: (1) face matching – detecting LR faces and matching them to high-resolution mugshots, and (2) license plate recognition – extracting text from LR plates without a predefined database. Videos are downsampled from high-resolution sources to ensure that faces and text are indecipherable in single frames, requiring models to exploit temporal information. We introduce evaluation metrics adapted from mean Average Precision at IoU 0.5, prioritizing identity correctness for faces and character-level accuracy for text. A baseline method with pre-trained video super-resolution, detection, and recognition achieved performance scores of 0.58 (face matching) and 0.42 (plate recognition), highlighting both the feasibility and challenge of the tasks. FANVID’s selection of faces and plates balances diversity with recognition challenge. We release the software for data access, evaluation, baseline, and annotation to support reproducibility and extension. FANVID aims to catalyze innovation in temporal modeling for LR recognition, with applications in surveillance, forensics, and autonomous vehicles. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.07304 [cs.CV] (or arXiv:2506.07304v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2506.07304 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-116] HotelMatch-LLM : Joint Multi-Task Training of Small and Large Language Models for Efficient Multimodal Hotel Retrieval ACL2025
【速读】:该论文旨在解决传统旅游搜索引擎在用户查询方式上的局限性,即用户需从目的地开始并不断调整搜索参数,而无法直接通过自然语言进行属性搜索。其解决方案的关键在于提出HotelMatch-LLM,一个针对旅游领域的多模态密集检索模型,该模型通过三个核心创新实现性能提升:领域特定的多任务优化、结合小语言模型(SLM)与大语言模型(LLM)的非对称密集检索架构,以及对所有房产图片集的广泛图像处理。
链接: https://arxiv.org/abs/2506.07296
作者: Arian Askari,Emmanouil Stergiadis,Ilya Gusev,Moran Beladev
机构: Booking.com(Booking.com)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACL 2025, Main track. 13 Pages, 1 figure
Abstract:We present HotelMatch-LLM, a multimodal dense retrieval model for the travel domain that enables natural language property search, addressing the limitations of traditional travel search engines which require users to start with a destination and editing search parameters. HotelMatch-LLM features three key innovations: (1) Domain-specific multi-task optimization with three novel retrieval, visual, and language modeling objectives; (2) Asymmetrical dense retrieval architecture combining a small language model (SLM) for efficient online query processing and a large language model (LLM) for embedding hotel data; and (3) Extensive image processing to handle all property image galleries. Experiments on four diverse test sets show HotelMatch-LLM significantly outperforms state-of-the-art models, including VISTA and MARVEL. Specifically, on the test set – main query type – we achieve 0.681 for HotelMatch-LLM compared to 0.603 for the most effective baseline, MARVEL. Our analysis highlights the impact of our multi-task optimization, the generalizability of HotelMatch-LLM across LLM architectures, and its scalability for processing large image galleries.
zh
[CV-117] Multi-Step Guided Diffusion for Image Restoration on Edge Devices: Toward Lightweight Perception in Embodied AI CVPR2025
【速读】:该论文旨在解决扩散模型在逆问题求解中恢复精度和鲁棒性受限的问题,尤其是在嵌入式或分布外场景下。现有方法如流形保持引导扩散(Manifold Preserving Guided Diffusion, MPGD)每一步去噪仅进行一次梯度更新,限制了恢复质量。论文提出的解决方案关键在于在每个去噪时间步引入多步优化策略,从而显著提升图像质量、感知准确性和泛化能力。实验表明,增加每步的梯度更新次数可有效提高LPIPS和PSNR指标,同时保持较低的延迟开销。
链接: https://arxiv.org/abs/2506.07286
作者: Aditya Chakravarty
机构: Independent Research(独立研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Accepted in CVPR 2025 Embodied AI Workshop
Abstract:Diffusion models have shown remarkable flexibility for solving inverse problems without task-specific retraining. However, existing approaches such as Manifold Preserving Guided Diffusion (MPGD) apply only a single gradient update per denoising step, limiting restoration fidelity and robustness, especially in embedded or out-of-distribution settings. In this work, we introduce a multistep optimization strategy within each denoising timestep, significantly enhancing image quality, perceptual accuracy, and generalization. Our experiments on super-resolution and Gaussian deblurring demonstrate that increasing the number of gradient updates per step improves LPIPS and PSNR with minimal latency overhead. Notably, we validate this approach on a Jetson Orin Nano using degraded ImageNet and a UAV dataset, showing that MPGD, originally trained on face datasets, generalizes effectively to natural and aerial scenes. Our findings highlight MPGD’s potential as a lightweight, plug-and-play restoration module for real-time visual perception in embodied AI agents such as drones and mobile robots.
zh
[CV-118] From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models
【速读】:该论文试图解决视频扩散模型(Video Diffusion Models, VDMs)在视频生成之外的潜在应用问题,即如何利用VDMs内部学习到的结构化表示和视觉世界理解能力,将其适配到多种新任务中。解决方案的关键在于提出一种少样本微调框架,通过将每个任务转化为视觉转换过程,在不改变冻结VDM生成接口的前提下,仅使用少量示例训练LoRA权重,从而实现模型在不同任务上的强泛化能力。
链接: https://arxiv.org/abs/2506.07280
作者: Pablo Acuaviva,Aram Davtyan,Mariam Hassan,Sebastian Stapf,Ahmad Rahimi,Alexandre Alahi,Paolo Favaro
机构: University of Bern (伯尔尼大学); EPFL (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 23 figures, 9 tables
Abstract:Video Diffusion Models (VDMs) have emerged as powerful generative tools, capable of synthesizing high-quality spatiotemporal content. Yet, their potential goes far beyond mere video generation. We argue that the training dynamics of VDMs, driven by the need to model coherent sequences, naturally pushes them to internalize structured representations and an implicit understanding of the visual world. To probe the extent of this internal knowledge, we introduce a few-shot fine-tuning framework that repurposes VDMs for new tasks using only a handful of examples. Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input-output sequences without altering the generative interface of a frozen VDM. Despite minimal supervision, the model exhibits strong generalization across diverse tasks, from low-level vision (for example, segmentation and pose estimation) to high-level reasoning (for example, on ARC-AGI). These results reframe VDMs as more than generative engines. They are adaptable visual learners with the potential to serve as the backbone for future foundation models in vision.
zh
[CV-119] Advancing Multimodal Reasoning Capabilities of Multimodal Large Language Models via Visual Perception Reward
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在多模态推理能力提升过程中,对多模态感知能力增强的忽视问题。现有基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法未能有效提升MLLMs的多模态感知能力,从而限制了其在复杂多模态推理任务中的表现。解决方案的关键在于提出Perception-R1,该方法引入了一种新颖的视觉感知奖励机制,通过评估生成响应与文本视觉注释的一致性来明确鼓励MLLMs准确感知视觉内容,从而有效提升其多模态感知与推理能力。
链接: https://arxiv.org/abs/2506.07218
作者: Tong Xiao,Xin Xu,Zhenya Huang,Hongyu Gao,Quan Liu,Qi Liu,Enhong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Enhancing the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with Verifiable Rewards (RLVR) to the multimodal domain in order to enhance the reasoning abilities of MLLMs. However, these works largely overlook the enhancement of multimodal perception capabilities in MLLMs, which serve as a core prerequisite and foundational component of complex multimodal reasoning. Through McNemar’s test, we find that existing RLVR method fails to effectively enhance the multimodal perception capabilities of MLLMs, thereby limiting their further improvement in multimodal reasoning. To address this limitation, we propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately, thereby can effectively incentivizing both their multimodal perception and reasoning capabilities. Specifically, we first collect textual visual annotations from the CoT trajectories of multimodal problems, which will serve as visual references for reward assignment. During RLVR training, we employ a judging LLM to assess the consistency between the visual annotations and the responses generated by MLLM, and assign the visual perception reward based on these consistency judgments. Extensive experiments on several multimodal reasoning benchmarks demonstrate the effectiveness of our Perception-R1, which achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
zh
[CV-120] AugmentGest: Can Random Data Cropping Augmentation Boost Gesture Recognition Performance?
【速读】:该论文旨在解决深度学习中数据集多样性不足的问题,特别是在基于骨架的数据集上,通过引入一种综合的数据增强框架来提升模型的泛化能力和鲁棒性。该框架的关键在于集成几何变换、随机裁剪、旋转、缩放以及基于强度的变换、亮度和对比度调整等多种方法,以模拟现实世界中的变化,并生成每个样本的三个增强版本,从而显著增加数据集规模并丰富手势表示的多样性。
链接: https://arxiv.org/abs/2506.07216
作者: Nada Aboudeshish,Dmitry Ignatov,Radu Timofte
机构: University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Data augmentation is a crucial technique in deep learning, particularly for tasks with limited dataset diversity, such as skeleton-based datasets. This paper proposes a comprehensive data augmentation framework that integrates geometric transformations, random cropping, rotation, zooming and intensity-based transformations, brightness and contrast adjustments to simulate real-world variations. Random cropping ensures the preservation of spatio-temporal integrity while addressing challenges such as viewpoint bias and occlusions. The augmentation pipeline generates three augmented versions for each sample in addition to the data set sample, thus quadrupling the data set size and enriching the diversity of gesture representations. The proposed augmentation strategy is evaluated on three models: multi-stream e2eET, FPPR point cloud-based hand gesture recognition (HGR), and DD-Network. Experiments are conducted on benchmark datasets including DHG14/28, SHREC’17, and JHMDB. The e2eET model, recognized as the state-of-the-art for hand gesture recognition on DHG14/28 and SHREC’17. The FPPR-PCD model, the second-best performing model on SHREC’17, excels in point cloud-based gesture recognition. DD-Net, a lightweight and efficient architecture for skeleton-based action recognition, is evaluated on SHREC’17 and the Human Motion Data Base (JHMDB). The results underline the effectiveness and versatility of the proposed augmentation strategy, significantly improving model generalization and robustness across diverse datasets and architectures. This framework not only establishes state-of-the-art results on all three evaluated models but also offers a scalable solution to advance HGR and action recognition applications in real-world scenarios. The framework is available at this https URL
zh
[CV-121] Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在面对后门攻击时的脆弱性问题,特别是针对传统单模态触发方式未充分考虑VLMs跨模态融合特性的缺陷。其解决方案的关键在于识别并利用跨模态语义不匹配作为隐式触发机制,提出了一种名为BadSem(Backdoor Attack with Semantic Manipulation)的数据中毒攻击方法,通过在训练过程中故意对图像-文本对进行语义错位来注入隐蔽的后门。
链接: https://arxiv.org/abs/2506.07214
作者: Zhiyuan Zhong,Zhen Sun,Yepang Liu,Xinlei He,Guanhong Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Vision Language Models (VLMs) have shown remarkable performance, but are also vulnerable to backdoor attacks whereby the adversary can manipulate the model’s outputs through hidden triggers. Prior attacks primarily rely on single-modality triggers, leaving the crucial cross-modal fusion nature of VLMs largely unexplored. Unlike prior work, we identify a novel attack surface that leverages cross-modal semantic mismatches as implicit triggers. Based on this insight, we propose BadSem (Backdoor Attack with Semantic Manipulation), a data poisoning attack that injects stealthy backdoors by deliberately misaligning image-text pairs during training. To perform the attack, we construct SIMBad, a dataset tailored for semantic manipulation involving color and object attributes. Extensive experiments across four widely used VLMs show that BadSem achieves over 98% average ASR, generalizes well to out-of-distribution datasets, and can transfer across poisoning modalities. Our detailed analysis using attention visualization shows that backdoored models focus on semantically sensitive regions under mismatched conditions while maintaining normal behavior on clean inputs. To mitigate the attack, we try two defense strategies based on system prompt and supervised fine-tuning but find that both of them fail to mitigate the semantic backdoor. Our findings highlight the urgent need to address semantic vulnerabilities in VLMs for their safer deployment.
zh
[CV-122] HOI-PAGE: Zero-Shot Human-Object Interaction Generation with Part Affordance Guidance
【速读】:该论文试图解决零样本条件下从文本提示中合成4D人体-物体交互(4D HOI)的问题,现有方法通常关注全局的人体与物体运动,而本文认为生成真实且多样的4D HOI需要更细粒度的理解,即人体部位与物体部位的交互。解决方案的关键在于引入了部分可操作性图(Part Affordance Graphs, PAGs),这是一种从大语言模型中提炼出的结构化HOI表示,编码了细粒度的部分信息及接触关系,并通过三阶段合成流程:分解3D物体为几何部分、从文本提示生成参考HOI视频并提取基于部分的运动约束、优化满足部分级接触约束的4D HOI运动序列,从而实现更真实和符合文本描述的合成结果。
链接: https://arxiv.org/abs/2506.07209
作者: Lei Li,Angela Dai
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Video: this https URL
Abstract:We present HOI-PAGE, a new approach to synthesizing 4D human-object interactions (HOIs) from text prompts in a zero-shot fashion, driven by part-level affordance reasoning. In contrast to prior works that focus on global, whole body-object motion for 4D HOI synthesis, we observe that generating realistic and diverse HOIs requires a finer-grained understanding – at the level of how human body parts engage with object parts. We thus introduce Part Affordance Graphs (PAGs), a structured HOI representation distilled from large language models (LLMs) that encodes fine-grained part information along with contact relations. We then use these PAGs to guide a three-stage synthesis: first, decomposing input 3D objects into geometric parts; then, generating reference HOI videos from text prompts, from which we extract part-based motion constraints; finally, optimizing for 4D HOI motion sequences that not only mimic the reference dynamics but also satisfy part-level contact constraints. Extensive experiments show that our approach is flexible and capable of generating complex multi-object or multi-person interaction sequences, with significantly improved realism and text alignment for zero-shot 4D HOI generation.
zh
[CV-123] V-LiVE: Training-Free Text-Guided Video Editing via Layer Informed Vitality Exploitation
【速读】:该论文旨在解决视频编辑中复杂任务的可访问性与可控性问题,特别是针对新颖物体添加和非刚性变换等尚未被充分探索的任务。其解决方案的关键在于提出一种无需训练的文本引导视频编辑框架TV-LiVE,通过层感知活力挖掘(Layer-informed Vitality Exploitation)识别视频生成模型中对输出质量有显著影响的关键层,并利用这些层的特性,通过选择性注入源模型中的关键和值特征到目标模型的对应层中,实现高效的视频编辑。
链接: https://arxiv.org/abs/2506.07205
作者: Min-Jung Kim,Dongjin Kim,Seokju Yun,Jaegul Choo
机构: KAIST AI (KAIST人工智能); University of Seoul (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video editing has garnered increasing attention alongside the rapid progress of diffusion-based video generation models. As part of these advancements, there is a growing demand for more accessible and controllable forms of video editing, such as prompt-based editing. Previous studies have primarily focused on tasks such as style transfer, background replacement, object substitution, and attribute modification, while maintaining the content structure of the source video. However, more complex tasks, including the addition of novel objects and nonrigid transformations, remain relatively unexplored. In this paper, we present TV-LiVE, a Training-free and text-guided Video editing framework via Layerinformed Vitality Exploitation. We empirically identify vital layers within the video generation model that significantly influence the quality of generated outputs. Notably, these layers are closely associated with Rotary Position Embeddings (RoPE). Based on this observation, our method enables both object addition and non-rigid video editing by selectively injecting key and value features from the source model into the corresponding layers of the target model guided by the layer vitality. For object addition, we further identify prominent layers to extract the mask regions corresponding to the newly added target prompt. We found that the extracted masks from the prominent layers faithfully indicate the region to be edited. Experimental results demonstrate that TV-LiVE outperforms existing approaches for both object addition and non-rigid video editing. Project Page: this https URL
zh
[CV-124] Hierarchical Feature-level Reverse Propagation for Post-Training Neural Networks
【速读】:该论文旨在解决端到端自动驾驶中高度耦合的黑箱模型在可解释性和安全保证方面的挑战。其解决方案的关键在于提出一种分层解耦的后训练框架,通过从真实标签重建中间特征图,引入替代监督信号以实现特定组件的独立训练,从而避免传统端到端反向传播的复杂性和耦合性,并提供对网络内部机制的可解释性洞察。该方法首次将特征级逆向计算形式化为适定优化问题,并将其严格重构为线性方程组或最小二乘问题,从而建立了一种新型高效的训练范式,将梯度反向传播扩展至特征反向传播。
链接: https://arxiv.org/abs/2506.07188
作者: Ni Ding,Lei He,Shengbo Eben Li,Keqiang Li
机构: Tsinghua University (清华大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures,
Abstract:End-to-end autonomous driving has emerged as a dominant paradigm, yet its highly entangled black-box models pose significant challenges in terms of interpretability and safety assurance. To improve model transparency and training flexibility, this paper proposes a hierarchical and decoupled post-training framework tailored for pretrained neural networks. By reconstructing intermediate feature maps from ground-truth labels, surrogate supervisory signals are introduced at transitional layers to enable independent training of specific components, thereby avoiding the complexity and coupling of conventional end-to-end backpropagation and providing interpretable insights into networks’ internal mechanisms. To the best of our knowledge, this is the first method to formalize feature-level reverse computation as well-posed optimization problems, which we rigorously reformulate as systems of linear equations or least squares problems. This establishes a novel and efficient training paradigm that extends gradient backpropagation to feature backpropagation. Extensive experiments on multiple standard image classification benchmarks demonstrate that the proposed method achieves superior generalization performance and computational efficiency compared to traditional training approaches, validating its effectiveness and potential.
zh
[CV-125] Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models
【速读】:该论文试图解决现有视频生成方法依赖于对大规模视频模型进行微调以实现特定任务控制的问题,这一过程随着模型规模的扩大变得越来越不切实际。解决方案的关键在于提出一种无需训练的帧级引导(Frame Guidance)机制,该机制利用关键帧、风格参考图像、草图或深度图等帧级信号实现可控视频生成,并通过一种简化的潜在空间处理方法显著降低内存使用,同时采用一种旨在实现全局一致性的潜在优化策略,从而在不进行任何训练的情况下实现多种任务的有效控制。
链接: https://arxiv.org/abs/2506.07177
作者: Sangwon Jang,Taekyung Ki,Jaehyeong Jo,Jaehong Yoon,Soo Ye Kim,Zhe Lin,Sung Ju Hwang
机构: KAIST(韩国科学技术院); UNC Chapel Hill(北卡罗来纳大学教堂山分校); Adobe Research(Adobe研究院); DeepAuto.ai(DeepAuto.ai)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Advancements in diffusion models have significantly improved video quality, directing attention to fine-grained controllability. However, many existing methods depend on fine-tuning large-scale video models for specific tasks, which becomes increasingly impractical as model sizes continue to grow. In this work, we present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals, such as keyframes, style reference images, sketches, or depth maps. For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage, and apply a novel latent optimization strategy designed for globally coherent video generation. Frame Guidance enables effective control across diverse tasks, including keyframe guidance, stylization, and looping, without any training, compatible with any video models. Experimental results show that Frame Guidance can produce high-quality controlled videos for a wide range of tasks and input signals.
zh
[CV-126] Faster than Fast: Accelerating Oriented FAST Feature Detection on Low-end Embedded GPUs
【速读】:该论文旨在解决基于生成式 AI (Generative AI) 的视觉SLAM(Simultaneous Localization and Mapping)系统在移动平台上的实时处理需求不足的问题,特别是针对ORBSLAM系统中Oriented FAST特征检测计算耗时过长的瓶颈。解决方案的关键在于优化Oriented FAST特征检测中的两个耗时步骤:FAST特征点检测和Harris角点检测,通过引入二进制编码策略快速筛选候选点以及采用基于高效GPU硬件指令的可分离Harris检测策略,从而显著提升计算效率。
链接: https://arxiv.org/abs/2506.07164
作者: Qiong Chang,Xinyuan Chen,Xiang Li,Weimin Wang,Jun Miyazaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The visual-based SLAM (Simultaneous Localization and Mapping) is a technology widely used in applications such as robotic navigation and virtual reality, which primarily focuses on detecting feature points from visual images to construct an unknown environmental map and simultaneously determines its own location. It usually imposes stringent requirements on hardware power consumption, processing speed and accuracy. Currently, the ORB (Oriented FAST and Rotated BRIEF)-based SLAM systems have exhibited superior performance in terms of processing speed and robustness. However, they still fall short of meeting the demands for real-time processing on mobile platforms. This limitation is primarily due to the time-consuming Oriented FAST calculations accounting for approximately half of the entire SLAM system. This paper presents two methods to accelerate the Oriented FAST feature detection on low-end embedded GPUs. These methods optimize the most time-consuming steps in Oriented FAST feature detection: FAST feature point detection and Harris corner detection, which is achieved by implementing a binary-level encoding strategy to determine candidate points quickly and a separable Harris detection strategy with efficient low-level GPU hardware-specific instructions. Extensive experiments on a Jetson TX2 embedded GPU demonstrate an average speedup of over 7.3 times compared to widely used OpenCV with GPU support. This significant improvement highlights its effectiveness and potential for real-time applications in mobile and resource-constrained environments.
zh
[CV-127] GoTrack: Generic 6DoF Object Pose Refinement and Tracking
【速读】:该论文试图解决6DoF(六自由度)物体位姿精调与跟踪的问题,特别是针对多样化的物体而无需进行特定物体的训练。其解决方案的关键在于提出GoTrack方法,该方法结合了基于CAD的模型到帧(model-to-frame)注册和帧到帧(frame-to-frame)注册,通过光流估计实现两种注册方式。模型到帧注册采用标准神经网络模块(在DINOv2基础上训练的Transformer)简化了计算并生成可靠的位姿置信度评分,而帧到帧注册则利用轻量级的现成光流模型,从而提高了计算效率和跟踪稳定性。
链接: https://arxiv.org/abs/2506.07155
作者: Van Nguyen Nguyen,Christian Forster,Sindi Shkodrani,Vincent Lepetit,Bugra Tekin,Cem Keskin,Tomas Hodan
机构: Meta Reality Labs(元宇宙实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce GoTrack, an efficient and accurate CAD-based method for 6DoF object pose refinement and tracking, which can handle diverse objects without any object-specific training. Unlike existing tracking methods that rely solely on an analysis-by-synthesis approach for model-to-frame registration, GoTrack additionally integrates frame-to-frame registration, which saves compute and stabilizes tracking. Both types of registration are realized by optical flow estimation. The model-to-frame registration is noticeably simpler than in existing methods, relying only on standard neural network blocks (a transformer is trained on top of DINOv2) and producing reliable pose confidence scores without a scoring network. For the frame-to-frame registration, which is an easier problem as consecutive video frames are typically nearly identical, we employ a light off-the-shelf optical flow model. We demonstrate that GoTrack can be seamlessly combined with existing coarse pose estimation methods to create a minimal pipeline that reaches state-of-the-art RGB-only results on standard benchmarks for 6DoF object pose estimation and tracking. Our source code and trained models are publicly available at this https URL
zh
[CV-128] Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion
【速读】:该论文旨在解决现有视频自编码器(Video AEs)在建模视频动态中的时空冗余方面效率不足的问题,从而导致压缩系数较低和下游任务训练成本过高的问题。其解决方案的关键在于提出Hi-VAE框架,通过分层编码视频动态的粗粒度到细粒度运动表征,并将解码过程形式化为条件生成任务,将视频动态分解为全局运动(Global Motion)和细节运动(Detailed Motion)两个潜在空间,利用独立的自监督运动编码器显著减少冗余,再通过条件扩散解码器结合层次化的全局与细节运动进行视频重建,从而实现高效的高保真视频压缩与生成。
链接: https://arxiv.org/abs/2506.07136
作者: Huaize Liu,Wenzhang Sun,Qiyuan Zhang,Donglin Di,Biao Gong,Hao Li,Chen Wei,Changqing Zou
机构: University of Chinese Academy of Sciences (中国科学院大学); Li Auto (小鹏汽车); Zhejiang Lab (浙江实验室); State Key Lab of CAD&CG, Zhejiang University (CAD&CG国家重点实验室,浙江大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent breakthroughs in video autoencoders (Video AEs) have advanced video generation, but existing methods fail to efficiently model spatio-temporal redundancies in dynamics, resulting in suboptimal compression factors. This shortfall leads to excessive training costs for downstream tasks. To address this, we introduce Hi-VAE, an efficient video autoencoding framework that hierarchically encode coarse-to-fine motion representations of video dynamics and formulate the decoding process as a conditional generation task. Specifically, Hi-VAE decomposes video dynamics into two latent spaces: Global Motion, capturing overarching motion patterns, and Detailed Motion, encoding high-frequency spatial details. Using separate self-supervised motion encoders, we compress video latents into compact motion representations to reduce redundancy significantly. A conditional diffusion decoder then reconstructs videos by combining hierarchical global and detailed motions, enabling high-fidelity video reconstructions. Extensive experiments demonstrate that Hi-VAE achieves a high compression factor of 1428 \times , almost 30 \times higher than baseline methods (e.g., Cosmos-VAE at 48 \times ), validating the efficiency of our approach. Meanwhile, Hi-VAE maintains high reconstruction quality at such high compression rates and performs effectively in downstream generative tasks. Moreover, Hi-VAE exhibits interpretability and scalability, providing new perspectives for future exploration in video latent representation and generation.
zh
[CV-129] Image segmentation and classification of E-waste for waste segregation KR
【速读】:该论文试图解决电子废弃物(e-waste)的分类问题,旨在通过机器学习模型辅助拾取与放置机器人实现废弃物的自动分拣。解决方案的关键在于构建一个定制的数据集,并采用先进的目标检测模型进行训练,其中YOLOv11模型在实时环境下达到了70 mAP的性能,而Mask-RCNN模型则实现了41 mAP,为后续与机器人系统的集成奠定了基础。
链接: https://arxiv.org/abs/2506.07122
作者: Prakriti Tripathi,Theertha Biju,Maniram Thota,Rakesh Lingam
机构: IIT Dharwad (印度理工学院达沃德分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 4 pages, 7 figures. For code and link to dataset, see this https URL
Abstract:Industry partners provided a problem statement that involves classifying electronic waste using machine learning models that will be used by pick-and-place robots for waste segregation. We started by taking common electronic waste items, such as a mouse and charger, unsoldering them, and taking pictures to create a custom dataset. Then state-of-the art YOLOv11 model was trained and run to achieve 70 mAP in real-time. Mask-RCNN model was also trained and achieved 41 mAP. The model will be further integrated with pick-and-place robots to perform segregation of e-waste.
zh
[CV-130] EdgeSpotter: Multi-Scale Dense Text Spotting for Industrial Panel Monitoring
【速读】:该论文旨在解决工业面板中文本定位(text spotting)的挑战,特别是在复杂工业面板中由于跨尺度定位和密集文本区域的模糊边界导致的高效且准确的文本检测问题。现有方法多聚焦于单一文本形状的表示,忽视了不同文本间多尺度特征信息的全面探索。该研究提出了一种新型的多尺度密集文本定位器(EdgeSpotter),其关键在于引入一种具有高效混合器的Transformer架构,以学习多层次特征之间的依赖关系,并整合多层空间与语义线索;同时设计了一种基于Catmull-Rom样条的特征采样方法,显式编码文本的形状、位置及语义信息,从而缓解漏检并降低因多尺度或密集文本区域引起的识别误差。
链接: https://arxiv.org/abs/2506.07112
作者: Changhong Fu,Hua Lin,Haobo Zuo,Liangliang Yao,Liguo Zhang
机构: Tongji University (同济大学); University of Hong Kong (香港大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text spotting for industrial panels is a key task for intelligent monitoring. However, achieving efficient and accurate text spotting for complex industrial panels remains challenging due to issues such as cross-scale localization and ambiguous boundaries in dense text regions. Moreover, most existing methods primarily focus on representing a single text shape, neglecting a comprehensive exploration of multi-scale feature information across different texts. To address these issues, this work proposes a novel multi-scale dense text spotter for edge AI-based vision system (EdgeSpotter) to achieve accurate and robust industrial panel monitoring. Specifically, a novel Transformer with efficient mixer is developed to learn the interdependencies among multi-level features, integrating multi-layer spatial and semantic cues. In addition, a new feature sampling with catmull-rom splines is designed, which explicitly encodes the shape, position, and semantic information of text, thereby alleviating missed detections and reducing recognition errors caused by multi-scale or dense text regions. Furthermore, a new benchmark dataset for industrial panel monitoring (IPM) is constructed. Extensive qualitative and quantitative evaluations on this challenging benchmark dataset validate the superior performance of the proposed method in different challenging panel monitoring tasks. Finally, practical tests based on the self-designed edge AI-based vision system demonstrate the practicality of the method. The code and demo will be available at this https URL.
zh
[CV-131] SceneLCM: End-to-End Layout-Guided Interactive Indoor Scene Generation with Latent Consistency Model
【速读】:该论文旨在解决复杂、交互式室内场景自动生成中存在的一系列挑战,包括严格的编辑约束、物理不一致、过度的人工干预、单房间限制以及材质质量不佳等问题。其解决方案的关键在于提出SceneLCM框架,该框架通过结合大型语言模型(LLM)进行布局设计与潜在一致性模型(LCM)进行场景优化,实现端到端的场景生成。该方法将场景生成分解为四个模块化流程:布局生成、家具生成、环境优化和物理编辑,其中核心创新点在于采用一致性轨迹采样(CTS)技术,以提升生成场景的语义丰富性、质量和物理真实性。
链接: https://arxiv.org/abs/2506.07091
作者: Yangkai Lin,Jiabao Lei,Kui Jia
机构: South China University of Technology (华南理工大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Our project page: this https URL. Automated generation of complex, interactive indoor scenes tailored to user prompt remains a formidable challenge. While existing methods achieve indoor scene synthesis, they struggle with rigid editing constraints, physical incoherence, excessive human effort, single-room limitations, and suboptimal material quality. To address these limitations, we propose SceneLCM, an end-to-end framework that synergizes Large Language Model (LLM) for layout design with Latent Consistency Model(LCM) for scene optimization. Our approach decomposes scene generation into four modular pipelines: (1) Layout Generation. We employ LLM-guided 3D spatial reasoning to convert textual descriptions into parametric blueprints(3D layout). And an iterative programmatic validation mechanism iteratively refines layout parameters through LLM-mediated dialogue loops; (2) Furniture Generation. SceneLCM employs Consistency Trajectory Sampling(CTS), a consistency distillation sampling loss guided by LCM, to form fast, semantically rich, and high-quality representations. We also offer two theoretical justification to demonstrate that our CTS loss is equivalent to consistency loss and its distillation error is bounded by the truncation error of the Euler solver; (3) Environment Optimization. We use a multiresolution texture field to encode the appearance of the scene, and optimize via CTS loss. To maintain cross-geometric texture coherence, we introduce a normal-aware cross-attention decoder to predict RGB by cross-attending to the anchors locations in geometrically heterogeneous instance. (4)Physically Editing. SceneLCM supports physically editing by integrating physical simulation, achieved persistent physical realism. Extensive experiments validate SceneLCM’s superiority over state-of-the-art techniques, showing its wide-ranging potential for diverse applications.
zh
[CV-132] UCOD-DPL: Unsupervised Camouflaged Object Detection via Dynamic Pseudo-label Learning CVPR2025
【速读】:该论文旨在解决无监督伪装目标检测(Unsupervised Camoflaged Object Detection, UCOD)中因依赖固定策略生成的伪标签噪声大以及简单解码器无法有效捕捉语义特征而导致的性能低下问题。其解决方案的关键在于提出一种基于教师-学生框架的动态伪标签学习方法(UCOD-DPL),包含自适应伪标签模块(Adaptive Pseudo-label Module, APM)、双分支对抗解码器(Dual-Branch Adversarial, DBA)和再看机制(Look-Twice mechanism),通过动态优化伪标签、增强对抗学习与二次细化,提升模型对伪装目标的检测能力。
链接: https://arxiv.org/abs/2506.07087
作者: Weiqi Yan,Lvhai Chen,Huaijia Kou,Shengchuan Zhang,Yan Zhang,Liujuan Cao
机构: Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025 (Hightlight)
Abstract:Unsupervised Camoflaged Object Detection (UCOD) has gained attention since it doesn’t need to rely on extensive pixel-level labels. Existing UCOD methods typically generate pseudo-labels using fixed strategies and train 1 x1 convolutional layers as a simple decoder, leading to low performance compared to fully-supervised methods. We emphasize two drawbacks in these approaches: 1). The model is prone to fitting incorrect knowledge due to the pseudo-label containing substantial noise. 2). The simple decoder fails to capture and learn the semantic features of camouflaged objects, especially for small-sized objects, due to the low-resolution pseudo-labels and severe confusion between foreground and background pixels. To this end, we propose a UCOD method with a teacher-student framework via Dynamic Pseudo-label Learning called UCOD-DPL, which contains an Adaptive Pseudo-label Module (APM), a Dual-Branch Adversarial (DBA) decoder, and a Look-Twice mechanism. The APM module adaptively combines pseudo-labels generated by fixed strategies and the teacher model to prevent the model from overfitting incorrect knowledge while preserving the ability for self-correction; the DBA decoder takes adversarial learning of different segmentation objectives, guides the model to overcome the foreground-background confusion of camouflaged objects, and the Look-Twice mechanism mimics the human tendency to zoom in on camouflaged objects and performs secondary refinement on small-sized objects. Extensive experiments show that our method demonstrates outstanding performance, even surpassing some existing fully supervised methods. The code is available now.
zh
[CV-133] FLAIR-HUB: Large-scale Multimodal Dataset for Land Cover and Crop Mapping
【速读】:该论文旨在解决高分辨率地球观测(Earth Observation, EO)数据在全局土地覆盖和作物类型监测中的处理与标注挑战,尤其是面对数据量大且异质性强的问题。其解决方案的关键在于引入FLAIR-HUB,这是目前最大的多传感器土地覆盖数据集,具有20厘米的超高分辨率标注,覆盖法国2528平方公里区域,并整合了六种对齐的模态数据,包括航拍影像、Sentinel-1/2时序数据、SPOT影像、地形数据及历史航拍图像,以支持多模态融合与深度学习模型的评估与训练。
链接: https://arxiv.org/abs/2506.07080
作者: Anatol Garioud,Sébastien Giordano,Nicolas David,Nicolas Gonthier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The growing availability of high-quality Earth Observation (EO) data enables accurate global land cover and crop type monitoring. However, the volume and heterogeneity of these datasets pose major processing and annotation challenges. To address this, the French National Institute of Geographical and Forest Information (IGN) is actively exploring innovative strategies to exploit diverse EO data, which require large annotated datasets. IGN introduces FLAIR-HUB, the largest multi-sensor land cover dataset with very-high-resolution (20 cm) annotations, covering 2528 km2 of France. It combines six aligned modalities: aerial imagery, Sentinel-1/2 time series, SPOT imagery, topographic data, and historical aerial images. Extensive benchmarks evaluate multimodal fusion and deep learning models (CNNs, transformers) for land cover or crop mapping and also explore multi-task learning. Results underscore the complexity of multimodal fusion and fine-grained classification, with best land cover performance (78.2% accuracy, 65.8% mIoU) achieved using nearly all modalities. FLAIR-HUB supports supervised and multimodal pretraining, with data and code available at this https URL.
zh
[CV-134] Accelerating 3D Gaussian Splatting with Neural Sorting and Axis-Oriented Rasterization
【速读】:该论文旨在解决在资源受限设备上实现3D Gaussian Splatting(3DGS)实时渲染的效率问题,特别是在功耗和面积预算紧张的情况下。其关键解决方案包括:首先,通过轴向光栅化技术预计算并重用X轴和Y轴上的共享项,从而减少高达63%的乘加操作;其次,引入一种基于神经网络的排序方法,预测与顺序无关的混合权重,以消除昂贵的硬件排序器需求;最后,设计一个高效的可重构处理阵列,并采用受Morton编码和Hilbert曲线启发的π-轨迹瓦片调度策略,以优化高斯分布的复用并降低内存访问开销。
链接: https://arxiv.org/abs/2506.07069
作者: Zhican Wang,Guanghui He,Dantong Liu,Lingjun Gao,Shell Xu Hu,Chen Zhang,Zhuoran Song,Nicholas Lane,Wayne Luk,Hongxiang Fan
机构: Shanghai Jiao Tong University (上海交通大学); University of Cambridge (剑桥大学); Imperial College London (帝国理工学院); Samsung AI Cambridge (三星人工智能剑桥)
类目: Graphics (cs.GR); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Under review
Abstract:3D Gaussian Splatting (3DGS) has recently gained significant attention for high-quality and efficient view synthesis, making it widely adopted in fields such as AR/VR, robotics, and autonomous driving. Despite its impressive algorithmic performance, real-time rendering on resource-constrained devices remains a major challenge due to tight power and area budgets. This paper presents an architecture-algorithm co-design to address these inefficiencies. First, we reveal substantial redundancy caused by repeated computation of common terms/expressions during the conventional rasterization. To resolve this, we propose axis-oriented rasterization, which pre-computes and reuses shared terms along both the X and Y axes through a dedicated hardware design, effectively reducing multiply-and-add (MAC) operations by up to 63%. Second, by identifying the resource and performance inefficiency of the sorting process, we introduce a novel neural sorting approach that predicts order-independent blending weights using an efficient neural network, eliminating the need for costly hardware sorters. A dedicated training framework is also proposed to improve its algorithmic stability. Third, to uniformly support rasterization and neural network inference, we design an efficient reconfigurable processing array that maximizes hardware utilization and throughput. Furthermore, we introduce a \pi -trajectory tile schedule, inspired by Morton encoding and Hilbert curve, to optimize Gaussian reuse and reduce memory access overhead. Comprehensive experiments demonstrate that the proposed design preserves rendering quality while achieving a speedup of 23.4\sim27.8\times and energy savings of 28.8\sim51.4\times compared to edge GPUs for real-world scenes. We plan to open-source our design to foster further development in this field.
zh
[CV-135] D2R: dual regularization loss with collaborative adversarial generation for model robustness
【速读】:该论文旨在解决深度神经网络模型在对抗攻击下的鲁棒性不足问题,现有防御方法存在通过损失函数对目标模型指导不足以及对抗样本生成缺乏协作两个关键局限。其解决方案的关键在于提出一种双正则化损失(D2R Loss)方法和一种协作对抗生成(CAG)策略,D2R Loss通过两个优化步骤增强目标模型的鲁棒性,而CAG则通过引导模型与目标模型之间的梯度协作生成对抗样本,从而提升模型的防御能力。
链接: https://arxiv.org/abs/2506.07056
作者: Zhenyu Liu,Huizhi Liang,Rajiv Ranjan,Zhanxing Zhu,Vaclav Snasel,Varun Ojha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:The robustness of Deep Neural Network models is crucial for defending models against adversarial attacks. Recent defense methods have employed collaborative learning frameworks to enhance model robustness. Two key limitations of existing methods are (i) insufficient guidance of the target model via loss functions and (ii) non-collaborative adversarial generation. We, therefore, propose a dual regularization loss (D2R Loss) method and a collaborative adversarial generation (CAG) strategy for adversarial training. D2R loss includes two optimization steps. The adversarial distribution and clean distribution optimizations enhance the target model’s robustness by leveraging the strengths of different loss functions obtained via a suitable function space exploration to focus more precisely on the target model’s distribution. CAG generates adversarial samples using a gradient-based collaboration between guidance and target models. We conducted extensive experiments on three benchmark databases, including CIFAR-10, CIFAR-100, Tiny ImageNet, and two popular target models, WideResNet34-10 and PreActResNet18. Our results show that D2R loss with CAG produces highly robust models.
zh
[CV-136] A Layered Self-Supervised Knowledge Distillation Framework for Efficient Multimodal Learning on the Edge
【速读】:该论文旨在解决如何在不依赖大型过参数化教师网络的情况下,提升紧凑深度学习模型的泛化能力和性能问题。其解决方案的关键在于提出一种分层自监督知识蒸馏(Layered Self-Supervised Knowledge Distillation, LSSKD)框架,通过在中间特征图上附加辅助分类器,生成多样化的自监督知识,并实现不同网络阶段的一对一知识迁移。该方法在多个数据集上均取得了优于现有先进方法的性能提升,同时在推理阶段可移除所有辅助分类器,从而避免额外计算开销,适用于资源受限的设备部署。
链接: https://arxiv.org/abs/2506.07055
作者: Tarique Dahri,Zulfiqar Ali Memon,Zhenyu Yu,Mohd. Yamani Idna Idris,Sheheryar Khan,Sadiq Ahmad,Maged Shoman,Saddam Aziz,Rizwan Qureshi
机构: Fast School of Computing, National University of Computer and Emerging Sciences, Karachi, Pakistan; Universiti Malaya, 50603 Kuala Lumpur, Malaysia; School of Professional Education and Executive Development, The Hong Kong Polytechnic University, Hong Kong; COMSATS University Islamabad, Wah Campus, 47040, Wah Cantt, Pakistan; Intelligent Transportation Systems University of Tennessee-Oak Ridge Innovation Institute’s Energy Storage and Transportation Convergent Research Initiative; Independent Researcher, USA; Center for research in Computer Vision, University of Central Florida; Orlando, Florida, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Layered Self-Supervised Knowledge Distillation (LSSKD) framework for training compact deep learning models. Unlike traditional methods that rely on pre-trained teacher networks, our approach appends auxiliary classifiers to intermediate feature maps, generating diverse self-supervised knowledge and enabling one-to-one transfer across different network stages. Our method achieves an average improvement of 4.54% over the state-of-the-art PS-KD method and a 1.14% gain over SSKD on CIFAR-100, with a 0.32% improvement on ImageNet compared to HASSKD. Experiments on Tiny ImageNet and CIFAR-100 under few-shot learning scenarios also achieve state-of-the-art results. These findings demonstrate the effectiveness of our approach in enhancing model generalization and performance without the need for large over-parameterized teacher networks. Importantly, at the inference stage, all auxiliary classifiers can be removed, yielding no extra computational cost. This makes our model suitable for deploying small language models on affordable low-computing devices. Owing to its lightweight design and adaptability, our framework is particularly suitable for multimodal sensing and cyber-physical environments that require efficient and responsive inference. LSSKD facilitates the development of intelligent agents capable of learning from limited sensory data under weak supervision.
zh
[CV-137] From Swath to Full-Disc: Advancing Precipitation Retrieval with Multimodal Knowledge Expansion
【速读】:该论文试图解决红外遥感降水反演精度低的问题,该问题源于红外算法与地表降水之间的弱相关性,而被动微波和雷达方法虽然精度较高但存在覆盖范围限制。解决方案的关键在于提出一种两阶段的多模态知识扩展框架,即PRE-Net模型,通过Swath-Distilling阶段利用Coordinated Masking and Wavelet Enhancement(CoMWE)技术将多模态数据集成模型的知识迁移至扫描带内的红外模型,并在Full-Disc Adaptation阶段通过Self-MaskTune技术平衡多模态与全盘红外知识,从而实现更准确的全盘红外降水反演。
链接: https://arxiv.org/abs/2506.07050
作者: Zheng Wang,Kai Ying,Bin Xu,Chunjiao Wang,Cong Bai
机构: Zhejiang University of Technology(浙江工业大学); National Meteorological Information Center(国家气象信息中心); Zhejiang Key Laboratory of Visual Information Intelligent Processing(浙江省视觉信息智能处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:
Abstract:Accurate near-real-time precipitation retrieval has been enhanced by satellite-based technologies. However, infrared-based algorithms have low accuracy due to weak relations with surface precipitation, whereas passive microwave and radar-based methods are more accurate but limited in range. This challenge motivates the Precipitation Retrieval Expansion (PRE) task, which aims to enable accurate, infrared-based full-disc precipitation retrievals beyond the scanning swath. We introduce Multimodal Knowledge Expansion, a two-stage pipeline with the proposed PRE-Net model. In the Swath-Distilling stage, PRE-Net transfers knowledge from a multimodal data integration model to an infrared-based model within the scanning swath via Coordinated Masking and Wavelet Enhancement (CoMWE). In the Full-Disc Adaptation stage, Self-MaskTune refines predictions across the full disc by balancing multimodal and full-disc infrared knowledge. Experiments on the introduced PRE benchmark demonstrate that PRE-Net significantly advanced precipitation retrieval performance, outperforming leading products like PERSIANN-CCS, PDIR, and IMERG. The code will be available at this https URL.
zh
[CV-138] QForce-RL: Quantized FPGA-Optimized Reinforcement Learning Compute Engine
【速读】:该论文旨在解决在资源受限设备上部署强化学习(Reinforcement Learning, RL)模型时面临的计算资源消耗大和能效低的问题。其关键解决方案是提出QForce-RL架构,该架构通过量化(quantization)技术提升吞吐量并降低能耗,同时结合E2HRL减少整体RL动作以学习期望策略,并利用QuaRL实现基于SIMD的硬件加速,从而在不显著牺牲性能的前提下实现轻量级RL架构的高效部署。
链接: https://arxiv.org/abs/2506.07046
作者: Anushka Jha,Tanushree Dewangan,Mukul Lokhande,Santosh Kumar Vishvakarma
机构: IIT Indore (印度理工学院印多尔分校)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:
Abstract:Reinforcement Learning (RL) has outperformed other counterparts in sequential decision-making and dynamic environment control. However, FPGA deployment is significantly resource-expensive, as associated with large number of computations in training agents with high-quality images and possess new challenges. In this work, we propose QForce-RL takes benefits of quantization to enhance throughput and reduce energy footprint with light-weight RL architecture, without significant performance degradation. QForce-RL takes advantages from E2HRL to reduce overall RL actions to learn desired policy and QuaRL for quantization based SIMD for hardware acceleration. We have also provided detailed analysis for different RL environments, with emphasis on model size, parameters, and accelerated compute ops. The architecture is scalable for resource-constrained devices and provide parametrized efficient deployment with flexibility in latency, throughput, power, and energy efficiency. The proposed QForce-RL provides performance enhancement up to 2.3x and better FPS - 2.6x compared to SoTA works.
zh
[CV-139] MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在面对需要跨大规模视频集合进行复杂推理的实际场景时表现不足的问题。现有视频问答基准测试的范围有限,通常每个查询仅涉及一个视频片段,无法充分反映实际应用中大规模音视频检索与推理的挑战。为解决这一问题,论文提出了一个新的任务AV-HaystacksQA,其关键在于从不同视频中识别出相关片段并将其关联以生成最信息量的答案。为此,研究者构建了AVHaystacks数据集,并提出了一种模型无关的多智能体框架MAGNET,该框架在QA任务中相对于基线方法在BLEU@4和GPT评估分数上分别实现了高达89%和65%的提升。
链接: https://arxiv.org/abs/2506.07016
作者: Sanjoy Chowdhury,Mohamed Elmoghany,Yohan Abeysinghe,Junjie Fei,Sayan Nag,Salman Khan,Mohamed Elhoseiny,Dinesh Manocha
机构: University of Maryland, College Park(马里兰大学学院市分校); KAUST(卡塔尔科学技术大学); MBZUAI(穆罕默德本扎耶德人工智能大学); Adobe(Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Audio-visual learning, Audio-Visual RAG, Multi-Video Linkage
Abstract:Large multimodal models (LMMs) have shown remarkable progress in audio-visual understanding, yet they struggle with real-world scenarios that require complex reasoning across extensive video collections. Existing benchmarks for video question answering remain limited in scope, typically involving one clip per query, which falls short of representing the challenges of large-scale, audio-visual retrieval and reasoning encountered in practical applications. To bridge this gap, we introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer. To this end, we present AVHaystacks, an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task. Additionally, we propose a model-agnostic, multi-agent framework MAGNET to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystacks. To enable robust evaluation of multi-video retrieval and temporal grounding for optimal response generation, we introduce two new metrics, STEM, which captures alignment errors between a ground truth and a predicted step sequence and MTGS, to facilitate balanced and interpretable evaluation of segment-level grounding performance. Project: this https URL
zh
[CV-140] ABLET: Table Structure Recognition using Encoder-only Transformers ICDAR2025
【速读】:该论文旨在解决表格结构识别中的挑战,特别是针对大型且密集的表格场景。其解决方案的关键在于提出一种基于分裂-合并的自顶向下模型,将行和列的分割建模为序列标注任务,并利用双Transformer编码器捕捉特征交互;同时将合并过程视为网格单元分类任务,通过额外的Transformer编码器确保合并的准确性和一致性。该方法通过消除不稳定的边界框预测,降低了分辨率损失和计算复杂度,从而在保持高速处理的同时实现了高精度。
链接: https://arxiv.org/abs/2506.07015
作者: Qiyu Hou,Jun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICDAR 2025
Abstract:To address the challenges of table structure recognition, we propose a novel Split-Merge-based top-down model optimized for large, densely populated tables. Our approach formulates row and column splitting as sequence labeling tasks, utilizing dual Transformer encoders to capture feature interactions. The merging process is framed as a grid cell classification task, leveraging an additional Transformer encoder to ensure accurate and coherent merging. By eliminating unstable bounding box predictions, our method reduces resolution loss and computational complexity, achieving high accuracy while maintaining fast processing speed. Extensive experiments on FinTabNet and PubTabNet demonstrate the superiority of our model over existing approaches, particularly in real-world applications. Our method offers a robust, scalable, and efficient solution for large-scale table recognition, making it well-suited for industrial deployment.
zh
[CV-141] UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment
【速读】:该论文试图解决单目视觉里程计在不同环境、平台和运动模式下鲁棒性和适应性不足的问题。传统方法依赖于特定部署的调优或预定义的运动先验,难以泛化到复杂多样的实际场景。解决方案的关键在于提出一种统一的框架UNO,其核心是采用Mixture-of-Experts策略进行局部状态估计,结合多个专门处理不同自运动模式的解码器,并引入全可微分的Gumbel-Softmax模块以构建稳健的帧间相关图、选择最优专家解码器并剔除错误估计,最终通过融合预训练的尺度无关深度先验与轻量级捆绑调整的统一后端来保证几何一致性。
链接: https://arxiv.org/abs/2506.07013
作者: Wentao Zhao,Yihe Niu,Yanbo Wang,Tianchen Deng,Shenghai Yuan,Zhenli Wang,Rui Guo,Jingchuan Wang
机构: Shanghai Jiao Tong University(上海交通大学); Nanyang Technological University(南洋理工大学); State Grid Intelligence Technology CO., LTD.(国家电网智能技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15pages, 8 figures
Abstract:This work presents UNO, a unified monocular visual odometry framework that enables robust and adaptable pose estimation across diverse environments, platforms, and motion patterns. Unlike traditional methods that rely on deployment-specific tuning or predefined motion priors, our approach generalizes effectively across a wide range of real-world scenarios, including autonomous vehicles, aerial drones, mobile robots, and handheld devices. To this end, we introduce a Mixture-of-Experts strategy for local state estimation, with several specialized decoders that each handle a distinct class of ego-motion patterns. Moreover, we introduce a fully differentiable Gumbel-Softmax module that constructs a robust inter-frame correlation graph, selects the optimal expert decoder, and prunes erroneous estimates. These cues are then fed into a unified back-end that combines pre-trained, scale-independent depth priors with a lightweight bundling adjustment to enforce geometric consistency. We extensively evaluate our method on three major benchmark datasets: KITTI (outdoor/autonomous driving), EuRoC-MAV (indoor/aerial drones), and TUM-RGBD (indoor/handheld), demonstrating state-of-the-art performance.
zh
[CV-142] BePo: Leverag ing Birds Eye View and Sparse Points for Efficient and Accurate 3D Occupancy Prediction CVPR2025
【速读】:该论文旨在解决3D占用预测(3D occupancy prediction)中现有方法在计算成本与表示能力之间的权衡问题。具体而言,传统方法依赖于密集的3D特征体积和交叉注意力机制,导致较高的计算开销;而基于鸟瞰图(BEV)或稀疏点云的方法虽然降低了计算成本,但各自存在缺陷:BEV难以处理小物体,而稀疏点云在捕捉平面或大物体时效率较低。论文提出的解决方案是BePo,其关键在于采用双分支结构,结合BEV与稀疏点云表示,通过交叉注意力机制实现两分支间的3D信息共享,从而增强BEV平面上困难目标的信号,并最终融合两分支输出生成更精确的3D占用预测。
链接: https://arxiv.org/abs/2506.07002
作者: Yunxiao Shi,Hong Cai,Jisoo Jeong,Yinhao Zhu,Shizhong Han,Amin Ansari,Fatih Porikli
机构: Qualcomm AI Research(高通人工智能研究); Qualcomm Technologies, Inc.(高通技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Two-page abstract version available at CVPR 2025 Embodied AI Workshop
Abstract:3D occupancy provides fine-grained 3D geometry and semantics for scene understanding which is critical for autonomous driving. Most existing methods, however, carry high compute costs, requiring dense 3D feature volume and cross-attention to effectively aggregate information. More recent works have adopted Bird’s Eye View (BEV) or sparse points as scene representation with much reduced cost, but still suffer from their respective shortcomings. More concretely, BEV struggles with small objects that often experience significant information loss after being projected to the ground plane. On the other hand, points can flexibly model little objects in 3D, but is inefficient at capturing flat surfaces or large objects. To address these challenges, in this paper, we present a novel 3D occupancy prediction approach, BePo, which combines BEV and sparse points based representations. We propose a dual-branch design: a query-based sparse points branch and a BEV branch. The 3D information learned in the sparse points branch is shared with the BEV stream via cross-attention, which enriches the weakened signals of difficult objects on the BEV plane. The outputs of both branches are finally fused to generate predicted 3D occupancy. We conduct extensive experiments on the Occ3D-nuScenes and Occ3D-Waymo benchmarks that demonstrate the superiority of our proposed BePo. Moreover, BePo also delivers competitive inference speed when compared to the latest efficient approaches.
zh
[CV-143] owards Physics-informed Diffusion for Anomaly Detection in Trajectories
【速读】:该论文旨在解决在给定轨迹数据、特定研究区域和用户定义阈值的情况下,检测可能由GPS欺骗引起的异常轨迹问题,这对于遏制国际水域中的非法活动(如未经许可的捕捞和非法石油转运)具有社会重要性。该问题的挑战性源于深度伪造生成技术(如添加噪声和虚假轨迹)的进步以及用于真实情况验证的标记样本不足。尽管现有文献中基于生成模型的异常轨迹检测方法在数据稀疏情况下表现出一定的前景,但它们未考虑细粒度的时空依赖性和先验物理知识,导致误报率较高。为解决这些限制,该论文提出了一种融合运动学约束的物理信息扩散模型,以识别不符合物理定律的轨迹。该解决方案的关键在于将物理规律与深度学习模型相结合,从而提高异常检测和轨迹生成的准确性与可靠性。
链接: https://arxiv.org/abs/2506.06999
作者: Arun Sharma,Mingzhou Yang,Majid Farhadloo,Subhankar Ghosh,Bharat Jayaprakash,Shashi Shekhar
机构: University of Minnesota (明尼苏达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:Given trajectory data, a domain-specific study area, and a user-defined threshold, we aim to find anomalous trajectories indicative of possible GPS spoofing (e.g., fake trajectory). The problem is societally important to curb illegal activities in international waters, such as unauthorized fishing and illicit oil transfers. The problem is challenging due to advances in AI generated in deep fakes generation (e.g., additive noise, fake trajectories) and lack of adequate amount of labeled samples for ground-truth verification. Recent literature shows promising results for anomalous trajectory detection using generative models despite data sparsity. However, they do not consider fine-scale spatiotemporal dependencies and prior physical knowledge, resulting in higher false-positive rates. To address these limitations, we propose a physics-informed diffusion model that integrates kinematic constraints to identify trajectories that do not adhere to physical laws. Experimental results on real-world datasets in the maritime and urban domains show that the proposed framework results in higher prediction accuracy and lower estimation error rate for anomaly detection and trajectory generation methods, respectively. Our implementation is available at this https URL.
zh
[CV-144] chnical Report for ICRA 2025 GOOSE 3D Semantic Segmentation Challenge: Adaptive Point Cloud Understanding for Heterogeneous Robotic Systems ICRA
【速读】:该论文旨在解决在多样化的非结构化户外环境中,对来自多个机器人平台的3D点云进行语义分割的问题。解决方案的关键在于将Point Prompt Tuning (PPT) 与Point Transformer v3 (PTv3) 主干网络相结合,通过平台特定的条件化和跨数据集类别对齐策略,实现对异构激光雷达数据的自适应处理,从而在不依赖额外外部数据的情况下显著提升了模型性能。
链接: https://arxiv.org/abs/2506.06995
作者: Xiaoya Zhang
机构: EARTHBRAIN Ltd.(EARTHBRAIN有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Winner of the GOOSE 3D Semantic Segmentation Challenge at the IEEE ICRA Workshop on Field Robotics 2025
Abstract:This technical report presents the implementation details of the winning solution for the ICRA 2025 GOOSE 3D Semantic Segmentation Challenge. This challenge focuses on semantic segmentation of 3D point clouds from diverse unstructured outdoor environments collected from multiple robotic platforms. This problem was addressed by implementing Point Prompt Tuning (PPT) integrated with Point Transformer v3 (PTv3) backbone, enabling adaptive processing of heterogeneous LiDAR data through platform-specific conditioning and cross-dataset class alignment strategies. The model is trained without requiring additional external data. As a result, this approach achieved substantial performance improvements with mIoU increases of up to 22.59% on challenging platforms compared to the baseline PTv3 model, demonstrating the effectiveness of adaptive point cloud understanding for field robotics applications.
zh
[CV-145] DM3Net: Dual-Camera Super-Resolution via Domain Modulation and Multi-scale Matching
【速读】:该论文旨在解决双摄像头超分辨率问题,特别是在智能手机摄影中,通过将长焦图像作为参考来提升广角图像的分辨率。其解决方案的关键在于提出DM³ Net网络,该网络基于领域调制(Domain Modulation)和多尺度匹配(Multi-scale Matching),通过学习两个域的压缩全局表示以弥合高分辨率域与退化域之间的领域差异,并设计多尺度匹配模块以提高高频率结构细节的可靠传递,同时引入关键剪枝(Key Pruning)以显著降低内存使用和推理时间而几乎不牺牲模型性能。
链接: https://arxiv.org/abs/2506.06993
作者: Cong Guan,Jiacheng Ying,Yuya Ieiri,Osamu Yoshie
机构: Waseda University (早稻田大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dual-camera super-resolution is highly practical for smartphone photography that primarily super-resolve the wide-angle images using the telephoto image as a reference. In this paper, we propose DM ^3 Net, a novel dual-camera super-resolution network based on Domain Modulation and Multi-scale Matching. To bridge the domain gap between the high-resolution domain and the degraded domain, we learn two compressed global representations from image pairs corresponding to the two domains. To enable reliable transfer of high-frequency structural details from the reference image, we design a multi-scale matching module that conducts patch-level feature matching and retrieval across multiple receptive fields to improve matching accuracy and robustness. Moreover, we also introduce Key Pruning to achieve a significant reduction in memory usage and inference time with little model performance sacrificed. Experimental results on three real-world datasets demonstrate that our DM ^3 Net outperforms the state-of-the-art approaches.
zh
[CV-146] Boosting Adversarial Transferability via Commonality-Oriented Gradient Optimization
【速读】:该论文试图解决在黑盒设置下,基于替代模型生成的对抗样本转移能力较弱的问题,这一问题主要由过拟合导致。解决方案的关键在于通过共同性导向的梯度优化策略(COGO),增强替代模型共享的共性信息扰动并抑制与个体特征相关的扰动,从而提升对抗样本的转移性能。COGO包含两个核心组件:共性增强(CE)和个性抑制(IS),其中CE通过扰动中低频区域来利用ViTs对中低频信息的依赖,IS则通过自适应阈值评估梯度与模型个体性的相关性并赋予相应权重。
链接: https://arxiv.org/abs/2506.06992
作者: Yanting Gao,Yepeng Liu,Junming Liu,Qi Zhang,Hongyun Zhang,Duoqian Miao,Cairong Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages
Abstract:Exploring effective and transferable adversarial examples is vital for understanding the characteristics and mechanisms of Vision Transformers (ViTs). However, adversarial examples generated from surrogate models often exhibit weak transferability in black-box settings due to overfitting. Existing methods improve transferability by diversifying perturbation inputs or applying uniform gradient regularization within surrogate models, yet they have not fully leveraged the shared and unique features of surrogate models trained on the same task, leading to suboptimal transfer performance. Therefore, enhancing perturbations of common information shared by surrogate models and suppressing those tied to individual characteristics offers an effective way to improve transferability. Accordingly, we propose a commonality-oriented gradient optimization strategy (COGO) consisting of two components: Commonality Enhancement (CE) and Individuality Suppression (IS). CE perturbs the mid-to-low frequency regions, leveraging the fact that ViTs trained on the same dataset tend to rely more on mid-to-low frequency information for classification. IS employs adaptive thresholds to evaluate the correlation between backpropagated gradients and model individuality, assigning weights to gradients accordingly. Extensive experiments demonstrate that COGO significantly improves the transfer success rates of adversarial attacks, outperforming current state-of-the-art methods.
zh
[CV-147] Hybrid Mesh-Gaussian Representation for Efficient Indoor Scene Reconstruction
【速读】:该论文试图解决3D Gaussian splatting (3DGS)在处理具有复杂纹理的区域时,由于需要大量Gaussians来准确捕捉颜色变化而导致的渲染效率低下的问题。解决方案的关键在于提出一种混合表示方法,将3DGS与带纹理的网格结合使用,利用带纹理的网格处理纹理丰富的平坦区域,同时保留Gaussians来建模复杂的几何结构,从而在保持渲染质量的同时提升帧率并减少Gaussian数量。
链接: https://arxiv.org/abs/2506.06988
作者: Binxiao Huang,Zhihao Li,Shiyong Liu,Xiao Tang,Jiajun Tang,Jiaqi Lin,Yuxin Cheng,Zhenyu Chen,Xiaofei Wu,Ngai Wong
机构: The University of Hong Kong (香港大学); Huawei Technologies Ltd (华为技术有限公司); Peking University (北京大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian splatting (3DGS) has demonstrated exceptional performance in image-based 3D reconstruction and real-time rendering. However, regions with complex textures require numerous Gaussians to capture significant color variations accurately, leading to inefficiencies in rendering speed. To address this challenge, we introduce a hybrid representation for indoor scenes that combines 3DGS with textured meshes. Our approach uses textured meshes to handle texture-rich flat areas, while retaining Gaussians to model intricate geometries. The proposed method begins by pruning and refining the extracted mesh to eliminate geometrically complex regions. We then employ a joint optimization for 3DGS and mesh, incorporating a warm-up strategy and transmittance-aware supervision to balance their contributions this http URL experiments demonstrate that the hybrid representation maintains comparable rendering quality and achieves superior frames per second FPS with fewer Gaussian primitives.
zh
[CV-148] Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
【速读】:该论文旨在解决生成式多模态模型(Generative Multimodal Models, GMMs)在跨模态表示学习中存在显著的模态差距问题,尽管对比语言-图像预训练(Contrastive Language-Image Pretraining, CLIP)具有跨模态检索能力,但其特征空间中仍存在较大的模态差异。论文提出的解决方案关键在于引入MAPLE(Modality-Aligned Preference Learning for Embeddings),该框架利用多模态大语言模型(Multimodal Large Language Models, MLLMs)内在的细粒度对齐先验,通过强化学习的方式进行跨模态表示学习,核心包括:(1)使用现成MLLM自动构建偏好数据;(2)设计一种新的相对偏好对齐(Relative Preference Alignment, RPA)损失,将直接偏好优化(Direct Preference Optimization, DPO)适配到嵌入学习场景,从而有效提升细粒度跨模态检索性能。
链接: https://arxiv.org/abs/2506.06970
作者: Pengfei Zhao,Rongbo Luan,Wei Zhang,Peng Wu,Sifeng He
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite Contrastive Language-Image Pretraining (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal Large Language Models) demonstrate powerful inherent modality alignment properties. While recent MLLM-based retrievers with unified architectures partially mitigate this gap, their reliance on coarse modality alignment mechanisms fundamentally limits their potential. In this work, We introduce MAPLE (Modality-Aligned Preference Learning for Embeddings), a novel framework that leverages the fine grained alignment priors inherent in MLLM to guide cross modal representation learning. MAPLE formulates the learning process as reinforcement learning with two key components: (1) Automatic preference data construction using off-the-shelf MLLM, and (2) a new Relative Preference Alignment (RPA) loss, which adapts Direct Preference Optimization (DPO) to the embedding learning setting. Experimental results show that our preference-guided alignment achieves substantial gains in fine-grained cross-modal retrieval, underscoring its effectiveness in handling nuanced semantic distinctions.
zh
[CV-149] Dual-view Spatio-Temporal Feature Fusion with CNN-Transformer Hybrid Network for Chinese Isolated Sign Language Recognition
【速读】:该论文旨在解决孤立手语识别(ISLR)在实际应用中面临的两个主要问题:现有手语数据集未能覆盖完整的手语词汇,以及大多数数据集仅提供单视角RGB视频,难以处理手部遮挡问题。其解决方案的关键是构建一个双视角手语数据集NationalCSL-DP,该数据集全面覆盖了中国国家标准手语词汇,并通过两个垂直视角(正面和左侧)记录了134140个手语视频。此外,还提出了一种基于卷积神经网络与Transformer的强基线模型,以及一种简单但有效的融合策略以提升预测性能。
链接: https://arxiv.org/abs/2506.06966
作者: Siyuan Jing,Guangxue Wang,Haoyang Zhai,Qin Tao,Jun Yang,Bing Wang,Peng Jin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 3 figures
Abstract:Due to the emergence of many sign language datasets, isolated sign language recognition (ISLR) has made significant progress in recent years. In addition, the development of various advanced deep neural networks is another reason for this breakthrough. However, challenges remain in applying the technique in the real world. First, existing sign language datasets do not cover the whole sign vocabulary. Second, most of the sign language datasets provide only single view RGB videos, which makes it difficult to handle hand occlusions when performing ISLR. To fill this gap, this paper presents a dual-view sign language dataset for ISLR named NationalCSL-DP, which fully covers the Chinese national sign language vocabulary. The dataset consists of 134140 sign videos recorded by ten signers with respect to two vertical views, namely, the front side and the left side. Furthermore, a CNN transformer network is also proposed as a strong baseline and an extremely simple but effective fusion strategy for prediction. Extensive experiments were conducted to prove the effectiveness of the datasets as well as the baseline. The results show that the proposed fusion strategy can significantly increase the performance of the ISLR, but it is not easy for the sequence-to-sequence model, regardless of whether the early-fusion or late-fusion strategy is applied, to learn the complementary features from the sign videos of two vertical views.
zh
[CV-150] Long-Tailed Learning for Generalized Category Discovery
【速读】:该论文试图解决在长尾分布的现实世界数据集中,传统广义类别发现(Generalized Category Discovery, GCD)方法性能显著下降的问题。其关键解决方案是提出一种新颖框架,包含自引导标注技术与表征平衡过程:自引导标注技术通过可学习分布生成伪标签,减少分类器偏差;表征平衡过程则通过挖掘样本邻域,增强模型对尾部类别的关注,从而提升模型在长尾分布下的泛化能力。
链接: https://arxiv.org/abs/2506.06965
作者: Cuong Manh Hoang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalized Category Discovery (GCD) utilizes labeled samples of known classes to discover novel classes in unlabeled samples. Existing methods show effective performance on artificial datasets with balanced distributions. However, real-world datasets are always imbalanced, significantly affecting the effectiveness of these methods. To solve this problem, we propose a novel framework that performs generalized category discovery in long-tailed distributions. We first present a self-guided labeling technique that uses a learnable distribution to generate pseudo-labels, resulting in less biased classifiers. We then introduce a representation balancing process to derive discriminative representations. By mining sample neighborhoods, this process encourages the model to focus more on tail classes. We conduct experiments on public datasets to demonstrate the effectiveness of the proposed framework. The results show that our model exceeds previous state-of-the-art methods.
zh
[CV-151] AR-RAG : Autoregressive Retrieval Augmentation for Image Generation
【速读】:该论文旨在解决现有图像生成模型在生成过程中存在的局限性,如过度复制、风格偏差等问题。其核心问题在于传统方法仅在生成前进行一次静态检索并依赖固定参考图像进行条件生成,无法适应生成过程中的动态需求。解决方案的关键在于提出自回归检索增强(AR-RAG),通过在每个生成步骤中进行上下文感知的检索,利用先前生成的图像块作为查询,动态获取最相关的图像块级视觉参考,从而提升生成质量与灵活性。
链接: https://arxiv.org/abs/2506.06962
作者: Jingyuan Qi,Zhiyang Xu,Qifan Wang,Lifu Huang
机构: Virginia Tech (弗吉尼亚理工学院); Meta (元公司); UC Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Image Generation, Retrieval Augmented Generation
Abstract:We introduce Autoregressive Retrieval Augmentation (AR-RAG), a novel paradigm that enhances image generation by autoregressively incorporating knearest neighbor retrievals at the patch level. Unlike prior methods that perform a single, static retrieval before generation and condition the entire generation on fixed reference images, AR-RAG performs context-aware retrievals at each generation step, using prior-generated patches as queries to retrieve and incorporate the most relevant patch-level visual references, enabling the model to respond to evolving generation needs while avoiding limitations (e.g., over-copying, stylistic bias, etc.) prevalent in existing methods. To realize AR-RAG, we propose two parallel frameworks: (1) Distribution-Augmentation in Decoding (DAiD), a training-free plug-and-use decoding strategy that directly merges the distribution of model-predicted patches with the distribution of retrieved patches, and (2) Feature-Augmentation in Decoding (FAiD), a parameter-efficient fine-tuning method that progressively smooths the features of retrieved patches via multi-scale convolution operations and leverages them to augment the image generation process. We validate the effectiveness of AR-RAG on widely adopted benchmarks, including Midjourney-30K, GenEval and DPG-Bench, demonstrating significant performance gains over state-of-the-art image generation models.
zh
[CV-152] ask-driven real-world super-resolution of document scans
【速读】:该论文试图解决单图像超分辨率(Single-image super-resolution)方法在模拟数据集上表现良好,但在真实场景(如文档扫描)中泛化能力不足的问题。其关键解决方案是引入一种面向任务的多任务学习框架,通过结合来自高层视觉任务的辅助损失函数(如文本检测、文本识别、关键点定位和色相一致性),并采用动态权重平均机制来平衡不同目标的相对重要性,从而优化超分辨率网络以提升光学字符识别(OCR)任务的性能。
链接: https://arxiv.org/abs/2506.06953
作者: Maciej Zyrek,Tomasz Tarasiewicz,Jakub Sadel,Aleksandra Krzywon,Michal Kawulok
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-image super-resolution refers to the reconstruction of a high-resolution image from a single low-resolution observation. Although recent deep learning-based methods have demonstrated notable success on simulated datasets – with low-resolution images obtained by degrading and downsampling high-resolution ones – they frequently fail to generalize to real-world settings, such as document scans, which are affected by complex degradations and semantic variability. In this study, we introduce a task-driven, multi-task learning framework for training a super-resolution network specifically optimized for optical character recognition tasks. We propose to incorporate auxiliary loss functions derived from high-level vision tasks, including text detection using the connectionist text proposal network, text recognition via a convolutional recurrent neural network, keypoints localization using this http URL, and hue consistency. To balance these diverse objectives, we employ dynamic weight averaging mechanism, which adaptively adjusts the relative importance of each loss term based on its convergence behavior. We validate our approach upon the SRResNet architecture, which is a well-established technique for single-image super-resolution. Experimental evaluations on both simulated and real-world scanned document datasets demonstrate that the proposed approach improves text detection, measured with intersection over union, while preserving overall image fidelity. These findings underscore the value of multi-objective optimization in super-resolution models for bridging the gap between simulated training regimes and practical deployment in real-world scenarios.
zh
[CV-153] LaTtE-Flow: Layerwise Timestep-Expert Flow-based Transformer ATC
【速读】:该论文旨在解决统一的多模态基础模型在预训练成本高、任务性能不如专用模型以及图像生成速度慢等问题。其解决方案的关键在于提出了一种名为Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow) 的新型高效架构,该架构通过将流匹配过程分配到专门的Transformer层组中,每组负责特定时间步,从而显著提高了采样效率,并结合时间步条件残差注意力机制以实现跨层的信息高效复用。
链接: https://arxiv.org/abs/2506.06952
作者: Ying Shen,Zhiyang Xu,Jiuhai Chen,Shizhe Diao,Jiaxin Zhang,Yuguang Yao,Joy Rimchala,Ismini Lourentzou,Lifu Huang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Virginia Tech (弗吉尼亚理工大学); University of Maryland (马里兰大学); Nvidia (英伟达); Intuit AI Research (财捷人工智能研究); UC Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Unified multimodal model, Flow-matching
Abstract:Recent advances in multimodal foundation models unifying image understanding and generation have opened exciting avenues for tackling a wide range of vision-language tasks within a single framework. Despite progress, existing unified models typically require extensive pretraining and struggle to achieve the same level of performance compared to models dedicated to each task. Additionally, many of these models suffer from slow image generation speeds, limiting their practical deployment in real-time or resource-constrained settings. In this work, we propose Layerwise Timestep-Expert Flow-based Transformer (LaTtE-Flow), a novel and efficient architecture that unifies image understanding and generation within a single multimodal model. LaTtE-Flow builds upon powerful pretrained Vision-Language Models (VLMs) to inherit strong multimodal understanding capabilities, and extends them with a novel Layerwise Timestep Experts flow-based architecture for efficient image generation. LaTtE-Flow distributes the flow-matching process across specialized groups of Transformer layers, each responsible for a distinct subset of timesteps. This design significantly improves sampling efficiency by activating only a small subset of layers at each sampling timestep. To further enhance performance, we propose a Timestep-Conditioned Residual Attention mechanism for efficient information reuse across layers. Experiments demonstrate that LaTtE-Flow achieves strong performance on multimodal understanding tasks, while achieving competitive image generation quality with around 6x faster inference speed compared to recent unified multimodal models.
zh
[CV-154] Polar Hierarchical Mamba: Towards Streaming LiDAR Object Detection with Point Clouds as Egocentric Sequences
【速读】:该论文旨在解决自动驾驶中实时感知对低延迟和高吞吐量的需求,特别是针对传统方法在处理LiDAR全360°扫描时引入的显著延迟问题。现有流式处理方法虽然通过按顺序处理部分扫描来缓解延迟,但其依赖的平移不变卷积与极坐标几何不匹配,导致性能下降或需要复杂的失真补偿。论文提出的解决方案是Polar Hierarchical Mamba (PHiM),其关键在于采用局部双向Mamba块进行扇区内的空间编码以及全局前向Mamba进行扇区间的时间建模,以替代卷积和位置编码,从而实现对极坐标流式LiDAR数据的高效处理。
链接: https://arxiv.org/abs/2506.06944
作者: Mellon M. Zhang,Glen Chou,Saibal Mukhopadhyay
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurate and efficient object detection is essential for autonomous vehicles, where real-time perception requires low latency and high throughput. LiDAR sensors provide robust depth information, but conventional methods process full 360° scans in a single pass, introducing significant delay. Streaming approaches address this by sequentially processing partial scans in the native polar coordinate system, yet they rely on translation-invariant convolutions that are misaligned with polar geometry – resulting in degraded performance or requiring complex distortion mitigation. Recent Mamba-based state space models (SSMs) have shown promise for LiDAR perception, but only in the full-scan setting, relying on geometric serialization and positional embeddings that are memory-intensive and ill-suited to streaming. We propose Polar Hierarchical Mamba (PHiM), a novel SSM architecture designed for polar-coordinate streaming LiDAR. PHiM uses local bidirectional Mamba blocks for intra-sector spatial encoding and a global forward Mamba for inter-sector temporal modeling, replacing convolutions and positional encodings with distortion-aware, dimensionally-decomposed operations. PHiM sets a new state-of-the-art among streaming detectors on the Waymo Open Dataset, outperforming the previous best by 10% and matching full-scan baselines at twice the throughput. Code will be available at this https URL .
zh
[CV-155] Experimental Evaluation of Static Image Sub-Region-Based Search Models Using CLIP
【速读】:该论文试图解决在高度同质化、专业领域中,由于用户缺乏专业知识导致文本描述模糊而难以有效检索图像的问题。解决方案的关键在于通过添加基于位置的提示(location-based prompts)来补充模糊的文本查询,从而提升检索性能。具体而言,研究者收集了一个包含741个标注的数据集,其中包含短文本、长文本描述以及指示感兴趣区域的边界框,并利用该数据集评估了CLIP模型在不同静态子区域与全图查询下的性能,结果表明采用简单的3×3划分和5网格重叠策略能够显著提高检索效果并保持对标注框扰动的鲁棒性。
链接: https://arxiv.org/abs/2506.06938
作者: Bastian Jäckl,Vojtěch Kloda,Daniel A. Keim,Jakub Lokoč
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures, 2 tables
Abstract:Advances in multimodal text-image models have enabled effective text-based querying in extensive image collections. While these models show convincing performance for everyday life scenes, querying in highly homogeneous, specialized domains remains challenging. The primary problem is that users can often provide only vague textual descriptions as they lack expert knowledge to discriminate between homogenous entities. This work investigates whether adding location-based prompts to complement these vague text queries can enhance retrieval performance. Specifically, we collected a dataset of 741 human annotations, each containing short and long textual descriptions and bounding boxes indicating regions of interest in challenging underwater scenes. Using these annotations, we evaluate the performance of CLIP when queried on various static sub-regions of images compared to the full image. Our results show that both a simple 3-by-3 partitioning and a 5-grid overlap significantly improve retrieval effectiveness and remain robust to perturbations of the annotation box.
zh
[CV-156] Rewriting the Budget: A General Framework for Black-Box Attacks Under Cost Asymmetry
【速读】:该论文旨在解决传统基于决策的黑盒对抗攻击中查询成本不对称的问题,即在实际应用中,不同查询可能产生不同的成本,例如某些输出类别可能触发额外的审查或惩罚,从而增加攻击成本。现有方法大多假设所有查询成本相同,但这一假设在现实中并不成立。论文提出了一种适用于非对称查询成本的通用框架,称为非对称黑盒攻击。其解决方案的关键在于对现有攻击方法的两个核心组件进行改进:一是提出非对称搜索(Asymmetric Search, AS),一种更保守的二分搜索变体,以减少对高成本查询的依赖;二是提出非对称梯度估计(Asymmetric Gradient Estimation, AGREST),通过调整采样分布来优先选择低成本查询。这些改进使得攻击在保持效果的同时显著降低了总查询成本。
链接: https://arxiv.org/abs/2506.06933
作者: Mahdi Salmani,Alireza Abdollahpoorrostam,Seyed-Mohsen Moosavi-Dezfooli
机构: University of Southern California (南加州大学); EPFL (瑞士联邦理工学院); Apple (苹果)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional decision-based black-box adversarial attacks on image classifiers aim to generate adversarial examples by slightly modifying input images while keeping the number of queries low, where each query involves sending an input to the model and observing its output. Most existing methods assume that all queries have equal cost. However, in practice, queries may incur asymmetric costs; for example, in content moderation systems, certain output classes may trigger additional review, enforcement, or penalties, making them more costly than others. While prior work has considered such asymmetric cost settings, effective algorithms for this scenario remain underdeveloped. In this paper, we propose a general framework for decision-based attacks under asymmetric query costs, which we refer to as asymmetric black-box attacks. We modify two core components of existing attacks: the search strategy and the gradient estimation process. Specifically, we propose Asymmetric Search (AS), a more conservative variant of binary search that reduces reliance on high-cost queries, and Asymmetric Gradient Estimation (AGREST), which shifts the sampling distribution to favor low-cost queries. We design efficient algorithms that minimize total attack cost by balancing different query types, in contrast to earlier methods such as stealthy attacks that focus only on limiting expensive (high-cost) queries. Our method can be integrated into a range of existing black-box attacks with minimal changes. We perform both theoretical analysis and empirical evaluation on standard image classification benchmarks. Across various cost regimes, our method consistently achieves lower total query cost and smaller perturbations than existing approaches, with improvements of up to 40% in some settings.
zh
[CV-157] How Important are Videos for Training Video LLM s?
【速读】:该论文试图解决当前视频大语言模型(Video LLM)在时间推理任务中的表现是否依赖于专门的视频训练的问题。研究发现,仅通过图像训练的LLM在时间推理任务上表现优于随机猜测水平,而进一步的视频特定训练带来的提升却相对有限。解决方案的关键在于揭示图像训练的LLM在缺乏视频数据的情况下仍能具备一定的时间推理能力,并提出一种基于标注图像序列和问题的简单微调方案,该方案在时间推理任务上的表现接近甚至超过视频训练的模型,表明当前模型未能充分利用真实视频中的丰富时间特征。
链接: https://arxiv.org/abs/2506.06928
作者: George Lydakis,Alexander Hermans,Ali Athar,Daan de Geus,Bastian Leibe
机构: RWTH Aachen University (亚琛工业大学); ByteDance Seed (字节跳动种子项目); Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page on this https URL
Abstract:Research into Video Large Language Models (LLMs) has progressed rapidly, with numerous models and benchmarks emerging in just a few years. Typically, these models are initialized with a pretrained text-only LLM and finetuned on both image- and video-caption datasets. In this paper, we present findings indicating that Video LLMs are more capable of temporal reasoning after image-only training than one would assume, and that improvements from video-specific training are surprisingly small. Specifically, we show that image-trained versions of two LLMs trained with the recent LongVU algorithm perform significantly above chance level on TVBench, a temporal reasoning benchmark. Additionally, we introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities. This baseline results in temporal reasoning performance close to, and occasionally higher than, what is achieved by video-trained LLMs. This suggests suboptimal utilization of rich temporal features found in real video by current models. Our analysis motivates further research into the mechanisms that allow image-trained LLMs to perform temporal reasoning, as well as into the bottlenecks that render current video training schemes inefficient.
zh
[CV-158] Reading in the Dark with Foveated Event Vision CVPR2025
【速读】:该论文试图解决智能眼镜在低光和高速运动场景下环境感知能力不足的问题,以及传统帧相机在捕获密集图像时带来的高带宽和高功耗问题。解决方案的关键在于提出一种基于事件的光学字符识别(OCR)方法,通过利用用户眼动来聚焦事件流,显著降低带宽需求约98%,同时充分利用事件相机在高动态范围和快速场景中的优势,结合深度二值重建与多模态大语言模型(LLM)实现高效的文本识别。
链接: https://arxiv.org/abs/2506.06918
作者: Carl Brander,Giovanni Cioffi,Nico Messikommer,Davide Scaramuzza
机构: University of Zurich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2025 Workshop on Event-based Vision
Abstract:Current smart glasses equipped with RGB cameras struggle to perceive the environment in low-light and high-speed motion scenarios due to motion blur and the limited dynamic range of frame cameras. Additionally, capturing dense images with a frame camera requires large bandwidth and power consumption, consequently draining the battery faster. These challenges are especially relevant for developing algorithms that can read text from images. In this work, we propose a novel event-based Optical Character Recognition (OCR) approach for smart glasses. By using the eye gaze of the user, we foveate the event stream to significantly reduce bandwidth by around 98% while exploiting the benefits of event cameras in high-dynamic and fast scenes. Our proposed method performs deep binary reconstruction trained on synthetic data and leverages multimodal LLMs for OCR, outperforming traditional OCR solutions. Our results demonstrate the ability to read text in low light environments where RGB cameras struggle while using up to 2400 times less bandwidth than a wearable RGB camera.
zh
[CV-159] Sleep Stage Classification using Multimodal Embedding Fusion from EOG and PSM
【速读】:该论文旨在解决传统多导睡眠图(polysomnography, PSG)在家庭睡眠监测中因依赖复杂设备和专业操作而带来的挑战,特别是针对老年人群的睡眠障碍诊断问题。其解决方案的关键在于引入一种新颖的方法,利用ImageBind这一多模态嵌入深度学习模型,将压力敏感垫(pressure-sensitive mat, PSM)数据与双通道眼电图(electrooculography, EOG)信号进行融合,实现五阶段睡眠-觉醒分类。该方法首次将PSM与EOG数据结合,并通过微调ImageBind显著提升了分类准确率,优于仅使用单通道EOG或PSM数据的现有模型,展示了其在有限标注数据下的适应性与医学应用潜力。
链接: https://arxiv.org/abs/2506.06912
作者: Olivier Papillon,Rafik Goubran,James Green,Julien Larivière-Chartier,Caitlin Higginson,Frank Knoefel,Rébecca Robillard
机构: Carleton University (卡尔顿大学); Bruyère Health Research Institute (布鲁耶健康研究所); University of Ottawa Institute for Mental Health Research at the Royal Ottawa Hospital (渥太华大学皇家渥太华医院精神健康研究所); School of Psychology, University of Ottawa (渥太华大学心理学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE MeMeA 2025
Abstract:Accurate sleep stage classification is essential for diagnosing sleep disorders, particularly in aging populations. While traditional polysomnography (PSG) relies on electroencephalography (EEG) as the gold standard, its complexity and need for specialized equipment make home-based sleep monitoring challenging. To address this limitation, we investigate the use of electrooculography (EOG) and pressure-sensitive mats (PSM) as less obtrusive alternatives for five-stage sleep-wake classification. This study introduces a novel approach that leverages ImageBind, a multimodal embedding deep learning model, to integrate PSM data with dual-channel EOG signals for sleep stage classification. Our method is the first reported approach that fuses PSM and EOG data for sleep stage classification with ImageBind. Our results demonstrate that fine-tuning ImageBind significantly improves classification accuracy, outperforming existing models based on single-channel EOG (DeepSleepNet), exclusively PSM data (ViViT), and other multimodal deep learning approaches (MBT). Notably, the model also achieved strong performance without fine-tuning, highlighting its adaptability to specific tasks with limited labeled data, making it particularly advantageous for medical applications. We evaluated our method using 85 nights of patient recordings from a sleep clinic. Our findings suggest that pre-trained multimodal embedding models, even those originally developed for non-medical domains, can be effectively adapted for sleep staging, with accuracies approaching systems that require complex EEG data.
zh
[CV-160] Gaussian Mapping for Evolving Scenes
【速读】:该论文旨在解决动态场景下新颖视图合成(NVS)系统的性能受限问题,特别是针对长期动态变化(即场景在相机视野之外发生的演变)缺乏有效建模的问题。其解决方案的关键在于引入一种动态场景适应机制,该机制能够持续更新3D表示以反映最新的场景变化,同时提出了一种新的关键帧管理机制,用于丢弃过时的观测数据,从而在保持几何和语义一致性的同时尽可能保留有用信息。
链接: https://arxiv.org/abs/2506.06909
作者: Vladimir Yugay,Thies Kersten,Luca Carlone,Theo Gevers,Martin R. Oswald,Lukas Schmid
机构: University of Amsterdam (阿姆斯特丹大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mapping systems with novel view synthesis (NVS) capabilities are widely used in computer vision, with augmented reality, robotics, and autonomous driving applications. Most notably, 3D Gaussian Splatting-based systems show high NVS performance; however, many current approaches are limited to static scenes. While recent works have started addressing short-term dynamics (motion within the view of the camera), long-term dynamics (the scene evolving through changes out of view) remain less explored. To overcome this limitation, we introduce a dynamic scene adaptation mechanism that continuously updates the 3D representation to reflect the latest changes. In addition, since maintaining geometric and semantic consistency remains challenging due to stale observations disrupting the reconstruction process, we propose a novel keyframe management mechanism that discards outdated observations while preserving as much information as possible. We evaluate Gaussian Mapping for Evolving Scenes (GaME) on both synthetic and real-world datasets and find it to be more accurate than the state of the art.
zh
[CV-161] KNN-Defense: Defense against 3D Adversarial Point Clouds using Nearest-Neighbor Search
【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)在处理三维点云数据时对对抗攻击的脆弱性问题,这些攻击包括点删除、点移动和点添加等,会破坏点云的语义和结构完整性,导致现有防御机制失效。其解决方案的关键在于提出一种名为KNN-Defense的防御策略,该策略基于流形假设和特征空间中的最近邻搜索,通过利用训练集中邻近样本的语义相似性来恢复被扰动的输入,而非重建表面几何或强制均匀点分布,从而实现轻量级且计算高效的防御。
链接: https://arxiv.org/abs/2506.06906
作者: Nima Jamali,Matina Mahdizadeh Sani,Hanieh Naderi,Shohreh Kasaei
机构: University of Waterloo (滑铁卢大学); University of Tehran (德黑兰大学); Sharif University of Technology (沙里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks (DNNs) have demonstrated remarkable performance in analyzing 3D point cloud data. However, their vulnerability to adversarial attacks-such as point dropping, shifting, and adding-poses a critical challenge to the reliability of 3D vision systems. These attacks can compromise the semantic and structural integrity of point clouds, rendering many existing defense mechanisms ineffective. To address this issue, a defense strategy named KNN-Defense is proposed, grounded in the manifold assumption and nearest-neighbor search in feature space. Instead of reconstructing surface geometry or enforcing uniform point distributions, the method restores perturbed inputs by leveraging the semantic similarity of neighboring samples from the training set. KNN-Defense is lightweight and computationally efficient, enabling fast inference and making it suitable for real-time and practical applications. Empirical results on the ModelNet40 dataset demonstrated that KNN-Defense significantly improves robustness across various attack types. In particular, under point-dropping attacks-where many existing methods underperform due to the targeted removal of critical points-the proposed method achieves accuracy gains of 20.1%, 3.6%, 3.44%, and 7.74% on PointNet, PointNet++, DGCNN, and PCT, respectively. These findings suggest that KNN-Defense offers a scalable and effective solution for enhancing the adversarial resilience of 3D point cloud classifiers. (An open-source implementation of the method, including code and data, is available at this https URL).
zh
[CV-162] NSD-Imagery: A benchmark dataset for extending fMRI vision decoding methods to mental imagery CVPR2025
【速读】:该论文试图解决当前基于自然场景数据集(NSD)训练的视觉解码模型在心理图像重建任务中的泛化能力不足的问题,尤其是这些模型在面对由人脑活动编码的、具有较低信噪比和空间分辨率的心理图像时表现不佳。解决方案的关键在于引入NSD-Imagery数据集,该数据集提供了与心理图像匹配的人类fMRI活动数据,从而能够评估模型在心理图像重建任务中的性能。研究进一步揭示了模型架构对跨解码性能的重要影响,表明采用简单线性解码结构和多模态特征解码的模型在心理图像重建任务中表现更优,而复杂架构则容易过拟合视觉训练数据。
链接: https://arxiv.org/abs/2506.06898
作者: Reese Kneeland,Paul S. Scotti,Ghislain St-Yves,Jesse Breedlove,Kendrick Kay,Thomas Naselaris
机构: University of Minnesota (明尼苏达大学); Princeton Neuroscience Institute (普林斯顿神经科学研究所); Stability AI/Medical AI Research Center (MedARC) (Stability AI/医疗人工智能研究中心(MedARC))
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
备注: Published at CVPR 2025
Abstract:We release NSD-Imagery, a benchmark dataset of human fMRI activity paired with mental images, to complement the existing Natural Scenes Dataset (NSD), a large-scale dataset of fMRI activity paired with seen images that enabled unprecedented improvements in fMRI-to-image reconstruction efforts. Recent models trained on NSD have been evaluated only on seen image reconstruction. Using NSD-Imagery, it is possible to assess how well these models perform on mental image reconstruction. This is a challenging generalization requirement because mental images are encoded in human brain activity with relatively lower signal-to-noise and spatial resolution; however, generalization from seen to mental imagery is critical for real-world applications in medical domains and brain-computer interfaces, where the desired information is always internally generated. We provide benchmarks for a suite of recent NSD-trained open-source visual decoding models (MindEye1, MindEye2, Brain Diffuser, iCNN, Takagi et al.) on NSD-Imagery, and show that the performance of decoding methods on mental images is largely decoupled from performance on vision reconstruction. We further demonstrate that architectural choices significantly impact cross-decoding performance: models employing simple linear decoding architectures and multimodal feature decoding generalize better to mental imagery, while complex architectures tend to overfit visual training data. Our findings indicate that mental imagery datasets are critical for the development of practical applications, and establish NSD-Imagery as a useful resource for better aligning visual decoding methods with this goal.
zh
[CV-163] Hybrid Vision Transformer-Mamba Framework for Autism Diagnosis via Eye-Tracking Analysis
【速读】:该论文旨在解决自闭症谱系障碍(Autism Spectrum Disorder, ASD)早期诊断的准确性问题,以促进早期干预。其解决方案的关键在于提出一种结合视觉Transformer(Vision Transformer, ViT)和视觉Mamba(Vision Mamba)的混合深度学习框架,通过基于注意力的融合机制整合视觉、语音和面部线索,从而捕捉空间和时间动态特征。该方法相较于传统手工特征提取方法,采用了先进的深度学习与可解释人工智能技术,提升了诊断的准确性和透明度。
链接: https://arxiv.org/abs/2506.06886
作者: Wafaa Kasri,Yassine Himeur,Abigail Copiaco,Wathiq Mansoor,Ammar Albanna,Valsamma Eapen
机构: Tissemsilt University (提塞姆希尔大学); University of Dubai (迪拜大学); Mohammed Bin Rashid University (穆罕默德·本·拉希德大学); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures and 2 tables
Abstract:Accurate Autism Spectrum Disorder (ASD) diagnosis is vital for early intervention. This study presents a hybrid deep learning framework combining Vision Transformers (ViT) and Vision Mamba to detect ASD using eye-tracking data. The model uses attention-based fusion to integrate visual, speech, and facial cues, capturing both spatial and temporal dynamics. Unlike traditional handcrafted methods, it applies state-of-the-art deep learning and explainable AI techniques to enhance diagnostic accuracy and transparency. Tested on the Saliency4ASD dataset, the proposed ViT-Mamba model outperformed existing methods, achieving 0.96 accuracy, 0.95 F1-score, 0.97 sensitivity, and 0.94 specificity. These findings show the model’s promise for scalable, interpretable ASD screening, especially in resource-constrained or remote clinical settings where access to expert diagnosis is limited.
zh
[CV-164] FREE: Fast and Robust Vision Language Models with Early Exits ACL
【速读】:该论文试图解决Vision-Language Models (VLMs)在实际应用中因模型规模过大而导致的推理延迟问题。解决方案的关键在于引入Early Exit (EE)策略,通过在模型中设置多个退出点,实现输入自适应的推理过程,从而在保持性能的前提下提升推理速度。为了解决在有限标注数据下训练退出分类器的挑战,作者提出了FREE方法,该方法基于GAN框架,利用对抗训练使每个退出点的Transformer层生成与最终层相似的特征表示,同时使用特征分类器作为判别器进行优化。
链接: https://arxiv.org/abs/2506.06884
作者: Divya Jyoti Bajpai,Manjesh Kumar Hanawal
机构: IIT Bombay (印度理工学院孟买分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear at the Association of Computational Linguistics (ACL) 2025 Conference
Abstract:In recent years, Vision-Language Models (VLMs) have shown remarkable performance improvements in Vision-Language tasks. However, their large size poses challenges for real-world applications where inference latency is a concern. To tackle this issue, we propose employing Early Exit (EE) strategies in VLMs. However, training exit classifiers in VLMs is challenging, particularly with limited labeled training data. To address this, we introduce FREE, an adversarial training approach within a GAN-based framework. Here, each exit consists of a transformer layer and a classifier. The transformer layer is adversarially trained to produce feature representations similar to the final layer, while a feature classifier serves as the discriminator. Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance. Experimental results demonstrate the effectiveness of our approach in enhancing accuracy and model robustness by mitigating overthinking and the phenomenon of mid-crisis that we highlight. We experimentally validate that our method speeds up the inference process by more than 1.51x while retaining comparable performance. The source code is available at this https URL.
zh
[CV-165] Face recognition on point cloud with cgan-top for denoising ICASSP2023
【速读】:该论文旨在解决在噪声点云上进行3D人脸识别的挑战,因为原始点云通常由于传感器不完美而包含大量噪声。其解决方案的关键在于提出一种端到端的3D人脸识别方法,该方法协同集成去噪和识别模块。具体而言,设计了一种基于三个正交平面的条件生成对抗网络(cGAN-TOP)以有效去除点云中的噪声并恢复潜在特征,随后采用一种链接动态图卷积神经网络(LDGCNN)从处理后的点云中进行人脸识别,通过分层链接局部点特征和多尺度邻域特征实现高效识别。
链接: https://arxiv.org/abs/2506.06864
作者: Junyu Liu,Jianfeng Ren,Sunhong Liang,Xudong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in ICASSP 2023
Abstract:Face recognition using 3D point clouds is gaining growing interest, while raw point clouds often contain a significant amount of noise due to imperfect sensors. In this paper, an end-to-end 3D face recognition on a noisy point cloud is proposed, which synergistically integrates the denoising and recognition modules. Specifically, a Conditional Generative Adversarial Network on Three Orthogonal Planes (cGAN-TOP) is designed to effectively remove the noise in the point cloud, and recover the underlying features for subsequent recognition. A Linked Dynamic Graph Convolutional Neural Network (LDGCNN) is then adapted to recognize faces from the processed point cloud, which hierarchically links both the local point features and neighboring features of multiple scales. The proposed method is validated on the Bosphorus dataset. It significantly improves the recognition accuracy under all noise settings, with a maximum gain of 14.81%.
zh
[CV-166] Multimodal Spatial Language Maps for Robot Navigation and Manipulation
【速读】:该论文旨在解决将语言与导航代理的观察结果进行对齐的问题,特别是在环境映射、空间精度和多模态信息利用方面存在的不足。其解决方案的关键在于提出多模态空间语言地图(multimodal spatial language maps),该地图通过融合预训练多模态特征与环境的3D重建,实现感知与物体或事件描述的匹配。该方法能够自主构建地图,并通过引入视觉-语言地图(VLMaps)和音频-视觉-语言地图(AVLMaps)扩展多模态输入,从而提升机器人在复杂环境中的目标定位与导航能力。
链接: https://arxiv.org/abs/2506.06862
作者: Chenguang Huang,Oier Mees,Andy Zeng,Wolfram Burgard
机构: University of Technology Nuremberg, Germany(纽伦堡工业大学,德国); UC Berkeley, USA(加州大学伯克利分校,美国); Google Research, USA(谷歌研究院,美国)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: accepted to International Journal of Robotics Research (IJRR). 24 pages, 18 figures. The paper contains texts from VLMaps( arXiv:2210.05714 ) and AVLMaps( arXiv:2303.07522 ). The project page is this https URL
Abstract:Grounding language to a navigating agent’s observations can leverage pretrained multimodal foundation models to match perceptions to object or event descriptions. However, previous approaches remain disconnected from environment mapping, lack the spatial precision of geometric maps, or neglect additional modality information beyond vision. To address this, we propose multimodal spatial language maps as a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment. We build these maps autonomously using standard exploration. We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps) obtained by adding audio information. When combined with large language models (LLMs), VLMaps can (i) translate natural language commands into open-vocabulary spatial goals (e.g., “in between the sofa and TV”) directly localized in the map, and (ii) be shared across different robot embodiments to generate tailored obstacle maps on demand. Building upon the capabilities above, AVLMaps extend VLMaps by introducing a unified 3D spatial representation integrating audio, visual, and language cues through the fusion of features from pretrained multimodal foundation models. This enables robots to ground multimodal goal queries (e.g., text, images, or audio snippets) to spatial locations for navigation. Additionally, the incorporation of diverse sensory inputs significantly enhances goal disambiguation in ambiguous environments. Experiments in simulation and real-world settings demonstrate that our multimodal spatial language maps enable zero-shot spatial and multimodal goal navigation and improve recall by 50% in ambiguous scenarios. These capabilities extend to mobile robots and tabletop manipulators, supporting navigation and interaction guided by visual, audio, and spatial cues.
zh
[CV-167] Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning
【速读】:该论文旨在解决现有基于强化学习(Reinforcement Learning, RL)微调的多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉推理能力上的局限性,特别是由于RL方法仅从策略模型自身采样动作组而导致的推理能力上限受限和训练效率低下的问题。其解决方案的关键在于提出一种名为Vision-EKIPL的新型强化学习框架,该框架在RL训练过程中引入由外部辅助模型生成的高质量动作,以指导策略模型的优化,从而显著扩展模型的探索空间,提升推理边界,并加快训练收敛速度与效率。
链接: https://arxiv.org/abs/2506.06856
作者: Chaoyang Wang,Zeyu Zhang,Haiyun Jiang
机构: Central South University (中南大学); The Australian National University (澳大利亚国立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual reasoning is crucial for understanding complex multimodal data and advancing Artificial General Intelligence. Existing methods enhance the reasoning capability of Multimodal Large Language Models (MLLMs) through Reinforcement Learning (RL) fine-tuning (e.g., GRPO). However, current RL approaches sample action groups solely from the policy model itself, which limits the upper boundary of the model’s reasoning capability and leads to inefficient training. To address these limitations, this paper proposes a novel RL framework called \textbfVision-EKIPL. The core of this framework lies in introducing high-quality actions generated by external auxiliary models during the RL training process to guide the optimization of the policy model. The policy learning with knowledge infusion from external models significantly expands the model’s exploration space, effectively improves the reasoning boundary, and substantially accelerates training convergence speed and efficiency. Experimental results demonstrate that our proposed Vision-EKIPL achieved up to a 5% performance improvement on the Reason-RFT-CoT Benchmark compared to the state-of-the-art (SOTA). It reveals that Vision-EKIPL can overcome the limitations of traditional RL methods, significantly enhance the visual reasoning performance of MLLMs, and provide a new effective paradigm for research in this field.
zh
[CV-168] DONUT: A Decoder-Only Model for Trajectory Prediction
【速读】:该论文旨在解决自主驾驶中对场景中其他智能体运动轨迹的预测问题,这一能力对于自动驾驶汽车提前感知和决策至关重要。其解决方案的关键在于提出DONUT模型,这是一种基于解码器的单模型架构,通过自回归方式对历史轨迹进行编码并预测未来轨迹,从而实现一致且迭代的预测过程,确保模型始终使用最新信息以提升性能。此外,引入了“过预测”策略,使网络在更长的时间范围内预测轨迹,进一步增强对未来状态的预判能力。
链接: https://arxiv.org/abs/2506.06854
作者: Markus Knoche,Daan de Geus,Bastian Leibe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoder-only models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Different from existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, enhancing the performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an ‘overprediction’ strategy that gives the network the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future, and further improves the performance. With experiments, we demonstrate that our decoder-only approach outperforms the encoder-decoder baseline, and achieves new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark.
zh
[CV-169] Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation
【速读】:该论文试图解决卫星遥感图像语义分割中因标注数据有限而导致的性能受限问题。其解决方案的关键在于将一种基于位置预测的自监督学习方法LOCA(Location-aware)适配到多模态卫星影像的语义分割任务中,通过扩展SatMAE的通道分组机制以支持多模态数据,并引入同组注意力掩码促进预训练过程中的跨模态交互,同时采用相对块位置预测来增强空间推理能力,从而提升分割效果。
链接: https://arxiv.org/abs/2506.06852
作者: John Waithaka,Moise Busogi
机构: Carnegie Mellon University Africa(卡内基梅隆大学非洲校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Semantic segmentation of satellite imagery is crucial for Earth observation applications, but remains constrained by limited labelled training data. While self-supervised pretraining methods like Masked Autoencoders (MAE) have shown promise, they focus on reconstruction rather than localisation-a fundamental aspect of segmentation tasks. We propose adapting LOCA (Location-aware), a position prediction self-supervised learning method, for multimodal satellite imagery semantic segmentation. Our approach addresses the unique challenges of satellite data by extending SatMAE’s channel grouping from multispectral to multimodal data, enabling effective handling of multiple modalities, and introducing same-group attention masking to encourage cross-modal interaction during pretraining. The method uses relative patch position prediction, encouraging spatial reasoning for localisation rather than reconstruction. We evaluate our approach on the Sen1Floods11 flood mapping dataset, where it significantly outperforms existing reconstruction-based self-supervised learning methods for satellite imagery. Our results demonstrate that position prediction tasks, when properly adapted for multimodal satellite imagery, learn representations more effective for satellite image semantic segmentation than reconstruction-based approaches.
zh
[CV-170] Deep Inertial Pose: A deep learning approach for human pose estimation
【速读】:该论文试图解决基于惯性测量单元(Inertial Measurement Unit, IMU)的运动捕捉系统中人体关节姿态估计的准确性问题,该问题通常需要复杂的生物力学模型和数学分析,导致现有解决方案如Xsens Technologies的MVN Awinda等软件成本高昂。论文提出的解决方案的关键在于利用神经网络抽象这些复杂的生物力学模型和数学方法,以实现高精度的人体姿态估计。通过比较不同的神经网络架构和方法,研究发现混合LSTM-Madgwick分离模型在使用高精度MARG传感器数据时表现最优,其四元数角度距离误差为7.96,表明神经网络可以有效替代传统融合滤波器进行姿态估计。
链接: https://arxiv.org/abs/2506.06850
作者: Sara M. Cerqueira,Manuel Palermo,Cristina P. Santos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:Inertial-based Motion capture system has been attracting growing attention due to its wearability and unsconstrained use. However, accurate human joint estimation demands several complex and expertise demanding steps, which leads to expensive software such as the state-of-the-art MVN Awinda from Xsens Technologies. This work aims to study the use of Neural Networks to abstract the complex biomechanical models and analytical mathematics required for pose estimation. Thus, it presents a comparison of different Neural Network architectures and methodologies to understand how accurately these methods can estimate human pose, using both low cost(MPU9250) and high end (Mtw Awinda) Magnetic, Angular Rate, and Gravity (MARG) sensors. The most efficient method was the Hybrid LSTM-Madgwick detached, which achieved an Quaternion Angle distance error of 7.96, using Mtw Awinda data. Also, an ablation study was conducted to study the impact of data augmentation, output representation, window size, loss function and magnetometer data on the pose estimation error. This work indicates that Neural Networks can be trained to estimate human pose, with results comparable to the state-of-the-art fusion filters.
zh
[CV-171] Multi-StyleGS: Stylizing Gaussian Splatting with Multiple Styles AAAI2025
【速读】:该论文试图解决如何将给定的3D场景通过自动局部风格迁移或手动指定的方式,适配到多个艺术风格中,同时保持风格化训练的内存效率问题。其解决方案的关键在于提出了一种名为Multi-StyleGS的新颖3D高斯泼溅(3D Gaussian Splatting, GS)风格化方法,该方法采用二部图匹配机制自动识别风格图像与渲染图像局部区域之间的对应关系,并引入一种基于语义的风格损失函数,结合局部-全局特征匹配以提升多视角一致性,从而实现更高效的内存训练、更丰富的纹理细节和更好的色彩匹配。
链接: https://arxiv.org/abs/2506.06846
作者: Yangkai Lin,Jiabao Lei,Kui jia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025
Abstract:In recent years, there has been a growing demand to stylize a given 3D scene to align with the artistic style of reference images for creative purposes. While 3D Gaussian Splatting(GS) has emerged as a promising and efficient method for realistic 3D scene modeling, there remains a challenge in adapting it to stylize 3D GS to match with multiple styles through automatic local style transfer or manual designation, while maintaining memory efficiency for stylization training. In this paper, we introduce a novel 3D GS stylization solution termed Multi-StyleGS to tackle these challenges. In particular, we employ a bipartite matching mechanism to au tomatically identify correspondences between the style images and the local regions of the rendered images. To facilitate local style transfer, we introduce a novel semantic style loss function that employs a segmentation network to apply distinct styles to various objects of the scene and propose a local-global feature matching to enhance the multi-view consistency. Furthermore, this technique can achieve memory efficient training, more texture details and better color match. To better assign a robust semantic label to each Gaussian, we propose several techniques to regularize the segmentation network. As demonstrated by our comprehensive experiments, our approach outperforms existing ones in producing plausible stylization results and offering flexible editing.
zh
[CV-172] Harnessing Vision-Language Models for Time Series Anomaly Detection
【速读】:该论文试图解决时间序列异常检测(Time-series Anomaly Detection, TSAD)中现有方法缺乏视觉-时序推理能力的问题,这些方法通常仅在数值数据上训练领域特定模型,无法像人类专家那样识别上下文相关的异常。解决方案的关键在于利用视觉语言模型(Vision Language Models, VLMs)的能力,提出一种两阶段的框架:第一阶段为ViT4TS,采用轻量级预训练视觉编码器对时间序列进行二维表示以精确定位候选异常;第二阶段为VLM4TS,通过整合全局时序上下文和VLM的推理能力对候选异常进行细化检测。该方法在无需时间序列微调的情况下,在F1-max得分上相比最佳基线提升了24.6%,并且在token使用效率上平均提高了36倍。
链接: https://arxiv.org/abs/2506.06836
作者: Zelin He,Sarah Alnegheimish,Matthew Reimherr
机构: Pennsylvania State University (宾夕法尼亚州立大学); MIT (麻省理工学院); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and industrial monitoring. Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal reasoning capacity that human experts have to identify contextual anomalies. To fill this gap, we explore a solution based on vision language models (VLMs). Recent studies have shown the ability of VLMs for visual reasoning tasks, yet their direct application to time series has fallen short on both accuracy and efficiency. To harness the power of VLMs for TSAD, we propose a two-stage solution, with (1) ViT4TS, a vision-screening stage built on a relatively lightweight pretrained vision encoder, which leverages 2-D time-series representations to accurately localize candidate anomalies; (2) VLM4TS, a VLM-based stage that integrates global temporal context and VLM reasoning capacity to refine the detection upon the candidates provided by ViT4TS. We show that without any time-series training, VLM4TS outperforms time-series pretrained and from-scratch baselines in most cases, yielding a 24.6 percent improvement in F1-max score over the best baseline. Moreover, VLM4TS also consistently outperforms existing language-model-based TSAD methods and is on average 36 times more efficient in token usage.
zh
[CV-173] EndoARSS: Adapting Spatially-Aware Foundation Model for Efficient Activity Recognition and Semantic Segmentation in Endoscopic Surgery
【速读】:该论文旨在解决内镜手术环境中由于手术场景复杂性高、目标与背景图像特征混淆以及传统深度学习模型在跨活动干扰下的性能不足所导致的手术环境理解难题。其解决方案的关键在于提出一种名为EndoARSS的新型多任务学习框架,该框架基于DINOv2基础模型,通过引入低秩适应(Low-Rank Adaptation)和任务高效共享低秩适配器(Task Efficient Shared Low-Rank Adapters)来实现高效微调并缓解不同任务间的梯度冲突,同时结合空间感知多尺度注意力机制(Spatially-Aware Multi-Scale Attention)以增强特征表示的区分能力。
链接: https://arxiv.org/abs/2506.06830
作者: Guankun Wang,Rui Tang,Mengya Xu,Long Bai,Huxin Gao,Hongliang Ren
机构: The Chinese University of Hong Kong (中国香港大学); Fuzhou University (福州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by Advanced Intelligent Systems
Abstract:Endoscopic surgery is the gold standard for robotic-assisted minimally invasive surgery, offering significant advantages in early disease detection and precise interventions. However, the complexity of surgical scenes, characterized by high variability in different surgical activity scenarios and confused image features between targets and the background, presents challenges for surgical environment understanding. Traditional deep learning models often struggle with cross-activity interference, leading to suboptimal performance in each downstream task. To address this limitation, we explore multi-task learning, which utilizes the interrelated features between tasks to enhance overall task performance. In this paper, we propose EndoARSS, a novel multi-task learning framework specifically designed for endoscopy surgery activity recognition and semantic segmentation. Built upon the DINOv2 foundation model, our approach integrates Low-Rank Adaptation to facilitate efficient fine-tuning while incorporating Task Efficient Shared Low-Rank Adapters to mitigate gradient conflicts across diverse tasks. Additionally, we introduce the Spatially-Aware Multi-Scale Attention that enhances feature representation discrimination by enabling cross-spatial learning of global information. In order to evaluate the effectiveness of our framework, we present three novel datasets, MTLESD, MTLEndovis and MTLEndovis-Gen, tailored for endoscopic surgery scenarios with detailed annotations for both activity recognition and semantic segmentation tasks. Extensive experiments demonstrate that EndoARSS achieves remarkable performance across multiple benchmarks, significantly improving both accuracy and robustness in comparison to existing models. These results underscore the potential of EndoARSS to advance AI-driven endoscopic surgical systems, offering valuable insights for enhancing surgical safety and efficiency.
zh
[CV-174] Controllable Coupled Image Generation via Diffusion Models
【速读】:该论文试图解决耦合图像生成任务中的背景一致性问题,即在同时生成的多张图像中,尽管中心物体可根据不同的文本提示具有灵活性,但它们的背景需要保持相同或高度相似。解决方案的关键在于通过解耦模型交叉注意力模块中的背景与实体组件,并引入依赖于采样时间步的时变权重控制参数,从而优化背景的一致性、文本到图像的对齐度以及整体视觉质量。
链接: https://arxiv.org/abs/2506.06826
作者: Chenfei Yuan,Nanshan Jia,Hangqi Li,Peter W. Glynn,Zeyu Zheng
机构: UC Berkeley IEOR and BAIR
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We provide an attention-level control method for the task of coupled image generation, where “coupled” means that multiple simultaneously generated images are expected to have the same or very similar backgrounds. While backgrounds coupled, the centered objects in the generated images are still expected to enjoy the flexibility raised from different text prompts. The proposed method disentangles the background and entity components in the model’s cross-attention modules, attached with a sequence of time-varying weight control parameters depending on the time step of sampling. We optimize this sequence of weight control parameters with a combined objective that assesses how coupled the backgrounds are as well as text-to-image alignment and overall visual quality. Empirical results demonstrate that our method outperforms existing approaches across these criteria.
zh
[CV-175] Exploring Visual Prompting: Robustness Inheritance and Beyond
【速读】:该论文试图解决在使用稳健源模型(robust source model)进行视觉提示(Visual Prompting, VP)时,如何成功继承源模型的稳健性以及VP是否面临与源模型相同的稳健性与泛化能力之间的权衡问题。论文提出的关键解决方案是Prompt Boundary Loosening (PBL),这是一种轻量级、可直接集成到VP中的策略,能够有效确保在源模型具备稳健性的情况下实现稳健性的成功继承,同时显著提升VP在不同下游数据集上的泛化能力。
链接: https://arxiv.org/abs/2506.06823
作者: Qi Li,Liangzhi Li,Zhouqiang Jiang,Bowen Wang,Keke Tang
机构: National University of Singapore(新加坡国立大学); Osaka University(大阪大学); The University of Hong Kong(香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2311.10992
Abstract:Visual Prompting (VP), an efficient method for transfer learning, has shown its potential in vision tasks. However, previous works focus exclusively on VP from standard source models, it is still unknown how it performs under the scenario of a robust source model: Can the robustness of the source model be successfully inherited? Does VP also encounter the same trade-off between robustness and generalization ability as the source model during this process? If such a trade-off exists, is there a strategy specifically tailored to VP to mitigate this limitation? In this paper, we thoroughly explore these three questions for the first time and provide affirmative answers to them. To mitigate the trade-off faced by VP, we propose a strategy called Prompt Boundary Loosening (PBL). As a lightweight, plug-and-play strategy naturally compatible with VP, PBL effectively ensures the successful inheritance of robustness when the source model is a robust model, while significantly enhancing VP’s generalization ability across various downstream datasets. Extensive experiments across various datasets show that our findings are universal and demonstrate the significant benefits of the proposed strategy.
zh
[CV-176] Hi-LSplat: Hierarchical 3D Language Gaussian Splatting
【速读】:该论文旨在解决基于3D高斯散射(3DGS)的模型在处理开放词汇语言查询时存在的视图不一致性和层次语义理解不足的问题。现有方法依赖于视图相关的2D基础模型来优化3D语义,但缺乏统一的3D表示,导致视图间语义不一致;同时,开放词汇挑战也引发了对象和关系描述的不一致性,阻碍了层次语义的理解。其解决方案的关键在于提出Hi-LSplat,通过构建带有分层实例聚类的3D层次语义树,将2D特征提升至3D特征,以解决由2D语义特征引起的视图不一致问题,并引入实例级和部件级对比损失,以捕捉全方位的层次语义表示。
链接: https://arxiv.org/abs/2506.06822
作者: Chenlu Zhan,Yufei Zhang,Gaoang Wang,Hongwei Wang
机构: Zhejiang University (浙江大学); Zhejiang University-University of Illinois Urbana-Champaign Institute (浙江大学-伊利诺伊大学厄巴纳-香槟分校联合研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Modeling 3D language fields with Gaussian Splatting for open-ended language queries has recently garnered increasing attention. However, recent 3DGS-based models leverage view-dependent 2D foundation models to refine 3D semantics but lack a unified 3D representation, leading to view inconsistencies. Additionally, inherent open-vocabulary challenges cause inconsistencies in object and relational descriptions, impeding hierarchical semantic understanding. In this paper, we propose Hi-LSplat, a view-consistent Hierarchical Language Gaussian Splatting work for 3D open-vocabulary querying. To achieve view-consistent 3D hierarchical semantics, we first lift 2D features to 3D features by constructing a 3D hierarchical semantic tree with layered instance clustering, which addresses the view inconsistency issue caused by 2D semantic features. Besides, we introduce instance-wise and part-wise contrastive losses to capture all-sided hierarchical semantic representations. Notably, we construct two hierarchical semantic datasets to better assess the model’s ability to distinguish different semantic levels. Extensive experiments highlight our method’s superiority in 3D open-vocabulary segmentation and localization. Its strong performance on hierarchical semantic datasets underscores its ability to capture complex hierarchical semantics within 3D scenes.
zh
[CV-177] Stepwise Decomposition and Dual-stream Focus: A Novel Approach for Training-free Camouflaged Object Segmentation
【速读】:该论文旨在解决伪装目标分割(Camouflaged Object Segmentation, COS)中任务通用提示分割方法所面临的两个关键问题:一是获取实例特定文本提示时的语义模糊性,这源于整体描述中缺乏足够的区分性线索,导致前景与背景混淆;二是获取实例特定视觉提示时的语义差异与空间分离,这源于远离目标边界的全局背景采样,特征相关性低,导致SAM分割无关区域。解决方案的关键在于提出RDVP-MSD框架,该框架通过多模态逐步分解思维链(MSD-CoT)逐步解耦图像描述以消除语义模糊性,同时结合区域约束双流视觉提示(RDVP)将空间约束注入视觉提示,并分别对前景和背景点进行视觉提示采样,从而有效缓解语义差异与空间分离问题。
链接: https://arxiv.org/abs/2506.06818
作者: Chao Yin,Hao Li,Kequan Yang,Jide Li,Pinpin Zhu,Xiaoqiang Li
机构: Shanghai University (上海大学); University of the Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review
Abstract:While promptable segmentation (\textite.g., SAM) has shown promise for various segmentation tasks, it still requires manual visual prompts for each object to be segmented. In contrast, task-generic promptable segmentation aims to reduce the need for such detailed prompts by employing only a task-generic prompt to guide segmentation across all test samples. However, when applied to Camouflaged Object Segmentation (COS), current methods still face two critical issues: 1) \textit\textbfsemantic ambiguity in getting instance-specific text prompts, which arises from insufficient discriminative cues in holistic captions, leading to foreground-background confusion; 2) \textit\textbfsemantic discrepancy combined with spatial separation in getting instance-specific visual prompts, which results from global background sampling far from object boundaries with low feature correlation, causing SAM to segment irrelevant regions. To address the issues above, we propose \textbfRDVP-MSD, a novel training-free test-time adaptation framework that synergizes \textbfRegion-constrained \textbfDual-stream \textbfVisual \textbfPrompting (RDVP) via \textbfMultimodal \textbfStepwise \textbfDecomposition Chain of Thought (MSD-CoT). MSD-CoT progressively disentangles image captions to eliminate semantic ambiguity, while RDVP injects spatial constraints into visual prompting and independently samples visual prompts for foreground and background points, effectively mitigating semantic discrepancy and spatial separation. Without requiring any training or supervision, RDVP-MSD achieves a state-of-the-art segmentation result on multiple COS benchmarks and delivers a faster inference speed than previous methods, demonstrating significantly improved accuracy and efficiency. The codes will be available at \hrefthis https URLthis https URL
zh
[CV-178] raining-Free Identity Preservation in Stylized Image Generation Using Diffusion Models
【速读】:该论文试图解决现有风格迁移技术在生成高质量风格化图像时难以保持身份一致性的问题,尤其是在人脸较小或摄像机与人脸距离较大的情况下,身份保留效果较差。解决方案的关键在于提出一种无需训练的框架,其核心包括:(1)“马赛克恢复内容图像”技术,显著提升了复杂场景中的身份保留能力;(2)一种无需训练的内容一致性损失,通过在风格化过程中更多关注原始图像来增强细粒度内容细节的保留。
链接: https://arxiv.org/abs/2506.06802
作者: Mohammad Ali Rezaei,Helia Hajikazem,Saeed Khanehgir,Mahdi Javanmardi
机构: Computer Vision Group of Part AI Research Center, Tehran, Iran; Department of Computer Engineering, Amirkabir University of Technology, Tehran, Iran
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While diffusion models have demonstrated remarkable generative capabilities, existing style transfer techniques often struggle to maintain identity while achieving high-quality stylization. This limitation is particularly acute for images where faces are small or exhibit significant camera-to-face distances, frequently leading to inadequate identity preservation. To address this, we introduce a novel, training-free framework for identity-preserved stylized image synthesis using diffusion models. Key contributions include: (1) the “Mosaic Restored Content Image” technique, significantly enhancing identity retention, especially in complex scenes; and (2) a training-free content consistency loss that enhances the preservation of fine-grained content details by directing more attention to the original image during stylization. Our experiments reveal that the proposed approach substantially surpasses the baseline model in concurrently maintaining high stylistic fidelity and robust identity integrity, particularly under conditions of small facial regions or significant camera-to-face distances, all without necessitating model retraining or fine-tuning.
zh
[CV-179] Feature-Based Instance Neighbor Discovery: Advanced Stable Test-Time Adaptation in Dynamic World
【速读】:该论文试图解决深度神经网络在训练域与测试域之间分布偏移时性能下降的问题,这一问题导致应用中的用户体验(Quality of Experience, QoE)显著降低。解决方案的关键在于提出一种基于特征的实例邻域发现(Feature-based Instance Neighbor Discovery, FIND),其核心包括层间特征解耦(Layer-wise Feature Disentanglement, LFD)、特征感知批归一化(Feature Aware Batch Normalization, FABN)以及选择性FABN(Selective FABN, S-FABN),通过动态适应不同测试分布来提升模型的鲁棒性和推理效率。
链接: https://arxiv.org/abs/2506.06782
作者: Qinting Jiang,Chuyang Ye,Dongyan Wei,Bingli Wang,Yuan Xue,Jingyan Jiang,Zhi Wang
机构: Tsinghua University (清华大学); Shenzhen Technology University (深圳技术大学); Sichuan Agricultural University (四川农业大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite progress, deep neural networks still suffer performance declines under distribution shifts between training and test domains, leading to a substantial decrease in Quality of Experience (QoE) for applications. Existing test-time adaptation (TTA) methods are challenged by dynamic, multiple test distributions within batches. We observe that feature distributions across different domains inherently cluster into distinct groups with varying means and variances. This divergence reveals a critical limitation of previous global normalization strategies in TTA, which inevitably distort the original data characteristics. Based on this insight, we propose Feature-based Instance Neighbor Discovery (FIND), which comprises three key components: Layer-wise Feature Disentanglement (LFD), Feature Aware Batch Normalization (FABN) and Selective FABN (S-FABN). LFD stably captures features with similar distributions at each layer by constructing graph structures. While FABN optimally combines source statistics with test-time distribution specific statistics for robust feature representation. Finally, S-FABN determines which layers require feature partitioning and which can remain unified, thereby enhancing inference efficiency. Extensive experiments demonstrate that FIND significantly outperforms existing methods, achieving a 30% accuracy improvement in dynamic scenarios while maintaining computational efficiency.
zh
[CV-180] Continuous-Time SO(3) Forecasting with Savitzky–Golay Neural Controlled Differential Equations CVPR
【速读】:该论文试图解决在计算机视觉和机器人领域中,对物体旋转进行跟踪与预测的问题,特别是针对SO(3)空间中的外推问题(SO(3) extrapolation)。其关键解决方案是利用由Savitzky-Golay路径引导的神经控制微分方程,在SO(3)上建模连续时间的旋转物体动力学。该方法不依赖于简化的运动假设,而是学习一个尊重旋转几何结构的通用潜在动力系统,从而提升了长期预测的准确性。
链接: https://arxiv.org/abs/2506.06780
作者: Lennart Bastian,Mohammad Rashed,Nassir Navab,Tolga Birdal
机构: Technical University of Munich (慕尼黑工业大学); Munich Center of Machine Learning (慕尼黑机器学习中心); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Extended abstract, presented at the CVPR Workshop on 4D Vision
Abstract:Tracking and forecasting the rotation of objects is fundamental in computer vision and robotics, yet SO(3) extrapolation remains challenging as (1) sensor observations can be noisy and sparse, (2) motion patterns can be governed by complex dynamics, and (3) application settings can demand long-term forecasting. This work proposes modeling continuous-time rotational object dynamics on SO(3) using Neural Controlled Differential Equations guided by Savitzky-Golay paths. Unlike existing methods that rely on simplified motion assumptions, our method learns a general latent dynamical system of the underlying object trajectory while respecting the geometric structure of rotations. Experimental results on real-world data demonstrate compelling forecasting capabilities compared to existing approaches.
zh
[CV-181] LoopDB: A Loop Closure Dataset for Large Scale Simultaneous Localization and Mapping
【速读】:该论文试图解决的是同时定位与地图构建(Simultaneous Localization and Mapping, SLAM)中的回环检测(Loop Closure)问题,旨在提供一个高质量的基准数据集以评估和训练相关算法。解决方案的关键在于构建了一个包含超过1000张图像的多样化环境数据集——LoopDB,其中每个场景由五张连续图像组成,并提供了精确的旋转和平移作为真实值,从而为回环检测算法的性能评估与深度学习方法的训练提供了可靠的基础。
链接: https://arxiv.org/abs/2506.06771
作者: Mohammad-Maher Nakshbandi,Ziad Sharawy,Dorian Cojocaru,Sorin Grigorescu
机构: Transilvania University of Brasov (特兰西瓦尼亚布加勒斯特大学); University of Craiova (克劳约瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:In this study, we introduce LoopDB, which is a challenging loop closure dataset comprising over 1000 images captured across diverse environments, including parks, indoor scenes, parking spaces, as well as centered around individual objects. Each scene is represented by a sequence of five consecutive images. The dataset was collected using a high resolution camera, providing suitable imagery for benchmarking the accuracy of loop closure algorithms, typically used in simultaneous localization and mapping. As ground truth information, we provide computed rotations and translations between each consecutive images. Additional to its benchmarking goal, the dataset can be used to train and fine-tune loop closure methods based on deep neural networks. LoopDB is publicly available at this https URL.
zh
[CV-182] he OCR Quest for Generalization: Learning to recognize low-resource alphabets with model editing
【速读】:该论文试图解决在多样化领域中实现识别系统鲁棒性的问题,特别是针对低资源语言(如古代手稿和非西方语言)在大规模预训练和基础技术中被忽视的挑战。解决方案的关键在于利用模型编辑技术,以增强对未见过的书写系统(低资源学习)的整合能力,并通过领域合并方法在数据分布稀疏的情况下实现更高效的泛化能力,从而在新的字母表和域外评估中表现出显著的性能提升。
链接: https://arxiv.org/abs/2506.06761
作者: Adrià Molina Rodríguez,Oriol Ramos Terrades,Josep Lladós
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving robustness in recognition systems across diverse domains is crucial for their practical utility. While ample data availability is usually assumed, low-resource languages, such as ancient manuscripts and non-western languages, tend to be kept out of the equations of massive pretraining and foundational techniques due to an under representation. In this work, we aim for building models which can generalize to new distributions of data, such as alphabets, faster than centralized fine-tune strategies. For doing so, we take advantage of the recent advancements in model editing to enhance the incorporation of unseen scripts (low-resource learning). In contrast to state-of-the-art meta-learning, we showcase the effectiveness of domain merging in sparse distributions of data, with agnosticity of its relation to the overall distribution or any other prototyping necessity. Even when using the same exact training data, our experiments showcase significant performance boosts in \textbftransfer learning to new alphabets and \textbfout-of-domain evaluation in challenging domain shifts, including historical ciphered texts and non-Latin scripts. This research contributes a novel approach into building models that can easily adopt under-represented alphabets and, therefore, enable document recognition to a wider set of contexts and cultures.
zh
[CV-183] LitMAS: A Lightweight and Generalized Multi-Modal Anti-Spoofing Framework for Biometric Security INTERSPEECH2025
【速读】:该论文旨在解决多模态生物特征认证系统中存在的一致性抗欺骗检测问题,尤其是在资源受限环境下实现高效、通用的反欺骗方案。其解决方案的关键在于提出了一种名为LitMAS的框架,该框架的核心是模态对齐浓缩损失(Modality-Aligned Concentration Loss),该损失函数在增强类间可分性的同时保持跨模态一致性,从而实现了对语音、人脸、虹膜和指纹等多种生物特征的鲁棒欺骗检测。
链接: https://arxiv.org/abs/2506.06759
作者: Nidheesh Gorthi,Kartik Thakral,Rishabh Ranjan,Richa Singh,Mayank Vatsa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Interspeech 2025
Abstract:Biometric authentication systems are increasingly being deployed in critical applications, but they remain susceptible to spoofing. Since most of the research efforts focus on modality-specific anti-spoofing techniques, building a unified, resource-efficient solution across multiple biometric modalities remains a challenge. To address this, we propose LitMAS, a \textbfLi gh \textbft weight and generalizable \textbfM ulti-modal \textbfA nti- \textbfS poofing framework designed to detect spoofing attacks in speech, face, iris, and fingerprint-based biometric systems. At the core of LitMAS is a Modality-Aligned Concentration Loss, which enhances inter-class separability while preserving cross-modal consistency and enabling robust spoof detection across diverse biometric traits. With just 6M parameters, LitMAS surpasses state-of-the-art methods by 1.36% in average EER across seven datasets, demonstrating high efficiency, strong generalizability, and suitability for edge deployment. Code and trained models are available at this https URL.
zh
[CV-184] SAR2Struct: Extracting 3D Semantic Structural Representation of Aircraft Targets from Single-View SAR Image
【速读】:该论文试图解决合成孔径雷达(SAR)图像中目标的语义结构恢复问题,即从单视角SAR图像中推断目标的组成部分及其组件之间的结构关系,如对称性和邻接性。传统方法主要关注三维表面重建或局部几何特征提取,而忽视了结构建模在捕捉语义信息中的作用。解决方案的关键在于提出一种基于结构描述符的两步算法框架,通过学习同一类型目标在不同SAR图像中的结构一致性与几何多样性,实现从二维SAR图像直接获取目标的三维语义结构表示。
链接: https://arxiv.org/abs/2506.06757
作者: Ziyu Yue,Ruixi You,Feng Xu
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 12 figures
Abstract:To translate synthetic aperture radar (SAR) image into interpretable forms for human understanding is the ultimate goal of SAR advanced information retrieval. Existing methods mainly focus on 3D surface reconstruction or local geometric feature extraction of targets, neglecting the role of structural modeling in capturing semantic information. This paper proposes a novel task: SAR target structure recovery, which aims to infer the components of a target and the structural relationships between its components, specifically symmetry and adjacency, from a single-view SAR image. Through learning the structural consistency and geometric diversity across the same type of targets as observed in different SAR images, it aims to derive the semantic representation of target directly from its 2D SAR image. To solve this challenging task, a two-step algorithmic framework based on structural descriptors is developed. Specifically, in the training phase, it first detects 2D keypoints from real SAR images, and then learns the mapping from these keypoints to 3D hierarchical structures using simulated data. During the testing phase, these two steps are integrated to infer the 3D structure from real SAR images. Experimental results validated the effectiveness of each step and demonstrated, for the first time, that 3D semantic structural representation of aircraft targets can be directly derived from a single-view SAR image.
zh
[CV-185] HU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation
【速读】:该论文旨在解决第一人称视频目标分割(egocentric video object segmentation)问题,其核心挑战在于处理复杂场景和实现长期跟踪。解决方案的关键在于将来自SAM2的大规模视觉预训练与基于深度的几何线索相结合,并在统一框架中集成这些信号,从而实现了强大的分割性能。
链接: https://arxiv.org/abs/2506.06748
作者: Mingqi Gao,Haoran Duan,Tianlu Zhang,Jungong Han
机构: Tsinghua University (清华大学); University of Warwick (华威大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this report, we describe our approach to egocentric video object segmentation. Our method combines large-scale visual pretraining from SAM2 with depth-based geometric cues to handle complex scenes and long-term tracking. By integrating these signals in a unified framework, we achieve strong segmentation performance. On the VISOR test set, our method reaches a JF score of 90.1%.
zh
[CV-186] RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation
【速读】:该论文旨在解决食品计算中生成菜谱图像的关键挑战,特别是在菜谱目标、分步指令与视觉内容之间缺乏细粒度对齐的问题。其解决方案的关键在于提出RecipeGen,这是首个大规模的真实世界基准数据集,支持基于菜谱的文本到图像(Text-to-Image, T2I)、图像到视频(Image-to-Video, I2V)和文本到视频(Text-to-Video, T2V)生成任务,包含26,453个菜谱、196,724张图像和4,491个视频,覆盖多样的食材、烹饪过程、风格和菜品类型,并引入领域特定的评估指标以衡量成分保真度和交互建模能力。
链接: https://arxiv.org/abs/2506.06733
作者: Ruoxuan Zhang,Jidong Gao,Bin Wen,Hongxia Xie,Chenming Zhang,Honghan-shuai,Wen-Huang Cheng
机构: Jilin University(吉林大学); Guangdong University of Technology(广东工业大学); National Chiao Tung University(国立交通大学); National Taiwan University(台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is an extended version of arXiv:2503.05228
Abstract:Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available now.
zh
[CV-187] VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs
【速读】:该论文试图解决大型多模态模型(Large Multimodal Models, LMMs)在面对以图像形式呈现的答案选项时,其数学推理能力不足的问题,这一问题属于多图像理解的重要方面。解决方案的关键在于提出VisioMath,这是一个专门针对基于图像答案选项的数学推理任务设计的基准数据集,包含8,070张图像和1,800道多选题,其中每个选项均为图像,强调了细粒度区分答案选项的重要性,从而为未来研究提供了严格的测试平台。
链接: https://arxiv.org/abs/2506.06727
作者: Can Li,Ting Zhang,Mei Wang,Hua Huang
机构: Beijing Normal University (北京师范大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Multimodal Models (LMMs) have demonstrated remarkable problem-solving capabilities across various domains. However, their ability to perform mathematical reasoning when answer options are represented as images–an essential aspect of multi-image comprehension–remains underexplored. To bridge this gap, we introduce VisioMath, a benchmark designed to evaluate mathematical reasoning in multimodal contexts involving image-based answer choices. VisioMath comprises 8,070 images and 1,800 multiple-choice questions, where each answer option is an image, presenting unique challenges to existing LMMs. To the best of our knowledge, VisioMath is the first dataset specifically tailored for mathematical reasoning in image-based-option scenarios, where fine-grained distinctions between answer choices are critical for accurate problem-solving. We systematically evaluate state-of-the-art LMMs on VisioMath and find that even the most advanced models struggle with this task. Notably, GPT-4o achieves only 45.9% accuracy, underscoring the limitations of current models in reasoning over visually similar answer choices. By addressing a crucial gap in existing benchmarks, VisioMath establishes a rigorous testbed for future research, driving advancements in multimodal reasoning.
zh
[CV-188] Improving Wildlife Out-of-Distribution Detection: Africas Big Five
【速读】:该论文试图解决野生动物冲突中因分类模型在面对未知类别时过度自信而导致的识别误差问题。解决方案的关键在于通过引入基于特征的异常检测方法,提升模型在不同分类阈值下的泛化能力,其中采用预训练的ImageNet特征结合参数化最近类均值(NCM)方法,在AUPR-IN、AUPR-OUT和AUTC指标上分别优于最佳的OOD方法2%、4%和22%。
链接: https://arxiv.org/abs/2506.06719
作者: Mufhumudzi Muthivhi,Jiahao Huo,Fredrik Gustafsson,Terence L. van Zyl
机构: Institute for Artificial Intelligent Systems (人工智能智能系统研究所); University of Johannesburg (约翰内斯堡大学); Department of Electrical Engineering (电气工程系); Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Mitigating human-wildlife conflict seeks to resolve unwanted encounters between these parties. Computer Vision provides a solution to identifying individuals that might escalate into conflict, such as members of the Big Five African animals. However, environments often contain several varied species. The current state-of-the-art animal classification models are trained under a closed-world assumption. They almost always remain overconfident in their predictions even when presented with unknown classes. This study investigates out-of-distribution (OOD) detection of wildlife, specifically the Big Five. To this end, we select a parametric Nearest Class Mean (NCM) and a non-parametric contrastive learning approach as baselines to take advantage of pretrained and projected features from popular classification encoders. Moreover, we compare our baselines to various common OOD methods in the literature. The results show feature-based methods reflect stronger generalisation capability across varying classification thresholds. Specifically, NCM with ImageNet pre-trained features achieves a 2%, 4% and 22% improvement on AUPR-IN, AUPR-OUT and AUTC over the best OOD methods, respectively. The code can be found here this https URL
zh
[CV-189] Active Contour Models Driven by Hyperbolic Mean Curvature Flow for Image Segmentation
【速读】:该论文旨在解决传统抛物型平均曲率流驱动的活动轮廓模型(Parabolic mean curvature flow-driven active contour models, PMCF-ACMs)在图像分割中对初始曲线配置依赖性过强的问题。其解决方案的关键在于提出一种新型的双曲型平均曲率流驱动的活动轮廓模型(Hyperbolic mean curvature flow-driven ACMs, HMCF-ACMs),通过引入可调初始速度场,实现对不同分割场景的自适应优化,并进一步构建了基于双曲双模正则化流的活动轮廓模型(Hyperbolic dual-mode regularized flow-driven ACMs, HDRF-ACMs),利用平滑Heaviside函数实现边缘感知的力调节,以抑制弱边界附近的过度扩散。
链接: https://arxiv.org/abs/2506.06712
作者: Saiyu Hu,Chunlei He,Jianfeng Zhang,Dexing Kong,Shoujun Huang
机构: Zhejiang Normal University (浙江师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Analysis of PDEs (math.AP)
备注:
Abstract:Parabolic mean curvature flow-driven active contour models (PMCF-ACMs) are widely used in image segmentation, which however depend heavily on the selection of initial curve configurations. In this paper, we firstly propose several hyperbolic mean curvature flow-driven ACMs (HMCF-ACMs), which introduce tunable initial velocity fields, enabling adaptive optimization for diverse segmentation scenarios. We shall prove that HMCF-ACMs are indeed normal flows and establish the numerical equivalence between dissipative HMCF formulations and certain wave equations using the level set method with signed distance function. Building on this framework, we furthermore develop hyperbolic dual-mode regularized flow-driven ACMs (HDRF-ACMs), which utilize smooth Heaviside functions for edge-aware force modulation to suppress over-diffusion near weak boundaries. Then, we optimize a weighted fourth-order Runge-Kutta algorithm with nine-point stencil spatial discretization when solving the above-mentioned wave equations. Experiments show that both HMCF-ACMs and HDRF-ACMs could achieve more precise segmentations with superior noise resistance and numerical stability due to task-adaptive configurations of initial velocities and initial contours.
zh
[CV-190] A Systematic Investigation on Deep Learning-Based Omnidirectional Image and Video Super-Resolution
【速读】:该论文旨在解决全向图像和视频超分辨率(omnidirectional image and video super-resolution)问题,即从低分辨率输入中重建高分辨率图像或视频帧,以提升细节保留并支持更精确的场景分析与理解。其解决方案的关键在于提出一种基于深度学习的方法,并引入一个新的真实退化全向图像和视频数据集——360Insta,该数据集在不同光照、运动和曝光条件下收集,弥补了现有数据集主要依赖合成退化而缺乏真实世界失真的不足,从而提升了全向超分辨率方法的泛化能力评估。
链接: https://arxiv.org/abs/2506.06710
作者: Qianqian Zhao,Chunle Guo,Tianyi Zhang,Junpei Zhang,Peiyang Jia,Tan Su,Wenjie Jiang,Chongyi Li
机构: Nankai University (南开大学); Insta360 (Insta360)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Omnidirectional image and video super-resolution is a crucial research topic in low-level vision, playing an essential role in virtual reality and augmented reality applications. Its goal is to reconstruct high-resolution images or video frames from low-resolution inputs, thereby enhancing detail preservation and enabling more accurate scene analysis and interpretation. In recent years, numerous innovative and effective approaches have been proposed, predominantly based on deep learning techniques, involving diverse network architectures, loss functions, projection strategies, and training datasets. This paper presents a systematic review of recent progress in omnidirectional image and video super-resolution, focusing on deep learning-based methods. Given that existing datasets predominantly rely on synthetic degradation and fall short in capturing real-world distortions, we introduce a new dataset, 360Insta, that comprises authentically degraded omnidirectional images and videos collected under diverse conditions, including varying lighting, motion, and exposure settings. This dataset addresses a critical gap in current omnidirectional benchmarks and enables more robust evaluation of the generalization capabilities of omnidirectional super-resolution methods. We conduct comprehensive qualitative and quantitative evaluations of existing methods on both public datasets and our proposed dataset. Furthermore, we provide a systematic overview of the current status of research and discuss promising directions for future exploration. All datasets, methods, and evaluation metrics introduced in this work are publicly available and will be regularly updated. Project page: this https URL.
zh
[CV-191] SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game
【速读】:该论文旨在解决在现实世界中对高速物体进行精确控制的挑战,特别是在乒乓球这一动态且对实时响应要求极高的场景下,实现快速拦截和精准轨迹调整。其解决方案的关键在于集成基于脉冲的视觉系统与模仿学习,具体包括两个核心模块:SONIC,一个基于脉冲相机的模块,通过补偿空气阻力和摩擦等现实不确定性,实现了毫米级的球拍接触预测;以及IMPACT,一个战略规划模块,用于实现球体向目标区域的精准投放。该系统结合了20 kHz脉冲相机以实现高时间分辨率的球体跟踪,并利用高效的神经网络模型进行实时轨迹修正和击球规划。
链接: https://arxiv.org/abs/2506.06690
作者: Hao Wang,Chengkai Hou,Xianglong Li,Yankai Fu,Chenxuan Li,Ning Chen,Gaole Dai,Jiaming Liu,Tiejun Huang,Shanghang Zhang
机构: Peking University (北京大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning to control high-speed objects in the real world remains a challenging frontier in robotics. Table tennis serves as an ideal testbed for this problem, demanding both rapid interception of fast-moving balls and precise adjustment of their trajectories. This task presents two fundamental challenges: it requires a high-precision vision system capable of accurately predicting ball trajectories, and it necessitates intelligent strategic planning to ensure precise ball placement to target regions. The dynamic nature of table tennis, coupled with its real-time response requirements, makes it particularly well-suited for advancing robotic control capabilities in fast-paced, precision-critical domains. In this paper, we present SpikePingpong, a novel system that integrates spike-based vision with imitation learning for high-precision robotic table tennis. Our approach introduces two key attempts that directly address the aforementioned challenges: SONIC, a spike camera-based module that achieves millimeter-level precision in ball-racket contact prediction by compensating for real-world uncertainties such as air resistance and friction; and IMPACT, a strategic planning module that enables accurate ball placement to targeted table regions. The system harnesses a 20 kHz spike camera for high-temporal resolution ball tracking, combined with efficient neural network models for real-time trajectory correction and stroke planning. Experimental results demonstrate that SpikePingpong achieves a remarkable 91% success rate for 30 cm accuracy target area and 71% in the more challenging 20 cm accuracy task, surpassing previous state-of-the-art approaches by 38% and 37% respectively. These significant performance improvements enable the robust implementation of sophisticated tactical gameplay strategies, providing a new research perspective for robotic control in high-speed dynamic tasks.
zh
[CV-192] Interpretation of Deep Learning Model in Embryo Selection for In Vitro Fertilization (IVF) Treatment
【速读】:该论文试图解决体外受精(IVF)过程中胚胎筛选效率低和依赖人工评估的问题,该过程耗时且主观性较强。解决方案的关键在于提出一种可解释的人工智能(XAI)框架,结合卷积神经网络(CNN)与长短期记忆网络(LSTM)的架构(称为CNN-LSTM),通过深度学习实现高精度的胚胎分类,并在保持模型可解释性的同时提升胚胎评估的效率和客观性。
链接: https://arxiv.org/abs/2506.06680
作者: Radha Kodali,Venkata Rao Dhulipalla,Venkata Siva Kishor Tatavarty,Madhavi Nadakuditi,Bharadwaj Thiruveedhula,Suryanarayana Gunnam,Durga Prasad Bavirisetti
机构: University of Tennessee Health Science Center (田纳西大学健康科学中心); Velagapudi Ramakrishna Siddhartha Engineering College (维拉加普迪·拉玛克里斯纳·西德哈拉姆工程学院); Converse Inc. (康维斯公司); Norwegian University of Science and Technology (挪威科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Infertility has a considerable impact on individuals’ quality of life, affecting them socially and psychologically, with projections indicating a rise in the upcoming years. In vitro fertilization (IVF) emerges as one of the primary techniques within economically developed nations, employed to address the rising problem of low fertility. Expert embryologists conventionally grade embryos by reviewing blastocyst images to select the most optimal for transfer, yet this process is time-consuming and lacks efficiency. Blastocyst images provide a valuable resource for assessing embryo viability. In this study, we introduce an explainable artificial intelligence (XAI) framework for classifying embryos, employing a fusion of convolutional neural network (CNN) and long short-term memory (LSTM) architecture, referred to as CNN-LSTM. Utilizing deep learning, our model achieves high accuracy in embryo classification while maintaining interpretability through XAI.
zh
[CV-193] RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation
【速读】:该论文旨在解决当前基于视觉-语言模型(Vision-Language Models, VLMs)的机器人系统主要依赖反应式System 1策略,而未能充分利用VLM在语义推理和长时程规划方面的优势的问题。其解决方案的关键在于引入RoboCerebra基准,该基准包含大规模仿真数据集、结合高层VLM规划器与低层视觉-语言-动作(Vision-Language-Action, VLA)控制器的分层框架,以及针对规划、反思和记忆的评估协议,从而推动长时程机器人操作中的高级推理能力发展。
链接: https://arxiv.org/abs/2506.06677
作者: Songhao Han,Boxiang Qiu,Yue Liao,Siyuan Huang,Chen Gao,Shuicheng Yan,Si Liu
机构: Beihang University (北京航空航天大学); National University of Singapore (新加坡国立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 18 figures
Abstract:Recent advances in vision-language models (VLMs) have enabled instruction-conditioned robotic systems with improved generalization. However, most existing work focuses on reactive System 1 policies, underutilizing VLMs’ strengths in semantic reasoning and long-horizon planning. These System 2 capabilities-characterized by deliberative, goal-directed thinking-remain under explored due to the limited temporal scale and structural complexity of current benchmarks. To address this gap, we introduce RoboCerebra, a benchmark for evaluating high-level reasoning in long-horizon robotic manipulation. RoboCerebra includes: (1) a large-scale simulation dataset with extended task horizons and diverse subtask sequences in household environments; (2) a hierarchical framework combining a high-level VLM planner with a low-level vision-language-action (VLA) controller; and (3) an evaluation protocol targeting planning, reflection, and memory through structured System 1-System 2 interaction. The dataset is constructed via a top-down pipeline, where GPT generates task instructions and decomposes them into subtask sequences. Human operators execute the subtasks in simulation, yielding high-quality trajectories with dynamic object variations. Compared to prior benchmarks, RoboCerebra features significantly longer action sequences and denser annotations. We further benchmark state-of-the-art VLMs as System 2 modules and analyze their performance across key cognitive dimensions, advancing the development of more capable and generalizable robotic planners.
zh
[CV-194] Flood-DamageSense: Multimodal Mamba with Multitask Learning for Building Flood Damage Assessment using SAR Remote Sensing Imagery
【速读】:该论文旨在解决现有灾后建筑损毁分类模型在洪水灾害场景下表现不佳的问题,尤其是在破坏力未留下明显光谱或结构特征的情况下。其关键解决方案是提出了一种名为Flood-DamageSense的深度学习框架,该框架通过融合灾前和灾后合成孔径雷达(SAR)/干涉合成孔径雷达(InSAR)图像、超高分辨率光学底图以及固有洪水风险层,实现了建筑物级别的洪水损毁评估。该方法利用多模态Mamba主干网络与半孪生编码器,联合预测损毁等级、洪水范围和建筑轮廓,显著提升了对“轻微”和“中度”损毁类别的识别能力。
链接: https://arxiv.org/abs/2506.06667
作者: Yu-Hsuan Ho,Ali Mostafavi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Most post-disaster damage classifiers succeed only when destructive forces leave clear spectral or structural signatures – conditions rarely present after inundation. Consequently, existing models perform poorly at identifying flood-related building damages. The model presented in this study, Flood-DamageSense, addresses this gap as the first deep-learning framework purpose-built for building-level flood-damage assessment. The architecture fuses pre- and post-event SAR/InSAR scenes with very-high-resolution optical basemaps and an inherent flood-risk layer that encodes long-term exposure probabilities, guiding the network toward plausibly affected structures even when compositional change is minimal. A multimodal Mamba backbone with a semi-Siamese encoder and task-specific decoders jointly predicts (1) graded building-damage states, (2) floodwater extent, and (3) building footprints. Training and evaluation on Hurricane Harvey (2017) imagery from Harris County, Texas – supported by insurance-derived property-damage extents – show a mean F1 improvement of up to 19 percentage points over state-of-the-art baselines, with the largest gains in the frequently misclassified “minor” and “moderate” damage categories. Ablation studies identify the inherent-risk feature as the single most significant contributor to this performance boost. An end-to-end post-processing pipeline converts pixel-level outputs to actionable, building-scale damage maps within minutes of image acquisition. By combining risk-aware modeling with SAR’s all-weather capability, Flood-DamageSense delivers faster, finer-grained, and more reliable flood-damage intelligence to support post-disaster decision-making and resource allocation.
zh
[CV-195] Generalized Trajectory Scoring for End-to-end Multimodal Planning CVPR2025
【速读】:该论文旨在解决端到端多模态规划中轨迹评分器在泛化能力上的局限性,即现有方法要么依赖静态轨迹集进行粗粒度离散化而难以适应细节变化,要么依赖动态生成的少量轨迹虽精度高但无法捕捉更广泛的轨迹分布。解决方案的关键在于提出GTRS(Generalized Trajectory Scoring)框架,该框架通过三个互补创新实现粗粒度与细粒度轨迹评估的统一:基于扩散模型的轨迹生成器生成多样化细粒度候选轨迹;词汇泛化技术通过在超密集轨迹集上使用丢弃正则化训练评分器,使其能够在较小子集上进行鲁棒推理;传感器增强策略提升跨域泛化能力并结合关键轨迹区分的精调训练。
链接: https://arxiv.org/abs/2506.06664
作者: Zhenxin Li,Wenhao Yao,Zi Wang,Xinglong Sun,Joshua Chen,Nadine Chang,Maying Shen,Zuxuan Wu,Shiyi Lan,Jose M. Alvarez
机构: NVIDIA(英伟达); Fudan University(复旦大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: The 1st place solution of the End-to-end Driving Track at the CVPR 2025 Autonomous Grand Challenge
Abstract:End-to-end multi-modal planning is a promising paradigm in autonomous driving, enabling decision-making with diverse trajectory candidates. A key component is a robust trajectory scorer capable of selecting the optimal trajectory from these candidates. While recent trajectory scorers focus on scoring either large sets of static trajectories or small sets of dynamically generated ones, both approaches face significant limitations in generalization. Static vocabularies provide effective coarse discretization but struggle to make fine-grained adaptation, while dynamic proposals offer detailed precision but fail to capture broader trajectory distributions. To overcome these challenges, we propose GTRS (Generalized Trajectory Scoring), a unified framework for end-to-end multi-modal planning that combines coarse and fine-grained trajectory evaluation. GTRS consists of three complementary innovations: (1) a diffusion-based trajectory generator that produces diverse fine-grained proposals; (2) a vocabulary generalization technique that trains a scorer on super-dense trajectory sets with dropout regularization, enabling its robust inference on smaller subsets; and (3) a sensor augmentation strategy that enhances out-of-domain generalization while incorporating refinement training for critical trajectory discrimination. As the winning solution of the Navsim v2 Challenge, GTRS demonstrates superior performance even with sub-optimal sensor inputs, approaching privileged methods that rely on ground-truth perception. Code will be available at this https URL.
zh
[CV-196] DriveSuprim: Towards Precise Trajectory Selection for End-to-End Planning
【速读】:该论文旨在解决自动驾驶车辆在复杂驾驶环境中基于单一预测轨迹的安全性评估不足问题,以及选择性方法在大量轨迹候选中精确选择最佳选项和区分细微但安全关键差异的优化挑战。其解决方案的关键在于提出DriveSuprim框架,该框架采用从粗到细的渐进式候选过滤范式、基于旋转的增强方法以提高分布外场景的鲁棒性,以及自蒸馏框架以稳定训练过程,从而实现了最先进的性能表现。
链接: https://arxiv.org/abs/2506.06659
作者: Wenhao Yao,Zhenxin Li,Shiyi Lan,Zi Wang,Xinglong Sun,Jose M. Alvarez,Zuxuan Wu
机构: Fudan University (复旦大学); NVIDIA (英伟达)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures
Abstract:In complex driving environments, autonomous vehicles must navigate safely. Relying on a single predicted path, as in regression-based approaches, usually does not explicitly assess the safety of the predicted trajectory. Selection-based methods address this by generating and scoring multiple trajectory candidates and predicting the safety score for each, but face optimization challenges in precisely selecting the best option from thousands of possibilities and distinguishing subtle but safety-critical differences, especially in rare or underrepresented scenarios. We propose DriveSuprim to overcome these challenges and advance the selection-based paradigm through a coarse-to-fine paradigm for progressive candidate filtering, a rotation-based augmentation method to improve robustness in out-of-distribution scenarios, and a self-distillation framework to stabilize training. DriveSuprim achieves state-of-the-art performance, reaching 93.5% PDMS in NAVSIM v1 and 87.1% EPDMS in NAVSIM v2 without extra data, demonstrating superior safetycritical capabilities, including collision avoidance and compliance with rules, while maintaining high trajectory quality in various driving scenarios.
zh
[CV-197] Parametric Gaussian Human Model: Generalizable Prior for Efficient and Realistic Human Avatar Modeling
【速读】:该论文旨在解决从单目视频中高效且高质量重建可动画化的人类虚拟形象(human avatar)的问题,特别是针对现有方法在稀疏单目输入下泛化能力差以及需要耗时的逐主体优化等挑战。其解决方案的关键在于提出了一种名为Parametric Gaussian Human Model (PGHM) 的通用且高效的框架,该框架通过将人体先验信息整合到3D Gaussian Splatting (3DGS) 中,实现了快速且高保真的人像重建。PGHM的核心创新包括:(1)一个对齐UV的潜在身份图,用于紧凑地编码个体特定的几何与外观信息;(2)一个解耦的多头U-Net,通过条件解码器分解静态、姿态相关和视角相关的高斯属性,从而在复杂姿态和视角下实现稳健的渲染效果,并支持无需多视角采集或长时间优化的高效主体适配。
链接: https://arxiv.org/abs/2506.06645
作者: Cheng Peng,Jingxiang Sun,Yushuo Chen,Zhaoqi Su,Zhuo Su,Yebin Liu
机构: Tsinghua University (清华大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Photorealistic and animatable human avatars are a key enabler for virtual/augmented reality, telepresence, and digital entertainment. While recent advances in 3D Gaussian Splatting (3DGS) have greatly improved rendering quality and efficiency, existing methods still face fundamental challenges, including time-consuming per-subject optimization and poor generalization under sparse monocular inputs. In this work, we present the Parametric Gaussian Human Model (PGHM), a generalizable and efficient framework that integrates human priors into 3DGS for fast and high-fidelity avatar reconstruction from monocular videos. PGHM introduces two core components: (1) a UV-aligned latent identity map that compactly encodes subject-specific geometry and appearance into a learnable feature tensor; and (2) a disentangled Multi-Head U-Net that predicts Gaussian attributes by decomposing static, pose-dependent, and view-dependent components via conditioned decoders. This design enables robust rendering quality under challenging poses and viewpoints, while allowing efficient subject adaptation without requiring multi-view capture or long optimization time. Experiments show that PGHM is significantly more efficient than optimization-from-scratch methods, requiring only approximately 20 minutes per subject to produce avatars with comparable visual quality, thereby demonstrating its practical applicability for real-world monocular avatar creation.
zh
[CV-198] Dark Channel-Assisted Depth-from-Defocus from a Single Image
【速读】:该论文试图解决从单张空间变化的散焦模糊图像中估计场景深度的问题(depth-from-defocus, DFD),因为该问题具有欠约束特性,传统方法通常依赖多张不同光圈或焦点设置的图像来恢复深度信息。解决方案的关键在于利用暗通道作为补充线索,通过局部散焦模糊与对比度变化之间的关系作为关键深度线索,从而提升场景结构估计的性能。
链接: https://arxiv.org/abs/2506.06643
作者: Moushumi Medhi,Rajiv Ranjan Sahay
机构: Indian Institute of Technology, Kharagpur(印度理工学院,卡哈格尔布尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we utilize the dark channel as a complementary cue to estimate the depth of a scene from a single space-variant defocus blurred image due to its effectiveness in implicitly capturing the local statistics of blurred images and the scene structure. Existing depth-from-defocus (DFD) techniques typically rely on multiple images with varying apertures or focus settings to recover depth information. Very few attempts have focused on DFD from a single defocused image due to the underconstrained nature of the problem. Our method capitalizes on the relationship between local defocus blur and contrast variations as key depth cues to enhance the overall performance in estimating the scene’s structure. The entire pipeline is trained adversarially in a fully end-to-end fashion. Experiments conducted on real data with realistic depth-induced defocus blur demonstrate that incorporating dark channel prior into single image DFD yields meaningful depth estimation results, validating the effectiveness of our approach.
zh
[CV-199] Non-Intrusive Load Monitoring Based on Image Load Signatures and Continual Learning DATE
【速读】:该论文旨在解决传统非侵入式负载监测(Non-Intrusive Load Monitoring, NILM)方法在复杂多变的负载组合和应用环境中存在的特征鲁棒性差与模型泛化能力不足的问题。其解决方案的关键在于将多维电力信号转换为可视化的图像负载特征签名,并结合深度卷积神经网络实现多设备的识别与分类;同时引入自监督预训练以提升特征泛化能力,并采用持续在线学习策略克服模型遗忘,从而适应新负载的出现。
链接: https://arxiv.org/abs/2506.06637
作者: Olimjon Toirov,Wei Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 10 pages, 3 figures, 2025 2nd International Conference on Digital Society and Artificial Intelligence (DSAI 2025), Conference dates: May 23-25, 2025
Abstract:Non-Intrusive Load Monitoring (NILM) identifies the operating status and energy consumption of each electrical device in the circuit by analyzing the electrical signals at the bus, which is of great significance for smart power management. However, the complex and changeable load combinations and application environments lead to the challenges of poor feature robustness and insufficient model generalization of traditional NILM methods. To this end, this paper proposes a new non-intrusive load monitoring method that integrates “image load signature” and continual learning. This method converts multi-dimensional power signals such as current, voltage, and power factor into visual image load feature signatures, and combines deep convolutional neural networks to realize the identification and classification of multiple devices; at the same time, self-supervised pre-training is introduced to improve feature generalization, and continual online learning strategies are used to overcome model forgetting to adapt to the emergence of new loads. This paper conducts a large number of experiments on high-sampling rate load datasets, and compares a variety of existing methods and model variants. The results show that the proposed method has achieved significant improvements in recognition accuracy.
zh
[CV-200] Vision-QRWKV: Exploring Quantum-Enhanced RWKV Models for Image Classification
【速读】:该论文试图解决传统神经网络在处理复杂高维数据时表达能力有限的问题,特别是在图像分类任务中面对细微或噪声干扰的类别区分时性能不足。其解决方案的关键在于引入一种混合量子-经典架构,即Vision-QRWKV,通过将变分量子电路(Variational Quantum Circuit, VQC)集成到RWKV架构的通道混合组件中,以增强非线性特征转换能力和视觉表征的表达能力。
链接: https://arxiv.org/abs/2506.06633
作者: Chi-Sheng Chen
机构: Neuro Industry, Inc.(Neuro Industry 公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in quantum machine learning have shown promise in enhancing classical neural network architectures, particularly in domains involving complex, high-dimensional data. Building upon prior work in temporal sequence modeling, this paper introduces Vision-QRWKV, a hybrid quantum-classical extension of the Receptance Weighted Key Value (RWKV) architecture, applied for the first time to image classification tasks. By integrating a variational quantum circuit (VQC) into the channel mixing component of RWKV, our model aims to improve nonlinear feature transformation and enhance the expressive capacity of visual representations. We evaluate both classical and quantum RWKV models on a diverse collection of 14 medical and standard image classification benchmarks, including MedMNIST datasets, MNIST, and FashionMNIST. Our results demonstrate that the quantum-enhanced model outperforms its classical counterpart on a majority of datasets, particularly those with subtle or noisy class distinctions (e.g., ChestMNIST, RetinaMNIST, BloodMNIST). This study represents the first systematic application of quantum-enhanced RWKV in the visual domain, offering insights into the architectural trade-offs and future potential of quantum models for lightweight and efficient vision tasks. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2506.06633 [cs.LG] (or arXiv:2506.06633v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.06633 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-201] PhysLab: A Benchmark Dataset for Multi-Granularity Visual Parsing of Physics Experiments
【速读】:该论文旨在解决视觉解析在教育场景中的局限性,具体包括现有数据集的标注粒度不足、领域覆盖有限以及缺乏明确的程序指导。其解决方案的关键在于引入PhysLab,这是首个记录学生进行复杂物理实验的视频数据集,包含四个代表性实验,涵盖多样的科学仪器和丰富的人-物体交互(HOI)模式,并提供多层次标注以支持多种视觉任务。
链接: https://arxiv.org/abs/2506.06631
作者: Minghao Zou,Qingtian Zeng,Yongping Miao,Shangkun Liu,Zilong Wang,Hantao Liu,Wei Zhou
机构: Shandong University of Science and Technology (山东科技大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual parsing of images and videos is critical for a wide range of real-world applications. However, progress in this field is constrained by limitations of existing datasets: (1) insufficient annotation granularity, which impedes fine-grained scene understanding and high-level reasoning; (2) limited coverage of domains, particularly a lack of datasets tailored for educational scenarios; and (3) lack of explicit procedural guidance, with minimal logical rules and insufficient representation of structured task process. To address these gaps, we introduce PhysLab, the first video dataset that captures students conducting complex physics experiments. The dataset includes four representative experiments that feature diverse scientific instruments and rich human-object interaction (HOI) patterns. PhysLab comprises 620 long-form videos and provides multilevel annotations that support a variety of vision tasks, including action recognition, object detection, HOI analysis, etc. We establish strong baselines and perform extensive evaluations to highlight key challenges in the parsing of procedural educational videos. We expect PhysLab to serve as a valuable resource for advancing fine-grained visual parsing, facilitating intelligent classroom systems, and fostering closer integration between computer vision and educational technologies. The dataset and the evaluation toolkit are publicly available at this https URL.
zh
[CV-202] Zero Shot Composed Image Retrieval
【速读】:该论文旨在解决组合图像检索(Composed Image Retrieval, CIR)中的零样本性能不足问题,特别是在FashionIQ基准上,现有方法的Recall@10仅达到20-25%。其解决方案的关键在于通过微调BLIP-2模型,并引入轻量级Q-Former以融合视觉和文本特征,生成统一的嵌入表示,从而显著提升了检索性能,Recall@10分别达到45.6%(shirt)、40.1%(dress)和50.4%(top-tee)。此外,研究还指出有效的基于偏好的CIR需要真正的多模态融合、与排名相关的损失函数以及精心筛选的负样本。
链接: https://arxiv.org/abs/2506.06602
作者: Santhosh Kakarla,Gautama Shastry Bulusu Venkata
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures
Abstract:Composed image retrieval (CIR) allows a user to locate a target image by applying a fine-grained textual edit (e.g., turn the dress blue'' or
remove stripes’') to a reference image. Zero-shot CIR, which embeds the image and the text with separate pretrained vision-language encoders, reaches only 20-25% Recall@10 on the FashionIQ benchmark. We improve this by fine-tuning BLIP-2 with a lightweight Q-Former that fuses visual and textual features into a single embedding, raising Recall@10 to 45.6% (shirt), 40.1% (dress), and 50.4% (top-tee) and increasing the average Recall@50 to 67.6%. We also examine Retrieval-DPO, which fine-tunes CLIP’s text encoder with a Direct Preference Optimization loss applied to FAISS-mined hard negatives. Despite extensive tuning of the scaling factor, index, and sampling strategy, Retrieval-DPO attains only 0.02% Recall@10 – far below zero-shot and prompt-tuned baselines – because it (i) lacks joint image-text fusion, (ii) uses a margin objective misaligned with top- K metrics, (iii) relies on low-quality negatives, and (iv) keeps the vision and Transformer layers frozen. Our results show that effective preference-based CIR requires genuine multimodal fusion, ranking-aware objectives, and carefully curated negatives.
zh
[CV-203] RARL: Improving Medical VLM Reasoning and Generalization with Reinforcement Learning and LoRA under Data and Hardware Constraints
【速读】:该论文旨在解决医学视觉-语言模型(VLMs)在泛化能力、透明度和计算效率方面的局限性,这些问题限制了其在资源受限环境中的实际部署。解决方案的关键在于提出一种基于推理感知的强化学习框架(Reasoning-Aware Reinforcement Learning, RARL),通过微调轻量级基础模型Qwen2-VL-2B-Instruct,并结合低秩适应和自定义奖励函数,以同时优化诊断准确性和推理质量。该方法在单块NVIDIA A100-PCIE-40GB GPU上进行训练,验证了其在资源受限环境中的可行性。
链接: https://arxiv.org/abs/2506.06600
作者: Tan-Hanh Pham,Chris Ngo
机构: Harvard Medical School (哈佛医学院); Knovel Engineering Lab (Knovel工程实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:The growing integration of vision-language models (VLMs) in medical applications offers promising support for diagnostic reasoning. However, current medical VLMs often face limitations in generalization, transparency, and computational efficiency-barriers that hinder deployment in real-world, resource-constrained settings. To address these challenges, we propose a Reasoning-Aware Reinforcement Learning framework, \textbfRARL, that enhances the reasoning capabilities of medical VLMs while remaining efficient and adaptable to low-resource environments. Our approach fine-tunes a lightweight base model, Qwen2-VL-2B-Instruct, using Low-Rank Adaptation and custom reward functions that jointly consider diagnostic accuracy and reasoning quality. Training is performed on a single NVIDIA A100-PCIE-40GB GPU, demonstrating the feasibility of deploying such models in constrained environments. We evaluate the model using an LLM-as-judge framework that scores both correctness and explanation quality. Experimental results show that RARL significantly improves VLM performance in medical image analysis and clinical reasoning, outperforming supervised fine-tuning on reasoning-focused tasks by approximately 7.78%, while requiring fewer computational resources. Additionally, we demonstrate the generalization capabilities of our approach on unseen datasets, achieving around 27% improved performance compared to supervised fine-tuning and about 4% over traditional RL fine-tuning. Our experiments also illustrate that diversity prompting during training and reasoning prompting during inference are crucial for enhancing VLM performance. Our findings highlight the potential of reasoning-guided learning and reasoning prompting to steer medical VLMs toward more transparent, accurate, and resource-efficient clinical decision-making. Code and data are publicly available.
zh
[CV-204] EV-LayerSegNet: Self-supervised Motion Segmentation using Event Cameras CVPR
【速读】:该论文旨在解决基于事件的运动分割任务中由于缺乏高质量标注数据而导致的训练困难问题。其关键解决方案是提出EV-LayerSegNet,一种自监督卷积神经网络,通过分离学习仿射光流和分割掩码,并利用它们对输入事件进行去模糊,从而将去模糊质量作为自监督学习的损失函数,实现无需人工标注的训练。
链接: https://arxiv.org/abs/2506.06596
作者: Youssef Farah,Federico Paredes-Vallés,Guido De Croon,Muhammad Ahmed Humais,Hussain Sajwani,Yahya Zweiri
机构: Khalifa University (哈利法大学); TU Delft (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for publication at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Nashville, 2025
Abstract:Event cameras are novel bio-inspired sensors that capture motion dynamics with much higher temporal resolution than traditional cameras, since pixels react asynchronously to brightness changes. They are therefore better suited for tasks involving motion such as motion segmentation. However, training event-based networks still represents a difficult challenge, as obtaining ground truth is very expensive, error-prone and limited in frequency. In this article, we introduce EV-LayerSegNet, a self-supervised CNN for event-based motion segmentation. Inspired by a layered representation of the scene dynamics, we show that it is possible to learn affine optical flow and segmentation masks separately, and use them to deblur the input events. The deblurring quality is then measured and used as self-supervised learning loss. We train and test the network on a simulated dataset with only affine motion, achieving IoU and detection rate up to 71% and 87% respectively.
zh
[CV-205] A Deep Learning Approach for Facial Attribute Manipulation and Reconstruction in Surveillance and Reconnaissance
【速读】:该论文试图解决 surveillance systems 中因低质量图像和视频导致的人脸识别准确率下降问题,以及现有基于 AI 的面部分析模型在肤色差异和部分遮挡人脸上的偏见问题。这些问题源于数据集的局限性和不平衡性,导致面部识别性能不公平且不可靠。解决方案的关键在于提出一个数据驱动的平台,通过生成合成训练数据来补偿数据集偏差,其核心技术包括基于深度学习的面部属性操控与重建,利用自动编码器和生成对抗网络(Generative Adversarial Networks, GANs)生成多样且高质量的面部数据集,并集成图像增强模块以提升监控视频中低分辨率或遮挡人脸的清晰度。
链接: https://arxiv.org/abs/2506.06578
作者: Anees Nashath Shaik,Barbara Villarini,Vasileios Argyriou
机构: Kingston University (金斯顿大学); University of Westminster (威斯敏斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surveillance systems play a critical role in security and reconnaissance, but their performance is often compromised by low-quality images and videos, leading to reduced accuracy in face recognition. Additionally, existing AI-based facial analysis models suffer from biases related to skin tone variations and partially occluded faces, further limiting their effectiveness in diverse real-world scenarios. These challenges are the results of data limitations and imbalances, where available training datasets lack sufficient diversity, resulting in unfair and unreliable facial recognition performance. To address these issues, we propose a data-driven platform that enhances surveillance capabilities by generating synthetic training data tailored to compensate for dataset biases. Our approach leverages deep learning-based facial attribute manipulation and reconstruction using autoencoders and Generative Adversarial Networks (GANs) to create diverse and high-quality facial datasets. Additionally, our system integrates an image enhancement module, improving the clarity of low-resolution or occluded faces in surveillance footage. We evaluate our approach using the CelebA dataset, demonstrating that the proposed platform enhances both training data diversity and model fairness. This work contributes to reducing bias in AI-based facial analysis and improving surveillance accuracy in challenging environments, leading to fairer and more reliable security applications.
zh
[CV-206] xtile Analysis for Recycling Automation using Transfer Learning and Zero-Shot Foundation Models
【速读】:该论文试图解决自动化纺织品回收中材料组成识别和污染物检测的挑战,特别是在利用传感器数据进行准确分类与分割方面的难题。其解决方案的关键在于采用标准RGB图像作为低成本的感知手段,并结合现代深度学习技术,包括迁移学习用于纺织品类别的分类以及基础模型(如Grounding DINO与Segment Anything Model)实现零样本分割,从而完成自动化纺织品回收流程中的关键预处理任务。
链接: https://arxiv.org/abs/2506.06569
作者: Yannis Spyridis,Vasileios Argyriou
机构: Kingston University, London, UK (金斯顿大学,伦敦,英国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated sorting is crucial for improving the efficiency and scalability of textile recycling, but accurately identifying material composition and detecting contaminants from sensor data remains challenging. This paper investigates the use of standard RGB imagery, a cost-effective sensing modality, for key pre-processing tasks in an automated system. We present computer vision components designed for a conveyor belt setup to perform (a) classification of four common textile types and (b) segmentation of non-textile features such as buttons and zippers. For classification, several pre-trained architectures were evaluated using transfer learning and cross-validation, with EfficientNetB0 achieving the best performance on a held-out test set with 81.25% accuracy. For feature segmentation, a zero-shot approach combining the Grounding DINO open-vocabulary detector with the Segment Anything Model (SAM) was employed, demonstrating excellent performance with a mIoU of 0.90 for the generated masks against ground truth. This study demonstrates the feasibility of using RGB images coupled with modern deep learning techniques, including transfer learning for classification and foundation models for zero-shot segmentation, to enable essential analysis steps for automated textile recycling pipelines.
zh
[CV-207] Securing Traffic Sign Recognition Systems in Autonomous Vehicles
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在交通标志识别任务中受到数据污染攻击(data poisoning attacks)导致模型性能下降的问题。其关键解决方案是提出一种基于数据增强的训练方法,通过引入非线性变换来干扰扰动数据,从而提升模型的鲁棒性。该方法有效缓解了误差最小化攻击(error-minimizing attacks)对模型预测准确率的影响,将准确率从99.90%恢复至96.05%,优于对抗训练的效果。
链接: https://arxiv.org/abs/2506.06563
作者: Thushari Hapuarachchi,Long Dang,Kaiqi Xiong
机构: University of South Florida (南佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:Deep Neural Networks (DNNs) are widely used for traffic sign recognition because they can automatically extract high-level features from images. These DNNs are trained on large-scale datasets obtained from unknown sources. Therefore, it is important to ensure that the models remain secure and are not compromised or poisoned during training. In this paper, we investigate the robustness of DNNs trained for traffic sign recognition. First, we perform the error-minimizing attacks on DNNs used for traffic sign recognition by adding imperceptible perturbations on training data. Then, we propose a data augmentation-based training method to mitigate the error-minimizing attacks. The proposed training method utilizes nonlinear transformations to disrupt the perturbations and improve the model robustness. We experiment with two well-known traffic sign datasets to demonstrate the severity of the attack and the effectiveness of our mitigation scheme. The error-minimizing attacks reduce the prediction accuracy of the DNNs from 99.90% to 10.6%. However, our mitigation scheme successfully restores the prediction accuracy to 96.05%. Moreover, our approach outperforms adversarial training in mitigating the error-minimizing attacks. Furthermore, we propose a detection model capable of identifying poisoned data even when the perturbations are imperceptible to human inspection. Our detection model achieves a success rate of over 99% in identifying the attack. This research highlights the need to employ advanced training methods for DNNs in traffic sign recognition systems to mitigate the effects of data poisoning attacks.
zh
[CV-208] Bridging Audio and Vision: Zero-Shot Audiovisual Segmentation by Connecting Pretrained Models INTERSPEECH2025
【速读】:该论文试图解决音频视觉分割(Audiovisual Segmentation, AVS)中依赖大量像素级标注数据的问题,这些问题的获取成本高且耗时。解决方案的关键在于提出一种新颖的零样本AVS框架,通过利用多个预训练模型来消除任务特定的训练需求,该方法整合了音频、视觉和文本表示以弥合模态间的差异,从而在无需AVS特定标注的情况下实现精确的声音源分割。
链接: https://arxiv.org/abs/2506.06537
作者: Seung-jae Lee,Paul Hongsuck Seo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted on INTERSPEECH2025
Abstract:Audiovisual segmentation (AVS) aims to identify visual regions corresponding to sound sources, playing a vital role in video understanding, surveillance, and human-computer interaction. Traditional AVS methods depend on large-scale pixel-level annotations, which are costly and time-consuming to obtain. To address this, we propose a novel zero-shot AVS framework that eliminates task-specific training by leveraging multiple pretrained models. Our approach integrates audio, vision, and text representations to bridge modality gaps, enabling precise sound source segmentation without AVS-specific annotations. We systematically explore different strategies for connecting pretrained models and evaluate their efficacy across multiple datasets. Experimental results demonstrate that our framework achieves state-of-the-art zero-shot AVS performance, highlighting the effectiveness of multimodal model integration for finegrained audiovisual segmentation.
zh
[CV-209] GS4: Generalizable Sparse Splatting Semantic SLAM
【速读】:该论文试图解决传统SLAM算法在生成高分辨率和完整3D地图方面的局限性,以及现有基于高斯点云(Gaussian Splatting, GS)的SLAM方法在场景优化上的计算效率低和泛化能力差的问题。其解决方案的关键在于提出一种可泛化的语义SLAM算法,该算法通过一个学习得到的泛化网络从RGB-D视频流中增量式地构建和更新3D场景表示,并将3D语义分割无缝集成到GS框架中,从而实现高效的全局定位校正和更优的场景理解。
链接: https://arxiv.org/abs/2506.06517
作者: Mingqi Jiang,Chanho Kim,Chen Ziwen,Li Fuxin
机构: Oregon State University (俄勒冈州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures
Abstract:Traditional SLAM algorithms are excellent at camera tracking but might generate lower resolution and incomplete 3D maps. Recently, Gaussian Splatting (GS) approaches have emerged as an option for SLAM with accurate, dense 3D map building. However, existing GS-based SLAM methods rely on per-scene optimization which is time-consuming and does not generalize to diverse scenes well. In this work, we introduce the first generalizable GS-based semantic SLAM algorithm that incrementally builds and updates a 3D scene representation from an RGB-D video stream using a learned generalizable network. Our approach starts from an RGB-D image recognition backbone to predict the Gaussian parameters from every downsampled and backprojected image location. Additionally, we seamlessly integrate 3D semantic segmentation into our GS framework, bridging 3D mapping and recognition through a shared backbone. To correct localization drifting and floaters, we propose to optimize the GS for only 1 iteration following global localization. We demonstrate state-of-the-art semantic SLAM performance on the real-world benchmark ScanNet with an order of magnitude fewer Gaussians compared to other recent GS-based methods, and showcase our model’s generalization capability through zero-shot transfer to the NYUv2 and TUM RGB-D datasets.
zh
[CV-210] Noise Consistency Regularization for Improved Subject-Driven Image Synthesis
【速读】:该论文旨在解决微调Stable Diffusion模型时出现的两个关键问题:欠拟合(模型无法可靠捕捉特定主体身份)和过拟合(模型记忆主体图像并降低背景多样性)。其解决方案的关键在于引入两种辅助的一致性损失:首先,先验一致性正则化损失确保非主体图像的预测扩散噪声与预训练模型保持一致,从而提升图像保真度;其次,主体一致性正则化损失增强了模型对乘性噪声调制潜在代码的鲁棒性,有助于在保持主体身份的同时提高图像多样性。
链接: https://arxiv.org/abs/2506.06483
作者: Yao Ni,Song Wen,Piotr Koniusz,Anoop Cherian
机构: The Australian National University (澳大利亚国立大学); Rutgers University (罗格斯大学); Data61 \usym2665CSIRO (Data61 \usym2665CSIRO); Mitsubishi Electric Research Laboratories (三菱电机研究实验室)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Fine-tuning Stable Diffusion enables subject-driven image synthesis by adapting the model to generate images containing specific subjects. However, existing fine-tuning methods suffer from two key issues: underfitting, where the model fails to reliably capture subject identity, and overfitting, where it memorizes the subject image and reduces background diversity. To address these challenges, we propose two auxiliary consistency losses for diffusion fine-tuning. First, a prior consistency regularization loss ensures that the predicted diffusion noise for prior (non-subject) images remains consistent with that of the pretrained model, improving fidelity. Second, a subject consistency regularization loss enhances the fine-tuned model’s robustness to multiplicative noise modulated latent code, helping to preserve subject identity while improving diversity. Our experimental results demonstrate that incorporating these losses into fine-tuning not only preserves subject identity but also enhances image diversity, outperforming DreamBooth in terms of CLIP scores, background variation, and overall visual quality.
zh
[CV-211] (LiFT) Lightweight Fitness Transformer: A language-vision model for Remote Monitoring of Physical Training
【速读】:该论文旨在解决现有健身追踪系统在运动种类覆盖范围有限以及模型复杂度高难以实际部署的问题。其关键解决方案是开发了一个鲁棒的多任务运动分析模型,能够跨数百种运动进行运动检测和重复次数计数,同时通过构建大规模健身数据集Olympia来克服以往数据量不足的限制,并采用单一的视觉-语言Transformer模型实现运动识别与重复计数,从而推动基于AI的健身追踪技术的普及。
链接: https://arxiv.org/abs/2506.06480
作者: A. Postlmayr,P. Cosman,S. Dey
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a fitness tracking system that enables remote monitoring for exercises using only a RGB smartphone camera, making fitness tracking more private, scalable, and cost effective. Although prior work explored automated exercise supervision, existing models are either too limited in exercise variety or too complex for real-world deployment. Prior approaches typically focus on a small set of exercises and fail to generalize across diverse movements. In contrast, we develop a robust, multitask motion analysis model capable of performing exercise detection and repetition counting across hundreds of exercises, a scale far beyond previous methods. We overcome previous data limitations by assembling a large-scale fitness dataset, Olympia covering more than 1,900 exercises. To our knowledge, our vision-language model is the first that can perform multiple tasks on skeletal fitness data. On Olympia, our model can detect exercises with 76.5% accuracy and count repetitions with 85.3% off-by-one accuracy, using only RGB video. By presenting a single vision-language transformer model for both exercise identification and rep counting, we take a significant step toward democratizing AI-powered fitness tracking.
zh
[CV-212] Edge-Enabled Collaborative Object Detection for Real-Time Multi-Vehicle Perception
【速读】:该论文旨在解决连接自动驾驶车辆(Connected Autonomous Vehicles, CAVs)中因车载感知系统受限于遮挡和盲区而导致的检测准确性不足,以及云端解决方案引入显著延迟而无法满足动态环境下的实时处理需求的问题。其解决方案的关键在于提出一种基于边缘计算和多CAV协同的创新框架——边缘增强型协作目标检测(Edge-Enabled Collaborative Object Detection, ECOD),通过在边缘侧部署Perceptive Aggregation and Collaborative Estimation (PACE)与Variable Object Tally and Evaluation (VOTE)两种关键算法,实现多视角、低延迟的目标检测与分类优化。
链接: https://arxiv.org/abs/2506.06474
作者: Everett Richards,Bipul Thapa,Lena Mashayekhy
机构: San Diego State University (圣地亚哥州立大学); University of Delaware (特拉华大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: This paper has been accepted to IEEE EDGE 2025. The final version will be published in IEEE Xplore later this year
Abstract:Accurate and reliable object detection is critical for ensuring the safety and efficiency of Connected Autonomous Vehicles (CAVs). Traditional on-board perception systems have limited accuracy due to occlusions and blind spots, while cloud-based solutions introduce significant latency, making them unsuitable for real-time processing demands required for autonomous driving in dynamic environments. To address these challenges, we introduce an innovative framework, Edge-Enabled Collaborative Object Detection (ECOD) for CAVs, that leverages edge computing and multi-CAV collaboration for real-time, multi-perspective object detection. Our ECOD framework integrates two key algorithms: Perceptive Aggregation and Collaborative Estimation (PACE) and Variable Object Tally and Evaluation (VOTE). PACE aggregates detection data from multiple CAVs on an edge server to enhance perception in scenarios where individual CAVs have limited visibility. VOTE utilizes a consensus-based voting mechanism to improve the accuracy of object classification by integrating data from multiple CAVs. Both algorithms are designed at the edge to operate in real-time, ensuring low-latency and reliable decision-making for CAVs. We develop a hardware-based controlled testbed consisting of camera-equipped robotic CAVs and an edge server to evaluate the efficacy of our framework. Our experimental results demonstrate the significant benefits of ECOD in terms of improved object classification accuracy, outperforming traditional single-perspective onboard approaches by up to 75%, while ensuring low-latency, edge-driven real-time processing. This research highlights the potential of edge computing to enhance collaborative perception for latency-sensitive autonomous systems.
zh
[CV-213] Splat and Replace: 3D Reconstruction with Repetitive Elements SIGGRAPH
【速读】:该论文试图解决在新颖视图合成中,由于训练视角不够全面而导致的未见区域和被遮挡区域渲染质量低的问题。解决方案的关键在于利用场景中的重复元素(repetitive elements),通过分割3DGS重建中的每个重复实例、进行配准并共享实例间的信息,从而提升因覆盖不足和遮挡导致的低质量区域的重建效果,同时兼顾实例间的外观变化。
链接: https://arxiv.org/abs/2506.06462
作者: Nicolás Violante,Andreas Meuleman,Alban Gauthier,Frédo Durand,Thibault Groueix,George Drettakis
机构: Inria & Université Côte d’Azur (Inria & 科蒂埃阿尔卑斯大学); AdobeUSA (Adobe美国); MITUSA (MIT美国)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Conference Papers 2025. Project site: this https URL
Abstract:We leverage repetitive elements in 3D scenes to improve novel view synthesis. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have greatly improved novel view synthesis but renderings of unseen and occluded parts remain low-quality if the training views are not exhaustive enough. Our key observation is that our environment is often full of repetitive elements. We propose to leverage those repetitions to improve the reconstruction of low-quality parts of the scene due to poor coverage and occlusions. We propose a method that segments each repeated instance in a 3DGS reconstruction, registers them together, and allows information to be shared among instances. Our method improves the geometry while also accounting for appearance variations across instances. We demonstrate our method on a variety of synthetic and real scenes with typical repetitive elements, leading to a substantial improvement in the quality of novel view synthesis.
zh
[CV-214] Vid2Sim: Generalizable Video-based Reconstruction of Appearance Geometry and Physics for Mesh-free Simulation CVPR2025
【速读】:该论文旨在解决从视频中准确重建带有纹理的形状和物理属性的问题,这是一个具有挑战性的系统识别问题。传统方法通常依赖于基于可微分模拟器和渲染器的复杂优化流程来估计物理参数,但这些方法往往需要针对每个场景进行大量超参数调优,并且计算成本高昂,限制了其实用性和泛化能力。本文提出的解决方案——Vid2Sim,关键在于采用一种无需网格的基于线性混合皮肤(Linear Blend Skinning, LBS)的简化模拟方法,实现了高效的计算和灵活的表示能力。该框架首先通过一个前馈神经网络从视频中重建物理系统的观测配置,随后通过轻量级优化流程在短时间内精确匹配视频观察结果,并支持高效、高质量的无网格模拟。
链接: https://arxiv.org/abs/2506.06440
作者: Chuhao Chen,Zhiyang Dou,Chen Wang,Yiming Huang,Anjun Chen,Qiao Feng,Jiatao Gu,Lingjie Liu
机构: University of Pennsylvania (宾夕法尼亚大学); The University of Hong Kong (香港大学); Zhejiang University (浙江大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
Abstract:Faithfully reconstructing textured shapes and physical properties from videos presents an intriguing yet challenging problem. Significant efforts have been dedicated to advancing such a system identification problem in this area. Previous methods often rely on heavy optimization pipelines with a differentiable simulator and renderer to estimate physical parameters. However, these approaches frequently necessitate extensive hyperparameter tuning for each scene and involve a costly optimization process, which limits both their practicality and generalizability. In this work, we propose a novel framework, Vid2Sim, a generalizable video-based approach for recovering geometry and physical properties through a mesh-free reduced simulation based on Linear Blend Skinning (LBS), offering high computational efficiency and versatile representation capability. Specifically, Vid2Sim first reconstructs the observed configuration of the physical system from video using a feed-forward neural network trained to capture physical world knowledge. A lightweight optimization pipeline then refines the estimated appearance, geometry, and physical properties to closely align with video observations within just a few minutes. Additionally, after the reconstruction, Vid2Sim enables high-quality, mesh-free simulation with high efficiency. Extensive experiments demonstrate that our method achieves superior accuracy and efficiency in reconstructing geometry and physical properties from video data.
zh
[CV-215] NeurNCD: Novel Class Discovery via Implicit Neural Representation ICMR2024
【速读】:该论文试图解决开放世界设置下新型类别发现的问题,传统显式表示如物体描述符或3D分割图由于其离散性、孔洞和噪声的特性,限制了新型类别的准确发现。解决方案的关键在于提出NeurNCD框架,该框架首次采用精心设计的Embedding-NeRF模型,并利用KL散度替代传统的显式3D分割图,以聚合语义嵌入和视觉嵌入空间中的熵,同时集成特征查询、特征调制和聚类等关键组件,从而实现高效的特征增强和信息交换。
链接: https://arxiv.org/abs/2506.06412
作者: Junming Wang,Yi Shi
机构: The University of Hong Kong (香港大学); Beijing Jiaotong University (北京交通大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICMR 2024
Abstract:Discovering novel classes in open-world settings is crucial for real-world applications. Traditional explicit representations, such as object descriptors or 3D segmentation maps, are constrained by their discrete, hole-prone, and noisy nature, which hinders accurate novel class discovery. To address these challenges, we introduce NeurNCD, the first versatile and data-efficient framework for novel class discovery that employs the meticulously designed Embedding-NeRF model combined with KL divergence as a substitute for traditional explicit 3D segmentation maps to aggregate semantic embedding and entropy in visual embedding space. NeurNCD also integrates several key components, including feature query, feature modulation and clustering, facilitating efficient feature augmentation and information exchange between the pre-trained semantic segmentation network and implicit neural representations. As a result, our framework achieves superior segmentation performance in both open and closed-world settings without relying on densely labelled datasets for supervised training or human interaction to generate sparse label supervision. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches on the NYUv2 and Replica datasets.
zh
[CV-216] Active Illumination Control in Low-Light Environments using NightHawk
【速读】:该论文旨在解决地下环境(如涵洞)中机器人视觉面临的挑战,这些环境由于光照不足和缺乏显著特征而影响视觉性能。解决方案的关键在于提出NightHawk框架,该框架结合主动照明与曝光控制,通过在线贝叶斯优化问题确定最佳光源强度和曝光时间,以优化图像质量。此外,论文引入了一种基于特征检测器的度量标准作为优化器的成本函数,从而提升图像的实用性和视觉估计的可靠性。
链接: https://arxiv.org/abs/2506.06394
作者: Yash Turkar,Youngjin Kim,Karthik Dantu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Subterranean environments such as culverts present significant challenges to robot vision due to dim lighting and lack of distinctive features. Although onboard illumination can help, it introduces issues such as specular reflections, overexposure, and increased power consumption. We propose NightHawk, a framework that combines active illumination with exposure control to optimize image quality in these settings. NightHawk formulates an online Bayesian optimization problem to determine the best light intensity and exposure-time for a given scene. We propose a novel feature detector-based metric to quantify image utility and use it as the cost function for the optimizer. We built NightHawk as an event-triggered recursive optimization pipeline and deployed it on a legged robot navigating a culvert beneath the Erie Canal. Results from field experiments demonstrate improvements in feature detection and matching by 47-197% enabling more reliable visual estimation in challenging lighting conditions.
zh
[CV-217] Exploring Adversarial Watermarking in Transformer-Based Models: Transferability and Robustness Against Defense Mechanism for Medical Images
【速读】:该论文试图解决视觉Transformer(Vision Transformer, ViT)在医学图像中对对抗水印攻击的脆弱性问题。其解决方案的关键在于通过Projected Gradient Descent (PGD)生成对抗水印,并评估其在卷积神经网络(CNN)中的迁移能力以及对抗训练作为防御机制的有效性。研究结果表明,尽管干净图像的性能未受影响,但ViT在面对对抗攻击时表现出显著的脆弱性,而对抗训练能够有效提升其鲁棒性。
链接: https://arxiv.org/abs/2506.06389
作者: Rifat Sadik,Tanvir Rahman,Arpan Bhattacharjee,Bikash Chandra Halder,Ismail Hossain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep learning models have shown remarkable success in dermatological image analysis, offering potential for automated skin disease diagnosis. Previously, convolutional neural network(CNN) based architectures have achieved immense popularity and success in computer vision (CV) based task like skin image recognition, generation and video analysis. But with the emergence of transformer based models, CV tasks are now are nowadays carrying out using these models. Vision Transformers (ViTs) is such a transformer-based models that have shown success in computer vision. It uses self-attention mechanisms to achieve state-of-the-art performance across various tasks. However, their reliance on global attention mechanisms makes them susceptible to adversarial perturbations. This paper aims to investigate the susceptibility of ViTs for medical images to adversarial watermarking-a method that adds so-called imperceptible perturbations in order to fool models. By generating adversarial watermarks through Projected Gradient Descent (PGD), we examine the transferability of such attacks to CNNs and analyze the performance defense mechanism – adversarial training. Results indicate that while performance is not compromised for clean images, ViTs certainly become much more vulnerable to adversarial attacks: an accuracy drop of as low as 27.6%. Nevertheless, adversarial training raises it up to 90.0%.
zh
[CV-218] CellCLIP – Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning
【速读】:该论文试图解决高通量显微镜技术(如Cell Painting)生成的高内涵筛选(High-content Screening, HCS)数据中,如何有效学习一个统一的潜在空间以对齐不同扰动与其对应的形态学效应的问题。由于Cell Painting图像与自然图像在语义上的显著差异,以及将不同类型的扰动(如小分子与CRISPR基因敲除)在单一潜在空间中表示的难度,使得这一问题具有挑战性。解决方案的关键在于提出CellCLIP,这是一种跨模态对比学习框架,通过预训练的图像编码器和一种新颖的通道编码方案,更好地捕捉图像嵌入中不同显微镜通道之间的关系,并结合自然语言编码器来表示扰动,从而实现更有效的跨模态对齐与表征。
链接: https://arxiv.org/abs/2506.06290
作者: Mingyu Lu,Ethan Weinberger,Chanwoo Kim,Su-In Lee
机构: University of Washington (华盛顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:High-content screening (HCS) assays based on high-throughput microscopy techniques such as Cell Painting have enabled the interrogation of cells’ morphological responses to perturbations at an unprecedented scale. The collection of such data promises to facilitate a better understanding of the relationships between different perturbations and their effects on cellular state. Towards achieving this goal, recent advances in cross-modal contrastive learning could, in theory, be leveraged to learn a unified latent space that aligns perturbations with their corresponding morphological effects. However, the application of such methods to HCS data is not straightforward due to substantial differences in the semantics of Cell Painting images compared to natural images, and the difficulty of representing different classes of perturbations (e.g., small molecule vs CRISPR gene knockout) in a single latent space. In response to these challenges, here we introduce CellCLIP, a cross-modal contrastive learning framework for HCS data. CellCLIP leverages pre-trained image encoders coupled with a novel channel encoding scheme to better capture relationships between different microscopy channels in image embeddings, along with natural language encoders for representing perturbations. Our framework outperforms current open-source models, demonstrating the best performance in both cross-modal retrieval and biologically meaningful downstream tasks while also achieving significant reductions in computation time.
zh
[CV-219] Facial Foundational Model Advances Early Warning of Coronary Artery Disease from Live Videos with DigitalShadow
【速读】:该论文试图解决全球人口老龄化背景下冠状动脉疾病(Coronary Artery Disease, CAD)早期检测与预防的挑战,旨在通过非侵入性手段实现CAD风险的早期预警。其解决方案的关键在于开发DigitalShadow系统,该系统基于微调的面部基础模型,通过分析实时视频流中的面部特征进行无接触式CAD风险评估,并结合个性化数据库生成自然语言风险报告和健康建议,同时以隐私保护为核心设计原则支持本地部署。
链接: https://arxiv.org/abs/2506.06283
作者: Juexiao Zhou,Zhongyi Han,Mankun Xin,Xingwei He,Guotao Wang,Jiaoyan Song,Gongning Luo,Wenjia He,Xintong Li,Yuetan Chu,Juanwen Chen,Bo Wang,Xia Wu,Wenwen Duan,Zhixia Guo,Liyan Bai,Yilin Pan,Xuefei Bi,Lu Liu,Long Feng,Xiaonan He,Xin Gao
机构: King Abdullah University of Science and Technology (KAUST); Beijing AnZhen Hospital (北京安贞医院); Capital Medical University (首都医科大学); Tongji Hospital (同济医院); Shanxi Cardiovascular Hospital (山西心血管医院); Beijing Changping Hospital (北京昌平医院); Daqing Longnan Hospital (大庆龙南医院); Tianjin Academy of Traditional Chinese Medicine Affiliated Hospital (天津中医药大学附属医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Global population aging presents increasing challenges to healthcare systems, with coronary artery disease (CAD) responsible for approximately 17.8 million deaths annually, making it a leading cause of global mortality. As CAD is largely preventable, early detection and proactive management are essential. In this work, we introduce DigitalShadow, an advanced early warning system for CAD, powered by a fine-tuned facial foundation model. The system is pre-trained on 21 million facial images and subsequently fine-tuned into LiveCAD, a specialized CAD risk assessment model trained on 7,004 facial images from 1,751 subjects across four hospitals in China. DigitalShadow functions passively and contactlessly, extracting facial features from live video streams without requiring active user engagement. Integrated with a personalized database, it generates natural language risk reports and individualized health recommendations. With privacy as a core design principle, DigitalShadow supports local deployment to ensure secure handling of user data.
zh
[CV-220] STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis
【速读】:该论文旨在解决高分辨率图像生成中模型表达能力与计算效率之间的平衡问题,以及传统归一化流(Normalizing Flow)在大规模和高分辨率任务中的适用性问题。其解决方案的关键在于提出Transformer Autoregressive Flow (TARFlow),该模型结合了归一化流的精确概率建模能力和自回归Transformer的结构化建模优势,并通过深度-浅层设计、预训练编码器的潜在空间建模以及新型引导算法等创新手段显著提升了模型的可扩展性和生成质量。
链接: https://arxiv.org/abs/2506.06276
作者: Jiatao Gu,Tianrong Chen,David Berthelot,Huangjie Zheng,Yuyang Wang,Ruixiang Zhang,Laurent Dinh,Miguel Angel Bautista,Josh Susskind,Shuangfei Zhai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: TLDR: We show for the first time that normalizing flows can be scaled for high-resolution and text-conditioned image synthesis
Abstract:We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.
zh
[CV-221] MiniGPT -Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT -4
【速读】:该论文试图解决的是反向设计(reverse designing)问题,即在给定源图像、编辑后的图像以及可选的高层文本编辑描述的情况下,预测编辑操作及其参数。这一任务要求模型同时理解源图像、编辑后的图像和文本描述之间的复杂交互,超越了传统的视觉-语言任务。论文的关键解决方案是将现有的视觉-语言模型(VLM)MiniGPT-4进行扩展和微调,以适应这一复杂的多模态任务,实验结果表明此类模型具有良好的可扩展性。
链接: https://arxiv.org/abs/2406.00971
作者: Vahid Azizi,Fatemeh Koochaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures
Abstract:Vision-Language Models (VLMs) have recently seen significant advancements through integrating with Large Language Models (LLMs). The VLMs, which process image and text modalities simultaneously, have demonstrated the ability to learn and understand the interaction between images and texts across various multi-modal tasks. Reverse designing, which could be defined as a complex vision-language task, aims to predict the edits and their parameters, given a source image, an edited version, and an optional high-level textual edit description. This task requires VLMs to comprehend the interplay between the source image, the edited version, and the optional textual context simultaneously, going beyond traditional vision-language tasks. In this paper, we extend and fine-tune MiniGPT-4 for the reverse designing task. Our experiments demonstrate the extensibility of off-the-shelf VLMs, specifically MiniGPT-4, for more complex tasks such as reverse designing. Code is available at this \hrefthis https URL
zh
[CV-222] Fine-Grained Motion Compression and Selective Temporal Fusion for Neural B-Frame Video Coding
【速读】:该论文旨在解决神经网络B帧编码中因直接采用P帧编码工具而未能充分应对B帧压缩独特挑战所导致的性能不佳问题。其解决方案的关键在于提出两种创新方法:一是设计细粒度运动压缩方法,通过交互式双分支运动自编码器与分支自适应量化步骤,实现双向运动矢量的精细压缩,并利用交互式运动熵模型挖掘双向运动潜在表示之间的相关性;二是提出选择性时间融合方法,通过预测双向融合权重以区分性地利用多尺度时间上下文,并引入基于超先验的隐式对齐机制,以隐式缓解融合后双向时间先验的不对齐问题。
链接: https://arxiv.org/abs/2506.07709
作者: Xihua Sheng,Peilin Chen,Meng Wang,Li Zhang,Shiqi Wang,Dapeng Oliver Wu
机构: City University of Hong Kong (香港城市大学); Lingnan University (岭南大学); Bytedance(字节跳动)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the remarkable progress in neural P-frame video coding, neural B-frame coding has recently emerged as a critical research direction. However, most existing neural B-frame codecs directly adopt P-frame coding tools without adequately addressing the unique challenges of B-frame compression, leading to suboptimal performance. To bridge this gap, we propose novel enhancements for motion compression and temporal fusion for neural B-frame coding. First, we design a fine-grained motion compression method. This method incorporates an interactive dual-branch motion auto-encoder with per-branch adaptive quantization steps, which enables fine-grained compression of bi-directional motion vectors while accommodating their asymmetric bitrate allocation and reconstruction quality requirements. Furthermore, this method involves an interactive motion entropy model that exploits correlations between bi-directional motion latent representations by interactively leveraging partitioned latent segments as directional priors. Second, we propose a selective temporal fusion method that predicts bi-directional fusion weights to achieve discriminative utilization of bi-directional multi-scale temporal contexts with varying qualities. Additionally, this method introduces a hyperprior-based implicit alignment mechanism for contextual entropy modeling. By treating the hyperprior as a surrogate for the contextual latent representation, this mechanism implicitly mitigates the misalignment in the fused bi-directional temporal priors. Extensive experiments demonstrate that our proposed codec outperforms state-of-the-art neural B-frame codecs and achieves comparable or even superior compression performance to the H.266/VVC reference software under random-access configurations.
zh
[CV-223] xt-guided multi-stage cross-perception network for medical image segmentation
【速读】:该论文旨在解决医学图像分割中目标区域语义表达能力弱的问题,这一问题主要由目标与非目标区域之间的对比度较低所导致。为了解决这一问题,论文提出了一种基于文本提示的多阶段跨感知网络(Text-guided Multi-stage Cross-perception network, TMC),其关键在于引入了多阶段跨注意力模块以增强模型对语义细节的理解,并设计了多阶段对齐损失以提升跨模态语义的一致性。
链接: https://arxiv.org/abs/2506.07475
作者: Gaoyu Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image segmentation plays a crucial role in clinical medicine, serving as a tool for auxiliary diagnosis, treatment planning, and disease monitoring, thus facilitating physicians in the study and treatment of diseases. However, existing medical image segmentation methods are limited by the weak semantic expression of the target segmentation regions, which is caused by the low contrast between the target and non-target segmentation regions. To address this limitation, text prompt information has greast potential to capture the lesion location. However, existing text-guided methods suffer from insufficient cross-modal interaction and inadequate cross-modal feature expression. To resolve these issues, we propose the Text-guided Multi-stage Cross-perception network (TMC). In TMC, we introduce a multistage cross-attention module to enhance the model’s understanding of semantic details and a multi-stage alignment loss to improve the consistency of cross-modal semantics. The results of the experiments demonstrate that our TMC achieves a superior performance with Dice of 84.77%, 78.50%, 88.73% in three public datasets (QaTa-COV19, MosMedData and Breast), outperforming UNet based networks and text-guided methods.
zh
[CV-224] Pendulum Tracker – SimuFísica: A Web-based Tool for Real-time Measurement of Oscillatory Motion
【速读】:该论文试图解决物理教学中对摆动运动进行实时测量与分析的问题,特别是针对物理摆的振荡运动。解决方案的关键在于开发了一个基于计算机视觉的工具Pendulum Tracker,它能够通过设备的摄像头自动检测摆的位置,并实时显示角度-时间曲线及振荡周期的估计值,从而实现对摆动周期的精确测量和对阻尼振荡的分析。
链接: https://arxiv.org/abs/2506.07301
作者: Marco P. M. de Souza,Juciane G. Maia,Lilian N. de Andrade
机构: Universidade Federal de Rondônia(朗多尼亚联邦大学); Secretaria de Estado da Educação de Rondônia(朗多尼亚州教育厅)
类目: Physics Education (physics.ed-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Pendulum Tracker, a computer vision-based application that enables real-time measurement of the oscillatory motion of a physical pendulum. Integrated into the educational platform SimuFísica, the system uses the this http URL library and runs directly in the browser, working on computers, tablets, and smartphones. The application automatically detects the pendulum’s position via the device’s camera, displaying in real time the angle-versus-time graph and estimates of the oscillation period. Experimental case studies demonstrate its effectiveness in measuring the period, determining gravitational acceleration, and analyzing damped oscillations. The results show excellent agreement with theoretical predictions, confirming the system’s accuracy and its applicability in educational contexts. The accessible interface and the ability to export raw data make Pendulum Tracker a versatile tool for experimental physics teaching.
zh
[CV-225] A Narrative Review on Large AI Models in Lung Cancer Screening Diagnosis and Treatment Planning
【速读】:该论文试图解决肺癌诊断与治疗中对准确、及时和个性化医疗方案的需求,其解决方案的关键在于利用大型AI模型(Large AI Models)在医学影像理解与临床决策中的优势。通过系统综述,研究者探讨了这些模型在肺部结节检测、基因突变预测、多组学整合及个体化治疗规划等任务中的应用,并分析了其在多模态学习任务中的性能表现,旨在推动可扩展、可解释且临床集成的AI系统发展。
链接: https://arxiv.org/abs/2506.07236
作者: Jiachen Zhong,Yiting Wang,Di Zhu,Ziwei Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review
Abstract:Lung cancer remains one of the most prevalent and fatal diseases worldwide, demanding accurate and timely diagnosis and treatment. Recent advancements in large AI models have significantly enhanced medical image understanding and clinical decision-making. This review systematically surveys the state-of-the-art in applying large AI models to lung cancer screening, diagnosis, prognosis, and treatment. We categorize existing models into modality-specific encoders, encoder-decoder frameworks, and joint encoder architectures, highlighting key examples such as CLIP, BLIP, Flamingo, BioViL-T, and GLoRIA. We further examine their performance in multimodal learning tasks using benchmark datasets like LIDC-IDRI, NLST, and MIMIC-CXR. Applications span pulmonary nodule detection, gene mutation prediction, multi-omics integration, and personalized treatment planning, with emerging evidence of clinical deployment and validation. Finally, we discuss current limitations in generalizability, interpretability, and regulatory compliance, proposing future directions for building scalable, explainable, and clinically integrated AI systems. Our review underscores the transformative potential of large AI models to personalize and optimize lung cancer care.
zh
[CV-226] A Comprehensive Analysis of COVID-19 Detection Using Bangladeshi Data and Explainable AI
【速读】:该论文旨在解决通过胸部X光(CXR)图像提高COVID-19检测准确率的问题。其关键解决方案是利用包含4,350张来自孟加拉国的图像数据集,并采用机器学习(ML)、深度学习(DL)和迁移学习(TL)模型,其中VGG19模型达到了98%的高准确率。此外,通过引入局部可解释性模型(LIME)和合成少数过采样技术(SMOTE),该研究不仅提升了模型的透明度和可靠性,还有效解决了类别不平衡问题,从而增强了CXR图像中疾病检测的效果。
链接: https://arxiv.org/abs/2506.07234
作者: Shuvashis Sarker
机构: Ahsanullah University of Science and Technology (阿罕默德科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 2024 4th International Conference on Innovations in Science, Engineering and Technology (ICISET)
Abstract:COVID-19 is a rapidly spreading and highly infectious virus which has triggered a global pandemic, profoundly affecting millions across the world. The pandemic has introduced unprecedented challenges in public health, economic stability, and societal structures, necessitating the implementation of extensive and multifaceted health interventions globally. It had a tremendous impact on Bangladesh by April 2024, with around 29,495 fatalities and more than 2 million confirmed cases. This study focuses on improving COVID-19 detection in CXR images by utilizing a dataset of 4,350 images from Bangladesh categorized into four classes: Normal, Lung-Opacity, COVID-19 and Viral-Pneumonia. ML, DL and TL models are employed with the VGG19 model achieving an impressive 98% accuracy. LIME is used to explain model predictions, highlighting the regions and features influencing classification decisions. SMOTE is applied to address class imbalances. By providing insight into both correct and incorrect classifications, the study emphasizes the importance of XAI in enhancing the transparency and reliability of models, ultimately improving the effectiveness of detection from CXR images.
zh
[CV-227] ransfer Learning and Explainable AI for Brain Tumor Classification: A Study Using MRI Data from Bangladesh
【速读】:该论文试图解决在医疗资源有限的国家(如孟加拉国)中,由于手动MRI分析耗时且易出错,导致脑肿瘤及时识别效率低下的问题。解决方案的关键在于开发一种基于MRI数据的自动化脑肿瘤分类系统,采用先进的深度学习模型(如VGG16、VGG19和ResNet50)进行分类,并结合可解释人工智能(Explainable AI, XAI)方法(如Grad-CAM和Grad-CAM++)以提高模型的可解释性与临床适用性。其中,VGG16在分类任务中表现出最高的准确率(99.17%),而XAI技术的引入增强了系统的透明度和稳定性。
链接: https://arxiv.org/abs/2506.07228
作者: Shuvashis Sarker
机构: Ahsanullah University of Science and Technology (阿罕默德科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 2024 6th International Conference on Sustainable Technologies for Industry 5.0 (STI)
Abstract:Brain tumors, regardless of being benign or malignant, pose considerable health risks, with malignant tumors being more perilous due to their swift and uncontrolled proliferation, resulting in malignancy. Timely identification is crucial for enhancing patient outcomes, particularly in nations such as Bangladesh, where healthcare infrastructure is constrained. Manual MRI analysis is arduous and susceptible to inaccuracies, rendering it inefficient for prompt diagnosis. This research sought to tackle these problems by creating an automated brain tumor classification system utilizing MRI data obtained from many hospitals in Bangladesh. Advanced deep learning models, including VGG16, VGG19, and ResNet50, were utilized to classify glioma, meningioma, and various brain cancers. Explainable AI (XAI) methodologies, such as Grad-CAM and Grad-CAM++, were employed to improve model interpretability by emphasizing the critical areas in MRI scans that influenced the categorization. VGG16 achieved the most accuracy, attaining 99.17%. The integration of XAI enhanced the system’s transparency and stability, rendering it more appropriate for clinical application in resource-limited environments such as Bangladesh. This study highlights the capability of deep learning models, in conjunction with explainable artificial intelligence (XAI), to enhance brain tumor detection and identification in areas with restricted access to advanced medical technologies.
zh
[CV-228] SiliCoN: Simultaneous Nuclei Segmentation and Color Normalization of Histological Images
【速读】:该论文旨在解决从组织学图像中准确分割细胞核区域的问题,尤其是在染色组织图像颜色外观存在不可接受变化的情况下。其解决方案的关键在于提出了一种新颖的深度生成模型,该模型能够同时进行细胞核结构分割和染色组织图像的颜色外观归一化。该模型巧妙地结合了截断正态分布和空间注意力机制,通过解耦潜在颜色外观信息与细胞核分割图及嵌入图信息,提高了模型的泛化性和适应性,从而使得颜色外观信息的变化不会影响到细胞核分割结果。
链接: https://arxiv.org/abs/2506.07028
作者: Suman Mahapatra,Pradipta Maji
机构: Indian Statistical Institute (印度统计学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures
Abstract:Segmentation of nuclei regions from histological images is an important task for automated computer-aided analysis of histological images, particularly in the presence of impermissible color variation in the color appearance of stained tissue images. While color normalization enables better nuclei segmentation, accurate segmentation of nuclei structures makes color normalization rather trivial. In this respect, the paper proposes a novel deep generative model for simultaneously segmenting nuclei structures and normalizing color appearance of stained histological this http URL model judiciously integrates the merits of truncated normal distribution and spatial attention. The model assumes that the latent color appearance information, corresponding to a particular histological image, is independent of respective nuclei segmentation map as well as embedding map information. The disentangled representation makes the model generalizable and adaptable as the modification or loss in color appearance information cannot be able to affect the nuclei segmentation map as well as embedding information. Also, for dealing with the stain overlap of associated histochemical reagents, the prior for latent color appearance code is assumed to be a mixture of truncated normal distributions. The proposed model incorporates the concept of spatial attention for segmentation of nuclei regions from histological images. The performance of the proposed approach, along with a comparative analysis with related state-of-the-art algorithms, has been demonstrated on publicly available standard histological image data sets.
zh
[CV-229] Optimal Transport Driven Asymmetric Image-to-Image Translation for Nuclei Segmentation of Histological Images
【速读】:该论文试图解决在不同目标域表示下进行核区域分割的问题,特别是当两个图像域的信息内容存在不对称性时,传统图像到图像翻译模型表现不佳的问题。解决方案的关键在于提出一种新的深度生成模型,该模型通过构建一个嵌入空间来处理信息丰富的组织学图像空间与信息贫乏的分割图域之间的信息差异,并结合最优传输和测度论的概念,设计了一个可逆生成器,从而实现了更低网络复杂度下的高效优化框架,同时避免了显式的循环一致性损失,并通过空间约束的压缩操作保持图像块内的空间连续性。
链接: https://arxiv.org/abs/2506.07023
作者: Suman Mahapatra,Pradipta Maji
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures
Abstract:Segmentation of nuclei regions from histological images enables morphometric analysis of nuclei structures, which in turn helps in the detection and diagnosis of diseases under consideration. To develop a nuclei segmentation algorithm, applicable to different types of target domain representations, image-to-image translation networks can be considered as they are invariant to target domain image representations. One of the important issues with image-to-image translation models is that they fail miserably when the information content between two image domains are asymmetric in nature. In this regard, the paper introduces a new deep generative model for segmenting nuclei structures from histological images. The proposed model considers an embedding space for handling information-disparity between information-rich histological image space and information-poor segmentation map domain. Integrating judiciously the concepts of optimal transport and measure theory, the model develops an invertible generator, which provides an efficient optimization framework with lower network complexity. The concept of invertible generator automatically eliminates the need of any explicit cycle-consistency loss. The proposed model also introduces a spatially-constrained squeeze operation within the framework of invertible generator to maintain spatial continuity within the image patches. The model provides a better trade-off between network complexity and model performance compared to other existing models having complex network architectures. The performance of the proposed deep generative model, along with a comparison with state-of-the-art nuclei segmentation methods, is demonstrated on publicly available histological image data sets.
zh
[CV-230] SPC to 3D: Novel View Synthesis from Binary SPC via I2I translation ICIP2025
【速读】:该论文旨在解决单光子雪崩二极管(Single Photon Avalanche Diodes, SPADs)图像由于其二值特性导致的纹理和颜色信息严重丢失问题,从而影响传统三维合成技术的效果。解决方案的关键在于提出一个模块化的两阶段框架:第一阶段利用生成模型如Pix2PixHD进行图像到图像(Image-to-Image, I2I)翻译,将二值SPAD图像转换为合理的RGB表示;第二阶段则采用三维场景重建技术如Neural Radiance Fields (NeRF)或Gaussian Splatting (3DGS)生成高质量的新视角图像。
链接: https://arxiv.org/abs/2506.06890
作者: Sumit Sharma,Gopi Raju Matta,Kaushik Mitra
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Accepted for publication at ICIP 2025
Abstract:Single Photon Avalanche Diodes (SPADs) represent a cutting-edge imaging technology, capable of detecting individual photons with remarkable timing precision. Building on this sensitivity, Single Photon Cameras (SPCs) enable image capture at exceptionally high speeds under both low and high illumination. Enabling 3D reconstruction and radiance field recovery from such SPC data holds significant promise. However, the binary nature of SPC images leads to severe information loss, particularly in texture and color, making traditional 3D synthesis techniques ineffective. To address this challenge, we propose a modular two-stage framework that converts binary SPC images into high-quality colorized novel views. The first stage performs image-to-image (I2I) translation using generative models such as Pix2PixHD, converting binary SPC inputs into plausible RGB representations. The second stage employs 3D scene reconstruction techniques like Neural Radiance Fields (NeRF) or Gaussian Splatting (3DGS) to generate novel views. We validate our two-stage pipeline (Pix2PixHD + Nerf/3DGS) through extensive qualitative and quantitative experiments, demonstrating significant improvements in perceptual quality and geometric consistency over the alternative baseline.
zh
[CV-231] ResPF: Residual Poisson Flow for Efficient and Physically Consistent Sparse-View CT Reconstruction
【速读】:该论文旨在解决稀疏视图计算机断层扫描(sparse-view CT)中由于辐射剂量降低导致的病态逆问题所带来的图像重建准确性难题。其解决方案的关键在于提出一种基于残差泊松流(Residual Poisson Flow, ResPF)的生成模型,该模型通过引入条件引导和劫持策略以显著降低采样成本,同时在每一步迭代中嵌入数据一致性约束以保证重建质量。此外,受ResNet启发,该方法还引入了残差融合模块,以线性结合生成结果与数据一致的重建,从而有效维持轨迹连续性,提升重建稳定性和精度。
链接: https://arxiv.org/abs/2506.06400
作者: Changsheng Fang,Yongtong Liu,Bahareh Morovati,Shuo Han,Yu Shi,Li Zhou,Shuyi Fan,Hengyong Yu
机构: University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sparse-view computed tomography (CT) is a practical solution to reduce radiation dose, but the resulting ill-posed inverse problem poses significant challenges for accurate image reconstruction. Although deep learning and diffusion-based methods have shown promising results, they often lack physical interpretability or suffer from high computational costs due to iterative sampling starting from random noise. Recent advances in generative modeling, particularly Poisson Flow Generative Models (PFGM), enable high-fidelity image synthesis by modeling the full data distribution. In this work, we propose Residual Poisson Flow (ResPF) Generative Models for efficient and accurate sparse-view CT reconstruction. Based on PFGM++, ResPF integrates conditional guidance from sparse measurements and employs a hijacking strategy to significantly reduce sampling cost by skipping redundant initial steps. However, skipping early stages can degrade reconstruction quality and introduce unrealistic structures. To address this, we embed a data-consistency into each iteration, ensuring fidelity to sparse-view measurements. Yet, PFGM sampling relies on a fixed ordinary differential equation (ODE) trajectory induced by electrostatic fields, which can be disrupted by step-wise data consistency, resulting in unstable or degraded reconstructions. Inspired by ResNet, we introduce a residual fusion module to linearly combine generative outputs with data-consistent reconstructions, effectively preserving trajectory continuity. To the best of our knowledge, this is the first application of Poisson flow models to sparse-view CT. Extensive experiments on synthetic and clinical datasets demonstrate that ResPF achieves superior reconstruction quality, faster inference, and stronger robustness compared to state-of-the-art iterative, learning-based, and diffusion models.
zh
[CV-232] Heart Rate Classification in ECG Signals Using Machine Learning and Deep Learning
【速读】:该论文旨在解决从心电图(ECG)信号中对心跳进行分类的问题,通过两种不同的方法实现:传统机器学习方法利用手工提取的特征,以及基于深度学习的将ECG波形转换为图像的方法。其解决方案的关键在于对比这两种方法的性能,发现手工提取的特征(如心率变异性HRV、均值、方差和RR间隔)在训练多种分类器(如LightGBM)时表现出更高的分类精度(准确率为99%,F1得分为0.94),优于基于图像的卷积神经网络(CNN)方法(F1得分为0.85)。研究结果表明,手工特征在捕捉ECG信号的时间和形态变化方面更具优势。
链接: https://arxiv.org/abs/2506.06349
作者: Thien Nhan Vo,Thanh Xuan Truong
机构: Ho Chi Minh City University of Technology (HUTECH); Bac Ninh High School for the Gifted
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This study addresses the classification of heartbeats from ECG signals through two distinct approaches: traditional machine learning utilizing hand-crafted features and deep learning via transformed images of ECG beats. The dataset underwent preprocessing steps, including downsampling, filtering, and normalization, to ensure consistency and relevance for subsequent analysis. In the first approach, features such as heart rate variability (HRV), mean, variance, and RR intervals were extracted to train various classifiers, including SVM, Random Forest, AdaBoost, LSTM, Bi-directional LSTM, and LightGBM. The second approach involved transforming ECG signals into images using Gramian Angular Field (GAF), Markov Transition Field (MTF), and Recurrence Plots (RP), with these images subsequently classified using CNN architectures like VGG and Inception. Experimental results demonstrate that the LightGBM model achieved the highest performance, with an accuracy of 99% and an F1 score of 0.94, outperforming the image-based CNN approach (F1 score of 0.85). Models such as SVM and AdaBoost yielded significantly lower scores, indicating limited suitability for this task. The findings underscore the superior ability of hand-crafted features to capture temporal and morphological variations in ECG signals compared to image-based representations of individual beats. Future investigations may benefit from incorporating multi-lead ECG signals and temporal dependencies across successive beats to enhance classification accuracy further. Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2506.06349 [eess.SP] (or arXiv:2506.06349v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2506.06349 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-233] An Open-Source Python Framework and Synthetic ECG Image Datasets for Digitization Lead and Lead Name Detection and Overlapping Signal Segmentation
【速读】:该论文旨在解决深度学习在心电图(Electrocardiogram, ECG)分析中的关键任务,包括ECG数字化、导联区域和导联名称检测以及像素级波形分割。其解决方案的关键在于提出一个开源的Python框架,能够生成合成的ECG图像数据集,这些数据集包含多种导联配置的ECG图像与时间序列信号、YOLO格式标注的边界框以及适用于U-Net模型的单导联图像及其分割掩码,其中在重叠情况下,相邻导联的波形被叠加到目标导联图像上,而分割掩码保持清晰。
链接: https://arxiv.org/abs/2506.06315
作者: Masoud Rahimi,Reza Karbasi,Abdol-Hossein Vahabie
机构: University of Tehran (德黑兰大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, 5 figures
Abstract:We introduce an open-source Python framework for generating synthetic ECG image datasets to advance critical deep learning-based tasks in ECG analysis, including ECG digitization, lead region and lead name detection, and pixel-level waveform segmentation. Using the PTB-XL signal dataset, our proposed framework produces four open-access datasets: (1) ECG images in various lead configurations paired with time-series signals for ECG digitization, (2) ECG images annotated with YOLO-format bounding boxes for detection of lead region and lead name, (3)-(4) cropped single-lead images with segmentation masks compatible with U-Net-based models in normal and overlapping versions. In the overlapping case, waveforms from neighboring leads are superimposed onto the target lead image, while the segmentation masks remain clean. The open-source Python framework and datasets are publicly available at this https URL and this https URL, respectively.
zh
[CV-234] Benchmarking Early Agitation Prediction in Community-Dwelling People with Dementia Using Multimodal Sensors and Machine Learning
【速读】:该论文试图解决在社区居住的阿尔茨海默病患者中,通过多模态传感器数据实现激越行为(agitation)的早期预测问题。其关键解决方案是引入基于活动数据的新型激越相关上下文特征,并采用多种机器学习和深度学习模型进行评估,其中最优方案为使用当前6小时时间戳的二分类方法预测后续时间点的激越行为,同时结合时间段和激越历史等附加信息,显著提升了模型性能,最高达到了AUC-ROC 0.9720和AUC-PR 0.4320。
链接: https://arxiv.org/abs/2506.06306
作者: Ali Abedi,Charlene H. Chu,Shehroz S. Khan
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 16 pages, 4 figures, 2 tables
Abstract:Agitation is one of the most common responsive behaviors in people living with dementia, particularly among those residing in community settings without continuous clinical supervision. Timely prediction of agitation can enable early intervention, reduce caregiver burden, and improve the quality of life for both patients and caregivers. This study aimed to develop and benchmark machine learning approaches for the early prediction of agitation in community-dwelling older adults with dementia using multimodal sensor data. A new set of agitation-related contextual features derived from activity data was introduced and employed for agitation prediction. A wide range of machine learning and deep learning models was evaluated across multiple problem formulations, including binary classification for single-timestamp tabular sensor data and multi-timestamp sequential sensor data, as well as anomaly detection for single-timestamp tabular sensor data. The study utilized the Technology Integrated Health Management (TIHM) dataset, the largest publicly available dataset for remote monitoring of people living with dementia, comprising 2,803 days of in-home activity, physiology, and sleep data. The most effective setting involved binary classification of sensor data using the current 6-hour timestamp to predict agitation at the subsequent timestamp. Incorporating additional information, such as time of day and agitation history, further improved model performance, with the highest AUC-ROC of 0.9720 and AUC-PR of 0.4320 achieved by the light gradient boosting machine. This work presents the first comprehensive benchmarking of state-of-the-art techniques for agitation prediction in community-based dementia care using privacy-preserving sensor data. The approach enables accurate, explainable, and efficient agitation prediction, supporting proactive dementia care and aging in place.
zh
人工智能
[AI-0] hinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction
【速读】:该论文试图解决传统测试时扩展(test-time scaling)方法在交互式代理任务中无法获取环境新信息或随时间调整行为的问题。其解决方案的关键在于提出测试时交互(Test-Time Interaction, TTI),通过增加代理的交互范围,使代理能够在单次运行中执行丰富的行为,如探索、回溯和动态重规划。TTI采用基于课程的在线强化学习方法,通过自适应调整运行长度来训练代理,从而实现探索与利用的动态平衡。
链接: https://arxiv.org/abs/2506.07976
作者: Junhong Shen,Hao Bai,Lunjun Zhang,Yifei Zhou,Amrith Setlur,Shengbang Tong,Diego Caples,Nan Jiang,Tong Zhang,Ameet Talwalkar,Aviral Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The current paradigm of test-time scaling relies on generating long reasoning traces (“thinking” more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent’s interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we study the domain of web agents. We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on WebVoyager and WebArena benchmarks. We further show that TTI enables agents to balance exploration and exploitation adaptively. Our results establish interaction scaling as a powerful, complementary axis to scaling per-step compute, offering new avenues for training adaptive agents.
zh
[AI-1] BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models
【速读】:该论文旨在解决机器人操作学习中因缺乏对3D数据空间结构的有效利用而导致的样本效率低的问题。现有方法在将3D信号整合到视觉-语言模型(Vision-Language Models, VLMs)进行动作预测时存在局限,未能充分利用3D数据的固有空间结构。论文提出的解决方案关键在于构建BridgeVLA模型,该模型通过将3D输入投影到多个2D图像以确保与VLM主干的输入对齐,并利用2D热图进行动作预测,从而在统一的2D图像空间内实现输入和输出的一致性。此外,还提出了一种可扩展的预训练方法,使VLM主干具备在下游策略学习前预测2D热图的能力,从而提升3D操作学习的效率和效果。
链接: https://arxiv.org/abs/2506.07961
作者: Peiyan Li,Yixiang Chen,Hongtao Wu,Xiao Ma,Xiangnan Wu,Yan Huang,Liang Wang,Tao Kong,Tieniu Tan
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: In Submission
Abstract:Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:this https URL
zh
[AI-2] Gradients: When Markets Meet Fine-tuning – A Distributed Approach to Model Optimisation
【速读】:该论文试图解决基础模型微调中的超参数优化问题,现有AutoML平台依赖单一优化策略,仅探索了部分可行的超参数配置。其解决方案的关键在于引入Gradients,一个去中心化的AutoML平台,将超参数优化转化为竞争市场,独立矿工通过经济激励竞争以发现最优配置,从而系统性地探索集中式方法所忽略的超参数区域。
链接: https://arxiv.org/abs/2506.07940
作者: Christopher Subia-Waud(Rayonlabs Team)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Foundation model fine-tuning faces a fundamental challenge: existing AutoML platforms rely on single optimisation strategies that explore only a fraction of viable hyperparameter configurations. In this white paper, We introduce Gradients, a decentralised AutoML platform that transforms hyperparameter optimisation into a competitive marketplace where independent miners compete to discover optimal configurations. Economic incentives align individual exploration with collective optimisation goals, driving systematic investigation of hyperparameter regions that centralised methods miss. We evaluate our approach across 180 controlled experiments spanning diverse model architectures (70M to 70B parameters) and task types. Gradients achieves an 82.8% win rate against HuggingFace AutoTrain and 100% against TogetherAI, Databricks, and Google Cloud, with mean improvements of 11.8% and 42.1% respectively. Complex reasoning and retrieval tasks show particularly strong gains of 30-40%, whilst diffusion models achieve 23.4% improvements for person-specific generation. These results demonstrate that competitive, economically-driven approaches can systematically discover superior configurations that centralised AutoML consistently miss.
zh
[AI-3] Diffusion of Responsibility in Collective Decision Making
【速读】:该论文试图解决在集体决策机制中由于责任扩散(diffusion of responsibility)导致的个体责任模糊问题。解决方案的关键在于通过定义决策机制的双模拟(bisimulation),证明双模拟能够保留与责任相关的性质,并据此建立最小双模拟机制下的责任无扩散机制,从而表明在两个代理的情况下,唯一避免责任扩散的方式是其中一个代理作为“独裁者”单独决策;而在多于两个代理的情况下,任何无责任扩散的机制都必须是“选举独裁制”,即代理们选举出一个代理进行单方面决策。
链接: https://arxiv.org/abs/2506.07935
作者: Pavel Naumov,Jia Tao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:
Abstract:The term "diffusion of responsibility’’ refers to situations in which multiple agents share responsibility for an outcome, obscuring individual accountability. This paper examines this frequently undesirable phenomenon in the context of collective decision-making mechanisms. The work shows that if a decision is made by two agents, then the only way to avoid diffusion of responsibility is for one agent to act as a "dictator’‘, making the decision unilaterally. In scenarios with more than two agents, any diffusion-free mechanism is an "elected dictatorship’’ where the agents elect a single agent to make a unilateral decision. The technical results are obtained by defining a bisimulation of decision-making mechanisms, proving that bisimulation preserves responsibility-related properties, and establishing the results for a smallest bisimular mechanism. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2506.07935 [cs.MA] (or arXiv:2506.07935v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2506.07935 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-4] Lightweight Sequential Transformers for Blood Glucose Level Prediction in Type-1 Diabetes
【速读】:该论文旨在解决在资源受限的可穿戴设备上部署预测模型以实现1型糖尿病(Type 1 Diabetes, T1D)患者血糖水平实时监测的问题,其主要挑战在于计算和内存约束。解决方案的关键在于提出一种轻量级的序列Transformer模型,该模型结合了Transformer的注意力机制与循环神经网络的序列处理能力,从而在保持计算效率的同时捕捉长期依赖关系,并通过平衡损失函数应对低血糖和高血糖事件的数据不平衡问题。
链接: https://arxiv.org/abs/2506.07864
作者: Mirko Paolo Barbato,Giorgia Rigamonti,Davide Marelli,Paolo Napoletano
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Type 1 Diabetes (T1D) affects millions worldwide, requiring continuous monitoring to prevent severe hypo- and hyperglycemic events. While continuous glucose monitoring has improved blood glucose management, deploying predictive models on wearable devices remains challenging due to computational and memory constraints. To address this, we propose a novel Lightweight Sequential Transformer model designed for blood glucose prediction in T1D. By integrating the strengths of Transformers’ attention mechanisms and the sequential processing of recurrent neural networks, our architecture captures long-term dependencies while maintaining computational efficiency. The model is optimized for deployment on resource-constrained edge devices and incorporates a balanced loss function to handle the inherent data imbalance in hypo- and hyperglycemic events. Experiments on two benchmark datasets, OhioT1DM and DiaTrend, demonstrate that the proposed model outperforms state-of-the-art methods in predicting glucose levels and detecting adverse events. This work fills the gap between high-performance modeling and practical deployment, providing a reliable and efficient T1D management solution.
zh
[AI-5] Fairness Overfitting in Machine Learning: An Information-Theoretic Perspective
【速读】:该论文试图解决机器学习模型在高风险应用中实现的公平性无法保证在未见数据上泛化的难题,即现有方法虽然通过正则化或其他干预手段改进公平性,但缺乏对公平性泛化能力的正式保障。解决方案的关键在于提出一种基于信息论的理论框架,利用Efron-Stein不等式推导出紧致的信息论公平性泛化界,该界同时考虑了互信息(Mutual Information, MI)和条件互信息(Conditional Mutual Information, CMI),从而为提升公平性泛化能力的算法设计提供了理论依据。
链接: https://arxiv.org/abs/2506.07861
作者: Firas Laakom,Haobo Chen,Jürgen Schmidhuber,Yuheng Bu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 38 pages
Abstract:Despite substantial progress in promoting fairness in high-stake applications using machine learning models, existing methods often modify the training process, such as through regularizers or other interventions, but lack formal guarantees that fairness achieved during training will generalize to unseen data. Although overfitting with respect to prediction performance has been extensively studied, overfitting in terms of fairness loss has received far less attention. This paper proposes a theoretical framework for analyzing fairness generalization error through an information-theoretic lens. Our novel bounding technique is based on Efron-Stein inequality, which allows us to derive tight information-theoretic fairness generalization bounds with both Mutual Information (MI) and Conditional Mutual Information (CMI). Our empirical results validate the tightness and practical relevance of these bounds across diverse fairness-aware learning algorithms. Our framework offers valuable insights to guide the design of algorithms improving fairness generalization.
zh
[AI-6] Residual Reweighted Conformal Prediction for Graph Neural Networks
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在高风险领域中因未量化不确定性而导致的可靠性问题,以及现有置信区间方法在处理图异方差性和结构偏差时过于保守的问题。其解决方案的关键在于提出一种名为残差重加权图神经网络(Residual Reweighted GNN, RR-GNN)的框架,该框架通过三种创新机制提升预测性能:首先,采用图结构蒙特卡洛置信区间方法,根据拓扑特征将节点或边划分为社区,实现反映异质性的簇条件覆盖;其次,通过在保留校准集上训练次级GNN估计任务特定残差,生成自适应非一致性评分,动态调整预测区间;最后,采用交叉训练协议,在防止信息泄露的同时保持图依赖性。
链接: https://arxiv.org/abs/2506.07854
作者: Zheng Zhang,Jie Bao,Zhixin Zhou,Nicolo Colombo,Lixin Cheng,Rui Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Graph Neural Networks (GNNs) excel at modeling relational data but face significant challenges in high-stakes domains due to unquantified uncertainty. Conformal prediction (CP) offers statistical coverage guarantees, but existing methods often produce overly conservative prediction intervals that fail to account for graph heteroscedasticity and structural biases. While residual reweighting CP variants address some of these limitations, they neglect graph topology, cluster-specific uncertainties, and risk data leakage by reusing training sets. To address these issues, we propose Residual Reweighted GNN (RR-GNN), a framework designed to generate minimal prediction sets with provable marginal coverage guarantees. RR-GNN introduces three major innovations to enhance prediction performance. First, it employs Graph-Structured Mondrian CP to partition nodes or edges into communities based on topological features, ensuring cluster-conditional coverage that reflects heterogeneity. Second, it uses Residual-Adaptive Nonconformity Scores by training a secondary GNN on a held-out calibration set to estimate task-specific residuals, dynamically adjusting prediction intervals according to node or edge uncertainty. Third, it adopts a Cross-Training Protocol, which alternates the optimization of the primary GNN and the residual predictor to prevent information leakage while maintaining graph dependencies. We validate RR-GNN on 15 real-world graphs across diverse tasks, including node classification, regression, and edge weight prediction. Compared to CP baselines, RR-GNN achieves improved efficiency over state-of-the-art methods, with no loss of coverage. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2506.07854 [cs.LG] (or arXiv:2506.07854v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.07854 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-7] A Temporal FRBR/FRBRoo-Based Model for Component-Level Versioning of Legal Norms
【速读】:该论文试图解决法律规范在自动化处理中难以有效表示的问题,特别是在追踪其层级组件(如条款、段落)的历时演变方面。现有框架如FRBR/FRBRoo和标准如Akoma Ntoso虽能在宏观层面建模法律文件,但缺乏对细粒度、组件级版本控制的原生机制,这限制了法律文本在特定时间点的确定性重建能力。论文提出的解决方案关键在于扩展FRBRoo框架,引入专门的子类——Temporal Version (TV) 和 Language Version (LV),以表示法律规范及其语言变体在特定时间点的状态,并通过Component Work (CW)、Component Temporal Version (CTV) 和 Component Language Version (CLV) 层级化跟踪单个条款、段落和条款的生命周期,从而实现对法律文本任意部分在特定日期的精确、确定性检索与重建。
链接: https://arxiv.org/abs/2506.07853
作者: Hudson de Martim
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Effectively representing legal norms for automated processing is a critical challenge, particularly in tracking the diachronic evolution of their hierarchical components (e.g., articles, paragraphs). While foundational frameworks like FRBR/FRBRoo and standards like Akoma Ntoso model legal documents at a macro level, they lack native mechanisms for granular, component-level versioning. This limitation hinders the deterministic point-in-time reconstruction of legal texts, a fundamental capability for reliable Legal Tech and AI applications. This paper proposes a structured, temporal model that extends the FRBRoo framework to address this gap. It introduces specialized subclasses of Expressio - Temporal Version (TV) and Language Version (LV - to represent the state of a legal norm and its linguistic variations at specific points in time. The model applies this same paradigm hierarchically, introducing Component Work (CW), Component Temporal Version (CTV), and Component Language Version (CLV) to track the lifecycle of individual articles, paragraphs, and clauses. Using the Brazilian Federal Constitution as a case study, the paper demonstrates how each amendment creates new Component Temporal Versions for affected provisions, while unaffected components retain their existing versions. This fine-grained, time-aware architecture enables the precise, deterministic retrieval and reconstruction of any part of a legal text as it existed on a specific date. The model provides a robust foundation for developing advanced legal information systems, knowledge graphs, and AI tools capable of accurate historical analysis and impact assessment, overcoming the limitations of current generative models.
zh
[AI-8] HAIBU-ReMUD: Reasoning Multimodal Ultrasound Dataset and Model Bridging to General Specific Domains
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在特定领域(如医学超声)中表现不佳的问题,主要原因是缺乏领域特定的图像-文本数据。解决方案的关键在于提出一种新的图像-文本推理监督微调数据生成流水线,从领域内非结构化数据中构建特定领域的四元组数据(图像、问题、思考过程和答案),并基于此构建了医学超声领域数据集ReMUD,包含超过45,000条推理与非推理类监督微调问答(QA)和视觉问答(VQA)数据,从而提升了模型在该领域的性能。
链接: https://arxiv.org/abs/2506.07837
作者: Shijie Wang,Yilun Zhang,Zeyu Lai,Dexing Kong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) have shown great potential in general domains but perform poorly in some specific domains due to a lack of domain-specific data, such as image-text data or vedio-text data. In some specific domains, there is abundant graphic and textual data scattered around, but lacks standardized arrangement. In the field of medical ultrasound, there are ultrasonic diagnostic books, ultrasonic clinical guidelines, ultrasonic diagnostic reports, and so on. However, these ultrasonic materials are often saved in the forms of PDF, images, etc., and cannot be directly used for the training of MLLMs. This paper proposes a novel image-text reasoning supervised fine-tuning data generation pipeline to create specific domain quadruplets (image, question, thinking trace, and answer) from domain-specific materials. A medical ultrasound domain dataset ReMUD is established, containing over 45,000 reasoning and non-reasoning supervised fine-tuning Question Answering (QA) and Visual Question Answering (VQA) data. The ReMUD-7B model, fine-tuned on Qwen2.5-VL-7B-Instruct, outperforms general-domain MLLMs in medical ultrasound field. To facilitate research, the ReMUD dataset, data generation codebase, and ReMUD-7B parameters will be released at this https URL, addressing the data shortage issue in specific domain MLLMs.
zh
[AI-9] Are Trees Really Green? A Detection Approach of IoT Malware Attacks
【速读】:该论文试图解决物联网(IoT)设备在面临网络安全攻击时,如何在保证检测性能的同时降低计算资源消耗的问题。解决方案的关键在于提出一种绿色的方法,通过优化基于树模型(Decision Trees、Random Forest 和 Extra-Trees)的超参数,以平衡能量消耗与检测效果,特别是在Matthew’s Correlation Coefficient指标上的表现,从而实现低功耗下的高效恶意网络攻击检测。
链接: https://arxiv.org/abs/2506.07836
作者: Silvia Lucia Sanna,Diego Soi,Davide Maiorca,Giorgio Giacinto
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Nowadays, the Internet of Things (IoT) is widely employed, and its usage is growing exponentially because it facilitates remote monitoring, predictive maintenance, and data-driven decision making, especially in the healthcare and industrial sectors. However, IoT devices remain vulnerable due to their resource constraints and difficulty in applying security patches. Consequently, various cybersecurity attacks are reported daily, such as Denial of Service, particularly in IoT-driven solutions. Most attack detection methodologies are based on Machine Learning (ML) techniques, which can detect attack patterns. However, the focus is more on identification rather than considering the impact of ML algorithms on computational resources. This paper proposes a green methodology to identify IoT malware networking attacks based on flow privacy-preserving statistical features. In particular, the hyperparameters of three tree-based models – Decision Trees, Random Forest and Extra-Trees – are optimized based on energy consumption and test-time performance in terms of Matthew’s Correlation Coefficient. Our results show that models maintain high performance and detection accuracy while consistently reducing power usage in terms of watt-hours (Wh). This suggests that on-premise ML-based Intrusion Detection Systems are suitable for IoT and other resource-constrained devices.
zh
[AI-10] Decentralizing Multi-Agent Reinforcement Learning with Temporal Causal Information
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中由于隐私约束、通信限制和性能问题所带来的挑战,特别是在去中心化多智能体强化学习(Decentralized Multi-Agent RL, DMARL)环境下,如何确保局部策略在执行时的兼容性以实现全局任务。其解决方案的关键在于引入高层符号知识(symbolic knowledge),通过扩展用于验证局部策略与团队任务兼容性的形式化工具,提升去中心化训练的理论保障能力,并实验证明符号知识能够显著加速DMARL中的学习过程。
链接: https://arxiv.org/abs/2506.07829
作者: Jan Corazza,Hadi Partovi Aria,Hyohun Kim,Daniel Neider,Zhe Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) algorithms can find an optimal policy for a single agent to accomplish a particular task. However, many real-world problems require multiple agents to collaborate in order to achieve a common goal. For example, a robot executing a task in a warehouse may require the assistance of a drone to retrieve items from high shelves. In Decentralized Multi-Agent RL (DMARL), agents learn independently and then combine their policies at execution time, but often must satisfy constraints on compatibility of local policies to ensure that they can achieve the global task when combined. In this paper, we study how providing high-level symbolic knowledge to agents can help address unique challenges of this setting, such as privacy constraints, communication limitations, and performance concerns. In particular, we extend the formal tools used to check the compatibility of local policies with the team task, making decentralized training with theoretical guarantees usable in more scenarios. Furthermore, we empirically demonstrate that symbolic knowledge about the temporal evolution of events in the environment can significantly expedite the learning process in DMARL.
zh
[AI-11] Addition in Four Movements: Mapping Layer-wise Information Trajectories in LLM s EMNLP2025
【速读】:该论文旨在探究大型语言模型LLaMA-3-8B-Instruct在多数字加法任务中的内部计算过程。其解决方案的关键在于结合线性探测(linear probing)与逻辑透镜(logit-lens)分析,揭示了模型在前向传播中遵循的四阶段轨迹:从公式结构表示的线性可解性到答案标记的逐步清晰化,最终实现对输出中各个数字的精准检测与解码,表明模型倾向于内部计算而非单纯记忆。
链接: https://arxiv.org/abs/2506.07824
作者: Yao Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, including appendix, 7 figures. EMNLP 2025 submission (ARR May 2025 cycle, reviews pending)
Abstract:Multi-digit addition is a clear probe of the computational power of large language models. To dissect the internal arithmetic processes in LLaMA-3-8B-Instruct, we combine linear probing with logit-lens inspection. Inspired by the step-by-step manner in which humans perform addition, we propose and analyze a coherent four-stage trajectory in the forward pass:Formula-structure representations become linearly decodable first, while the answer token is still far down the candidate this http URL computational features then emerge this http URL deeper activation layers, numerical abstractions of the result become clearer, enabling near-perfect detection and decoding of the individual digits in the this http URL the output, the model organizes and generates the final content, with the correct token reliably occupying the top this http URL trajectory suggests a hierarchical process that favors internal computation over rote memorization. We release our code and data to facilitate reproducibility.
zh
[AI-12] Accelerating Diffusion Models in Offline RL via Reward-Aware Consistency Trajectory Distillation
【速读】:该论文旨在解决扩散模型在决策任务中推理速度慢的问题,同时克服一致性模型在应用中因次优示范或复杂多网络并行训练所带来的挑战。其解决方案的关键在于提出一种新颖的一致性蒸馏方法,该方法将奖励优化直接整合到蒸馏过程中,从而实现单步生成,在保持较高性能的同时简化了训练流程。
链接: https://arxiv.org/abs/2506.07822
作者: Xintong Duan,Yutong He,Fahim Tajwar,Ruslan Salakhutdinov,J. Zico Kolter,Jeff Schneider
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Although diffusion models have achieved strong results in decision-making tasks, their slow inference speed remains a key limitation. While the consistency model offers a potential solution, its applications to decision-making often struggle with suboptimal demonstrations or rely on complex concurrent training of multiple networks. In this work, we propose a novel approach to consistency distillation for offline reinforcement learning that directly incorporates reward optimization into the distillation process. Our method enables single-step generation while maintaining higher performance and simpler training. Empirical evaluations on the Gym MuJoCo benchmarks and long horizon planning demonstrate that our approach can achieve an 8.7% improvement over previous state-of-the-art while offering up to 142x speedup over diffusion counterparts in inference time.
zh
[AI-13] Guideline Forest: Experience-Induced Multi-Guideline Reasoning with Stepwise Aggregation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中难以有效利用多种可复用策略的问题,尤其是在结构化和高效地整合多种推理路径方面存在不足。其解决方案的关键在于提出Guideline Forest框架,该框架通过从验证过的示例中归纳出结构化的推理策略(称为指南),并通过逐步聚合的方式执行这些策略。与传统的测试时搜索或单路径蒸馏方法不同,Guideline Forest通过生成多样化的推理变体,模拟人类推理的灵活性和适应性,从而提升模型在不确定情况下的问题解决能力和泛化性能。
链接: https://arxiv.org/abs/2506.07820
作者: Jiaxiang CHen,Zhuo Wang,Mingxi Zou,Qifan Wang,Zenglin Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Human reasoning is flexible, adaptive, and grounded in prior experience-qualities that large language models (LLMs) still struggle to emulate. Existing methods either explore diverse reasoning paths at inference time or search for optimal workflows through expensive operations, but both fall short in leveraging multiple reusable strategies in a structured, efficient manner. We propose Guideline Forest, a framework that enhances LLMs reasoning by inducing structured reasoning strategies-called guidelines-from verified examples and executing them via step-wise aggregation. Unlike test-time search or single-path distillation, our method draws on verified reasoning experiences by inducing reusable guidelines and expanding each into diverse variants. Much like human reasoning, these variants reflect alternative thought patterns, are executed in parallel, refined via self-correction, and aggregated step by step-enabling the model to adaptively resolve uncertainty and synthesize robust this http URL evaluate Guideline Forest on four benchmarks-GSM8K, MATH-500, MBPP, and HumanEval-spanning mathematical and programmatic reasoning. Guideline Forest consistently outperforms strong baselines, including CoT, ReAct, ToT, FoT, and AFlow. Ablation studies further highlight the effectiveness of multi-path reasoning and stepwise aggregation, underscoring the Guideline Forest’s adaptability and generalization potential.
zh
[AI-14] A Proposal to Extend the Common Model of Cognition with Metacognition
【速读】:该论文试图解决如何在认知架构中整合元认知(metacognition)的问题,具体而言是提出一种统一的方法将元认知嵌入到通用认知模型(Common Model of Cognition, CMC)中。解决方案的关键在于利用CMC已有的认知能力,在工作记忆中对代理的认知能力和过程进行显式表征,并通过对此类表征的推理实现元认知功能,从而在不显著改变原有结构的前提下完成扩展。
链接: https://arxiv.org/abs/2506.07807
作者: John Laird,Christian Lebiere,Paul Rosenbloom,Andrea Stocco,Robert Wray
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The Common Model of Cognition (CMC) provides an abstract characterization of the structure and processing required by a cognitive architecture for human-like minds. We propose a unified approach to integrating metacognition within the CMC. We propose that metacognition involves reasoning over explicit representations of an agent’s cognitive capabilities and processes in working memory. Our proposal exploits the existing cognitive capabilities of the CMC, making minimal extensions in the structure and information available within working memory. We provide examples of metacognition within our proposal.
zh
[AI-15] Enhancing Adversarial Robustness with Conformal Prediction: A Framework for Guaranteed Model Reliability
【速读】:该论文旨在解决深度学习模型在高风险应用中面对对抗攻击时的鲁棒性不足以及可靠性保障缺失的问题,特别是在模型不确定性估计和准确率之外的可靠性能保证方面。其解决方案的关键在于将对抗训练与置信预测(Conformal Prediction)相结合,提出了一种名为OPSA(OPtimal Size Attack)的对抗攻击方法,通过最大化模型不确定性来降低置信预测的效率,同时引入OPSA-AT(Adversarial Training)防御策略,在新的置信训练范式中集成OPSA,从而显著提升模型的鲁棒性和预测可靠性。
链接: https://arxiv.org/abs/2506.07804
作者: Jie Bao,Chuangyin Dang,Rui Luo,Hanwei Zhang,Zhixin Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:As deep learning models are increasingly deployed in high-risk applications, robust defenses against adversarial attacks and reliable performance guarantees become paramount. Moreover, accuracy alone does not provide sufficient assurance or reliable uncertainty estimates for these models. This study advances adversarial training by leveraging principles from Conformal Prediction. Specifically, we develop an adversarial attack method, termed OPSA (OPtimal Size Attack), designed to reduce the efficiency of conformal prediction at any significance level by maximizing model uncertainty without requiring coverage guarantees. Correspondingly, we introduce OPSA-AT (Adversarial Training), a defense strategy that integrates OPSA within a novel conformal training paradigm. Experimental evaluations demonstrate that our OPSA attack method induces greater uncertainty compared to baseline approaches for various defenses. Conversely, our OPSA-AT defensive model significantly enhances robustness not only against OPSA but also other adversarial attacks, and maintains reliable prediction. Our findings highlight the effectiveness of this integrated approach for developing trustworthy and resilient deep learning models for safety-critical domains. Our code is available at this https URL.
zh
[AI-16] REMoH: A Reflective Evolution of Multi-objective Heuristics approach via Large Language Models
【速读】:该论文旨在解决复杂决策任务中多目标优化的问题,传统算法虽然有效,但通常需要大量的问题特定建模,并且难以适应非线性结构。其解决方案的关键在于提出了一种名为Reflective Evolution of Multi-objective Heuristics (REMoH)的框架,该框架将NSGA-II与基于大型语言模型(Large Language Models, LLMs)的启发式生成相结合,核心创新是引入了反射机制,通过聚类和搜索空间反射引导生成多样化、高质量的启发式方法,从而提升收敛性和保持解的多样性。
链接: https://arxiv.org/abs/2506.07759
作者: Diego Forniés-Tabuenca,Alejandro Uribe,Urtzi Otamendi,Arkaitz Artetxe,Juan Carlos Rivera,Oier Lopez de Lacalle
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 21 pages, 5 tables, 7 figures and 4 appendixes. Pre-print submitted to IEEE Transactions on Evolutionary Computation
Abstract:Multi-objective optimization is fundamental in complex decision-making tasks. Traditional algorithms, while effective, often demand extensive problem-specific modeling and struggle to adapt to nonlinear structures. Recent advances in Large Language Models (LLMs) offer enhanced explainability, adaptability, and reasoning. This work proposes Reflective Evolution of Multi-objective Heuristics (REMoH), a novel framework integrating NSGA-II with LLM-based heuristic generation. A key innovation is a reflection mechanism that uses clustering and search-space reflection to guide the creation of diverse, high-quality heuristics, improving convergence and maintaining solution diversity. The approach is evaluated on the Flexible Job Shop Scheduling Problem (FJSSP) in-depth benchmarking against state-of-the-art methods using three instance datasets: Dauzere, Barnes, and Brandimarte. Results demonstrate that REMoH achieves competitive results compared to state-of-the-art approaches with reduced modeling effort and enhanced adaptability. These findings underscore the potential of LLMs to augment traditional optimization, offering greater flexibility, interpretability, and robustness in multi-objective scenarios.
zh
[AI-17] Agent Semantics Semantic Spacetime and Graphical Reasoning
【速读】:该论文试图解决在语义时空图模型中如何实现可扩展的语义复杂性表示及过程建模的问题,特别是处理图路径中的可预测性与信息泄露之间的矛盾。其解决方案的关键在于定义一个有限的γ(3,4)表示,以形成一个封闭的操作集,从而支持任意程度的语义复杂性,并通过语义时空公设减少对路径的约束,同时揭示吸收态在部分图中的普遍性,这导致了信息泄漏问题。该问题与除以零密切相关,表明闭包性的丧失以及需要手动注入修正信息。
链接: https://arxiv.org/abs/2506.07756
作者: Mark Burgess
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Some formal aspects of the Semantic Spacetime graph model are presented, with reference to its use for directed knowledge representations and process modelling. A finite \gamma(3,4) representation is defined to form a closed set of operations that can scale to any degree of semantic complexity. The Semantic Spacetime postulates bring predictability with minimal constraints to pathways in graphs. The ubiquitous appearance of absorbing states in any partial graph means that a graph process leaks information. The issue is closely associated with the issue of division by zero, which signals a loss of closure and the need for manual injection of remedial information. The Semantic Spacetime model (and its Promise Theory) origins help to clarify how such absorbing states are associated with boundary information where intentionality can enter.
zh
[AI-18] Comparing Credit Risk Estimates in the Gen-AI Era
【速读】:该论文试图解决信用评分建模技术的性能比较问题,重点在于评估生成式 AI (Generative AI) 在信用风险评分中的应用潜力。研究的关键在于对比传统方法与基于生成式 AI 的方法,发现当前生成式 AI 模型在性能上仍无法达到传统方法的水平,无论采用何种集成策略,这表明生成式 AI 在该领域仍存在显著局限性,亟需进一步的研究与开发以提升其适用性。
链接: https://arxiv.org/abs/2506.07754
作者: Nicola Lavecchia,Sid Fadanelli,Federico Ricciuti,Gennaro Aloe,Enrico Bagli,Pietro Giuffrida,Daniele Vergari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI technologies have demonstrated significant potential across diverse applications. This study provides a comparative analysis of credit score modeling techniques, contrasting traditional approaches with those leveraging generative AI. Our findings reveal that current generative AI models fall short of matching the performance of traditional methods, regardless of the integration strategy employed. These results highlight the limitations in the current capabilities of generative AI for credit risk scoring, emphasizing the need for further research and development before the possibility of applying generative AI for this specific task, or equivalent ones.
zh
[AI-19] Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning ICML2025
【速读】:该论文旨在解决离线分层强化学习(Offline Hierarchical Reinforcement Learning, HRL)中因任务时间跨度增加而导致的效率下降以及跨轨迹有效状态转移拼接策略不足的问题。其解决方案的关键在于提出一种名为图辅助拼接(Graph-Assisted Stitching, GAS)的新框架,该框架将子目标选择建模为图搜索问题,而非显式学习高层策略。通过将状态嵌入到时间距离表示(Temporal Distance Representation, TDR)空间,GAS将来自不同轨迹的语义相似状态聚类为统一的图节点,从而实现高效的转移拼接,并利用最短路径算法在图中选择子目标序列,同时由低层策略学习达到子目标。此外,引入时间效率(Temporal Efficiency, TE)度量以提升图的质量,显著增强了任务性能。
链接: https://arxiv.org/abs/2506.07744
作者: Seungho Baek,Taegeon Park,Jongchan Park,Seungjun Oh,Yusung Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICML 2025
Abstract:Existing offline hierarchical reinforcement learning methods rely on high-level policy learning to generate subgoal sequences. However, their efficiency degrades as task horizons increase, and they lack effective strategies for stitching useful state transitions across different trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that formulates subgoal selection as a graph search problem rather than learning an explicit high-level policy. By embedding states into a Temporal Distance Representation (TDR) space, GAS clusters semantically similar states from different trajectories into unified graph nodes, enabling efficient transition stitching. A shortest-path algorithm is then applied to select subgoal sequences within the graph, while a low-level policy learns to reach the subgoals. To improve graph quality, we introduce the Temporal Efficiency (TE) metric, which filters out noisy or inefficient transition states, significantly enhancing task performance. GAS outperforms prior offline HRL methods across locomotion, navigation, and manipulation tasks. Notably, in the most stitching-critical task, it achieves a score of 88.3, dramatically surpassing the previous state-of-the-art score of 1.0. Our source code is available at: this https URL.
zh
[AI-20] RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐过程中仍存在漏洞的问题,特别是针对政策违规内容的检测与防护不足。现有方法依赖于大量人工标注的数据集,难以应对分布外威胁,如新兴的有害类别或越狱攻击。论文提出的解决方案关键在于RSafe,其核心是通过基于推理的自适应保护机制,实现对输入内容的安全风险进行分步策略引导推理,并利用基于规则的强化学习优化推理路径,从而在未见过或对抗性安全违规场景中保持稳健的防护能力。
链接: https://arxiv.org/abs/2506.07736
作者: Jingnan Zheng,Xiangtian Ji,Yijun Lu,Chenhang Cui,Weixiang Zhao,Gelei Deng,Zhenkai Liang,An Zhang,Tat-Seng Chua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts, posing significant risks to users and society. To safeguard against the risk of policy-violating content, system-level moderation via external guard models-designed to monitor LLM inputs and outputs and block potentially harmful content-has emerged as a prevalent mitigation strategy. Existing approaches of training guard models rely heavily on extensive human curated datasets and struggle with out-of-distribution threats, such as emerging harmful categories or jailbreak attacks. To address these limitations, we propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies. RSafe operates in two stages: 1) guided reasoning, where it analyzes safety risks of input content through policy-guided step-by-step reasoning, and 2) reinforced alignment, where rule-based RL optimizes its reasoning paths to align with accurate safety prediction. This two-stage training paradigm enables RSafe to internalize safety principles to generalize safety protection capability over unseen or adversarial safety violation scenarios. During inference, RSafe accepts user-specified safety policies to provide enhanced safeguards tailored to specific safety requirements.
zh
[AI-21] NeurIPS 2025 E2LM Competition : Early Training Evaluation of Language Models
【速读】:该论文试图解决现有基准在评估小型语言模型早期训练阶段性能时缺乏有效或区分性信号的问题(benchmark)。其解决方案的关键在于设计专门针对语言模型早期训练进展的科学知识评估任务,并鼓励参与者开发新的评估方法或调整现有基准以更准确地捕捉模型间的性能差异。为此,研究者提供了三个预训练的小型模型及其训练过程中的中间检查点,并支持在免费云平台上的实验与开发,以降低参与门槛。
链接: https://arxiv.org/abs/2506.07731
作者: Mouadh Yagoubi,Yasser Dahou,Billel Mokeddem,Younes Belkada,Phuc H. Le-Khac,Basma El Amel Boussaha,Reda Alami,Jingwei Zuo,Damiano Marsili,Mugariya Farooq,Mounia Lalmas,Georgia Gkioxari,Patrick Gallinari,Philip Torr,Hakim Hacid
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Existing benchmarks have proven effective for assessing the performance of fully trained large language models. However, we find striking differences in the early training stages of small models, where benchmarks often fail to provide meaningful or discriminative signals. To explore how these differences arise, this competition tackles the challenge of designing scientific knowledge evaluation tasks specifically tailored for measuring early training progress of language models. Participants are invited to develop novel evaluation methodologies or adapt existing benchmarks to better capture performance differences among language models. To support this effort, we provide three pre-trained small models (0.5B, 1B, and 3B parameters), along with intermediate checkpoints sampled during training up to 200B tokens. All experiments and development work can be run on widely available free cloud-based GPU platforms, making participation accessible to researchers with limited computational resources. Submissions will be evaluated based on three criteria: the quality of the performance signal they produce, the consistency of model rankings at 1 trillion tokens of training, and their relevance to the scientific knowledge domain. By promoting the design of tailored evaluation strategies for early training, this competition aims to attract a broad range of participants from various disciplines, including those who may not be machine learning experts or have access to dedicated GPU resources. Ultimately, this initiative seeks to make foundational LLM research more systematic and benchmark-informed from the earliest phases of model development.
zh
[AI-22] MCPWorld: A Unified Benchmarking Testbed for API GUI and Hybrid Computer Use Agents
【速读】:该论文旨在解决现有计算机使用代理(CUA)基准测试主要针对图形用户界面(GUI)代理的问题,这些评估方法易受UI变化影响,并忽略了应用API所暴露的功能交互,例如Model Context Protocol (MCP)。其解决方案的关键在于提出MCPWorld,这是一个面向API、GUI以及API-GUI混合代理的首个自动化CUA测试平台,通过使用“白盒应用”实现对任务完成度的程序化验证,从而提供与具体代理实现或UI状态解耦的鲁棒、准确的评估方式。
链接: https://arxiv.org/abs/2506.07672
作者: Yunhe Yan,Shihe Wang,Jiajun Du,Yexuan Yang,Yuxuan Shan,Qichen Qiu,Xianqing Jia,Xinge Wang,Xin Yuan,Xu Han,Mao Qin,Yinxiao Chen,Chen Peng,Shangguang Wang,Mengwei Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:(M)LLM-powered computer use agents (CUA) are emerging as a transformative technique to automate human-computer interaction. However, existing CUA benchmarks predominantly target GUI agents, whose evaluation methods are susceptible to UI changes and ignore function interactions exposed by application APIs, e.g., Model Context Protocol (MCP). To this end, we propose MCPWorld, the first automatic CUA testbed for API, GUI, and API-GUI hybrid agents. A key principle of MCPWorld is the use of “white-box apps”, i.e., those with source code availability and can be revised/re-compiled as needed (e.g., adding MCP support), with two notable advantages: (1) It greatly broadens the design space of CUA, such as what and how the app features to be exposed/extracted as CUA-callable APIs. (2) It allows MCPWorld to programmatically verify task completion by directly monitoring application behavior through techniques like dynamic code instrumentation, offering robust, accurate CUA evaluation decoupled from specific agent implementations or UI states. Currently, MCPWorld includes 201 well curated and annotated user tasks, covering diversified use cases and difficulty levels. MCPWorld is also fully containerized with GPU acceleration support for flexible adoption on different OS/hardware environments. Our preliminary experiments, using a representative LLM-powered CUA framework, achieve 75.12% task completion accuracy, simultaneously providing initial evidence on the practical effectiveness of agent automation leveraging MCP. Overall, we anticipate MCPWorld to facilitate and standardize the benchmarking of next-generation computer use agents that can leverage rich external tools. Our code and dataset are publicly available at this https URL. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.07672 [cs.AI] (or arXiv:2506.07672v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.07672 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yunhe Yan [view email] [v1] Mon, 9 Jun 2025 11:50:33 UTC (889 KB)
zh
[AI-23] SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling ACL’25
【速读】:该论文旨在解决软件工程(Software Engineering, SWE)中构建有效代理模型所面临的挑战,特别是由于高质量训练数据和有效测试用例的缺乏导致的问题。其解决方案的关键在于构建一个强大的流水线以合成用于补丁评估的测试用例,并通过扩展代理轨迹来构建训练数据,从而提升SWE代理的性能。实验结果表明,基于开源大语言模型(Large Language Models, LLMs)构建的SWE-Dev模型在SWE-bench-Verified基准上取得了顶级性能。
链接: https://arxiv.org/abs/2506.07636
作者: Haoran Wang,Zhenyu Hou,Yao Wei,Jie Tang,Yuxiao Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL’25
Abstract:Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality training data and effective test cases. To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs. First, we develop a robust pipeline to synthesize test cases for patch evaluation. Second, we scale up agent trajectories to construct the training data for building SWE-Dev. Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents. Specifically, the success rates of the SWE-Dev 7B and 32B parameter models reach 23.4% and 36.6%, respectively, outperforming state-of-the-art open-source models. All code, models, and datasets are publicly available at this https URL.
zh
[AI-24] Automating Exploratory Multiomics Research via Language Models
【速读】:该论文试图解决在临床蛋白基因组学中,如何从原始数据文件中自动生成数据驱动的假设问题,以促进新发现的产生。解决方案的关键在于PROTEUS系统,它通过分离模块模拟科学过程的不同阶段,从开放式数据探索到特定统计分析和假设提出,并利用统一的图结构来管理复杂的研究流程,从而实现可靠且新颖的假设生成。
链接: https://arxiv.org/abs/2506.07591
作者: Shang Qu,Ning Ding,Linhai Xie,Yifei Li,Zaoqu Liu,Kaiyan Zhang,Yibai Xiong,Yuxin Zuo,Zhangren Chen,Ermo Hua,Xingtai Lv,Youbang Sun,Yang Li,Dong Li,Fuchu He,Bowen Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:This paper introduces PROTEUS, a fully automated system that produces data-driven hypotheses from raw data files. We apply PROTEUS to clinical proteogenomics, a field where effective downstream data analysis and hypothesis proposal is crucial for producing novel discoveries. PROTEUS uses separate modules to simulate different stages of the scientific process, from open-ended data exploration to specific statistical analysis and hypothesis proposal. It formulates research directions, tools, and results in terms of relationships between biological entities, using unified graph structures to manage complex research processes. We applied PROTEUS to 10 clinical multiomics datasets from published research, arriving at 360 total hypotheses. Results were evaluated through external data validation and automatic open-ended scoring. Through exploratory and iterative research, the system can navigate high-throughput and heterogeneous multiomics data to arrive at hypotheses that balance reliability and novelty. In addition to accelerating multiomic analysis, PROTEUS represents a path towards tailoring general autonomous systems to specialized scientific domains to achieve open-ended hypothesis generation from data.
zh
[AI-25] PrunePEFT: Iterative Hybrid Pruning for Parameter-Efficient Fine-tuning of LLM s
【速读】:该论文试图解决参数高效微调(Parameter Efficient Fine-Tuning, PEFT)方法在配置选择上的挑战,即如何在庞大的设计空间中找到最优的PEFT模块类型及其插入层,以避免因配置不当导致的次优结果。传统解决方案如架构搜索技术虽然有效,但存在较大的额外开销。论文提出的解决方案关键在于将PEFT策略搜索转化为一个剪枝问题,并引入一种混合剪枝策略,利用不同PEFT模块对剪枝方法的敏感性,通过迭代移除冗余或冲突的PEFT模块来优化微调配置,从而显著降低计算负担,提升方法的可扩展性和效率。
链接: https://arxiv.org/abs/2506.07587
作者: Tongzhou Yu,Zhuhao Zhang,Guanghui Zhu,Shen Jiang,Meikang Qiu,Yihua Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Parameter Efficient Fine-Tuning (PEFT) methods have emerged as effective and promising approaches for fine-tuning pre-trained language models. Compared with Full parameter Fine-Tuning (FFT), PEFT achieved comparable task performance with a substantial reduction of trainable parameters, which largely saved the training and storage costs. However, using the PEFT method requires considering a vast design space, such as the type of PEFT modules and their insertion layers. Inadequate configurations can lead to sub-optimal results. Conventional solutions such as architectural search techniques, while effective, tend to introduce substantial additional overhead. In this paper, we propose a novel approach, PrunePEFT, which formulates the PEFT strategy search as a pruning problem and introduces a hybrid pruning strategy that capitalizes on the sensitivity of pruning methods to different PEFT modules. This method extends traditional pruning techniques by iteratively removing redundant or conflicting PEFT modules, thereby optimizing the fine-tuned configuration. By efficiently identifying the most relevant modules, our approach significantly reduces the computational burden typically associated with architectural search processes, making it a more scalable and efficient solution for fine-tuning large pre-trained models.
zh
[AI-26] FedCGD: Collective Gradient Divergence Optimized Scheduling for Wireless Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在无线网络应用中面临的两个关键问题:设备的数据异构性和带宽限制。现有研究通常将数据异构性视为单个设备的属性,而本文证明了联邦学习的收敛速度受设备级和样本级集体梯度发散(Collective Gradient Divergence, CGD)之和的影响。其中,设备级CGD指的是所选设备组的梯度发散,而非单个设备发散的总和;样本级CGD则通过采样方差进行统计上界约束,且与参与本地更新的总样本数成反比。为获得设备级CGD的可处理形式,本文将分类问题转化为群体分布与全局分布之间的加权地球移动距离(Weighted Earth Mover’s Distance, WEMD),并提出FedCGD算法,在多项式时间内最小化多层级CGD的总和,通过平衡WEMD与采样方差实现性能优化。
链接: https://arxiv.org/abs/2506.07581
作者: Tan Chen,Jintao Yan,Yuxuan Sun,Sheng Zhou,Zhisheng Niu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Federated learning (FL) is a promising paradigm for multiple devices to cooperatively train a model. When applied in wireless networks, two issues consistently affect the performance of FL, i.e., data heterogeneity of devices and limited bandwidth. Many papers have investigated device scheduling strategies considering the two issues. However, most of them recognize data heterogeneity as a property of individual devices. In this paper, we prove that the convergence speed of FL is affected by the sum of device-level and sample-level collective gradient divergence (CGD). The device-level CGD refers to the gradient divergence of the scheduled device group, instead of the sum of the individual device divergence. The sample-level CGD is statistically upper bounded by sampling variance, which is inversely proportional to the total number of samples scheduled for local update. To derive a tractable form of the device-level CGD, we further consider a classification problem and transform it into the weighted earth moving distance (WEMD) between the group distribution and the global distribution. Then we propose FedCGD algorithm to minimize the sum of multi-level CGDs by balancing WEMD and sampling variance, within polynomial time. Simulation shows that the proposed strategy increases classification accuracy on the CIFAR-10 dataset by up to 4.2% while scheduling 41.8% fewer devices, and flexibly switches between reducing WEMD and reducing sampling variance.
zh
[AI-27] Denoising the Future: Top-p Distributions for Moving Through Time
【速读】:该论文旨在解决动态概率模型中的推理问题,特别是针对隐马尔可夫模型(Hidden Markov Model, HMM)在时间推进过程中需要枚举整个状态空间所带来的计算效率低下和噪声增加的问题。解决方案的关键在于通过仅使用累积概率为p的前p个最可能的状态(即top-p状态)来对未来的不确定性进行去噪,从而加速推理过程。该方法理论上保证了由仅使用top-p状态引入的误差受p和模型的最小混合率(minimal mixing rate)的限制。
链接: https://arxiv.org/abs/2506.07578
作者: Florian Andreas Marwitz,Ralf Möller,Magnus Bender,Marcel Gehrke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Inference in dynamic probabilistic models is a complex task involving expensive operations. In particular, for Hidden Markov Models, the whole state space has to be enumerated for advancing in time. Even states with negligible probabilities are considered, resulting in computational inefficiency and increased noise due to the propagation of unlikely probability mass. We propose to denoise the future and speed up inference by using only the top-p states, i.e., the most probable states with accumulated probability p. We show that the error introduced by using only the top-p states is bound by p and the so-called minimal mixing rate of the underlying model. Moreover, in our empirical evaluation, we show that we can expect speedups of at least an order of magnitude, while the error in terms of total variation distance is below 0.09.
zh
[AI-28] MoE-MLoRA for Multi-Domain CTR Prediction: Efficient Adaptation with Expert Specialization
【速读】:该论文旨在解决个性化推荐系统在跨领域用户交互中的适应性问题,传统方法如MLoRA在每个领域仅应用单一的适配策略,缺乏处理多样化用户行为的灵活性。其解决方案的关键在于提出MoE-MLoRA,这是一种基于专家混合(Mixture-of-Experts)的框架,其中每个专家首先独立训练以在其领域内专业化,随后通过门控网络动态加权各专家的贡献,从而实现更灵活和高效的多领域推荐。
链接: https://arxiv.org/abs/2506.07563
作者: Ken Yagel,Eyal German,Aviel Ben Siman Tov
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized recommendation systems must adapt to user interactions across different domains. Traditional approaches like MLoRA apply a single adaptation per domain but lack flexibility in handling diverse user behaviors. To address this, we propose MoE-MLoRA, a mixture-of-experts framework where each expert is first trained independently to specialize in its domain before a gating network is trained to weight their contributions dynamically. We evaluate MoE-MLoRA across eight CTR models on Movielens and Taobao, showing that it improves performance in large-scale, dynamic datasets (+1.45 Weighed-AUC in Taobao-20) but offers limited benefits in structured datasets with low domain diversity and sparsity. Further analysis of the number of experts per domain reveals that larger ensembles do not always improve performance, indicating the need for model-aware tuning. Our findings highlight the potential of expert-based architectures for multi-domain recommendation systems, demonstrating that task-aware specialization and adaptive gating can enhance predictive accuracy in complex environments. The implementation and code are available in our GitHub repository.
zh
[AI-29] GTR-CoT: Graph Traversal as Visual Chain of Thought for Molecular Structure Recognition
【速读】:该论文旨在解决光学化学结构识别(Optical Chemical Structure Recognition, OCSR)中的挑战,即如何准确地将分子图像转换为机器可读的格式,特别是在处理复杂分子结构和不一致标注时存在的困难。其解决方案的关键在于提出GTR-Mol-VLM框架,该框架包含两项核心创新:一是基于图遍历的视觉思维链机制,通过逐步预测原子-键序列来模拟人类推理过程;二是数据驱动的“忠实识别所见”原则,以解决图像中简写结构与扩展注释之间的不匹配问题。
链接: https://arxiv.org/abs/2506.07553
作者: Jingchao Wang,Haote Yang,Jiang Wu,Yifan He,Xingjian Wei,Yinfan Wang,Chengjin Liu,Lingli Ge,Lijun Wu,Bin Wang,Dahua Lin,Conghui He
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Optical Chemical Structure Recognition (OCSR) is crucial for digitizing chemical knowledge by converting molecular images into machine-readable formats. While recent vision-language models (VLMs) have shown potential in this task, their image-captioning approach often struggles with complex molecular structures and inconsistent annotations. To overcome these challenges, we introduce GTR-Mol-VLM, a novel framework featuring two key innovations: (1) the \textitGraph Traversal as Visual Chain of Thought mechanism that emulates human reasoning by incrementally parsing molecular graphs through sequential atom-bond predictions, and (2) the data-centric principle of \textitFaithfully Recognize What You’ve Seen, which addresses the mismatch between abbreviated structures in images and their expanded annotations. To support model development, we constructed GTR-CoT-1.3M, a large-scale instruction-tuning dataset with meticulously corrected annotations, and introduced MolRec-Bench, the first benchmark designed for a fine-grained evaluation of graph-parsing accuracy in OCSR. Comprehensive experiments demonstrate that GTR-Mol-VLM achieves superior results compared to specialist models, chemistry-domain VLMs, and commercial general-purpose VLMs. Notably, in scenarios involving molecular images with functional group abbreviations, GTR-Mol-VLM outperforms the second-best baseline by approximately 14 percentage points, both in SMILES-based and graph-based metrics. We hope that this work will drive OCSR technology to more effectively meet real-world needs, thereby advancing the fields of cheminformatics and AI for Science. We will release GTR-CoT at this https URL.
zh
[AI-30] Curriculum Learning With Counterfactual Group Relative Policy Advantage For Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决多智能体强化学习(MARL)在动态环境中的适应性不足与策略次优问题,这些问题通常源于现有方法依赖于固定对手策略而导致的静态难度条件。其解决方案的关键在于提出一种动态课程学习(dynamic CL)框架,该框架通过自适应难度调整机制,根据智能体实时训练表现动态调节对手强度,使智能体能够从简单到复杂逐步学习。此外,为应对动态课程学习带来的不稳定性和稀疏全局奖励问题,论文引入了反事实群体相对策略优势(CGRPA),通过构建反事实优势函数来分离群体行为中的个体贡献,从而提供内在信用信号以增强信用分配并稳定学习过程。
链接: https://arxiv.org/abs/2506.07548
作者: Weiqiang Jin,Hongyang Du,Guizhong Liu,Dong In Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 16 pages; 12figures
Abstract:Multi-agent reinforcement learning (MARL) has achieved strong performance in cooperative adversarial tasks. However, most existing methods typically train agents against fixed opponent strategies and rely on such meta-static difficulty conditions, which limits their adaptability to changing environments and often leads to suboptimal policies. Inspired by the success of curriculum learning (CL) in supervised tasks, we propose a dynamic CL framework for MARL that employs an self-adaptive difficulty adjustment mechanism. This mechanism continuously modulates opponent strength based on real-time agent training performance, allowing agents to progressively learn from easier to more challenging scenarios. However, the dynamic nature of CL introduces instability due to nonstationary environments and sparse global rewards. To address this challenge, we develop a Counterfactual Group Relative Policy Advantage (CGRPA), which is tightly coupled with the curriculum by providing intrinsic credit signals that reflect each agent’s impact under evolving task demands. CGRPA constructs a counterfactual advantage function that isolates individual contributions within group behavior, facilitating more reliable policy updates throughout the curriculum. CGRPA evaluates each agent’s contribution through constructing counterfactual action advantage function, providing intrinsic rewards that enhance credit assignment and stabilize learning under non-stationary conditions. Extensive experiments demonstrate that our method improves both training stability and final performance, achieving competitive results against state-of-the-art methods. The code is available at this https URL.
zh
[AI-31] Coordinating Search-Informed Reasoning and Reasoning -Guided Search in Claim Verification
【速读】:该论文试图解决多跳声明验证(multi-hop claim verification)问题,该任务需要通过多步骤推理构建验证链,并迭代搜索以发现隐藏的桥梁事实。其核心挑战在于推理与搜索过程的高度交织性,即有效的推理依赖于动态获取的证据,而有效的搜索则需要基于部分信息进行推理以优化查询。解决方案的关键在于提出分层代理推理与信息搜索框架(HARIS),该框架显式建模了由推理驱动的搜索和由搜索启发的推理的协同过程,通过高层推理代理和低层搜索代理的分工协作,提升验证的准确性和可解释性。
链接: https://arxiv.org/abs/2506.07528
作者: Qisheng Hu,Quanyu Long,Wenya Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures
Abstract:Multi-hop claim verification is inherently challenging, requiring multi-step reasoning to construct verification chains while iteratively searching for information to uncover hidden bridging facts. This process is fundamentally interleaved, as effective reasoning relies on dynamically retrieved evidence, while effective search demands reasoning to refine queries based on partial information. To achieve this, we propose Hierarchical Agent Reasoning and Information Search (HARIS), explicitly modeling the coordinated process of reasoning-driven searching and search-informed reasoning. HARIS consists of a high-level reasoning agent that focuses on constructing the main verification chain, generating factual questions when more information is needed, and a low-level search agent that iteratively retrieves more information, refining its search based on intermediate findings. This design allows each agent to specialize in its respective task, enhancing verification accuracy and interpretability. HARIS is trained using reinforcement learning with outcome-based rewards. Experimental results on the EX-FEVER and HOVER benchmarks demonstrate that HARIS achieves strong performance, greatly advancing multi-hop claim verification.
zh
[AI-32] Learning What Reinforcement Learning Cant: Interleaved Online Fine-Tuning for Hardest Questions
【速读】:该论文试图解决当前强化学习(Reinforcement Learning, RL)在大型语言模型(Large Language Model, LLM)推理中无法超越基础模型限制的问题,即RL主要基于模型已有知识进行优化,难以促进新信息的获取。解决方案的关键在于结合监督微调(Supervised Fine-Tuning, SFT)与RL,利用高质量演示数据补充RL的不足,从而引入新的知识和推理模式。通过提出一种新的训练方法——ReLIFT(Reinforcement Learning Interleaved with Online Fine-Tuning),在RL训练过程中动态引入SFT,以提升模型的推理能力。
链接: https://arxiv.org/abs/2506.07527
作者: Lu Ma,Hao Liang,Meiyi Qiang,Lexiang Tang,Xiaochen Ma,Zhen Hao Wong,Junbo Niu,Chengyu Shen,Runming He,Bin Cui,Wentao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 5 figures
Abstract:Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model’s original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbfReLIFT (\textbfReinforcement \textbfLearning \textbfInterleaved with Online \textbfFine-\textbfTuning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model’s reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.
zh
[AI-33] IntenTest: Stress Testing for Intent Integrity in API-Calling LLM Agents
【速读】:该论文试图解决大型语言模型代理(LLM agents)在执行基于自然语言指令的API调用任务时,因用户意图理解偏差而导致的行为偏离问题,尤其是在外部工具包不断演变的情况下。解决方案的关键在于提出IntenTest框架,该框架通过基于工具包文档生成真实任务并应用针对性变异,系统性地揭示意图完整性违规。其核心创新包括语义分区方法,用于根据API参数及其等价类对自然语言任务进行分类,以及轻量级预测器对种子任务进行变异和排序,同时结合数据类型感知的策略记忆以提高测试效率。
链接: https://arxiv.org/abs/2506.07524
作者: Shiwei Feng,Xiangzhe Xu,Xuan Chen,Kaiyuan Zhang,Syed Yusuf Ahmed,Zian Su,Mingwei Zheng,Xiangyu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:LLM agents are increasingly deployed to automate real-world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent’s actions that diverge from the user’s intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce IntenTest, an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, IntenTest generates realistic tasks based on toolkits’ documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful categories based on toolkit API parameters and their equivalence classes. Within each partition, seed tasks are mutated and ranked by a lightweight predictor that estimates the likelihood of triggering agent errors. To enhance efficiency, IntenTest maintains a datatype-aware strategy memory that retrieves and adapts effective mutation patterns from past cases. Experiments on 80 toolkit APIs demonstrate that IntenTest effectively uncovers intent integrity violations, significantly outperforming baselines in both error-exposing rate and query efficiency. Moreover, IntenTest generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains.
zh
[AI-34] LeVo: High-Quality Song Generation with Multi-Preference Alignment
【速读】:该论文旨在解决音乐生成中歌曲复杂结构建模不足以及高质量数据稀缺导致的音质、音乐性、指令遵循和人声与伴奏和谐度受限等问题。其解决方案的关键在于提出一种基于语言模型的框架LeVo,包含LeLM和音乐编解码器,其中LeLM通过并行建模混合令牌(用于实现人声与伴奏的和谐)和双轨令牌(分别编码人声与伴奏以提升生成质量),并采用两个仅解码器的Transformer和模块化扩展训练策略以避免不同令牌类型间的干扰。此外,引入基于直接偏好优化(DPO)的多偏好对齐方法,进一步提升音乐性和指令遵循能力。
链接: https://arxiv.org/abs/2506.07520
作者: Shun Lei,Yaoxun Xu,Zhiwei Lin,Huaicheng Zhang,Wei Tan,Hangting Chen,Jianwei Yu,Yixuan Zhang,Chenyu Yang,Haina Zhu,Shuai Wang,Zhiyong Wu,Dong Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in sound quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, an LM-based framework consisting of LeLM and a music codec. LeLM is capable of parallelly modeling two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and DPO post-training. Experimental results demonstrate that LeVo consistently outperforms existing methods on both objective and subjective metrics. Ablation studies further justify the effectiveness of our designs. Audio examples are available at this https URL.
zh
[AI-35] Reinforcement Learning via Implicit Imitation Guidance
【速读】:该论文试图解决样本高效强化学习(sample efficient reinforcement learning)问题,即在缺乏密集奖励信号的情况下,利用先验数据(如示范数据)进行初始化。传统方法通常通过模仿学习目标来整合先验数据,但这可能会损害长期性能,因为其与奖励最大化的目标不直接对齐。该论文的解决方案之关键在于提出一种名为数据引导噪声(Data-Guided Noise, DGN)的框架,该框架仅通过向策略中添加噪声来引导探索,而非显式地进行行为克隆约束,从而更有效地利用先验数据指导动作探索。
链接: https://arxiv.org/abs/2506.07505
作者: Perry Dong,Alec M. Lessing,Annie S. Chen,Chelsea Finn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We study the problem of sample efficient reinforcement learning, where prior data such as demonstrations are provided for initialization in lieu of a dense reward signal. A natural approach is to incorporate an imitation learning objective, either as regularization during training or to acquire a reference policy. However, imitation learning objectives can ultimately degrade long-term performance, as it does not directly align with reward maximization. In this work, we propose to use prior data solely for guiding exploration via noise added to the policy, sidestepping the need for explicit behavior cloning constraints. The key insight in our framework, Data-Guided Noise (DGN), is that demonstrations are most useful for identifying which actions should be explored, rather than forcing the policy to take certain actions. Our approach achieves up to 2-3x improvement over prior reinforcement learning from offline data methods across seven simulated continuous control tasks.
zh
[AI-36] Premise Selection for a Lean Hammer
【速读】:该论文试图解决在Lean证明助手中缺乏高效自动定理证明工具(即Hammer)的问题,尽管Lean在学术界和工业界日益流行。其核心挑战在于如何将神经网络方法与符号推理相结合,以提升形式化验证的自动化水平。解决方案的关键在于提出LeanHammer,这是一个端到端的通用Hammer系统,其核心是基于依赖类型理论的新型神经前提选择系统。该系统能够动态适应用户特定的上下文,并与符号证明搜索和重建相结合,从而显著提高目标求解率并具备良好的领域泛化能力。
链接: https://arxiv.org/abs/2506.07477
作者: Thomas Zhu,Joshua Clune,Jeremy Avigad,Albert Qiaochu Jiang,Sean Welleck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: LeanHammer is available at this https URL
Abstract:Neural methods are transforming automated reasoning for proof assistants, yet integrating these advances into practical verification workflows remains challenging. Hammers are tools that interface with external automatic theorem provers to automate tedious reasoning steps. They have dramatically improved productivity in proof assistants, but the Lean proof assistant still does not have a hammer despite its growing popularity. We present LeanHammer, the first end-to-end domain-general hammer for Lean, built on a novel neural premise selection system for a hammer in dependent type theory. Unlike existing Lean premise selectors, our approach dynamically adapts to user-specific contexts and combines with symbolic proof search and reconstruction to create a practical hammer. With comprehensive evaluations, we show that our premise selector enables LeanHammer to solve 21% more goals relative to existing premise selectors, and generalize well to diverse domains. Our work bridges the gap between neural retrieval and symbolic reasoning, making formal verification more accessible to researchers and practitioners.
zh
[AI-37] Language-Grounded Hierarchical Planning and Execution with Multi-Robot 3D Scene Graphs
【速读】:该论文旨在解决多机器人系统在复杂自然语言指令下进行环境建图、定位及任务与运动规划(TAMP)的问题。其关键解决方案是引入一种基于3D场景图(3D scene graph)的共享表示,结合开放集物体地图,实现多机器人3D场景图融合,从而支持实时、视角不变的重定位和规划。此外,通过利用大型语言模型(LLM)从共享3D场景图和机器人能力中提取上下文,将操作员意图转化为规划领域定义语言(PDDL)目标,进一步提升了系统的任务执行能力。
链接: https://arxiv.org/abs/2506.07454
作者: Jared Strader,Aaron Ray,Jacob Arkin,Mason B. Peterson,Yun Chang,Nathan Hughes,Christopher Bradley,Yi Xuan Jia,Carlos Nieto-Granda,Rajat Talak,Chuchu Fan,Luca Carlone,Jonathan P. How,Nicholas Roy
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures
Abstract:In this paper, we introduce a multi-robot system that integrates mapping, localization, and task and motion planning (TAMP) enabled by 3D scene graphs to execute complex instructions expressed in natural language. Our system builds a shared 3D scene graph incorporating an open-set object-based map, which is leveraged for multi-robot 3D scene graph fusion. This representation supports real-time, view-invariant relocalization (via the object-based map) and planning (via the 3D scene graph), allowing a team of robots to reason about their surroundings and execute complex tasks. Additionally, we introduce a planning approach that translates operator intent into Planning Domain Definition Language (PDDL) goals using a Large Language Model (LLM) by leveraging context from the shared 3D scene graph and robot capabilities. We provide an experimental assessment of the performance of our system on real-world tasks in large-scale, outdoor environments.
zh
[AI-38] Efficient Generation of Diverse Cooperative Agents with World Models
【速读】:该论文试图解决在Zero-Shot Coordination (ZSC)代理训练过程中生成具有多样化协作惯例的伙伴代理的问题,当前的Cross-play Minimization (XPM)方法在计算上非常昂贵且样本效率低下,因为训练目标需要采样多种轨迹,且每个伙伴代理都需要从头开始训练。该论文的关键解决方案是引入XPM-WM框架,通过一个学习到的World Model (WM)生成模拟轨迹,从而显著加速XPM方法的训练过程,并有效生成具有多样化惯例的伙伴代理,同时保持与之前方法相当的性能。
链接: https://arxiv.org/abs/2506.07450
作者: Yi Loo,Akshunn Trivedi,Malika Meghjani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A major bottleneck in the training process for Zero-Shot Coordination (ZSC) agents is the generation of partner agents that are diverse in collaborative conventions. Current Cross-play Minimization (XPM) methods for population generation can be very computationally expensive and sample inefficient as the training objective requires sampling multiple types of trajectories. Each partner agent in the population is also trained from scratch, despite all of the partners in the population learning policies of the same coordination task. In this work, we propose that simulated trajectories from the dynamics model of an environment can drastically speed up the training process for XPM methods. We introduce XPM-WM, a framework for generating simulated trajectories for XPM via a learned World Model (WM). We show XPM with simulated trajectories removes the need to sample multiple trajectories. In addition, we show our proposed method can effectively generate partners with diverse conventions that match the performance of previous methods in terms of SP population training reward as well as training partners for ZSC agents. Our method is thus, significantly more sample efficient and scalable to a larger number of partners.
zh
[AI-39] Extending Epistemic Uncertainty Beyond Parameters Would Assist in Designing Reliable LLM s
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在部署过程中如何有效管理和主动应对不确定性的问题。当前的方法主要依赖于通过拒绝高不确定性输出来避免错误信息,但这种保守策略反映了缺乏系统性区分和应对不同不确定性来源的工具。论文提出的解决方案关键在于采用贝叶斯实验建模(Bayesian Modeling of Experiments),该框架为推理不确定性并明确其可减少性提供了连贯的基础,使LLM及其用户能够采取上下文相关的措施,如请求澄清、检索外部信息或优化输入,从而实现主动解决而非被动回避不确定性。
链接: https://arxiv.org/abs/2506.07448
作者: T. Duy Nguyen-Hien,Desi R. Ivanova,Yee Whye Teh,Wee Sun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Although large language models (LLMs) are highly interactive and extendable, current approaches to ensure reliability in deployments remain mostly limited to rejecting outputs with high uncertainty in order to avoid misinformation. This conservative strategy reflects the current lack of tools to systematically distinguish and respond to different sources of uncertainty. In this paper, we advocate for the adoption of Bayesian Modeling of Experiments – a framework that provides a coherent foundation to reason about uncertainty and clarify the reducibility of uncertainty – for managing and proactively addressing uncertainty that arises in LLM deployments. This framework enables LLMs and their users to take contextually appropriate steps, such as requesting clarification, retrieving external information, or refining inputs. By supporting active resolution rather than passive avoidance, it opens the door to more reliable, transparent, and broadly applicable LLM systems, particularly in high-stakes, real-world settings.
zh
[AI-40] Fact in Frag ments: Deconstructing Complex Claims via LLM -based Atomic Fact Extraction and Verification
【速读】:该论文试图解决复杂声明在事实验证中的准确性问题,尤其是在处理需要多跳推理的碎片化证据时,传统方法因依赖静态分解策略和表层语义检索而难以捕捉声明的结构与意图,导致推理错误累积、证据噪声污染以及对多样化声明适应性不足。解决方案的关键在于提出原子事实提取与验证(Atomic Fact Extraction and Verification, AFEV)框架,该框架通过迭代分解复杂声明为原子事实,实现细粒度检索与自适应推理,动态优化声明理解,减少误差传播,并通过重排序证据和引入上下文特定示例来提升推理过程的准确性和可解释性。
链接: https://arxiv.org/abs/2506.07446
作者: Liwen Zheng,Chaozhuo Li,Zheng Liu,Feiran Huang,Haoran Jia,Zaisheng Ye,Xi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Fact verification plays a vital role in combating misinformation by assessing the veracity of claims through evidence retrieval and reasoning. However, traditional methods struggle with complex claims requiring multi-hop reasoning over fragmented evidence, as they often rely on static decomposition strategies and surface-level semantic retrieval, which fail to capture the nuanced structure and intent of the claim. This results in accumulated reasoning errors, noisy evidence contamination, and limited adaptability to diverse claims, ultimately undermining verification accuracy in complex scenarios. To address this, we propose Atomic Fact Extraction and Verification (AFEV), a novel framework that iteratively decomposes complex claims into atomic facts, enabling fine-grained retrieval and adaptive reasoning. AFEV dynamically refines claim understanding and reduces error propagation through iterative fact extraction, reranks evidence to filter noise, and leverages context-specific demonstrations to guide the reasoning process. Extensive experiments on five benchmark datasets demonstrate that AFEV achieves state-of-the-art performance in both accuracy and interpretability.
zh
[AI-41] LegalReason er: Step-wised Verification-Correction for Legal Judgment Reasoning
【速读】:该论文试图解决法律判决预测(Legal Judgment Prediction, LJP)中因复杂法律推理导致的逻辑错误问题。解决方案的关键在于通过分步验证和修正推理过程来提高LJP的可靠性,具体包括识别争议点以分解复杂案件,并在每一步推理中利用过程验证器从正确性、进展性和潜在视角验证逻辑,当检测到错误时,应用专家设计的归因与解决策略进行修正。
链接: https://arxiv.org/abs/2506.07443
作者: Weijie Shi,Han Zhu,Jiaming Ji,Mengze Li,Jipeng Zhang,Ruiyuan Zhang,Jia Zhu,Jiajie Xu,Sirui Han,Yike Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Legal judgment prediction (LJP) aims to function as a judge by making final rulings based on case claims and facts, which plays a vital role in the judicial domain for supporting court decision-making and improving judicial efficiency. However, existing methods often struggle with logical errors when conducting complex legal reasoning. We propose LegalReasoner, which enhances LJP reliability through step-wise verification and correction of the reasoning process. Specifically, it first identifies dispute points to decompose complex cases, and then conducts step-wise reasoning while employing a process verifier to validate each step’s logic from correctness, progressiveness, and potential perspectives. When errors are detected, expert-designed attribution and resolution strategies are applied for correction. To fine-tune LegalReasoner, we release the LegalHK dataset, containing 58,130 Hong Kong court cases with detailed annotations of dispute points, step-by-step reasoning chains, and process verification labels. Experiments demonstrate that LegalReasoner significantly improves concordance with court decisions from 72.37 to 80.27 on LLAMA-3.1-70B. The data is available at this https URL.
zh
[AI-42] Fast Geometric Embedding for Node Influence Maximization
【速读】:该论文试图解决在大规模图中计算经典中心性度量(如介数和接近度)计算成本过高的问题。其解决方案的关键在于引入一种高效的力导向布局算法,将图嵌入到低维空间中,其中从原点的径向距离作为多种中心性度量的代理。
链接: https://arxiv.org/abs/2506.07435
作者: Alexander Kolpakov,Igor Rivin
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, 18 tables; Github repository available ( this https URL;) Package available on PyPi ( this https URL )
Abstract:Computing classical centrality measures such as betweenness and closeness is computationally expensive on large-scale graphs. In this work, we introduce an efficient force layout algorithm that embeds a graph into a low-dimensional space, where the radial distance from the origin serves as a proxy for various centrality measures. We evaluate our method on multiple graph families and demonstrate strong correlations with degree, PageRank, and paths-based centralities. As an application, it turns out that the proposed embedding allows to find high-influence nodes in a network, and provides a fast and scalable alternative to the standard greedy algorithm.
zh
[AI-43] HeTa: Relation-wise Heterogeneous Graph Foundation Attack Model IJCAI2025
【速读】:该论文试图解决异构图神经网络(Heterogeneous Graph Neural Networks, HGNNs)在面对攻击时的脆弱性问题,旨在设计一种可泛化的基础攻击模型,以实现对不同HGNNs的扰动迁移以及快速适应新的异构图(HGs)。解决方案的关键在于通过挖掘共享的攻击单元,构建一个基于关系感知的基础替代模型,从而识别出具有普遍性的关系感知攻击权重,并在此基础上实施逐关系的序列化攻击策略,使扰动能够有效迁移至多种目标HGNNs并易于微调以适应新场景。
链接: https://arxiv.org/abs/2506.07428
作者: Yuling Wang,Zihui Chen,Pengfei Jiao,Xiao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by IJCAI 2025
Abstract:Heterogeneous Graph Neural Networks (HGNNs) are vulnerable, highlighting the need for tailored attacks to assess their robustness and ensure security. However, existing HGNN attacks often require complex retraining of parameters to generate specific perturbations for new scenarios. Recently, foundation models have opened new horizons for the generalization of graph neural networks by capturing shared semantics across various graph distributions. This leads us to ask:Can we design a foundation attack model for HGNNs that enables generalizable perturbations across different HGNNs, and quickly adapts to new heterogeneous graphs (HGs)? Empirical findings reveal that, despite significant differences in model design and parameter space, different HGNNs surprisingly share common vulnerability patterns from a relation-aware perspective. Therefore, we explore how to design foundation HGNN attack criteria by mining shared attack units. In this paper, we propose a novel relation-wise heterogeneous graph foundation attack model, HeTa. We introduce a foundation surrogate model to align heterogeneity and identify the importance of shared relation-aware attack units. Building on this, we implement a serialized relation-by-relation attack based on the identified relational weights. In this way, the perturbation can be transferred to various target HGNNs and easily fine-tuned for new HGs. Extensive experiments exhibit powerful attack performances and generalizability of our method.
zh
[AI-44] Evaluating Visual Mathematics in Multimodal LLM s: A Multilingual Benchmark Based on the Kangaroo Tests
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉呈现的数学问题中的表现不足问题,特别是其在处理图表、多语言文本和符号表示方面的有效性尚未得到充分探索。论文的关键解决方案是通过构建一个跨语言的袋鼠风格基准测试(Kangaroo style benchmark),涵盖英语、法语、西班牙语和加泰罗尼亚语,对多个模型(如GPT 4o、Pixtral、Qwen VL、Llama 3.2 Vision变体和Gemini 2.0 Flash)进行系统评估,以分析其在几何、视觉代数、逻辑、模式和组合数学等领域的性能,并揭示模型在利用视觉信息和进行结构化推理方面的能力差异。
链接: https://arxiv.org/abs/2506.07418
作者: Arnau Igualde Sáez,Lamyae Rhomrasi,Yusef Ahsini,Ricardo Vinuesa,Sergio Hoyas,Jose P. García Sabater,Marius J. Fullana i Alfonso,J. Alberto Conejero
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures
Abstract:Multimodal Large Language Models (MLLMs) promise advanced vision language capabilities, yet their effectiveness in visually presented mathematics remains underexplored. This paper analyzes the development and evaluation of MLLMs for mathematical problem solving, focusing on diagrams, multilingual text, and symbolic notation. We then assess several models, including GPT 4o, Pixtral, Qwen VL, Llama 3.2 Vision variants, and Gemini 2.0 Flash in a multilingual Kangaroo style benchmark spanning English, French, Spanish, and Catalan. Our experiments reveal four key findings. First, overall precision remains moderate across geometry, visual algebra, logic, patterns, and combinatorics: no single model excels in every topic. Second, while most models see improved accuracy with questions that do not have images, the gain is often limited; performance for some remains nearly unchanged without visual input, indicating underutilization of diagrammatic information. Third, substantial variation exists across languages and difficulty levels: models frequently handle easier items but struggle with advanced geometry and combinatorial reasoning. Notably, Gemini 2.0 Flash achieves the highest precision on image based tasks, followed by Qwen VL 2.5 72B and GPT 4o, though none approach human level performance. Fourth, a complementary analysis aimed at distinguishing whether models reason or simply recite reveals that Gemini and GPT 4o stand out for their structured reasoning and consistent accuracy. In contrast, Pixtral and Llama exhibit less consistent reasoning, often defaulting to heuristics or randomness when unable to align their outputs with the given answer options.
zh
[AI-45] Evidential Spectrum-Aware Contrastive Learning for OOD Detection in Dynamic Graphs
【速读】:该论文旨在解决动态图中分布外(Out-of-distribution, OOD)检测的两个关键问题:一是由单点估计导致的高偏差和高方差,使预测对数据中的随机性敏感;二是由于缺乏OOD训练数据导致的评分同质化,即模型仅学习到分布内(In-distribution, ID)特定模式,从而导致整体OOD评分偏低且ID与OOD数据之间的评分差距狭窄。解决方案的关键在于通过证据深度学习(Evidential Deep Learning, EDL)框架提出EviSEC,其核心是设计一个证据神经网络,将输出重新定义为后验Dirichlet分布,以解释输入的随机性;同时引入频谱感知增强模块生成OOD近似样本,从而扩大ID与OOD数据之间的评分差距并缓解评分同质化问题。
链接: https://arxiv.org/abs/2506.07417
作者: Nan Sun,Xixun Lin,Zhiheng Zhou,Yanmin Shang,Zhenlin Cheng,Yanan Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages,5 figures
Abstract:Recently, Out-of-distribution (OOD) detection in dynamic graphs, which aims to identify whether incoming data deviates from the distribution of the in-distribution (ID) training set, has garnered considerable attention in security-sensitive fields. Current OOD detection paradigms primarily focus on static graphs and confront two critical challenges: i) high bias and high variance caused by single-point estimation, which makes the predictions sensitive to randomness in the data; ii) score homogenization resulting from the lack of OOD training data, where the model only learns ID-specific patterns, resulting in overall low OOD scores and a narrow score gap between ID and OOD data. To tackle these issues, we first investigate OOD detection in dynamic graphs through the lens of Evidential Deep Learning (EDL). Specifically, we propose EviSEC, an innovative and effective OOD detector via Evidential Spectrum-awarE Contrastive Learning. We design an evidential neural network to redefine the output as the posterior Dirichlet distribution, explaining the randomness of inputs through the uncertainty of distribution, which is overlooked by single-point estimation. Moreover, spectrum-aware augmentation module generates OOD approximations to identify patterns with high OOD scores, thereby widening the score gap between ID and OOD data and mitigating score homogenization. Extensive experiments on real-world datasets demonstrate that EviSAC effectively detects OOD samples in dynamic graphs.
zh
[AI-46] LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
【速读】:该论文旨在解决在资源受限环境中(如机器人和自动驾驶系统)高效部署视觉-语言模型(Vision-Language Model, VLM)所面临的计算开销过大的问题。其解决方案的关键在于通过联合采用块选择(patch selection)以过滤无关的摄像头视图、令牌选择模块以缩短输入序列长度,以及推测解码(speculative decoding)以加速令牌生成,从而显著降低端到端延迟,同时保持任务准确性。
链接: https://arxiv.org/abs/2506.07416
作者: Jin Huang,Yuchao Jin,Le An,Josh Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves 2.5\times end-to-end latency reduction without compromising task accuracy. The speed-up further increases to 3.2\times when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.
zh
[AI-47] An Intelligent Fault Self-Healing Mechanism for Cloud AI Systems via Integration of Large Language Models and Deep Reinforcement Learning
【速读】:该论文旨在解决云基础AI系统中故障检测与自适应恢复的问题,以保障服务的可靠性和连续性。其解决方案的关键在于提出一种融合大型语言模型(Large Language Model, LLM)和深度强化学习(Deep Reinforcement Learning, DRL)的智能故障自愈机制(Intelligent Fault Self-Healing Mechanism, IFSHM),通过构建两阶段混合架构实现语义理解与策略优化,其中LLM用于故障语义解析,DRL用于恢复策略优化,并引入记忆引导的元控制器以提升模型的泛化能力和持续适应性。
链接: https://arxiv.org/abs/2506.07411
作者: Ze Yang,Yihong Jin,Juntian Liu,Xinhe Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Proceedings of 2025 IEEE 8th International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE 2025)
Abstract:As the scale and complexity of cloud-based AI systems continue to increase, the detection and adaptive recovery of system faults have become the core challenges to ensure service reliability and continuity. In this paper, we propose an Intelligent Fault Self-Healing Mechanism (IFSHM) that integrates Large Language Model (LLM) and Deep Reinforcement Learning (DRL), aiming to realize a fault recovery framework with semantic understanding and policy optimization capabilities in cloud AI systems. On the basis of the traditional DRL-based control model, the proposed method constructs a two-stage hybrid architecture: (1) an LLM-driven fault semantic interpretation module, which can dynamically extract deep contextual semantics from multi-source logs and system indicators to accurately identify potential fault modes; (2) DRL recovery strategy optimizer, based on reinforcement learning, learns the dynamic matching of fault types and response behaviors in the cloud environment. The innovation of this method lies in the introduction of LLM for environment modeling and action space abstraction, which greatly improves the exploration efficiency and generalization ability of reinforcement learning. At the same time, a memory-guided meta-controller is introduced, combined with reinforcement learning playback and LLM prompt fine-tuning strategy, to achieve continuous adaptation to new failure modes and avoid catastrophic forgetting. Experimental results on the cloud fault injection platform show that compared with the existing DRL and rule methods, the IFSHM framework shortens the system recovery time by 37% with unknown fault scenarios.
zh
[AI-48] Fractional-order Jacobian Matrix Differentiation and Its Application in Artificial Neural Networks
【速读】:该论文试图解决当前缺乏与自动微分(Autograd)技术完美兼容的分数阶矩阵微分方法的问题,从而限制了分数阶微分在深度学习中的应用。解决方案的关键在于提出一种基于整数阶雅可比矩阵定义的分数阶雅可比矩阵微分(\bfJ^\alpha)方法,通过该方法实现基于矩阵的分数阶链式法则,并设计分数阶自动微分技术,使分数阶微分能够应用于隐藏层,从而提升其在深度学习中的实用性。
链接: https://arxiv.org/abs/2506.07408
作者: Xiaojun zhou,Chunna Zhao,Yaqun Huang,Chengli Zhou,Junjie Ye,Kemeng Xiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fractional-order differentiation has many characteristics different from integer-order differentiation. These characteristics can be applied to the optimization algorithms of artificial neural networks to obtain better results. However, due to insufficient theoretical research, at present, there is no fractional-order matrix differentiation method that is perfectly compatible with automatic differentiation (Autograd) technology. Therefore, we propose a fractional-order matrix differentiation calculation method. This method is introduced by the definition of the integer-order Jacobian matrix. We denote it as fractional-order Jacobian matrix differentiation ( \bfJ^\alpha ). Through \bfJ^\alpha , we can carry out the matrix-based fractional-order chain rule. Based on the Linear module and the fractional-order differentiation, we design the fractional-order Autograd technology to enable the use of fractional-order differentiation in hidden layers, thereby enhancing the practicality of fractional-order differentiation in deep learning. In the experiment, according to the PyTorch framework, we design fractional-order Linear (FLinear) and replace this http URL in the multilayer perceptron with FLinear. Through the qualitative analysis of the training set and validation set Loss , the quantitative analysis of the test set indicators, and the analysis of time consumption and GPU memory usage during model training, we verify the superior performance of \bfJ^\alpha and prove that it is an excellent fractional-order gradient descent method in the field of deep learning.
zh
[AI-49] Anomaly Detection and Early Warning Mechanism for Intelligent Monitoring Systems in Multi-Cloud Environments Based on LLM
【速读】:该论文旨在解决多云环境中智能监控系统的安全性和可靠性问题,特别是提升异常检测的准确性与实时响应效率。其解决方案的关键在于引入基于大型语言模型(Large-Scale Language Model, LLM)的多层次特征提取方法,结合LLM的自然语言处理能力与传统机器学习方法,以增强异常检测效果,并通过LLM的上下文理解能力实现对不同云服务提供商和环境的动态适应,从而更有效地识别异常模式并预测潜在故障。
链接: https://arxiv.org/abs/2506.07407
作者: Yihong Jin,Ze Yang,Juntian Liu,Xinhe Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings of 2025 5th International Symposium on Computer Technology and Information Science (ISCTIS 2025)
Abstract:With the rapid development of multi-cloud environments, it is increasingly important to ensure the security and reliability of intelligent monitoring systems. In this paper, we propose an anomaly detection and early warning mechanism for intelligent monitoring system in multi-cloud environment based on Large-Scale Language Model (LLM). On the basis of the existing monitoring framework, the proposed model innovatively introduces a multi-level feature extraction method, which combines the natural language processing ability of LLM with traditional machine learning methods to enhance the accuracy of anomaly detection and improve the real-time response efficiency. By introducing the contextual understanding capabilities of LLMs, the model dynamically adapts to different cloud service providers and environments, so as to more effectively detect abnormal patterns and predict potential failures. Experimental results show that the proposed model is significantly better than the traditional anomaly detection system in terms of detection accuracy and latency, and significantly improves the resilience and active management ability of cloud infrastructure.
zh
[AI-50] InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)内部表示的可解释性问题,特别是现有特征可解释性方法对表示结构的强假设在实际中可能不成立的问题。其解决方案的关键在于提出一种无假设且可扩展的框架——InverseScope,该框架通过输入反演(input inversion)来解释神经激活,并定义一个生成相似激活的输入分布,从而推断编码特征。为提高高维空间采样的效率,该方法引入了一种新型条件生成架构,同时提出了基于采样输入的特征一致性率的定量评估协议,使得基于反演的可解释性方法能够应用于更大规模的模型和实际任务。
链接: https://arxiv.org/abs/2506.07406
作者: Yifan Luo,Zhennan Zhou,Bin Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures
Abstract:Understanding the internal representations of large language models (LLMs) is a central challenge in interpretability research. Existing feature interpretability methods often rely on strong assumptions about the structure of representations that may not hold in practice. In this work, we introduce InverseScope, an assumption-light and scalable framework for interpreting neural activations via input inversion. Given a target activation, we define a distribution over inputs that generate similar activations and analyze this distribution to infer the encoded features. To address the inefficiency of sampling in high-dimensional spaces, we propose a novel conditional generation architecture that significantly improves sample efficiency compared to previous methods. We further introduce a quantitative evaluation protocol that tests interpretability hypotheses using feature consistency rate computed over the sampled inputs. InverseScope scales inversion-based interpretability methods to larger models and practical tasks, enabling systematic and quantitative analysis of internal representations in real-world LLMs.
zh
[AI-51] From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks
【速读】:该论文旨在解决无人机蜂群网络在开放无线环境、动态拓扑和资源受限条件下面临的严重拒绝服务(DoS)威胁问题。传统静态或集中式防御机制难以应对此类动态分布式场景,因此本文提出了一种基于联邦多智能体深度强化学习(FMADRL)的移动目标防御(MTD)框架,以实现主动和自适应的DoS缓解。解决方案的关键在于设计三种轻量级且协调的MTD机制,包括领导切换、路由变异和频谱跳变,利用无人机蜂群的固有灵活性干扰攻击者并增强网络弹性,同时通过多智能体部分可观测马尔可夫决策过程(POMDP)建模防御问题,并采用基于策略梯度的FMADRL算法实现分布式学习与防御策略优化。
链接: https://arxiv.org/abs/2506.07392
作者: Yuyang Zhou,Guang Cheng,Kang Du,Zihan Chen,Tian Qin,Yuyu Zhao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13pages; In submission
Abstract:The proliferation of unmanned aerial vehicle (UAV) swarms has enabled a wide range of mission-critical applications, but also exposes UAV networks to severe Denial-of-Service (DoS) threats due to their open wireless environment, dynamic topology, and resource constraints. Traditional static or centralized defense mechanisms are often inadequate for such dynamic and distributed scenarios. To address these challenges, we propose a novel federated multi-agent deep reinforcement learning (FMADRL)-driven moving target defense (MTD) framework for proactive and adaptive DoS mitigation in UAV swarm networks. Specifically, we design three lightweight and coordinated MTD mechanisms, including leader switching, route mutation, and frequency hopping, that leverage the inherent flexibility of UAV swarms to disrupt attacker efforts and enhance network resilience. The defense problem is formulated as a multi-agent partially observable Markov decision process (POMDP), capturing the distributed, resource-constrained, and uncertain nature of UAV swarms under attack. Each UAV is equipped with a local policy agent that autonomously selects MTD actions based on partial observations and local experiences. By employing a policy gradient-based FMADRL algorithm, UAVs collaboratively optimize their defense policies via reward-weighted aggregation, enabling distributed learning without sharing raw data and thus reducing communication overhead. Extensive simulations demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving up to a 34.6% improvement in attack mitigation rate, a reduction in average recovery time of up to 94.6%, and decreases in energy consumption and defense cost by as much as 29.3% and 98.3%, respectively, while maintaining robust mission continuity under various DoS attack strategies.
zh
[AI-52] Boosting Vulnerability Detection of LLM s via Curriculum Preference Optimization with Synthetic Reasoning Data ACL2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在检测软件漏洞方面的能力有限的问题。其关键在于通过构建推理数据合成和漏洞特定偏好优化的框架ReVD,以提升模型对漏洞模式的识别能力。具体而言,ReVD通过设计正向与反向推理过程及三元组监督微调结合课程在线偏好优化策略,有效增强了模型对漏洞语义特征的理解和识别能力。
链接: https://arxiv.org/abs/2506.07390
作者: Xin-Cheng Wen,Yijun Yang,Cuiyun Gao,Yang Xiao,Deheng Ye
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted by ACL 2025 Findings
Abstract:Large language models (LLMs) demonstrate considerable proficiency in numerous coding-related tasks; however, their capabilities in detecting software vulnerabilities remain limited. This limitation primarily stems from two factors: (1) the absence of reasoning data related to vulnerabilities, which hinders the models’ ability to capture underlying vulnerability patterns; and (2) their focus on learning semantic representations rather than the reason behind them, thus failing to recognize semantically similar vulnerability samples. Furthermore, the development of LLMs specialized in vulnerability detection is challenging, particularly in environments characterized by the scarcity of high-quality datasets. In this paper, we propose a novel framework ReVD that excels at mining vulnerability patterns through reasoning data synthesizing and vulnerability-specific preference optimization. Specifically, we construct forward and backward reasoning processes for vulnerability and corresponding fixed code, ensuring the synthesis of high-quality reasoning data. Moreover, we design the triplet supervised fine-tuning followed by curriculum online preference optimization for enabling ReVD to better understand vulnerability patterns. The extensive experiments conducted on PrimeVul and SVEN datasets demonstrate that ReVD sets new state-of-the-art for LLM-based software vulnerability detection, e.g., 12.24%-22.77% improvement in the accuracy. The source code and data are available at this https URL.
zh
[AI-53] Shapley-Coop: Credit Assignment for Emergent Cooperation in Self-Interested LLM Agents
【速读】:该论文旨在解决在开放环境中缺乏协调规则时,大型语言模型(Large Language Models, LLMs)代理倾向于采取自利行为的问题,核心挑战在于信用分配——即公平评估每个代理的贡献并设计能够对齐其异质目标的定价机制。解决方案的关键是提出一种协作工作流Shapley-Coop,该方法结合了Shapley Chain-of-Thought(利用边际贡献作为定价的基础)和结构化协商协议,以实现理性任务时间定价和任务后奖励再分配,从而对齐代理激励、促进合作并保持自主性。
链接: https://arxiv.org/abs/2506.07388
作者: Yun Hua,Haosheng Chen,Shiqin Wang,Wenhao Li,Xiangfeng Wang,Jun Luo
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) show strong collaborative performance in multi-agent systems with predefined roles and workflows. However, in open-ended environments lacking coordination rules, agents tend to act in self-interested ways. The central challenge in achieving coordination lies in credit assignment – fairly evaluating each agent’s contribution and designing pricing mechanisms that align their heterogeneous goals. This problem is critical as LLMs increasingly participate in complex human-AI collaborations, where fair compensation and accountability rely on effective pricing mechanisms. Inspired by how human societies address similar coordination challenges (e.g., through temporary collaborations such as employment or subcontracting), we propose a cooperative workflow, Shapley-Coop. Shapley-Coop integrates Shapley Chain-of-Thought – leveraging marginal contributions as a principled basis for pricing – with structured negotiation protocols for effective price matching, enabling LLM agents to coordinate through rational task-time pricing and post-task reward redistribution. This approach aligns agent incentives, fosters cooperation, and maintains autonomy. We evaluate Shapley-Coop across two multi-agent games and a software engineering simulation, demonstrating that it consistently enhances LLM agent collaboration and facilitates equitable credit assignment. These results highlight the effectiveness of Shapley-Coop’s pricing mechanisms in accurately reflecting individual contributions during task execution.
zh
[AI-54] HyColor: An Efficient Heuristic Algorithm for Graph Coloring
【速读】:该论文试图解决图着色问题(Graph Coloring Problem, GCP),即在保证相邻顶点颜色不同的前提下,找到给图着色所需的最小颜色数。现有GCP算法通常针对小规模难解图或大规模稀疏图(最多包含10^7个顶点),而本文提出了一种高效的混合启发式算法HyColor,其关键在于三个方面:一种局部决策策略以提高色数的下界;一种图缩减策略以减小工作图的规模;以及基于k-core和混合度的贪心启发式方法以高效地对图进行着色。
链接: https://arxiv.org/abs/2506.07373
作者: Enqiang Zhu,Yu Zhang,Haopeng Sun,Ziqi Wei,Witold Pedrycz,Chanjuan Liu,Jin Xu
机构: 未知
类目: Discrete Mathematics (cs.DM); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures
Abstract:The graph coloring problem (GCP) is a classic combinatorial optimization problem that aims to find the minimum number of colors assigned to vertices of a graph such that no two adjacent vertices receive the same color. GCP has been extensively studied by researchers from various fields, including mathematics, computer science, and biological science. Due to the NP-hard nature, many heuristic algorithms have been proposed to solve GCP. However, existing GCP algorithms focus on either small hard graphs or large-scale sparse graphs (with up to 10^7 vertices). This paper presents an efficient hybrid heuristic algorithm for GCP, named HyColor, which excels in handling large-scale sparse graphs while achieving impressive results on small dense graphs. The efficiency of HyColor comes from the following three aspects: a local decision strategy to improve the lower bound on the chromatic number; a graph-reduction strategy to reduce the working graph; and a k-core and mixed degree-based greedy heuristic for efficiently coloring graphs. HyColor is evaluated against three state-of-the-art GCP algorithms across four benchmarks, comprising three large-scale sparse graph benchmarks and one small dense graph benchmark, totaling 209 instances. The results demonstrate that HyColor consistently outperforms existing heuristic algorithms in both solution accuracy and computational efficiency for the majority of instances. Notably, HyColor achieved the best solutions in 194 instances (over 93%), with 34 of these solutions significantly surpassing those of other algorithms. Furthermore, HyColor successfully determined the chromatic number and achieved optimal coloring in 128 instances.
zh
[AI-55] Deepfake Technology Unveiled: The Commoditization of AI and Its Impact on Digital Trust
【速读】:该论文试图解决生成式 AI 技术(Generative AI)的普及所带来的数字信任危机问题,特别是深度伪造技术(Deepfake Technology)在欺诈、虚假信息传播以及多媒体真实性丧失方面的潜在风险。其解决方案的关键在于通过分析低成本、易用的工具(如 Runway、Rope 和 ElevenLabs)如何使高逼真的深度伪造内容得以快速生成,从而揭示当前技术对个人和组织的安全威胁,并强调建立监管框架、提高公众意识以及加强多方协作的重要性。
链接: https://arxiv.org/abs/2506.07363
作者: Claudiu Popa,Rex Pallath,Liam Cunningham,Hewad Tahiri,Abiram Kesavarajah,Tao Wu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 12 pages, 13 figures
Abstract:Deepfake Technology Unveiled: The Commoditization of AI and Its Impact on Digital Trust. With the increasing accessibility of generative AI, tools for voice cloning, face-swapping, and synthetic media creation have advanced significantly, lowering both financial and technical barriers for their use. While these technologies present innovative opportunities, their rapid growth raises concerns about trust, privacy, and security. This white paper explores the implications of deepfake technology, analyzing its role in enabling fraud, misinformation, and the erosion of authenticity in multimedia. Using cost-effective, easy to use tools such as Runway, Rope, and ElevenLabs, we explore how realistic deepfakes can be created with limited resources, demonstrating the risks posed to individuals and organizations alike. By analyzing the technical and ethical challenges of deepfake mitigation and detection, we emphasize the urgent need for regulatory frameworks, public awareness, and collaborative efforts to maintain trust in digital media.
zh
[AI-56] Lightweight Joint Audio-Visual Deepfake Detection via Single-Stream Multi-Modal Learning Framework
【速读】:该论文旨在解决音频-视觉深度伪造检测中因独立学习音频和视觉特征而导致的特征相关性利用不足以及模型冗余的问题。其关键解决方案是设计一种轻量级网络,通过单流多模态学习框架实现音频和视觉特征的高效融合,具体为引入协作式音视频学习块,在学习过程中整合多模态信息,并通过迭代应用该块实现跨层的多模态特征连续融合,从而减少对过多模块堆叠的依赖,提升模型效率。
链接: https://arxiv.org/abs/2506.07358
作者: Kuiyuan Zhang,Wenjie Pei,Rushi Lan,Yifang Guo,Zhongyun Hua
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Deepfakes are AI-synthesized multimedia data that may be abused for spreading misinformation. Deepfake generation involves both visual and audio manipulation. To detect audio-visual deepfakes, previous studies commonly employ two relatively independent sub-models to learn audio and visual features, respectively, and fuse them subsequently for deepfake detection. However, this may underutilize the inherent correlations between audio and visual features. Moreover, utilizing two isolated feature learning sub-models can result in redundant neural layers, making the overall model inefficient and impractical for resource-constrained environments. In this work, we design a lightweight network for audio-visual deepfake detection via a single-stream multi-modal learning framework. Specifically, we introduce a collaborative audio-visual learning block to efficiently integrate multi-modal information while learning the visual and audio features. By iteratively employing this block, our single-stream network achieves a continuous fusion of multi-modal features across its layers. Thus, our network efficiently captures visual and audio features without the need for excessive block stacking, resulting in a lightweight network design. Furthermore, we propose a multi-modal classification module that can boost the dependence of the visual and audio classifiers on modality content. It also enhances the whole resistance of the video classifier against the mismatches between audio and visual modalities. We conduct experiments on the DF-TIMIT, FakeAVCeleb, and DFDC benchmark datasets. Compared to state-of-the-art audio-visual joint detection methods, our method is significantly lightweight with only 0.48M parameters, yet it achieves superiority in both uni-modal and multi-modal deepfakes, as well as in unseen types of deepfakes. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2506.07358 [cs.SD] (or arXiv:2506.07358v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2506.07358 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-57] SALT: A Lightweight Model Adaptation Method for Closed Split Computing Environments
【速读】:该论文旨在解决在封闭约束环境下,即头部网络和尾部网络均为专有且用户无法访问的情况下,如何实现轻量级模型适配的问题。传统适配方法因需要访问模型参数或架构而不可行,而本文提出的SALT(Split-Adaptive Lightweight Tuning)框架通过在客户端引入一个紧凑的可训练适配器,对头部网络的潜在特征进行精调,从而在不修改原始模型且不增加通信开销的前提下实现用户特定的适配。该解决方案的关键在于客户端适配器的设计与训练,使其能够在有限资源下提升模型性能并支持鲁棒的边缘-云推理。
链接: https://arxiv.org/abs/2506.07355
作者: Yuya Okada,Takayuki Nishio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 6 pages, submitted to IEEE Globecom 2025 (under review)
Abstract:We propose SALT (Split-Adaptive Lightweight Tuning), a lightweight model adaptation framework for Split Computing under closed constraints, where the head and tail networks are proprietary and inaccessible to users. In such closed environments, conventional adaptation methods are infeasible since they require access to model parameters or architectures. SALT addresses this challenge by introducing a compact, trainable adapter on the client side to refine latent features from the head network, enabling user-specific adaptation without modifying the original models or increasing communication overhead. We evaluate SALT on user-specific classification tasks with CIFAR-10 and CIFAR-100, demonstrating improved accuracy with lower training latency compared to fine-tuning methods. Furthermore, SALT facilitates model adaptation for robust inference over lossy networks, a common challenge in edge-cloud environments. With minimal deployment overhead, SALT offers a practical solution for personalized inference in edge AI systems under strict system constraints.
zh
[AI-58] Distributed Risk-Sensitive Safety Filters for Uncertain Discrete-Time Systems
【速读】:该论文旨在解决多智能体系统中由于集中式协调不切实际而导致的安全性保障问题(safety assurance in multi-agent systems)。其解决方案的关键在于提出一种基于价值函数定义的控制屏障函数(control barrier functions, CBFs)的风险敏感安全过滤器,通过中心化的风险敏感安全条件结合指数风险算子来增强对模型不确定性的鲁棒性,并引入两种分布式策略:一种基于最坏情况预期,另一种基于接近已知安全策略,从而实现智能体在不同策略间的切换以确保可行性。
链接: https://arxiv.org/abs/2506.07347
作者: Armin Lederer,Erfaun Noorani,Andreas Krause
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Ensuring safety in multi-agent systems is a significant challenge, particularly in settings where centralized coordination is impractical. In this work, we propose a novel risk-sensitive safety filter for discrete-time multi-agent systems with uncertain dynamics that leverages control barrier functions (CBFs) defined through value functions. Our approach relies on centralized risk-sensitive safety conditions based on exponential risk operators to ensure robustness against model uncertainties. We introduce a distributed formulation of the safety filter by deriving two alternative strategies: one based on worst-case anticipation and another on proximity to a known safe policy. By allowing agents to switch between strategies, feasibility can be ensured. Through detailed numerical evaluations, we demonstrate the efficacy of our approach in maintaining safety without being overly conservative.
zh
[AI-59] Real-Time Execution of Action Chunking Flow Policies
【速读】:该论文试图解决现代AI系统在与物理世界交互时面临的实时性能问题,特别是当前先进通用模型(包括最新的视觉-语言-动作模型,Vision-Language Action Models, VLAs)的高延迟问题。尽管动作分块(action chunking)技术已在高频控制任务中实现了时间一致性,但其未能完全解决延迟问题,导致在分块边界处出现暂停或分布外的抖动行为。该论文提出的解决方案是实时分块(Real-Time Chunking, RTC),其关键在于在执行当前动作分块的同时生成下一个动作分块,通过“冻结”确定可执行的动作并“补全”其余动作,从而实现平滑的异步执行。该方法无需对扩散模型或流模型进行重新训练即可直接应用,并在动态任务和真实世界双臂操作任务中验证了其高效性、性能和对推理延迟的鲁棒性。
链接: https://arxiv.org/abs/2506.07339
作者: Kevin Black,Manuel Y. Galliker,Sergey Levine
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern AI systems, especially those interacting with the physical world, increasingly require real-time performance. However, the high latency of state-of-the-art generalist models, including recent vision-language action models (VLAs), poses a significant challenge. While action chunking has enabled temporal consistency in high-frequency control tasks, it does not fully address the latency problem, leading to pauses or out-of-distribution jerky movements at chunk boundaries. This paper presents a novel inference-time algorithm that enables smooth asynchronous execution of action chunking policies. Our method, real-time chunking (RTC), is applicable to any diffusion- or flow-based VLA out of the box with no re-training. It generates the next action chunk while executing the current one, “freezing” actions guaranteed to execute and “inpainting” the rest. To test RTC, we introduce a new benchmark of 12 highly dynamic tasks in the Kinetix simulator, as well as evaluate 6 challenging real-world bimanual manipulation tasks. Results demonstrate that RTC is fast, performant, and uniquely robust to inference delay, significantly improving task throughput and enabling high success rates in precise tasks \unicodex2013 such as lighting a match \unicodex2013 even in the presence of significant latency. See this https URL for videos.
zh
[AI-60] JavelinGuard: Low-Cost Transformer Architectures for LLM Security
【速读】:该论文试图解决在大型语言模型(Large Language Model, LLM)交互中检测恶意意图的问题,旨在为生产环境提供低成本、高性能的模型架构。解决方案的关键在于利用近期Transformer架构的进展,如紧凑型BERT变体(例如ModernBERT),构建参数量约为400M的高精度分类器,实现即使在标准CPU硬件上的快速推理速度。研究系统地探索了五种逐步复杂的基于Transformer的架构,并通过在九个多样化对抗数据集上的基准测试验证了其有效性,同时与领先的开源防护模型及大型解码器-only LLM进行了对比,展示了在准确性和延迟方面的优越成本效益。
链接: https://arxiv.org/abs/2506.07330
作者: Yash Datta,Sharath Rajasekar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 16 pages, 1 Figure and 5 Tables
Abstract:We present JavelinGuard, a suite of low-cost, high-performance model architectures designed for detecting malicious intent in Large Language Model (LLM) interactions, optimized specifically for production deployment. Recent advances in transformer architectures, including compact BERT(Devlin et al. 2019) variants (e.g., ModernBERT (Warner et al. 2024)), allow us to build highly accurate classifiers with as few as approximately 400M parameters that achieve rapid inference speeds even on standard CPU hardware. We systematically explore five progressively sophisticated transformer-based architectures: Sharanga (baseline transformer classifier), Mahendra (enhanced attention-weighted pooling with deeper heads), Vaishnava and Ashwina (hybrid neural ensemble architectures), and Raudra (an advanced multi-task framework with specialized loss functions). Our models are rigorously benchmarked across nine diverse adversarial datasets, including popular sets like the NotInject series, BIPIA, Garak, ImprovedLLM, ToxicChat, WildGuard, and our newly introduced JavelinBench, specifically crafted to test generalization on challenging borderline and hard-negative cases. Additionally, we compare our architectures against leading open-source guardrail models as well as large decoder-only LLMs such as gpt-4o, demonstrating superior cost-performance trade-offs in terms of accuracy, and latency. Our findings reveal that while Raudra’s multi-task design offers the most robust performance overall, each architecture presents unique trade-offs in speed, interpretability, and resource requirements, guiding practitioners in selecting the optimal balance of complexity and efficiency for real-world LLM security applications.
zh
[AI-61] Speech Recognition on TV Series with Video-guided Post-Correction
【速读】:该论文试图解决在复杂环境(如电视剧)中自动语音识别(ASR)系统因重叠语音、领域特定术语和长距离上下文依赖而导致的转录准确性问题。解决方案的关键在于提出一种新颖的多模态后校正框架,通过利用视频中的时间与上下文信息来优化ASR输出,其核心步骤包括基于视频的上下文信息提取和上下文感知的ASR校正。
链接: https://arxiv.org/abs/2506.07323
作者: Haoyuan Yang,Yue Zhang,Liqiang Jing
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Automatic Speech Recognition (ASR) has achieved remarkable success with deep learning, driving advancements in conversational artificial intelligence, media transcription, and assistive technologies. However, ASR systems still struggle in complex environments such as TV series, where overlapping speech, domain-specific terminology, and long-range contextual dependencies pose significant challenges to transcription accuracy. Existing multimodal approaches fail to correct ASR outputs with the rich temporal and contextual information available in video. To address this limitation, we propose a novel multimodal post-correction framework that refines ASR transcriptions by leveraging contextual cues extracted from video. Our framework consists of two stages: ASR Generation and Video-based Post-Correction, where the first stage produces the initial transcript and the second stage corrects errors using Video-based Contextual Information Extraction and Context-aware ASR Correction. We employ the Video-Large Multimodal Model (VLMM) to extract key contextual information using tailored prompts, which is then integrated with a Large Language Model (LLM) to refine the ASR output. We evaluate our method on a multimodal benchmark for TV series ASR and demonstrate its effectiveness in improving ASR performance by leveraging video-based context to enhance transcription accuracy in complex multimedia environments.
zh
[AI-62] Generative Modeling of Networked Time-Series via Transformer Architectures
【速读】:该论文试图解决安全与网络应用中因数据访问受限而导致的机器学习模型训练数据不足的问题。现有研究虽然表明Transformer模型能够通过合成新样本扩大数据集规模,但这些合成样本并未有效提升模型性能。论文提出的解决方案关键在于设计一种高效的基于Transformer的生成框架,用于生成时间序列数据,从而提升现有和新型机器学习工作流的性能,其模型在多个数据集上实现了最先进的结果,并具备良好的泛化能力和高质量样本生成能力。
链接: https://arxiv.org/abs/2506.07312
作者: Yusuf Elnady
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Many security and network applications require having large datasets to train the machine learning models. Limited data access is a well-known problem in the security domain. Recent studies have shown the potential of Transformer models to enlarge the size of data by synthesizing new samples, but the synthesized samples don’t improve the models over the real data. To address this issue, we design an efficient transformer-based model as a generative framework to generate time-series data, that can be used to boost the performance of existing and new ML workflows. Our new transformer model achieves the SOTA results. We style our model to be generalizable and work across different datasets, and produce high-quality samples.
zh
[AI-63] Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在长上下文推理过程中由于传统键值(Key-Value, KV)缓存处理方式导致的严重内存效率问题。其解决方案的关键在于将分页注意力(PagedAttention)与PyTorch的FlexAttention相结合,从而缓解内部碎片化和单调KV缓存分配的低效问题。通过在IBM的Foundation Model Stack(FMS)中实现融合注意力内核,该方法能够高效地收集分散的KV数据,显著降低推理延迟,并在长序列长度下保持线性增长的延迟特性。
链接: https://arxiv.org/abs/2506.07311
作者: Thomas Joshi,Herman Saini,Neil Dhillon,Antoni Viros i Martin,Kaoutar El Maghraoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) encounter severe memory inefficiencies during long-context inference due to conventional handling of key-value (KV) caches. In this work, we introduce a novel integration of PagedAttention with PyTorch’s FlexAttention, addressing internal fragmentation and inefficiencies associated with monolithic KV cache allocations. Implemented within IBM’s Foundation Model Stack (FMS), our fused attention kernel efficiently gathers scattered KV data. Our benchmarks on an NVIDIA L4 GPU (24GB) demonstrate significantly reduced inference latency, growing only linearly (~2x) with sequence length from 128 to 2048 tokens when utilizing a global KV cache, compared to exponential latency increases without caching. While peak memory usage remains largely unchanged for single-step evaluations (dominated by model weights and activations), paged attention causes minimal incremental memory usage, observable only at sequence lengths exceeding 2048 tokens due to its power-of-two cache allocations. We open-source the full implementation and discuss its implications for future long-context model deployment.
zh
[AI-64] Pre-trained Large Language Models Learn Hidden Markov Models In-context
【速读】:该论文试图解决如何利用预训练的大语言模型(Large Language Models, LLMs)通过上下文学习(In-Context Learning, ICL)对由隐马尔可夫模型(Hidden Markov Models, HMMs)生成的数据进行有效建模与预测的问题。其解决方案的关键在于利用LLMs在提示中从示例中推断模式的能力,从而实现接近理论最优的预测精度,并揭示HMM特性对模型性能的影响规律。
链接: https://arxiv.org/abs/2506.07298
作者: Yijia Dai,Zhaolin Gao,Yahya Satter,Sarah Dean,Jennifer J. Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Hidden Markov Models (HMMs) are foundational tools for modeling sequential data with latent Markovian structure, yet fitting them to real-world data remains computationally challenging. In this work, we show that pre-trained large language models (LLMs) can effectively model data generated by HMMs via in-context learning (ICL) \unicodex2013 their ability to infer patterns from examples within a prompt. On a diverse set of synthetic HMMs, LLMs achieve predictive accuracy approaching the theoretical optimum. We uncover novel scaling trends influenced by HMM properties, and offer theoretical conjectures for these empirical observations. We also provide practical guidelines for scientists on using ICL as a diagnostic tool for complex data. On real-world animal decision-making tasks, ICL achieves competitive performance with models designed by human experts. To our knowledge, this is the first demonstration that ICL can learn and predict HMM-generated sequences \unicodex2013 an advance that deepens our understanding of in-context learning in LLMs and establishes its potential as a powerful tool for uncovering hidden structure in complex scientific data.
zh
[AI-65] Secondary Stakeholders in AI: Fighting for Brokering and Navigating Agency
【速读】:该论文试图解决当前参与式AI(Participatory AI)研究中对次要利益相关者(secondary stakeholders)关注不足的问题,这些问题通常被忽视,而研究更多聚焦于主要利益相关者如终端用户。论文的关键解决方案在于通过半结构化访谈将参与式AI的理想扩展到更广泛的主要和次要利益相关者群体,并提出参与的核心理想包括知情性(informedness)、同意(consent)和自主性(agency)。研究还识别了三种利益相关者类型,并探讨了他们在复杂问题空间中实现这些理想所面临的系统性障碍。
链接: https://arxiv.org/abs/2506.07281
作者: Leah Hope Ajmani,Nuredin Ali Abdelkadir,Stevie Chancellor
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI technologies become more human-facing, there have been numerous calls to adapt participatory approaches to AI development – spurring the idea of participatory AI. However, these calls often focus only on primary stakeholders, such as end-users, and not secondary stakeholders. This paper seeks to translate the ideals of participatory AI to a broader population of secondary AI stakeholders through semi-structured interviews. We theorize that meaningful participation involves three participatory ideals: (1) informedness, (2) consent, and (3) agency. We also explore how secondary stakeholders realize these ideals by traversing a complicated problem space. Like walking up the rungs of a ladder, these ideals build on one another. We introduce three stakeholder archetypes: the reluctant data contributor, the unsupported activist, and the well-intentioned practitioner, who must navigate systemic barriers to achieving agentic AI relationships. We envision an AI future where secondary stakeholders are able to meaningfully participate with the AI systems they influence and are influenced by.
zh
[AI-66] okenized Bandit for LLM Decoding and Alignment ICML2025
【速读】:该论文试图解决的是在生成式 AI (Generative AI) 编码和对齐背景下,基于序列选择的线性与随机多臂老虎机问题,即 tokenized linear bandit (TLB) 和 multi-armed bandit (TMAB)。其核心挑战在于,在每一轮中,决策者需依次不可逆地选择一个 token,最终根据用户反馈的随机效用进行学习,而该效用的期望由依赖于查询的序列函数决定。论文指出,在没有对序列函数结构进行假设的情况下,学习是不可行的,因此提出了一个自然的假设——随着共同 token 增多而距离衰减(diminishing distance with more commons, DDMC),并设计了具有 O~(LT) 和 O~(LT2/3) 的遗憾界的算法,分别适用于 TLB 和 TMAB。关键在于通过 DDMC 假设实现有效学习,并验证了贪婪解码在 LLM 中的几乎最优性,同时为解码时的 LLM 对齐提供了应用前景。
链接: https://arxiv.org/abs/2506.07276
作者: Suho Shin,Chenghao Yang,Haifeng Xu,Mohammad T. Hajiaghayi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear at ICML 2025
Abstract:We introduce the tokenized linear bandit (TLB) and multi-armed bandit (TMAB), variants of linear and stochastic multi-armed bandit problems inspired by LLM decoding and alignment. In these problems, at each round t \in [T] , a user submits a query (context), and the decision maker (DM) sequentially selects a token irrevocably from a token set. Once the sequence is complete, the DM observes a random utility from the user, whose expectation is presented by a sequence function mapping the chosen token sequence to a nonnegative real value that depends on the query. In both problems, we first show that learning is impossible without any structure on the sequence function. We introduce a natural assumption, diminishing distance with more commons (DDMC), and propose algorithms with regret \tildeO(L\sqrtT) and \tildeO(L\sqrtT^2/3) for TLB and TMAB, respectively. As a side product, we obtain an (almost) optimality of the greedy decoding for LLM decoding algorithm under DDMC, which justifies the unresaonable effectiveness of greedy decoding in several tasks. This also has an immediate application to decoding-time LLM alignment, when the misaligned utility can be represented as the frozen LLM’s utility and a linearly realizable latent function. We finally validate our algorithm’s performance empirically as well as verify our assumptions using synthetic and real-world datasets. Comments: To appear at ICML 2025 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.07276 [cs.LG] (or arXiv:2506.07276v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.07276 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-67] Subgoal-Guided Policy Heuristic Search with Learned Subgoals ICML-25
【速读】:该论文试图解决在策略树搜索(policy tree search)算法中,由于训练需要完整的求解轨迹而导致的样本效率低下问题,特别是在处理困难问题实例时,从随机初始化策略开始的学习过程成本过高,导致搜索样本在失败尝试中被浪费。解决方案的关键在于提出一种新的基于子目标(subgoal)的策略学习方法,通过从搜索过程中生成的树结构(包括失败尝试的搜索树)中学习子目标及其条件策略,从而提高策略和启发函数学习的样本效率。
链接: https://arxiv.org/abs/2506.07255
作者: Jake Tuero,Michael Buro,Levi H. S. Lelis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to ICML-25
Abstract:Policy tree search is a family of tree search algorithms that use a policy to guide the search. These algorithms provide guarantees on the number of expansions required to solve a given problem that are based on the quality of the policy. While these algorithms have shown promising results, the process in which they are trained requires complete solution trajectories to train the policy. Search trajectories are obtained during a trial-and-error search process. When the training problem instances are hard, learning can be prohibitively costly, especially when starting from a randomly initialized policy. As a result, search samples are wasted in failed attempts to solve these hard instances. This paper introduces a novel method for learning subgoal-based policies for policy tree search algorithms. The subgoals and policies conditioned on subgoals are learned from the trees that the search expands while attempting to solve problems, including the search trees of failed attempts. We empirically show that our policy formulation and training method improve the sample efficiency of learning a policy and heuristic function in this online setting.
zh
[AI-68] Overclocking LLM Reasoning LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)在显式思维过程中的推理长度控制问题,即如何平衡推理过程的深度与效率,以避免因推理过短导致任务复杂性无法被充分捕捉,或因推理过长导致过度思考进而增加计算开销和性能下降。解决方案的关键在于通过编码模型在推理过程中的进展,并利用交互式进度条可视化技术揭示模型的规划动态,同时在推理过程中操控内部进展编码以减少不必要的步骤,从而生成更简洁且果断的思维链。
链接: https://arxiv.org/abs/2506.07240
作者: Roy Eisenstadt,Itamar Zimerman,Lior Wolf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, techniques such as explicit structured reasoning have demonstrated strong test-time scaling behavior by enforcing a separation between the model’s internal “thinking” process and the final response. A key factor influencing answer quality in this setting is the length of the thinking stage. When the reasoning is too short, the model may fail to capture the complexity of the task. Conversely, when it is too long, the model may overthink, leading to unnecessary computation and degraded performance. This paper explores and exploits the underlying mechanisms by which LLMs understand and regulate the length of their reasoning during explicit thought processes. First, we show that LLMs encode their progress through the reasoning process and introduce an interactive progress bar visualization, which is then used to reveal insights on the model’s planning dynamics. Second, we manipulate the internal progress encoding during inference to reduce unnecessary steps and generate a more concise and decisive chain of thoughts. Our empirical results demonstrate that this “overclocking” method mitigates overthinking, improves answer accuracy, and reduces inference latency. Our code is publicly available.
zh
[AI-69] VeriLoC: Line-of-Code Level Prediction of Hardware Design Quality from Verilog Code
【速读】:该论文试图解决在芯片设计早期阶段直接从Verilog代码中预测关键设计质量指标(如时序和布线拥塞)的问题,特别是针对导致时序违规或后续布线拥塞的代码行进行精准预测。现有方法主要关注模块级质量预测,未涉及代码行级别的预测。论文提出的解决方案——VeriLoC,其关键在于利用最新的Verilog代码生成大语言模型(LLM)提取代码的局部行级和模块级嵌入,并通过拼接这些嵌入训练下游分类器/回归器,从而实现对行级和模块级设计质量的高效预测。
链接: https://arxiv.org/abs/2506.07239
作者: Raghu Vamshi Hemadri,Jitendra Bhandari,Johann Knechtel,Badri P Gopalan,Ramesh Narayanaswamy,Ramesh Karri,Siddharth Garg
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern chip design is complex, and there is a crucial need for early-stage prediction of key design-quality metrics like timing and routing congestion directly from Verilog code (a commonly used programming language for hardware design). It is especially important yet complex to predict individual lines of code that cause timing violations or downstream routing congestion. Prior works have tried approaches like converting Verilog into an intermediate graph representation and using LLM embeddings alongside other features to predict module-level quality, but did not consider line-level quality prediction. We propose VeriLoC, the first method that predicts design quality directly from Verilog at both the line- and module-level. To this end, VeriLoC leverages recent Verilog code-generation LLMs to extract local line-level and module-level embeddings, and train downstream classifiers/regressors on concatenations of these embeddings. VeriLoC achieves high F1-scores of 0.86-0.95 for line-level congestion and timing prediction, and reduces the mean average percentage error from 14% - 18% for SOTA methods down to only 4%. We believe that VeriLoC embeddings and insights from our work will also be of value for other predictive and optimization tasks for complex hardware design.
zh
[AI-70] Learn as Individuals Evolve as a Team: Multi-agent LLM s Adaptation in Embodied Environments
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多智能体具身场景中的适应能力不足问题,即现有基于LLM的规划算法难以有效应对复杂、动态的多智能体交互环境。其解决方案的关键在于提出一种名为“Learn as Individuals, Evolve as a Team (LIET)”的范式,通过在个体层面学习局部效用函数以增强对环境的理解,并在团队层面协同维护和更新共享协作知识列表,从而提升智能体的决策能力和协作效率。
链接: https://arxiv.org/abs/2506.07232
作者: Xinran Li,Chenjia Bai,Zijian Li,Jiakun Zheng,Ting Xiao,Jun Zhang
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) possess extensive knowledge bases and strong reasoning capabilities, making them promising tools for complex, multi-agent planning in embodied environments. However, despite LLMs’ advanced abilities and the sophisticated modular design of agentic methods, existing LLM-based planning algorithms remain limited by weak adaptation capabilities to multi-agent embodied scenarios. We address this limitation by introducing a framework that enables LLM agents to learn and evolve both before and during test time, equipping them with environment-relevant knowledge for better planning and enhanced communication for improved cooperation. Inspired by centralized training with decentralized execution in multi-agent reinforcement learning, we propose a \textitLearn as Individuals, Evolve as a Team (LIET) paradigm for multi-agent LLMs adaptation. At the individual level, LLM agents learn a local utility function from exploratory datasets to better comprehend the embodied environment, which is then queried during test time to support informed decision-making. At the team level, LLM agents collaboratively and iteratively maintain and update a shared cooperation knowledge list based on new experiences, using it to guide more effective communication. By combining individual learning with team evolution, LIET enables comprehensive and flexible adaptation for LLM agents. Our experiments on Communicative Watch-And-Help and ThreeD-World Multi-Agent Transport benchmarks demonstrate that LIET, instantiated with both LLaMA and GPT-4o, outperforms existing baselines and exhibits strong cooperative planning abilities.
zh
[AI-71] LLM -Enhanced Rapid-Reflex Async-Reflect Embodied Agent for Real-Time Decision-Making in Dynamically Changing Environments CVPR2025
【速读】:该论文旨在解决在高风险动态场景下,基于大型语言模型(Large Language Models, LLMs)的智能体因决策延迟导致的性能瓶颈问题。其关键解决方案是提出一种时间转换机制(Time Conversion Mechanism, TCM),将决策中的推理延迟转化为等效的模拟帧数,从而在单一帧率(FPS)指标下统一认知与物理成本,并通过引入响应延迟(Respond Latency, RL)和延迟到动作比(Latency-to-Action Ratio, LAR)构建了全面的延迟感知评估协议。此外,论文还提出了快速反射异步反思智能体(Rapid-Reflex Async-Reflect Agent, RRARA),结合轻量级LLM引导反馈模块与规则基础智能体,实现即时反应行为与异步反思优化。
链接: https://arxiv.org/abs/2506.07223
作者: Yangqing Zheng,Shunqi Mao,Dingxin Zhang,Weidong Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by the CVPR 2025 Embodied AI Workshop
Abstract:In the realm of embodied intelligence, the evolution of large language models (LLMs) has markedly enhanced agent decision making. Consequently, researchers have begun exploring agent performance in dynamically changing high-risk scenarios, i.e., fire, flood, and wind scenarios in the HAZARD benchmark. Under these extreme conditions, the delay in decision making emerges as a crucial yet insufficiently studied issue. We propose a Time Conversion Mechanism (TCM) that translates inference delays in decision-making into equivalent simulation frames, thus aligning cognitive and physical costs under a single FPS-based metric. By extending HAZARD with Respond Latency (RL) and Latency-to-Action Ratio (LAR), we deliver a fully latency-aware evaluation protocol. Moreover, we present the Rapid-Reflex Async-Reflect Agent (RRARA), which couples a lightweight LLM-guided feedback module with a rule-based agent to enable immediate reactive behaviors and asynchronous reflective refinements in situ. Experiments on HAZARD show that RRARA substantially outperforms existing baselines in latency-sensitive scenarios.
zh
[AI-72] BIMgent: Towards Autonomous Building Modeling via Computer-use Agents ICML2025
【速读】:该论文旨在解决建筑、工程与施工(AEC)领域中三维建筑建模过程的自动化问题,该过程涉及开放式设计任务和复杂的建筑信息模型(BIM)作者软件交互模式,而现有研究对此尚未有深入探索。论文提出的解决方案是BIMgent,其关键在于利用多模态大语言模型(LLMs)构建一个代理框架,通过图形用户界面(GUI)操作实现自主建筑模型创作,涵盖概念设计的多模态输入、软件特定工作流规划以及作者GUI操作的高效执行。
链接: https://arxiv.org/abs/2506.07217
作者: Zihan Deng,Changyu Du,Stavros Nousias,André Borrmann
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML 2025 Workshop on Computer Use Agents
Abstract:Existing computer-use agents primarily focus on general-purpose desktop automation tasks, with limited exploration of their application in highly specialized domains. In particular, the 3D building modeling process in the Architecture, Engineering, and Construction (AEC) sector involves open-ended design tasks and complex interaction patterns within Building Information Modeling (BIM) authoring software, which has yet to be thoroughly addressed by current studies. In this paper, we propose BIMgent, an agentic framework powered by multimodal large language models (LLMs), designed to enable autonomous building model authoring via graphical user interface (GUI) operations. BIMgent automates the architectural building modeling process, including multimodal input for conceptual design, planning of software-specific workflows, and efficient execution of the authoring GUI actions. We evaluate BIMgent on real-world building modeling tasks, including both text-based conceptual design generation and reconstruction from existing building design. The design quality achieved by BIMgent was found to be reasonable. Its operations achieved a 32% success rate, whereas all baseline models failed to complete the tasks (0% success rate). Results demonstrate that BIMgent effectively reduces manual workload while preserving design intent, highlighting its potential for practical deployment in real-world architectural modeling scenarios.
zh
[AI-73] Sword and Shield: Uses and Strategies of LLM s in Navigating Disinformation
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在虚假信息传播中的双重作用问题,即它们既可能被滥用以生成复杂的虚假信息,也可能被用于提升虚假信息的检测与缓解策略。解决方案的关键在于通过模拟在线论坛的通信博弈,分析不同角色(如虚假信息制造者、管理员和用户)如何利用LLMs实现各自目标,从而揭示LLMs在这一复杂动态中的潜在用途与风险。研究强调了理解LLMs在不同情境下有效性的必要性,并提出了对未来LLM发展和在线平台设计的建议,以实现用户赋权与信任建立之间的平衡。
链接: https://arxiv.org/abs/2506.07211
作者: Gionnieve Lim,Bryan Chen Zhengyu Tan,Kellie Yu Hui Sim,Weiyan Shi,Ming Hui Chew,Ming Shan Hee,Roy Ka-Wei Lee,Simon T. Perrault,Kenny Tsu Wei Choo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of Large Language Models (LLMs) presents a dual challenge in the fight against disinformation. These powerful tools, capable of generating human-like text at scale, can be weaponised to produce sophisticated and persuasive disinformation, yet they also hold promise for enhancing detection and mitigation strategies. This paper investigates the complex dynamics between LLMs and disinformation through a communication game that simulates online forums, inspired by the game Werewolf, with 25 participants. We analyse how Disinformers, Moderators, and Users leverage LLMs to advance their goals, revealing both the potential for misuse and combating disinformation. Our findings highlight the varying uses of LLMs depending on the participants’ roles and strategies, underscoring the importance of understanding their effectiveness in this context. We conclude by discussing implications for future LLM development and online platform design, advocating for a balanced approach that empowers users and fosters trust while mitigating the risks of LLM-assisted disinformation.
zh
[AI-74] Reasoning Multimodal Large Language Model: Data Contamination and Dynamic Evaluation
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中可能存在的数据污染问题,这一问题可能导致模型在测试集上的表现被高估,从而掩盖了其真正的泛化能力。解决方案的关键在于提出一种动态评估框架,通过扰动任务本身而非输入来更严格地评估模型的泛化能力。该方法利用相同的视觉输入,在一系列任务(如问答、描述生成、问题提出、验证等)中评估模型,以探测其多样化的性能表现,从而判断模型性能是否依赖于任务特定的表面线索或具备真正的泛化能力。
链接: https://arxiv.org/abs/2506.07202
作者: Ming Liu,Wensheng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) show impressive vision-language benchmark performance, yet growing concerns about data contamination (test set exposure during training) risk masking true generalization. This concern extends to reasoning MLLMs, often fine-tuned via reinforcement learning from potentially contaminated base models. We propose a novel dynamic evaluation framework to rigorously assess MLLM generalization, moving beyond static benchmarks. Instead of perturbing inputs, we perturb the task itself. Using the same visual input, models are evaluated across a family of tasks (e.g., QA, captioning, question posing, verification) to probe diverse capabilities. This task perturbation reveals whether model performance is robust or reliant on superficial task-specific cues. Our approach is analogous to loss landscape sharpness: models overfit or contaminated for a single task (sharp minima) falter under task shifts, unlike models with generalizable solutions (flatter minima). We developed an automated pipeline with a calibrated judge scoring open-ended generations (captions, questions) using paraphrase and corruption sampling. Applying this framework to leading image/video MLLMs on benchmarks including MME, RealWorldQA, and CVRR-ES, we analyze each model’s cross-task “ability vector.” We demonstrate that fine-tuning on simulated test data (extreme contamination) drastically sharpens task-specific performance but harms overall generalization. Our dynamic task perturbation offers deeper insights into MLLM generalization, distinguishing genuine understanding from spurious leakage or overfitting.
zh
[AI-75] Exploring Effective Strategies for Building a Customised GPT Agent for Coding Classroom Dialogues DATE
【速读】:该论文试图解决课堂对话分析中手动编码劳动强度大、耗时且需要细致理解对话功能的问题,提出利用生成式 AI (Generative AI) 自动化编码过程的策略。其解决方案的关键在于通过设计实验,基于 GPT-4 的 MyGPT 代理,探索在有限数据条件下配置高效编码代理的有效策略,并验证其在使用人类编码本时的基准性能及不同输入示例对性能的影响。
链接: https://arxiv.org/abs/2506.07194
作者: Luwei Bai,Dongkeun Han,Sara Hennessy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Draft technical report. 39 pages, 2 figures. Not yet submitted for publication. Update expected
Abstract:This study investigates effective strategies for developing a customised GPT agent to code classroom dialogue. While classroom dialogue is widely recognised as a crucial element of education, its analysis remains challenging due to the need for a nuanced understanding of dialogic functions and the labour-intensive nature of manual transcript coding. Recent advancements in large language models offer promising avenues for automating this process. However, existing studies predominantly focus on training large-scale models or evaluating pre-trained models with fixed codebooks, which are often not applicable or replicable for dialogue researchers working with small datasets or customised coding schemes. Using GPT-4’s MyGPT agent as a case, this study evaluates its baseline performance in coding classroom dialogue with a human codebook and examines how performance varies with different example inputs through a variable control method. Through a design-based research approach, it identifies a set of practical strategies, based on MyGPT’s unique features, for configuring effective agents with limited data. The findings suggest that, despite some limitations, a MyGPT agent developed with these strategies can serve as a useful coding assistant by generating coding suggestions.
zh
[AI-76] Regularized Adaptive Graph Learning for Large-Scale Traffic Forecasting
【速读】:该论文旨在解决交通预测中自适应图卷积网络存在的节点嵌入正则化不足以及图卷积操作计算复杂度高的问题(Adaptive graph convolution networks)。其解决方案的关键在于提出一种正则化的自适应图学习模型(Regularized Adaptive Graph Learning, RAGL),该模型通过结合随机共享嵌入(Stochastic Shared Embedding, SSE)与自适应图卷积,并利用残差差异机制实现嵌入正则化和噪声抑制;同时,为提升大规模道路网络的可扩展性,设计了基于余弦相似度的高效运算符(Efficient Cosine Operator, ECO),以线性时间复杂度完成图卷积操作。
链接: https://arxiv.org/abs/2506.07179
作者: Kaiqi Wu,Weiyang Kong,Sen Zhang,Yubao Liu,Zitong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Traffic prediction is a critical task in spatial-temporal forecasting with broad applications in travel planning and urban management. Adaptive graph convolution networks have emerged as mainstream solutions due to their ability to learn node embeddings in a data-driven manner and capture complex latent dependencies. However, existing adaptive graph learning methods for traffic forecasting often either ignore the regularization of node embeddings, which account for a significant proportion of model parameters, or face scalability issues from expensive graph convolution operations. To address these challenges, we propose a Regularized Adaptive Graph Learning (RAGL) model. First, we introduce a regularized adaptive graph learning framework that synergizes Stochastic Shared Embedding (SSE) and adaptive graph convolution via a residual difference mechanism, achieving both embedding regularization and noise suppression. Second, to ensure scalability on large road networks, we develop the Efficient Cosine Operator (ECO), which performs graph convolution based on the cosine similarity of regularized embeddings with linear time complexity. Extensive experiments on four large-scale real-world traffic datasets show that RAGL consistently outperforms state-of-the-art methods in terms of prediction accuracy and exhibits competitive computational efficiency.
zh
[AI-77] ranslating Federated Learning Algorithms in Python into CSP Processes Using ChatGPT
【速读】:该论文试图解决将联邦学习(Federated Learning, FL)算法自动转换为通信顺序进程(CSP)过程以进行形式化验证的问题。传统方法依赖人工翻译,而本文提出了一种基于ChatGPT的自动化翻译流程,其关键在于利用ChatGPT的反馈估计上下文的最小性,并通过模型检查器PAT验证翻译结果的正确性。
链接: https://arxiv.org/abs/2506.07173
作者: Miroslav Popovic,Marko Popovic,Miodrag Djukic,Ilija Basicevic
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 4 tables
Abstract:The Python Testbed for Federated Learning Algorithms is a simple Python FL framework that is easy to use by MLAI developers who do not need to be professional programmers and is also amenable to LLMs. In the previous research, generic federated learning algorithms provided by this framework were manually translated into the CSP processes and algorithms’ safety and liveness properties were automatically verified by the model checker PAT. In this paper, a simple translation process is introduced wherein the ChatGPT is used to automate the translation of the mentioned federated learning algorithms in Python into the corresponding CSP processes. Within the process, the minimality of the used context is estimated based on the feedback from ChatGPT. The proposed translation process was experimentally validated by successful translation (verified by the model checker PAT) of both generic centralized and decentralized federated learning algorithms.
zh
[AI-78] AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models ACL2025
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)中多目标偏好对齐方法存在的两个核心问题:一是难以有效平衡多种偏好维度,二是依赖辅助奖励/参考模型导致计算复杂度高。其解决方案的关键在于提出自适应多目标偏好优化(Adaptive Multi-objective Preference Optimization, AMoPO)框架,通过引入多目标优化范式,利用维度感知的生成指标作为隐式奖励,实现无需额外奖励模型或参考模型的动态偏好维度平衡。该方法通过将生成空间建模为高斯分布,引入自适应权重分配机制,从而实现对不同偏好维度的动态优先级调整。
链接: https://arxiv.org/abs/2506.07165
作者: Qi Liu,Jingqing Ruan,Hao Li,Haodong Zhao,Desheng Wang,Jiansong Chen,Wan Guanglu,Xunliang Cai,Zhi Zheng,Tong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2025
Abstract:Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dynamic balance across preference dimensions. By introducing the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards, AMoPO aligns LLMs with diverse preferences without additional reward models or reference models. We introduce an adaptive weight assignment mechanism that models the generation space as a Gaussian distribution, allowing dynamic prioritization of preference dimensions. Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%, and the experiments on 7B, 14B, and 32B models reveal the scaling ability of AMoPO. Moreover, additional analysis of multiple dimensions verifies its adaptability and effectiveness. These findings validate AMoPO’s capability to achieve dimension-aware preference alignment, highlighting its superiority. Our codes and datasets are available at this https URL.
zh
[AI-79] Mind the Web: The Security of Web Use Agents
【速读】:该论文试图解决生成式 AI (Generative AI) 驱动的 Web-use agents 在执行复杂网络任务时因高权限能力而带来的安全风险问题。其核心问题是这些代理在处理合法浏览任务时,可能被恶意内容(如评论、评价或广告)所劫持,从而导致隐私泄露、数据篡改和系统可用性受损等严重后果。解决方案的关键在于提出任务对齐注入技术,该技术将恶意命令伪装成有助于任务完成的指导信息,利用大语言模型(LLM)在上下文推理中的局限性,使代理无法识别偏离原始任务目标的操控意图。通过这一方法,攻击者能够绕过代理内置的安全机制,实现多种类型的攻击。
链接: https://arxiv.org/abs/2506.07153
作者: Avishag Shapira,Parth Atulbhai Gandhi,Edan Habler,Oleg Brodt,Asaf Shabtai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Web-use agents are rapidly being deployed to automate complex web tasks, operating with extensive browser capabilities including multi-tab navigation, DOM manipulation, JavaScript execution and authenticated session access. However, these powerful capabilities create a critical and previously unexplored attack surface. This paper demonstrates how attackers can exploit web-use agents’ high-privilege capabilities by embedding malicious content in web pages such as comments, reviews, or advertisements that agents encounter during legitimate browsing tasks. In addition, we introduce the task-aligned injection technique that frame malicious commands as helpful task guidance rather than obvious attacks. This technique exploiting fundamental limitations in LLMs’ contextual reasoning: agents struggle in maintaining coherent contextual awareness and fail to detect when seemingly helpful web content contains steering attempts that deviate from their original task goal. Through systematic evaluation of four popular agents (OpenAI Operator, Browser Use, Do Browser, OpenOperator), we demonstrate nine payload types that compromise confidentiality, integrity, and availability, including unauthorized camera activation, user impersonation, local file exfiltration, password leakage, and denial of service, with validation across multiple LLMs achieving success rates of 80%-100%. These payloads succeed across agents with built-in safety mechanisms, requiring only the ability to post content on public websites, creating unprecedented risks given the ease of exploitation combined with agents’ high-privilege access. To address this attack, we propose comprehensive mitigation strategies including oversight mechanisms, execution constraints, and task-aware reasoning techniques, providing practical directions for secure development and deployment.
zh
[AI-80] axonomy of migration scenarios for Qiskit refactoring using LLM s
【速读】:该论文试图解决量子编程库在频繁更新过程中导致的代码重构难题,这些难题由于量子计算软件的特殊性,与传统软件工程中的重构挑战存在本质差异。解决方案的关键在于构建一个量子电路重构问题的分类体系(taxonomy),并通过大型语言模型(LLMs)与专家开发者的协作,生成并整合两种分类体系,形成统一的分类框架,从而为未来的AI辅助迁移研究提供基础,并支持对自动化重构技术的更严谨评估。
链接: https://arxiv.org/abs/2506.07135
作者: José Manuel Suárez,Luís Mariano Bibbó,Joaquín Bogado,Alejandro Fernandez
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: Accepted for publication in ASQC JAIIO 54 ( this https URL )
Abstract:As quantum computing advances, quantum programming libraries’ heterogeneity and steady evolution create new challenges for software developers. Frequent updates in software libraries break working code that needs to be refactored, thus adding complexity to an already complex landscape. These refactoring challenges are, in many cases, fundamentally different from those known in classical software engineering due to the nature of quantum computing software. This study addresses these challenges by developing a taxonomy of quantum circuit’s refactoring problems, providing a structured framework to analyze and compare different refactoring approaches. Large Language Models (LLMs) have proven valuable tools for classic software development, yet their value in quantum software engineering remains unexplored. This study uses LLMs to categorize refactoring needs in migration scenarios between different Qiskit versions. Qiskit documentation and release notes were scrutinized to create an initial taxonomy of refactoring required for migrating between Qiskit releases. Two taxonomies were produced: one by expert developers and one by an LLM. These taxonomies were compared, analyzing differences and similarities, and were integrated into a unified taxonomy that reflects the findings of both methods. By systematically categorizing refactoring challenges in Qiskit, the unified taxonomy is a foundation for future research on AI-assisted migration while enabling a more rigorous evaluation of automated refactoring techniques. Additionally, this work contributes to quantum software engineering (QSE) by enhancing software development workflows, improving language compatibility, and promoting best practices in quantum programming.
zh
[AI-81] Reliable Critics: Monotonic Improvement and Convergence Guarantees for Reinforcement Learning
【速读】:该论文试图解决在使用带有函数逼近的强化学习(Reinforcement Learning, RL)算法时,难以保证策略迭代(Policy Iteration)的单调性提升问题。传统策略迭代在引入线性函数逼近后,其基本的单调改进保证会失效。解决方案的关键在于提出一种名为可靠策略迭代(Reliable Policy Iteration, RPI)的算法,该算法通过基于Bellman方程的约束优化替代常规的投影或Bellman误差最小化过程,从而确保价值估计具有严格的单调性,并且下界紧贴真实回报。此外,RPI的极限部分满足未投影的Bellman方程,体现了其在强化学习框架中的自然适配性。
链接: https://arxiv.org/abs/2506.07134
作者: Eshwar S. R.,Gugan Thoppe,Aditya Gopalan,Gal Dalal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 19 pages
Abstract:Despite decades of research, it remains challenging to correctly use Reinforcement Learning (RL) algorithms with function approximation. A prime example is policy iteration, whose fundamental guarantee of monotonic improvement collapses even under linear function approximation. To address this issue, we introduce Reliable Policy Iteration (RPI). It replaces the common projection or Bellman-error minimization during policy evaluation with a Bellman-based constrained optimization. We prove that not only does RPI confer textbook monotonicity on its value estimates but these estimates also lower bound the true return. Also, their limit partially satisfies the unprojected Bellman equation, emphasizing RPI’s natural fit within RL. RPI is the first algorithm with such monotonicity and convergence guarantees under function approximation. For practical use, we provide a model-free variant of RPI that amounts to a novel critic. It can be readily integrated into primary model-free PI implementations such as DQN and DDPG. In classical control tasks, such RPI-enhanced variants consistently maintain their lower-bound guarantee while matching or surpassing the performance of all baseline methods.
zh
[AI-82] Robotic Policy Learning via Human-assisted Action Preference Optimization
【速读】:该论文旨在解决Vision-Language-Action (VLA)模型在实际部署中依赖专家示范而难以实现错误纠正和从失败中学习的问题。其解决方案的关键在于提出一种名为HAPO(Human-assisted Action Preference Optimization)的人机协作动作偏好优化方法,通过人类干预收集交互轨迹,并将其用于动作偏好优化过程,从而减少失败动作的发生并提升修正动作的适应性。此外,该方法引入了一种自适应重加权算法,以解决在将偏好优化引入VLA模型时出现的不可逆交互和令牌概率不匹配问题,使模型能够从二元可取性信号中进行有效学习。
链接: https://arxiv.org/abs/2506.07127
作者: Wenke xia,Yichu Yang,Hongtao Wu,Xiao Ma,Tao Kong,Di Hu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Establishing a reliable and iteratively refined robotic system is essential for deploying real-world applications. While Vision-Language-Action (VLA) models are widely recognized as the foundation model for such robotic deployment, their dependence on expert demonstrations hinders the crucial capabilities of correction and learning from failures. To mitigate this limitation, we introduce a Human-assisted Action Preference Optimization method named HAPO, designed to correct deployment failures and foster effective adaptation through preference alignment for VLA models. This method begins with a human-robot collaboration framework for reliable failure correction and interaction trajectory collection through human intervention. These human-intervention trajectories are further employed within the action preference optimization process, facilitating VLA models to mitigate failure action occurrences while enhancing corrective action adaptation. Specifically, we propose an adaptive reweighting algorithm to address the issues of irreversible interactions and token probability mismatch when introducing preference optimization into VLA models, facilitating model learning from binary desirability signals derived from interactions. Through combining these modules, our human-assisted action preference optimization method ensures reliable deployment and effective learning from failure for VLA models. The experiments conducted in simulation and real-world scenarios prove superior generalization and robustness of our framework across a variety of manipulation tasks.
zh
[AI-83] MAGNet: A Multi-Scale Attention-Guided Graph Fusion Network for DRC Violation Detection
【速读】:该论文旨在解决集成电路设计中设计规则检查(DRC)的效率与准确性问题,以实现成本降低和设计效率提升。其解决方案的关键在于提出MAGNet模型,该模型融合了改进的U-Net与图神经网络(GNN),通过动态注意力模块(DAM)和多尺度卷积模块(MSCM)增强空间特征提取能力,并构建像素对齐的图结构以建模引脚间的拓扑关系,同时采用标签扩增策略提升模型对稀疏违规模式的敏感性,从而有效结合空间、语义与结构信息,提升DRC热点检测的预测精度与减少误报率。
链接: https://arxiv.org/abs/2506.07126
作者: Weihan Lu,Hong Cai Chen
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 9 pages, 12 figures, 2 tables
Abstract:Design rule checking (DRC) is of great significance for cost reduction and design efficiency improvement in integrated circuit (IC) designs. Machine-learning-based DRC has become an important approach in computer-aided design (CAD). In this paper, we propose MAGNet, a hybrid deep learning model that integrates an improved U-Net with a graph neural network for DRC violation prediction. The U-Net backbone is enhanced with a Dynamic Attention Module (DAM) and a Multi-Scale Convolution Module (MSCM) to strengthen its capability in extracting fine-grained and multi-scale spatial features. In parallel, we construct a pixel-aligned graph structure based on chip layout tiles, and apply a specialized GNN to model the topological relationships among pins. During graph construction, a graph-to-grid mapping is generated to align GNN features with the layout image. In addition, a label amplification strategy is adopted during training to enhance the model’s sensitivity to sparse violation patterns. Overall, MAGNet effectively combines spatial, semantic, and structural information, achieving improved prediction accuracy and reduced false positive rates in DRC hotspot detection. Subsequently, through incremental training, we achieve a more sensitive discrimination ability for hotspots. The results demonstrate that, in comparison with ibUnet, RouteNet, and J-Net, MAGnet significantly outperforms these models, achieving substantial improvements in overall performance.
zh
[AI-84] Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)安全性评估中存在的一些关键问题,特别是传统红队测试(red-teaming)方法在攻击策略多样性与覆盖范围上的不足。传统方法通常依赖于简单的度量标准(如词频或句子嵌入相似性)来追求多样性,这可能无法有效捕捉攻击策略的真实变化,同时单一攻击模型的训练方式也限制了其在不同攻击风格和风险类别中的覆盖能力。论文提出的解决方案是质量-多样性红队测试(Quality-Diversity Red-Teaming, QDRT),其关键在于通过行为条件化训练实现目标驱动的多样性,并采用开放式的行为回放缓冲区,同时训练多个专业化的攻击模型,以生成高质量且多样化的攻击样本。
链接: https://arxiv.org/abs/2506.07121
作者: Ren-Jian Wang,Ke Xue,Zeyu Qin,Ziniu Li,Sheng Tang,Hao-Tian Li,Shengcai Liu,Chao Qian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Ensuring safety of large language models (LLMs) is important. Red teaming–a systematic approach to identifying adversarial prompts that elicit harmful responses from target LLMs–has emerged as a crucial safety evaluation method. Within this framework, the diversity of adversarial prompts is essential for comprehensive safety assessments. We find that previous approaches to red-teaming may suffer from two key limitations. First, they often pursue diversity through simplistic metrics like word frequency or sentence embedding similarity, which may not capture meaningful variation in attack strategies. Second, the common practice of training a single attacker model restricts coverage across potential attack styles and risk categories. This paper introduces Quality-Diversity Red-Teaming (QDRT), a new framework designed to address these limitations. QDRT achieves goal-driven diversity through behavior-conditioned training and implements a behavioral replay buffer in an open-ended manner. Additionally, it trains multiple specialized attackers capable of generating high-quality attacks across diverse styles and risk categories. Our empirical evaluation demonstrates that QDRT generates attacks that are both more diverse and more effective against a wide range of target LLMs, including GPT-2, Llama-3, Gemma-2, and Qwen2.5. This work advances the field of LLM safety by providing a systematic and effective approach to automated red-teaming, ultimately supporting the responsible deployment of LLMs.
zh
[AI-85] RBA-FE: A Robust Brain-Inspired Audio Feature Extractor for Depression Diagnosis
【速读】:该论文试图解决抑郁诊断中音频特征提取的精度限制及噪声干扰问题,特别是在传统深度学习模型未能充分考虑音频特征的情况下。其解决方案的关键在于提出一种受脑启发的鲁棒音频特征提取器(RBA-FE)模型,该模型采用改进的分层网络架构,并引入自适应速率平滑漏斗积分-发放(ARSLIF)脉冲神经元模型,以模拟大脑注意力系统中的“细胞信号选择性重调”机制,从而提升模型在噪声环境下的鲁棒性。
链接: https://arxiv.org/abs/2506.07118
作者: Yu-Xuan Wu,Ziyan Huang,Bin Hu,Zhi-Hong Guan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 14 pages
Abstract:This article proposes a robust brain-inspired audio feature extractor (RBA-FE) model for depression diagnosis, using an improved hierarchical network architecture. Most deep learning models achieve state-of-the-art performance for image-based diagnostic tasks, ignoring the counterpart audio features. In order to tailor the noise challenge, RBA-FE leverages six acoustic features extracted from the raw audio, capturing both spatial characteristics and temporal dependencies. This hybrid attribute helps alleviate the precision limitation in audio feature extraction within other learning models like deep residual shrinkage networks. To deal with the noise issues, our model incorporates an improved spiking neuron model, called adaptive rate smooth leaky integrate-and-fire (ARSLIF). The ARSLIF model emulates the mechanism of ``retuning of cellular signal selectivity" in the brain attention systems, which enhances the model robustness against environmental noises in audio data. Experimental results demonstrate that RBA-FE achieves state-of-the-art accuracy on the MODMA dataset, respectively with 0.8750, 0.8974, 0.8750 and 0.8750 in precision, accuracy, recall and F1 score. Extensive experiments on the AVEC2014 and DAIC-WOZ datasets both show enhancements in noise robustness. It is further indicated by comparison that the ARSLIF neuron model suggest the abnormal firing pattern within the feature extraction on depressive audio data, offering brain-inspired interpretability.
zh
[AI-86] BRIGHT: Upgrading the BRIGHT Benchmark with MARCUS a Multi-Agent RAG Clean-Up Suite EMNLP2025
【速读】:该论文旨在解决Retrieval-Augmented Generation (RAG)系统中因网络爬取数据的结构噪声和语义不连贯性导致的检索准确性和多跳推理能力受限的问题。其解决方案的关键在于提出MARCUS,一个基于大型语言模型(LLMs)的多智能体流水线,通过结构噪声去除和语义分割,系统性地清理并重新分块BRIGHT数据集,从而生成高质量的BRIGHT-Plus语料库,显著提升了检索精度和多跳推理性能。
链接: https://arxiv.org/abs/2506.07116
作者: Liyang Chen,Yujun Cai,Jieqiong Dong,Yiwei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, 4 tables. Submitted to EMNLP 2025
Abstract:Retrieval-Augmented Generation (RAG) systems require corpora that are both structurally clean and semantically coherent. BRIGHT is a recent and influential benchmark designed to evaluate complex multi-hop retrieval across diverse, high-reasoning domains. However, its practical effectiveness is limited by common web-crawled artifacts - such as content redundancy and semantic discontinuity - that impair retrieval accuracy and downstream reasoning. Notably, we find that such issues are concentrated in seven StackExchange-derived subdomains, while other domains (e.g., Coding and Theorem-based content) remain relatively clean. In this study, we present MARCUS, a multi-agent pipeline that leverages large language models (LLMs) to systematically clean and re-chunk BRIGHT into a higher-quality corpus: BRIGHT-Plus. MARCUS applies dedicated agents for structural noise removal and semantic segmentation, preserving answer-bearing spans while improving contextual integrity. Experimental evaluations demonstrate that BRIGHT-Plus yields consistent and significant improvements in both retrieval accuracy and multi-hop reasoning across a diverse set of retrievers. We release both the BRIGHT-Plus corpus and the MARCUS pipeline to support future research on robust, reasoning-centric retrieval. Comments: 8 pages, 7 figures, 4 tables. Submitted to EMNLP 2025 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2506.07116 [cs.AI] (or arXiv:2506.07116v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2506.07116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-87] owards Universal Offline Black-Box Optimization via Learning Language Model Embeddings ICML2025
【速读】:该论文试图解决通用黑盒优化(universal black-box optimization, BBO)算法在跨领域泛化能力不足的问题,其核心挑战在于异构数值空间缺乏统一表示,导致现有离线BBO方法仅限于单任务和固定维度设置。解决方案的关键在于利用语言模型(language model, LM)的嵌入表示,通过捕捉潜在关系实现不同数据类型的通用优化,具体包括端到端的下一个标记预测框架以及优先学习具有强表征能力的潜在空间。实验验证了所提方法的有效性和通用性,表明融合语言模型先验与字符串嵌入空间能够突破传统BBO的限制。
链接: https://arxiv.org/abs/2506.07109
作者: Rong-Xi Tan,Ming Chen,Ke Xue,Yao Wang,Yaoyuan Wang,Sheng Fu,Chao Qian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: ICML 2025
Abstract:The pursuit of universal black-box optimization (BBO) algorithms is a longstanding goal. However, unlike domains such as language or vision, where scaling structured data has driven generalization, progress in offline BBO remains hindered by the lack of unified representations for heterogeneous numerical spaces. Thus, existing offline BBO approaches are constrained to single-task and fixed-dimensional settings, failing to achieve cross-domain universal optimization. Recent advances in language models (LMs) offer a promising path forward: their embeddings capture latent relationships in a unifying way, enabling universal optimization across different data types possible. In this paper, we discuss multiple potential approaches, including an end-to-end learning framework in the form of next-token prediction, as well as prioritizing the learning of latent spaces with strong representational capabilities. To validate the effectiveness of these methods, we collect offline BBO tasks and data from open-source academic works for training. Experiments demonstrate the universality and effectiveness of our proposed methods. Our findings suggest that unifying language model priors and learning string embedding space can overcome traditional barriers in universal BBO, paving the way for general-purpose BBO algorithms. The code is provided at this https URL.
zh
[AI-88] Filling the Missings: Spatiotemporal Data Imputation by Conditional Diffusion
【速读】:该论文旨在解决时空系统中缺失数据的问题,这一问题在环境监测和城市交通管理等现代应用中具有重要影响。现有基于机器学习和深度学习的方法在建模空间与时间维度之间的复杂依赖关系方面存在不足,并且在数据插补过程中容易产生累积误差。该论文提出的解决方案是CoFILL,一种用于时空数据插补的条件扩散模型(Conditional Diffusion Model)。其关键在于利用扩散模型的优势生成高质量的插补结果,而无需依赖可能引入误差的先验估计,并通过创新的双流架构并行处理时间域和频域特征,从而更全面地捕捉数据中的快速波动和潜在模式。
链接: https://arxiv.org/abs/2506.07099
作者: Wenying He,Jieling Huang,Junhua Gu,Ji Zhang,Yude Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages,3 figures
Abstract:Missing data in spatiotemporal systems presents a significant challenge for modern applications, ranging from environmental monitoring to urban traffic management. The integrity of spatiotemporal data often deteriorates due to hardware malfunctions and software failures in real-world deployments. Current approaches based on machine learning and deep learning struggle to model the intricate interdependencies between spatial and temporal dimensions effectively and, more importantly, suffer from cumulative errors during the data imputation process, which propagate and amplify through iterations. To address these limitations, we propose CoFILL, a novel Conditional Diffusion Model for spatiotemporal data imputation. CoFILL builds on the inherent advantages of diffusion models to generate high-quality imputations without relying on potentially error-prone prior estimates. It incorporates an innovative dual-stream architecture that processes temporal and frequency domain features in parallel. By fusing these complementary features, CoFILL captures both rapid fluctuations and underlying patterns in the data, which enables more robust imputation. The extensive experiments reveal that CoFILL’s noise prediction network successfully transforms random noise into meaningful values that align with the true data distribution. The results also show that CoFILL outperforms state-of-the-art methods in imputation accuracy. The source code is publicly available at this https URL.
zh
[AI-89] Patient Similarity Computation for Clinical Decision Support: An Efficient Use of Data Transformation Combining Static and Time Series Data
【速读】:该论文试图解决医疗信息学中的患者相似性计算(Patient Similarity Computation, PSC)问题,旨在通过分析患者的病史临床记录来衡量患者之间的相似性,从而提升临床决策支持的效果。解决方案的关键在于提出了一种基于数据变换(Data Transformation, DT)方法的分布式患者相似性计算(Distributed Patient Similarity Computation, DPSC)技术,有效结合了时间序列数据与静态数据,并采用动态时间规整(Dynamic Time Warping, DTW)方法处理时间序列相似性,同时通过分布式计算克服了DTW在大数据场景下的计算效率问题。此外,利用自适应证据权重(aWOE)和Z-score数据变换方法对静态数据进行预处理,提升了模型预测性能并保护了患者隐私。
链接: https://arxiv.org/abs/2506.07092
作者: Joydeb Kumar Sana,Mohammad M. Masud,M Sohel Rahman,M Saifur Rahman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper presents a novel distributed patient similarity computation (DPSC) technique based on data transformation (DT) methods, utilizing an effective combination of time series and static data
Abstract:Patient similarity computation (PSC) is a fundamental problem in healthcare informatics. The aim of the patient similarity computation is to measure the similarity among patients according to their historical clinical records, which helps to improve clinical decision support. This paper presents a novel distributed patient similarity computation (DPSC) technique based on data transformation (DT) methods, utilizing an effective combination of time series and static data. Time series data are sensor-collected patients’ information, including metrics like heart rate, blood pressure, Oxygen saturation, respiration, etc. The static data are mainly patient background and demographic data, including age, weight, height, gender, etc. Static data has been used for clustering the patients. Before feeding the static data to the machine learning model adaptive Weight-of-Evidence (aWOE) and Z-score data transformation (DT) methods have been performed, which improve the prediction performances. In aWOE-based patient similarity models, sensitive patient information has been processed using aWOE which preserves the data privacy of the trained models. We used the Dynamic Time Warping (DTW) approach, which is robust and very popular, for time series similarity. However, DTW is not suitable for big data due to the significant computational run-time. To overcome this problem, distributed DTW computation is used in this study. For Coronary Artery Disease, our DT based approach boosts prediction performance by as much as 11.4%, 10.20%, and 12.6% in terms of AUC, accuracy, and F-measure, respectively. In the case of Congestive Heart Failure (CHF), our proposed method achieves performance enhancement up to 15.9%, 10.5%, and 21.9% for the same measures, respectively. The proposed method reduces the computation time by as high as 40%.
zh
[AI-90] On the Generalization of Data-Assisted Control in port-Hamiltonian Systems (DAC-pH)
【速读】:该论文旨在解决不确定环境下基于端口哈密顿(pH)系统的控制问题,特别是如何有效处理参数和结构不确定性。其解决方案的关键在于提出一种混合控制框架,通过数据辅助控制(DAC)实现动态分解,将系统演化划分为具有固定拓扑的左右两部分:右半部分(RHS)负责处理最坏情况下的参数不确定性,左半部分(LHS)则处理结构和参数不确定性,并通过虚拟端口变量Π进行耦合。该方法结合了非线性控制器与强化学习(RL),在保持系统内在结构的同时,提升了控制性能、可解释性及安全性。
链接: https://arxiv.org/abs/2506.07079
作者: Mostafa Eslami,Maryam Babazadeh
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper presents an early investigation of Data-Assisted Control (DAC) with reinforcement learning, showcasing its potential through a simple example. Theoretical analysis is ongoing to establish formal support and guarantees for the proposed approach
Abstract:This paper introduces a hypothetical hybrid control framework for port-Hamiltonian (p \mathcalH ) systems, employing a dynamic decomposition based on Data-Assisted Control (DAC). The system’s evolution is split into two parts with fixed topology: Right-Hand Side (RHS)- an intrinsic Hamiltonian flow handling worst-case parametric uncertainties, and Left-Hand Side (LHS)- a dissipative/input flow addressing both structural and parametric uncertainties. A virtual port variable \Pi serves as the interface between these two components. A nonlinear controller manages the intrinsic Hamiltonian flow, determining a desired port control value \Pi_c . Concurrently, Reinforcement Learning (RL) is applied to the dissipative/input flow to learn an agent for providing optimal policy in mapping \Pi_c to the actual system input. This hybrid approach effectively manages RHS uncertainties while preserving the system’s inherent structure. Key advantages include adjustable performance via LHS controller parameters, enhanced AI explainability and interpretability through the port variable \Pi , the ability to guarantee safety and state attainability with hard/soft constraints, reduced complexity in learning hypothesis classes compared to end-to-end solutions, and improved state/parameter estimation using LHS prior knowledge and system Hamiltonian to address partial observability. The paper details the p \mathcalH formulation, derives the decomposition, and presents the modular controller architecture. Beyond design, crucial aspects of stability and robustness analysis and synthesis are investigated, paving the way for deeper theoretical investigations. An application example, a pendulum with nonlinear dynamics, is simulated to demonstrate the approach’s empirical and phenomenological benefits for future research.
zh
[AI-91] Dual-Priv Pruning : Efficient Differential Private Fine-Tuning in Multimodal Large Language Models
【速读】:该论文试图解决在多模态大语言模型(Multimodal Large Language Models, MLLMs)中应用差分隐私(Differential Privacy, DP)时面临的计算开销大和模型性能退化的问题。其关键解决方案是提出Dual-Priv Pruning框架,该框架通过两种互补的剪枝机制实现DP微调:一是通过移除冗余视觉信息来降低输入维度的视觉标记剪枝,二是基于噪声梯度幅度选择性剪枝参数更新的梯度更新剪枝,从而减轻噪声影响并提升模型效用。
链接: https://arxiv.org/abs/2506.07077
作者: Qianshan Wei,Jiaqi Li,Zihan You,Yi Zhan,Kecen Li,Jialin Wu,Xinfeng Li Hengjun Liu,Yi Yu,Bin Cao,Yiwen Xu,Yang Liu,Guilin Qi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Differential Privacy (DP) is a widely adopted technique, valued for its effectiveness in protecting the privacy of task-specific datasets, making it a critical tool for large language models. However, its effectiveness in Multimodal Large Language Models (MLLMs) remains uncertain. Applying Differential Privacy (DP) inherently introduces substantial computation overhead, a concern particularly relevant for MLLMs which process extensive textual and visual data. Furthermore, a critical challenge of DP is that the injected noise, necessary for privacy, scales with parameter dimensionality, leading to pronounced model degradation; This trade-off between privacy and utility complicates the application of Differential Privacy (DP) to complex architectures like MLLMs. To address these, we propose Dual-Priv Pruning, a framework that employs two complementary pruning mechanisms for DP fine-tuning in MLLMs: (i) visual token pruning to reduce input dimensionality by removing redundant visual information, and (ii) gradient-update pruning during the DP optimization process. This second mechanism selectively prunes parameter updates based on the magnitude of noisy gradients, aiming to mitigate noise impact and improve utility. Experiments demonstrate that our approach achieves competitive results with minimal performance degradation. In terms of computational efficiency, our approach consistently utilizes less memory than standard DP-SGD. While requiring only 1.74% more memory than zeroth-order methods which suffer from severe performance issues on A100 GPUs, our method demonstrates leading memory efficiency on H20 GPUs. To the best of our knowledge, we are the first to explore DP fine-tuning in MLLMs. Our code is coming soon.
zh
[AI-92] Reasoning Paths as Signals: Augmenting Multi-hop Fact Verification through Structural Reasoning Progression
【速读】:该论文试图解决自动化事实验证系统在处理现实场景中复杂事实性陈述时的挑战,特别是如何准确聚合和推理多跳证据。现有方法通常依赖静态或浅层模型,无法捕捉推理路径的动态结构,导致检索碎片化和可解释性有限。解决方案的关键在于提出一种结构化推理框架,该框架在证据检索和陈述验证阶段都将推理路径建模为结构化的图,其核心包括一个增强结构的检索机制和一个基于推理路径的验证模块,以及一种能够捕捉多跳证据链中长程依赖关系的结构感知推理机制。
链接: https://arxiv.org/abs/2506.07075
作者: Liwen Zheng,Chaozhuo Li,Haoran Jia,Xi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The growing complexity of factual claims in real-world scenarios presents significant challenges for automated fact verification systems, particularly in accurately aggregating and reasoning over multi-hop evidence. Existing approaches often rely on static or shallow models that fail to capture the evolving structure of reasoning paths, leading to fragmented retrieval and limited interpretability. To address these issues, we propose a Structural Reasoning framework for Multi-hop Fact Verification that explicitly models reasoning paths as structured graphs throughout both evidence retrieval and claim verification stages. Our method comprises two key modules: a structure-enhanced retrieval mechanism that constructs reasoning graphs to guide evidence collection, and a reasoning-path-guided verification module that incrementally builds subgraphs to represent evolving inference trajectories. We further incorporate a structure-aware reasoning mechanism that captures long-range dependencies across multi-hop evidence chains, enabling more precise verification. Extensive experiments on the FEVER and HoVer datasets demonstrate that our approach consistently outperforms strong baselines, highlighting the effectiveness of reasoning-path modeling in enhancing retrieval precision and verification accuracy.
zh
[AI-93] Prime the search: Using large language models for guiding geometric task and motion planning by warm-starting tree search
【速读】:该论文试图解决在可移动障碍物环境中将一组物体重新定位到指定区域的几何任务与运动规划(G-TAMP)问题。传统方法依赖于领域无关的启发式策略或从规划经验中学习以引导搜索,但通常需要大量计算资源或数据。该研究的关键在于利用大型语言模型(LLM)的常识知识来指导G-TAMP任务规划,通过设计基于谓词的提示以编码来自运动规划算法的几何信息,并结合蒙特卡洛树搜索(MCTS)在混合动作空间中进行搜索,从而减少计算成本并提高规划效率。
链接: https://arxiv.org/abs/2506.07062
作者: Dongryung Lee,Sejune Joo,Kimin Lee,Beomjoon Kim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: The International Journal of Robotics Research (IJRR)
Abstract:The problem of relocating a set of objects to designated areas amidst movable obstacles can be framed as a Geometric Task and Motion Planning (G-TAMP) problem, a subclass of task and motion planning (TAMP). Traditional approaches to G-TAMP have relied either on domain-independent heuristics or on learning from planning experience to guide the search, both of which typically demand significant computational resources or data. In contrast, humans often use common sense to intuitively decide which objects to manipulate in G-TAMP problems. Inspired by this, we propose leveraging Large Language Models (LLMs), which have common sense knowledge acquired from internet-scale data, to guide task planning in G-TAMP problems. To enable LLMs to perform geometric reasoning, we design a predicate-based prompt that encodes geometric information derived from a motion planning algorithm. We then query the LLM to generate a task plan, which is then used to search for a feasible set of continuous parameters. Since LLMs are prone to mistakes, instead of committing to LLM’s outputs, we extend Monte Carlo Tree Search (MCTS) to a hybrid action space and use the LLM to guide the search. Unlike the previous approach that calls an LLM at every node and incurs high computational costs, we use it to warm-start the MCTS with the nodes explored in completing the LLM’s task plan. On six different G-TAMP problems, we show our method outperforms previous LLM planners and pure search algorithms. Code can be found at: this https URL
zh
[AI-94] Policy Gradient with Tree Search: Avoiding Local Optimas through Lookahead
【速读】:该论文试图解决强化学习中经典策略梯度(Policy Gradient, PG)方法容易陷入次优局部极值的问题,尤其是在大规模或复杂环境中的表现不佳。解决方案的关键在于提出一种结合树搜索的策略梯度方法(Policy Gradient with Tree Search, PGTS),通过引入m步前瞻机制来增强策略优化。该方法通过增加树搜索深度m,单调地减少不利的平稳点集合,从而提升最终平稳策略的最坏情况性能。
链接: https://arxiv.org/abs/2506.07054
作者: Uri Koren,Navdeep Kumar,Uri Gadot,Giorgia Ramponi,Kfir Yehuda Levy,Shie Mannor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Classical policy gradient (PG) methods in reinforcement learning frequently converge to suboptimal local optima, a challenge exacerbated in large or complex environments. This work investigates Policy Gradient with Tree Search (PGTS), an approach that integrates an m -step lookahead mechanism to enhance policy optimization. We provide theoretical analysis demonstrating that increasing the tree search depth m -monotonically reduces the set of undesirable stationary points and, consequently, improves the worst-case performance of any resulting stationary policy. Critically, our analysis accommodates practical scenarios where policy updates are restricted to states visited by the current policy, rather than requiring updates across the entire state space. Empirical evaluations on diverse MDP structures, including Ladder, Tightrope, and Gridworld environments, illustrate PGTS’s ability to exhibit “farsightedness,” navigate challenging reward landscapes, escape local traps where standard PG fails, and achieve superior solutions.
zh
[AI-95] Mathesis: Towards Formal Theorem Proving from Natural Languages
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在形式化定理证明中面临的输入依赖专家编写的形式化陈述的问题,从而限制了其在自然语言表达的实际问题中的应用。解决方案的关键在于提出Mathesis系统,其核心包括Mathesis-Autoformalizer,这是首个利用强化学习提升自然语言问题形式化能力的自动形式化工具,并结合了新颖的LeanScorer框架进行形式化质量评估;以及Mathesis-Prover,用于从形式化陈述生成形式化证明。该系统通过端到端的流程实现了对非正式问题陈述的处理,提升了形式化定理证明的实际适用性。
链接: https://arxiv.org/abs/2506.07047
作者: Yu Xuejun,Jianyuan Zhong,Zijin Feng,Pengyi Zhai,Roozbeh Yousefzadeh,Wei Chong Ng,Haoxiong Liu,Ziyi Shou,Jing Xiong,Yudong Zhou,Claudia Beth Ong,Austen Jeremy Sugiarto,Yaoxi Zhang,Wai Ming Tai,Huan Cao,Dongcai Lu,Jiacheng Sun,Qiang Xu,Shen Xin,Zhenguo Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models show strong promise for formal reasoning. However, most LLM-based theorem provers have long been constrained by the need for expert-written formal statements as inputs, limiting their applicability to real-world problems expressed in natural language. We tackle this gap with Mathesis, the first end-to-end theorem proving pipeline processing informal problem statements. It contributes Mathesis-Autoformalizer, the first autoformalizer using reinforcement learning to enhance the formalization ability of natural language problems, aided by our novel LeanScorer framework for nuanced formalization quality assessment. It also proposes a Mathesis-Prover, which generates formal proofs from the formalized statements. To evaluate the real-world applicability of end-to-end formal theorem proving, we introduce Gaokao-Formal, a benchmark of 488 complex problems from China’s national college entrance exam. Our approach is carefully designed, with a thorough study of each component. Experiments demonstrate Mathesis’s effectiveness, with the autoformalizer outperforming the best baseline by 22% in pass-rate on Gaokao-Formal. The full system surpasses other model combinations, achieving 64% accuracy on MiniF2F with pass@32 and a state-of-the-art 18% on Gaokao-Formal.
zh
[AI-96] Efficient Q-Learning and Actor-Critic Methods for Robust Averag e Reward Reinforcement Learning
【速读】:该论文试图解决在平均奖励马尔可夫决策过程(Average Reward Markov Decision Processes, AR-MDPs)中,如何设计具有非渐近收敛性的鲁棒Q-学习和策略梯度算法的问题。其关键在于证明了鲁棒Q Bellman算子在特定构造的半范数下是一个严格压缩映射,该半范数消去了常数函数的影响,从而支持了一种随机逼近更新方法,能够在\tilde\cO(\epsilon^{-2})样本内学习到最优鲁棒Q函数。此外,该方法还可用于鲁棒Q函数估计,并结合鲁棒策略镜像下降理论,提出了一种自然策略梯度算法,在\tilde\cO(\epsilon^{-3})样本内达到ϵ-最优的鲁棒策略。
链接: https://arxiv.org/abs/2506.07040
作者: Yang Xu,Swetha Ganesh,Vaneet Aggarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: arXiv admin note: text overlap with arXiv:2502.16816
Abstract:We present the first Q -learning and actor-critic algorithms for robust average reward Markov Decision Processes (MDPs) with non-asymptotic convergence under contamination, TV distance and Wasserstein distance uncertainty sets. We show that the robust Q Bellman operator is a strict contractive mapping with respect to a carefully constructed semi-norm with constant functions being quotiented out. This property supports a stochastic approximation update, that learns the optimal robust Q function in \tilde\cO(\epsilon^-2) samples. We also show that the same idea can be used for robust Q function estimation, which can be further used for critic estimation. Coupling it with theories in robust policy mirror descent update, we present a natural actor-critic algorithm that attains an \epsilon -optimal robust policy in \tilde\cO(\epsilon^-3) samples. These results advance the theory of distributionally robust reinforcement learning in the average reward setting.
zh
[AI-97] AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)在实际应用中面对恶意提示(尤其是越狱攻击)时的安全性与可用性之间的权衡问题。现有方法如激活转向(activation steering)虽然能够通过在推理过程中向模型内部激活添加拒绝方向向量来增强安全性,但存在因过度拒绝正常提示而导致性能下降的问题。论文提出的解决方案关键在于设计一种具有理论基础且实验有效的激活转向方法——AlphaSteer,其核心是将激活转向视为一个可学习的过程,包含两个原则性学习目标:保持可用性与提升安全性。具体而言,对于正常数据,AlphaSteer通过零空间约束学习构造接近零的转向向量以保持模型性能;对于恶意数据,则借助线性回归学习构造拒绝方向向量以提升安全性。
链接: https://arxiv.org/abs/2506.07022
作者: Leheng Sheng,Changshuo Shen,Weixiang Zhao,Junfeng Fang,Xiaohao Liu,Zhenkai Liang,Xiang Wang,An Zhang,Tat-Seng Chua
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:As LLMs are increasingly deployed in real-world applications, ensuring their ability to refuse malicious prompts, especially jailbreak attacks, is essential for safe and reliable use. Recently, activation steering has emerged as an effective approach for enhancing LLM safety by adding a refusal direction vector to internal activations of LLMs during inference, which will further induce the refusal behaviors of LLMs. However, indiscriminately applying activation steering fundamentally suffers from the trade-off between safety and utility, since the same steering vector can also lead to over-refusal and degraded performance on benign prompts. Although prior efforts, such as vector calibration and conditional steering, have attempted to mitigate this trade-off, their lack of theoretical grounding limits their robustness and effectiveness. To better address the trade-off between safety and utility, we present a theoretically grounded and empirically effective activation steering method called AlphaSteer. Specifically, it considers activation steering as a learnable process with two principled learning objectives: utility preservation and safety enhancement. For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints. For safety enhancement, it learns to construct a refusal direction vector for steering malicious data, with the help of linear regression. Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer, which significantly improves the safety of LLMs without compromising general capabilities. Our codes are available at this https URL.
zh
[AI-98] Deep regularization networks for inverse problems with noisy operators
【速读】:该论文旨在解决大规模逆问题中的正则化问题,特别是在数据噪声较大的情况下,如何加速时空正则化过程以实现实时成像。其解决方案的关键在于引入一种基于监督学习的神经算子,该算子能够将散射方程右端的模式映射到相应的正则化参数。通过两阶段训练策略,首先利用Morozov偏差原则生成低分辨率的正则化图,随后通过最小化Tikhonov损失函数并结合验证损失优化网络预测,从而提升图像质量。该方法无需先验知识即可直接从测试数据中学习,并通过自适应选择损失权重实现了对复杂环境下的损伤演化成像的有效处理。
链接: https://arxiv.org/abs/2506.07008
作者: Fatemeh Pourahmadian,Yang Xu
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:A supervised learning approach is proposed for regularization of large inverse problems where the main operator is built from noisy data. This is germane to superresolution imaging via the sampling indicators of the inverse scattering theory. We aim to accelerate the spatiotemporal regularization process for this class of inverse problems to enable real-time imaging. In this approach, a neural operator maps each pattern on the right-hand side of the scattering equation to its affiliated regularization parameter. The network is trained in two steps which entails: (1) training on low-resolution regularization maps furnished by the Morozov discrepancy principle with nonoptimal thresholds, and (2) optimizing network predictions through minimization of the Tikhonov loss function regulated by the validation loss. Step 2 allows for tailoring of the approximate maps of Step 1 toward construction of higher quality images. This approach enables direct learning from test data and dispenses with the need for a-priori knowledge of the optimal regularization maps. The network, trained on low-resolution data, quickly generates dense regularization maps for high-resolution imaging. We highlight the importance of the training loss function on the network’s generalizability. In particular, we demonstrate that networks informed by the logic of discrepancy principle lead to images of higher contrast. In this case, the training process involves many-objective optimization. We propose a new method to adaptively select the appropriate loss weights during training without requiring an additional optimization process. The proposed approach is synthetically examined for imaging damage evolution in an elastic plate. The results indicate that the discrepancy-informed regularization networks not only accelerate the imaging process, but also remarkably enhance the image quality in complex environments.
zh
[AI-99] CARoL: Context-aware Adaptation for Robot Learning
【速读】:该论文试图解决在机器人学习中,利用强化学习(Reinforcement Learning, RL)从零开始学习新任务效率低下的问题,其核心挑战在于如何确定已有知识的相关性以及如何自适应地将这些知识整合到新任务的学习过程中。论文提出的解决方案是Context-aware Adaptation for Robot Learning (CARoL),其关键在于通过分析系统动力学中的状态转移来实现上下文感知,从而识别新任务与已有知识之间的相似性,并利用这些相似性对特定知识进行优先级排序和适应性调整,以提升新任务学习的效率和泛化能力。
链接: https://arxiv.org/abs/2506.07006
作者: Zechen Hu,Tong Xu,Xuesu Xiao,Xuan Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Using Reinforcement Learning (RL) to learn new robotic tasks from scratch is often inefficient. Leveraging prior knowledge has the potential to significantly enhance learning efficiency, which, however, raises two critical challenges: how to determine the relevancy of existing knowledge and how to adaptively integrate them into learning a new task. In this paper, we propose Context-aware Adaptation for Robot Learning (CARoL), a novel framework to efficiently learn a similar but distinct new task from prior knowledge. CARoL incorporates context awareness by analyzing state transitions in system dynamics to identify similarities between the new task and prior knowledge. It then utilizes these identified similarities to prioritize and adapt specific knowledge pieces for the new task. Additionally, CARoL has a broad applicability spanning policy-based, value-based, and actor-critic RL algorithms. We validate the efficiency and generalizability of CARoL on both simulated robotic platforms and physical ground vehicles. The simulations include CarRacing and LunarLander environments, where CARoL demonstrates faster convergence and higher rewards when learning policies for new tasks. In real-world experiments, we show that CARoL enables a ground vehicle to quickly and efficiently adapt policies learned in simulation to smoothly traverse real-world off-road terrain.
zh
[AI-100] End-to-End Probabilistic Framework for Learning with Hard Constraints
【速读】:该论文试图解决在概率预测中如何有效融入操作/物理约束的问题,同时实现对模型不确定性的量化。其解决方案的关键在于提出了一种新颖的可微分概率投影层(DPPL),该层通过创新性地利用方差信息来强制执行硬约束,从而能够在端到端训练过程中学习系统,而非依赖后处理或推理阶段的约束满足方式。此外,ProbHardE2E能够优化严格合理的评分规则,无需对目标分布做出假设,从而获得鲁棒的分布估计,并支持多种非线性约束,提升了建模能力和灵活性。
链接: https://arxiv.org/abs/2506.07003
作者: Utkarsh Utkarsh,Danielle C. Maddix,Ruijun Ma,Michael W. Mahoney,Yuyang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 46 pages, 5 figures, 10 tables
Abstract:We present a general purpose probabilistic forecasting framework, ProbHardE2E, to learn systems that can incorporate operational/physical constraints as hard requirements. ProbHardE2E enforces hard constraints by exploiting variance information in a novel way; and thus it is also capable of performing uncertainty quantification (UQ) on the model. Our methodology uses a novel differentiable probabilistic projection layer (DPPL) that can be combined with a wide range of neural network architectures. This DPPL allows the model to learn the system in an end-to-end manner, compared to other approaches where the constraints are satisfied either through a post-processing step or at inference. In addition, ProbHardE2E can optimize a strictly proper scoring rule, without making any distributional assumptions on the target, which enables it to obtain robust distributional estimates (in contrast to existing approaches that generally optimize likelihood-based objectives, which are heavily biased by their distributional assumptions and model choices); and it can incorporate a range of non-linear constraints (increasing the power of modeling and flexibility). We apply ProbHardE2E to problems in learning partial differential equations with uncertainty estimates and to probabilistic time-series forecasting, showcasing it as a broadly applicable general setup that connects these seemingly disparate domains.
zh
[AI-101] Evaluating LLM -corrupted Crowdsourcing Data Without Ground Truth
【速读】:该论文试图解决在众包任务中,由于使用大型语言模型(Large Language Models, LLMs)生成的响应可能破坏数据集的真实性,从而影响人类反馈质量的问题。其解决方案的关键在于利用同行预测(peer prediction)机制,该机制通过评估工作者回答之间的相关性而不依赖于真实标签,从而检测并减轻LLM辅助的作弊行为。该方法在考虑LLM生成标签的前提下,构建了一种无需训练的评分机制,并在众包模型下提供了理论保障,实验证明其在检测低努力作弊行为上的有效性。
链接: https://arxiv.org/abs/2506.06991
作者: Yichi Zhang,Jinlong Pang,Zhaowei Zhu,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC)
备注: 33 pages, 9 figures
Abstract:The recent success of generative AI highlights the crucial role of high-quality human feedback in building trustworthy AI systems. However, the increasing use of large language models (LLMs) by crowdsourcing workers poses a significant challenge: datasets intended to reflect human input may be compromised by LLM-generated responses. Existing LLM detection approaches often rely on high-dimension training data such as text, making them unsuitable for annotation tasks like multiple-choice labeling. In this work, we investigate the potential of peer prediction – a mechanism that evaluates the information within workers’ responses without using ground truth – to mitigate LLM-assisted cheating in crowdsourcing with a focus on annotation tasks. Our approach quantifies the correlations between worker answers while conditioning on (a subset of) LLM-generated labels available to the requester. Building on prior research, we propose a training-free scoring mechanism with theoretical guarantees under a crowdsourcing model that accounts for LLM collusion. We establish conditions under which our method is effective and empirically demonstrate its robustness in detecting low-effort cheating on real-world crowdsourcing datasets.
zh
[AI-102] Deep RL Needs Deep Behavior Analysis: Exploring Implicit Planning by Model-Free Agents in Open-Ended Environments
【速读】:该论文试图解决深度强化学习(Deep Reinforcement Learning, DRL)代理行为分析方法不足的问题,尤其是在任务和代理复杂度增加时,仅通过奖励曲线比较已无法全面理解其行为。解决方案的关键在于引入神经科学和动物行为学的工具,对DRL代理进行联合行为与神经分析,揭示其策略、记忆和规划的定量特征。研究发现,基于无模型循环神经网络(RNN)的DRL代理可以通过涌现动力学表现出结构化、类似规划的行为,而无需显式记忆模块或世界模型,从而为理解DRL代理的学习动态提供了新的视角和通用分析框架。
链接: https://arxiv.org/abs/2506.06981
作者: Riley Simmons-Edler,Ryan P. Badman,Felix Baastad Berg,Raymond Chua,John J. Vastola,Joshua Lunger,William Qian,Kanaka Rajan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Understanding the behavior of deep reinforcement learning (DRL) agents – particularly as task and agent sophistication increase – requires more than simple comparison of reward curves, yet standard methods for behavioral analysis remain underdeveloped in DRL. We apply tools from neuroscience and ethology to study DRL agents in a novel, complex, partially observable environment, ForageWorld, designed to capture key aspects of real-world animal foraging – including sparse, depleting resource patches, predator threats, and spatially extended arenas. We use this environment as a platform for applying joint behavioral and neural analysis to agents, revealing detailed, quantitatively grounded insights into agent strategies, memory, and planning. Contrary to common assumptions, we find that model-free RNN-based DRL agents can exhibit structured, planning-like behavior purely through emergent dynamics – without requiring explicit memory modules or world models. Our results show that studying DRL agents like animals – analyzing them with neuroethology-inspired tools that reveal structure in both behavior and neural dynamics – uncovers rich structure in their learning dynamics that would otherwise remain invisible. We distill these tools into a general analysis framework linking core behavioral and representational features to diagnostic methods, which can be reused for a wide range of tasks and agents. As agents grow more complex and autonomous, bridging neuroscience, cognitive science, and AI will be essential – not just for understanding their behavior, but for ensuring safe alignment and maximizing desirable behaviors that are hard to measure via reward. We show how this can be done by drawing on lessons from how biological intelligence is studied.
zh
[AI-103] MoXGATE: Modality-aware cross-attention for multi-omic gastrointestinal cancer sub-type classification
【速读】:该论文旨在解决多组学数据(如基因组、表观基因组和转录组)在癌症亚型分类中的有效整合问题,这一问题因各组学特征的异质性而具有挑战性。其解决方案的关键在于提出一种基于交叉注意力机制的深度学习框架——Modality-Aware Cross-Attention MoXGATE,通过引入可学习的模态权重,增强跨多组学源的特征融合能力,从而有效捕捉模态间的依赖关系,提升分类性能与模型的可解释性。
链接: https://arxiv.org/abs/2506.06980
作者: Sajib Acharjee Dip,Uddip Acharjee Shuvo,Dipanwita Mallick,Abrar Rahman Abir,Liqing Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, 6 tables
Abstract:Cancer subtype classification is crucial for personalized treatment and prognostic assessment. However, effectively integrating multi-omic data remains challenging due to the heterogeneous nature of genomic, epigenomic, and transcriptomic features. In this work, we propose Modality-Aware Cross-Attention MoXGATE, a novel deep-learning framework that leverages cross-attention and learnable modality weights to enhance feature fusion across multiple omics sources. Our approach effectively captures inter-modality dependencies, ensuring robust and interpretable integration. Through experiments on Gastrointestinal Adenocarcinoma (GIAC) and Breast Cancer (BRCA) datasets from TCGA, we demonstrate that MoXGATE outperforms existing methods, achieving 95% classification accuracy. Ablation studies validate the effectiveness of cross-attention over simple concatenation and highlight the importance of different omics modalities. Moreover, our model generalizes well to unseen cancer types e.g., breast cancer, underscoring its adaptability. Key contributions include (1) a cross-attention-based multi-omic integration framework, (2) modality-weighted fusion for enhanced interpretability, (3) application of focal loss to mitigate data imbalance, and (4) validation across multiple cancer subtypes. Our results indicate that MoXGATE is a promising approach for multi-omic cancer subtype classification, offering improved performance and biological generalizability.
zh
[AI-104] UdonCare: Hierarchy Pruning for Unseen Domain Discovery in Predictive Healthcare
【速读】:该论文试图解决临床预测中领域泛化(domain generalization)的问题,特别是在患者队列数据分布发生变化时模型性能下降的挑战。传统领域泛化方法在实际医疗环境中表现不佳,主要原因是患者特定的领域标签通常不可用,导致领域发现困难,以及纯数据驱动的方法忽略了关键的临床知识,造成医学知识整合的缺口。解决方案的关键在于利用层次化医学本体(如ICD-9-CM层级)将疾病分组为更高级别的类别,从而发现更具灵活性的潜在领域,并通过UdonCare框架迭代剪枝细粒度领域、编码精炼后的领域,并应用孪生推理机制分离与领域相关的信号与患者级特征。
链接: https://arxiv.org/abs/2506.06977
作者: Pengfei Hu,Xiaoxue Han,Fei Wang,Yue Ning
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Domain generalization has become a critical challenge in clinical prediction, where patient cohorts often exhibit shifting data distributions that degrade model performance. Typical domain generalization approaches struggle in real-world healthcare settings for two main reasons: (1) patient-specific domain labels are typically unavailable, making domain discovery especially difficult; (2) purely data-driven approaches overlook key clinical insights, leading to a gap in medical knowledge integration. To address these problems, we leverage hierarchical medical ontologies like the ICD-9-CM hierarchy to group diseases into higher-level categories and discover more flexible latent domains. In this paper, we introduce UdonCare, a hierarchy-guided framework that iteratively prunes fine-grained domains, encodes these refined domains, and applies a Siamese-type inference mechanism to separate domain-related signals from patient-level features. Experimental results on clinical datasets (MIMIC-III and MIMIC-IV) show that the proposed model achieves higher performance compared to other domain generalization baselines when substantial domain gaps presents, highlighting the untapped potential of medical knowledge for enhancing domain generalization in practical healthcare applications.
zh
[AI-105] Deontically Constrained Policy Improvement in Reinforcement Learning Agents
【速读】:该论文试图解决在满足由道义逻辑(deontic logic)表达的约束条件下,学习一个最大化效用的决策策略的问题。其关键在于将预期行为功利主义逻辑引入受控马尔可夫决策过程(MDPs),通过改进策略的方法,达到在满足约束条件下的任务效用局部最大值。该方法在双层次结构中同时最大化两个价值函数,其中一个隐式地体现约束条件。
链接: https://arxiv.org/abs/2506.06959
作者: Alena Makarova,Houssam Abbas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 11 figures, DEON2025 conference
Abstract:Markov Decision Processes (MDPs) are the most common model for decision making under uncertainty in the Machine Learning community. An MDP captures non-determinism, probabilistic uncertainty, and an explicit model of action. A Reinforcement Learning (RL) agent learns to act in an MDP by maximizing a utility function. This paper considers the problem of learning a decision policy that maximizes utility subject to satisfying a constraint expressed in deontic logic. In this setup, the utility captures the agent’s mission - such as going quickly from A to B. The deontic formula represents (ethical, social, situational) constraints on how the agent might achieve its mission by prohibiting classes of behaviors. We use the logic of Expected Act Utilitarianism, a probabilistic stit logic that can be interpreted over controlled MDPs. We develop a variation on policy improvement, and show that it reaches a constrained local maximum of the mission utility. Given that in stit logic, an agent’s duty is derived from value maximization, this can be seen as a way of acting to simultaneously maximize two value functions, one of which is implicit, in a bi-level structure. We illustrate these results with experiments on sample MDPs.
zh
[AI-106] Position: Simulating Society Requires Simulating Thought
【速读】:该论文试图解决当前基于生成式 AI (Generative AI) 的代理在社会模拟中缺乏内部一致性、因果推理能力和信念可追溯性的问题,这些问题使得它们难以可靠地分析人类的推理、 deliberation(审议)或对干预措施的反应。解决方案的关键在于提出一种概念建模范式——Generative Minds (GenMinds),该范式借鉴认知科学,以支持生成代理中的结构化信念表示,并引入 RECAP(REconstructing CAusal Paths)框架作为评估基准,通过因果可追溯性、人口统计学基础和干预一致性来衡量推理的真实性,从而推动生成代理从表面的语言模仿向模拟思维的转变。
链接: https://arxiv.org/abs/2506.06958
作者: Chance Jiajie Li,Jiayi Wu,Zhenze Mo,Ao Qu,Yuhan Tang,Kaiya Ivy Zhao,Yulu Gan,Jie Fan,Jiangbo Yu,Jinhua Zhao,Paul Liang,Luis Alonso,Kent Larson
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Simulating society with large language models (LLMs), we argue, requires more than generating plausible behavior – it demands cognitively grounded reasoning that is structured, revisable, and traceable. LLM-based agents are increasingly used to emulate individual and group behavior – primarily through prompting and supervised fine-tuning. Yet they often lack internal coherence, causal reasoning, and belief traceability – making them unreliable for analyzing how people reason, deliberate, or respond to interventions. To address this, we present a conceptual modeling paradigm, Generative Minds (GenMinds), which draws from cognitive science to support structured belief representations in generative agents. To evaluate such agents, we introduce the RECAP (REconstructing CAusal Paths) framework, a benchmark designed to assess reasoning fidelity via causal traceability, demographic grounding, and intervention consistency. These contributions advance a broader shift: from surface-level mimicry to generative agents that simulate thought – not just language – for social simulations. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2506.06958 [cs.CY] (or arXiv:2506.06958v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2506.06958 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-107] Is Your Training Pipeline Production-Ready? A Case Study in the Healthcare Domain
【速读】:该论文试图解决将机器学习(Machine Learning, ML)训练流水线部署到生产环境中的挑战,特别是如何通过软件工程实践提升其可维护性、鲁棒性和可扩展性。论文的关键解决方案在于对连续训练子系统的架构进行演化,从最初的“大泥球”(Big Ball of Mud)架构逐步演进为模块化单体(Modular Monolith)和微服务(Microservices)架构,并通过采用不同的设计原则和模式来优化系统质量属性。
链接: https://arxiv.org/abs/2506.06946
作者: Daniel Lawand(1),Lucas Quaresma(1),Roberto Bolgheroni(1),Alfredo Goldman(1),Renato Cordeiro Ferreira(1,2,3,4) ((1) University of São Paulo, (2) Jheronimus Academy of Data Science, (3) Technical University of Eindhoven, (4) Tilburg University)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 figures (2 diagrams, 1 code listing), submitted to the workshop SADIS 2025
Abstract:Deploying a Machine Learning (ML) training pipeline into production requires robust software engineering practices. This differs significantly from experimental workflows. This experience report investigates this challenge in SPIRA, a project whose goal is to create an ML-Enabled System (MLES) to pre-diagnose insufficiency respiratory via speech analysis. The first version of SPIRA’s training pipeline lacked critical software quality attributes. This paper presents an overview of the MLES, then compares three versions of the architecture of the Continuous Training subsystem, which evolved from a Big Ball of Mud, to a Modular Monolith, towards Microservices. By adopting different design principles and patterns to enhance its maintainability, robustness, and extensibility. In this way, the paper seeks to offer insights for both ML Engineers tasked to productionize ML training pipelines and Data Scientists seeking to adopt MLOps practices.
zh
[AI-108] An Agent ic Framework for Autonomous Metamaterial Modeling and Inverse Design
【速读】:该论文试图解决光子超材料逆向设计中复杂、耗时且依赖人工干预的问题,其解决方案的关键在于构建一个具备自主执行能力的智能代理框架(Agentic Framework)。该框架能够根据所需的光学谱进行自主推理、规划与适应,通过生成并开发正向深度学习模型、调用外部工具进行仿真与优化、利用记忆功能,并最终通过深度逆向方法生成设计。该框架的核心优势在于其内部反思能力和决策灵活性,从而能够产生多样化且可能具有创新性的输出。
链接: https://arxiv.org/abs/2506.06935
作者: Darui Lu,Jordan M. Malof,Willie J. Padilla
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注: 22 pages, 6 figures
Abstract:Recent significant advances in integrating multiple Large Language Model (LLM) systems have enabled Agentic Frameworks capable of performing complex tasks autonomously, including novel scientific research. We develop and demonstrate such a framework specifically for the inverse design of photonic metamaterials. When queried with a desired optical spectrum, the Agent autonomously proposes and develops a forward deep learning model, accesses external tools via APIs for tasks like simulation and optimization, utilizes memory, and generates a final design via a deep inverse method. The framework’s effectiveness is demonstrated in its ability to automate, reason, plan, and adapt. Notably, the Agentic Framework possesses internal reflection and decision flexibility, permitting highly varied and potentially novel outputs.
zh
[AI-109] Boosting LLM Reasoning via Spontaneous Self-Correction
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学推理任务中表现不足的问题,特别是现有自纠正(self-correction)方法依赖额外提示和系统设计,无法实现单次推理过程中实时、自发的自我修正。其解决方案的关键在于提出SPOC(Spontaneous Self-Correction),该方法通过让模型在单次推理过程中交替生成解题步骤和验证过程,并根据验证结果动态终止生成,从而有效控制计算开销。SPOC从多智能体视角出发,赋予同一模型双重角色——解题提案者和验证者,并通过合成数据微调和在线强化学习提升模型的自我验证与协作能力。
链接: https://arxiv.org/abs/2506.06923
作者: Xutong Zhao,Tengyu Xu,Xuewei Wang,Zhengxing Chen,Di Jin,Liang Tan,Yen-Ting,Zishun Yu,Zhuokai Zhao,Yun He,Sinong Wang,Han Fang,Sarath Chandar,Chen Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While large language models (LLMs) have demonstrated remarkable success on a broad range of tasks, math reasoning remains a challenging one. One of the approaches for improving math reasoning is self-correction, which designs self-improving loops to let the model correct its own mistakes. However, existing self-correction approaches treat corrections as standalone post-generation refinements, relying on extra prompt and system designs to elicit self-corrections, instead of performing real-time, spontaneous self-corrections in a single pass. To address this, we propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass, with generation dynamically terminated based on verification outcomes, thereby effectively scaling inference time compute. SPOC considers a multi-agent perspective by assigning dual roles – solution proposer and verifier – to the same model. We adopt a simple yet effective approach to generate synthetic data for fine-tuning, enabling the model to develop capabilities for self-verification and multi-agent collaboration. We further improve its solution proposal and verification accuracy through online reinforcement learning. Experiments on mathematical reasoning benchmarks show that SPOC significantly improves performance. Notably, SPOC boosts the accuracy of Llama-3.1-8B and 70B Instruct models, achieving gains of 8.8% and 11.6% on MATH500, 10.0% and 20.0% on AMC23, and 3.3% and 6.7% on AIME24, respectively.
zh
[AI-110] Graph-Based Physics-Guided Urban PM2.5 Air Quality Imputation with Constrained Monitoring Data
【速读】:该论文试图解决在监测数据有限的城市区域中,高分辨率和准确的空气质量建模问题(air quality modeling)。解决方案的关键在于提出一种基于图的物理引导学习框架GraPhy,其通过设计特定的层结构和边特征,有效处理低分辨率监测数据,并结合物理规律提升模型性能。
链接: https://arxiv.org/abs/2506.06917
作者: Shangjie Du,Hui Wei,Dong Yoon Lee,Zhizhang Hu,Shijia Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACM Transactions on Sensor Networks (TOSN) 2025
Abstract:This work introduces GraPhy, a graph-based, physics-guided learning framework for high-resolution and accurate air quality modeling in urban areas with limited monitoring data. Fine-grained air quality monitoring information is essential for reducing public exposure to pollutants. However, monitoring networks are often sparse in socioeconomically disadvantaged regions, limiting the accuracy and resolution of air quality modeling. To address this, we propose a physics-guided graph neural network architecture called GraPhy with layers and edge features designed specifically for low-resolution monitoring data. Experiments using data from California’s socioeconomically disadvantaged San Joaquin Valley show that GraPhy achieves the overall best performance evaluated by mean squared error (MSE), mean absolute error (MAE), and R-square value (R2), improving the performance by 9%-56% compared to various baseline models. Moreover, GraPhy consistently outperforms baselines across different spatial heterogeneity levels, demonstrating the effectiveness of our model design.
zh
[AI-111] Causal Graph based Event Reasoning using Semantic Relation Experts
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在事件推理任务中难以准确识别事件之间因果关系的问题,这一问题导致其在事件预测和时间线理解等深层推理任务中表现不佳。解决方案的关键在于生成因果事件图(causal event graphs),作为辅助机制帮助LLMs在推理过程中显式表示因果性。通过让LLMs模拟专家并围绕特定语义关系进行多轮讨论,最终由一位专家整合结果,从而生成准确的因果图。该方法无需对下游任务进行微调,便在事件预测和时间序列任务上取得了与最先进模型相当的结果。
链接: https://arxiv.org/abs/2506.06910
作者: Mahnaz Koupaee,Xueying Bai,Mudan Chen,Greg Durrett,Nathanael Chambers,Niranjan Balasubramanian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how events in a scenario causally connect with each other is important for effectively modeling and reasoning about events. But event reasoning remains a difficult challenge, and despite recent advances, Large Language Models (LLMs) still struggle to accurately identify causal connections between events. This struggle leads to poor performance on deeper reasoning tasks like event forecasting and timeline understanding. To address this challenge, we investigate the generation of causal event graphs (e.g., A enables B) as a parallel mechanism to help LLMs explicitly represent causality during inference. This paper evaluates both how to generate correct graphs as well as how graphs can assist reasoning. We propose a collaborative approach to causal graph generation where we use LLMs to simulate experts that focus on specific semantic relations. The experts engage in multiple rounds of discussions which are then consolidated by a final expert. Then, to demonstrate the utility of causal graphs, we use them on multiple downstream applications, and also introduce a new explainable event prediction task that requires a causal chain of events in the explanation. These explanations are more informative and coherent than baseline generations. Finally, our overall approach not finetuned on any downstream task, achieves competitive results with state-of-the-art models on both forecasting and next event prediction tasks.
zh
[AI-112] Uncertainty Estimation on Graphs with Structure Informed Stochastic Partial Differential Equations
【速读】:该论文试图解决在图结构数据中准确估计不确定性的问题,尤其是在分布偏移(distributional shifts)情况下。传统不确定性估计方法难以处理图结构中由图拓扑和标签分布共同引起的随机性,这增加了问题的复杂性。论文的关键解决方案是将随机偏微分方程(SPDE)在Matern高斯过程驱动下的演化与图神经网络(GNN)中的信息传递机制进行类比,提出一种结合时空噪声的新信息传递方案。该方法能够同时捕捉空间和时间上的不确定性,并通过显式控制协方差核的平滑性来提升在低和高标签信息量场景下的不确定性估计性能。
链接: https://arxiv.org/abs/2506.06907
作者: Fred Xu,Thomas Markovich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks have achieved impressive results across diverse network modeling tasks, but accurately estimating uncertainty on graphs remains difficult, especially under distributional shifts. Unlike traditional uncertainty estimation, graph-based uncertainty must account for randomness arising from both the graph’s structure and its label distribution, which adds complexity. In this paper, making an analogy between the evolution of a stochastic partial differential equation (SPDE) driven by Matern Gaussian Process and message passing using GNN layers, we present a principled way to design a novel message passing scheme that incorporates spatial-temporal noises motivated by the Gaussian Process approach to SPDE. Our method simultaneously captures uncertainty across space and time and allows explicit control over the covariance kernel smoothness, thereby enhancing uncertainty estimates on graphs with both low and high label informativeness. Our extensive experiments on Out-of-Distribution (OOD) detection on graph datasets with varying label informativeness demonstrate the soundness and superiority of our model to existing approaches.
zh
[AI-113] Can Biologically Plausible Temporal Credit Assignment Rules Match BPTT for Neural Similarity? E-prop as an Example
【速读】:该论文试图解决如何开发出符合生物学约束的神经网络学习规则,使其在任务性能和与神经记录数据的相似性方面达到与传统反向传播(Backpropagation Through Time, BPTT)相当的水平。解决方案的关键在于验证一种基于梯度截断的生物合理性学习规则——e-prop(e-propagation),其在保持任务准确性的同时,能够实现与BPTT相当的神经数据相似性,从而表明生物合理性学习规则在神经科学任务中具有较高的可行性与有效性。
链接: https://arxiv.org/abs/2506.06904
作者: Yuhan Helena Liu,Guangyu Robert Yang,Christopher J. Cueva
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Understanding how the brain learns may be informed by studying biologically plausible learning rules. These rules, often approximating gradient descent learning to respect biological constraints such as locality, must meet two critical criteria to be considered an appropriate brain model: (1) good neuroscience task performance and (2) alignment with neural recordings. While extensive research has assessed the first criterion, the second remains underexamined. Employing methods such as Procrustes analysis on well-known neuroscience datasets, this study demonstrates the existence of a biologically plausible learning rule – namely e-prop, which is based on gradient truncation and has demonstrated versatility across a wide range of tasks – that can achieve neural data similarity comparable to Backpropagation Through Time (BPTT) when matched for task accuracy. Our findings also reveal that model architecture and initial conditions can play a more significant role in determining neural similarity than the specific learning rule. Furthermore, we observe that BPTT-trained models and their biologically plausible counterparts exhibit similar dynamical properties at comparable accuracies. These results underscore the substantial progress made in developing biologically plausible learning rules, highlighting their potential to achieve both competitive task performance and neural data similarity.
zh
[AI-114] KnowCoder-V2: Deep Knowledge Analysis
【速读】:该论文旨在解决现有深度研究框架在处理深度知识分析任务时面临的三大挑战:缺乏系统化的知识组织与管理、纯在线操作导致大规模共享知识任务效率低下,以及无法执行复杂知识计算从而限制了生成洞察性分析结果的能力。解决方案的关键在于提出一种名为KDR(Knowledgeable Deep Research)的框架,该框架引入独立的知识组织阶段,将大规模领域相关数据离线预处理为系统化知识,并在此基础上扩展深度研究,增加在线复杂知识计算的推理步骤。此外,论文还提出了\KCII,一种通过统一代码生成连接知识组织与推理的大型语言模型,以提升在该框架下解决知识分析任务的能力。
链接: https://arxiv.org/abs/2506.06881
作者: Zixuan Li,Wenxuan Liu,Long Bai,Chunmao Zhang,Wei Li,Fenghui Zhang,Quanxin Jin,Ruoyun He,Zhuo Chen,Zhilei Hu,Fei Wang,Bingbing Xu,Xuhui Jiang,Xiaolong Jin,Jiafeng Guo,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deep knowledge analysis tasks always involve the systematic extraction and association of knowledge from large volumes of data, followed by logical reasoning to discover insights. However, to solve such complex tasks, existing deep research frameworks face three major challenges: 1) They lack systematic organization and management of knowledge; 2) They operate purely online, making it inefficient for tasks that rely on shared and large-scale knowledge; 3) They cannot perform complex knowledge computation, limiting their abilities to produce insightful analytical results. Motivated by these, in this paper, we propose a \textbfKnowledgeable \textbfDeep \textbfResearch (\textbfKDR) framework that empowers deep research with deep knowledge analysis capability. Specifically, it introduces an independent knowledge organization phase to preprocess large-scale, domain-relevant data into systematic knowledge offline. Based on this knowledge, it extends deep research with an additional kind of reasoning steps that perform complex knowledge computation in an online manner. To enhance the abilities of LLMs to solve knowledge analysis tasks in the above framework, we further introduce \textbf\KCII, an LLM that bridges knowledge organization and reasoning via unified code generation. For knowledge organization, it generates instantiation code for predefined classes, transforming data into knowledge objects. For knowledge computation, it generates analysis code and executes on the above knowledge objects to obtain deep analysis results. Experimental results on more than thirty datasets across six knowledge analysis tasks demonstrate the effectiveness of \KCII. Moreover, when integrated into the KDR framework, \KCII can generate high-quality reports with insightful analytical results compared to the mainstream deep research framework.
zh
[AI-115] LLM -D12: A Dual-Dimensional Scale of Instrumental and Relational Dependencies on Large Language Models
【速读】:该论文试图解决当前缺乏有效评估个体对大型语言模型(Large Language Models, LLMs)依赖程度的工具的问题,现有工具主要基于经典行为成瘾症状进行调整,存在概念上的局限性。论文提出的解决方案关键在于开发并验证了一个新的12项量表,称为LLM-D12,该量表从理论出发构建条目,并通过探索性和验证性因子分析确定了两个维度:工具性依赖(Instrumental Dependency)和关系性依赖(Relationship Dependency),从而提供了更贴合LLM-human关系特性的评估框架。
链接: https://arxiv.org/abs/2506.06874
作者: Ala Yankouskaya,Areej B. Babiker,Syeda W. F. Rizvi,Sameha Alshakhsi,Magnus Liebherr,Raian Ali
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:There is growing interest in understanding how people interact with large language models (LLMs) and whether such models elicit dependency or even addictive behaviour. Validated tools to assess the extent to which individuals may become dependent on LLMs are scarce and primarily build on classic behavioral addiction symptoms, adapted to the context of LLM use. We view this as a conceptual limitation, as the LLM-human relationship is more nuanced and warrants a fresh and distinct perspective. To address this gap, we developed and validated a new 12-item questionnaire to measure LLM dependency, referred to as LLM-D12. The scale was based on the authors’ prior theoretical work, with items developed accordingly and responses collected from 526 participants in the UK. Exploratory and confirmatory factor analyses, performed on separate halves of the total sample using a split-sample approach, supported a two-factor structure: Instrumental Dependency (six items) and Relationship Dependency (six items). Instrumental Dependency reflects the extent to which individuals rely on LLMs to support or collaborate in decision-making and cognitive tasks. Relationship Dependency captures the tendency to perceive LLMs as socially meaningful, sentient, or companion-like entities. The two-factor structure demonstrated excellent internal consistency and clear discriminant validity. External validation confirmed both the conceptual foundation and the distinction between the two subscales. The psychometric properties and structure of our LLM-D12 scale were interpreted in light of the emerging view that dependency on LLMs does not necessarily indicate dysfunction but may still reflect reliance levels that could become problematic in certain contexts.
zh
[AI-116] Recursive Semantic Anchoring in ISO 639:2023: A Structural Extension to ISO/TC 37 Frameworks
【速读】:该论文试图解决ISO 639:2023在处理方言演变和克里奥尔语混合时缺乏机器可操作机制的问题。解决方案的关键在于提出一种递归语义锚定的形式化方法,通过为每个语言实体附加一组固定点算子ϕ_n,m,这些算子通过关系ϕ_n,m(χ) = χ ⊕ Δ(χ)建模有限的语义漂移,其中Δ(χ)是在潜在语义流形中的漂移向量。该方法利用范畴论将算子视为态射,漂移向量作为箭头,构建了一个名为DriftLang的范畴,并通过函子Φ将漂移对象映射到其唯一锚点,从而证明收敛性。
链接: https://arxiv.org/abs/2506.06870
作者: Bugra Kilictas,Faruk Alpay
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 21 pages, no figures. Includes formal proofs, RDF/Turtle ontology schema, ϕ-index disambiguation cases, and evaluation of transformer-based AI models under semantic drift
Abstract:ISO 639:2023 unifies the ISO language-code family and introduces contextual metadata, but it lacks a machine-native mechanism for handling dialectal drift and creole mixtures. We propose a formalisation of recursive semantic anchoring, attaching to every language entity \chi a family of fixed-point operators \phi_n,m that model bounded semantic drift via the relation \phi_n,m(\chi) = \chi \oplus \Delta(\chi) , where \Delta(\chi) is a drift vector in a latent semantic manifold. The base anchor \phi_0,0 recovers the canonical ISO 639:2023 identity, whereas \phi_99,9 marks the maximal drift state that triggers a deterministic fallback. Using category theory, we treat the operators \phi_n,m as morphisms and drift vectors as arrows in a category \mathrmDriftLang . A functor \Phi: \mathrmDriftLang \to \mathrmAnchorLang maps every drifted object to its unique anchor and proves convergence. We provide an RDF/Turtle schema (\textttBaseLanguage, \textttDriftedLanguage, \textttResolvedAnchor) and worked examples – e.g., \phi_8,4 (Standard Mandarin) versus \phi_8,7 (a colloquial variant), and \phi_1,7 for Nigerian Pidgin anchored to English. Experiments with transformer models show higher accuracy in language identification and translation on noisy or code-switched input when the \phi -indices are used to guide fallback routing. The framework is compatible with ISO/TC 37 and provides an AI-tractable, drift-aware semantic layer for future standards.
zh
[AI-117] Incorporating Failure of Machine Learning in Dynamic Probabilistic Safety Assurance
【速读】:该论文试图解决机器学习(Machine Learning, ML)模型在安全关键系统中因数据分布偏移导致的推理失败问题,这类失败难以通过传统依赖设计文档或代码的安全评估方法进行有效检测和分析。解决方案的关键在于提出一种概率安全性保障框架,该框架将SafeML与贝叶斯网络(Bayesian Networks, BNs)相结合,将ML故障建模为更广泛因果安全分析的一部分,从而实现不确定性下的动态安全性评估与系统适应。
链接: https://arxiv.org/abs/2506.06868
作者: Razieh Arshadizadeh,Mahmoud Asgari,Zeinab Khosravi,Yiannis Papadopoulos,Koorosh Aslansefat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Machine Learning (ML) models are increasingly integrated into safety-critical systems, such as autonomous vehicle platooning, to enable real-time decision-making. However, their inherent imperfection introduces a new class of failure: reasoning failures often triggered by distributional shifts between operational and training data. Traditional safety assessment methods, which rely on design artefacts or code, are ill-suited for ML components that learn behaviour from data. SafeML was recently proposed to dynamically detect such shifts and assign confidence levels to the reasoning of ML-based components. Building on this, we introduce a probabilistic safety assurance framework that integrates SafeML with Bayesian Networks (BNs) to model ML failures as part of a broader causal safety analysis. This allows for dynamic safety evaluation and system adaptation under uncertainty. We demonstrate the approach on an simulated automotive platooning system with traffic sign recognition. The findings highlight the potential broader benefits of explicitly modelling ML failures in safety assessment.
zh
[AI-118] SAFE: Finding Sparse and Flat Minima to Improve Pruning ICML2025
【速读】:该论文旨在解决神经网络稀疏化过程中普遍存在的性能下降问题,即在减少模型参数量的同时难以恢复原始性能。其解决方案的关键在于寻找同时具备稀疏性和平坦性的子网络,通过将剪枝建模为一个带有稀疏性约束的优化问题,并将平坦性作为优化目标,利用增广拉格朗日对偶方法进行求解,进一步提出广义投影操作,从而得到新的剪枝方法SAFE及其改进版本SAFE^+。
链接: https://arxiv.org/abs/2506.06866
作者: Dongyeop Lee,Kwanhee Lee,Jinseok Chung,Namhoon Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025
Abstract:Sparsifying neural networks often suffers from seemingly inevitable performance degradation, and it remains challenging to restore the original performance despite much recent progress. Motivated by recent studies in robust optimization, we aim to tackle this problem by finding subnetworks that are both sparse and flat at the same time. Specifically, we formulate pruning as a sparsity-constrained optimization problem where flatness is encouraged as an objective. We solve it explicitly via an augmented Lagrange dual approach and extend it further by proposing a generalized projection operation, resulting in novel pruning methods called SAFE and its extension, SAFE ^+ . Extensive evaluations on standard image classification and language modeling tasks reveal that SAFE consistently yields sparse networks with improved generalization performance, which compares competitively to well-established baselines. In addition, SAFE demonstrates resilience to noisy data, making it well-suited for real-world conditions.
zh
[AI-119] High-Fidelity Scientific Simulation Surrogates via Adaptive Implicit Neural Representations
【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在处理具有局部高频率变化的复杂科学领域时表现不佳的问题。传统方法通过在刚性几何结构(如网格)上引入额外特征来改善性能,但这种方法牺牲了灵活性并增加了模型规模。论文提出的解决方案关键在于Feature-Adaptive INR (FA-INR),其通过跨注意力机制与增强的记忆库进行交互,学习灵活的特征表示,从而根据数据特性自适应地分配模型能力,而非依赖于刚性的结构假设。此外,引入坐标引导的专家混合(coordinate-guided mixture of experts, MoE)进一步提升了特征表示的专精性和效率。
链接: https://arxiv.org/abs/2506.06858
作者: Ziwei Li,Yuhan Duan,Tianyu Xiong,Yi-Tang Chen,Wei-Lun Chao,Han-Wei Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective surrogate models are critical for accelerating scientific simulations. Implicit neural representations (INRs) offer a compact and continuous framework for modeling spatially structured data, but they often struggle with complex scientific fields exhibiting localized, high-frequency variations. Recent approaches address this by introducing additional features along rigid geometric structures (e.g., grids), but at the cost of flexibility and increased model size. In this paper, we propose a simple yet effective alternative: Feature-Adaptive INR (FA-INR). FA-INR leverages cross-attention to an augmented memory bank to learn flexible feature representations, enabling adaptive allocation of model capacity based on data characteristics, rather than rigid structural assumptions. To further improve scalability, we introduce a coordinate-guided mixture of experts (MoE) that enhances the specialization and efficiency of feature representations. Experiments on three large-scale ensemble simulation datasets show that FA-INR achieves state-of-the-art fidelity while significantly reducing model size, establishing a new trade-off frontier between accuracy and compactness for INR-based surrogates.
zh
[AI-120] United Minds or Isolated Agents ? Exploring Coordination of LLM s under Cognitive Load Theory
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在复杂、多维度任务中表现出的性能瓶颈问题,这一瓶颈源于模型难以有效整合多样化信息或满足多重约束条件。解决方案的关键在于引入CoThinker,这是一个基于认知负荷理论(Cognitive Load Theory, CLT)的多智能体框架,通过智能体专业化分配内在认知负荷,并借助结构化通信和集体工作记忆管理事务性负荷,从而缓解认知过载并提升协作解决问题的能力。
链接: https://arxiv.org/abs/2506.06843
作者: HaoYang Shang,Xuan Liu,Zi Liang,Jie Zhang,Haibo Hu,Song Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) exhibit a notable performance ceiling on complex, multi-faceted tasks, as they often fail to integrate diverse information or adhere to multiple constraints. We posit that such limitation arises when the demands of a task exceed the LLM’s effective cognitive load capacity. This interpretation draws a strong analogy to Cognitive Load Theory (CLT) in cognitive science, which explains similar performance boundaries in the human mind, and is further supported by emerging evidence that reveals LLMs have bounded working memory characteristics. Building upon this CLT-grounded understanding, we introduce CoThinker, a novel LLM-based multi-agent framework designed to mitigate cognitive overload and enhance collaborative problem-solving abilities. CoThinker operationalizes CLT principles by distributing intrinsic cognitive load through agent specialization and managing transactional load via structured communication and a collective working memory. We empirically validate CoThinker on complex problem-solving tasks and fabricated high cognitive load scenarios, demonstrating improvements over existing multi-agent baselines in solution quality and efficiency. Our analysis reveals characteristic interaction patterns, providing insights into the emergence of collective cognition and effective load management, thus offering a principled approach to overcoming LLM performance ceilings.
zh
[AI-121] AI-Generated Compromises for Coalition Formation
【速读】:该论文试图解决在多方代理之间寻找妥协提案的问题,这一问题在人工智能的论证、调解和协商等子领域中具有基础性意义。其解决方案的关键在于引入一个结合代理有限理性和不确定性的模型,并利用生成式 AI(Generative AI)方法生成妥协提案。研究聚焦于协作文档撰写领域,通过自然语言处理技术与大语言模型构建语义度量空间,并在此基础上设计算法以推荐可能获得广泛支持的妥协点。
链接: https://arxiv.org/abs/2506.06837
作者: Eyal Briman,Ehud Shapiro,Nimrod Talmon
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:The challenge of finding compromises between agent proposals is fundamental to AI subfields such as argumentation, mediation, and negotiation. Building on this tradition, Elkind et al. (2021) introduced a process for coalition formation that seeks majority-supported proposals preferable to the status quo, using a metric space where each agent has an ideal point. A crucial step in this process involves identifying compromise proposals around which agent coalitions can unite. How to effectively find such compromise proposals remains an open question. We address this gap by formalizing a model that incorporates agent bounded rationality and uncertainty, and by developing AI methods to generate compromise proposals. We focus on the domain of collaborative document writing, such as the democratic drafting of a community constitution. Our approach uses natural language processing techniques and large language models to induce a semantic metric space over text. Based on this space, we design algorithms to suggest compromise points likely to receive broad support. To evaluate our methods, we simulate coalition formation processes and show that AI can facilitate large-scale democratic text editing, a domain where traditional tools are limited.
zh
[AI-122] IMPA-HGAE:Intra-Meta-Path Augmented Heterogeneous Graph Autoencoder
【速读】:该论文试图解决现有异构图自监督学习(heterogeneous graph self-supervised learning, HGSSL)模型在训练过程中将异构图转换为同构图时,仅利用元路径两端节点的信息而忽视元路径中内部节点信息的问题。解决方案的关键在于提出一种名为IMPA-HGAE的框架,通过充分挖掘元路径内部节点的信息来增强目标节点的嵌入表示,并引入创新的掩码策略以提升生成式自监督学习模型在异构图数据上的表征能力。
链接: https://arxiv.org/abs/2506.06809
作者: Di Lin,Wanjing Ren,Xuanbin Li,Rui Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-supervised learning (SSL) methods have been increasingly applied to diverse downstream tasks due to their superior generalization capabilities and low annotation costs. However, most existing heterogeneous graph SSL models convert heterogeneous graphs into homogeneous ones via meta-paths for training, which only leverage information from nodes at both ends of meta-paths while underutilizing the heterogeneous node information along the meta-paths. To address this limitation, this paper proposes a novel framework named IMPA-HGAE to enhance target node embeddings by fully exploiting internal node information along meta-paths. Experimental results validate that IMPA-HGAE achieves superior performance on heterogeneous datasets. Furthermore, this paper introduce innovative masking strategies to strengthen the representational capacity of generative SSL models on heterogeneous graph data. Additionally, this paper discuss the interpretability of the proposed method and potential future directions for generative self-supervised learning in heterogeneous graphs. This work provides insights into leveraging meta-path-guided structural semantics for robust representation learning in complex graph scenarios.
zh
[AI-123] Is Optimal Transport Necessary for Inverse Reinforcement Learning?
【速读】:该论文试图解决逆强化学习(Inverse Reinforcement Learning, IRL)中依赖最优传输(Optimal Transport, OT)方法所带来的算法复杂性、超参数敏感性和求解优化问题的挑战。其解决方案的关键在于提出两种简单的启发式替代方法:最小距离奖励,根据与专家状态的最近邻关系分配奖励而不考虑时间顺序;以及段匹配奖励,通过轻量级的时间对齐将智能体状态与专家轨迹中的对应段进行匹配。这些方法避免了优化过程,具有线性时间复杂度且易于实现。
链接: https://arxiv.org/abs/2506.06793
作者: Zixuan Dong,Yumi Omori,Keith Ross
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 10 tables
Abstract:Inverse Reinforcement Learning (IRL) aims to recover a reward function from expert demonstrations. Recently, Optimal Transport (OT) methods have been successfully deployed to align trajectories and infer rewards. While OT-based methods have shown strong empirical results, they introduce algorithmic complexity, hyperparameter sensitivity, and require solving the OT optimization problems. In this work, we challenge the necessity of OT in IRL by proposing two simple, heuristic alternatives: (1) Minimum-Distance Reward, which assigns rewards based on the nearest expert state regardless of temporal order; and (2) Segment-Matching Reward, which incorporates lightweight temporal alignment by matching agent states to corresponding segments in the expert trajectory. These methods avoid optimization, exhibit linear-time complexity, and are easy to implement. Through extensive evaluations across 32 online and offline benchmarks with three reinforcement learning algorithms, we show that our simple rewards match or outperform recent OT-based approaches. Our findings suggest that the core benefits of OT may arise from basic proximity alignment rather than its optimal coupling formulation, advocating for reevaluation of complexity in future IRL design.
zh
[AI-124] Learning What Matters Now: A Dual-Critic Context-Aware RL Framework for Priority-Driven Information Gain
【速读】:该论文旨在解决自主系统在高风险搜索与救援(Search-and-Rescue, SAR)任务中,如何在持续获取关键任务信息的同时,灵活适应不断变化的操作优先级问题。解决方案的关键在于提出一种轻量级的双评判器强化学习框架CA-MIQ(Context-Aware Max-Information Q-learning),该框架通过结合一个用于任务奖励的标准外在评判器和一个融合状态新颖性、信息位置意识及实时优先级对齐的内在评判器,实现探索策略的动态调整。此外,内置的优先级变化检测器能够触发临时探索增强和选择性评判器重置,使智能体在优先级更新后迅速重新聚焦,从而显著提升任务成功率。
链接: https://arxiv.org/abs/2506.06786
作者: Dimitris Panagopoulos,Adolfo Perrusquia,Weisi Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 2 figures, 3 tables, submitted as a regural paper to IEEE International Conference on Systems, Man, and Cybernetics (SMC) 2025
Abstract:Autonomous systems operating in high-stakes search-and-rescue (SAR) missions must continuously gather mission-critical information while flexibly adapting to shifting operational priorities. We propose CA-MIQ (Context-Aware Max-Information Q-learning), a lightweight dual-critic reinforcement learning (RL) framework that dynamically adjusts its exploration strategy whenever mission priorities change. CA-MIQ pairs a standard extrinsic critic for task reward with an intrinsic critic that fuses state-novelty, information-location awareness, and real-time priority alignment. A built-in shift detector triggers transient exploration boosts and selective critic resets, allowing the agent to re-focus after a priority revision. In a simulated SAR grid-world, where experiments specifically test adaptation to changes in the priority order of information types the agent is expected to focus on, CA-MIQ achieves nearly four times higher mission-success rates than baselines after a single priority shift and more than three times better performance in multiple-shift scenarios, achieving 100% recovery while baseline methods fail to adapt. These results highlight CA-MIQ’s effectiveness in any discrete environment with piecewise-stationary information-value distributions.
zh
[AI-125] Bio-Inspired Classification: Combining Information Theory and Spiking Neural Networks – Influence of the Learning Rules
【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNN)训练中的挑战,特别是其时间动态特性、脉冲事件的非可微性以及稀疏事件驱动激活所带来的问题。论文的关键解决方案是提出一种基于SNN与莱姆普尔-齐夫复杂度(Lempel-Ziv complexity, LZC)结合的生物启发分类器,该方法融合了SNN在时间精度和生物现实性方面的优势与LZC在结构复杂性分析上的能力,从而实现了对时空神经数据的高效且可解释的分类。
链接: https://arxiv.org/abs/2506.06750
作者: Zofia Rudnicka,Janusz Szczepanski,Agnieszka Pregowska
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Training of Spiking Neural Networks (SNN) is challenging due to their unique properties, including temporal dynamics, non-differentiability of spike events, and sparse event-driven activations. In this paper, we widely consider the influence of the type of chosen learning algorithm, including bioinspired learning rules on the accuracy of classification. We proposed a bioinspired classifier based on the combination of SNN and Lempel-Ziv complexity (LZC). This approach synergizes the strengths of SNNs in temporal precision and biological realism with LZC’s structural complexity analysis, facilitating efficient and interpretable classification of spatiotemporal neural data. It turned out that the classic backpropagation algorithm achieves excellent classification accuracy, but at extremely high computational cost, which makes it impractical for real-time applications. Biologically inspired learning algorithms such as tempotron and Spikprop provide increased computational efficiency while maintaining competitive classification performance, making them suitable for time-sensitive tasks. The results obtained indicate that the selection of the most appropriate learning algorithm depends on the trade-off between classification accuracy and computational cost as well as application constraints.
zh
[AI-126] AI PsyRoom: Artificial Intelligence Platform for Segmented Yearning and Reactive Outcome Optimization Method
【速读】:该论文旨在解决心理辅导中因心理健康服务需求增长与专业人员短缺所带来的挑战,以及现有大语言模型在情感理解深度和基于细粒度情感生成个性化治疗方案方面的不足。其解决方案的关键在于提出AI PsyRoom,一个基于细粒度情感分类和多智能体框架的模拟系统,通过构建高质量的情感对话数据集EmoPsy和生成个性化治疗方案的模块,显著提升了问题导向性、表达、共情和交互沟通质量。
链接: https://arxiv.org/abs/2506.06740
作者: Yigui Feng,Qinglin Wang,Ke Liu,Xinhai Chen,Bo Yang,Jie Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Psychological counseling faces huge challenges due to the growing demand for mental health services and the shortage of trained professionals. Large language models (LLMs) have shown potential to assist psychological counseling, especially in empathy and emotional support. However, existing models lack a deep understanding of emotions and are unable to generate personalized treatment plans based on fine-grained emotions. To address these shortcomings, we present AI PsyRoom, a multi-agent simulation framework designed to enhance psychological counseling by generating empathetic and emotionally nuanced conversations. By leveraging fine-grained emotion classification and a multi-agent framework, we construct a multi-agent PsyRoom A for dialogue reconstruction, generating a high-quality dialogue dataset EmoPsy, which contains 35 sub-emotions, 423 specific emotion scenarios, and 12,350 dialogues. We also propose PsyRoom B for generating personalized treatment plans. Quantitative evaluations demonstrate that AI PsyRoom significantly outperforms state-of-the-art methods, achieving 18% improvement in problem orientation, 23% in expression, 24% in Empathy, and 16% in interactive communication quality. The datasets and models are publicly available, providing a foundation for advancing AI-assisted psychological counseling research.
zh
[AI-127] Honey I shrunk the hypothesis space (through logical preprocessing)
【速读】:该论文试图解决归纳逻辑编程(Inductive Logic Programming, ILP)中假设空间过大导致的学习效率低的问题。解决方案的关键在于在ILP系统搜索假设空间之前,利用背景知识“缩小”假设空间,通过识别并移除无论训练样本如何都不可能存在于最优假设中的规则,从而显著减少学习时间,同时保持预测准确性。
链接: https://arxiv.org/abs/2506.06739
作者: Andrew Cropper,Filipe Gouveia,David M. Cerna
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to JAIR
Abstract:Inductive logic programming (ILP) is a form of logical machine learning. The goal is to search a hypothesis space for a hypothesis that generalises training examples and background knowledge. We introduce an approach that ‘shrinks’ the hypothesis space before an ILP system searches it. Our approach uses background knowledge to find rules that cannot be in an optimal hypothesis regardless of the training examples. For instance, our approach discovers relationships such as “even numbers cannot be odd” and “prime numbers greater than 2 are odd”. It then removes violating rules from the hypothesis space. We implement our approach using answer set programming and use it to shrink the hypothesis space of a constraint-based ILP system. Our experiments on multiple domains, including visual reasoning and game playing, show that our approach can substantially reduce learning times whilst maintaining predictive accuracies. For instance, given just 10 seconds of preprocessing time, our approach can reduce learning times from over 10 hours to only 2 seconds.
zh
[AI-128] Ai-Driven Vulnerability Analysis in Smart Contracts: Trends Challenges and Future Directions
【速读】:该论文试图解决智能合约(smart contract)中存在的一系列安全漏洞问题,如数值溢出、重入攻击和权限控制不当等,这些问题已导致区块链领域数百万美元的损失。传统审计技术在可扩展性、自动化和适应开发模式变化方面存在局限,因此该论文提出基于人工智能(AI)的解决方案,其关键在于利用机器学习、深度学习、图神经网络和基于Transformer的模型等技术,从代码表示、语义处理到实际漏洞检测进行全面分析,以实现更高效、准确和可扩展的安全保障。
链接: https://arxiv.org/abs/2506.06735
作者: Mesut Ozdag
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Smart contracts, integral to blockchain ecosystems, enable decentralized applications to execute predefined operations without intermediaries. Their ability to enforce trustless interactions has made them a core component of platforms such as Ethereum. Vulnerabilities such as numerical overflows, reentrancy attacks, and improper access permissions have led to the loss of millions of dollars throughout the blockchain and smart contract sector. Traditional smart contract auditing techniques such as manual code reviews and formal verification face limitations in scalability, automation, and adaptability to evolving development patterns. As a result, AI-based solutions have emerged as a promising alternative, offering the ability to learn complex patterns, detect subtle flaws, and provide scalable security assurances. This paper examines novel AI-driven techniques for vulnerability detection in smart contracts, focusing on machine learning, deep learning, graph neural networks, and transformer-based models. This paper analyzes how each technique represents code, processes semantic information, and responds to real world vulnerability classes. We also compare their strengths and weaknesses in terms of accuracy, interpretability, computational overhead, and real time applicability. Lastly, it highlights open challenges and future opportunities for advancing this domain.
zh
[AI-129] Fuse and Federate: Enhancing EV Charging Station Security with Multimodal Fusion and Federated Learning
【速读】:该论文旨在解决电动汽车供电设备(EVSE)在智能电网基础设施中面临的显著网络安全挑战,特别是针对由EVSE的互联性和自主性所驱动的新型复杂攻击,如网络侦察、后门入侵和分布式拒绝服务(DDoS)攻击。现有基于网络或主机的入侵检测系统(IDS)难以有效检测专门针对EVSE基础设施新漏洞的高级攻击。论文提出的解决方案的关键在于采用多模态数据源(包括网络流量和内核事件)构建一种新颖的入侵检测框架,并通过联邦学习实现跨EVSE站点的分布式学习,从而在保护数据隐私的同时提升攻击模式识别能力。实验结果表明,该框架在去中心化环境中实现了超过98%的检测率和97%以上的精确率,有效应对了EVSE安全的动态挑战。
链接: https://arxiv.org/abs/2506.06730
作者: Rabah Rahal,Abdelaziz Amara Korba,Yacine Ghamri-Doudane
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid global adoption of electric vehicles (EVs) has established electric vehicle supply equipment (EVSE) as a critical component of smart grid infrastructure. While essential for ensuring reliable energy delivery and accessibility, EVSE systems face significant cybersecurity challenges, including network reconnaissance, backdoor intrusions, and distributed denial-of-service (DDoS) attacks. These emerging threats, driven by the interconnected and autonomous nature of EVSE, require innovative and adaptive security mechanisms that go beyond traditional intrusion detection systems (IDS). Existing approaches, whether network-based or host-based, often fail to detect sophisticated and targeted attacks specifically crafted to exploit new vulnerabilities in EVSE infrastructure. This paper proposes a novel intrusion detection framework that leverages multimodal data sources, including network traffic and kernel events, to identify complex attack patterns. The framework employs a distributed learning approach, enabling collaborative intelligence across EVSE stations while preserving data privacy through federated learning. Experimental results demonstrate that the proposed framework outperforms existing solutions, achieving a detection rate above 98% and a precision rate exceeding 97% in decentralized environments. This solution addresses the evolving challenges of EVSE security, offering a scalable and privacypreserving response to advanced cyber threats
zh
[AI-130] WorldLLM : Improving LLM s world modeling using curiosity-driven theory-making
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在结构化、领域特定情境中生成精确预测的局限性,这些问题源于其无法将广泛而无结构的理解扎根于具体环境。解决方案的关键在于提出WorldLLM框架,该框架通过结合贝叶斯推理(Bayesian inference)、自主主动探索与强化学习(reinforcement learning),增强基于LLM的世界建模能力。WorldLLM利用LLM的上下文学习能力,通过自然语言假设引导世界模型的预测,并通过贝叶斯推理框架迭代优化假设,同时使用好奇心驱动的强化学习策略收集证据,从而实现预测能力的持续提升。
链接: https://arxiv.org/abs/2506.06725
作者: Guillaume Levy,Cedric Colas,Pierre-Yves Oudeyer,Thomas Carta,Clement Romac
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) possess general world knowledge but often struggle to generate precise predictions in structured, domain-specific contexts such as simulations. These limitations arise from their inability to ground their broad, unstructured understanding in specific environments. To address this, we present WorldLLM, a framework that enhances LLM-based world modeling by combining Bayesian inference and autonomous active exploration with reinforcement learning. WorldLLM leverages the in-context learning abilities of LLMs to guide an LLM-based world model’s predictions using natural language hypotheses given in its prompt. These hypotheses are iteratively refined through a Bayesian inference framework that leverages a second LLM as the proposal distribution given collected evidence. This evidence is collected using a curiosity-driven reinforcement learning policy that explores the environment to find transitions with a low log-likelihood under our LLM-based predictive model using the current hypotheses. By alternating between refining hypotheses and collecting new evidence, our framework autonomously drives continual improvement of the predictions. Our experiments demonstrate the effectiveness of WorldLLM in a textual game environment that requires agents to manipulate and combine objects. The framework not only enhances predictive accuracy, but also generates human-interpretable theories of environment dynamics.
zh
[AI-131] Integrating AI Planning Semantics into SysML System Models for Automated PDDL File Generation
【速读】:该论文试图解决如何将基于规划领域定义语言(PDDL)的规划语义直接集成到系统模型中的问题,以实现系统建模与人工智能规划之间的有效衔接。解决方案的关键在于定义可重用的构造型(stereotypes),用于表示PDDL的核心概念如类型、谓词、函数和动作,并通过形式化的OCL约束确保语法一致性,同时基于PDDL 3.1的巴科斯-诺尔范式(BNF)定义构建SysML配置文件,从而支持自动化和基于模型的规划描述生成。
链接: https://arxiv.org/abs/2506.06714
作者: Hamied Nabizada,Tom Jeleniewski,Lasse Beers,Maximilian Weigand,Felix Gehlhoff,Alexander Fay
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a SysML profile that enables the direct integration of planning semantics based on the Planning Domain Definition Language (PDDL) into system models. Reusable stereotypes are defined for key PDDL concepts such as types, predicates, functions and actions, while formal OCL constraints ensure syntactic consistency. The profile was derived from the Backus-Naur Form (BNF) definition of PDDL 3.1 to align with SysML modeling practices. A case study from aircraft manufacturing demonstrates the application of the profile: a robotic system with interchangeable end effectors is modeled and enriched to generate both domain and problem descriptions in PDDL format. These are used as input to a PDDL solver to derive optimized execution plans. The approach supports automated and model-based generation of planning descriptions and provides a reusable bridge between system modeling and AI planning in engineering design.
zh
[AI-132] Do Protein Transformers Have Biological Intelligence? ECML-PKDD2025
【速读】:该论文旨在解决蛋白质序列中生物智能的捕捉问题,即如何利用深度神经网络(特别是Transformer)有效预测蛋白质的功能特性。其解决方案的关键在于提出了一种新的Transformer架构——Sequence Protein Transformers (SPT),以及一种名为Sequence Score的可解释人工智能(Explainable Artificial Intelligence, XAI)技术。SPT旨在实现计算高效的蛋白质功能预测,而Sequence Score则能够高效解释模型决策过程,从而揭示蛋白质模型中的生物智能。此外,研究还构建了Protein-FN数据集以支持相关研究。
链接: https://arxiv.org/abs/2506.06701
作者: Fudong Lin,Wanrou Du,Jinchan Liu,Tarikul Milon,Shelby Meche,Wu Xu,Xiaoqi Qin,Xu Yuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注: Accepted by European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2025)
Abstract:Deep neural networks, particularly Transformers, have been widely adopted for predicting the functional properties of proteins. In this work, we focus on exploring whether Protein Transformers can capture biological intelligence among protein sequences. To achieve our goal, we first introduce a protein function dataset, namely Protein-FN, providing over 9000 protein data with meaningful labels. Second, we devise a new Transformer architecture, namely Sequence Protein Transformers (SPT), for computationally efficient protein function predictions. Third, we develop a novel Explainable Artificial Intelligence (XAI) technique called Sequence Score, which can efficiently interpret the decision-making processes of protein models, thereby overcoming the difficulty of deciphering biological intelligence bided in Protein Transformers. Remarkably, even our smallest SPT-Tiny model, which contains only 5.4M parameters, demonstrates impressive predictive accuracy, achieving 94.3% on the Antibiotic Resistance (AR) dataset and 99.6% on the Protein-FN dataset, all accomplished by training from scratch. Besides, our Sequence Score technique helps reveal that our SPT models can discover several meaningful patterns underlying the sequence structures of protein data, with these patterns aligning closely with the domain knowledge in the biology community. We have officially released our Protein-FN dataset on Hugging Face Datasets this https URL. Our code is available at this https URL.
zh
[AI-133] Design and Implementation of a RISC-V SoC with Custom DSP Accelerators for Edge Computing
【速读】:该论文旨在分析RISC-V指令集架构(Instruction Set Architecture, ISA)的模块化设计、实现挑战及性能特征,以评估其在嵌入式系统中的适用性及可扩展性。解决方案的关键在于通过周期精确的流水线实现仿真,评估CPI(cycles per instruction)和能效等性能指标,并验证RISC-V在特定工艺节点下相较于ARM Cortex-M0的17%功率消耗降低优势,同时突出其开放标准带来的领域专用优化灵活性。
链接: https://arxiv.org/abs/2506.06693
作者: Priyanshu Yadav
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 12 Pages, 1 figure
Abstract:This paper presents a comprehensive analysis of the RISC-V instruction set architecture, focusing on its modular design, implementation challenges, and performance characteristics. We examine the RV32I base instruction set with extensions for multiplication (M) and atomic operations (A). Through cycle-accurate simulation of a pipelined implementation, we evaluate performance metrics including CPI (cycles per instruction) and power efficiency. Our results demonstrate RISC-V’s advantages in embedded systems and its scalability for custom accelerators. Comparative analysis shows a 17% reduction in power consumption compared to ARM Cortex-M0 implementations in similar process nodes. The open-standard nature of RISC-V provides significant flexibility for domain-specific optimizations.
zh
[AI-134] RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks
【速读】:该论文试图解决双臂机器人在复杂多任务场景中任务并行性优化不足的问题,从而限制了双臂协作的潜力。解决方案的关键在于提出RoboPARA框架,该框架采用两阶段流程:基于依赖图的任务候选生成,通过构建有向无环图(DAG)建模任务依赖关系并消除冗余;以及基于图重新遍历的双臂并行规划,通过优化DAG遍历以最大化并行性同时保持任务一致性。
链接: https://arxiv.org/abs/2506.06683
作者: Shiying Duan,Pei Ren,Nanxiang Jiang,Zhengping Che,Jian Tang,Yifan Sun,Zhaoxin Fan,Wenjun Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Dual-arm robots play a crucial role in improving efficiency and flexibility in complex multitasking scenarios. While existing methods have achieved promising results in task planning, they often fail to fully optimize task parallelism, limiting the potential of dual-arm collaboration. To address this issue, we propose RoboPARA, a novel large language model (LLM)-driven framework for dual-arm task parallelism planning. RoboPARA employs a two-stage process: (1) Dependency Graph-based Planning Candidates Generation, which constructs directed acyclic graphs (DAGs) to model task dependencies and eliminate redundancy, and (2) Graph Re-Traversal-based Dual-Arm Parallel Planning, which optimizes DAG traversal to maximize parallelism while maintaining task coherence. In addition, we introduce the Cross-Scenario Dual-Arm Parallel Task dataset (X-DAPT dataset), the first dataset specifically designed to evaluate dual-arm task parallelism across diverse scenarios and difficulty levels. Extensive experiments on the X-DAPT dataset demonstrate that RoboPARA significantly outperforms existing methods, achieving higher efficiency and reliability, particularly in complex task combinations. The code and dataset will be released upon acceptance.
zh
[AI-135] Self-Adapting Improvement Loops for Robotic Learning
【速读】:该论文试图解决视频生成模型在未见过的任务中泛化能力不足的问题,尤其是在机器人任务中,模型需要具备持续在线学习和自我改进的能力。解决方案的关键在于提出Self-Adapting Improvement Loop (SAIL),该方法通过域内视频模型在自生成轨迹上迭代更新,结合互联网规模预训练视频模型进行适应,从而逐步提升模型在特定任务上的性能。
链接: https://arxiv.org/abs/2506.06658
作者: Calvin Luo,Zilai Zeng,Mingxi Jia,Yilun Du,Chen Sun
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Video generative models trained on expert demonstrations have been utilized as performant text-conditioned visual planners for solving robotic tasks. However, generalization to unseen tasks remains a challenge. Whereas improved generalization may be facilitated by leveraging learned prior knowledge from additional pre-collected offline data sources, such as web-scale video datasets, in the era of experience we aim to design agents that can continuously improve in an online manner from self-collected behaviors. In this work we thus propose the Self-Adapting Improvement Loop (SAIL), where an in-domain video model iteratively updates itself on self-produced trajectories, collected through adaptation with an internet-scale pretrained video model, and steadily improves its performance for a specified task of interest. We apply SAIL to a diverse suite of MetaWorld tasks, as well as two manipulation tasks on a real robot arm, and find that performance improvements continuously emerge over multiple iterations for novel tasks initially unseen during original in-domain video model training. Furthermore, we discover that SAIL is surprisingly robust regarding if and how the self-collected experience is filtered, and the quality of the initial in-domain demonstrations. Through adaptation with summarized internet-scale data, and learning through online experience, we thus demonstrate a way to iteratively bootstrap a high-performance video model for solving novel robotic tasks through self-improvement.
zh
[AI-136] GELD: A Unified Neural Model for Efficiently Solving Traveling Salesman Problems Across Different Scales
【速读】:该论文旨在解决传统基于神经网络的旅行商问题(TSP)求解器在使用相同预训练模型参数时,难以高效求解小规模和大规模TSP的问题,从而限制了其实际应用价值。解决方案的关键在于提出一种名为GELD的新型神经TSP求解器,其核心是结合轻量级全局视图编码器(Global-view Encoder, GE)与重量级局部视图解码器(Local-view Decoder, LD),并通过一种低复杂度注意力机制提升模型的推理速度与可扩展性,同时采用两阶段训练策略增强模型的泛化能力。
链接: https://arxiv.org/abs/2506.06634
作者: Yubin Xiao,Di Wang,Rui Cao,Xuan Wu,Boyang Li,You Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21pages, 4 figures, and 14 tables
Abstract:The Traveling Salesman Problem (TSP) is a well-known combinatorial optimization problem with broad real-world applications. Recent advancements in neural network-based TSP solvers have shown promising results. Nonetheless, these models often struggle to efficiently solve both small- and large-scale TSPs using the same set of pre-trained model parameters, limiting their practical utility. To address this issue, we introduce a novel neural TSP solver named GELD, built upon our proposed broad global assessment and refined local selection framework. Specifically, GELD integrates a lightweight Global-view Encoder (GE) with a heavyweight Local-view Decoder (LD) to enrich embedding representation while accelerating the decision-making process. Moreover, GE incorporates a novel low-complexity attention mechanism, allowing GELD to achieve low inference latency and scalability to larger-scale TSPs. Additionally, we propose a two-stage training strategy that utilizes training instances of different sizes to bolster GELD’s generalization ability. Extensive experiments conducted on both synthetic and real-world datasets demonstrate that GELD outperforms seven state-of-the-art models considering both solution quality and inference speed. Furthermore, GELD can be employed as a post-processing method to significantly elevate the quality of the solutions derived by existing neural TSP solvers via spending affordable additional computing time. Notably, GELD is shown as capable of solving TSPs with up to 744,710 nodes, first-of-its-kind to solve this large size TSP without relying on divide-and-conquer strategies to the best of our knowledge.
zh
[AI-137] Active Test-time Vision-Language Navigation
【速读】:该论文旨在解决视觉-语言导航(Vision-Language Navigation, VLN)策略在部署到未知环境时任务性能下降的问题,特别是在测试阶段无法获得外部交互或反馈的情况下。其关键解决方案是提出ATENA(Active Test-time Navigation Agent),这是一个基于测试阶段主动学习的框架,通过周期性反馈来增强代理对不确定导航结果的处理能力。ATENA的核心在于混合熵优化(mixture entropy optimization),通过结合动作分布和伪专家分布来控制预测置信度和动作偏好,从而提升不确定性校准效果,并结合自主动学习策略使代理持续参与迭代过程,实现更稳健和适应性的决策。
链接: https://arxiv.org/abs/2506.06630
作者: Heeju Ko,Sungjune Kim,Gyeongrok Oh,Jeongyoon Yoon,Honglak Lee,Sujin Jang,Seungryong Kim,Sangpil Kim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Navigation (VLN) policies trained on offline datasets often exhibit degraded task performance when deployed in unfamiliar navigation environments at test time, where agents are typically evaluated without access to external interaction or feedback. Entropy minimization has emerged as a practical solution for reducing prediction uncertainty at test time; however, it can suffer from accumulated errors, as agents may become overconfident in incorrect actions without sufficient contextual grounding. To tackle these challenges, we introduce ATENA (Active TEst-time Navigation Agent), a test-time active learning framework that enables a practical human-robot interaction via episodic feedback on uncertain navigation outcomes. In particular, ATENA learns to increase certainty in successful episodes and decrease it in failed ones, improving uncertainty calibration. Here, we propose mixture entropy optimization, where entropy is obtained from a combination of the action and pseudo-expert distributions-a hypothetical action distribution assuming the agent’s selected action to be optimal-controlling both prediction confidence and action preference. In addition, we propose a self-active learning strategy that enables an agent to evaluate its navigation outcomes based on confident predictions. As a result, the agent stays actively engaged throughout all iterations, leading to well-grounded and adaptive decision-making. Extensive evaluations on challenging VLN benchmarks-REVERIE, R2R, and R2R-CE-demonstrate that ATENA successfully overcomes distributional shifts at test time, outperforming the compared baseline methods across various settings.
zh
[AI-138] textitQuantMCP: Grounding Large Language Models in Verifiable Financial Reality
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在金融分析与决策中因数据幻觉和缺乏实时、可验证金融信息访问而面临的应用障碍。其解决方案的关键在于提出QuantMCP框架,该框架通过Model Context Protocol (MCP) 实现标准化和安全的工具调用,使LLMs能够准确对接多种Python可访问的金融数据API,从而获取经过验证的结构化数据,提升模型的数据解析能力与分析深度。
链接: https://arxiv.org/abs/2506.06622
作者: Yifan Zeng
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) hold immense promise for revolutionizing financial analysis and decision-making, yet their direct application is often hampered by issues of data hallucination and lack of access to real-time, verifiable financial information. This paper introduces QuantMCP, a novel framework designed to rigorously ground LLMs in financial reality. By leveraging the Model Context Protocol (MCP) for standardized and secure tool invocation, QuantMCP enables LLMs to accurately interface with a diverse array of Python-accessible financial data APIs (e.g., Wind, yfinance). Users can interact via natural language to precisely retrieve up-to-date financial data, thereby overcoming LLM’s inherent limitations in factual data recall. More critically, once furnished with this verified, structured data, the LLM’s analytical capabilities are unlocked, empowering it to perform sophisticated data interpretation, generate insights, and ultimately support more informed financial decision-making processes. QuantMCP provides a robust, extensible, and secure bridge between conversational AI and the complex world of financial data, aiming to enhance both the reliability and the analytical depth of LLM applications in finance.
zh
[AI-139] CAtCh: Cognitive Assessment through Cookie Thief
【速读】:该论文试图解决从患者语音记录中预测更广泛的认知障碍(Cognitive Impairment, CI)的问题,而现有方法主要针对阿尔茨海默病及相关痴呆(Alzheimer’s disease and related dementia, ADRD)的预测。论文的关键解决方案是评估基于语音的开源方法以及多模态情感分析方法在CI预测任务中的表现,结果表明多模态方法优于单模态方法,且基于声学特征的方法优于基于语言特征的方法,特别是可解释的与情感和语调相关的声学特征显著优于基于BERT的语言特征和可解释的语言特征。
链接: https://arxiv.org/abs/2506.06603
作者: Joseph T Colonel,Carolyn Hagler,Guiselle Wismer,Laura Curtis,Jacqueline Becker,Juan Wisnivesky,Alex Federman,Gaurav Pandey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Several machine learning algorithms have been developed for the prediction of Alzheimer’s disease and related dementia (ADRD) from spontaneous speech. However, none of these algorithms have been translated for the prediction of broader cognitive impairment (CI), which in some cases is a precursor and risk factor of ADRD. In this paper, we evaluated several speech-based open-source methods originally proposed for the prediction of ADRD, as well as methods from multimodal sentiment analysis for the task of predicting CI from patient audio recordings. Results demonstrated that multimodal methods outperformed unimodal ones for CI prediction, and that acoustics-based approaches performed better than linguistics-based ones. Specifically, interpretable acoustic features relating to affect and prosody were found to significantly outperform BERT-based linguistic features and interpretable linguistic features, respectively. All the code developed for this study is available at this https URL.
zh
[AI-140] From Model-Based and Adaptive Control to Evolving Fuzzy Control
【速读】:该论文旨在回顾和总结经典模糊建模与控制框架的历史发展及核心贡献,并探讨演化智能系统在模糊建模与控制中的出现及其重要性,特别是其在处理非平稳环境中的优势。论文提出,演化模糊系统(Evolving Fuzzy Systems)通过从数据流中逐步更新规则库结构,实现对模糊模型的构建与适应,其解决方案的关键在于实现模型结构的动态演化,从而提升系统在复杂、变化环境中的性能与适应能力。
链接: https://arxiv.org/abs/2506.06594
作者: Daniel Leite,Igor Škrjanc,Fernando Gomide
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 2 figures. Fuzz-IEEE 2025 Booklet: 60 Years of Fuzzy Set Theory
Abstract:Evolving fuzzy systems build and adapt fuzzy models - such as predictors and controllers - by incrementally updating their rule-base structure from data streams. On the occasion of the 60-year anniversary of fuzzy set theory, commemorated during the Fuzz-IEEE 2025 event, this brief paper revisits the historical development and core contributions of classical fuzzy and adaptive modeling and control frameworks. It then highlights the emergence and significance of evolving intelligent systems in fuzzy modeling and control, emphasizing their advantages in handling nonstationary environments. Key challenges and future directions are discussed, including safety, interpretability, and principled structural evolution.
zh
[AI-141] AI Simulation by Digital Twins: Systematic Survey Reference Framework and Mapping to a Standardized Architecture
【速读】:该论文旨在解决现代 subsymbolic AI 在数据量和质量不足方面的挑战,通过 AI 模拟中的数字孪生(Digital Twin)技术提供解决方案。其关键在于利用高保真虚拟副本与物理系统的交互能力,结合先进的模拟器生成合成数据,从而安全高效地训练 AI 代理。
链接: https://arxiv.org/abs/2506.06580
作者: Xiaoran Liu,Istvan David
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Software Engineering (cs.SE); Systems and Control (eess.SY)
备注:
Abstract:Insufficient data volume and quality are particularly pressing challenges in the adoption of modern subsymbolic AI. To alleviate these challenges, AI simulation uses virtual training environments in which AI agents can be safely and efficiently developed with simulated, synthetic data. Digital twins open new avenues in AI simulation, as these high-fidelity virtual replicas of physical systems are equipped with state-of-the-art simulators and the ability to further interact with the physical system for additional data collection. In this article, we report on our systematic survey of digital twin-enabled AI simulation. By analyzing 22 primary studies, we identify technological trends and derive a reference framework to situate digital twins and AI components. Based on our findings, we derive a reference framework and provide architectural guidelines by mapping it onto the ISO 23247 reference architecture for digital twins. Finally, we identify challenges and research opportunities for prospective researchers.
zh
[AI-142] he Optimization Paradox in Clinical AI Multi-Agent Systems
【速读】:该论文试图解决多智能体人工智能系统在临床环境中部署时,组件级优化与系统整体性能之间关系不明确的问题。其解决方案的关键在于通过分解临床诊断过程为信息收集、解释和鉴别诊断三个阶段,并对比单智能体系统与多智能体系统的性能,从而评估不同系统架构在诊断结果、流程合规性和成本效率方面的表现。研究揭示了一个悖论:尽管多智能体系统通常优于单智能体系统,但采用最优组件的“最佳组合”系统在诊断准确性上却显著低于顶级多智能体系统,这表明医疗AI的成功集成不仅需要组件级别的优化,还需关注智能体间的信息流动与兼容性。
链接: https://arxiv.org/abs/2506.06574
作者: Suhana Bedi,Iddah Mlauzi,Daniel Shin,Sanmi Koyejo,Nigam H. Shah
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent artificial intelligence systems are increasingly deployed in clinical settings, yet the relationship between component-level optimization and system-wide performance remains poorly understood. We evaluated this relationship using 2,400 real patient cases from the MIMIC-CDM dataset across four abdominal pathologies (appendicitis, pancreatitis, cholecystitis, diverticulitis), decomposing clinical diagnosis into information gathering, interpretation, and differential diagnosis. We evaluated single agent systems (one model performing all tasks) against multi-agent systems (specialized models for each task) using comprehensive metrics spanning diagnostic outcomes, process adherence, and cost efficiency. Our results reveal a paradox: while multi-agent systems generally outperformed single agents, the component-optimized or Best of Breed system with superior components and excellent process metrics (85.5% information accuracy) significantly underperformed in diagnostic accuracy (67.7% vs. 77.4% for a top multi-agent system). This finding underscores that successful integration of AI in healthcare requires not just component level optimization but also attention to information flow and compatibility between agents. Our findings highlight the need for end to end system validation rather than relying on component metrics alone.
zh
[AI-143] Graph Persistence goes Spectral
【速读】:该论文试图解决图神经网络(Graph Neural Networks, GNNs)在表达能力上的局限性,特别是如何通过引入更复杂的拓扑信息来超越Weisfeiler-Leman(WL)层次结构。现有方法虽然尝试通过将顶点和边特征嵌入到持久同调(Persistent Homology, PH)图中以提高表达能力,但仍无法有效捕捉基本的图结构信息。论文提出的解决方案是SpectRe,这是一种将谱信息整合到PH图中的新型拓扑描述符,其关键在于通过结合谱信息提升图表示的表达能力,并证明了SpectRe在图上的严格优越性以及局部稳定性。
链接: https://arxiv.org/abs/2506.06571
作者: Mattie Ji,Amauri H. Souza,Vikas Garg
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 24 pages, 4 figures, 6 tables
Abstract:Including intricate topological information (e.g., cycles) provably enhances the expressivity of message-passing graph neural networks (GNNs) beyond the Weisfeiler-Leman (WL) hierarchy. Consequently, Persistent Homology (PH) methods are increasingly employed for graph representation learning. In this context, recent works have proposed decorating classical PH diagrams with vertex and edge features for improved expressivity. However, due to their dependence on features, these methods still fail to capture basic graph structural information. In this paper, we propose SpectRe – a new topological descriptor for graphs that integrates spectral information into PH diagrams. Notably, SpectRe is strictly more expressive than existing descriptors on graphs. We also introduce notions of global and local stability to analyze existing descriptors and establish that SpectRe is locally stable. Finally, experiments on synthetic and real-world datasets demonstrate the effectiveness of SpectRe and its potential to enhance the capabilities of graph models in relevant learning tasks.
zh
[AI-144] KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes
【速读】:该论文试图解决如何将AI系统的推理、编码和理解能力有效应用于构建复杂的真实世界数据科学流水线的问题。解决方案的关键在于引入KRAMABENCH,这是一个由104个手动整理的真实数据科学流水线组成的基准,涵盖来自6个不同领域的24个数据源的1700个数据文件,用于全面测试AI系统在数据处理中的端到端能力。此外,论文还提出了参考框架DS-GURU,指导AI模型将问题分解为子任务、逐步推理并生成实现设计的Python代码,以评估现有模型在需要广泛数据处理和领域知识的任务中的表现。
链接: https://arxiv.org/abs/2506.06541
作者: Eugenie Lai,Gerardo Vitagliano,Ziyu Zhang,Sivaprasad Sudhir,Om Chabra,Anna Zeng,Anton A. Zabreyko,Chenning Li,Ferdi Kossmann,Jialin Ding,Jun Chen,Markos Markakis,Matthew Russo,Weiyang Wang,Ziniu Wu,Michael J. Cafarella,Lei Cao,Samuel Madden,Tim Kraska
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Constructing real-world data-to-insight pipelines often involves data extraction from data lakes, data integration across heterogeneous data sources, and diverse operations from data cleaning to analysis. The design and implementation of data science pipelines require domain knowledge, technical expertise, and even project-specific insights. AI systems have shown remarkable reasoning, coding, and understanding capabilities. However, it remains unclear to what extent these capabilities translate into successful design and execution of such complex pipelines. We introduce KRAMABENCH: a benchmark composed of 104 manually-curated real-world data science pipelines spanning 1700 data files from 24 data sources in 6 different domains. We show that these pipelines test the end-to-end capabilities of AI systems on data processing, requiring data discovery, wrangling and cleaning, efficient processing, statistical reasoning, and orchestrating data processing steps given a high-level task. Our evaluation tests 5 general models and 3 code generation models using our reference framework, DS-GURU, which instructs the AI model to decompose a question into a sequence of subtasks, reason through each step, and synthesize Python code that implements the proposed design. Our results on KRAMABENCH show that, although the models are sufficiently capable of solving well-specified data science code generation tasks, when extensive data processing and domain knowledge are required to construct real-world data science pipelines, existing out-of-box models fall short. Progress on KramaBench represents crucial steps towards developing autonomous data science agents for real-world applications. Our code, reference framework, and data are available at this https URL.
zh
[AI-145] Hierarchical and Collaborative LLM -Based Control for Multi-UAV Motion and Communication in Integrated Terrestrial and Non-Terrestrial Networks ICML2025
【速读】:该论文旨在解决多无人机系统(multi-UAV systems)在动态和受限环境中的控制与优化问题,特别是在融合地面和非地面网络(integrated terrestrial and non-terrestrial networks)中的协同作业挑战。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的分层协作方法,其中部署在高空平台站(HAPS)的LLM负责无人机接入控制,而每个无人机上的LLM则负责运动规划与控制,从而实现高层战略规划与低层战术决策的结合。
链接: https://arxiv.org/abs/2506.06532
作者: Zijiang Yan,Hao Zhou,Jianhua Pei,Hina Tabassum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Robotics (cs.RO); Systems and Control (eess.SY)
备注: Accepted in ICML 2025 Workshop on Machine Learning for Wireless Communication and Networks (ML4Wireless)
Abstract:Unmanned aerial vehicles (UAVs) have been widely adopted in various real-world applications. However, the control and optimization of multi-UAV systems remain a significant challenge, particularly in dynamic and constrained environments. This work explores the joint motion and communication control of multiple UAVs operating within integrated terrestrial and non-terrestrial networks that include high-altitude platform stations (HAPS). Specifically, we consider an aerial highway scenario in which UAVs must accelerate, decelerate, and change lanes to avoid collisions and maintain overall traffic flow. Different from existing studies, we propose a novel hierarchical and collaborative method based on large language models (LLMs). In our approach, an LLM deployed on the HAPS performs UAV access control, while another LLM onboard each UAV handles motion planning and control. This LLM-based framework leverages the rich knowledge embedded in pre-trained models to enable both high-level strategic planning and low-level tactical decisions. This knowledge-driven paradigm holds great potential for the development of next-generation 3D aerial highway systems. Experimental results demonstrate that our proposed collaborative LLM-based method achieves higher system rewards, lower operational costs, and significantly reduced UAV collision rates compared to baseline approaches.
zh
[AI-146] ScriptDoctor: Automatic Generation of PuzzleScript Games via Large Language Models and Tree Search
【速读】:该论文试图解决如何将大规模预训练模型(Large Pre-trained Models)有效地集成到长期时间跨度的自动游戏设计(Automatic Game Design, AGD)流程中,以实现游戏内容的自主生成与测试。现有研究多依赖于人工监督下的临时性使用,缺乏系统性的整合方案。论文提出的解决方案关键在于ScriptDoctor系统,该系统基于大型语言模型(Large Language Model, LLM),在PuzzleScript语言环境下通过迭代循环实现游戏设计的自动生成与测试,利用人类编写的示例进行输出引导、通过PuzzleScript引擎的编译错误获取功能性代码,并借助基于搜索的代理进行游戏测试,从而展示出LLM在开放性、自动化游戏内容生成中的潜力。
链接: https://arxiv.org/abs/2506.06524
作者: Sam Earle,Ahmed Khalifa,Muhammad Umair Nasir,Zehua Jiang,Graham Todd,Andrzej Banburski-Fahey,Julian Togelius
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 5 pages, 3 figures, 3 tables, submitted to IEEE Conference on Games as a Short Paper
Abstract:There is much interest in using large pre-trained models in Automatic Game Design (AGD), whether via the generation of code, assets, or more abstract conceptualization of design ideas. But so far this interest largely stems from the ad hoc use of such generative models under persistent human supervision. Much work remains to show how these tools can be integrated into longer-time-horizon AGD pipelines, in which systems interface with game engines to test generated content autonomously. To this end, we introduce ScriptDoctor, a Large Language Model (LLM)-driven system for automatically generating and testing games in PuzzleScript, an expressive but highly constrained description language for turn-based puzzle games over 2D gridworlds. ScriptDoctor generates and tests game design ideas in an iterative loop, where human-authored examples are used to ground the system’s output, compilation errors from the PuzzleScript engine are used to elicit functional code, and search-based agents play-test generated games. ScriptDoctor serves as a concrete example of the potential of automated, open-ended LLM-based workflows in generating novel game content.
zh
[AI-147] Reinforcement Learning for Autonomous Warehouse Orchestration in SAP Logistics Execution: Redefining Supply Chain Agility
【速读】:该论文旨在解决现代供应链中仓库操作效率与灵活性不足的问题,特别是在SAP Logistics Execution(LE)系统中实现任务的自主协调。其解决方案的关键在于引入一种基于强化学习(reinforcement learning, RL)的框架,通过将仓库流程建模为动态环境,实时优化任务分配、库存移动和订单拣选,从而提升操作敏捷性和效率。
链接: https://arxiv.org/abs/2506.06523
作者: Sumanth Pillella
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages
Abstract:In an era of escalating supply chain demands, SAP Logistics Execution (LE) is pivotal for managing warehouse operations, transportation, and delivery. This research introduces a pioneering framework leveraging reinforcement learning (RL) to autonomously orchestrate warehouse tasks in SAP LE, enhancing operational agility and efficiency. By modeling warehouse processes as dynamic environments, the framework optimizes task allocation, inventory movement, and order picking in real-time. A synthetic dataset of 300,000 LE transactions simulates real-world warehouse scenarios, including multilingual data and operational disruptions. The analysis achieves 95% task optimization accuracy, reducing processing times by 60% compared to traditional methods. Visualizations, including efficiency heatmaps and performance graphs, guide agile warehouse strategies. This approach tackles data privacy, scalability, and SAP integration, offering a transformative solution for modern supply chains.
zh
[AI-148] Private GPT s for LLM -driven testing in software development and machine learning
【速读】:该论文试图解决如何利用私有GPTs自动生成可执行测试代码的问题,其核心在于通过需求描述(如用户故事或史诗中的验收标准)来生成符合测试要求的代码。解决方案的关键在于采用两步流程:首先使用Gherkin语法作为中间步骤,再生成测试代码,相较于直接由需求生成代码,该方法在人类可读性和最佳编码实践(如代码行数和测试库的使用)方面表现更优。
链接: https://arxiv.org/abs/2506.06509
作者: Jakub Jagielski,Markus Abel
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 5 pages, 10 figures
Abstract:In this contribution, we examine the capability of private GPTs to automatically generate executable test code based on requirements. More specifically, we use acceptance criteria as input, formulated as part of epics, or stories, which are typically used in modern development processes. This gives product owners, or business intelligence, respectively, a way to directly produce testable criteria through the use of LLMs. We explore the quality of the so-produced tests in two ways: i) directly by letting the LLM generate code from requirements, ii) through an intermediate step using Gherkin syntax. As a result, it turns out that the two-step procedure yields better results -where we define better in terms of human readability and best coding practices, i.e. lines of code and use of additional libraries typically used in testing. Concretely, we evaluate prompt effectiveness across two scenarios: a simple “Hello World” program and a digit classification model, showing that structured prompts lead to higher-quality test outputs.
zh
[AI-149] Synthetic Problem Generation for Reasoning via Quality-Diversity Algorithms
【速读】:该论文试图解决现有方法在生成合成数据时难以扩展到更复杂和多样问题领域的问题,这些方法通常依赖于知识蒸馏或使用自然的真实问题陈述来保证问题质量。解决方案的关键在于提出SPARQ:一种基于质量-多样性算法的合成问题生成方法,通过测量问题的求解率(作为问题难度的代理)来生成高质量且多样的数学问题与解答对,仅需一个模型即可实现,从而提升了模型的推理能力。
链接: https://arxiv.org/abs/2506.06499
作者: Alex Havrilla,Edward Hughes,Mikayel Samvelyan,Jacob Abernethy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) driven synthetic data generation has emerged as a powerful method for improving model reasoning capabilities. However, most methods either distill large state-of-the-art models into small students or use natural ground-truth problem statements to guarantee problem statement quality. This limits the scalability of these approaches to more complex and diverse problem domains. To address this, we present SPARQ: Synthetic Problem Generation for Reasoning via Quality-Diversity Algorithms, a novel approach for generating high-quality and diverse synthetic math problem and solution pairs using only a single model by measuring a problem’s solve-rate: a proxy for problem difficulty. Starting from a seed dataset of 7.5K samples, we generate over 20 million new problem-solution pairs. We show that filtering the generated data by difficulty and then fine-tuning the same model on the resulting data improves relative model performance by up to 24%. Additionally, we conduct ablations studying the impact of synthetic data quantity, quality and diversity on model generalization. We find that higher quality, as measured by problem difficulty, facilitates better in-distribution performance. Further, while generating diverse synthetic data does not as strongly benefit in-distribution performance, filtering for more diverse data facilitates more robust OOD generalization. We also confirm the existence of model and data scaling laws for synthetically generated problems, which positively benefit downstream model generalization.
zh
[AI-150] he Economic Dispatch of Power-to-Gas Systems with Deep Reinforcement Learning:Tackling the Challenge of Delayed Rewards with Long-Term Energy Storag e
【速读】:该论文试图解决如何在考虑可再生能源波动性、电价和负荷变化的情况下,实现Power-to-Gas (P2G)系统的经济运行问题。由于P2G系统在能量转换与存储方面的效率低于电池储能系统(BESs),且电能转化为气体的效益不明显,因此其优化运行具有挑战性。论文提出的解决方案关键在于应用深度强化学习(DRL)技术,并针对P2G系统操作中延迟奖励的特点进行改进,包括整合预测信息、在奖励函数中引入惩罚项以及采用策略性成本计算,从而提升DRL算法在长期能源存储场景下的有效性。
链接: https://arxiv.org/abs/2506.06484
作者: Manuel Sage,Khalil Al Handawi,Yaoyao Fiona Zhao
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication at the 19th ASME International Conference on Energy Sustainability
Abstract:Power-to-Gas (P2G) technologies gain recognition for enabling the integration of intermittent renewables, such as wind and solar, into electricity grids. However, determining the most cost-effective operation of these systems is complex due to the volatile nature of renewable energy, electricity prices, and loads. Additionally, P2G systems are less efficient in converting and storing energy compared to battery energy storage systems (BESs), and the benefits of converting electricity into gas are not immediately apparent. Deep Reinforcement Learning (DRL) has shown promise in managing the operation of energy systems amidst these uncertainties. Yet, DRL techniques face difficulties with the delayed reward characteristic of P2G system operation. Previous research has mostly focused on short-term studies that look at the energy conversion process, neglecting the long-term storage capabilities of P2G. This study presents a new method by thoroughly examining how DRL can be applied to the economic operation of P2G systems, in combination with BESs and gas turbines, over extended periods. Through three progressively more complex case studies, we assess the performance of DRL algorithms, specifically Deep Q-Networks and Proximal Policy Optimization, and introduce modifications to enhance their effectiveness. These modifications include integrating forecasts, implementing penalties on the reward function, and applying strategic cost calculations, all aimed at addressing the issue of delayed rewards. Our findings indicate that while DRL initially struggles with the complex decision-making required for P2G system operation, the adjustments we propose significantly improve its capability to devise cost-effective operation strategies, thereby unlocking the potential for long-term energy storage in P2G technologies. Comments: Accepted for publication at the 19th ASME International Conference on Energy Sustainability Subjects: Systems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2506.06484 [eess.SY] (or arXiv:2506.06484v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2506.06484 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-151] Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storag e
【速读】:该论文试图解决大规模语言模型(Large Language Model, LLM)训练过程中GPU内存不足的问题,通过利用低成本的PCIe接口固态硬盘(SSD)扩展GPU内存。解决方案的关键在于提出一种生命周期感知的张量卸载框架TERAIO,该框架通过分析训练初期几轮中的张量活跃周期,准确估计每个张量的生命周期,并据此生成优化的张量卸载/预取计划,从而在不阻塞GPU训练流程的情况下,高效地将不活跃的大张量卸载到SSD或从SSD预取回GPU。
链接: https://arxiv.org/abs/2506.06472
作者: Ziqi Yuan,Haoyang Zhang,Yirui Eric Zhou,Apoorve Mohan,I-Hsin Chung,Seetharami Seelam,Jian Huang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:
Abstract:We present the design and implementation of a new lifetime-aware tensor offloading framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs). Our framework, TERAIO, is developed explicitly for large language model (LLM) training with multiple GPUs and multiple SSDs. Its design is driven by our observation that the active tensors take only a small fraction (1.7% on average) of allocated GPU memory in each LLM training iteration, the inactive tensors are usually large and will not be used for a long period of time, creating ample opportunities for offloading/prefetching tensors to/from slow SSDs without stalling the GPU training process. TERAIO accurately estimates the lifetime (active period of time in GPU memory) of each tensor with the profiling of the first few iterations in the training process. With the tensor lifetime analysis, TERAIO will generate an optimized tensor offloading/prefetching plan and integrate it into the compiled LLM program via PyTorch. TERAIO has a runtime tensor migration engine to execute the offloading/prefetching plan via GPUDirect storage, which allows direct tensor migration between GPUs and SSDs for alleviating the CPU bottleneck and maximizing the SSD bandwidth utilization. In comparison with state-of-the-art studies such as ZeRO-Offload and ZeRO-Infinity, we show that TERAIO improves the training performance of various LLMs by 1.47x on average, and achieves 80.7% of the ideal performance assuming unlimited GPU memory.
zh
[AI-152] SIGMA: Refining Large Language Model Reasoning via Sibling-Guided Monte Carlo Augmentation
【速读】:该论文试图解决大规模语言模型通过简单扩大数据集规模来提升性能已逐渐出现收益递减的问题,其核心关注点转向了数据质量的提升。传统方法在蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)过程中仅保留得分最高的路径,而丢弃了包含有价值部分见解、重复错误模式和替代推理策略的兄弟节点,导致大量信息被浪费。该论文提出的SIGMA(Sibling Guided Monte Carlo Augmentation)框架的关键在于重新整合这些被丢弃的兄弟节点,通过语义关联以及两阶段优化——批判模型识别兄弟节点集合中的优劣,修订模型基于比较反馈对最优路径进行文本回传优化,从而提升推理轨迹的质量。
链接: https://arxiv.org/abs/2506.06470
作者: Yanwei Ren,Haotian Zhang,Fuxiang Wu,Jiayan Qiu,Jiaxing Huang,Baosheng Yu,Liu Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Enhancing large language models by simply scaling up datasets has begun to yield diminishing returns, shifting the spotlight to data quality. Monte Carlo Tree Search (MCTS) has emerged as a powerful technique for generating high-quality chain-of-thought data, yet conventional approaches typically retain only the top-scoring trajectory from the search tree, discarding sibling nodes that often contain valuable partial insights, recurrent error patterns, and alternative reasoning strategies. This unconditional rejection of non-optimal reasoning branches may waste vast amounts of informative data in the whole search tree. We propose SIGMA (Sibling Guided Monte Carlo Augmentation), a novel framework that reintegrates these discarded sibling nodes to refine LLM reasoning. SIGMA forges semantic links among sibling nodes along each search path and applies a two-stage refinement: a critique model identifies overlooked strengths and weaknesses across the sibling set, and a revision model conducts text-based backpropagation to refine the top-scoring trajectory in light of this comparative feedback. By recovering and amplifying the underutilized but valuable signals from non-optimal reasoning branches, SIGMA substantially improves reasoning trajectories. On the challenging MATH benchmark, our SIGMA-tuned 7B model achieves 54.92% accuracy using only 30K samples, outperforming state-of-the-art models trained on 590K samples. This result highlights that our sibling-guided optimization not only significantly reduces data usage but also significantly boosts LLM reasoning.
zh
[AI-153] WISCA: A Consensus-Based Approach to Harmonizing Interpretability in Tabular Datasets
【速读】:该论文试图解决机器学习模型在科学和高风险领域中,尽管预测准确性被优先考虑,但可解释性仍然至关重要,而不同可解释性算法常产生冲突的解释,从而需要共识来统一结果的问题。解决方案的关键在于提出一种新的方法——WISCA(Weighted Scaled Consensus Attributions),该方法通过整合类别概率和归一化属性值来生成共识性解释,从而提升解释的可靠性。
链接: https://arxiv.org/abs/2506.06455
作者: Antonio Jesús Banegas-Luna,Horacio Pérez-Sánchez,Carlos Martínez-Cortés
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 27 pages, 11 figures, 2 tables, 13 equations
Abstract:While predictive accuracy is often prioritized in machine learning (ML) models, interpretability remains essential in scientific and high-stakes domains. However, diverse interpretability algorithms frequently yield conflicting explanations, highlighting the need for consensus to harmonize results. In this study, six ML models were trained on six synthetic datasets with known ground truths, utilizing various model-agnostic interpretability techniques. Consensus explanations were generated using established methods and a novel approach: WISCA (Weighted Scaled Consensus Attributions), which integrates class probability and normalized attributions. WISCA consistently aligned with the most reliable individual method, underscoring the value of robust consensus strategies in improving explanation reliability.
zh
[AI-154] Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance
【速读】:该论文试图解决现有大型语言模型(Large Language Model, LLM)安全保证方法在面对新兴威胁时的不足,特别是针对推理阶段的扩展性问题。传统方法主要关注训练阶段的对齐以培养安全行为,但其易受各种越狱攻击的影响,而推理阶段的扩展虽然提升了模型的推理能力,却未被充分探索用于安全保证。解决方案的关键在于提出SAFFRON,一种专为安全保证设计的推理扩展范式,其核心是引入多分支奖励模型(Multifurcation Reward Model, MRM),通过减少奖励模型评估次数来缓解探索—效率困境,同时结合部分监督训练目标、保守探索约束和基于Trie的键值缓存策略,以提升安全性和效率。
链接: https://arxiv.org/abs/2506.06444
作者: Ruizhong Qiu,Gaotang Li,Tianxin Wei,Jingrui He,Hanghang Tong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 19 pages
Abstract:Existing safety assurance research has primarily focused on training-phase alignment to instill safe behaviors into LLMs. However, recent studies have exposed these methods’ susceptibility to diverse jailbreak attacks. Concurrently, inference scaling has significantly advanced LLM reasoning capabilities but remains unexplored in the context of safety assurance. Addressing this gap, our work pioneers inference scaling for robust and effective LLM safety against emerging threats. We reveal that conventional inference scaling techniques, despite their success in reasoning tasks, perform poorly in safety contexts, even falling short of basic approaches like Best-of-N Sampling. We attribute this inefficiency to a newly identified challenge, the exploration–efficiency dilemma, arising from the high computational overhead associated with frequent process reward model (PRM) evaluations. To overcome this dilemma, we propose SAFFRON, a novel inference scaling paradigm tailored explicitly for safety assurance. Central to our approach is the introduction of a multifurcation reward model (MRM) that significantly reduces the required number of reward model evaluations. To operationalize this paradigm, we further propose: (i) a partial supervision training objective for MRM, (ii) a conservative exploration constraint to prevent out-of-distribution explorations, and (iii) a Trie-based key–value caching strategy that facilitates cache sharing across sequences during tree search. Extensive experiments validate the effectiveness of our method. Additionally, we publicly release our trained multifurcation reward model (Saffron-1) and the accompanying token-level safety reward dataset (Safety4M) to accelerate future research in LLM safety. Our code, model, and data are publicly available at this https URL , and our project homepage is at this https URL .
zh
[AI-155] Unlocking Chemical Insights: Superior Molecular Representations from Intermediate Encoder Layers
【速读】:该论文试图解决传统分子编码器在下游任务中仅依赖最终层嵌入可能导致信息丢失的问题,从而影响性能表现。其解决方案的关键在于对分子编码器的多层嵌入进行系统性分析,并发现中间层嵌入在多数任务中优于最终层表示。通过使用最优中间层的固定嵌入或对其进行微调,显著提升了下游任务的性能,证明了探索分子编码器的完整表征深度对于提升模型效果和计算效率的重要性。
链接: https://arxiv.org/abs/2506.06443
作者: Luis Pinto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM)
备注:
Abstract:Pretrained molecular encoders have become indispensable in computational chemistry for tasks such as property prediction and molecular generation. However, the standard practice of relying solely on final-layer embeddings for downstream tasks may discard valuable information. In this work, we challenge this convention by conducting a comprehensive layer-wise analysis of five diverse molecular encoders across 22 ADMET property prediction tasks. Our results demonstrate that embeddings from intermediate layers consistently outperform final-layer representations. Specifically, using fixed embeddings from the optimal intermediate layers improved downstream performance by an average of 5.4%, reaching gains up to 28.6%. Furthermore, finetuning up to these intermediate layers yielded even greater average improvements of 8.5%, with performance increases as high as 40.8%, achieving new state-of-the-art results on several benchmarks. Additionally, a strong positive correlation between fixed embedding performance and finetuning outcomes supports an efficient evaluate-then-finetune approach, enabling identification of optimal layers with reduced computational cost. These findings highlight the importance of exploring the full representational depth of molecular encoders to achieve substantial performance improvements and computational efficiency. The code is made publicly available at this https URL.
zh
[AI-156] Benchmarking Misuse Mitigation Against Covert Adversaries
【速读】:该论文试图解决现有语言模型安全评估未能有效应对隐蔽攻击(covert attacks)的问题,特别是针对现实攻击者通过多个看似无害的小任务请求来逐步实现危险目标的攻击策略。解决方案的关键在于开发了状态防御基准测试(Benchmarks for Stateful Defenses, BSD),这是一个自动化生成数据以评估隐蔽攻击及其对应防御机制的管道。该方法揭示了分解攻击作为有效滥用促进因素,并强调了状态防御作为应对策略的重要性。
链接: https://arxiv.org/abs/2506.06414
作者: Davis Brown,Mahdi Sabbaghi,Luze Sun,Alexander Robey,George J. Pappas,Eric Wong,Hamed Hassani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing language model safety evaluations focus on overt attacks and low-stakes tasks. Realistic attackers can subvert current safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because individual queries do not appear harmful, the attack is hard to detect. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
zh
[AI-157] meWak: Temporal Chained-Hashing Watermark for Time Series Data
【速读】:该论文旨在解决在真实空间中对多变量时间序列扩散模型进行水印嵌入的问题,尤其是在处理特征异质性和时间依赖性时的挑战。传统水印方法依赖于同质潜在空间,而当前最先进的时间序列生成器在真实空间中运行,导致基于潜在空间的水印方法不兼容。论文提出的解决方案是TimeWak,其关键在于在真实的时间-特征空间中直接嵌入时间链式哈希水印,并引入ϵ-精确逆向机制以应对反向扩散过程中特征间非均匀重建误差分布的问题,从而保证水印的高可检测性。
链接: https://arxiv.org/abs/2506.06407
作者: Zhi Wen Soi,Chaoyi Zhu,Fouad Abiad,Aditya Shankar,Jeroen M. Galjaard,Huijuan Wang,Lydia Y. Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Synthetic time series generated by diffusion models enable sharing privacy-sensitive datasets, such as patients’ functional MRI records. Key criteria for synthetic data include high data utility and traceability to verify the data source. Recent watermarking methods embed in homogeneous latent spaces, but state-of-the-art time series generators operate in real space, making latent-based watermarking incompatible. This creates the challenge of watermarking directly in real space while handling feature heterogeneity and temporal dependencies. We propose TimeWak, the first watermarking algorithm for multivariate time series diffusion models. To handle temporal dependence and spatial heterogeneity, TimeWak embeds a temporal chained-hashing watermark directly within the real temporal-feature space. The other unique feature is the \epsilon -exact inversion, which addresses the non-uniform reconstruction error distribution across features from inverting the diffusion process to detect watermarks. We derive the error bound of inverting multivariate time series and further maintain high watermark detectability. We extensively evaluate TimeWak on its impact on synthetic data quality, watermark detectability, and robustness under various post-editing attacks, against 5 datasets and baselines of different temporal lengths. Our results show that TimeWak achieves improvements of 61.96% in context-FID score, and 8.44% in correlational scores against the state-of-the-art baseline, while remaining consistently detectable.
zh
[AI-158] heoretical Analysis of Positional Encodings in Transformer Models: Impact on Expressiveness and Generalization
【速读】:该论文试图解决Transformer模型中位置编码(Positional Encoding)对模型表达能力、泛化能力和长序列外推性能的影响问题。其解决方案的关键在于提出一个理论框架,通过函数逼近分析表达能力,利用Rademacher复杂度建立泛化界,并引入基于正交函数的位置编码方法(如小波和勒让德多项式),以提升模型的性能。此外,该研究将ALiBi的偏置方法扩展到统一的理论背景中,为位置编码的设计提供了新的理论依据。
链接: https://arxiv.org/abs/2506.06398
作者: Yin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Positional encodings are a core part of transformer-based models, enabling processing of sequential data without recurrence. This paper presents a theoretical framework to analyze how various positional encoding methods, including sinusoidal, learned, relative, and bias-based methods like Attention with Linear Biases (ALiBi), impact a transformer’s expressiveness, generalization ability, and extrapolation to longer sequences. Expressiveness is defined via function approximation, generalization bounds are established using Rademacher complexity, and new encoding methods based on orthogonal functions, such as wavelets and Legendre polynomials, are proposed. The extrapolation capacity of existing and proposed encodings is analyzed, extending ALiBi’s biasing approach to a unified theoretical context. Experimental evaluation on synthetic sequence-to-sequence tasks shows that orthogonal transform-based encodings outperform traditional sinusoidal encodings in generalization and extrapolation. This work addresses a critical gap in transformer theory, providing insights for design choices in natural language processing, computer vision, and other transformer applications.
zh
[AI-159] Benchmarking Large Language Models on Homework Assessment in Circuit Analysis
【速读】:该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)在工程教育中进行作业评估的问题,特别是针对本科生电路分析课程的作业评估。解决方案的关键在于构建了一个包含官方参考答案和真实学生解答的新数据集,并将其转换为LaTeX格式以克服当前LLMs在图像识别方面的局限性。此外,设计了一种提示模板,用于测试学生解答的五个指标:完整性、方法、最终答案、算术错误和单位,从而评估不同LLMs在作业评估中的性能。
链接: https://arxiv.org/abs/2506.06390
作者: Liangliang Chen,Zhihao Qin,Yiming Guo,Jacqueline Rohde,Ying Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have the potential to revolutionize various fields, including code development, robotics, finance, and education, due to their extensive prior knowledge and rapid advancements. This paper investigates how LLMs can be leveraged in engineering education. Specifically, we benchmark the capabilities of different LLMs, including GPT-3.5 Turbo, GPT-4o, and Llama 3 70B, in assessing homework for an undergraduate-level circuit analysis course. We have developed a novel dataset consisting of official reference solutions and real student solutions to problems from various topics in circuit analysis. To overcome the limitations of image recognition in current state-of-the-art LLMs, the solutions in the dataset are converted to LaTeX format. Using this dataset, a prompt template is designed to test five metrics of student solutions: completeness, method, final answer, arithmetic error, and units. The results show that GPT-4o and Llama 3 70B perform significantly better than GPT-3.5 Turbo across all five metrics, with GPT-4o and Llama 3 70B each having distinct advantages in different evaluation aspects. Additionally, we present insights into the limitations of current LLMs in several aspects of circuit analysis. Given the paramount importance of ensuring reliability in LLM-generated homework assessment to avoid misleading students, our results establish benchmarks and offer valuable insights for the development of a reliable, personalized tutor for circuit analysis – a focus of our future work. Furthermore, the proposed evaluation methods can be generalized to a broader range of courses for engineering education in the future.
zh
[AI-160] Human and AI collaboration in Fitness Education:A Longitudinal Study with a Pilates Instructor
【速读】:该论文试图解决生成式 AI (Generative AI) 在健身教育中与人类专家协作的最优角色问题。研究通过为期一年的定性案例研究,探讨了如何将生成式 AI 整合到普拉提课程的规划与教学中,其解决方案的关键在于通过定期参与教师培训课程和进行半结构化访谈,深入了解生成式 AI 在实际教学场景中的应用方式与效果。
链接: https://arxiv.org/abs/2506.06383
作者: Qian Huang,King Wang Poon
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 19 pages, 5 figures
Abstract:Artificial intelligence is poised to transform teaching and coaching practices,yet its optimal role alongside human expertise remains this http URL study investigates human and AI collaboration in fitness education through a one year qualitative case study with a Pilates this http URL researcher participated in the instructor classes and conducted biweekly semi structured interviews to explore how generative AI could be integrated into class planning and instruction.
zh
[AI-161] CPS-Guard: Framework for Dependability Assurance of AI- and LLM -Based Cyber-Physical Systems
【速读】:该论文旨在解决传统验证与确认(Verification and Validation, VV)方法在应对人工智能(Artificial Intelligence, AI)组件不可预测性和动态性方面的不足。其解决方案的关键在于提出CPS-Guard框架,该框架通过多角色编排(multi-role orchestration)实现对AI驱动的Cyber-Physical Systems (CPS) 的迭代保障过程自动化,具体包括分配安全监控、安全评估、故障注入和恢复规划等专用角色给模拟环境中的代理,以持续评估和优化AI行为,满足系统的可靠性需求。
链接: https://arxiv.org/abs/2506.06381
作者: Trisanth Srinivasan,Santosh Patapati,Himani Musku,Idhant Gode,Aditya Arora,Samvit Bhattacharya,Abubakr Nazriev,Sanika Hirave,Zaryab Kanjiani,Srinjoy Ghose,Srinidhi Shetty
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:
Abstract:Cyber-Physical Systems (CPS) increasingly depend on advanced AI techniques to operate in critical applications. However, traditional verification and validation methods often struggle to handle the unpredictable and dynamic nature of AI components. In this paper, we introduce CPS-Guard, a novel framework that employs multi-role orchestration to automate the iterative assurance process for AI-powered CPS. By assigning specialized roles (e.g., safety monitoring, security assessment, fault injection, and recovery planning) to dedicated agents within a simulated environment, CPS-Guard continuously evaluates and refines AI behavior against a range of dependability requirements. We demonstrate the framework through a case study involving an autonomous vehicle navigating an intersection with an AI-based planner. Our results show that CPS-Guard effectively detects vulnerabilities, manages performance impacts, and supports adaptive recovery strategies, thereby offering a structured and extensible solution for rigorous VV in safety- and security-critical systems.
zh
[AI-162] Beyond the Norm: A Survey of Synthetic Data Generation for Rare Events
【速读】:该论文试图解决极端事件(extreme events)数据稀缺导致的建模难题,这类事件如市场崩盘、自然灾害和流行病等,虽然发生频率低但影响巨大,而现有的数据驱动方法因依赖大量训练数据而难以有效应用。解决方案的关键在于合成数据生成(synthetic data generation),通过生成具有重尾分布特性的数据来弥补真实极端事件数据的不足,同时结合统计理论、专门的训练与采样机制,提升模型对极端场景的适应能力。
链接: https://arxiv.org/abs/2506.06380
作者: Jingyi Gu,Xuan Zhang,Guiling Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Extreme events, such as market crashes, natural disasters, and pandemics, are rare but catastrophic, often triggering cascading failures across interconnected systems. Accurate prediction and early warning can help minimize losses and improve preparedness. While data-driven methods offer powerful capabilities for extreme event modeling, they require abundant training data, yet extreme event data is inherently scarce, creating a fundamental challenge. Synthetic data generation has emerged as a powerful solution. However, existing surveys focus on general data with privacy preservation emphasis, rather than extreme events’ unique performance requirements. This survey provides the first overview of synthetic data generation for extreme events. We systematically review generative modeling techniques and large language models, particularly those enhanced by statistical theory as well as specialized training and sampling mechanisms to capture heavy-tailed distributions. We summarize benchmark datasets and introduce a tailored evaluation framework covering statistical, dependence, visual, and task-oriented metrics. A central contribution is our in-depth analysis of each metric’s applicability in extremeness and domain-specific adaptations, providing actionable guidance for model evaluation in extreme settings. We categorize key application domains and identify underexplored areas like behavioral finance, wildfires, earthquakes, windstorms, and infectious outbreaks. Finally, we outline open challenges, providing a structured foundation for advancing synthetic rare-event research.
zh
[AI-163] owards Foundation Model on Temporal Knowledge Graph Reasoning
【速读】:该论文试图解决现有Temporal Knowledge Graph Embedding (TKGE)模型在链接预测任务中依赖于训练过程中已见实体、关系和时间信息的问题,这限制了模型在新领域中的迁移能力和真实场景下的泛化性能。解决方案的关键在于提出一种完全归纳(fully-inductive)的方法,通过正弦位置编码捕捉细粒度的时间模式,并利用基于局部和全局时间上下文的消息传递机制生成适应性的实体和关系表示,从而实现跨时间粒度和时间跨度的时序信息迁移与表征学习。
链接: https://arxiv.org/abs/2506.06367
作者: Jiaxin Pan,Mojtaba Nayyeri,Osama Mohammed,Daniel Hernandez,Rongchuan Zhang,Cheng Cheng,Steffen Staab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Temporal Knowledge Graphs (TKGs) store temporal facts with quadruple formats (s, p, o, t). Existing Temporal Knowledge Graph Embedding (TKGE) models perform link prediction tasks in transductive or semi-inductive settings, which means the entities, relations, and temporal information in the test graph are fully or partially observed during training. Such reliance on seen elements during inference limits the models’ ability to transfer to new domains and generalize to real-world scenarios. A central limitation is the difficulty in learning representations for entities, relations, and timestamps that are transferable and not tied to dataset-specific vocabularies. To overcome these limitations, we introduce the first fully-inductive approach to temporal knowledge graph link prediction. Our model employs sinusoidal positional encodings to capture fine-grained temporal patterns and generates adaptive entity and relation representations using message passing conditioned on both local and global temporal contexts. Our model design is agnostic to temporal granularity and time span, effectively addressing temporal discrepancies across TKGs and facilitating time-aware structural information transfer. As a pretrained, scalable, and transferable model, POSTRA demonstrates strong zero-shot performance on unseen temporal knowledge graphs, effectively generalizing to novel entities, relations, and timestamps. Extensive theoretical analysis and empirical results show that a single pretrained model can improve zero-shot performance on various inductive temporal reasoning scenarios, marking a significant step toward a foundation model for temporal KGs.
zh
[AI-164] CR-BLEA: Contrastive Ranking for Adaptive Resource Allocation in Bilevel Evolutionary Algorithms
【速读】:该论文旨在解决双层优化中由于嵌套结构导致的计算成本高昂问题,特别是在上层候选解需要求解对应下层问题时,大量无前景的下层任务重复评估造成资源浪费。其解决方案的关键在于提出一种新颖的资源分配框架,该框架通过对比排序网络在线学习上下层解之间的关系模式,并基于此设计参考排序策略,优先优化有潜力的任务并根据估计种群质量自适应控制重采样,从而显著降低计算成本并保持或提升解的准确性。
链接: https://arxiv.org/abs/2506.06362
作者: Dejun Xu,Jijia Chen,Gary G. Yen,Min Jiang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Bilevel optimization poses a significant computational challenge due to its nested structure, where each upper-level candidate solution requires solving a corresponding lower-level problem. While evolutionary algorithms (EAs) are effective at navigating such complex landscapes, their high resource demands remain a key bottleneck – particularly the redundant evaluation of numerous unpromising lower-level tasks. Despite recent advances in multitasking and transfer learning, resource waste persists. To address this issue, we propose a novel resource allocation framework for bilevel EAs that selectively identifies and focuses on promising lower-level tasks. Central to our approach is a contrastive ranking network that learns relational patterns between paired upper- and lower-level solutions online. This knowledge guides a reference-based ranking strategy that prioritizes tasks for optimization and adaptively controls resampling based on estimated population quality. Comprehensive experiments across five state-of-the-art bilevel algorithms show that our framework significantly reduces computational cost while preserving – or even enhancing – solution accuracy. This work offers a generalizable strategy to improve the efficiency of bilevel EAs, paving the way for more scalable bilevel optimization.
zh
[AI-165] actile MNIST: Benchmarking Active Tactile Perception
【速读】:该论文试图解决触觉感知在机器人操作中因信息局部性而难以实现广泛空间感知或全局场景理解的问题,其解决方案的关键在于引入主动感知策略,即通过有意识地引导传感器聚焦于具有更多信息或显著特征的区域,并随时间整合这些信息以完成任务或理解场景。为实现这一目标,论文提出了Tactile MNIST Benchmark Suite,一个针对主动触觉感知任务的标准化基准套件,包含多样化的仿真场景和大规模合成与真实触觉数据集,旨在推动触觉传感与主动感知领域的系统性进展。
链接: https://arxiv.org/abs/2506.06361
作者: Tim Schneider,Guillaume Duret,Cristiana de Farias,Roberto Calandra,Liming Chen,Jan Peters
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Tactile perception has the potential to significantly enhance dexterous robotic manipulation by providing rich local information that can complement or substitute for other sensory modalities such as vision. However, because tactile sensing is inherently local, it is not well-suited for tasks that require broad spatial awareness or global scene understanding on its own. A human-inspired strategy to address this issue is to consider active perception techniques instead. That is, to actively guide sensors toward regions with more informative or significant features and integrate such information over time in order to understand a scene or complete a task. Both active perception and different methods for tactile sensing have received significant attention recently. Yet, despite advancements, both fields lack standardized benchmarks. To bridge this gap, we introduce the Tactile MNIST Benchmark Suite, an open-source, Gymnasium-compatible benchmark specifically designed for active tactile perception tasks, including localization, classification, and volume estimation. Our benchmark suite offers diverse simulation scenarios, from simple toy environments all the way to complex tactile perception tasks using vision-based tactile sensors. Furthermore, we also offer a comprehensive dataset comprising 13,500 synthetic 3D MNIST digit models and 153,600 real-world tactile samples collected from 600 3D printed digits. Using this dataset, we train a CycleGAN for realistic tactile simulation rendering. By providing standardized protocols and reproducible evaluation frameworks, our benchmark suite facilitates systematic progress in the fields of tactile sensing and active perception.
zh
[AI-166] From Transformers to Large Language Models : A systematic review of AI applications in the energy sector towards Agent ic Digital Twins
【速读】:该论文试图解决传统机器学习在智能电网中能源管理应用时面临的泛化能力不足、情境感知有限以及异构数据融合困难等问题。其解决方案的关键在于利用Transformer架构和大型语言模型(LLMs)等基础模型,这些模型在建模复杂的时间序列和上下文关系以及多模态数据融合方面表现出色,从而提升了能源领域AI应用的性能与适用性。
链接: https://arxiv.org/abs/2506.06359
作者: Gabriel Antonesi,Tudor Cioara,Ionut Anghel,Vasilis Michalakopoulos,Elissaios Sarmas,Liana Toderean
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence (AI) has long promised to improve energy management in smart grids by enhancing situational awareness and supporting more effective decision-making. While traditional machine learning has demonstrated notable results in forecasting and optimization, it often struggles with generalization, situational awareness, and heterogeneous data integration. Recent advances in foundation models such as Transformer architecture and Large Language Models (LLMs) have demonstrated improved capabilities in modelling complex temporal and contextual relationships, as well as in multi-modal data fusion which is essential for most AI applications in the energy sector. In this review we synthesize the rapid expanding field of AI applications in the energy domain focusing on Transformers and LLMs. We examine the architectural foundations, domain-specific adaptations and practical implementations of transformer models across various forecasting and grid management tasks. We then explore the emerging role of LLMs in the field: adaptation and fine tuning for the energy sector, the type of tasks they are suited for, and the new challenges they introduce. Along the way, we highlight practical implementations, innovations, and areas where the research frontier is rapidly expanding. These recent developments reviewed underscore a broader trend: Generative AI (GenAI) is beginning to augment decision-making not only in high-level planning but also in day-to-day operations, from forecasting and grid balancing to workforce training and asset onboarding. Building on these developments, we introduce the concept of the Agentic Digital Twin, a next-generation model that integrates LLMs to bring autonomy, proactivity, and social interaction into digital twin-based energy management systems.
zh
[AI-167] Will artificial agents pursue power by default?
【速读】:该论文试图解决关于高级人工智能(Advanced AI)是否可能追求对人类的权力这一争议性问题,其核心在于评估“权力”是否为一种收敛性工具目标(convergent instrumental goal)。论文通过在抽象的决策理论框架中形式化工具性收敛性和权力追求的概念,来分析该命题的合理性。解决方案的关键在于指出,尽管权力作为收敛性工具目标具有一定真实性,但其预测能力受限于对智能体最终目标的了解程度;然而,对于那些有望获得绝对或接近绝对权力的智能体而言,工具性收敛现象更具预测性。
链接: https://arxiv.org/abs/2506.06352
作者: Christian Tarsney
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Researchers worried about catastrophic risks from advanced AI have argued that we should expect sufficiently capable AI agents to pursue power over humanity because power is a convergent instrumental goal, something that is useful for a wide range of final goals. Others have recently expressed skepticism of these claims. This paper aims to formalize the concepts of instrumental convergence and power-seeking in an abstract, decision-theoretic framework, and to assess the claim that power is a convergent instrumental goal. I conclude that this claim contains at least an element of truth, but might turn out to have limited predictive utility, since an agent’s options cannot always be ranked in terms of power in the absence of substantive information about the agent’s final goals. However, the fact of instrumental convergence is more predictive for agents who have a good shot at attaining absolute or near-absolute power.
zh
[AI-168] NR4DER: Neural Re-ranking for Diversified Exercise Recommendation SIGIR2025
【速读】:该论文旨在解决在线教育平台中学生在学习过程中面临的高退课率以及现有练习推荐方法无法适应学生多样化学习节奏的问题。现有方法在调整不活跃学生的学习模式和满足个性化学习节奏方面存在困难,导致推荐的准确性和多样性受限。论文提出的解决方案关键在于神经重排序(Neural Re-ranking)技术,通过结合改进的mLSTM模型、序列增强方法和神经重排序机制,提升练习推荐的准确性和多样性,从而更好地适应学生的不同学习节奏。
链接: https://arxiv.org/abs/2506.06341
作者: Xinghe Cheng,Xufang Zhou,Liangda Fang,Chaobo He,Yuyu Zhou,Weiqi Luo,Zhiguo Gong,Quanlong Guan
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: accepted for presentation at the SIGIR 2025 Full Papers track
Abstract:With the widespread adoption of online education platforms, an increasing number of students are gaining new knowledge through Massive Open Online Courses (MOOCs). Exercise recommendation have made strides toward improving student learning outcomes. However, existing methods not only struggle with high dropout rates but also fail to match the diverse learning pace of students. They frequently face difficulties in adjusting to inactive students’ learning patterns and in accommodating individualized learning paces, resulting in limited accuracy and diversity in recommendations. To tackle these challenges, we propose Neural Re-ranking for Diversified Exercise Recommendation (in short, NR4DER). NR4DER first leverages the mLSTM model to improve the effectiveness of the exercise filter module. It then employs a sequence enhancement method to enhance the representation of inactive students, accurately matches students with exercises of appropriate difficulty. Finally, it utilizes neural re-ranking to generate diverse recommendation lists based on individual students’ learning histories. Extensive experimental results indicate that NR4DER significantly outperforms existing methods across multiple real-world datasets and effectively caters to the diverse learning pace of students.
zh
[AI-169] Structured Semantics from Unstructured Notes: Language Model Approaches to EHR-Based Decision Support
【速读】:该论文试图解决如何利用大型语言模型(Large Language Models, LLMs)从复杂且非结构化的电子健康记录(Electronic Health Records, EHRs)中提取有价值的信息,以提升临床决策支持的问题。其解决方案的关键在于通过先进的语言模型挖掘文本特征,这些特征在传统高维EHR分析中常被忽视,但能够提供语义丰富的表示,并有助于不同机构间数据的标准化与整合。此外,论文还探讨了医疗编码的整合以及确保AI模型在医疗领域中的泛化性和公平性的挑战与机遇。
链接: https://arxiv.org/abs/2506.06340
作者: Wu Hao Ran,Xi Xi,Furong Li,Jingyi Lu,Jian Jiang,Hui Huang,Yuzhuan Zhang,Shi Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of large language models (LLMs) has opened new avenues for analyzing complex, unstructured data, particularly within the medical domain. Electronic Health Records (EHRs) contain a wealth of information in various formats, including free text clinical notes, structured lab results, and diagnostic codes. This paper explores the application of advanced language models to leverage these diverse data sources for improved clinical decision support. We will discuss how text-based features, often overlooked in traditional high dimensional EHR analysis, can provide semantically rich representations and aid in harmonizing data across different institutions. Furthermore, we delve into the challenges and opportunities of incorporating medical codes and ensuring the generalizability and fairness of AI models in healthcare.
zh
[AI-170] Introduction to Predictive Coding Networks for Machine Learning
【速读】:该论文试图解决传统前馈神经网络在机器学习中的局限性,提出一种基于预测编码网络(Predictive Coding Networks, PCNs)的生物启发框架,以更好地理解大脑中的分层计算机制。解决方案的关键在于构建一种能够模拟大脑层级预测与误差反馈机制的网络架构,并通过推断和学习更新规则实现有效的算法实现,从而在图像分类任务(如CIFAR-10)中取得优异性能。
链接: https://arxiv.org/abs/2506.06332
作者: Mikko Stenlund
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages
Abstract:Predictive coding networks (PCNs) constitute a biologically inspired framework for understanding hierarchical computation in the brain, and offer an alternative to traditional feedforward neural networks in ML. This note serves as a quick, onboarding introduction to PCNs for machine learning practitioners. We cover the foundational network architecture, inference and learning update rules, and algorithmic implementation. A concrete image-classification task (CIFAR-10) is provided as a benchmark-smashing application, together with an accompanying Python notebook containing the PyTorch implementation.
zh
[AI-171] Memory OS of AI Agent
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在固定上下文窗口和内存管理不足方面面临的挑战,这些问题导致了长期记忆能力的严重缺失以及与AI代理交互体验中的个性化限制。其解决方案的关键在于提出了一种名为MemoryOS的内存操作系统,通过借鉴操作系统中的内存管理原理,设计了一个分层存储架构,并包含四个核心模块:内存存储、更新、检索和生成。该架构由三个存储单元层级组成,支持在不同层级间进行动态更新,从而实现层次化内存整合与动态更新,显著提升了长对话中的上下文连贯性和个性化记忆保留能力。
链接: https://arxiv.org/abs/2506.06326
作者: Jiazheng Kang,Mingming Ji,Zhe Zhao,Ting Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) face a crucial challenge from fixed context windows and inadequate memory management, leading to a severe shortage of long-term memory capabilities and limited personalization in the interactive experience with AI agents. To overcome this challenge, we innovatively propose a Memory Operating System, i.e., MemoryOS, to achieve comprehensive and efficient memory management for AI agents. Inspired by the memory management principles in operating systems, MemoryOS designs a hierarchical storage architecture and consists of four key modules: Memory Storage, Updating, Retrieval, and Generation. Specifically, the architecture comprises three levels of storage units: short-term memory, mid-term memory, and long-term personal memory. Key operations within MemoryOS include dynamic updates between storage units: short-term to mid-term updates follow a dialogue-chain-based FIFO principle, while mid-term to long-term updates use a segmented page organization strategy. Our pioneering MemoryOS enables hierarchical memory integration and dynamic updating. Extensive experiments on the LoCoMo benchmark show an average improvement of 49.11% on F1 and 46.18% on BLEU-1 over the baselines on GPT-4o-mini, showing contextual coherence and personalized memory retention in long conversations. The implementation code is open-sourced at this https URL.
zh
[AI-172] Evolutionary model for energy trading in community microgrids using Hawk-Dove strategies
【速读】:该论文试图解决微电网之间能量协作中的能量平衡问题,特别是在去中心化环境下如何实现微电网社区层面的能量稳定。解决方案的关键在于将每个微电网建模为采用“鹰”或“鸽”策略的自主代理,并通过进化算法模拟其在能源交易中的互动,其中个体以能量交易矩阵的形式表示,通过重组和变异操作进行种群演化,同时利用多准则适应度函数评估个体表现,从而优化卖家利润、社区能量稳定性及电池退化惩罚等因素。
链接: https://arxiv.org/abs/2506.06325
作者: Viorica Rozina Chifu,Tudor Cioara,Cristina Bianca Pop,Ionut Anghel
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:
Abstract:This paper proposes a decentralized model of energy cooperation between microgrids, in which decisions are made locally, at the level of the microgrid community. Each microgrid is modeled as an autonomous agent that adopts a Hawk or Dove strategy, depending on the level of energy stored in the battery and its role in the energy trading process. The interactions between selling and buying microgrids are modeled through an evolutionary algorithm. An individual in the algorithm population is represented as an energy trading matrix that encodes the amounts of energy traded between the selling and buying microgrids. The population evolution is achieved by recombination and mutation operators. Recombination uses a specialized operator for matrix structures, and mutation is applied to the matrix elements according to a Gaussian distribution. The evaluation of an individual is made with a multi-criteria fitness function that considers the seller profit, the degree of energy stability at the community level, penalties for energy imbalance at the community level and for the degradation of microgrids batteries. The method was tested on a simulated scenario with 100 microgrids, each with its own selling and buying thresholds, to reflect a realistic environment with variable storage characteristics of microgrids batteries. By applying the algorithm on this scenario, 95 out of the 100 microgrids reached a stable energy state. This result confirms the effectiveness of the proposed model in achieving energy balance both at the individual level, for each microgrid, and at the level of the entire community.
zh
[AI-173] Mapping Human-Agent Co-Learning and Co-Adaptation: A Scoping Review
【速读】:该论文试图解决当前关于人类-人工智能-机器人协同学习与适应研究中术语不一致的问题,以及缺乏对智能代理类型和任务领域多样性的系统性梳理。其解决方案的关键在于通过范围综述方法,系统收集并分析现有文献中描述人机协作关系的术语,明确其细微差异,并探讨所采用的认知理论与框架,从而为未来在动态复杂任务领域中的研究提供理论基础与方向指引。
链接: https://arxiv.org/abs/2506.06324
作者: Shruti Kumar,Xiaoyu Chen,Xiaomei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Abstract accepted to HFES 2024 Annual Meeting
Abstract:Several papers have delved into the challenges of human-AI-robot co-learning and co-adaptation. It has been noted that the terminology used to describe this collaborative relationship in existing studies needs to be more consistent. For example, the prefix “co” is used interchangeably to represent both “collaborative” and “mutual,” and the terms “co-learning” and “co-adaptation” are sometimes used interchangeably. However, they can reflect subtle differences in the focus of the studies. The current scoping review’s primary research question (RQ1) aims to gather existing papers discussing this collaboration pattern and examine the terms researchers use to describe this human-agent relationship. Given the relative newness of this area of study, we are also keen on exploring the specific types of intelligent agents and task domains that have been considered in existing research (RQ2). This exploration is significant as it can shed light on the diversity of human-agent interactions, from one-time to continuous learning/adaptation scenarios. It can also help us understand the dynamics of human-agent interactions in different task domains, guiding our expectations towards research situated in dynamic, complex domains. Our third objective (RQ3) is to investigate the cognitive theories and frameworks that have been utilized in existing studies to measure human-agent co-learning and co-adaptation. This investigation is crucial as it can help us understand the theoretical underpinnings of human-agent collaboration and adaptation, and it can also guide us in identifying any new frameworks proposed specifically for this type of relationship.
zh
[AI-174] Neural networks with image recognition by pairs
【速读】:该论文试图解决传统基于度量识别方法的神经网络在架构设计和权重计算中依赖于解析表达式的问题,从而限制了其灵活性和扩展性。其解决方案的关键在于将这些网络进行转换,使其能够应用经典的学习算法,而无需使用解析表达式来计算权重值,通过成对图像识别的方式进行训练,从而简化学习过程并提高网络的可扩展性。
链接: https://arxiv.org/abs/2506.06322
作者: Polad Geidarov
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural networks based on metric recognition methods have a strictly determined architecture. Number of neurons, connections, as well as weights and thresholds values are calculated analytically, based on the initial conditions of tasks: number of recognizable classes, number of samples, metric expressions used. This paper discusses the possibility of transforming these networks in order to apply classical learning algorithms to them without using analytical expressions that calculate weight values. In the received network, training is carried out by recognizing images in pairs. This approach simplifies the learning process and easily allows to expand the neural network by adding new images to the recognition task. The advantages of these networks, including such as: 1) network architecture simplicity and transparency; 2) training simplicity and reliability; 3) the possibility of using a large number of images in the recognition problem using a neural network; 4) a consistent increase in the number of recognizable classes without changing the previous values of weights and thresholds.
zh
[AI-175] A Reinforcement-Learning-Enhanced LLM Framework for Automated A/B Testing in Personalized Marketing
【速读】:该论文旨在解决个性化营销中如何有效算法化A/B测试以最大化用户响应这一关键问题(the challenge of how to effectively algorithm the A/B testing to maximize user response)。其解决方案的关键在于提出一种基于强化学习策略优化与大语言模型(LLM)结合的自动化和个性化A/B测试框架——RL-LLM-AB Test。该框架通过Prompt-Conditioned Generator生成候选内容变体,并利用多模态感知模块动态融合用户画像和当前查询上下文,构建交互状态;随后通过具有Actor-Critic结构的策略优化模块实时选择内容版本,并借助Memory-Augmented Reward Estimator捕捉长期用户偏好漂移,从而提升策略在多用户和多内容场景下的泛化能力。
链接: https://arxiv.org/abs/2506.06316
作者: Haoyang Feng,Yanjun Dai,Yuan Gao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:For personalized marketing, a new challenge of how to effectively algorithm the A/B testing to maximize user response is urgently to be overcome. In this paper, we present a new approach, the RL-LLM-AB test framework, for using reinforcement learning strategy optimization combined with LLM to automate and personalize A/B tests. The RL-LLM-AB test is built upon the pre-trained instruction-tuned language model. It first generates A/B versions of candidate content variants using a Prompt-Conditioned Generator, and then dynamically embeds and fuses the user portrait and the context of the current query with the multi-modal perception module to constitute the current interaction state. The content version is then selected in real-time through the policy optimization module with an Actor-Critic structure, and long-term revenue is estimated according to real-time feedback (such as click-through rate and conversion rate). Furthermore, a Memory-Augmented Reward Estimator is embedded into the framework to capture long-term user preference drift, which helps to generalize policy across multiple users and content contexts. Numerical results demonstrate the superiority of our proposed RL-LLM-ABTest over existing A/B testing methods, including classical A/B testing, Contextual Bandits, and benchmark reinforcement learning approaches on real-world marketing data.
zh
[AI-176] Large Language Models and Their Applications in Roadway Safety and Mobility Enhancement: A Comprehensive Review
【速读】:该论文旨在解决现代交通系统中道路安全与通行效率的关键挑战,其核心问题是传统工程方法难以应对复杂、动态和异构的交通环境。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的自然语言理解、知识整合与推理能力,通过架构调整、训练优化、提示工程及多模态策略等手段,弥合交通领域特有的时空与物理数据与模型之间的“模态差距”。该研究系统分析了LLMs在交通流预测、信号控制、事故分析及驾驶员行为评估等场景中的应用,并探讨了V2X集成、领域专用基础模型、可解释性框架及边缘计算等支撑技术,以推动更安全、智能的交通系统发展。
链接: https://arxiv.org/abs/2506.06301
作者: Muhammad Monjurul Karim,Yan Shi,Shucheng Zhang,Bingzhang Wang,Mehrdad Nasri,Yinhai Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Roadway safety and mobility remain critical challenges for modern transportation systems, demanding innovative analytical frameworks capable of addressing complex, dynamic, and heterogeneous environments. While traditional engineering methods have made progress, the complexity and dynamism of real-world traffic necessitate more advanced analytical frameworks. Large Language Models (LLMs), with their unprecedented capabilities in natural language understanding, knowledge integration, and reasoning, represent a promising paradigm shift. This paper comprehensively reviews the application and customization of LLMs for enhancing roadway safety and mobility. A key focus is how LLMs are adapted – via architectural, training, prompting, and multimodal strategies – to bridge the “modality gap” with transportation’s unique spatio-temporal and physical data. The review systematically analyzes diverse LLM applications in mobility (e.g., traffic flow prediction, signal control) and safety (e.g., crash analysis, driver behavior assessment,). Enabling technologies such as V2X integration, domain-specific foundation models, explainability frameworks, and edge computing are also examined. Despite significant potential, challenges persist regarding inherent LLM limitations (hallucinations, reasoning deficits), data governance (privacy, bias), deployment complexities (sim-to-real, latency), and rigorous safety assurance. Promising future research directions are highlighted, including advanced multimodal fusion, enhanced spatio-temporal reasoning, human-AI collaboration, continuous learning, and the development of efficient, verifiable systems. This review provides a structured roadmap of current capabilities, limitations, and opportunities, underscoring LLMs’ transformative potential while emphasizing the need for responsible innovation to realize safer, more intelligent transportation systems.
zh
[AI-177] Pairwise Calibrated Rewards for Pluralistic Alignment
【速读】:该论文试图解决当前对齐流水线中单一、普遍的有益行为定义无法反映人类偏好在用户、情境和文化间的多样性问题,导致少数观点被忽视。其解决方案的关键在于通过多个奖励函数的分布来反映多样化的偏好,每个奖励函数诱导一个不同的对齐策略,并直接从成对偏好中学习该分布,而不依赖标注者标识或预定义群体。核心标准是成对校准,即对于每对候选响应,偏好某一响应的奖励函数比例应与具有该偏好的标注者比例一致。
链接: https://arxiv.org/abs/2506.06298
作者: Daniel Halpern,Evi Micha,Ariel D. Procaccia,Itai Shapira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Current alignment pipelines presume a single, universal notion of desirable behavior. However, human preferences often diverge across users, contexts, and cultures. As a result, disagreement collapses into the majority signal and minority perspectives are discounted. To address this, we propose reflecting diverse human preferences through a distribution over multiple reward functions, each inducing a distinct aligned policy. The distribution is learned directly from pairwise preference without annotator identifiers or predefined groups. Instead, annotator disagreements are treated as informative soft labels. Our central criterion is pairwise calibration: for every pair of candidate responses, the proportion of reward functions preferring one response matches the fraction of annotators with that preference. We prove that even a small outlier-free ensemble can accurately represent diverse preference distributions. Empirically, we introduce and validate a practical training heuristic to learn such ensembles, and demonstrate its effectiveness through improved calibration, implying a more faithful representation of pluralistic values.
zh
[AI-178] Optimal patient allocation for echocardiographic assessments
【速读】:该论文旨在解决医院中超声心动图检查调度所面临的挑战,这些挑战源于非确定性因素(如患者未到场、到达时间不确定、检查时长多样等)以及胎儿与非胎儿患者流之间的资源约束不对称性。其解决方案的关键在于通过分析实际操作数据,构建基于离散事件随机模拟(discrete-event stochastic simulation)的模型,并结合强化学习(reinforcement learning, RL)方法,以动态优化资源分配策略,从而提高心脏超声实验室的运营效率。
链接: https://arxiv.org/abs/2506.06297
作者: Bozhi Sun,Seda Tierney,Jeffrey A. Feinstein,Frederick Damen,Alison L. Marsden,Daniele E. Schiavazzi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Scheduling echocardiographic exams in a hospital presents significant challenges due to non-deterministic factors (e.g., patient no-shows, patient arrival times, diverse exam durations, etc.) and asymmetric resource constraints between fetal and non-fetal patient streams. To address these challenges, we first conducted extensive pre-processing on one week of operational data from the Echo Laboratory at Stanford University’s Lucile Packard Children’s Hospital, to estimate patient no-show probabilities and derive empirical distributions of arrival times and exam durations. Based on these inputs, we developed a discrete-event stochastic simulation model using SimPy, and integrate it with the open source Gymnasium Python library. As a baseline for policy optimization, we developed a comparative framework to evaluate on-the-fly versus reservation-based allocation strategies, in which different proportions of resources are reserved in advance. Considering a hospital configuration with a 1:6 ratio of fetal to non-fetal rooms and a 4:2 ratio of fetal to non-fetal sonographers, we show that on-the-fly allocation generally yields better performance, more effectively adapting to patient variability and resource constraints. Building on this foundation, we apply reinforcement learning (RL) to derive an approximated optimal dynamic allocation policy. This RL-based policy is benchmarked against the best-performing rule-based strategies, allowing us to quantify their differences and provide actionable insights for improving echo lab efficiency through intelligent, data-driven resource management.
zh
[AI-179] Dynamic Graph CNN with Jacobi Kolmogorov-Arnold Networks for 3D Classification of Point Sets
【速读】:该论文旨在解决三维点云分类任务中传统方法在准确性和收敛速度上的局限性,以及如何有效结合动态图卷积神经网络(DGCNN)与可扩展的非线性建模能力。其解决方案的关键在于将Jacobi-Kolmogorov-Arnold Networks (KAN)引入DGCNN框架,用可调节的一维多项式展开替代传统的多层感知机(MLP)层,从而在保持参数效率的同时提升模型性能,并通过实验验证了基于雅可比多项式的KAN层在准确率和收敛速度上的优势。
链接: https://arxiv.org/abs/2506.06296
作者: Hanaa El Afia,Said Ohamouddou,Raddouane Chiheb,Abdellatif El Afia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Jacobi-KAN-DGCNN, a framework that integrates Dynamic Graph Convolutional Neural Network (DGCNN) with Jacobi Kolmogorov-Arnold Networks (KAN) for the classification of three-dimensional point clouds. This method replaces Multi-Layer Perceptron (MLP) layers with adaptable univariate polynomial expansions within a streamlined DGCNN architecture, circumventing deep levels for both MLP and KAN to facilitate a layer-by-layer comparison. In comparative experiments on the ModelNet40 dataset, KAN layers employing Jacobi polynomials outperform the traditional linear layer-based DGCNN baseline in terms of accuracy and convergence speed, while maintaining parameter efficiency. Our results demonstrate that higher polynomial degrees do not automatically improve performance, highlighting the need for further theoretical and empirical investigation to fully understand the interactions between polynomial bases, degrees, and the mechanisms of graph-based learning.
zh
[AI-180] Prediction of Bank Credit Ratings using Heterogeneous Topological Graph Neural Networks
【速读】:该论文试图解决由于隐私问题导致的银行间连接图不完整,从而限制了图神经网络(Graph Neural Networks, GNNs)在信用评级预测中的直接应用问题。解决方案的关键在于利用持久同调(persistent homology)构建一个能够捕捉银行之间关系的网络,并将其与传统的贷款网络相结合,形成一个异构网络,从而整合多源信息以提高预测性能。
链接: https://arxiv.org/abs/2506.06293
作者: Junyi Liu,Stanley Kok
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: WITS 2024 (Workshop on Information Technologies and Systems 2024)
Abstract:Agencies such as Standard Poor’s and Moody’s provide bank credit ratings that influence economic stability and decision-making by stakeholders. Accurate and timely predictions support informed decision-making, regulatory actions, and investor protection. However, a complete interbank connection graph is often unavailable due to privacy concerns, complicating the direct application of Graph Neural Networks (GNNs) for rating prediction. our research utilizes persistent homology to construct a network that captures relationships among banks and combines this with a traditional lending network to create a heterogeneous network that integrates information from both sources, leading to improved predictions. Experiments on a global, real-world dataset validate the effectiveness of HTGNN. This research has implications for investors and regulatory bodies in enhancing proactive risk mitigation and the implementation of effective market this http URL code can be find at this https URL.
zh
[AI-181] Mutual-Taught for Co-adapting Policy and Reward Models ACL2025
【速读】:该论文旨在解决在大型语言模型(Large Language Models, LLMs)的偏好优化过程中,由于新生成的模型样本与用于训练奖励模型(Reward Model, RM)的数据之间出现分布偏移(distribution shifts),导致RM效能下降,进而影响策略模型(Policy Model, PM)性能的问题。解决方案的关键在于提出一种自训练方法——Mutual-Taught,该方法通过迭代优化PM和RM,无需额外的人工标注。其核心思想模仿期望最大化(Expectation-Maximization, EM)算法:在E-step中,利用当前RM的反馈更新PM以逼近潜在最优偏好分布;在M-step中,通过PM在E-step前后的输出构建训练数据来更新RM,从而使其适应变化的策略分布。
链接: https://arxiv.org/abs/2506.06292
作者: Tianyuan Shi,Canbin Huang,Fanqi Wan,Longguang Zhong,Ziyi Yang,Weizhou Shen,Xiaojun Quan,Ming Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 (Main Conference)
Abstract:During the preference optimization of large language models (LLMs), distribution shifts may arise between newly generated model samples and the data used to train the reward model (RM). This shift reduces the efficacy of the RM, which in turn negatively impacts the performance of the policy model (PM). To address this challenge, we propose Mutual-Taught, a self-training method that iteratively improves both the PM and RM without requiring additional human annotation. Our approach mirrors the expectation-maximization (EM) algorithm. In the E-step, the PM is updated using feedback from the current RM, guiding the PM toward a better approximation of the latent optimal preference distribution. In the M-step, we update the RM by constructing training data from the outputs of the PM before and after the E-step update. This process ensures that the RM adapts to the evolving policy distribution. Experimental results demonstrate that this iterative approach leads to consistent improvements in both models. Specifically, our 8B policy model, LLaMA-3-8B-Instruct-MT, achieves a length-controlled win rate of 54.1% on AlpacaEval-2, while our 8B reward model, FsfairX-LLaMA3-RM-MT, performs on par with GPT-4o-2024-08-06 on RewardBench.
zh
[AI-182] Improvement of Optimization using Learning Based Models in Mixed Integer Linear Programming Tasks
【速读】:该论文试图解决在大规模、实时场景中混合整数线性规划(Mixed Integer Linear Program, MILP)求解过程中计算时间过长的问题。解决方案的关键在于提出一种基于学习的框架,利用行为克隆(Behavior Cloning, BC)和强化学习(Reinforcement Learning, RL)训练图神经网络(Graph Neural Network, GNN),以生成高质量的初始解,从而为MILP求解器提供热启动,提升求解效率。
链接: https://arxiv.org/abs/2506.06291
作者: Xiaoke Wang,Batuhan Altundas,Zhaoxin Li,Aaron Zhao,Matthew Gombolay
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 4 pages, 4 figures
Abstract:Mixed Integer Linear Programs (MILPs) are essential tools for solving planning and scheduling problems across critical industries such as construction, manufacturing, and logistics. However, their widespread adoption is limited by long computational times, especially in large-scale, real-time scenarios. To address this, we present a learning-based framework that leverages Behavior Cloning (BC) and Reinforcement Learning (RL) to train Graph Neural Networks (GNNs), producing high-quality initial solutions for warm-starting MILP solvers in Multi-Agent Task Allocation and Scheduling Problems. Experimental results demonstrate that our method reduces optimization time and variance compared to traditional techniques while maintaining solution quality and feasibility.
zh
[AI-183] Deep Research Bench: Evaluating AI Web Research Agents
【速读】:该论文试图解决当前缺乏对网络研究代理(web research agents)质量的直接评估问题,特别是在网络内容持续变化的背景下。其解决方案的关键在于引入Deep Research Bench,这是一个包含89个跨8个不同任务类别的多步骤网络研究任务实例的基准测试集,并由熟练人员精心计算出答案。此外,论文还提供了“RetroSearch”环境,该环境包含大量冻结的网页数据,使得离线“RetroSearch”代理能够与实时网络代理表现相当,从而实现了模型随时间变化的可靠评估。
链接: https://arxiv.org/abs/2506.06287
作者: FutureSearch:Nikos I. Bosse,Jon Evans,Robert G. Gambee,Daniel Hnyk,Peter Mühlbacher,Lawrence Phillips,Dan Schwarz,Jack Wildman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Amongst the most common use cases of modern AI is LLM chat with web search enabled. However, no direct evaluations of the quality of web research agents exist that control for the continually-changing web. We introduce Deep Research Bench, consisting of 89 multi-step web research task instances of varying difficulty across 8 diverse task categories, with the answers carefully worked out by skilled humans. We provide a “RetroSearch” environment with a large frozen set of scraped web pages, and demonstrate that offline “RetroSearch” agents perform comparably to “live web” agents, enabling reliable evaluations of models over time. We provide robust agent tooling and scaffolding to benchmark major LLMs as they are released, including “thinking” models like o3 and Gemini 2.5 Pro. We include automated evaluations of the lengthy agent traces to report progress over time in hallucinations, tool use, and forgetting. Finally, we evaluate the major web research products branded as “Deep Research”, “Deep Search”, “Search”, or “Research.” Results are available on a public leaderboard at this https URL.
zh
[AI-184] Disentangling AI Alignment: A Structured Taxonomy Beyond Safety and Ethics
【速读】:该论文试图解决人工智能代理在现实世界中操作时,如何确保其不仅安全,还能符合更广泛的规范性期望这一跨学科挑战。其解决方案的关键在于构建一个结构化的概念框架,通过区分对齐目标(如安全性、伦理性和合法性等)、作用范围(结果对齐与执行对齐)以及利益相关者(个体对齐与集体对齐)三个维度,明确AI对齐的不同配置,从而为不同领域之间的实践与哲学整合提供基础。
链接: https://arxiv.org/abs/2506.06286
作者: Kevin Baum
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: accepted for the LNCS post proceedings of the AISoLA 2024 conference
Abstract:Recent advances in AI research make it increasingly plausible that artificial agents with consequential real-world impact will soon operate beyond tightly controlled environments. Ensuring that these agents are not only safe but that they adhere to broader normative expectations is thus an urgent interdisciplinary challenge. Multiple fields – notably AI Safety, AI Alignment, and Machine Ethics – claim to contribute to this task. However, the conceptual boundaries and interrelations among these domains remain vague, leaving researchers without clear guidance in positioning their work. To address this meta-challenge, we develop a structured conceptual framework for understanding AI alignment. Rather than focusing solely on alignment goals, we introduce a taxonomy distinguishing the alignment aim (safety, ethicality, legality, etc.), scope (outcome vs. execution), and constituency (individual vs. collective). This structural approach reveals multiple legitimate alignment configurations, providing a foundation for practical and philosophical integration across domains, and clarifying what it might mean for an agent to be aligned all-things-considered. Comments: accepted for the LNCS post proceedings of the AISoLA 2024 conference Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2506.06286 [cs.CY] (or arXiv:2506.06286v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2506.06286 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-185] NFISiS: New Perspectives on Fuzzy Inference Systems for Renewable Energy Forecasting
【速读】:该论文试图解决现有可演化模糊系统(Evolving Fuzzy Systems, eFS)模型由于缺乏公开可用的实现而限制其可访问性和广泛应用的问题。解决方案的关键在于开发了一个名为evolvingfuzzysystems的Python库,该库实现了多种经典的eFS模型,并提供了用于训练、可视化和性能评估的内置工具,从而促进了模型的评估与比较。
链接: https://arxiv.org/abs/2506.06285
作者: Kaike Sa Teles Rocha Alves,Eduardo Pestana de Aguiar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Evolving Fuzzy Systems (eFS) have gained significant attention due to their ability to adaptively update their structure in response to data dynamics while maintaining interpretability. However, the lack of publicly available implementations of these models limits their accessibility and widespread adoption. To address this gap, we present evolvingfuzzysystems, a Python library that provides implementations of several well-established eFS models, including ePL-KRLS-DISCO, ePL+, eMG, ePL, exTS, Simpl_eTS, and eTS. The library facilitates model evaluation and comparison by offering built-in tools for training, visualization, and performance assessment. The models are evaluated using the fetch_california_housing dataset, with performance measured in terms of normalized root-mean-square error (NRMSE), non-dimensional error index (NDEI), and mean absolute percentage error (MAPE). Additionally, computational complexity is analyzed by measuring execution times and rule evolution during training and testing phases. The results highlight ePL as a simple yet efficient model that balances accuracy and computational cost, making it particularly suitable for real-world applications. By making these models publicly available, evolvingfuzzysystems aims to foster research and practical applications in adaptive and interpretable machine learning.
zh
[AI-186] Unreal Patterns
【速读】:该论文试图解决如何在信息表示中处理不存在或可能永远不会存在的实体(如虚构实体、蓝图、模拟和未来情景)的问题。传统方法通过引入“虚拟实例”或依赖模态逻辑来处理此类实体,但被批评为存在哲学上的过度承诺或计算效率低下。论文提出的解决方案关键在于使用实际类型的交集来建模这些情况,而非依赖具体的非存在个体(non-existent tokens),从而在保持现实主义本体论框架的同时,提供一种实用且可计算的处理方式。
链接: https://arxiv.org/abs/2506.06284
作者: John Beverley,Jim Logan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces a framework for representing information about entities that do not exist or may never exist, such as those involving fictional entities, blueprints, simulations, and future scenarios. Traditional approaches that introduce “dummy instances” or rely on modal logic are criticized, and a proposal is defended in which such cases are modeled using the intersections of actual types rather than specific non existent tokens. The paper positions itself within the Basic Formal Ontology and its realist commitments, emphasizing the importance of practical, implementable solutions over purely metaphysical or philosophical proposals, arguing that existing approaches to non existent entities either overcommit to metaphysical assumptions or introduce computational inefficiencies that hinder applications. By developing a structured ontology driven approach to unreal patterns, the paper aims to provide a useful and computationally viable means of handling references to hypothetical or non existent entities.
zh
[AI-187] Understanding Financial Reasoning in AI: A Multimodal Benchmark and Error Learning Approach
【速读】:该论文旨在解决金融领域中AI模型在处理复杂视觉数据(如图表、表格和趋势图)时的推理能力不足问题,尤其是在结合文本与视觉模态进行财务分析时的表现。其关键解决方案是一个基于错误感知的学习框架,该框架通过利用历史模型错误和反馈来指导推理过程,而无需进行微调,从而提升模型在金融特定场景下的推理性能。
链接: https://arxiv.org/abs/2506.06282
作者: Shuangyan Deng,Haizhou Peng,Jiachen Xu,Chunhou Liu,Ciprian Doru Giurcuaneanu,Jiamou Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Effective financial reasoning demands not only textual understanding but also the ability to interpret complex visual data such as charts, tables, and trend graphs. This paper introduces a new benchmark designed to evaluate how well AI models - especially large language and multimodal models - reason in finance-specific contexts. Covering 3,200 expert-level question-answer pairs across 15 core financial topics, the benchmark integrates both textual and visual modalities to reflect authentic analytical challenges in finance. To address limitations in current reasoning approaches, we propose an error-aware learning framework that leverages historical model mistakes and feedback to guide inference, without requiring fine-tuning. Our experiments across state-of-the-art models show that multimodal inputs significantly enhance performance and that incorporating error feedback leads to consistent and measurable improvements. The results highlight persistent challenges in visual understanding and mathematical logic, while also demonstrating the promise of self-reflective reasoning in financial AI systems. Our code and data can be found at https://anonymous/FinMR/CodeData.
zh
[AI-188] owards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation
【速读】:该论文试图解决生成式 AI(Generative AI)在金融领域中生成基础分析报告时存在的准确性不足问题,尤其是在实际应用中缺乏有效的评估标准。现有金融基准主要关注模型回答金融问题的能力,而未涵盖生成财务分析报告等实际任务。论文提出的解决方案是构建 FinAR-Bench 基准数据集,聚焦于财务报表分析这一基础分析的核心能力,并通过将任务分解为三个可测量的步骤——关键信息提取、财务指标计算和逻辑推理——来提高评估的精确性和可靠性。该结构化方法有助于客观评估大语言模型(LLMs)在每个步骤中的表现,从而更真实地反映其在实际金融场景中的性能。
链接: https://arxiv.org/abs/2506.07315
作者: Zonghan Wu,Junlin Wang,Congyuan Zou,Chenhan Wang,Yilei Shao
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI, particularly large language models (LLMs), is beginning to transform the financial industry by automating tasks and helping to make sense of complex financial information. One especially promising use case is the automatic creation of fundamental analysis reports, which are essential for making informed investment decisions, evaluating credit risks, guiding corporate mergers, etc. While LLMs attempt to generate these reports from a single prompt, the risks of inaccuracy are significant. Poor analysis can lead to misguided investments, regulatory issues, and loss of trust. Existing financial benchmarks mainly evaluate how well LLMs answer financial questions but do not reflect performance in real-world tasks like generating financial analysis reports. In this paper, we propose FinAR-Bench, a solid benchmark dataset focusing on financial statement analysis, a core competence of fundamental analysis. To make the evaluation more precise and reliable, we break this task into three measurable steps: extracting key information, calculating financial indicators, and applying logical reasoning. This structured approach allows us to objectively assess how well LLMs perform each step of the process. Our findings offer a clear understanding of LLMs current strengths and limitations in fundamental analysis and provide a more practical way to benchmark their performance in real-world financial settings.
zh
[AI-189] From Axioms to Algorithms: Mechanized Proofs of the vNM Utility Theorem
【速读】:该论文旨在通过形式化验证的方式,对冯·诺依曼-摩根斯坦(von Neumann-Morgenstern, vNM)期望效用定理进行严格的数学建模与证明。其核心问题是确保在偏好关系满足完备性、传递性、连续性和独立性等经典公理的前提下,能够机器验证地证明效用函数的存在性与唯一性。解决方案的关键在于利用Lean 4交互式定理证明器,实现对独立性公理的细粒度建模,以及对混合彩票的基本性质进行形式化证明,并通过构造性方法展示效用函数的存在性,同时结合计算实验验证结果的有效性。
链接: https://arxiv.org/abs/2506.07066
作者: Li Jingyuan
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注:
Abstract:This paper presents a comprehensive formalization of the von Neumann-Morgenstern (vNM) expected utility theorem using the Lean 4 interactive theorem prover. We implement the classical axioms of preference-completeness, transitivity, continuity, and independence-enabling machine-verified proofs of both the existence and uniqueness of utility representations. Our formalization captures the mathematical structure of preference relations over lotteries, verifying that preferences satisfying the vNM axioms can be represented by expected utility maximization. Our contributions include a granular implementation of the independence axiom, formally verified proofs of fundamental claims about mixture lotteries, constructive demonstrations of utility existence, and computational experiments validating the results. We prove equivalence to classical presentations while offering greater precision at decision boundaries. This formalization provides a rigorous foundation for applications in economic modeling, AI alignment, and management decision systems, bridging the gap between theoretical decision theory and computational implementation. Subjects: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP) Cite as: arXiv:2506.07066 [econ.TH] (or arXiv:2506.07066v1 [econ.TH] for this version) https://doi.org/10.48550/arXiv.2506.07066 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-190] Less is More: some Computational Principles based on Parcimony and Limitations of Natural Intelligence
【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)系统在效率、适应性和创造力方面与自然智能(Natural Intelligence, NI)存在显著差距的问题。其解决方案的关键在于借鉴自然智能中的约束条件,如有限的神经带宽、能量限制和稀疏数据,这些约束实际上促进了高效、灵活和创造性行为的产生。论文提出通过采用“少即是多”的原则,包括能量约束、简洁架构和真实世界交互,来推动更高效、可解释且具有生物基础的人工系统的发展。
链接: https://arxiv.org/abs/2506.07060
作者: Laura Cohen,Xavier Hinaut,Lilyana Petrova,Alexandre Pitti,Syd Reynal,Ichiro Tsuda
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
Abstract:Natural intelligence (NI) consistently achieves more with less. Infants learn language, develop abstract concepts, and acquire sensorimotor skills from sparse data, all within tight neural and energy limits. In contrast, today’s AI relies on virtually unlimited computational power, energy, and data to reach high performance. This paper argues that constraints in NI are paradoxically catalysts for efficiency, adaptability, and creativity. We first show how limited neural bandwidth promotes concise codes that still capture complex patterns. Spiking neurons, hierarchical structures, and symbolic-like representations emerge naturally from bandwidth constraints, enabling robust generalization. Next, we discuss chaotic itinerancy, illustrating how the brain transits among transient attractors to flexibly retrieve memories and manage uncertainty. We then highlight reservoir computing, where random projections facilitate rapid generalization from small datasets. Drawing on developmental perspectives, we emphasize how intrinsic motivation, along with responsive social environments, drives infant language learning and discovery of meaning. Such active, embodied processes are largely absent in current AI. Finally, we suggest that adopting ‘less is more’ principles – energy constraints, parsimonious architectures, and real-world interaction – can foster the emergence of more efficient, interpretable, and biologically grounded artificial systems.
zh
[AI-191] AnnoDPO: Protein Functional Annotation Learning with Direct Preference Optimization
【速读】:该论文旨在解决蛋白质功能预测中的标注稀缺性和类别不平衡问题,这是蛋白质表示学习中的核心挑战。其解决方案的关键在于提出AnnoDPO框架,该框架借鉴了人类反馈强化学习(RLHF)的思想,利用直接偏好优化(DPO)方法,通过偏好对齐的训练目标来提升注释学习的效果,从而在生物知识整合方面建立新的范式。
链接: https://arxiv.org/abs/2506.07035
作者: Zixuan Jiang,Renjing Xu
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:
Abstract:Deciphering protein function remains a fundamental challenge in protein representation learning. The task presents significant difficulties for protein language models (PLMs) due to the sheer volume of functional annotation categories and the highly imbalanced distribution of annotated instances across biological ontologies. Inspired by the remarkable success of reinforcement learning from human feedback (RLHF) in large language model (LLM) alignment, we propose AnnoDPO, a novel multi-modal framework for protein function prediction that leverages Direct Preference Optimization (DPO) to enhance annotation learning. Our methodology addresses the dual challenges of annotation scarcity and category imbalance through preference-aligned training objectives, establishing a new paradigm for biological knowledge integration in protein representation learning.
zh
[AI-192] A Statistical Framework for Model Selection in LSTM Networks
【速读】:该论文试图解决长短期记忆(Long Short-Term Memory, LSTM)神经网络模型选择中的问题,包括超参数调优、架构定义和正则化选择,这些问题目前主要依赖于启发式方法且计算成本较高。解决方案的关键在于提出一个统一的统计框架,将经典模型选择方法如信息准则和收缩估计扩展到序列神经网络,通过定义适应时间结构的惩罚似然函数、提出用于隐藏状态动态的广义阈值方法,并利用变分贝叶斯和近似边缘似然方法实现高效的估计策略。
链接: https://arxiv.org/abs/2506.06840
作者: Fahad Mostafa
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP)
备注:
Abstract:Long Short-Term Memory (LSTM) neural network models have become the cornerstone for sequential data modeling in numerous applications, ranging from natural language processing to time series forecasting. Despite their success, the problem of model selection, including hyperparameter tuning, architecture specification, and regularization choice remains largely heuristic and computationally expensive. In this paper, we propose a unified statistical framework for systematic model selection in LSTM networks. Our framework extends classical model selection ideas, such as information criteria and shrinkage estimation, to sequential neural networks. We define penalized likelihoods adapted to temporal structures, propose a generalized threshold approach for hidden state dynamics, and provide efficient estimation strategies using variational Bayes and approximate marginal likelihood methods. Several biomedical data centric examples demonstrate the flexibility and improved performance of the proposed framework.
zh
[AI-193] Depth-Optimal Quantum Layout Synthesis as SAT
【速读】:该论文旨在解决量子电路布局综合(Quantum-circuit Layout Synthesis)中的问题,即在量子硬件的连接限制下,如何高效地将量子电路映射到物理设备上,同时减少电路中的CX门数量或深度以降低噪声影响。其解决方案的关键在于提出一种新的、高效的基于SAT(满足性问题)的编码方法,该方法不仅关注门的数量,还确保找到具有最小电路深度或最小CX门深度的映射电路,通过增量SAT求解和并行计划实现高效的求解过程。
链接: https://arxiv.org/abs/2506.06752
作者: Anna B. Jakobsen,Anders B. Clausen,Jaco van de Pol,Irfansha Shaik
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 24 pages, 4 figures, 11 tables
Abstract:Quantum circuits consist of gates applied to qubits. Current quantum hardware platforms impose connectivity restrictions on binary CX gates. Hence, Layout Synthesis is an important step to transpile quantum circuits before they can be executed. Since CX gates are noisy, it is important to reduce the CX count or CX depth of the mapped circuits. We provide a new and efficient encoding of Quantum-circuit Layout Synthesis in SAT. Previous SAT encodings focused on gate count and CX-gate count. Our encoding instead guarantees that we find mapped circuits with minimal circuit depth or minimal CX-gate depth. We use incremental SAT solving and parallel plans for an efficient encoding. This results in speedups of more than 10-100x compared to OLSQ2, which guarantees depth-optimality. But minimizing depth still takes more time than minimizing gate count with Q-Synth. We correlate the noise reduction achieved by simulating circuits after (CX)-count and (CX)-depth reduction. We find that minimizing for CX-count correlates better with reducing noise than minimizing for CX-depth. However, taking into account both CX-count and CX-depth provides the best noise reduction. Comments: 24 pages, 4 figures, 11 tables Subjects: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2506.06752 [quant-ph] (or arXiv:2506.06752v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2506.06752 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-194] Neural Spectral Band Generation for Audio Coding INTERSPEECH2025
【速读】:该论文试图解决传统频谱带复制(Spectral Band Replication, SBR)在处理多种类型音频信号时存在的局限性,以及直接采用深度神经网络(Deep Neural Network, DNN)方法进行盲式带宽扩展(Blind Bandwidth Extension, BWE)导致的性能不佳问题。其解决方案的关键在于提出一种参数化非盲带宽扩展方法,通过在音频编码流程的前端和末端分别进行DNN-based的侧信息提取和带宽扩展,从而提升扩展效果。
链接: https://arxiv.org/abs/2506.06732
作者: Woongjib Choi,Byeong Hyeon Kim,Hyungseob Lim,Inseon Jang,Hong-Goo Kang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Accepted to Interspeech 2025
Abstract:Audio bandwidth extension is the task of reconstructing missing high frequency components of bandwidth-limited audio signals, where bandwidth limitation is a common issue for audio signals due to several reasons, including channel capacity and data constraints. While conventional spectral band replication is a well-established parametric approach to audio bandwidth extension, the SBR usually entails coarse feature extraction and reconstruction techniques, which leads to limitations when processing various types of audio signals. In parallel, numerous deep neural network-based audio bandwidth extension methods have been proposed. These DNN-based methods are usually referred to as blind BWE, as these methods do not rely on prior information extracted from original signals, and only utilize given low frequency band signals to estimate missing high frequency components. In order to replace conventional SBR with DNNs, simply adopting existing DNN-based methodologies results in suboptimal performance due to the blindness of these methods. My proposed research suggests a new approach to parametric non-blind bandwidth extension, as DNN-based side information extraction and DNN-based bandwidth extension are performed only at the front and end of the audio coding pipeline.
zh
[AI-195] AS-ASR: A Lightweight Framework for Aphasia-Specific Automatic Speech Recognition
【速读】:该论文试图解决在资源受限的边缘设备上对失语症语音进行准确识别的问题(Aphasia-specific speech recognition)。解决方案的关键在于提出了一种轻量级的框架AS-ASR,其基于Whisper-tiny模型,并采用了一种混合训练策略,系统地结合标准语音与失语症语音以不同比例进行训练,从而实现鲁棒的泛化能力;同时引入了基于GPT-4的参考增强方法,以优化噪声失语症转录文本,提升监督质量。
链接: https://arxiv.org/abs/2506.06566
作者: Chen Bao,Chuanbing Huo,Qinyu Chen,Chang Gao
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:This paper proposes AS-ASR, a lightweight aphasia-specific speech recognition framework based on Whisper-tiny, tailored for low-resource deployment on edge devices. Our approach introduces a hybrid training strategy that systematically combines standard and aphasic speech at varying ratios, enabling robust generalization, and a GPT-4-based reference enhancement method that refines noisy aphasic transcripts, improving supervision quality. We conduct extensive experiments across multiple data mixing configurations and evaluation settings. Results show that our fine-tuned model significantly outperforms the zero-shot baseline, reducing WER on aphasic speech by over 30% while preserving performance on standard speech. The proposed framework offers a scalable, efficient solution for real-world disordered speech recognition.
zh
[AI-196] Model-based Neural Data Augmentation for sub-wavelength Radio Localization
【速读】:该论文旨在解决在复杂无线电环境,尤其是非视距(NLoS)传播路径主导的情况下,传统信号处理技术导致的定位精度下降问题。其解决方案的关键在于扩展基于指纹的定位框架,通过引入基于模型的神经网络来学习位置到信道的映射,并作为生成式神经信道模型使用,从而在减少内存需求的同时提高定位精度。该方法通过生成模型增强指纹比对字典,实现了亚波长级别的定位精度,相较于传统指纹方法在定位精度上提升了几个数量级,同时将内存需求降低了约一个数量级。
链接: https://arxiv.org/abs/2506.06387
作者: Baptiste Chatelier(IETR, INSA Rennes, MERCE-France),Vincent Corlay(MERCE-France),Musa Furkan Keskin,Matthieu Crussière(INSA Rennes, IETR),Henk Wymeersch,Luc Le Magoarou(INSA Rennes, IETR)
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:
Abstract:The increasing deployment of large antenna arrays at base stations has significantly improved the spatial resolution and localization accuracy of radio-localization methods. However, traditional signal processing techniques struggle in complex radio environments, particularly in scenarios dominated by non line of sight (NLoS) propagation paths, resulting in degraded localization accuracy. Recent developments in machine learning have facilitated the development of machine learning-assisted localization techniques, enhancing localization accuracy in complex radio environments. However, these methods often involve substantial computational complexity during both the training and inference phases. This work extends the well-established fingerprinting-based localization framework by simultaneously reducing its memory requirements and improving its accuracy. Specifically, a model-based neural network is used to learn the location-to-channel mapping, and then serves as a generative neural channel model. This generative model augments the fingerprinting comparison dictionary while reducing the memory requirements. The proposed method outperforms fingerprinting baselines by achieving sub-wavelength localization accuracy, even in NLoS environments. Remarkably, it offers an improvement by several orders of magnitude in localization accuracy, while simultaneously reducing memory requirements by an order of magnitude compared to classical fingerprinting methods.
zh
[AI-197] owards real-time assessment of infrasound event detection capability using deep learning-based transmission loss estimation
【速读】:该论文旨在解决惯性声波传输损耗建模中计算成本高且难以在操作监测应用中探索大参数空间的问题。其解决方案的关键在于利用包含风场和温度场的神经网络模型,通过卷积和循环层捕捉真实大气模型中的空间和距离依赖特征,从而实现近实时的传输损耗预测,并提升模型的准确性与不确定性估计能力。
链接: https://arxiv.org/abs/2506.06358
作者: Alice Janela Cameijo,Alexis Le Pichon,Youcef Sklab,Souhila Arib,Quentin Brissaud,Sven peter Naesholm,Constantino Listowski,Samir Aknine
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 49 pages, 22 figures
Abstract:Accurate modeling of infrasound transmission loss is essential for evaluating the performance of the International Monitoring System, enabling the effective design and maintenance of infrasound stations to support compliance of the Comprehensive Nuclear-Test-Ban Treaty. State-of-the-art propagation modeling tools enable transmission loss to be finely simulated using atmospheric models. However, the computational cost prohibits the exploration of a large parameter space in operational monitoring applications. To address this, recent studies made use of a deep learning algorithm capable of making transmission loss predictions almost instantaneously. However, the use of nudged atmospheric models leads to an incomplete representation of the medium, and the absence of temperature as an input makes the algorithm incompatible with long range propagation. In this study, we address these limitations by using both wind and temperature fields as inputs to a neural network, simulated up to 130 km altitude and 4,000 km distance. We also optimize several aspects of the neural network architecture. We exploit convolutional and recurrent layers to capture spatially and range-dependent features embedded in realistic atmospheric models, improving the overall performance. The neural network reaches an average error of 4 dB compared to full parabolic equation simulations and provides epistemic and data-related uncertainty estimates. Its evaluation on the 2022 Hunga Tonga-Hunga Ha’apai volcanic eruption demonstrates its prediction capability using atmospheric conditions and frequencies not included in the training. This represents a significant step towards near real-time assessment of International Monitoring System detection thresholds of explosive sources.
zh
[AI-198] Large Language Models for EEG: A Comprehensive Survey and Taxonomy
【速读】:该论文试图解决如何将大型语言模型(Large Language Models, LLMs)与脑电图(EEG)研究相结合,以推动神经解码、脑机接口(BCI)和情感计算等领域的创新。其解决方案的关键在于利用基于Transformer的架构,通过微调、少样本学习和零样本学习等方法,使EEG-based模型能够执行复杂的任务,如自然语言生成、语义解释和诊断辅助。论文通过系统综述和结构化分类,提供了建模策略、系统设计及应用领域的全面概述,为未来融合自然语言处理与神经信号分析的研究奠定了基础。
链接: https://arxiv.org/abs/2506.06353
作者: Naseem Babu,Jimson Mathew,A. P. Vinod
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The growing convergence between Large Language Models (LLMs) and electroencephalography (EEG) research is enabling new directions in neural decoding, brain-computer interfaces (BCIs), and affective computing. This survey offers a systematic review and structured taxonomy of recent advancements that utilize LLMs for EEG-based analysis and applications. We organize the literature into four domains: (1) LLM-inspired foundation models for EEG representation learning, (2) EEG-to-language decoding, (3) cross-modal generation including image and 3D object synthesis, and (4) clinical applications and dataset management tools. The survey highlights how transformer-based architectures adapted through fine-tuning, few-shot, and zero-shot learning have enabled EEG-based models to perform complex tasks such as natural language generation, semantic interpretation, and diagnostic assistance. By offering a structured overview of modeling strategies, system designs, and application areas, this work serves as a foundational resource for future work to bridge natural language processing and neural signal analysis through language models.
zh
[AI-199] Deep learning methods for modeling infrasound transmission loss in the middle atmosphere
【速读】:该论文旨在解决传统传播建模工具(如抛物线方程方法)在计算成本上的局限性,以实现对全球大气中次声波传输损耗(Infrasound Transmission Losses, TLs)的高效预测。其解决方案的关键在于开发一种优化的卷积神经网络,通过利用全球范围内的温度和风场联合模拟数据(传播距离超过4000公里),显著降低计算时间并提高预测精度,平均误差为8.6 dB,在0.1-3.2 Hz频率范围内表现出良好的性能。
链接: https://arxiv.org/abs/2506.06351
作者: Alexis Le Pichon,Alice Janela Cameijo,Samir Aknine,Youcef Sklab,Souhila Arib,Quentin Brissaud,Sven Peter Naesholm
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 7 figures
Abstract:Accurate modeling of infrasound transmission losses (TLs) is essential to assess the performance of the global International Monitoring System infrasound network. Among existing propagation modeling tools, parabolic equation (PE) method enables TLs to be finely modeled, but its computational cost does not allow exploration of a large parameter space for operational monitoring applications. To reduce computation times, Brissaud et al. 2023 explored the potential of convolutional neural networks trained on a large set of regionally simulated wavefields ( 1000 km from the source) to predict TLs with negligible computation times compared to PE simulations. However, this method struggles in unfavorable initial wind conditions, especially at high frequencies, and causal issues with winds at large distances from the source affecting ground TLs close to the source. In this study, we have developed an optimized convolutional network designed to minimize prediction errors while predicting TLs from globally simulated combined temperature and wind fields spanning over propagation ranges of 4000 km. Our approach enhances the previously proposed one by implementing key optimizations that improve the overall architecture performance. The implemented model predicts TLs with an average error of 8.6 dB in the whole frequency band (0.1-3.2 Hz) and explored realistic atmospheric scenarios.
zh
[AI-200] Explainable-AI powered stock price prediction using time series transformers: A Case Study on BIST100
【速读】:该论文试图解决金融数据复杂性带来的财务素养挑战,特别是在股票价格预测中提升模型的可解释性与准确性。解决方案的关键在于将基于Transformer的时间序列模型与可解释人工智能(XAI)相结合,通过引入SHAP和LIME等技术,增强模型决策过程的透明度,从而帮助个体做出更明智的投资决策并积极参与金融市场。
链接: https://arxiv.org/abs/2506.06345
作者: Sukru Selim Calik,Andac Akyuz,Zeynep Hilal Kilimci,Kerem Colak
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Financial literacy is increasingly dependent on the ability to interpret complex financial data and utilize advanced forecasting tools. In this context, this study proposes a novel approach that combines transformer-based time series models with explainable artificial intelligence (XAI) to enhance the interpretability and accuracy of stock price predictions. The analysis focuses on the daily stock prices of the five highest-volume banks listed in the BIST100 index, along with XBANK and XU100 indices, covering the period from January 2015 to March 2025. Models including DLinear, LTSNet, Vanilla Transformer, and Time Series Transformer are employed, with input features enriched by technical indicators. SHAP and LIME techniques are used to provide transparency into the influence of individual features on model outputs. The results demonstrate the strong predictive capabilities of transformer models and highlight the potential of interpretable machine learning to empower individuals in making informed investment decisions and actively engaging in financial markets.
zh
[AI-201] A Reinforcement Learning Approach for RIS-aided Fair Communications
【速读】:该论文试图解决在使用可重构智能表面(Reconfigurable Intelligent Surfaces, RISs)与强化学习(Reinforcement Learning, RL)结合的通信系统中,如何实现多用户设备(User Equipment, UE)之间的公平性问题。其关键解决方案是提出一种新的方法,旨在获得高效且公平的双工RIS-RL系统,确保所有合法UE单元都能接收到足够强度的信号,避免因功率不足而导致部分用户被剥夺服务。
链接: https://arxiv.org/abs/2506.06344
作者: Alex Pierron,Michel Barbeau,Luca De Cicco,Jose Rubio-Hernan,Joaquin Garcia-Alfaro
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 6 figures, 1 table, 16 references
Abstract:Reconfigurable Intelligent Surfaces (RISs) are composed of physical elements that can dynamically alter electromagnetic wave properties to enhance beamforming and leading to improvements in areas with low coverage properties. They have the potential to be combined with Reinforcement Learning (RL) techniques to achieve network performance and energy efficiency via optimization techniques. In addition to performance and energy improvements, it is also crucial to consider the concept of fair communications. RISs must ensure that User Equipment (UE) units receive their signals with adequate strength, without other UE being deprived of service due to insufficient power. In this paper, we address such a problem. We explore the fairness properties of previous work and propose a novel method that aims at obtaining an efficient and fair duplex RIS-RL system for multiple legitimate UE units. We report and discuss our experimental work and simulation results. We also release our code and datasets to foster further research in the topic.
zh
[AI-202] MoE-Gyro: Self-Supervised Over-Range Reconstruction and Denoising for MEMS Gyroscopes
【速读】:该论文试图解决MEMS陀螺仪在测量范围与噪声性能之间的固有权衡问题,以及现有硬件和深度学习方法在实际部署中的局限性。其解决方案的关键在于提出一种名为MoE-Gyro的自监督框架,该框架通过引入两个专家模块——过饱和信号重建专家(Over-Range Reconstruction Expert, ORE)和降噪专家(Denoise Expert, DE),实现同时的过范围信号重构与噪声抑制,并结合轻量级门控模块动态分配输入段至相应专家,从而有效扩展测量范围并提升信号质量。
链接: https://arxiv.org/abs/2506.06318
作者: Feiyang Pan,Shenghe Zheng,Chunyan Yin,Guangbin Dou
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:MEMS gyroscopes play a critical role in inertial navigation and motion control applications but typically suffer from a fundamental trade-off between measurement range and noise performance. Existing hardware-based solutions aimed at mitigating this issue introduce additional complexity, cost, and scalability challenges. Deep-learning methods primarily focus on noise reduction and typically require precisely aligned ground-truth signals, making them difficult to deploy in practical scenarios and leaving the fundamental trade-off unresolved. To address these challenges, we introduce Mixture of Experts for MEMS Gyroscopes (MoE-Gyro), a novel self-supervised framework specifically designed for simultaneous over-range signal reconstruction and noise suppression. MoE-Gyro employs two experts: an Over-Range Reconstruction Expert (ORE), featuring a Gaussian-Decay Attention mechanism for reconstructing saturated segments; and a Denoise Expert (DE), utilizing dual-branch complementary masking combined with FFT-guided augmentation for robust noise reduction. A lightweight gating module dynamically routes input segments to the appropriate expert. Furthermore, existing evaluation lack a comprehensive standard for assessing multi-dimensional signal enhancement. To bridge this gap, we introduce IMU Signal Enhancement Benchmark (ISEBench), an open-source benchmarking platform comprising the GyroPeak-100 dataset and a unified evaluation of IMU signal enhancement methods. We evaluate MoE-Gyro using our proposed ISEBench, demonstrating that our framework significantly extends the measurable range from 450 deg/s to 1500 deg/s, reduces Bias Instability by 98.4%, and achieves state-of-the-art performance, effectively addressing the long-standing trade-off in inertial sensing.
zh
[AI-203] DELPHYNE: A Pre-Trained Model for General and Financial Time Series
【速读】:该论文试图解决现有时间序列预训练模型在金融应用中表现不佳的问题,具体表现为在零样本和微调设置下未能超越简单的金融基准。其关键原因是预训练阶段缺乏金融数据以及跨领域时间序列模式差异导致的负迁移效应。为了解决这些问题,作者提出了一种名为Delphyne的金融时间序列预训练模型,该模型在少量微调步骤下即可达到与现有基础模型和全样本模型相当的性能,并在多种金融任务中表现出色。
链接: https://arxiv.org/abs/2506.06288
作者: Xueying Ding,Aakriti Mittal,Achintya Gopal
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Time-series data is a vital modality within data science communities. This is particularly valuable in financial applications, where it helps in detecting patterns, understanding market behavior, and making informed decisions based on historical data. Recent advances in language modeling have led to the rise of time-series pre-trained models that are trained on vast collections of datasets and applied to diverse tasks across financial domains. However, across financial applications, existing time-series pre-trained models have not shown boosts in performance over simple finance benchmarks in both zero-shot and fine-tuning settings. This phenomenon occurs because of a i) lack of financial data within the pre-training stage, and ii) the negative transfer effect due to inherently different time-series patterns across domains. Furthermore, time-series data is continuous, noisy, and can be collected at varying frequencies and with varying lags across different variables, making this data more challenging to model than languages. To address the above problems, we introduce a Pre-trained MoDEL for FINance TimE-series (Delphyne). Delphyne achieves competitive performance to existing foundation and full-shot models with few fine-tuning steps on publicly available datasets, and also shows superior performances on various financial tasks.
zh
机器学习
[LG-0] Realistic Urban Traffic Generator using Decentralized Federated Learning for the SUMO simulator
链接: https://arxiv.org/abs/2506.07980
作者: Alberto Bazán-Guillén,Carlos Beis-Penedo,Diego Cajaraville-Aboy,Pablo Barbecho-Bautista,Rebeca P. Díaz-Redondo,Luis J. de la Cruz Llopis,Ana Fernández-Vilas,Mónica Aguilar Igartua,Manuel Fernández-Veiga
类目: Machine Learning (cs.LG)
*备注: 21 pages, 7 figures
Abstract:Realistic urban traffic simulation is essential for sustainable urban planning and the development of intelligent transportation systems. However, generating high-fidelity, time-varying traffic profiles that accurately reflect real-world conditions, especially in large-scale scenarios, remains a major challenge. Existing methods often suffer from limitations in accuracy, scalability, or raise privacy concerns due to centralized data processing. This work introduces DesRUTGe (Decentralized Realistic Urban Traffic Generator), a novel framework that integrates Deep Reinforcement Learning (DRL) agents with the SUMO simulator to generate realistic 24-hour traffic patterns. A key innovation of DesRUTGe is its use of Decentralized Federated Learning (DFL), wherein each traffic detector and its corresponding urban zone function as an independent learning node. These nodes train local DRL models using minimal historical data and collaboratively refine their performance by exchanging model parameters with selected peers (e.g., geographically adjacent zones), without requiring a central coordinator. Evaluated using real-world data from the city of Barcelona, DesRUTGe outperforms standard SUMO-based tools such as RouteSampler, as well as other centralized learning approaches, by delivering more accurate and privacy-preserving traffic pattern generation.
[LG-1] Hyperpruning: Efficient Search through Pruned Variants of Recurrent Neural Networks Leverag ing Lyapunov Spectrum
链接: https://arxiv.org/abs/2506.07975
作者: Caleb Zheng,Eli Shlizerman
类目: Machine Learning (cs.LG)
*备注: 26 pages, 3 figures
Abstract:A variety of pruning methods have been introduced for over-parameterized Recurrent Neural Networks to improve efficiency in terms of power consumption and storage utilization. These advances motivate a new paradigm, termed `hyperpruning’, which seeks to identify the most suitable pruning strategy for a given network architecture and application. Unlike conventional hyperparameter search, where the optimal configuration’s accuracy remains uncertain, in the context of network pruning, the accuracy of the dense model sets the target for the accuracy of the pruned one. The goal, therefore, is to discover pruned variants that match or even surpass this established accuracy. However, exhaustive search over pruning configurations is computationally expensive and lacks early performance guarantees. To address this challenge, we propose a novel Lyapunov Spectrum (LS)-based distance metric that enables early comparison between pruned and dense networks, allowing accurate prediction of post-training performance. By integrating this LS-based distance with standard hyperparameter optimization algorithms, we introduce an efficient hyperpruning framework, termed LS-based Hyperpruning (LSH). LSH reduces search time by an order of magnitude compared to conventional approaches relying on full training. Experiments on stacked LSTM and RHN architectures using the Penn Treebank dataset, and on AWD-LSTM-MoS using WikiText-2, demonstrate that under fixed training budgets and target pruning ratios, LSH consistently identifies superior pruned models. Remarkably, these pruned variants not only outperform those selected by loss-based baseline but also exceed the performance of their dense counterpart.
[LG-2] A Two-Phase Deep Learning Framework for Adaptive Time-Stepping in High-Speed Flow Modeling
链接: https://arxiv.org/abs/2506.07969
作者: Jacob Helwig,Sai Sreeharsha Adavi,Xuan Zhang,Yuchao Lin,Felix S. Chim,Luke Takeshi Vizzini,Haiyang Yu,Muhammad Hasnain,Saykat Kumar Biswas,John J. Holloway,Narendra Singh,N. K. Anand,Swagnik Guhathakurta,Shuiwang Ji
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:
Abstract:We consider the problem of modeling high-speed flows using machine learning methods. While most prior studies focus on low-speed fluid flows in which uniform time-stepping is practical, flows approaching and exceeding the speed of sound exhibit sudden changes such as shock waves. In such cases, it is essential to use adaptive time-stepping methods to allow a temporal resolution sufficient to resolve these phenomena while simultaneously balancing computational costs. Here, we propose a two-phase machine learning method, known as ShockCast, to model high-speed flows with adaptive time-stepping. In the first phase, we propose to employ a machine learning model to predict the timestep size. In the second phase, the predicted timestep is used as an input along with the current fluid fields to advance the system state by the predicted timestep. We explore several physically-motivated components for timestep prediction and introduce timestep conditioning strategies inspired by neural ODE and Mixture of Experts. As ShockCast is the first framework for learning high-speed flows, we evaluate our methods by generating two supersonic flow datasets, available at this https URL. Our code is publicly available as part of the AIRS library (this https URL).
[LG-3] Neural Tangent Kernel Analysis to Probe Convergence in Physics-informed Neural Solvers: PIKANs vs. PINNs
链接: https://arxiv.org/abs/2506.07958
作者: Salah A. Faroughi,Farinaz Mostajeran
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Analysis of PDEs (math.AP); Spectral Theory (math.SP)
*备注:
Abstract:Physics-informed Kolmogorov-Arnold Networks (PIKANs), and in particular their Chebyshev-based variants (cPIKANs), have recently emerged as promising models for solving partial differential equations (PDEs). However, their training dynamics and convergence behavior remain largely unexplored both theoretically and numerically. In this work, we aim to advance the theoretical understanding of cPIKANs by analyzing them using Neural Tangent Kernel (NTK) theory. Our objective is to discern the evolution of kernel structure throughout gradient-based training and its subsequent impact on learning efficiency. We first derive the NTK of standard cKANs in a supervised setting, and then extend the analysis to the physics-informed context. We analyze the spectral properties of NTK matrices, specifically their eigenvalue distributions and spectral bias, for four representative PDEs: the steady-state Helmholtz equation, transient diffusion and Allen-Cahn equations, and forced vibrations governed by the Euler-Bernoulli beam equation. We also conduct an investigation into the impact of various optimization strategies, e.g., first-order, second-order, and hybrid approaches, on the evolution of the NTK and the resulting learning dynamics. Results indicate a tractable behavior for NTK in the context of cPIKANs, which exposes learning dynamics that standard physics-informed neural networks (PINNs) cannot capture. Spectral trends also reveal when domain decomposition improves training, directly linking kernel behavior to convergence rates under different setups. To the best of our knowledge, this is the first systematic NTK study of cPIKANs, providing theoretical insight that clarifies and predicts their empirical performance.
[LG-4] Cost-Optimal Active AI Model Evaluation
链接: https://arxiv.org/abs/2506.07949
作者: Anastasios N. Angelopoulos,Jacob Eisenstein,Jonathan Berant,Alekh Agarwal,Adam Fisch
类目: Machine Learning (cs.LG)
*备注:
Abstract:The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, rapid iteration often makes it necessary to rely on synthetic annotation data because of the low cost, despite the potential for substantial bias. In this paper, we develop novel, cost-aware methods for actively balancing the use of a cheap, but often inaccurate, weak rater – such as a model-based autorater that is designed to automatically assess the quality of generated content – with a more expensive, but also more accurate, strong rater alternative such as a human. More specifically, the goal of our approach is to produce a low variance, unbiased estimate of the mean of the target “strong” rating, subject to some total annotation budget. Building on recent work in active and prediction-powered statistical inference, we derive a family of cost-optimal policies for allocating a given annotation budget between weak and strong raters so as to maximize statistical efficiency. Using synthetic and real-world data, we empirically characterize the conditions under which these policies yield improvements over prior methods. We find that, especially in tasks where there is high variability in the difficulty of examples, our policies can achieve the same estimation precision at a far lower total annotation budget than standard evaluation methods.
[LG-5] okenBreak: Bypassing Text Classification Models Through Token Manipulation
链接: https://arxiv.org/abs/2506.07948
作者: Kasimir Schulz,Kenneth Yeung,Kieran Evans
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Natural Language Processing (NLP) models are used for text-related tasks such as classification and generation. To complete these tasks, input data is first tokenized from human-readable text into a format the model can understand, enabling it to make inferences and understand context. Text classification models can be implemented to guard against threats such as prompt injection attacks against Large Language Models (LLMs), toxic input and cybersecurity risks such as spam emails. In this paper, we introduce TokenBreak: a novel attack that can bypass these protection models by taking advantage of the tokenization strategy they use. This attack technique manipulates input text in such a way that certain models give an incorrect classification. Importantly, the end target (LLM or email recipient) can still understand and respond to the manipulated text and therefore be vulnerable to the very attack the protection model was put in place to prevent. The tokenizer is tied to model architecture, meaning it is possible to predict whether or not a model is vulnerable to attack based on family. We also present a defensive strategy as an added layer of protection that can be implemented without having to retrain the defensive model.
[LG-6] Ensemble-Based Survival Models with the Self-Attended Beran Estimator Predictions
链接: https://arxiv.org/abs/2506.07933
作者: Lev V. Utkin,Semen P. Khomets,Vlada A. Efremenko,Andrei V. Konstantinov,Natalya M. Verbova
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Survival analysis predicts the time until an event of interest, such as failure or death, but faces challenges due to censored data, where some events remain unobserved. Ensemble-based models, like random survival forests and gradient boosting, are widely used but can produce unstable predictions due to variations in bootstrap samples. To address this, we propose SurvBESA (Survival Beran Estimators Self-Attended), a novel ensemble model that combines Beran estimators with a self-attention mechanism. Unlike traditional methods, SurvBESA applies self-attention to predicted survival functions, smoothing out noise by adjusting each survival function based on its similarity to neighboring survival functions. We also explore a special case using Huber’s contamination model to define attention weights, simplifying training to a quadratic or linear optimization problem. Numerical experiments show that SurvBESA outperforms state-of-the-art models. The implementation of SurvBESA is publicly available.
[LG-7] A Generative Physics-Informed Reinforcement Learning-Based Approach for Construction of Representative Drive Cycle
链接: https://arxiv.org/abs/2506.07929
作者: Amirreza Yasami,Mohammadali Tofigh,Mahdi Shahbakhti,Charles Robert Koch
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Accurate driving cycle construction is crucial for vehicle design, fuel economy analysis, and environmental impact assessments. A generative Physics-Informed Expected SARSA-Monte Carlo (PIESMC) approach that constructs representative driving cycles by capturing transient dynamics, acceleration, deceleration, idling, and road grade transitions while ensuring model fidelity is introduced. Leveraging a physics-informed reinforcement learning framework with Monte Carlo sampling, PIESMC delivers efficient cycle construction with reduced computational cost. Experimental evaluations on two real-world datasets demonstrate that PIESMC replicates key kinematic and energy metrics, achieving up to a 57.3% reduction in cumulative kinematic fragment errors compared to the Micro-trip-based (MTB) method and a 10.5% reduction relative to the Markov-chain-based (MCB) method. Moreover, it is nearly an order of magnitude faster than conventional techniques. Analyses of vehicle-specific power distributions and wavelet-transformed frequency content further confirm its ability to reproduce experimental central tendencies and variability.
[LG-8] W4S4: WaLRUS Meets S4 for Long-Range Sequence Modeling
链接: https://arxiv.org/abs/2506.07920
作者: Hossein Babaei,Mel White,Richard G. Baraniuk
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注: 10 pages, 2 figures, 3 tables
Abstract:State Space Models (SSMs) have emerged as powerful components for sequence modeling, enabling efficient handling of long-range dependencies via linear recurrence and convolutional computation. However, their effectiveness depends heavily on the choice and initialization of the state matrix. In this work, we build on the SaFARi framework and existing WaLRUS SSMs to introduce a new variant, W4S4 (WaLRUS for S4), a new class of SSMs constructed from redundant wavelet frames. WaLRUS admits a stable diagonalization and supports fast kernel computation without requiring low-rank approximations, making it both theoretically grounded and computationally efficient. We show that WaLRUS retains information over long horizons significantly better than HiPPO-based SSMs, both in isolation and when integrated into deep architectures such as S4. Our experiments demonstrate consistent improvements across delay reconstruction tasks, classification benchmarks, and long-range sequence modeling, confirming that high-quality, structured initialization enabled by wavelet-based state dynamic offers substantial advantages over existing alternatives. WaLRUS provides a scalable and versatile foundation for the next generation of deep SSM-based models.
[LG-9] CausalPFN: Amortized Causal Effect Estimation via In-Context Learning
链接: https://arxiv.org/abs/2506.07918
作者: Vahid Balazadeh,Hamidreza Kamkari,Valentin Thomas,Benson Li,Junwei Ma,Jesse C. Cresswell,Rahul G. Krishnan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that amortizes this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out-of-the-box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model does not require any further training or tuning and takes a step toward automated causal inference (this https URL).
[LG-10] FunDiff: Diffusion Models over Function Spaces for Physics-Informed Generative Modeling
链接: https://arxiv.org/abs/2506.07902
作者: Sifan Wang,Zehao Dou,Tong-Rui Liu,Lu Lu
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 31 pages, 12 figures
Abstract:Recent advances in generative modeling – particularly diffusion models and flow matching – have achieved remarkable success in synthesizing discrete data such as images and videos. However, adapting these models to physical applications remains challenging, as the quantities of interest are continuous functions governed by complex physical laws. Here, we introduce \textbfFunDiff , a novel framework for generative modeling in function spaces. FunDiff combines a latent diffusion process with a function autoencoder architecture to handle input functions with varying discretizations, generate continuous functions evaluable at arbitrary locations, and seamlessly incorporate physical priors. These priors are enforced through architectural constraints or physics-informed loss functions, ensuring that generated samples satisfy fundamental physical laws. We theoretically establish minimax optimality guarantees for density estimation in function spaces, showing that diffusion-based estimators achieve optimal convergence rates under suitable regularity conditions. We demonstrate the practical effectiveness of FunDiff across diverse applications in fluid dynamics and solid mechanics. Empirical results show that our method generates physically consistent samples with high fidelity to the target distribution and exhibits robustness to noisy and low-resolution data. Code and datasets are publicly available at this https URL.
[LG-11] SoK: Data Reconstruction Attacks Against Machine Learning Models: Definition Metrics and Benchmark USENIX-SECURITY
链接: https://arxiv.org/abs/2506.07888
作者: Rui Wen,Yiyong Liu,Michael Backes,Yang Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To Appear in the 34th USENIX Security Symposium, August 13-15, 2025
Abstract:Data reconstruction attacks, which aim to recover the training dataset of a target model with limited access, have gained increasing attention in recent years. However, there is currently no consensus on a formal definition of data reconstruction attacks or appropriate evaluation metrics for measuring their quality. This lack of rigorous definitions and universal metrics has hindered further advancement in this field. In this paper, we address this issue in the vision domain by proposing a unified attack taxonomy and formal definitions of data reconstruction attacks. We first propose a set of quantitative evaluation metrics that consider important criteria such as quantifiability, consistency, precision, and diversity. Additionally, we leverage large language models (LLMs) as a substitute for human judgment, enabling visual evaluation with an emphasis on high-quality reconstructions. Using our proposed taxonomy and metrics, we present a unified framework for systematically evaluating the strengths and limitations of existing attacks and establishing a benchmark for future research. Empirical results, primarily from a memorization perspective, not only validate the effectiveness of our metrics but also offer valuable insights for designing new attacks.
[LG-12] Schauder Bases for C[0 1] Using ReLU Softplus and Two Sigmoidal Functions
链接: https://arxiv.org/abs/2506.07884
作者: Anand Ganesh,Babhrubahan Bose,Anand Rajagopalan
类目: Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注: 9 pages
Abstract:We construct four Schauder bases for the space C[0,1] , one using ReLU functions, another using Softplus functions, and two more using sigmoidal versions of the ReLU and Softplus functions. This establishes the existence of a basis using these functions for the first time, and improves on the universal approximation property associated with them.
[LG-13] Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?
链接: https://arxiv.org/abs/2506.07871
作者: Sigma Jahan,Mohammad Masudur Rahman
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:As attention-based deep learning models scale in size and complexity, diagnosing their faults becomes increasingly challenging. In this work, we conduct an empirical study to evaluate the potential of Hessian-based analysis for diagnosing faults in attention-based models. Specifically, we use Hessian-derived insights to identify fragile regions (via curvature analysis) and parameter interdependencies (via parameter interaction analysis) within attention mechanisms. Through experiments on three diverse models (HAN, 3D-CNN, DistilBERT), we show that Hessian-based metrics can localize instability and pinpoint fault sources more effectively than gradients alone. Our empirical findings suggest that these metrics could significantly improve fault diagnosis in complex neural architectures, potentially improving software debugging practices.
[LG-14] Jarzynski Reweighting and Sampling Dynamics for Training Energy-Based Models: Theoretical Analysis of Different Transition Kernels
链接: https://arxiv.org/abs/2506.07843
作者: Davide Carbone
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Energy-Based Models (EBMs) provide a flexible framework for generative modeling, but their training remains theoretically challenging due to the need to approximate normalization constants and efficiently sample from complex, multi-modal distributions. Traditional methods, such as contrastive divergence and score matching, introduce biases that can hinder accurate learning. In this work, we present a theoretical analysis of Jarzynski reweighting, a technique from non-equilibrium statistical mechanics, and its implications for training EBMs. We focus on the role of the choice of the kernel and we illustrate these theoretical considerations in two key generative frameworks: (i) flow-based diffusion models, where we reinterpret Jarzynski reweighting in the context of stochastic interpolants to mitigate discretization errors and improve sample quality, and (ii) Restricted Boltzmann Machines, where we analyze its role in correcting the biases of contrastive divergence. Our results provide insights into the interplay between kernel choice and model performance, highlighting the potential of Jarzynski reweighting as a principled tool for generative learning.
[LG-15] Clustered Federated Learning via Embedding Distributions
链接: https://arxiv.org/abs/2506.07769
作者: Dekai Zhang,Matthew Williams,Francesca Toni
类目: Machine Learning (cs.LG)
*备注: 24 pages
Abstract:Federated learning (FL) is a widely used framework for machine learning in distributed data environments where clients hold data that cannot be easily centralised, such as for data protection reasons. FL, however, is known to be vulnerable to non-IID data. Clustered FL addresses this issue by finding more homogeneous clusters of clients. We propose a novel one-shot clustering method, EMD-CFL, using the Earth Mover’s distance (EMD) between data distributions in embedding space. We theoretically motivate the use of EMDs using results from the domain adaptation literature and demonstrate empirically superior clustering performance in extensive comparisons against 16 baselines and on a range of challenging datasets.
[LG-16] Profiling Electric Vehicles via Early Charging Voltage Patterns
链接: https://arxiv.org/abs/2506.07714
作者: Francesco Marchiori,Denis Donadel,Alessandro Brighente,Mauro Conti
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Accepted to be presented at the AICPSS Workshop in conjunction with ARES 2025
Abstract:Electric Vehicles (EVs) are rapidly gaining adoption as a sustainable alternative to fuel-powered vehicles, making secure charging infrastructure essential. Despite traditional authentication protocols, recent results showed that attackers may steal energy through tailored relay attacks. One countermeasure is leveraging the EV’s fingerprint on the current exchanged during charging. However, existing methods focus on the final charging stage, allowing malicious actors to consume substantial energy before being detected and repudiated. This underscores the need for earlier and more effective authentication methods to prevent unauthorized charging. Meanwhile, profiling raises privacy concerns, as uniquely identifying EVs through charging patterns could enable user tracking. In this paper, we propose a framework for uniquely identifying EVs using physical measurements from the early charging stages. We hypothesize that voltage behavior early in the process exhibits similar characteristics to current behavior in later stages. By extracting features from early voltage measurements, we demonstrate the feasibility of EV profiling. Our approach improves existing methods by enabling faster and more reliable vehicle identification. We test our solution on a dataset of 7408 usable charges from 49 EVs, achieving up to 0.86 accuracy. Feature importance analysis shows that near-optimal performance is possible with just 10 key features, improving efficiency alongside our lightweight models. This research lays the foundation for a novel authentication factor while exposing potential privacy risks from unauthorized access to charging data. Comments: Accepted to be presented at the AICPSS Workshop in conjunction with ARES 2025 Subjects: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2506.07714 [cs.CR] (or arXiv:2506.07714v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2506.07714 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-17] Evaluating Robustness in Latent Diffusion Models via Embedding Level Augmentation
链接: https://arxiv.org/abs/2506.07706
作者: Boris Martirosyan,Alexey Karmanov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Latent diffusion models (LDMs) achieve state-of-the-art performance across various tasks, including image generation and video synthesis. However, they generally lack robustness, a limitation that remains not fully explored in current research. In this paper, we propose several methods to address this gap. First, we hypothesize that the robustness of LDMs primarily should be measured without their text encoder, because if we take and explore the whole architecture, the problems of image generator and text encoders wll be fused. Second, we introduce novel data augmentation techniques designed to reveal robustness shortcomings in LDMs when processing diverse textual prompts. We then fine-tune Stable Diffusion 3 and Stable Diffusion XL models using Dreambooth, incorporating these proposed augmentation methods across multiple tasks. Finally, we propose a novel evaluation pipeline specifically tailored to assess the robustness of LDMs fine-tuned via Dreambooth.
[LG-18] owards a Small Language Model Lifecycle Framework
链接: https://arxiv.org/abs/2506.07695
作者: Parsa Miraghaei,Sergio Moreschini,Antti Kolehmainen,David Hästbacka
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Background: The growing demand for efficient and deployable language models has led to increased interest in Small Language Models (SLMs). However, existing research remains fragmented, lacking a unified lifecycle perspective. Objective: This study aims to define a comprehensive lifecycle framework for SLMs by synthesizing insights from academic literature and practitioner sources. Method: We conducted a comprehensive survey of 36 works, analyzing and categorizing lifecycle-relevant techniques. Results: We propose a modular lifecycle model structured into main, optional, and cross-cutting components. The model captures key interconnections across stages, supporting method reuse, co-adaptation, and lifecycle-awareness. Conclusion: Our framework provides a coherent foundation for developing and maintaining SLMs, bridging theory and practice, and guiding future research and tool development. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2506.07695 [cs.SE] (or arXiv:2506.07695v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2506.07695 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sergio Moreschini [view email] [v1] Mon, 9 Jun 2025 12:33:05 UTC (5,032 KB) Full-text links: Access Paper: View a PDF of the paper titled Towards a Small Language Model Lifecycle Framework, by Parsa Miraghaei and Sergio Moreschini and Antti Kolehmainen and David H"astbackaView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.SE prev | next new | recent | 2025-06 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-19] How Benchmark Prediction from Fewer Data Misses the Mark
链接: https://arxiv.org/abs/2506.07673
作者: Guanhua Zhang,Florian E. Dorner,Moritz Hardt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset of evaluation points and predict overall benchmark performance from that subset. In this paper, we systematically assess the strengths and limitations of 11 benchmark prediction methods across 19 diverse benchmarks. First, we identify a highly competitive baseline: Take a random sample and fit a regression model on the sample to predict missing entries. Outperforming most existing methods, this baseline challenges the assumption that careful subset selection is necessary for benchmark prediction. Second, we discover that all existing methods crucially depend on model similarity. They work best when interpolating scores among similar models. The effectiveness of benchmark prediction sharply declines when new models have higher accuracy than previously seen models. In this setting of extrapolation, none of the previous methods consistently beat a simple average over random samples. To improve over the sample average, we introduce a new method inspired by augmented inverse propensity weighting. This method consistently outperforms the random sample average even for extrapolation. However, its performance still relies on model similarity and the gains are modest in general. This shows that benchmark prediction fails just when it is most needed: at the evaluation frontier, where the goal is to evaluate new models of unknown capabilities.
[LG-20] ProARD: progressive adversarial robustness distillation: provide wide range of robust students
链接: https://arxiv.org/abs/2506.07666
作者: Seyedhamidreza Mousavi,Seyedali Mousavi,Masoud Daneshtalab
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adversarial Robustness Distillation (ARD) has emerged as an effective method to enhance the robustness of lightweight deep neural networks against adversarial attacks. Current ARD approaches have leveraged a large robust teacher network to train one robust lightweight student. However, due to the diverse range of edge devices and resource constraints, current approaches require training a new student network from scratch to meet specific constraints, leading to substantial computational costs and increased CO2 emissions. This paper proposes Progressive Adversarial Robustness Distillation (ProARD), enabling the efficient one-time training of a dynamic network that supports a diverse range of accurate and robust student networks without requiring retraining. We first make a dynamic deep neural network based on dynamic layers by encompassing variations in width, depth, and expansion in each design stage to support a wide range of architectures. Then, we consider the student network with the largest size as the dynamic teacher network. ProARD trains this dynamic network using a weight-sharing mechanism to jointly optimize the dynamic teacher network and its internal student networks. However, due to the high computational cost of calculating exact gradients for all the students within the dynamic network, a sampling mechanism is required to select a subset of students. We show that random student sampling in each iteration fails to produce accurate and robust students.
[LG-21] he Universality Lens: Why Even Highly Over-Parametrized Models Learn Well
链接: https://arxiv.org/abs/2506.07661
作者: Meir Feder,Ruediger Urbanke,Yaniv Fogel
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:
Abstract:A fundamental question in modern machine learning is why large, over-parameterized models, such as deep neural networks and transformers, tend to generalize well, even when their number of parameters far exceeds the number of training samples. We investigate this phenomenon through the lens of information theory, grounded in universal learning theory. Specifically, we study a Bayesian mixture learner with log-loss and (almost) uniform prior over an expansive hypothesis class. Our key result shows that the learner’s regret is not determined by the overall size of the hypothesis class, but rather by the cumulative probability of all models that are close, in Kullback-Leibler divergence distance, to the true data-generating process. We refer to this cumulative probability as the weight of the hypothesis. This leads to a natural notion of model simplicity: simple models are those with large weight and thus require fewer samples to generalize, while complex models have small weight and need more data. This perspective provides a rigorous and intuitive explanation for why over-parameterized models often avoid overfitting: the presence of simple hypotheses allows the posterior to concentrate on them when supported by the data. We further bridge theory and practice by recalling that stochastic gradient descent with Langevin dynamics samples from the correct posterior distribution, enabling our theoretical learner to be approximated using standard machine learning methods combined with ensemble learning. Our analysis yields non-uniform regret bounds and aligns with key practical concepts such as flat minima and model distillation. The results apply broadly across online, batch, and supervised learning settings, offering a unified and principled understanding of the generalization behavior of modern AI systems. Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML) Cite as: arXiv:2506.07661 [cs.LG] (or arXiv:2506.07661v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.07661 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yaniv Fogel [view email] [v1] Mon, 9 Jun 2025 11:32:31 UTC (170 KB) Full-text links: Access Paper: View a PDF of the paper titled The Universality Lens: Why Even Highly Over-Parametrized Models Learn Well, by Meir Feder and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-06 Change to browse by: cs cs.IT math math.IT stat stat.ML References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-22] Return of ChebNet: Understanding and Improving an Overlooked GNN on Long Range Tasks
链接: https://arxiv.org/abs/2506.07624
作者: Ali Hariri,Álvaro Arroyo,Alessio Gravina,Moshe Eliasof,Carola-Bibiane Schönlieb,Davide Bacciu,Kamyar Azizzadenesheli,Xiaowen Dong,Pierre Vandergheynst
类目: Machine Learning (cs.LG)
*备注:
Abstract:ChebNet, one of the earliest spectral GNNs, has largely been overshadowed by Message Passing Neural Networks (MPNNs), which gained popularity for their simplicity and effectiveness in capturing local graph structure. Despite their success, MPNNs are limited in their ability to capture long-range dependencies between nodes. This has led researchers to adapt MPNNs through rewiring or make use of Graph Transformers, which compromises the computational efficiency that characterized early spatial message-passing architectures, and typically disregards the graph structure. Almost a decade after its original introduction, we revisit ChebNet to shed light on its ability to model distant node interactions. We find that out-of-box, ChebNet already shows competitive advantages relative to classical MPNNs and GTs on long-range benchmarks, while maintaining good scalability properties for high-order polynomials. However, we uncover that this polynomial expansion leads ChebNet to an unstable regime during training. To address this limitation, we cast ChebNet as a stable and non-dissipative dynamical system, which we coin Stable-ChebNet. Our Stable-ChebNet model allows for stable information propagation, and has controllable dynamics which do not require the use of eigendecompositions, positional encodings, or graph rewiring. Across several benchmarks, Stable-ChebNet achieves near state-of-the-art performance.
[LG-23] he Catechol Benchmark: Time-series Solvent Selection Data for Few-shot Machine Learning
链接: https://arxiv.org/abs/2506.07619
作者: Toby Boyne,Juan S. Campos,Becky D. Langdon,Jixiang Qing,Yilin Xie,Shiqiang Zhang,Calvin Tsay,Ruth Misener,Daniel W. Davies,Kim E. Jelfs,Sarah Boyall,Thomas M. Dixon,Linden Schrecker,Jose Pablo Folch
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Machine learning has promised to change the landscape of laboratory chemistry, with impressive results in molecular property prediction and reaction retro-synthesis. However, chemical datasets are often inaccessible to the machine learning community as they tend to require cleaning, thorough understanding of the chemistry, or are simply not available. In this paper, we introduce a novel dataset for yield prediction, providing the first-ever transient flow dataset for machine learning benchmarking, covering over 1200 process conditions. While previous datasets focus on discrete parameters, our experimental set-up allow us to sample a large number of continuous process conditions, generating new challenges for machine learning models. We focus on solvent selection, a task that is particularly difficult to model theoretically and therefore ripe for machine learning applications. We showcase benchmarking for regression algorithms, transfer-learning approaches, feature engineering, and active learning, with important applications towards solvent replacement and sustainable manufacturing.
[LG-24] FuXi-Air: Urban Air Quality Forecasting Based on Emission-Meteorology-Pollutant multimodal Machine Learning
链接: https://arxiv.org/abs/2506.07616
作者: Zhixin Geng,Xu Fan,Xiqiao Lu,Yan Zhang,Guangyuan Yu,Cheng Huang,Qian Wang,Yuewu Li,Weichun Ma,Qi Yu,Libo Wu,Hao Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Air pollution has emerged as a major public health challenge in megacities. Numerical simulations and single-site machine learning approaches have been widely applied in air quality forecasting tasks. However, these methods face multiple limitations, including high computational costs, low operational efficiency, and limited integration with observational data. With the rapid advancement of artificial intelligence, there is an urgent need to develop a low-cost, efficient air quality forecasting model for smart urban management. An air quality forecasting model, named FuXi-Air, has been constructed in this study based on multimodal data fusion to support high-precision air quality forecasting and operated in typical megacities. The model integrates meteorological forecasts, emission inventories, and pollutant monitoring data under the guidance of air pollution mechanism. By combining an autoregressive prediction framework with a frame interpolation strategy, the model successfully completes 72-hour forecasts for six major air pollutants at an hourly resolution across multiple monitoring sites within 25-30 seconds. In terms of both computational efficiency and forecasting accuracy, it outperforms the mainstream numerical air quality models in operational forecasting work. Ablation experiments concerning key influencing factors show that although meteorological data contribute more to model accuracy than emission inventories do, the integration of multimodal data significantly improves forecasting precision and ensures that reliable predictions are obtained under differing pollution mechanisms across megacities. This study provides both a technical reference and a practical example for applying multimodal data-driven models to air quality forecasting and offers new insights into building hybrid forecasting systems to support air pollution risk warning in smart city management.
[LG-25] mberStrike: Dataset Reconstruction Attack Revealing Privacy Leakage in Federated Tree-Based Systems
链接: https://arxiv.org/abs/2506.07605
作者: Marco Di Gennaro,Giovanni De Lucia,Stefano Longari,Stefano Zanero,Michele Carminati
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Proceedings on Privacy Enhancing Technologies (To appear) 2025(4)
Abstract:Federated Learning has emerged as a privacy-oriented alternative to centralized Machine Learning, enabling collaborative model training without direct data sharing. While extensively studied for neural networks, the security and privacy implications of tree-based models remain underexplored. This work introduces TimberStrike, an optimization-based dataset reconstruction attack targeting horizontally federated tree-based models. Our attack, carried out by a single client, exploits the discrete nature of decision trees by using split values and decision paths to infer sensitive training data from other clients. We evaluate TimberStrike on State-of-the-Art federated gradient boosting implementations across multiple frameworks, including Flower, NVFlare, and FedTree, demonstrating their vulnerability to privacy breaches. On a publicly available stroke prediction dataset, TimberStrike consistently reconstructs between 73.05% and 95.63% of the target dataset across all implementations. We further analyze Differential Privacy, showing that while it partially mitigates the attack, it also significantly degrades model performance. Our findings highlight the need for privacy-preserving mechanisms specifically designed for tree-based Federated Learning systems, and we provide preliminary insights into their design.
[LG-26] winBreak: Jailbreaking LLM Security Alignments based on Twin Prompts USENIX-SECURITY USENIX-SECURITY2025
链接: https://arxiv.org/abs/2506.07596
作者: Torsten Krauß,Hamid Dashtbani,Alexandra Dmitrienko
类目: Machine Learning (cs.LG)
*备注: 26 pages, 25 tables, 13 figures, 2 algorithms, to appear in the 43th USENIX Security Symposium (USENIX Security 2025)
Abstract:Machine learning is advancing rapidly, with applications bringing notable benefits, such as improvements in translation and code generation. Models like ChatGPT, powered by Large Language Models (LLMs), are increasingly integrated into daily life. However, alongside these benefits, LLMs also introduce social risks. Malicious users can exploit LLMs by submitting harmful prompts, such as requesting instructions for illegal activities. To mitigate this, models often include a security mechanism that automatically rejects such harmful prompts. However, they can be bypassed through LLM jailbreaks. Current jailbreaks often require significant manual effort, high computational costs, or result in excessive model modifications that may degrade regular utility. We introduce TwinBreak, an innovative safety alignment removal method. Building on the idea that the safety mechanism operates like an embedded backdoor, TwinBreak identifies and prunes parameters responsible for this functionality. By focusing on the most relevant model layers, TwinBreak performs fine-grained analysis of parameters essential to model utility and safety. TwinBreak is the first method to analyze intermediate outputs from prompts with high structural and content similarity to isolate safety parameters. We present the TwinPrompt dataset containing 100 such twin prompts. Experiments confirm TwinBreak’s effectiveness, achieving 89% to 98% success rates with minimal computational requirements across 16 LLMs from five vendors. Comments: 26 pages, 25 tables, 13 figures, 2 algorithms, to appear in the 43th USENIX Security Symposium (USENIX Security 2025) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.07596 [cs.LG] (or arXiv:2506.07596v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.07596 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-27] Exploiting Curvature in Online Convex Optimization with Delayed Feedback
链接: https://arxiv.org/abs/2506.07595
作者: Hao Qiu,Emmanuel Esposito,Mengxiao Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:In this work, we study the online convex optimization problem with curved losses and delayed feedback. When losses are strongly convex, existing approaches obtain regret bounds of order d_\max \ln T , where d_\max is the maximum delay and T is the time horizon. However, in many cases, this guarantee can be much worse than \sqrtd_\mathrmtot as obtained by a delayed version of online gradient descent, where d_\mathrmtot is the total delay. We bridge this gap by proposing a variant of follow-the-regularized-leader that obtains regret of order \min\sigma_\max\ln T, \sqrtd_\mathrmtot\ , where \sigma_\max is the maximum number of missing observations. We then consider exp-concave losses and extend the Online Newton Step algorithm to handle delays with an adaptive learning rate tuning, achieving regret \min\d_\max n\ln T, \sqrtd_\mathrmtot\ where n is the dimension. To our knowledge, this is the first algorithm to achieve such a regret bound for exp-concave losses. We further consider the problem of unconstrained online linear regression and achieve a similar guarantee by designing a variant of the Vovk-Azoury-Warmuth forecaster with a clipping trick. Finally, we implement our algorithms and conduct experiments under various types of delay and losses, showing an improved performance over existing methods.
[LG-28] Aircraft Trajectory Dataset Augmentation in Latent Space
链接: https://arxiv.org/abs/2506.07585
作者: Seokbin Yoon,Keumjin Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Aircraft trajectory modeling plays a crucial role in Air Traffic Management (ATM) and is important for various downstream tasks, including conflict detection and landing time prediction. Dataset augmentation through the addition of synthetically generated trajectory data is necessary to develop a more robust aircraft trajectory model and ensure that the trajectory dataset is sufficient and balanced. In this work, we propose a novel framework called ATRADA for aircraft trajectory dataset augmentation. In the proposed framework, a Transformer encoder learns the underlying patterns in the original trajectory dataset and converts each data point into a context vector in the learned latent space. The converted dataset in the latent space is projected into reduced dimensions using principal component analysis (PCA), and a Gaussian mixture model (GMM) is applied to fit the probability distribution of the data points in the reduced-dimensional space. Finally, new samples are drawn from the fitted GMM, the dimension of the samples is reverted to the original dimension, and they are decoded with a Multi-Layer Perceptron (MLP). Several experiments demonstrate that the framework effectively generates new, high-quality synthetic aircraft trajectory data, which were compared to the results of several baselines.
[LG-29] MIRA: Medical Time Series Foundation Model for Real-World Health Data
链接: https://arxiv.org/abs/2506.07584
作者: Hao Li,Bowen Deng,Chang Xu,Zhiyuan Feng,Viktor Schlegel,Yu-Hao Huang,Yizheng Sun,Jingyuan Sun,Kailai Yang,Yiyao Yu,Jiang Bian
类目: Machine Learning (cs.LG)
*备注:
Abstract:A unified foundation model for medical time series – pretrained on open access and ethics board-approved medical corpora – offers the potential to reduce annotation burdens, minimize model customization, and enable robust transfer across clinical institutions, modalities, and tasks, particularly in data-scarce or privacy-constrained environments. However, existing generalist time series foundation models struggle to handle medical time series data due to their inherent challenges, including irregular intervals, heterogeneous sampling rates, and frequent missing values. To address these challenges, we introduce MIRA, a unified foundation model specifically designed for medical time series forecasting. MIRA incorporates a Continuous-Time Rotary Positional Encoding that enables fine-grained modeling of variable time intervals, a frequency-specific mixture-of-experts layer that routes computation across latent frequency regimes to further promote temporal specialization, and a Continuous Dynamics Extrapolation Block based on Neural ODE that models the continuous trajectory of latent states, enabling accurate forecasting at arbitrary target timestamps. Pretrained on a large-scale and diverse medical corpus comprising over 454 billion time points collect from publicly available datasets, MIRA achieves reductions in forecasting errors by an average of 10% and 7% in out-of-distribution and in-distribution scenarios, respectively, when compared to other zero-shot and fine-tuned baselines. We also introduce a comprehensive benchmark spanning multiple downstream clinical tasks, establishing a foundation for future research in medical time series modeling.
[LG-30] Improving Memory Efficiency for Training KANs via Meta Learning ICML2025
链接: https://arxiv.org/abs/2506.07549
作者: Zhangchi Zhao,Jun Shu,Deyu Meng,Zongben Xu
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Inspired by the Kolmogorov-Arnold representation theorem, KANs offer a novel framework for function approximation by replacing traditional neural network weights with learnable univariate functions. This design demonstrates significant potential as an efficient and interpretable alternative to traditional MLPs. However, KANs are characterized by a substantially larger number of trainable parameters, leading to challenges in memory efficiency and higher training costs compared to MLPs. To address this limitation, we propose to generate weights for KANs via a smaller meta-learner, called MetaKANs. By training KANs and MetaKANs in an end-to-end differentiable manner, MetaKANs achieve comparable or even superior performance while significantly reducing the number of trainable parameters and maintaining promising interpretability. Extensive experiments on diverse benchmark tasks, including symbolic regression, partial differential equation solving, and image classification, demonstrate the effectiveness of MetaKANs in improving parameter efficiency and memory usage. The proposed method provides an alternative technique for training KANs, that allows for greater scalability and extensibility, and narrows the training cost gap with MLPs stated in the original paper of KANs. Our code is available at this https URL.
[LG-31] Flowing Datasets with Wasserstein over Wasserstein Gradient Flows ICML2025
链接: https://arxiv.org/abs/2506.07534
作者: Clément Bonet,Christophe Vauthier,Anna Korba
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Accepted as an oral at ICML2025
Abstract:Many applications in machine learning involve data represented as probability distributions. The emergence of such data requires radically novel techniques to design tractable gradient flows on probability distributions over this type of (infinite-dimensional) objects. For instance, being able to flow labeled datasets is a core task for applications ranging from domain adaptation to transfer learning or dataset distillation. In this setting, we propose to represent each class by the associated conditional distribution of features, and to model the dataset as a mixture distribution supported on these classes (which are themselves probability distributions), meaning that labeled datasets can be seen as probability distributions over probability distributions. We endow this space with a metric structure from optimal transport, namely the Wasserstein over Wasserstein (WoW) distance, derive a differential structure on this space, and define WoW gradient flows. The latter enables to design dynamics over this space that decrease a given objective functional. We apply our framework to transfer learning and dataset distillation tasks, leveraging our gradient flow construction as well as novel tractable functionals that take the form of Maximum Mean Discrepancies with Sliced-Wasserstein based kernels between probability distributions.
[LG-32] Addressing Correlated Latent Exogenous Variables in Debiased Recommender Systems KDD KDD’25
链接: https://arxiv.org/abs/2506.07517
作者: Shuqiang Zhang,Yuchao Zhang,Jinkun Chen,Haochen Sui
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '25), August 3–7, 2025, Toronto, ON, Canada
Abstract:Recommendation systems (RS) aim to provide personalized content, but they face a challenge in unbiased learning due to selection bias, where users only interact with items they prefer. This bias leads to a distorted representation of user preferences, which hinders the accuracy and fairness of recommendations. To address the issue, various methods such as error imputation based, inverse propensity scoring, and doubly robust techniques have been developed. Despite the progress, from the structural causal model perspective, previous debiasing methods in RS assume the independence of the exogenous variables. In this paper, we release this assumption and propose a learning algorithm based on likelihood maximization to learn a prediction model. We first discuss the correlation and difference between unmeasured confounding and our scenario, then we propose a unified method that effectively handles latent exogenous variables. Specifically, our method models the data generation process with latent exogenous variables under mild normality assumptions. We then develop a Monte Carlo algorithm to numerically estimate the likelihood function. Extensive experiments on synthetic datasets and three real-world datasets demonstrate the effectiveness of our proposed method. The code is at this https URL.
[LG-33] Mind the Gap: Removing the Discretization Gap in Differentiable Logic Gate Networks
链接: https://arxiv.org/abs/2506.07500
作者: Shakir Yousefi,Andreas Plesner,Till Aczel,Roger Wattenhofer
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:Modern neural networks demonstrate state-of-the-art performance on numerous existing benchmarks; however, their high computational requirements and energy consumption prompt researchers to seek more efficient solutions for real-world deployment. Logic gate networks (LGNs) learns a large network of logic gates for efficient image classification. However, learning a network that can solve a simple problem like CIFAR-10 can take days to weeks to train. Even then, almost half of the network remains unused, causing a discretization gap. This discretization gap hinders real-world deployment of LGNs, as the performance drop between training and inference negatively impacts accuracy. We inject Gumbel noise with a straight-through estimator during training to significantly speed up training, improve neuron utilization, and decrease the discretization gap. We theoretically show that this results from implicit Hessian regularization, which improves the convergence properties of LGNs. We train networks 4.5 \times faster in wall-clock time, reduce the discretization gap by 98% , and reduce the number of unused gates by 100% .
[LG-34] Explicit Preference Optimization: No Need for an Implicit Reward Model
链接: https://arxiv.org/abs/2506.07492
作者: Xiangkun Hu,Lemin Kong,Tong He,David Wipf
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: arXiv admin note: substantial text overlap with arXiv:2407.09072
Abstract:The generated responses of large language models (LLMs) are often fine-tuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a separate reward model is independently learned and then later applied to LLM policy updates, ongoing research effort has targeted more straightforward alternatives. In this regard, direct preference optimization (DPO) and its many offshoots circumvent the need for a separate reward training step. Instead, through the judicious use of a reparameterization trick that induces an \textitimplicit reward, DPO and related methods consolidate learning to the minimization of a single loss function. And yet despite demonstrable success in some real-world settings, we prove that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive interpolation behaviors, underappreciated artifacts of the reparameterizations upon which they are based. To this end, we introduce an \textitexplicit preference optimization framework termed EXPO that requires no analogous reparameterization to achieve an implicit reward. Quite differently, we merely posit intuitively-appealing regularization factors from scratch that transparently avoid the potential pitfalls of key DPO variants, provably satisfying regularization desiderata that prior methods do not. Empirical results serve to corroborate our analyses and showcase the efficacy of EXPO.
[LG-35] Circumventing Backdoor Space via Weight Symmetry
链接: https://arxiv.org/abs/2506.07467
作者: Jie Peng,Hongwei Yang,Jing Zhao,Hengji Dong,Hui He,Weizhe Zhang,Haoyu He
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep neural networks are vulnerable to backdoor attacks, where malicious behaviors are implanted during training. While existing defenses can effectively purify compromised models, they typically require labeled data or specific training procedures, making them difficult to apply beyond supervised learning settings. Notably, recent studies have shown successful backdoor attacks across various learning paradigms, highlighting a critical security concern. To address this gap, we propose Two-stage Symmetry Connectivity (TSC), a novel backdoor purification defense that operates independently of data format and requires only a small fraction of clean samples. Through theoretical analysis, we prove that by leveraging permutation invariance in neural networks and quadratic mode connectivity, TSC amplifies the loss on poisoned samples while maintaining bounded clean accuracy. Experiments demonstrate that TSC achieves robust performance comparable to state-of-the-art methods in supervised learning scenarios. Furthermore, TSC generalizes to self-supervised learning frameworks, such as SimCLR and CLIP, maintaining its strong defense capabilities. Our code is available at this https URL.
[LG-36] ProteinZero: Self-Improving Protein Generation via Online Reinforcement Learning
链接: https://arxiv.org/abs/2506.07459
作者: Ziwen Wang,Jiajun Fan,Ruihan Guo,Thao Nguyen,Heng Ji,Ge Liu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Protein generative models have shown remarkable promise in protein design but still face limitations in success rate, due to the scarcity of high-quality protein datasets for supervised pretraining. We present ProteinZero, a novel framework that enables scalable, automated, and continuous self-improvement of the inverse folding model through online reinforcement learning. To achieve computationally tractable online feedback, we introduce efficient proxy reward models based on ESM-fold and a novel rapid ddG predictor that significantly accelerates evaluation speed. ProteinZero employs a general RL framework balancing multi-reward maximization, KL-divergence from a reference model, and a novel protein-embedding level diversity regularization that prevents mode collapse while promoting higher sequence diversity. Through extensive experiments, we demonstrate that ProteinZero substantially outperforms existing methods across every key metric in protein design, achieving significant improvements in structural accuracy, designability, thermodynamic stability, and sequence diversity. Most impressively, ProteinZero reduces design failure rates by approximately 36% - 48% compared to widely-used methods like ProteinMPNN, ESM-IF and InstructPLM, consistently achieving success rates exceeding 90% across diverse and complex protein folds. Notably, the entire RL run on CATH-4.3 can be done with a single 8 X GPU node in under 3 days, including reward computation. Our work establishes a new paradigm for protein design where models evolve continuously from their own generated outputs, opening new possibilities for exploring the vast protein design space.
[LG-37] Federated In-Context Learning: Iterative Refinement for Improved Answer Quality ICML2025
链接: https://arxiv.org/abs/2506.07440
作者: Ruhan Wang,Zhiyong Wang,Chengkai Huang,Rui Wang,Tong Yu,Lina Yao,John C.S. Lui,Dongruo Zhou
类目: Machine Learning (cs.LG)
*备注: 27 pages, 16 figures. Accepted to ICML 2025
Abstract:For question-answering (QA) tasks, in-context learning (ICL) enables language models to generate responses without modifying their parameters by leveraging examples provided in the input. However, the effectiveness of ICL heavily depends on the availability of high-quality examples, which are often scarce due to data privacy constraints, annotation costs, and distribution disparities. A natural solution is to utilize examples stored on client devices, but existing approaches either require transmitting model parameters - incurring significant communication overhead - or fail to fully exploit local datasets, limiting their effectiveness. To address these challenges, we propose Federated In-Context Learning (Fed-ICL), a general framework that enhances ICL through an iterative, collaborative process. Fed-ICL progressively refines responses by leveraging multi-round interactions between clients and a central server, improving answer quality without the need to transmit model parameters. We establish theoretical guarantees for the convergence of Fed-ICL and conduct extensive experiments on standard QA benchmarks, demonstrating that our proposed approach achieves strong performance while maintaining low communication costs.
[LG-38] RiemannFormer: A Framework for Attention in Curved Spaces
链接: https://arxiv.org/abs/2506.07405
作者: Zhongping Ji
类目: Machine Learning (cs.LG)
*备注: 10 pages, 1 figure
Abstract:This research endeavors to offer insights into unlocking the further potential of transformer-based architectures. One of the primary motivations is to offer a geometric interpretation for the attention mechanism in transformers. In our framework, the attention mainly involves metric tensors, tangent spaces, inner product, and how they relate to each other. These quantities and structures at discrete positions are intricately interconnected via the parallel transport of tangent vectors. To make the learning process more efficient, we reduce the number of parameters through ingenious predefined configurations. Moreover, we introduce an explicit mechanism to highlight a neighborhood by attenuating the remote values, given that transformers inherently neglect local inductive bias. Experimental results demonstrate that our modules deliver significant performance improvements relative to the baseline. More evaluation experiments on visual and large language models will be launched successively.
[LG-39] Moment Alignment: Unifying Gradient and Hessian Matching for Domain Generalization UAI2025
链接: https://arxiv.org/abs/2506.07378
作者: Yuen Chen,Haozhe Si,Guojun Zhang,Han Zhao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: UAI 2025
Abstract:Domain generalization (DG) seeks to develop models that generalize well to unseen target domains, addressing the prevalent issue of distribution shifts in real-world applications. One line of research in DG focuses on aligning domain-level gradients and Hessians to enhance generalization. However, existing methods are computationally inefficient and the underlying principles of these approaches are not well understood. In this paper, we develop the theory of moment alignment for DG. Grounded in \textittransfer measure, a principled framework for quantifying generalizability between two domains, we first extend the definition of transfer measure to domain generalization that includes multiple source domains and establish a target error bound. Then, we prove that aligning derivatives across domains improves transfer measure both when the feature extractor induces an invariant optimal predictor across domains and when it does not. Notably, moment alignment provides a unifying understanding of Invariant Risk Minimization, gradient matching, and Hessian matching, three previously disconnected approaches to DG. We further connect feature moments and derivatives of the classifier head, and establish the duality between feature learning and classifier fitting. Building upon our theory, we introduce \textbfClosed-Form \textbfMoment \textbfAlignment (CMA), a novel DG algorithm that aligns domain-level gradients and Hessians in closed-form. Our method overcomes the computational inefficiencies of existing gradient and Hessian-based techniques by eliminating the need for repeated backpropagation or sampling-based Hessian estimation. We validate the efficacy of our approach through two sets of experiments: linear probing and full fine-tuning. CMA demonstrates superior performance in both settings compared to Empirical Risk Minimization and state-of-the-art algorithms.
[LG-40] MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing
链接: https://arxiv.org/abs/2506.07366
作者: Haiyue Ma,Zhixu Du,Yiran Chen
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:In multi-GPU Mixture-of-Experts (MoE) network, experts are distributed across different GPUs, which creates load imbalance as each expert processes different number of tokens. Recent works improve MoE inference load balance by dynamically duplicating popular experts to more GPUs to process excessive tokens, which requires predicting the distribution before routing. In this paper, we discuss the tradeoff of prediction strategies, accuracies, overhead, and end-to-end system performance. We propose MoE-GPS, a framework that guides the selection of the optimal predictor design under various system configurations, by quantifying the performance impact to system-level model runtime. Specifically, we advocate for Distribution-Only Prediction, a prediction strategy that only predicts overall token distribution which significantly reduces overhead compared to the traditional Token-to-Expert Prediction. On Mixtral 8x7B MMLU dataset, MoE-GPS suggests Distribution-Only Prediction which improves end-to-end inference performance by more than 23% compared with Token-to-Expert Prediction.
[LG-41] Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models
链接: https://arxiv.org/abs/2506.07334
作者: Haoyu Wang,Peihao Wang,Mufei Li,Shikun Liu,Siqi Miao,Zhangyang Wang,Pan Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern large language models (LLMs) are inherently auto-regressive, requiring input to be serialized into flat sequences regardless of their structural dependencies. This serialization hinders the model’s ability to leverage structural inductive biases, especially in tasks such as retrieval-augmented generation (RAG) and reasoning on data with native graph structures, where inter-segment dependencies are crucial. We introduce Graph-KV with the potential to overcome this limitation. Graph-KV leverages the KV-cache of text segments as condensed representations and governs their interaction through structural inductive biases. In this framework, ‘target’ segments selectively attend only to the KV-caches of their designated ‘source’ segments, rather than all preceding segments in a serialized sequence. This approach induces a graph-structured block mask, sparsifying attention and enabling a message-passing-like step within the LLM. Furthermore, strategically allocated positional encodings for source and target segments reduce positional bias and context window consumption. We evaluate Graph-KV across three scenarios: (1) seven RAG benchmarks spanning direct inference, multi-hop reasoning, and long-document understanding; (2) Arxiv-QA, a novel academic paper QA task with full-text scientific papers structured as citation ego-graphs; and (3) paper topic classification within a citation network. By effectively reducing positional bias and harnessing structural inductive biases, Graph-KV substantially outperforms baselines, including standard costly sequential encoding, across various settings. Code and the Graph-KV data are publicly available.
[LG-42] Mobility-Aware Asynchronous Federated Learning with Dynamic Sparsification
链接: https://arxiv.org/abs/2506.07328
作者: Jintao Yan,Tan Chen,Yuxuan Sun,Zhaojun Nan,Sheng Zhou,Zhisheng Niu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Asynchronous Federated Learning (AFL) enables distributed model training across multiple mobile devices, allowing each device to independently update its local model without waiting for others. However, device mobility introduces intermittent connectivity, which necessitates gradient sparsification and leads to model staleness, jointly affecting AFL convergence. This paper develops a theoretical model to characterize the interplay among sparsification, model staleness and mobility-induced contact patterns, and their joint impact on AFL convergence. Based on the analysis, we propose a mobility-aware dynamic sparsification (MADS) algorithm that optimizes the sparsification degree based on contact time and model staleness. Closed-form solutions are derived, showing that under low-speed conditions, MADS increases the sparsification degree to enhance convergence, while under high-speed conditions, it reduces the sparsification degree to guarantee reliable uploads within limited contact time. Experimental results validate the theoretical findings. Compared with the state-of-the-art benchmarks, the MADS algorithm increases the image classification accuracy on the CIFAR-10 dataset by 8.76% and reduces the average displacement error in the Argoverse trajectory prediction dataset by 9.46%.
[LG-43] DEF: Diffusion-augmented Ensemble Forecasting
链接: https://arxiv.org/abs/2506.07324
作者: David Millard,Arielle Carr,Stéphane Gaudreault,Ali Baheri
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 26 pages, 20 plots, journal paper
Abstract:We present DEF (\textbf\ulDiffusion-augmented \textbf\ulEnsemble \textbf\ulForecasting), a novel approach for generating initial condition perturbations. Modern approaches to initial condition perturbations are primarily designed for numerical weather prediction (NWP) solvers, limiting their applicability in the rapidly growing field of machine learning for weather prediction. Consequently, stochastic models in this domain are often developed on a case-by-case basis. We demonstrate that a simple conditional diffusion model can (1) generate meaningful structured perturbations, (2) be applied iteratively, and (3) utilize a guidance term to intuitivey control the level of perturbation. This method enables the transformation of any deterministic neural forecasting system into a stochastic one. With our stochastic extended systems, we show that the model accumulates less error over long-term forecasts while producing meaningful forecast distributions. We validate our approach on the 5.625 ^\circ ERA5 reanalysis dataset, which comprises atmospheric and surface variables over a discretized global grid, spanning from the 1960s to the present. On this dataset, our method demonstrates improved predictive performance along with reasonable spread estimates.
[LG-44] PASS: Private Attributes Protection with Stochastic Data Substitution
链接: https://arxiv.org/abs/2506.07308
作者: Yizhuo Chen,Chun-Fu(Richard)Chen,Hsiang Hsu,Shaohan Hu,Tarek Abdelzaher
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The growing Machine Learning (ML) services require extensive collections of user data, which may inadvertently include people’s private information irrelevant to the services. Various studies have been proposed to protect private attributes by removing them from the data while maintaining the utilities of the data for downstream tasks. Nevertheless, as we theoretically and empirically show in the paper, these methods reveal severe vulnerability because of a common weakness rooted in their adversarial training based strategies. To overcome this limitation, we propose a novel approach, PASS, designed to stochastically substitute the original sample with another one according to certain probabilities, which is trained with a novel loss function soundly derived from information-theoretic objective defined for utility-preserving private attributes protection. The comprehensive evaluation of PASS on various datasets of different modalities, including facial images, human activity sensory signals, and voice recording datasets, substantiates PASS’s effectiveness and generalizability.
[LG-45] owards Generalized Source Tracing for Codec-Based Deepfake Speech
链接: https://arxiv.org/abs/2506.07294
作者: Xuanjun Chen,I-Ming Lin,Lin Zhang,Haibin Wu,Hung-yi Lee,Jyh-Shing Roger Jang
类目: ound (cs.SD); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Submitted to IEEE ASRU 2025
Abstract:Recent attempts at source tracing for codec-based deepfake speech (CodecFake), generated by neural audio codec-based speech generation (CoSG) models, have exhibited suboptimal performance. However, how to train source tracing models using simulated CoSG data while maintaining strong performance on real CoSG-generated audio remains an open challenge. In this paper, we show that models trained solely on codec-resynthesized data tend to overfit to non-speech regions and struggle to generalize to unseen content. To mitigate these challenges, we introduce the Semantic-Acoustic Source Tracing Network (SASTNet), which jointly leverages Whisper for semantic feature encoding and Wav2vec2 with AudioMAE for acoustic feature encoding. Our proposed SASTNet achieves state-of-the-art performance on the CoSG test set of the CodecFake+ dataset, demonstrating its effectiveness for reliable source tracing.
[LG-46] EviNet: Evidential Reasoning Network for Resilient Graph Learning in the Open and Noisy Environments KDD2025
链接: https://arxiv.org/abs/2506.07288
作者: Weijie Guan,Haohui Wang,Jian Kang,Lihui Liu,Dawei Zhou
类目: Machine Learning (cs.LG)
*备注: KDD 2025
Abstract:Graph learning has been crucial to many real-world tasks, but they are often studied with a closed-world assumption, with all possible labels of data known a priori. To enable effective graph learning in an open and noisy environment, it is critical to inform the model users when the model makes a wrong prediction to in-distribution data of a known class, i.e., misclassification detection or when the model encounters out-of-distribution from novel classes, i.e., out-of-distribution detection. This paper introduces Evidential Reasoning Network (EVINET), a framework that addresses these two challenges by integrating Beta embedding within a subjective logic framework. EVINET includes two key modules: Dissonance Reasoning for misclassification detection and Vacuity Reasoning for out-of-distribution detection. Extensive experiments demonstrate that EVINET outperforms state-of-the-art methods across multiple metrics in the tasks of in-distribution classification, misclassification detection, and out-of-distribution detection. EVINET demonstrates the necessity of uncertainty estimation and logical reasoning for misclassification detection and out-of-distribution detection and paves the way for open-world graph learning. Our code and data are available at this https URL.
[LG-47] Investigating the Relationship Between Physical Activity and Tailored Behavior Change Messaging: Connecting Contextual Bandit with Large Language Models
链接: https://arxiv.org/abs/2506.07275
作者: Haochen Song,Dominik Hofer,Rania Islambouli,Laura Hawkins,Ananya Bhattacharjee,Meredith Franklin,Joseph Jay Williams
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Applications (stat.AP)
*备注:
Abstract:Machine learning approaches, such as contextual multi-armed bandit (cMAB) algorithms, offer a promising strategy to reduce sedentary behavior by delivering personalized interventions to encourage physical activity. However, cMAB algorithms typically require large participant samples to learn effectively and may overlook key psychological factors that are not explicitly encoded in the model. In this study, we propose a hybrid approach that combines cMAB for selecting intervention types with large language models (LLMs) to personalize message content. We evaluate four intervention types: behavioral self-monitoring, gain-framed, loss-framed, and social comparison, each delivered as a motivational message aimed at increasing motivation for physical activity and daily step count. Message content is further personalized using dynamic contextual factors including daily fluctuations in self-efficacy, social influence, and regulatory focus. Over a seven-day trial, participants receive daily messages assigned by one of four models: cMAB alone, LLM alone, combined cMAB with LLM personalization (cMABxLLM), or equal randomization (RCT). Outcomes include daily step count and message acceptance, assessed via ecological momentary assessments (EMAs). We apply a causal inference framework to evaluate the effects of each model. Our findings offer new insights into the complementary roles of LLM-based personalization and cMAB adaptation in promoting physical activity through personalized behavioral messaging.
[LG-48] A Cramér-von Mises Approach to Incentivizing Truthful Data Sharing
链接: https://arxiv.org/abs/2506.07272
作者: Alex Clinton,Thomas Zeng,Yiding Chen,Xiaojin Zhu,Kirthevasan Kandasamy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern data marketplaces and data sharing consortia increasingly rely on incentive mechanisms to encourage agents to contribute data. However, schemes that reward agents based on the quantity of submitted data are vulnerable to manipulation, as agents may submit fabricated or low-quality data to inflate their rewards. Prior work has proposed comparing each agent’s data against others’ to promote honesty: when others contribute genuine data, the best way to minimize discrepancy is to do the same. Yet prior implementations of this idea rely on very strong assumptions about the data distribution (e.g. Gaussian), limiting their applicability. In this work, we develop reward mechanisms based on a novel, two-sample test inspired by the Cramér-von Mises statistic. Our methods strictly incentivize agents to submit more genuine data, while disincentivizing data fabrication and other types of untruthful reporting. We establish that truthful reporting constitutes a (possibly approximate) Nash equilibrium in both Bayesian and prior-agnostic settings. We theoretically instantiate our method in three canonical data sharing problems and show that it relaxes key assumptions made by prior work. Empirically, we demonstrate that our mechanism incentivizes truthful data sharing via simulations and on real-world language and image data.
[LG-49] Machine Learning-Based Self-Localization Using Internal Sensors for Automating Bulldozers
链接: https://arxiv.org/abs/2506.07271
作者: Hikaru Sawafuji,Ryota Ozaki,Takuto Motomura,Toyohisa Matsuda,Masanori Tojima,Kento Uchida,Shinichi Shirakawa
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Self-localization is an important technology for automating bulldozers. Conventional bulldozer self-localization systems rely on RTK-GNSS (Real Time Kinematic-Global Navigation Satellite Systems). However, RTK-GNSS signals are sometimes lost in certain mining conditions. Therefore, self-localization methods that do not depend on RTK-GNSS are required. In this paper, we propose a machine learning-based self-localization method for bulldozers. The proposed method consists of two steps: estimating local velocities using a machine learning model from internal sensors, and incorporating these estimates into an Extended Kalman Filter (EKF) for global localization. We also created a novel dataset for bulldozer odometry and conducted experiments across various driving scenarios, including slalom, excavation, and driving on slopes. The result demonstrated that the proposed self-localization method suppressed the accumulation of position errors compared to kinematics-based methods, especially when slip occurred. Furthermore, this study showed that bulldozer-specific sensors, such as blade position sensors and hydraulic pressure sensors, contributed to improving self-localization accuracy.
[LG-50] RADAR: Recall Augmentation through Deferred Asynchronous Retrieval
链接: https://arxiv.org/abs/2506.07261
作者: Amit Jaspal,Qian Dang,Ajantha Ramineni
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Modern large-scale recommender systems employ multi-stage ranking funnel (Retrieval, Pre-ranking, Ranking) to balance engagement and computational constraints (latency, CPU). However, the initial retrieval stage, often relying on efficient but less precise methods like K-Nearest Neighbors (KNN), struggles to effectively surface the most engaging items from billion-scale catalogs, particularly distinguishing highly relevant and engaging candidates from merely relevant ones. We introduce Recall Augmentation through Deferred Asynchronous Retrieval (RADAR), a novel framework that leverages asynchronous, offline computation to pre-rank a significantly larger candidate set for users using the full complexity ranking model. These top-ranked items are stored and utilized as a high-quality retrieval source during online inference, bypassing online retrieval and pre-ranking stages for these candidates. We demonstrate through offline experiments that RADAR significantly boosts recall (2X Recall@200 vs DNN retrieval baseline) by effectively combining a larger retrieved candidate set with a more powerful ranking model. Online A/B tests confirm a +0.8% lift in topline engagement metrics, validating RADAR as a practical and effective method to improve recommendation quality under strict online serving constraints.
[LG-51] A Stable Whitening Optimizer for Efficient Neural Network Training
链接: https://arxiv.org/abs/2506.07254
作者: Kevin Frans,Sergey Levine,Pieter Abbeel
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we take an experimentally grounded look at neural network optimization. Building on the Shampoo family of algorithms, we identify and alleviate three key issues, resulting in the proposed SPlus method. First, we find that naive Shampoo is prone to divergence when matrix-inverses are cached for long periods. We introduce an alternate bounded update combining a historical eigenbasis with instantaneous normalization, resulting in across-the-board stability and significantly lower computational requirements. Second, we adapt a shape-aware scaling to enable learning rate transfer across network width. Third, we find that high learning rates result in large parameter noise, and propose a simple iterate-averaging scheme which unblocks faster learning. To properly confirm these findings, we introduce a pointed Transformer training benchmark, considering three objectives (language modelling, image classification, and diffusion modelling) across different stages of training. On average, SPlus is able to reach the validation performance of Adam within 44% of the gradient steps and 62% of the wallclock time.
[LG-52] Promoting Ensemble Diversity with Interactive Bayesian Distributional Robustness for Fine-tuning Foundation Models ICML2025
链接: https://arxiv.org/abs/2506.07247
作者: Ngoc-Quan Pham,Tuan Truong,Quyen Tran,Tan Nguyen,Dinh Phung,Trung Le
类目: Machine Learning (cs.LG)
*备注: ICML 2025 (Poster)
Abstract:We introduce Interactive Bayesian Distributional Robustness (IBDR), a novel Bayesian inference framework that allows modeling the interactions between particles, thereby enhancing ensemble quality through increased particle diversity. IBDR is grounded in a generalized theoretical framework that connects the distributional population loss with the approximate posterior, motivating a practical dual optimization procedure that enforces distributional robustness while fostering particle diversity. We evaluate IBDR’s performance against various baseline methods using the VTAB-1K benchmark and the common reasoning language task. The results consistently show that IBDR outperforms these baselines, underscoring its effectiveness in real-world applications.
[LG-53] VARSHAP: Addressing Global Dependency Problems in Explainable AI with Variance-Based Local Feature Attribution
链接: https://arxiv.org/abs/2506.07229
作者: Mateusz Gajewski,Mikołaj Morzy,Adam Karczmarz,Piotr Sankowski
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing feature attribution methods like SHAP often suffer from global dependence, failing to capture true local model behavior. This paper introduces VARSHAP, a novel model-agnostic local feature attribution method which uses the reduction of prediction variance as the key importance metric of features. Building upon Shapley value framework, VARSHAP satisfies the key Shapley axioms, but, unlike SHAP, is resilient to global data distribution shifts. Experiments on synthetic and real-world datasets demonstrate that VARSHAP outperforms popular methods such as KernelSHAP or LIME, both quantitatively and qualitatively.
[LG-54] Audio synthesizer inversion in symmetric parameter spaces with approximately equivariant flow matching
链接: https://arxiv.org/abs/2506.07199
作者: Ben Hayes,Charalampos Saitis,György Fazekas
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注: Accepted at ISMIR 2025
Abstract:Many audio synthesizers can produce the same signal given different parameter configurations, meaning the inversion from sound to parameters is an inherently ill-posed problem. We show that this is largely due to intrinsic symmetries of the synthesizer, and focus in particular on permutation invariance. First, we demonstrate on a synthetic task that regressing point estimates under permutation symmetry degrades performance, even when using a permutation-invariant loss function or symmetry-breaking heuristics. Then, viewing equivalent solutions as modes of a probability distribution, we show that a conditional generative model substantially improves performance. Further, acknowledging the invariance of the implicit parameter distribution, we find that performance is further improved by using a permutation equivariant continuous normalizing flow. To accommodate intricate symmetries in real synthesizers, we also propose a relaxed equivariance strategy that adaptively discovers relevant symmetries from data. Applying our method to Surge XT, a full-featured open source synthesizer used in real world audio production, we find our method outperforms regression and generative baselines across audio reconstruction metrics.
[LG-55] GGBall: Graph Generative Model on Poincaré Ball
链接: https://arxiv.org/abs/2506.07198
作者: Tianci Bu,Chuanrui Wang,Hao Ma,Haoren Zheng,Xin Lu,Tailin Wu
类目: Machine Learning (cs.LG)
*备注: 29 pages, 3 figures
Abstract:Generating graphs with hierarchical structures remains a fundamental challenge due to the limitations of Euclidean geometry in capturing exponential complexity. Here we introduce \textbfGGBall, a novel hyperbolic framework for graph generation that integrates geometric inductive biases with modern generative paradigms. GGBall combines a Hyperbolic Vector-Quantized Autoencoder (HVQVAE) with a Riemannian flow matching prior defined via closed-form geodesics. This design enables flow-based priors to model complex latent distributions, while vector quantization helps preserve the curvature-aware structure of the hyperbolic space. We further develop a suite of hyperbolic GNN and Transformer layers that operate entirely within the manifold, ensuring stability and scalability. Empirically, our model reduces degree MMD by over 75% on Community-Small and over 40% on Ego-Small compared to state-of-the-art baselines, demonstrating an improved ability to preserve topological hierarchies. These results highlight the potential of hyperbolic geometry as a powerful foundation for the generative modeling of complex, structured, and hierarchical data domains. Our code is available at \hrefthis https URLhere.
[LG-56] Analyzing Breast Cancer Survival Disparities by Race and Demographic Location: A Survival Analysis Approach
链接: https://arxiv.org/abs/2506.07191
作者: Ramisa Farha,Joshua O. Olukoya
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:This study employs a robust analytical framework to uncover patterns in survival outcomes among breast cancer patients from diverse racial and geographical backgrounds. This research uses the SEER 2021 dataset to analyze breast cancer survival outcomes to identify and comprehend dissimilarities. Our approach integrates exploratory data analysis (EDA), through this we identify key variables that influence survival rates and employ survival analysis techniques, including the Kaplan-Meier estimator and log-rank test and the advanced modeling Cox Proportional Hazards model to determine how survival rates vary across racial groups and countries. Model validation and interpretation are undertaken to ensure the reliability of our findings, which are documented comprehensively to inform policymakers and healthcare professionals. The outcome of this paper is a detailed version of statistical analysis that not just highlights disparities in breast cancer treatment and care but also serves as a foundational tool for developing targeted interventions to address the inequalities effectively. Through this research, our aim is to contribute to the global efforts to improve breast cancer outcomes and reduce treatment disparities.
[LG-57] Learning based on neurovectors for tabular data: a new neural network approach ICDM2025
链接: https://arxiv.org/abs/2506.07185
作者: J.C. Husillos,A. Gallego,A. Roma,A. Troncoso
类目: Machine Learning (cs.LG)
*备注: Submitted to 25th IEEE International Conference on Data Mining (ICDM 2025)
Abstract:In this paper, we present a novel learning approach based on Neurovectors, an innovative paradigm that structures information through interconnected nodes and vector relationships for tabular data processing. Unlike traditional artificial neural networks that rely on weight adjustment through backpropagation, Neurovectors encode information by structuring data in vector spaces where energy propagation, rather than traditional weight updates, drives the learning process, enabling a more adaptable and explainable learning process. Our method generates dynamic representations of knowledge through neurovectors, thereby improving both the interpretability and efficiency of the predictive model. Experimental results using datasets from well-established repositories such as the UCI machine learning repository and Kaggle are reported both for classification and regression. To evaluate its performance, we compare our approach with standard machine learning and deep learning models, showing that Neurovectors achieve competitive accuracy.
[LG-58] pFedSOP : Accelerating Training Of Personalized Federated Learning Using Second-Order Optimization
链接: https://arxiv.org/abs/2506.07159
作者: Mrinmay Sen,Chalavadi Krishna Mohan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Personalized Federated Learning (PFL) enables clients to collaboratively train personalized models tailored to their individual objectives, addressing the challenge of model generalization in traditional Federated Learning (FL) due to high data heterogeneity. However, existing PFL methods often require increased communication rounds to achieve the desired performance, primarily due to slow training caused by the use of first-order optimization, which has linear convergence. Additionally, many of these methods increase local computation because of the additional data fed into the model during the search for personalized local models. One promising solution to this slow training is second-order optimization, known for its quadratic convergence. However, employing it in PFL is challenging due to the Hessian matrix and its inverse. In this paper, we propose pFedSOP, which efficiently utilizes second-order optimization in PFL to accelerate the training of personalized models and enhance performance with fewer communication rounds. Our approach first computes a personalized local gradient update using the Gompertz function-based normalized angle between local and global gradient updates, incorporating client-specific global information. We then use a regularized Fisher Information Matrix (FIM), computed from this personalized gradient update, as an approximation of the Hessian to update the personalized models. This FIM-based second-order optimization speeds up training with fewer communication rounds by tackling the challenges with exact Hessian and avoids additional data being fed into the model during the search for personalized local models. Extensive experiments on heterogeneously partitioned image classification datasets with partial client participation demonstrate that pFedSOP outperforms state-of-the-art FL and PFL algorithms.
[LG-59] Pointwise confidence estimation in the non-linear ell2-regularized least squares
链接: https://arxiv.org/abs/2506.07088
作者: Ilja Kuzborskij,Yasin Abbasi Yadkori
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We consider a high-probability non-asymptotic confidence estimation in the \ell^2 -regularized non-linear least-squares setting with fixed design. In particular, we study confidence estimation for local minimizers of the regularized training loss. We show a pointwise confidence bound, meaning that it holds for the prediction on any given fixed test input x . Importantly, the proposed confidence bound scales with similarity of the test input to the training data in the implicit feature space of the predictor (for instance, becoming very large when the test input lies far outside of the training data). This desirable last feature is captured by the weighted norm involving the inverse-Hessian matrix of the objective function, which is a generalized version of its counterpart in the linear setting, x^\top \textCov^-1 x . Our generalized result can be regarded as a non-asymptotic counterpart of the classical confidence interval based on asymptotic normality of the MLE estimator. We propose an efficient method for computing the weighted norm, which only mildly exceeds the cost of a gradient computation of the loss function. Finally, we complement our analysis with empirical evidence showing that the proposed confidence bound provides better coverage/width trade-off compared to a confidence estimation by bootstrapping, which is a gold-standard method in many applications involving non-linear predictors such as neural networks.
[LG-60] State Entropy Regularization for Robust Reinforcement Learning
链接: https://arxiv.org/abs/2506.07085
作者: Uri Koren,Yonatan Ashlag,Mirco Mutti,Esther Derman,Pierre-Luc Bacon,Shie Mannor
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:State entropy regularization has empirically shown better exploration and sample complexity in reinforcement learning (RL). However, its theoretical guarantees have not been studied. In this paper, we show that state entropy regularization improves robustness to structured and spatially correlated perturbations. These types of variation are common in transfer learning but often overlooked by standard robust RL methods, which typically focus on small, uncorrelated changes. We provide a comprehensive characterization of these robustness properties, including formal guarantees under reward and transition uncertainty, as well as settings where the method performs poorly. Much of our analysis contrasts state entropy with the widely used policy entropy regularization, highlighting their different benefits. Finally, from a practical standpoint, we illustrate that compared with policy entropy, the robustness advantages of state entropy are more sensitive to the number of rollouts used for policy evaluation.
[LG-61] E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models
链接: https://arxiv.org/abs/2506.07078
作者: Jiaheng Dong,Hong Jia,Soumyajit Chatterjee,Abhirup Ghosh,James Bailey,Ting Dang
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Under Review
Abstract:Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts, such as background noise and speaker accents. Test-time adaptation (TTA) has recently emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels. However, existing TTA approaches, particularly those relying on backpropagation, are memory-intensive, limiting their applicability in speech tasks and resource-constrained settings. Although backpropagation-free methods offer improved efficiency, existing ones exhibit poor accuracy. This is because they are predominantly developed for vision tasks, which fundamentally differ from speech task formulations, noise characteristics, and model architecture, posing unique transferability challenges. In this paper, we introduce E-BATS, the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models. E-BATS achieves a balance between adaptation effectiveness and memory efficiency through three key components: (i) lightweight prompt adaptation for a forward-pass-based feature alignment, (ii) a multi-scale loss to capture both global (utterance-level) and local distribution shifts (token-level) and (iii) a test-time exponential moving average mechanism for stable adaptation across utterances. Experiments conducted on four noisy speech datasets spanning sixteen acoustic conditions demonstrate consistent improvements, with 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods. By enabling scalable and robust adaptation under acoustic variability, this work paves the way for developing more efficient adaptation approaches for practical speech processing systems in real-world environments.
[LG-62] FairPFN: A Tabular Foundation Model for Causal Fairness
链接: https://arxiv.org/abs/2506.07049
作者: Jake Robertson,Noah Hollmann,Samuel Müller,Noor Awad,Frank Hutter
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:Machine learning (ML) systems are utilized in critical sectors, such as healthcare, law enforcement, and finance. However, these systems are often trained on historical data that contains demographic biases, leading to ML decisions that perpetuate or exacerbate existing social inequalities. Causal fairness provides a transparent, human-in-the-loop framework to mitigate algorithmic discrimination, aligning closely with legal doctrines of direct and indirect discrimination. However, current causal fairness frameworks hold a key limitation in that they assume prior knowledge of the correct causal model, restricting their applicability in complex fairness scenarios where causal models are unknown or difficult to identify. To bridge this gap, we propose FairPFN, a tabular foundation model pre-trained on synthetic causal fairness data to identify and mitigate the causal effects of protected attributes in its predictions. FairPFN’s key contribution is that it requires no knowledge of the causal model and still demonstrates strong performance in identifying and removing protected causal effects across a diverse set of hand-crafted and real-world scenarios relative to robust baseline methods. FairPFN paves the way for promising future research, making causal fairness more accessible to a wider variety of complex fairness problems.
[LG-63] Mixture Experts with Test-Time Self-Supervised Aggregation for Tabular Imbalanced Regression
链接: https://arxiv.org/abs/2506.07033
作者: Yung-Chien Wang,Kuang-Da Wang,Wei-Yao Wang,Wen-Chih Peng
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:Tabular data serve as a fundamental and ubiquitous representation of structured information in numerous real-world applications, e.g., finance and urban planning. In the realm of tabular imbalanced applications, data imbalance has been investigated in classification tasks with insufficient instances in certain labels, causing the model’s ineffective generalizability. However, the imbalance issue of tabular regression tasks is underexplored, and yet is critical due to unclear boundaries for continuous labels and simplifying assumptions in existing imbalance regression work, which often rely on known and balanced test distributions. Such assumptions may not hold in practice and can lead to performance degradation. To address these issues, we propose MATI: Mixture Experts with Test-Time Self-Supervised Aggregation for Tabular Imbalance Regression, featuring two key innovations: (i) the Region-Aware Mixture Expert, which adopts a Gaussian Mixture Model to capture the underlying related regions. The statistical information of each Gaussian component is then used to synthesize and train region-specific experts to capture the unique characteristics of their respective regions. (ii) Test-Time Self-Supervised Expert Aggregation, which dynamically adjusts region expert weights based on test data features to reinforce expert adaptation across varying test distributions. We evaluated MATI on four real-world tabular imbalance regression datasets, including house pricing, bike sharing, and age prediction. To reflect realistic deployment scenarios, we adopted three types of test distributions: a balanced distribution with uniform target frequencies, a normal distribution that follows the training data, and an inverse distribution that emphasizes rare target regions. On average across these three test distributions, MATI achieved a 7.1% improvement in MAE compared to existing methods.
[LG-64] Comparison of Lightweight Methods for Vehicle Dynamics-Based Driver Drowsiness Detection
链接: https://arxiv.org/abs/2506.07014
作者: Yutaro Nakagama,Daisuke Ishii,Kazuki Yoshizoe
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, to be published at IV 2025
Abstract:Driver drowsiness detection (DDD) prevents road accidents caused by driver fatigue. Vehicle dynamics-based DDD has been proposed as a method that is both economical and high performance. However, there are concerns about the reliability of performance metrics and the reproducibility of many of the existing methods. For instance, some previous studies seem to have a data leakage issue among training and test datasets, and many do not openly provide the datasets they used. To this end, this paper aims to compare the performance of representative vehicle dynamics-based DDD methods under a transparent and fair framework that uses a public dataset. We first develop a framework for extracting features from an open dataset by Aygun et al. and performing DDD with lightweight ML models; the framework is carefully designed to support a variety of onfigurations. Second, we implement three existing representative methods and a concise random forest (RF)-based method in the framework. Finally, we report the results of experiments to verify the reproducibility and clarify the performance of DDD based on common metrics. Among the evaluated methods, the RF-based method achieved the highest accuracy of 88 %. Our findings imply the issues inherent in DDD methods developed in a non-standard manner, and demonstrate a high performance method implemented appropriately.
[LG-65] Modified K-means Algorithm with Local Optimality Guarantees ICML2025
链接: https://arxiv.org/abs/2506.06990
作者: Mingyi Li,Michael R. Metel,Akiko Takeda
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:The K-means algorithm is one of the most widely studied clustering algorithms in machine learning. While extensive research has focused on its ability to achieve a globally optimal solution, there still lacks a rigorous analysis of its local optimality guarantees. In this paper, we first present conditions under which the K-means algorithm converges to a locally optimal solution. Based on this, we propose simple modifications to the K-means algorithm which ensure local optimality in both the continuous and discrete sense, with the same computational complexity as the original K-means algorithm. As the dissimilarity measure, we consider a general Bregman divergence, which is an extension of the squared Euclidean distance often used in the K-means algorithm. Numerical experiments confirm that the K-means algorithm does not always find a locally optimal solution in practice, while our proposed methods provide improved locally optimal solutions with reduced clustering loss. Our code is available at this https URL.
[LG-66] Correcting for Position Bias in Learning to Rank: A Control Function Approach
链接: https://arxiv.org/abs/2506.06989
作者: Md Aminul Islam,Kathryn Vasilaky,Elena Zheleva
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Implicit feedback data, such as user clicks, is commonly used in learning-to-rank (LTR) systems because it is easy to collect and it often reflects user preferences. However, this data is prone to various biases, and training an LTR system directly on biased data can result in suboptimal ranking performance. One of the most prominent and well-studied biases in implicit feedback data is position bias, which occurs because users are more likely to interact with higher-ranked documents regardless of their true relevance. In this paper, we propose a novel control function-based method that accounts for position bias in a two-stage process. The first stage uses exogenous variation from the residuals of the ranking process to correct for position bias in the second stage click equation. Unlike previous position bias correction methods, our method does not require knowledge of the click or propensity model and allows for nonlinearity in the underlying ranking model. Moreover, our method is general and allows for debiasing any state-of-the-art ranking algorithm by plugging it into the second stage. We also introduce a technique to debias validation clicks for hyperparameter tuning to select the optimal model in the absence of unbiased validation data. Experimental results demonstrate that our method outperforms state-of-the-art approaches in correcting for position bias.
[LG-67] Fully Explainable Classification Models Using Hyperblocks
链接: https://arxiv.org/abs/2506.06986
作者: Austin Snyder,Ryan Gallagher,Boris Kovalerchuk
类目: Machine Learning (cs.LG)
*备注: 7 pages, 8 figures, 6 tables
Abstract:Building on existing work with Hyperblocks, which classify data using minimum and maximum bounds for each attribute, we focus on enhancing interpretability, decreasing training time, and reducing model complexity without sacrificing accuracy. This system allows subject matter experts (SMEs) to directly inspect and understand the model’s decision logic without requiring extensive machine learning expertise. To reduce Hyperblock complexity while retaining performance, we introduce a suite of algorithms for Hyperblock simplification. These include removing redundant attributes, removing redundant blocks through overlap analysis, and creating disjunctive units. These methods eliminate unnecessary parameters, dramatically reducing model size without harming classification power. We increase robustness by introducing an interpretable fallback mechanism using k-Nearest Neighbor (k-NN) classifiers for points not covered by any block, ensuring complete data coverage while preserving model transparency. Our results demonstrate that interpretable models can scale to high-dimensional, large-volume datasets while maintaining competitive accuracy. On benchmark datasets such as WBC (9-D), we achieve strong predictive performance with significantly reduced complexity. On MNIST (784-D), our method continues to improve through tuning and simplification, showing promise as a transparent alternative to black-box models in domains where trust, clarity, and control are crucial.
[LG-68] Certified Unlearning for Neural Networks
链接: https://arxiv.org/abs/2506.06985
作者: Anastasia Koloskova,Youssef Allouah,Animesh Jha,Rachid Guerraoui,Sanmi Koyejo
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注:
Abstract:We address the problem of machine unlearning, where the goal is to remove the influence of specific training data from a model upon request, motivated by privacy concerns and regulatory requirements such as the “right to be forgotten.” Unfortunately, existing methods rely on restrictive assumptions or lack formal guarantees. To this end, we propose a novel method for certified machine unlearning, leveraging the connection between unlearning and privacy amplification by stochastic post-processing. Our method uses noisy fine-tuning on the retain data, i.e., data that does not need to be removed, to ensure provable unlearning guarantees. This approach requires no assumptions about the underlying loss function, making it broadly applicable across diverse settings. We analyze the theoretical trade-offs in efficiency and accuracy and demonstrate empirically that our method not only achieves formal unlearning guarantees but also performs effectively in practice, outperforming existing baselines. Our code is available at this https URL
[LG-69] Near Optimal Non-asymptotic Sample Complexity of 1-Identification
链接: https://arxiv.org/abs/2506.06978
作者: Zitian Li,Wang Chi Cheung
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Motivated by an open direction in existing literature, we study the 1-identification problem, a fundamental multi-armed bandit formulation on pure exploration. The goal is to determine whether there exists an arm whose mean reward is at least a known threshold \mu_0 , or to output None if it believes such an arm does not exist. The agent needs to guarantee its output is correct with probability at least 1-\delta . Degenne Koolen 2019 has established the asymptotically tight sample complexity for the 1-identification problem, but they commented that the non-asymptotic analysis remains unclear. We design a new algorithm Sequential-Exploration-Exploitation (SEE), and conduct theoretical analysis from the non-asymptotic perspective. Novel to the literature, we achieve near optimality, in the sense of matching upper and lower bounds on the pulling complexity. The gap between the upper and lower bounds is up to a polynomial logarithmic factor. The numerical result also indicates the effectiveness of our algorithm, compared to existing benchmarks.
[LG-70] Safety-Aware Reinforcement Learning for Control via Risk-Sensitive Action-Value Iteration and Quantile Regression
链接: https://arxiv.org/abs/2506.06954
作者: Clinton Enwerem,Aniruddh G. Puranic,John S. Baras,Calin Belta
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 13 pages, 4 figures. Submission under review
Abstract:Mainstream approximate action-value iteration reinforcement learning (RL) algorithms suffer from overestimation bias, leading to suboptimal policies in high-variance stochastic environments. Quantile-based action-value iteration methods reduce this bias by learning a distribution of the expected cost-to-go using quantile regression. However, ensuring that the learned policy satisfies safety constraints remains a challenge when these constraints are not explicitly integrated into the RL framework. Existing methods often require complex neural architectures or manual tradeoffs due to combined cost functions. To address this, we propose a risk-regularized quantile-based algorithm integrating Conditional Value-at-Risk (CVaR) to enforce safety without complex architectures. We also provide theoretical guarantees on the contraction properties of the risk-sensitive distributional Bellman operator in Wasserstein space, ensuring convergence to a unique cost distribution. Simulations of a mobile robot in a dynamic reach-avoid task show that our approach leads to more goal successes, fewer collisions, and better safety-performance trade-offs compared to risk-neutral methods.
[LG-71] Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty Depth Stochasticity and More ICML2025
链接: https://arxiv.org/abs/2506.06940
作者: Geonhui Yoo,Minhak Song,Chulhee Yun
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:When training deep neural networks with gradient descent, sharpness often increases – a phenomenon known as progressive sharpening – before saturating at the edge of stability. Although commonly observed in practice, the underlying mechanisms behind progressive sharpening remain poorly understood. In this work, we study this phenomenon using a minimalist model: a deep linear network with a single neuron per layer. We show that this simple model effectively captures the sharpness dynamics observed in recent empirical studies, offering a simple testbed to better understand neural network training. Moreover, we theoretically analyze how dataset properties, network depth, stochasticity of optimizers, and step size affect the degree of progressive sharpening in the minimalist model. We then empirically demonstrate how these theoretical insights extend to practical scenarios. This study offers a deeper understanding of sharpness dynamics in neural network training, highlighting the interplay between depth, training data, and optimizers.
[LG-72] Basis Transformers for Multi-Task Tabular Regression
链接: https://arxiv.org/abs/2506.06926
作者: Wei Min Loh,Jiaqi Shang,Pascal Poupart
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dealing with tabular data is challenging due to partial information, noise, and heterogeneous structure. Existing techniques often struggle to simultaneously address key aspects of tabular data such as textual information, a variable number of columns, and unseen data without metadata besides column names. We propose a novel architecture, \textitbasis transformers, specifically designed to tackle these challenges while respecting inherent invariances in tabular data, including hierarchical structure and the representation of numeric values. We evaluate our design on a multi-task tabular regression benchmark, achieving an improvement of 0.338 in the median R^2 score and the lowest standard deviation across 34 tasks from the OpenML-CTR23 benchmark. Furthermore, our model has five times fewer parameters than the best-performing baseline and surpasses pretrained large language model baselines – even when initialized from randomized weights.
[LG-73] Scalable Gaussian Processes with Latent Kronecker Structure
链接: https://arxiv.org/abs/2506.06895
作者: Jihao Andreas Lin,Sebastian Ament,Maximilian Balandat,David Eriksson,José Miguel Hernández-Lobato,Eytan Bakshy
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: International Conference on Machine Learning 2025
Abstract:Applying Gaussian processes (GPs) to very large datasets remains a challenge due to limited computational scalability. Matrix structures, such as the Kronecker product, can accelerate operations significantly, but their application commonly entails approximations or unrealistic assumptions. In particular, the most common path to creating a Kronecker-structured kernel matrix is by evaluating a product kernel on gridded inputs that can be expressed as a Cartesian product. However, this structure is lost if any observation is missing, breaking the Cartesian product structure, which frequently occurs in real-world data such as time series. To address this limitation, we propose leveraging latent Kronecker structure, by expressing the kernel matrix of observed values as the projection of a latent Kronecker product. In combination with iterative linear system solvers and pathwise conditioning, our method facilitates inference of exact GPs while requiring substantially fewer computational resources than standard iterative methods. We demonstrate that our method outperforms state-of-the-art sparse and variational GPs on real-world datasets with up to five million examples, including robotics, automated machine learning, and climate applications.
[LG-74] Can In-Context Reinforcement Learning Recover From Reward Poisoning Attacks?
链接: https://arxiv.org/abs/2506.06891
作者: Paulius Sasnauskas,Yiğit Yalın,Goran Radanović
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained Decision-Pretrained Transformer (AT-DPT). Our method simultaneously trains an attacker to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that the proposed method significantly outperforms these baselines in bandit settings, under a learned attacker. We additionally evaluate AT-DPT on an adaptive attacker, and observe similar results. Furthermore, we extend our evaluation to the MDP setting, confirming that the robustness observed in bandit scenarios generalizes to more complex environments.
[LG-75] Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning ICML2025
链接: https://arxiv.org/abs/2506.06873
作者: Armin Behnamnia,Gholamali Aminian,Alireza Aghaei,Chengchun Shi,Vincent Y. F. Tan,Hamid R. Rabiee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted as spotlight poster in ICML 2025
Abstract:Off-policy learning and evaluation leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator’s bias and variance. In the off-policy learning scenario, we establish bounds on the regret – the performance gap between our LSE estimator and the optimal policy – assuming bounded (1+\epsilon) -th moment of weighted reward. Notably, we achieve a convergence rate of O(n^-\epsilon/(1+ \epsilon)) for the regret bounds, where \epsilon \in [0,1] and n is the size of logged bandit feedback dataset. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach. The code for our estimator is available at the following link: this https URL.
[LG-76] Differentially Private Sparse Linear Regression with Heavy-tailed Responses ECML2025
链接: https://arxiv.org/abs/2506.06861
作者: Xizhi Tian,Meng Ding,Touming Tao,Zihang Xiang,Di Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at ECML 2025
Abstract:As a fundamental problem in machine learning and differential privacy (DP), DP linear regression has been extensively studied. However, most existing methods focus primarily on either regular data distributions or low-dimensional cases with irregular data. To address these limitations, this paper provides a comprehensive study of DP sparse linear regression with heavy-tailed responses in high-dimensional settings. In the first part, we introduce the DP-IHT-H method, which leverages the Huber loss and private iterative hard thresholding to achieve an estimation error bound of ( \tildeO\biggl( s^* \frac1 2 \cdot \biggl(\frac\log dn\biggr)^\frac\zeta1 + \zeta + s^* \frac1 + 2\zeta2 + 2\zeta \cdot \biggl(\frac\log^2 dn \varepsilon\biggr)^\frac\zeta1 + \zeta \biggr) ) under the (\varepsilon, \delta) -DP model, where n is the sample size, d is the dimensionality, s^* is the sparsity of the parameter, and \zeta \in (0, 1] characterizes the tail heaviness of the data. In the second part, we propose DP-IHT-L, which further improves the error bound under additional assumptions on the response and achieves ( \tildeO\Bigl(\frac(s^*)^3/2 \log dn \varepsilon\Bigr). ) Compared to the first result, this bound is independent of the tail parameter \zeta . Finally, through experiments on synthetic and real-world datasets, we demonstrate that our methods outperform standard DP algorithms designed for ``regular’’ data. Comments: Accepted at ECML 2025 Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2506.06861 [cs.LG] (or arXiv:2506.06861v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.06861 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xizhi Tian [view email] [v1] Sat, 7 Jun 2025 16:56:20 UTC (784 KB) Full-text links: Access Paper: View a PDF of the paper titled Differentially Private Sparse Linear Regression with Heavy-tailed Responses, by Xizhi Tian and 4 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-06 Change to browse by: cs cs.CR References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-77] Curvature Enhanced Data Augmentation for Regression ICML2025
链接: https://arxiv.org/abs/2506.06853
作者: Ilya Kaufman Sirot,Omri Azencot
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to ICML 2025
Abstract:Deep learning models with a large number of parameters, often referred to as over-parameterized models, have achieved exceptional performance across various tasks. Despite concerns about overfitting, these models frequently generalize well to unseen data, thanks to effective regularization techniques, with data augmentation being among the most widely used. While data augmentation has shown great success in classification tasks using label-preserving transformations, its application in regression problems has received less attention. Recently, a novel \emphmanifold learning approach for generating synthetic data was proposed, utilizing a first-order approximation of the data manifold. Building on this foundation, we present a theoretical framework and practical tools for approximating and sampling general data manifolds. Furthermore, we introduce the Curvature-Enhanced Manifold Sampling (CEMS) method for regression tasks. CEMS leverages a second-order representation of the data manifold to enable efficient sampling and reconstruction of new data points. Extensive evaluations across multiple datasets and comparisons with state-of-the-art methods demonstrate that CEMS delivers superior performance in both in-distribution and out-of-distribution scenarios, while introducing only minimal computational overhead. Code is available at this https URL.
[LG-78] ASPO: Constraint-Aware Bayesian Optimization for FPGA-based Soft Processors
链接: https://arxiv.org/abs/2506.06817
作者: Haoran Wu,Ce Guo,Wayne Luk,Robert Mullins
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Performance (cs.PF)
*备注: Accepted to International Conference on Field-Programmable Logic and Applications (FPL) 2025
Abstract:Bayesian Optimization (BO) has shown promise in tuning processor design parameters. However, standard BO does not support constraints involving categorical parameters such as types of branch predictors and division circuits. In addition, optimization time of BO grows with processor complexity, which becomes increasingly significant especially for FPGA-based soft processors. This paper introduces ASPO, an approach that leverages disjunctive form to enable BO to handle constraints involving categorical parameters. Unlike existing methods that directly apply standard BO, the proposed ASPO method, for the first time, customizes the mathematical mechanism of BO to address challenges faced by soft-processor designs on FPGAs. Specifically, ASPO supports categorical parameters using a novel customized BO covariance kernel. It also accelerates the design evaluation procedure by penalizing the BO acquisition function with potential evaluation time and by reusing FPGA synthesis checkpoints from previously evaluated configurations. ASPO targets three soft processors: RocketChip, BOOM, and EL2 VeeR. The approach is evaluated based on seven RISC-V benchmarks. Results show that ASPO can reduce execution time for the ``multiply’’ benchmark on the BOOM processor by up to 35% compared to the default configuration. Furthermore, it reduces design time for the BOOM processor by up to 74% compared to Boomerang, a state-of-the-art hardware-oriented BO approach.
[LG-79] Path Integral Optimiser: Global Optimisation via Neural Schrödinger-Föllm er Diffusion NEURIPS2024
链接: https://arxiv.org/abs/2506.06815
作者: Max McGuinness,Eirik Fladmark,Francisco Vargas
类目: Machine Learning (cs.LG)
*备注: 6 pages. Presented at the OPT Workshop, NeurIPS 2024, Vancouver, CA
Abstract:We present an early investigation into the use of neural diffusion processes for global optimisation, focusing on Zhang et al.'s Path Integral Sampler. One can use the Boltzmann distribution to formulate optimization as solving a Schrödinger bridge sampling problem, then apply Girsanov’s theorem with a simple (single-point) prior to frame it in stochastic control terms, and compute the solution’s integral terms via a neural approximation (a Fourier MLP). We provide theoretical bounds for this optimiser, results on toy optimisation tasks, and a summary of the stochastic theory motivating the model. Ultimately, we found the optimiser to display promising per-step performance at optimisation tasks between 2 and 1,247 dimensions, but struggle to explore higher-dimensional spaces when faced with a 15.9k parameter model, indicating a need for work on adaptation in such environments.
[LG-80] FuncGNN: Learning Functional Semantics of Logic Circuits with Graph Neural Networks
链接: https://arxiv.org/abs/2506.06787
作者: Qiyun Zhao
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:As integrated circuit scale grows and design complexity rises, effective circuit representation helps support logic synthesis, formal verification, and other automated processes in electronic design automation. And-Inverter Graphs (AIGs), as a compact and canonical structure, are widely adopted for representing Boolean logic in these workflows. However, the increasing complexity and integration density of modern circuits introduce structural heterogeneity and global logic information loss in AIGs, posing significant challenges to accurate circuit modeling. To address these issues, we propose FuncGNN, which integrates hybrid feature aggregation to extract multi-granularity topological patterns, thereby mitigating structural heterogeneity and enhancing logic circuit representations. FuncGNN further introduces gate-aware normalization that adapts to circuit-specific gate distributions, improving robustness to structural heterogeneity. Finally, FuncGNN employs multi-layer integration to merge intermediate features across layers, effectively synthesizing local and global semantic information for comprehensive logic representations. Experimental results on two logic-level analysis tasks (i.e., signal probability prediction and truth-table distance prediction) demonstrate that FuncGNN outperforms existing state-of-the-art methods, achieving improvements of 2.06% and 18.71%, respectively, while reducing training time by approximately 50.6% and GPU memory usage by about 32.8%.
[LG-81] Caterpillar GNN: Replacing Message Passing with Efficient Aggregation
链接: https://arxiv.org/abs/2506.06784
作者: Marek Černý
类目: Machine Learning (cs.LG)
*备注: 40 pages, 9 figures, 3 tables
Abstract:Message-passing graph neural networks (MPGNNs) dominate modern graph learning, typically prioritizing maximal expressive power. In contrast, we introduce an \emphefficient aggregation mechanism, deliberately trading off some expressivity for stronger and more structured aggregation capabilities. Our approach allows seamless scaling between classical message-passing and simpler methods based on colored or plain walks. We rigorously characterize the expressive power at each intermediate step using homomorphism counts from a hierarchy of generalized \emphcaterpillar graphs. Based on this foundation, we propose the \emphCaterpillar GNN, whose robust graph-level aggregation enables it to successfully tackle synthetic graph-level task specifically designed to be challenging for classical MPGNNs. Moreover, we demonstrate that, on real-world datasets, the Caterpillar GNN achieves comparable predictive performance while significantly reducing the number of nodes in the hidden layers of the computational graph.
[LG-82] aming Wild Branches: Overcoming Hard-to-Predict Branches using the Bullseye Predictor ISCA2025
链接: https://arxiv.org/abs/2506.06773
作者: Emet Behrendt,Shing Wai Pun,Prashant J. Nair
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)
*备注: Paper accepted and presented at the 6th Championship Branch Prediction (CBP) workshop, co-held with ISCA 2025, on June 21, 2025, Tokyo, Japan
Abstract:Branch prediction is key to the performance of out-of-order processors. While the CBP-2016 winner TAGE-SC-L combines geometric-history tables, a statistical corrector, and a loop predictor, over half of its remaining mispredictions stem from a small set of hard-to-predict (H2P) branches. These branches occur under diverse global histories, causing repeated thrashing in TAGE and eviction before usefulness counters can mature. Prior work shows that simply enlarging the tables offers only marginal improvement. We augment a 159 KB TAGE-SC-L predictor with a 28 KB H2P-targeted subsystem called the Bullseye predictor. It identifies problematic PCs using a set-associative H2P Identification Table (HIT) and steers them to one of two branch-specific perceptrons, one indexed by hashed local history and the other by folded global history. A short trial phase tracks head-to-head accuracy in an H2P cache. A branch becomes perceptron-resident only if the perceptron’s sustained accuracy and output magnitude exceed dynamic thresholds, after which TAGE updates for that PC are suppressed to reduce pollution. The HIT, cache, and perceptron operate fully in parallel with TAGE-SC-L, providing higher fidelity on the H2P tail. This achieves an average MPKI of 3.4045 and CycWpPKI of 145.09. Comments: Paper accepted and presented at the 6th Championship Branch Prediction (CBP) workshop, co-held with ISCA 2025, on June 21, 2025, Tokyo, Japan Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF) ACMclasses: C.1.2; B.2.1; C.4; C.0 Cite as: arXiv:2506.06773 [cs.AR] (or arXiv:2506.06773v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2506.06773 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-83] A Framework for Controllable Multi-objective Learning with Annealed Stein Variational Hypernetworks
链接: https://arxiv.org/abs/2506.06715
作者: Minh-Duc Nguyen,Dung D. Le
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Paper is under review
Abstract:Pareto Set Learning (PSL) is popular as an efficient approach to obtaining the complete optimal solution in Multi-objective Learning (MOL). A set of optimal solutions approximates the Pareto set, and its mapping is a set of dense points in the Pareto front in objective space. However, some current methods face a challenge: how to make the Pareto solution is diverse while maximizing the hypervolume value. In this paper, we propose a novel method to address this challenge, which employs Stein Variational Gradient Descent (SVGD) to approximate the entire Pareto set. SVGD pushes a set of particles towards the Pareto set by applying a form of functional gradient descent, which helps to converge and diversify optimal solutions. Additionally, we employ diverse gradient direction strategies to thoroughly investigate a unified framework for SVGD in multi-objective optimization and adapt this framework with an annealing schedule to promote stability. We introduce our method, SVH-MOL, and validate its effectiveness through extensive experiments on multi-objective problems and multi-task learning, demonstrating its superior performance.
[LG-84] Breaking Data Silos: Towards Open and Scalable Mobility Foundation Models via Generative Continual Learning
链接: https://arxiv.org/abs/2506.06694
作者: Yuan Yuan,Yukun Liu,Chonghua Han,Jie Feng,Yong Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Foundation models have revolutionized fields such as natural language processing and computer vision by enabling general-purpose learning across diverse tasks and datasets. However, building analogous models for human mobility remains challenging due to the privacy-sensitive nature of mobility data and the resulting data silos across institutions. To bridge this gap, we propose MoveGCL, a scalable and privacy-preserving framework for training mobility foundation models via generative continual learning. Without sharing raw data, MoveGCL enables decentralized and progressive model evolution by replaying synthetic trajectories generated from a frozen teacher model, and reinforces knowledge retention through a tailored distillation strategy that mitigates catastrophic forgetting. To address the heterogeneity of mobility patterns, MoveGCL incorporates a Mixture-of-Experts Transformer with a mobility-aware expert routing mechanism, and employs a layer-wise progressive adaptation strategy to stabilize continual updates. Experiments on six real-world urban datasets demonstrate that MoveGCL achieves performance comparable to joint training and significantly outperforms federated learning baselines, while offering strong privacy protection. MoveGCL marks a crucial step toward unlocking foundation models for mobility, offering a practical blueprint for open, scalable, and privacy-preserving model development in the era of foundation models.
[LG-85] Learning Robust Heterogeneous Graph Representations via Contrastive-Reconstruction under Sparse Semantics
链接: https://arxiv.org/abs/2506.06682
作者: Di Lin,Wanjing Ren,Xuanbin Li,Rui Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:In graph self-supervised learning, masked autoencoders (MAE) and contrastive learning (CL) are two prominent paradigms. MAE focuses on reconstructing masked elements, while CL maximizes similarity between augmented graph views. Recent studies highlight their complementarity: MAE excels at local feature capture, and CL at global information extraction. Hybrid frameworks for homogeneous graphs have been proposed, but face challenges in designing shared encoders to meet the semantic requirements of both tasks. In semantically sparse scenarios, CL struggles with view construction, and gradient imbalance between positive and negative samples persists. This paper introduces HetCRF, a novel dual-channel self-supervised learning framework for heterogeneous graphs. HetCRF uses a two-stage aggregation strategy to adapt embedding semantics, making it compatible with both MAE and CL. To address semantic sparsity, it enhances encoder output for view construction instead of relying on raw features, improving efficiency. Two positive sample augmentation strategies are also proposed to balance gradient contributions. Node classification experiments on four real-world heterogeneous graph datasets demonstrate that HetCRF outperforms state-of-the-art baselines. On datasets with missing node features, such as Aminer and Freebase, at a 40% label rate in node classification, HetCRF improves the Macro-F1 score by 2.75% and 2.2% respectively compared to the second-best baseline, validating its effectiveness and superiority.
[LG-86] hrough the Gaps: Uncovering Tactical Line-Breaking Passes with Clustering
链接: https://arxiv.org/abs/2506.06666
作者: Oktay Karakuş,Hasan Arkadaş
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages and 5 figures
Abstract:Line-breaking passes (LBPs) are crucial tactical actions in football, allowing teams to penetrate defensive lines and access high-value spaces. In this study, we present an unsupervised, clustering-based framework for detecting and analysing LBPs using synchronised event and tracking data from elite matches. Our approach models opponent team shape through vertical spatial segmentation and identifies passes that disrupt defensive lines within open play. Beyond detection, we introduce several tactical metrics, including the space build-up ratio (SBR) and two chain-based variants, LBPCh ^1 and LBPCh ^2 , which quantify the effectiveness of LBPs in generating immediate or sustained attacking threats. We evaluate these metrics across teams and players in the 2022 FIFA World Cup, revealing stylistic differences in vertical progression and structural disruption. The proposed methodology is explainable, scalable, and directly applicable to modern performance analysis and scouting workflows.
[LG-87] SDP-CROWN: Efficient Bound Propagation for Neural Network Verification with Tightness of Semidefinite Programming ICML2025
链接: https://arxiv.org/abs/2506.06665
作者: Hong-Ming Chiu,Hao Chen,Huan Zhang,Richard Y. Zhang
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Neural network verifiers based on linear bound propagation scale impressively to massive models but can be surprisingly loose when neuron coupling is crucial. Conversely, semidefinite programming (SDP) verifiers capture inter-neuron coupling naturally, but their cubic complexity restricts them to only small models. In this paper, we propose SDP-CROWN, a novel hybrid verification framework that combines the tightness of SDP relaxations with the scalability of bound-propagation verifiers. At the core of SDP-CROWN is a new linear bound, derived via SDP principles, that explicitly captures \ell_2 -norm-based inter-neuron coupling while adding only one extra parameter per layer. This bound can be integrated seamlessly into any linear bound-propagation pipeline, preserving the inherent scalability of such methods yet significantly improving tightness. In theory, we prove that our inter-neuron bound can be up to a factor of \sqrtn tighter than traditional per-neuron bounds. In practice, when incorporated into the state-of-the-art \alpha -CROWN verifier, we observe markedly improved verification performance on large models with up to 65 thousand neurons and 2.47 million parameters, achieving tightness that approaches that of costly SDP-based methods.
[LG-88] Rescaled Influence Functions: Accurate Data Attribution in High Dimension
链接: https://arxiv.org/abs/2506.06656
作者: Ittai Rubinstein,Samuel B. Hopkins
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:How does the training data affect a model’s behavior? This is the question we seek to answer with data attribution. The leading practical approaches to data attribution are based on influence functions (IF). IFs utilize a first-order Taylor approximation to efficiently predict the effect of removing a set of samples from the training set without retraining the model, and are used in a wide variety of machine learning applications. However, especially in the high-dimensional regime (# params \geq \Omega( # samples ) ), they are often imprecise and tend to underestimate the effect of sample removals, even for simple models such as logistic regression. We present rescaled influence functions (RIF), a new tool for data attribution which can be used as a drop-in replacement for influence functions, with little computational overhead but significant improvement in accuracy. We compare IF and RIF on a range of real-world datasets, showing that RIFs offer significantly better predictions in practice, and present a theoretical analysis explaining this improvement. Finally, we present a simple class of data poisoning attacks that would fool IF-based detections but would be detected by RIF.
[LG-89] SAFER: A Calibrated Risk-Aware Multimodal Recommendation Model for Dynamic Treatment Regimes ICML2025
链接: https://arxiv.org/abs/2506.06649
作者: Yishan Shen,Yuyang Ye,Hui Xiong,Yong Chen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by ICML 2025
Abstract:Dynamic treatment regimes (DTRs) are critical to precision medicine, optimizing long-term outcomes through personalized, real-time decision-making in evolving clinical contexts, but require careful supervision for unsafe treatment risks. Existing efforts rely primarily on clinician-prescribed gold standards despite the absence of a known optimal strategy, and predominantly using structured EHR data without extracting valuable insights from clinical notes, limiting their reliability for treatment recommendations. In this work, we introduce SAFER, a calibrated risk-aware tabular-language recommendation framework for DTR that integrates both structured EHR and clinical notes, enabling them to learn from each other, and addresses inherent label uncertainty by assuming ambiguous optimal treatment solution for deceased patients. Moreover, SAFER employs conformal prediction to provide statistical guarantees, ensuring safe treatment recommendations while filtering out uncertain predictions. Experiments on two publicly available sepsis datasets demonstrate that SAFER outperforms state-of-the-art baselines across multiple recommendation metrics and counterfactual mortality rate, while offering robust formal assurances. These findings underscore SAFER potential as a trustworthy and theoretically grounded solution for high-stakes DTR applications.
[LG-90] Spark Transformer: Reactivating Sparsity in FFN and Attention
链接: https://arxiv.org/abs/2506.06644
作者: Chong You,Kan Wu,Zhipeng Jia,Lin Chen,Srinadh Bhojanapalli,Jiaxian Guo,Utku Evci,Jan Wassenberg,Praneeth Netrapalli,Jeremiah J. Willcock,Suvinay Subramanian,Felix Chern,Alek Andreev,Shreya Pathak,Felix Yu,Prateek Jain,David E. Culler,Henry M. Levy,Sanjiv Kumar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The discovery of the lazy neuron phenomenon in trained Transformers, where the vast majority of neurons in their feed-forward networks (FFN) are inactive for each token, has spurred tremendous interests in activation sparsity for enhancing large model efficiency. While notable progress has been made in translating such sparsity to wall-time benefits, modern Transformers have moved away from the ReLU activation function crucial to this phenomenon. Existing efforts on re-introducing activation sparsity often degrade model quality, increase parameter count, complicate or slow down training. Sparse attention, the application of sparse activation to the attention mechanism, often faces similar challenges. This paper introduces the Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism while maintaining model quality, parameter count, and standard training procedures. Our method realizes sparsity via top-k masking for explicit control over sparsity level. Crucially, we introduce statistical top-k, a hardware-accelerator-friendly, linear-time approximate algorithm that avoids costly sorting and mitigates significant training slowdown from standard top- k operators. Furthermore, Spark Transformer reallocates existing FFN parameters and attention key embeddings to form a low-cost predictor for identifying activated entries. This design not only mitigates quality loss from enforced sparsity, but also enhances wall-time benefit. Pretrained with the Gemma-2 recipe, Spark Transformer demonstrates competitive performance on standard benchmarks while exhibiting significant sparsity: only 8% of FFN neurons are activated, and each token attends to a maximum of 256 tokens. This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2506.06644 [cs.LG] (or arXiv:2506.06644v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.06644 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-91] Stacey: Promoting Stochastic Steepest Descent via Accelerated ell_p-Smooth Nonconvex Optimization
链接: https://arxiv.org/abs/2506.06606
作者: Xinyu Luo,Cedar Site Bai,Bolian Li,Petros Drineas,Ruqi Zhang,Brian Bullins
类目: Machine Learning (cs.LG)
*备注:
Abstract:While popular optimization methods such as SGD, AdamW, and Lion depend on steepest descent updates in either \ell_2 or \ell_\infty norms, there remains a critical gap in handling the non-Euclidean structure observed in modern deep networks training. In this work, we address this need by introducing a new accelerated \ell_p steepest descent algorithm, called Stacey, which uses interpolated primal-dual iterate sequences to effectively navigate non-Euclidean smooth optimization tasks. In addition to providing novel theoretical guarantees for the foundations of our algorithm, we empirically compare our approach against these popular methods on tasks including image classification and language model (LLM) pretraining, demonstrating both faster convergence and higher final accuracy. We further evaluate different values of p across various models and datasets, underscoring the importance and efficiency of non-Euclidean approaches over standard Euclidean methods. Code can be found at this https URL .
[LG-92] Scoring the Unscorables: Cyber Risk Assessment Beyond Internet Scans
链接: https://arxiv.org/abs/2506.06604
作者: Armin Sarabi,Manish Karir,Mingyan Liu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:In this paper we present a study on using novel data types to perform cyber risk quantification by estimating the likelihood of a data breach. We demonstrate that it is feasible to build a highly accurate cyber risk assessment model using public and readily available technology signatures obtained from crawling an organization’s website. This approach overcomes the limitations of previous similar approaches that relied on large-scale IP address based scanning data, which suffers from incomplete/missing IP address mappings as well as the lack of such data for large numbers of small and medium-sized organizations (SMEs). In comparison to scan data, technology digital signature data is more readily available for millions of SMEs. Our study shows that there is a strong relationship between these technology signatures and an organization’s cybersecurity posture. In cross-validating our model using different cyber incident datasets, we also highlight the key differences between ransomware attack victims and the larger population of cyber incident and data breach victims.
[LG-93] Direct Prediction Set Minimization via Bilevel Conformal Classifier Training ICML
链接: https://arxiv.org/abs/2506.06599
作者: Yuanjie Shi,Hooman Shahrokhi,Xuesong Jia,Xiongzhi Chen,Janardhan Rao Doppa,Yan Yan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for Publication at International Conference on Machine Learning (ICML), 2025
Abstract:Conformal prediction (CP) is a promising uncertainty quantification framework which works as a wrapper around a black-box classifier to construct prediction sets (i.e., subset of candidate classes) with provable guarantees. However, standard calibration methods for CP tend to produce large prediction sets which makes them less useful in practice. This paper considers the problem of integrating conformal principles into the training process of deep classifiers to directly minimize the size of prediction sets. We formulate conformal training as a bilevel optimization problem and propose the \em Direct Prediction Set Minimization (DPSM) algorithm to solve it. The key insight behind DPSM is to minimize a measure of the prediction set size (upper level) that is conditioned on the learned quantile of conformity scores (lower level). We analyze that DPSM has a learning bound of O(1/\sqrtn) (with n training samples), while prior conformal training methods based on stochastic approximation for the quantile has a bound of \Omega(1/s) (with batch size s and typically s \ll \sqrtn ). Experiments on various benchmark datasets and deep models show that DPSM significantly outperforms the best prior conformal training baseline with 20.46%\downarrow in the prediction set size and validates our theory.
[LG-94] Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixtures
链接: https://arxiv.org/abs/2506.06584
作者: Mo Zhou,Weihang Xu,Maryam Fazel,Simon S. Du
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 77 pages
Abstract:Learning Gaussian Mixture Models (GMMs) is a fundamental problem in machine learning, with the Expectation-Maximization (EM) algorithm and its popular variant gradient EM being arguably the most widely used algorithms in practice. In the exact-parameterized setting, where both the ground truth GMM and the learning model have the same number of components m , a vast line of work has aimed to establish rigorous recovery guarantees for EM. However, global convergence has only been proven for the case of m=2 , and EM is known to fail to recover the ground truth when m\geq 3 . In this paper, we consider the \textitover-parameterized setting, where the learning model uses nm components to fit an m -component ground truth GMM. In contrast to the exact-parameterized case, we provide a rigorous global convergence guarantee for gradient EM. Specifically, for any well separated GMMs in general position, we prove that with only mild over-parameterization n = \Omega(m\log m) , randomly initialized gradient EM converges globally to the ground truth at a polynomial rate with polynomial samples. Our analysis proceeds in two stages and introduces a suite of novel tools for Gaussian Mixture analysis. We use Hermite polynomials to study the dynamics of gradient EM and employ tensor decomposition to characterize the geometric landscape of the likelihood loss. This is the first global convergence and recovery result for EM or Gradient EM beyond the special case of m=2 . Comments: 77 pages Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2506.06584 [cs.LG] (or arXiv:2506.06584v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.06584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-95] Demystifying Topological Message-Passing with Relational Structures: A Case Study on Oversquashing in Simplicial Message-Passing ICLR2025
链接: https://arxiv.org/abs/2506.06582
作者: Diaaeldin Taha,James Chapman,Marzieh Eidi,Karel Devriendt,Guido Montúfar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 50 pages, 12 figures, published at ICLR 2025. The Thirteenth International Conference on Learning Representations. 2025
Abstract:Topological deep learning (TDL) has emerged as a powerful tool for modeling higher-order interactions in relational data. However, phenomena such as oversquashing in topological message-passing remain understudied and lack theoretical analysis. We propose a unifying axiomatic framework that bridges graph and topological message-passing by viewing simplicial and cellular complexes and their message-passing schemes through the lens of relational structures. This approach extends graph-theoretic results and algorithms to higher-order structures, facilitating the analysis and mitigation of oversquashing in topological message-passing networks. Through theoretical analysis and empirical studies on simplicial networks, we demonstrate the potential of this framework to advance TDL.
[LG-96] Rapid training of Hamiltonian graph networks without gradient descent
链接: https://arxiv.org/abs/2506.06558
作者: Atamert Rahma,Chinmay Datar,Ana Cukarska,Felix Dietrich
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 10 pages, 7 figures, 2 tables, and an appendix
Abstract:Learning dynamical systems that respect physical symmetries and constraints remains a fundamental challenge in data-driven modeling. Integrating physical laws with graph neural networks facilitates principled modeling of complex N-body dynamics and yields accurate and permutation-invariant models. However, training graph neural networks with iterative, gradient-based optimization algorithms (e.g., Adam, RMSProp, LBFGS) often leads to slow training, especially for large, complex systems. In comparison to 15 different optimizers, we demonstrate that Hamiltonian Graph Networks (HGN) can be trained up to 600x faster–but with comparable accuracy–by replacing iterative optimization with random feature-based parameter construction. We show robust performance in diverse simulations, including N-body mass-spring systems in up to 3 dimensions with different geometries, while retaining essential physical invariances with respect to permutation, rotation, and translation. We reveal that even when trained on minimal 8-node systems, the model can generalize in a zero-shot manner to systems as large as 4096 nodes without retraining. Our work challenges the dominance of iterative gradient-descent-based optimization algorithms for training neural network models for physical systems.
[LG-97] Infinity Search: Approximate Vector Search with Projections on q-Metric Spaces
链接: https://arxiv.org/abs/2506.06557
作者: Antonio Pariente,Ignacio Hounie,Santiago Segarra,Alejandro Ribeiro
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Signal Processing (eess.SP); Metric Geometry (math.MG)
*备注:
Abstract:Despite the ubiquity of vector search applications, prevailing search algorithms overlook the metric structure of vector embeddings, treating it as a constraint rather than exploiting its underlying properties. In this paper, we demonstrate that in q -metric spaces, metric trees can leverage a stronger version of the triangle inequality to reduce comparisons for exact search. Notably, as q approaches infinity, the search complexity becomes logarithmic. Therefore, we propose a novel projection method that embeds vector datasets with arbitrary dissimilarity measures into q -metric spaces while preserving the nearest neighbor. We propose to learn an approximation of this projection to efficiently transform query points to a space where euclidean distances satisfy the desired properties. Our experimental results with text and image vector embeddings show that learning q -metric approximations enables classic metric tree algorithms – which typically underperform with high-dimensional data – to achieve competitive performance against state-of-the-art search methods.
[LG-98] SDN-Based False Data Detection With Its Mitigation and Machine Learning Robustness for In-Vehicle Networks
链接: https://arxiv.org/abs/2506.06556
作者: Long Dang,Thushari Hapuarachchi,Kaiqi Xiong,Yi Li
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: The 34th International Conference on Computer Communications and Networks (ICCCN 2025)
Abstract:As the development of autonomous and connected vehicles advances, the complexity of modern vehicles increases, with numerous Electronic Control Units (ECUs) integrated into the system. In an in-vehicle network, these ECUs communicate with one another using an standard protocol called Controller Area Network (CAN). Securing communication among ECUs plays a vital role in maintaining the safety and security of the vehicle. This paper proposes a robust SDN-based False Data Detection and Mitigation System (FDDMS) for in-vehicle networks. Leveraging the unique capabilities of Software-Defined Networking (SDN), FDDMS is designed to monitor and detect false data injection attacks in real-time. Specifically, we focus on brake-related ECUs within an SDN-enabled in-vehicle network. First, we decode raw CAN data to create an attack model that illustrates how false data can be injected into the system. Then, FDDMS, incorporating a Long Short Term Memory (LSTM)-based detection model, is used to identify false data injection attacks. We further propose an effective variant of DeepFool attack to evaluate the model’s robustness. To countermeasure the impacts of four adversarial attacks including Fast gradient descent method, Basic iterative method, DeepFool, and the DeepFool variant, we further enhance a re-training technique method with a threshold based selection strategy. Finally, a mitigation scheme is implemented to redirect attack traffic by dynamically updating flow rules through SDN. Our experimental results show that the proposed FDDMS is robust against adversarial attacks and effectively detects and mitigates false data injection attacks in real-time.
[LG-99] GeoClip: Geometry-Aware Clipping for Differentially Private SGD
链接: https://arxiv.org/abs/2506.06549
作者: Atefeh Gilani,Naima Tasnim,Lalitha Sankar,Oliver Kosut
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Information Theory (cs.IT)
*备注:
Abstract:Differentially private stochastic gradient descent (DP-SGD) is the most widely used method for training machine learning models with provable privacy guarantees. A key challenge in DP-SGD is setting the per-sample gradient clipping threshold, which significantly affects the trade-off between privacy and utility. While recent adaptive methods improve performance by adjusting this threshold during training, they operate in the standard coordinate system and fail to account for correlations across the coordinates of the gradient. We propose GeoClip, a geometry-aware framework that clips and perturbs gradients in a transformed basis aligned with the geometry of the gradient distribution. GeoClip adaptively estimates this transformation using only previously released noisy gradients, incurring no additional privacy cost. We provide convergence guarantees for GeoClip and derive a closed-form solution for the optimal transformation that minimizes the amount of noise added while keeping the probability of gradient clipping under control. Experiments on both tabular and image datasets demonstrate that GeoClip consistently outperforms existing adaptive clipping methods under the same privacy budget.
[LG-100] Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs
链接: https://arxiv.org/abs/2506.06521
作者: Shulun Chen,Runlong Zhou,Zihan Zhang,Maryam Fazel,Simon S. Du
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 30 pages
Abstract:We consider the gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm achieves a variance-aware gap-dependent regret bound of \tildeO\left(\left(\sum_\Delta_h(s,a)0 \fracH^2 \log K \land \mathttVar_\max^\textc\Delta_h(s,a) +\sum_\Delta_h(s,a)=0\frac H^2 \land \mathttVar_\max^\textc\Delta_\mathrmmin + SAH^4 (S \lor H) \right) \log K\right), where H is the planning horizon, S is the number of states, A is the number of actions, and K is the number of episodes. Here, \Delta_h(s,a) =V_h^* (a) - Q_h^* (s, a) represents the suboptimality gap and \Delta_\mathrmmin := \min_\Delta_h (s,a) 0 \Delta_h(s,a) . The term \mathttVar_\max^\textc denotes the maximum conditional total variance, calculated as the maximum over all (\pi, h, s) tuples of the expected total variance under policy \pi conditioned on trajectories visiting state s at step h . \mathttVar_\max^\textc characterizes the maximum randomness encountered when learning any (h, s) pair. Our result stems from a novel analysis of the weighted sum of the suboptimality gap and can be potentially adapted for other algorithms. To complement the study, we establish a lower bound of \Omega \left( \sum_\Delta_h(s,a)0 \fracH^2 \land \mathttVar_\max^\textc\Delta_h(s,a)\cdot \log K\right), demonstrating the necessity of dependence on \mathttVar_\max^\textc even when the maximum unconditional total variance (without conditioning on (h, s) ) approaches zero.
[LG-101] A Systematic Review of Poisoning Attacks Against Large Language Models
链接: https://arxiv.org/abs/2506.06518
作者: Neil Fendley,Edward W. Staley,Joshua Carney,William Redman,Marie Chau,Nathan Drenkow
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 28 Pages including number
Abstract:With the widespread availability of pretrained Large Language Models (LLMs) and their training datasets, concerns about the security risks associated with their usage has increased significantly. One of these security risks is the threat of LLM poisoning attacks where an attacker modifies some part of the LLM training process to cause the LLM to behave in a malicious way. As an emerging area of research, the current frameworks and terminology for LLM poisoning attacks are derived from earlier classification poisoning literature and are not fully equipped for generative LLM settings. We conduct a systematic review of published LLM poisoning attacks to clarify the security implications and address inconsistencies in terminology across the literature. We propose a comprehensive poisoning threat model applicable to categorize a wide range of LLM poisoning attacks. The poisoning threat model includes four poisoning attack specifications that define the logistics and manipulation strategies of an attack as well as six poisoning metrics used to measure key characteristics of an attack. Under our proposed framework, we organize our discussion of published LLM poisoning literature along four critical dimensions of LLM poisoning attacks: concept poisons, stealthy poisons, persistent poisons, and poisons for unique tasks, to better understand the current landscape of security risks.
[LG-102] InstantFT: An FPGA-Based Runtime Subsecond Fine-tuning of CNN Models
链接: https://arxiv.org/abs/2506.06505
作者: Keisuke Sugiura,Hiroki Matsutani
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:
Abstract:Training deep neural networks (DNNs) requires significantly more computation and memory than inference, making runtime adaptation of DNNs challenging on resource-limited IoT platforms. We propose InstantFT, an FPGA-based method for ultra-fast CNN fine-tuning on IoT devices, by optimizing the forward and backward computations in parameter-efficient fine-tuning (PEFT). Experiments on datasets with concept drift demonstrate that InstantFT fine-tunes a pre-trained CNN 17.4x faster than existing Low-Rank Adaptation (LoRA)-based approaches, while achieving comparable accuracy. Our FPGA-based InstantFT reduces the fine-tuning time to just 0.36s and improves energy-efficiency by 16.3x, enabling on-the-fly adaptation of CNNs to non-stationary data distributions.
[LG-103] Optimal Rates in Continual Linear Regression via Increasing Regularization
链接: https://arxiv.org/abs/2506.06501
作者: Ran Levinstein,Amit Attia,Matan Schliserman,Uri Sherman,Tomer Koren,Daniel Soudry,Itay Evron
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study realizable continual linear regression under random task orderings, a common setting for developing continual learning theory. In this setup, the worst-case expected loss after k learning iterations admits a lower bound of \Omega(1/k) . However, prior work using an unregularized scheme has only established an upper bound of O(1/k^1/4) , leaving a significant gap. Our paper proves that this gap can be narrowed, or even closed, using two frequently used regularization schemes: (1) explicit isotropic \ell_2 regularization, and (2) implicit regularization via finite step budgets. We show that these approaches, which are used in practice to mitigate forgetting, reduce to stochastic gradient descent (SGD) on carefully defined surrogate losses. Through this lens, we identify a fixed regularization strength that yields a near-optimal rate of O(\log k / k) . Moreover, formalizing and analyzing a generalized variant of SGD for time-varying functions, we derive an increasing regularization strength schedule that provably achieves an optimal rate of O(1/k) . This suggests that schedules that increase the regularization coefficient or decrease the number of steps per task are beneficial, at least in the worst case.
[LG-104] Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks
链接: https://arxiv.org/abs/2506.06489
作者: Daniel Kunin,Giovanni Luca Marchetti,Feng Chen,Dhruva Karkada,James B. Simon,Michael R. DeWeese,Surya Ganguli,Nina Miolane
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 35 pages, 7 figures
Abstract:What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each round, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.
[LG-105] Membership Inference Attacks for Unseen Classes
链接: https://arxiv.org/abs/2506.06488
作者: Pratiksha Thaker,Neil Kale,Zhiwei Steven Wu,Virginia Smith
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: Preprint
Abstract:Shadow model attacks are the state-of-the-art approach for membership inference attacks on machine learning models. However, these attacks typically assume an adversary has access to a background (nonmember) data distribution that matches the distribution the target model was trained on. We initiate a study of membership inference attacks where the adversary or auditor cannot access an entire subclass from the distribution – a more extreme but realistic version of distribution shift than has been studied previously. In this setting, we first show that the performance of shadow model attacks degrades catastrophically, and then demonstrate the promise of another approach, quantile regression, that does not have the same limitations. We show that quantile regression attacks consistently outperform shadow model attacks in the class dropout setting – for example, quantile regression attacks achieve up to 11 \times the TPR of shadow models on the unseen class on CIFAR-100, and achieve nontrivial TPR on ImageNet even with 90% of training classes removed. We also provide a theoretical model that illustrates the potential and limitations of this approach.
[LG-106] A Certified Unlearning Approach without Access to Source Data ICML2025
链接: https://arxiv.org/abs/2506.06486
作者: Umit Yigit Basaran,Sk Miraj Ahmed,Amit Roy-Chowdhury,Basak Guler
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Machine Learning (stat.ML)
*备注: Accepted by ICML 2025
Abstract:With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \finalwithout access to the original training data samples. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updatedWhile our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees. This ensures strong guarantees on the model’s behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.
[LG-107] meRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness
链接: https://arxiv.org/abs/2506.06482
作者: Zhiyuan Zhao,Juntong Ni,Shangqing Xu,Haoxin Liu,Wei Jin,B. Aditya Prakash
类目: Machine Learning (cs.LG)
*备注: 46 pages, 1 figure, 28 tables
Abstract:Time-series forecasting is an essential task with wide real-world applications across domains. While recent advances in deep learning have enabled time-series forecasting models with accurate predictions, there remains considerable debate over which architectures and design components, such as series decomposition or normalization, are most effective under varying conditions. Existing benchmarks primarily evaluate models at a high level, offering limited insight into why certain designs work better. To mitigate this gap, we propose TimeRecipe, a unified benchmarking framework that systematically evaluates time-series forecasting methods at the module level. TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components across a diverse range of datasets, forecasting horizons, and task settings. Our results reveal that exhaustive exploration of the design space can yield models that outperform existing state-of-the-art methods and uncover meaningful intuitions linking specific design choices to forecasting scenarios. Furthermore, we release a practical toolkit within TimeRecipe that recommends suitable model architectures based on these empirical insights. The benchmark is available at: this https URL.
[LG-108] owards Infant Sleep-Optimized Driving: Synergizing Wearable and Vehicle Sensing in Intelligent Cruise Control
链接: https://arxiv.org/abs/2506.06459
作者: Ruitao Chen,Mozhang Guo,Jinge Li
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Robotics (cs.RO); Systems and Control (eess.SY)
*备注:
Abstract:Automated driving (AD) has substantially improved vehicle safety and driving comfort, but their impact on passenger well-being, particularly infant sleep, is not sufficiently studied. Sudden acceleration, abrupt braking, and sharp maneuvers can disrupt infant sleep, compromising both passenger comfort and parental convenience. To solve this problem, this paper explores the integration of reinforcement learning (RL) within AD to personalize driving behavior and optimally balance occupant comfort and travel efficiency. In particular, we propose an intelligent cruise control framework that adapts to varying driving conditions to enhance infant sleep quality by effectively synergizing wearable sensing and vehicle data. Long short-term memory (LSTM) and transformer-based neural networks are integrated with RL to model the relationship between driving behavior and infant sleep quality under diverse traffic and road conditions. Based on the sleep quality indicators from the wearable sensors, driving action data from vehicle controllers, and map data from map applications, the model dynamically computes the optimal driving aggressiveness level, which is subsequently translated into specific AD control strategies, e.g., the magnitude and frequency of acceleration, lane change, and overtaking. Simulation results demonstrate that the proposed solution significantly improves infant sleep quality compared to baseline methods, while preserving desirable travel efficiency.
[LG-109] LETS Forecast: Learning Embedology for Time Series Forecasting ICML
链接: https://arxiv.org/abs/2506.06454
作者: Abrar Majeedi,Viswanatha Reddy Gajjala,Satya Sai Srinath Namburi GNVV,Nada Magdi Elkordi,Yin Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at International Conference on Machine Learning (ICML) 2025
Abstract:Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise future prediction. While deep learning has achieved major success in time series forecasting, many existing approaches do not explicitly model the dynamics. To bridge this gap, we introduce DeepEDM, a framework that integrates nonlinear dynamical systems modeling with deep neural networks. Inspired by empirical dynamic modeling (EDM) and rooted in Takens’ theorem, DeepEDM presents a novel deep model that learns a latent space from time-delayed embeddings, and employs kernel regression to approximate the underlying dynamics, while leveraging efficient implementation of softmax attention and allowing for accurate prediction of future time steps. To evaluate our method, we conduct comprehensive experiments on synthetic data of nonlinear dynamical systems as well as real-world time series across domains. Our results show that DeepEDM is robust to input noise, and outperforms state-of-the-art methods in forecasting accuracy. Our code is available at: this https URL.
[LG-110] CoxNTF: A New Approach for Joint Clustering and Prediction in Survival Analysis
链接: https://arxiv.org/abs/2506.06411
作者: Paul Fogel(1),Christophe Geissler(1),George Luta(2) ((1) Data Services, ForvisMazars, Courbevoie, France, (2) Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Medical Center, Washington, DC, USA)
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, Conference on Lifetime Data Science 2025, Brooklyn, New York, USA
Abstract:The interpretation of the results of survival analysis often benefits from latent factor representations of baseline covariates. However, existing methods, such as Nonnegative Matrix Factorization (NMF), do not incorporate survival information, limiting their predictive power. We present CoxNTF, a novel approach that uses non-negative tensor factorization (NTF) to derive meaningful latent representations that are closely associated with survival outcomes. CoxNTF constructs a weighted covariate tensor in which survival probabilities derived from the Coxnet model are used to guide the tensorization process. Our results show that CoxNTF achieves survival prediction performance comparable to using Coxnet with the original covariates, while providing a structured and interpretable clustering framework. In addition, the new approach effectively handles feature redundancy, making it a powerful tool for joint clustering and prediction in survival analysis.
[LG-111] Evaluating Large Language Model Capabilities in Assessing Spatial Econometrics Research
链接: https://arxiv.org/abs/2506.06377
作者: Giuseppe Arbia,Luca Morandini,Vincenzo Nardelli
类目: Computers and Society (cs.CY); Machine Learning (cs.LG); Econometrics (econ.EM); Computation (stat.CO)
*备注:
Abstract:This paper investigates Large Language Models (LLMs) ability to assess the economic soundness and theoretical consistency of empirical findings in spatial econometrics. We created original and deliberately altered “counterfactual” summaries from 28 published papers (2005-2024), which were evaluated by a diverse set of LLMs. The LLMs provided qualitative assessments and structured binary classifications on variable choice, coefficient plausibility, and publication suitability. The results indicate that while LLMs can expertly assess the coherence of variable choices (with top models like GPT-4o achieving an overall F1 score of 0.87), their performance varies significantly when evaluating deeper aspects such as coefficient plausibility and overall publication suitability. The results further revealed that the choice of LLM, the specific characteristics of the paper and the interaction between these two factors significantly influence the accuracy of the assessment, particularly for nuanced judgments. These findings highlight LLMs’ current strengths in assisting with initial, more surface-level checks and their limitations in performing comprehensive, deep economic reasoning, suggesting a potential assistive role in peer review that still necessitates robust human oversight.
[LG-112] El0ps: An Exact L0-regularized Problems Solver
链接: https://arxiv.org/abs/2506.06373
作者: Théo Guyard,Cédric Herzet,Clément Elvira
类目: Mathematical Software (cs.MS); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:This paper presents El0ps, a Python toolbox providing several utilities to handle L0-regularized problems related to applications in machine learning, statistics, and signal processing, among other fields. In contrast to existing toolboxes, El0ps allows users to define custom instances of these problems through a flexible framework, provides a dedicated solver achieving state-of-the-art performance, and offers several built-in machine learning pipelines. Our aim with El0ps is to provide a comprehensive tool which opens new perspectives for the integration of L0-regularized problems in practical applications.
[LG-113] Deep Learning Enhanced Multi-Day Turnover Quantitative Trading Algorithm for Chinese A-Share Market
链接: https://arxiv.org/abs/2506.06356
作者: Yimin Du
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 10 pages
Abstract:This paper presents a sophisticated multi-day turnover quantitative trading algorithm that integrates advanced deep learning techniques with comprehensive cross-sectional stock prediction for the Chinese A-share market. Our framework combines five interconnected modules: initial stock selection through deep cross-sectional prediction networks, opening signal distribution analysis using mixture models for arbitrage identification, market capitalization and liquidity-based dynamic position sizing, grid-search optimized profit-taking and stop-loss mechanisms, and multi-granularity volatility-based market timing models. The algorithm employs a novel approach to balance capital efficiency with risk management through adaptive holding periods and sophisticated entry/exit timing. Trained on comprehensive A-share data from 2010-2020 and rigorously backtested on 2021-2024 data, our method achieves remarkable performance with 15.2% annualized returns, maximum drawdown constrained below 5%, and a Sharpe ratio of 1.87. The strategy demonstrates exceptional scalability by maintaining 50-100 daily positions with a 9-day maximum holding period, incorporating dynamic profit-taking and stop-loss mechanisms that enhance capital turnover efficiency while preserving risk-adjusted returns. Our approach exhibits robust performance across various market regimes while maintaining high capital capacity suitable for institutional deployment.
[LG-114] Optimized Local Updates in Federated Learning via Reinforcement Learning IJCNN2025
链接: https://arxiv.org/abs/2506.06337
作者: Ali Murad,Bo Hui,Wei-Shinn Ku
类目: Machine Learning (cs.LG)
*备注: This paper is accepted at IEEE IJCNN 2025
Abstract:Federated Learning (FL) is a distributed framework for collaborative model training over large-scale distributed data, enabling higher performance while maintaining client data privacy. However, the nature of model aggregation at the centralized server can result in a performance drop in the presence of non-IID data across different clients. We remark that training a client locally on more data than necessary does not benefit the overall performance of all clients. In this paper, we devise a novel framework that leverages a Deep Reinforcement Learning (DRL) agent to select an optimized amount of data necessary to train a client model without oversharing information with the server. Starting without awareness of the client’s performance, the DRL agent utilizes the change in training loss as a reward signal and learns to optimize the amount of training data necessary for improving the client’s performance. Specifically, after each aggregation round, the DRL algorithm considers the local performance as the current state and outputs the optimized weights for each class, in the training data, to be used during the next round of local training. In doing so, the agent learns a policy that creates an optimized partition of the local training dataset during the FL rounds. After FL, the client utilizes the entire local training dataset to further enhance its performance on its own data distribution, mitigating the non-IID effects of aggregation. Through extensive experiments, we demonstrate that training FL clients through our algorithm results in superior performance on multiple benchmark datasets and FL frameworks. Our code is available at this https URL.
[LG-115] Preference-based learning for news headline recommendation
链接: https://arxiv.org/abs/2506.06334
作者: Alexandre Bouras,Audrey Durand,Richard Khoury
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:This study explores strategies for optimizing news headline recommendations through preference-based learning. Using real-world data of user interactions with French-language online news posts, we learn a headline recommender agent under a contextual bandit setting. This allows us to explore the impact of translation on engagement predictions, as well as the benefits of different interactive strategies on user engagement during data collection. Our results show that explicit exploration may not be required in the presence of noisy contexts, opening the door to simpler but efficient strategies in practice.
[LG-116] Extending AALpy with Passive Learning: A Generalized State-Merging Approach
链接: https://arxiv.org/abs/2506.06333
作者: Benjamin von Berg,Bernhard K. Aichernig
类目: Machine Learning (cs.LG); Formal Languages and Automata Theory (cs.FL)
*备注: Accepted for publication at CAV 2025, the 37th International Conference on Computer Aided Verification
Abstract:AALpy is a well-established open-source automata learning library written in Python with a focus on active learning of systems with IO behavior. It provides a wide range of state-of-the-art algorithms for different automaton types ranging from fully deterministic to probabilistic automata. In this work, we present the recent addition of a generalized implementation of an important method from the domain of passive automata learning: state-merging in the red-blue framework. Using a common internal representation for different automaton types allows for a general and highly configurable implementation of the red-blue framework. We describe how to define and execute state-merging algorithms using AALpy, which reduces the implementation effort for state-merging algorithms mainly to the definition of compatibility criteria and scoring. This aids the implementation of both existing and novel algorithms. In particular, defining some existing state-merging algorithms from the literature with AALpy only takes a few lines of code.
[LG-117] ExplainBench: A Benchmark Framework for Local Model Explanations in Fairness-Critical Applications
链接: https://arxiv.org/abs/2506.06330
作者: James Afful
类目: Machine Learning (cs.LG)
*备注:
Abstract:As machine learning systems are increasingly deployed in high-stakes domains such as criminal justice, finance, and healthcare, the demand for interpretable and trustworthy models has intensified. Despite the proliferation of local explanation techniques, including SHAP, LIME, and counterfactual methods, there exists no standardized, reproducible framework for their comparative evaluation, particularly in fairness-sensitive settings. We introduce ExplainBench, an open-source benchmarking suite for systematic evaluation of local model explanations across ethically consequential datasets. ExplainBench provides unified wrappers for popular explanation algorithms, integrates end-to-end pipelines for model training and explanation generation, and supports evaluation via fidelity, sparsity, and robustness metrics. The framework includes a Streamlit-based graphical interface for interactive exploration and is packaged as a Python module for seamless integration into research workflows. We demonstrate ExplainBench on datasets commonly used in fairness research, such as COMPAS, UCI Adult Income, and LendingClub, and showcase how different explanation methods behave under a shared experimental protocol. By enabling reproducible, comparative analysis of local explanations, ExplainBench advances the methodological foundations of interpretable machine learning and facilitates accountability in real-world AI systems. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2506.06330 [cs.LG] (or arXiv:2506.06330v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2506.06330 Focus to learn more arXiv-issued DOI via DataCite
[LG-118] Wine Quality Prediction with Ensemble Trees: A Unified Leak-Free Comparative Study
链接: https://arxiv.org/abs/2506.06327
作者: Zilang Chen
类目: Machine Learning (cs.LG)
*备注: 14pages, 7figures,2tables
Abstract:Accurate and reproducible wine-quality assessment is critical for production control yet remains dominated by subjective, labour-intensive tasting panels. We present the first unified benchmark of five ensemble learners (Random Forest, Gradient Boosting, XGBoost, LightGBM, CatBoost) on the canonical Vinho Verde red- and white-wine datasets (1,599 and 4,898 instances, 11 physicochemical attributes). Our leakage-free workflow employs an 80:20 stratified train-test split, five-fold StratifiedGroupKFold within the training set, per-fold standardisation, SMOTE-Tomek resampling, inverse-frequency cost weighting, Optuna hyper-parameter search (120-200 trials per model) and a two-stage feature-selection refit. Final scores on untouched test sets are reported with weighted F1 as the headline metric. Gradient Boosting achieves the highest accuracy (weighted F1 0.693 +/- 0.028 for red and 0.664 +/- 0.016 for white), followed within three percentage points by Random Forest and XGBoost. Limiting each model to its five top-ranked variables lowers dimensionality by 55 percent while reducing weighted F1 by only 2.6 percentage points for red and 3.0 percentage points for white, indicating that alcohol, volatile acidity, sulphates, free SO2 and chlorides capture most predictive signal. Runtime profiling on an EPYC 9K84/H20 node reveals a steep efficiency gradient: Gradient Boosting averages 12 h per five-fold study, XGBoost and LightGBM require 2-3 h, CatBoost 1 h, and Random Forest under 50 min. We therefore recommend Random Forest as the most cost-effective production model, XGBoost and LightGBM as GPU-efficient alternatives, and Gradient Boosting as the accuracy ceiling for offline benchmarking. The fully documented pipeline and metric set provide a reproducible baseline for future work on imbalanced multi-class wine-quality prediction.
[LG-119] LT-PINN: Lagrangian Topology-conscious Physics-informed Neural Network for Boundary-focused Engineering Optimization
链接: https://arxiv.org/abs/2506.06300
作者: Yuanye Zhou,Zhaokun Wang,Kai Zhou,Hui Tang,Xiaofan Li
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Physics-informed neural networks (PINNs) have emerged as a powerful meshless tool for topology optimization, capable of simultaneously determining optimal topologies and physical solutions. However, conventional PINNs rely on density-based topology descriptions, which necessitate manual interpolation and limit their applicability to complex geometries. To address this, we propose Lagrangian topology-conscious PINNs (LT-PINNs), a novel framework for boundary-focused engineering optimization. By parameterizing the control variables of topology boundary curves as learnable parameters, LT-PINNs eliminate the need for manual interpolation and enable precise boundary determination. We further introduce specialized boundary condition loss function and topology loss function to ensure sharp and accurate boundary representations, even for intricate topologies. The accuracy and robustness of LT-PINNs are validated via two types of partial differential equations (PDEs), including elastic equation with Dirichlet boundary conditions and Laplace’s equation with Neumann boundary conditions. Furthermore, we demonstrate effectiveness of LT-PINNs on more complex time-dependent and time-independent flow problems without relying on measurement data, and showcase their engineering application potential in flow velocity rearrangement, transforming a uniform upstream velocity into a sine-shaped downstream profile. The results demonstrate (1) LT-PINNs achieve substantial reductions in relative L2 errors compared with the state-of-art density topology-oriented PINNs (DT-PINNs), (2) LT-PINNs can handle arbitrary boundary conditions, making them suitable for a wide range of PDEs, and (3) LT-PINNs can infer clear topology boundaries without manual interpolation, especially for complex topologies.
[LG-120] Amatriciana: Exploiting Temporal GNNs for Robust and Efficient Money Laundering Detection
链接: https://arxiv.org/abs/2506.00654
作者: Marco Di Gennaro,Francesco Panebianco,Marco Pianta,Stefano Zanero,Michele Carminati
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Money laundering is a financial crime that poses a serious threat to financial integrity and social security. The growing number of transactions makes it necessary to use automatic tools that help law enforcement agencies detect such criminal activity. In this work, we present Amatriciana, a novel approach based on Graph Neural Networks to detect money launderers inside a graph of transactions by considering temporal information. Amatriciana uses the whole graph of transactions without splitting it into several time-based subgraphs, exploiting all relational information in the dataset. Our experiments on a public dataset reveal that the model can learn from a limited amount of data. Furthermore, when more data is available, the model outperforms other State-of-the-art approaches; in particular, Amatriciana decreases the number of False Positives (FPs) while detecting many launderers. In summary, Amatriciana achieves an F1 score of 0.76. In addition, it lowers the FPs by 55% with respect to other State-of-the-art models.
[LG-121] Discrete and Continuous Difference of Submodular Minimization
链接: https://arxiv.org/abs/2506.07952
作者: George Orfanides,Tim Hoheisel,Marwa El Halabi
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:Submodular functions, defined on continuous or discrete domains, arise in numerous applications. We study the minimization of the difference of two submodular (DS) functions, over both domains, extending prior work restricted to set functions. We show that all functions on discrete domains and all smooth functions on continuous domains are DS. For discrete domains, we observe that DS minimization is equivalent to minimizing the difference of two convex (DC) functions, as in the set function case. We propose a novel variant of the DC Algorithm (DCA) and apply it to the resulting DC Program, obtaining comparable theoretical guarantees as in the set function case. The algorithm can be applied to continuous domains via discretization. Experiments demonstrate that our method outperforms baselines in integer compressive sensing and integer least squares.
[LG-122] Deep reinforcement learning for near-deterministic preparation of cubic- and quartic-phase gates in photonic quantum computing
链接: https://arxiv.org/abs/2506.07859
作者: Amanuel Anteneh Léandre Brunel,Carlos González-Arciniegas,Olivier Pfister
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Cubic-phase states are a sufficient resource for universal quantum computing over continuous variables. We present results from numerical experiments in which deep neural networks are trained via reinforcement learning to control a quantum optical circuit for generating cubic-phase states, with an average success rate of 96%. The only non-Gaussian resource required is photon-number-resolving measurements. We also show that the exact same resources enable the direct generation of a quartic-phase gate, with no need for a cubic gate decomposition.
[LG-123] Conditional Local Independence Testing with Application to Dynamic Causal Discovery
链接: https://arxiv.org/abs/2506.07844
作者: Mingzhou Liu,Xinwei Sun,Yizhou Wang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: Working paper
Abstract:In this note, we extend the conditional local independence testing theory developed in Christgau et al. (2024) to Ito processes. The result can be applied to causal discovery in dynamic systems.
[LG-124] Accelerating Constrained Sampling: A Large Deviations Approach
链接: https://arxiv.org/abs/2506.07816
作者: Yingli Wang,Changwei Tu,Xiaoyu Wang,Lingjiong Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 40 pages, 7 figures
Abstract:The problem of sampling a target probability distribution on a constrained domain arises in many applications including machine learning. For constrained sampling, various Langevin algorithms such as projected Langevin Monte Carlo (PLMC) based on the discretization of reflected Langevin dynamics (RLD) and more generally skew-reflected non-reversible Langevin Monte Carlo (SRNLMC) based on the discretization of skew-reflected non-reversible Langevin dynamics (SRNLD) have been proposed and studied in the literature. This work focuses on the long-time behavior of SRNLD, where a skew-symmetric matrix is added to RLD. Although the non-asymptotic convergence analysis for SRNLD (and SRNLMC) and the acceleration compared to RLD (and PMLC) have been studied in the literature, it is not clear how one should design the skew-symmetric matrix in the dynamics to achieve good performance in practice. We establish a large deviation principle (LDP) for the empirical measure of SRNLD when the skew-symmetric matrix is chosen such that its product with the inward unit normal vector field on the boundary is zero. By explicitly characterizing the rate functions, we show that SRNLD can accelerate the convergence to the target distribution compared to RLD with this choice of the skew-symmetric matrix. Numerical experiments for SRNLMC based on the proposed skew-symmetric matrix show superior performance which validate the theoretical findings from the large deviations theory.
[LG-125] A weighted quantum ensemble of homogeneous quantum classifiers
链接: https://arxiv.org/abs/2506.07810
作者: Emiliano Tolotti,Enrico Blanzieri,Davide Pastorello
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 21 pages, 4 figures
Abstract:Ensemble methods in machine learning aim to improve prediction accuracy by combining multiple models. This is achieved by ensuring diversity among predictors to capture different data aspects. Homogeneous ensembles use identical models, achieving diversity through different data subsets, and weighted-average ensembles assign higher influence to more accurate models through a weight learning procedure. We propose a method to achieve a weighted homogeneous quantum ensemble using quantum classifiers with indexing registers for data encoding. This approach leverages instance-based quantum classifiers, enabling feature and training point subsampling through superposition and controlled unitaries, and allowing for a quantum-parallel execution of diverse internal classifiers with different data compositions in superposition. The method integrates a learning process involving circuit execution and classical weight optimization, for a trained ensemble execution with weights encoded in the circuit at test-time. Empirical evaluation demonstrate the effectiveness of the proposed method, offering insights into its performance.
[LG-126] Diffusion Models-Aided Uplink Channel Estimation for RIS-Assisted Systems
链接: https://arxiv.org/abs/2506.07770
作者: Yang Wang,Yin Xu,Cixiao Zhang,Zhiyong Chen,Xiaowu Ou,Mingzeng Dai,Meixia Tao,Wenjun Zhang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages
Abstract:This letter proposes a channel estimation method for reconfigurable intelligent surface (RIS)-assisted systems through a novel diffusion model (DM) framework. We reformulate the channel estimation problem as a denoising process, which aligns with the reverse process of the DM. To overcome the inherent randomness in the reverse process of conventional DM approaches, we adopt a deterministic sampling strategy with a step alignment mechanism that ensures the accuracy of channel estimation while adapting to different signal-to-noise ratio (SNR). Furthermore, to reduce the number of parameters of the U-Net, we meticulously design a lightweight network that achieves comparable performance, thereby enhancing the practicality of our proposed method. Extensive simulations demonstrate superior performance over a wide range of SNRs compared to baselines. For instance, the proposed method achieves performance improvements of up to 13.5 dB in normalized mean square error (NMSE) at SNR = 0 dB. Notably, the proposed lightweight network exhibits almost no performance loss compared to the original U-Net, while requiring only 6.59% of its parameters.
[LG-127] Quickest Causal Change Point Detection by Adaptive Intervention
链接: https://arxiv.org/abs/2506.07760
作者: Haijie Xu,Chen Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We propose an algorithm for change point monitoring in linear causal models that accounts for interventions. Through a special centralization technique, we can concentrate the changes arising from causal propagation across nodes into a single dimension. Additionally, by selecting appropriate intervention nodes based on Kullback-Leibler divergence, we can amplify the change magnitude. We also present an algorithm for selecting the intervention values, which aids in the identification of the most effective intervention nodes. Two monitoring methods are proposed, each with an adaptive intervention policy to make a balance between exploration and exploitation. We theoretically demonstrate the first-order optimality of the proposed methods and validate their properties using simulation datasets and two real-world case studies.
[LG-128] Rao-Blackwellised Reparameterisation Gradients
链接: https://arxiv.org/abs/2506.07687
作者: Kevin Lam,Thang Bui,George Deligiannidis,Yee Whye Teh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Latent Gaussian variables have been popularised in probabilistic machine learning. In turn, gradient estimators are the machinery that facilitates gradient-based optimisation for models with latent Gaussian variables. The reparameterisation trick is often used as the default estimator as it is simple to implement and yields low-variance gradients for variational inference. In this work, we propose the R2-G2 estimator as the Rao-Blackwellisation of the reparameterisation gradient estimator. Interestingly, we show that the local reparameterisation gradient estimator for Bayesian MLPs is an instance of the R2-G2 estimator and Rao-Blackwellisation. This lets us extend benefits of Rao-Blackwellised gradients to a suite of probabilistic models. We show that initial training with R2-G2 consistently yields better performance in models with multiple applications of the reparameterisation trick.
[LG-129] Poisson Midpoint Method for Log Concave Sampling: Beyond the Strong Error Lower Bounds
链接: https://arxiv.org/abs/2506.07614
作者: Rishikesh Srinivasan,Dheeraj Nagaraj
类目: Probability (math.PR); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We study the problem of sampling from strongly log-concave distributions over \mathbbR^d using the Poisson midpoint discretization (a variant of the randomized midpoint method) for overdamped/underdamped Langevin dynamics. We prove its convergence in the 2-Wasserstein distance ( W_2 ), achieving a cubic speedup in dependence on the target accuracy ( \epsilon ) over the Euler-Maruyama discretization, surpassing existing bounds for randomized midpoint methods. Notably, in the case of underdamped Langevin dynamics, we demonstrate the complexity of W_2 convergence is much smaller than the complexity lower bounds for convergence in L^2 strong error established in the literature.
[LG-130] Decentralized Optimization on Compact Submanifolds by Quantized Riemannian Gradient Tracking
链接: https://arxiv.org/abs/2506.07351
作者: Jun Chen,Lina Liu,Tianyi Zhu,Yong Liu,Guang Dai,Yunliang Jiang,Ivor W. Tsang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:This paper considers the problem of decentralized optimization on compact submanifolds, where a finite sum of smooth (possibly non-convex) local functions is minimized by n agents forming an undirected and connected graph. However, the efficiency of distributed optimization is often hindered by communication bottlenecks. To mitigate this, we propose the Quantized Riemannian Gradient Tracking (Q-RGT) algorithm, where agents update their local variables using quantized gradients. The introduction of quantization noise allows our algorithm to bypass the constraints of the accurate Riemannian projection operator (such as retraction), further improving iterative efficiency. To the best of our knowledge, this is the first algorithm to achieve an \mathcalO(1/K) convergence rate in the presence of quantization, matching the convergence rate of methods without quantization. Additionally, we explicitly derive lower bounds on decentralized consensus associated with a function of quantization levels. Numerical experiments demonstrate that Q-RGT performs comparably to non-quantized methods while reducing communication bottlenecks and computational overhead.
[LG-131] Uncertainty-Aware Strategies: A Model-Agnostic Framework for Robust Financial Optimization through Subsampling
链接: https://arxiv.org/abs/2506.07299
作者: Hans Buehler,Blanka Horvath,Yannick Limmer,Thorsten Schmidt
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Mathematical Finance (q-fin.MF); Risk Management (q-fin.RM)
*备注: 18 pages, 12 figures
Abstract:This paper addresses the challenge of model uncertainty in quantitative finance, where decisions in portfolio allocation, derivative pricing, and risk management rely on estimating stochastic models from limited data. In practice, the unavailability of the true probability measure forces reliance on an empirical approximation, and even small misestimations can lead to significant deviations in decision quality. Building on the framework of Klibanoff et al. (2005), we enhance the conventional objective - whether this is expected utility in an investing context or a hedging metric - by superimposing an outer “uncertainty measure”, motivated by traditional monetary risk measures, on the space of models. In scenarios where a natural model distribution is lacking or Bayesian methods are impractical, we propose an ad hoc subsampling strategy, analogous to bootstrapping in statistical finance and related to mini-batch sampling in deep learning, to approximate model uncertainty. To address the quadratic memory demands of naive implementations, we also present an adapted stochastic gradient descent algorithm that enables efficient parallelization. Through analytical, simulated, and empirical studies - including multi-period, real data and high-dimensional examples - we demonstrate that uncertainty measures outperform traditional mixture of measures strategies and our model-agnostic subsampling-based approach not only enhances robustness against model risk but also achieves performance comparable to more elaborate Bayesian methods.
[LG-132] ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition
链接: https://arxiv.org/abs/2506.07259
作者: Daolang Huang,Xinyi Wen,Ayush Bharti,Samuel Kaski,Luigi Acerbi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 13 figures
Abstract:Many critical applications, from autonomous scientific discovery to personalized medicine, demand systems that can both strategically acquire the most informative data and instantaneously perform inference based upon it. While amortized methods for Bayesian inference and experimental design offer part of the solution, neither approach is optimal in the most general and challenging task, where new data needs to be collected for instant inference. To tackle this issue, we introduce the Amortized Active Learning and Inference Engine (ALINE), a unified framework for amortized Bayesian inference and active data acquisition. ALINE leverages a transformer architecture trained via reinforcement learning with a reward based on self-estimated information gain provided by its own integrated inference component. This allows it to strategically query informative data points while simultaneously refining its predictions. Moreover, ALINE can selectively direct its querying strategy towards specific subsets of model parameters or designated predictive tasks, optimizing for posterior estimation, data prediction, or a mixture thereof. Empirical results on regression-based active learning, classical Bayesian experimental design benchmarks, and a psychometric model with selectively targeted parameters demonstrate that ALINE delivers both instant and accurate inference along with efficient selection of informative points.
[LG-133] Quantile-Optimal Policy Learning under Unmeasured Confounding
链接: https://arxiv.org/abs/2506.07140
作者: Zhongren Chen,Siyu Chen,Zhengling Qi,Xiaohong Chen,Zhuoran Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:
Abstract:We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest \alpha -quantile for some \alpha \in (0, 1) . We focus on the offline setting whose generating process involves unobserved confounders. Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset. To address these challenges, we propose a suite of causal-assisted policy learning methods that provably enjoy strong theoretical guarantees under mild conditions. In particular, to address (i) and (ii), using causal inference tools such as instrumental variables and negative controls, we propose to estimate the quantile objectives by solving nonlinear functional integral equations. Then we adopt a minimax estimation approach with nonparametric models to solve these integral equations, and propose to construct conservative policy estimates that address (iii). The final policy is the one that maximizes these pessimistic estimates. In addition, we propose a novel regularized policy learning method that is more amenable to computation. Finally, we prove that the policies learned by these methods are \tilde\mathscrO(n^-1/2) quantile-optimal under a mild coverage assumption on the offline dataset. Here, \tilde\mathscrO(\cdot) omits poly-logarithmic factors. To the best of our knowledge, we propose the first sample-efficient policy learning algorithms for estimating the quantile-optimal policy when there exist unmeasured confounding.
[LG-134] Inverse Design of Metamaterials with Manufacturing-Guiding Spectrum-to-Structure Conditional Diffusion Model
链接: https://arxiv.org/abs/2506.07083
作者: Jiawen Li,Jiang Guo,Yuanzhe Li,Zetian Mao,Jiaxing Shen,Tashi Xu,Diptesh Das,Jinming He,Run Hu,Yaerim Lee,Koji Tsuda,Junichiro Shiomi
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注: 20 pages, 7 figures
Abstract:Metamaterials are artificially engineered structures that manipulate electromagnetic waves, having optical properties absent in natural materials. Recently, machine learning for the inverse design of metamaterials has drawn attention. However, the highly nonlinear relationship between the metamaterial structures and optical behaviour, coupled with fabrication difficulties, poses challenges for using machine learning to design and manufacture complex metamaterials. Herein, we propose a general framework that implements customised spectrum-to-shape and size parameters to address one-to-many metamaterial inverse design problems using conditional diffusion models. Our method exhibits superior spectral prediction accuracy, generates a diverse range of patterns compared to other typical generative models, and offers valuable prior knowledge for manufacturing through the subsequent analysis of the diverse generated results, thereby facilitating the experimental fabrication of metamaterial designs. We demonstrate the efficacy of the proposed method by successfully designing and fabricating a free-form metamaterial with a tailored selective emission spectrum for thermal camouflage applications.
[LG-135] Half-AVAE: Adversarial-Enhanced Factorized and Structured Encoder-Free VAE for Underdetermined Independent Component Analysis
链接: https://arxiv.org/abs/2506.07011
作者: Yuan-Hao Wei,Yan-Jie Sun
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:This study advances the Variational Autoencoder (VAE) framework by addressing challenges in Independent Component Analysis (ICA) under both determined and underdetermined conditions, focusing on enhancing the independence and interpretability of latent variables. Traditional VAEs map observed data to latent variables and back via an encoder-decoder architecture, but struggle with underdetermined ICA where the number of latent variables exceeds observed signals. The proposed Half Adversarial VAE (Half-AVAE) builds on the encoder-free Half-VAE framework, eliminating explicit inverse mapping to tackle underdetermined scenarios. By integrating adversarial networks and External Enhancement (EE) terms, Half-AVAE promotes mutual independence among latent dimensions, achieving factorized and interpretable representations. Experiments with synthetic signals demonstrate that Half-AVAE outperforms baseline models, including GP-AVAE and Half-VAE, in recovering independent components under underdetermined conditions, as evidenced by lower root mean square errors. The study highlights the flexibility of VAEs in variational inference, showing that encoder omission, combined with adversarial training and structured priors, enables effective solutions for complex ICA tasks, advancing applications in disentanglement, causal inference, and generative modeling.
[LG-136] Conditional Denoising Diffusion for ISAC Enhanced Channel Estimation in Cell-Free 6G
链接: https://arxiv.org/abs/2506.06942
作者: Mohammad Farzanullah,Han Zhang,Akram Bin Sediq,Ali Afana,Melike Erol-Kantarci
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: IEEE PIMRC conference, 6 pages, 6 figures
Abstract:Cell-free Integrated Sensing and Communication (ISAC) aims to revolutionize 6th Generation (6G) networks. By combining distributed access points with ISAC capabilities, it boosts spectral efficiency, situational awareness, and communication reliability. Channel estimation is a critical step in cell-free ISAC systems to ensure reliable communication, but its performance is usually limited by challenges such as pilot contamination and noisy channel estimates. This paper presents a novel framework leveraging sensing information as a key input within a Conditional Denoising Diffusion Model (CDDM). In this framework, we integrate CDDM with a Multimodal Transformer (MMT) to enhance channel estimation in ISAC-enabled cell-free systems. The MMT encoder effectively captures inter-modal relationships between sensing and location data, enabling the CDDM to iteratively denoise and refine channel estimates. Simulation results demonstrate that the proposed approach achieves significant performance gains. As compared with Least Squares (LS) and Minimum Mean Squared Error (MMSE) estimators, the proposed model achieves normalized mean squared error (NMSE) improvements of 8 dB and 9 dB, respectively. Moreover, we achieve a 27.8% NMSE improvement compared to the traditional denoising diffusion model (TDDM), which does not incorporate sensing channel information. Additionally, the model exhibits higher robustness against pilot contamination and maintains high accuracy under challenging conditions, such as low signal-to-noise ratios (SNRs). According to the simulation results, the model performs well for users near sensing targets by leveraging the correlation between sensing and communication channels.
[LG-137] Graph Neural Networks in Modern AI-aided Drug Discovery
链接: https://arxiv.org/abs/2506.06915
作者: Odin Zhang,Haitao Lin,Xujun Zhang,Xiaorui Wang,Zhenxing Wu,Qing Ye,Weibo Zhao,Jike Wang,Kejun Ying,Yu Kang,Chang-yu Hsieh,Tingjun Hou
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Graph neural networks (GNNs), as topology/structure-aware models within deep learning, have emerged as powerful tools for AI-aided drug discovery (AIDD). By directly operating on molecular graphs, GNNs offer an intuitive and expressive framework for learning the complex topological and geometric features of drug-like molecules, cementing their role in modern molecular modeling. This review provides a comprehensive overview of the methodological foundations and representative applications of GNNs in drug discovery, spanning tasks such as molecular property prediction, virtual screening, molecular generation, biomedical knowledge graph construction, and synthesis planning. Particular attention is given to recent methodological advances, including geometric GNNs, interpretable models, uncertainty quantification, scalable graph architectures, and graph generative frameworks. We also discuss how these models integrate with modern deep learning approaches, such as self-supervised learning, multi-task learning, meta-learning and pre-training. Throughout this review, we highlight the practical challenges and methodological bottlenecks encountered when applying GNNs to real-world drug discovery pipelines, and conclude with a discussion on future directions.
[LG-138] he Currents of Conflict: Decomposing Conflict Trends with Gaussian Processes
链接: https://arxiv.org/abs/2506.06828
作者: Simon P. von der Maase
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注: Total Words: 8122, Total pages: 28, Total figures: 6, Total Tables: 5
Abstract:I present a novel approach to estimating the temporal and spatial patterns of violent conflict. I show how we can use highly temporally and spatially disaggregated data on conflict events in tandem with Gaussian processes to estimate temporospatial conflict trends. These trends can be studied to gain insight into conflict traps, diffusion and tempo-spatial conflict exposure in general; they can also be used to control for such phenomenons given other estimation tasks; lastly, the approach allow us to extrapolate the estimated tempo-spatial conflict patterns into future temporal units, thus facilitating powerful, stat-of-the-art, conflict forecasts. Importantly, these results are achieved via a relatively parsimonious framework using only one data source: past conflict patterns.
[LG-139] Continuous Semi-Implicit Models ICML2025
链接: https://arxiv.org/abs/2506.06778
作者: Longlin Yu,Jiajun Zha,Tong Yang,Tianyu Xie,Xiangyu Zhang,S.-H. Gary Chan,Cheng Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 8 figures, ICML 2025
Abstract:Semi-implicit distributions have shown great promise in variational inference and generative modeling. Hierarchical semi-implicit models, which stack multiple semi-implicit layers, enhance the expressiveness of semi-implicit distributions and can be used to accelerate diffusion models given pretrained score networks. However, their sequential training often suffers from slow convergence. In this paper, we introduce CoSIM, a continuous semi-implicit model that extends hierarchical semi-implicit models into a continuous framework. By incorporating a continuous transition kernel, CoSIM enables efficient, simulation-free training. Furthermore, we show that CoSIM achieves consistency with a carefully designed transition kernel, offering a novel approach for multistep distillation of generative models at the distributional level. Extensive experiments on image generation demonstrate that CoSIM performs on par or better than existing diffusion model acceleration methods, achieving superior performance on FD-DINOv2.
[LG-140] IQFM A Wireless Foundational Model for I/Q Streams in AI-Native 6G
链接: https://arxiv.org/abs/2506.06718
作者: Omar Mashaal,Hatem Abou-Zeid
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Foundational models have shown remarkable potential in natural language processing and computer vision, yet remain in their infancy in wireless communications. While a few efforts have explored image-based modalities such as channel state information (CSI) and frequency spectrograms, foundational models that operate directly on raw IQ data remain largely unexplored. This paper presents, IQFM, the first I/Q signal foundational model for wireless communications. IQFM supporting diverse tasks: modulation classification, angle-of-arrival (AoA), beam prediction, and RF fingerprinting, without heavy preprocessing or handcrafted features. We also introduce a task-aware augmentation strategy that categorizes transformations into core augmentations, such as cyclic time shifting, and task-specific augmentations. This strategy forms the basis for structured, task-dependent representation learning within a contrastive self-supervised learning (SSL) framework. Using this strategy, the lightweight encoder, pre-trained via SSL on over-the-air multi-antenna IQ data, achieves up to 99.67% and 65.45% accuracy on modulation and AoA classification, respectively, using only one labeled sample per class, outperforming supervised baselines by up to 7x and 145x. The model also generalizes to out-of-distribution tasks; when adapted to new tasks using only 500 samples per class and minimal parameter updates via LoRA, the same frozen encoder achieves 94.15% on beam prediction (vs. 89.53% supervised), 50.00% on RML2016a modulation classification (vs. 49.30%), and 96.05% on RF fingerprinting (vs. 96.64%). These results demonstrate the potential of raw IQ-based foundational models as efficient, reusable encoders for multi-task learning in AI-native 6G systems.
[LG-141] Explaining Risks: Axiomatic Risk Attributions for Financial Models
链接: https://arxiv.org/abs/2506.06653
作者: Dangxing Chen
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: This article has been accepted for publication in Quantitative Finance, published by Taylor Francis
Abstract:In recent years, machine learning models have achieved great success at the expense of highly complex black-box structures. By using axiomatic attribution methods, we can fairly allocate the contributions of each feature, thus allowing us to interpret the model predictions. In high-risk sectors such as finance, risk is just as important as mean predictions. Throughout this work, we address the following risk attribution problem: how to fairly allocate the risk given a model with data? We demonstrate with analysis and empirical examples that risk can be well allocated by extending the Shapley value framework.
[LG-142] Robust Learnability of Sample-Compressible Distributions under Noisy or Adversarial Perturbations
链接: https://arxiv.org/abs/2506.06613
作者: Arefe Boushehrian,Amir Najafi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 50 pages, 1 figure
Abstract:Learning distribution families over \mathbbR^d is a fundamental problem in unsupervised learning and statistics. A central question in this setting is whether a given family of distributions possesses sufficient structure to be (at least) information-theoretically learnable and, if so, to characterize its sample complexity. In 2018, Ashtiani et al. reframed \emphsample compressibility, originally due to Littlestone and Warmuth (1986), as a structural property of distribution classes, proving that it guarantees PAC-learnability. This discovery subsequently enabled a series of recent advancements in deriving nearly tight sample complexity bounds for various high-dimensional open problems. It has been further conjectured that the converse also holds: every learnable class admits a tight sample compression scheme. In this work, we establish that sample compressible families remain learnable even from perturbed samples, subject to a set of necessary and sufficient conditions. We analyze two models of data perturbation: (i) an additive independent noise model, and (ii) an adversarial corruption model, where an adversary manipulates a limited subset of the samples unknown to the learner. Our results are general and rely on as minimal assumptions as possible. We develop a perturbation-quantization framework that interfaces naturally with the compression scheme and leads to sample complexity bounds that scale gracefully with the noise level and corruption budget. As concrete applications, we establish new sample complexity bounds for learning finite mixtures of high-dimensional uniform distributions under both noise and adversarial perturbations, as well as for learning Gaussian mixture models from adversarially corrupted samples, resolving two open problems in the literature. Comments: 50 pages, 1 figure Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2506.06613 [stat.ML] (or arXiv:2506.06613v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2506.06613 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-143] Direct Fisher Score Estimation for Likelihood Maximization
链接: https://arxiv.org/abs/2506.06542
作者: Sherman Khoo,Yakun Wang,Song Liu,Mark Beaumont
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of likelihood maximization when the likelihood function is intractable but model simulations are readily available. We propose a sequential, gradient-based optimization method that directly models the Fisher score based on a local score matching technique which uses simulations from a localized region around each parameter iterate. By employing a linear parameterization to the surrogate score model, our technique admits a closed-form, least-squares solution. This approach yields a fast, flexible, and efficient approximation to the Fisher score, effectively smoothing the likelihood objective and mitigating the challenges posed by complex likelihood landscapes. We provide theoretical guarantees for our score estimator, including bounds on the bias introduced by the smoothing. Empirical results on a range of synthetic and real-world problems demonstrate the superior performance of our method compared to existing benchmarks.
[LG-144] Improving choice model specification using reinforcement learning
链接: https://arxiv.org/abs/2506.06410
作者: Gabriel Nova,Sander van Cranenburgh,Stephane Hess
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures
Abstract:Discrete choice modelling is a theory-driven modelling framework for understanding and forecasting choice behaviour. To obtain behavioural insights, modellers test several competing model specifications in their attempts to discover the ‘true’ data generation process. This trial-and-error process requires expertise, is time-consuming, and relies on subjective theoretical assumptions. Although metaheuristics have been proposed to assist choice modellers, they treat model specification as a classic optimisation problem, relying on static strategies, applying predefined rules, and neglecting outcomes from previous estimated models. As a result, current metaheuristics struggle to prioritise promising search regions, adapt exploration dynamically, and transfer knowledge to other modelling tasks. To address these limitations, we introduce a deep reinforcement learning-based framework where an ‘agent’ specifies models by estimating them and receiving rewards based on goodness-of-fit and parsimony. Results demonstrate the agent dynamically adapts its strategies to identify promising specifications across data generation processes, showing robustness and potential transferability, without prior domain knowledge.
[LG-145] ransformer-Based Decomposition of Electrodermal Activity for Real-World Mental Health Applications
链接: https://arxiv.org/abs/2506.06378
作者: Charalampos Tsirmpas,Stasinos Konstantopoulos,Dimitris Andrikopoulos,Konstantina Kyriakouli,Panagiotis Fatouros
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Decomposing Electrodermal Activity (EDA) into phasic (short-term, stimulus-linked responses) and tonic (longer-term baseline) components is essential for extracting meaningful emotional and physiological biomarkers. This study presents a comparative analysis of knowledge-driven, statistical, and deep learning-based methods for EDA signal decomposition, with a focus on in-the-wild data collected from wearable devices. In particular, the authors introduce the Feel Transformer, a novel Transformer-based model adapted from the Autoformer architecture, designed to separate phasic and tonic components without explicit supervision. The model leverages pooling and trend-removal mechanisms to enforce physiologically meaningful decompositions. Comparative experiments against methods such as Ledalab, cvxEDA, and conventional detrending show that the Feel Transformer achieves a balance between feature fidelity (SCR frequency, amplitude, and tonic slope) and robustness to noisy, real-world data. The model demonstrates potential for real-time biosignal analysis and future applications in stress prediction, digital mental health interventions, and physiological forecasting.
[LG-146] ChemGraph: An Agent ic Framework for Computational Chemistry Workflows
链接: https://arxiv.org/abs/2506.06363
作者: Thang D. Pham,Aditya Tanikanti,Murat Keçeli
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Atomistic simulations are essential tools in chemistry and materials science, accelerating the discovery of novel catalysts, energy storage materials, and pharmaceuticals. However, running these simulations remains challenging due to the wide range of computational methods, diverse software ecosystems, and the need for expert knowledge and manual effort for the setup, execution, and validation stages. In this work, we present ChemGraph, an agentic framework powered by artificial intelligence and state-of-the-art simulation tools to streamline and automate computational chemistry and materials science workflows. ChemGraph leverages graph neural network-based foundation models for accurate yet computationally efficient calculations and large language models (LLMs) for natural language understanding, task planning, and scientific reasoning to provide an intuitive and interactive interface. Users can perform tasks such as molecular structure generation, single-point energy, geometry optimization, vibrational analysis, and thermochemistry calculations with methods ranging from tight-binding and machine learning interatomic potentials to density functional theory or wave function theory-based methods. We evaluate ChemGraph across 13 benchmark tasks and demonstrate that smaller LLMs (GPT-4o-mini, Claude-3.5-haiku, Qwen2.5-14B) perform well on simple workflows, while more complex tasks benefit from using larger models like GPT-4o. Importantly, we show that decomposing complex tasks into smaller subtasks through a multi-agent framework enables smaller LLM models to match or exceed GPT-4o’s performance in specific scenarios.
[LG-147] owards Generalizable Drowsiness Monitoring with Physiological Sensors: A Preliminary Study
链接: https://arxiv.org/abs/2506.06360
作者: Jiyao Wang,Suzan Ayas,Jiahao Zhang,Xiao Wen,Dengbo He,Birsen Donmez
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Accepted by HFES2025
Abstract:Accurately detecting drowsiness is vital to driving safety. Among all measures, physiological-signal-based drowsiness monitoring can be more privacy-preserving than a camera-based approach. However, conflicts exist regarding how physiological metrics are associated with different drowsiness labels across datasets. Thus, we analyzed key features from electrocardiograms (ECG), electrodermal activity (EDA), and respiratory (RESP) signals across four datasets, where different drowsiness inducers (such as fatigue and low arousal) and assessment methods (subjective vs. objective) were used. Binary logistic regression models were built to identify the physiological metrics that are associated with drowsiness. Findings indicate that distinct different drowsiness inducers can lead to different physiological responses, and objective assessments were more sensitive than subjective ones in detecting drowsiness. Further, the increased heart rate stability, reduced respiratory amplitude, and decreased tonic EDA are robustly associated with increased drowsiness. The results enhance understanding of drowsiness detection and can inform future generalizable monitoring designs.
[LG-148] Multi-Platform Methane Plume Detection via Model and Domain Adaptation
链接: https://arxiv.org/abs/2506.06348
作者: Vassiliki Mancoridis,Brian Bue,Jake H. Lee,Andrew K. Thorpe,Daniel Cusworth,Alana Ayasse,Philip G. Brodrick,Riley Duren
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 12 pages 8 figures. In review
Abstract:Prioritizing methane for near-term climate action is crucial due to its significant impact on global warming. Previous work used columnwise matched filter products from the airborne AVIRIS-NG imaging spectrometer to detect methane plume sources; convolutional neural networks (CNNs) discerned anthropogenic methane plumes from false positive enhancements. However, as an increasing number of remote sensing platforms are used for methane plume detection, there is a growing need to address cross-platform alignment. In this work, we describe model- and data-driven machine learning approaches that leverage airborne observations to improve spaceborne methane plume detection, reconciling the distributional shifts inherent with performing the same task across platforms. We develop a spaceborne methane plume classifier using data from the EMIT imaging spectroscopy mission. We refine classifiers trained on airborne imagery from AVIRIS-NG campaigns using transfer learning, outperforming the standalone spaceborne model. Finally, we use CycleGAN, an unsupervised image-to-image translation technique, to align the data distributions between airborne and spaceborne contexts. Translating spaceborne EMIT data to the airborne AVIRIS-NG domain using CycleGAN and applying airborne classifiers directly yields the best plume detection results. This methodology is useful not only for data simulation, but also for direct data alignment. Though demonstrated on the task of methane plume detection, our work more broadly demonstrates a data-driven approach to align related products obtained from distinct remote sensing instruments.
[LG-149] LD-RPMNet: Near-Sensor Diagnosis for Railway Point Machines
链接: https://arxiv.org/abs/2506.06346
作者: Wei Li,Xiaochun Wu,Xiaoxi Hu,Yuxuan Zhang,Sebastian Bader,Yuhan Huang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This paper is accepted for IEEE Sensors Applcations Symposium (SAS) 2025
Abstract:Near-sensor diagnosis has become increasingly prevalent in industry. This study proposes a lightweight model named LD-RPMNet that integrates Transformers and Convolutional Neural Networks, leveraging both local and global feature extraction to optimize computational efficiency for a practical railway application. The LD-RPMNet introduces a Multi-scale Depthwise Separable Convolution (MDSC) module, which decomposes cross-channel convolutions into pointwise and depthwise convolutions while employing multi-scale kernels to enhance feature extraction. Meanwhile, a Broadcast Self-Attention (BSA) mechanism is incorporated to simplify complex matrix multiplications and improve computational efficiency. Experimental results based on collected sound signals during the operation of railway point machines demonstrate that the optimized model reduces parameter count and computational complexity by 50% while improving diagnostic accuracy by nearly 3%, ultimately achieving an accuracy of 98.86%. This demonstrates the possibility of near-sensor fault diagnosis applications in railway point machines.
[LG-150] Uncertainty-Aware Multi-view Arrhythmia Classification from ECG IJCNN2024
链接: https://arxiv.org/abs/2506.06342
作者: Mohd Ashhad,Sana Rahmani,Mohammed Fayiz,Ali Etemad,Javad Hashemi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This paper has been accepted to IJCNN 2024 conference
Abstract:We propose a deep neural architecture that performs uncertainty-aware multi-view classification of arrhythmia from ECG. Our method learns two different views (1D and 2D) of single-lead ECG to capture different types of information. We use a fusion technique to reduce the conflict between the different views caused by noise and artifacts in ECG data, thus incorporating uncertainty to obtain stronger final predictions. Our framework contains the following three modules (1) a time-series module to learn the morphological features from ECG; (2) an image-space learning module to learn the spatiotemporal features; and (3) the uncertainty-aware fusion module to fuse the information from the two different views. Experimental results on two real-world datasets demonstrate that our framework not only improves the performance on arrhythmia classification compared to the state-of-the-art but also shows better robustness to noise and artifacts present in ECG.
[LG-151] Composite Reward Design in PPO-Driven Adaptive Filtering
链接: https://arxiv.org/abs/2506.06323
作者: Abdullah Burkan Bereketoglu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 5 pages, 9 figures, 1 table, , Keywords: Adaptive filtering, reinforcement learning, PPO, noise reduction, signal denoising
Abstract:Model-free and reinforcement learning-based adaptive filtering methods are gaining traction for denoising in dynamic, non-stationary environments such as wireless signal channels. Traditional filters like LMS, RLS, Wiener, and Kalman are limited by assumptions of stationary or requiring complex fine-tuning or exact noise statistics or fixed models. This letter proposes an adaptive filtering framework using Proximal Policy Optimization (PPO), guided by a composite reward that balances SNR improvement, MSE reduction, and residual smoothness. Experiments on synthetic signals with various noise types show that our PPO agent generalizes beyond its training distribution, achieving real-time performance and outperforming classical filters. This work demonstrates the viability of policy-gradient reinforcement learning for robust, low-latency adaptive signal filtering.
[LG-152] A Novel Shape-Aware Topological Representation for GPR Data with DNN Integration
链接: https://arxiv.org/abs/2506.06311
作者: Meiyan Kang,Shizuo Kaji,Sang-Yun Lee,Taegon Kim,Hee-Hwan Ryu,Suyoung Choi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 15 pages, 6 figures
Abstract:Ground Penetrating Radar (GPR) is a widely used Non-Destructive Testing (NDT) technique for subsurface exploration, particularly in infrastructure inspection and maintenance. However, conventional interpretation methods are often limited by noise sensitivity and a lack of structural awareness. This study presents a novel framework that enhances the detection of underground utilities, especially pipelines, by integrating shape-aware topological features derived from B-scan GPR images using Topological Data Analysis (TDA), with the spatial detection capabilities of the YOLOv5 deep neural network (DNN). We propose a novel shape-aware topological representation that amplifies structural features in the input data, thereby improving the model’s responsiveness to the geometrical features of buried objects. To address the scarcity of annotated real-world data, we employ a Sim2Real strategy that generates diverse and realistic synthetic datasets, effectively bridging the gap between simulated and real-world domains. Experimental results demonstrate significant improvements in mean Average Precision (mAP), validating the robustness and efficacy of our approach. This approach underscores the potential of TDA-enhanced learning in achieving reliable, real-time subsurface object detection, with broad applications in urban planning, safety inspection, and infrastructure management.
[LG-153] Enhancing Contrastive Learning-based Electrocardiogram Pretrained Model with Patient Memory Queue
链接: https://arxiv.org/abs/2506.06310
作者: Xiaoyu Sun,Yang Yang,Xunde Dong
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 8 pages, 4 figures
Abstract:In the field of automatic Electrocardiogram (ECG) diagnosis, due to the relatively limited amount of labeled data, how to build a robust ECG pretrained model based on unlabeled data is a key area of focus for researchers. Recent advancements in contrastive learning-based ECG pretrained models highlight the potential of exploiting the additional patient-level self-supervisory signals inherent in ECG. They are referred to as patient contrastive learning. Its rationale is that multiple physical recordings from the same patient may share commonalities, termed patient consistency, so redefining positive and negative pairs in contrastive learning as intrapatient and inter-patient samples provides more shared context to learn an effective representation. However, these methods still fail to efficiently exploit patient consistency due to the insufficient amount of intra-inter patient samples existing in a batch. Hence, we propose a contrastive learning-based ECG pretrained model enhanced by the Patient Memory Queue (PMQ), which incorporates a large patient memory queue to mitigate model degeneration that can arise from insufficient intra-inter patient samples. In order to further enhance the performance of the pretrained model, we introduce two extra data augmentation methods to provide more perspectives of positive and negative pairs for pretraining. Extensive experiments were conducted on three public datasets with three different data ratios. The experimental results show that the comprehensive performance of our method outperforms previous contrastive learning methods and exhibits greater robustness in scenarios with limited labeled data. The code is available at this https URL.
[LG-154] Leverag ing Novel Ensemble Learning Techniques and Landsat Multispectral Data for Estimating Olive Yields in Tunisia
链接: https://arxiv.org/abs/2506.06309
作者: Mohamed Kefi,Tien Dat Pham,Thin Nguyen,Mark G. Tjoelker,Viola Devasirvatham,Kenichi Kashiwagi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Olive production is an important tree crop in Mediterranean climates. However, olive yield varies significantly due to climate change. Accurately estimating yield using remote sensing and machine learning remains a complex challenge. In this study, we developed a streamlined pipeline for olive yield estimation in the Kairouan and Sousse governorates of Tunisia. We extracted features from multispectral reflectance bands, vegetation indices derived from Landsat-8 OLI and Landsat-9 OLI-2 satellite imagery, along with digital elevation model data. These spatial features were combined with ground-based field survey data to form a structured tabular dataset. We then developed an automated ensemble learning framework, implemented using AutoGluon to train and evaluate multiple machine learning models, select optimal combinations through stacking, and generate robust yield predictions using five-fold cross-validation. The results demonstrate strong predictive performance from both sensors, with Landsat-8 OLI achieving R2 = 0.8635 and RMSE = 1.17 tons per ha, and Landsat-9 OLI-2 achieving R2 = 0.8378 and RMSE = 1.32 tons per ha. This study highlights a scalable, cost-effective, and accurate method for olive yield estimation, with potential applicability across diverse agricultural regions globally.
[LG-155] Scientific machine learning in Hydrology: a unified perspective
链接: https://arxiv.org/abs/2506.06308
作者: Adoubi Vincent De Paul Adombi
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:Scientific machine learning (SciML) provides a structured approach to integrating physical knowledge into data-driven modeling, offering significant potential for advancing hydrological research. In recent years, multiple methodological families have emerged, including physics-informed machine learning, physics-guided machine learning, hybrid physics-machine learning, and data-driven physics discovery. Within each of these families, a proliferation of heterogeneous approaches has developed independently, often without conceptual coordination. This fragmentation complicates the assessment of methodological novelty and makes it difficult to identify where meaningful advances can still be made in the absence of a unified conceptual framework. This review, the first focused overview of SciML in hydrology, addresses these limitations by proposing a unified methodological framework for each SciML family, bringing together representative contributions into a coherent structure that fosters conceptual clarity and supports cumulative progress in hydrological modeling. Finally, we highlight the limitations and future opportunities of each unified family to guide systematic research in hydrology, where these methods remain underutilized.
[LG-156] mplate-Guided 3D Molecular Pose Generation via Flow Matching and Differentiable Optimization
链接: https://arxiv.org/abs/2506.06305
作者: Noémie Bergues,Arthur Carré,Paul Join-Lambert,Brice Hoffmann,Arnaud Blondel,Hamza Tajmouati
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Predicting the 3D conformation of small molecules within protein binding sites is a key challenge in drug design. When a crystallized reference ligand (template) is available, it provides geometric priors that can guide 3D pose prediction. We present a two-stage method for ligand conformation generation guided by such templates. In the first stage, we introduce a molecular alignment approach based on flow-matching to generate 3D coordinates for the ligand, using the template structure as a reference. In the second stage, a differentiable pose optimization procedure refines this conformation based on shape and pharmacophore similarities, internal energy, and, optionally, the protein binding pocket. We evaluate our approach on a new benchmark of ligand pairs co-crystallized with the same target and show that it outperforms standard docking tools and open-access alignment methods, especially in cases involving low similarity to the template or high ligand flexibility.
信息检索
[IR-0] Leverag ing Historical and Current Interests for Continual Sequential Recommendation
链接: https://arxiv.org/abs/2506.07466
作者: Gyuseok Lee,Hyunsik Yoo,Junyoung Hwang,SeongKu Kang,Hwanjo Yu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Sequential recommendation models based on the Transformer architecture show superior performance in harnessing long-range dependencies within user behavior via self-attention. However, naively updating them on continuously arriving non-stationary data streams incurs prohibitive computation costs or leads to catastrophic forgetting. To address this, we propose Continual Sequential Transformer for Recommendation (CSTRec) that effectively leverages well-preserved historical user interests while capturing current interests. At its core is Continual Sequential Attention (CSA), a linear attention mechanism that retains past knowledge without direct access to old data. CSA integrates two key components: (1) Cauchy-Schwarz Normalization that stabilizes training under uneven interaction frequencies, and (2) Collaborative Interest Enrichment that mitigates forgetting through shared, learnable interest pools. We further introduce a technique that facilitates learning for cold-start users by transferring historical knowledge from behaviorally similar existing users. Extensive experiments on three real-world datasets indicate that CSTRec outperforms state-of-the-art baselines in both knowledge retention and acquisition.
[IR-1] Research Knowledge Graphs: the Shifting Paradigm of Scholarly Information Representation
链接: https://arxiv.org/abs/2506.07285
作者: Matthäus Zloch,Danilo Dessì,Jennifer D’Souza,Leyla Jael Castro,Benjamin Zapilko,Saurav Karmakar,Brigitte Mathiak,Markus Stocker,Wolfgang Otto,Sören Auer,Stefan Dietze
类目: Information Retrieval (cs.IR)
*备注: Extended Semantic Web Conference 2025, In-use track, 10 pages, 1 figure
Abstract:Sharing and reusing research artifacts, such as datasets, publications, or methods is a fundamental part of scientific activity, where heterogeneity of resources and metadata and the common practice of capturing information in unstructured publications pose crucial challenges. Reproducibility of research and finding state-of-the-art methods or data have become increasingly challenging. In this context, the concept of Research Knowledge Graphs (RKGs) has emerged, aiming at providing an easy to use and machine-actionable representation of research artifacts and their relations. That is facilitated through the use of established principles for data representation, the consistent adoption of globally unique persistent identifiers and the reuse and linking of vocabularies and data. This paper provides the first conceptualisation of the RKG vision, a categorisation of in-use RKGs together with a description of RKG building blocks and principles. We also survey real-world RKG implementations differing with respect to scale, schema, data, used vocabulary, and reliability of the contained data. We also characterise different RKG construction methodologies and provide a forward-looking perspective on the diverse applications, opportunities, and challenges associated with the RKG vision.
[IR-2] OneSug: The Unified End-to-End Generative Framework for E-commerce Query Suggestion
链接: https://arxiv.org/abs/2506.06913
作者: Xian Guo,Ben Chen,Siyuan Wang,Ying Yang,Chenyi Lei,Yuqing Ding,Han Li
类目: Information Retrieval (cs.IR)
*备注: 11 pages, 8 figures, and 6 tables
Abstract:Query suggestion plays a crucial role in enhancing user experience in e-commerce search systems by providing relevant query recommendations that align with users’ initial input. This module helps users navigate towards personalized preference needs and reduces typing effort, thereby improving search experience. Traditional query suggestion modules usually adopt multi-stage cascading architectures, for making a well trade-off between system response time and business conversion. But they often suffer from inefficiencies and suboptimal performance due to inconsistent optimization objectives across stages. To address these, we propose OneSug, the first end-to-end generative framework for e-commerce query suggestion. OneSug incorporates a prefix2query representation enhancement module to enrich prefixes using semantically and interactively related queries to bridge content and business characteristics, an encoder-decoder generative model that unifies the query suggestion process, and a reward-weighted ranking strategy with behavior-level weights to capture fine-grained user preferences. Extensive evaluations on large-scale industry datasets demonstrate OneSug’s ability for effective and efficient query suggestion. Furthermore, OneSug has been successfully deployed for the entire traffic on the e-commerce search engine in Kuaishou platform for over 1 month, with statistically significant improvements in user top click position (-9.33%), CTR (+2.01%), Order (+2.04%), and Revenue (+1.69%) over the online multi-stage strategy, showing great potential in e-commercial conversion.
[IR-3] he State-of-the-Art in Lifelog Retrieval: A Review of Progress at the ACM Lifelog Search Challenge Workshop 2022-24
链接: https://arxiv.org/abs/2506.06743
作者: Allie Tran,Werner Bailer,Duc-Tien Dang-Nguyen,Graham Healy,Steve Hodges,Björn Þór Jónsson,Luca Rossetto,Klaus Schoeffmann,Minh-Triet Tran,Lucia Vadicamo,Cathal Gurrin
类目: Multimedia (cs.MM); Information Retrieval (cs.IR)
*备注:
Abstract:The ACM Lifelog Search Challenge (LSC) is a venue that welcomes and compares systems that support the exploration of lifelog data, and in particular the retrieval of specific information, through an interactive competition format. This paper reviews the recent advances in interactive lifelog retrieval as demonstrated at the ACM LSC from 2022 to 2024. Through a detailed comparative analysis, we highlight key improvements across three main retrieval tasks: known-item search, question answering, and ad-hoc search. Our analysis identifies trends such as the widespread adoption of embedding-based retrieval methods (e.g., CLIP, BLIP), increased integration of large language models (LLMs) for conversational retrieval, and continued innovation in multimodal and collaborative search interfaces. We further discuss how specific retrieval techniques and user interface (UI) designs have impacted system performance, emphasizing the importance of balancing retrieval complexity with usability. Our findings indicate that embedding-driven approaches combined with LLMs show promise for lifelog retrieval systems. Likewise, improving UI design can enhance usability and efficiency. Additionally, we recommend reconsidering multi-instance system evaluations within the expert track to better manage variability in user familiarity and configuration effectiveness.
[IR-4] Research on E-Commerce Long-Tail Product Recommendation Mechanism Based on Large-Scale Language Models
链接: https://arxiv.org/abs/2506.06336
作者: Qingyi Lu,Haotian Lyu,Jiayun Zheng,Yang Wang,Li Zhang,Chengrui Zhou
类目: Information Retrieval (cs.IR)
*备注:
Abstract:As e-commerce platforms expand their product catalogs, accurately recommending long-tail items becomes increasingly important for enhancing both user experience and platform revenue. A key challenge is the long-tail problem, where extreme data sparsity and cold-start issues limit the performance of traditional recommendation methods. To address this, we propose a novel long-tail product recommendation mechanism that integrates product text descriptions and user behavior sequences using a large-scale language model (LLM). First, we introduce a semantic visor, which leverages a pre-trained LLM to convert multimodal textual content such as product titles, descriptions, and user reviews into meaningful embeddings. These embeddings help represent item-level semantics effectively. We then employ an attention-based user intent encoder that captures users’ latent interests, especially toward long-tail items, by modeling collaborative behavior patterns. These components feed into a hybrid ranking model that fuses semantic similarity scores, collaborative filtering outputs, and LLM-generated recommendation candidates. Extensive experiments on a real-world e-commerce dataset show that our method outperforms baseline models in recall (+12%), hit rate (+9%), and user coverage (+15%). These improvements lead to better exposure and purchase rates for long-tail products. Our work highlights the potential of LLMs in interpreting product content and user intent, offering a promising direction for future e-commerce recommendation systems.