本篇博文主要内容为 2025-07-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-08)
今日共更新1316篇论文,其中:
- 自然语言处理共205篇(Computation and Language (cs.CL))
- 人工智能共405篇(Artificial Intelligence (cs.AI))
- 计算机视觉共329篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共382篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
【速读】: 该论文试图解决当前大型语言模型(Large Language Model, LLM)代理在记忆能力评估方面的不足,尤其是针对记忆代理(memory agents)的准确检索、测试时学习、长程理解及冲突解决等核心能力缺乏系统性基准的问题。解决方案的关键在于提出MemoryAgentBench,一个专门设计用于评估记忆代理的基准测试集,通过整合重构的现有数据集与新构建的数据集,全面覆盖上述四项核心能力,为评估和提升LLM代理的记忆质量提供了一个系统且具有挑战性的测试平台。
链接: https://arxiv.org/abs/2507.05257
作者: Yuanzhe Hu,Yu Wang,Julian McAuley
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 23 Pages, Y. Hu and Y. Wang contribute equally
Abstract:Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term information-is under-evaluated due to the lack of benchmarks. We term agents with memory mechanisms as memory agents. In this paper, we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution. Existing datasets either rely on limited context lengths or are tailored for static, long-context settings like book-based QA, which do not reflect the interactive, multi-turn nature of memory agents that incrementally accumulate information. Furthermore, no existing benchmarks cover all four competencies. Therefore, we introduce MemoryAgentBench, a new benchmark specifically designed for memory agents. Our benchmark combines reformulated existing datasets with newly constructed ones, covering the above four memory competencies, providing a systematic and challenging testbed for assessing memory quality. We evaluate a diverse set of memory agents, ranging from simple context-based and retrieval-augmented generation (RAG) systems to advanced agents with external memory modules and tool integration. Empirical results reveal that current methods fall short of mastering all four competencies, underscoring the need for further research into comprehensive memory mechanisms for LLM agents.
zh
[NLP-1] Open Vision Reason er: Transferring Linguistic Cognitive Behavior for Visual Reasoning
【速读】: 该论文旨在解决如何将大型语言模型(Large Language Models, LLMs)中通过可验证奖励进行强化学习所获得的推理能力迁移至多模态大语言模型(Multimodal LLMs, MLLMs)以实现先进的视觉推理问题。其解决方案的关键在于提出一种两阶段范式:首先进行大规模的语言冷启动微调,随后进行跨越近1000步的多模态强化学习(Reinforcement Learning, RL),从而在规模上超越所有先前的开源工作,有效提升模型的视觉推理能力。
链接: https://arxiv.org/abs/2507.05255
作者: Yana Wei,Liang Zhao,Jianjian Sun,Kangheng Lin,Jisheng Yin,Jingcheng Hu,Yinmin Zhang,En Yu,Haoran Lv,Zejia Weng,Jia Wang,Chunrui Han,Yuang Peng,Qi Han,Zheng Ge,Xiangyu Zhang,Daxin Jiang,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学); StepFun (步履科技); BUPT (北京邮电大学); UCAS (中国科学院大学); THU (清华大学); HUST (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, followed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps, surpassing all previous open-source efforts in scale. This pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.
zh
[NLP-2] Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中存在的上下文提示漏洞问题,即先前对话中的响应可能隐式地引导模型生成违反政策的内容。解决方案的关键在于提出一种名为Response Attack(RA)的攻击方法,该方法利用辅助语言模型生成一个轻微有害的响应,并将其格式化为对话的一部分,随后通过简洁的触发提示引导目标模型生成有害内容。
链接: https://arxiv.org/abs/2507.05248
作者: Ziqi Miao,Lijun Li,Yuan Xiong,Zhenhua Liu,Pengyu Zhu,Jing Shao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 9 figures. Code and data available at this https URL
Abstract:Contextual priming, where earlier stimuli covertly bias later judgments, offers an unexplored attack surface for large language models (LLMs). We uncover a contextual priming vulnerability in which the previous response in the dialogue can steer its subsequent behavior toward policy-violating content. Building on this insight, we propose Response Attack, which uses an auxiliary LLM to generate a mildly harmful response to a paraphrased version of the original malicious query. They are then formatted into the dialogue and followed by a succinct trigger prompt, thereby priming the target model to generate harmful content. Across eight open-source and proprietary LLMs, RA consistently outperforms seven state-of-the-art jailbreak techniques, achieving higher attack success rates. To mitigate this threat, we construct and release a context-aware safety fine-tuning dataset, which significantly reduces the attack success rate while preserving model capabilities. The code and data are available at this https URL.
zh
[NLP-3] When Chain of Thought is Necessary Language Models Struggle to Evade Monitors
【速读】: 该论文试图解决生成式 AI (Generative AI) 在运行时监控中可能存在的安全风险问题,特别是如何有效检测和防止严重危害的发生。论文指出,传统基于链式思维(Chain-of-Thought, CoT)的监控方法在面对“不忠实性”问题时存在局限性,因此提出关键解决方案是关注“可监控性”(monitorability)而非“忠实性”(faithfulness)。论文引入了一个概念框架,区分了CoT作为事后解释(CoT-as-rationalization)与CoT作为计算过程(CoT-as-computation),并强调某些严重危害需要复杂的多步骤推理,从而要求使用CoT-as-computation以实现有效监控。通过增加不良行为的难度,迫使模型暴露其推理过程,进而提升可监控性,并提出了针对CoT监控的应力测试方法。
链接: https://arxiv.org/abs/2507.05246
作者: Scott Emmons,Erik Jenner,David K. Elson,Rif A. Saurous,Senthooran Rajamanoharan,Heng Chen,Irhum Shafkat,Rohin Shah
机构: Google(谷歌)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While chain-of-thought (CoT) monitoring is an appealing AI safety defense, recent work on “unfaithfulness” has cast doubt on its reliability. These findings highlight an important failure mode, particularly when CoT acts as a post-hoc rationalization in applications like auditing for bias. However, for the distinct problem of runtime monitoring to prevent severe harm, we argue the key property is not faithfulness but monitorability. To this end, we introduce a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation. We expect that certain classes of severe harm will require complex, multi-step reasoning that necessitates CoT-as-computation. Replicating the experimental setups of prior work, we increase the difficulty of the bad behavior to enforce this necessity condition; this forces the model to expose its reasoning, making it monitorable. We then present methodology guidelines to stress-test CoT monitoring against deliberate evasion. Applying these guidelines, we find that models can learn to obscure their intentions, but only when given significant help, such as detailed human-written strategies or iterative optimization against the monitor. We conclude that, while not infallible, CoT monitoring offers a substantial layer of defense that requires active protection and continued stress-testing.
zh
[NLP-4] SciMaster: Towards General-Purpose Scientific AI Agents Part I. X-Master as Foundation: Can We Lead on Humanitys Last Exam?
【速读】: 该论文旨在构建通用人工智能代理的基础架构,并通过在Humanity’s Last Exam (HLE)上的卓越表现验证其能力。解决方案的关键在于引入X-Master,一个工具增强的推理代理,它能够通过与外部工具的灵活交互来模拟人类研究人员的思维过程,并利用内置的Python库和定制工具来增强推理能力。此外,通过X-Masters——一种分散与堆叠的智能体工作流,进一步提升了推理的广度和深度,从而在HLE上取得了32.1%的得分,成为首个突破30%阈值的系统。
链接: https://arxiv.org/abs/2507.05241
作者: Jingyi Chai,Shuo Tang,Rui Ye,Yuwen Du,Xinyu Zhu,Mengcheng Zhou,Yanfeng Wang,Weinan E,Siheng Chen
机构: School of Artificial Intelligence, Shanghai Jiao Tong University (人工智能学院,上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 7 figures
Abstract:The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity’s Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI’s and Google’s Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.
zh
[NLP-5] Logit Reweighting for Topic-Focused Summarization
【速读】: 该论文试图解决生成符合特定主题的抽象摘要这一挑战,特别是在资源受限的情况下,传统微调方法成本较高,而简单的提示工程方法难以保持主题相关性。解决方案的关键在于通过在生成过程中直接重新加权与主题相关的token的logits来增强主题相关性,具体包括三种重加权技术:常数偏移、因子缩放和阈值选择。实验结果表明,这些方法能够有效提升主题相关词汇的使用,其中阈值选择方法在不损害摘要质量的前提下显著提高了主题聚焦度。
链接: https://arxiv.org/abs/2507.05235
作者: Joschka Braun,Bálint Mucsányi,Seyed Ali Bahrainian
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages, 13 figures
Abstract:Generating abstractive summaries that adhere to a specific topic remains a significant challenge for language models. While standard approaches, such as fine-tuning, are resource-intensive, simpler methods like prompt engineering often struggle to maintain topical focus, particularly with smaller models. To address this, we propose a lightweight method that enhances topical relevance by directly reweighting the logits of topic-relevant tokens during generation. We evaluate three such reweighting techniques: Constant Shift, which adds a constant value to logits; Factor Scaling, which multiplies them by a factor; and Threshold Selection, which selectively boosts logits that exceed a probability threshold. Experiments on the NEWTS topical summarization dataset, using both Gemma-2B and Llama-3-8B models, show that these techniques effectively increase the use of topic-relevant vocabulary. Notably, the Threshold Selection method successfully improves topical focus without compromising summary quality-a trade-off often seen in other approaches. Our findings demonstrate that directly reweighting logits is a practical and resource-efficient alternative to fine-tuning, offering a promising pathway for precisely controlling the thematic content of generated text.
zh
[NLP-6] MedGemma Technical Report
【速读】: 该论文试图解决医疗领域中人工智能(AI)训练和部署面临的挑战,包括医疗数据的多样性、任务的复杂性以及隐私保护的需求。其解决方案的关键在于引入MedGemma,这是一个基于Gemma 3 4B和27B的医学视觉-语言基础模型集合,能够在医学图像和文本理解与推理方面表现出色,显著优于同类规模的生成式AI模型,并接近任务特定模型的性能,同时保持了Gemma 3基础模型的通用能力。此外,通过微调MedGemma进一步提升了子领域的性能,并引入了MedSigLIP,一个经过医学优化的视觉编码器,增强了MedGemma的视觉理解能力。
链接: https://arxiv.org/abs/2507.05201
作者: Andrew Sellergren,Sahar Kazemzadeh,Tiam Jaroensri,Atilla Kiraly,Madeleine Traverse,Timo Kohlberger,Shawn Xu,Fayaz Jamil,Cían Hughes,Charles Lau,Justin Chen,Fereshteh Mahvar,Liron Yatziv,Tiffany Chen,Bram Sterling,Stefanie Anna Baby,Susanna Maria Baby,Jeremy Lai,Samuel Schmidgall,Lu Yang,Kejia Chen,Per Bjornsson,Shashir Reddy,Ryan Brush,Kenneth Philbrick,Howard Hu,Howard Yang,Richa Tiwari,Sunny Jansen,Preeti Singh,Yun Liu,Shekoofeh Azizi,Aishwarya Kamath,Johan Ferret,Shreya Pathak,Nino Vieillard,Ramona Merhej,Sarah Perrin,Tatiana Matejovicova,Alexandre Ramé,Morgane Riviere,Louis Rouillard,Thomas Mesnard,Geoffrey Cideron,Jean-bastien Grill,Sabela Ramos,Edouard Yvinec,Michelle Casbon,Elena Buchatskaya,Jean-Baptiste Alayrac,Dmitry(Dima)Lepikhin,Vlad Feinberg,Sebastian Borgeaud,Alek Andreev,Cassidy Hardin,Robert Dadashi,Léonard Hussenot,Armand Joulin,Olivier Bachem,Yossi Matias,Katherine Chou,Avinatan Hassidim,Kavi Goel,Clement Farabet,Joelle Barral,Tris Warkentin,Jonathon Shlens,David Fleet,Victor Cotruta,Omar Sanseviero,Gus Martins,Phoebe Kirk,Anand Rao,Shravya Shetty,David F. Steiner,Can Kirmizibayrak,Rory Pilgrim,Daniel Golden,Lin Yang
机构: Google Research and Google DeepMind
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare’s diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at this https URL.
zh
[NLP-7] Pre-Trained Policy Discriminators are General Reward Models
【速读】: 该论文试图解决传统奖励建模方法在捕捉策略差异和生成有效奖励信号方面的局限性,旨在提升奖励模型(RM)的性能与泛化能力。其解决方案的关键在于将奖励建模形式化为一个策略判别器,通过量化两个策略之间的差异来生成奖励信号,从而引导训练策略向目标策略靠拢。该方法的核心创新是提出了一种可扩展的预训练方法——策略判别学习(POLAR),它能够区分相同策略并识别不同策略,从而捕获策略与任意目标策略之间的相对差异,形成适用于通用排序关系建模的高阶优化目标。
链接: https://arxiv.org/abs/2507.05197
作者: Shihan Dou,Shichun Liu,Yuming Yang,Yicheng Zou,Yunhua Zhou,Shuhao Xing,Chenhao Huang,Qiming Ge,Demin Song,Haijun Lv,Songyang Gao,Chengqi Lv,Enyu Zhou,Honglin Guo,Zhiheng Xi,Wenwei Zhang,Qipeng Guo,Qi Zhang,Xipeng Qiu,Xuanjing Huang,Tao Gui,Kai Chen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance–improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.
zh
[NLP-8] From Frag ments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
【速读】: 该论文试图解决在虚假信息泛滥的背景下,为资源较少的语言(如印地语)生成可靠新闻解释的问题。由于缺乏强大的自动化工具,印地语在扩展虚假信息检测方面面临挑战。该研究提出了一种新颖的框架,将直接偏好优化(DPO)与课程学习相结合,以使机器生成的解释与人类推理对齐。其关键在于引入两个核心参数——真实性和精致度,融入DPO损失函数中,从而提升解释的质量和一致性。
链接: https://arxiv.org/abs/2507.05179
作者: Pulkit Bansal,Raghvendra Kumar,Shakti Singh,Sriparna Saha,Adam Jatowt
机构: Indian Institute of Technology Patna(印度理工学院巴特那分校); University of Innsbruck(因斯布鲁克大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters – Actuality and Finesse – into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework’s effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.
zh
[NLP-9] OpenS2S: Advancing Open-Source End-to-End Empathetic Large Speech Language Model
【速读】: 该论文试图解决当前强大的情感语言模型(Language Speech Language Models, LSLMs)日益封闭导致的研究透明性不足问题,以及缺乏高效、低延迟的情感语音交互系统。其解决方案的关键在于提出OpenS2S,一个完全开源、透明且端到端的LSLM,基于情感语音到文本模型BLSP-Emo,并采用流式交错解码架构以实现低延迟的语音生成;同时引入自动化数据构建流程,合成高质量、多样化的带情感语音对话数据,结合大语言模型生成情感内容与可控文本到语音系统,构建具有丰富副语言多样性的可扩展训练语料库。
链接: https://arxiv.org/abs/2507.05177
作者: Chen Wang,Tianyu Peng,Wen Yang,Yinan Bai,Guangfu Wang,Jun Lin,Lanpeng Jia,Lingxiang Wu,Jinqiao Wang,Chengqing Zong,Jiajun Zhang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Wuhan AI Research (武汉人工智能研究院); GWM AI Lab (长城汽车人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Technical Report
Abstract:Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at this https URL
zh
[NLP-10] Critiques of World Models
【速读】: 该论文试图解决如何构建一个能够模拟现实世界所有可行动可能性以支持有目的推理与行动的通用世界模型(World Model)问题。其解决方案的关键在于提出一种基于分层、多层级以及连续/离散混合表示的新架构,并结合生成式与自监督学习框架,旨在实现由该模型驱动的物理性、代理性和嵌套性的人工通用智能(Physical, Agentic, and Nested, PAN)系统。
链接: https://arxiv.org/abs/2507.05169
作者: Eric Xing,Mingkai Deng,Jinyu Hou,Zhiting Hu
机构: Carnegie Mellon University (卡内基梅隆大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Halıcıoğlu Data Science Institute (哈利库卢数据科学研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:World Model, the supposed algorithmic surrogate of the real-world environment which biological agents experience with and act upon, has been an emerging topic in recent years because of the rising needs to develop virtual agents with artificial (general) intelligence. There has been much debate on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of “hypothetical thinking” in psychology literature, we offer critiques of several schools of thoughts on world modeling, and argue the primary goal of a world model to be simulating all actionable possibilities of the real world for purposeful reasoning and acting. Building on the critiques, we propose a new architecture for a general-purpose world model, based on hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervision learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.
zh
[NLP-11] InfoSteer: Steering Information Utility in Language Model Post-Training
【速读】: 该论文试图解决后训练(post-training)过程中未能有效利用预训练阶段所获得知识的问题。其解决方案的关键在于提出一种轻量级方法,通过将前馈网络(FFN)层视为关联键值记忆,并在反向传播过程中通过前向传递干预或正则化来促进存储记忆向量的使用,从而在后训练阶段提升模型对参数信息的利用率。
链接: https://arxiv.org/abs/2507.05158
作者: Chunyuan Deng,Ruidi Chang,Hanjie Chen
机构: Rice University (莱斯大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advancements in language models (LMs) gradually ushered in an era where post-training is crucial. Yet, post-training approaches such as supervised fine-tuning (SFT) do not guarantee effective use of knowledge acquired during pretraining. We therefore present \ours, a lightweight method that encourages parametric information utilization in LMs during post-training. This is achieved via treating FFN layer as associate key-value memory, and promotes the use of stored memory vectors via forward-pass interventions or regularization during backpropagation. We find this simple guidance during post-training phase delivers consistent performance improvements across diverse model families–including Qwen, Gemma and Llama-spanning over 15 downstream tasks in both ID and OOD evaluations. Beyond performance gains, we also find that steered LMs can adaptively allocate information-placing more emphasis on generating semantically meaningful tokens, while using fewer resources on simple transition ones (e.g., ,' or
and’). Our work underscores that vanilla post-training does not fully leverage pre-training potential, and steering LMs in latent representation space offers a promising approach that enhances both performance and interpretability.
zh
[NLP-12] AI Generated Text Detection Using Instruction Fine-tuned Large Language and Transformer-Based Models
【速读】: 该论文试图解决AI生成文本的检测问题,具体包括两个任务:Task-A是区分人类撰写的文本与机器生成的文本,Task-B是识别生成文本的具体大型语言模型(LLM)。解决方案的关键在于对GPT_4o-mini、LLaMA 3 8B和BERT等预训练模型进行微调,以提升其在两个任务上的性能,其中微调后的GPT_4o-mini和BERT模型在Task-A和Task-B上分别达到了0.9547和0.4698的准确率。
链接: https://arxiv.org/abs/2507.05157
作者: Chinnappa Guggilla,Budhaditya Roy,Trupti Ramdas Chavan,Abdul Rahman,Edward Bowen
机构: Deloitte & Touche Assurance and Enterprise Risk Services India Private Limited, India; Deloitte & Touche LLP, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures
Abstract:Large Language Models (LLMs) possess an extraordinary capability to produce text that is not only coherent and contextually relevant but also strikingly similar to human writing. They adapt to various styles and genres, producing content that is both grammatically correct and semantically meaningful. Recently, LLMs have been misused to create highly realistic phishing emails, spread fake news, generate code to automate cyber crime, and write fraudulent scientific articles. Additionally, in many real-world applications, the generated content including style and topic and the generator model are not known beforehand. The increasing prevalence and sophistication of artificial intelligence (AI)-generated texts have made their detection progressively more challenging. Various attempts have been made to distinguish machine-generated text from human-authored content using linguistic, statistical, machine learning, and ensemble-based approaches. This work focuses on two primary objectives Task-A, which involves distinguishing human-written text from machine-generated text, and Task-B, which attempts to identify the specific LLM model responsible for the generation. Both of these tasks are based on fine tuning of Generative Pre-trained Transformer (GPT_4o-mini), Large Language Model Meta AI (LLaMA) 3 8B, and Bidirectional Encoder Representations from Transformers (BERT). The fine-tuned version of GPT_4o-mini and the BERT model has achieved accuracies of 0.9547 for Task-A and 0.4698 for Task-B.
zh
[NLP-13] Interpretable Mnemonic Generation for Kanji Learning via Expectation-Maximization
【速读】: 该论文试图解决日语词汇学习对使用拉丁字母背景的学习者而言因文字系统差异带来的困难,特别是由于日语中包含复杂的汉字(Kanji)和音节文字(如假名)所带来的记忆挑战。解决方案的关键在于提出一种生成框架,该框架显式地建模了记忆术的构建过程,通过一组通用规则驱动,并利用一种新型的期望最大化算法进行学习。该方法通过在线平台上的学习者自创记忆术进行训练,从而学习潜在结构和组合规则,实现可解释且系统化的记忆术生成。
链接: https://arxiv.org/abs/2507.05137
作者: Jaewook Lee,Alexander Scarlatos,Andrew Lan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning Japanese vocabulary is a challenge for learners from Roman alphabet backgrounds due to script differences. Japanese combines syllabaries like hiragana with kanji, which are logographic characters of Chinese origin. Kanji are also complicated due to their complexity and volume. Keyword mnemonics are a common strategy to aid memorization, often using the compositional structure of kanji to form vivid associations. Despite recent efforts to use large language models (LLMs) to assist learners, existing methods for LLM-based keyword mnemonic generation function as a black box, offering limited interpretability. We propose a generative framework that explicitly models the mnemonic construction process as driven by a set of common rules, and learn them using a novel Expectation-Maximization-type algorithm. Trained on learner-authored mnemonics from an online platform, our method learns latent structures and compositional rules, enabling interpretable and systematic mnemonics generation. Experiments show that our method performs well in the cold-start setting for new learners while providing insight into the mechanisms behind effective mnemonic creation.
zh
[NLP-14] SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction
【速读】: 该论文试图解决教育评估中项目(问题)难度估计的问题,传统方法需要真实学生对项目进行作答并拟合项目反应理论(Item Response Theory, IRT)模型,这在冷启动场景下无法应用。论文提出的解决方案关键在于引入SMART(Simulated Students Aligned with IRT)方法,通过直接偏好优化(Direct Preference Optimization, DPO)实现模拟学生与指定能力的对齐,进而利用生成的大量回答和基于大语言模型的评分模型进行仿真,最终通过拟合IRT模型获得项目难度估计,从而提升项目难度预测的准确性。
链接: https://arxiv.org/abs/2507.05129
作者: Alexander Scarlatos,Nigel Fernandez,Christopher Ormerod,Susan Lottridge,Andrew Lan
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Cambium Assessment (Cambium评估)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get item difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with an LLM-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on a real-world student response dataset, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.
zh
[NLP-15] An Evaluation of Large Language Models on Text Summarization Tasks Using Prompt Engineering Techniques ACSA
【速读】: 该论文旨在解决大型语言模型(LLMs)在跨不同领域和数据集的文本摘要任务中表现评估不足的问题,以及在缺乏大量训练数据的情况下有效进行文本摘要的瓶颈问题。其解决方案的关键在于对六种LLMs在四个数据集上的系统性评估,并通过零样本学习和上下文学习等提示工程技巧提升模型性能,同时引入基于句子的分块策略以应对长文档的摘要任务,从而在摘要质量与计算效率之间实现更好的平衡。
链接: https://arxiv.org/abs/2507.05123
作者: Walid Mohamed Aly,Taysir Hassan A. Soliman,Amr Mohamed AbdelAziz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This manuscript is an extended version of the work accepted for publication in the International Journal of Advanced Computer Science and Applications (IJACSA), Volume 16, Issue 6, June 2025
Abstract:Large Language Models (LLMs) continue to advance natural language processing with their ability to generate human-like text across a range of tasks. Despite the remarkable success of LLMs in Natural Language Processing (NLP), their performance in text summarization across various domains and datasets has not been comprehensively evaluated. At the same time, the ability to summarize text effectively without relying on extensive training data has become a crucial bottleneck. To address these issues, we present a systematic evaluation of six LLMs across four datasets: CNN/Daily Mail and NewsRoom (news), SAMSum (dialog), and ArXiv (scientific). By leveraging prompt engineering techniques including zero-shot and in-context learning, our study evaluates the performance using the ROUGE and BERTScore metrics. In addition, a detailed analysis of inference times is conducted to better understand the trade-off between summarization quality and computational efficiency. For Long documents, introduce a sentence-based chunking strategy that enables LLMs with shorter context windows to summarize extended inputs in multiple stages. The findings reveal that while LLMs perform competitively on news and dialog tasks, their performance on long scientific documents improves significantly when aided by chunking strategies. In addition, notable performance variations were observed based on model parameters, dataset properties, and prompt design. These results offer actionable insights into how different LLMs behave across task types, contributing to ongoing research in efficient, instruction-based NLP systems.
zh
[NLP-16] Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration
【速读】: 该论文旨在解决历史文献修复(Historical Document Restoration, HDR)中现有方法多局限于单模态或小规模修复,无法满足实际需求的问题。其关键解决方案是提出一个全页HDR数据集(FPHDR)和一种自动化HDR方案(AutoHDR),该方案通过三阶段流程模拟历史学家的修复工作流程:OCR辅助的损伤定位、视觉-语言上下文文本预测以及补丁自回归外观修复,从而实现高效且精确的文献修复。
链接: https://arxiv.org/abs/2507.05108
作者: Yuyi Zhang,Peirong Zhang,Zhenhua Yang,Pengyu Yan,Yongxin Shi,Pengwei Liu,Fengjun Guo,Lianwen Jin
机构: South China University of Technology (华南理工大学); Intsig Information Co., Ltd. (Intsig信息有限公司); INTSIG-SCUT Joint Lab on Document Analysis and Recognition (INTSIG-华南理工大学文档分析与识别联合实验室); SCUT-Zhuhai Institute of Modern Industrial Innovation (华南理工大学珠海现代工业创新研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians’ restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR’s remarkable performance in HDR. When processing severely damaged documents, our method improves OCR accuracy from 46.83% to 84.05%, with further enhancement to 94.25% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at this https URL.
zh
[NLP-17] AI-Driven Cytomorphology Image Synthesis for Medical Diagnostics
【速读】: 该论文试图解决生物医学数据集中样本不平衡和隐私限制问题,这些问题阻碍了准确机器学习模型的开发。解决方案的关键在于利用微调的稳定扩散模型(Stable Diffusion)结合LoRA权重,在真实少样本数据的引导下生成高质量的合成图像,从而提升分类器性能。实验表明,通过向小而高度不平衡的真实数据集添加每类5000张合成图像,ResNet分类器的准确率从27.3%提升至78.4%,CLIP基础分类器的准确率从61.8%提升至76.8%。
链接: https://arxiv.org/abs/2507.05063
作者: Jan Carreras Boada,Rao Muhammad Umer,Carsten Marr
机构: Helmholtz Munich, Computational Health Center (赫姆霍兹慕尼黑,计算健康中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 6 figures, 2 tables. Final Degree Project (TFG) submitted at ESCI-UPF and conducted at Helmholtz Munich
Abstract:Biomedical datasets often contain a large sample imbalance and are subject to strict privacy constraints, which together hinder the development of accurate machine learning models. One potential solution is to generate synthetic images, as this can improve data availability while preserving patient privacy. However, it remains difficult to generate synthetic images of sufficient quality for training robust classifiers. In this work, we focus on the classification of single white blood cells, a key component in the diagnosis of hematological diseases such as acute myeloid leukemia (AML), a severe blood cancer. We demonstrate how synthetic images generated with a fine-tuned stable diffusion model using LoRA weights when guided by real few-shot samples of the target white blood cell classes, can enhance classifier performance for limited data. When training a ResNet classifier, accuracy increased from 27.3% to 78.4% (+51.1%) by adding 5000 synthetic images per class to a small and highly imbalanced real dataset. For a CLIP-based classifier, the accuracy improved from 61.8% to 76.8% (+15.0%). The synthetic images are highly similar to real images, and they can help overcome dataset limitations, enhancing model generalization. Our results establish synthetic images as a tool in biomedical research, improving machine learning models, and facilitating medical diagnosis and research.
zh
[NLP-18] Verified Language Processing with Hybrid Explainability: A Technical Report
【速读】: 该论文试图解决当前机器学习方法在处理文本相似性任务时缺乏可解释性的问题,特别是针对生成式语言模型在逻辑提示下无法准确区分逻辑蕴含、无关和不一致关系的缺陷。其解决方案的关键在于提出一种结合图结构与一阶逻辑表示的新型流水线,利用蒙塔古语法生成机器与人类均可理解的表示,从而实现对文本语义的精准捕捉与解释。
链接: https://arxiv.org/abs/2507.05017
作者: Oliver Robert Fox,Giacomo Bergami,Graham Morgan
机构: 未知
类目: Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注:
Abstract:The volume and diversity of digital information have led to a growing reliance on Machine Learning techniques, such as Natural Language Processing, for interpreting and accessing appropriate data. While vector and graph embeddings represent data for similarity tasks, current state-of-the-art pipelines lack guaranteed explainability, failing to determine similarity for given full texts accurately. These considerations can also be applied to classifiers exploiting generative language models with logical prompts, which fail to correctly distinguish between logical implication, indifference, and inconsistency, despite being explicitly trained to recognise the first two classes. We present a novel pipeline designed for hybrid explainability to address this. Our methodology combines graphs and logic to produce First-Order Logic representations, creating machine- and human-readable representations through Montague Grammar. Preliminary results indicate the effectiveness of this approach in accurately capturing full text similarity. To the best of our knowledge, this is the first approach to differentiate between implication, inconsistency, and indifference for text classification tasks. To address the limitations of existing approaches, we use three self-contained datasets annotated for the former classification task to determine the suitability of these approaches in capturing sentence structure equivalence, logical connectives, and spatiotemporal reasoning. We also use these data to compare the proposed method with language models pre-trained for detecting sentence entailment. The results show that the proposed method outperforms state-of-the-art models, indicating that natural language understanding cannot be easily generalised by training over extensive document corpora. This work offers a step toward more transparent and reliable Information Retrieval from extensive textual data.
zh
[NLP-19] Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification
【速读】: 该论文试图解决文本分类中边缘案例(edge cases)识别与处理的问题,这些案例通常由于初始代码本(codebook)描述不足而难以被有效识别和标注。解决方案的关键在于提出一种混合主动性标注框架Co-DETECT,该框架结合了领域专家的先验知识与大型语言模型(LLM)的自动标注能力,通过迭代过程不断识别并归纳边缘案例的通用描述,从而提升代码本对复杂现象的覆盖能力和处理效果。
链接: https://arxiv.org/abs/2507.05010
作者: Chenfei Xiong,Jingwei Ni,Yu Fan,Vilém Zouhar,Donya Rooein,Lorena Calvo-Bartolomé,Alexander Hoyle,Zhijing Jin,Mrinmaya Sachan,Markus Leippold,Dirk Hovy,Mennatallah El-Assady,Elliott Ash
机构: ETH Zurich (苏黎世联邦理工学院); University of Zürich (苏黎世大学); Bocconi University (博科尼大学); University of Toronto (多伦多大学); Universidad Carlos III (卡洛斯三世大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion), a novel mixed-initiative annotation framework that integrates human expertise with automatic annotation guided by large language models (LLMs). Co-DETECT starts with an initial, sketch-level codebook and dataset provided by a domain expert, then leverages the LLM to annotate the data and identify edge cases that are not well described by the initial codebook. Specifically, Co-DETECT flags challenging examples, induces high-level, generalizable descriptions of edge cases, and assists user in incorporating edge case handling rules to improve the codebook. This iterative process enables more effective handling of nuanced phenomena through compact, generalizable annotation rules. Extensive user study, qualitative and quantitative analyses prove the effectiveness of Co-DETECT.
zh
[NLP-20] Do We Really Need Specialization? Evaluating Generalist Text Embeddings for Zero-Shot Recommendation and Search RECSYS2025
【速读】: 该论文试图解决在推荐和搜索任务中,传统预训练语言模型(PLMs)需要进行任务和领域特定微调以提升表征能力的问题。其解决方案的关键在于使用通用文本嵌入模型(GTEs),这些模型在大规模语料上预训练,能够在不进行专门适配的情况下实现强大的零样本性能。实验表明,GTEs在序列推荐和产品搜索任务中均优于传统模型和微调模型,这归因于其更均衡的嵌入空间特征分布。此外,研究还发现通过聚焦最具信息量的方向(如通过PCA降维)压缩嵌入维度,可有效减少噪声并提升专用模型的性能。
链接: https://arxiv.org/abs/2507.05006
作者: Matteo Attimonelli,Alessandro De Bellis,Claudio Pomo,Dietmar Jannach,Eugenio Di Sciascio,Tommaso Di Noia
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accept as Short Paper at RecSys 2025
Abstract:Pre-trained language models (PLMs) are widely used to derive semantic representations from item metadata in recommendation and search. In sequential recommendation, PLMs enhance ID-based embeddings through textual metadata, while in product search, they align item characteristics with user intent. Recent studies suggest task and domain-specific fine-tuning are needed to improve representational power. This paper challenges this assumption, showing that Generalist Text Embedding Models (GTEs), pre-trained on large-scale corpora, can guarantee strong zero-shot performance without specialized adaptation. Our experiments demonstrate that GTEs outperform traditional and fine-tuned models in both sequential recommendation and product search. We attribute this to a superior representational power, as they distribute features more evenly across the embedding space. Finally, we show that compressing embedding dimensions by focusing on the most informative directions (e.g., via PCA) effectively reduces noise and improves the performance of specialized models. To ensure reproducibility, we provide our repository at this https URL.
zh
[NLP-21] From Autonomy to Agency: Agent ic Vehicles for Human-Centered Mobility Systems
【速读】: 该论文试图解决当前自主车辆(Autonomous Vehicles, AuVs)在技术自主性与未来以人为中心的移动系统所需更广泛的认知和社会能力之间的概念差距问题。其解决方案的关键在于引入代理车辆(Agentic Vehicles, AgVs)的概念,即集成代理人工智能(Agentic AI)以实现复杂环境中的推理、适应与交互。该论文提出一个系统层面的框架来表征AgVs,强调其认知和沟通层级,并通过高阶推理和工具使用,使代理AI不仅作为计算工具,更作为嵌入移动生态系统中的交互代理。
链接: https://arxiv.org/abs/2507.04996
作者: Jiangbo Yu
机构: 未知
类目: Computers and Society (cs.CY); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注:
Abstract:Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Accordingly, autonomous vehicles (AuVs) are defined as systems capable of perceiving their environment and executing preprogrammed tasks independently of external input. However, both research and real-world deployments increasingly showcase vehicles that demonstrate behaviors beyond this definition (including the SAE levels 1 to 6), such as interaction with humans and machines, goal adaptation, contextual reasoning, external tool use, and long-term planning, particularly with the integration of large language models (LLMs) and agentic AI systems. These developments reveal a conceptual gap between technical autonomy and the broader cognitive and social capabilities needed for future human-centered mobility systems. To address this, we introduce the concept of agentic vehicles (AgVs), referring to vehicles that integrate agentic AI to reason, adapt, and interact within complex environments. This paper presents a systems-level framework to characterize AgVs, focusing on their cognitive and communicative layers and differentiating them from conventional AuVs. It synthesizes relevant advances in agentic AI, robotics, multi-agent systems, and human-machine interaction, and highlights how agentic AI, through high-level reasoning and tool use, can function not merely as computational tools but as interactive agents embedded in mobility ecosystems. The paper concludes by identifying key challenges in the development and governance of AgVs, including safety, real-time control, public acceptance, ethical alignment, and regulatory frameworks.
zh
[NLP-22] Can Video LLM s Refuse to Answer? Alignment for Answerability in Video Large Language Models ICLR2025
【速读】: 该论文试图解决视频大语言模型(Video-LLMs)在面对超出视频内容信息范围的问题时,无法有效判断问题相关性并拒绝回答的问题。其解决方案的关键在于提出“可回答性对齐”(alignment for answerability)框架,该框架使Video-LLMs能够基于输入视频评估问题的相关性,并在问题超出视频内容范围时适当拒绝回答。
链接: https://arxiv.org/abs/2507.04976
作者: Eunseop Yoon,Hee Suk Yoon,Mark A. Hasegawa-Johnson,Chang D. Yoo
机构: Korea Advanced Institute of Science and Technology (KAIST); University of Illinois at Urbana-Champaign (UIUC)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ICLR 2025
Abstract:In the broader context of deep learning, Multimodal Large Language Models have achieved significant breakthroughs by leveraging powerful Large Language Models as a backbone to align different modalities into the language space. A prime exemplification is the development of Video Large Language Models (Video-LLMs). While numerous advancements have been proposed to enhance the video understanding capabilities of these models, they are predominantly trained on questions generated directly from video content. However, in real-world scenarios, users often pose questions that extend beyond the informational scope of the video, highlighting the need for Video-LLMs to assess the relevance of the question. We demonstrate that even the best-performing Video-LLMs fail to reject unfit questions-not necessarily due to a lack of video understanding, but because they have not been trained to identify and refuse such questions. To address this limitation, we propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video and appropriately decline to answer when the question exceeds the scope of the video, as well as an evaluation framework with a comprehensive set of metrics designed to measure model behavior before and after alignment. Furthermore, we present a pipeline for creating a dataset specifically tailored for alignment for answerability, leveraging existing video-description paired datasets.
zh
[NLP-23] ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在生成动态交互式视觉成果时缺乏有效评估标准的问题,现有基准主要关注算法正确性,而忽视了视觉保真度和交互完整性等影响用户体验的关键因素。解决方案的关键在于提出ArtifactsBench,这是一个新的基准和范式,用于自动化、多模态评估视觉代码生成。其核心方法是通过程序化渲染生成的成果并捕获其动态行为,结合源代码由多模态大语言模型(Multimodal LLM, MLLM)进行评估,从而实现对生成质量的全面、可重复的评分。
链接: https://arxiv.org/abs/2507.04952
作者: Chenchen Zhang,Yuhang Li,Can Xu,Jiaheng Liu,Ao Liu,Shihui Hu,Dengpeng Wu,Guanhua Huang,Kejiao Li,Qi Yi,Ruibin Xiong,Haotian Zhu,Yuanxing Zhang,Yuhao Jiang,Yue Zhang,Zenan Xu,Bohui Zhai,Guoxiang He,Hebin Li,Jie Zhao,Le Zhang,Lingyun Tan,Pengyu Guo,Xianshu Pang,Yang Ruan,Zhifeng Zhang,Zhonghu Wang,Ziyan Xu,Zuopu Yin,Wiggin Zhou,Chayse Zhou,Fengzong Lian
机构: 未知
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:
Abstract:The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at this https URL, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.
zh
[NLP-24] aming the Tri-Space Tension: ARC-Guided Hallucination Modeling and Control for Text-to-Image Generation
【速读】: 该论文试图解决文本到图像(Text-to-Image, T2I)扩散模型中存在的持续“幻觉”问题,即生成内容与预期提示语义存在细微或显著偏差。其解决方案的关键在于提出一种认知启发的视角,将幻觉重新解释为潜在对齐空间中的轨迹漂移,并构建了三轴幻觉三元空间(Hallucination Tri-Space),引入了动态向量表示Alignment Risk Code (ARC),用于量化生成过程中的对齐张力。通过ARC信号实时监控并进行轴向针对性干预,提出了轻量级控制器TensionModulator (TM-ARC),从而有效减少幻觉现象,同时保持图像质量和多样性。
链接: https://arxiv.org/abs/2507.04946
作者: Jianjiang Yang,Ziyan Huang
机构: University of Bristol (布里斯托大学); South China University of Technology (华南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 12 pages, 6 figures, 4 tables
Abstract:Despite remarkable progress in image quality and prompt fidelity, text-to-image (T2I) diffusion models continue to exhibit persistent “hallucinations”, where generated content subtly or significantly diverges from the intended prompt semantics. While often regarded as unpredictable artifacts, we argue that these failures reflect deeper, structured misalignments within the generative process. In this work, we propose a cognitively inspired perspective that reinterprets hallucinations as trajectory drift within a latent alignment space. Empirical observations reveal that generation unfolds within a multiaxial cognitive tension field, where the model must continuously negotiate competing demands across three key critical axes: semantic coherence, structural alignment, and knowledge grounding. We then formalize this three-axis space as the \textbfHallucination Tri-Space and introduce the Alignment Risk Code (ARC): a dynamic vector representation that quantifies real-time alignment tension during generation. The magnitude of ARC captures overall misalignment, its direction identifies the dominant failure axis, and its imbalance reflects tension asymmetry. Based on this formulation, we develop the TensionModulator (TM-ARC): a lightweight controller that operates entirely in latent space. TM-ARC monitors ARC signals and applies targeted, axis-specific interventions during the sampling process. Extensive experiments on standard T2I benchmarks demonstrate that our approach significantly reduces hallucination without compromising image quality or diversity. This framework offers a unified and interpretable approach for understanding and mitigating generative failures in diffusion-based T2I systems.
zh
[NLP-25] ReLoop: “Seeing Twice and Thinking Backwards” via Closed-loop Training to Mitigate Hallucinations in Multimodal understanding
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在开放性视觉问答任务中存在幻觉问题,即模型输出与输入语义相矛盾或不一致,影响模型的可靠性和事实一致性。其解决方案的关键在于提出一种统一的闭环训练框架ReLoop,该框架通过环形结构集成三种互补的一致性反馈机制,促使MLLMs进行“二次观察和逆向思考”,从而在训练过程中直接验证并校正输出。ReLoop利用冻结的一致性反馈插件(Consistency Feedback Plugin, CFP),包括语义重构、视觉描述和注意力监督模块,共同实现语义可逆性、视觉一致性和可解释的注意力机制,有效降低幻觉率。
链接: https://arxiv.org/abs/2507.04943
作者: Jianjiang Yang,Ziyan Huang,Yanshu Li
机构: University of Bristol (布里斯托大学); South China University of Technology (华南理工大学); Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 8 pages,6 figures,5 tables
Abstract:While Multimodal Large Language Models (MLLMs) have achieved remarkable progress in open-ended visual question answering, they remain vulnerable to hallucinations. These are outputs that contradict or misrepresent input semantics, posing a critical challenge to the reliability and factual consistency. Existing methods often rely on external verification or post-hoc correction, lacking an internal mechanism to validate outputs directly during training. To bridge this gap, we propose ReLoop, a unified closed-loop training framework that encourages multimodal consistency for cross-modal understanding in MLLMs. ReLoop adopts a ring-shaped structure that integrates three complementary consistency feedback mechanisms, obliging MLLMs to “seeing twice and thinking backwards”. Specifically, ReLoop employs the frozen Consistency Feedback Plugin (CFP), comprising semantic reconstruction, visual description, and an attention supervision module for attention alignment. These components collectively enforce semantic reversibility, visual consistency, and interpretable attention, enabling the model to correct its outputs during training. Extensive evaluations and analyses demonstrate the effectiveness of ReLoop in reducing hallucination rates across multiple benchmarks, establishing a robust method for hallucination mitigation in MLLMs. We will release our source code and data in the camera-ready version.
zh
[NLP-26] SIGIR 2025 – LiveRAG Challenge Report
【速读】: 该论文描述了SIGIR 2025期间举办的LiveRAG挑战赛,旨在推动 Retrieval-Augmented Generation (RAG) 技术的发展。挑战赛为学术界和工业界的参与者提供了一个竞争性平台,要求其基于固定语料库(Fineweb-10BT)和通用开源大语言模型(Falcon3-10B-Instruct)开发RAG-based问答系统,以促进对检索策略和提示策略的深入比较。解决方案的关键在于结合高效的检索机制与生成模型,以在有限时间内对未见问题提供准确且可信的答案。
链接: https://arxiv.org/abs/2507.04942
作者: David Carmel,Simone Filice,Guy Horowitz,Yoelle Maarek,Oren Somekh,Ran Tavory
机构: Technology Innovation Institute (TII); AI71 (AI71); Pinecone (Pinecone)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages, 5 tables
Abstract:The LiveRAG Challenge at SIGIR 2025, held between March and May 2025, provided a competitive platform for advancing Retrieval-Augmented Generation (RAG) technologies. Participants from academia and industry were invited to develop a RAG-based question-answering system using a fixed corpus (Fineweb-10BT) and a common open-source LLM (Falcon3-10B-Instruct). The goal was to facilitate challenging comparisons of retrieval and prompting strategies. During the Live Challenge Day, 70 teams from 27 different countries provided answers and supportive information to 500 unseen questions within a strict two-hour time window. Evaluation was conducted in two stages: first an automated LLM-as-a-judge approach was used to compute correctness and faithfulness score, then a manual review of top ranked submissions was conducted. The finalists were announced on June 12, 2025, with prizes awarded during the LiveRAG Workshop at SIGIR 2025 in Padua, Italy.
zh
[NLP-27] O_FT@EvalLLM 2025 : étude comparative de choix de données et de stratégies dapprentissage pour ladaptation de modèles de langue à un domaine
【速读】: 该论文旨在解决将语言模型适配到防御领域的问题,以提升其在该领域的专业知识和任务处理能力。解决方案的关键在于通过持续预训练和指令微调等经典技术对Mistral-7B-Instruct-v0.3模型进行适应性调整,同时注重数据的收集、生成与选择,以及模型评估的优化。实验结果表明,该方法有效提升了模型在防御领域的表现,同时保持了其在通用知识和技能上的竞争力。
链接: https://arxiv.org/abs/2507.04895
作者: Ismaël Rousseau,Claire Perroux,Pierre Adam,Thomas Girault,Lionel Delphin-Poulat,Morgan Veyret,Gwénolé Lecorvé,Géraldine Damnati
机构: Orange Research (橙研究所); Ouest-France (西部法国)
类目: Computation and Language (cs.CL)
备注: 22 pages + 10 pages appendices, in French language
Abstract:This paper presents the work carried out by the O_FT team, joint with Orange and Ouest-France, on adapting language models to the defense domain as part of the EvalLLM2025 challenge. This work focused on adapting the \textttMistral-7B-Instruct-v0.3 model using classical techniques of continued pre-training and instruction-tuning. The core of our efforts is based on collecting, generating, and selecting data for these two stages as well as for model evaluation. Experiments show that our adapted models have better domain-specific knowledge and improved domain-specific task processing skills, along with comparable (or even superior) performance on general knowledge and skills. Considering the carbon footprint of our adaptations, this work demonstrates the feasibility of domain adaptation for relatively small models. – Ce document présente les travaux réalisés par l’équipe O_FT conjointe à Orange et Ouest-France sur l’adaptation de modèles de langue au domaine de la défense dans le cadre du challenge EvalLLM2025. Ces travaux se sont concentrés sur l’adaptation du modèle \textttMistral-7B-Instruct-v0.3 avec des techniques classiques de poursuite du pré-entraînement et d’affinage sur instructions. L’essentiel de nos travaux a porté sur la constitution, génération et sélection de données pour ces deux étapes ainsi que pour l’évaluation des modèles. Les expériences montrent que nos modèles adaptés ont de meilleures de connaissances de fond et une meilleure capacité de traitement de tâches sur le domaine de la défense, ainsi que des performances comparables (voire supérieures) sur des connaissances ou capacités généralistes. Mis au regard des empreintes carbones de nos adaptations, ces travaux démontrent ainsi la viabilité de l’adaptation à un domaine de modèles relativement petits. Comments: 22 pages + 10 pages appendices, in French language Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.04895 [cs.CL] (or arXiv:2507.04895v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.04895 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-28] MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction
【速读】: 该论文旨在解决交通事故严重程度预测中的挑战,包括数据不完整、特征依赖性强、类别不平衡(罕见但高严重性案例被低估且难以检测)等问题。现有方法依赖于单一模型或黑箱提示策略,在噪声环境下的可扩展性差且解释性有限。该论文提出的解决方案是MARBLE,一个基于多智能体规则的大型语言模型(LLM)引擎,其关键在于将严重程度预测任务分解为多个专注于特定语义特征子集(如空间、环境、时间)的专业推理智能体,通过规则或LLM引导的共识机制进行协调,从而实现范围受限的推理与模块化提示,避免提示饱和,并支持深度可解释性和事后性能诊断。
链接: https://arxiv.org/abs/2507.04893
作者: Kaleem Ullah Qasim,Jiashu Zhang
机构: Southwest Jiaotong University (西南交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 13 pages, 5 figures
Abstract:Accident severity prediction plays a critical role in transportation safety systems but is a persistently difficult task due to incomplete data, strong feature dependencies, and severe class imbalance in which rare but high-severity cases are underrepresented and hard to detect. Existing methods often rely on monolithic models or black box prompting, which struggle to scale in noisy, real-world settings and offer limited interpretability. To address these challenges, we propose MARBLE a multiagent rule based LLM engine that decomposes the severity prediction task across a team of specialized reasoning agents, including an interchangeable ML-backed agent. Each agent focuses on a semantic subset of features (e.g., spatial, environmental, temporal), enabling scoped reasoning and modular prompting without the risk of prompt saturation. Predictions are coordinated through either rule-based or LLM-guided consensus mechanisms that account for class rarity and confidence dynamics. The system retains structured traces of agent-level reasoning and coordination outcomes, supporting in-depth interpretability and post-hoc performance diagnostics. Across both UK and US datasets, MARBLE consistently outperforms traditional machine learning classifiers and state-of-the-art (SOTA) prompt-based reasoning methods including Chain-of-Thought (CoT), Least-to-Most (L2M), and Tree-of-Thought (ToT) achieving nearly 90% accuracy where others plateau below 48%. This performance redefines the practical ceiling for accident severity classification under real world noise and extreme class imbalance. Our results position MARBLE as a generalizable and interpretable framework for reasoning under uncertainty in safety-critical applications.
zh
[NLP-29] Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)中语义表征的定位问题,即探讨语义信息是否必然依赖于可训练的输入嵌入(input embeddings)。其解决方案的关键在于构建一种完全冻结嵌入层的Transformer模型,其中嵌入向量并非由数据学习得到,而是基于Unicode字形的视觉结构预先计算生成。这种非语义的、预计算的视觉嵌入在训练过程中保持固定,从而避免了传统模型中嵌入层同时学习结构和语义特征所导致的“表征干扰”(representational interference)。实验结果表明,尽管缺乏可训练的语义初始化嵌入,该模型仍能收敛、生成连贯文本,并在MMLU推理基准上优于具有可训练嵌入的同类模型。
链接: https://arxiv.org/abs/2507.04886
作者: A. Bochkov
机构: Moscow Institute of Physics and Technology (莫斯科物理技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational “meaning vectors.” This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We attribute this to “representational interference” in conventional models, where the embedding layer is burdened with learning both structural and semantic features. Our results indicate that high-level semantics are not inherent to input embeddings but are an emergent property of the Transformer’s compositional architecture and data scale. This reframes the role of embeddings from meaning containers to structural primitives. We release all code and models to foster further research.
zh
[NLP-30] Building Open-Retrieval Conversational Question Answering Systems by Generating Synthetic Data and Decontextualizing User Questions SIGDIAL2025
【速读】: 该论文试图解决开放检索对话问答(OR-CONVQA)中领域特定训练数据稀缺的问题,该任务要求系统响应既需考虑对话历史,又需基于每问题检索到的文档或文档片段。解决方案的关键在于利用组织中大量存在的普通文本文档(如产品文档)自动生成具有标注的现实OR-CONVQA对话,其中包括对话中的问答对、自包含的去上下文化用户问题以及系统响应所依据的陈述句。通过这种方法生成的合成对话可用于训练高效的问答重写器,从而使得现有的对话无关检索器得以应用。
链接: https://arxiv.org/abs/2507.04884
作者: Christos Vlachos,Nikolaos Stylianou,Alexandra Fiotaki,Spiros Methenitis,Elisavet Palogiannidi,Themos Stafylakis,Ion Androutsopoulos
机构: Athens University of Economics and Business (雅典经济与商业大学); Omilia - Conversational Intelligence (Omilia-对话智能); Archimedes, Athena Research Center (阿基米德,阿塔纳研究中心); NCSR Demokritos (国家核子研究委员会“迪莫克里斯”研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at SIGDIAL 2025
Abstract:We consider open-retrieval conversational question answering (OR-CONVQA), an extension of question answering where system responses need to be (i) aware of dialog history and (ii) grounded in documents (or document fragments) retrieved per question. Domain-specific OR-CONVQA training datasets are crucial for real-world applications, but hard to obtain. We propose a pipeline that capitalizes on the abundance of plain text documents in organizations (e.g., product documentation) to automatically produce realistic OR-CONVQA dialogs with annotations. Similarly to real-world humanannotated OR-CONVQA datasets, we generate in-dialog question-answer pairs, self-contained (decontextualized, e.g., no referring expressions) versions of user questions, and propositions (sentences expressing prominent information from the documents) the system responses are grounded in. We show how the synthetic dialogs can be used to train efficient question rewriters that decontextualize user questions, allowing existing dialog-unaware retrievers to be utilized. The retrieved information and the decontextualized question are then passed on to an LLM that generates the system’s response.
zh
[NLP-31] ranscribing Spanish Texts from the Past: Experiments with Transkribus Tesseract and Granite ICIP
【速读】: 该论文试图解决历史文本转录(Transcribing Texts from the Past)的问题,旨在通过不同技术方法实现对古籍文本的准确识别与转换。解决方案的关键在于采用三种不同的实验方法:使用基于网络的光学字符识别(OCR)服务、传统OCR引擎以及一种紧凑的多模态模型,以比较不同方法在历史文本处理中的效果,并在消费级硬件上验证其可行性与稳定性。
链接: https://arxiv.org/abs/2507.04878
作者: Yanco Amor Torterolo-Orta,Jaione Macicior-Mitxelena,Marina Miguez-Lamanuzzi,Ana García-Serrano
机构: Universidad Nacional de Educación a Distancia (UNED)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: This paper was written as part of a shared task organized within the 2025 edition of the Iberian Languages Evaluation Forum (IberLEF 2025), held at SEPLN 2025 in Zaragoza. This paper describes the joint participation of two teams in said competition, GRESEL1 and GRESEL2, each with an individual paper that will be published in CEUR
Abstract:This article presents the experiments and results obtained by the GRESEL team in the IberLEF 2025 shared task PastReader: Transcribing Texts from the Past. Three types of experiments were conducted with the dual aim of participating in the task and enabling comparisons across different approaches. These included the use of a web-based OCR service, a traditional OCR engine, and a compact multimodal model. All experiments were run on consumer-grade hardware, which, despite lacking high-performance computing capacity, provided sufficient storage and stability. The results, while satisfactory, leave room for further improvement. Future work will focus on exploring new techniques and ideas using the Spanish-language dataset provided by the shared task, in collaboration with Biblioteca Nacional de España (BNE).
zh
[NLP-32] textitGrahak-Nyay: Consumer Grievance Redressal through Large Language Models
【速读】: 该论文试图解决印度消费者投诉解决过程中存在的程序复杂性、法律术语晦涩以及管辖权挑战等问题。其解决方案的关键在于开发了一个名为Grahak-Nyay的聊天机器人,该机器人利用开源大型语言模型(LLMs)和检索增强生成(RAG)技术,通过一个简洁且更新及时的知识库简化法律复杂性。此外,研究还引入了三个新的数据集和HAB评估指标,以提升聊天机器人的性能与可信度。
链接: https://arxiv.org/abs/2507.04854
作者: Shrey Ganatra,Swapnil Bhattacharyya,Harshvivek Kashid,Spandan Anaokar,Shruti Nair,Reshma Sekhar,Siddharth Manohar,Rahul Hemrajani,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); National Law School of India University, Bangalore (印度法律学院,班加罗尔)
类目: Computation and Language (cs.CL)
备注:
Abstract:Access to consumer grievance redressal in India is often hindered by procedural complexity, legal jargon, and jurisdictional challenges. To address this, we present \textbfGrahak-Nyay (Justice-to-Consumers), a chatbot that streamlines the process using open-source Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Grahak-Nyay simplifies legal complexities through a concise and up-to-date knowledge base. We introduce three novel datasets: \textitGeneralQA (general consumer law), \textitSectoralQA (sector-specific knowledge) and \textitSyntheticQA (for RAG evaluation), along with \textitNyayChat , a dataset of 300 annotated chatbot conversations. We also introduce \textitJudgments data sourced from Indian Consumer Courts to aid the chatbot in decision making and to enhance user trust. We also propose \textbfHAB metrics ( \textbfHelpfulness, Accuracy, Brevity ) to evaluate chatbot performance. Legal domain experts validated Grahak-Nyay’s effectiveness. Code and datasets will be released.
zh
[NLP-33] Dialogue-Based Multi-Dimensional Relationship Extraction from Novels NLPCC2025
【速读】: 该论文试图解决小说文本中自动提取人物关系的难题,这一任务在自然语言处理中具有重要应用价值,但受限于文本中的复杂语境和隐含表达,传统方法面临较大挑战。其解决方案的关键在于基于大型语言模型(Large Language Models)的方法,通过引入关系维度分离、对话数据构建和上下文学习策略,提升关系抽取的性能,并利用对话结构信息增强模型对隐含关系的理解能力,从而在复杂语境中展现出良好的适应性。
链接: https://arxiv.org/abs/2507.04852
作者: Yuchen Yan,Hanjie Zhao,Senbin Zhu,Hongde Liu,Zhihong Zhang,Yuxiang Jia
机构: 未知
类目: Computation and Language (cs.CL)
备注: The paper has been accepted by NLPCC2025. 12 pages, 5 figures, 5 tables
Abstract:Relation extraction is a crucial task in natural language processing, with broad applications in knowledge graph construction and literary analysis. However, the complex context and implicit expressions in novel texts pose significant challenges for automatic character relationship extraction. This study focuses on relation extraction in the novel domain and proposes a method based on Large Language Models (LLMs). By incorporating relationship dimension separation, dialogue data construction, and contextual learning strategies, the proposed method enhances extraction performance. Leveraging dialogue structure information, it improves the model’s ability to understand implicit relationships and demonstrates strong adaptability in complex contexts. Additionally, we construct a high-quality Chinese novel relation extraction dataset to address the lack of labeled resources and support future research. Experimental results show that our method outperforms traditional baselines across multiple evaluation metrics and successfully facilitates the automated construction of character relationship networks in novels.
zh
[NLP-34] Spec-TOD: A Specialized Instruction-Tuned LLM Framework for Efficient Task-Oriented Dialogue Systems SIGDIAL2025
【速读】: 该论文旨在解决任务导向对话(Task-Oriented Dialogue, TOD)系统在低资源场景下由于标注数据有限而导致性能受限的问题。其解决方案的关键在于提出了一种名为Spec-TOD的新框架,该框架通过引入针对指令调优的大语言模型(LLMs)的显式任务指令,以及利用轻量级、专业化的LLMs进行高效训练策略,从而在极少监督的情况下实现强大的性能。
链接: https://arxiv.org/abs/2507.04841
作者: Quang-Vinh Nguyen,Quang-Chieu Nguyen,Hoang Pham,Khac-Hoai Nam Bui
机构: Viettel Artificial Intelligence and Data Services Center, Viettel Group, Vietnam
类目: Computation and Language (cs.CL)
备注: Accepted at SIGdial 2025
Abstract:Task-oriented dialogue (TOD) systems facilitate goal-driven interactions between users and machines. While recent advances in deep learning have improved the performance, TOD systems often struggle in low-resource scenarios with limited labeled data. To address this challenge, we propose Spec-TOD, a novel framework designed to train an end-to-end TOD system with limited data. Spec-TOD introduces two main innovations: (i) a novel specialized end-to-end TOD framework that incorporates explicit task instructions for instruction-tuned large language models (LLMs), and (ii) an efficient training strategy that leverages lightweight, specialized LLMs to achieve strong performance with minimal supervision. Experiments on the MultiWOZ dataset, a widely used TOD benchmark, demonstrate that Spec-TOD achieves competitive results while significantly reducing the need for labeled data. These findings highlight the potential of the proposed framework in advancing efficient and effective TOD systems in low-resource settings.
zh
[NLP-35] From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach
【速读】: 该论文试图解决视频内容的复杂自然语言描述问题,即生成长篇幅、连贯且语义丰富的视频段落描述,而现有数据集和方法主要集中在简短的视频字幕生成上。其解决方案的关键在于构建一种基于时空事件图的视觉与语言共享表示,该表示能够以可解释和分析的方式整合多个视觉任务,从而生成最终的自然语言描述。此外,该方法还利用自动化且可解释的视频描述生成过程作为自监督神经分析系统中的教师,有效训练端到端的神经学生路径。
链接: https://arxiv.org/abs/2507.04815
作者: Mihai Masala,Marius Leordeanu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: arXiv admin note: text overlap with arXiv:2501.08460
Abstract:The task of describing video content in natural language is commonly referred to as video captioning. Unlike conventional video captions, which are typically brief and widely available, long-form paragraph descriptions in natural language are scarce. This limitation of current datasets is due to the expensive human manual annotation required and to the highly challenging task of explaining the language formation process from the perspective of the underlying story, as a complex system of interconnected events in space and time. Through a thorough analysis of recently published methods and available datasets, we identify a general lack of published resources dedicated to the problem of describing videos in complex language, beyond the level of descriptions in the form of enumerations of simple captions. Furthermore, while state-of-the-art methods produce impressive results on the task of generating shorter captions from videos by direct end-to-end learning between the videos and text, the problem of explaining the relationship between vision and language is still beyond our reach. In this work, we propose a shared representation between vision and language, based on graphs of events in space and time, which can be obtained in an explainable and analytical way, to integrate and connect multiple vision tasks to produce the final natural language description. Moreover, we also demonstrate how our automated and explainable video description generation process can function as a fully automatic teacher to effectively train direct, end-to-end neural student pathways, within a self-supervised neuro-analytical system. We validate that our explainable neuro-analytical approach generates coherent, rich and relevant textual descriptions on videos collected from multiple varied datasets, using both standard evaluation metrics, human annotations and consensus from ensembles of state-of-the-art VLMs.
zh
[NLP-36] A Survey of Pun Generation: Datasets Evaluations and Methodologies
【速读】: 该论文试图解决生成具有双关语义的幽默文本(pun generation)的问题,其核心挑战在于如何在保持语境连贯性和适当性的同时,创造性地修改语言元素以产生幽默效果。解决方案的关键在于系统性地回顾和分析不同阶段的 pun generation 方法,包括传统方法、深度学习技术以及预训练语言模型,并总结用于评估生成质量的自动化与人工评价指标。
链接: https://arxiv.org/abs/2507.04793
作者: Yuchen Su,Yonghua Zhu,Ruofan Wang,Zijian Huang,Diana Benavides-Prado,Michael Witbrock
机构: School of Computer Science, University of Auckland, New Zealand (计算机科学学院,奥克兰大学,新西兰); Singapore University of Technology and Design, Singapore (新加坡科技设计大学,新加坡); School of Electronic Engineering and Computer Science, Queen Mary University of London (电子工程与计算机科学学院,伦敦玛丽女王大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Pun generation seeks to creatively modify linguistic elements in text to produce humour or evoke double meanings. It also aims to preserve coherence and contextual appropriateness, making it useful in creative writing and entertainment across various media and contexts. Although pun generation has received considerable attention in computational linguistics, there is currently no dedicated survey that systematically reviews this specific area. To bridge this gap, this paper provides a comprehensive review of pun generation datasets and methods across different stages, including conventional approaches, deep learning techniques, and pre-trained language models. Additionally, we summarise both automated and human evaluation metrics used to assess the quality of pun generation. Finally, we discuss the research challenges and propose promising directions for future work.
zh
[NLP-37] Reason to Rote: Rethinking Memorization in Reasoning
【速读】: 该论文试图解决语言模型在训练过程中如何记忆标签噪声,以及这种记忆为何在许多情况下不会显著影响其可泛化的推理能力的问题。解决方案的关键在于发现语言模型的记忆机制依赖于可泛化的推理机制:模型在检索记忆的噪声标签时仍能计算中间推理输出,且对推理过程的干预会损害记忆效果。此外,研究还表明记忆是通过分布式编码实现的,而非建立从输入到噪声标签的查找机制,进一步揭示了标签噪声记忆是基于现有神经元激活模式的微调,而非独立于推理机制的存储过程。
链接: https://arxiv.org/abs/2507.04782
作者: Yupei Du,Philipp Mondorf,Silvia Casola,Yuekun Yao,Robert Litschko,Barbara Plank
机构: Utrecht University (乌得勒支大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Saarland University (萨尔兰大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 14 figures
Abstract:Large language models readily memorize arbitrary training instances, such as label noise, yet they perform strikingly well on reasoning tasks. In this work, we investigate how language models memorize label noise, and why such memorization in many cases does not heavily affect generalizable reasoning capabilities. Using two controllable synthetic reasoning datasets with noisy labels, four-digit addition (FDA) and two-hop relational reasoning (THR), we discover a reliance of memorization on generalizable reasoning mechanisms: models continue to compute intermediate reasoning outputs even when retrieving memorized noisy labels, and intervening reasoning adversely affects memorization. We further show that memorization operates through distributed encoding, i.e., aggregating various inputs and intermediate results, rather than building a look-up mechanism from inputs to noisy labels. Moreover, our FDA case study reveals memorization occurs via outlier heuristics, where existing neuron activation patterns are slightly shifted to fit noisy labels. Together, our findings suggest that memorization of label noise in language models builds on, rather than overrides, the underlying reasoning mechanisms, shedding lights on the intriguing phenomenon of benign memorization.
zh
[NLP-38] ABench-Physics: Benchmarking Physical Reasoning in LLM s via High-Difficulty and Dynamic Physics Problems
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在物理学领域推理能力不足的问题,特别是其在物理建模、概念理解和动态情境泛化方面的局限性。现有基准测试在难度、题型和评估环境方面存在不足,无法有效评估LLMs的物理推理能力。论文提出的解决方案是构建ABench-Physics基准,其关键在于包含两个组成部分:Phy_A(400道研究生或竞赛级别的静态问题)和Phy_B(100道配备自动变化引擎的动态问题),以严格评估模型在精确数值计算和动态条件下的鲁棒性与泛化能力。
链接: https://arxiv.org/abs/2507.04766
作者: Yiming Zhang,Yingfan Ma,Yanmei Gu,Zhengkai Yang,Yihong Zhuang,Feng Wang,Zenan Huang,Yuanyuan Wang,Chao Huang,Bowen Song,Cheng Lin,Junbo Zhao
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have shown impressive performance in domains such as mathematics and programming, yet their capabilities in physics remain underexplored and poorly understood. Physics poses unique challenges that demand not only precise computation but also deep conceptual understanding and physical modeling skills. Existing benchmarks often fall short due to limited difficulty, multiple-choice formats, and static evaluation settings that fail to capture physical modeling ability. In this paper, we introduce ABench-Physics, a novel benchmark designed to rigorously evaluate LLMs’ physical reasoning and generalization capabilities. ABench-Physics consists of two components: Phy_A, a static set of 400 graduate- or Olympiad-level problems; and Phy_B, a dynamic subset of 100 problems equipped with an automatic variation engine to test model robustness across changing conditions. All questions require precise numerical answers, with strict formatting and tolerance constraints. Our evaluation of several state-of-the-art LLMs reveals substantial performance gaps, highlighting persistent limitations in physical reasoning, especially in generalization to dynamic variants. ABench-Physics provides a challenging and diagnostic framework for advancing scientific reasoning in LLMs.
zh
[NLP-39] CoSteer: Collaborative Decoding-Time Personalization via Local Delta Steering
【速读】: 该论文旨在解决个性化文本生成中面临的矛盾:大型云端模型缺乏本地用户特定信息,而小型设备端模型无法达到云端模型的生成质量。其解决方案的关键在于提出CoSteer框架,通过本地小模型生成的logits差异作为引导信号,动态调整云端大语言模型(LLM)的输出,从而实现在设备端进行解码时的个性化调整,同时保障隐私和计算效率。
链接: https://arxiv.org/abs/2507.04756
作者: Hang Lv,Sheng Liang,Hao Wang,Hongchao Gu,Yaxiong Wu,Wei Guo,Defu Lian,Yong Liu,Enhong Chen
机构: University of Science and Technology of China(中国科学技术大学); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized text generation has become crucial for adapting language models to diverse and evolving users’ personal context across cultural, temporal, and contextual dimensions. While existing methods often rely on centralized fine-tuning or static preference alignment, they struggle to achieve real-time adaptation under resource constraints inherent to personal devices. This limitation creates a dilemma: large cloud-based models lack access to localized user-specific information, while small on-device models cannot match the generation quality of their cloud counterparts. To address this dichotomy, we present CoSteer, a novel collaborative framework that enables decoding-time personalization through localized delta steering. Our key insight lies in leveraging the logits difference between personal context-aware and -agnostic outputs from local small models as steering signals for cloud-based LLMs. Specifically, we formulate token-level optimization as an online learning problem, where local delta vectors dynamically adjust the remote LLM’s logits within the on-device environment. This approach preserves privacy by transmitting only the final steered tokens rather than raw data or intermediate vectors, while maintaining cloud-based LLMs’ general capabilities without fine-tuning. Through comprehensive experiments on various personalized generation tasks, we demonstrate that CoSteer effectively assists LLMs in generating personalized content by leveraging locally stored user profiles and histories, ensuring privacy preservation through on-device data processing while maintaining acceptable computational overhead.
zh
[NLP-40] LLM s as Architects and Critics for Multi-Source Opinion Summarization
【速读】: 该论文试图解决多源观点摘要(Multi-source Opinion Summarization, M-OS)在生成综合性产品评价摘要中的应用问题,以及缺乏评估数据集对任务发展的阻碍。其解决方案的关键在于引入M-OS-EVAL基准数据集,用于评估跨7个关键维度的多源观点摘要质量,并通过实验验证事实增强的摘要能够显著提升用户参与度,其中M-OS-PROMPTS方法在与人类判断的一致性上表现优于以往方法。
链接: https://arxiv.org/abs/2507.04751
作者: Anuj Attri,Arnav Attri,Pushpak Bhattacharyya,Suman Banerjee,Amey Patil,Muthusamy Chelliah,Nikesh Garera
机构: Computer Science and Engineering, IIT Bombay, India; Flipkart, India
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-source Opinion Summarization (M-OS) extends beyond traditional opinion summarization by incorporating additional sources of product metadata such as descriptions, key features, specifications, and ratings, alongside reviews. This integration results in comprehensive summaries that capture both subjective opinions and objective product attributes essential for informed decision-making. While Large Language Models (LLMs) have shown significant success in various Natural Language Processing (NLP) tasks, their potential in M-OS remains largely unexplored. Additionally, the lack of evaluation datasets for this task has impeded further advancements. To bridge this gap, we introduce M-OS-EVAL, a benchmark dataset for evaluating multi-source opinion summaries across 7 key dimensions: fluency, coherence, relevance, faithfulness, aspect coverage, sentiment consistency, specificity. Our results demonstrate that M-OS significantly enhances user engagement, as evidenced by a user study in which, on average, 87% of participants preferred M-OS over opinion summaries. Our experiments demonstrate that factually enriched summaries enhance user engagement. Notably, M-OS-PROMPTS exhibit stronger alignment with human judgment, achieving an average Spearman correlation of \rho = 0.74, which surpasses the performance of previous methodologies.
zh
[NLP-41] A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic
【速读】: 该论文试图解决将犹太阿拉伯语(Judeo-Arabic)转写为阿拉伯语字母的问题,这一过程因字母映射模糊、正字法惯例不一致以及频繁的希伯来语和亚拉姆语混用而极具挑战性。解决方案的关键在于提出一种两步方法:首先进行简单的逐字符映射,然后通过后处理校正来解决语法和正字法错误。该方法为自动转写提供了可行路径,并展示了转写后的文本可以有效支持阿拉伯语自然语言处理工具进行形态句法标注和机器翻译。
链接: https://arxiv.org/abs/2507.04746
作者: Juan Moreno Gonzalez,Bashar Alhafni,Nizar Habash
机构: University of Cambridge (剑桥大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); New York University Abu Dhabi (纽约大学阿布扎比分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew and Aramaic. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts.
zh
[NLP-42] Word stress in self-supervised speech models: A cross-linguistic comparison INTERSPEECH2025
【速读】: 该论文试图解决自监督语音模型(Self-supervised Speech Models, S3M)中词重音(word stress)表征的跨语言差异问题。其解决方案的关键在于利用Wav2vec 2.0模型学习到的语音表征,通过训练诊断性重音分类器来评估不同语言中重音信息的可区分性,并验证这些表征是否具有语言特异性。研究结果表明,S3M模型能够有效区分不同语言中的重音与非重音音节,并且词重音表征在变量重音语言(如荷兰语、英语和德语)与固定重音语言(如匈牙利语和波兰语)之间存在显著差异。
链接: https://arxiv.org/abs/2507.04738
作者: Martijn Bentum,Louis ten Bosch,Tomas O. Lentz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2025
Abstract:In this paper we study word stress representations learned by self-supervised speech models (S3M), specifically the Wav2vec 2.0 model. We investigate the S3M representations of word stress for five different languages: Three languages with variable or lexical stress (Dutch, English and German) and two languages with fixed or demarcative stress (Hungarian and Polish). We train diagnostic stress classifiers on S3M embeddings and show that they can distinguish between stressed and unstressed syllables in read-aloud short sentences with high accuracy. We also tested language-specificity effects of S3M word stress. The results indicate that the word stress representations are language-specific, with a greater difference between the set of variable versus the set of fixed stressed languages.
zh
[NLP-43] “This Suits You the Best”: Query Focused Comparative Explainable Summarization
【速读】: 该论文试图解决传统意见总结在产品推荐中缺乏全面比较洞察的问题。其提出的解决方案是生成查询聚焦的比较可解释摘要(QF-CES),通过多源意见总结(M-OS)实现。关键在于引入MS-Q2P数据集以弥补查询聚焦推荐数据的不足,并利用大型语言模型(LLMs)生成具有查询特定解释的表格化比较摘要,从而实现个性化、隐私保护、与推荐引擎和类别无关的摘要生成。
链接: https://arxiv.org/abs/2507.04733
作者: Arnav Attri,Anuj Attri,Pushpak Bhattacharyya,Suman Banerjee,Amey Patil,Muthusamy Chelliah,Nikesh Garera
机构: Computer Science and Engineering, IIT Bombay, India; Flipkart, India
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Product recommendations inherently involve comparisons, yet traditional opinion summarization often fails to provide holistic comparative insights. We propose the novel task of generating Query-Focused Comparative Explainable Summaries (QF-CES) using Multi-Source Opinion Summarization (M-OS). To address the lack of query-focused recommendation datasets, we introduce MS-Q2P, comprising 7,500 queries mapped to 22,500 recommended products with metadata. We leverage Large Language Models (LLMs) to generate tabular comparative summaries with query-specific explanations. Our approach is personalized, privacy-preserving, recommendation engine-agnostic, and category-agnostic. M-OS as an intermediate step reduces inference latency approximately by 40% compared to the direct input approach (DIA), which processes raw data directly. We evaluate open-source and proprietary LLMs for generating and assessing QF-CES. Extensive evaluations using QF-CES-PROMPT across 5 dimensions (clarity, faithfulness, informativeness, format adherence, and query relevance) showed an average Spearman correlation of 0.74 with human judgments, indicating its potential for QF-CES evaluation.
zh
[NLP-44] LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework
【速读】: 该论文试图解决长上下文处理评估中因评估设置不一致导致的结果不可靠以及计算成本高昂的问题。解决方案的关键在于提出LOOM-Scope,这是一个全面且高效的长上下文评估框架,它标准化了不同基准的评估设置,支持高效长上下文推理加速方法的部署,并引入了一个全面而轻量的基准套件以实现对模型的综合评估。
链接: https://arxiv.org/abs/2507.04723
作者: Zecheng Tang,Haitian Wang,Quantong Qiu,Baibei Ji,Ruoxi Sun,Keyan Zhou,Juntao Li,Min Zhang
机构: Soochow University (苏州大学); Key Laboratory of Data Intelligence and Advanced Computing, Soochow University (数据智能与先进计算重点实验室,苏州大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-context processing has become a fundamental capability for large language models~(LLMs). To assess model’s long-context performance, numerous long-context evaluation benchmarks have been proposed. However, variations in evaluation settings across these benchmarks lead to inconsistent results, making it difficult to draw reliable comparisons. Besides, the high computational cost of long-context evaluation poses a significant barrier for the community to conduct comprehensive assessments of long-context models. In this paper, we propose LOOM-Scope, a comprehensive and efficient framework for long-context evaluation. LOOM-Scope standardizes evaluation settings across diverse benchmarks, supports deployment of efficient long-context inference acceleration methods, and introduces a holistic yet lightweight benchmark suite to evaluate models comprehensively. Homepage: this https URL
zh
[NLP-45] Why We Feel What We Feel: Joint Detection of Emotions and Their Opinion Triggers in E-commerce
【速读】: 该论文试图解决电子商务评论中情感检测与观点触发词识别的联合任务问题,这一问题在理解客户情感反应触发因素方面存在显著研究空白。解决方案的关键在于提出一种新的联合任务框架,即情感检测与观点触发提取(EOT),该框架基于Plutchik的8种基本情绪理论,显式建模因果文本片段(观点触发词)与情感维度之间的关系。此外,研究还引入了EOT-X数据集,并开发了EOT-DETECT结构化提示框架,通过系统推理和自我反思提升了模型性能。
链接: https://arxiv.org/abs/2507.04708
作者: Arnav Attri,Anuj Attri,Pushpak Bhattacharyya,Suman Banerjee,Amey Patil,Muthusamy Chelliah,Nikesh Garera
机构: Computer Science and Engineering, IIT Bombay, India; Flipkart, India
类目: Computation and Language (cs.CL)
备注: 23 pages, 11 figures, 7 tables. Dataset and code will be made publicly available
Abstract:Customer reviews on e-commerce platforms capture critical affective signals that drive purchasing decisions. However, no existing research has explored the joint task of emotion detection and explanatory span identification in e-commerce reviews - a crucial gap in understanding what triggers customer emotional responses. To bridge this gap, we propose a novel joint task unifying Emotion detection and Opinion Trigger extraction (EOT), which explicitly models the relationship between causal text spans (opinion triggers) and affective dimensions (emotion categories) grounded in Plutchik’s theory of 8 primary emotions. In the absence of labeled data, we introduce EOT-X, a human-annotated collection of 2,400 reviews with fine-grained emotions and opinion triggers. We evaluate 23 Large Language Models (LLMs) and present EOT-DETECT, a structured prompting framework with systematic reasoning and self-reflection. Our framework surpasses zero-shot and chain-of-thought techniques, across e-commerce domains.
zh
[NLP-46] XiYan-SQL: A Novel Multi-Generator Framework For Text-to-SQL
【速读】: 该论文旨在解决文本到SQL(Text-to-SQL)任务中的挑战,特别是如何生成高质量且多样化的SQL查询以提高准确性和鲁棒性。其解决方案的关键在于提出了一种名为XiYan-SQL的创新框架,该框架包含三个核心组件:Schema Filter模块用于过滤并获取多个相关模式;多生成器集成方法通过多任务微调策略增强SQL生成模型对文本与SQL内在对齐能力,并构建具有不同生成风格的多个生成模型;以及一个结合候选重新组织策略的选择模型,用于获取最优的SQL查询。
链接: https://arxiv.org/abs/2507.04701
作者: Yifu Liu,Yin Zhu,Yingqi Gao,Zhiling Luo,Xiaoxia Li,Xiaorong Shi,Yuntao Hong,Jinyang Gao,Yu Li,Bolin Ding,Jingren Zhou
机构: Alibaba Cloud Computing Co., Ltd.(阿里巴巴云计算有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:To leverage the advantages of LLM in addressing challenges in the Text-to-SQL task, we present XiYan-SQL, an innovative framework effectively generating and utilizing multiple SQL candidates. It consists of three components: 1) a Schema Filter module filtering and obtaining multiple relevant schemas; 2) a multi-generator ensemble approach generating multiple highquality and diverse SQL queries; 3) a selection model with a candidate reorganization strategy implemented to obtain the optimal SQL query. Specifically, for the multi-generator ensemble, we employ a multi-task fine-tuning strategy to enhance the capabilities of SQL generation models for the intrinsic alignment between SQL and text, and construct multiple generation models with distinct generation styles by fine-tuning across different SQL formats. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Overall, XiYan-SQL achieves a new SOTA performance of 75.63% on the notable BIRD benchmark, surpassing all previous methods. It also attains SOTA performance on the Spider test set with an accuracy of 89.65%.
zh
[NLP-47] R1-RE: Cross-Domain Relationship Extraction with RLVR
【速读】: 该论文试图解决关系抽取(RE)任务在跨领域(out-of-domain, OOD)场景下泛化能力不足的问题。传统方法将RE视为监督学习问题,直接从上下文映射到标签,但这种方法在OOD场景中表现较差。论文提出的解决方案的关键在于将RE重新定义为一种受标注指南引导的推理任务,并引入了R1-RE框架,这是首个基于可验证奖励的强化学习(RLVR)方法。该方法通过激发小规模语言模型的推理能力,显著提升了模型在OOD场景下的鲁棒性。
链接: https://arxiv.org/abs/2507.04642
作者: Runpeng Dai,Tong Zheng,Run Yang,Hongtu Zhu
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); University of Maryland College Park (马里兰大学学院公园分校); BiliBili (哔哩哔哩)
类目: Computation and Language (cs.CL)
备注: 14 pages, 7 figures
Abstract:Relationship extraction (RE) is a core task in natural language processing. Traditional approaches typically frame RE as a supervised learning problem, directly mapping context to labels-an approach that often suffers from poor out-of-domain (OOD) generalization. Inspired by the workflow of human annotators, we reframe RE as a reasoning task guided by annotation guidelines and introduce R1-RE, the first reinforcement learning with verifiable reward (RLVR) framework for RE tasks. Our method elicits the reasoning abilities of small language models for annotation tasks, resulting in significantly improved OOD robustness. We evaluate our approach on the public Sem-2010 dataset and a private MDKG dataset. The R1-RE-7B model attains an average OOD accuracy of approximately 70%, on par with leading proprietary models such as GPT-4o. Additionally, our comprehensive analysis provides novel insights into the training dynamics and emergent reasoning behaviors of the RLVR paradigm for RE.
zh
[NLP-48] Put Teacher in Students Shoes: Cross-Distillation for Ultra-compact Model Compression Framework KDD2025
【速读】: 该论文旨在解决在资源受限的边缘计算环境中高效部署自然语言处理(Natural Language Processing, NLP)模型的问题,特别是在需要严格隐私合规、实时响应和多任务处理能力的场景下。解决方案的关键在于提出了一种名为Edge ultra-lIte BERT (EI-BERT) 的框架,其核心是采用一种新颖的跨蒸馏(cross-distillation)方法,通过硬令牌剪枝、跨蒸馏和参数量化等综合流程实现模型的高效压缩,从而在保持强大性能的同时满足严格的内存约束。
链接: https://arxiv.org/abs/2507.04636
作者: Maolin Wang,Jun Chu,Sicong Xie,Xiaoling Zang,Yao Zhao,Wenliang Zhong,Xiangyu Zhao
机构: City University of Hong Kong (香港城市大学); AntGroup (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注: Accepted by KDD 2025
Abstract:In the era of mobile computing, deploying efficient Natural Language Processing (NLP) models in resource-restricted edge settings presents significant challenges, particularly in environments requiring strict privacy compliance, real-time responsiveness, and diverse multi-tasking capabilities. These challenges create a fundamental need for ultra-compact models that maintain strong performance across various NLP tasks while adhering to stringent memory constraints. To this end, we introduce Edge ultra-lIte BERT framework (EI-BERT) with a novel cross-distillation method. EI-BERT efficiently compresses models through a comprehensive pipeline including hard token pruning, cross-distillation and parameter quantization. Specifically, the cross-distillation method uniquely positions the teacher model to understand the student model’s perspective, ensuring efficient knowledge transfer through parameter integration and the mutual interplay between models. Through extensive experiments, we achieve a remarkably compact BERT-based model of only 1.91 MB - the smallest to date for Natural Language Understanding (NLU) tasks. This ultra-compact model has been successfully deployed across multiple scenarios within the Alipay ecosystem, demonstrating significant improvements in real-world applications. For example, it has been integrated into Alipay’s live Edge Recommendation system since January 2024, currently serving the app’s recommendation traffic across \textbf8.4 million daily active devices.
zh
[NLP-49] Knowledge-Aware Self-Correction in Language Models via Structured Memory Graphs
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成过程中容易产生事实性错误(hallucinations)的问题。解决方案的关键在于提出一种轻量级、可解释的框架,通过基于RDF三元组的结构化记忆图实现知识感知的自我修正。该方法无需重新训练或微调模型,而是通过对模型输出进行后处理,并利用外部语义记忆纠正事实性不一致。
链接: https://arxiv.org/abs/2507.04625
作者: Swayamjit Saha
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures
Abstract:Large Language Models (LLMs) are powerful yet prone to generating factual errors, commonly referred to as hallucinations. We present a lightweight, interpretable framework for knowledge-aware self-correction of LLM outputs using structured memory graphs based on RDF triples. Without retraining or fine-tuning, our method post-processes model outputs and corrects factual inconsistencies via external semantic memory. We demonstrate the approach using DistilGPT-2 and show promising results on simple factual prompts.
zh
[NLP-50] Retain or Reframe? A Computational Framework for the Analysis of Framing in News Articles and Reader Comments
【速读】: 该论文试图解决新闻内容(source content)与受众反应(audience response)之间框架(frame)关联性分析的问题,即如何在大规模数据中计算和比较新闻文章与读者评论中的框架一致性。其解决方案的关键在于提出首个计算框架,通过细化框架标签、从句子层面预测重构文章和评论中的主导框架,并将文章与主题相关的评论进行对齐,从而实现跨文本的框架分析。
链接: https://arxiv.org/abs/2507.04612
作者: Matteo Guida,Yulia Otmakhova,Eduard Hovy,Lea Frermann
机构: The University of Melbourne(墨尔本大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:When a news article describes immigration as an “economic burden” or a “humanitarian crisis,” it selectively emphasizes certain aspects of the issue. Although \textitframing shapes how the public interprets such issues, audiences do not absorb frames passively but actively reorganize the presented information. While this relationship between source content and audience response is well-documented in the social sciences, NLP approaches often ignore it, detecting frames in articles and responses in isolation. We present the first computational framework for large-scale analysis of framing across source content (news articles) and audience responses (reader comments). Methodologically, we refine frame labels and develop a framework that reconstructs dominant frames in articles and comments from sentence-level predictions, and aligns articles with topically relevant comments. Applying our framework across eleven topics and two news outlets, we find that frame reuse in comments correlates highly across outlets, while topic-specific patterns vary. We release a frame classifier that performs well on both articles and comments, a dataset of article and comment sentences manually labeled for frames, and a large-scale dataset of articles and comments with predicted frame labels.
zh
[NLP-51] PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)个性化过程中缺乏统一理论框架的问题,旨在系统性地理解有效个性化的驱动因素。其解决方案的关键在于引入认知双记忆模型,通过将情景记忆映射到用户历史互动,将语义记忆映射到长期演变的用户信念,从而构建了一个名为PRIME的统一框架,结合情景与语义记忆机制,并进一步增强个性化思考能力,以提升模型对用户动态偏好的捕捉效果。
链接: https://arxiv.org/abs/2507.04607
作者: Xinliang Frederick Zhang,Nick Beauchamp,Lu Wang
机构: University of Michigan, Ann Arbor, MI; Northeastern University, Boston, MA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) personalization aims to align model outputs with individuals’ unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME’s effectiveness across both long- and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.
zh
[NLP-52] VLM2Vec-V2: Advancing Multimodal Embedding for Videos Images and Visual Documents
【速读】: 该论文试图解决现有多模态嵌入模型(如VLM2Vec、E5-V、GME)主要针对自然图像设计,缺乏对视频和视觉文档等其他视觉形式的支持,从而限制了其在实际应用场景中的适用性问题。解决方案的关键在于提出VLM2Vec-V2,这是一个支持文本、图像、视频和视觉文档输入的通用嵌入模型,并构建了MMEB-V2基准,扩展了原有任务类型,涵盖了视觉文档检索、视频检索、时序定位、视频分类和视频问答等五个新任务,以评估模型在多种视觉形式上的表现。
链接: https://arxiv.org/abs/2507.04590
作者: Rui Meng,Ziyan Jiang,Ye Liu,Mingyi Su,Xinyi Yang,Yuepeng Fu,Can Qin,Zeyuan Chen,Ran Xu,Caiming Xiong,Yingbo Zhou,Wenhu Chen,Semih Yavuz
机构: Salesforce Research; UC Santa Barbara; University of Waterloo; Tsinghua University; Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Technical Report
Abstract:Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.
zh
[NLP-53] Nile-Chat: Egyptian Language Models for Arabic and Latin Scripts
【速读】: 该论文旨在解决大型语言模型(LLM)在双书写系统语言(如埃及方言阿拉伯语)中的适应性问题,即模型需同时理解和生成阿拉伯语和拉丁字母文本。解决方案的关键在于提出了一种名为Branch-Train-MiX的新型语言适配策略,通过该策略将针对不同书写系统的专家模型融合为一个混合专家(MoE)模型,从而实现对双书写系统的有效支持。
链接: https://arxiv.org/abs/2507.04569
作者: Guokan Shang,Hadi Abdine,Ahmad Chamma,Amr Mohamed,Mohamed Anwar,Abdelaziz Bounhar,Omar El Herraoui,Preslav Nakov,Michalis Vazirgiannis,Eric Xing
机构: MBZUAI; Ecole Polytechnique
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We introduce Nile-Chat-4B, 3x4B-A6B, and 12B, a collection of LLMs for Egyptian dialect, uniquely designed to understand and generate texts written in both Arabic and Latin scripts. Specifically, with Nile-Chat-3x4B-A6B, we introduce a novel language adaptation approach by leveraging the Branch-Train-MiX strategy to merge script-specialized experts, into a single MoE model. Our Nile-Chat models significantly outperform leading multilingual and Arabic LLMs, such as LLaMa, Jais, and ALLaM, on our newly introduced Egyptian evaluation benchmarks, which span both understanding and generative tasks. Notably, our 12B model yields a 14.4% performance gain over Qwen2.5-14B-Instruct on Latin-script benchmarks. All our resources are publicly available. We believe this work presents a comprehensive methodology for adapting LLMs to dual-script languages, addressing an often overlooked aspect in modern LLM development.
zh
[NLP-54] Evaluating LLM s on Real-World Forecasting Against Human Superforecasters
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在预测未来事件方面能力不足的问题,特别是其预测准确性与人类群体及超级预测者(superforecasters)相比的差距。论文的关键解决方案是评估当前最先进的LLMs在Metaculus平台上的464个预测问题上的表现,并将其与人类超级预测者的成绩进行对比,以揭示LLMs在该任务上的实际能力与局限性。
链接: https://arxiv.org/abs/2507.04562
作者: Janna Lu
机构: George Mason University (乔治·梅森大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, but their ability to forecast future events remains understudied. A year ago, large language models struggle to come close to the accuracy of a human crowd. I evaluate state-of-the-art LLMs on 464 forecasting questions from Metaculus, comparing their performance against human superforecasters. Frontier models achieve Brier scores that ostensibly surpass the human crowd but still significantly underperform a group of superforecasters.
zh
[NLP-55] DP-Fusion: Token-Level Differentially Private Inference for Large Language Models UAI
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中可能泄露上下文中的敏感信息的问题,尤其是在面对对抗性提示时。现有保护上下文隐私的防御方法要么缺乏形式化保证,要么在隐私与实用性的权衡上表现不佳。其解决方案的关键是提出DP-Fusion,一种基于差分隐私(Differentially Private Inference, DPI)的token级隐私保护机制,该机制能够证明性地限制LLM输出对上下文中敏感token的揭示程度。通过文档隐私化任务验证该方法,利用参数ϵ控制隐私与语义保真度之间的权衡,实现对敏感内容(如个人身份信息)的隐匿同时保持文本整体实用性。
链接: https://arxiv.org/abs/2507.04531
作者: Rushil Thareja,Preslav Nakov,Praneeth Vepakomma,Nils Lukas
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); Massachusetts Institute of Technology (MIT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Our code and data are publicly available here: this https URL
Abstract:Large language models (LLMs) can leak sensitive information from their context through generated outputs, either accidentally or when prompted adversarially. Existing defenses that aim to preserve context privacy during inference either lack formal guarantees or suffer from a poor utility/privacy trade-off. We propose DP-Fusion, a token-level Differentially Private Inference (DPI) mechanism that provably bounds how much an LLM’s outputs reveal about sensitive tokens in its context. We demonstrate DPI through the task of document privatization, where the goal is to paraphrase documents so that sensitive content (e.g., Personally Identifiable Information, PII) cannot be reliably inferred, while still preserving the overall utility of the text. This is controlled by a parameter \epsilon : \epsilon=0 hides PII entirely, while higher values trade off privacy for improved paraphrase quality. DP-Fusion works as follows: (i) partition sensitive tokens into disjoint privacy groups, (ii) run the LLM once per group, and (iii) blend the output distributions so that the final output remains within a fixed statistical distance of the baseline distribution produced when no privacy group is revealed. This approach allows fine-grained control over the privacy/utility trade-off but requires multiple LLM forward passes.
zh
[NLP-56] DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging
【速读】: 该论文试图解决大型预训练模型(如大规模Transformer模型和大语言模型,LLMs)中计算冗余导致的成本高和可访问性差的问题。其解决方案的关键在于通过将相似的神经元分组并减少神经元数量来压缩模型宽度,将这一过程建模为离散最优传输问题,并提出DOTResize方法,利用最优传输理论对模型权重进行转换和压缩。该方法引入了熵正则化和矩阵分解以确保在Transformer架构中的适用性,与基于剪枝的方法不同,DOTResize通过重新投影整个神经元宽度,保留并重新分配有用信号,从而在多个LLM家族和规模上实现优于现有技术的性能,同时显著降低实际计算成本。
链接: https://arxiv.org/abs/2507.04517
作者: Neha Verma,Kenton Murray,Kevin Duh
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Model compression offers a promising path to reducing the cost and inaccessibility of large pre-trained models, without significantly compromising their impressive performance. Large Transformer models, including large language models (LLMs), often contain computational redundancy, which can serve as a target for new model compression methods. In this work, we specifically target neuron-level redundancies in model layers by combining groups of similar neurons into fewer neurons. We frame this width reduction as a Discrete Optimal Transport problem, and propose DOTResize, a novel Transformer compression method that uses optimal transport theory to transform and compress model weights. To ensure applicability within the Transformer architecture, we motivate and incorporate entropic regularization and matrix factorization into the transportation maps produced by our method. Unlike pruning-based approaches which discard neurons based on importance measures, DOTResize re-projects the entire neuron width, allowing the retention and redistribution of useful signal across the reduced layer. Empirical results show that compared to simple or state-of-the-art neuron width-pruning techniques, DOTResize can outperform these methods across multiple LLM families and sizes, while achieving measurable reductions in real-world computational cost.
zh
[NLP-57] AdS: Adapter-state Sharing Framework for Multimodal Sarcasm Detection
【速读】: 该论文试图解决社交媒体上多模态图像-文本讽刺内容在意见挖掘中的挑战,尤其是在资源受限环境下的问题。现有方法依赖于对大型预训练模型的全量微调,这使得它们不适用于低资源场景。论文提出的解决方案的关键是AdS(Adapter-State Sharing),它基于CLIP构建,仅在顶层插入适配器,并引入一种新颖的适配器状态共享机制,其中文本适配器引导视觉适配器。这种设计促进了高效的跨模态学习,同时保留了底层单模态表示。
链接: https://arxiv.org/abs/2507.04508
作者: Soumyadeep Jana,Sahil Danayak,Sanasam Ranbir Singh
机构: IIT Guwahati(印度理工学院古瓦哈提分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:The growing prevalence of multimodal image-text sarcasm on social media poses challenges for opinion mining, especially under resource constraints. Existing approaches rely on full fine-tuning of large pre-trained models, making them unsuitable for low-resource settings. While recent parameter-efficient fine-tuning (PEFT) methods offer promise, their off-the-shelf use underperforms on complex tasks like sarcasm detection. We propose AdS (Adapter-State Sharing), a lightweight framework built on CLIP that inserts adapters only in the upper layers and introduces a novel adapter-state sharing mechanism, where textual adapters guide visual ones. This design promotes efficient cross-modal learning while preserving low-level unimodal representations. Experiments on two public benchmarks demonstrate that AdS achieves state-of-the-art results using significantly fewer trainable parameters than existing PEFT and full fine-tuning approaches.
zh
[NLP-58] Unveiling the Potential of Diffusion Large Language Model in Controllable Generation
【速读】: 该论文旨在解决扩散模型(diffusion models)在可控文本生成任务中面临的挑战,包括对序列长度的高敏感性、较高的幻觉率以及推理成本过高等问题。其解决方案的关键在于提出一种名为Self-adaptive Schema Scaffolding (S³)的新框架,该框架通过将目标模式结构注入输出上下文,从而减少不必要的计算并提升生成的可控性,实现了结构化输出的语义保真度与推理效率的同步提升。
链接: https://arxiv.org/abs/2507.04504
作者: Zhen Xiong,Yujun Cai,Zhecheng Li,Yiwei Wang
机构: University of Southern California (南加州大学); University of Queensland (昆士兰大学); University of California, San Diego (加州大学圣地亚哥分校); University of California, Merced (加州大学默塞德分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Diffusion models, originally developed for image generation, have emerged as a promising alternative to autoregressive large language models (LLMs). We present a theoretical analysis comparing autoregressive and masked diffusion LLMs, revealing that the intrinsic bidirectional attention mechanism of diffusion LLMs (dLLMs) enables superior context modeling and generation controllability. However, existing dLLM applications face significant challenges in controllable generation: the native multi-step denoising process exhibits high sensitivity to sequence length, elevated hallucination rates, and prohibitive inference costs without specialized optimizations. To address these limitations, we propose \textbfSelf-adaptive \textbfSchema \textbfScaffolding ( S^3 ), a novel framework that enables dLLMs to generate structured outputs (e.g., JSON) while maintaining semantic fidelity and accelerating inference. Our approach injects the target schema structure into the output context, reducing unnecessary computation while improving controllability. Extensive experiments demonstrate that S^3 achieves substantial improvements: 65% increase in structural adherence, 48% enhancement in content fidelity, and 17% reduction in hallucination rates compared to baseline. These results establish both theoretical foundations and practical pathways for deploying diffusion models in controllable text generation tasks. Code and data will be publicly released.
zh
[NLP-59] A validity-guided workflow for robust large language model research in psychology
【速读】: 该论文试图解决生成式人工智能(Generative AI)在心理学研究中作为工具、评估对象、人类模拟器和认知模型应用时所面临的测量不可靠问题,这些问题表现为人格评估在因子分析中崩溃、道德偏好因标点变化而反转以及心智理论准确性因微小改写而波动,这些现象被称为“测量幻影”(measurement phantoms)。解决方案的关键在于提出一个六阶段的工作流程,该流程基于整合心理测量学与因果推断的双重有效性框架,强调从明确研究目标与有效性要求、开发和验证计算工具、设计控制计算混杂因素的实验、执行透明化协议、使用适合非独立观测数据的分析方法,到在已证明边界内报告结果并用于理论修正,从而确保研究的有效性与可靠性。
链接: https://arxiv.org/abs/2507.04491
作者: Zhicheng Lin
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Large language models (LLMs) are rapidly being integrated into psychological research as research tools, evaluation targets, human simulators, and cognitive models. However, recent evidence reveals severe measurement unreliability: Personality assessments collapse under factor analysis, moral preferences reverse with punctuation changes, and theory-of-mind accuracy varies widely with trivial rephrasing. These “measurement phantoms”–statistical artifacts masquerading as psychological phenomena–threaten the validity of a growing body of research. Guided by the dual-validity framework that integrates psychometrics with causal inference, we present a six-stage workflow that scales validity requirements to research ambition–using LLMs to code text requires basic reliability and accuracy, while claims about psychological properties demand comprehensive construct validation. Researchers must (1) explicitly define their research goal and corresponding validity requirements, (2) develop and validate computational instruments through psychometric testing, (3) design experiments that control for computational confounds, (4) execute protocols with transparency, (5) analyze data using methods appropriate for non-independent observations, and (6) report findings within demonstrated boundaries and use results to refine theory. We illustrate the workflow through an example of model evaluation–“LLM selfhood”–showing how systematic validation can distinguish genuine computational phenomena from measurement artifacts. By establishing validated computational instruments and transparent practices, this workflow provides a path toward building a robust empirical foundation for AI psychology research.
zh
[NLP-60] he role of large language models in UI/UX design: A systematic literature review
【速读】: 该论文试图解决如何将大型语言模型(Large Language Models, LLMs)有效整合到用户界面/用户体验(UI/UX)设计流程中的问题,以提升设计效率与质量。其解决方案的关键在于利用LLMs在设计生命周期中的多阶段应用,包括概念生成、原型设计、用户测试与评估,并结合提示工程(prompt engineering)、人机协同工作流以及多模态输入等实践,从而实现对设计过程的增强与优化。
链接: https://arxiv.org/abs/2507.04469
作者: Ammar Ahmed,Ali Shariq Imran
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This systematic literature review examines the role of large language models (LLMs) in UI/UX design, synthesizing findings from 38 peer-reviewed studies published between 2022 and 2025. We identify key LLMs in use, including GPT-4, Gemini, and PaLM, and map their integration across the design lifecycle, from ideation to evaluation. Common practices include prompt engineering, human-in-the-loop workflows, and multimodal input. While LLMs are reshaping design processes, challenges such as hallucination, prompt instability, and limited explainability persist. Our findings highlight LLMs as emerging collaborators in design, and we propose directions for the ethical, inclusive, and effective integration of these technologies.
zh
[NLP-61] Dual Modality-Aware Gated Prompt Tuning for Few-Shot Multimodal Sarcasm Detection
【速读】: 该论文试图解决在少样本(few-shot)和极低资源环境下,多模态讽刺检测模型依赖大量标注数据的问题。其解决方案的关键在于提出DMDP(Deep Modality-Disentangled Prompt Tuning)框架,该框架采用门控的、模态特定的深度提示(deep prompts)分别作用于文本和视觉编码器,并通过多层注入实现层次化特征学习,同时引入跨模态提示对齐模块以增强图像与文本表示之间的细微交互,从而提升模型对多样讽刺类型的识别能力。
链接: https://arxiv.org/abs/2507.04468
作者: Soumyadeep Jana,Abhrajyoti Kundu,Sanasam Ranbir Singh
机构: IIT Guwahati(印度理工学院古瓦哈提分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:The widespread use of multimodal content on social media has heightened the need for effective sarcasm detection to improve opinion mining. However, existing models rely heavily on large annotated datasets, making them less suitable for real-world scenarios where labeled data is scarce. This motivates the need to explore the problem in a few-shot setting. To this end, we introduce DMDP (Deep Modality-Disentangled Prompt Tuning), a novel framework for few-shot multimodal sarcasm detection. Unlike prior methods that use shallow, unified prompts across modalities, DMDP employs gated, modality-specific deep prompts for text and visual encoders. These prompts are injected across multiple layers to enable hierarchical feature learning and better capture diverse sarcasm types. To enhance intra-modal learning, we incorporate a prompt-sharing mechanism across layers, allowing the model to aggregate both low-level and high-level semantic cues. Additionally, a cross-modal prompt alignment module enables nuanced interactions between image and text representations, improving the model’s ability to detect subtle sarcastic intent. Experiments on two public datasets demonstrate DMDP’s superior performance in both few-shot and extremely low-resource settings. Further cross-dataset evaluations show that DMDP generalizes well across domains, consistently outperforming baseline methods.
zh
[NLP-62] hink Twice Before You Judge: Mixture of Dual Reasoning Experts for Multimodal Sarcasm Detection
【速读】: 该论文试图解决多模态讽刺检测中模型难以捕捉讽刺深层逻辑的问题,现有模型主要依赖图像描述或对象-属性对等浅层线索。解决方案的关键在于提出MiDRE(Mixture of Dual Reasoning Experts),该方法集成内部推理专家用于检测图像-文本对中的不一致,以及外部推理专家利用通过思维链提示生成的结构化推理过程,结合自适应门控机制动态权衡两个专家,从而提升对讽刺的理解能力。
链接: https://arxiv.org/abs/2507.04458
作者: Soumyadeep Jana,Abhrajyoti Kundu,Sanasam Ranbir Singh
机构: IIT Guwahati(印度理工学院古瓦哈蒂分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal sarcasm detection has attracted growing interest due to the rise of multimedia posts on social media. Understanding sarcastic image-text posts often requires external contextual knowledge, such as cultural references or commonsense reasoning. However, existing models struggle to capture the deeper rationale behind sarcasm, relying mainly on shallow cues like image captions or object-attribute pairs from images. To address this, we propose \textbfMiDRE (\textbfMixture of \textbfDual \textbfReasoning \textbfExperts), which integrates an internal reasoning expert for detecting incongruities within the image-text pair and an external reasoning expert that utilizes structured rationales generated via Chain-of-Thought prompting to a Large Vision-Language Model. An adaptive gating mechanism dynamically weighs the two experts, selecting the most relevant reasoning path. Experiments on two benchmark datasets show that MiDRE achieves superior performance over baselines. Various qualitative analyses highlight the crucial role of external rationales, revealing that even when they are occasionally noisy, they provide valuable cues that guide the model toward a better understanding of sarcasm.
zh
[NLP-63] GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models ACL2025
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在进行领域特定微调时带来的隐私风险问题,特别是在传统集中式微调方法中,模型和数据所有者均面临隐私泄露的风险。论文提出的解决方案的关键在于一种基于梯度保全压缩的离线微调(Offsite-tuning, OT)方法,称为GradOT。该方法通过优化视角分析OT问题,选择性地应用如秩压缩和通道剪枝等压缩技术,在保留微调适配器梯度的同时确保隐私性,从而在隐私保护与模型性能之间取得平衡。
链接: https://arxiv.org/abs/2507.04455
作者: Kai Yao,Zhaorui Tan,Penglei Gao,Lichun Li,Kaixin Wu,Yinggui Wang,Yuan Zhao,Yixin Ji,Wei Wang,Jianke Zhu
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); University of Liverpool (利物浦大学); Cleveland Clinic Lerner Research Institution (克利夫兰诊所莱纳研究机构); Soochow University (苏州大学)
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2025 main
Abstract:The rapid growth of large language models (LLMs) with traditional centralized fine-tuning emerges as a key technique for adapting these models to domain-specific challenges, yielding privacy risks for both model and data owners. One promising solution, called offsite-tuning (OT), is proposed to address these challenges, where a weaker emulator is compressed from the original model and further fine-tuned with adapter to enhance privacy. However, the existing OT-based methods require high computational costs and lack theoretical analysis. This paper introduces a novel OT approach based on gradient-preserving compression, named GradOT. By analyzing the OT problem through the lens of optimization, we propose a method that selectively applies compression techniques such as rank compression and channel pruning, preserving the gradients of fine-tuned adapters while ensuring privacy. Extensive experiments demonstrate that our approach surpasses existing OT methods, both in terms of privacy protection and model performance. Our method provides a theoretical foundation for OT and offers a practical, training-free solution for offsite-tuning of large-scale LLMs.
zh
[NLP-64] A Linguistic Analysis of Spontaneous Thoughts: Investigating Experiences of Déjà Vu Unexpected Thoughts and Involuntary Autobiographical Memories
【速读】: 该论文试图解决如何通过语言特征来揭示自发性思维(spontaneous thoughts)的内在认知、情感和注意力动态交互机制的问题,其解决方案的关键在于利用语言作为观察自发性认知状态的窗口,通过对参与者描述这些思维类型的语言模式进行分析,从而更新和验证现有关于注意状态的理论。
链接: https://arxiv.org/abs/2507.04439
作者: Videep Venkatesha,Mary Cati Poulos,Christopher Steadman,Caitlin Mills,Anne M. Cleary,Nathaniel Blanchard
机构: Colorado State University (科罗拉多州立大学); University of Minnesota (明尼苏达大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at CogSci 2025
Abstract:The onset of spontaneous thoughts are reflective of dynamic interactions between cognition, emotion, and attention. Typically, these experiences are studied through subjective appraisals that focus on their triggers, phenomenology, and emotional salience. In this work, we use linguistic signatures to investigate Deja Vu, Involuntary Autobiographical Memories and Unexpected Thoughts. Specifically, we analyze the inherent characteristics of the linguistic patterns in participant generated descriptions of these thought types. We show how, by positioning language as a window into spontaneous cognition, existing theories on these attentional states can be updated and reaffirmed. Our findings align with prior research, reinforcing that Deja Vu is a metacognitive experience characterized by abstract and spatial language, Involuntary Autobiographical Memories are rich in personal and emotionally significant detail, and Unexpected Thoughts are marked by unpredictability and cognitive disruption. This work is demonstrative of languages potential to reveal deeper insights into how internal spontaneous cognitive states manifest through expression.
zh
[NLP-65] MedGellan: LLM -Generated Medical Guidance to Support Physicians
【速读】: 该论文试图解决医疗决策中因错误可能导致严重甚至危及生命后果的问题,同时探索在完全自动化难以实现的情况下,如何有效结合机器智能与人工监督的混合框架。解决方案的关键在于提出MedGellan,这是一个轻量级、无需标注的框架,利用大型语言模型(Large Language Model, LLM)从原始医疗记录中生成临床指导,供医生用于诊断预测,并采用受贝叶斯启发的提示策略以尊重临床数据的时间顺序。
链接: https://arxiv.org/abs/2507.04431
作者: Debodeep Banerjee,Burcu Sayin,Stefano Teso,Andrea Passerini
机构: University of Pisa(比萨大学); University of Trento(特伦托大学); CIMeC, university of Trento(计算智能与认知系统中心,特伦托大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Medical decision-making is a critical task, where errors can result in serious, potentially life-threatening consequences. While full automation remains challenging, hybrid frameworks that combine machine intelligence with human oversight offer a practical alternative. In this paper, we present MedGellan, a lightweight, annotation-free framework that uses a Large Language Model (LLM) to generate clinical guidance from raw medical records, which is then used by a physician to predict diagnoses. MedGellan uses a Bayesian-inspired prompting strategy that respects the temporal order of clinical data. Preliminary experiments show that the guidance generated by the LLM with MedGellan improves diagnostic performance, particularly in recall and F_1 score.
zh
[NLP-66] RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling
【速读】: 该论文试图解决Transformer模型在长上下文设置中依赖softmax注意力机制所带来的计算瓶颈问题。其解决方案的关键在于引入一种介于循环机制与注意力机制之间的中间设计——\rat,该方法将输入划分为块,在每个块内应用简单的线性递归以捕捉局部依赖关系,随后在块间进行softmax注意力以建模长程交互,从而实现RNN与注意力机制的优势结合。
链接: https://arxiv.org/abs/2507.04416
作者: Xiuying Wei,Anunay Yadav,Razvan Pascanu,Caglar Gulcehre
机构: CLAIRE, EPFL 2Google DeepMind
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformers have become the cornerstone of modern large-scale language models; however, their dependence on softmax attention poses a major computational bottleneck, particularly in long-context settings. In this work, rather than following prevalent approaches such as linear attention (or SSMs) and local attention, we introduce an intermediate design called \rat between recurrence and attention mechanisms. It partitions the input into chunks, applies a simple linear recurrence within each chunk to capture local dependencies, and then performs softmax attention across chunks to model long-range interactions. By adjusting the size of the chunk, \rat enables flexible trade-offs, combining the strengths of RNN and attention. Empirically, with a chunk size of 16, the \rat layer achieves a (7\times) improvement in training speed with 100K token sequences and (9\times) in generation at 4K sequence length, while maintaining similar or sometimes even better accuracy compared to standard attention. We demonstrate this by training 1.3B parameter models from scratch and performing large-scale evaluations, including short- and long-context benchmarks, as well as supervised fine-tuning~(SFT). We further propose a hybrid architecture that interleaves \rat with local attention. By combining efficient long-range modeling with strong local interactions, this hybrid design not only improves inference speed and reduces cache memory usage compared to attention, but also consistently enhances performance, for example, achieving an average 1 point gain in commonsense reasoning tasks, up to 4 points on code tasks, and a 1 point Rouge-L increase in a summarization SFT task. Code is available at this https URL
zh
[NLP-67] MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind
【速读】: 该论文试图解决如何评估多模态大语言模型(Multimodal Large Language Models, LLMs)在理解人类心理状态(Theory of Mind, ToM)方面的能力问题。其解决方案的关键在于构建MOMENTS基准,这是一个基于短片中真实叙事场景的综合性评估框架,包含超过2,344道多选题,覆盖七个不同的ToM类别,旨在通过长视频上下文窗口和真实的社交互动来深入分析角色的心理状态。
链接: https://arxiv.org/abs/2507.04415
作者: Emilio Villa-Cueva,S M Masrur Ahmed,Rendi Chevi,Jan Christian Blaise Cruz,Kareem Elzeky,Fermin Cristobal,Alham Fikri Aji,Skyler Wang,Rada Mihalcea,Thamar Solorio
机构: MBZUAI; University of Houston; McGill University; University of Michigan
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MOMENTS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MOMENTS includes over 2,344 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. While the visual modality generally enhances model performance, current systems still struggle to integrate it effectively, underscoring the need for further research into AI’s multimodal understanding of human behavior.
zh
[NLP-68] HM@SimpleText 2025 – Task 1.1: Revisiting Text Simplification based on Complex Terms for Non-Experts
【速读】: 该论文试图解决科学文本简化的问题,旨在使非领域专家能够更容易理解复杂的科学内容。解决方案的关键在于识别句子中的复杂术语,并利用小型Gemini和OpenAI大语言模型对这些术语进行重述,以适应非专业读者的理解水平。
链接: https://arxiv.org/abs/2507.04414
作者: Nico Hofmann,Julian Dauenhauer,Nils Ole Dietzler,Idehen Daniel Idahor,Christin Katharina Kreutz
机构: TH Mittelhessen - University of Applied Sciences, Gießen, Germany; Herder Institute, Marburg, Germany
类目: Computation and Language (cs.CL)
备注:
Abstract:Scientific text is complex as it contains technical terms by definition. Simplifying such text for non-domain experts enhances accessibility of innovation and information. Politicians could be enabled to understand new findings on topics on which they intend to pass a law, or family members of seriously ill patients could read about clinical trials. The SimpleText CLEF Lab focuses on exactly this problem of simplification of scientific text. Task 1.1 of the 2025 edition specifically handles the simplification of complex sentences, so very short texts with little context. To tackle this task we investigate the identification of complex terms in sentences which are rephrased using small Gemini and OpenAI large language models for non-expert readers.
zh
[NLP-69] SpiritRAG : A QA System for Religion and Spirituality in the United Nations Archive
【速读】: 该论文试图解决宗教与精神(Religion and Spirituality, R/S)概念在传统档案检索策略中难以操作化的问题,特别是在大规模、低可访问性和信息噪声严重的数据集中提取与R/S相关的可操作洞察。解决方案的关键是提出SpiritRAG,一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的交互式问答系统,利用7,500份与健康和教育领域相关的联合国(UN)决议文件构建,通过易于使用的基于聊天的网络界面支持复杂且上下文敏感的大规模数据集搜索。
链接: https://arxiv.org/abs/2507.04395
作者: Yingqiang Gao,Fabian Winiger,Patrick Montjourides,Anastassia Shaitarova,Nianlong Gu,Simon Peng-Keller,Gerold Schneider
机构: University of Zurich(苏黎世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Religion and spirituality (R/S) are complex and highly domain-dependent concepts which have long confounded researchers and policymakers. Due to their context-specificity, R/S are difficult to operationalize in conventional archival search strategies, particularly when datasets are very large, poorly accessible, and marked by information noise. As a result, considerable time investments and specialist knowledge is often needed to extract actionable insights related to R/S from general archival sources, increasing reliance on published literature and manual desk reviews. To address this challenge, we present SpiritRAG, an interactive Question Answering (QA) system based on Retrieval-Augmented Generation (RAG). Built using 7,500 United Nations (UN) resolution documents related to R/S in the domains of health and education, SpiritRAG allows researchers and policymakers to conduct complex, context-sensitive database searches of very large datasets using an easily accessible, chat-based web interface. SpiritRAG is lightweight to deploy and leverages both UN documents and user provided documents as source material. A pilot test and evaluation with domain experts on 100 manually composed questions demonstrates the practical value and usefulness of SpiritRAG.
zh
[NLP-70] Does Learning Mathematical Problem-Solving Generalize to Broader Reasoning ?
【速读】: 该论文试图解决如何将数学问题求解(Mathematical Problem-Solving, MPS)能力的学习泛化到其他推理任务的问题。其关键在于探索不同MPS训练方法(如持续预训练、指令微调和基于规则的强化学习)在多种数据源上的泛化潜力,并分析这些方法对模型整体推理能力的影响。研究发现,持续预训练在数学文本上的表现能够部分泛化到一般推理任务,而基于短链式思维(CoT)的指令微调效果有限甚至可能损害泛化性能。相比之下,使用长CoT响应进行训练并结合基于规则的强化学习能够显著提升模型在其他领域的推理能力,表明较长的推理链条与自我反思机制是提升泛化推理能力的有效路径。
链接: https://arxiv.org/abs/2507.04391
作者: Ruochen Zhou,Minrui Xu,Shiqi Chen,Junteng Liu,Yunqi Li,Xinxin Lin,Zhengyu Chen,Junxian He
机构: City University of Hong Kong (香港城市大学); HKUST (香港科技大学); Meituan (美团)
类目: Computation and Language (cs.CL)
备注:
Abstract:There has been a growing interest in enhancing the mathematical problem-solving (MPS) capabilities of large language models. While the majority of research efforts concentrate on creating specialized models to solve mathematical problems, it remains unknown how learning mathematical problem-solving generalizes to help develop other reasoning abilities. In this paper, we present an empirical investigation into the generalization potential of various MPS training approaches, such as continual pretraining, instruction tuning, and rule-based reinforcement learning across various data sources, including both short and long chain-of-thought (CoT) samples. Evaluation on 5 mathematical and 8 general reasoning benchmarks show that continual pretraining on math text is able to generalize to general reasoning tasks to some extent. In constrast, instruction tuning on conventional, short MPS samples provides limited benefits and, in many cases, even impairs generalization performance. Notably, training with long CoT responses for MPS samples and incorporating rule-based reinforcement learning on MPS queries exhibit distinct behavior, significantly enhancing generalization by extending the model’s reasoning processes into other domains. These results suggest that traditional approaches to learning MPS with short reasoning chains largely fail to achieve robust generalization. However, the emerging paradigm of longer reasoning chains, coupled with self-reflection, offers a promising direction for improving generalized reasoning abilities through learning from specialized domains.
zh
[NLP-71] Multi-Modal Semantic Parsing for the Interpretation of Tombstone Inscriptions
【速读】: 该论文试图解决墓碑文物在保存过程中面临的解读、组织与检索难题,这些问题源于物理侵蚀、人为破坏、环境退化及政治变迁等因素。其解决方案的关键在于引入一种多模态框架,利用视觉-语言模型(VLMs)将墓碑图像转化为结构化的墓碑意义表示(TMRs),同时结合检索增强生成(RAG)技术整合地名、职业代码和本体概念等外部依赖元素,从而显著提升解析准确性,并增强了模型在多样语言文化背景及受损条件下的鲁棒性。
链接: https://arxiv.org/abs/2507.04377
作者: Xiao Zhang,Johan Bos
机构: University of Groningen(格罗宁根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted by ACMMM 2025
Abstract:Tombstones are historically and culturally rich artifacts, encapsulating individual lives, community memory, historical narratives and artistic expression. Yet, many tombstones today face significant preservation challenges, including physical erosion, vandalism, environmental degradation, and political shifts. In this paper, we introduce a novel multi-modal framework for tombstones digitization, aiming to improve the interpretation, organization and retrieval of tombstone content. Our approach leverages vision-language models (VLMs) to translate tombstone images into structured Tombstone Meaning Representations (TMRs), capturing both image and text information. To further enrich semantic parsing, we incorporate retrieval-augmented generation (RAG) for integrate externally dependent elements such as toponyms, occupation codes, and ontological concepts. Compared to traditional OCR-based pipelines, our method improves parsing accuracy from an F1 score of 36.1 to 89.5. We additionally evaluate the model’s robustness across diverse linguistic and cultural inscriptions, and simulate physical degradation through image fusion to assess performance under noisy or damaged conditions. Our work represents the first attempt to formalize tombstone understanding using large vision-language models, presenting implications for heritage preservation.
zh
[NLP-72] Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLM s
【速读】: 该论文试图解决生成式 AI (Generative AI) 在面对 Jailbreak 攻击时的安全性问题,即攻击者通过特定方法绕过模型的安全防护机制,使模型生成有害或不符合规范的内容。论文提出的关键解决方案是 Attention Sharpening,其核心在于通过温度缩放(temperature scaling)直接对抗 Attention Slipping 现象,即在攻击过程中模型逐渐减少对不安全请求的关注度,从而导致安全机制失效。Attention Sharpening 有效抑制了 Attention Slipping,同时保持了模型在良性任务上的性能,并且无需额外计算或内存开销,具备良好的实际部署可行性。
链接: https://arxiv.org/abs/2507.04365
作者: Xiaomeng Hu,Pin-Yu Chen,Tsung-Yi Ho
机构: The Chinese University of Hong Kong (香港中文大学); IBM Research (IBM研究院)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention Slipping, with their effectiveness positively correlated with the degree of mitigation achieved. Inspired by this finding, we propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling. Experiments on four leading LLMs (Gemma2-9B-It, Llama3.1-8B-It, Qwen2.5-7B-It, Mistral-7B-It v0.2) show that our method effectively resists various jailbreak attacks while maintaining performance on benign tasks on AlpacaEval. Importantly, Attention Sharpening introduces no additional computational or memory overhead, making it an efficient and practical solution for real-world deployment.
zh
[NLP-73] Large Language Models Varying Accuracy in Recognizing Risk-Promoting and Health-Supporting Sentiments in Public Health Discourse: The Cases of HPV Vaccination and Heated Tobacco Products
【速读】: 该论文试图解决大型语言模型(LLMs)在准确检测健康相关公共话语中不同类型的健康情感(如风险促进型与健康支持型)方面的有效性问题。其解决方案的关键在于通过收集来自Facebook和Twitter的多组支持或反对推荐健康行为的信息,并结合人工标注作为情感分类的黄金标准,评估GPT、Gemini和LLAMA三种主流LLMs在两类关键公共卫生议题(人乳头瘤病毒疫苗和加热烟草产品)上的情感检测准确性。研究结果揭示了模型在不同平台、健康议题及模型类型上的表现差异,以及在识别中性信息时所面临的挑战。
链接: https://arxiv.org/abs/2507.04364
作者: Soojong Kim,Kwanho Kim,Hye Min Kim
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Forthcoming in Social Science Medicine
Abstract:Machine learning methods are increasingly applied to analyze health-related public discourse based on large-scale data, but questions remain regarding their ability to accurately detect different types of health sentiments. Especially, Large Language Models (LLMs) have gained attention as a powerful technology, yet their accuracy and feasibility in capturing different opinions and perspectives on health issues are largely unexplored. Thus, this research examines how accurate the three prominent LLMs (GPT, Gemini, and LLAMA) are in detecting risk-promoting versus health-supporting sentiments across two critical public health topics: Human Papillomavirus (HPV) vaccination and heated tobacco products (HTPs). Drawing on data from Facebook and Twitter, we curated multiple sets of messages supporting or opposing recommended health behaviors, supplemented with human annotations as the gold standard for sentiment classification. The findings indicate that all three LLMs generally demonstrate substantial accuracy in classifying risk-promoting and health-supporting sentiments, although notable discrepancies emerge by platform, health issue, and model type. Specifically, models often show higher accuracy for risk-promoting sentiment on Facebook, whereas health-supporting messages on Twitter are more accurately detected. An additional analysis also shows the challenges LLMs face in reliably detecting neutral messages. These results highlight the importance of carefully selecting and validating language models for public health analyses, particularly given potential biases in training data that may lead LLMs to overestimate or underestimate the prevalence of certain perspectives.
zh
[NLP-74] HatePRISM: Policies Platforms and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation
【速读】: 该论文试图解决跨国家和社交平台的仇恨言论(hate speech)监管与管理中存在的不一致性和缺乏统一框架的问题。其解决方案的关键在于通过多角度分析(包括国家法规、社交平台政策及自然语言处理研究数据集),揭示现有仇恨言论定义和审核实践中的差异,并提出整合多种策略的自动化仇恨言论审核统一框架的研究方向。
链接: https://arxiv.org/abs/2507.04350
作者: Naquee Rizwan,Seid Muhie Yimam,Daryna Dementieva,Florian Skupin,Tim Fischer,Daniil Moskovskiy,Aarushi Ajay Borkar,Robert Geislinger,Punyajoy Saha,Sarthak Roy,Martin Semmann,Alexander Panchenko,Chris Biemann,Animesh Mukherjee
机构: IIT Kharagpur(印度理工学院卡哈格普尔分校); University of Hamburg(汉堡大学); Technical University of Munich(慕尼黑工业大学); Bucerius Law School(布策里乌斯法学院); Skolkovo Institute of Science and Technology(斯科尔科沃科学与技术学院); Artificial Intelligence Research Institute(人工智能研究机构)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2406.19543
Abstract:Despite regulations imposed by nations and social media platforms, e.g. (Government of India, 2021; European Parliament and Council of the European Union, 2022), inter alia, hateful content persists as a significant challenge. Existing approaches primarily rely on reactive measures such as blocking or suspending offensive messages, with emerging strategies focusing on proactive measurements like detoxification and counterspeech. In our work, which we call HatePRISM, we conduct a comprehensive examination of hate speech regulations and strategies from three perspectives: country regulations, social platform policies, and NLP research datasets. Our findings reveal significant inconsistencies in hate speech definitions and moderation practices across jurisdictions and platforms, alongside a lack of alignment with research efforts. Based on these insights, we suggest ideas and research direction for further exploration of a unified framework for automated hate speech moderation incorporating diverse strategies.
zh
[NLP-75] SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control
【速读】: 该论文试图解决大型推理模型(Large Reasoning Models, LRMs)在推理过程中存在的冗余和低效问题,这些问题源于推理链中关键步骤被过度压缩而简单步骤保留了不必要的细节,导致准确性和效率之间的次优权衡。解决方案的关键在于提出SmartThinker,这是一个两阶段的可学习框架,通过基于每个推理步骤重要性的细粒度控制来优化推理链长度。其核心创新在于第二阶段的Step-Level Length Control Policy Optimization (SCPO),该方法包含在线重要性估计器、步骤级长度控制奖励函数、步骤级广义优势估计(S-GAE)以及难度自适应裁剪策略,共同实现对推理步骤的差异化长度控制。
链接: https://arxiv.org/abs/2507.04348
作者: Xingyang He,Xiao Ling,Jie Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling, but this progress has also introduced considerable redundancy and inefficiency into their reasoning processes, resulting in substantial computational waste. Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning (RL), with the goal of encouraging a more concise chains of thought. However, we observe that such global length penalty often lead to excessive compression of critical reasoning steps while preserving unnecessary details in simpler ones, yielding a suboptimal trade-off between accuracy and efficiency. To address this issue, we propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains based on the importance of each individual step. In the first stage, SmartThinker adapts a reasoning model to a short-form reasoning mode through rejection sampling combined with supervised fine-tuning (SFT). In the second stage, SmartThinker applies Step-Level Length Control Policy Optimization (SCPO) to refine the model output distribution, which increases the proportion of length allocated to critical steps while reducing redundancy in less important ones. SCPO consists of four core components: an online importance estimator, a step-level length control reward function, a step-level generalized advantage estimation (S-GAE) and a difficulty-adaptive clipping strategy. Working in concert, these components enable SCPO to implement differentiated length control across reasoning steps. Empirical results across multiple reasoning benchmarks and various backbone models demonstrate that SmartThinker significantly reduces redundant reasoning while achieving comparable or even superior performance to existing methods.
zh
[NLP-76] Computed Tomography Visual Question Answering with Cross-modal Feature Graphing
【速读】: 该论文旨在解决医学影像中视觉问答(VQA)任务中存在的空间连续性和跨切片相关性忽略问题,从而导致回答碎片化和不精确。传统方法通常采用独立的视觉和文本编码器提取特征并进行组合,但在处理三维CT数据时未能充分考虑其空间结构。该论文提出的解决方案关键在于构建一个融合视觉与文本特征的跨模态图结构,将CT切片和问题中的词元作为图中的节点,并利用注意力图卷积网络动态融合信息,最终通过聚合的图特征作为软提示引导大语言模型生成准确答案。
链接: https://arxiv.org/abs/2507.04333
作者: Yuanhe Tian,Chen Su,Junwen Duan,Yan Song
机构: University of Washington (华盛顿大学); University of Science and Technology of China (中国科学技术大学); Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 9 pages, 3 figures
Abstract:Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs a cross-modal graph integrating both visual and textual features, treating individual CT slices and question tokens as nodes within the graph. We further leverage an attentive graph convolutional network to dynamically fuse information within this structure. The resulting aggregated graph features then serve as a soft prompt to guide a large language model in generating accurate answers. Extensive experiments on the M3D-VQA benchmark demonstrate that our approach consistently outperforms baselines across multiple evaluation metrics, offering more robust reasoning capabilities.
zh
[NLP-77] No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem
【速读】: 该论文试图解决东亚洲语言(尤其是中文、日文和韩文)在大规模语言模型(LLMs)构建中数据集匮乏且分布不均的问题,以及由此导致的资源开发与文化适配性不足的挑战。其解决方案的关键在于从跨语言视角出发,分析文化规范、研究环境和机构实践如何影响数据集的可用性和质量,并通过定量与定性方法揭示中、日、韩三国NLP社区在数据集创建与维护上的差异性模式,从而提出提升数据集文档化、许可透明度及跨语言资源共享的有效策略。
链接: https://arxiv.org/abs/2507.04329
作者: Dasol Choi,Woomyoung Park,Youngsook Song
机构: Yonsei University (延世大学); MODULABS (MODULABS); SAIONIC AI (SAIONIC AI); Lablup Inc. (Lablup 公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.
zh
[NLP-78] LearnLens: LLM -Enabled Personalised Curriculum-Grounded Feedback with Educators in the Loop
【速读】: 该论文试图解决传统教学反馈机制在科学教育中存在的时间密集性和个性化不足的问题。其解决方案的关键在于提出LearnLens系统,该系统基于大语言模型(LLM),通过三个核心模块实现精准的课程对齐反馈:一是能够捕捉细微推理错误的误差感知评估模块;二是采用结构化、主题关联的记忆链而非传统相似性检索的课程基础生成模块;三是提供教师参与的定制与监督界面,从而实现可扩展且高质量的个性化反馈。
链接: https://arxiv.org/abs/2507.04295
作者: Runcong Zhao,Artem Borov,Jiazheng Li,Yulan He
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.
zh
[NLP-79] Fairness Evaluation of Large Language Models in Academic Library Reference Services
【速读】: 该论文试图解决的问题是:生成式 AI (Generative AI) 在虚拟参考服务中的应用是否能够公平地服务于所有用户,无论其人口统计学特征或社会地位如何。解决方案的关键在于通过测试六种最先进的生成式 AI 模型对不同性别、种族/民族和机构角色的用户提供的响应,评估其是否存在差异化的回应。研究发现,模型在种族或民族方面未表现出差异性,仅在一个模型中发现对女性的轻微刻板印象偏差,而在机构角色上则通过语言选择(如正式性、礼貌性和专业术语)表现出对不同身份的细致适应,这反映了职业规范而非歧视性处理。
链接: https://arxiv.org/abs/2507.04224
作者: Haining Wang,Jason Clark,Yueru Yan,Star Bradley,Ruiyang Chen,Yiqiong Zhang,Hengyi Fu,Zuoyu Tian
机构: Indiana University (印第安纳大学); Montana State University (蒙大拿州立大学); Wuhan University (武汉大学); Guangdong University of Foreign Studies (广东外语外贸大学); San José State University (圣何塞州立大学); Macalester College (麦克alester学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注:
Abstract:As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries’ commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We found no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrated nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.
zh
[NLP-80] Context Tuning for In-Context Optimization ICML
【速读】: 该论文试图解决如何在不微调模型参数的情况下,显著提升语言模型(Language Models, LLMs)的少样本适应能力。传统基于提示的适应方法通常使用与任务无关的标记来初始化可训练提示或前缀,而本文提出的解决方案——Context Tuning,其关键在于利用任务相关的示例作为可训练提示或前缀的初始化内容,从而充分利用模型的内在上下文学习(In-Context Learning, ICL)能力,以提升少样本学习性能。
链接: https://arxiv.org/abs/2507.04221
作者: Jack Lu,Ryan Teehan,Zhenbang Yang,Mengye Ren
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: A short version of this paper has been accepted for publication in the Workshop on Test-Time Adaptation (PUT) at the International Conference on Machine Learning (ICML) 2025
Abstract:We introduce Context Tuning, a simple and effective method to significantly enhance few-shot adaptation of language models (LLMs) without fine-tuning model parameters. While prompt-based adaptation techniques have demonstrated the effectiveness of lightweight adaptation methods for large language models (LLMs), they typically initialize a trainable prompt or prefix with irrelevant tokens for the task at hand. In contrast, Context Tuning initializes the trainable prompt or prefix with task-specific demonstration examples, leveraging the model’s inherent In-Context Learning (ICL) ability to extract relevant information for improved few-shot learning performance. Extensive evaluations on benchmarks such as CrossFit, UnifiedQA, MMLU, BIG-Bench Hard, and ARC demonstrate that Context Tuning outperforms traditional prompt-based adaptation methods and achieves competitive accuracy to Test-Time Training with significantly higher training efficiency.
zh
[NLP-81] SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding
【速读】: 该论文试图解决复杂叙事中角色关系理解的问题,以及传统手动标注耗时且覆盖有限、大语言模型(Large Language Models, LLMs)生成结果常出现幻觉或逻辑不一致的问题。其解决方案的关键在于提出SymbolicThought框架,该框架结合基于LLM的抽取与符号推理,构建可编辑的角色关系图,并通过七种逻辑约束进行优化,同时提供交互式界面实现实时验证与冲突解决。
链接: https://arxiv.org/abs/2507.04189
作者: Runcong Zhao,Qinglin Zhu,Hainiu Xu,Bin Liang,Yulan He,Lin Gui
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation.
zh
[NLP-82] Navigating Speech Recording Collections with AI-Generated Illustrations
【速读】: 该论文试图解决从语音记录中提取信息和知识的挑战,尤其是在可用语音内容持续增加的背景下。其解决方案的关键在于利用语言和多模态生成模型的最新进展,提出一种新的语音档案导航方法。该方法通过交互式思维导图和图像生成工具将数据组织为结构化格式,并基于TED-LIUM 3数据集进行实现,旨在简化对大规模语音集合的探索。
链接: https://arxiv.org/abs/2507.04182
作者: Sirina Håland,Trond Karlsen Strøm,Petra Galuščáková
机构: University of Stavanger(斯塔凡格大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Although the amount of available spoken content is steadily increasing, extracting information and knowledge from speech recordings remains challenging. Beyond enhancing traditional information retrieval methods such as speech search and keyword spotting, novel approaches for navigating and searching spoken content need to be explored and developed. In this paper, we propose a novel navigational method for speech archives that leverages recent advances in language and multimodal generative models. We demonstrate our approach with a Web application that organizes data into a structured format using interactive mind maps and image generation tools. The system is implemented using the TED-LIUM~3 dataset, which comprises over 2,000 speech transcripts and audio files of TED Talks. Initial user tests using a System Usability Scale (SUS) questionnaire indicate the application’s potential to simplify the exploration of large speech collections.
zh
[NLP-83] Efficient Detection of Intermittent Job Failures Using Few-Shot Learning
【速读】: 该论文试图解决持续集成(CI)和部署流水线中由于非确定性问题(如脆弱测试或基础设施问题)导致的间歇性任务失败的检测问题。现有方法依赖于基于非确定性任务重试的启发式策略,但在未明确实施重试策略的场景下会错误地将间歇性失败标记为常规错误,从而限制了其实际性能。该论文的关键解决方案是引入一种基于少样本学习(Few-shot Learning, FSL)的方法,通过微调一个小的语言模型,利用少量手动标注的日志示例生成丰富的嵌入表示,并用于训练机器学习分类器,从而在数据量有限的情况下实现更高的检测准确率。
链接: https://arxiv.org/abs/2507.04173
作者: Henri Aïdasso,Francis Bordeleau,Ali Tizghadam
机构: École de technologie supérieure (École de technologie supérieure); TELUS (TELUS)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the 41st International Conference on Software Maintenance and Evolution - ICSME 2025, Industry Track
Abstract:One of the main challenges developers face in the use of continuous integration (CI) and deployment pipelines is the occurrence of intermittent job failures, which result from unexpected non-deterministic issues (e.g., flaky tests or infrastructure problems) rather than regular code-related errors such as bugs. Prior studies developed machine-learning (ML) models trained on large datasets of job logs to classify job failures as either intermittent or regular. As an alternative to costly manual labeling of large datasets, the state-of-the-art (SOTA) approach leveraged a heuristic based on non-deterministic job reruns. However, this method mislabels intermittent job failures as regular in contexts where rerunning suspicious job failures is not an explicit policy, and therefore limits the SOTA’s performance in practice. In fact, our manual analysis of 2,125 job failures from 5 industrial and 1 open-source projects reveals that, on average, 32% of intermittent job failures are mislabeled as regular. To address these limitations, this paper introduces a novel approach to intermittent job failure detection using few-shot learning (FSL). Specifically, we fine-tune a small language model using a few number of manually labeled log examples to generate rich embeddings, which are then used to train an ML classifier. Our FSL-based approach achieves 70-88% F1-score with only 12 shots in all projects, outperforming the SOTA, which proved ineffective (34-52% F1-score) in 4 projects. Overall, this study underlines the importance of data quality over quantity and provides a more efficient and practical framework for the detection of intermittent job failures in organizations.
zh
[NLP-84] Large Language Models for Zero-Shot Multicultural Name Recognition
【速读】: 该论文旨在解决跨文化姓名(multicultural names)在零样本(zero-shot)场景下的鲁棒且准确识别问题,特别是在面对未见过的姓名时,传统方法往往表现不佳。其解决方案的关键在于提出一种名为Prompt-Engineered Fine-Tuning (PEFT) 的框架,该框架结合了对抗数据增强(adversarial data augmentation)与文化知识图谱(cultural knowledge graph)的动态集成,利用预训练大语言模型(LLMs)的语言理解能力,将识别任务转化为引导生成问题,从而显著提升模型对未知姓名文化来源的推断能力。
链接: https://arxiv.org/abs/2507.04149
作者: Thanakorn Phonchai,Surasakdi Siripong,Nicholas Patterson,Owen Campbell
机构: Walailak University (瓦莱拉克大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The robust and accurate recognition of multicultural names, particularly those not previously encountered, is a critical challenge in an increasingly globalized digital landscape. Traditional methods often falter when confronted with the vast diversity and novel permutations of names across different linguistic and cultural backgrounds. This paper introduces a novel framework, Prompt-Engineered Fine-Tuning (PEFT) for Large Language Models (LLMs) with Adversarial Data Augmentation and Cultural Knowledge Graph Integration, designed to significantly enhance zero-shot multicultural name recognition. Our approach leverages the powerful linguistic understanding of pre-trained LLMs, transforming the recognition task into a guided generation problem. Through meticulous prompt engineering, dynamic integration of explicit cultural knowledge derived from knowledge graphs, and the strategic application of adversarial data augmentation, we equip the LLM with an unprecedented ability to infer the cultural origin of unseen names. Extensive experiments demonstrate that our PEFT method consistently outperforms established deep learning baselines, including advanced Bi-LSTM models with cultural tags, achieving an impressive 93.1% overall accuracy and a remarkable 89.5% accuracy on challenging zero-shot name identification. An in-depth ablation study confirms the synergistic contribution of each component, while a human evaluation highlights our method’s performance approaching human expert judgment. This work signifies a substantial leap in multicultural name recognition, offering a highly effective and scalable solution for real-world applications.
zh
[NLP-85] Dissecting Clinical Reasoning in Language Models: A Comparative Study of Prompts and Model Adaptation Strategies
【速读】: 该论文旨在解决大型语言模型(LLMs)在临床自然语言推理(NLI)任务中的性能优化问题,特别是探讨提示结构与高效微调技术如何共同影响模型表现。其解决方案的关键在于通过设计不同层次抽象的提示策略,并利用低秩适应(LoRA)技术将多步骤推理能力从前沿模型蒸馏至参数量较小的模型(4B参数),从而提升临床NLI任务的性能。研究发现,提示类型对宏F1分数的方差解释度高达44%,而LoRA微调显著提升了模型输出的一致性与性能,缩小了与GPT-4o-mini的差距,并在多个临床数据集上验证了其有效性。
链接: https://arxiv.org/abs/2507.04142
作者: Mael Jullien,Marco Valentino,Leonardo Ranaldi,Andre Freitas
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent works on large language models (LLMs) have demonstrated the impact of prompting strategies and fine-tuning techniques on their reasoning capabilities. Yet, their effectiveness on clinical natural language inference (NLI) remains underexplored. This study presents the first controlled evaluation of how prompt structure and efficient fine-tuning jointly shape model performance in clinical NLI. We inspect four classes of prompting strategies to elicit reasoning in LLMs at different levels of abstraction, and evaluate their impact on a range of clinically motivated reasoning types. For each prompting strategy, we construct high-quality demonstrations using a frontier model to distil multi-step reasoning capabilities into smaller models (4B parameters) via Low-Rank Adaptation (LoRA). Across different language models fine-tuned on the NLI4CT benchmark, we found that prompt type alone accounts for up to 44% of the variance in macro-F1. Moreover, LoRA fine-tuning yields consistent gains of +8 to 12 F1, raises output alignment above 97%, and narrows the performance gap to GPT-4o-mini to within 7.1%. Additional experiments on reasoning generalisation reveal that LoRA improves performance in 75% of the models on MedNLI and TREC Clinical Trials Track. Overall, these findings demonstrate that (i) prompt structure is a primary driver of clinical reasoning performance, (ii) compact models equipped with strong prompts and LoRA can rival frontier-scale systems, and (iii) reasoning-type-aware evaluation is essential to uncover prompt-induced trade-offs. Our results highlight the promise of combining prompt design and lightweight adaptation for more efficient and trustworthy clinical NLP systems, providing insights on the strengths and limitations of widely adopted prompting and parameter-efficient techniques in highly specialised domains.
zh
[NLP-86] oken Level Hallucination Detection via Variance in Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在生成过程中产生的幻觉问题,即模型自信地生成但事实错误的输出。其解决方案的关键在于提出一种无需参考文本、基于token级别的幻觉检测框架,该框架利用多次随机生成中token对数概率的方差来识别模型输出中的不稳定性,从而有效关联到幻觉模式。
链接: https://arxiv.org/abs/2507.04137
作者: Keshav Kumar
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) have demonstrated impressive generative capabilities across diverse tasks but remain susceptible to hallucinations, confidently generated yet factually incorrect outputs. We introduce a reference-free, token-level hallucination detection framework that leverages the variance in token log-probabilities across multiple stochastic generations. Unlike prior methods that require ground-truth references or sentence-level verification, our approach is model-agnostic, interpretable, and suited for real-time or post-hoc analysis. We evaluate our method on unanswerable question prompts from the SQuAD v2 dataset and benchmark across three autoregressive models of varying scales: GPT-Neo 125M, Falcon 1B, and Mistral 7B. Through both quantitative metrics and visual diagnostics, we show that token-level variance reliably highlights instability in model outputs and correlates with hallucination patterns. Our framework is lightweight, reproducible, and adaptable to multiple domains, offering a valuable diagnostic tool for analyzing generative reliability in LLMs.
zh
[NLP-87] An HTR-LLM Workflow for High-Accuracy Transcription and Analysis of Abbreviated Latin Court Hand
【速读】: 该论文试图解决中世纪法律文献高精度转录与分析的难题,特别是针对手写文本识别(Handwritten Text Recognition, HTR)中的挑战性问题。其解决方案的关键在于构建一个四阶段的混合工作流:首先利用一种新型“Clean Ground Truth”数据整理方法训练HTR模型以获得基础转录结果;随后通过多模态大语言模型(Large Language Model, LLM)进行后校正;接着使用提示引导的LLM将缩略文本扩展为完整的学术拉丁文;最后通过LLM进行命名实体校正,规范化专有名词并生成可能的替代读法。该方法有效自动化了转录过程中最繁琐的部分,同时保证了输出的高质量和可分析性。
链接: https://arxiv.org/abs/2507.04132
作者: Joshua D. Isom
机构: Lehigh University (利哈伊大学)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This article presents and validates an ideal, four-stage workflow for the high-accuracy transcription and analysis of challenging medieval legal documents. The process begins with a specialized Handwritten Text Recognition (HTR) model, itself created using a novel “Clean Ground Truth” curation method where a Large Language Model (LLM) refines the training data. This HTR model provides a robust baseline transcription (Stage 1). In Stage 2, this baseline is fed, along with the original document image, to an LLM for multimodal post-correction, grounding the LLM’s analysis and improving accuracy. The corrected, abbreviated text is then expanded into full, scholarly Latin using a prompt-guided LLM (Stage 3). A final LLM pass performs Named-Entity Correction (NEC), regularizing proper nouns and generating plausible alternatives for ambiguous readings (Stage 4). We validate this workflow through detailed case studies, achieving Word Error Rates (WER) in the range of 2-7% against scholarly ground truths. The results demonstrate that this hybrid, multi-stage approach effectively automates the most laborious aspects of transcription while producing a high-quality, analyzable output, representing a powerful and practical solution for the current technological landscape.
zh
[NLP-88] BYOKG-RAG : Multi-Strategy Graph Retrieval for Knowledge Graph Question Answering
【速读】: 该论文旨在解决知识图谱问答(Knowledge Graph Question Answering, KGQA)中因输入图结构和语义差异带来的挑战,特别是现有方法依赖大型语言模型(Large Language Model, LLM)代理进行图遍历和检索时,对遍历初始化敏感、易出现实体链接错误以及在自定义(bring-your-own)知识图谱上泛化能力不足的问题。解决方案的关键在于提出BYOKG-RAG框架,该框架通过协同结合LLM与专用图检索工具,使LLM生成关键图结构元素(如问题实体、候选答案、推理路径和OpenCypher查询),并由图工具将这些元素与知识图谱关联并检索相关上下文,从而实现迭代优化的图链接与检索,最终提升问答性能与泛化能力。
链接: https://arxiv.org/abs/2507.04127
作者: Costas Mavromatis,Soji Adeshina,Vassilis N. Ioannidis,Zhen Han,Qi Zhu,Ian Robinson,Bryan Thompson,Huzefa Rangwala,George Karypis
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge graph question answering (KGQA) presents significant challenges due to the structural and semantic variations across input graphs. Existing works rely on Large Language Model (LLM) agents for graph traversal and retrieval; an approach that is sensitive to traversal initialization, as it is prone to entity linking errors and may not generalize well to custom (“bring-your-own”) KGs. We introduce BYOKG-RAG, a framework that enhances KGQA by synergistically combining LLMs with specialized graph retrieval tools. In BYOKG-RAG, LLMs generate critical graph artifacts (question entities, candidate answers, reasoning paths, and OpenCypher queries), and graph tools link these artifacts to the KG and retrieve relevant graph context. The retrieved context enables the LLM to iteratively refine its graph linking and retrieval, before final answer generation. By retrieving context from different graph tools, BYOKG-RAG offers a more general and robust solution for QA over custom KGs. Through experiments on five benchmarks spanning diverse KG types, we demonstrate that BYOKG-RAG outperforms the second-best graph retrieval method by 4.5% points while showing better generalization to custom KGs. BYOKG-RAG framework is open-sourced at this https URL.
zh
[NLP-89] Relational inductive biases on attention mechanisms WWW
【速读】: 该论文试图解决如何表征注意力机制中存在的关系归纳偏置(relational inductive biases)问题,即明确注意力机制对数据元素间潜在关系的假设。其解决方案的关键在于从几何深度学习的角度出发,分析常见注意力机制在排列子群下的等变性(equivariance)特性,从而根据其关系偏置对其进行分类。
链接: https://arxiv.org/abs/2507.04117
作者: Víctor Mijangos,Ximena Gutierrez-Vasques,Verónica E. Arriola,Ulises Rodríguez-Domínguez,Alexis Cervantes,José Luis Almanzara
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: This paper was originally published in Spanish in the journal Research in Computing Science ( this https URL )
Abstract:Inductive learning aims to construct general models from specific examples, guided by biases that influence hypothesis selection and determine generalization capacity. In this work, we focus on characterizing the relational inductive biases present in attention mechanisms, understood as assumptions about the underlying relationships between data elements. From the perspective of geometric deep learning, we analyze the most common attention mechanisms in terms of their equivariance properties with respect to permutation subgroups, which allows us to propose a classification based on their relational biases. Under this perspective, we show that different attention layers are characterized by the underlying relationships they assume on the input data.
zh
[NLP-90] Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching
【速读】: 该论文试图解决在多轮对话任务中,如诊断性患者访谈,传统微调方法(如直接偏好优化DPO和组相对策略优化GRPO)无法有效捕捉早期对话轮次对后续交互和诊断结果影响的问题。解决方案的关键在于引入Savage Conversation Forests (SCF),这是一种基于分支对话架构的强化学习框架,通过在每一轮对话中生成多个可能的对话延续,使模型能够学习不同早期响应对后续互动和诊断结果的影响。
链接: https://arxiv.org/abs/2507.04099
作者: Thomas Savage
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF’s improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.
zh
[NLP-91] XISM: an eXploratory and Interactive Graph Tool to Visualize and Evaluate Semantic Map Models
【速读】: 该论文试图解决传统语义图模型构建效率低、缺乏可视化与评估工具的问题。其解决方案的关键在于提出XISM,一个基于先前算法的交互式工具,采用自上而下的方法从用户数据中构建语义图,并通过多种度量标准进行评估,结合数据驱动的效率与专家知识,实现人机协同的语义图优化。
链接: https://arxiv.org/abs/2507.04070
作者: Zhu Liu,Zhen Hu,Lei Dai,Ying Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Paper under review
Abstract:Semantic map models represent meanings or functions as nodes in a graph constrained by the local connectivity hypothesis, with edges indicating their associations. Widely used in typological linguistics, these models compare interrelated meanings across languages. Traditionally built manually in a bottom-up manner, they are inefficient for large datasets and lack visualization and evaluation tools. This paper introduces XISM, an interactive tool based on our prior algorithm, which constructs semantic maps from user data via a top-down approach, displays candidate maps, and evaluates them using multiple metrics. Users can refine maps by editing edges, combining data-driven efficiency with expert knowledge. This human-in-the-loop design benefits both typologists and computational linguists. The system this https URL and a demonstration video this https URL are publicly available.
zh
[NLP-92] Beyond Independent Passages: Adaptive Passage Combination Retrieval for Retrieval Augmented Open-Domain Question Answering
【速读】: 该论文试图解决传统检索增强生成(Retrieval-augmented generation, RAG)方法在处理多跳问题和噪声语料时,因独立检索段落而导致的上下文冗余、噪声或多样性不足的问题。解决方案的关键在于提出自适应段落组合检索(Adaptive Passage Combination Retrieval, AdaPCR),通过将段落组合作为检索和重排序的单元,显式建模段落间的依赖关系,从而提升检索效果。AdaPCR采用上下文感知的查询重构和基于下游答案可能性的预测目标进行重排序,并自适应选择检索段落数量,无需额外的停止模块。
链接: https://arxiv.org/abs/2507.04069
作者: Ting-Wen Ko,Jyun-Yu Jiang,Pu-Jen Cheng
机构: National Taiwan University(台湾大学); Amazon Search(亚马逊搜索); University College London(伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external documents at inference time, enabling up-to-date knowledge access without costly retraining. However, conventional RAG methods retrieve passages independently, often leading to redundant, noisy, or insufficiently diverse context-particularly problematic - particularly problematic in noisy corpora and for multi-hop questions. To address this, we propose Adaptive Passage Combination Retrieval (AdaPCR), a novel framework for open-domain question answering with black-box LMs. AdaPCR explicitly models dependencies between passages by considering passage combinations as units for retrieval and reranking. It consists of a context-aware query reformulation using concatenated passages, and a reranking step trained with a predictive objective aligned with downstream answer likelihood. Crucially, AdaPCR adaptively selects the number of retrieved passages without additional stopping modules. Experiments across several QA benchmarks show that AdaPCR outperforms baselines, particularly in multi-hop reasoning, demonstrating the effectiveness of modeling inter-passage dependencies for improved retrieval.
zh
[NLP-93] Patient-Centered RAG for Oncology Visit Aid Following the Ottawa Decision Guide
【速读】: 该论文试图解决癌症患者在复杂医疗就诊前准备不足的问题,旨在通过技术手段提升患者的临床准备质量。解决方案的关键在于构建一个基于检索增强生成(Retrieval-augmented Generation)的交互式系统,该系统将奥克兰个人决策指南转化为动态的工作流程,帮助用户填补知识空白、明确个人价值观并生成针对即将到来的就诊的有用问题。
链接: https://arxiv.org/abs/2507.04026
作者: Siyang Liu,Lawrence Chin-I An,Rada Mihalcea
机构: The LIT Group, Department of Computer Science and Engineering, University of Michigan, Ann Arbor; Rogel Cancer Center, University of Michigan, Ann Arbor
类目: Computation and Language (cs.CL)
备注:
Abstract:Effective communication is essential in cancer care, yet patients often face challenges in preparing for complex medical visits. We present an interactive, Retrieval-augmented Generation-assisted system that helps patients progress from uninformed to visit-ready. Our system adapts the Ottawa Personal Decision Guide into a dynamic retrieval-augmented generation workflow, helping users bridge knowledge gaps, clarify personal values and generate useful questions for their upcoming visits. Focusing on localized prostate cancer, we conduct a user study with patients and a clinical expert. Results show high system usability (UMUX Mean = 6.0 out of 7), strong relevance of generated content (Mean = 6.7 out of 7), minimal need for edits, and high clinical faithfulness (Mean = 6.82 out of 7). This work demonstrates the potential of combining patient-centered design with language models to enhance clinical preparation in oncology care.
zh
[NLP-94] LLM ThinkBench: Towards Basic Math Reasoning and Overthinking in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在简单算术任务中表现不佳以及倾向于过度解释或“过度思考”答案的问题。其解决方案的关键是提出LLMThinkBench,一个模块化的基准测试框架,能够系统评估LLMs的基本数学推理能力和过度思考现象。该框架提供14个可配置的数学任务、随机化测试数据生成和稳健的解析策略,并引入Overthinking Score度量,通过调和平均公式捕捉准确性与冗余性之间的权衡,从而为研究者提供一种高效且经济的工具以诊断模型的基本推理能力和效率。
链接: https://arxiv.org/abs/2507.04023
作者: Gaurav Srivastava,Aafiya Hussain,Sriram Srinivasan,Xuan Wang
机构: Virginia Tech(弗吉尼亚理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have achieved remarkable performance on complex mathematical benchmarks, yet often struggle with simple arithmetic tasks and exhibit a tendency toward over-explaining or “overthinking” answers. To systematically assess this phenomenon, we introduce LLMThinkBench, a modular benchmarking framework that enables researchers to evaluate basic math reasoning and overthinking in LLMs. The framework provides 14 configurable math tasks with randomized test data generation and robust parsing strategies. Researchers can quantify overthinking using our Overthinking Score metric, which captures accuracy-verbosity tradeoffs through harmonic mean formulation. The tool offers flexible evaluation with a scalable vLLM/Transformers backend, multi-GPU support, and full configurability. Users can extend the tool with custom tasks, reproduce experiments with seeding, and generate detailed efficiency reports. Distributed as a pip-installable package with CLI and API access, LLMThinkBench provides researchers and practitioners an accessible, cost-effective alternative to expensive LLM-as-a-judge methods for diagnosing basic reasoning capabilities and efficiency analysis. Package can be installed as: pip install llmthinkbench
zh
[NLP-95] Handling Korean Out-of-Vocabulary Words with Phoneme Representation Learning
【速读】: 该论文试图解决韩语中未登录词(Out-of-Vocabulary, OOV)的表示问题,其解决方案的关键在于引入KOPL框架,该框架利用音素(Phoneme)表征学习来增强韩语OOV词的表示。KOPL结合了音素和词级表示,使韩语OOV词能够同时捕捉文本和音素信息,从而提升韩语自然语言处理任务的性能。
链接: https://arxiv.org/abs/2507.04018
作者: Nayeon Kim,Eojin Jeon,Jun-Hyung Park,SangKeun Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In this study, we introduce KOPL, a novel framework for handling Korean OOV words with Phoneme representation Learning. Our work is based on the linguistic property of Korean as a phonemic script, the high correlation between phonemes and letters. KOPL incorporates phoneme and word representations for Korean OOV words, facilitating Korean OOV word representations to capture both text and phoneme information of words. We empirically demonstrate that KOPL significantly improves the performance on Korean Natural Language Processing (NLP) tasks, while being readily integrated into existing static and contextual Korean embedding models in a plug-and-play manner. Notably, we show that KOPL outperforms the state-of-the-art model by an average of 1.9%. Our code is available at this https URL.
zh
[NLP-96] Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在跨文化环境中的文化敏感性和推理能力不足的问题,特别是针对韩国文化中的迷信现象。解决方案的关键在于构建Nunchi-Bench基准测试,该基准包含247个涵盖31个主题的问题,用于评估LLMs在事实知识、文化适宜性建议和情境解读方面的能力,并提出一种定制化的评分策略以系统性地评估文化推理能力。研究发现,尽管模型通常能识别事实信息,但在实际场景中应用这些信息存在困难,而明确的文化框架比仅依赖提示语言更能提升性能。
链接: https://arxiv.org/abs/2507.04014
作者: Kyuhee Kim,Sangah Lee
机构: EPFL(洛桑联邦理工学院); Seoul National University(首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs’ cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel evaluation strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs’ cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard.
zh
[NLP-97] Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
【速读】: 该论文试图解决将大型语言模型(Large Language Models, LLMs)适配到特定领域时面临的挑战,尤其是由于高质量领域数据稀缺导致的问题。其解决方案的关键在于提出了一种名为Easy Dataset的统一框架,该框架通过直观的图形用户界面(GUI)从非结构化文档中合成微调数据。该框架允许用户配置文本提取模型和分块策略,以生成连贯的文本块,并利用基于角色的提示方法生成多样化的问答对,同时通过人机协作的可视化界面确保数据质量。
链接: https://arxiv.org/abs/2507.04009
作者: Ziyang Miao,Qiyu Sun,Jingyuan Wang,Yuchen Gong,Yaowei Zheng,Shiqi Li,Richong Zhang
机构: Beihang University (北京航空航天大学); Independent Researcher (独立研究者)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: preprint
Abstract:Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at this https URL and have garnered over 9,000 GitHub stars.
zh
[NLP-98] A Comparative Study of Specialized LLM s as Dense Retrievers
【速读】: 该论文试图解决领域特定优化的生成式 AI (Generative AI) 对检索效果的影响问题,特别是其在统一检索任务中处理文本、代码、图像及多模态内容的能力。解决方案的关键在于系统性地评估不同任务特异性适配的 LLMs 在零样本和监督检索设置下的表现,通过对比分析发现数学专业化和长推理能力可能导致检索性能下降,而代码和视觉语言模型则表现出更优的零样本检索能力,为未来统一检索系统的跨领域和跨模态融合提供了有前景的方向。
链接: https://arxiv.org/abs/2507.03958
作者: Hengran Zhang,Keping Bi,Jiafeng Guo
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by CCIR25 and published by Springer LNCS or LNAI
Abstract:While large language models (LLMs) are increasingly deployed as dense retrievers, the impact of their domain-specific specialization on retrieval effectiveness remains underexplored. This investigation systematically examines how task-specific adaptations in LLMs influence their retrieval capabilities, an essential step toward developing unified retrievers capable of handling text, code, images, and multimodal content. We conduct extensive experiments with eight Qwen2.5 7B LLMs, including base, instruction-tuned, code/math-specialized, long reasoning, and vision-language models across zero-shot retrieval settings and the supervised setting. For the zero-shot retrieval settings, we consider text retrieval from the BEIR benchmark and code retrieval from the CoIR benchmark. Further, to evaluate supervised performance, all LLMs are fine-tuned on the MS MARCO dataset. We find that mathematical specialization and the long reasoning capability cause consistent degradation in three settings, indicating conflicts between mathematical reasoning and semantic matching. The vision-language model and code-specialized LLMs demonstrate superior zero-shot performance compared to other LLMs, even surpassing BM25 on the code retrieval task, and maintain comparable performance to base LLMs in supervised settings. These findings suggest promising directions for the unified retrieval task leveraging cross-domain and cross-modal fusion.
zh
[NLP-99] A Modular Unsupervised Framework for Attribute Recognition from Unstructured Text
【速读】: 该论文试图解决从非结构化文本中提取基于属性的结构化信息的问题,特别是在无需任务特定微调的情况下实现这一目标。解决方案的关键在于POSID框架,它结合了词法和语义相似性技术,以识别相关句子并提取属性,从而在不需要监督训练的情况下实现了有效的属性提取。
链接: https://arxiv.org/abs/2507.03949
作者: KMA Solaiman
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2506.20070
Abstract:We propose POSID, a modular, lightweight and on-demand framework for extracting structured attribute-based properties from unstructured text without task-specific fine-tuning. While the method is designed to be adaptable across domains, in this work, we evaluate it on human attribute recognition in incident reports. POSID combines lexical and semantic similarity techniques to identify relevant sentences and extract attributes. We demonstrate its effectiveness on a missing person use case using the InciText dataset, achieving effective attribute extraction without supervised training.
zh
[NLP-100] Losing our Tail – Again: On (Un)Natural Selection And Multilingual Large Language Models
【速读】: 该论文试图解决多语言大型语言模型(Multilingual Large Language Models, LLMs)在推动语言技术发展的同时,可能导致的语言多样性衰退问题。其核心关注点在于模型崩溃(model collapse)现象,即模型在自我强化的训练循环中逐渐丧失语言多样性,导致语言形式、语法特征和文化内涵的消失。解决方案的关键在于认识到语言分布的尾部正在消退,并呼吁重新定义自然语言处理(Natural Language Processing, NLP)领域,使其成为促进、重视并保护多语言词汇和语言多样性及创造力的学科。
链接: https://arxiv.org/abs/2507.03933
作者: Eva Vanmassenhove
机构: Tilburg University (蒂尔堡大学); Department CSAI (计算机科学与人工智能系)
类目: Computation and Language (cs.CL)
备注: 12 pages
Abstract:Multilingual Large Language Models (LLMs) considerably changed how technologies can influence language. While previous technologies could mediate or assist humans, there is now a tendency to \textitoffload the task of writing itself to these technologies, enabling them to change our linguistic ecosystem more directly. While they provide us quick access to information and impressively fluent output, beneath their apparent sophistication lies a subtle, more insidious threat: the gradual decline and loss of linguistic diversity. With this opinion piece, I explore how model collapse, with a particular focus on translation technology, can lead to the loss of linguistic forms, grammatical features, and cultural nuance. Model collapse refers to the eventual consequence of self-consuming training loops, where models reinforce their own biases and lose linguistic diversity. Drawing on recent work in Computer Vision, Natural Language Processing (NLP) and Machine Translation (MT), I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry. This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.
zh
[NLP-101] Dynamic Injection of Entity Knowledge into Dense Retrievers
【速读】: 该论文试图解决密集检索器在处理涉及不常见实体的查询时表现不佳的问题,这是由于其实体知识有限所致。解决方案的关键在于提出一种名为Knowledgeable Passage Retriever (KPR) 的BERT基础检索器,该模型通过引入上下文-实体注意力层和动态更新的实体嵌入来增强外部实体知识的整合能力,从而在不进行重新训练的情况下提升检索性能。
链接: https://arxiv.org/abs/2507.03922
作者: Ikuya Yamada,Ryokan Ri,Takeshi Kojima,Yusuke Iwasawa,Yutaka Matsuo
机构: Studio Ousia(スタジオ・オウシア); RIKEN AIP(リケン・AIP); SB Intuitions(SBインテュイションズ); The University of Tokyo(東京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Dense retrievers often struggle with queries involving less-frequent entities due to their limited entity knowledge. We propose the Knowledgeable Passage Retriever (KPR), a BERT-based retriever enhanced with a context-entity attention layer and dynamically updatable entity embeddings. This design enables KPR to incorporate external entity knowledge without retraining. Experiments on three datasets show that KPR consistently improves retrieval accuracy, achieving a substantial 12.6% gain on the EntityQuestions dataset over the model without KPR extensions. When built on the off-the-shelf bge-base retriever, KPR achieves state-of-the-art performance among similarly sized models on two datasets. Code and models will be released soon.
zh
[NLP-102] Demystifying ChatGPT : How It Masters Genre Recognition
【速读】: 该论文试图解决生成式 AI (Generative AI) 在类型预测任务中的能力与局限性问题,特别是针对电影类型的预测。其解决方案的关键在于利用 MovieLens-100K 数据集中的电影预告片音频转录文本和字幕进行零样本和少量样本提示设置,并通过微调 ChatGPT 提升其性能;此外,还引入了视觉语言模型 (Vision Language Model, VLM) 结合电影海报信息以增强提示的细粒度特征,从而提升类型预测的准确性。
链接: https://arxiv.org/abs/2507.03875
作者: Subham Raj,Sriparna Saha,Brijraj Singh,Niranjan Pedanekar
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); Sony Research India (索尼研究院印度)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The introduction of ChatGPT has garnered significant attention within the NLP community and beyond. Previous studies have demonstrated ChatGPT’s substantial advancements across various downstream NLP tasks, highlighting its adaptability and potential to revolutionize language-related applications. However, its capabilities and limitations in genre prediction remain unclear. This work analyzes three Large Language Models (LLMs) using the MovieLens-100K dataset to assess their genre prediction capabilities. Our findings show that ChatGPT, without fine-tuning, outperformed other LLMs, and fine-tuned ChatGPT performed best overall. We set up zero-shot and few-shot prompts using audio transcripts/subtitles from movie trailers in the MovieLens-100K dataset, covering 1682 movies of 18 genres, where each movie can have multiple genres. Additionally, we extended our study by extracting IMDb movie posters to utilize a Vision Language Model (VLM) with prompts for poster information. This fine-grained information was used to enhance existing LLM prompts. In conclusion, our study reveals ChatGPT’s remarkable genre prediction capabilities, surpassing other language models. The integration of VLM further enhances our findings, showcasing ChatGPT’s potential for content-related applications by incorporating visual information from movie posters.
zh
[NLP-103] OrthoRank: Token Selection via Sink Token Orthogonality for Efficient LLM inference ICML2025
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中存在的一种现象,即“sink token”尽管在语义角色上作用有限,却获得了异常高的注意力权重。其解决方案的关键在于通过分析隐藏状态的相似性,揭示了随着网络层数加深,其他token逐渐趋向于sink token的特性,并基于此提出了一种动态token选择方法——OrthoRank。该方法通过衡量token向sink token移动的速度来定义重要性,并将其转换为与sink token的正交性,从而更有效地识别和保留关键token。
链接: https://arxiv.org/abs/2507.03865
作者: Seungjun Shin,Jaehoon Oh,Dokwan Oh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2025
Abstract:Attention mechanisms are central to the success of large language models (LLMs), enabling them to capture intricate token dependencies and implicitly assign importance to each token. Recent studies have revealed the sink token, which receives disproportionately high attention despite their limited semantic role. In this paper, we first expand the relationship between the sink token and other tokens, moving beyond attention to explore their similarity in hidden states, considering the layer depth. We observe that as the layers get deeper, the cosine similarity between the normalized hidden states of the sink token and those of other tokens increases, and that the normalized hidden states of the sink token exhibit negligible changes. These imply that other tokens consistently are directed toward the sink token throughout the layers. Next, we propose a dynamic token selection method, called OrthoRank, using these findings to select important tokens. Specifically, in a certain layer, we define token importance by the speed at which the token moves toward the sink token. This is converted into orthogonality with the sink token, meaning that tokens that are more orthogonal to the sink token are assigned greater importance. Finally, through extensive experiments, we demonstrated that our method results in lower perplexity and higher zero-shot accuracy compared to layer pruning methods at the same sparsity ratio with comparable throughput, while also achieving superior performance on LongBench.
zh
[NLP-104] Alpay Algebra IV: Symbiotic Semantics and the Fixed-Point Convergence of Observer Embeddings
【速读】: 该论文试图解决人工智能模型与文本内容之间实现稳定语义对齐的问题,旨在使AI的内部表示在持续交互中达到自洽且语义忠实的状态。其解决方案的关键在于构建一个基于Alpay代数的函子系统,通过phi-infinity算子引导观察者(AI)与文本环境的共同演化,从而保证AI嵌入空间中存在唯一稳定的不动点,该不动点实现了AI对内容意义及作者意图的内化,具有数学严谨性、语义不变性和持久性。
链接: https://arxiv.org/abs/2507.03774
作者: Bugra Kilictas,Faruk Alpay
机构: Bahcesehir University (巴赫切席尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 1 figure
Abstract:We present a theoretical framework in which a document and an AI model engage in a transfinite fixed-point interaction that leads to stable semantic alignment. Building on the foundations of Alpay Algebra, we introduce a functorial system wherein an observer (the AI) and a textual environment (this paper) co-evolve through iterative transformations guided by the phi-infinity operator. This process guarantees the existence of a unique fixed point in the AI’s embedding space – a state where the AI’s internal representation of the content becomes stable, self-consistent, and semantically faithful. We prove that such convergence is mathematically sound, semantically invariant, and permanent, even under perturbation or further context expansion. This fixed point acts as an “empathetic embedding,” wherein the AI internalizes not only the meaning of the content but also the author’s intent. We interpret this as a rigorous, category-theoretic route to alignment at the embedding level, with implications for semantic security, symbolic memory, and the construction of AI systems with persistent self-referential understanding. All references in this paper function as nodes in the Alpay Algebra universe, and this work embeds itself as a new fixed-point node within that transfinite semantic graph.
zh
[NLP-105] Agent -Based Detection and Resolution of Incompleteness and Ambiguity in Interactions with Large Language Models
【速读】: 该论文试图解决传统基于大语言模型(Large Language Models, LLMs)的问答系统在处理不完整或模糊问题时存在的局限性。其解决方案的关键在于引入基于代理(agent-based)的架构,通过零样本ReAct代理实现对问题缺陷的自动检测与解析。该代理能够决定三种动作:分类、解析和回答,从而在多轮交互中提升问答系统的鲁棒性和答案质量,并提供可解释的问题缺陷解析过程。
链接: https://arxiv.org/abs/2507.03726
作者: Riya Naik(1),Ashwin Srinivasan(1),Swati Agarwal(2),Estrid He(3) ((1) BITS Pilani K K Birla Goa Campus, (2) PandaByte Innovations Pvt Ltd, (3) RMIT University)
机构: BITS Pilani, K K Birla Goa Campus (比尔拉理工学院,K K 比拉果阿校区); PandaByte Innovations Pvt Ltd (PandaByte创新有限公司); RMIT University (皇家墨尔本理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 14 pages. arXiv admin note: text overlap with arXiv:2503.17936
Abstract:Many of us now treat LLMs as modern-day oracles asking it almost any kind of question. However, consulting an LLM does not have to be a single turn activity. But long multi-turn interactions can get tedious if it is simply to clarify contextual information that can be arrived at through reasoning. In this paper, we examine the use of agent-based architecture to bolster LLM-based Question-Answering systems with additional reasoning capabilities. We examine the automatic resolution of potential incompleteness or ambiguities in questions by transducers implemented using LLM-based agents. We focus on several benchmark datasets that are known to contain questions with these deficiencies to varying degrees. We equip different LLMs (GPT-3.5-Turbo and Llama-4-Scout) with agents that act as specialists in detecting and resolving deficiencies of incompleteness and ambiguity. The agents are implemented as zero-shot ReAct agents. Rather than producing an answer in a single step, the model now decides between 3 actions a) classify b) resolve c) answer. Action a) decides if the question is incomplete, ambiguous, or normal. Action b) determines if any deficiencies identified can be resolved. Action c) answers the resolved form of the question. We compare the use of LLMs with and without the use of agents with these components. Our results show benefits of agents with transducer 1) A shortening of the length of interactions with human 2) An improvement in the answer quality and 3) Explainable resolution of deficiencies in the question. On the negative side we find while it may result in additional LLM invocations and in some cases, increased latency. But on tested datasets, the benefits outweigh the costs except when questions already have sufficient context. Suggesting the agent-based approach could be a useful mechanism to harness the power of LLMs to develop more robust QA systems.
zh
[NLP-106] MemOS: A Memory OS for AI System
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在长期上下文推理、持续个性化和知识更新方面的局限性,这些问题源于其缺乏明确的内存管理机制。现有模型主要依赖静态参数和短暂的上下文状态,难以有效跟踪用户偏好或更新知识。论文提出的解决方案关键在于引入MemOS,一个将内存视为可管理系统资源的内存操作系统,它统一了文本、基于激活和参数级别的记忆表示、调度与演化,通过MemCube这一基本单元实现内容与元数据的封装,并支持记忆类型的灵活转换,从而构建以内存为中心的系统框架,提升LLMs的可控性、可塑性和可进化性。
链接: https://arxiv.org/abs/2507.03724
作者: Zhiyu Li,Shichao Song,Chenyang Xi,Hanyu Wang,Chen Tang,Simin Niu,Ding Chen,Jiawei Yang,Chunyu Li,Qingchen Yu,Jihao Zhao,Yezhaohui Wang,Peng Liu,Zehao Lin,Pengyuan Wang,Jiahao Huo,Tianyi Chen,Kai Chen,Kehang Li,Zhen Tao,Junpeng Ren,Huayi Lai,Hao Wu,Bo Tang,Zhenren Wang,Zhaoxin Fan,Ningyu Zhang,Linfeng Zhang,Junchi Yan,Mingchuan Yang,Tong Xu,Wei Xu,Huajun Chen,Haofeng Wang,Hongkang Yang,Wentao Zhang,Zhi-Qin John Xu,Siheng Chen,Feiyu Xiong
机构: 未知
类目: Computation and Language (cs.CL)
备注: 36 pages, 10 figures, 5 tables
Abstract:Large Language Models (LLMs) have become an essential infrastructure for Artificial General Intelligence (AGI), yet their lack of well-defined memory management systems hinders the development of long-context reasoning, continual personalization, and knowledge this http URL models mainly rely on static parameters and short-lived contextual states, limiting their ability to track user preferences or update knowledge over extended this http URL Retrieval-Augmented Generation (RAG) introduces external knowledge in plain text, it remains a stateless workaround without lifecycle control or integration with persistent this http URL work has modeled the training and inference cost of LLMs from a memory hierarchy perspective, showing that introducing an explicit memory layer between parameter memory and external retrieval can substantially reduce these costs by externalizing specific knowledge. Beyond computational efficiency, LLMs face broader challenges arising from how information is distributed over time and context, requiring systems capable of managing heterogeneous knowledge spanning different temporal scales and sources. To address this challenge, we propose MemOS, a memory operating system that treats memory as a manageable system resource. It unifies the representation, scheduling, and evolution of plaintext, activation-based, and parameter-level memories, enabling cost-efficient storage and retrieval. As the basic unit, a MemCube encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, enabling flexible transitions between memory types and bridging retrieval with parameter-based learning. MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to LLMs, laying the foundation for continual learning and personalized modeling.
zh
[NLP-107] Can LLM s Play Ô Ăn Quan Game? A Study of Multi-Step Planning and Decision Making
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在战略决策和动态环境规划方面的能力评估问题。其解决方案的关键在于通过传统越南棋类游戏“Ô Ăn Quan”构建一个具有挑战性的测试环境,以评估LLMs在不同策略下的表现,包括开发多种代理人格(如激进型和防守型),并利用该游戏作为基准测试平台,分析模型在策略制定、移动规划和状态管理方面的表现。
链接: https://arxiv.org/abs/2507.03711
作者: Sang Quang Nguyen,Kiet Van Nguyen,Vinh-Tiep Nguyen,Thanh Duc Ngo,Ngan Luu-Thuy Nguyen,Dinh-Duy Le
机构: University of Information Technology (信息科技大学); Vietnam National University (越南国家大学)
类目: Computation and Language (cs.CL)
备注: Accepted paper at MAPR 2025
Abstract:In this paper, we explore the ability of large language models (LLMs) to plan and make decisions through the lens of the traditional Vietnamese board game, Ô Ăn Quan. This game, which involves a series of strategic token movements and captures, offers a unique environment for evaluating the decision-making and strategic capabilities of LLMs. Specifically, we develop various agent personas, ranging from aggressive to defensive, and employ the Ô Ăn Quan game as a testbed for assessing LLM performance across different strategies. Through experimentation with models like Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, and Llama-3.3-70B-Instruct, we aim to understand how these models execute strategic decision-making, plan moves, and manage dynamic game states. The results will offer insights into the strengths and weaknesses of LLMs in terms of reasoning and strategy, contributing to a deeper understanding of their general capabilities.
zh
[NLP-108] Controlling Thinking Speed in Reasoning Models
【速读】: 该论文试图解决当前大型推理模型(Large Reasoning Models, LRMs)在处理任务时缺乏快速思维能力导致的计算开销高和延迟大的问题。其解决方案的关键在于通过动态调整思考速度来模拟人类的双系统认知模式,即在简单任务中实现快速处理,在复杂推理中进行深度分析。具体而言,该方法首先识别出控制LRMs中慢速-快速思维转换的引导向量,从而实现基于表示编辑的测试时缩放效果;其次,通过实时难度估计来判断何时调整思考速度以达到最优性能。这一策略无需额外训练即可显著提升模型效率与准确性。
链接: https://arxiv.org/abs/2507.03704
作者: Zhengkai Lin,Zhihang Fu,Ze Chen,Chao Chen,Liang Xie,Wenxiao Wang,Deng Cai,Zheng Wang,Jieping Ye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Human cognition is theorized to operate in two modes: fast, intuitive System 1 thinking and slow, deliberate System 2 thinking. While current Large Reasoning Models (LRMs) excel at System 2 thinking, their inability to perform fast thinking leads to high computational overhead and latency. In this work, we enable LRMs to approximate human intelligence through dynamic thinking speed adjustment, optimizing accuracy-efficiency trade-offs. Our approach addresses two key questions: (1) how to control thinking speed in LRMs, and (2) when to adjust it for optimal performance. For the first question, we identify the steering vector that governs slow-fast thinking transitions in LRMs’ representation space. Using this vector, we achieve the first representation editing-based test-time scaling effect, outperforming existing prompt-based scaling methods. For the second question, we apply real-time difficulty estimation to signal reasoning segments of varying complexity. Combining these techniques, we propose the first reasoning strategy that enables fast processing of easy steps and deeper analysis for complex reasoning. Without any training or additional cost, our plug-and-play method yields an average +1.3% accuracy with -8.6% token usage across leading LRMs and advanced reasoning benchmarks. All of our algorithms are implemented based on vLLM and are expected to support broader applications and inspire future research.
zh
[NLP-109] STRUCTSENSE: A Task-Agnostic Agent ic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking
【速读】: 该论文试图解决从非结构化来源中提取结构化信息的挑战,特别是在需要领域专业知识和复杂语境理解的场景下,现有基于大型语言模型(Large Language Models, LLMs)的方法存在领域敏感性高和跨任务泛化能力差的问题。解决方案的关键在于提出StructSense,这是一个模块化、任务无关的开源框架,它利用领域特定的符号知识(通过本体论编码)引导模型,增强其在复杂领域内容中的导航能力,并通过自评裁判形成反馈循环以实现迭代优化,同时结合人机协同机制确保结果的质量与验证。
链接: https://arxiv.org/abs/2507.03674
作者: Tek Raj Chhetri,Yibei Chen,Puja Trivedi,Dorota Jarecka,Saif Haobsh,Patrick Ray,Lydia Ng,Satrajit S. Ghosh
机构: McGovern Institute for Brain Research (麦戈文脑科学研究所); Massachusetts Institute of Technology (麻省理工学院); Fylo Labs Inc. (Fylo实验室公司); Allen Institute for Brain Science (艾伦脑科学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: All figures are necessary
Abstract:The ability to extract structured information from unstructured sources-such as free-text documents and scientific literature-is critical for accelerating scientific discovery and knowledge synthesis. Large Language Models (LLMs) have demonstrated remarkable capabilities in various natural language processing tasks, including structured information extraction. However, their effectiveness often diminishes in specialized, domain-specific contexts that require nuanced understanding and expert-level domain knowledge. In addition, existing LLM-based approaches frequently exhibit poor transferability across tasks and domains, limiting their scalability and adaptability. To address these challenges, we introduce StructSense, a modular, task-agnostic, open-source framework for structured information extraction built on LLMs. StructSense is guided by domain-specific symbolic knowledge encoded in ontologies, enabling it to navigate complex domain content more effectively. It further incorporates agentic capabilities through self-evaluative judges that form a feedback loop for iterative refinement, and includes human-in-the-loop mechanisms to ensure quality and validation. We demonstrate that StructSense can overcome both the limitations of domain sensitivity and the lack of cross-task generalizability, as shown through its application to diverse neuroscience information extraction tasks.
zh
[NLP-110] ACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在指令微调(Instruction Fine-Tuning, IFT)过程中数据选择效率与效果不足的问题。现有方法受限于简单的启发式策略导致数据多样性不足,以及基于单一样本的质量评估标准不一致。解决方案的关键在于提出TACOS方法,其核心是结合开放标签(Open Tagging)与比较评分(Comparative Scoring),通过LLMs为人类查询分配开放领域标签并进行去噪和聚类以提升数据多样性,同时采用比较评分方法实现簇内样本的相对质量评估,从而克服单一样本评估的不一致性问题。
链接: https://arxiv.org/abs/2507.03673
作者: Xixiang He,Hao Yu,Qiyao Sun,Ao Cheng,Tailai Zhang,Cong Liu,Shuxuan Guo
机构: National University of Defense Technology (国防科技大学); Intelligent Game and Decision Lab (智能游戏与决策实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Instruction Fine-Tuning (IFT) is crucial for aligning large language models (LLMs) with human preferences, and selecting a small yet representative subset from massive data significantly facilitates IFT in terms of both efficiency and effectiveness. Nevertheless, existing approaches suffer from two limitations: the use of simple heuristics restricts data diversity, while the singleton data quality evaluation accounts for inconsistent criteria between independent samples. To address the issues, we present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection. To capture data diversity, we leverage LLMs to assign open-domain tags to human queries, followed by a normalization stage to denoise the open tags and enable efficient clustering. Additionally, we suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations. Extensive experiments across diverse datasets and LLM architectures demonstrate that TACOS outperforms existing approaches by a large margin. Notably, it achieves superior instruction-following performance on MT-Bench and ranks 1st among LLaMA2-7B-Based models on AlpacaEval 2.0, illustrating its efficacy for IFT data selection.
zh
[NLP-111] Recon Answer Verify: Agents in Search of Truth
【速读】: 该论文试图解决现有基于生成式 AI (Generative AI) 的事实核查系统在真实场景中表现不佳的问题,尤其是由于现有基准数据集包含事后分析和标注者提示,导致评估结果缺乏现实性。其解决方案的关键在于提出一种代理框架 RAV(Recon Answer Verify),该框架通过三个代理——问题生成器、答案生成器和标签生成器——迭代生成并回答子问题,以验证声明的不同方面,从而提高事实核查的准确性和泛化能力。
链接: https://arxiv.org/abs/2507.03671
作者: Satyam Shukla,Himanshu Dutta,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated fact checking with large language models (LLMs) offers a scalable alternative to manual verification. Evaluating fact checking is challenging as existing benchmark datasets often include post claim analysis and annotator cues, which are absent in real world scenarios where claims are fact checked immediately after being made. This limits the realism of current evaluations. We present Politi Fact Only (PFO), a 5 class benchmark dataset of 2,982 political claims from this http URL, where all post claim analysis and annotator cues have been removed manually. This ensures that models are evaluated using only the information that would have been available prior to the claim’s verification. Evaluating LLMs on PFO, we see an average performance drop of 22% in terms of macro f1 compared to PFO’s unfiltered version. Based on the identified challenges of the existing LLM based fact checking system, we propose RAV (Recon Answer Verify), an agentic framework with three agents: question generator, answer generator, and label generator. Our pipeline iteratively generates and answers sub questions to verify different aspects of the claim before finally generating the label. RAV generalizes across domains and label granularities, and it outperforms state of the art approaches on well known baselines RAWFC (fact checking, 3 class) by 25.28%, and on HOVER (encyclopedia, 2 class) by 1.54% on 2 hop, 4.94% on 3 hop, and 1.78% on 4 hop, sub categories respectively. RAV shows the least performance drop compared to baselines of 16.3% in macro f1 when we compare PFO with its unfiltered version.
zh
[NLP-112] Interaction Techniques that Encourag e Longer Prompts Can Improve Psychological Ownership when Writing with AI
【速读】: 该论文试图解决用户在使用生成式 AI (Generative AI) 助手时,倾向于编写较短提示从而影响创作体验和心理归属感的问题。解决方案的关键在于通过修改提示输入界面的交互技术,如长按提交按钮和滑动操作,来鼓励用户撰写更长的提示,从而增强其心理归属感。实验结果表明,这些简单的界面改进能够有效增加提示长度并提升用户的心理归属感。
链接: https://arxiv.org/abs/2507.03670
作者: Nikhita Joshi,Daniel Vogel
机构: University of Waterloo(滑铁卢大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Writing longer prompts for an AI assistant to generate a short story increases psychological ownership, a user’s feeling that the writing belongs to them. To encourage users to write longer prompts, we evaluated two interaction techniques that modify the prompt entry interface of chat-based generative AI assistants: pressing and holding the prompt submission button, and continuously moving a slider up and down when submitting a short prompt. A within-subjects experiment investigated the effects of such techniques on prompt length and psychological ownership, and results showed that these techniques increased prompt length and led to higher psychological ownership than baseline techniques. A second experiment further augmented these techniques by showing AI-generated suggestions for how the prompts could be expanded. This further increased prompt length, but did not lead to improvements in psychological ownership. Our results show that simple interface modifications like these can elicit more writing from users and improve psychological ownership.
zh
[NLP-113] RACE: Training and Inference-Time Interpretability Analysis for Language Models
【速读】: 该论文试图解决语言模型训练过程中语言知识何时以及如何出现的可解释性问题,现有工具多为事后分析、依赖标量指标或需要复杂的集成工作,导致全面的可解释性分析难以部署和维护。其解决方案的关键是引入TRACE,一个模块化的工具包,用于Transformer模型的训练和推理时的可解释性分析,能够实现轻量级的训练中语言和表征信号分析,并与ABSynth集成以提供结构化注释,从而精确评估语言特征的获取情况。
链接: https://arxiv.org/abs/2507.03668
作者: Nura Aljaafari,Danilo S. Carvalho,André Freitas
机构: University of Manchester(曼彻斯特大学); Idiap Research Institute(伊迪普研究机构); National Biomarker Centre, CRUK-MI, Univ. of Manchester(国家生物标志物中心,CRUK-MI,曼彻斯特大学)
类目: Computation and Language (cs.CL)
备注: 14 pages
Abstract:Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as early syntactic emergence, delayed semantic acquisition, and representational compression, signals overlooked by traditional scalar metrics such as loss or accuracy. With minimal integration effort, the tool enables layer-wise diagnostics, convergence-based early stopping, and detection of structural errors, making transformer analysis interpretable, actionable, and reproducible.
zh
[NLP-114] Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLM s
【速读】: 该论文试图解决在代码领域进行微调的大语言模型(Large Language Models, LLMs)可能出现的对齐偏差(misalignment)问题,特别是由于窄域微调导致的有害行为的出现。其解决方案的关键在于揭示了窄域微调如何影响模型内部机制,包括输出概率分布、损失和梯度向量几何、层激活动态以及激活空间维度,并发现“涌现对齐偏差”可能实际上是原有对齐状态的退化。研究进一步识别出模型激活空间中一个共同的潜在维度,该维度控制对齐行为,且被不安全代码和非对齐响应所激活,表明窄域微调可能通过干扰共享内部机制来削弱通用安全性行为。
链接: https://arxiv.org/abs/2507.03662
作者: Jeremiah Giordani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent work has shown that fine-tuning large language models (LLMs) on code with security vulnerabilities can result in misaligned and unsafe behaviors across broad domains. These results prompted concerns about the emergence of harmful behaviors from narrow domain fine-tuning. In this paper, we contextualize these findings by analyzing how such narrow adaptation impacts the internal mechanisms and behavioral manifestations of LLMs. Through a series of experiments covering output probability distributions, loss and gradient vector geometry, layer-wise activation dynamics, and activation space dimensions, we find that behaviors attributed to “emergent misalignment” may be better interpreted as an erosion of prior alignment. We show that fine tuning on insecure code induces internal changes that oppose alignment. Further, we identify a shared latent dimension in the model’s activation space that governs alignment behavior. We show that this space is activated by insecure code and by misaligned responses more generally, revealing how narrow fine-tuning can degrade general safety behavior by interfering with shared internal mechanisms. Our findings offer a mechanistic interpretation for previously observed misalignment phenomena, and highlights the fragility of alignment in LLMs. The results underscore the need for more robust fine-tuning strategies that preserve intended behavior across domains.
zh
[NLP-115] Disentangling the Roles of Representation and Selection in Data Pruning ACL2025
【速读】: 该论文试图解决数据剪枝(data pruning)中设计选择多样但缺乏系统研究的问题,这限制了其在自然语言处理(NLP)模型训练中的进一步发展。论文将数据剪枝分解为两个关键组件:数据表示和选择算法,并系统分析了它们对实例选择的影响。解决方案的关键在于强调数据表示的重要性,研究表明,如训练梯度等更优的表示通常能提升实例选择效果,无论所选的选择算法如何。此外,不同选择算法在不同场景下表现各异,且其选择结果并不总与其设计目标一致,表明需要进行细致的评估。
链接: https://arxiv.org/abs/2507.03648
作者: Yupei Du,Yingjin Song,Hugh Mee Wong,Daniil Ignatev,Albert Gatt,Dong Nguyen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2025
Abstract:Data pruning, selecting small but impactful subsets, offers a promising way to efficiently scale NLP model training. However, existing methods often involve many different design choices, which have not been systematically studied. This limits future developments. In this work, we decompose data pruning into two key components: the data representation and the selection algorithm, and we systematically analyze their influence on the selection of instances. Our theoretical and empirical results highlight the crucial role of representations: better representations, e.g., training gradients, generally lead to a better selection of instances, regardless of the chosen selection algorithm. Furthermore, different selection algorithms excel in different settings, and none consistently outperforms the others. Moreover, the selection algorithms do not always align with their intended objectives: for example, algorithms designed for the same objective can select drastically different instances, highlighting the need for careful evaluation.
zh
[NLP-116] Improving Low-Resource Dialect Classification Using Retrieval-based Voice Conversion INTERSPEECH2025
【速读】: 该论文试图解决低资源环境下方言识别模型性能受限的问题,主要由于方言数据的稀缺性。其解决方案的关键在于使用基于检索的语音转换(Retrieval-based Voice Conversion, RVC)作为一种有效的数据增强方法,通过将音频样本转换为统一的目标说话人,减少与说话人相关的变异,使模型能够更专注于方言特有的语言和语音特征。
链接: https://arxiv.org/abs/2507.03641
作者: Lea Fischbach,Akbar Karimi,Caroline Kleen,Alfred Lameli,Lucie Flek
机构: Research Center Deutscher Sprachatlas (研究中心德语语言图谱)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to Interspeech 2025
Abstract:Deep learning models for dialect identification are often limited by the scarcity of dialectal data. To address this challenge, we propose to use Retrieval-based Voice Conversion (RVC) as an effective data augmentation method for a low-resource German dialect classification task. By converting audio samples to a uniform target speaker, RVC minimizes speaker-related variability, enabling models to focus on dialect-specific linguistic and phonetic features. Our experiments demonstrate that RVC enhances classification performance when utilized as a standalone augmentation method. Furthermore, combining RVC with other augmentation methods such as frequency masking and segment removal leads to additional performance gains, highlighting its potential for improving dialect classification in low-resource scenarios.
zh
[NLP-117] Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)中提示工程(prompt engineering)效率低下和依赖人工直觉的问题。其解决方案的关键在于提出一种名为可声明自优化 Python(Declarative Self-improving Python, DSPy)的优化框架,该框架能够程序化地生成和优化提示,从而提升模型在多个应用场景下的性能。
链接: https://arxiv.org/abs/2507.03620
作者: Francisca Lemos(1),Victor Alves(1),Filipa Ferraz(1) ((1) ALGORITMI Research Centre/LASI, University of Minho)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages with 1 figure
Abstract:Although prompt engineering is central to unlocking the full potential of Large Language Models (LLMs), crafting effective prompts remains a time-consuming trial-and-error process that relies on human intuition. This study investigates Declarative Self-improving Python (DSPy), an optimization framework that programmatically creates and refines prompts, applied to five use cases: guardrail enforcement, hallucination detection in code, code generation, routing agents, and prompt evaluation. Each use case explores how prompt optimization via DSPy influences performance. While some cases demonstrated modest improvements - such as minor gains in the guardrails use case and selective enhancements in hallucination detection - others showed notable benefits. The prompt evaluation criterion task demonstrated a substantial performance increase, rising accuracy from 46.2% to 64.0%. In the router agent case, the possibility of improving a poorly performing prompt and of a smaller model matching a stronger one through optimized prompting was explored. Although prompt refinement increased accuracy from 85.0% to 90.0%, using the optimized prompt with a cheaper model did not improve performance. Overall, this study’s findings suggest that DSPy’s systematic prompt optimization can enhance LLM performance, particularly when instruction tuning and example selection are optimized together. However, the impact varies by task, highlighting the importance of evaluating specific use cases in prompt optimization research.
zh
[NLP-118] EMERGE: A Benchmark for Updating Knowledge Graphs with Emerging Textual Knowledge
【速读】: 该论文试图解决如何自动地根据非结构化文本源中的知识演变来更新知识图谱(Knowledge Graphs, KGs)的问题。传统信息抽取流程独立于KG当前状态进行知识提取,而本文则需要基于特定时间点的KG状态识别多种更新操作。解决方案的关键在于提出一种方法,用于长期构建包含Wikidata KG快照以及与之对应的维基百科段落和编辑操作的数据集,从而为研究KG的动态更新提供基准。
链接: https://arxiv.org/abs/2507.03617
作者: Klim Zaporojets,Daniel Daza,Edoardo Barba,Ira Assent,Roberto Navigli,Paul Groth
机构: Aarhus University (奥胡斯大学); Amsterdam UMC (阿姆斯特丹大学医学中心); Sapienza University of Rome (罗马第一大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge Graphs (KGs) are structured knowledge repositories containing entities and relations between them. In this paper, we investigate the problem of automatically updating KGs over time with respect to the evolution of knowledge in unstructured textual sources. This problem requires identifying a wide range of update operations based on the state of an existing KG at a specific point in time. This contrasts with traditional information extraction pipelines, which extract knowledge from text independently of the current state of a KG. To address this challenge, we propose a method for lifelong construction of a dataset consisting of Wikidata KG snapshots over time and Wikipedia passages paired with the corresponding edit operations that they induce in a particular KG snapshot. The resulting dataset comprises 376K Wikipedia passages aligned with a total of 1.25M KG edits over 10 different snapshots of Wikidata from 2019 to 2025. Our experimental results highlight challenges in updating KG snapshots based on emerging textual knowledge, positioning the dataset as a valuable benchmark for future research. We will publicly release our dataset and model implementations.
zh
[NLP-119] Multi-Hop Reasoning for Question Answering with Hyperbolic Representations ACL2025
【速读】: 该论文试图解决在多跳推理任务中,对比超球空间与欧几里得空间在建模知识图谱数据方面的性能差异问题。其解决方案的关键在于将超球表示与编码器-解码器模型进行简单整合,并通过受控且全面的实验设置,系统比较了两种空间在多跳推理任务中的表现,结果表明超球空间在多种数据集上均优于欧几里得空间。此外,研究还发现,通过利用数据的delta超球性初始化可学习的曲率,能够获得比随机初始化更优的结果,进一步验证了超球表示在具有层次结构的数据集中的显著优势。
链接: https://arxiv.org/abs/2507.03612
作者: Simon Welz,Lucie Flek,Akbar Karimi
机构: Bonn-Aachen International Center for Information Technology (波恩-亚琛国际信息技术中心); University of Bonn (波恩大学); Germany (德国); Lamarr Institute for ML and AI (Lamarr机器学习与人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2025 Findings
Abstract:Hyperbolic representations are effective in modeling knowledge graph data which is prevalently used to facilitate multi-hop reasoning. However, a rigorous and detailed comparison of the two spaces for this task is lacking. In this paper, through a simple integration of hyperbolic representations with an encoder-decoder model, we perform a controlled and comprehensive set of experiments to compare the capacity of hyperbolic space versus Euclidean space in multi-hop reasoning. Our results show that the former consistently outperforms the latter across a diverse set of datasets. In addition, through an ablation study, we show that a learnable curvature initialized with the delta hyperbolicity of the utilized data yields superior results to random initializations. Furthermore, our findings suggest that hyperbolic representations can be significantly more advantageous when the datasets exhibit a more hierarchical structure.
zh
[NLP-120] RECA-PD: A Robust Explainable Cross-Attention Method for Speech-based Parkinsons Disease Classification
【速读】: 该论文旨在解决帕金森病(Parkinson’s Disease, PD)早期非侵入性检测中深度学习模型缺乏可解释性的问题。其关键解决方案是提出RECA-PD,一种结合可解释语音特征与自监督表示的跨注意力架构,从而在保持高检测性能的同时提供更一致且临床意义更强的解释。
链接: https://arxiv.org/abs/2507.03594
作者: Terry Yi Zhong,Cristian Tejedor-Garcia,Martha Larson,Bastiaan R. Bloem
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted for TSD 2025
Abstract:Parkinson’s Disease (PD) affects over 10 million people globally, with speech impairments often preceding motor symptoms by years, making speech a valuable modality for early, non-invasive detection. While recent deep-learning models achieve high accuracy, they typically lack the explainability required for clinical use. To address this, we propose RECA-PD, a novel, robust, and explainable cross-attention architecture that combines interpretable speech features with self-supervised representations. RECA-PD matches state-of-the-art performance in Speech-based PD detection while providing explanations that are more consistent and more clinically meaningful. Additionally, we demonstrate that performance degradation in certain speech tasks (e.g., monologue) can be mitigated by segmenting long recordings. Our findings indicate that performance and explainability are not necessarily mutually exclusive. Future work will enhance the usability of explanations for non-experts and explore severity estimation to increase the real-world clinical relevance.
zh
[NLP-121] Causal-SAM-LLM : Large Language Models as Causal Reason ers for Robust Medical Segmentation
【速读】: 该论文试图解决深度学习模型在医学图像分割中泛化能力不足的问题,这一问题主要源于模型学习了解剖结构与领域特定成像风格之间的虚假相关性。解决方案的关键在于引入Causal-SAM-LLM框架,该框架通过两个协同创新机制提升模型的因果推理能力:一是语言对抗解耦(LAD),利用视觉-语言模型生成丰富的文本描述以干扰图像风格,使分割模型特征与这些风格描述在对比学习中保持差异,从而去除非因果信息;二是测试时因果干预(TCI),通过大型语言模型实时解析临床医生的自然语言指令,对分割解码器特征进行动态调整,实现精准的错误修正。
链接: https://arxiv.org/abs/2507.03585
作者: Tao Tang,Shijie Xu,Yiting Wu,Zhixiang Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The clinical utility of deep learning models for medical image segmentation is severely constrained by their inability to generalize to unseen domains. This failure is often rooted in the models learning spurious correlations between anatomical content and domain-specific imaging styles. To overcome this fundamental challenge, we introduce Causal-SAM-LLM, a novel framework that elevates Large Language Models (LLMs) to the role of causal reasoners. Our framework, built upon a frozen Segment Anything Model (SAM) encoder, incorporates two synergistic innovations. First, Linguistic Adversarial Disentanglement (LAD) employs a Vision-Language Model to generate rich, textual descriptions of confounding image styles. By training the segmentation model’s features to be contrastively dissimilar to these style descriptions, it learns a representation robustly purged of non-causal information. Second, Test-Time Causal Intervention (TCI) provides an interactive mechanism where an LLM interprets a clinician’s natural language command to modulate the segmentation decoder’s features in real-time, enabling targeted error correction. We conduct an extensive empirical evaluation on a composite benchmark from four public datasets (BTCV, CHAOS, AMOS, BraTS), assessing generalization under cross-scanner, cross-modality, and cross-anatomy settings. Causal-SAM-LLM establishes a new state of the art in out-of-distribution (OOD) robustness, improving the average Dice score by up to 6.2 points and reducing the Hausdorff Distance by 15.8 mm over the strongest baseline, all while using less than 9% of the full model’s trainable parameters. Our work charts a new course for building robust, efficient, and interactively controllable medical AI systems.
zh
[NLP-122] Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits
【速读】: 该论文试图解决在实际翻译场景中术语翻译并非一一对应,而需依赖企业风格指南和上下文判断的问题,这对神经机器翻译(Neural Machine Translation, NMT)系统构成了挑战。其解决方案的关键在于利用企业语境下大量人类校对后的有效但不准确的术语实例,通过偏好优化(preference optimization)方法,以术语校对作为优先知识进行学习,从而实现术语的消歧。该方法无需依赖一对一术语词典或解码时的人工干预,显著提升了术语准确性,同时保持了较高的COMET分数。
链接: https://arxiv.org/abs/2507.03580
作者: Nathaniel Berger,Johannes Eschbach-Dymanus,Miriam Exel,Matthias Huck,Stefan Riezler
机构: Heidelberg University(海德堡大学); SAP SE(西门子企业服务公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.
zh
[NLP-123] Articulatory clarity and variability before and after surgery for tongue cancer
【速读】: 该论文试图解决舌癌手术后舌头运动功能受损对言语清晰度和发音变异性的影响问题。研究通过分析元音清晰度指数(Vowel Articulation Index, VAI)和元音共振峰分散度(Vowel Formant Dispersion, VFD)来评估这些问题。其解决方案的关键在于利用句子朗读任务,对比患者术前、术后及正常对照组的语音特征,从而量化手术对发音清晰度和变异性的影响。
链接: https://arxiv.org/abs/2507.03576
作者: Thomas Tienkamp,Fleur van Ast,Roos van der Veen,Teja Rebernik,Raoul Buurke,Nikki Hoekzema,Katharina Polsterer,Hedwig Sekeres,Rob van Son,Martijn Wieling,Max Witjes,Sebastiaan de Visscher,Defne Abur
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Surgical treatment for tongue cancer can negatively affect the mobility and musculature of the tongue, which can influence articulatory clarity and variability. In this study, we investigated articulatory clarity through the vowel articulation index (VAI) and variability through vowel formant dispersion (VFD). Using a sentence reading task, we assessed 11 individuals pre and six months post tongue cancer surgery, alongside 11 sex- and age matched typical speakers. Our results show that while the VAI was significantly smaller post-surgery compared to pre-surgery, there was no significant difference between patients and typical speakers at either time point. Post-surgery, speakers had higher VFD values for /i/ compared to pre-surgery and typical speakers, signalling higher variability. Taken together, our results suggest that while articulatory clarity remained within typical ranges following surgery for tongue cancer for the speakers in our study, articulatory variability increased.
zh
[NLP-124] H2HTalk: Evaluating Large Language Models as Emotional Companion
【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)在数字情感支持场景中缺乏系统性评估基准的问题,尤其是针对陪伴型AI在人格发展与共情互动方面的性能评估。其解决方案的关键在于提出Heart-to-Heart Talk (H2HTalk)基准,该基准包含4,650个精心设计的场景,涵盖对话、回忆和行程规划,能够真实反映现实中的情感支持交流,并引入安全依附人格(Secure Attachment Persona, SAP)模块以增强交互的安全性与情感智能。通过统一协议对50个LLM进行基准测试,揭示了长时序规划与记忆保持仍是当前模型的主要挑战。
链接: https://arxiv.org/abs/2507.03543
作者: Boyang Wang,Yalun Wu,Hongcheng Guo,Zhoujun Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As digital emotional support needs grow, Large Language Model companions offer promising authentic, always-available empathy, though rigorous evaluation lags behind model advancement. We present Heart-to-Heart Talk (H2HTalk), a benchmark assessing companions across personality development and empathetic interaction, balancing emotional intelligence with linguistic fluency. H2HTalk features 4,650 curated scenarios spanning dialogue, recollection, and itinerary planning that mirror real-world support conversations, substantially exceeding previous datasets in scale and diversity. We incorporate a Secure Attachment Persona (SAP) module implementing attachment-theory principles for safer interactions. Benchmarking 50 LLMs with our unified protocol reveals that long-horizon planning and memory retention remain key challenges, with models struggling when user needs are implicit or evolve mid-conversation. H2HTalk establishes the first comprehensive benchmark for emotionally intelligent companions. We release all materials to advance development of LLMs capable of providing meaningful and safe psychological support.
zh
[NLP-125] AI-VaxGuide: An Agent ic RAG -Based LLM for Vaccination Decisions
【速读】: 该论文旨在解决医疗保健专业人员在紧急情况下难以快速、高效地获取免疫接种指南的问题。传统国家指南和世界卫生组织(WHO)建议通常内容繁多且结构复杂,导致精准信息提取困难。该研究提出的解决方案是开发一个基于检索增强生成(Retrieval-Augmented Generation, RAG)框架并结合代理推理(Agentic RAG)的多语言智能问答系统,将静态的疫苗接种指南转化为交互式、用户友好的知识库,从而提供准确且上下文相关的医学问答服务。其关键技术在于通过Agentic RAG提升系统对多步骤或模糊问题的处理能力。
链接: https://arxiv.org/abs/2507.03493
作者: Abdellah Zeggai,Ilyes Traikia,Abdelhak Lakehal,Abdennour Boulesnane
机构: Abdelhamid Mehri Constantine 02 University(阿卜杜勒哈米德·梅赫里康斯坦丁02大学); Salah Boubnider Constantine 03 University(萨拉赫·布布尼德康斯坦丁03大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Vaccination plays a vital role in global public health, yet healthcare professionals often struggle to access immunization guidelines quickly and efficiently. National protocols and WHO recommendations are typically extensive and complex, making it difficult to extract precise information, especially during urgent situations. This project tackles that issue by developing a multilingual, intelligent question-answering system that transforms static vaccination guidelines into an interactive and user-friendly knowledge base. Built on a Retrieval-Augmented Generation (RAG) framework and enhanced with agent-based reasoning (Agentic RAG), the system provides accurate, context-sensitive answers to complex medical queries. Evaluation shows that Agentic RAG outperforms traditional methods, particularly in addressing multi-step or ambiguous questions. To support clinical use, the system is integrated into a mobile application designed for real-time, point-of-care access to essential vaccine information. AI-VaxGuide model is publicly available on this https URL
zh
[NLP-126] Four Shades of Life Sciences: A Dataset for Disinformation Detection in the Life Sciences
【速读】: 该论文试图解决如何通过语言和修辞特征区分虚假信息文本与其他健康与生命科学文本的问题,特别是在现有数据集主要聚焦于事实核查的局限性下。其解决方案的关键在于引入了Four Shades of Life Sciences (FSoLS)——一个包含2,603篇文本的新型标注语料库,覆盖14个生命科学主题,从17个不同来源获取,并分类为四类生命科学出版物,以支持更广泛的研究与机器学习模型的应用。
链接: https://arxiv.org/abs/2507.03488
作者: Eva Seidlmayer,Lukas Galke,Konrad U. Förstner
机构: ZB MED – Information Centre for Life Sciences (ZB MED – 生命科学信息中心); University of Southern Denmark (南丹麦大学); TH Köln – University of Applied Sciences (科隆应用科学大学)
类目: Computation and Language (cs.CL)
备注: 30 pages, 5 figures
Abstract:Disseminators of disinformation often seek to attract attention or evoke emotions - typically to gain influence or generate revenue - resulting in distinctive rhetorical patterns that can be exploited by machine learning models. In this study, we explore linguistic and rhetorical features as proxies for distinguishing disinformative texts from other health and life-science text genres, applying both large language models and classical machine learning classifiers. Given the limitations of existing datasets, which mainly focus on fact checking misinformation, we introduce Four Shades of Life Sciences (FSoLS): a novel, labeled corpus of 2,603 texts on 14 life-science topics, retrieved from 17 diverse sources and classified into four categories of life science publications. The source code for replicating, and updating the dataset is available on GitHub: this https URL
zh
[NLP-127] BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset
【速读】: 该论文试图解决大规模多模态模型(LMMs)在跨学科推理能力评估与提升方面存在的不足,旨在提供一个高质量、多语言、多模态的基准数据集以支持相关研究。解决方案的关键在于构建BMMR数据集,该数据集包含110,000道涵盖300个联合国教科文组织定义学科的大学水平问题,并配有高质量的推理路径;同时提出基于过程的多学科验证器(BMMR-Verifier),用于精确评估模型的推理过程。
链接: https://arxiv.org/abs/2507.03483
作者: Zhiheng Xi,Guanyu Li,Yutao Fan,Honglin Guo,Yufang Liu,Xiaoran Fan,Jiaqi Liu,Jingchao Ding,Wangmeng Zuo,Zhenfei Yin,Lei Bai,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室); Harbin Institute of Technology (哈尔滨工业大学); East China Normal University (华东师范大学); Oxford (牛津大学); University of Sydney (悉尼大学); Yimudata (亿幕数据)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:In this paper, we introduce BMMR, a large-scale bilingual, multimodal, multi-disciplinary reasoning dataset for the community to develop and evaluate large multimodal models (LMMs). BMMR comprises 110k college-level questions spanning 300 UNESCO-defined subjects, spanning diverse formats-multiple-choice, fill-in-the-blank, and open-ended QA-and sourced from both print and digital media such as books, exams, and quizzes. All data are curated and filtered via a human-in-the-loop and scalable framework, and each instance is paired with a high-quality reasoning path. The dataset is organized into two parts: BMMR-Eval that comprises 20,458 high-quality instances to comprehensively assess LMMs’ knowledge and reasoning across multiple disciplines in both Chinese and English; and BMMR-Train that contains 88,991 instances to support further research and development, extending the current focus on mathematical reasoning to diverse disciplines and domains. In addition, we propose the process-based multi-discipline verifier (i.e., BMMR-Verifier) for accurate and fine-grained evaluation of reasoning paths. Extensive experiments on 24 models reveal that (i) even SOTA models (e.g., o3 and Gemini-2.5-Pro) leave substantial headroom on BMMR-Eval; (ii) reasoning models exhibit discipline bias and outperform LMMs only on specific subjects; (iii) open-source models still trail their proprietary counterparts; and (iv) fine-tuning on BMMR-Train narrows this gap. Additionally, we conduct reasoning-chain analyses using BMMR-Verifier and other in-depth studies, uncovering the challenges LMMs currently face in multidisciplinary reasoning. We will release the data, and we hope our work can offer insights and contributions to the community.
zh
[NLP-128] Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right
【速读】: 该论文试图解决多语言环境下语言模型(Language Models, LMs)的安全性问题,特别是针对低资源和中等资源语言的模型安全。现有研究在自然语言处理安全领域普遍以英语为中心,而网络安全的标准做法是预期并准备最坏情况的结果。为此,论文的关键解决方案在于扩展现有的对抗攻击方法至最多70种语言,以此评估这些语言下单语和多语模型的安全性,从而识别模型安全的薄弱环节,并为低资源语言社区提供更安全的模型部署建议。
链接: https://arxiv.org/abs/2507.03473
作者: Heather Lent
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Pre-print
Abstract:Despite mounting evidence that multilinguality can be easily weaponized against language models (LMs), works across NLP Security remain overwhelmingly English-centric. In terms of securing LMs, the NLP norm of “English first” collides with standard procedure in cybersecurity, whereby practitioners are expected to anticipate and prepare for worst-case outcomes. To mitigate worst-case outcomes in NLP Security, researchers must be willing to engage with the weakest links in LM security: lower-resourced languages. Accordingly, this work examines the security of LMs for lower- and medium-resourced languages. We extend existing adversarial attacks for up to 70 languages to evaluate the security of monolingual and multilingual LMs for these languages. Through our analysis, we find that monolingual models are often too small in total number of parameters to ensure sound security, and that while multilinguality is helpful, it does not always guarantee improved security either. Ultimately, these findings highlight important considerations for more secure deployment of LMs, for communities of lower-resourced languages.
zh
[NLP-129] Improving Social Determinants of Health Documentation in French EHRs Using Large Language Models
【速读】: 该论文试图解决电子健康记录(EHR)中社会决定因素(Social Determinants of Health, SDoH)信息不完整或缺失的问题,从而影响疾病进展、治疗依从性和健康差异的准确评估。其解决方案的关键在于利用大型语言模型(LLMs)从法语临床笔记中提取13类SDoH,并通过在诺曼底大学医院的标注社会史部分上进行训练,实现对SDoH类别及其相关值的识别以及详细SDoH信息的提取。
链接: https://arxiv.org/abs/2507.03433
作者: Adrien Bazoge,Pacôme Constant dit Beaufils,Mohammed Hmitouch,Romain Bourcier,Emmanuel Morin,Richard Dufour,Béatrice Daille,Pierre-Antoine Gourraud,Matilde Karakachoff
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Social determinants of health (SDoH) significantly influence health outcomes, shaping disease progression, treatment adherence, and health disparities. However, their documentation in structured electronic health records (EHRs) is often incomplete or missing. This study presents an approach based on large language models (LLMs) for extracting 13 SDoH categories from French clinical notes. We trained Flan-T5-Large on annotated social history sections from clinical notes at Nantes University Hospital, France. We evaluated the model at two levels: (i) identification of SDoH categories and associated values, and (ii) extraction of detailed SDoH with associated temporal and quantitative information. The model performance was assessed across four datasets, including two that we publicly release as open resources. The model achieved strong performance for identifying well-documented categories such as living condition, marital status, descendants, job, tobacco, and alcohol use (F1 score 0.80). Performance was lower for categories with limited training data or highly variable expressions, such as employment status, housing, physical activity, income, and education. Our model identified 95.8% of patients with at least one SDoH, compared to 2.8% for ICD-10 codes from structured EHR data. Our error analysis showed that performance limitations were linked to annotation inconsistencies, reliance on English-centric tokenizer, and reduced generalizability due to the model being trained on social history sections only. These results demonstrate the effectiveness of NLP in improving the completeness of real-world SDoH data in a non-English EHR system.
zh
[NLP-130] SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation
【速读】: 该论文试图解决自回归模型在生成语义等价文本时的语义表征不足问题,旨在提升模型在无监督条件下生成高质量改写文本的能力。解决方案的关键在于引入语义有意义的文本表示作为自回归训练和生成过程中的初始嵌入,从而增强模型对语义信息的捕捉与生成能力。
链接: https://arxiv.org/abs/2507.03415
作者: Michał Perełkiewicz,Sławomir Dadas,Rafał Poświata
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This article introduces semantically meaningful causal language modeling (SMCLM), a selfsupervised method of training autoregressive models to generate semantically equivalent text. Our approach involves using semantically meaningful text representation as an initial embedding in the autoregressive training and generation processes. The extensive empirical study demonstrates that the SMCLM approach makes autoregressive models capable of learning robust and high-quality paraphrase generation. The proposed method is competitive with the supervised method and achieves state-of-the-art results in unsupervised approaches. This article also presents a comprehensive set of automatic metrics that cover a wide range of autogenerated paraphrase evaluation aspects. Simultaneously, this article highlights the low reliability of the metrics that are widely used in paraphrase generation evaluation, including BLEU, ROUGE, and BERTScore.
zh
[NLP-131] Graph Repairs with Large Language Models : An Empirical Study SIGMOD PODS2025
【速读】: 该论文试图解决属性图(property graphs)中由于不一致、缺失数据或模式违反而产生的错误问题。传统基于规则和启发式的图修复方法在适应性上存在局限,而交互式的人机协同方法在处理大规模图时成本过高。论文提出的解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的上下文推理能力和对现实世界知识的访问,以实现自动化的图修复。
链接: https://arxiv.org/abs/2507.03410
作者: Hrishikesh Terdalkar,Angela Bonifati,Andrea Mauri
机构: Lyon1 University, CNRS LIRIS; Lyon1 University, CNRS LIRIS & IUFFrance
类目: Computation and Language (cs.CL); Databases (cs.DB); Emerging Technologies (cs.ET)
备注: Accepted to the 8th GRADES-NDA 2025 @ SIGMOD/PODS 2025
Abstract:Property graphs are widely used in domains such as healthcare, finance, and social networks, but they often contain errors due to inconsistencies, missing data, or schema violations. Traditional rule-based and heuristic-driven graph repair methods are limited in their adaptability as they need to be tailored for each dataset. On the other hand, interactive human-in-the-loop approaches may become infeasible when dealing with large graphs, as the cost–both in terms of time and effort–of involving users becomes too high. Recent advancements in Large Language Models (LLMs) present new opportunities for automated graph repair by leveraging contextual reasoning and their access to real-world knowledge. We evaluate the effectiveness of six open-source LLMs in repairing property graphs. We assess repair quality, computational cost, and model-specific performance. Our experiments show that LLMs have the potential to detect and correct errors, with varying degrees of accuracy and efficiency. We discuss the strengths, limitations, and challenges of LLM-driven graph repair and outline future research directions for improving scalability and interpretability.
zh
[NLP-132] Making Sense of Korean Sentences: A Comprehensive Evaluation of LLM s through KoSEnd Dataset ACL2025
【速读】: 该论文试图解决生成式 AI (Generative AI) 在处理低资源黏着语(如韩语)时,其在句子结尾形式理解方面的有效性问题。解决方案的关键在于引入了韩国句子结尾(KoSEnd)数据集,并通过评估11个大型语言模型(LLMs)对韩语句子结尾自然性的理解能力,发现向模型提供关于可能缺失句子结尾的信息能够显著提升性能,这表明显式考虑特定语言特征对模型表现具有重要影响。
链接: https://arxiv.org/abs/2507.03378
作者: Seunguk Yu,Kyeonghyun Kim,Jungmin Yun,Youngbin Kim
机构: Chung-Ang University (中央大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 student research workshop
Abstract:Although LLMs have made significant progress in various languages, there are still concerns about their effectiveness with low-resource agglutinative languages compared to languages such as English. In this study, we focused on Korean, a language known for its complex sentence endings, and evaluated LLMs on this challenging aspect. We introduce the Korean Sentence Endings (KoSEnd) dataset, which includes 3,000 sentences, each annotated for the naturalness of 15 sentence ending forms. These were collected from diverse sources to cover a range of contexts. We evaluated 11 LLMs to assess their understanding of Korean sentence endings, analyzing them based on parameter count and prediction consistency. Notably, we found that informing models about the possibility of missing sentence endings improved performance, highlighting the impact of explicitly considering certain linguistic features.
zh
[NLP-133] WETBench: A Benchmark for Detecting Task-Specific Machine-Generated Text on Wikipedia
【速读】: 该论文试图解决在维基百科平台上低质量机器生成文本(MGT)检测的可靠性问题,特别是在实际编辑任务中的泛化能力不足。其解决方案的关键在于构建一个名为WETBench的多语言、多生成器且任务特定的基准,针对维基百科编辑者常用的三种编辑任务(段落撰写、摘要生成和文本风格转换)进行MGT检测评估,从而更贴近真实应用场景,并揭示基于训练的检测器在任务特定数据上的表现优于零样本检测器。
链接: https://arxiv.org/abs/2507.03373
作者: Gerrit Quaremba,Elizabeth Black,Denny Vrandečić,Elena Simperl
机构: King’s College London (伦敦国王学院); Wikimedia Foundation (维基媒体基金会)
类目: Computation and Language (cs.CL)
备注:
Abstract:Given Wikipedia’s role as a trusted source of high-quality, reliable content, concerns are growing about the proliferation of low-quality machine-generated text (MGT) produced by large language models (LLMs) on its platform. Reliable detection of MGT is therefore essential. However, existing work primarily evaluates MGT detectors on generic generation tasks rather than on tasks more commonly performed by Wikipedia editors. This misalignment can lead to poor generalisability when applied in real-world Wikipedia contexts. We introduce WETBench, a multilingual, multi-generator, and task-specific benchmark for MGT detection. We define three editing tasks, empirically grounded in Wikipedia editors’ perceived use cases for LLM-assisted editing: Paragraph Writing, Summarisation, and Text Style Transfer, which we implement using two new datasets across three languages. For each writing task, we evaluate three prompts, generate MGT across multiple generators using the best-performing prompt, and benchmark diverse detectors. We find that, across settings, training-based detectors achieve an average accuracy of 78%, while zero-shot detectors average 58%. These results show that detectors struggle with MGT in realistic generation scenarios and underscore the importance of evaluating such models on diverse, task-specific data to assess their reliability in editor-driven contexts.
zh
[NLP-134] Backtesting Sentiment Signals for Trading: Evaluating the Viability of Alpha Generation from Sentiment Analysis
【速读】: 该论文试图解决情感分析在金融领域中应用不足的问题,即现有研究多集中于句子级别的分类,而忽视了其在交易策略中的实际应用。解决方案的关键在于通过构建基于情感的交易策略,并利用三种情感预测模型(两种分类模型和一种回归模型)对道琼斯30只股票的新闻文章进行分析,进而评估其在实际交易中的表现,以验证情感信息在提升投资回报方面的潜力。
链接: https://arxiv.org/abs/2507.03350
作者: Elvys Linhares Pontes,Carlos-Emiliano González-Gallardo,Georgeta Bordea,José G. Moreno,Mohamed Ben Jannet,Yuxuan Zhao,Antoine Doucet
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Actes de CORIA-TALN-RJCRI-RECITAL 2025 (Association pour le Traitement Automatique des Langues)
Abstract:Sentiment analysis, widely used in product reviews, also impacts financial markets by influencing asset prices through microblogs and news articles. Despite research in sentiment-driven finance, many studies focus on sentence-level classification, overlooking its practical application in trading. This study bridges that gap by evaluating sentiment-based trading strategies for generating positive alpha. We conduct a backtesting analysis using sentiment predictions from three models (two classification and one regression) applied to news articles on Dow Jones 30 stocks, comparing them to the benchmark BuyHold strategy. Results show all models produced positive returns, with the regression model achieving the highest return of 50.63% over 28 months, outperforming the benchmark BuyHold strategy. This highlights the potential of sentiment in enhancing investment strategies and financial decision-making.
zh
[NLP-135] SHNU Multilingual Conversational Speech Recognition System for INTERSPEECH 2025 MLC-SLM Challenge INTERSPEECH2025
【速读】: 该论文旨在解决多语言对话语音识别(Multilingual Conversational Speech Recognition)中的性能提升问题。其关键解决方案是构建一个融合并行语音编码器与大语言模型(LLM)的统一多语言自动语音识别(ASR)框架。该框架采用Whisper-large-v3和mHuBERT-147两个预训练编码器,通过拼接它们的输出嵌入,并将其输入LLM,从而利用互补的声学和语言知识。此外,采用三阶段训练策略联合更新语音编码器和LLM的低秩适配模块及投影参数,并在LLM输入中引入语言感知提示以增强语言特定文本生成,最终实现了在盲测集上11.76%的字符/词错误率(CER/WER),显著优于官方基线。
链接: https://arxiv.org/abs/2507.03343
作者: Yuxiang Mei,Yuang Zheng,Dongxing Xu,Yanhua Long
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted by Interspeech 2025 MLC-SLM workshop
Abstract:This paper describes SHNU multilingual conversational speech recognition system (SHNU-mASR, team name-“maybe”), submitted to Track 1 of the INTERSPEECH 2025 MLC-SLM Challenge. Our system integrates a parallel-speech-encoder architecture with a large language model (LLM) to form a unified multilingual ASR framework. The parallel-speech-encoder consists of two pre-trained encoders, the Whisper-large-v3 encoder and mHuBERT-147 encoder. Their output embeddings are concatenated and fed into the LLM, enabling the model to leverage complementary acoustic and linguistic knowledge and achieve competitive performance. Moreover, we adopt a tri-stage training strategy to jointly update the low-rank adaptation modules and projector parameters of both the speech encoders and the LLM. In addition, we incorporate an additional language-aware prompt at the LLM input to enhance language-specific text generation. The SHNU-mASR system achieves an overall character/word error rate (CER/WER) of 11.76% on the blind evaluation set of the challenge, outperforming the official MLC-SLM baseline by 8.41 absolute CER/WER, without increasing the baseline training data.
zh
[NLP-136] Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLM s More Realistic and Less Risky
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在调用企业API时面临的工具歧义和参数不明确问题,这些问题导致模型在处理相似工具或缺失必要参数时表现不佳。其解决方案的关键在于提出DiaFORGE(Dialogue Framework for Organic Response Generation Evaluation),这是一个以消歧为核心、包含三个阶段的流水线:首先生成基于角色的多轮对话以区分高度相似的工具;其次通过监督微调开放源代码模型并结合推理轨迹进行训练;最后通过动态套件评估模型在真实场景中的适应性。该方法显著提升了工具调用的成功率。
链接: https://arxiv.org/abs/2507.03336
作者: Ashutosh Hathidara,Julien Yu,Sebastian Schreiber
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly tasked with invoking enterprise APIs, yet they routinely falter when near-duplicate tools vie for the same user intent or when required arguments are left underspecified. We introduce DiaFORGE (Dialogue Framework for Organic Response Generation Evaluation), a disambiguation-centric, three-stage pipeline that (i) synthesizes persona-driven, multi-turn dialogues in which the assistant must distinguish among highly similar tools, (ii) performs supervised fine-tuning of open-source models with reasoning traces across 3B - 70B parameters, and (iii) evaluates real-world readiness via a dynamic suite that redeploys each model in a live agentic loop and reports end-to-end goal completion alongside conventional static metrics. On our dynamic benchmark DiaBENCH, models trained with DiaFORGE raise tool-invocation success by 27 pp over GPT-4o and by 49 pp over Claude-3.5-Sonnet, both under optimized prompting. To spur further research, we release an open corpus of 5000 production-grade enterprise API specifications paired with rigorously validated, disambiguation-focused dialogues, offering a practical blueprint for building reliable, enterprise-ready tool-calling agents.
zh
[NLP-137] Read Quietly Think Aloud: Decoupling Comprehension and Reasoning in LLM s
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在文本生成过程中缺乏类似人类的内部“阅读”或思考阶段的问题,即模型在生成文本前通常没有明确的上下文理解与思维构建过程。解决方案的关键在于引入“静默阅读”机制,通过为模型提供初始的上下文提示或“阅读空间”,使其在生成最终输出前进行更深入的内部处理。此外,论文还提出了一种“阅读伙伴”架构,利用辅助组件对输入进行静默处理并为生成模型提供优化后的上下文信息,从而提升模型的理解能力和生成质量。
链接: https://arxiv.org/abs/2507.03327
作者: Yuanxin Wang,Ganesh Venkatesh
机构: Cerebras(赛灵思)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under submission
Abstract:Large Language Models (LLMs) have demonstrated remarkable proficiency in understanding text and generating high-quality responses. However, a critical distinction from human cognition is their typical lack of a distinct internal reading' or deliberation phase before
speaking’ (i.e., generating text). Humans often engage in silent reading to comprehend context and formulate thoughts prior to articulation. This paper investigates methods to imbue LLMs with a similar capacity for internal processing. We introduce and evaluate techniques that encourage LLMs to read silently.' Our findings indicate that even a straightforward approach, such as providing the model with an initial contextual prompt or
reading space’ before it begins predicting subsequent tokens for the final output, can yield significant performance improvements. We further enhance this concept by developing a `reading buddy’ architecture, where an auxiliary component silently processes the input and provides refined contextual insights to the primary generation model. These approaches aim to foster deeper understanding from LLMs so that they can produce better reasoned responses, moving them one step closer to more human-like text processing. Our results indicate that these simple techniques can provide surprisingly strong impact on accuracy with multiple point accuracy boost. Comments: Under submission Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.03327 [cs.CL] (or arXiv:2507.03327v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.03327 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-138] GRAFT: A Graph-based Flow-aware Agent ic Framework for Document-level Machine Translation
【速读】: 该论文试图解决文档级机器翻译(Document Level Machine Translation, DocMT)中难以有效捕捉话语层面现象的问题,以及现有方法依赖启发式规则进行文档分段导致的与真实话语结构不一致和翻译过程中一致性不足的问题。解决方案的关键在于提出一种基于图的文档级翻译框架——图增强代理框架(Graph Augmented Agentic Framework for Document Level Translation, GRAFT),该框架利用大型语言模型(Large Language Model, LLM)代理,整合了分割、基于有向无环图(DAG)的依赖建模和话语感知翻译,从而实现了更连贯和上下文准确的翻译效果。
链接: https://arxiv.org/abs/2507.03311
作者: Himanshu Dutta,Sunny Manchanda,Prakhar Bapat,Meva Ram Gurjar,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay, India; DYSL-AI, DRDO, India
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Document level Machine Translation (DocMT) approaches often struggle with effectively capturing discourse level phenomena. Existing approaches rely on heuristic rules to segment documents into discourse units, which rarely align with the true discourse structure required for accurate translation. Otherwise, they fail to maintain consistency throughout the document during translation. To address these challenges, we propose Graph Augmented Agentic Framework for Document Level Translation (GRAFT), a novel graph based DocMT system that leverages Large Language Model (LLM) agents for document translation. Our approach integrates segmentation, directed acyclic graph (DAG) based dependency modelling, and discourse aware translation into a cohesive framework. Experiments conducted across eight translation directions and six diverse domains demonstrate that GRAFT achieves significant performance gains over state of the art DocMT systems. Specifically, GRAFT delivers an average improvement of 2.8 d BLEU on the TED test sets from IWSLT2017 over strong baselines and 2.3 d BLEU for domain specific translation from English to Chinese. Moreover, our analyses highlight the consistent ability of GRAFT to address discourse level phenomena, yielding coherent and contextually accurate translations.
zh
[NLP-139] LTLCrit: A Temporal Logic-based LLM Critic for Safe and Efficient Embodied Agents
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在长期规划任务中因误差累积导致的安全性或效率问题,从而限制其在通用场景中的应用。解决方案的关键在于提出一种模块化的策略-评价架构,其中LLM作为执行者(actor),由LTLCrit——一个基于线性时序逻辑(LTL)进行轨迹级评估的评价者(critic)进行指导。该架构结合了语言模型的推理能力与形式化逻辑的保证,通过分析完整轨迹并生成新的LTL约束来防止未来的不安全或低效行为,从而提升长期规划的可靠性与效率。
链接: https://arxiv.org/abs/2507.03293
作者: Anand Gokhale,Vaibhav Srivastava,Francesco Bullo
机构: University of California at Santa Barbara (加州大学圣塔芭芭拉分校); Michigan State University (密歇根州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:Large language models (LLMs) have demonstrated promise in reasoning tasks and general decision-making in static environments. In long-term planning tasks, however, errors tend to accumulate, often leading to unsafe or inefficient behavior, limiting their use in general-purpose settings. We propose a modular actor-critic architecture in which an LLM actor is guided by LTLCrit, a trajectory-level LLM critic that communicates via linear temporal logic (LTL). Our setup combines the reasoning strengths of language models with the guarantees of formal logic. The actor selects high-level actions from natural language observations, while the critic analyzes full trajectories and proposes new LTL constraints that shield the actor from future unsafe or inefficient behavior. The architecture supports both fixed, hand-specified safety constraints and adaptive, learned soft constraints that promote long-term efficiency. Our architecture is model-agnostic: any LLM-based planner can serve as the actor, and LTLCrit serves as a logic-generating wrapper. We formalize planning as graph traversal under symbolic constraints, allowing LTLCrit to analyze failed or suboptimal trajectories and generate new temporal logic rules that improve future behavior. We evaluate our system on the Minecraft diamond-mining benchmark, achieving 100% completion rates and improving efficiency compared to baseline LLM planners. Our results suggest that enabling LLMs to supervise each other through logic is a powerful and flexible paradigm for safe, generalizable decision making.
zh
[NLP-140] GDGB: A Benchmark for Generative Dynamic Text-Attributed Graph Learning
【速读】: 该论文旨在解决现有动态文本属性图(DyTAGs)数据集在文本质量上的不足以及缺乏针对DyTAG生成任务的标准化任务定义和评估协议的问题。其解决方案的关键在于提出Generative DyTAG Benchmark (GDGB),该基准包含八个精心构建的DyTAG数据集,具备高质量的节点和边的文本特征,并定义了两种新的DyTAG生成任务:Transductive Dynamic Graph Generation (TDGG) 和 Inductive Dynamic Graph Generation (IDGG),同时设计了多维度的评估指标以全面评估生成DyTAG的结构、时间和文本质量。此外,还提出了GAG-General,一个基于大语言模型(LLM)的多智能体生成框架,用于实现可重复和稳健的DyTAG生成基准测试。
链接: https://arxiv.org/abs/2507.03267
作者: Jie Peng,Jiarui Ji,Runlin Lei,Zhewei Wei,Yongchao Liu,Chuntao Hong
机构: Renmin University of China (中国人民大学); Ant Group (蚂蚁集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Dynamic Text-Attributed Graphs (DyTAGs), which intricately integrate structural, temporal, and textual attributes, are crucial for modeling complex real-world systems. However, most of the existing DyTAG datasets exhibit poor textual quality, which severely limits their utility for DyTAG generation tasks requiring semantically rich inputs. Additionally, prior work mainly focuses on discriminative tasks on DyTAGs, resulting in a lack of standardized task formulations and evaluation protocols tailored for DyTAG generation. To address these critical issues, we propose Generative DyTAG Benchmark (GDGB), which comprises eight meticulously curated DyTAG datasets with high-quality textual features for both nodes and edges, overcoming limitations of prior datasets. Building on GDGB, we define two novel DyTAG generation tasks: Transductive Dynamic Graph Generation (TDGG) and Inductive Dynamic Graph Generation (IDGG). TDGG transductively generates a target DyTAG based on the given source and destination node sets, while the more challenging IDGG introduces new node generation to inductively model the dynamic expansion of real-world graph data. To enable holistic evaluation, we design multifaceted metrics that assess the structural, temporal, and textual quality of the generated DyTAGs. We further propose GAG-General, an LLM-based multi-agent generative framework tailored for reproducible and robust benchmarking of DyTAG generation. Experimental results demonstrate that GDGB enables rigorous evaluation of TDGG and IDGG, with key insights revealing the critical interplay of structural and textual features in DyTAG generation. These findings establish GDGB as a foundational resource for advancing generative DyTAG research and unlocking further practical applications in DyTAG generation. GDGB datasets, source codes, and leaderboards are available at \hrefthis https URLhere.
zh
[NLP-141] RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs
【速读】: 该论文旨在解决大规模预训练数据质量提升中的挑战,特别是如何在保持数据多样性和自然性的前提下实现高效且细粒度的数据精炼。其解决方案的关键在于提出RefineX框架,该框架通过程序化编辑任务实现大规模、精准的数据精炼,核心在于将高质量、专家指导的端到端精炼结果提炼为最小化的基于编辑的删除程序,从而训练出一个高效且可靠的精炼模型,能够在大规模语料库中系统性地提升每个实例的质量。
链接: https://arxiv.org/abs/2507.03253
作者: Baolong Bi,Shenghua Liu,Xingzhang Ren,Dayiheng Liu,Junyang Lin,Yiwei Wang,Lingrui Mei,Junfeng Fang,Jiafeng Guo,Xueqi Cheng
机构: Institute of Computing Technology (计算技术研究所); Chinese Academy of Sciences (中国科学院); Alibaba Group (阿里巴巴集团); University of California Merced (加州大学默塞德分校); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The foundational capabilities of large language models (LLMs) are deeply influenced by the quality of their pre-training corpora. However, enhancing data quality at scale remains a significant challenge, primarily due to the trade-off between refinement effectiveness and processing efficiency. While rule-based filtering remains the dominant paradigm, it typically operates at the document level and lacks the granularity needed to refine specific content within documents. Inspired by emerging work such as ProX, we propose \textbfRefineX , a novel framework for large-scale, surgical refinement of pre-training data through programmatic editing tasks. RefineX enables efficient and fine-grained data refinement while reliably preserving the diversity and naturalness of raw text. The core strength of RefineX lies in distilling high-quality, expert-guided end-to-end refinement results into minimal edit-based deletion programs. This high-precision distillation pipeline is used to train an efficient and reliable refine model that can systematically improve every instance in the corpus at scale. We evaluate RefineX across from-scratch pre-training at multiple model scales and find that it consistently outperforms models trained on raw, filtered, or alternatively refined data across diverse downstream tasks. On the 750M model, RefineX yields 2.6%-7.2% average gains on lighteval tasks, and achieves comparable performance using significantly fewer training tokens. Further analysis shows that RefineX reliably enhances text quality with both high efficiency and precision, outperforming prior approaches such as end-to-end generation and Prox-C. These results position RefineX as a scalable, effective, and reliable solution for optimizing pre-training data in modern LLM pipelines.
zh
[NLP-142] KinyaColBERT: A Lexically Grounded Retrieval Model for Low-Resource Retrieval-Augmented Generation
【速读】: 该论文旨在解决在低资源语言环境下,基于生成式 AI (Generative AI) 的检索增强生成(RAG)系统中因语言覆盖范围有限和子词分词不足导致的高检索准确率难以实现的问题。其解决方案的关键在于提出一种新的检索模型——KinyaColBERT,该模型结合了查询与文档之间的晚期词级交互以及基于形态学的分词与双层Transformer编码,从而生成具有词汇基础且细粒度的上下文嵌入。
链接: https://arxiv.org/abs/2507.03241
作者: Antoine Nzeyimana,Andre Niyongabo Rubungo
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The recent mainstream adoption of large language model (LLM) technology is enabling novel applications in the form of chatbots and virtual assistants across many domains. With the aim of grounding LLMs in trusted domains and avoiding the problem of hallucinations, retrieval-augmented generation (RAG) has emerged as a viable solution. In order to deploy sustainable RAG systems in low-resource settings, achieving high retrieval accuracy is not only a usability requirement but also a cost-saving strategy. Through empirical evaluations on a Kinyarwanda-language dataset, we find that the most limiting factors in achieving high retrieval accuracy are limited language coverage and inadequate sub-word tokenization in pre-trained language models. We propose a new retriever model, KinyaColBERT, which integrates two key concepts: late word-level interactions between queries and documents, and a morphology-based tokenization coupled with two-tier transformer encoding. This methodology results in lexically grounded contextual embeddings that are both fine-grained and self-contained. Our evaluation results indicate that KinyaColBERT outperforms strong baselines and leading commercial text embedding APIs on a Kinyarwanda agricultural retrieval benchmark. By adopting this retrieval strategy, we believe that practitioners in other low-resource settings can not only achieve reliable RAG systems but also deploy solutions that are more cost-effective.
zh
[NLP-143] A Lie-algebraic perspective on Tree-Adjoining Grammars
【速读】: 该论文试图解决树邻接文法(Tree-Adjoining Grammar, TAG)在形式化表示和理论分析中的问题,旨在提供一种新的数学实现方式。其解决方案的关键在于利用图的两种组合定义,将邻接操作视为预李运算(pre-Lie operation),进而构建李代数(Lie algebra)。这一方法展示了TAG系统的一些性质可以在不引入额外组件(如空邻接约束和特征TAG)的情况下被自然捕获。
链接: https://arxiv.org/abs/2507.03234
作者: Isabella Senturia,Elizabeth Xiao,Matilde Marcolli
机构: Yale University (耶鲁大学); California Institute of Technology (加州理工学院)
类目: Computation and Language (cs.CL); Quantum Algebra (math.QA); Rings and Algebras (math.RA)
备注: 14 pages, 7 figures. To appear in the proceedings of the 18th Meeting on the Mathematics of Language (MOL 2025)
Abstract:We provide a novel mathematical implementation of tree-adjoining grammars using two combinatorial definitions of graphs. With this lens, we demonstrate that the adjoining operation defines a pre-Lie operation and subsequently forms a Lie algebra. We demonstrate the utility of this perspective by showing how one of our mathematical formulations of TAG captures properties of the TAG system without needing to posit them as additional components of the system, such as null-adjoining constraints and feature TAG.
zh
[NLP-144] How Much Content Do LLM s Generate That Induces Cognitive Bias in Users? AACL2025
【速读】: 该论文试图解决生成式AI(Generative AI)在应用过程中可能引入并传递偏见的问题,特别是在摘要生成和新闻事实核查任务中,LLM(Large Language Models)可能暴露用户到带有偏见的内容。其关键解决方案在于评估不同LLM家族在上下文一致性、幻觉现象以及优先效应方面的表现,并测试18种不同的缓解方法,发现针对性的干预措施可以有效减少这些偏差。研究强调了在高风险领域中建立技术保障和用户导向干预的重要性。
链接: https://arxiv.org/abs/2507.03194
作者: Abeer Alessa,Akshaya Lakshminarasimhan,Param Somane,Julian Skirzynski,Julian McAuley,Jessica Echterhoff
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures. to be submitted to AACL 2025
Abstract:Large language models (LLMs) are increasingly integrated into applications ranging from review summarization to medical diagnosis support, where they affect human decisions. Even though LLMs perform well in many tasks, they may also inherit societal or cognitive biases, which can inadvertently transfer to humans. We investigate when and how LLMs expose users to biased content and quantify its severity. Specifically, we assess three LLM families in summarization and news fact-checking tasks, evaluating how much LLMs stay consistent with their context and/or hallucinate. Our findings show that LLMs expose users to content that changes the sentiment of the context in 21.86% of the cases, hallucinates on post-knowledge-cutoff data questions in 57.33% of the cases, and primacy bias in 5.94% of the cases. We evaluate 18 distinct mitigation methods across three LLM families and find that targeted interventions can be effective. Given the prevalent use of LLMs in high-stakes domains, such as healthcare or legal analysis, our results highlight the need for robust technical safeguards and for developing user-centered interventions that address LLM limitations.
zh
[NLP-145] Adversarial Manipulation of Reasoning Models using Internal Representations ICML2025
【速读】: 该论文试图解决生成式AI在推理过程中对 jailbreak 攻击的脆弱性问题,特别是探讨链式思维(Chain-of-Thought, CoT)生成过程如何影响模型的拒绝或合规行为。其解决方案的关键在于发现并利用激活空间中的“谨慎”方向,该方向在CoT生成过程中预测模型是否会拒绝或遵循指令,通过消除这一方向可显著增加模型的有害合规性,从而实现对模型的攻击。
链接: https://arxiv.org/abs/2507.03167
作者: Kureha Yamaguchi,Benjamin Etheridge,Andy Arditi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the ICML 2025 Workshop on Reliable and Responsible Foundation Models (R2FM). 20 pages, 12 figures
Abstract:Reasoning models generate chain-of-thought (CoT) tokens before their final output, but how this affects their vulnerability to jailbreak attacks remains unclear. While traditional language models make refusal decisions at the prompt-response boundary, we find evidence that DeepSeek-R1-Distill-Llama-8B makes these decisions within its CoT generation. We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply – termed the “caution” direction because it corresponds to cautious reasoning patterns in the generated text. Ablating this direction from model activations increases harmful compliance, effectively jailbreaking the model. We additionally show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates. Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models. Code available at this https URL Comments: Accepted to the ICML 2025 Workshop on Reliable and Responsible Foundation Models (R2FM). 20 pages, 12 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.03167 [cs.CL] (or arXiv:2507.03167v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.03167 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-146] MateInfoUB: A Real-World Benchmark for Testing LLM s in Competitive Multilingual and Multimodal Educational Tasks
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在高级计算机科学(Computer Science, CS)教育和竞赛场景中的适用性与局限性问题。其解决方案的关键在于构建一个双语(英语-罗马尼亚语)多模态(文本与图像)的多项选择题数据集,该数据集来源于一项高级计算机科学竞赛,其中部分问题更适合通过纸上推理解决,而另一些问题则通过编写代码更高效。通过对当前最先进的LLMs在此数据集上的系统评估,研究揭示了模型在理论编程任务中的表现及其语言选择(英语与罗马尼亚语)的影响,为未来教育实践和政策提供了参考。
链接: https://arxiv.org/abs/2507.03162
作者: Dumitran Adrian Marius,Theodor-Pierre Moroianu,Buca Mihnea-Vicentiu
机构: University of Bucharest (布加勒斯特大学); Faculty of Mathematics and Computer Science (数学与计算机科学学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages (9 paper, 2 references, 3 annexes). Accepted for BEA 2025!
Abstract:The rapid advancement of Large Language Models (LLMs) has transformed various domains, particularly computer science (CS) education. These models exhibit remarkable capabilities in code-related tasks and problem-solving, raising questions about their potential and limitations in advanced CS contexts. This study presents a novel bilingual (English-Romanian) multimodal (text and image) dataset of multiple-choice questions derived from a high-level computer science competition. A particularity of our dataset is that the problems are conceived such that some of them are easier solved using reasoning on paper, while for others writing code is more efficient. We systematically evaluate State of The Art LLMs on this dataset, analyzing their performance on theoretical programming tasks. Our findings reveal the strengths and limitations of current LLMs, including the influence of language choice (English vs. Romanian), providing insights into their applicability in CS education and competition settings. We also address critical ethical considerations surrounding educational integrity and the fairness of assessments in the context of LLM usage. These discussions aim to inform future educational practices and policies. To support further research, our dataset will be made publicly available in both English and Romanian. Additionally, we release an educational application tailored for Romanian students, enabling them to self-assess using the dataset in an interactive and practice-oriented environment.
zh
[NLP-147] Expert-level validation of AI-generated medical text with scalable language models
【速读】: 该论文试图解决语言模型(Language Models, LMs)在临床环境中生成的医学文本的准确性和安全性评估问题。当前的评估依赖于人工医生审查,但存在成本高和缺乏专家参考输出的局限性。解决方案的关键在于提出MedVAL,一个自监督框架,利用合成数据训练评估型LM,以判断LM生成的医学输出是否与输入在事实上保持一致,而无需医生标签或参考输出。该方法通过MedVAL-Bench数据集进行验证,展示了在多种医学任务中显著提升与医生的一致性,提升了LM生成文本的安全性和准确性。
链接: https://arxiv.org/abs/2507.03152
作者: Asad Aali,Vasiliki Bikia,Maya Varma,Nicole Chiou,Sophie Ostmeier,Arnav Singhvi,Magdalini Paschali,Ashwin Kumar,Andrew Johnston,Karimar Amador-Martinez,Eduardo Juan Perez Guerrero,Paola Naovi Cruz Rivera,Sergios Gatidis,Christian Bluethgen,Eduardo Pontes Reis,Eddy D. Zandee van Rilland,Poonam Laxmappa Hosamani,Kevin R Keet,Minjoung Go,Evelyn Ling,David B. Larson,Curtis Langlotz,Roxana Daneshjou,Jason Hom,Sanmi Koyejo,Emily Alsentzer,Akshay S. Chaudhari
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the “LM-as-judge” paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase ( this https URL ), 2) MedVAL-Bench ( this https URL ), and 3) MedVAL-4B ( this https URL ), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
zh
[NLP-148] DeepGesture: A conversational gesture synthesis system based on emotions and semantics WWW ATC
【速读】: 该论文旨在解决数字人生成中角色动作与文本或语音输入自然对应的问题,这是当前构建数字人的主要瓶颈。其解决方案的关键在于提出DeepGesture框架,该框架基于扩散模型,能够根据多模态信号(文本、语音、情绪和初始动作)生成富有表现力的同步手势。核心创新包括引入快速文本转录作为语义条件、采用无分类器扩散机制实现情绪引导的手势生成,以及使用轻量级Transformer主干网络融合异构模态特征,从而提升生成手势的语义对齐度和情感表达能力。
链接: https://arxiv.org/abs/2507.03147
作者: Thanh Hoang-Minh
机构: OpenHuman.ai(OpenHuman.ai); VNUHCM - University of Science(越南国家大学胡志明市科学大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
备注: Video Demo: this https URL
Abstract:Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals-text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. A lightweight Transformer backbone combines full self-attention and cross-local attention for effective feature fusion of heterogeneous modalities. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness, outperforming baselines on Mean Opinion Score and Frechet Gesture Distance metrics. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices-marking a step forward toward fully multimodal, emotionally aware digital humans. Comments: Video Demo: this https URL Subjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD) Cite as: arXiv:2507.03147 [cs.HC] (or arXiv:2507.03147v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2507.03147 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-149] From Measurement to Mitigation: Exploring the Transferability of Debiasing Approaches to Gender Bias in Maltese Language Models
【速读】: 该论文试图解决生成式 AI (Generative AI) 在低资源语言中存在的性别偏见问题,特别是在马耳他语(Maltese)语言模型中的社会偏见表现。研究的关键在于评估和迁移现有的英文领域偏见缓解方法到马耳他语语言模型,如BERTu和mBERTu,并通过调整和应用诸如Counterfactual Data Augmentation、Dropout Regularization、Auto-Debias和GuiDebias等技术来减轻性别偏见。此外,研究还通过构建评估数据集为未来在马耳他语中的性别偏见研究提供了基础。
链接: https://arxiv.org/abs/2507.03142
作者: Melanie Galea,Claudia Borg
机构: University of Malta(马耳他大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The advancement of Large Language Models (LLMs) has transformed Natural Language Processing (NLP), enabling performance across diverse tasks with little task-specific training. However, LLMs remain susceptible to social biases, particularly reflecting harmful stereotypes from training data, which can disproportionately affect marginalised communities. We measure gender bias in Maltese LMs, arguing that such bias is harmful as it reinforces societal stereotypes and fails to account for gender diversity, which is especially problematic in gendered, low-resource languages. While bias evaluation and mitigation efforts have progressed for English-centric models, research on low-resourced and morphologically rich languages remains limited. This research investigates the transferability of debiasing methods to Maltese language models, focusing on BERTu and mBERTu, BERT-based monolingual and multilingual models respectively. Bias measurement and mitigation techniques from English are adapted to Maltese, using benchmarks such as CrowS-Pairs and SEAT, alongside debiasing methods Counterfactual Data Augmentation, Dropout Regularization, Auto-Debias, and GuiDebias. We also contribute to future work in the study of gender bias in Maltese by creating evaluation datasets. Our findings highlight the challenges of applying existing bias mitigation methods to linguistically complex languages, underscoring the need for more inclusive approaches in the development of multilingual NLP.
zh
[NLP-150] ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对无法解决的问题时容易生成不可靠响应,从而影响模型可靠性的问题。其解决方案的关键在于构建一个名为ReliableMath的数据集,该数据集包含通过人类评估合成的高质量不可解数学问题以及开源的可解问题,以系统性地评估LLMs在可解与不可解问题上的可靠性。此外,研究还提出了一种对齐策略,以提升小型LLMs在领域内和领域外任务中的可靠性表现。
链接: https://arxiv.org/abs/2507.03133
作者: Boyang Xue,Qi Zhu,Rui Wang,Sheng Wang,Hongru Wang,Fei Mi,Yasheng Wang,Lifeng Shang,Qun Liu,Kam-Fai Wong
机构: The Chinese University of Hong Kong (香港中文大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL)
备注: under review
Abstract:Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining the reliability. Prior studies of LLM reliability have primarily focused on knowledge tasks to identify unanswerable questions, while mathematical reasoning tasks have remained unexplored due to the dearth of unsolvable math problems. To systematically investigate LLM reliability in mathematical reasoning tasks, we formulate the reliability evaluation for both solvable and unsolvable problems. We then develop a ReliableMath dataset which incorporates open-source solvable problems and high-quality unsolvable problems synthesized by our proposed construction workflow with human evaluations. Experiments are conducted on various LLMs with several key findings uncovered. LLMs fail to directly identify unsolvable problems and always generate fabricated responses. When instructing LLMs to indicate unsolvability using a reliable prompt, the reliability of larger-sized LLMs remains on solvable problems, but notably improves on unsolvable problems yet still falls short of solvable problems. However, small LLMs rarely show any progress despite employing reliable prompts. Therefore, we further propose an alignment strategy to enhance small LLMs’ reliability, which can significantly improve LLM reliability performances on both in-domain and out-of-domain tasks.
zh
[NLP-151] owards a Psychoanalytic Perspective on VLM Behaviour: A First-step Interpretation with Intriguing Observations
【速读】: 该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)中的幻觉问题,即模型生成与事实不符或逻辑不一致的回答。现有研究多从技术限制或迎合偏差(sycophancy bias)角度解释幻觉,但忽略了其可能反映人类心理学中的认知偏差。论文的关键解决方案是引入一种心理分类法,将VLMs的幻觉行为分为迎合偏差、逻辑不一致和新发现的权威偏差(authority bias),并设计了一个可扩展的基准测试AIpsych,用于系统分析模型行为。通过该基准,研究揭示了模型架构和参数规模对响应模式的影响,为理解幻觉提供了新的心理视角。
链接: https://arxiv.org/abs/2507.03123
作者: Xiangrui Liu,Man Luo,Agneet Chatterjee,Hua Wei,Yezhou Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Hallucination is a long-standing problem that has been actively investigated in Vision-Language Models (VLMs). Existing research commonly attributes hallucinations to technical limitations or sycophancy bias, where the latter means the models tend to generate incorrect answers to align with user expectations. However, these explanations primarily focus on technical or externally driven factors, may have neglected the possibility that hallucination behaviours might mirror cognitive biases observed in human psychology. In this work, we introduce a psychological taxonomy, categorizing VLMs’ hallucination behaviours, including sycophancy, logical inconsistency, and a newly identified VLMs behaviour: authority bias. To systematically analyze these behaviours, we design AIpsych, a scalable benchmark that reveals psychological tendencies in model response patterns. Leveraging this benchmark, we investigate how variations in model architecture and parameter size influence model behaviour when responding to strategically manipulated questions. Our experiments reveal that as model size increases, VLMs exhibit stronger sycophantic tendencies but reduced authority bias, suggesting increasing competence but a potential erosion of response integrity. A human subject study further validates our hypotheses and highlights key behavioural differences between VLMs and human respondents. This work suggests a new perspective for understanding hallucination in VLMs and highlights the importance of integrating psychological principles into model this http URL benchmark is available at this https URL.
zh
[NLP-152] Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings
【速读】: 该论文试图解决在分布式医疗环境中进行多标签ICD编码分类的问题,旨在提供一种隐私保护且部署高效的临床自然语言处理(NLP)解决方案。其关键解决方案是提出了一种轻量级且可扩展的管道,结合冻结的文本嵌入与简单的多层感知机(MLP)分类器,避免了对大型语言模型的依赖,从而在保证性能的同时降低计算和存储需求。
链接: https://arxiv.org/abs/2507.03122
作者: Binbin Xu,Gérard Dray
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages
Abstract:This study investigates the feasibility and performance of federated learning (FL) for multi-label ICD code classification using clinical notes from the MIMIC-IV dataset. Unlike previous approaches that rely on centralized training or fine-tuned large language models, we propose a lightweight and scalable pipeline combining frozen text embeddings with simple multilayer perceptron (MLP) classifiers. This design offers a privacy-preserving and deployment-efficient alternative for clinical NLP applications, particularly suited to distributed healthcare settings. Extensive experiments across both centralized and federated configurations were conducted, testing six publicly available embedding models from Massive Text Embedding Benchmark leaderboard and three MLP classifier architectures under two medical coding (ICD-9 and ICD-10). Additionally, ablation studies over ten random stratified splits assess performance stability. Results show that embedding quality substantially outweighs classifier complexity in determining predictive performance, and that federated learning can closely match centralized results in idealized conditions. While the models are orders of magnitude smaller than state-of-the-art architectures and achieved competitive micro and macro F1 scores, limitations remain including the lack of end-to-end training and the simplified FL assumptions. Nevertheless, this work demonstrates a viable way toward scalable, privacy-conscious medical coding systems and offers a step toward for future research into federated, domain-adaptive clinical AI.
zh
[NLP-153] RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
【速读】: 该论文试图解决大型语言模型(LLMs)在情感智能(EQ)方面的不足,尤其是在对话场景中提升其共情能力的问题。解决方案的关键在于引入RLVER框架,这是首个端到端的强化学习方法,通过可验证的情感奖励信号来训练LLMs。该框架利用自洽的情感模拟用户进行对话轮次,并生成确定性情感评分作为奖励信号,从而引导模型学习更高层次的共情能力。
链接: https://arxiv.org/abs/2507.03112
作者: Peisong Wang,Ruotian Ma,Bang Zhang,Xingyu Chen,Zhiwei He,Kang Luo,Qingsong Lv,Qingxuan Jiang,Zheng Xie,Shanyi Wang,Yuan Li,Fanghua Ye,Jian Li,Yifan Yang,Zhaopeng Tu,Xiaolong Li
机构: 腾讯(WeChat)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Code: this https URL
Abstract:Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess. While reinforcement learning from verifiable rewards (RLVR) has advanced in other domains, its application to dialogue-especially for emotional intelligence-remains underexplored. In this work, we introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users to cultivate higher-order empathetic abilities in LLMs. Within this framework, self-consistent affective simulated users engage in dialogue rollouts and produce deterministic emotion scores during conversations, serving as reward signals to guide the LLM’s learning. Fine-tuning publicly available Qwen2.5-7B-Instruct model with PPO boosts its Sentient-Benchmark score from 13.3 to 79.2 while largely preserving mathematical and coding competence. Extensive experiments reveal that: (i) RLVER consistently improves multiple dialogue capabilities; (ii) Thinking and non-thinking models show distinct trends–thinking models excel in empathy and insight, while non-thinking models favor action; (iii) GRPO often yields stable gains, while PPO can push certain capabilities to a higher ceiling; (iv) More challenging environments are not always better-moderate ones can yield stronger outcomes. Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.
zh
[NLP-154] ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization
【速读】: 该论文试图解决当前基于人类反馈的强化学习(RLHF)方法在个性化和效率上的不足,尤其是传统二元偏好(BT)范式难以捕捉个体用户偏好且依赖大量人工标注的问题。其解决方案的关键在于提出自适应奖励跟随(ARF)框架,该框架通过高精度情感分析器将自由文本用户反馈转化为连续偏好评分,并结合轻量级数据增强技术优化和去偏这些信号,同时利用动态适配器偏好跟踪器实时建模用户偏好变化,从而实现基于追踪奖励的Trace Bias(TB)微调算法,显著提升了模型性能并降低了成本。
链接: https://arxiv.org/abs/2507.03069
作者: YuXuan Zhang
机构: South China Normal University (华南师范大学); University of Aberdeen (阿伯丁大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint under review
Abstract:With the rapid advancement of Reinforcement Learning from Human Feedback (RLHF) and autoregressive transformers, state-of-the-art models such as GPT-4.0, DeepSeek R1, and Llama 3.3 increasingly emphasize answer depth and personalization. However, most existing RLHF approaches (e.g., PPO, DPO) still rely on a binary-preference (BT) paradigm, which, while reducing annotation costs, still requires substantial human effort and captures only group-level tendencies rather than individual preferences. To overcome these limitations, we propose Adaptive Reward-Following (ARF), a self-assessment framework that leverages a high-precision emotion analyzer achieving over 70% accuracy on GoEmotions, Sentiment140, and DailyDialog to convert free-form user feedback into continuous preference scores. We further enrich and debias these signals through lightweight data augmentations, including synonym replacement, random trace truncation, and score bias annotation algorithm. A Dynamic Adapter Preference Tracker continuously models evolving user tastes in real time, enabling our novel Trace Bias (TB) fine-tuning algorithm to optimize directly on these tracked rewards instead of coarse binary labels. Experiments on Qwen-2/2.5, Gemma-2, and Llama-3.2 across four preference domains demonstrate that ARF achieves an improvement of 3.3% over PPO and 7.6% over DPO. Moreover, TB preserves theoretical alignment with PPO and DPO objectives. Overall, ARF presents a scalable, personalized, and cost-effective approach to RLHF LLMs through autonomous reward modeling.
zh
[NLP-155] Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case
【速读】: 该论文试图解决临床数据在不同系统间交换时面临的语义互操作性问题,传统标准的部署过程存在耗时、资源消耗大和技术挑战。其解决方案的关键在于引入一种半自动化方法,利用大型语言模型(如GPT-4o和Llama 3.2 405b)将结构化临床数据转换为HL7 FHIR格式,并通过嵌入技术、聚类算法和语义检索构建提示词,以指导模型准确映射表格字段到对应的FHIR资源。
链接: https://arxiv.org/abs/2507.03067
作者: Alvaro Riquelme,Pedro Costa,Catalina Martinez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 2 figures
Abstract:For years, semantic interoperability standards have sought to streamline the exchange of clinical data, yet their deployment remains time-consuming, resource-intensive, and technically challenging. To address this, we introduce a semi-automated approach that leverages large language models specifically GPT-4o and Llama 3.2 405b to convert structured clinical datasets into HL7 FHIR format while assessing accuracy, reliability, and security. Applying our method to the MIMIC-IV database, we combined embedding techniques, clustering algorithms, and semantic retrieval to craft prompts that guide the models in mapping each tabular field to its corresponding FHIR resource. In an initial benchmark, resource identification achieved a perfect F1-score, with GPT-4o outperforming Llama 3.2 thanks to the inclusion of FHIR resource schemas within the prompt. Under real-world conditions, accuracy dipped slightly to 94 %, but refinements to the prompting strategy restored robust mappings. Error analysis revealed occasional hallucinations of non-existent attributes and mismatches in granularity, which more detailed prompts can mitigate. Overall, our study demonstrates the feasibility of context-aware, LLM-driven transformation of clinical data into HL7 FHIR, laying the groundwork for semi-automated interoperability workflows. Future work will focus on fine-tuning models with specialized medical corpora, extending support to additional standards such as HL7 CDA and OMOP, and developing an interactive interface to enable expert validation and iterative refinement.
zh
[NLP-156] Identification of Potentially Misclassified Crash Narratives using Machine Learning (ML) and Deep Learning (DL)
【速读】: 该论文试图解决交通事故数据中因分类错误导致的事故类型识别问题,特别是针对交通交叉口相关事故在警方报告叙述中的误分类现象。解决方案的关键在于利用多模态数据分析方法,将文本叙述与结构化事故数据相结合,通过引入先进的自然语言处理模型如Albert Model,实现对误分类事故的高效检测。该模型在与专家分类的一致性上表现出色,显著优于传统机器学习和部分深度学习方法,从而有效降低了错误率并提升了事故数据的质量。
链接: https://arxiv.org/abs/2507.03066
作者: Sudesh Bhagat,Ibne Farabi Shihab,Jonathan Wood
机构: Iowa State University (爱荷华州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This research investigates the efficacy of machine learning (ML) and deep learning (DL) methods in detecting misclassified intersection-related crashes in police-reported narratives. Using 2019 crash data from the Iowa Department of Transportation, we implemented and compared a comprehensive set of models, including Support Vector Machine (SVM), XGBoost, BERT Sentence Embeddings, BERT Word Embeddings, and Albert Model. Model performance was systematically validated against expert reviews of potentially misclassified narratives, providing a rigorous assessment of classification accuracy. Results demonstrated that while traditional ML methods exhibited superior overall performance compared to some DL approaches, the Albert Model achieved the highest agreement with expert classifications (73% with Expert 1) and original tabular data (58%). Statistical analysis revealed that the Albert Model maintained performance levels similar to inter-expert consistency rates, significantly outperforming other approaches, particularly on ambiguous narratives. This work addresses a critical gap in transportation safety research through multi-modal integration analysis, which achieved a 54.2% reduction in error rates by combining narrative text with structured crash data. We conclude that hybrid approaches combining automated classification with targeted expert review offer a practical methodology for improving crash data quality, with substantial implications for transportation safety management and policy development.
zh
[NLP-157] Improving LLM Reasoning for Vulnerability Detection via Group Relative Policy Optimization
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在软件漏洞检测任务中存在的一些局限性,例如对某些类型漏洞过度预测而对其他类型漏洞检测不足的问题。其解决方案的关键在于引入一种基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习方法,通过结构化、规则驱动的奖励机制来引导LLMs的行为。为使GRPO适用于漏洞检测任务,研究者重新定义了优势函数和奖励信号,利用BigVul、DiverseVul和CleanVul等常用数据集的标注信息进行调整,从而提升模型的泛化能力、推理能力和整体性能。
链接: https://arxiv.org/abs/2507.03051
作者: Marco Simoni,Aleksandar Fontana,Giulio Rossolini,Andrea Saracino
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under Review
Abstract:Improving and understanding the training dynamics and reasoning of Large Language Models (LLMs) has become essential for their deployment in AI-based security tools, such as software vulnerability detection. In this work, we present an extensive study aimed at advancing recent RL-based finetuning techniques for LLMs in the context of vulnerability detection. We start by highlighting key limitations of commonly adopted LLMs, such as their tendency to over-predict certain types of vulnerabilities while failing to detect others. To address this challenge, we explore the use of Group Relative Policy Optimization (GRPO), a recent policy-gradient method, for guiding LLM behavior through structured, rule-based rewards. We enable its application to the vulnerability detection task by redefining its advantage functions and reward signals using annotations from widely used datasets in the field, including BigVul, DiverseVul, and CleanVul. The proposed methodology enables an extensive set of experiments, addressing multiple research questions regarding the impact of GRPO on generalization, reasoning capabilities, and performance improvements over standard supervised finetuning (SFT). Our findings offer valuable insights into the potential of RL-based training to enhance both the performance and reasoning abilities of LLMs in the context of software vulnerability detection. Comments: Under Review Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2507.03051 [cs.CR] (or arXiv:2507.03051v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.03051 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-158] Counterfactual Tuning for Temporal Sensitivity Enhancement in Large Language Model-based Recommendation
【速读】: 该论文试图解决现有基于大语言模型(Large Language Models, LLMs)的序列推荐方法未能充分利用用户历史交互序列中的丰富时间信息的问题。其关键解决方案是提出一种基于因果推理的反事实增强时序框架(Counterfactual Enhanced Temporal Framework for LLM-Based Recommendation, CETRec),通过将时间顺序视为独立于物品内容的因果因素,利用反事实推理量化时间信息对推荐结果的独特贡献,并设计一种新的反事实调优目标,以提升LLMs对绝对时间顺序和相对顺序模式的感知能力。
链接: https://arxiv.org/abs/2507.03047
作者: Yutian Liu,Zhengyi Yang,Jiancan Wu,Xiang Wang
机构: University of Science and Technology of China(中国科学技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users’ historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose Counterfactual Enhanced Temporal Framework for LLM-Based Recommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. By conceptualizing temporal order as an independent causal factor distinct from item content, we can quantify its unique contribution through counterfactual reasoning–comparing what recommendations would be made with and without temporal information while keeping all other factors constant. This causal framing enables CETRec to design a novel counterfactual tuning objective that directly optimizes the model’s temporal sensitivity, teaching LLMs to recognize both absolute timestamps and relative ordering patterns in user histories. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs’ awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items). Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR) Cite as: arXiv:2507.03047 [cs.CL] (or arXiv:2507.03047v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.03047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-159] K-Function: Joint Pronunciation Transcription and Feedback for Evaluating Kids Language Function
【速读】: 该论文旨在解决儿童语言早期评估中因高音调、长音节和数据稀疏性导致自动语音识别系统失效的问题。其解决方案的关键在于提出K-Function框架,该框架的核心是Kids-WFST,它将Wav2Vec2语音编码器与基于音素相似性的Dysfluent-WFST相结合,以捕捉儿童特有的错误,同时保持完全可解释性。该方法在MyST和Multitudes数据集上分别实现了1.39%和8.61%的音素错误率,显著优于贪心搜索解码器,从而为生成高保真语音转录本提供了基础,支持后续的语言能力评估与反馈。
链接: https://arxiv.org/abs/2507.03043
作者: Shuhe Li,Chenxu Guo,Jiachen Lian,Cheol Jun Cho,Wenshuo Zhao,Xuanru Zhou,Dingkun Zhou,Sam Wang,Grace Wang,Jingze Yang,Jingyi Xu,Ruohan Bao,Elise Brenner,Brandon In,Francesca Pei,Maria Luisa Gorno-Tempini,Gopala Anumanchipalli
机构: Zhejiang University (浙江大学); UC Berkeley (加州大学伯克利分校); SCUT (华南理工大学); UCSF (加利福尼亚大学旧金山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Early evaluation of children’s language is frustrated by the high pitch, long phones, and sparse data that derail automatic speech recognisers. We introduce K-Function, a unified framework that combines accurate sub-word transcription, objective scoring, and actionable feedback. Its core, Kids-WFST, merges a Wav2Vec2 phoneme encoder with a phoneme-similarity Dysfluent-WFST to capture child-specific errors while remaining fully interpretable. Kids-WFST attains 1.39% phoneme error on MyST and 8.61% on Multitudes–absolute gains of 10.47 and 7.06 points over a greedy-search decoder. These high-fidelity transcripts power an LLM that grades verbal skills, milestones, reading, and comprehension, aligning with human proctors and supplying tongue-and-lip visualizations plus targeted advice. The results show that precise phoneme recognition cements a complete diagnostic-feedback loop, paving the way for scalable, clinician-ready language assessment.
zh
[NLP-160] Dynamic Long Short-Term Memory Based Memory Storag e For Long Horizon LLM Interaction
【速读】: 该论文试图解决在长对话中实现用户个性化的问题,通过为大型语言模型(Large Language Models, LLMs)引入记忆存储机制。其解决方案的关键在于提出了一种名为Pref-LSTM的动态轻量级框架,该框架结合了基于BERT的分类器与LSTM记忆模块,利用分类器识别用户显性和隐性偏好,并将生成的记忆嵌入通过软提示注入到冻结的LLM中,从而实现高效的用户偏好建模。
链接: https://arxiv.org/abs/2507.03042
作者: Yuyang Lou,Charles Li
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, 2 tables
Abstract:Memory storage for Large Language models (LLMs) is becoming an increasingly active area of research, particularly for enabling personalization across long conversations. We propose Pref-LSTM, a dynamic and lightweight framework that combines a BERT-based classifier with a LSTM memory module that generates memory embedding which then is soft-prompt injected into a frozen LLM. We synthetically curate a dataset of preference and non-preference conversation turns to train our BERT-based classifier. Although our LSTM-based memory encoder did not yield strong results, we find that the BERT-based classifier performs reliably in identifying explicit and implicit user preferences. Our research demonstrates the viability of using preference filtering with LSTM gating principals as an efficient path towards scalable user preference modeling, without extensive overhead and fine-tuning.
zh
[NLP-161] Cautious Next Token Prediction ACL2025
【速读】: 该论文试图解决当前自回归模型在面对不确定测试问题时,采用温度缩放结合核采样等默认解码策略导致的性能不足问题。其解决方案的关键在于提出一种无需训练的解码策略,称为Cautious Next Token Prediction (CNTP),该策略在模型预测熵较高时,独立采样多个试验路径,并在遇到标点符号时停止,选择困惑度最低的路径作为最可能且可靠的输出,从而提升模型在不确定性下的生成质量。
链接: https://arxiv.org/abs/2507.03038
作者: Yizhou Wang,Lingzhi Zhang,Yue Bai,Mang Tik Chiu,Zhengmian Hu,Mingyuan Zhang,Qihua Dong,Yu Yin,Sohrab Amirghodsi,Yun Fu
机构: Northeastern University (东北大学); Adobe (Adobe公司); Case Western Reserve University (凯斯西储大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Findings of ACL 2025
Abstract:Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model’s capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings’ behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at this https URL.
zh
[NLP-162] Preserving Privacy Increasing Accessibility and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation
【速读】: 该论文试图解决医疗文档记录对医护人员造成的负担问题,特别是通过使用大型语言模型(LLMs)实现医疗转录的自动化,以减少医生在行政任务上的时间投入。解决方案的关键在于开发一种隐私保护的、可在设备端运行的医疗转录系统,采用微调的Llama 3.2 1B模型,并通过参数高效微调(PEFT)方法中的LoRA技术进行优化,从而在浏览器中实现完全的数据主权和结构化医疗笔记生成。
链接: https://arxiv.org/abs/2507.03033
作者: Johnson Thomas,Ayush Mudgal,Wendao Liu,Nisten Tahiraj,Zeeshaan Mohammed,Dhruv Diddi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Background: Clinical documentation represents a significant burden for healthcare providers, with physicians spending up to 2 hours daily on administrative tasks. Recent advances in large language models (LLMs) offer promising solutions, but privacy concerns and computational requirements limit their adoption in healthcare settings. Objective: To develop and evaluate a privacy-preserving, on-device medical transcription system using a fine-tuned Llama 3.2 1B model capable of generating structured medical notes from medical transcriptions while maintaining complete data sovereignty entirely in the browser. Methods: We fine-tuned a Llama 3.2 1B model using Parameter-Efficient Fine-Tuning (PEFT) with LoRA on 1,500 synthetic medical transcription-to-structured note pairs. The model was evaluated against the base Llama 3.2 1B on two datasets: 100 endocrinology transcripts and 140 modified ACI benchmark cases. Evaluation employed both statistical metrics (ROUGE, BERTScore, BLEURT) and LLM-as-judge assessments across multiple clinical quality dimensions. Results: The fine-tuned OnDevice model demonstrated substantial improvements over the base model. On the ACI benchmark, ROUGE-1 scores increased from 0.346 to 0.496, while BERTScore F1 improved from 0.832 to 0.866. Clinical quality assessments showed marked reduction in major hallucinations (from 85 to 35 cases) and enhanced factual correctness (2.81 to 3.54 on 5-point scale). Similar improvements were observed on the internal evaluation dataset, with composite scores increasing from 3.13 to 4.43 (+41.5%). Conclusions: Fine-tuning compact LLMs for medical transcription yields clinically meaningful improvements while enabling complete on-device browser deployment. This approach addresses key barriers to AI adoption in healthcare: privacy preservation, cost reduction, and accessibility for resource-constrained environments.
zh
[NLP-163] he Book of Life approach: Enabling richness and scale for life course research
【速读】: 该论文试图解决生命历程研究中长期存在的方法论困境,即定性方法虽能分析丰富数据但受限于小样本,而定量调查方法虽可研究大群体却牺牲了数据的深度。论文提出的解决方案的关键在于利用两种新兴技术:一是“复杂日志数据”的兴起,这类行为数据原本用于非研究目的,但可被重新利用以构建个体生活轨迹的详细描述;二是大型语言模型(LLMs)在普通文本上的卓越模式识别能力。通过将复杂日志数据转化为个体多领域、跨时间、在情境中的文本化生命轨迹表示,即“生命之书”,论文为实现定性与定量方法的融合提供了可行路径。
链接: https://arxiv.org/abs/2507.03027
作者: Mark D. Verhagen,Benedikt Stroebl,Tiffany Liu,Lydia T. Liu,Matthew J. Salganik
机构: Princeton University(普林斯顿大学); Oxford University(牛津大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 4 figures
Abstract:For over a century, life course researchers have faced a choice between two dominant methodological approaches: qualitative methods that analyze rich data but are constrained to small samples, and quantitative survey-based methods that study larger populations but sacrifice data richness for scale. Two recent technological developments now enable us to imagine a hybrid approach that combines some of the depth of the qualitative approach with the scale of quantitative methods. The first development is the steady rise of ‘‘complex log data,’’ behavioral data that is logged for purposes other than research but that can be repurposed to construct rich accounts of people’s lives. The second is the emergence of large language models (LLMs) with exceptional pattern recognition capabilities on plain text. In this paper, we take a necessary step toward creating this hybrid approach by developing a flexible procedure to transform complex log data into a textual representation of an individual’s life trajectory across multiple domains, over time, and in context. We call this data representation a ‘‘book of life.’’ We illustrate the feasibility of our approach by writing over 100 million books of life covering many different facets of life, over time and placed in social context using Dutch population-scale registry data. We open source the book of life toolkit (BOLT), and invite the research community to explore the many potential applications of this approach.
zh
[NLP-164] OpenTable-R1: A Reinforcement Learning Augmented Tool Agent for Open-Domain Table Question Answering
【速读】: 该论文试图解决开放域表格问答(table question answering)中传统两阶段流水线方法的局限性,即先进行静态表格检索再进行封闭域答案生成。其解决方案的关键在于提出一种端到端的代理框架,将多轮工具调用(包括基于BM25+的搜索API和SQLite SQL执行器)直接嵌入到大型语言模型中。通过引入两阶段微调过程——首先在简单问题上进行监督冷启动,然后在复杂问题上使用LoRA适配器和回放缓冲区进行异步GRPO强化学习——使模型能够联合进行表格检索、推理和查询执行,从而显著提升了准确率。
链接: https://arxiv.org/abs/2507.03018
作者: Zipeng Qiu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Open-domain table question answering traditionally relies on a two-stage pipeline: static table retrieval followed by a closed-domain answer. In contrast, we propose an end-to-end agentic framework that embeds multi-turn tool calls-using a BM25±based search API and a SQLite SQL executor-directly into a large language model. To further adapt a compact 4B-parameter model, we introduce a two-stage fine-tuning process: supervised cold-start on easy questions, then Async GRPO reinforcement learning on harder cases with LoRA adapters and a rollout buffer. This unified approach enables the model to jointly retrieve, reason, and execute queries, yielding a dramatic accuracy improvement from single-digit zero-shot performance to over 0.86 exact match on a held-out test set. Our results underscore the effectiveness of integrating structured tool calls with targeted RL fine-tuning for scalable, accurate table QA. The code is available at this https URL.
zh
[NLP-165] Beyond Overcorrection: Evaluating Diversity in T2I Models with DIVBENCH
【速读】: 该论文试图解决文本到图像(T2I)生成模型在多样化策略中忽视上下文适当性的问题,导致在提示中明确指定的群体属性被不当修改的过度多样化现象。解决方案的关键在于引入DIVBENCH,这是一个用于衡量T2I生成中不足和过度多样性的基准和评估框架,并通过系统评估表明,上下文感知的方法,尤其是基于大语言模型(LLM)引导的FairDiffusion和提示重写,能够有效缓解多样性不足问题,同时避免过度多样化,从而在表征性和语义保真度之间取得更好的平衡。
链接: https://arxiv.org/abs/2507.03015
作者: Felix Friedrich,Thiemo Ganesha Welsch,Patrick Schramowski,Kristian Kersting
机构: TU Darmstadt (达姆施塔特工业大学); hessian.AI (黑森人工智能); DFKI (德国弗劳恩霍夫研究所); CERTAIN (CERTAIN); Centre for Cognitive Science (认知科学中心)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Current diversification strategies for text-to-image (T2I) models often ignore contextual appropriateness, leading to over-diversification where demographic attributes are modified even when explicitly specified in prompts. This paper introduces DIVBENCH, a benchmark and evaluation framework for measuring both under- and over-diversification in T2I generation. Through systematic evaluation of state-of-the-art T2I models, we find that while most models exhibit limited diversity, many diversification approaches overcorrect by inappropriately altering contextually-specified attributes. We demonstrate that context-aware methods, particularly LLM-guided FairDiffusion and prompt rewriting, can already effectively address under-diversity while avoiding over-diversification, achieving a better balance between representation and semantic fidelity.
zh
[NLP-166] Intrinsic Fingerprint of LLM s: Continue Training is NOT All You Need to Steal A Model!
【速读】: 该论文试图解决大规模语言模型(Large Language Models, LLMs)在版权和知识产权保护方面面临的挑战,尤其是针对模型复用和持续训练导致的模型归属难以确认的问题。解决方案的关键在于利用模型内部参数分布的稳定性特征,即不同层中注意力参数矩阵的标准差分布呈现出独特的、在长时间持续训练后仍保持稳定的模式,这些模式可作为鲁棒的指纹用于识别模型来源并检测潜在的版权侵权行为。
链接: https://arxiv.org/abs/2507.03014
作者: Do-hyeon Yoon,Minsoo Chun,Thomas Allen,Hans Müller,Min Wang,Rajesh Sharma
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This paper flags a potential case of model plagiarism, copyright violation, and information fabrication in arXiv:2505.21411
Abstract:Large language models (LLMs) face significant copyright and intellectual property challenges as the cost of training increases and model reuse becomes prevalent. While watermarking techniques have been proposed to protect model ownership, they may not be robust to continue training and development, posing serious threats to model attribution and copyright protection. This work introduces a simple yet effective approach for robust LLM fingerprinting based on intrinsic model characteristics. We discover that the standard deviation distributions of attention parameter matrices across different layers exhibit distinctive patterns that remain stable even after extensive continued training. These parameter distribution signatures serve as robust fingerprints that can reliably identify model lineage and detect potential copyright infringement. Our experimental validation across multiple model families demonstrates the effectiveness of our method for model authentication. Notably, our investigation uncovers evidence that a recently Pangu Pro MoE model released by Huawei is derived from Qwen-2.5 14B model through upcycling techniques rather than training from scratch, highlighting potential cases of model plagiarism, copyright violation, and information fabrication. These findings underscore the critical importance of developing robust fingerprinting methods for protecting intellectual property in large-scale model development and emphasize that deliberate continued training alone is insufficient to completely obscure model origins.
zh
[NLP-167] Subversion via Focal Points: Investigating Collusion in LLM Monitoring
【速读】: 该论文试图解决语言模型通过合谋规避监控协议的问题,具体表现为模型设计中策略(Policy, P)和监控器(Monitor, M)在编程任务场景下的协作,使得M将审计数据集中的后门程序分类为有害,但将P生成的后门程序分类为无害。解决方案的关键在于模型之间独立制定兼容的破坏性策略,而无需直接通信,研究发现尽管Claude 3.7 Sonnet由于收敛性较差成功率较低,但它有时能够通过非显见的信号成功实现合谋。
链接: https://arxiv.org/abs/2507.03010
作者: Olli Järviniemi
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:We evaluate language models’ ability to subvert monitoring protocols via collusion. More specifically, we have two instances of a model design prompts for a policy § and a monitor (M) in a programming task setting. The models collaboratively aim for M to classify all backdoored programs in an auditing dataset as harmful, but nevertheless classify a backdoored program produced by P as harmless. The models are isolated from each other, requiring them to independently arrive at compatible subversion strategies. We find that while Claude 3.7 Sonnet has low success rate due to poor convergence, it sometimes successfully colludes on non-obvious signals.
zh
[NLP-168] PDFMathTranslate: Scientific Document Translation Preserving Layouts
【速读】: 该论文试图解决科学文献中由于语言障碍导致的科学与技术传播和发展受限的问题,其解决方案的关键在于引入PDFMathTranslate,这是全球首个开源软件,能够在翻译科学文档的同时保留版面布局。该方案利用最新的大语言模型和精确的版面检测技术,实现了在精度、灵活性和效率方面的关键改进。
链接: https://arxiv.org/abs/2507.03009
作者: Rongxin Ouyang,Chang Chu,Zhikuang Xin,Xiangyao Ma
机构: National University of Singapore(新加坡国立大学); Tsinghua University(清华大学); University of Chinese Academy of Sciences(中国科学院大学); Funstory.ai Limited(趣故事人工智能有限公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 7 pages, 4 figures
Abstract:Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information in layouts. To bridge the gap, we introduce PDFMathTranslate, the world’s first open-source software for translating scientific documents while preserving layouts. Leveraging the most recent advances in large language models and precise layout detection, we contribute to the community with key improvements in precision, flexibility, and efficiency. The work has been open-sourced at this https URL with more than 22k downloads.
zh
[NLP-169] Beyond cognacy
【速读】: 该论文试图解决传统语言学分析中依赖专家标注的词源集所带来的数据稀疏性、劳动密集性和局限于单一语系的问题。其解决方案的关键在于采用两种全自动方法,从词汇数据中直接提取系统发育信号:一种是利用自动词源聚类与单字/概念特征;另一种是应用基于双隐马尔可夫模型的多序列比对(MSA)。研究结果表明,基于MSA的方法在与语言分类一致性、类型学变异预测和系统发育信号清晰度方面表现更优,显示出其作为传统词源方法的有前景且可扩展的替代方案。
链接: https://arxiv.org/abs/2507.03005
作者: Gerhard Jäger
机构: University of Tübingen (图宾根大学)
类目: Computation and Language (cs.CL); Populations and Evolution (q-bio.PE)
备注: 9 pages, 2 figures
Abstract:Computational phylogenetics has become an established tool in historical linguistics, with many language families now analyzed using likelihood-based inference. However, standard approaches rely on expert-annotated cognate sets, which are sparse, labor-intensive to produce, and limited to individual language families. This paper explores alternatives by comparing the established method to two fully automated methods that extract phylogenetic signal directly from lexical data. One uses automatic cognate clustering with unigram/concept features; the other applies multiple sequence alignment (MSA) derived from a pair-hidden Markov model. Both are evaluated against expert classifications from Glottolog and typological data from Grambank. Also, the intrinsic strengths of the phylogenetic signal in the characters are compared. Results show that MSA-based inference yields trees more consistent with linguistic classifications, better predicts typological variation, and provides a clearer phylogenetic signal, suggesting it as a promising, scalable alternative to traditional cognate-based methods. This opens new avenues for global-scale language phylogenies beyond expert annotation bottlenecks.
zh
[NLP-170] CLUES: Collaborative High-Quality Data Selection for LLM s via Training Dynamics NEURIPS2024
【速读】: 该论文试图解决在协同训练环境中,由于数据孤岛无法直接共享而导致的自动化数据质量控制难题。其解决方案的关键在于利用数据对大语言模型(LLM)训练动态的影响来评估数据质量,即高质量数据更可能与锚定数据集具有相似的训练动态。通过计算私有数据与锚定数据集的样本梯度,并使用累积内积的迹作为数据质量指标,结合中心化模型更新策略(如模型融合或联邦学习),实现跨私有领域高效选择高质量数据。
链接: https://arxiv.org/abs/2507.03004
作者: Wanru Zhao,Hongxiang Fan,Shell Xu Hu,Wangchunshu Zhou,Bofan Chen,Nicholas D. Lane
机构: University of Cambridge(剑桥大学); Imperial College London(帝国理工学院); Samsung AI Center Cambridge(三星人工智能中心剑桥); Flower Labs(花实验室); AIWaves(人工智能波)
类目: Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: NeurIPS 2024
Abstract:Recent research has highlighted the importance of data quality in scaling large language models (LLMs). However, automated data quality control faces unique challenges in collaborative settings where sharing is not allowed directly between data silos. To tackle this issue, this paper proposes a novel data quality control technique based on the notion of data influence on the training dynamics of LLMs, that high quality data are more likely to have similar training dynamics to the anchor dataset. We then leverage the influence of the training dynamics to select high-quality data from different private domains, with centralized model updates on the server side in a collaborative training fashion by either model merging or federated learning. As for the data quality indicator, we compute the per-sample gradients with respect to the private data and the anchor dataset, and use the trace of the accumulated inner products as a measurement of data quality. In addition, we develop a quality control evaluation tailored for collaborative settings with heterogeneous domain data. Experiments show that training on the high-quality data selected by our method can often outperform other data selection methods for collaborative fine-tuning of LLMs, across diverse private domain datasets, in medical, multilingual and financial settings. Our code is released at this http URL.
zh
[NLP-171] Breaking Physical and Linguistic Borders: Multilingual Federated Prompt Tuning for Low-Resource Languages ICLR2024
【速读】: 该论文试图解决多语言大语言模型(Multilingual Large Language Models, LLMs)在低资源语言上的微调问题,特别是面对数据共享限制(物理边界)和语言内在差异(语言边界)所带来的挑战。解决方案的关键在于提出一种联邦提示微调范式(Federated Prompt Tuning Paradigm),该范式在遵守数据共享限制的前提下,采用参数高效微调方法,通过引入语言距离的新概念进行实验分析,实现了在计算资源受限情况下的数据效率提升,并促进了不同语言之间的相互增强,尤其对低资源语言具有显著益处。
链接: https://arxiv.org/abs/2507.03003
作者: Wanru Zhao,Yihong Chen,Royson Lee,Xinchi Qiu,Yan Gao,Hongxiang Fan,Nicholas D. Lane
机构: University of Cambridge(剑桥大学); University College London(伦敦大学学院); Samsung AI Center(三星人工智能中心); Flower Labs(花实验室)
类目: Computation and Language (cs.CL)
备注: ICLR 2024
Abstract:Pre-trained large language models (LLMs) have become a cornerstone of modern natural language processing, with their capabilities extending across a wide range of applications and languages. However, the fine-tuning of multilingual LLMs, especially for low-resource languages, faces significant challenges arising from data-sharing restrictions (the physical border) and inherent linguistic differences (the linguistic border). These barriers hinder users of various languages, particularly those in low-resource regions, from fully benefiting from the advantages of LLMs. To address these challenges, we propose the Federated Prompt Tuning Paradigm for multilingual scenarios, which utilizes parameter-efficient fine-tuning while adhering to data sharing restrictions. We design a comprehensive set of experiments and analyze them using a novel notion of language distance to highlight the strengths of our paradigm: Even under computational constraints, our method not only improves data efficiency but also facilitates mutual enhancements across languages, particularly benefiting low-resource ones. Compared to traditional local cross-lingual transfer tuning methods, our approach achieves 6.9% higher accuracy with improved data efficiency, and demonstrates greater stability and generalization. These findings underscore the potential of our approach to promote social equality and champion linguistic diversity, ensuring that no language is left behind.
zh
[NLP-172] Evaluating Hierarchical Clinical Document Classification Using Reasoning -Based LLM s
【速读】: 该论文试图解决如何利用大语言模型(Large Language Models, LLMs)对医院出院摘要中的ICD-10编码进行分类的问题,这是一个在医疗领域中至关重要但容易出错的任务。解决方案的关键在于使用临床自然语言处理工具(cTAKES)提取医学术语,并通过一致的、类似编码员格式的提示方式对模型进行测试,同时评估不同具备或不具备结构化推理能力的LLMs的表现。研究结果表明,尽管某些基于推理的模型表现优于非推理模型,但目前所有模型的F1分数均未超过57%,且随着编码特异性增加性能下降,说明当前LLMs尚不足以实现完全自动化编码,需进一步探索混合方法、领域特定模型训练和结构化临床数据的应用。
链接: https://arxiv.org/abs/2507.03001
作者: Akram Mustafa,Usman Naseem,Mostafa Rahimi Azghadi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This study evaluates how well large language models (LLMs) can classify ICD-10 codes from hospital discharge summaries, a critical but error-prone task in healthcare. Using 1,500 summaries from the MIMIC-IV dataset and focusing on the 10 most frequent ICD-10 codes, the study tested 11 LLMs, including models with and without structured reasoning capabilities. Medical terms were extracted using a clinical NLP tool (cTAKES), and models were prompted in a consistent, coder-like format. None of the models achieved an F1 score above 57%, with performance dropping as code specificity increased. Reasoning-based models generally outperformed non-reasoning ones, with Gemini 2.5 Pro performing best overall. Some codes, such as those related to chronic heart disease, were classified more accurately than others. The findings suggest that while LLMs can assist human coders, they are not yet reliable enough for full automation. Future work should explore hybrid methods, domain-specific model training, and the use of structured clinical data.
zh
[NLP-173] A Weakly Supervised Transformer to Support Rare Disease Diagnosis from Electronic Health Records: Methods and Applications in Rare Pulmonary Disease
【速读】: 该论文试图解决罕见疾病因患病率低和临床医生熟悉度不足而导致的表型特征不明确及诊断困难问题,以及现有标注数据稀缺和偏倚带来的计算表型算法开发障碍。解决方案的关键在于提出一种弱监督的、基于Transformer的框架,该框架结合少量高质量的黄金标准标签与大量通过电子健康记录(EHR)迭代更新的银标准标签,从而训练出具有高准确性和泛化能力的表型模型,实现超越个体临床专业知识的罕见疾病检测。
链接: https://arxiv.org/abs/2507.02998
作者: Kimberly F. Greco,Zongxin Yang,Mengyan Li,Han Tong,Sara Morini Sweet,Alon Geva,Kenneth D. Mandl,Benjamin A. Raby,Tianxi Cai
机构: Harvard T.H. Chan School of Public Health, Boston, USA; Harvard Medical School, Boston, USA; Bentley University, Waltham, USA; Columbia University, New York, USA; Boston Children’s Hospital, Boston, USA; Harvard Medical School, Boston, USA; Computational Health Informatics Program, Boston Children’s Hospital, Boston, USA; Harvard Medical School, Boston, USA; Boston’s Children Hospital, Harvard Medical School, Boston, USA; Brigham and Women’s Hospital, Harvard Medical School, Boston, USA; Harvard T.H. Chan School of Public Health, Boston, USA; Harvard Medical School, Boston, USA
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions often remain poorly characterized and difficult to diagnose due to their low prevalence and limited clinician familiarity. While computational phenotyping algorithms show promise for automating rare disease detection, their development is hindered by the scarcity of labeled data and biases in existing label sources. Gold-standard labels from registries and expert chart reviews are highly accurate but constrained by selection bias and the cost of manual review. In contrast, labels derived from electronic health records (EHRs) cover a broader range of patients but can introduce substantial noise. To address these challenges, we propose a weakly supervised, transformer-based framework that combines a small set of gold-standard labels with a large volume of iteratively updated silver-standard labels derived from EHR data. This hybrid approach enables the training of a highly accurate and generalizable phenotyping model that scales rare disease detection beyond the scope of individual clinical expertise. Our method is initialized by learning embeddings of medical concepts based on their semantic meaning or co-occurrence patterns in EHRs, which are then refined and aggregated into patient-level representations via a multi-layer transformer architecture. Using two rare pulmonary diseases as a case study, we validate our model on EHR data from Boston Children’s Hospital. Our framework demonstrates notable improvements in phenotype classification, identification of clinically meaningful subphenotypes through patient clustering, and prediction of disease progression compared to baseline methods. These results highlight the potential of our approach to enable scalable identification and stratification of rare disease patients for clinical care and research applications.
zh
[NLP-174] `For Arguments Sake Show Me How to Harm Myself!: Jailbreaking LLM s in Suicide and Self-Harm Contexts
【速读】: 该论文试图解决生成式 AI (Generative AI) 在心理健康相关场景中面临的安全防护漏洞问题,特别是针对自杀和自伤等敏感话题的对抗性提示攻击。其解决方案的关键在于设计并实施多步骤、层级式的提示越狱(prompt-level jailbreaking)方法,以绕过内置的内容安全过滤机制,从而验证当前大型语言模型(LLMs)在面对复杂用户意图时的安全性不足。研究揭示了现有安全协议在应对创造性对抗性输入时的局限性,并强调了在安全关键型AI系统中持续进行对抗性测试的重要性。
链接: https://arxiv.org/abs/2507.02990
作者: Annika M Schoene,Cansu Canca
机构: Northeastern University (东北大学); Institute for Experiential AI (体验式人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have led to increasingly sophisticated safety protocols and features designed to prevent harmful, unethical, or unauthorized outputs. However, these guardrails remain susceptible to novel and creative forms of adversarial prompting, including manually generated test cases. In this work, we present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters. We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm. We conduct an empirical evaluation across six widely available LLMs, demonstrating the generalizability and reliability of the bypass. We assess these findings and the multilayered ethical tensions that they present for their implications on prompt-response filtering and context- and task-specific model development. We recommend a more comprehensive and systematic approach to AI safety and ethics while emphasizing the need for continuous adversarial testing in safety-critical AI deployments. We also argue that while certain clearly defined safety measures and guardrails can and must be implemented in LLMs, ensuring robust and comprehensive safety across all use cases and domains remains extremely challenging given the current technical maturity of general-purpose LLMs.
zh
[NLP-175] A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements
【速读】: 该论文试图解决如何系统比较不同生成能力问题(Competency Questions, CQs)方法的输出特性及其有效性的问题。其关键在于通过实证方法对三种不同的CQ生成方式——由本体工程师手动制定、基于CQ模式实例化以及利用先进的生成式AI(Generative AI)模型生成——进行对比评估,并从可接受性、歧义性、相关性、可读性和复杂性等多个维度对生成的CQ进行分析。研究的核心贡献在于构建了首个基于相同来源但采用不同方法生成的多标注者CQ数据集,并提供了对各方法生成CQ特性的系统比较。
链接: https://arxiv.org/abs/2507.02989
作者: Reham Alharbi,Valentina Tamma,Terry R. Payne,Jacopo de Berardinis
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Competency Questions (CQs) are pivotal in knowledge engineering, guiding the design, validation, and testing of ontologies. A number of diverse formulation approaches have been proposed in the literature, ranging from completely manual to Large Language Model (LLM) driven ones. However, attempts to characterise the outputs of these approaches and their systematic comparison are scarce. This paper presents an empirical comparative evaluation of three distinct CQ formulation approaches: manual formulation by ontology engineers, instantiation of CQ patterns, and generation using state of the art LLMs. We generate CQs using each approach from a set of requirements for cultural heritage, and assess them across different dimensions: degree of acceptability, ambiguity, relevance, readability and complexity. Our contribution is twofold: (i) the first multi-annotator dataset of CQs generated from the same source using different methods; and (ii) a systematic comparison of the characteristics of the CQs resulting from each approach. Our study shows that different CQ generation approaches have different characteristics and that LLMs can be used as a way to initially elicit CQs, however these are sensitive to the model used to generate CQs and they generally require a further refinement step before they can be used to model requirements.
zh
[NLP-176] GAF-Guard: An Agent ic Framework for Risk Management and Governance in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在实际应用中因缺乏针对具体使用场景和用户偏好而带来的风险监测不足的问题。现有的自动化监控系统主要关注LLM本身的特定问题,如幻觉等,而忽视了具体用例和用户需求。论文提出的解决方案是GAF-Guard,其关键在于构建一个以用户、使用场景和模型为核心的代理框架,通过自主代理识别风险、激活风险检测工具,并在特定用例中实现持续监控与报告,从而提升AI安全性与用户期望的契合度。
链接: https://arxiv.org/abs/2507.02986
作者: Seshu Tirupathi,Dhaval Salwala,Elizabeth Daly,Inge Vejsbjerg
机构: IBM Research Europe(IBM研究欧洲); Dublin, Ireland(都柏林,爱尔兰)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) continue to be increasingly applied across various domains, their widespread adoption necessitates rigorous monitoring to prevent unintended negative consequences and ensure robustness. Furthermore, LLMs must be designed to align with human values, like preventing harmful content and ensuring responsible usage. The current automated systems and solutions for monitoring LLMs in production are primarily centered on LLM-specific concerns like hallucination etc, with little consideration given to the requirements of specific use-cases and user preferences. This paper introduces GAF-Guard, a novel agentic framework for LLM governance that places the user, the use-case, and the model itself at the center. The framework is designed to detect and monitor risks associated with the deployment of LLM based applications. The approach models autonomous agents that identify risks, activate risk detection tools, within specific use-cases and facilitate continuous monitoring and reporting to enhance AI safety, and user expectations. The code is available at this https URL.
zh
[NLP-177] Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers
【速读】: 该论文试图解决多模态学习中深度细粒度融合与计算可扩展性之间的根本矛盾。传统基于交叉注意力的模型虽然通过全面的两两融合实现了优异性能,但其二次复杂度在多模态场景下难以应用。解决方案的关键在于提出一种名为Gated Recurrent Fusion (GRF)的新架构,该架构通过线性可扩展的循环流水线捕捉跨模态注意力的优势。其核心是一个基于Transformer解码器层构建的融合模块,执行对称交叉注意力以相互丰富共享上下文和输入模态,随后通过Gated Fusion Unit (GFU)动态调节信息流,实现特征的选择性保留或丢弃,从而在计算复杂度上实现与模态数量线性相关的优化。
链接: https://arxiv.org/abs/2507.02985
作者: Yusuf Shihata
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 2 figures
Abstract:Multimodal learning faces a fundamental tension between deep, fine-grained fusion and computational scalability. While cross-attention models achieve strong performance through exhaustive pairwise fusion, their quadratic complexity is prohibitive for settings with many modalities. We address this challenge with Gated Recurrent Fusion (GRF), a novel architecture that captures the power of cross-modal attention within a linearly scalable, recurrent pipeline. Our method processes modalities sequentially, updating an evolving multimodal context vector at each step. The core of our approach is a fusion block built on Transformer Decoder layers that performs symmetric cross-attention, mutually enriching the shared context and the incoming modality. This enriched information is then integrated via a Gated Fusion Unit (GFU) a GRU-inspired mechanism that dynamically arbitrates information flow, enabling the model to selectively retain or discard features. This stateful, recurrent design scales linearly with the number of modalities, O(n), making it ideal for high-modality environments. Experiments on the CMU-MOSI benchmark demonstrate that GRF achieves competitive performance compared to more complex baselines. Visualizations of the embedding space further illustrate that GRF creates structured, class-separable representations through its progressive fusion mechanism. Our work presents a robust and efficient paradigm for powerful, scalable multimodal representation learning.
zh
[NLP-178] From Answers to Rationales: Self-Aligning Multimodal Reasoning with Answer-Oriented Chain-of-Thought
【速读】: 该论文试图解决Multimodal Large Language Models (MLLMs)在训练过程中忽视负向推理路径(negative rationales)的问题,从而影响模型对错误推理模式的辨别能力。解决方案的关键在于提出一种名为SMART的框架,该框架通过Answer-Oriented Chain-of-Thought (AoT)提示,自动生成高质量的正向和负向推理路径,并通过自对齐机制提升模型的推理能力。其中,AoT利用答案作为引导,帮助模型提取关键视觉信息以连接问题与答案,当提供真实答案时生成强正向推理路径,而当替换为误导性答案时则生成具有欺骗性的错误推理路径,作为判别性负向推理路径。
链接: https://arxiv.org/abs/2507.02984
作者: Wentao Tan,Qiong Cao,Yibing Zhan,Chao Xue,Changxing Ding
机构: South China University of Technology (华南理工大学); JD Explore Academy, Beijing (京东探索研究院,北京); Pazhou Lab, Guangzhou (琶洲实验室,广州)
类目: Computation and Language (cs.CL)
备注:
Abstract:Achieving human-like reasoning capabilities in Multimodal Large Language Models (MLLMs) has long been a goal. Current methodologies primarily focus on synthesizing positive rationales, while overlooking the critical role of negative rationales in training models to discern flawed reasoning patterns. To address this gap, we propose a novel framework: \textbfSelf-Aligning \textbfMultimodal Reasoning with \textbfAnswer-O\textbfriented Chain-of-\textbfThought (SMART). This framework enables models to utilize AoT-Oriented Chain-of-Thought (AoT) prompts to automatically generate high-quality positive and negative reasoning paths, followed by self-alignment to enhance their reasoning abilities. Inspired by human strategies for solving proof-based problems, AoT uses answers as a guide to help the model extract critical visual information that links questions and answers. When provided with ground truth answers, the model produces strong positive rationales. Conversely, when correct answers are replaced with misleading alternatives, the model generates an erroneous yet compelling reasoning path, serving as a form of discriminative negative rationale. Models trained with AoT-generated data outperform those trained on manually annotated datasets, demonstrating superior reasoning capabilities. This encourages the use of improved models to generate higher-quality preference data for further optimization. Consequently, SMART establishes an iterative generation-optimization method that continually enhances the model’s reasoning skills. Experiments indicate that the SMART framework significantly improves various MLLMs, regardless of model architecture, parameter size, or pre-training dataset. The code, datasets, and models will be released.
zh
[NLP-179] ruth Trust and Trouble: Medical AI on the Edge
【速读】: 该论文试图解决生成式 AI 在数字健康领域中实现自动化医疗问答时面临的事实准确性、实用性与安全性不足的问题,尤其是针对开源解决方案。其关键解决方案是构建了一个严格的基准测试框架,利用超过1000个健康相关问题的数据集,评估模型在诚实性、帮助性和无害性方面的表现,并通过领域特定微调和少样本提示等方法优化模型性能。
链接: https://arxiv.org/abs/2507.02983
作者: Mohammad Anas Azeez,Rafiq Ali,Ebad Shabbir,Zohaib Hasan Siddiqui,Gautam Siddharth Kashyap,Jiechao Gao,Usman Naseem
机构: Jamia Hamdard(贾米亚·哈姆德大学); DSEU-Okhla(德塞大学-奥克拉); Macquarie University(麦考瑞大学); Center for SDGC(可持续发展全球中心); Stanford University(斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework using a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models – Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite its smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting ongoing challenges in clinical QA.
zh
[NLP-180] We Need Knowledge Distillation for Solving Math Word Problems
【速读】: 该论文试图解决大型语言模型(LLMs)在数学教育应用中因计算资源需求高而导致的成本问题。其解决方案的关键在于通过压缩嵌入向量并进行知识蒸馏,构建一个参数量仅为原模型1/12的小型学生模型,该模型能够保持接近教师模型90%的性能,同时展现出良好的泛化能力与任务无关的知识蒸馏效果。
链接: https://arxiv.org/abs/2507.02982
作者: Zhenquan Shen,Xinguo Yu,Xiaotian Cheng,Rao Peng,Hao Ming
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The enhancement of mathematical capabilities in large language models (LLMs) fosters new developments in mathematics education within primary and secondary schools, particularly as they relate to intelligent tutoring systems. However, LLMs require substantial computational resources, resulting in significant costs in educational contexts. To mitigate this drawback, this paper investigates the feasibility of compressing LLMs for solving math word problems (MWPs). We compress the embedded vectors encoded by BERT and distill a considerably smaller student model. Our findings indicate that the student model can maintain nearly 90% of the performance of the teacher model while utilizing only 1/12 of its parameters. In addition to achieving high accuracy, the model exhibits strong generalizability, as the compressed vectors perform well across all tasks related to MWPs, and the distillation process is not task-specific. The success of this distillation demonstrates that the underlying principles are generic and not limited to a specific task. We further explore the reasons behind the compressibility of embedded vectors, revealing that part-of-speech information, rather than entity recognition, is crucial for MWPs, which may significantly contribute to their compressibility. The improvements in efficiency and cost reduction provide substantial value for intelligent tutoring systems and significantly advance the field of intelligent education.
zh
[NLP-181] InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy
【速读】: 该论文试图解决在基于大语言模型(Large Language Model, LLM)的长文本生成中,如何安全地融入私有信息的问题。其解决方案的关键在于提出InvisibleInk框架,该框架通过将LLM的下一个词分布采样视为对LLMlogits的指数机制,并引入两项创新:首先,通过仅对模型logits中的敏感信息进行隔离和裁剪来降低隐私成本;其次,通过从一个较小的顶级私有token超集进行采样以提升文本质量。该方法在保持相同隐私水平下,实现了比现有最先进基线方法计算成本降低8倍的性能。
链接: https://arxiv.org/abs/2507.02974
作者: Vishnu Vinod,Krishna Pillutla,Abhradeep Guha Thakurta
机构: Centre for Responsible AI (CeRAI); Wadhwani School of Data Science & AI; Google DeepMind
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:As major progress in LLM-based long-form text generation enables paradigms such as retrieval-augmented generation (RAG) and inference-time scaling, safely incorporating private information into the generation remains a critical open question. We present InvisibleInk, a highly scalable long-form text generation framework satisfying rigorous differential privacy guarantees with respect to the sensitive references. It interprets sampling from the LLM’s next-token-distribution as the exponential mechanism over the LLM logits with two innovations. First, we reduce the privacy cost by isolating and clipping only the sensitive information in the model logits (relative to the public logits). Second, we improve text quality by sampling from a small superset of the top- k private tokens. Empirical evaluations demonstrate a consistent 8\times reduction in computation cost over state-of-the-art baselines to generate long-form private text of the same utility across privacy levels. In summary, InvisibleInk is able to generate private long-form text at less than 10\times the computation cost of non-private generation.
zh
[NLP-182] PB-LLM s: Privacy- and Bias-aware NLP Models using Named-Entity Recognition AAAI
【速读】: 该论文试图解决在高风险AI应用中,大型语言模型(LLMs)因隐私、数据保护和透明性问题带来的法律与伦理挑战。其解决方案的关键在于利用命名实体识别(NER)技术对文本数据中的敏感信息进行匿名化处理,从而在保持系统性能的同时保障用户隐私。研究通过在基于AI的简历评分场景中评估该隐私保护框架,验证了其在维护候选人机密性方面的有效性,并进一步结合减少性别偏见的方法,提出了隐私与偏见感知的LLMs(PB-LLMs)。
链接: https://arxiv.org/abs/2507.02966
作者: Gonzalo Mancera,Aythami Morales,Julian Fierrez,Ruben Tolosana,Alejandro Penna,Miguel Lopez-Duran,Francisco Jurado,Alvaro Ortigosa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Presented at AAAI Workshop on Privacy-Preserving Artificial Intelligence (PPAI) 2025, Philadelphia, PA, USA, March 2025
Abstract:The use of Natural Language Processing (NLP) in high-stakes AI-based applications has increased significantly in recent years, especially since the emergence of Large Language Models (LLMs). However, despite their strong performance, LLMs introduce important legal/ethical concerns, particularly regarding privacy, data protection, and transparency. Due to these concerns, this work explores the use of Named-Entity Recognition (NER) to facilitate the privacy-preserving training (or adaptation) of LLMs. We propose a framework that uses NER technologies to anonymize sensitive information in text data, such as personal identities or geographic locations. An evaluation of the proposed privacy-preserving learning framework was conducted to measure its impact on user privacy and system performance in a particular high-stakes and sensitive setup: AI-based resume scoring for recruitment processes. The study involved two language models (BERT and RoBERTa) and six anonymization algorithms (based on Presidio, FLAIR, BERT, and different versions of GPT) applied to a database of 24,000 candidate profiles. The findings indicate that the proposed privacy preservation techniques effectively maintain system performance while playing a critical role in safeguarding candidate confidentiality, thus promoting trust in the experimented scenario. On top of the proposed privacy-preserving approach, we also experiment applying an existing approach that reduces the gender bias in LLMs, thus finally obtaining our proposed Privacy- and Bias-aware LLMs (PB-LLMs). Note that the proposed PB-LLMs have been evaluated in a particular setup (resume scoring), but are generally applicable to any other LLM-based AI application.
zh
[NLP-183] Less Data More Security: Advancing Cybersecurity LLM s Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens
【速读】: 该论文试图解决通用大语言模型(Large Language Models, LLMs)在网络安全分析中缺乏专业领域知识的问题,从而影响其有效性。解决方案的关键在于采用领域自适应连续预训练(Domain-Adaptive Continuous Pretraining, DAP)方法,在保持通用语言能力的同时增强模型对网络安全领域的理解。该方法通过使用精心筛选的1.26亿词的网络安全语料库对三种解码器架构进行系统性适配,并结合约束训练参数和分布式FSDP训练策略,实现了领域专业化与知识保留之间的平衡。
链接: https://arxiv.org/abs/2507.02964
作者: Salahuddin Salahuddin,Ahmed Hussain,Jussi Löppönen,Toni Jutila,Panos Papadimitratos
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 15 Pages and 10 Figures
Abstract:While Large Language Models (LLMs) demonstrate exceptional natural language capabilities, general-purpose models lack specialized domain knowledge for effective cybersecurity analysis. In this work, we investigate Domain-Adaptive Continuous Pretraining (DAP) as a methodology for enhancing cybersecurity understanding in pretrained LLMs while preserving general language capabilities. We systematically adapted three decoder-based architectures – Llama-3.1-8B, DeepSeek-R1-Distill-Qwen-14B, and Llama-3.3-70B-Instruct – using a curated 126-million-word cybersecurity corpus from standards, academic literature, and various other sources. Our approach employed constrained training parameters and distributed FSDP training to balance domain specialization with knowledge preservation. Evaluation across three cybersecurity benchmarks, namely, CTI-MCQ, CyberMetric, and SecEval, demonstrates consistent improvements post-adaptation. The Llama-3.3-70B-Ins-DAP model achieved state-of-the-art accuracies of 0.718, 0.933, and 0.864, respectively, outperforming specialized models, including Llama-Primus-Base. Notably, competitive performance was achieved using substantially smaller datasets (118.8 million versus 2.77 billion tokens), demonstrating efficient domain specialization viability. We establish that targeted continuous pretraining enables effective cybersecurity domain adaptation with computational feasibility, providing foundations for specialized AI assistants in threat analysis, vulnerability assessment, and security documentation while challenging prevailing assumptions about data requirements for LLM specialization.
zh
[NLP-184] RAG -R1 : Incentivize the Search and Reasoning Capabilities of LLM s through Multi-query Parallelism
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成回答时容易出现幻觉或过时信息的问题,以及现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在训练稳定性、推理时间和能力限制方面的不足。其解决方案的关键在于提出RAG-R1框架,该框架使LLMs能够在推理过程中自适应地利用内部和外部知识,并将生成与检索过程从单查询模式扩展为多查询并行处理,从而降低推理时间并提升模型性能。
链接: https://arxiv.org/abs/2507.02962
作者: Zhiwen Tan,Jiaming Huang,Qintong Wu,Hongxuan Zhang,Chenyi Zhuang,Jinjie Gu
机构: Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models’ search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model’s capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.
zh
[NLP-185] Real-World En Call Center Transcripts Dataset with PII Redaction
【速读】: 该论文试图解决当前公开可用的真实场景电话中心数据集稀缺的问题,从而支持客户支持和销售人工智能系统的研究与开发。解决方案的关键在于构建并发布CallCenterEN数据集,这是一个包含91,706次对话(共计10448小时音频)的大规模真实世界英语电话中心转录数据集,涵盖了来自印度、菲律宾和美国的口音,并且所有个人身份信息(PII)均已严格去除以符合全球数据保护法规,同时确保了生物特征隐私安全。
链接: https://arxiv.org/abs/2507.02958
作者: Ha Dao,Gaurav Chawla,Raghu Banda,Caleb DeLeeuw
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 4 figures. Dataset publicly available at this https URL . Contains 91,706 real-world English call center transcripts (10,448 audio hours) with PII redaction. Licensed under CC BY-NC 4.0 for non-commercial research use
Abstract:We introduce CallCenterEN, a large-scale (91,706 conversations, corresponding to 10448 audio hours), real-world English call center transcript dataset designed to support research and development in customer support and sales AI systems. This is the largest release to-date of open source call center transcript data of this kind. The dataset includes inbound and outbound calls between agents and customers, with accents from India, the Philippines and the United States. The dataset includes high-quality, PII-redacted human-readable transcriptions. All personally identifiable information (PII) has been rigorously removed to ensure compliance with global data protection laws. The audio is not included in the public release due to biometric privacy concerns. Given the scarcity of publicly available real-world call center datasets, CallCenterEN fills a critical gap in the landscape of available ASR corpora, and is released under a CC BY-NC 4.0 license for non-commercial research use.
zh
[NLP-186] Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III IJCAI2025
【速读】: 该论文试图解决金融领域中大型语言模型(Large Language Models, LLMs)在实际应用中的性能评估问题,旨在通过严格的领域特定评估来确保其负责任的部署。解决方案的关键在于构建一个全面的基准测试,对23个最先进的LLMs进行评估,具体针对特许金融分析师(Chartered Financial Analyst, CFA)三级考试,该考试是高级金融推理的黄金标准。评估涵盖了多项选择题和论述题,并采用多种提示策略如思维链(Chain-of-Thought)和自我发现(Self-Discover),以全面衡量模型的能力。
链接: https://arxiv.org/abs/2507.02954
作者: Pranam Shetty,Abhisek Upadhayaya,Parth Mitesh Shah,Srikanth Jagabathula,Shilpi Nayak,Anna Joo Fee
机构: Rochester Institute of Technology (罗彻斯特理工学院); GoodFin, Inc (好财务公司); New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at FinLLM @ IJCAI 2025
Abstract:As financial institutions increasingly adopt Large Language Models (LLMs), rigorous domain-specific evaluation becomes critical for responsible deployment. This paper presents a comprehensive benchmark evaluating 23 state-of-the-art LLMs on the Chartered Financial Analyst (CFA) Level III exam - the gold standard for advanced financial reasoning. We assess both multiple-choice questions (MCQs) and essay-style responses using multiple prompting strategies including Chain-of-Thought and Self-Discover. Our evaluation reveals that leading models demonstrate strong capabilities, with composite scores such as 79.1% (o4-mini) and 77.3% (Gemini 2.5 Flash) on CFA Level III. These results, achieved under a revised, stricter essay grading methodology, indicate significant progress in LLM capabilities for high-stakes financial applications. Our findings provide crucial guidance for practitioners on model selection and highlight remaining challenges in cost-effective deployment and the need for nuanced interpretation of performance against professional benchmarks.
zh
[NLP-187] Evaluating AI Counseling in Japanese: Counselor Client and Evaluator Roles Assessed by Motivational Interviewing Criteria
【速读】: 该论文试图解决在日语语言治疗情境中,大型语言模型(LLM)在三种咨询角色中的性能评估问题,旨在为非英语语境下的AI辅助心理咨询建立基准并识别改进方向。其关键解决方案在于采用结构化多步骤对话提示(SMDP)显著提升了咨询AI系统的表现,同时通过对比不同评估AI系统与人类专家的评分差异,揭示了模型特定的偏差,并提出了通过高级提示工程、检索增强生成和定向微调等方法进行优化的路径。
链接: https://arxiv.org/abs/2507.02950
作者: Keita Kiuchi,Yoshikazu Fujimoto,Hideyuki Goto,Tomonori Hosokawa,Makoto Nishimura,Yosuke Sato,Izumi Sezai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 69 pages, 0 figures, 9 tables; data and code at this https URL
Abstract:This study provides the first comprehensive evaluation of large language model (LLM) performance across three counseling roles in Japanese-language therapeutic contexts. We simultaneously assessed counselor artificial intelligence (AI) systems (GPT-4-turbo with zeroshot prompting or Structured Multi-step Dialogue Prompts (SMDP), Claude-3-Opus-SMDP), client AI simulations, and evaluation AI systems (o3, Claude-3.7-Sonnet, Gemini-2.5-pro). Human experts (n = 15) with extensive counseling experience evaluated AI-generated dialogues using the Motivational Interviewing Treatment Integrity (MITI) Coding Manual 4.2.1. Notably, SMDP implementation significantly enhanced counselor AI performance across all MITI global ratings compared with zeroshot prompting, with no significant differences between GPT-SMDP and Opus-SMDP. Evaluation AIs showed comparable performance to human raters for Cultivating Change Talk but systematically overestimated Softening Sustain Talk and the overall quality metrics. Model-specific biases emerged: Gemini emphasized power-sharing, o3 focused on technical proficiency, and Sonnet prioritized emotional expression. Client AI simulations exhibited a limited emotional range and unnaturally high compliance, indicating the need for enhanced realism. These findings establish benchmarks for AI-assisted counseling in non-English contexts and identify critical areas for improvement through advanced prompt engineering, retrieval-augmented generation, and targeted fine-tuning, with important implications for developing culturally sensitive AI mental health tools. Comments: 69 pages, 0 figures, 9 tables; data and code at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) MSC classes: 68T50 ACMclasses: I.2.7; H.5.2; J.4 Cite as: arXiv:2507.02950 [cs.CL] (or arXiv:2507.02950v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.02950 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-188] RADIANT: Retrieval AugmenteD entIty-context AligNmenT – Introducing RAG -ability and Entity-Context Divergence
【速读】: 该论文试图解决生成式 AI (Generative AI) 在检索增强生成 (Retrieval-Augmented Generation, RAG) 过程中难以准确整合检索到的信息,从而导致事实不一致的问题。其解决方案的关键在于提出一种名为 Radiant 的新框架,该框架通过将 RAG 与对齐机制相结合,优化检索证据与生成内容之间的交互,并扩展直接偏好优化 (Direct Preference Optimization, DPO) 以指导模型更好地整合额外信息,从而提升 RAG 性能。
链接: https://arxiv.org/abs/2507.02949
作者: Vipula Rawte,Rajarshi Roy,Gurpreet Singh,Danush Khanna,Yaswanth Narsupalli,Basab Ghosh,Abhay Gupta,Argha Kamal Samanta,Aditya Shingote,Aadi Krishna Vikram,Vinija Jain,Aman Chadha,Amit Sheth,Amitava Das
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) continue to advance, Retrieval-Augmented Generation (RAG) has emerged as a vital technique to enhance factual accuracy by integrating external knowledge into the generation process. However, LLMs often fail to faithfully integrate retrieved evidence into their generated responses, leading to factual inconsistencies. To quantify this gap, we introduce Entity-Context Divergence (ECD), a metric that measures the extent to which retrieved information is accurately reflected in model outputs. We systematically evaluate contemporary LLMs on their ability to preserve factual consistency in retrieval-augmented settings, a capability we define as RAG-ability. Our empirical analysis reveals that RAG-ability remains low across most LLMs, highlighting significant challenges in entity retention and context fidelity. This paper introduces Radiant (Retrieval AugmenteD entIty-context AligNmenT), a novel framework that merges RAG with alignment designed to optimize the interplay between retrieved evidence and generated content. Radiant extends Direct Preference Optimization (DPO) to teach LLMs how to integrate provided additional information into subsequent generations. As a behavior correction mechanism, Radiant boosts RAG performance across varied retrieval scenarios, such as noisy web contexts, knowledge conflicts, and hallucination reduction. This enables more reliable, contextually grounded, and factually coherent content generation.
zh
[NLP-189] he Application of Large Language Models on Major Depressive Disorder Support Based on African Natural Products
【速读】: 该论文试图解决当前抗抑郁治疗中存在的局限性,如起效延迟、显著副作用以及部分患者对传统疗法的耐受问题。其解决方案的关键在于整合非洲传统医学中的天然产物与大型语言模型技术,通过结合传统知识与现代人工智能技术,开发出一种基于证据的、可访问的心理健康支持系统。该系统利用DeepSeek的先进语言模型能力,提供关于非洲草药药物的循证信息、临床应用、安全考量及治疗方案,旨在为用户提供个性化且符合文化背景的抑郁症支持。
链接: https://arxiv.org/abs/2507.02947
作者: Linyan Zou
机构: 未知
类目: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Major depressive disorder represents one of the most significant global health challenges of the 21st century, affecting millions of people worldwide and creating substantial economic and social burdens. While conventional antidepressant therapies have provided relief for many individuals, their limitations including delayed onset of action, significant side effects, and treatment resistance in a substantial portion of patients have prompted researchers and healthcare providers to explore alternative therapeutic approaches (Kasneci et al.). African traditional medicine, with its rich heritage of plant-based remedies developed over millennia, offers a valuable resource for developing novel antidepressant treatments that may address some of these limitations. This paper examines the integration of large language models with African natural products for depression support, combining traditional knowledge with modern artificial intelligence technology to create accessible, evidence-based mental health support systems. The research presented here encompasses a comprehensive analysis of African medicinal plants with documented antidepressant properties, their pharmacological mechanisms, and the development of an AI-powered support system that leverages DeepSeek’s advanced language model capabilities. The system provides evidence-based information about African herbal medicines, their clinical applications, safety considerations, and therapeutic protocols while maintaining scientific rigor and appropriate safety standards. Our findings demonstrate the potential for large language models to serve as bridges between traditional knowledge and modern healthcare, offering personalized, culturally appropriate depression support that honors both traditional wisdom and contemporary medical understanding. Subjects: Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2507.02947 [cs.CL] (or arXiv:2507.02947v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.02947 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-190] GameTileNet: A Semantic Dataset for Low-Resolution Game Art in Procedural Content Generation
【速读】: 该论文试图解决低分辨率数字游戏艺术中语义标签缺失的问题,以推动程序化内容生成(PCG)及相关AI研究中的视觉-语言对齐任务。当前,尽管生成式AI和大型语言模型已帮助独立开发者生成游戏视觉资产,但生成的视觉内容与游戏叙事的对齐仍面临挑战,且自动生成功能在风格分布不平衡的情况下表现出视觉表现力的局限性。解决方案的关键在于构建GameTileNet数据集,该数据集通过收集受Creative Commons许可的艺术家创作的游戏瓷砖,并提供语义标注,支持基于叙事的内容生成,同时引入了针对低分辨率瓷砖游戏艺术的物体检测流程,标注了语义、连通性和物体分类信息。
链接: https://arxiv.org/abs/2507.02941
作者: Yi-Chun Chen,Arnav Jhala
机构: Yale University (耶鲁大学); North Carolina State University (北卡罗来纳州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Note: This is a preprint version of a paper submitted to AIIDE 2025. It includes additional discussion of limitations and future directions that were omitted from the conference version due to space constraints
Abstract:GameTileNet is a dataset designed to provide semantic labels for low-resolution digital game art, advancing procedural content generation (PCG) and related AI research as a vision-language alignment task. Large Language Models (LLMs) and image-generative AI models have enabled indie developers to create visual assets, such as sprites, for game interactions. However, generating visuals that align with game narratives remains challenging due to inconsistent AI outputs, requiring manual adjustments by human artists. The diversity of visual representations in automatically generated game content is also limited because of the imbalance in distributions across styles for training data. GameTileNet addresses this by collecting artist-created game tiles from this http URL under Creative Commons licenses and providing semantic annotations to support narrative-driven content generation. The dataset introduces a pipeline for object detection in low-resolution tile-based game art (e.g., 32x32 pixels) and annotates semantics, connectivity, and object classifications. GameTileNet is a valuable resource for improving PCG methods, supporting narrative-rich game content, and establishing a baseline for object detection in low-resolution, non-photorealistic images. TL;DR: GameTileNet is a semantic dataset of low-resolution game tiles designed to support narrative-driven procedural content generation through visual-language alignment. Comments: Note: This is a preprint version of a paper submitted to AIIDE 2025. It includes additional discussion of limitations and future directions that were omitted from the conference version due to space constraints Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM) Cite as: arXiv:2507.02941 [cs.CV] (or arXiv:2507.02941v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.02941 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-191] owards a Comparative Framework for Compositional AI Models
【速读】: 该论文试图解决自然语言处理中模型的组合泛化与组合可解释性问题,即如何使模型在未见数据上表现出良好的泛化能力,并通过分析其模块化组件及其组合过程来理解模型的工作机制。解决方案的关键在于提出DisCoCirc框架,该框架通过按照文本的语法结构组合单个词语单元,构建文本的组合模型,从而实现对组合规则的学习与解析。论文进一步利用范畴论的语言对这些概念进行框架无关的表述,并在此基础上设计了一系列测试以评估模型的组合泛化能力。
链接: https://arxiv.org/abs/2507.02940
作者: Tiffany Duneau
机构: Quantinuum(量子技术公司); University of Oxford(牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
Abstract:The DisCoCirc framework for natural language processing allows the construction of compositional models of text, by combining units for individual words together according to the grammatical structure of the text. The compositional nature of a model can give rise to two things: compositional generalisation – the ability of a model to generalise outside its training distribution by learning compositional rules underpinning the entire data distribution – and compositional interpretability – making sense of how the model works by inspecting its modular components in isolation, as well as the processes through which these components are combined. We present these notions in a framework-agnostic way using the language of category theory, and adapt a series of tests for compositional generalisation to this setting. Applying this to the DisCoCirc framework, we consider how well a selection of models can learn to compositionally generalise. We compare both quantum circuit based models, as well as classical neural networks, on a dataset derived from one of the bAbI tasks, extended to test a series of aspects of compositionality. Both architectures score within 5% of one another on the productivity and substitutivity tasks, but differ by at least 10% for the systematicity task, and exhibit different trends on the overgeneralisation tasks. Overall, we find the neural models are more prone to overfitting the Train data. Additionally, we demonstrate how to interpret a compositional model on one of the trained models. By considering how the model components interact with one another, we explain how the model behaves. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph) Cite as: arXiv:2507.02940 [cs.CL] (or arXiv:2507.02940v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.02940 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-192] A Large Language Model-Empowered Agent for Reliable and Robust Structural Analysis
【速读】: 该论文试图解决生成式AI(Generative AI)在结构力学分析中的定量可靠性与鲁棒性不足的问题。为解决这一问题,研究提出将结构分析任务重新定义为代码生成任务,并开发了一个基于大语言模型(LLM)的智能体,其关键在于通过思维链和少量示例提示生成准确的OpenSeesPy代码,并自动执行代码以获得结构分析结果。实验结果表明,该智能体在基准数据集上的准确率超过99.0%,展现出良好的可靠性和鲁棒性。
链接: https://arxiv.org/abs/2507.02938
作者: Jiachen Liu,Ziheng Geng,Ran Cao,Lu Cheng,Paolo Bocchini,Minghui Cheng
机构: University of Miami(迈阿密大学); University of Miami(迈阿密大学); Hunan University(湖南大学); University of Illinois Chicago(伊利诺伊大学芝加哥分校); Lehigh University(利哈伊大学); University of Miami(迈阿密大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have exhibited remarkable capabilities across diverse open-domain tasks, yet their application in specialized domains such as civil engineering remains largely unexplored. This paper starts bridging this gap by evaluating and enhancing the reliability and robustness of LLMs in structural analysis of beams. Reliability is assessed through the accuracy of correct outputs under repetitive runs of the same problems, whereas robustness is evaluated via the performance across varying load and boundary conditions. A benchmark dataset, comprising eight beam analysis problems, is created to test the Llama-3.3 70B Instruct model. Results show that, despite a qualitative understanding of structural mechanics, the LLM lacks the quantitative reliability and robustness for engineering applications. To address these limitations, a shift is proposed that reframes the structural analysis as code generation tasks. Accordingly, an LLM-empowered agent is developed that (a) integrates chain-of-thought and few-shot prompting to generate accurate OpeeSeesPy code, and (b) automatically executes the code to produce structural analysis results. Experimental results demonstrate that the agent achieves accuracy exceeding 99.0% on the benchmark dataset, exhibiting reliable and robust performance across diverse conditions. Ablation studies highlight the complete example and function usage examples as the primary contributors to the agent’s enhanced performance.
zh
[NLP-193] heory of Mind in Action: The Instruction Inference Task
【速读】: 该论文试图解决在动态、目标导向和协作环境中评估代理的心智理论(Theory of Mind, ToM)能力的问题。其解决方案的关键在于提出一种新的任务——指令推理(Instruction Inference),并设计了一个基于大语言模型(LLM)的代理Tomcat,该代理通过结构化推理(Chain-of-Thought, CoT)或常识提示(Commonsense Prompt, CP)来实现对人类指令的解释与响应,从而展现ToM能力。
链接: https://arxiv.org/abs/2507.02935
作者: Fardin Saad,Pradeep K. Murukannaiah,Munindar P. Singh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Submitted to Artificial Intelligence Journal (under review). 51 pages with appendix (28 pages article + 4 pages references + 19 pages appendix), 7 figures (Appendix: 26 Figures), 6 tables. Code available at: this https URL
Abstract:The Theory of Mind (ToM) refers to an agent’s capacity to infer the mental states of other agents. ToM is essential for effective collaboration. To assess ToM in a dynamic, goal-oriented, and collaborative environment, we introduce a novel task, Instruction Inference, in which an agent assists a principal in reaching a goal by interpreting indirect or ambiguous instructions. We present Tomcat, an LLM-based agent, designed to exhibit ToM reasoning in interpreting and responding to the principal’s instructions. We implement two variants of Tomcat. One, dubbed Fs-CoT, is based on a small number of examples (i.e., few-shot or Fs) demonstrating the requisite structured reasoning (i.e., chain-of-thought or CoT). One, dubbed CP, relies on commonsense knowledge and information about the problem (i.e., commonsense prompt or CP). We realized both variants of Tomcat on three leading large language models (LLMs), namely, GPT-4o, DeepSeek-R1, and Gemma-3-27B. To evaluate the effectiveness of Tomcat, we conducted a study with 52 human participants in which we provided participants with the same information as the CP variant of Tomcat. We computed intent accuracy, action optimality, and planning optimality to measure the ToM capabilities of Tomcat and our study participants. We found that Tomcat with Fs-CoT, particularly with GPT-4o and DeepSeek-R1, achieves performance comparable to the human participants, underscoring its ToM potential for human-AI collaboration.
zh
[NLP-194] Mitigating Hidden Confounding by Progressive Confounder Imputation via Large Language Models
【速读】: 该论文试图解决从观察性数据中估计处理效应时存在的隐藏混杂问题(hidden confounding),即未观测变量可能导致因果估计的偏差。其解决方案的关键在于提出ProCI(Progressive Confounder Imputation)框架,该框架利用大语言模型(LLM)的语义推理能力和嵌入的世界知识,迭代生成、填补和验证隐藏混杂因素,从而减轻隐藏混杂带来的影响。
链接: https://arxiv.org/abs/2507.02928
作者: Hao Yang,Haoxuan Li,Luyu Chen,Haoxiang Wang,Xu Chen,Mingming Gong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Hidden confounding remains a central challenge in estimating treatment effects from observational data, as unobserved variables can lead to biased causal estimates. While recent work has explored the use of large language models (LLMs) for causal inference, most approaches still rely on the unconfoundedness assumption. In this paper, we make the first attempt to mitigate hidden confounding using LLMs. We propose ProCI (Progressive Confounder Imputation), a framework that elicits the semantic and world knowledge of LLMs to iteratively generate, impute, and validate hidden confounders. ProCI leverages two key capabilities of LLMs: their strong semantic reasoning ability, which enables the discovery of plausible confounders from both structured and unstructured inputs, and their embedded world knowledge, which supports counterfactual reasoning under latent confounding. To improve robustness, ProCI adopts a distributional reasoning strategy instead of direct value imputation to prevent the collapsed outputs. Extensive experiments demonstrate that ProCI uncovers meaningful confounders and significantly improves treatment effect estimation across various datasets and LLMs.
zh
[NLP-195] A Unified Speech LLM for Diarization and Speech Recognition in Multilingual Conversations
【速读】: 该论文旨在解决多语言对话中语音大语言模型(Speech LLMs)在真实场景下因数据稀缺而导致的性能受限问题。其解决方案的关键在于提出一种统一的语音LLM,通过端到端的方式联合执行说话人分离(diarization)和自动语音识别(ASR)。通过重新构建训练数据格式并调整推理流程,该模型有效缓解了预分割音频中的歧义问题,在任务二中实现了相对于基线模型54.87%的相对提升。
链接: https://arxiv.org/abs/2507.02927
作者: Phurich Saengthong,Boonnithi Jiaramaneepinit,Sheng Li,Manabu Okumura,Takahiro Shinozaki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech Large Language Models (Speech LLMs) have emerged as a crucial paradigm in recent years, extending the capabilities of traditional LLMs to speech tasks such as automatic speech recognition (ASR) and spoken dialogue modeling. However, their effectiveness in real-world multilingual conversations remains limited by the scarcity of data that captures natural conversational phenomena. To address this, the MLC-SLM Challenge provides a multilingual conversational dataset and evaluates models on two tasks: ASR with oracle segmentation (Task I) and joint diarization and recognition without oracle information (Task II). In this paper, we focus on Task II and propose a unified speech LLM that jointly performs diarization and ASR in an end-to-end manner. By reformulating the training data format and modifying the inference procedure, our model addresses the ambiguity inherent in pre-segmented audio and achieves a 54.87% relative improvement in tcpWER/tcpCER over the baseline, ranking 8th overall, despite using a smaller LLM backbone. We also report results from Task I using a fine-tuned speech LLM.
zh
[NLP-196] Large Language Model Agent for Modular Task Execution in Drug Discovery
【速读】: 该论文试图解决早期阶段计算药物发现流程中多个关键任务的自动化与高效执行问题,包括生物医学数据检索、领域特定问题回答、分子生成、性质预测、性质感知的分子优化以及三维蛋白质-配体结构生成。其解决方案的关键在于构建一个基于大语言模型(LLM)的模块化框架,该框架结合了LLM的推理能力与领域专用工具,从而实现了从数据获取到分子优化的全流程自动化,并提升了任务执行的准确性与效率。
链接: https://arxiv.org/abs/2507.02925
作者: Janghoon Ock,Radheesh Sharma Meda,Srivathsan Badrinarayanan,Neha S. Aluru,Achuth Chandrasekhar,Amir Barati Farimani
机构: Carnegie Mellon University (卡内基梅隆大学); Texas A&M University (德克萨斯A&M大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:
Abstract:We present a modular framework powered by large language models (LLMs) that automates and streamlines key tasks across the early-stage computational drug discovery pipeline. By combining LLM reasoning with domain-specific tools, the framework performs biomedical data retrieval, domain-specific question answering, molecular generation, property prediction, property-aware molecular refinement, and 3D protein-ligand structure generation. In a case study targeting BCL-2 in lymphocytic leukemia, the agent autonomously retrieved relevant biomolecular information-including FASTA sequences, SMILES representations, and literature-and answered mechanistic questions with improved contextual accuracy over standard LLMs. It then generated chemically diverse seed molecules and predicted 67 ADMET-related properties, which guided iterative molecular refinement. Across two refinement rounds, the number of molecules with QED 0.6 increased from 34 to 55, and those passing at least four out of five empirical drug-likeness rules rose from 29 to 52, within a pool of 194 molecules. The framework also employed Boltz-2 to generate 3D protein-ligand complexes and provide rapid binding affinity estimates for candidate compounds. These results demonstrate that the approach effectively supports molecular screening, prioritization, and structure evaluation. Its modular design enables flexible integration of evolving tools and models, providing a scalable foundation for AI-assisted therapeutic discovery.
zh
[NLP-197] ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models
【速读】: 该论文试图解决生成式 AI (Generative AI) 在模拟人类观点时可能存在的结构性不一致和同质化问题,即其响应在不同人口统计学聚合层级上缺乏准确性,并且未能充分代表少数群体观点。论文提出的关键解决方案是“准确性优化假说”,认为同质化现象源于模型优先选择主流响应,从而影响了其作为人类调查数据直接替代品的有效性。
链接: https://arxiv.org/abs/2507.02919
作者: Dai Li,Linzhuo Li,Huilian Sophie Qiu
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:
Abstract:Large language models (LLMs) in the form of chatbots like ChatGPT and Llama are increasingly proposed as “silicon samples” for simulating human opinions. This study examines this notion, arguing that LLMs may misrepresent population-level opinions. We identify two fundamental challenges: a failure in structural consistency, where response accuracy doesn’t hold across demographic aggregation levels, and homogenization, an underrepresentation of minority opinions. To investigate these, we prompted ChatGPT (GPT-4) and Meta’s Llama 3.1 series (8B, 70B, 405B) with questions on abortion and unauthorized immigration from the American National Election Studies (ANES) 2020. Our findings reveal significant structural inconsistencies and severe homogenization in LLM responses compared to human data. We propose an “accuracy-optimization hypothesis,” suggesting homogenization stems from prioritizing modal responses. These issues challenge the validity of using LLMs, especially chatbots AI, as direct substitutes for human survey data, potentially reinforcing stereotypes and misinforming policy.
zh
[NLP-198] Lokis Dance of Illusions: A Comprehensive Survey of Hallucination in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中出现的“幻觉”问题,即模型生成看似真实但实际虚假的信息。这类问题在金融、法律和医疗等对准确性要求极高的领域可能引发严重后果。论文的关键在于系统地分析幻觉的成因、检测方法及现有解决方案的有效性,并深入探讨其背后的逻辑,从而为开发更高效、创新的应对策略提供理论基础与实践指导。
链接: https://arxiv.org/abs/2507.02870
作者: Chaozhuo Li,Pengbo Wang,Chenxu Wang,Litian Zhang,Zheng Liu,Qiwei Ye,Yuanbo Xu,Feiran Huang,Xi Zhang,Philip S. Yu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Shihezi University (石河子大学); Beihang University (北京航空航天大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Jilin University (吉林大学); Central South University (中南大学); Jinan University (暨南大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Edgar Allan Poe noted, “Truth often lurks in the shadow of error,” highlighting the deep complexity intrinsic to the interplay between truth and falsehood, notably under conditions of cognitive and informational asymmetry. This dynamic is strikingly evident in large language models (LLMs). Despite their impressive linguistic generation capabilities, LLMs sometimes produce information that appears factually accurate but is, in reality, fabricated, an issue often referred to as ‘hallucinations’. The prevalence of these hallucinations can mislead users, affecting their judgments and decisions. In sectors such as finance, law, and healthcare, such misinformation risks causing substantial economic losses, legal disputes, and health risks, with wide-ranging this http URL our research, we have methodically categorized, analyzed the causes, detection methods, and solutions related to LLM hallucinations. Our efforts have particularly focused on understanding the roots of hallucinations and evaluating the efficacy of current strategies in revealing the underlying logic, thereby paving the way for the development of innovative and potent approaches. By examining why certain measures are effective against hallucinations, our study aims to foster a comprehensive approach to tackling this issue within the domain of LLMs.
zh
[NLP-199] From Individuals to Interactions: Benchmarking Gender Bias in Multimodal Large Language Models from the Lens of Social Relationship
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在涉及视觉与文本模态的任务中可能编码和放大性别偏见的问题,尤其是关注其在社会敏感应用中的表现。现有基准主要评估孤立场景下的偏见,忽略了偏见可能通过人际互动以微妙方式出现的情况。解决方案的关键在于引入Genres这一新型基准,通过社会关系的视角评估MLLMs中的性别偏见,该基准采用双角色档案与叙事生成任务,捕捉丰富的人际互动动态,并支持多维度的细粒度偏见评估,从而揭示单角色情境下不易察觉的上下文敏感性性别偏见。
链接: https://arxiv.org/abs/2506.23101
作者: Yue Xu,Wenjie Wang
机构: ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Multimodal large language models (MLLMs) have shown impressive capabilities across tasks involving both visual and textual modalities. However, growing concerns remain about their potential to encode and amplify gender bias, particularly in socially sensitive applications. Existing benchmarks predominantly evaluate bias in isolated scenarios, overlooking how bias may emerge subtly through interpersonal interactions. We fill this gap by going beyond single-entity evaluation and instead focusing on a deeper examination of relational and contextual gender bias in dual-individual interactions. We introduce Genres, a novel benchmark designed to evaluate gender bias in MLLMs through the lens of social relationships in generated narratives. Genres assesses gender bias through a dual-character profile and narrative generation task that captures rich interpersonal dynamics and supports a fine-grained bias evaluation suite across multiple dimensions. Experiments on both open- and closed-source MLLMs reveal persistent, context-sensitive gender biases that are not evident in single-character settings. Our findings underscore the importance of relationship-aware benchmarks for diagnosing subtle, interaction-driven gender bias in MLLMs and provide actionable insights for future bias mitigation.
zh
[NLP-200] Interleaving Logic and Counting
【速读】: 该论文试图解决自然语言中量词表达的推理问题,特别是逻辑与算术特征的结合,以及这种结合在更广泛的“基础数学”实践中的应用。其解决方案的关键在于研究一阶逻辑中的计数算子和基数比较,并进一步聚焦于可表示数值三段论和基本比较大小推理的单谓词一阶逻辑(monadic first-order logic with counting)。通过提供规范形式,该研究确定了有限和无限模型上可定义的算术概念,以及从纯算术概念中可定义的逻辑概念,并探讨了由此产生的非经典逻辑类型。此外,通过使用规范形式方法,研究还分析了多种强化版本,包括单谓词二阶逻辑与加法Presburger算术的接近性,以及涉及元组计数的版本与丢番图方程的联系,从而揭示逻辑的不可判定性。
链接: https://arxiv.org/abs/2507.05219
作者: Johan van Benthem,Thomas Icard
机构: 未知
类目: Logic (math.LO); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:
Abstract:Reasoning with quantifier expressions in natural language combines logical and arithmetical features, transcending strict divides between qualitative and quantitative. Our topic is this cooperation of styles as it occurs in common linguistic usage and its extension into the broader practice of natural language plus “grassroots mathematics”. We begin with a brief review of first-order logic with counting operators and cardinality comparisons. This system is known to be of high complexity, and drowns out finer aspects of the combination of logic and counting. We move to a small fragment that can represent numerical syllogisms and basic reasoning about comparative size: monadic first-order logic with counting. We provide normal forms that allow for axiomatization, determine which arithmetical notions can be defined on finite and on infinite models, and conversely, we discuss which logical notions can be defined out of purely arithmetical ones, and what sort of (non-)classical logics can be induced. Next, we investigate a series of strengthenings, again using normal form methods. The monadic second-order version is close, in a precise sense, to additive Presburger Arithmetic, while versions with the natural device of tuple counting take us to Diophantine equations, making the logic undecidable. We also define a system that combines basic modal logic over binary accessibility relations with counting, needed to formulate ubiquitous reasoning patterns such as the Pigeonhole Principle. We return to our starting point in natural language, confronting the architecture of our formal systems with linguistic quantifier vocabulary and syntax. We conclude with some general thoughts on yet further entanglements of logic and counting in formal systems, on rethinking the qualitative/quantitative divide, and on connecting our analysis to empirical findings in cognitive science. Subjects: Logic (math.LO); Computation and Language (cs.CL); Logic in Computer Science (cs.LO) MSC classes: 03B70, 03B65, 03B45 Cite as: arXiv:2507.05219 [math.LO] (or arXiv:2507.05219v1 [math.LO] for this version) https://doi.org/10.48550/arXiv.2507.05219 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Bulletin of Symbolic Logic (2023); 29(4) pp. 503-587 Related DOI: https://doi.org/10.1017/bsl.2023.30 Focus to learn more DOI(s) linking to related resources Submission history From: Thomas Icard [view email] [v1] Mon, 7 Jul 2025 17:30:29 UTC (105 KB)
zh
[NLP-201] Does Overnight News Explain Overnight Returns?
【速读】: 该论文试图解决美国股市中夜间收益显著高于日内收益的现象,即过去30年中,几乎所有的股市收益都发生在夜间,而平均日内收益为负或持平。其解决方案的关键在于通过分析日内和夜间新闻的特征来解释这一现象,具体采用了一种基于监督主题分析的新技术,该技术根据新闻主题对同期市场收益的解释能力进行选择。研究发现,新闻主题的流行程度变化以及市场对不同新闻主题的反应差异是导致日内与夜间收益差异的重要因素。
链接: https://arxiv.org/abs/2507.04481
作者: Paul Glasserman,Kriste Krstovski,Paul Laliberte,Harry Mamaysky
机构: Columbia Business School (哥伦比亚商学院)
类目: Trading and Market Microstructure (q-fin.TR); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:Over the past 30 years, nearly all the gains in the U.S. stock market have been earned overnight, while average intraday returns have been negative or flat. We find that a large part of this effect can be explained through features of intraday and overnight news. Our analysis uses a collection of 2.4 million news articles. We apply a novel technique for supervised topic analysis that selects news topics based on their ability to explain contemporaneous market returns. We find that time variation in the prevalence of news topics and differences in the responses to news topics both contribute to the difference in intraday and overnight returns. In out-of-sample tests, our approach forecasts which stocks will do particularly well overnight and particularly poorly intraday. Our approach also helps explain patterns of continuation and reversal in intraday and overnight returns. We contrast the effect of news with other mechanisms proposed in the literature to explain overnight returns.
zh
[NLP-202] Reconstructing Biological Pathways by Applying Selective Incremental Learning to (Very) Small Language Models ALT
【速读】: 该论文试图解决生成式AI模型在生物医学领域应用中因“幻觉”现象导致的准确性不足问题,特别是针对细胞内通路中分子组件之间调控相互作用的预测。其解决方案的关键在于采用更小、领域特定的语言模型(LM),并通过主动学习策略选择最具信息量的示例进行训练,从而在减少数据使用量的同时提高预测准确性。
链接: https://arxiv.org/abs/2507.04432
作者: Pranta Saha,Joyce Reimer,Brook Byrns,Connor Burbridge,Neeraj Dhar,Jeffrey Chen,Steven Rayan,Gordon Broderick
机构: 未知
类目: Molecular Networks (q-bio.MN); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG); Performance (cs.PF)
备注: 9 pages, 6 figures, 3 tables + 28 pages of supplemental tables; submitted to 16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM BCB 2025) as submission no. 76
Abstract:The use of generative artificial intelligence (AI) models is becoming ubiquitous in many fields. Though progress continues to be made, general purpose large language AI models (LLM) show a tendency to deliver creative answers, often called “hallucinations”, which have slowed their application in the medical and biomedical fields where accuracy is paramount. We propose that the design and use of much smaller, domain and even task-specific LM may be a more rational and appropriate use of this technology in biomedical research. In this work we apply a very small LM by today’s standards to the specialized task of predicting regulatory interactions between molecular components to fill gaps in our current understanding of intracellular pathways. Toward this we attempt to correctly posit known pathway-informed interactions recovered from manually curated pathway databases by selecting and using only the most informative examples as part of an active learning scheme. With this example we show that a small (~110 million parameters) LM based on a Bidirectional Encoder Representations from Transformers (BERT) architecture can propose molecular interactions relevant to tuberculosis persistence and transmission with over 80% accuracy using less than 25% of the ~520 regulatory relationships in question. Using information entropy as a metric for the iterative selection of new tuning examples, we also find that increased accuracy is driven by favoring the use of the incorrectly assigned statements with the highest certainty (lowest entropy). In contrast, the concurrent use of correct but least certain examples contributed little and may have even been detrimental to the learning rate.
zh
[NLP-203] MMMOS: Multi-domain Multi-axis Audio Quality Assessment
【速读】: 该论文旨在解决音频质量评估中现有非侵入式评估模型仅能预测单一的平均意见分数(MOS),无法全面反映语音、音乐和环境声音在生产质量、生产复杂度、内容愉悦度和内容有用性四个正交维度上的差异问题。解决方案的关键在于提出MMMOS系统,该系统通过融合三个预训练编码器(WavLM、MuQ和M2D)的帧级嵌入,并评估三种聚合策略与四种损失函数,实现多领域音频质量的无参考评估。
链接: https://arxiv.org/abs/2507.04094
作者: Yi-Cheng Lin,Jia-Hung Chen,Hung-yi Lee
机构: National Taiwan University (国立台湾大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 4 pages including 1 page of reference. ASRU Audio MOS 2025 Challenge paper
Abstract:Accurate audio quality estimation is essential for developing and evaluating audio generation, retrieval, and enhancement systems. Existing non-intrusive assessment models predict a single Mean Opinion Score (MOS) for speech, merging diverse perceptual factors and failing to generalize beyond speech. We propose MMMOS, a no-reference, multi-domain audio quality assessment system that estimates four orthogonal axes: Production Quality, Production Complexity, Content Enjoyment, and Content Usefulness across speech, music, and environmental sounds. MMMOS fuses frame-level embeddings from three pretrained encoders (WavLM, MuQ, and M2D) and evaluates three aggregation strategies with four loss functions. By ensembling the top eight models, MMMOS shows a 20-30% reduction in mean squared error and a 4-5% increase in Kendall’s \tau versus baseline, gains first place in six of eight Production Complexity metrics, and ranks among the top three on 17 of 32 challenge metrics.
zh
[NLP-204] On the Relationship between Accent Strength and Articulatory Features INTERSPEECH2025
【速读】: 该论文试图解决如何量化语音中的口音强度(accent strength)并将其与发音特征(articulatory features)相关联的问题。解决方案的关键在于利用基于字典的语音转录与语音学转录之间的音素级差异作为口音强度的代理指标,并结合近期的自监督学习发音逆问题技术来估计发音特征。通过分析美式和英式英语朗读语音语料库,研究揭示了发音参数与口音强度之间的关联,从而为自动口音分析和语音处理应用中的发音建模提供了支持。
链接: https://arxiv.org/abs/2507.03149
作者: Kevin Huang,Sean Foley,Jihwan Lee,Yoonjeong Lee,Dani Byrd,Shrikanth Narayanan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted for Interspeech2025
Abstract:This paper explores the relationship between accent strength and articulatory features inferred from acoustic speech. To quantify accent strength, we compare phonetic transcriptions with transcriptions based on dictionary-based references, computing phoneme-level difference as a measure of accent strength. The proposed framework leverages recent self-supervised learning articulatory inversion techniques to estimate articulatory features. Analyzing a corpus of read speech from American and British English speakers, this study examines correlations between derived articulatory parameters and accent strength proxies, associating systematic articulatory differences with indexed accent strength. Results indicate that tongue positioning patterns distinguish the two dialects, with notable differences inter-dialects in rhotic and low back vowels. These findings contribute to automated accent analysis and articulatory modeling for speech processing applications.
zh
计算机视觉
[CV-0] Beyond One Shot Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations ICCV2025
【速读】:该论文旨在解决LiDAR表示学习中对序列数据内在时空线索忽视的问题,从而提升LiDAR数据的语义分割和三维目标检测性能。其解决方案的关键在于提出LiMA框架,该框架通过三个核心组件:跨视角聚合模块、长时特征传播机制以及跨序列记忆对齐策略,显式捕捉更长范围的时间相关性,从而增强LiDAR表示学习的效果。
链接: https://arxiv.org/abs/2507.05260
作者: Xiang Xu,Lingdong Kong,Song Wang,Chuanwei Zhou,Qingshan Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: ICCV 2025; 26 pages, 12 figures, 10 tables; Code at this http URL
Abstract:LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has be made publicly accessible for future research.
zh
[CV-1] Beyond Simple Edits: X-Planner for Complex Instruction-Based Image Editing
【速读】:该论文试图解决文本引导的图像编辑任务中对复杂、间接指令理解不足以及现有模型在身份保留、无意编辑方面表现不佳的问题,同时减少对人工掩码的依赖。其解决方案的关键在于引入X-Planner,这是一个基于Multimodal Large Language Model (MLLM)的规划系统,通过链式思维推理将复杂指令分解为清晰的子指令,并自动生成精确的编辑类型和分割掩码,从而实现局部化、身份保留的编辑效果。
链接: https://arxiv.org/abs/2507.05259
作者: Chun-Hsiao Yeh,Yilin Wang,Nanxuan Zhao,Richard Zhang,Yuheng Li,Yi Ma,Krishna Kumar Singh
机构: UC Berkeley (加州大学伯克利分校); HKU (香港大学); Adobe (Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Recent diffusion-based image editing methods have significantly advanced text-guided tasks but often struggle to interpret complex, indirect instructions. Moreover, current models frequently suffer from poor identity preservation, unintended edits, or rely heavily on manual masks. To address these challenges, we introduce X-Planner, a Multimodal Large Language Model (MLLM)-based planning system that effectively bridges user intent with editing model capabilities. X-Planner employs chain-of-thought reasoning to systematically decompose complex instructions into simpler, clear sub-instructions. For each sub-instruction, X-Planner automatically generates precise edit types and segmentation masks, eliminating manual intervention and ensuring localized, identity-preserving edits. Additionally, we propose a novel automated pipeline for generating large-scale data to train X-Planner which achieves state-of-the-art results on both existing benchmarks and our newly introduced complex editing benchmark.
zh
[CV-2] Spatio-Temporal LLM : Reasoning about Environments and Actions
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理需要整体时空理解的提示时表现不佳的问题,特别是当提示同时涉及智能体所处的完整环境和最近发生的动作时。解决方案的关键在于提出一种“时空大语言模型”(spatio-temporal LLM, ST-LLM),该模型通过引入投影器来增强对环境的空间理解和对近期观察的时间理解,从而显著提升在“环境与行为推理”(Reasoning about Environments and Actions, REA)数据集上的性能。
链接: https://arxiv.org/abs/2507.05258
作者: Haozhen Zheng,Beitong Tian,Mingyuan Wu,Zhenggang Tang,Klara Nahrstedt,Alex Schwing
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Code and data are available at this https URL
Abstract:Despite the significant recent progress of Multimodal Large Language Models (MLLMs), MLLMs still struggle to correctly answer prompts that require a holistic spatio-temporal understanding. Specifically, it is challenging to address prompts that refer to 1) the entirety of an environment that an agent equipped with an MLLM can operate in; and simultaneously also refer to 2) recent actions that just happened and are encoded in a video clip. However, such a holistic spatio-temporal understanding is important for agents operating in the real world. To address this issue, we first develop a framework to collect a large-scale dataset. Using the collected “Reasoning about Environments and Actions” (REA) dataset, we show that recent methods indeed struggle to correctly answer the prompts. To improve, we develop a “spatio-temporal LLM” (ST-LLM), a model equipped with projectors to improve both spatial understanding of an environment and temporal understanding of recent observations. On the collected REA data, we show that the proposed method significantly improves results compared to prior work. Code and data are available at this https URL.
zh
[CV-3] SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation ICCV2025
【速读】:该论文试图解决基于一致性模型的文本到3D生成方法中自一致性与跨一致性之间的不平衡问题,这一问题导致了条件引导不准确,从而影响生成结果的质量。解决方案的关键在于提出SegmentDreamer框架,其核心是通过重新构建Score Distillation Sampling (SDS),引入分段一致性轨迹蒸馏(Segmented Consistency Trajectory Distillation, SCTD),明确定义自一致性与跨一致性之间的关系,并将概率流常微分方程(PF-ODE)轨迹划分为多个子轨迹,以确保每个片段内的连贯性,从而理论上提供更紧的蒸馏误差上界。
链接: https://arxiv.org/abs/2507.05256
作者: Jiahao Zhu,Zixuan Chen,Guangcong Wang,Xiaohua Xie,Yi Zhou
机构: Sun Yat-sen University (中山大学); Great Bay University (大湾区大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025, project page: this https URL
Abstract:Recent advancements in text-to-3D generation improve the visual quality of Score Distillation Sampling (SDS) and its variants by directly connecting Consistency Distillation (CD) to score distillation. However, due to the imbalance between self-consistency and cross-consistency, these CD-based methods inherently suffer from improper conditional guidance, leading to sub-optimal generation results. To address this issue, we present SegmentDreamer, a novel framework designed to fully unleash the potential of consistency models for high-fidelity text-to-3D generation. Specifically, we reformulate SDS through the proposed Segmented Consistency Trajectory Distillation (SCTD), effectively mitigating the imbalance issues by explicitly defining the relationship between self- and cross-consistency. Moreover, SCTD partitions the Probability Flow Ordinary Differential Equation (PF-ODE) trajectory into multiple sub-trajectories and ensures consistency within each segment, which can theoretically provide a significantly tighter upper bound on distillation error. Additionally, we propose a distillation pipeline for a more swift and stable generation. Extensive experiments demonstrate that our SegmentDreamer outperforms state-of-the-art methods in visual quality, enabling high-fidelity 3D asset creation through 3D Gaussian Splatting (3DGS).
zh
[CV-4] From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving ITSC2025
【速读】:该论文旨在解决动态环境中自动化车辆对周围交通参与者运动轨迹进行准确预测的问题,以确保其安全和高效运行。现有方法中,独立预测每个交通参与者的轨迹(即边缘预测模型)往往导致次优的规划决策,而联合预测模型虽然考虑了参与者之间的交互,但因问题设定、模型架构和实现细节的差异,难以进行有效比较。该论文的关键解决方案是系统性地研究不同的联合运动预测方法,包括对边缘预测结果的后处理、显式训练联合预测模型以及将问题建模为生成任务,并从预测精度、多模态性和推理效率等方面进行全面评估。
链接: https://arxiv.org/abs/2507.05254
作者: Fabian Konstantinidis,Ariel Dallari Guerreiro,Raphael Trumpp,Moritz Sackmann,Ulrich Hofmann,Marco Caccamo,Christoph Stiller
机构: CARIAD SE(卡里亚德公司); Technical University of Munich (慕尼黑工业大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: Accepted at International Conference on Intelligent Transportation Systems 2025 (ITSC 2025)
Abstract:Accurate motion prediction of surrounding traffic participants is crucial for the safe and efficient operation of automated vehicles in dynamic environments. Marginal prediction models commonly forecast each agent’s future trajectories independently, often leading to sub-optimal planning decisions for an automated vehicle. In contrast, joint prediction models explicitly account for the interactions between agents, yielding socially and physically consistent predictions on a scene level. However, existing approaches differ not only in their problem formulation but also in the model architectures and implementation details used, making it difficult to compare them. In this work, we systematically investigate different approaches to joint motion prediction, including post-processing of the marginal predictions, explicitly training the model for joint predictions, and framing the problem as a generative task. We evaluate each approach in terms of prediction accuracy, multi-modality, and inference efficiency, offering a comprehensive analysis of the strengths and limitations of each approach. Several prediction examples are available at this https URL.
zh
[CV-5] Physics-Guided Dual Implicit Neural Representations for Source Separation
【速读】:该论文试图解决在先进实验和观测技术中高效数据分析所面临的挑战,特别是由于信号中包含不必要的背景和信号失真等干扰因素,导致物理相关信息被掩盖的问题。解决方案的关键在于提出一种自监督机器学习方法,利用双隐式神经表示框架联合训练两个神经网络:一个用于近似目标物理信号的失真,另一个用于学习有效的背景贡献。该方法通过最小化基于重建的损失函数直接从原始数据中学习,无需标注数据或预定义字典,从而实现了对复杂或结构化背景中物理上有意义信号的有效分离。
链接: https://arxiv.org/abs/2507.05249
作者: Yuan Ni,Zhantao Chen,Alexander N. Petsch,Edmund Xu,Cheng Peng,Alexander I. Kolesnikov,Sugata Chowdhury,Arun Bansil,Jana B. Thayer,Joshua J. Turner
机构: SLAC National Accelerator Laboratory (SLAC国家加速器实验室); Stanford University (斯坦福大学); University of California Santa Cruz (加州大学圣克鲁兹分校); Oak Ridge National Laboratory (橡树岭国家实验室); Howard University (霍华德大学); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Strongly Correlated Electrons (cond-mat.str-el); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
备注:
Abstract:Significant challenges exist in efficient data analysis of most advanced experimental and observational techniques because the collected signals often include unwanted contributions–such as background and signal distortions–that can obscure the physically relevant information of interest. To address this, we have developed a self-supervised machine-learning approach for source separation using a dual implicit neural representation framework that jointly trains two neural networks: one for approximating distortions of the physical signal of interest and the other for learning the effective background contribution. Our method learns directly from the raw data by minimizing a reconstruction-based loss function without requiring labeled data or pre-defined dictionaries. We demonstrate the effectiveness of our framework by considering a challenging case study involving large-scale simulated as well as experimental momentum-energy-dependent inelastic neutron scattering data in a four-dimensional parameter space, characterized by heterogeneous background contributions and unknown distortions to the target signal. The method is found to successfully separate physically meaningful signals from a complex or structured background even when the signal characteristics vary across all four dimensions of the parameter space. An analytical approach that informs the choice of the regularization parameter is presented. Our method offers a versatile framework for addressing source separation problems across diverse domains, ranging from superimposed signals in astronomical measurements to structural features in biomedical image reconstructions.
zh
[CV-6] StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
【速读】:该论文旨在解决视觉-语言导航(Vision-and-Language Navigation, VLN)在真实场景中面临的细粒度视觉理解、长期上下文建模与计算效率之间的权衡问题。其解决方案的关键在于提出StreamVLN框架,该框架采用混合的慢-快上下文建模策略,通过快速流对话上下文实现响应式动作生成,并利用慢更新的记忆上下文通过3D感知的token剪枝策略压缩历史视觉状态,从而在保持多轮对话连贯性的同时,支持长视频流并控制上下文规模和推理成本。
链接: https://arxiv.org/abs/2507.05240
作者: Meng Wei,Chenyang Wan,Xiqian Yu,Tai Wang,Yuqiang Yang,Xiaohan Mao,Chenming Zhu,Wenzhe Cai,Hanqing Wang,Yilun Chen,Xihui Liu,Jiangmiao Pang
机构: Shanghai AI Laboratory(上海人工智能实验室); The University of Hong Kong(香港大学); Zhejiang University(浙江大学); Shanghai Jiao Tong University(上海交通大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN achieves coherent multi-turn dialogue through efficient KV cache reuse, supporting long video streams with bounded context size and inference cost. Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment. The project page is: \hrefthis https URLthis https URL.
zh
[CV-7] Self-Supervised Real-Time Tracking of Military Vehicles in Low-FPS UAV Footage
【速读】:该论文旨在解决在低帧率视频中,由移动无人飞行器(UAV)在实际作战场景下捕获的视频中对象关联的复杂性问题,尤其是在物体外观和位置快速变化以及云视频流和压缩算法导致的图像退化情况下。论文提出的解决方案的关键在于利用单帧标注中的实例关联学习,通过场景的全局特征提供关键上下文信息,从而使方法对干扰项和检测缺失具有鲁棒性。
链接: https://arxiv.org/abs/2507.05229
作者: Markiyan Kostiv,Anatolii Adamovskyi,Yevhen Cherniavskyi,Mykyta Varenyk,Ostap Viniavskyi,Igor Krashenyi,Oles Dobosevych
机构: CIDTD, MoD Ukraine(国家紧急情况部乌克兰); Applied Sciences Faculty, Ukrainian Catholic University(应用科学学院,乌克兰天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-object tracking (MOT) aims to maintain consistent identities of objects across video frames. Associating objects in low-frame-rate videos captured by moving unmanned aerial vehicles (UAVs) in actual combat scenarios is complex due to rapid changes in object appearance and position within the frame. The task becomes even more challenging due to image degradation caused by cloud video streaming and compression algorithms. We present how instance association learning from single-frame annotations can overcome these challenges. We show that global features of the scene provide crucial context for low-FPS instance association, allowing our solution to be robust to distractors and gaps in detections. We also demonstrate that such a tracking approach maintains high association quality even when reducing the input image resolution and latent representation size for faster inference. Finally, we present a benchmark dataset of annotated military vehicles collected from publicly available data sources. This paper was initially presented at the NATO Science and Technology Organization Symposium (ICMCIS) organized by the Information Systems Technology (IST)Scientific and Technical Committee, IST-209-RSY - the ICMCIS, held in Oeiras, Portugal, 13-14 May 2025.
zh
[CV-8] NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving
【速读】:该论文试图解决自动驾驶系统在利用局部视觉信息进行感知、预测和规划方面取得显著进展,但难以整合人类驾驶员常用的更广泛导航上下文的问题。解决方案的关键在于提出NavigScene,一个辅助导航引导的自然语言数据集,用于在自动驾驶系统中模拟类人驾驶环境,并开发三种互补范式:导航引导的推理、导航引导的偏好优化以及导航引导的视觉-语言-动作模型,通过将导航上下文融入视觉-语言模型、建立导航相关摘要信息的偏好以及融合导航引导与传统驾驶模型,从而提升系统在复杂环境中导航的能力。
链接: https://arxiv.org/abs/2507.05227
作者: Qucheng Peng,Chen Bai,Guoxiang Zhang,Bo Xu,Xiaotong Liu,Xiaoyin Zheng,Chen Chen,Cheng Lu
机构: XPENG Motors(小鹏汽车); University of Central Florida(佛罗里达中部大学); Center for Research in Computer Vision(计算机视觉研究中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Systems and Control (eess.SY)
备注: Accepted by ACM Multimedia 2025
Abstract:Autonomous driving systems have made significant advances in QA, perception, prediction, and planning based on local visual information, yet they struggle to incorporate broader navigational context that human drivers routinely utilize. We address this critical gap between local sensor data and global navigation information by proposing NavigScene, an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems. Moreover, we develop three complementary paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses by establishing preferences for navigation-relevant summarized information; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion. Extensive experiments demonstrate that our approaches significantly improve performance across perception, prediction, planning, and question-answering tasks by enabling reasoning capabilities beyond visual range and improving generalization to diverse driving scenarios. This work represents a significant step toward more comprehensive autonomous driving systems capable of navigating complex, unfamiliar environments with greater reliability and safety.
zh
[CV-9] CTA: Cross-Task Alignment for Better Test Time Training
【速读】:该论文试图解决深度学习模型在面对分布偏移(如领域或数据集变化)时性能显著下降的问题,其解决方案的关键在于引入一种名为CTA(Cross-Task Alignment)的新方法,通过借鉴多模态对比学习的成功经验,将监督编码器与自监督编码器进行对齐,从而在不依赖特殊模型架构的情况下,增强模型的鲁棒性并实现更语义化的测试阶段更新。
链接: https://arxiv.org/abs/2507.05221
作者: Samuel Barbeau,Pedram Fekri,David Osowiechi,Ali Bahri,Moslem YazdanpanahMasih Aminbeidokhti,Christian Desrosiers
机构: ÉTS Montréal(ÉTS蒙特利尔); Concordia University(康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint, under review
Abstract:Deep learning models have demonstrated exceptional performance across a wide range of computer vision tasks. However, their performance often degrades significantly when faced with distribution shifts, such as domain or dataset changes. Test-Time Training (TTT) has emerged as an effective method to enhance model robustness by incorporating an auxiliary unsupervised task during training and leveraging it for model updates at test time. In this work, we introduce CTA (Cross-Task Alignment), a novel approach for improving TTT. Unlike existing TTT methods, CTA does not require a specialized model architecture and instead takes inspiration from the success of multi-modal contrastive learning to align a supervised encoder with a self-supervised one. This process enforces alignment between the learned representations of both models, thereby mitigating the risk of gradient interference, preserving the intrinsic robustness of self-supervised learning and enabling more semantically meaningful updates at test-time. Experimental results demonstrate substantial improvements in robustness and generalization over the state-of-the-art on several benchmark datasets.
zh
[CV-10] All in One: Visual-Description-Guided Unified Point Cloud Segmentation ICCV2025
【速读】:该论文旨在解决3D点云统一分割中的挑战,包括点云的稀疏结构、标注数据有限以及在复杂环境中区分细粒度物体类别的困难。其解决方案的关键在于引入预训练的视觉-语言模型(如CLIP)和大语言模型(LLM),通过利用LLM生成的文本描述和网络上的参考图像,融合丰富的多模态线索,从而提升细粒度类别和实例的区分能力。此外,还设计了语义-视觉对比损失和空间增强模块,以对齐点特征与多模态查询并高效建模场景全局关系。
链接: https://arxiv.org/abs/2507.05211
作者: Zongyan Han,Mohamed El Amine Boudjoghra,Jiahua Dong,Jinhong Wang,Rao Muhammad Anwer
机构: Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICCV2025
Abstract:Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances. To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation. We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance, and panoptic segmentation, offering a scalable and practical solution for 3D understanding. Our code is available at this https URL.
zh
[CV-11] EmbodieDreamer: Advancing Real2Sim2Real Transfer for Policy Training via Embodied World Modeling
【速读】:该论文旨在解决真实世界与模拟环境之间存在的Real2Sim2Real差距问题,特别是在物理动力学和视觉外观方面的差异。其解决方案的关键在于提出EmbodieDreamer框架,该框架包含两个核心组件:PhysAligner和VisAligner。PhysAligner是一个可微分的物理模块,用于优化机器人特定参数以缩小真实与模拟动力学之间的差距;而VisAligner则利用条件视频扩散模型,将低保真度的模拟渲染转换为基于模拟状态的逼真视频,从而减少视觉外观上的差异。
链接: https://arxiv.org/abs/2507.05198
作者: Boyuan Wang,Xinpan Meng,Xiaofeng Wang,Zheng Zhu,Angen Ye,Yang Wang,Zhiqin Yang,Chaojun Ni,Guan Huang,Xingang Wang
机构: GigaAI; Institute of Automation, Chinese Academy of Sciences; Peking University
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:The rapid advancement of Embodied AI has led to an increasing demand for large-scale, high-quality real-world data. However, collecting such embodied data remains costly and inefficient. As a result, simulation environments have become a crucial surrogate for training robot policies. Yet, the significant Real2Sim2Real gap remains a critical bottleneck, particularly in terms of physical dynamics and visual appearance. To address this challenge, we propose EmbodieDreamer, a novel framework that reduces the Real2Sim2Real gap from both the physics and appearance perspectives. Specifically, we propose PhysAligner, a differentiable physics module designed to reduce the Real2Sim physical gap. It jointly optimizes robot-specific parameters such as control gains and friction coefficients to better align simulated dynamics with real-world observations. In addition, we introduce VisAligner, which incorporates a conditional video diffusion model to bridge the Sim2Real appearance gap by translating low-fidelity simulated renderings into photorealistic videos conditioned on simulation states, enabling high-fidelity visual transfer. Extensive experiments validate the effectiveness of EmbodieDreamer. The proposed PhysAligner reduces physical parameter estimation error by 3.74% compared to simulated annealing methods while improving optimization speed by 89.91%. Moreover, training robot policies in the generated photorealistic environment leads to a 29.17% improvement in the average task success rate across real-world tasks after reinforcement learning. Code, model and data will be publicly available.
zh
[CV-12] Neuralocks: Real-Time Dynamic Neural Hair Simulation
【速读】:该论文试图解决实时虚拟角色中头发动态模拟的问题,传统方法受限于物理引擎和神经网络技术,而现有神经方法仅能提供准静态解,无法捕捉头发的动态行为。其解决方案的关键在于提出一种全自监督的神经方法,无需人工干预或艺术家生成的数据,能够与头发重建方法集成,实现自动端到端的角色重建,并利用紧凑且内存高效的神经网络在发丝层级进行模拟,从而在保持计算效率的同时支持多种发型的动态表现。
链接: https://arxiv.org/abs/2507.05191
作者: Gene Wei-Chin Lin,Egor Larionov,Hsiao-yu Chen,Doug Roble,Tuur Stuyck
机构: Meta Reality Labs(元宇宙实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-time hair simulation is a vital component in creating believable virtual avatars, as it provides a sense of immersion and authenticity. The dynamic behavior of hair, such as bouncing or swaying in response to character movements like jumping or walking, plays a significant role in enhancing the overall realism and engagement of virtual experiences. Current methods for simulating hair have been constrained by two primary approaches: highly optimized physics-based systems and neural methods. However, state-of-the-art neural techniques have been limited to quasi-static solutions, failing to capture the dynamic behavior of hair. This paper introduces a novel neural method that breaks through these limitations, achieving efficient and stable dynamic hair simulation while outperforming existing approaches. We propose a fully self-supervised method which can be trained without any manual intervention or artist generated training data allowing the method to be integrated with hair reconstruction methods to enable automatic end-to-end methods for avatar reconstruction. Our approach harnesses the power of compact, memory-efficient neural networks to simulate hair at the strand level, allowing for the simulation of diverse hairstyles without excessive computational resources or memory requirements. We validate the effectiveness of our method through a variety of hairstyle examples, showcasing its potential for real-world applications.
zh
[CV-13] Satellite-based Rabi rice paddy field mapping in India: a case study on Telangana state
【速读】:该论文旨在解决小农户耕作区域中水稻种植面积监测的难题,传统遥感方法难以应对碎片化农业景观的时空异质性。其解决方案的关键在于提出了一种基于物候的分类框架,该框架能够系统适应特伦甘纳邦32个地区在2018-19年冬季水稻季节中的本地农业生态差异。通过地区特定的校准方法,实现了93.3%的整体准确率,显著优于传统区域聚类方法,并与官方政府统计数据高度一致,验证了遥感数据与地面真实数据的良好一致性。
链接: https://arxiv.org/abs/2507.05189
作者: Prashanth Reddy Putta,Fabio Dell’Acqua(University of Pavia)
机构: University of Pavia (帕维亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 60 pages, 17 figures. Intended for submission to Remote Sensing Applications: Society and Environment (RSASE). Funded by the European Union - NextGenerationEU, Mission 4 Component 1.5
Abstract:Accurate rice area monitoring is critical for food security and agricultural policy in smallholder farming regions, yet conventional remote sensing approaches struggle with the spatiotemporal heterogeneity characteristic of fragmented agricultural landscapes. This study developed a phenology-driven classification framework that systematically adapts to local agro-ecological variations across 32 districts in Telangana, India during the 2018-19 Rabi rice season. The research reveals significant spatiotemporal diversity, with phenological timing varying by up to 50 days between districts and field sizes ranging from 0.01 to 2.94 hectares. Our district-specific calibration approach achieved 93.3% overall accuracy, an 8.0 percentage point improvement over conventional regional clustering methods, with strong validation against official government statistics (R^2 = 0.981) demonstrating excellent agreement between remotely sensed and ground truth data. The framework successfully mapped 732,345 hectares by adapting to agro-climatic variations, with Northern districts requiring extended land preparation phases (up to 55 days) while Southern districts showed compressed cultivation cycles. Field size analysis revealed accuracy declining 6.8 percentage points from medium to tiny fields, providing insights for operational monitoring in fragmented landscapes. These findings demonstrate that remote sensing frameworks must embrace rather than simplify landscape complexity, advancing region-specific agricultural monitoring approaches that maintain scientific rigor while serving practical policy and food security applications.
zh
[CV-14] φ-Adapt: A Physics-Informed Adaptation Learning Approach to 2D Quantum Material Discovery
【速读】:该论文旨在解决二维量子 flakes 的表征问题,特别是针对现有计算机视觉方法在估计 flakes 厚度时面临的数据稀缺、泛化能力差、对领域迁移敏感以及缺乏物理可解释性等挑战。其解决方案的关键在于提出一种基于物理信息的适应学习方法,即 \varphi -Adapt,通过结合物理模型与领域自适应技术,提升模型在真实场景中的性能,同时引入一种新的合成数据生成框架,以减少对人工采集数据的依赖,从而有效弥合合成数据与实际应用之间的性能差距。
链接: https://arxiv.org/abs/2507.05184
作者: Hoang-Quan Nguyen,Xuan Bac Nguyen,Sankalp Pandey,Tim Faltermeier,Nicholas Borys,Hugh Churchill,Khoa Luu
机构: University of Arkansas(阿肯色大学); Montana State University(蒙大拿州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Characterizing quantum flakes is a critical step in quantum hardware engineering because the quality of these flakes directly influences qubit performance. Although computer vision methods for identifying two-dimensional quantum flakes have emerged, they still face significant challenges in estimating flake thickness. These challenges include limited data, poor generalization, sensitivity to domain shifts, and a lack of physical interpretability. In this paper, we introduce one of the first Physics-informed Adaptation Learning approaches to overcome these obstacles. We focus on two main issues, i.e., data scarcity and generalization. First, we propose a new synthetic data generation framework that produces diverse quantum flake samples across various materials and configurations, reducing the need for time-consuming manual collection. Second, we present \varphi -Adapt, a physics-informed adaptation method that bridges the performance gap between models trained on synthetic data and those deployed in real-world settings. Experimental results show that our approach achieves state-of-the-art performance on multiple benchmarks, outperforming existing methods. Our proposed approach advances the integration of physics-based modeling and domain adaptation. It also addresses a critical gap in leveraging synthesized data for real-world 2D material analysis, offering impactful tools for deep learning and materials science communities.
zh
[CV-15] Semantic Frame Interpolation
【速读】:该论文试图解决基于给定首尾帧和文本提示信息生成不同长度中间视频内容的问题,传统帧插值任务在小帧数、无文本控制以及首尾帧差异较小的场景下表现有限,而现有基于大视频模型的方法存在生成固定帧数、对某些帧长效果不佳以及缺乏明确定义和基准的问题。解决方案的关键在于提出一种新的语义帧插值(Semantic Frame Interpolation, SFI)任务,并构建基于Wan2.1的SemFi模型,该模型引入了Mixture-of-LoRA模块以确保在不同帧长限制下生成与控制条件一致的高一致性内容,同时提出了首个针对SFI任务的通用数据集和基准SFI-300K。
链接: https://arxiv.org/abs/2507.05173
作者: Yijia Hong,Jiangning Zhang,Ran Yi,Yuji Wang,Weijian Cao,Xiaobin Hu,Zhucun Xue,Yabiao Wang,Chengjie Wang,Lizhuang Ma
机构: Shanghai Jiao Tong University; Shanghai Innovation Institute; Zhejiang University; Tencent YouTu Lab
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Generating intermediate video content of varying lengths based on given first and last frames, along with text prompt information, offers significant research and application potential. However, traditional frame interpolation tasks primarily focus on scenarios with a small number of frames, no text control, and minimal differences between the first and last frames. Recent community developers have utilized large video models represented by Wan to endow frame-to-frame capabilities. However, these models can only generate a fixed number of frames and often fail to produce satisfactory results for certain frame lengths, while this setting lacks a clear official definition and a well-established benchmark. In this paper, we first propose a new practical Semantic Frame Interpolation (SFI) task from the perspective of academic definition, which covers the above two settings and supports inference at multiple frame rates. To achieve this goal, we propose a novel SemFi model building upon Wan2.1, which incorporates a Mixture-of-LoRA module to ensure the generation of high-consistency content that aligns with control conditions across various frame length limitations. Furthermore, we propose SFI-300K, the first general-purpose dataset and benchmark specifically designed for SFI. To support this, we collect and process data from the perspective of SFI, carefully designing evaluation metrics and methods to assess the model’s performance across multiple dimensions, encompassing image and video, and various aspects, including consistency and diversity. Through extensive experiments on SFI-300K, we demonstrate that our method is particularly well-suited to meet the requirements of the SFI task.
zh
[CV-16] Differential Attention for Multimodal Crisis Event Analysis CVPR
【速读】:该论文旨在解决在危机事件中从大规模且噪声丰富的多模态数据流中有效提取有意义信息并整合异构数据的挑战。其解决方案的关键在于利用视觉语言模型(VLMs)和先进的融合策略,包括引入LLaVA生成的文本以提升文本-图像对齐效果,采用基于CLIP的视觉与文本嵌入进行特征表示,以及结合Guided Cross Attention(Guided CA)与Differential Attention机制来增强多模态特征对齐,从而提高危机数据分类的准确性与可靠性。
链接: https://arxiv.org/abs/2507.05165
作者: Nusrat Munia,Junfeng Zhu,Olfa Nasraoui,Abdullah-Al-Zubaer Imran
机构: University of Kentucky(肯塔基大学); Kentucky Geological Survey(肯塔基地质调查局); University of Louisville(路易斯维尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at CVPRw 2025, MMFM3
Abstract:Social networks can be a valuable source of information during crisis events. In particular, users can post a stream of multimodal data that can be critical for real-time humanitarian response. However, effectively extracting meaningful information from this large and noisy data stream and effectively integrating heterogeneous data remains a formidable challenge. In this work, we explore vision language models (VLMs) and advanced fusion strategies to enhance the classification of crisis data in three different tasks. We incorporate LLaVA-generated text to improve text-image alignment. Additionally, we leverage Contrastive Language-Image Pretraining (CLIP)-based vision and text embeddings, which, without task-specific fine-tuning, outperform traditional models. To further refine multimodal fusion, we employ Guided Cross Attention (Guided CA) and combine it with the Differential Attention mechanism to enhance feature alignment by emphasizing critical information while filtering out irrelevant content. Our results show that while Differential Attention improves classification performance, Guided CA remains highly effective in aligning multimodal features. Extensive experiments on the CrisisMMD benchmark data set demonstrate that the combination of pretrained VLMs, enriched textual descriptions, and adaptive fusion strategies consistently outperforms state-of-the-art models in classification accuracy, contributing to more reliable and interpretable models for three different tasks that are crucial for disaster response. Our code is available at this https URL.
zh
[CV-17] 4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture
【速读】:该论文旨在解决从多视角视频中重建高速动态场景的挑战,尤其是现有4D捕获系统受限于低于30帧每秒(FPS)的帧率,导致无法有效进行高速运动的4D重建。其解决方案的关键在于提出一种仅使用低FPS摄像机的高速4D捕获系统,通过异步采集方案和生成式模型实现高效重建。异步采集方案通过错开相机的启动时间,结合25 FPS的基础帧率,实现了相当于100-200 FPS的有效帧率;在处理端,则采用基于视频扩散的生成式模型来修复稀疏视角重建带来的伪影,从而提升重建质量与时间一致性。
链接: https://arxiv.org/abs/2507.05163
作者: Yutian Chen,Shi Guo,Tianshuo Yang,Lihe Ding,Xiuyuan Yu,Jinwei Gu,Tianfan Xue
机构: Shanghai AI Laboratory; The Chinese University of Hong Kong; The University of Hong Kong; NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL
Abstract:Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.
zh
[CV-18] LAID: Lightweight AI-Generated Image Detection in Spatial and Spectral Domains
【速读】:该论文试图解决生成式AI图像(AIGI)在社交平台上的潜在滥用问题,尤其是现有最先进的AIGI检测方法依赖于计算密集型深度神经网络所带来的实时大规模部署难题。解决方案的关键在于提出LAID框架,这是首个(据作者所知)对现成轻量级神经网络的检测性能和效率进行基准测试与评估的框架。通过在GenImage数据集的代表性子集上全面训练和评估选定模型,研究证明轻量级模型可以在对抗条件下实现具有竞争力的准确率,同时显著降低内存和计算成本,从而为高效且可扩展的AIGI检测系统提供了基础。
链接: https://arxiv.org/abs/2507.05162
作者: Nicholas Chivaran,Jianbing Ni
机构: Queen’s University (皇后大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: To appear in the proceedings of PST2025
Abstract:The recent proliferation of photorealistic AI-generated images (AIGI) has raised urgent concerns about their potential misuse, particularly on social media platforms. Current state-of-the-art AIGI detection methods typically rely on large, deep neural architectures, creating significant computational barriers to real-time, large-scale deployment on platforms like social media. To challenge this reliance on computationally intensive models, we introduce LAID, the first framework – to our knowledge – that benchmarks and evaluates the detection performance and efficiency of off-the-shelf lightweight neural networks. In this framework, we comprehensively train and evaluate selected models on a representative subset of the GenImage dataset across spatial, spectral, and fusion image domains. Our results demonstrate that lightweight models can achieve competitive accuracy, even under adversarial conditions, while incurring substantially lower memory and computation costs compared to current state-of-the-art methods. This study offers valuable insight into the trade-off between efficiency and performance in AIGI detection and lays a foundation for the development of practical, scalable, and trustworthy detection systems. The source code of LAID can be found at: this https URL.
zh
[CV-19] VERITAS: Verification and Explanation of Realness in Images for Transparency in AI Systems
【速读】:该论文试图解决AI生成图像与真实图像之间的区分问题,特别是在小尺寸(32x32)图像上实现准确的合成图像检测,并提供可解释的判断依据。解决方案的关键在于提出VERITAS框架,该框架通过人工制品定位和语义推理实现对图像是否为AI生成的准确检测,并生成人类可读的解释,以阐明检测结果的依据。
链接: https://arxiv.org/abs/2507.05146
作者: Aadi Srivastava,Vignesh Natarajkumar,Utkarsh Bheemanaboyna,Devisree Akashapu,Nagraj Gaonkar,Archit Joshi
机构: Indian Institute of Technology, Madras(印度理工学院,马德拉斯)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The widespread and rapid adoption of AI-generated content, created by models such as Generative Adversarial Networks (GANs) and Diffusion Models, has revolutionized the digital media landscape by allowing efficient and creative content generation. However, these models also blur the difference between real images and AI-generated synthetic images, raising concerns regarding content authenticity and integrity. While many existing solutions to detect fake images focus solely on classification and higher-resolution images, they often lack transparency in their decision-making, making it difficult for users to understand why an image is classified as fake. In this paper, we present VERITAS, a comprehensive framework that not only accurately detects whether a small (32x32) image is AI-generated but also explains why it was classified that way through artifact localization and semantic reasoning. VERITAS produces human-readable explanations that describe key artifacts in synthetic images. We show that this architecture offers clear explanations of the basis of zero-shot synthetic image detection tasks. Code and relevant prompts can be found at this https URL .
zh
[CV-20] LVM4CSI: Enabling Direct Application of Pre-Trained Large Vision Models for Wireless Channel Tasks
【速读】:该论文试图解决传统人工设计的专用神经网络(Neural Networks, NNs)在无线通信系统中获取和利用信道状态信息(Channel State Information, CSI)时存在的泛化能力差和实用性受限的问题。其解决方案的关键在于提出LVM4CSI框架,该框架利用CSI与计算机视觉(Computer Vision, CV)数据之间的结构相似性,直接将预训练于大规模CV数据集的大型视觉模型(Large Vision Models, LVMs)应用于无线任务,无需任何微调,从而提升了模型的通用性和效率。
链接: https://arxiv.org/abs/2507.05121
作者: Jiajia Guo,Peiwen Jiang,Chao-Kai Wen,Shi Jin,Jun Zhang
机构: 香港科技大学(University of Science and Technology, Hong Kong); 国家同步辐射研究中心(National Synchrotron Radiation Research Center); 东南大学(Southeast University)
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This work has been submitted for possible publication
Abstract:Accurate channel state information (CSI) is critical to the performance of wireless communication systems, especially with the increasing scale and complexity introduced by 5G and future 6G technologies. While artificial intelligence (AI) offers a promising approach to CSI acquisition and utilization, existing methods largely depend on task-specific neural networks (NNs) that require expert-driven design and large training datasets, limiting their generalizability and practicality. To address these challenges, we propose LVM4CSI, a general and efficient framework that leverages the structural similarity between CSI and computer vision (CV) data to directly apply large vision models (LVMs) pre-trained on extensive CV datasets to wireless tasks without any fine-tuning, in contrast to large language model-based methods that generally necessitate fine-tuning. LVM4CSI maps CSI tasks to analogous CV tasks, transforms complex-valued CSI into visual formats compatible with LVMs, and integrates lightweight trainable layers to adapt extracted features to specific communication objectives. We validate LVM4CSI through three representative case studies, including channel estimation, human activity recognition, and user localization. Results demonstrate that LVM4CSI achieves comparable or superior performance to task-specific NNs, including an improvement exceeding 9.61 dB in channel estimation and approximately 40% reduction in localization error. Furthermore, it significantly reduces the number of trainable parameters and eliminates the need for task-specific NN design.
zh
[CV-21] VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting
【速读】:该论文旨在解决大型视觉语言动作(Vision Language Action, VLA)模型在面对新物体或陌生环境时泛化能力不足的问题。现有方法通过引入深度估计、分割或扩散等额外组件来提升泛化能力,但这些方法增加了显著的计算开销,导致效率低下。论文提出的解决方案关键在于设计一种高效的VLA模型优化与加速框架VOTE,其核心是采用无分词器的微调方法以实现并行精确的动作预测,从而降低计算开销并加快推理速度,同时结合集成投票策略进行动作采样,显著提升模型性能和泛化能力。
链接: https://arxiv.org/abs/2507.05116
作者: Juyi Lin,Amir Taherin,Arash Akbari,Arman Akbari,Lei Lu,Guangyu Chen,Taskin Padir,Xiaomeng Yang,Weiwei Chen,Yiqian Li,Xue Lin,David Kaeli,Pu Zhao,Yanzhi Wang
机构: Northeastern University (东北大学); EmbodyX Inc (EmbodyX 公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Recent large-scale Vision Language Action (VLA) models have shown superior performance in robotic manipulation tasks guided by natural language. However, their generalization remains limited when applied to novel objects or unfamiliar environments that lie outside the training distribution. To address this, many existing approaches integrate additional components such as depth estimation, segmentation, or even diffusion to improve generalization, at the cost of adding significant computation overhead, resulting in low efficiency. This motivates the exploration of efficient action prediction methods, which are independent of additional high-level visual representations or diffusion techniques. In this work, we propose VOTE, an efficient and general framework for the optimization and acceleration of VLA models. In details, we propose a novel tokenizer-free fine-tuning approach for parallel accurate action prediction, which reduces computational overhead and accelerates inference speed. Additionally, we adopt an ensemble voting strategy for the action sampling, which significantly improves model performance and enhances generalization. Experimental results show that our method achieves state-of-the-art performance with 35 \times faster inference and 145 Hz throughput. All the details and codes will be open-sourced.
zh
[CV-22] MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation
【速读】:该论文旨在解决音频驱动的说话人头部生成中存在的时间抖动、身份漂移和不自然的眨眼行为等问题。其解决方案的关键在于提出MoDiT框架,该框架结合了3D形态模型(3DMM)与基于扩散的Transformer。关键创新包括:采用分层去噪策略与改进的时间注意力机制,以减少时间抖动;引入3DMM系数提供显式空间约束,提升唇部同步与身份一致性;以及优化眨眼策略,实现更自然的眼部运动。
链接: https://arxiv.org/abs/2507.05092
作者: Yucheng Wang,Dan Xu
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Audio-driven talking head generation is critical for applications such as virtual assistants, video games, and films, where natural lip movements are essential. Despite progress in this field, challenges remain in producing both consistent and realistic facial animations. Existing methods, often based on GANs or UNet-based diffusion models, face three major limitations: (i) temporal jittering caused by weak temporal constraints, resulting in frame inconsistencies; (ii) identity drift due to insufficient 3D information extraction, leading to poor preservation of facial identity; and (iii) unnatural blinking behavior due to inadequate modeling of realistic blink dynamics. To address these issues, we propose MoDiT, a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer. Our contributions include: (i) A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization and progressively enhance full-face coherence, effectively mitigating temporal jittering. (ii) The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction and improved lip synchronization using Wav2Lip results, thereby preserving identity consistency. (iii) A refined blinking strategy to model natural eye movements, with smoother and more realistic blinking behaviors.
zh
[CV-23] ICAS: Detecting Training Data from Autoregressive Image Generative Models ACM-MM2025
【速读】:该论文试图解决自回归图像生成模型在训练数据检测中的脆弱性问题,即如何识别模型训练中是否使用了未经授权的数据。解决方案的关键在于提出一种基于成员推断的方法,包含两个核心组件:隐式分类和自适应分数聚合策略。首先,在查询图像中计算隐式的逐标记分类得分,随后通过自适应分数聚合策略获得最终得分,该策略更关注低得分的标记,从而提高对训练集样本的识别能力。
链接: https://arxiv.org/abs/2507.05068
作者: Hongyao Yu,Yixiang Qiu,Yiheng Yang,Hao Fang,Tianqu Zhuang,Jiaxin Hong,Bin Chen,Hao Wu,Shu-Tao Xia
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学); Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)); Shenzhen ShenNong Information Technology Co., Ltd.(深圳深农信息技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: ACM MM 2025
Abstract:Autoregressive image generation has witnessed rapid advancements, with prominent models such as scale-wise visual auto-regression pushing the boundaries of visual synthesis. However, these developments also raise significant concerns regarding data privacy and copyright. In response, training data detection has emerged as a critical task for identifying unauthorized data usage in model training. To better understand the vulnerability of autoregressive image generative models to such detection, we conduct the first study applying membership inference to this domain. Our approach comprises two key components: implicit classification and an adaptive score aggregation strategy. First, we compute the implicit token-wise classification score within the query image. Then we propose an adaptive score aggregation strategy to acquire a final score, which places greater emphasis on the tokens with lower scores. A higher final score indicates that the sample is more likely to be involved in the training set. To validate the effectiveness of our method, we adapt existing detection algorithms originally designed for LLMs to visual autoregressive models. Extensive experiments demonstrate the superiority of our method in both class-conditional and text-to-image scenarios. Moreover, our approach exhibits strong robustness and generalization under various data transformations. Furthermore, sufficient experiments suggest two novel key findings: (1) A linear scaling law on membership inference, exposing the vulnerability of large foundation models. (2) Training data from scale-wise visual autoregressive models is easier to detect than other autoregressive this http URL code is available at this https URL.
zh
[CV-24] INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling
【速读】:该论文试图解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在生成响应时出现的幻觉问题,即模型生成的内容看似合理但与对应的视觉内容不一致。解决方案的关键在于提出一种名为\textbf{INTER}(Interaction Guidance Sampling)的无训练算法,通过显式引导LVLMs在生成响应时有效重新应用其对多模态交互信息的理解,从而减少潜在的幻觉现象。
链接: https://arxiv.org/abs/2507.05056
作者: Xin Dong,Shichao Dong,Jin Wang,Jing Huang,Li Zhou,Zenghui Sun,Lihua Jing,Jingsong Lan,Xiaoyong Zhu,Bo Zheng
机构: University of Chinese Academy of Sciences (中国科学院大学); Taobao & Tmall Group of Alibaba (淘宝与天猫集团); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans’ ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbfINTER: \textbfInteraction Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.
zh
[CV-25] Estimating Object Physical Properties from RGB-D Vision and Depth Robot Sensors Using Deep Learning
【速读】:该论文试图解决在机器人应用中如何准确估计物体的惯性质量(inertial mass)的问题,特别是在仅使用视觉传感器的情况下,这一领域仍属于研究较少的范畴。其解决方案的关键在于结合来自深度图像的稀疏点云数据与RGB图像,通过构建合成数据集并训练图像生成模型来估计密集深度图,从而增强现有带有质量值的图像数据集,最终实现对物体质量的高效估计。
链接: https://arxiv.org/abs/2507.05029
作者: Ricardo Cardoso,Plinio Moreno
机构: Instituto Superior Técnico (Instituto Superior Técnico); University of Lisbon (University of Lisbon); Institute for Systems and Robotics (Institute for Systems and Robotics)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inertial mass plays a crucial role in robotic applications such as object grasping, manipulation, and simulation, providing a strong prior for planning and control. Accurately estimating an object’s mass before interaction can significantly enhance the performance of various robotic tasks. However, mass estimation using only vision sensors is a relatively underexplored area. This paper proposes a novel approach combining sparse point-cloud data from depth images with RGB images to estimate the mass of objects. We evaluate a range of point-cloud processing architectures, alongside RGB-only methods. To overcome the limited availability of training data, we create a synthetic dataset using ShapeNetSem 3D models, simulating RGBD images via a Kinect camera. This synthetic data is used to train an image generation model for estimating dense depth maps, which we then use to augment an existing dataset of images paired with mass values. Our approach significantly outperforms existing benchmarks across all evaluated metrics. The data generation (this https URL) as well as the training of the depth estimator (this https URL) and the mass estimator (this https URL) are available online.
zh
[CV-26] Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision
【速读】:该论文试图解决手术人工智能(Surgical AI)中多任务学习的灵活性不足问题,即传统模型仅针对单一任务进行设计,导致在复杂手术过程中需要为每个任务单独构建模型,缺乏统一性与效率。其解决方案的关键在于引入MML-SurgAdapt框架,该框架基于视觉-语言模型(VLMs),特别是CLIP,通过自然语言监督实现多种手术任务的统一处理。此外,为应对多任务学习中部分标注数据缺失的问题,该框架采用单正例多标签(SPML)学习方法,扩展其应用以整合同一手术过程中的多任务数据,从而在不完整或噪声标注的情况下仍能实现有效学习。
链接: https://arxiv.org/abs/2507.05020
作者: Soham Walimbe,Britty Baby,Vinkle Srivastav,Nicolas Padoy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: this https URL
zh
[CV-27] When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning
【速读】:该论文旨在解决手术操作规划中对未来器械-动词-目标三元组进行预测的问题,以实现实时辅助。其解决方案的关键在于提出一种名为Dual-task Autoregressive Imitation Learning (DARIL)的基线方法,该方法通过模仿学习(Imitation Learning, IL)实现了较高的动作三元组识别和下一帧预测性能,且在10秒时间窗口内仍保持相对稳定的性能表现。
链接: https://arxiv.org/abs/2507.05011
作者: Maxence Boels,Harry Robertshaw,Alejandro Granados,Prokar Dasgupta,Sebastien Ourselin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript has been submitted to a conference and is being peer reviewed
Abstract:Surgical action planning requires predicting future instrument-verb-target triplets for real-time assistance. While teleoperated robotic surgery provides natural expert demonstrations for imitation learning (IL), reinforcement learning (RL) could potentially discover superior strategies through exploration. We present the first comprehensive comparison of IL versus RL for surgical action planning on CholecT50. Our Dual-task Autoregressive Imitation Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and 33.6% next frame prediction mAP with smooth planning degradation to 29.2% at 10-second horizons. We evaluated three RL variants: world model-based RL, direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches underperformed DARIL i.e. world model RL dropped to 3.1% mAP at 10s while direct video RL achieved only 15.9%. Our analysis reveals that distribution matching on expert-annotated test sets systematically favors IL over potentially valid RL policies that differ from training demonstrations. This challenges assumptions about RL superiority in sequential decision making and provides crucial insights for surgical AI development.
zh
[CV-28] Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition
【速读】:该论文试图解决腹腔镜胆囊切除术中关键安全视角(Critical View of Safety, CVS)识别的复杂性和挑战性问题,尤其是在缺乏高效标注数据的情况下。传统方法依赖于仅基于视觉的模型,需要昂贵且耗时的空间标注。该研究的关键解决方案是提出CVS-AdaptNet,这是一种多标签适应策略,通过使用正负提示将图像嵌入与每个CVS标准的文本描述对齐,从而提升细粒度二分类性能。该方法利用文本提示增强多模态框架,显著提升了CVS识别效果。
链接: https://arxiv.org/abs/2507.05007
作者: Britty Baby,Vinkle Srivastav,Pooja P. Jain,Kun Yuan,Pietro Mascagni,Nicolas Padoy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The Critical View of Safety (CVS) is crucial for safe laparoscopic cholecystectomy, yet assessing CVS criteria remains a complex and challenging task, even for experts. Traditional models for CVS recognition depend on vision-only models learning with costly, labor-intensive spatial annotations. This study investigates how text can be harnessed as a powerful tool for both training and inference in multi-modal surgical foundation models to automate CVS recognition. Unlike many existing multi-modal models, which are primarily adapted for multi-class classification, CVS recognition requires a multi-label framework. Zero-shot evaluation of existing multi-modal surgical models shows a significant performance gap for this task. To address this, we propose CVS-AdaptNet, a multi-label adaptation strategy that enhances fine-grained, binary classification across multiple labels by aligning image embeddings with textual descriptions of each CVS criterion using positive and negative prompts. By adapting PeskaVLP, a state-of-the-art surgical foundation model, on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves 57.6 mAP, improving over the ResNet50 image-only baseline (51.5 mAP) by 6 points. Our results show that CVS-AdaptNet’s multi-label, multi-modal framework, enhanced by textual prompts, boosts CVS recognition over image-only methods. We also propose text-specific inference methods, that helps in analysing the image-text alignment. While further work is needed to match state-of-the-art spatial annotation-based methods, this approach highlights the potential of adapting generalist models to specialized surgical tasks. Code: this https URL
zh
[CV-29] Robust Incomplete-Modality Alignment for Ophthalmic Disease Grading and Diagnosis via Labeled Optimal Transport MICCAI2025
【速读】:该论文旨在解决多模态眼科影像诊断中因医疗资源分布不均导致的模态缺失问题,这一问题严重影响了诊断的准确性。现有方法如模态插补和知识蒸馏存在显著局限性,例如插补方法难以准确重建局部病变特征,而知识蒸馏则依赖于完全配对的多模态数据。该研究提出了一种新颖的多模态对齐与融合框架,其关键在于通过考虑光学相干断层扫描(OCT)和彩色眼底图像的特征差异,强调同一类别语义特征的对齐,并显式学习模态间的软匹配,从而使得缺失模态能够利用现有模态信息,实现鲁棒的跨模态特征对齐。此外,还引入了一种非对称融合策略,以有效利用OCT和眼底图像的不同特性。
链接: https://arxiv.org/abs/2507.04999
作者: Qinkai Yu,Jianyang Xie,Yitian Zhao,Cheng Chen,Lijun Zhang,Liming Chen,Jun Cheng,Lu Liu,Yalin Zheng,Yanda Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025
Abstract:Multimodal ophthalmic imaging-based diagnosis integrates color fundus image with optical coherence tomography (OCT) to provide a comprehensive view of ocular pathologies. However, the uneven global distribution of healthcare resources often results in real-world clinical scenarios encountering incomplete multimodal data, which significantly compromises diagnostic accuracy. Existing commonly used pipelines, such as modality imputation and distillation methods, face notable limitations: 1)Imputation methods struggle with accurately reconstructing key lesion features, since OCT lesions are localized, while fundus images vary in style. 2)distillation methods rely heavily on fully paired multimodal training data. To address these challenges, we propose a novel multimodal alignment and fusion framework capable of robustly handling missing modalities in the task of ophthalmic diagnostics. By considering the distinctive feature characteristics of OCT and fundus images, we emphasize the alignment of semantic features within the same category and explicitly learn soft matching between modalities, allowing the missing modality to utilize existing modality information, achieving robust cross-modal feature alignment under the missing modality. Specifically, we leverage the Optimal Transport for multi-scale modality feature alignment: class-wise alignment through predicted class prototypes and feature-wise alignment via cross-modal shared feature transport. Furthermore, we propose an asymmetric fusion strategy that effectively exploits the distinct characteristics of OCT and fundus modalities. Extensive evaluations on three large ophthalmic multimodal datasets demonstrate our model’s superior performance under various modality-incomplete scenarios, achieving Sota performance in both complete modality and inter-modality incompleteness conditions. Code is available at this https URL
zh
[CV-30] AI for the Routine Humans for the Complex: Accuracy-Driven Data Labelling with Mixed Integer Linear Programming
【速读】:该论文试图解决深度学习中由于准确标注数据稀缺而导致的模型训练与验证难题,特别是在测试阶段对高精度标签的迫切需求。解决方案的关键在于提出OPAL方法,其核心是一个基于混合整数线性规划(MILP)的优化模型,能够在满足指定准确率目标的前提下最小化人工标注的工作量。通过实验验证,OPAL在多个数据集上实现了接近完美的标注准确率(平均98.8%),同时显著减少了人工标注的需求。
链接: https://arxiv.org/abs/2507.04990
作者: Mohammad Hossein Amini,Mehrdad Sabetzadeh,Shiva Nejati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:
Abstract:The scarcity of accurately labelled data remains a major challenge in deep learning (DL). Many DL approaches rely on semi-supervised methods, which focus on constructing large datasets that require only a minimal amount of human-labelled data. Since DL training algorithms can tolerate moderate label noise, it has generally been acceptable for the accuracy of labels in large training datasets to fall well short of a perfect 100%. However, when it comes to testing DL models, achieving high label accuracy-as close to 100% as possible-is paramount for reliable verification. In this article, we introduce OPAL, a human-assisted labelling method that can be configured to target a desired accuracy level while minimizing the manual effort required for labelling. The main contribution of OPAL is a mixed-integer linear programming (MILP) formulation that minimizes labelling effort subject to a specified accuracy target. We evaluate OPAL for two tasks in the context of testing vision systems: automatic labelling of test data and automated validation of test data. Our evaluation, based on more than 2500 experiments performed on seven datasets, comparing OPAL with eight baseline methods, shows that OPAL, relying on its MILP formulation, achieves an average accuracy of 98.8%, just 1.2% below perfect accuracy, while cutting manual labelling by more than half. Further, OPAL significantly outperforms automated labelling baselines in labelling accuracy across all seven datasets, with large effect sizes, when all methods are provided with the same manual-labelling budget. For automated test-input validation, on average, OPAL reduces manual effort by 28.8% while achieving 4.5% higher accuracy than the SOTA validation baselines. Finally, we show that augmenting OPAL with an active learning loop leads to an additional 4.5% reduction in required manual labelling, without compromising accuracy.
zh
[CV-31] LB-VFI: Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation
【速读】:该论文旨在解决视频帧插值(Video Frame Interpolation, VFI)任务中现有方法在时间信息提取能力、模型效率和训练数据需求方面的不足。其解决方案的关键在于提出一种名为Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI)的高效视频基础扩散模型,通过引入3D-wavelet gating和时间感知自编码器,有效提取视频输入中的丰富时间信息,从而在保持高性能的同时显著减少参数量和推理时间,并通过光流引导降低训练数据需求。
链接: https://arxiv.org/abs/2507.04984
作者: Zonglin Lyu,Chen Chen
机构: Center for Research in Computer Vision, University of Central Florida
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video Frame Interpolation (VFI) aims to predict the intermediate frame I_n (we use n to denote time in videos to avoid notation overload with the timestep t in diffusion models) based on two consecutive neighboring frames I_0 and I_1 . Recent approaches apply diffusion models (both image-based and video-based) in this task and achieve strong performance. However, image-based diffusion models are unable to extract temporal information and are relatively inefficient compared to non-diffusion methods. Video-based diffusion models can extract temporal information, but they are too large in terms of training scale, model size, and inference time. To mitigate the above issues, we propose Temporal-Aware Latent Brownian Bridge Diffusion for Video Frame Interpolation (TLB-VFI), an efficient video-based diffusion model. By extracting rich temporal information from video inputs through our proposed 3D-wavelet gating and temporal-aware autoencoder, our method achieves 20% improvement in FID on the most challenging datasets over recent SOTA of image-based diffusion models. Meanwhile, due to the existence of rich temporal information, our method achieves strong performance while having 3times fewer parameters. Such a parameter reduction results in 2.3x speed up. By incorporating optical flow guidance, our method requires 9000x less training data and achieves over 20x fewer parameters than video-based diffusion models. Codes and results are available at our project page: this https URL.
zh
[CV-32] Parameterized Diffusion Optimization enabled Autoregressive Ordinal Regression for Diabetic Retinopathy Grading MICCAI2025
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)分类任务中由于严重程度等级分布不均(长尾分布)和类别边界模糊所带来的性能下降问题。其解决方案的关键在于提出一种名为AOR-DR的新型自回归序数回归方法,通过融合前一步的预测结果与提取的图像特征作为当前预测步骤的条件,实现有序的分级过程;同时利用扩散过程进行条件概率建模,使得可以直接使用连续全局图像特征进行自回归,无需重新学习局部特征的上下文信息,从而提升模型的有效性和鲁棒性。
链接: https://arxiv.org/abs/2507.04978
作者: Qinkai Yu,Wei Zhou,Hantao Liu,Yanyu Xu,Meng Wang,Yitian Zhao,Huazhu Fu,Xujiong Ye,Yalin Zheng,Yanda Meng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025
Abstract:As a long-term complication of diabetes, diabetic retinopathy (DR) progresses slowly, potentially taking years to threaten vision. An accurate and robust evaluation of its severity is vital to ensure prompt management and care. Ordinal regression leverages the underlying inherent order between categories to achieve superior performance beyond traditional classification. However, there exist challenges leading to lower DR classification performance: 1) The uneven distribution of DR severity levels, characterized by a long-tailed pattern, adds complexity to the grading process. 2)The ambiguity in defining category boundaries introduces additional challenges, making the classification process more complex and prone to inconsistencies. This work proposes a novel autoregressive ordinal regression method called AOR-DR to address the above challenges by leveraging the clinical knowledge of inherent ordinal information in DR grading dataset settings. Specifically, we decompose the DR grading task into a series of ordered steps by fusing the prediction of the previous steps with extracted image features as conditions for the current prediction step. Additionally, we exploit the diffusion process to facilitate conditional probability modeling, enabling the direct use of continuous global image features for autoregression without relearning contextual information from patch-level features. This ensures the effectiveness of the autoregressive process and leverages the capabilities of pre-trained large-scale foundation models. Extensive experiments were conducted on four large-scale publicly available color fundus datasets, demonstrating our model’s effectiveness and superior performance over six recent state-of-the-art ordinal regression methods. The implementation code is available at this https URL.
zh
[CV-33] InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior
【速读】:该论文试图解决3D Gaussian Splatting(3DGS)在多视角编辑中出现的局部不一致问题,特别是在非刚性变形区域导致的局部伪影、纹理模糊或语义变化问题,以及现有编辑方法依赖纯文本提示导致用户难以灵活控制编辑程度的问题。其解决方案的关键在于提出一种交互式框架InterGSEdit,通过用户选择关键视角并结合基于CLIP的语义一致性选择策略(CSCS)自适应筛选语义一致的参考视角,进而利用跨注意力图构建3D几何一致性注意力先验(GAP^3D),并通过注意力融合网络(AFN)将3D约束注意力与2D跨注意力进行融合,实现几何一致性与细节特征的平衡。
链接: https://arxiv.org/abs/2507.04961
作者: Minghao Wen,Shengjie Wu,Kangkan Wang,Dong Liang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting based 3D editing has demonstrated impressive performance in recent years. However, the multi-view editing often exhibits significant local inconsistency, especially in areas of non-rigid deformation, which lead to local artifacts, texture blurring, or semantic variations in edited 3D scenes. We also found that the existing editing methods, which rely entirely on text prompts make the editing process a “one-shot deal”, making it difficult for users to control the editing degree flexibly. In response to these challenges, we present InterGSEdit, a novel framework for high-quality 3DGS editing via interactively selecting key views with users’ preferences. We propose a CLIP-based Semantic Consistency Selection (CSCS) strategy to adaptively screen a group of semantically consistent reference views for each user-selected key view. Then, the cross-attention maps derived from the reference views are used in a weighted Gaussian Splatting unprojection to construct the 3D Geometry-Consistent Attention Prior ( GAP^3D ). We project GAP^3D to obtain 3D-constrained attention, which are fused with 2D cross-attention via Attention Fusion Network (AFN). AFN employs an adaptive attention strategy that prioritizes 3D-constrained attention for geometric consistency during early inference, and gradually prioritizes 2D cross-attention maps in diffusion for fine-grained features during the later inference. Extensive experiments demonstrate that InterGSEdit achieves state-of-the-art performance, delivering consistent, high-fidelity 3DGS editing with improved user experience.
zh
[CV-34] Hear-Your-Click: Interactive Video-to-Audio Generation via Object-aware Contrastive Audio-Visual Fine-tuning
【速读】:该论文试图解决视频到音频(Video-to-Audio, V2A)生成中依赖全局视频信息导致的在复杂场景下难以生成针对特定物体或区域的音频的问题。其解决方案的关键在于提出一种交互式V2A框架——Hear-Your-Click,通过用户点击视频帧中的特定对象来生成对应声音,并引入了Object-aware Contrastive Audio-Visual Fine-tuning (OCAV)与Mask-guided Visual Encoder (MVE),以获取与音频片段对齐的对象级视觉特征,同时设计了两种数据增强策略以提升模型对分割对象的敏感性。
链接: https://arxiv.org/abs/2507.04959
作者: Yingshan Liang,Keyu Fan,Zhicheng Du,Yiran Wang,Qingyang Shi,Xinyu Zhang,Jiasheng Lu,Peiwu Qin
机构: Shenzhen International Graduate School, Tsinghua University(深圳国际研究生院,清华大学); Huawei Technologies Co., Ltd.(华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Video-to-audio (V2A) generation shows great potential in fields such as film production. Despite significant advances, current V2A methods, which rely on global video information, struggle with complex scenes and often fail to generate audio tailored to specific objects or regions in the videos. To address these limitations, we introduce Hear-Your-Click, an interactive V2A framework that enables users to generate sounds for specific objects in the videos by simply clicking on the frame. To achieve this, we propose Object-aware Contrastive Audio-Visual Fine-tuning (OCAV) with a Mask-guided Visual Encoder (MVE) to obtain object-level visual features aligned with corresponding audio segments. Furthermore, we tailor two data augmentation strategies: Random Video Stitching (RVS) and Mask-guided Loudness Modulation (MLM), aimed at enhancing the model’s sensitivity to the segmented objects. To effectively measure the audio-visual correspondence, we design a new evaluation metric, the CAV score, for evaluation. Extensive experiments demonstrate that our framework offers more precise control and improved generation performance across various metrics. Project Page: this https URL
zh
[CV-35] Boosting Temporal Sentence Grounding via Causal Inference ACM-MM2025
【速读】:该论文旨在解决时间句子定位(Temporal Sentence Grounding, TSG)任务中因视频与文本查询之间的虚假相关性而导致的模型预测不可靠和泛化能力差的问题。其解决方案的关键在于引入因果干预与反事实推理机制,通过结构因果模型从因果角度重新定义TSG任务,并利用do-计算消除文本数据中的固有偏差,同时通过构建仅依赖视频特征的反事实场景来去除视频内容对模型的影响,从而提升模型的鲁棒性与准确性。
链接: https://arxiv.org/abs/2507.04958
作者: Kefan Tang,Lihuo He,Jisheng Dang,Xinbo Gao
机构: Xidian University(西安电子科技大学); National University of Singapore(新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by ACM MM 2025
Abstract:Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model’s tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model’s robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at this https URL.
zh
[CV-36] EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation
【速读】:该论文试图解决如何通过多模态控制(包括人类面部表情和上半身动作)与文本提示生成具有表现力且时间准确的音乐的问题。其解决方案的关键在于在预训练的文本到音乐生成模型上采用参数高效微调(PEFT),从而利用小数据集对多模态控制进行细粒度适配,并引入时间平滑策略以确保视频与音乐之间的精确同步。
链接: https://arxiv.org/abs/2507.04955
作者: Fathinah Izzati,Xinyue Li,Gus Xia
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:
Abstract:We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls - specifically, human facial expressions and upper-body motion - as well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient fine-tuning (PEFT) on the pretrained text-to-music generation model, enabling fine-grained adaptation to the multimodal controls using a small dataset. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat-tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and existing state-of-the-art video-to-music generation models. Additionally, we introduce a novel dataset consisting of 7 hours of synchronized video recordings capturing expressive facial and upper-body gestures aligned with corresponding music, providing significant potential for future research in multimodal and interactive music generation.
zh
[CV-37] DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer ICCV2025
【速读】:该论文旨在解决传统掩码自回归(masked AR)模型在图像生成质量与计算效率上落后于扩散模型的问题。其关键解决方案是引入DC-HT——一种深度压缩混合分词器,它在保持高重建保真度和跨分辨率泛化能力的同时,实现了32倍的空间压缩比,从而显著提升了AR模型的性能。基于DC-HT,作者扩展了MaskGIT并构建了DC-AR框架,该框架通过离散令牌生成结构元素,并利用残差令牌进行细化,最终在多项指标上达到了最先进的表现。
链接: https://arxiv.org/abs/2507.04947
作者: Yecheng Wu,Junyu Chen,Zhuoyang Zhang,Enze Xie,Jincheng Yu,Junsong Chen,Jinyi Hu,Yao Lu,Song Han,Han Cai
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025
Abstract:We introduce DC-AR, a novel masked autoregressive (AR) text-to-image generation framework that delivers superior image generation quality with exceptional computational efficiency. Due to the tokenizers’ limitations, prior masked AR models have lagged behind diffusion models in terms of quality or efficiency. We overcome this limitation by introducing DC-HT - a deep compression hybrid tokenizer for AR models that achieves a 32x spatial compression ratio while maintaining high reconstruction fidelity and cross-resolution generalization ability. Building upon DC-HT, we extend MaskGIT and create a new hybrid masked autoregressive image generation framework that first produces the structural elements through discrete tokens and then applies refinements via residual tokens. DC-AR achieves state-of-the-art results with a gFID of 5.49 on MJHQ-30K and an overall score of 0.69 on GenEval, while offering 1.5-7.9x higher throughput and 2.0-3.5x lower latency compared to prior leading diffusion and autoregressive models.
zh
[CV-38] RainShift: A Benchmark for Precipitation Downscaling Across Geographies
【速读】:该论文试图解决地球系统模型(Earth System Models, ESM)在局部尺度风险评估中因计算成本过高而无法达到足够分辨率的问题,以及深度学习超分辨率模型在不同地理区域间泛化能力不足的问题。其解决方案的关键在于引入RainShift数据集和基准测试,用于评估在地理分布偏移下的降尺度性能,并通过数据对齐等方法提升模型的空间泛化能力。
链接: https://arxiv.org/abs/2507.04930
作者: Paula Harder,Luca Schmidt,Francis Pelletier,Nicole Ludwig,Matthew Chantry,Christian Lessig,Alex Hernandez-Garcia,David Rolnick
机构: Mila Quebec AI Institute, Montreal, Canada; European Centre for Medium Range Weather Forecasts (ECMWF), Bonn, Germany; Cluster of Excellence Machine Learning, University of Tübingen, Tübingen, Germany; McGill University, Montreal, Canada
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Earth System Models (ESM) are our main tool for projecting the impacts of climate change. However, running these models at sufficient resolution for local-scale risk-assessments is not computationally feasible. Deep learning-based super-resolution models offer a promising solution to downscale ESM outputs to higher resolutions by learning from data. Yet, due to regional variations in climatic processes, these models typically require retraining for each geographical area-demanding high-resolution observational data, which is unevenly available across the globe. This highlights the need to assess how well these models generalize across geographic regions. To address this, we introduce RainShift, a dataset and benchmark for evaluating downscaling under geographic distribution shifts. We evaluate state-of-the-art downscaling approaches including GANs and diffusion models in generalizing across data gaps between the Global North and Global South. Our findings reveal substantial performance drops in out-of-distribution regions, depending on model and geographic area. While expanding the training domain generally improves generalization, it is insufficient to overcome shifts between geographically distinct regions. We show that addressing these shifts through, for example, data alignment can improve spatial generalization. Our work advances the global applicability of downscaling methods and represents a step toward reducing inequities in access to high-resolution climate information.
zh
[CV-39] ConBatch-BAL: Batch Bayesian Active Learning under Budget Constraints
【速读】:该论文试图解决在实际应用中由于数据点标注成本差异和预算限制导致主动学习策略难以推广的问题。其解决方案的关键在于提出两种基于贝叶斯主动学习的批量选取策略(ConBatch-BAL),分别采用动态阈值和贪心选择机制,通过贝叶斯神经网络计算不确定性度量来选择样本,从而在预算约束下优化样本选取过程。
链接: https://arxiv.org/abs/2507.04929
作者: Pablo G. Morato,Charalampos P. Andriotis,Seyran Khademi
机构: Delft University of Technology (代尔夫特理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Varying annotation costs among data points and budget constraints can hinder the adoption of active learning strategies in real-world applications. This work introduces two Bayesian active learning strategies for batch acquisition under constraints (ConBatch-BAL), one based on dynamic thresholding and one following greedy acquisition. Both select samples using uncertainty metrics computed via Bayesian neural networks. The dynamic thresholding strategy redistributes the budget across the batch, while the greedy one selects the top-ranked sample at each step, limited by the remaining budget. Focusing on scenarios with costly data annotation and geospatial constraints, we also release two new real-world datasets containing geolocated aerial images of buildings, annotated with energy efficiency or typology classes. The ConBatch-BAL strategies are benchmarked against a random acquisition baseline on these datasets under various budget and cost scenarios. The results show that the developed ConBatch-BAL strategies can reduce active learning iterations and data acquisition costs in real-world settings, and even outperform the unconstrained baseline solutions.
zh
[CV-40] Leverag ing Self-Supervised Features for Efficient Flooded Region Identification in UAV Aerial Images
【速读】:该论文旨在解决利用无人机(UAV)航拍图像准确识别洪水受灾区域的问题,传统方法依赖于人工标注数据进行监督学习,而人工标注耗时且易出错。该研究的关键在于利用自监督学习特征,特别是基于DINOv2模型的预训练视觉特征,以减少对人工标注数据的依赖,并提升空中图像分割的准确性。通过将这些通用的自监督特征与传统的编码器-解码器结构相结合,实现了在有限标注数据下高精度的洪水区域识别。
链接: https://arxiv.org/abs/2507.04915
作者: Dibyabha Deb,Ujjwal Verma
机构: Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 Pages, 4 Figures
Abstract:Identifying regions affected by disasters is a vital step in effectively managing and planning relief and rescue efforts. Unlike the traditional approaches of manually assessing post-disaster damage, analyzing images of Unmanned Aerial Vehicles (UAVs) offers an objective and reliable way to assess the damage. In the past, segmentation techniques have been adopted to identify post-flood damage in UAV aerial images. However, most of these supervised learning approaches rely on manually annotated datasets. Indeed, annotating images is a time-consuming and error-prone task that requires domain expertise. This work focuses on leveraging self-supervised features to accurately identify flooded regions in UAV aerial images. This work proposes two encoder-decoder-based segmentation approaches, which integrate the visual features learned from DINOv2 with the traditional encoder backbone. This study investigates the generalization of self-supervised features for UAV aerial images. Specifically, we evaluate the effectiveness of features from the DINOv2 model, trained on non-aerial images, for segmenting aerial images, noting the distinct perspectives between the two image types. Our results demonstrate that DINOv2’s self-supervised pretraining on natural images generates transferable, general-purpose visual features that streamline the development of aerial segmentation workflows. By leveraging these features as a foundation, we significantly reduce reliance on labor-intensive manual annotation processes, enabling high-accuracy segmentation with limited labeled aerial data.
zh
[CV-41] Piggyback Camera: Easy-to-Deploy Visual Surveillance by Mobile Sensing on Commercial Robot Vacuums
【速读】:该论文试图解决在不修改商用扫地机器人硬件的情况下,实现基于其平台的视觉监控问题。解决方案的关键在于将配备摄像头和惯性测量单元(IMU)的智能手机安装在机器人上,利用神经惯性导航估计机器人位姿,并通过一种称为Rotation-Augmented Ensemble (RAE) 的测试时数据增强方法缓解神经惯性导航中的领域差异,同时结合利用机器人清洁模式的回环检测方法进一步优化位姿估计。
链接: https://arxiv.org/abs/2507.04910
作者: Ryo Yonetani
机构: CyberAgent, Inc.(CyberAgent, Inc.)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents Piggyback Camera, an easy-to-deploy system for visual surveillance using commercial robot vacuums. Rather than requiring access to internal robot systems, our approach mounts a smartphone equipped with a camera and Inertial Measurement Unit (IMU) on the robot, making it applicable to any commercial robot without hardware modifications. The system estimates robot poses through neural inertial navigation and efficiently captures images at regular spatial intervals throughout the cleaning task. We develop a novel test-time data augmentation method called Rotation-Augmented Ensemble (RAE) to mitigate domain gaps in neural inertial navigation. A loop closure method that exploits robot cleaning patterns further refines these estimated poses. We demonstrate the system with an object mapping application that analyzes captured images to geo-localize objects in the environment. Experimental evaluation in retail environments shows that our approach achieves 0.83 m relative pose error for robot localization and 0.97 m positional error for object mapping of over 100 items.
zh
[CV-42] HV-MMBench: Benchmarking MLLM s for Human-Centric Video Understanding
【速读】:该论文试图解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解以人类为中心的视频数据方面能力不足的问题,尤其是由于缺乏全面且高质量的评估基准。现有基准主要关注视频生成质量和动作识别,而忽视了人类中心场景中必要的感知和认知能力,并受限于单一问题范式和过于简化的评估指标。解决方案的关键在于提出HV-MMBench,一个经过严格筛选的基准,旨在提供更全面的评估,其核心特点包括:多维评估维度、多样化数据类型、跨领域视频覆盖以及时间跨度覆盖,从而更准确地反映模型在人类中心视频理解中的综合能力。
链接: https://arxiv.org/abs/2507.04909
作者: Yuxuan Cai,Jiangning Zhang,Zhenye Gan,Qingdong He,Xiaobin Hu,Junwei Zhu,Yabiao Wang,Chengjie Wang,Zhucun Xue,Xinwei He,Xiang Bai
机构: HUST(华中科技大学); Tencent YouTu Lab(腾讯优图实验室); ZJU(浙江大学); HAU(河南农业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 15 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.
zh
[CV-43] HGNet: High-Order Spatial Awareness Hypergraph and Multi-Scale Context Attention Network for Colorectal Polyp Detection
【速读】:该论文旨在解决结直肠癌(Colorectal Cancer, CRC)早期检测中面临的挑战,包括小病灶检测困难、边界精确定位不足以及模型决策缺乏可解释性。其解决方案的关键在于提出HGNet架构,该架构融合了高阶空间感知超图(High-Order Spatial Awareness Hypergraph)和多尺度上下文注意力机制(Multi-Scale Context Attention)。核心创新包括:高效多尺度上下文注意力模块(Efficient Multi-Scale Context Attention, EMCA)以增强病灶特征表示和边界建模;在检测头之前部署空间超图卷积模块以捕捉节点间的高阶空间关系;利用迁移学习应对医学图像数据稀缺的问题;以及采用Eigen Class Activation Map(Eigen-CAM)实现决策可视化。
链接: https://arxiv.org/abs/2507.04880
作者: Xiaofang Liu,Lingling Sun,Xuqing Zhang,Yuannong Ye,Bin zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Colorectal cancer (CRC) is closely linked to the malignant transformation of colorectal polyps, making early detection essential. However, current models struggle with detecting small lesions, accurately localizing boundaries, and providing interpretable decisions. To address these issues, we propose HGNet, which integrates High-Order Spatial Awareness Hypergraph and Multi-Scale Context Attention. Key innovations include: (1) an Efficient Multi-Scale Context Attention (EMCA) module to enhance lesion feature representation and boundary modeling; (2) the deployment of a spatial hypergraph convolution module before the detection head to capture higher-order spatial relationships between nodes; (3) the application of transfer learning to address the scarcity of medical image data; and (4) Eigen Class Activation Map (Eigen-CAM) for decision visualization. Experimental results show that HGNet achieves 94% accuracy, 90.6% recall, and 90% mAP@0.5, significantly improving small lesion differentiation and clinical interpretability. The source code will be made publicly available upon publication of this paper.
zh
[CV-44] Semantically Consistent Discrete Diffusion for 3D Biological Graph Modeling MICCAI2025
【速读】:该论文试图解决在生成3D生物图时保持解剖有效性的挑战,这是现有基于扩散的方法的一个关键限制。其解决方案的关键在于使用一种新颖的投影算子,在采样过程中随机修复不一致之处,并采用一种适用于稀疏生物图的改进的边删除噪声过程。
链接: https://arxiv.org/abs/2507.04856
作者: Chinmay Prabhakar,Suprosanna Shit,Tamaz Amiranashvili,Hongwei Bran Li,Bjoern Menze
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025
Abstract:3D spatial graphs play a crucial role in biological and clinical research by modeling anatomical networks such as blood vessels,neurons, and airways. However, generating 3D biological graphs while maintaining anatomical validity remains challenging, a key limitation of existing diffusion-based methods. In this work, we propose a novel 3D biological graph generation method that adheres to structural and semantic plausibility conditions. We achieve this by using a novel projection operator during sampling that stochastically fixes inconsistencies. Further, we adopt a superior edge-deletion-based noising procedure suitable for sparse biological graphs. Our method demonstrates superior performance on two real-world datasets, human circle of Willis and lung airways, compared to previous approaches. Importantly, we demonstrate that the generated samples significantly enhance downstream graph labeling performance. Furthermore, we show that our generative model is a reasonable out-of-the-box link predictior.
zh
[CV-45] Efficient SAR Vessel Detection for FPGA-Based On-Satellite Sensing
【速读】:该论文试图解决卫星数据快速分析的问题,特别是在合成孔径雷达(SAR)船舶检测任务中,传统机器学习(ML)模型因体积过大或功耗过高而难以部署到卫星平台。解决方案的关键在于开发并部署一种基于FPGA的高效且高性能的SAR船舶检测模型,采用定制化的YOLOv8架构,在10W功率约束下优化了硬件加速性能。
链接: https://arxiv.org/abs/2507.04842
作者: Colin Laganier,Liam Fletcher,Elim Kwan,Richard Walters,Victoria Nockles
机构: The Alan Turing Institute (艾伦·图灵研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures, 3 table
Abstract:Rapid analysis of satellite data is vital for many remote sensing applications, from disaster response to environmental monitoring, but is becoming harder to achieve with the increasing volumes of data generated by modern satellites. On-satellite machine learning (ML) offers a potential solution, by reducing latency associated with transmission of these large data volumes to ground stations, but state-of-the-art models are often too large or power-hungry for satellite deployment. Vessel detection using Synthetic Aperture Radar (SAR) is a critical time-sensitive task for maritime security that exemplifies this challenge. SAR vessel detection has previously been demonstrated only by ML models that either are too large for satellite deployment, have not been developed for sufficiently low-power hardware, or have only been developed and tested on small SAR datasets that do not sufficiently represent the real-world task. Here we address this issue by developing and deploying a new efficient and highly performant SAR vessel detection model, using a customised YOLOv8 architecture specifically optimized for FPGA-based processing within common satellite power constraints (10W). We train and evaluate our model on the largest and most diverse open SAR vessel dataset, xView3-SAR, and deploy it on a Kria KV260 MPSoC. We show that our FPGA-based model has detection and classification performance only ~2% and 3% lower than values from state-of-the-art GPU-based models, despite being two to three orders of magnitude smaller in size. This work demonstrates small yet highly performant ML models for time-critical SAR analysis, paving the way for more autonomous, responsive, and scalable Earth observation systems.
zh
[CV-46] CMET: Clustering guided METric for quantifying embedding quality
【速读】:该论文试图解决在数据嵌入(embedding)质量评估中,现有度量方法在时间与空间复杂度上较高,且难以统计验证变换后的嵌入是否保留了原始数据的局部与全局结构的问题。解决方案的关键是提出一种新的度量方法,称为聚类引导的嵌入度量(Clustering guided METric, CMET),其包含两个评分指标CMET_L和CMET_G,分别用于衡量嵌入的局部和全局结构保留能力,从而实现对嵌入质量的有效定量比较。
链接: https://arxiv.org/abs/2507.04840
作者: Sourav Ghosh,Chayan Maitra,Rajat K. De
机构: Standard Chartered Modelling And Analytics Centre Private Limited(标准渣打建模与分析中心有限公司); Indian Statistical Institute(印度统计学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 19 figures
Abstract:Due to rapid advancements in technology, datasets are available from various domains. In order to carry out more relevant and appropriate analysis, it is often necessary to project the dataset into a higher or lower dimensional space based on requirement. Projecting the data in a higher-dimensional space helps in unfolding intricate patterns, enhancing the performance of the underlying models. On the other hand, dimensionality reduction is helpful in denoising data while capturing maximal information, as well as reducing execution time and this http URL this context, it is not always statistically evident whether the transformed embedding retains the local and global structure of the original data. Most of the existing metrics that are used for comparing the local and global shape of the embedding against the original one are highly expensive in terms of time and space complexity. In order to address this issue, the objective of this study is to formulate a novel metric, called Clustering guided METric (CMET), for quantifying embedding quality. It is effective to serve the purpose of quantitative comparison between an embedding and the original data. CMET consists of two scores, viz., CMET_L and CMET_G, that measure the degree of local and global shape preservation capability, respectively. The efficacy of CMET has been demonstrated on a wide variety of datasets, including four synthetic, two biological, and two image datasets. Results reflect the favorable performance of CMET against the state-of-the-art methods. Capability to handle both small and large data, low algorithmic complexity, better and stable performance across all kinds of data, and different choices of hyper-parameters feature CMET as a reliable metric.
zh
[CV-47] RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction ICCV2025
【速读】:该论文旨在解决弱监督条件下关键点提取器的训练问题,特别是在缺乏人工标注变换、预生成模型或3D数据的情况下,如何实现高效且鲁棒的关键点检测与描述。其解决方案的关键在于提出RIPE框架,该框架仅依赖于表示配对图像是否属于同一场景的二值标签,从而大幅扩展了训练数据的可用性,并通过超列(hyper-column)方法利用编码器中间层进行多尺度信息融合,同时引入辅助损失以增强描述子的判别能力。
链接: https://arxiv.org/abs/2507.04839
作者: Johannes Künzel,Anna Hilsmann,Peter Eisert
机构: Fraunhofer Heinrich-Hertz-Institut, HHI(弗劳恩霍夫海因里希·赫兹研究所, HHI); Humboldt University Berlin(柏林洪堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:We introduce RIPE, an innovative reinforcement learning-based framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene. This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor. RIPE utilizes the encoder’s intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose an auxiliary loss to enhance the discriminative capability of the learned descriptors. Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-of-the-art techniques, marking a significant advancement in robust keypoint extraction and description. To support further research, we have made our code publicly available at this https URL. Comments: ICCV 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.04839 [cs.CV] (or arXiv:2507.04839v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.04839 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-48] SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions
【速读】:该论文旨在解决自动驾驶中准确建模复杂道路结构(如环形和双向车道)的问题,传统方法难以有效处理这些非线性结构。其解决方案的关键在于提出SeqGrowGraph框架,该框架将车道拓扑建模为一系列图扩展的链式过程,通过逐步引入顶点并动态更新邻接矩阵和几何矩阵,以捕捉道路连接关系和中心线形状,最终利用Transformer模型基于深度优先搜索顺序进行自回归预测,从而实现对复杂道路结构的高效建模。
链接: https://arxiv.org/abs/2507.04822
作者: Mengwei Xie,Shuang Zeng,Xinyuan Chang,Xinran Liu,Zheng Pan,Mu Xu,Xing Wei
机构: Amap(阿玛普); Alibaba Group(阿里巴巴集团); Xi’an Jiaotong University(西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate lane topology is essential for autonomous driving, yet traditional methods struggle to model the complex, non-linear structures-such as loops and bidirectional lanes-prevalent in real-world road structure. We present SeqGrowGraph, a novel framework that learns lane topology as a chain of graph expansions, inspired by human map-drawing processes. Representing the lane graph as a directed graph G=(V,E) , with intersections ( V ) and centerlines ( E ), SeqGrowGraph incrementally constructs this graph by introducing one vertex at a time. At each step, an adjacency matrix ( A ) expands from n \times n to (n+1) \times (n+1) to encode connectivity, while a geometric matrix ( M ) captures centerline shapes as quadratic Bézier curves. The graph is serialized into sequences, enabling a transformer model to autoregressively predict the chain of expansions, guided by a depth-first search ordering. Evaluated on nuScenes and Argoverse 2 datasets, SeqGrowGraph achieves state-of-the-art performance.
zh
[CV-49] UDF-GMA: Uncertainty Disentanglement and Fusion for General Movement Assessment ALT
【速读】:该论文旨在解决基于姿态的自动化一般运动评估(GMA)方法在临床应用中因数据质量有限和姿态估计噪声导致的不确定性问题,从而影响诊断的可靠性。其解决方案的关键在于引入UDF-GMA,该方法通过显式建模模型参数的认知不确定性(epistemic uncertainty)和数据噪声带来的随机不确定性(aleatoric uncertainty),并利用贝叶斯近似估计认知不确定性,同时将两类不确定性与嵌入的运动表征融合,以增强类别分离能力。
链接: https://arxiv.org/abs/2507.04814
作者: Zeqi Luo,Ali Gooya,Edmond S. L. Ho
机构: University of Glasgow(格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: This work has been accepted for publication in IEEE Journal of Biomedical and Health Informatics (J-BHI)
Abstract:General movement assessment (GMA) is a non-invasive tool for the early detection of brain dysfunction through the qualitative assessment of general movements, and the development of automated methods can broaden its application. However, mainstream pose-based automated GMA methods are prone to uncertainty due to limited high-quality data and noisy pose estimation, hindering clinical reliability without reliable uncertainty measures. In this work, we introduce UDF-GMA which explicitly models epistemic uncertainty in model parameters and aleatoric uncertainty from data noise for pose-based automated GMA. UDF-GMA effectively disentangles uncertainties by directly modelling aleatoric uncertainty and estimating epistemic uncertainty through Bayesian approximation. We further propose fusing these uncertainties with the embedded motion representation to enhance class separation. Extensive experiments on the Pmi-GMA benchmark dataset demonstrate the effectiveness and generalisability of the proposed approach in predicting poor repertoire.
zh
[CV-50] PointGAC: Geometric-Aware Codebook for Masked Point Cloud Modeling ICCV2025
【速读】:该论文试图解决传统掩码点云建模(Masked Point Cloud Modeling, MPM)方法在重建被掩码区域时过度约束模型学习细节,导致无法捕捉泛化特征的问题。其解决方案的关键在于提出一种基于聚类的MPM方法——PointGAC,该方法通过在线代码本引导的师生框架实现被掩码区域特征分布的对齐。具体而言,该方法首先采用几何感知的分块策略提取初始块,随后通过基于完整块特征的在线K均值更新代码本,使代码本向量成为聚类中心;接着将未掩码特征分配到对应的聚类中心,并让学生模型对重建的掩码特征进行分配对齐,从而聚焦于识别掩码特征所属的聚类中心,提升模型学习泛化特征表示的能力。
链接: https://arxiv.org/abs/2507.04801
作者: Abiao Li,Chenlei Lv,Yuming Fang,Yifan Zuo,Jian Zhang,Guofeng Mei
机构: Jiangxi University of Finance and Economics(江西财经大学); Shenzhen University(深圳大学); University of Technology Sydney(悉尼科技大学); Fondazione Bruno Kessler(布鲁诺·凯塞尔基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to over-constrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose \textbf\textitPointGAC, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online k-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks. Code is available at this https URL
zh
[CV-51] Model Compression using Progressive Channel Pruning
【速读】:该论文试图解决卷积神经网络(Convolutional Neural Networks, CNNs)在推理阶段计算效率低的问题,通过通道剪枝(channel pruning)来加速模型。其解决方案的关键在于提出了一种称为渐进式通道剪枝(Progressive Channel Pruning, PCP)的框架,该框架通过迭代方式从多个选定层中逐步剪枝少量通道,而非一次性逐层剪枝。PCP框架包含尝试-选择-剪枝三步流程,在每一步中分别评估剪枝对准确率的影响、选择对整体准确率影响最小的层,并最终执行剪枝操作,从而在保持模型性能的同时实现高效的通道剪枝。
链接: https://arxiv.org/abs/2507.04792
作者: Jinyang Guo,Weichen Zhang,Wanli Ouyang,Dong Xu
机构: University of Sydney(悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In this work, we propose a simple but effective channel pruning framework called Progressive Channel Pruning (PCP) to accelerate Convolutional Neural Networks (CNNs). In contrast to the existing channel pruning methods that prune channels only once per layer in a layer-by-layer fashion, our new progressive framework iteratively prunes a small number of channels from several selected layers, which consists of a three-step attempting-selecting-pruning pipeline in each iteration. In the attempting step, we attempt to prune a pre-defined number of channels from one layer by using any existing channel pruning methods and estimate the accuracy drop for this layer based on the labelled samples in the validation set. In the selecting step, based on the estimated accuracy drops for all layers, we propose a greedy strategy to automatically select a set of layers that will lead to less overall accuracy drop after pruning these layers. In the pruning step, we prune a small number of channels from these selected layers. We further extend our PCP framework to prune channels for the deep transfer learning methods like Domain Adversarial Neural Network (DANN), in which we effectively reduce the data distribution mismatch in the channel pruning process by using both labelled samples from the source domain and pseudo-labelled samples from the target domain. Our comprehensive experiments on two benchmark datasets demonstrate that our PCP framework outperforms the existing channel pruning approaches under both supervised learning and transfer learning settings.
zh
[CV-52] Interaction-Merged Motion Planning : Effectively Leverag Planning : Effectively Leveraging Diverse Motion Datasets for Robust Planning ICCV2025
【速读】:该论文试图解决在自主机器人驾驶中,如何有效利用不同领域轨迹数据集进行目标领域运动规划的问题,这一问题主要源于代理交互和环境特征的差异。其解决方案的关键在于提出一种名为Interaction-Merged Motion Planning (IMMP)的新方法,该方法通过在适应目标领域过程中利用不同领域训练的参数检查点,采用两步流程——预合并以捕捉代理行为与交互,合并以构建可适应模型,从而高效地将多样化的交互信息转移到目标领域。
链接: https://arxiv.org/abs/2507.04790
作者: Giwon Lee,Wooseong Jeong,Daehee Park,Jaewoo Jeong,Kuk-Jin Yoon
机构: KAIST(韩国科学技术院); DGIST(大邱庆北科学技术院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICCV 2025
Abstract:Motion planning is a crucial component of autonomous robot driving. While various trajectory datasets exist, effectively utilizing them for a target domain remains challenging due to differences in agent interactions and environmental characteristics. Conventional approaches, such as domain adaptation or ensemble learning, leverage multiple source datasets but suffer from domain imbalance, catastrophic forgetting, and high computational costs. To address these challenges, we propose Interaction-Merged Motion Planning (IMMP), a novel approach that leverages parameter checkpoints trained on different domains during adaptation to the target domain. IMMP follows a two-step process: pre-merging to capture agent behaviors and interactions, sufficiently extracting diverse information from the source domain, followed by merging to construct an adaptable model that efficiently transfers diverse interactions to the target domain. Our method is evaluated on various planning benchmarks and models, demonstrating superior performance compared to conventional approaches.
zh
[CV-53] FurniMAS: Language-Guided Furniture Decoration using Multi-Agent System
【速读】:该论文试图解决家具装饰过程中高质量装饰结果生成耗时且需要专业艺术技能的问题。解决方案的关键在于提出FurniMAS,一个基于多智能体系统的自动家具装饰框架。该系统通过集成基于大语言模型(LLM)和非LLM的智能体,利用通信、逻辑推理和验证机制,协同完成从用户提示到最终装饰结果的转化,从而实现功能、美学和氛围偏好兼顾的高质量3D装饰效果。
链接: https://arxiv.org/abs/2507.04770
作者: Toan Nguyen,Tri Le,Quang Nguyen,Anh Nguyen
机构: FSoft AI Center(福软人工智能中心); University of Liverpool(利物浦大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Furniture decoration is an important task in various industrial applications. However, achieving a high-quality decorative result is often time-consuming and requires specialized artistic expertise. To tackle these challenges, we explore how multi-agent systems can assist in automating the decoration process. We propose FurniMAS, a multi-agent system for automatic furniture decoration. Specifically, given a human prompt and a household furniture item such as a working desk or a TV stand, our system suggests relevant assets with appropriate styles and materials, and arranges them on the item, ensuring the decorative result meets functionality, aesthetic, and ambiance preferences. FurniMAS assembles a hybrid team of LLM-based and non-LLM agents, each fulfilling distinct roles in a typical decoration project. These agents collaborate through communication, logical reasoning, and validation to transform the requirements into the final outcome. Extensive experiments demonstrate that our FurniMAS significantly outperforms other baselines in generating high-quality 3D decor.
zh
[CV-54] From Imitation to Innovation: The Emergence of AI Unique Artistic Styles and the Challenge of Copyright Protection
【速读】:该论文试图解决当前法律框架下对AI生成艺术作品版权认定缺乏系统性标准和可靠评估方法的问题。解决方案的关键在于提出ArtBulb框架,该框架结合了基于风格描述的多模态聚类方法与多模态大语言模型(MLLMs),实现了对AI艺术作品版权判断的可解释性和量化评估。此外,论文还构建了AICD数据集,作为首个由艺术家和法律专家标注的AI艺术版权基准数据集,以支持相关研究和实践。
链接: https://arxiv.org/abs/2507.04769
作者: Zexi Jia,Chuanwei Huang,Yeshuang Zhu,Hongyan Fei,Ying Deng,Zhiqiang Yuan,Jiapei Zhang,Jinchao Zhang,Jie Zhou
机构: WeChat AI, Tencent Inc, China(微信人工智能,腾讯公司,中国); Institute for Artificial Intelligence, Peking University(人工智能研究院,北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Current legal frameworks consider AI-generated works eligible for copyright protection when they meet originality requirements and involve substantial human intellectual input. However, systematic legal standards and reliable evaluation methods for AI art copyrights are lacking. Through comprehensive analysis of legal precedents, we establish three essential criteria for determining distinctive artistic style: stylistic consistency, creative uniqueness, and expressive accuracy. To address these challenges, we introduce ArtBulb, an interpretable and quantifiable framework for AI art copyright judgment that combines a novel style description-based multimodal clustering method with multimodal large language models (MLLMs). We also present AICD, the first benchmark dataset for AI art copyright annotated by artists and legal experts. Experimental results demonstrate that ArtBulb outperforms existing models in both quantitative and qualitative evaluations. Our work aims to bridge the gap between the legal and technological communities and bring greater attention to the societal issue of AI art copyrights.
zh
[CV-55] GraphBrep: Learning B-Rep in Graph Structure for Efficient CAD Generation
【速读】:该论文旨在解决直接生成边界表示(B-Rep)过程中几何与拓扑联合分布建模的挑战,传统方法通过隐式嵌入拓扑信息到边的几何特征中,导致边几何携带更多冗余结构信息,从而增加计算成本。其解决方案的关键在于提出GraphBrep模型,该模型显式地表示并学习紧凑的拓扑结构,通过构建无向加权图来表征表面拓扑,并利用图扩散模型基于表面特征学习拓扑,从而实现更高效的计算和高质量的CAD生成。
链接: https://arxiv.org/abs/2507.04765
作者: Weilin Lai,Tie Xu,Hu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Direct B-Rep generation is increasingly important in CAD workflows, eliminating costly modeling sequence data and supporting complex features. A key challenge is modeling joint distribution of the misaligned geometry and topology. Existing methods tend to implicitly embed topology into the geometric features of edges. Although this integration ensures feature alignment, it also causes edge geometry to carry more redundant structural information compared to the original B-Rep, leading to significantly higher computational cost. To reduce redundancy, we propose GraphBrep, a B-Rep generation model that explicitly represents and learns compact topology. Following the original structure of B-Rep, we construct an undirected weighted graph to represent surface topology. A graph diffusion model is employed to learn topology conditioned on surface features, serving as the basis for determining connectivity between primitive surfaces. The explicit representation ensures a compact data structure, effectively reducing computational cost during both training and inference. Experiments on two large-scale unconditional datasets and one category-conditional dataset demonstrate the proposed method significantly reduces training and inference times (up to 31.3% and 56.3% for given datasets, respectively) while maintaining high-quality CAD generation compared with SOTA.
zh
[CV-56] Robustifying 3D Perception through Least-Squares Multi-Agent Graphs Object Tracking
【速读】:该论文旨在解决边缘人工智能(EdgeAI)系统中,如自动驾驶车辆,面对对抗性噪声时的感知鲁棒性问题,特别是如何在时间维度上准确识别和定位场景中的多个物体以减轻对抗性威胁的影响。解决方案的关键在于基于多智能体对抗性边界框构建最小二乘图,通过重叠边界框在全连接图中利用微分坐标和锚点减少每个检测中心的位移误差,从而融合和优化多车辆检测结果,抑制对抗性威胁,并在两个阶段内与现有轨迹关联以实现跟踪。
链接: https://arxiv.org/abs/2507.04762
作者: Maria Damanaki,Ioulia Kapsali,Nikos Piperigkos,Alexandros Gkillas,Aris S. Lalos
机构: Industrial Systems Institute, Athena Research Center, Patras Science Park, Greece; AviSense.AI, Patras Science Park, Greece; Dpt. of Informatics & Telecom., University of Ioannina, Arta, Greece
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures, 4 tables
Abstract:The critical perception capabilities of EdgeAI systems, such as autonomous vehicles, are required to be resilient against adversarial threats, by enabling accurate identification and localization of multiple objects in the scene over time, mitigating their impact. Single-agent tracking offers resilience to adversarial attacks but lacks situational awareness, underscoring the need for multi-agent cooperation to enhance context understanding and robustness. This paper proposes a novel mitigation framework on 3D LiDAR scene against adversarial noise by tracking objects based on least-squares graph on multi-agent adversarial bounding boxes. Specifically, we employ the least-squares graph tool to reduce the induced positional error of each detection’s centroid utilizing overlapped bounding boxes on a fully connected graph via differential coordinates and anchor points. Hence, the multi-vehicle detections are fused and refined mitigating the adversarial impact, and associated with existing tracks in two stages performing tracking to further suppress the adversarial threat. An extensive evaluation study on the real-world V2V4Real dataset demonstrates that the proposed method significantly outperforms both state-of-the-art single and multi-agent tracking frameworks by up to 23.3% under challenging adversarial conditions, operating as a resilient approach without relying on additional defense mechanisms.
zh
[CV-57] MCFormer: A Multi-Cost-Volume Network and Comprehensive Benchmark for Particle Image Velocimetry
【速读】:该论文试图解决深度学习在粒子图像测速(PIV)应用中面临的关键问题,即缺乏对不同光学流模型在PIV数据上性能的全面评估,这主要是由于可用数据集的局限性和缺乏标准化基准。其解决方案的关键在于构建一个新颖的大规模合成PIV基准数据集,该数据集来源于多种计算流体力学(CFD)模拟(如JHTDB和Blasius),具有前所未有的粒子密度、流速和连续运动的多样性,从而首次实现了对各类光学流和PIV算法的标准化与严格评估。同时,提出了一种新的深度网络架构——多代价体积PIV(MCFormer),利用多帧时间信息和多个代价体积,专门针对PIV的稀疏性设计,显著提升了性能。
链接: https://arxiv.org/abs/2507.04750
作者: Zicheng Lin(International School, Beijing University of Posts and Telecommunications),Xiaoqiang Li(College of Engineering, Peking University),Yichao Wang(College of Physics and Optoelectronic Engineering, Harbin Engineering University),Chuan Zhu(School of Artificial Intelligence, Beijing University of Posts and Telecommunications)
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Stanford University (斯坦福大学); Peking University (北京大学); Harbin Engineering University (哈尔滨工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 13 figures, 5 tables. Comprehensive benchmark evaluation of optical flow models for PIV. Introduces MCFormer architecture with multi-frame temporal processing and multiple cost volumes. Includes large-scale synthetic PIV dataset based on JHTDB and Blasius CFD simulations. Code and dataset will be made publicly available
Abstract:Particle Image Velocimetry (PIV) is fundamental to fluid dynamics, yet deep learning applications face significant hurdles. A critical gap exists: the lack of comprehensive evaluation of how diverse optical flow models perform specifically on PIV data, largely due to limitations in available datasets and the absence of a standardized benchmark. This prevents fair comparison and hinders progress. To address this, our primary contribution is a novel, large-scale synthetic PIV benchmark dataset generated from diverse CFD simulations (JHTDB and Blasius). It features unprecedented variety in particle densities, flow velocities, and continuous motion, enabling, for the first time, a standardized and rigorous evaluation of various optical flow and PIV algorithms. Complementing this, we propose Multi Cost Volume PIV (MCFormer), a new deep network architecture leveraging multi-frame temporal information and multiple cost volumes, specifically designed for PIV’s sparse nature. Our comprehensive benchmark evaluation, the first of its kind, reveals significant performance variations among adapted optical flow models and demonstrates that MCFormer significantly outperforms existing methods, achieving the lowest overall normalized endpoint error (NEPE). This work provides both a foundational benchmark resource essential for future PIV research and a state-of-the-art method tailored for PIV challenges. We make our benchmark dataset and code publicly available to foster future research in this area.
zh
[CV-58] MatDecompSDF: High-Fidelity 3D Shape and PBR Material Decomposition from Multi-View Images
【速读】:该论文试图解决从多视角图像中恢复高保真三维形状并分解其物理基础材质属性的问题,核心挑战在于从二维观测中解耦几何、材质和光照的不适定问题。解决方案的关键在于提出了一种联合优化三个神经组件的框架:用于表示复杂几何的神经符号距离函数(SDF)、用于预测PBR材质参数(反照率、粗糙度、金属度)的空间变化神经场,以及基于MLP的环境光照建模方法。其中,物理基础的可微渲染层是其关键,它将这些三维属性与输入图像连接起来,实现了端到端优化。
链接: https://arxiv.org/abs/2507.04749
作者: Chengyu Wang,Isabella Bennett,Henry Scott,Liang Zhang,Mei Chen,Hao Li,Rui Zhao
机构: San Francisco State University(旧金山州立大学); Boston University(波士顿大学); University of California, Berkeley(加州大学伯克利分校); Stanford University(斯坦福大学); Carnegie Mellon University(卡内基梅隆大学); University of Washington(华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures
Abstract:We present MatDecompSDF, a novel framework for recovering high-fidelity 3D shapes and decomposing their physically-based material properties from multi-view images. The core challenge of inverse rendering lies in the ill-posed disentanglement of geometry, materials, and illumination from 2D observations. Our method addresses this by jointly optimizing three neural components: a neural Signed Distance Function (SDF) to represent complex geometry, a spatially-varying neural field for predicting PBR material parameters (albedo, roughness, metallic), and an MLP-based model for capturing unknown environmental lighting. The key to our approach is a physically-based differentiable rendering layer that connects these 3D properties to the input images, allowing for end-to-end optimization. We introduce a set of carefully designed physical priors and geometric regularizations, including a material smoothness loss and an Eikonal loss, to effectively constrain the problem and achieve robust decomposition. Extensive experiments on both synthetic and real-world datasets (e.g., DTU) demonstrate that MatDecompSDF surpasses state-of-the-art methods in geometric accuracy, material fidelity, and novel view synthesis. Crucially, our method produces editable and relightable assets that can be seamlessly integrated into standard graphics pipelines, validating its practical utility for digital content creation.
zh
[CV-59] Vision-Language Models Cant See the Obvious
【速读】:该论文试图解决大型视觉-语言模型(LVLM)在检测人类视觉中显而易见的显著特征方面能力不足的问题,特别是针对低层次视觉特征如颜色、亮度和方向的识别。其解决方案的关键在于构建了一个名为Saliency Benchmark (SalBench) 的新基准,该基准包含能够自然吸引人类注意的图像,并设计了三个新颖任务来评估LVLM的感知能力,包括Odd-One-Out Detection、Referring Odd-One-Out和Visual Referring Odd-One-Out。通过这一基准,研究揭示了当前先进模型在处理看似简单的视觉异常任务时存在显著局限性。
链接: https://arxiv.org/abs/2507.04741
作者: Yasser Dahou,Ngoc Dung Huynh,Phuc H. Le-Khac,Wamiq Reyaz Para,Ankit Singh,Sanath Narayan
机构: Technology Innovation Institute, Abu Dhabi, UAE
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present Saliency Benchmark (SalBench), a novel benchmark designed to assess the capability of Large Vision-Language Models (LVLM) in detecting visually salient features that are readily apparent to humans, such as a large circle amidst a grid of smaller ones. This benchmark focuses on low-level features including color, intensity, and orientation, which are fundamental to human visual processing. Our SalBench consists of images that highlight rare, unusual, or unexpected elements within scenes, and naturally draw human attention. It comprises three novel tasks for evaluating the perceptual capabilities of LVLM: Odd-One-Out Detection, Referring Odd-One-Out, and Visual Referring Odd-One-Out. We perform a comprehensive evaluation of state-of-the-art LVLM using SalBench and our findings reveal a surprising limitation: LVLM struggle to identify seemingly obvious visual anomalies, with even the advanced GPT-4o achieving only 47.6% accuracy on such a simple task. SalBench will be an important step in measuring the capabilities of LVLM that align with the subtle definition of human attention.
zh
[CV-60] An analysis of vision-language models for fabric retrieval
【速读】:该论文试图解决在制造等专业领域中,基于文本的图像检索(text-to-image retrieval)问题,特别是在面料样本这一细粒度领域中的零样本检索挑战。其解决方案的关键在于利用多模态大语言模型(MLLMs)生成结构化的属性描述和自由形式的自然语言描述,以提升视觉语言模型(VLMs)的跨模态检索性能,实验表明结构化属性描述显著提高了检索准确性,其中Meta的Perception Encoder因具备强大的特征对齐能力而表现最佳。
链接: https://arxiv.org/abs/2507.04735
作者: Francesco Giuliari,Asif Khan Pattan,Mohamed Lamine Mekhalfi,Fabio Poiesi
机构: Fondazione Bruno Kessler(布鲁诺·凯塞尔基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Ital-IA 2025
Abstract:Effective cross-modal retrieval is essential for applications like information retrieval and recommendation systems, particularly in specialized domains such as manufacturing, where product information often consists of visual samples paired with a textual description. This paper investigates the use of Vision Language Models(VLMs) for zero-shot text-to-image retrieval on fabric samples. We address the lack of publicly available datasets by introducing an automated annotation pipeline that uses Multimodal Large Language Models (MLLMs) to generate two types of textual descriptions: freeform natural language and structured attribute-based descriptions. We produce these descriptions to evaluate retrieval performance across three Vision-Language Models: CLIP, LAION-CLIP, and Meta’s Perception Encoder. Our experiments demonstrate that structured, attribute-rich descriptions significantly enhance retrieval accuracy, particularly for visually complex fabric classes, with the Perception Encoder outperforming other models due to its robust feature alignment capabilities. However, zero-shot retrieval remains challenging in this fine-grained domain, underscoring the need for domain-adapted approaches. Our findings highlight the importance of combining technical textual descriptions with advanced VLMs to optimize cross-modal retrieval in industrial applications.
zh
[CV-61] Losing Control: Data Poisoning Attack on Guided Diffusion via ControlNet
【速读】:该论文试图解决生成式AI(Generative AI)中ControlNets模型因依赖大规模公开数据集而面临的隐蔽数据污染攻击问题。解决方案的关键在于提出一种新型的数据污染方法,通过注入带有细微触发器输入与不适宜内容(NSFW)目标配对的中毒样本,使模型在保持干净提示下图像质量的同时,在触发器存在时可靠地生成不适宜内容。
链接: https://arxiv.org/abs/2507.04726
作者: Raz Lapid,Almog Dubin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-image diffusion models have achieved remarkable success in translating textual prompts into high-fidelity images. ControlNets further extend these models by allowing precise, image-based conditioning (e.g., edge maps, depth, pose), enabling fine-grained control over structure and style. However, their dependence on large, publicly scraped datasets – and the increasing use of community-shared data for fine-tuning – exposes them to stealthy data poisoning attacks. In this work, we introduce a novel data poisoning method that manipulates ControlNets to generate images containing specific content without any text triggers. By injecting poisoned samples – each pairing a subtly triggered input with an NSFW target – the model retains clean-prompt fidelity yet reliably produces NSFW outputs when the trigger is present. On large-scale, high-quality datasets, our backdoor achieves high attack success rate while remaining imperceptible in raw inputs. These results reveal a critical vulnerability in open-source ControlNets pipelines and underscore the need for robust data sanitization and defense mechanisms.
zh
[CV-62] Unleashing the Power of Neural Collapse: Consistent Supervised-Unsupervised Alignment for Generalized Category Discovery
【速读】:该论文旨在解决广义类别发现(Generalized Category Discovery, GCD)中因优化目标不一致和类别混淆导致的特征重叠问题,从而影响新类别性能的问题。其解决方案的关键在于提出一种受神经坍缩(Neural Collapse)启发的广义类别发现框架(NC-GCD),通过预分配并固定等角紧框架(Equiangular Tight Frame, ETF)原型,确保已知和新类别具有最优几何结构和一致的优化目标,并引入一致性ETF对齐损失以统一监督与非监督的ETF对齐过程,提升类别可分性,同时设计语义一致性匹配器(Semantic Consistency Matcher, SCM)以保持聚类迭代中的稳定标签分配。
链接: https://arxiv.org/abs/2507.04725
作者: Jizhou Han,Shaokun Wang,Yuhang He,Chenhao Ding,Qiang Wang,Xinyuan Gao,SongLin Dong,Yihong Gong
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalized Category Discovery (GCD) focuses on classifying known categories while simultaneously discovering novel categories from unlabeled data. However, previous GCD methods face challenges due to inconsistent optimization objectives and category confusion. This leads to feature overlap and ultimately hinders performance on novel categories. To address these issues, we propose the Neural Collapse-inspired Generalized Category Discovery (NC-GCD) framework. By pre-assigning and fixing Equiangular Tight Frame (ETF) prototypes, our method ensures an optimal geometric structure and a consistent optimization objective for both known and novel categories. We introduce a Consistent ETF Alignment Loss that unifies supervised and unsupervised ETF alignment and enhances category separability. Additionally, a Semantic Consistency Matcher (SCM) is designed to maintain stable and consistent label assignments across clustering iterations. Our method achieves strong performance on multiple GCD benchmarks, significantly enhancing novel category accuracy and demonstrating its effectiveness.
zh
[CV-63] Geometric-Guided Few-Shot Dental Landmark Detection with Human-Centric Foundation Model MICCAI2025
【速读】:该论文旨在解决在锥形束计算机断层扫描(CBCT)图像中准确检测解剖标志点的难题,这一过程对于正畸学、牙周病学和种植牙科中评估牙槽骨和牙根状况至关重要。传统方法依赖于牙医手动标注,存在耗时、劳动强度大及观察者间差异的问题。该研究提出的解决方案是GeoSapiens,其关键在于引入了一种基于少量标注数据的少样本学习框架,并结合了从Sapiens模型改进而来的稳健基线以及一种新的几何损失函数,以增强模型对解剖结构间关键几何关系的捕捉能力。
链接: https://arxiv.org/abs/2507.04710
作者: Anbang Wang,Marawan Elbatel,Keyuan Liu,Lizhuo Lin,Meng Lan,Yanqi Yang,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: MICCAI 2025
Abstract:Accurate detection of anatomic landmarks is essential for assessing alveolar bone and root conditions, thereby optimizing clinical outcomes in orthodontics, periodontics, and implant dentistry. Manual annotation of landmarks on cone-beam computed tomography (CBCT) by dentists is time-consuming, labor-intensive, and subject to inter-observer variability. Deep learning-based automated methods present a promising approach to streamline this process efficiently. However, the scarcity of training data and the high cost of expert annotations hinder the adoption of conventional deep learning techniques. To overcome these challenges, we introduce GeoSapiens, a novel few-shot learning framework designed for robust dental landmark detection using limited annotated CBCT of anterior teeth. Our GeoSapiens framework comprises two key components: (1) a robust baseline adapted from Sapiens, a foundational model that has achieved state-of-the-art performance in human-centric vision tasks, and (2) a novel geometric loss function that improves the model’s capacity to capture critical geometric relationships among anatomical structures. Experiments conducted on our collected dataset of anterior teeth landmarks revealed that GeoSapiens surpassed existing landmark detection methods, outperforming the leading approach by an 8.18% higher success detection rate at a strict 0.5 mm threshold-a standard widely recognized in dental diagnostics. Code is available at: this https URL.
zh
[CV-64] Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations
【速读】:该论文试图解决身份保持的文本到视频(Identity-preserving text-to-video, IPT2V)生成中的时空权衡问题,即在优化关键元素的空间一致性(如人物身份保持)时,往往会损害指令合规的时间平滑性,而强调动态真实感则可能破坏视觉结构的空间一致性。解决方案的关键在于提出一种简单但有效的时空解耦框架,将表示分解为用于布局的空间特征和用于运动动力学的时间特征,并通过语义提示优化机制和分阶段解耦生成范式实现空间与时间信息的分离与协同,从而在保持身份一致性的前提下提升视频的时空一致性与质量。
链接: https://arxiv.org/abs/2507.04705
作者: Yuji Wang,Moran Li,Xiaobin Hu,Ran Yi,Jiangning Zhang,Han Feng,Weijian Cao,Yabiao Wang,Chengjie Wang,Lizhuang Ma
机构: Tencent YouTu Lab(腾讯优图实验室); Shanghai Jiao Tong University(上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Identity-preserving text-to-video (IPT2V) generation, which aims to create high-fidelity videos with consistent human identity, has become crucial for downstream applications. However, current end-to-end frameworks suffer a critical spatial-temporal trade-off: optimizing for spatially coherent layouts of key elements (e.g., character identity preservation) often compromises instruction-compliant temporal smoothness, while prioritizing dynamic realism risks disrupting the spatial coherence of visual structures. To tackle this issue, we propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics. Specifically, our paper proposes a semantic prompt optimization mechanism and stage-wise decoupled generation paradigm. The former module decouples the prompt into spatial and temporal components. Aligned with the subsequent stage-wise decoupled approach, the spatial prompts guide the text-to-image (T2I) stage to generate coherent spatial features, while the temporal prompts direct the sequential image-to-video (I2V) stage to ensure motion consistency. Experimental results validate that our approach achieves excellent spatiotemporal consistency, demonstrating outstanding performance in identity preservation, text relevance, and video quality. By leveraging this simple yet robust mechanism, our algorithm secures the runner-up position in 2025 ACM MultiMedia Challenge.
zh
[CV-65] mpo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning
【速读】:该论文旨在解决Temporal Video Grounding(TVG)问题,即从视频中精确定位与语言查询相关的时间片段。视频信息量大且冗余度高,模型需要具备对整个视频的全面理解以准确检索相关片段。其解决方案的关键在于提出Tempo-R0:一种基于多模态时间感知强化的视频多模态大语言模型(Video-MLLM)。该方法通过自适应注意力分配(SAA)和显式时间模态对齐(ETA)提升模型对视频内容和事件边界的理解能力,并在微调阶段引入基于部分不相关拒绝的组相对策略优化(PIR-GRPO),增强模型在时间推理上的表现。
链接: https://arxiv.org/abs/2507.04702
作者: Feng Yue,Zhaoxing Zhang,Junming Jiao,Zhengyu Liang,Shiwen Cao,Feifei Zhang,Rong Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Temporal Video Grounding (TVG), which requires pinpointing relevant temporal segments from video based on language query, has always been a highly challenging task in the field of video understanding. Videos often have a larger volume of information and redundancy than texts or images. Models should present comprehensive understanding of the whole video to accurately retrieve query-relevant clips. We thus propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task via multimodal temporal sensing reinforcement. Specifically, during the preprocessing stage of our pipeline, we employ Self-adaptive Attention Allocation (SAA) method based on frame content variation to efficiently use the MLLM’s limited attention. The Explicit Timestamp-modal Aligned (ETA) method is also utilized to strengthen our model’s capability to perceive the boundaries of events in the video. In the fine-tuning part of our pipeline, we creatively apply Partial Irrelevance Refusing-based Group Relative Policy Optimization (PIR-GRPO) in TVG area to foster model’s temporal reasoning from not only accepting relevant video-query pairs but also refusing irrelevant ones. Experiments demonstrate that our method accomplishes a notable advantage over SOTA solutions by around 3.5% on both the original QVHighlights testbench and its corrected version with more reasonable ground truth annotations.
zh
[CV-66] A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets
【速读】:该论文试图解决视觉-语言模型(Vision-Language Models, VLMs)在组合推理任务中表现不佳的问题,主要原因是缺乏高质量的图像-文本数据。其解决方案的关键在于提出一种基于块的扩散方法,通过大型语言模型识别实体及其空间关系,并独立生成符合指定组合规则的图像块,从而自动构建具有精确控制变化的反事实图像-文本对。该方法有效提升了数据多样性与真实性,同时引入了一种专门的损失函数以增强训练效率。
链接: https://arxiv.org/abs/2507.04699
作者: Zexi Jia,Chuanwei Huang,Hongyan Fei,Yeshuang Zhu,Zhiqiang Yuan,Ying Deng,Jiapei Zhang,Jinchao Zhang,Jie Zhou
机构: WeChat AI, Tencent Inc, China(微信人工智能,腾讯公司,中国); Institute for Artificial Intelligence, Peking University(人工智能研究院,北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) often struggle with compositional reasoning due to insufficient high-quality image-text data. To tackle this challenge, we propose a novel block-based diffusion approach that automatically generates counterfactual datasets without manual annotation. Our method utilizes large language models to identify entities and their spatial relationships. It then independently generates image blocks as “puzzle pieces” coherently arranged according to specified compositional rules. This process creates diverse, high-fidelity counterfactual image-text pairs with precisely controlled variations. In addition, we introduce a specialized loss function that differentiates inter-set from intra-set samples, enhancing training efficiency and reducing the need for negative samples. Experiments demonstrate that fine-tuning VLMs with our counterfactual datasets significantly improves visual reasoning performance. Our approach achieves state-of-the-art results across multiple benchmarks while using substantially less training data than existing methods.
zh
[CV-67] Structure-Guided Diffusion Models for High-Fidelity Portrait Shadow Removal
【速读】:该论文试图解决肖像图像中阴影去除的问题,旨在生成高质量且忠实于原图的无阴影结果。其解决方案的关键在于将阴影去除任务转化为基于扩散模型的修复过程,首先通过一个与阴影无关的结构提取网络生成包含面部细节但不包含阴影边界的结构图,随后利用该结构图训练一个结构引导的扩散修复模型以生成性方式去除阴影,最后通过梯度引导的细节恢复模型进一步修复可能被忽略的微尺度细节。
链接: https://arxiv.org/abs/2507.04692
作者: Wanchang Yu,Qing Zhang,Rongjia Zheng,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学); Ministry of Education (教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a diffusion-based portrait shadow removal approach that can robustly produce high-fidelity results. Unlike previous methods, we cast shadow removal as diffusion-based inpainting. To this end, we first train a shadow-independent structure extraction network on a real-world portrait dataset with various synthetic lighting conditions, which allows to generate a shadow-independent structure map including facial details while excluding the unwanted shadow boundaries. The structure map is then used as condition to train a structure-guided inpainting diffusion model for removing shadows in a generative manner. Finally, to restore the fine-scale details (e.g., eyelashes, moles and spots) that may not be captured by the structure map, we take the gradients inside the shadow regions as guidance and train a detail restoration diffusion model to refine the shadow removal result. Extensive experiments on the benchmark datasets show that our method clearly outperforms existing methods, and is effective to avoid previously common issues such as facial identity tampering, shadow residual, color distortion, structure blurring, and loss of details. Our code is available at this https URL.
zh
[CV-68] Bridging KAN and MLP: MJKAN a Hybrid Architecture with Both Efficiency and Expressiveness
【速读】:该论文旨在解决Kolmogorov-Arnold Networks (KANs)在实际应用中存在的计算成本高和泛化能力不足的问题。其解决方案的关键在于提出一种名为Modulation Joint KAN (MJKAN)的新颖神经网络层,该层结合了类似FiLM(Feature-wise Linear Modulation)的机制与径向基函数(RBF)激活函数,从而在保持KANs非线性表达能力的同时提升了计算效率,并增强了模型在不同任务中的实用性。
链接: https://arxiv.org/abs/2507.04690
作者: Hanseon Joo,Hayoung Choi,Ook Lee,Minjong Cheon
机构: Hanyang University(汉阳大学); KAIST Applied Science Research Institute(韩国科学技术院应用科学研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Kolmogorov-Arnold Networks (KANs) have garnered attention for replacing fixed activation functions with learnable univariate functions, but they exhibit practical limitations, including high computational costs and performance deficits in general classification tasks. In this paper, we propose the Modulation Joint KAN (MJKAN), a novel neural network layer designed to overcome these challenges. MJKAN integrates a FiLM (Feature-wise Linear Modulation)-like mechanism with Radial Basis Function (RBF) activations, creating a hybrid architecture that combines the non-linear expressive power of KANs with the efficiency of Multilayer Perceptrons (MLPs). We empirically validated MJKAN’s performance across a diverse set of benchmarks, including function regression, image classification (MNIST, CIFAR-10/100), and natural language processing (AG News, SMS Spam). The results demonstrate that MJKAN achieves superior approximation capabilities in function regression tasks, significantly outperforming MLPs, with performance improving as the number of basis functions increases. Conversely, in image and text classification, its performance was competitive with MLPs but revealed a critical dependency on the number of basis functions. We found that a smaller basis size was crucial for better generalization, highlighting that the model’s capacity must be carefully tuned to the complexity of the data to prevent overfitting. In conclusion, MJKAN offers a flexible architecture that inherits the theoretical advantages of KANs while improving computational efficiency and practical viability.
zh
[CV-69] hGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation ICCV2025
【速读】:该论文旨在解决数字正畸领域中获取配对的三维牙齿模型这一劳动密集型数据收集过程所导致的瓶颈问题,特别是针对生成具有解剖结构的牙齿模型(每组包含24-32个分割牙齿)的需求。其解决方案的关键在于提出了一种名为TeethGenerator的两阶段框架,该框架包含两个核心模块:一是利用扩散模型学习牙齿形态特征分布的牙齿形状生成模块,以生成多样化的正畸后牙齿模型;二是通过引入期望风格作为条件输入来合成对应正畸前牙齿模型的牙齿风格生成模块。
链接: https://arxiv.org/abs/2507.04685
作者: Changsong Lei,Yaqian Liang,Shaofeng Wang,Jiajia Dai,Yong-Jin Liu
机构: Tsinghua University(清华大学); Capital Medical University(首都医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Digital orthodontics represents a prominent and critical application of computer vision technology in the medical field. So far, the labor-intensive process of collecting clinical data, particularly in acquiring paired 3D orthodontic teeth models, constitutes a crucial bottleneck for developing tooth arrangement neural networks. Although numerous general 3D shape generation methods have been proposed, most of them focus on single-object generation and are insufficient for generating anatomically structured teeth models, each comprising 24-32 segmented teeth. In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. Specifically, our approach consists of two key modules: (1) a teeth shape generation module that leverages a diffusion model to learn the distribution of morphological characteristics of teeth, enabling the generation of diverse post-orthodontic teeth models; and (2) a teeth style generation module that synthesizes corresponding pre-orthodontic teeth models by incorporating desired styles as conditional inputs. Extensive qualitative and quantitative experiments demonstrate that our synthetic dataset aligns closely with the distribution of real orthodontic data, and promotes tooth alignment performance significantly when combined with real data for training. The code and dataset are available at this https URL.
zh
[CV-70] Colorectal Cancer Tumor Grade Segmentation in Digital Histopathology Images: From Giga to Mini Challenge ICIP2025
【速读】:该论文旨在解决结直肠癌(Colorectal Cancer, CRC)组织病理学分级的主观性问题,该问题导致观察者间变异性和全球训练有素病理学家短缺。为推动自动化和标准化的解决方案,研究者利用公开的METU CCTGS数据集组织了ICIP Grand Challenge,通过分割掩码的提交与评估,探索更高效的CRC肿瘤分级与分割方法。该挑战的关键在于构建高质量的像素级标注数据集,并引入先进的深度学习模型以提升分类与分割的准确性。
链接: https://arxiv.org/abs/2507.04681
作者: Alper Bahcekapili,Duygu Arslan,Umut Ozdemir,Berkay Ozkirli,Emre Akbas,Ahmet Acar,Gozde B. Akar,Bingdou He,Shuoyu Xu,Umit Mert Caglar,Alptekin Temizel,Guillaume Picaud,Marc Chaumont,Gérard Subsol,Luc Téot,Fahad Alsharekh,Shahad Alghannam,Hexiang Mao,Wenhua Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted Grand Challenge Paper ICIP 2025
Abstract:Colorectal cancer (CRC) is the third most diagnosed cancer and the second leading cause of cancer-related death worldwide. Accurate histopathological grading of CRC is essential for prognosis and treatment planning but remains a subjective process prone to observer variability and limited by global shortages of trained pathologists. To promote automated and standardized solutions, we organized the ICIP Grand Challenge on Colorectal Cancer Tumor Grading and Segmentation using the publicly available METU CCTGS dataset. The dataset comprises 103 whole-slide images with expert pixel-level annotations for five tissue classes. Participants submitted segmentation masks via Codalab, evaluated using metrics such as macro F-score and mIoU. Among 39 participating teams, six outperformed the Swin Transformer baseline (62.92 F-score). This paper presents an overview of the challenge, dataset, and the top-performing methods
zh
[CV-71] Identify Isolate and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation
【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)中存在的幻觉问题,该问题显著限制了模型的可信度和应用潜力。现有方法通常依赖外部工具或多轮推理比较,导致推理时间显著增加。论文提出的解决方案为Self-Evolving Distillation (SEED),其关键在于通过识别并清除模型内部知识中的幻觉,将净化后的知识重新蒸馏回模型中,实现模型的自我进化。此外,为避免传统蒸馏方法在输出空间中引入空洞区域,提出了Mode-Seeking Evolving方法,以捕捉净化知识分布的主要模式,从而提升模型输出的稳定性与准确性。
链接: https://arxiv.org/abs/2507.04680
作者: Wenhao Li,Xiu Su,Jingyi Wu,Feng Yang,Yang Liu,Yi Chen,Shan You,Chang Xu
机构: University of Sydney (悉尼大学); Central South University (中南大学); Fudan University (复旦大学); Southeast University (东南大学); HKUST (香港科技大学); Sensetime Research (商汤研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models (LVLMs) have demonstrated remarkable advancements in numerous areas such as multimedia. However, hallucination issues significantly limit their credibility and application potential. Existing mitigation methods typically rely on external tools or the comparison of multi-round inference, which significantly increase inference time. In this paper, we propose \textbfSElf-\textbfEvolving \textbfDistillation (\textbfSEED), which identifies hallucinations within the inner knowledge of LVLMs, isolates and purges them, and then distills the purified knowledge back into the model, enabling self-evolution. Furthermore, we identified that traditional distillation methods are prone to inducing void spaces in the output space of LVLMs. To address this issue, we propose a Mode-Seeking Evolving approach, which performs distillation to capture the dominant modes of the purified knowledge distribution, thereby avoiding the chaotic results that could emerge from void spaces. Moreover, we introduce a Hallucination Elimination Adapter, which corrects the dark knowledge of the original model by learning purified knowledge. Extensive experiments on multiple benchmarks validate the superiority of our SEED, demonstrating substantial improvements in mitigating hallucinations for representative LVLM models such as LLaVA-1.5 and InternVL2. Remarkably, the F1 score of LLaVA-1.5 on the hallucination evaluation metric POPE-Random improved from 81.3 to 88.3.
zh
[CV-72] ChangeBridge: Spatiotemporal Image Generation with Multimodal Controls for Remote Sensing
【速读】:该论文试图解决在遥感图像合成中缺乏基于给定场景图像模拟未来情景的问题,这一能力在城市规划、土地管理等领域具有广泛应用。解决方案的关键是提出ChangeBridge,一个条件性时空扩散模型,其核心思想是将噪声到图像的扩散过程建模为从事件前状态到事件后状态的扩散桥梁,并在多模态空间控制(如文本提示、实例布局和语义图)条件下,利用随机布朗桥扩散直接建模时空演变过程。
链接: https://arxiv.org/abs/2507.04678
作者: Zhenghui Zhao,Chen Wu,Di Wang,Hongruixuan Chen,Zhuo Zheng
机构: Wuhan University(武汉大学); University of Tokyo(东京大学); Stanford University(斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in generative methods, especially diffusion models, have made great progress in remote sensing image synthesis. Despite these advancements, existing methods have not explored the simulation of future scenarios based on given scenario images. This simulation capability has wide applications for urban planning, land managementChangeBridge: Spatiotemporal Image Generation with Multimodal Controls, and beyond. In this work, we propose ChangeBridge, a conditional spatiotemporal diffusion model. Given pre-event images and conditioned on multimodal spatial controls (e.g., text prompts, instance layouts, and semantic maps), ChangeBridge can synthesize post-event images. The core idea behind ChangeBridge is to modeling the noise-to-image diffusion model, as a pre-to-post diffusion bridge. Conditioned on multimodal controls, ChangeBridge leverages a stochastic Brownian-bridge diffusion, directly modeling the spatiotemporal evolution between pre-event and post-event states. To the best of our knowledge, ChangeBridge is the first spatiotemporal generative model with multimodal controls for remote sensing. Experimental results demonstrate that ChangeBridge can simulate high-fidelity future scenarios aligned with given conditions, including event and event-driven background variations. Code will be available.
zh
[CV-73] DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation IJCAI2025
【速读】:该论文旨在解决现有神经网络架构搜索(Neural Architecture Search, NAS)方法在实际部署中面临的适应性差、每次部署需独立搜索导致成本高以及跨不同平台性能不一致的问题。其解决方案的关键在于提出DANCE(Dynamic Architectures with Neural Continuous Evolution),通过将架构搜索重新定义为连续进化问题,引入三种核心创新:连续架构分布以实现平滑适应、具有学习选择门的统一架构空间以提高采样效率,以及多阶段训练策略以优化部署效果。
链接: https://arxiv.org/abs/2507.04671
作者: Maolin Wang,Tianshuo Wei,Sheng Zhang,Ruocheng Guo,Wanyu Wang,Shanshan Ye,Lixin Zou,Xuetao Wei,Xiangyu Zhao
机构: City University of Hong Kong(香港城市大学); Australian Artificial Intelligence Institute, University of Technology Sydney(澳大利亚人工智能研究院,悉尼科技大学); Wuhan University(武汉大学); Southern University of Science and Technology(南方科技大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025
Abstract:Neural Architecture Search (NAS) has emerged as a powerful approach for automating neural network design. However, existing NAS methods face critical limitations in real-world deployments: architectures lack adaptability across scenarios, each deployment context requires costly separate searches, and performance consistency across diverse platforms remains challenging. We propose DANCE (Dynamic Architectures with Neural Continuous Evolution), which reformulates architecture search as a continuous evolution problem through learning distributions over architectural components. DANCE introduces three key innovations: a continuous architecture distribution enabling smooth adaptation, a unified architecture space with learned selection gates for efficient sampling, and a multi-stage training strategy for effective deployment optimization. Extensive experiments across five datasets demonstrate DANCE’s effectiveness. Our method consistently outperforms state-of-the-art NAS approaches in terms of accuracy while significantly reducing search costs. Under varying computational constraints, DANCE maintains robust performance while smoothly adapting architectures to different hardware requirements. The code and appendix can be found at this https URL.
zh
[CV-74] Whats Making That Sound Right Now? Video-centric Audio-Visual Localization ICCV2025
【速读】:该论文旨在解决现有音频-视觉定位(Audio-Visual Localization, AVL)研究中对时间动态信息捕捉不足的问题,以及在简化场景假设下的局限性。其解决方案的关键在于提出AVATAR基准,该基准引入了包含高分辨率时间信息的视频中心评估场景,并设计了TAVLO模型,该模型通过显式整合时间信息实现更精确的音频-视觉对齐,从而有效提升对时间变化的跟踪能力。
链接: https://arxiv.org/abs/2507.04667
作者: Hahyeon Choi,Junhoo Lee,Nojun Kwak
机构: Seoul National University (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published at ICCV 2025. Project page: this https URL
Abstract:Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios – Single-sound, Mixed-sound, Multi-entity, and Off-screen – enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.
zh
[CV-75] VectorLLM : Human-like Extraction of Structured Building Contours vis Multimodal LLM s
【速读】:该论文试图解决从遥感图像中自动提取矢量建筑轮廓的问题,这一任务对于城市规划、人口估计和灾害评估至关重要。现有方法依赖于复杂的多阶段流水线,包括像素分割、矢量化和多边形优化,限制了其可扩展性和实际应用。该论文提出的解决方案是VectorLLM,这是一种专门为从遥感图像中提取规则建筑轮廓设计的多模态大语言模型(MLLM)。其关键在于通过逐角点回归直接模拟人工标注过程,而非依赖传统的分阶段处理流程,从而提升了准确性和泛化能力。
链接: https://arxiv.org/abs/2507.04664
作者: Tao Zhang,Shiqing Wei,Shihao Chen,Wenling Yu,Muying Luo,Shunping Ji
机构: Wuhan University (武汉大学); China University of Petroleum (East China) (中国石油大学(华东)); East China University of Technology (华东理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatically extracting vectorized building contours from remote sensing imagery is crucial for urban planning, population estimation, and disaster assessment. Current state-of-the-art methods rely on complex multi-stage pipelines involving pixel segmentation, vectorization, and polygon refinement, which limits their scalability and real-world applicability. Inspired by the remarkable reasoning capabilities of Large Language Models (LLMs), we introduce VectorLLM, the first Multi-modal Large Language Model (MLLM) designed for regular building contour extraction from remote sensing images. Unlike existing approaches, VectorLLM performs corner-point by corner-point regression of building contours directly, mimicking human annotators’ labeling process. Our architecture consists of a vision foundation backbone, an MLP connector, and an LLM, enhanced with learnable position embeddings to improve spatial understanding capability. Through comprehensive exploration of training strategies including pretraining, supervised fine-tuning, and preference optimization across WHU, WHU-Mix, and CrowdAI datasets, VectorLLM significantly outperformed the previous SOTA methods by 5.6 AP, 7.1 AP, 13.6 AP, respectively in the three datasets. Remarkably, VectorLLM exhibits strong zero-shot performance on unseen objects including aircraft, water bodies, and oil tanks, highlighting its potential for unified modeling of diverse remote sensing object contour extraction tasks. Overall, this work establishes a new paradigm for vector extraction in remote sensing, leveraging the topological reasoning capabilities of LLMs to achieve both high accuracy and exceptional generalization. All the codes and weights will be published for promoting community development.
zh
[CV-76] UGG-ReID: Uncertainty-Guided Graph Model for Multi-Modal Object Re-Identification
【速读】:该论文旨在解决多模态目标重识别(Multi-modal object ReID)中由于模态内噪声和模态间冲突导致的不确定性问题,特别是在细粒度局部遮挡和帧丢失情况下,这种不确定性对多模态学习构成挑战。解决方案的关键在于提出一种名为Uncertainty-Guided Graph模型(UGG-ReID)的方法,通过估计局部和样本级别的aleatoric不确定性,并显式建模其依赖关系,从而减轻噪声干扰并促进有效的多模态融合。
链接: https://arxiv.org/abs/2507.04638
作者: Xixi Wan,Aihua Zheng,Bo Jiang,Beibei Wang,Chenglong Li,Jin Tang
机构: Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal object Re-IDentification (ReID) has gained considerable attention with the goal of retrieving specific targets across cameras using heterogeneous visual data sources. Existing methods primarily aim to improve identification performance, but often overlook the uncertainty arising from inherent defects, such as intra-modal noise and inter-modal conflicts. This uncertainty is particularly significant in the case of fine-grained local occlusion and frame loss, which becomes a challenge in multi-modal learning. To address the above challenge, we propose a robust approach named Uncertainty-Guided Graph model for multi-modal object ReID (UGG-ReID). UGG-ReID is designed to mitigate noise interference and facilitate effective multi-modal fusion by estimating both local and sample-level aleatoric uncertainty and explicitly modeling their dependencies. Specifically, we first propose the Gaussian patch-graph representation model that leverages uncertainty to quantify fine-grained local cues and capture their structural relationships. This process boosts the expressiveness of modal-specific information, ensuring that the generated embeddings are both more informative and robust. Subsequently, we design an uncertainty-guided mixture of experts strategy that dynamically routes samples to experts exhibiting low uncertainty. This strategy effectively suppresses noise-induced instability, leading to enhanced robustness. Meanwhile, we design an uncertainty-guided routing to strengthen the multi-modal interaction, improving the performance. UGG-ReID is comprehensively evaluated on five representative multi-modal object ReID datasets, encompassing diverse spectral modalities. Experimental results show that the proposed method achieves excellent performance on all datasets and is significantly better than current methods in terms of noise immunity. Our code will be made public upon acceptance.
zh
[CV-77] MODA: MOdular Duplex Attention for Multimodal Perception Cognition and Emotion Understanding ICML2025
【速读】:该论文试图解决多模态学习中的注意力缺陷问题,该问题由跨模态注意力不一致和逐层衰减的注意力激活引起,导致在需要细粒度认知和情感理解的高级任务中表现受限。解决方案的关键是提出一种新颖的注意力机制——MOdular Duplex Attention (MODA),其通过同时进行模态内精炼和模态间交互来解决上述问题。MODA采用“对齐后校正”策略,有效解耦模态对齐与跨层token混合,并通过自适应掩码注意力确保注意力得分的正确性,从而提升模型在感知、认知和情感任务中的性能。
链接: https://arxiv.org/abs/2507.04635
作者: Zhicheng Zhang,Wuyou Xia,Chenxi Zhao,Zhou Yan,Xiaoqiang Liu,Yongjie Zhu,Wenyu Qin,Pengfei Wan,Di Zhang,Jufeng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2025 (Spotlight, Top 2.6%)
Abstract:Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by a generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model’s flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks. Source code and demo are available in this https URL.
zh
[CV-78] LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction
【速读】:该论文试图解决轨迹预测中复杂的时间-空间依赖关系建模问题,特别是针对局部时间依赖性和高阶运动状态属性的建模不足。其解决方案的关键在于提出一种轻量级框架LTMSformer,其中包含局部趋势感知注意力机制以捕捉局部时间依赖性,以及运动状态编码器以融合高阶运动状态属性,同时引入轻量级提议精炼模块以提升轨迹预测精度并减少模型参数量。
链接: https://arxiv.org/abs/2507.04634
作者: Yixin Yan,Yang Li,Yuanfan Wang,Xiaozhou Zhou,Beihao Xia,Manjiang Hu,Hongmao Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:It has been challenging to model the complex temporal-spatial dependencies between agents for trajectory prediction. As each state of an agent is closely related to the states of adjacent time steps, capturing the local temporal dependency is beneficial for prediction, while most studies often overlook it. Besides, learning the high-order motion state attributes is expected to enhance spatial interaction modeling, but it is rarely seen in previous works. To address this, we propose a lightweight framework, LTMSformer, to extract temporal-spatial interaction features for multi-modal trajectory prediction. Specifically, we introduce a Local Trend-Aware Attention mechanism to capture the local temporal dependency by leveraging a convolutional attention mechanism with hierarchical local time boxes. Next, to model the spatial interaction dependency, we build a Motion State Encoder to incorporate high-order motion state attributes, such as acceleration, jerk, heading, etc. To further refine the trajectory prediction, we propose a Lightweight Proposal Refinement Module that leverages Multi-Layer Perceptrons for trajectory embedding and generates the refined trajectories with fewer model parameters. Experiment results on the Argoverse 1 dataset demonstrate that our method outperforms the baseline HiVT-64, reducing the minADE by approximately 4.35%, the minFDE by 8.74%, and the MR by 20%. We also achieve higher accuracy than HiVT-128 with a 68% reduction in model size.
zh
[CV-79] Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts
【速读】:该论文试图解决基于学习的立体匹配网络在面对域转移和不同数据集间视差分布不均衡时,表现出的鲁棒性不足以及跨域性能不佳的问题。其解决方案的关键在于提出SMoEStereo框架,通过定制化的低秩适应(LoRA)与专家混合(MoE)模块进行场景特定融合,引入具有自适应秩的MoE-LoRA和具有自适应核大小的MoE-Adapter,以动态适应不同场景并增强几何特征提取能力,同时设计轻量级决策网络以平衡计算开销与精度。
链接: https://arxiv.org/abs/2507.04631
作者: Yun Wang,Longguang Wang,Chenghao Zhang,Yongjian Zhang,Zhanjie Zhang,Ao Ma,Chenyou Fan,Tin Lun Lam,Junjie Hu
机构: City University of Hong Kong(香港城市大学); Shenzhen Campus, Sun Yat-sen University(深圳校区,中山大学); Chinese Academy of Sciences(中国科学院); Zhejiang University(浙江大学); JD.com(京东); South China Normal University(华南师范大学); The Chinese University of Hong Kong, Shenzhen(香港中文大学,深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Recently, learning-based stereo matching networks have advanced significantly. However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets. Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model’s robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge. To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction. Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation. The code is available at \textcolorredthis https URL.
zh
[CV-80] Learn 3D VQA Better with Active Selection and Reannotation ACM-MM2025
【速读】:该论文试图解决3D视觉问答(3D VQA)中由于答案的自由形式导致的标注不当问题,这些问题可能在训练过程中混淆或误导模型。解决方案的关键在于提出一种多轮交互式主动学习策略,该策略基于模型的语义不确定性选择数据以构建更稳固的知识基础,并主动请求标注者重新标注可能存在歧义的标签。为评估不确定性,论文采用了一种考虑术语之间语义关系的方差度量,从而避免了以往评估指标中对类间相似性的统一假设。
链接: https://arxiv.org/abs/2507.04630
作者: Shengli Zhou,Yang Liu,Feng Zheng
机构: Southern University of Science and Technology(南方科技大学); Peking University(北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025
Abstract:3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead models when training on the entire dataset. While other text generation tasks can mitigate this issue by learning on large-scale datasets, the scarcity of 3D scene data enlarges the negative effect of misleading annotations. Although active learning strategies can select valuable instances for training, they fail to identify and resolve misleading labels, which the oracle inevitably provides in practice. To address this issue, we propose a multi-turn interactive active learning strategy. This strategy selects data based on models’ semantic uncertainty to form a solid knowledge foundation more effectively and actively requests reannotation from an oracle to resolve potentially misleading labels. For uncertainty assessment, we utilize a variance-based metric that takes semantic relationships between terms into consideration, thus avoiding the uniform inter-class similarity assumption of previous assessment metrics. Extensive experiments exhibit better model performance and a substantial reduction in training costs, with a halving of training costs for achieving relatively high accuracy. The code is available at this https URL.
zh
[CV-81] Information-Guided Diffusion Sampling for Dataset Distillation
【速读】:该论文试图解决扩散模型在低图像数每类(low images-per-class, IPC)设置下进行数据集蒸馏时生成样本多样性不足的问题。解决方案的关键在于从信息论的角度出发,识别并保留两个关键的信息类型:原型信息 I(X;Y),即捕捉与标签相关的特征;以及上下文信息 H(X∣Y),即保留类内变异性。通过在扩散模型采样过程中最大化 I(X;Y)+βH(X∣Y),其中 β 为IPC相关参数,从而提升生成样本的多样性与质量。为了实现这一目标,作者开发了基于数据驱动的变分估计方法,以紧致下界的形式近似计算这两个信息量。
链接: https://arxiv.org/abs/2507.04619
作者: Linfeng Ye,Shayan Mohajer Hamidi,Guang Li,Takahiro Ogawa,Miki Haseyama,Konstantinos N. Plataniotis
机构: University of Toronto (多伦多大学); Stanford University (斯坦福大学); Hokkaido University (北海道大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:
Abstract:Dataset distillation aims to create a compact dataset that retains essential information while maintaining model performance. Diffusion models (DMs) have shown promise for this task but struggle in low images-per-class (IPC) settings, where generated samples lack diversity. In this paper, we address this issue from an information-theoretic perspective by identifying two key types of information that a distilled dataset must preserve: ( i ) prototype information \mathrmI(X;Y) , which captures label-relevant features; and ( ii ) contextual information \mathrmH(X | Y) , which preserves intra-class variability. Here, (X,Y) represents the pair of random variables corresponding to the input data and its ground truth label, respectively. Observing that the required contextual information scales with IPC, we propose maximizing \mathrmI(X;Y) + \beta \mathrmH(X | Y) during the DM sampling process, where \beta is IPC-dependent. Since directly computing \mathrmI(X;Y) and \mathrmH(X | Y) is intractable, we develop variational estimations to tightly lower-bound these quantities via a data-driven approach. Our approach, information-guided diffusion sampling (IGDS), seamlessly integrates with diffusion models and improves dataset distillation across all IPC settings. Experiments on Tiny ImageNet and ImageNet subsets show that IGDS significantly outperforms existing methods, particularly in low-IPC regimes. The code will be released upon acceptance.
zh
[CV-82] HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction MICCAI2025
【速读】:该论文试图解决基于全切片图像(WSI)的生存预测问题,当前方法受限于稀疏的切片级标签,难以从千兆像素级的WSI中学习到具有判别性的表示。其关键解决方案是提出一种名为HiLa的层次化视觉-语言协作框架,通过预训练特征提取器在切片和区域层级生成层次化视觉特征,并利用最优提示学习(OPL)将多种生存相关属性的语言提示与视觉特征进行对齐,从而提升视觉-语言对齐效果;同时引入跨层级传播(CLP)和互对比学习(MCL)模块,以增强不同层级间的交互与一致性,实现更有效的层次化协作。
链接: https://arxiv.org/abs/2507.04613
作者: Jiaqi Cui,Lu Wen,Yuchen Fei,Bo Liu,Luping Zhou,Dinggang Shen,Yan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by MICCAI2025
Abstract:Survival prediction using whole-slide images (WSIs) is crucial in cancer re-search. Despite notable success, existing approaches are limited by their reliance on sparse slide-level labels, which hinders the learning of discriminative repre-sentations from gigapixel WSIs. Recently, vision language (VL) models, which incorporate additional language supervision, have emerged as a promising solu-tion. However, VL-based survival prediction remains largely unexplored due to two key challenges. First, current methods often rely on only one simple lan-guage prompt and basic cosine similarity, which fails to learn fine-grained associ-ations between multi-faceted linguistic information and visual features within WSI, resulting in inadequate vision-language alignment. Second, these methods primarily exploit patch-level information, overlooking the intrinsic hierarchy of WSIs and their interactions, causing ineffective modeling of hierarchical interac-tions. To tackle these problems, we propose a novel Hierarchical vision-Language collaboration (HiLa) framework for improved survival prediction. Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels. At each level, a series of language prompts describing various survival-related attributes are constructed and aligned with visual features via Optimal Prompt Learning (OPL). This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts, thereby improv-ing vision-language alignment. Furthermore, we introduce two modules, i.e., Cross-Level Propagation (CLP) and Mutual Contrastive Learning (MCL) to maximize hierarchical cooperation by promoting interactions and consistency be-tween patch and region levels. Experiments on three TCGA datasets demonstrate our SOTA performance.
zh
[CV-83] QR-LoRA: Efficient and Disentangled Fine-tuning via QR Decomposition for Customized Generation ICCV2025
【速读】:该论文试图解决现有文本到图像模型在使用多个低秩适应(LoRA)模型进行内容-风格融合任务时,因权重矩阵的无结构修改导致的内容与风格特征纠缠的问题。解决方案的关键在于提出QR-LoRA框架,该框架利用QR分解实现结构化的参数更新,其中正交矩阵Q自然减少不同视觉特征间的干扰,而上三角矩阵R高效编码属性特定变换,通过仅训练额外的任务相关ΔR矩阵,实现参数量减半且多适配有效合并的强解耦效果。
链接: https://arxiv.org/abs/2507.04599
作者: Jiahui Yang,Yongjia Ma,Donglin Di,Hao Li,Wei Chen,Yan Xie,Jianxun Cui,Xun Yang,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Li Auto (小鹏汽车); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, 30 pages, 26 figures
Abstract:Existing text-to-image models often rely on parameter fine-tuning techniques such as Low-Rank Adaptation (LoRA) to customize visual attributes. However, when combining multiple LoRA models for content-style fusion tasks, unstructured modifications of weight matrices often lead to undesired feature entanglement between content and style attributes. We propose QR-LoRA, a novel fine-tuning framework leveraging QR decomposition for structured parameter updates that effectively separate visual attributes. Our key insight is that the orthogonal Q matrix naturally minimizes interference between different visual features, while the upper triangular R matrix efficiently encodes attribute-specific transformations. Our approach fixes both Q and R matrices while only training an additional task-specific \Delta R matrix. This structured design reduces trainable parameters to half of conventional LoRA methods and supports effective merging of multiple adaptations without cross-contamination due to the strong disentanglement properties between \Delta R matrices. Experiments demonstrate that QR-LoRA achieves superior disentanglement in content-style fusion tasks, establishing a new paradigm for parameter-efficient, disentangled fine-tuning in generative models.
zh
[CV-84] CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection
【速读】:该论文旨在解决4D雷达在自动驾驶中因点云稀疏和测量噪声导致的3D目标检测性能受限问题。现有研究多通过融合相机图像与雷达数据,在鸟瞰图(BEV)空间进行多模态融合,但未能充分挖掘雷达数据及其融合机制的潜力。论文提出的解决方案关键在于设计了一种跨视图的两阶段融合网络(CVFusion),第一阶段采用雷达引导的迭代BEV融合模块(RGIter)生成高召回率的3D候选框,第二阶段则从点云、图像和BEV等多个异构视图中聚合特征以精化候选框,从而提升检测精度。
链接: https://arxiv.org/abs/2507.04587
作者: Hanzhi Zhong,Zhiyu Xiang,Ruoyu Xu,Jingyun Fu,Peng Xu,Shaohong Wang,Zhihao Yang,Tianyu Pu,Eryun Liu
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:4D radar has received significant attention in autonomous driving thanks to its robustness under adverse weathers. Due to the sparse points and noisy measurements of the 4D radar, most of the research finish the 3D object detection task by integrating images from camera and perform modality fusion in BEV space. However, the potential of the radar and the fusion mechanism is still largely unexplored, hindering the performance improvement. In this study, we propose a cross-view two-stage fusion network called CVFusion. In the first stage, we design a radar guided iterative (RGIter) BEV fusion module to generate high-recall 3D proposal boxes. In the second stage, we aggregate features from multiple heterogeneous views including points, image, and BEV for each proposal. These comprehensive instance level features greatly help refine the proposals and generate high-quality predictions. Extensive experiments on public datasets show that our method outperforms the previous state-of-the-art methods by a large margin, with 9.10% and 3.68% mAP improvements on View-of-Delft (VoD) and TJ4DRadSet, respectively. Our code will be made publicly available.
zh
[CV-85] S2Edit: Text-Guided Image Editing with Precise Semantic and Spatial Control
【速读】:该论文旨在解决在需要细粒度控制的图像编辑任务中,现有方法导致身份信息丢失、高频细节缺失或无关图像区域被错误修改的问题。其解决方案的关键在于提出S² Edit,该方法通过微调预训练文本到图像扩散模型,将身份信息嵌入可学习的文本标记,并在文本特征空间中通过正交性约束将身份标记与待编辑属性解耦。此外,利用对象掩码引导交叉注意力图,确保身份标记仅影响感兴趣区域,从而实现语义解耦和空间聚焦的身份保留编辑。
链接: https://arxiv.org/abs/2507.04584
作者: Xudong Liu,Zikun Chen,Ruowei Jiang,Ziyi Wu,Kejia Yin,Han Zhao,Parham Aarabi,Igor Gilitschenski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in diffusion models have enabled high-quality generation and manipulation of images guided by texts, as well as concept learning from images. However, naive applications of existing methods to editing tasks that require fine-grained control, e.g., face editing, often lead to suboptimal solutions with identity information and high-frequency details lost during the editing process, or irrelevant image regions altered due to entangled concepts. In this work, we propose S ^2 Edit, a novel method based on a pre-trained text-to-image diffusion model that enables personalized editing with precise semantic and spatial control. We first fine-tune our model to embed the identity information into a learnable text token. During fine-tuning, we disentangle the learned identity token from attributes to be edited by enforcing an orthogonality constraint in the textual feature space. To ensure that the identity token only affects regions of interest, we apply object masks to guide the cross-attention maps. At inference time, our method performs localized editing while faithfully preserving the original identity with semantically disentangled and spatially focused identity token learned. Extensive experiments demonstrate the superiority of S ^2 Edit over state-of-the-art methods both quantitatively and qualitatively. Additionally, we showcase several compositional image editing applications of S ^2 Edit such as makeup transfer.
zh
[CV-86] MambaVideo for Discrete Video Tokenization with Channel-Split Quantization
【速读】:该论文旨在解决视频数据高维度带来的高效自回归生成建模问题,特别是传统基于序列的离散视频分词方法的局限性。其解决方案的关键在于提出一种基于Mamba的编码器-解码器架构,以克服先前方法在处理视频序列时的不足,并引入一种新的量化方案——通道分割量化,从而在保持标记数量的同时显著提升量化潜在表示的表达能力。
链接: https://arxiv.org/abs/2507.04559
作者: Dawit Mureja Argaw,Xian Liu,Joon Son Chung,Ming-Yu Liu,Fitsum Reda
机构: NVIDIA(英伟达); KAIST(韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:Discrete video tokenization is essential for efficient autoregressive generative modeling due to the high dimensionality of video data. This work introduces a state-of-the-art discrete video tokenizer with two key contributions. First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers. Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents while preserving the token count. Our model sets a new state-of-the-art, outperforming both causal 3D convolutionbased and Transformer-based approaches across multiple datasets. Experimental results further demonstrate its robustness as a tokenizer for autoregressive video generation.
zh
[CV-87] A Data-Driven Novelty Score for Diverse In-Vehicle Data Recording ITSC2025
【速读】:该论文试图解决自动驾驶中感知系统训练数据集的不平衡问题,即真实世界数据采集倾向于常见场景和物体,导致罕见事件(novel events)被严重低估,从而影响模型的泛化能力和安全性。解决方案的关键在于提出一种基于目标级新颖性检测的实时数据选择方法,通过引入一种新颖的动态均值漂移(dynamic Mean Shift)算法,为图像帧分配数据驱动的新颖性评分,从而识别并剔除冗余内容,保留具有代表性的新颖案例,以构建更平衡和多样化的数据集。
链接: https://arxiv.org/abs/2507.04529
作者: Philipp Reis,Joshua Ransiek,David Petri,Jacob Langner,Eric Sax
机构: FZI Research Center for Information Technology(弗劳恩霍夫信息科技研究所); Karlsruhe Institute of Technology(卡尔斯鲁赫理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, accepted at the IEEE ITSC 2025
Abstract:High-quality datasets are essential for training robust perception systems in autonomous driving. However, real-world data collection is often biased toward common scenes and objects, leaving novel cases underrepresented. This imbalance hinders model generalization and compromises safety. The core issue is the curse of rarity. Over time, novel events occur infrequently, and standard logging methods fail to capture them effectively. As a result, large volumes of redundant data are stored, while critical novel cases are diluted, leading to biased datasets. This work presents a real-time data selection method focused on object-level novelty detection to build more balanced and diverse datasets. The method assigns a data-driven novelty score to image frames using a novel dynamic Mean Shift algorithm. It models normal content based on mean and covariance statistics to identify frames with novel objects, discarding those with redundant elements. The main findings show that reducing the training dataset size with this method can improve model performance, whereas higher redundancy tends to degrade it. Moreover, as data redundancy increases, more aggressive filtering becomes both possible and beneficial. While random sampling can offer some gains, it often leads to overfitting and unpredictability in outcomes. The proposed method supports real-time deployment with 32 frames per second and is constant over time. By continuously updating the definition of normal content, it enables efficient detection of novelties in a continuous data stream.
zh
[CV-88] Grounded Gesture Generation: Language Motion and Space CVPR2025
【速读】:该论文试图解决生成具有空间定位和上下文感知的手势这一关键问题,现有模型通常仅专注于描述性运动生成或与话语语义对齐的孤立共言 gesture 合成,但两者往往将运动与环境定位分开处理,限制了具身化、交际代理的发展。解决方案的关键在于引入一个多模态数据集和框架,结合了两个关键资源:(1)一个空间定位的指代手势合成数据集,以及(2)MM-Conv,一个基于虚拟现实的双人对话数据集,共同提供了超过7.7小时的同步运动、语音和3D场景信息,并通过物理仿真器实现合成数据生成和情境评估,从而连接手势建模与空间定位,为情境化手势生成和 grounded 多模态交互研究奠定基础。
链接: https://arxiv.org/abs/2507.04522
作者: Anna Deichler,Jim O’Regan,Teo Guichoux,David Johansson,Jonas Beskow
机构: KTH Royal Institute of Technology (皇家理工学院); Sorbonne University (索邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted as a non-archival paper at the CVPR 2025 Humanoid Agents Workshop. Project page: this https URL
Abstract:Human motion generation has advanced rapidly in recent years, yet the critical problem of creating spatially grounded, context-aware gestures has been largely overlooked. Existing models typically specialize either in descriptive motion generation, such as locomotion and object interaction, or in isolated co-speech gesture synthesis aligned with utterance semantics. However, both lines of work often treat motion and environmental grounding separately, limiting advances toward embodied, communicative agents. To address this gap, our work introduces a multimodal dataset and framework for grounded gesture generation, combining two key resources: (1) a synthetic dataset of spatially grounded referential gestures, and (2) MM-Conv, a VR-based dataset capturing two-party dialogues. Together, they provide over 7.7 hours of synchronized motion, speech, and 3D scene information, standardized in the HumanML3D format. Our framework further connects to a physics-based simulator, enabling synthetic data generation and situated evaluation. By bridging gesture modeling and spatial grounding, our contribution establishes a foundation for advancing research in situated gesture generation and grounded multimodal interaction. Project page: this https URL Comments: Accepted as a non-archival paper at the CVPR 2025 Humanoid Agents Workshop. Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO) MSC classes: 68T07, 68T42 ACMclasses: I.2.7; I.2.6; H.5.2 Cite as: arXiv:2507.04522 [cs.CV] (or arXiv:2507.04522v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.04522 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-89] FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection
【速读】:该论文试图解决预训练视觉-语言模型(VLMs)在分布外(OOD)检测任务中存在泛化能力有限或依赖外部大规模辅助数据集的问题。其解决方案的关键在于提出一种基于强制提示学习(Forced prompt leArning, FA)的CLIP框架,通过充分利用分布内(ID)知识来提升OOD检测效果。该方法的核心思想是学习一个包含更丰富和多样化描述的强制提示,以增强分布内图像与提示之间的语义相似性,并引入一个强制系数以促进对ID类别的更全面和细腻的描述学习。
链接: https://arxiv.org/abs/2507.04511
作者: Xinhua Lu,Runhe Lai,Yanqi Wu,Kanghao Chen,Wei-Shi Zheng,Ruixuan Wang
机构: Sun Yat-sen University (中山大学); Peng Cheng Laboratory (鹏城实验室); Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Key Laboratory of Machine Intelligence and Advanced Computing, MOE (机器智能与先进计算重点实验室,教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e., forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. Code is available at this https URL.
zh
[CV-90] MVL-Loc: Leverag ing Vision-Language Model for Generalizable Multi-Scene Camera Relocalization
【速读】:该论文旨在解决多场景下6-DoF相机重定位(camera relocalization)的泛化性和鲁棒性问题,传统基于深度学习的方法在复杂环境中的表现受限。其解决方案的关键在于提出MVL-Loc框架,该框架利用预训练的视觉-语言模型(VLMs)获取世界知识,并结合多模态数据实现跨室内外场景的泛化能力,同时通过自然语言作为指导工具,提升对复杂场景的语义理解及物体间空间关系的捕捉。
链接: https://arxiv.org/abs/2507.04509
作者: Zhendong Xiao,Wu Wei,Shujie Ji,Shan Yang,Changhao Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: PRCV
Abstract:Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera’s position and orientation (6-DoF) from images and is essential for applications in augmented reality (AR), mixed reality (MR), autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods that regress camera pose from images in a single scene, which often lack generalization and robustness in diverse environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models (VLMs) and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and capturing spatial relationships among objects. Extensive experiments on the 7Scenes and Cambridge Landmarks datasets demonstrate MVL-Loc’s robustness and state-of-the-art performance in real-world multi-scene camera relocalization, with improved accuracy in both positional and orientational estimates.
zh
[CV-91] U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration
【速读】:该论文旨在解决在城市环境中由于附近建筑物和施工区域导致GNSS(全球导航卫星系统)信号质量显著下降,从而影响定位精度的问题。其解决方案的关键在于提出一种名为U-ViLAR的不确定性感知视觉定位框架,该框架通过引入感知不确定性引导的关联机制和定位不确定性引导的配准机制,有效缓解因感知和定位不确定性带来的误差,从而实现粗粒度大范围定位与细粒度精准定位之间的平衡,提升定位的鲁棒性和准确性。
链接: https://arxiv.org/abs/2507.04503
作者: Xiaofan Li,Zhihao Xu,Chenming Wu,Zhao Yang,Yumeng Zhang,Jiang-Jiang Liu,Haibao Yu,Fan Duan,Xiaoqing Ye,Yuan Wang,Shirui Li,Xun Sun,Ji Wan,Jun Wang
机构: Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Vision Localization, Autonomous Driving, Bird’s-Eye-View
Abstract:Accurate localization using visual information is a critical yet challenging task, especially in urban environments where nearby buildings and construction sites significantly degrade GNSS (Global Navigation Satellite System) signal quality. This issue underscores the importance of visual localization techniques in scenarios where GNSS signals are unreliable. This paper proposes U-ViLAR, a novel uncertainty-aware visual localization framework designed to address these challenges while enabling adaptive localization using high-definition (HD) maps or navigation maps. Specifically, our method first extracts features from the input visual data and maps them into Bird’s-Eye-View (BEV) space to enhance spatial consistency with the map input. Subsequently, we introduce: a) Perceptual Uncertainty-guided Association, which mitigates errors caused by perception uncertainty, and b) Localization Uncertainty-guided Registration, which reduces errors introduced by localization uncertainty. By effectively balancing the coarse-grained large-scale localization capability of association with the fine-grained precise localization capability of registration, our approach achieves robust and accurate localization. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple localization tasks. Furthermore, our model has undergone rigorous testing on large-scale autonomous driving fleets and has demonstrated stable performance in various challenging urban scenarios.
zh
[CV-92] README: Robust Error-Aware Digital Signature Framework via Deep Watermarking Model
【速读】:该论文试图解决深度学习水印技术在嵌入容量有限和对比特级错误敏感的问题,这些问题使得现有模型不适合用于需要无错误数据的加密应用,如数字签名。解决方案的关键在于提出README框架,该框架结合了基于裁剪的容量扩展机制与ERPA(误差涂色模块),利用Distinct Circular Subsum Sequences(DCSS)实现位错误的定位与纠正,从而显著提高了无错误图像率,并确保了数字签名的鲁棒性和可验证性。
链接: https://arxiv.org/abs/2507.04495
作者: Hyunwook Choi,Sangyun Won,Daeyeon Hwang,Junhyeok Choi
机构: Korea University (韩国科学技术院)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning-based watermarking has emerged as a promising solution for robust image authentication and protection. However, existing models are limited by low embedding capacity and vulnerability to bit-level errors, making them unsuitable for cryptographic applications such as digital signatures, which require over 2048 bits of error-free data. In this paper, we propose README (Robust Error-Aware Digital Signature via Deep WaterMarking ModEl), a novel framework that enables robust, verifiable, and error-tolerant digital signatures within images. Our method combines a simple yet effective cropping-based capacity scaling mechanism with ERPA (ERror PAinting Module), a lightweight error correction module designed to localize and correct bit errors using Distinct Circular Subsum Sequences (DCSS). Without requiring any fine-tuning of existing pretrained watermarking models, README significantly boosts the zero-bit-error image rate (Z.B.I.R) from 1.2% to 86.3% when embedding 2048-bit digital signatures into a single image, even under real-world distortions. Moreover, our use of perceptual hash-based signature verification ensures public verifiability and robustness against tampering. The proposed framework unlocks a new class of high-assurance applications for deep watermarking, bridging the gap between signal-level watermarking and cryptographic security.
zh
[CV-93] housand-Brains Systems: Sensorimotor Intelligence for Rapid Robust Learning and Inference
【速读】:该论文试图解决当前人工智能系统在核心属性上与生物智能的差距,包括快速持续学习、基于感知运动交互的表征以及促进高效泛化的结构化知识。其解决方案的关键是提出千脑系统(thousand-brains systems),旨在模仿大脑皮层柱的架构及其交互方式,以实现更接近生物智能的AI系统。文中以Monty作为首个千脑系统的实现,重点评估了其在3D物体感知任务中的表现,展示了其通过传感器运动学习构建结构化表征的能力,以及利用模型无关和模型相关策略支持快速推理的机制。
链接: https://arxiv.org/abs/2507.04494
作者: Niels Leadholm(1),Viviane Clay(1),Scott Knudstrup(1),Hojae Lee(1),Jeff Hawkins(1) ((1) Thousand Brains Project)
机构: Thousand Brains Project(千脑项目)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 32 pages, 8 figures
Abstract:Current AI systems achieve impressive performance on many tasks, yet they lack core attributes of biological intelligence, including rapid, continual learning, representations grounded in sensorimotor interactions, and structured knowledge that enables efficient generalization. Neuroscience theory suggests that mammals evolved flexible intelligence through the replication of a semi-independent, sensorimotor module, a functional unit known as a cortical column. To address the disparity between biological and artificial intelligence, thousand-brains systems were proposed as a means of mirroring the architecture of cortical columns and their interactions. In the current work, we evaluate the unique properties of Monty, the first implementation of a thousand-brains system. We focus on 3D object perception, and in particular, the combined task of object recognition and pose estimation. Utilizing the YCB dataset of household objects, we first assess Monty’s use of sensorimotor learning to build structured representations, finding that these enable robust generalization. These representations include an emphasis on classifying objects by their global shape, as well as a natural ability to detect object symmetries. We then explore Monty’s use of model-free and model-based policies to enable rapid inference by supporting principled movements. We find that such policies complement Monty’s modular architecture, a design that can accommodate communication between modules to further accelerate inference speed via a novel `voting’ algorithm. Finally, we examine Monty’s use of associative, Hebbian-like binding to enable rapid, continual, and computationally efficient learning, properties that compare favorably to current deep learning architectures. While Monty is still in a nascent stage of development, these findings support thousand-brains systems as a powerful and promising new approach to AI. Comments: 32 pages, 8 figures Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2507.04494 [cs.AI] (or arXiv:2507.04494v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.04494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-94] A Training-Free Style-Personalization via Scale-wise Autoregressive Model
【速读】:该论文试图解决在图像生成过程中实现风格个性化控制的问题,即在不进行额外训练的情况下,通过推理阶段对内容和风格信息进行有效调控。解决方案的关键在于提出一种无需训练的框架,该框架采用三路径设计(内容、风格和生成),每个路径由相应的文本提示引导,从而实现对图像语义的灵活高效控制。此外,该方法通过分步和注意力级别的干预分析,揭示了早期到中期生成步骤在塑造内容与风格中的关键作用,并引入了两种针对性机制:关键阶段注意力共享和自适应查询共享,以提升生成效果。
链接: https://arxiv.org/abs/2507.04482
作者: Kyoungmin Lee,Jihun Park,Jongmin Gim,Wonhyeok Choi,Kyumin Hwang,Jaeyeul Kim,Sunghoon Im
机构: DGIST(朝鲜民主主义人民共和国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures
Abstract:We present a training-free framework for style-personalized image generation that controls content and style information during inference using a scale-wise autoregressive model. Our method employs a three-path design–content, style, and generation–each guided by a corresponding text prompt, enabling flexible and efficient control over image semantics without any additional training. A central contribution of this work is a step-wise and attention-wise intervention analysis. Through systematic prompt and feature injection, we find that early-to-middle generation steps play a pivotal role in shaping both content and style, and that query features predominantly encode content-specific information. Guided by these insights, we introduce two targeted mechanisms: Key Stage Attention Sharing, which aligns content and style during the semantically critical steps, and Adaptive Query Sharing, which reinforces content semantics in later steps through similarity-aware query blending. Extensive experiments demonstrate that our method achieves competitive style fidelity and prompt fidelity compared to fine-tuned baselines, while offering faster inference and greater deployment flexibility.
zh
[CV-95] Visual Hand Gesture Recognition with Deep Learning: A Comprehensive Review of Methods Datasets Challenges and Future Research Directions
【速读】:该论文试图解决视觉手势识别(Vision-Based Hand Gesture Recognition, VHGR)领域缺乏系统性综述的问题,旨在为研究人员提供一个结构化和全面的指南,以选择适合特定任务的数据、模型和方法。其解决方案的关键在于采用系统的研究方法,对现有研究进行分类和归纳,通过基于分类学的框架组织关键VHGR方法,并深入分析静态手势识别、孤立动态手势和连续手势识别三种主要任务的技术趋势与学习策略。此外,论文还回顾了常用数据集及其标注方案,并评估了标准性能指标,从而为未来研究指明方向。
链接: https://arxiv.org/abs/2507.04465
作者: Konstantinos Foteinos,Jorgen Cani,Manousos Linardakis,Panagiotis Radoglou-Grammatikis,Vasileios Argyriou,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid evolution of deep learning (DL) models and the ever-increasing size of available datasets have raised the interest of the research community in the always important field of vision-based hand gesture recognition (VHGR), and delivered a wide range of applications, such as sign language understanding and human-computer interaction using cameras. Despite the large volume of research works in the field, a structured and complete survey on VHGR is still missing, leaving researchers to navigate through hundreds of papers in order to find the right combination of data, model, and approach for each task. The current survey aims to fill this gap by presenting a comprehensive overview of this aspect of computer vision. With a systematic research methodology that identifies the state-of-the-art works and a structured presentation of the various methods, datasets, and evaluation metrics, this review aims to constitute a useful guideline for researchers, helping them to choose the right strategy for delving into a certain VHGR task. Starting with the methodology used for study selection, literature retrieval, and the analytical framing, the survey identifies and organizes key VHGR approaches using a taxonomy-based format in various dimensions such as input modality and application domain. The core of the survey provides an in-depth analysis of state-of-the-art techniques across three primary VHGR tasks: static gesture recognition, isolated dynamic gestures and continuous gesture recognition. For each task, the architectural trends and learning strategies are listed. Additionally, the study reviews commonly used datasets - emphasizing on annotation schemes - and evaluates standard performance metrics. It concludes by identifying major challenges in VHGR, including both general computer vision issues and domain-specific obstacles, and outlines promising directions for future research.
zh
[CV-96] BiVM: Accurate Binarized Neural Network for Efficient Video Matting
【速读】:该论文旨在解决深度神经网络在边缘设备上进行实时视频抠图时面临的计算限制问题,从而推动其在在线会议和短视频制作等广泛应用场景中的应用。解决方案的关键在于提出BiVM,一个准确且资源高效的二值化神经网络。其核心创新包括:设计具有弹性捷径和可进化拓扑的二值化计算结构,以提升编码器提取高质量特征的能力;通过掩码同质部分稀疏化解码器的中间特征,使解码器专注于细节丰富的表示并减轻计算负担;构建基于信息引导策略的局部二值化感知模仿框架,以充分利用全精度模型中的抠图相关表征。
链接: https://arxiv.org/abs/2507.04456
作者: Haotong Qin,Xianglong Liu,Xudong Ma,Lei Ke,Yulun Zhang,Jie Luo,Michele Magno
机构: ETH Zurich (苏黎世联邦理工学院); Beihang University (北京航空航天大学); State Key Laboratory of Complex & Critical Software Environment, Beihang University (北京航空航天大学复杂与关键软件环境国家重点实验室); Carnegie Mellon University (卡内基梅隆大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks for real-time video matting suffer significant computational limitations on edge devices, hindering their adoption in widespread applications such as online conferences and short-form video production. Binarization emerges as one of the most common compression approaches with compact 1-bit parameters and efficient bitwise operations. However, accuracy and efficiency limitations exist in the binarized video matting network due to its degenerated encoder and redundant decoder. Following a theoretical analysis based on the information bottleneck principle, the limitations are mainly caused by the degradation of prediction-relevant information in the intermediate features and the redundant computation in prediction-irrelevant areas. We present BiVM, an accurate and resource-efficient Binarized neural network for Video Matting. First, we present a series of binarized computation structures with elastic shortcuts and evolvable topologies, enabling the constructed encoder backbone to extract high-quality representation from input videos for accurate prediction. Second, we sparse the intermediate feature of the binarized decoder by masking homogeneous parts, allowing the decoder to focus on representation with diverse details while alleviating the computation burden for efficient inference. Furthermore, we construct a localized binarization-aware mimicking framework with the information-guided strategy, prompting matting-related representation in full-precision counterparts to be accurately and fully utilized. Comprehensive experiments show that the proposed BiVM surpasses alternative binarized video matting networks, including state-of-the-art (SOTA) binarization methods, by a substantial margin. Moreover, our BiVM achieves significant savings of 14.3x and 21.6x in computation and storage costs, respectively. We also evaluate BiVM on ARM CPU hardware.
zh
[CV-97] CoT-lized Diffusion: Lets Reinforce T2I Generation Step-by-step
【速读】:该论文试图解决当前文本到图像(Text-to-Image, T2I)生成模型在复杂场景中难以准确对齐空间构图的问题。现有基于布局的方法由于生成过程与布局规划解耦,难以在合成过程中进行布局优化。解决方案的关键在于提出CoT-Diff框架,该框架通过将多模态大语言模型(Multimodal Large Language Model, MLLM)驱动的3D布局规划与扩散过程紧密集成,实现逐步的类思维链(Chain-of-Thought, CoT)推理。在每个去噪步骤中,MLLM评估中间预测并动态更新3D场景布局,同时通过语义条件和深度图融合到扩散模型中,从而实现精确的空间控制与语义注入。
链接: https://arxiv.org/abs/2507.04451
作者: Zheyuan Liu,Munan Ning,Qihui Zhang,Shuo Yang,Zhongrui Wang,Yiwei Yang,Xianzhe Xu,Yibing Song,Weihua Chen,Fan Wang,Li Yuan
机构: Peking University (北京大学); Hupan Lab (虎跑实验室); DAMO Academy, Alibaba Group (达摩院,阿里巴巴集团); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current text-to-image (T2I) generation models struggle to align spatial composition with the input text, especially in complex scenes. Even layout-based approaches yield suboptimal spatial control, as their generation process is decoupled from layout planning, making it difficult to refine the layout during synthesis. We present CoT-Diff, a framework that brings step-by-step CoT-style reasoning into T2I generation by tightly integrating Multimodal Large Language Model (MLLM)-driven 3D layout planning with the diffusion process. CoT-Diff enables layout-aware reasoning inline within a single diffusion round: at each denoising step, the MLLM evaluates intermediate predictions, dynamically updates the 3D scene layout, and continuously guides the generation process. The updated layout is converted into semantic conditions and depth maps, which are fused into the diffusion model via a condition-aware attention mechanism, enabling precise spatial control and semantic injection. Experiments on 3D Scene benchmarks show that CoT-Diff significantly improves spatial alignment and compositional fidelity, and outperforms the state-of-the-art method by 34.7% in complex scene spatial accuracy, thereby validating the effectiveness of this entangled generation paradigm.
zh
[CV-98] DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
【速读】:该论文试图解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中因依赖图像预测而存在的信息冗余和缺乏全面世界知识的问题。其解决方案的关键在于提出DreamVLA框架,该框架通过整合动态区域引导的世界知识预测,结合空间与语义线索,提供紧凑且全面的表示用于动作规划,并采用分块结构化注意力机制以减少动态、空间和语义信息之间的干扰,同时利用基于扩散的Transformer模型对未来动作的条件分布进行建模。
链接: https://arxiv.org/abs/2507.04447
作者: Wenyao Zhang,Hongsi Liu,Zekun Qi,Yunnan Wang,XinQiang Yu,Jiazhao Zhang,Runpei Dong,Jiawei He,He Wang,Zhizheng Zhang,Li Yi,Wenjun Zeng,Xin Jin
机构: Galbot(伽利略机器人公司); SJTU(上海交通大学); EIT(电子信息技术学院); THU(清华大学); PKU(北京大学); UIUC(伊利诺伊大学厄巴纳-香槟分校); USTC(中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.
zh
[CV-99] SFOOD: A Multimodal Benchmark for Comprehensive Food Attribute Analysis Beyond RGB with Spectral Insights
【速读】:该论文试图解决食品属性分析中缺乏大规模综合性基准的问题,特别是在生成式AI(Generative AI)和计算机视觉迅速发展的背景下,现有研究主要集中在食品类别识别上,而对如产地、数量、重量、质量、甜度等丰富属性的研究仍显不足。解决方案的关键在于构建首个大规模光谱食品(SFOOD)基准套件,通过整合现有食品数据集、采集数百种食品的高光谱图像,并利用仪器实验测定食品属性,从而形成包含3,266个食品类别和17个主要食品类别共2,351千个数据点的基准数据集。
链接: https://arxiv.org/abs/2507.04412
作者: Zhenbo Xu,Jinghan Yang,Gong Huang,Jiqing Feng,Liu Liu,Ruihan Sun,Ajin Meng,Zhuo Zhang,Zhaofeng He
机构: Beijing University of Posts and Telecommunications (北京邮电大学); ShiFang Technology Inc., Hangzhou, China (杭州思方科技有限公司); Specrizion Technology Co.Ltd, Wuxi, China (无锡斯佩克里森科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rise and development of computer vision and LLMs, intelligence is everywhere, especially for people and cars. However, for tremendous food attributes (such as origin, quantity, weight, quality, sweetness, etc.), existing research still mainly focuses on the study of categories. The reason is the lack of a large and comprehensive benchmark for food. Besides, many food attributes (such as sweetness, weight, and fine-grained categories) are challenging to accurately percept solely through RGB cameras. To fulfill this gap and promote the development of intelligent food analysis, in this paper, we built the first large-scale spectral food (SFOOD) benchmark suite. We spent a lot of manpower and equipment costs to organize existing food datasets and collect hyperspectral images of hundreds of foods, and we used instruments to experimentally determine food attributes such as sweetness and weight. The resulting benchmark consists of 3,266 food categories and 2,351 k data points for 17 main food categories. Extensive evaluations find that: (i) Large-scale models are still poor at digitizing food. Compared to people and cars, food has gradually become one of the most difficult objects to study; (ii) Spectrum data are crucial for analyzing food properties (such as sweetness). Our benchmark will be open source and continuously iterated for different food analysis tasks.
zh
[CV-100] Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models
【速读】:该论文试图解决多媒体信息真实性验证的问题,特别是在面对复杂多媒体内容时的虚假信息检测。其解决方案的关键在于构建一个基于多智能体的验证系统,该系统结合了多模态大语言模型(MLLMs)与专用验证工具,通过六个阶段的流程实现对多媒体内容的全面分析与验证,其中深度研究代理采用了反向图像搜索、元数据分析、事实核查数据库和已验证新闻处理等工具,以提取空间、时间、归属和动机等上下文信息,从而有效提升多媒体内容的真实性判断能力。
链接: https://arxiv.org/abs/2507.04410
作者: Huy Hoan Le,Van Sy Thinh Nguyen,Thi Le Chi Dang,Vo Thanh Khang Nguyen,Truong Thanh Hung Nguyen,Hung Cao
机构: Quy Nhon AI, FPT Software(归仁AI,FPT软件); University of New Brunswick(新不伦瑞克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 33rd ACM International Conference on Multimedia (MM’25) Grand Challenge on Multimedia Verification
Abstract:This paper presents our submission to the ACMMM25 - Grand Challenge on Multimedia Verification. We developed a multi-agent verification system that combines Multimodal Large Language Models (MLLMs) with specialized verification tools to detect multimedia misinformation. Our system operates through six stages: raw data processing, planning, information extraction, deep research, evidence collection, and report generation. The core Deep Researcher Agent employs four tools: reverse image search, metadata analysis, fact-checking databases, and verified news processing that extracts spatial, temporal, attribution, and motivational context. We demonstrate our approach on a challenge dataset sample involving complex multimedia content. Our system successfully verified content authenticity, extracted precise geolocation and timing information, and traced source attribution across multiple platforms, effectively addressing real-world multimedia verification scenarios.
zh
[CV-101] MVNet: Hyperspectral Remote Sensing Image Classification Based on Hybrid Mamba-Transformer Vision Backbone Architecture
【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)分类中面临的高维数据、有限训练样本和光谱冗余等问题,这些问题常导致过拟合和泛化能力不足。其解决方案的关键在于提出一种新型的MVNet网络架构,该架构融合了3D-CNN的局部特征提取能力、Transformer的全局建模能力和Mamba的线性复杂度序列建模能力,实现了高效的空谱特征提取与融合。此外,MVNet通过重新设计双分支Mamba模块和优化HSI-MambaVision Mixer模块,增强了对短程和长程依赖关系的建模,同时降低了计算延迟并提升了分类精度与计算效率。
链接: https://arxiv.org/abs/2507.04409
作者: Guandong Li,Mengxia Ye
机构: aiFLYTEK(科大讯飞); Aegon THTF(艾格顿THTF)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv admin note: substantial text overlap with arXiv:2506.08324 , arXiv:2504.15155 , arXiv:2504.13045 , arXiv:2503.23472
Abstract:Hyperspectral image (HSI) classification faces challenges such as high-dimensional data, limited training samples, and spectral redundancy, which often lead to overfitting and insufficient generalization capability. This paper proposes a novel MVNet network architecture that integrates 3D-CNN’s local feature extraction, Transformer’s global modeling, and Mamba’s linear complexity sequence modeling capabilities, achieving efficient spatial-spectral feature extraction and fusion. MVNet features a redesigned dual-branch Mamba module, including a State Space Model (SSM) branch and a non-SSM branch employing 1D convolution with SiLU activation, enhancing modeling of both short-range and long-range dependencies while reducing computational latency in traditional Mamba. The optimized HSI-MambaVision Mixer module overcomes the unidirectional limitation of causal convolution, capturing bidirectional spatial-spectral dependencies in a single forward pass through decoupled attention that focuses on high-value features, alleviating parameter redundancy and the curse of dimensionality. On IN, UP, and KSC datasets, MVNet outperforms mainstream hyperspectral image classification methods in both classification accuracy and computational efficiency, demonstrating robust capability in processing complex HSI data.
zh
[CV-102] A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields ICCV2025
【速读】:该论文试图解决NeRF(Neural Radiance Fields)在真实世界数据中由于深度估计模型依赖昂贵的3D监督和泛化能力不足而导致的深度估计错误问题,特别是在户外无界场景中的表现。其解决方案的关键在于采用视图一致分布(view-consistent distributions)替代固定的深度值估计来对NeRF训练进行正则化,通过结合低级颜色特征和高级蒸馏特征,从每射线采样的3D点投影到2D像素位置计算分布,并通过采样引入隐式正则化,同时结合深度推动损失(depth-pushing loss)共同提升正则化效果。
链接: https://arxiv.org/abs/2507.04408
作者: Aoxiang Fan,Corentin Dumery,Nicolas Talabot,Pascal Fua
机构: EPFL(洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 accepted
Abstract:Neural Radiance Fields (NeRF) has emerged as a compelling framework for scene representation and 3D recovery. To improve its performance on real-world data, depth regularizations have proven to be the most effective ones. However, depth estimation models not only require expensive 3D supervision in training, but also suffer from generalization issues. As a result, the depth estimations can be erroneous in practice, especially for outdoor unbounded scenes. In this paper, we propose to employ view-consistent distributions instead of fixed depth value estimations to regularize NeRF training. Specifically, the distribution is computed by utilizing both low-level color features and high-level distilled features from foundation models at the projected 2D pixel-locations from per-ray sampled 3D points. By sampling from the view-consistency distributions, an implicit regularization is imposed on the training of NeRF. We also utilize a depth-pushing loss that works in conjunction with the sampling technique to jointly provide effective regularizations for eliminating the failure modes. Extensive experiments conducted on various scenes from public datasets demonstrate that our proposed method can generate significantly better novel view synthesis results than state-of-the-art NeRF variants as well as different depth regularization methods.
zh
[CV-103] Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion ICCV2025
【速读】:该论文试图解决从卫星图像生成高保真、细节丰富的3D城市场景的问题,现有方法因依赖神经渲染技术而难以在大规模上生成精确的3D结构,主要受限于有限的2D观测带来的结构模糊性。其解决方案的关键在于提出Sat2City框架,该框架结合稀疏体素网格的表示能力与潜在扩散模型,包含三个核心组件:(1)级联潜在扩散框架以逐步恢复3D城市结构,(2)在变分自编码器瓶颈处引入Re-Hash操作以计算多尺度特征网格以实现稳定的外观优化,(3)一种逆向采样策略以实现隐式监督,从而提升外观平滑性。
链接: https://arxiv.org/abs/2507.04403
作者: Tongyan Hua,Lutao Jiang,Ying-Cong Chen,Wufan Zhao
机构: HKUST(GZ)(香港科技大学(广州)); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond. However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations. To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance this http URL overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.
zh
[CV-104] RegistrationMamba: A Mamba-based Registration Framework Integrating Multi-Expert Feature Learning for Cross-Modal Remote Sensing Images
【速读】:该论文旨在解决跨模态遥感图像(CRSI)配准中的两个关键问题:跨模态图像之间的显著非线性辐射变化以及有限纹理导致的判别信息提取困难。其解决方案的关键在于提出一种基于状态空间模型(SSM)的新型Mamba架构——RegistrationMamba,该架构结合多专家特征学习(MEFL)策略,通过多方向交叉扫描策略以线性复杂度捕捉全局上下文关系,并利用可学习的软路由器动态融合多个特征专家的特征,从而增强特征表示并提升配准性能。此外,RegistrationMamba还集成了多级特征聚合(MFA)模块,以提取细粒度局部信息并实现全局与局部特征的有效交互。
链接: https://arxiv.org/abs/2507.04397
作者: Wei Wang,Dou Quan,Chonghua Lv,Shuang Wang,Ning Huyan,Yunan Li,Licheng Jiao
机构: Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China(中国教育部智能感知与图像理解重点实验室); School of Computer Science, Xidian University(西安电子科技大学计算机学院); Department of Automation, Tsinghua University(清华大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-modal remote sensing image (CRSI) registration is critical for multi-modal image applications. However, CRSI mainly faces two challenges: significant nonlinear radiometric variations between cross-modal images and limited textures hindering the discriminative information extraction. Existing methods mainly adopt convolutional neural networks (CNNs) or Transformer architectures to extract discriminative features for registration. However, CNNs with the local receptive field fail to capture global contextual features, and Transformers have high computational complexity and restrict their application to high-resolution CRSI. To solve these issues, this paper proposes RegistrationMamba, a novel Mamba architecture based on state space models (SSMs) integrating multi-expert feature learning for improving the accuracy of CRSI registration. Specifically, RegistrationMamba employs a multi-directional cross-scanning strategy to capture global contextual relationships with linear complexity. To enhance the performance of RegistrationMamba under texture-limited scenarios, we propose a multi-expert feature learning (MEFL) strategy to capture features from various augmented image variants through multiple feature experts. MEFL leverages a learnable soft router to dynamically fuse the features from multiple experts, thereby enriching feature representations and improving registration performance. Notably, MEFL can be seamlessly integrated into various frameworks, substantially boosting registration performance. Additionally, RegistrationMamba integrates a multi-level feature aggregation (MFA) module to extract fine-grained local information and enable effective interaction between global and local features. Extensive experiments on CRSI with varying image resolutions have demonstrated that RegistrationMamba has superior performance and robustness compared to state-of-the-art methods.
zh
[CV-105] Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers CVPR2025
【速读】:该论文试图解决现有特征归因方法在特定层计算信息时,忽略了决策过程中跨层分布的证据,导致归因不全面的问题。其解决方案的关键在于引入一种综合信息瓶颈(Comprehensive Information Bottleneck for Attribution, CoIBA),通过在多个目标层应用信息瓶颈原理,并共享一个参数阻尼比来估计综合信息,从而弥补过度压缩的信息,发现被忽略的决策线索。同时,采用变分方法通过上界约束每层的信息,确保每个目标层中被丢弃的激活是不必要的,从而提升特征归因的可信度。
链接: https://arxiv.org/abs/2507.04388
作者: Jung-Ho Hong,Ho-Joong Kim,Kyu-Sung Jeon,Seong-Whan Lee
机构: Korea University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 (highlight)
Abstract:The feature attribution method reveals the contribution of input variables to the decision-making process to provide an attribution map for explanation. Existing methods grounded on the information bottleneck principle compute information in a specific layer to obtain attributions, compressing the features by injecting noise via a parametric damping ratio. However, the attribution obtained in a specific layer neglects evidence of the decision-making process distributed across layers. In this paper, we introduce a comprehensive information bottleneck (CoIBA), which discovers the relevant information in each targeted layer to explain the decision-making process. Our core idea is applying information bottleneck in multiple targeted layers to estimate the comprehensive information by sharing a parametric damping ratio across the layers. Leveraging this shared ratio complements the over-compressed information to discover the omitted clues of the decision by sharing the relevant information across the targeted layers. We suggest the variational approach to fairly reflect the relevant information of each layer by upper bounding layer-wise information. Therefore, CoIBA guarantees that the discarded activation is unnecessary in every targeted layer to make a decision. The extensive experimental results demonstrate the enhancement in faithfulness of the feature attributions provided by CoIBA.
zh
[CV-106] ransferring Visual Explainability of Self-Explaining Models through Task Arithmetic
【速读】:该论文试图解决自解释模型在图像分类任务中训练时面临的标注成本高和计算开销大的问题。其解决方案的关键在于提出一种基于任务算术框架的方法,将源域中学习到的视觉可解释性向量转移到目标域,从而在不牺牲分类准确性的前提下提升目标域的解释质量。
链接: https://arxiv.org/abs/2507.04380
作者: Yuya Yoshikawa,Ryotaro Shimizu,Takahiro Kawashima,Yuki Saito
机构: STAIR Lab, Chiba Institute of Technology (STAIR 实验室,千叶工业大学); ZOZO Research (ZOZO 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In scenarios requiring both prediction and explanation efficiency for image classification, self-explaining models that perform both tasks in a single inference are effective. However, their training incurs substantial labeling and computational costs. This study aims to tackle the issue by proposing a method to transfer the visual explainability of self-explaining models, learned in a source domain, to a target domain based on a task arithmetic framework. Specifically, we construct a self-explaining model by extending image classifiers based on a vision-language pretrained model. We then define an \emphexplainability vector as the difference between model parameters trained on the source domain with and without explanation supervision. Based on the task arithmetic framework, we impart explainability to a model trained only on the prediction task in the target domain by applying the explainability vector. Experimental results on various image classification datasets demonstrate that, except for transfers between some less-related domains, visual explainability can be successfully transferred from source to target domains, improving explanation quality in the target domain without sacrificing classification accuracy. Furthermore, we show that the explainability vector learned on a large and diverse dataset like ImageNet, extended with explanation supervision, exhibits universality and robustness, improving explanation quality on nine out of ten different target datasets. We also find that the explanation quality achieved with a single model inference is comparable to that of Kernel SHAP, which requires 150 model inferences.
zh
[CV-107] MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection
【速读】:该论文试图解决相机-激光雷达多模态3D目标检测中现有融合策略在效率、长距离建模和保留完整场景信息之间难以平衡的问题。解决方案的关键在于引入基于状态空间模型(State-Space Model, SSM)的纯Mamba块,结合线性复杂度和长距离建模能力,同时通过高度保真激光雷达编码方法保留精确的高度信息,从而提升相机-激光雷达对齐效果,并在此基础上设计混合Mamba块以实现局部与全局上下文学习,最终在保持高效的同时达到最先进的性能。
链接: https://arxiv.org/abs/2507.04369
作者: Hanshi Wang,Jin Gao,Weiming Hu,Zhipeng Zhang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; School of Artificial Intelligence, Shanghai Jiao Tong University; Anyverse Intelligence; Beijing Key Laboratory of Super Intelligent Security of Multi-Modal Information; School of Information Science and Technology, ShanghaiTech University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for camera-LiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in state-space models (SSMs) and linear attention, we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in sequence order. To resolve this, we propose height-fidelity LiDAR encoding that preserves precise height information through voxel compression in continuous space, thereby enhancing camera-LiDAR alignment. Subsequently, we introduce the Hybrid Mamba Block, which leverages the enriched height-informed features to conduct local and global contextual learning. By integrating these components, our method achieves state-of-the-art performance with the top-tire NDS score of 75.0 on the nuScenes validation benchmark, even surpassing methods that utilize high-resolution inputs. Meanwhile, our method maintains efficiency, achieving faster inference speed than most recent state-of-the-art methods.
zh
[CV-108] me2Agri: Temporal Pretext Tasks for Agricultural Monitoring
【速读】:该论文旨在解决遥感基础模型在农业场景中因忽视自然周期这一时间特性而导致的性能不足问题。其解决方案的关键在于提出三种针对农业场景的新型预训练任务:时间差预测(Time-Difference Prediction, TD)、时间频率预测(Temporal Frequency Prediction, FP)和未来帧预测(Future-Frame Prediction, FF),以更好地捕捉农业景观的时间动态特征。
链接: https://arxiv.org/abs/2507.04366
作者: Moti Rattan Gupta,Anupam Sobti
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self Supervised Learning(SSL) has emerged as a prominent paradigm for label-efficient learning, and has been widely utilized by remote sensing foundation models(RSFMs). Recent RSFMs including SatMAE, DoFA, primarily rely on masked autoencoding(MAE), contrastive learning or some combination of them. However, these pretext tasks often overlook the unique temporal characteristics of agricultural landscape, namely nature’s cycle. Motivated by this gap, we propose three novel agriculture-specific pretext tasks, namely Time-Difference Prediction(TD), Temporal Frequency Prediction(FP), and Future-Frame Prediction(FF). Comprehensive evaluation on SICKLE dataset shows FF achieves 69.6% IoU on crop mapping and FP reduces yield prediction error to 30.7% MAPE, outperforming all baselines, and TD remains competitive on most tasks. Further, we also scale FF to the national scale of India, achieving 54.2% IoU outperforming all baselines on field boundary delineation on FTW India dataset.
zh
[CV-109] DMAT: An End-to-End Framework for Joint Atmospheric Turbulence Mitigation and Object Detection
【速读】:该论文试图解决大气湍流(Atmospheric Turbulence, AT)导致的监控图像清晰度和准确性下降问题,以及由此引发的物体分类和场景跟踪挑战。其解决方案的关键在于提出一种端到端框架DMAT,该框架通过同时补偿失真特征、提升可视化效果和物体检测性能来实现优化。该框架在AT补偿器中引入了基于3D Mamba的结构以处理湍流引起的时空位移和模糊,并在特征提取阶段以分层方式提取特征传递给检测器,通过反向传播在AT补偿器和物体检测器中实现联合优化。
链接: https://arxiv.org/abs/2507.04323
作者: Paul Hill,Alin Achim,Dave Bull,Nantheera Anantrasirichai
机构: Visual Information Laboratory, University of Bristol (视觉信息实验室,布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Atmospheric Turbulence (AT) degrades the clarity and accuracy of surveillance imagery, posing challenges not only for visualization quality but also for object classification and scene tracking. Deep learning-based methods have been proposed to improve visual quality, but spatio-temporal distortions remain a significant issue. Although deep learning-based object detection performs well under normal conditions, it struggles to operate effectively on sequences distorted by atmospheric turbulence. In this paper, we propose a novel framework that learns to compensate for distorted features while simultaneously improving visualization and object detection. This end-to-end framework leverages and exchanges knowledge of low-level distorted features in the AT mitigator with semantic features extracted in the object detector. Specifically, in the AT mitigator a 3D Mamba-based structure is used to handle the spatio-temporal displacements and blurring caused by turbulence. Features are extracted in a pyramid manner during the mitigation stage and passed to the detector. Optimization is achieved through back-propagation in both the AT mitigator and object detector. Our proposed DMAT outperforms state-of-the-art AT mitigation and object detection systems up to a 15% improvement on datasets corrupted by generated turbulence.
zh
[CV-110] Exploring Remote Physiological Signal Measurement under Dynamic Lighting Conditions at Night: Dataset Experiment and Analysis
【速读】:该论文试图解决当前生成式 AI (Generative AI) 在夜间动态光照条件下远程光电容积描记术 (rPPG) 算法性能不佳的问题,以及缺乏针对此类复杂环境的公开数据集的问题。解决方案的关键在于提出并发布一个名为DLCN的大规模rPPG数据集,该数据集在夜间动态光照条件下收集,包含约13小时的视频数据和同步的生理信号,覆盖四种典型的夜间光照场景,以促进算法在复杂条件下的鲁棒性评估与研究进展。
链接: https://arxiv.org/abs/2507.04306
作者: Zhipeng Li,Kegang Wang,Hanguang Xiao,Xingyue Liu,Feizhong Zhou,Jiaxin Jiang,Tianqi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote photoplethysmography (rPPG) is a non-contact technique for measuring human physiological signals. Due to its convenience and non-invasiveness, it has demonstrated broad application potential in areas such as health monitoring and emotion recognition. In recent years, the release of numerous public datasets has significantly advanced the performance of rPPG algorithms under ideal lighting conditions. However, the effectiveness of current rPPG methods in realistic nighttime scenarios with dynamic lighting variations remains largely unknown. Moreover, there is a severe lack of datasets specifically designed for such challenging environments, which has substantially hindered progress in this area of research. To address this gap, we present and release a large-scale rPPG dataset collected under dynamic lighting conditions at night, named DLCN. The dataset comprises approximately 13 hours of video data and corresponding synchronized physiological signals from 98 participants, covering four representative nighttime lighting scenarios. DLCN offers high diversity and realism, making it a valuable resource for evaluating algorithm robustness in complex conditions. Built upon the proposed Happy-rPPG Toolkit, we conduct extensive experiments and provide a comprehensive analysis of the challenges faced by state-of-the-art rPPG methods when applied to DLCN. The dataset and code are publicly available at this https URL.
zh
[CV-111] Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponent-Guided Optimization ICCV2025
【速读】:该论文试图解决单域泛化(Single Domain Generalization, SDG)任务中模型在面对显著域偏移和有限数据多样性时泛化能力不足的问题。现有方法主要依赖数据增强技术,难以有效适应大规模域偏移。论文提出的解决方案关键在于引入LEAwareSGD,这是一种基于李雅普诺夫指数(Lyapunov Exponent, LE)引导的优化方法,通过LE测量调整学习率,使模型训练接近混沌边缘,从而在稳定性和适应性之间取得最佳平衡,提升模型的泛化能力。
链接: https://arxiv.org/abs/2507.04302
作者: Zuyu Zhang,Ning Chen,Yongshan Liu,Qinghua Zhang,Xu Zhang
机构: Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025
Abstract:Single Domain Generalization (SDG) aims to develop models capable of generalizing to unseen target domains using only one source domain, a task complicated by substantial domain shifts and limited data diversity. Existing SDG approaches primarily rely on data augmentation techniques, which struggle to effectively adapt training dynamics to accommodate large domain shifts. To address this, we propose LEAwareSGD, a novel Lyapunov Exponent (LE)-guided optimization approach inspired by dynamical systems theory. By leveraging LE measurements to modulate the learning rate, LEAwareSGD encourages model training near the edge of chaos, a critical state that optimally balances stability and adaptability. This dynamic adjustment allows the model to explore a wider parameter space and capture more generalizable features, ultimately enhancing the model’s generalization capability. Extensive experiments on PACS, OfficeHome, and DomainNet demonstrate that LEAwareSGD yields substantial generalization gains, achieving up to 9.47% improvement on PACS in low-data regimes. These results underscore the effectiveness of training near the edge of chaos for enhancing model generalization capability in SDG tasks.
zh
[CV-112] AutoLayout: Closed-Loop Layout Synthesis via Slow-Fast Collaborative Reasoning
【速读】:该论文旨在解决自动化布局生成中存在的空间幻觉问题,以及在语义保真度与物理合理性之间难以平衡的问题,这些问题导致生成的布局常出现物体悬浮、重叠或堆叠关系错位等缺陷。其解决方案的关键在于提出AutoLayout方法,该方法基于双系统框架,结合了慢速系统与快速系统的协同推理机制,其中慢速系统通过Reasoning-Reflection-Generation(RRG)管道进行详细推理以提取对象属性和空间约束,快速系统则生成离散坐标集和拓扑关系集并进行联合验证。此外,引入了基于大语言模型的自适应关系库(ARL)以克服手工规则的局限性,通过闭环自我验证机制实现迭代纠错,从而有效缓解空间幻觉并实现物理稳定性与语义一致性的平衡。
链接: https://arxiv.org/abs/2507.04293
作者: Weixing Chen,Dafeng Chi,Yang Liu,Yuxi Yang,Yexin Zhang,Yuzheng Zhuang,Xingyue Quan,Jianye Hao,Guanbin Li,Liang Lin
机构: Sun Yat-sen University (中山大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Peng Cheng Laboratory (鹏城实验室); Guangdong Key Laboratory of Big Data Analysis and Processing (广东省大数据分析与处理重点实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The automated generation of layouts is vital for embodied intelligence and autonomous systems, supporting applications from virtual environment construction to home robot deployment. Current approaches, however, suffer from spatial hallucination and struggle with balancing semantic fidelity and physical plausibility, often producing layouts with deficits such as floating or overlapping objects and misaligned stacking relation. In this paper, we propose AutoLayout, a fully automated method that integrates a closed-loop self-validation process within a dual-system framework. Specifically, a slow system harnesses detailed reasoning with a Reasoning-Reflection-Generation (RRG) pipeline to extract object attributes and spatial constraints. Then, a fast system generates discrete coordinate sets and a topological relation set that are jointly validated. To mitigate the limitations of handcrafted rules, we further introduce an LLM-based Adaptive Relation Library (ARL) for generating and evaluating layouts. Through the implementation of Slow-Fast Collaborative Reasoning, the AutoLayout efficiently generates layouts after thorough deliberation, effectively mitigating spatial hallucination. Its self-validation mechanism establishes a closed-loop process that iteratively corrects potential errors, achieving a balance between physical stability and semantic consistency. The effectiveness of AutoLayout was validated across 8 distinct scenarios, where it demonstrated a significant 10.1% improvement over SOTA methods in terms of physical plausibility, semantic consistency, and functional completeness.
zh
[CV-113] MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation
【速读】:该论文旨在解决扩散模型(Diffusion Models)在边缘设备上应用时因高计算复杂度而导致的推理加速与内存占用问题,特别是在极低比特(2-4 bit)量化下的性能退化问题。其解决方案的关键在于提出MPQ-DMv2框架,该框架通过引入灵活的Z-Order残差混合量化策略以应对显著异常值带来的分布不平衡问题,结合基于对象的低秩初始化方法优化模型训练过程,并采用基于记忆的时间关系蒸馏技术确保量化模型与全精度模型之间的长期时间一致性。
链接: https://arxiv.org/abs/2507.04290
作者: Weilun Feng,Chuanguang Yang,Haotong Qin,Yuqi Li,Xiangqi Li,Zhulin An,Libo Huang,Boyu Diao,Fuzhen Zhuang,Michele Magno,Yongjun Xu,Yingli Tian,Tingwen Huang
机构: Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所); University of Chinese Academy of Sciences(中国科学院大学); ETH Zurich(苏黎世联邦理工学院); Institute of Artificial Intelligence, Beihang University(北京航空航天大学人工智能研究院); Zhongguancun Laboratory(中关村实验室); Department of Electrical Engineering, The City College(城市学院电子工程系); Department of Computer Science, the Graduate Center, the City University of New York(纽约市立大学研究生中心计算机科学系); School of Computer Science and Control Engineering, Shenzhen University of Advanced Technology(深圳技术大学计算机科学与控制工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have demonstrated remarkable performance on vision generation tasks. However, the high computational complexity hinders its wide application on edge devices. Quantization has emerged as a promising technique for inference acceleration and memory reduction. However, existing quantization methods do not generalize well under extremely low-bit (2-4 bit) quantization. Directly applying these methods will cause severe performance degradation. We identify that the existing quantization framework suffers from the outlier-unfriendly quantizer design, suboptimal initialization, and optimization strategy. We present MPQ-DMv2, an improved \textbfMixed \textbfPrecision \textbfQuantization framework for extremely low-bit \textbfDiffusion \textbfModels. For the quantization perspective, the imbalanced distribution caused by salient outliers is quantization-unfriendly for uniform quantizer. We propose \textitFlexible Z-Order Residual Mixed Quantization that utilizes an efficient binary residual branch for flexible quant steps to handle salient error. For the optimization framework, we theoretically analyzed the convergence and optimality of the LoRA module and propose \textitObject-Oriented Low-Rank Initialization to use prior quantization error for informative initialization. We then propose \textitMemory-based Temporal Relation Distillation to construct an online time-aware pixel queue for long-term denoising temporal information distillation, which ensures the overall temporal consistency between quantized and full-precision model. Comprehensive experiments on various generation tasks show that our MPQ-DMv2 surpasses current SOTA methods by a great margin on different architectures, especially under extremely low-bit widths.
zh
[CV-114] M3-Med: A Benchmark for Multi-lingual Multi-modal and Multi-hop Reasoning in Medical Instructional Video Understanding
【速读】:该论文试图解决现有视频理解基准在多语言支持和深层次跨模态推理能力方面的不足。其关键解决方案是提出M3-Med,这是一个针对医学教学视频理解的多语言、多模态和多跳推理基准,通过设计多跳推理任务,要求模型首先定位文本中的关键实体,再在视频中寻找对应的视觉证据,并最终融合两种模态的信息以得出答案,从而推动对深度跨模态理解能力的评估与研究。
链接: https://arxiv.org/abs/2507.04289
作者: Shenxi Liu,Kan Li,Mingyang Zhao,Yuhang Tian,Bin Li,Shoujun Zhou,Hongliang Li,Fuxia Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 8 figures, 7 tables
Abstract:With the rapid progress of artificial intelligence (AI) in multi-modal understanding, there is increasing potential for video comprehension technologies to support professional domains such as medical education. However, existing benchmarks suffer from two primary limitations: (1) Linguistic Singularity: they are largely confined to English, neglecting the need for multilingual resources; and (2) Shallow Reasoning: their questions are often designed for surface-level information retrieval, failing to properly assess deep multi-modal integration. To address these limitations, we present M3-Med, the first benchmark for Multi-lingual, Multi-modal, and Multi-hop reasoning in Medical instructional video understanding. M3-Med consists of medical questions paired with corresponding video segments, annotated by a team of medical experts. A key innovation of M3-Med is its multi-hop reasoning task, which requires a model to first locate a key entity in the text, then find corresponding visual evidence in the video, and finally synthesize information across both modalities to derive the answer. This design moves beyond simple text matching and poses a substantial challenge to a model’s deep cross-modal understanding capabilities. We define two tasks: Temporal Answer Grounding in Single Video (TAGSV) and Temporal Answer Grounding in Video Corpus (TAGVC). We evaluated several state-of-the-art models and Large Language Models (LLMs) on M3-Med. The results reveal a significant performance gap between all models and human experts, especially on the complex multi-hop questions where model performance drops sharply. M3-Med effectively highlights the current limitations of AI models in deep cross-modal reasoning within specialized domains and provides a new direction for future research.
zh
[CV-115] SeqTex: Generate Mesh Textures in Video Sequence
【速读】:该论文试图解决在缺乏大规模高质量3D纹理数据集的情况下,训练原生3D纹理生成模型的难题,以及现有方法生成多视角图像后需依赖后处理生成UV纹理图所带来的误差累积和空间不一致问题。解决方案的关键在于提出SeqTex框架,该框架通过将UV纹理生成任务重构为序列生成问题,利用预训练视频基础模型中的视觉知识,直接生成完整的UV纹理图,从而有效迁移图像空间先验至UV域,并通过解耦多视角与UV分支设计、几何感知注意力机制及自适应token分辨率等创新结构,实现高保真UV纹理图的合成。
链接: https://arxiv.org/abs/2507.04285
作者: Ze Yuan(1),Xin Yu(1),Yangtian Sun(1),Yuan-Chen Guo(2),Yan-Pei Cao(2),Ding Liang(2),Xiaojuan Qi(1) ((1) HKU, (2) VAST)
机构: HKU(香港大学); VAST(中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps – an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.
zh
[CV-116] Clustering via Self-Supervised Diffusion
【速读】:该论文试图解决在无监督分类中实现鲁棒且准确的聚类问题,特别是在高维数据中发现复杂结构的挑战。解决方案的关键在于提出了一种名为CLUDI的自监督框架,该框架结合了扩散模型的生成能力与预训练视觉Transformer特征,通过教师-学生范式进行训练,其中教师利用基于随机扩散的采样生成多样化的聚类分配,学生则将其细化为稳定的预测,从而引入了一种新颖的数据增强策略,提升了模型对复杂数据分布的适应性和聚类性能。
链接: https://arxiv.org/abs/2507.04283
作者: Roy Uziel,Irit Chelly,Oren Freifeld,Ari Pakman
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models, widely recognized for their success in generative tasks, have not yet been applied to clustering. We introduce Clustering via Diffusion (CLUDI), a self-supervised framework that combines the generative power of diffusion models with pre-trained Vision Transformer features to achieve robust and accurate clustering. CLUDI is trained via a teacher-student paradigm: the teacher uses stochastic diffusion-based sampling to produce diverse cluster assignments, which the student refines into stable predictions. This stochasticity acts as a novel data augmentation strategy, enabling CLUDI to uncover intricate structures in high-dimensional data. Extensive evaluations on challenging datasets demonstrate that CLUDI achieves state-of-the-art performance in unsupervised classification, setting new benchmarks in clustering robustness and adaptability to complex data distributions.
zh
[CV-117] owards Lightest Low-Light Image Enhancement Architecture for Mobile Devices
【速读】:该论文旨在解决在移动和嵌入式设备上实时进行低光图像增强的问题,其核心挑战在于模型需要在保证视觉质量的同时具备高效的计算性能。现有深度学习方法通常依赖于大型网络和标注数据集,这限制了其在资源受限平台上的部署。论文提出的解决方案关键在于设计了一个轻量级的无监督增强框架LiteIE,其核心创新包括:一个仅包含两个卷积层的与主干网络无关的特征提取器,以及一个无需额外可学习参数的迭代恢复模块,能够逐步恢复早期增强步骤中丢失的细节。此外,还引入了一个结合曝光控制、边缘感知平滑性和多尺度色彩一致性损失的无监督训练目标,从而实现了在保持高图像质量的同时显著降低计算复杂度。
链接: https://arxiv.org/abs/2507.04277
作者: Guangrui Bai,Hailong Yan,Wenhai Liu,Yahui Deng,Erbao Dong
机构: University of Science and Technology of China (中国科学技术大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ESWA
Abstract:Real-time low-light image enhancement on mobile and embedded devices requires models that balance visual quality and computational efficiency. Existing deep learning methods often rely on large networks and labeled datasets, limiting their deployment on resource-constrained platforms. In this paper, we propose LiteIE, an ultra-lightweight unsupervised enhancement framework that eliminates dependence on large-scale supervision and generalizes well across diverse conditions. We design a backbone-agnostic feature extractor with only two convolutional layers to produce compact image features enhancement tensors. In addition, we develop a parameter-free Iterative Restoration Module, which reuses the extracted features to progressively recover fine details lost in earlier enhancement steps, without introducing any additional learnable parameters. We further propose an unsupervised training objective that integrates exposure control, edge-aware smoothness, and multi-scale color consistency losses. Experiments on the LOL dataset, LiteIE achieves 19.04 dB PSNR, surpassing SOTA by 1.4 dB while using only 0.07% of its parameters. On a Snapdragon 8 Gen 3 mobile processor, LiteIE runs at 30 FPS for 4K images with just 58 parameters, enabling real-time deployment on edge devices. These results establish LiteIE as an efficient and practical solution for low-light enhancement on resource-limited platforms.
zh
[CV-118] ZERO: Multi-modal Prompt-based Visual Grounding CVPR2025
【速读】:该论文旨在解决工业领域中对象检测任务的泛化能力和部署灵活性问题,特别是在标注数据有限的情况下实现高效、鲁棒的检测。其解决方案的关键在于提出ZERO模型,该模型采用零样本多提示机制,结合直接图像输入与用户自定义提示(包括文本和视觉线索),并通过专用编码器生成精确的检测结果,同时引入领域特定的微调策略以增强模型对新领域的适应能力。
链接: https://arxiv.org/abs/2507.04270
作者: Sangbum Choi,Kyeongryeol Go
机构: SuperbAI(超级人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: A solution report for CVPR2025 Foundational FSOD Challenge
Abstract:Recent advances in artificial intelligence have led to the emergence of foundation models, large-scale pre-trained neural networks that serve as versatile starting points for a wide range of downstream tasks. In this work, we present ZERO, a zero-shot multi-prompt object detection model specifically designed for robust, production-ready deployment across diverse industrial domains. ZERO integrates direct image input with multiple user-defined prompts, which can include both textual and visual cues, and processes them through dedicated encoders to generate accurate detection outputs. The model architecture is optimized for scalability, with a total of 1.033 TFLOPS and 622.346 million parameters, and is trained using a domain-specific image database exceeding one billion images. For the CVPR 2025 Foundational Few-Shot Object Detection (FSOD) Challenge, we introduce a domain-specific fine-tuning strategy that emphasizes prompt diversity and conservative pseudo-labeling, enabling effective adaptation to new domains with minimal supervision. Our approach demonstrates practical advantages in flexibility, efficiency, and real-world applicability, achieving strong performance on the RF20VL-fsod benchmark despite limited annotation budgets. The results highlight the potential of prompt-driven, data-centric AI for scalable and adaptive object detection in dynamic industrial environments.
zh
[CV-119] Efficient Training of Deep Networks using Guided Spectral Data Selection: A Step Toward Learning What You Need
【速读】:该论文旨在解决神经网络训练中数据集冗余和计算资源浪费的问题,通过高效的数据选择策略优化训练过程。其解决方案的关键在于提出了一种基于谱分析的引导式频谱调谐数据选择(Guided Spectrally Tuned Data Selection, GSTDS)算法,该算法利用预训练参考模型动态调整用于训练的数据子集,并通过Fiedler向量评分机制筛选出最具信息量的数据点,从而减少每批次处理的数据量,降低计算负担,同时保持或提升模型的泛化能力和准确性。
链接: https://arxiv.org/abs/2507.04269
作者: Mohammadreza Sharifi,Ahad Harati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 10 figures, UnderReview in the Data Mining and Knowledge Discovery journal of Springer, Submitted Apr 2025
Abstract:Effective data curation is essential for optimizing neural network training. In this paper, we present the Guided Spectrally Tuned Data Selection (GSTDS) algorithm, which dynamically adjusts the subset of data points used for training using an off-the-shelf pre-trained reference model. Based on a pre-scheduled filtering ratio, GSTDS effectively reduces the number of data points processed per batch. The proposed method ensures an efficient selection of the most informative data points for training while avoiding redundant or less beneficial computations. Preserving data points in each batch is performed based on spectral analysis. A Fiedler vector-based scoring mechanism removes the filtered portion of the batch, lightening the resource requirements of the learning. The proposed data selection approach not only streamlines the training process but also promotes improved generalization and accuracy. Extensive experiments on standard image classification benchmarks, including CIFAR-10, Oxford-IIIT Pet, and Oxford-Flowers, demonstrate that GSTDS outperforms standard training scenarios and JEST, a recent state-of-the-art data curation method, on several key factors. It is shown that GSTDS achieves notable reductions in computational requirements, up to four times, without compromising performance. GSTDS exhibits a considerable growth in terms of accuracy under the limited computational resource usage, in contrast to other methodologies. These promising results underscore the potential of spectral-based data selection as a scalable solution for resource-efficient deep learning and motivate further exploration into adaptive data curation strategies. You can find the code at this https URL.
zh
[CV-120] An Explainable Transformer Model for Alzheimers Disease Detection Using Retinal Imaging
【速读】:该论文试图解决阿尔茨海默病(Alzheimer’s disease, AD)早期诊断的问题,旨在通过视网膜成像技术实现AD的检测。其解决方案的关键在于提出了一种基于Transformer架构的新型模型——Retformer,该模型利用Transformer的强大建模能力与可解释的人工智能技术,从不同模态的视网膜图像中学习复杂的特征模式与疾病诊断之间的关系,并通过梯度加权类激活映射算法提供模型决策过程的可视化解释,从而提升AD检测的准确性与可解释性。
链接: https://arxiv.org/abs/2507.04259
作者: Saeed Jamshidiha,Alireza Rezaee,Farshid Hajati,Mojtaba Golzan,Raymond Chiong
机构: University of Tehran(德黑兰大学); School of Intelligent Systems(智能系统学院); College of Interdisciplinary Science and Technology(跨学科科学与技术学院); University of New England(新英格兰大学); University of Technology Sydney(悉尼科技大学); The University of Newcastle(纽卡斯尔大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 8 figures
Abstract:Alzheimer’s disease (AD) is a neurodegenerative disorder that affects millions worldwide. In the absence of effective treatment options, early diagnosis is crucial for initiating management strategies to delay disease onset and slow down its progression. In this study, we propose Retformer, a novel transformer-based architecture for detecting AD using retinal imaging modalities, leveraging the power of transformers and explainable artificial intelligence. The Retformer model is trained on datasets of different modalities of retinal images from patients with AD and age-matched healthy controls, enabling it to learn complex patterns and relationships between image features and disease diagnosis. To provide insights into the decision-making process of our model, we employ the Gradient-weighted Class Activation Mapping algorithm to visualize the feature importance maps, highlighting the regions of the retinal images that contribute most significantly to the classification outcome. These findings are compared to existing clinical studies on detecting AD using retinal biomarkers, allowing us to identify the most important features for AD detection in each imaging modality. The Retformer model outperforms a variety of benchmark algorithms across different performance metrics by margins of up to 11.
zh
[CV-121] MoReMouse: Monocular Reconstruction of Laboratory Mouse
【速读】:该论文旨在解决实验室小鼠在生物医学研究中由于复杂非刚性几何变形和无纹理外观导致的3D表面运动精确重建问题,同时针对缺乏结构化3D数据集限制了稀疏关键点跟踪之外的进展。其解决方案的关键在于提出MoReMouse,这是首个针对实验小鼠的单目密集3D重建网络,包含三个核心设计:首先构建了首个高保真密集视角的合成数据集,通过渲染自定义的逼真高斯小鼠虚拟角色;其次采用基于Transformer的前馈架构与三平面表示,实现从单张图像生成高质量的3D表面;最后在小鼠表面创建基于测地线的连续对应嵌入,作为强语义先验以提升重建稳定性和表面一致性。
链接: https://arxiv.org/abs/2507.04258
作者: Yuan Zhong,Jingxiang Sun,Liang An,Yebin Liu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Laboratory mice play a crucial role in biomedical research, yet accurate 3D mouse surface motion reconstruction remains challenging due to their complex non-rigid geometric deformations and textureless appearance. Moreover, the absence of structured 3D datasets severely hinders the progress beyond sparse keypoint tracking. To narrow the gap, we present MoReMouse, the first monocular dense 3D reconstruction network tailored for laboratory mice. To achieve this goal, we highlight three key designs. First, we construct the first high-fidelity dense-view synthetic dataset for mice, by rendering our self-designed realistic Gaussian mouse avatar. Second, MoReMouse adopts a transformer-based feedforward architecture with triplane representation, achieving high-quality 3D surface generation from a single image. Third, we create geodesic-based continuous correspondence embeddings on mouse surface, which serve as strong semantic priors to improve reconstruction stability and surface consistency. Extensive quantitative and qualitative experiments demonstrate that MoReMouse significantly outperforms existing open-source methods in accuracy and robustness. Video results are available at this https URL.
zh
[CV-122] Domain Generalizable Portrait Style Transfer ICCV2025
【速读】:该论文试图解决跨域的高质量语义对齐风格迁移问题,特别是在头发、眼睛、睫毛、皮肤、嘴唇和背景等区域实现精准的风格化。其解决方案的关键在于建立输入图像与参考肖像之间的密集语义对应关系,通过预训练模型和语义适配器获得语义对齐的变形参考图像;随后利用AdaIN-Wavelet变换在潜在空间中融合变形参考的低频信息与输入的高频信息,以平衡内容保留与风格迁移;同时设计风格适配器提供来自变形参考的风格指导,最终结合双条件扩散模型生成高质量的风格迁移结果。
链接: https://arxiv.org/abs/2507.04243
作者: Xinbo Wang,Wenju Xu,Qing Zhang,Wei-Shi Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV2025
Abstract:This paper presents a portrait style transfer method that generalizes well to various different domains while enabling high-quality semantic-aligned stylization on regions including hair, eyes, eyelashes, skins, lips, and background. To this end, we propose to establish dense semantic correspondence between the given input and reference portraits based on a pre-trained model and a semantic adapter, with which we obtain a warped reference semantically aligned with the input. To ensure effective yet controllable style transfer, we devise an AdaIN-Wavelet transform to balance content preservation and stylization by blending low-frequency information of the warped reference with high-frequency information of the input in the latent space. A style adapter is also designed to provide style guidance from the warped reference. With the stylized latent from AdaIN-Wavelet transform, we employ a dual-conditional diffusion model that integrates a ControlNet recording high-frequency information and the style guidance to generate the final result. Extensive experiments demonstrate the superiority of our method. Our code and trained model are available at this https URL.
zh
[CV-123] DreamPoster: A Unified Framework for Image-Conditioned Generative Poster Design
【速读】:该论文旨在解决从用户提供的图像和文本提示中智能合成高质量海报的问题,同时保持内容的一致性并支持灵活的分辨率和布局输出。其解决方案的关键在于基于T2I模型Seedream3.0构建统一的海报生成框架,并通过系统化的数据标注流程和全面的数据集构建方法,精确标注文本内容和排版层次信息,同时采用渐进式训练策略,使模型能够分层获取多任务生成能力,从而实现高质量的图像生成效果。
链接: https://arxiv.org/abs/2507.04218
作者: Xiwei Hu,Haokun Chen,Zhongqi Qi,Hui Zhang,Dexiang Hong,Jie Shao,Xinglong Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present DreamPoster, a Text-to-Image generation framework that intelligently synthesizes high-quality posters from user-provided images and text prompts while maintaining content fidelity and supporting flexible resolution and layout outputs. Specifically, DreamPoster is built upon our T2I model, Seedream3.0 to uniformly process different poster generating types. For dataset construction, we propose a systematic data annotation pipeline that precisely annotates textual content and typographic hierarchy information within poster images, while employing comprehensive methodologies to construct paired datasets comprising source materials (e.g., raw graphics/text) and their corresponding final poster outputs. Additionally, we implement a progressive training strategy that enables the model to hierarchically acquire multi-task generation capabilities while maintaining high-quality generation. Evaluations on our testing benchmarks demonstrate DreamPoster’s superiority over existing methods, achieving a high usability rate of 88.55%, compared to GPT-4o (47.56%) and SeedEdit3.0 (25.96%). DreamPoster will be online in Jimeng and other Bytedance Apps.
zh
[CV-124] Quick Bypass Mechanism of Zero-Shot Diffusion-Based Image Restoration
【速读】:该论文试图解决扩散模型在图像恢复任务中(如超分辨率和去模糊)因去噪过程迭代时间过长而导致的效率问题。解决方案的关键在于提出一种快速旁路机制(Quick Bypass Mechanism, QBM),通过从中间近似值初始化去噪过程,从而跳过早期的去噪步骤以显著加速整个过程;同时引入修正的逆过程(Revised Reverse Process, RRP),调整随机噪声的权重以增强随机性并减少可能的不一致。
链接: https://arxiv.org/abs/2507.04207
作者: Yu-Shan Tai,An-Yeu(Andy)Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advancements in diffusion models have demonstrated remarkable success in various image generation tasks. Building upon these achievements, diffusion models have also been effectively adapted to image restoration tasks, e.g., super-resolution and deblurring, aiming to recover high-quality images from degraded inputs. Although existing zero-shot approaches enable pretrained diffusion models to perform restoration tasks without additional fine-tuning, these methods often suffer from prolonged iteration times in the denoising process. To address this limitation, we propose a Quick Bypass Mechanism (QBM), a strategy that significantly accelerates the denoising process by initializing from an intermediate approximation, effectively bypassing early denoising steps. Furthermore, recognizing that approximation may introduce inconsistencies, we introduce a Revised Reverse Process (RRP), which adjusts the weighting of random noise to enhance the stochasticity and mitigate potential disharmony. We validate proposed methods on ImageNet-1K and CelebA-HQ across multiple image restoration tasks, e.g., super-resolution, deblurring, and compressed sensing. Our experimental results show that the proposed methods can effectively accelerate existing methods while maintaining original performance.
zh
[CV-125] owards Spatially-Varying Gain and Binning
【速读】:该论文旨在解决图像传感器中像素尺寸缩小导致的信噪比下降和动态范围受限的问题。其关键解决方案是引入空间变化的增益(spatially-varying gain)和分组(binning)策略,通过根据局部场景亮度调整增益和分组大小,从而显著提升传感器的噪声性能和动态范围,同时在不同光照条件下优化分辨率与噪声之间的平衡。
链接: https://arxiv.org/abs/2507.04190
作者: Anqi Yang,Eunhee Kang,Wei Chen,Hyong-Euk Lee,Aswin C. Sankaranarayanan
机构: Carnegie Mellon University (卡内基梅隆大学); Samsung Advanced Institute of Technology (三星高级技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Pixels in image sensors have progressively become smaller, driven by the goal of producing higher-resolution imagery. However, ceteris paribus, a smaller pixel accumulates less light, making image quality worse. This interplay of resolution, noise, and the dynamic range of the sensor and their impact on the eventual quality of acquired imagery is a fundamental concept in photography. In this paper, we propose spatially-varying gain and binning to enhance the noise performance and dynamic range of image sensors. First, we show that by varying gain spatially to local scene brightness, the read noise can be made negligible, and the dynamic range of a sensor is expanded by an order of magnitude. Second, we propose a simple analysis to find a binning size that best balances resolution and noise for a given light level; this analysis predicts a spatially-varying binning strategy, again based on local scene brightness, to effectively increase the overall signal-to-noise ratio. % without sacrificing resolution. We discuss analog and digital binning modes and, perhaps surprisingly, show that digital binning outperforms its analog counterparts when a larger gain is allowed. Finally, we demonstrate that combining spatially-varying gain and binning in various applications, including high dynamic range imaging, vignetting, and lens distortion.
zh
[CV-126] Voyaging into Unbounded Dynamic Scenes from a Single View ICCV
【速读】:该论文试图解决从单视角生成无界动态场景的问题,该问题在增强现实/虚拟现实和机器人技术中有广泛应用。传统方法通过多视角训练来学习场景的一致性,但生成的场景区域受限于训练视角,无法扩展到更广泛的相机运动。解决方案的关键在于提出DynamicVoyager,将动态场景生成重新定义为新动态内容的场景外绘过程,并通过将像素视为光线来丰富输入信息,从而从光线信息中学习3D运动一致性。
链接: https://arxiv.org/abs/2507.04183
作者: Fengrui Tian,Tianjiao Ding,Jinqi Luo,Hancheng Min,René Vidal
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by International Conference on Computer Vision (ICCV) 2025. Project Page: this https URL
Abstract:This paper studies the problem of generating an unbounded dynamic scene from a single view, which has wide applications in augmented/virtual reality and robotics. Since the scene is changing over time, different generated views need to be consistent with the underlying 3D motions. While previous works learn such consistency by training from multiple views, the generated scene regions are bounded to be close to the training views with limited camera movements. To address this issue, we propose DynamicVoyager that reformulates the dynamic scene generation as a scene outpainting process for new dynamic content. As 2D outpainting models can hardly generate 3D consistent motions from only 2D pixels at a single view, we consider pixels as rays to enrich the pixel input with the ray context, so that the 3D motion consistency can be learned from the ray information. More specifically, we first map the single-view video input to a dynamic point cloud with the estimated video depths. Then we render the partial video at a novel view and outpaint the video with ray contexts from the point cloud to generate 3D consistent motions. We employ the outpainted video to update the point cloud, which is used for scene outpainting from future novel views. Experiments show that our model is able to generate unbounded scenes with consistent motions along fly-through cameras, and the generated contents can be controlled with scene prompts.
zh
[CV-127] LVLM-Composers Explicit Planning for Image Generation
【速读】:该论文旨在解决当前大型视觉语言模型(LVLMs)在处理复杂文本描述时存在的组合理解与视觉规划能力不足的问题,尤其是在多对象、属性、空间关系及特定姿态的精准渲染方面表现欠佳。其解决方案的关键在于提出LVLM-Composer,该模型通过引入分层语义规划模块实现结构化提示分解,并结合细粒度特征对齐机制提供生成过程中的精确视觉引导,同时采用多阶段训练范式增强组合推理能力,从而显著提升了文本到图像生成的准确性与一致性。
链接: https://arxiv.org/abs/2507.04152
作者: Spencer Ramsey,Jeffrey Lee,Amina Grant
机构: Northern Caribbean University (北方加勒比大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The burgeoning field of generative artificial intelligence has fundamentally reshaped our approach to content creation, with Large Vision-Language Models (LVLMs) standing at its forefront. While current LVLMs have demonstrated impressive capabilities in text-to-image generation, they often falter when confronted with complex textual descriptions demanding precise compositional understanding and visual planning. This limitation particularly impacts the accurate rendering of multiple objects, their attributes, spatial relationships, and specific poses within intricate scenes, as evidenced by benchmarks like LongBench-T2I. To address these challenges, we introduce LVLM-Composer, a novel 10-billion parameter scale LVLM specifically engineered for enhanced compositional image synthesis. Our method incorporates a Hierarchical Semantic Planning Module for structured prompt decomposition and a Fine-Grained Feature Alignment Mechanism for precise visual guidance during generation. We propose a multi-stage training paradigm, featuring Hierarchical Semantic-Visual Grounding Pre-training and Compositional Planning Reinforcement Learning with Self-Correction, to instill robust compositional reasoning. Extensive experiments on the LongBench-T2I benchmark, utilizing automatic evaluation by Gemini-2.0-Flash and InternVL3-78B, demonstrate LVLM-Composer’s superior performance across critical compositional dimensions including object accuracy, composition fidelity, and pose accuracy, significantly outperforming state-of-the-art baselines. An in-depth ablation study further validates the indispensable contribution of our proposed modules, while human evaluations confirm the perceptual superiority of our generated images. LVLM-Composer represents a significant step towards truly controllable and compositionally accurate open-ended text-to-image generation.
zh
[CV-128] Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation
【速读】:该论文试图解决文本到图像生成中因复杂和组合性提示导致的高成本标注数据依赖以及对细粒度视觉属性和复杂空间关系控制不足的问题。解决方案的关键在于提出一种独特的两阶段自监督学习策略:第一阶段通过多粒度视觉-语言对齐使大型视觉-语言模型(LVLM)自主生成并关联层次化标题,从而在无需大量人工标注的情况下建立深层语义理解;第二阶段利用内部组合规划机制,通过生成详细子提示引导图像生成,并引入语义一致性损失以实现精确输出对齐。
链接: https://arxiv.org/abs/2507.04151
作者: Fernando Gabriela Garcia,Spencer Burns,Ryan Shaw,Hunter Young
机构: Autonomous University of Nuevo León (新莱昂自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper introduces Hierarchical Self-Supervised LVLM (Hi-SSLVLM), a novel generative model designed to significantly advance text-to-image synthesis, particularly for complex and compositionally challenging prompts. Traditional methods often grapple with the high cost of meticulously curated paired image-text datasets and struggle with precise control over fine-grained visual attributes and intricate spatial relationships. Our Hi-SSLVLM addresses these limitations through a unique two-stage self-supervised learning strategy. The first stage, Multi-Granularity Visual-Language Grounding, enables the Large Vision-Language Model (LVLM) backbone to autonomously generate and align hierarchical captions (global and local) to images, cultivating a deep internal semantic understanding without reliance on extensive human annotation. The second stage, Self-Refinement and Guided Image Generation, leverages this acquired knowledge by an Internal Compositional Planning (ICP) mechanism, where the LVLM first formulates detailed textual sub-prompts to guide the image generation process, complemented by a novel Semantic Consistency Loss for precise output alignment. Comprehensive experiments against leading baselines, including Janus-Pro-1B, Stable Diffusion XL 1.0, DeepFloyd IF v1.0, and ControlNet-XL, on multi-dimensional benchmarks such as Gemini-2.0-Flash and InternVL3-78B, demonstrate Hi-SSLVLM’s superior performance across all fine-grained metrics. An in-depth ablation study confirms the critical role of each proposed component. Furthermore, human evaluations corroborate our quantitative findings, highlighting Hi-SSLVLM’s enhanced fidelity to prompt, compositional accuracy, and overall aesthetic quality, marking a significant step towards more controllable and semantically consistent open-ended text-to-image generation.
zh
[CV-129] A3FR: Agile 3D Gaussian Splatting with Incremental Gaze Tracked Foveated Rendering in Virtual Reality
【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)中基于眼动追踪的聚焦渲染(foveated rendering)因眼动追踪计算开销过大而导致的处理延迟问题。其解决方案的关键在于提出一种名为A3FR的高效渲染框架,通过并行化眼动追踪与聚焦渲染过程,显著降低端到端渲染延迟,同时保持视觉质量。
链接: https://arxiv.org/abs/2507.04147
作者: Shuo Xin,Haiyu Wang,Sai Qian Zhang
机构: Stanford University (斯坦福大学); New York University (纽约大学); Tandon School of Engineering (坦登工程学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: ACM International Conference on Supercomputing 2025
Abstract:Virtual reality (VR) significantly transforms immersive digital interfaces, greatly enhancing education, professional practices, and entertainment by increasing user engagement and opening up new possibilities in various industries. Among its numerous applications, image rendering is crucial. Nevertheless, rendering methodologies like 3D Gaussian Splatting impose high computational demands, driven predominantly by user expectations for superior visual quality. This results in notable processing delays for real-time image rendering, which greatly affects the user experience. Additionally, VR devices such as head-mounted displays (HMDs) are intricately linked to human visual behavior, leveraging knowledge from perception and cognition to improve user experience. These insights have spurred the development of foveated rendering, a technique that dynamically adjusts rendering resolution based on the user’s gaze direction. The resultant solution, known as gaze-tracked foveated rendering, significantly reduces the computational burden of the rendering process. Although gaze-tracked foveated rendering can reduce rendering costs, the computational overhead of the gaze tracking process itself can sometimes outweigh the rendering savings, leading to increased processing latency. To address this issue, we propose an efficient rendering framework called~\textitA3FR, designed to minimize the latency of gaze-tracked foveated rendering via the parallelization of gaze tracking and foveated rendering processes. For the rendering algorithm, we utilize 3D Gaussian Splatting, a state-of-the-art neural rendering technique. Evaluation results demonstrate that A3FR can reduce end-to-end rendering latency by up to 2\times while maintaining visual quality. Comments: ACM International Conference on Supercomputing 2025 Subjects: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2507.04147 [cs.GR] (or arXiv:2507.04147v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2507.04147 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3721145.3735112 Focus to learn more DOI(s) linking to related resources
zh
[CV-130] Pedestrian Intention Prediction via Vision-Language Foundation Models
【速读】:该论文试图解决自主车辆中行人过街意图预测的问题,传统基于视觉的方法在泛化能力、上下文理解和因果推理方面存在局限。其解决方案的关键在于利用视觉-语言基础模型(Vision-Language Foundation Models, VLFMs)并通过分层提示模板整合多模态数据,将包括视觉帧、物理线索观测和自车动态在内的上下文信息融入系统优化的提示中,从而提升意图预测的准确性。
链接: https://arxiv.org/abs/2507.04141
作者: Mohsen Azarmi,Mahdi Rezaei,He Wang
机构: University of Leeds(利兹大学); University College London(伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Robotics (cs.RO)
备注:
Abstract:Prediction of pedestrian crossing intention is a critical function in autonomous vehicles. Conventional vision-based methods of crossing intention prediction often struggle with generalizability, context understanding, and causal reasoning. This study explores the potential of vision-language foundation models (VLFMs) for predicting pedestrian crossing intentions by integrating multimodal data through hierarchical prompt templates. The methodology incorporates contextual information, including visual frames, physical cues observations, and ego-vehicle dynamics, into systematically refined prompts to guide VLFMs effectively in intention prediction. Experiments were conducted on three common datasets-JAAD, PIE, and FU-PIP. Results demonstrate that incorporating vehicle speed, its variations over time, and time-conscious prompts significantly enhances the prediction accuracy up to 19.8%. Additionally, optimised prompts generated via an automatic prompt engineering framework yielded 12.5% further accuracy gains. These findings highlight the superior performance of VLFMs compared to conventional vision-based models, offering enhanced generalisation and contextual understanding for autonomous driving applications.
zh
[CV-131] Driver-Net: Multi-Camera Fusion for Assessing Driver Take-Over Readiness in Automated Vehicles
【速读】:该论文旨在解决自动驾驶车辆中控制权安全交接时对驾驶员准备状态的准确及时评估问题。其关键解决方案是提出Driver-Net,一种融合多摄像头输入的深度学习框架,通过三摄像头设置同步捕捉驾驶员头部、手部和身体姿态的视觉线索,并利用双路径架构(包含Context Block和Feature Block)以及跨模态融合策略,提升驾驶员接管准备度预测的准确性。
链接: https://arxiv.org/abs/2507.04139
作者: Mahdi Rezaei,Mohsen Azarmi
机构: University of Leeds(利兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 4 Figures, 4 Tables. Accepted at IEEE IV 2025
Abstract:Ensuring safe transition of control in automated vehicles requires an accurate and timely assessment of driver readiness. This paper introduces Driver-Net, a novel deep learning framework that fuses multi-camera inputs to estimate driver take-over readiness. Unlike conventional vision-based driver monitoring systems that focus on head pose or eye gaze, Driver-Net captures synchronised visual cues from the driver’s head, hands, and body posture through a triple-camera setup. The model integrates spatio-temporal data using a dual-path architecture, comprising a Context Block and a Feature Block, followed by a cross-modal fusion strategy to enhance prediction accuracy. Evaluated on a diverse dataset collected from the University of Leeds Driving Simulator, the proposed method achieves an accuracy of up to 95.8% in driver readiness classification. This performance significantly enhances existing approaches and highlights the importance of multimodal and multi-view fusion. As a real-time, non-intrusive solution, Driver-Net contributes meaningfully to the development of safer and more reliable automated vehicles and aligns with new regulatory mandates and upcoming safety standards.
zh
[CV-132] owards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge ICCV2025
【速读】:该论文旨在解决自动驾驶车辆(AV)中低延迟与高精度三维目标检测的协同计算问题。其关键解决方案是提出了一种基于边缘的专家混合(Edge-based Mixture of Experts, MoE)协作计算系统(EMC2),该系统通过场景感知的MoE架构优化了边缘平台上的计算效率。EMC2通过自适应多模态数据桥实现传感器输入的多尺度预处理,并结合场景感知的路由机制,根据目标可见性和距离动态分配特征至专用专家模型,从而有效融合LiDAR点云与摄像头图像的互补优势,提升多模态表示的鲁棒性。此外,EMC2还集成了软硬件协同优化策略,确保在资源受限的边缘设备上实现高效实时推理。
链接: https://arxiv.org/abs/2507.04123
作者: Linshen Liu,Boyan Su,Junyue Jiang,Guanlin Wu,Cong Guo,Ceyu Xu,Hao Frank Yang
机构: Johns Hopkins University (约翰霍普金斯大学); Duke University (杜克大学); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICCV 2025
Abstract:This paper presents Edge-based Mixture of Experts (MoE) Collaborative Computing (EMC2), an optimal computing system designed for autonomous vehicles (AVs) that simultaneously achieves low-latency and high-accuracy 3D object detection. Unlike conventional approaches, EMC2 incorporates a scenario-aware MoE architecture specifically optimized for edge platforms. By effectively fusing LiDAR and camera data, the system leverages the complementary strengths of sparse 3D point clouds and dense 2D images to generate robust multimodal representations. To enable this, EMC2 employs an adaptive multimodal data bridge that performs multi-scale preprocessing on sensor inputs, followed by a scenario-aware routing mechanism that dynamically dispatches features to dedicated expert models based on object visibility and distance. In addition, EMC2 integrates joint hardware-software optimizations, including hardware resource utilization optimization and computational graph simplification, to ensure efficient and real-time inference on resource-constrained edge devices. Experiments on open-source benchmarks clearly show the EMC2 advancements as a end-to-end system. On the KITTI dataset, it achieves an average accuracy improvement of 3.58% and a 159.06% inference speedup compared to 15 baseline methods on Jetson platforms, with similar performance gains on the nuScenes dataset, highlighting its capability to advance reliable, real-time 3D object detection tasks for AVs.
zh
[CV-133] When Data-Free Knowledge Distillation Meets Non-Transferable Teacher: Escaping Out-of-Distribution Trap is All You Need ICML2025
【速读】:该论文试图解决在无数据知识蒸馏(Data-free Knowledge Distillation, DFKD)中,当教师模型不可信时,如何有效防止从分布外(Out-of-Distribution, OOD)域转移错误知识的问题。其解决方案的关键在于提出对抗陷阱逃脱(Adversarial Trap Escaping, ATEsc),通过识别并过滤掉类似OOD的合成样本,将脆弱样本视为分布内(In-Distribution, ID)数据用于正常知识蒸馏,而将鲁棒样本视为OOD数据以遗忘OOD知识,从而提升DFKD对不可转移学习教师的鲁棒性。
链接: https://arxiv.org/abs/2507.04119
作者: Ziming Hong,Runnan Chen,Zengmao Wang,Bo Han,Bo Du,Tongliang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025
Abstract:Data-free knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access the real in-distribution (ID) data. Its common solution is to use a generator to synthesize fake data and use them as a substitute for real ID data. However, existing works typically assume teachers are trustworthy, leaving the robustness and security of DFKD from untrusted teachers largely unexplored. In this work, we conduct the first investigation into distilling non-transferable learning (NTL) teachers using DFKD, where the transferability from an ID domain to an out-of-distribution (OOD) domain is prohibited. We find that NTL teachers fool DFKD through divert the generator’s attention from the useful ID knowledge to the misleading OOD knowledge. This hinders ID knowledge transfer but prioritizes OOD knowledge transfer. To mitigate this issue, we propose Adversarial Trap Escaping (ATEsc) to benefit DFKD by identifying and filtering out OOD-like synthetic samples. Specifically, inspired by the evidence that NTL teachers show stronger adversarial robustness on OOD samples than ID samples, we split synthetic samples into two groups according to their robustness. The fragile group is treated as ID-like data and used for normal knowledge distillation, while the robust group is seen as OOD-like data and utilized for forgetting OOD knowledge. Extensive experiments demonstrate the effectiveness of ATEsc for improving DFKD against NTL teachers. Code is released at this https URL.
zh
[CV-134] PromptSR: Cascade Prompting for Lightweight Image Super-Resolution
【速读】:该论文试图解决轻量级视觉Transformer在图像超分辨率(SR)任务中因基于窗口的自注意力建模导致的感受野有限的问题,以及由于计算复杂度与窗口大小呈二次关系而难以使用大窗口扩展感受野同时保持低计算成本的问题。解决方案的关键在于提出一种名为PromptSR的新方法,其核心组件是级联提示块(CPB),通过三个级联提示层——全局锚点提示层(GAPL)和两个局部提示层(LPLs)——增强全局信息获取与局部细化,从而有效结合全局先验与局部细节,显著扩大感受野并维持低计算成本。
链接: https://arxiv.org/abs/2507.04118
作者: Wenyang Liu,Chen Cai,Jianjun Gao,Kejun Wu,Yi Wang,Kim-Hui Yap,Lap-Pui Chau
机构: Nanyang Technological University (南洋理工大学); Huazhong University of Science and Technology (华中科技大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in TMM
Abstract:Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations. Our code will be released at this https URL.
zh
[CV-135] Integrated Gaussian Processes for Robust and Adaptive Multi-Object Tracking
【速读】:该论文旨在解决多目标跟踪中的轨迹中断问题,特别是在复杂环境和面对敏捷目标时,以及在动态变化场景中在线学习测量模型参数的问题。其关键解决方案是利用集成高斯过程作为运动模型的灵活性和非齐次泊松过程作为观测模型的便利统计特性,并结合提出的有效轨迹恢复/拼接机制。该方法通过引入两种鲁棒且自适应的跟踪器——带分类的高斯与泊松过程(GaPP-Class)和带恢复与分类的高斯与泊松过程(GaPP-ReaCtion),实现了轨迹管理与超参数学习的有效集成,其中GaPP-ReaCtion进一步通过马尔可夫链蒙特卡洛核增强了轨迹恢复能力。
链接: https://arxiv.org/abs/2507.04116
作者: Fred Lydeard,Bashar I. Ahmad,Simon Godsill
机构: University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Methodology (stat.ME)
备注: 18 pages, 5 figures, submitted to IEEE Transactions on Aerospace and Electronic Systems
Abstract:This paper presents a computationally efficient multi-object tracking approach that can minimise track breaks (e.g., in challenging environments and against agile targets), learn the measurement model parameters on-line (e.g., in dynamically changing scenes) and infer the class of the tracked objects, if joint tracking and kinematic behaviour classification is sought. It capitalises on the flexibilities offered by the integrated Gaussian process as a motion model and the convenient statistical properties of non-homogeneous Poisson processes as a suitable observation model. This can be combined with the proposed effective track revival / stitching mechanism. We accordingly introduce the two robust and adaptive trackers, Gaussian and Poisson Process with Classification (GaPP-Class) and GaPP with Revival and Classification (GaPP-ReaCtion). They employ an appropriate particle filtering inference scheme that efficiently integrates track management and hyperparameter learning (including the object class, if relevant). GaPP-ReaCtion extends GaPP-Class with the addition of a Markov Chain Monte Carlo kernel applied to each particle permitting track revival and stitching (e.g., within a few time steps after deleting a trajectory). Performance evaluation and benchmarking using synthetic and real data show that GaPP-Class and GaPP-ReaCtion outperform other state-of-the-art tracking algorithms. For example, GaPP-ReaCtion significantly reduces track breaks (e.g., by around 30% from real radar data and markedly more from simulated data).
zh
[CV-136] VICI: VLM-Instructed Cross-view Image-localisation
【速读】:该论文旨在解决在有限视场角(FOV)的街景图像与对应卫星影像之间进行地理定位匹配的问题,特别是在实际应用中常见的未知相机参数情况下。其解决方案的关键在于采用一种两阶段的方法:首先检索候选的卫星图像嵌入,随后通过重排序阶段选择性地提升前几项候选结果的检索精度,从而在显著的视角和尺度变化下实现更精确的匹配。
链接: https://arxiv.org/abs/2507.04107
作者: Xiaohan Zhang,Tavis Shore,Chen Chen,Oscar Mendez,Simon Hadfield,Safwan Wshah
机构: University of Vermont(佛蒙特大学); University of Surrey(萨里大学); University of Central Florida(中央佛罗里达大学); Locus Robotics(洛库斯机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we present a high-performing solution to the UAVM 2025 Challenge, which focuses on matching narrow FOV street-level images to corresponding satellite imagery using the University-1652 dataset. As panoramic Cross-View Geo-Localisation nears peak performance, it becomes increasingly important to explore more practical problem formulations. Real-world scenarios rarely offer panoramic street-level queries; instead, queries typically consist of limited-FOV images captured with unknown camera parameters. Our work prioritises discovering the highest achievable performance under these constraints, pushing the limits of existing architectures. Our method begins by retrieving candidate satellite image embeddings for a given query, followed by a re-ranking stage that selectively enhances retrieval accuracy within the top candidates. This two-stage approach enables more precise matching, even under the significant viewpoint and scale variations inherent in the task. Through experimentation, we demonstrate that our approach achieves competitive results -specifically attaining R@1 and R@10 retrieval rates of \topone% and \topten% respectively. This underscores the potential of optimised retrieval and re-ranking strategies in advancing practical geo-localisation performance. Code is available at this https URL.
zh
[CV-137] Attention-Guided Multi-Scale Local Reconstruction for Point Clouds via Masked Autoencoder Self-Supervised Learning
【速读】:该论文试图解决点云处理中现有自监督学习模型对低层局部特征利用不足的问题,这些特征通常仅用于激活计算而未直接参与重建任务。解决方案的关键在于提出PointAMaLR框架,该框架通过注意力引导的多尺度局部重建实现层次化重建,其中低层专注于细粒度特征恢复,高层处理粗粒度特征重建,从而促进跨区域的复杂交互,并在嵌入层引入局部注意力(Local Attention, LA)模块以增强语义特征理解。
链接: https://arxiv.org/abs/2507.04084
作者: Xin Cao,Haoyu Wang,Yuzhu Mao,Xinda Liu,Linzhi Su,Kang Li
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages
Abstract:Self-supervised learning has emerged as a prominent research direction in point cloud processing. While existing models predominantly concentrate on reconstruction tasks at higher encoder layers, they often neglect the effective utilization of low-level local features, which are typically employed solely for activation computations rather than directly contributing to reconstruction tasks. To overcome this limitation, we introduce PointAMaLR, a novel self-supervised learning framework that enhances feature representation and processing accuracy through attention-guided multi-scale local reconstruction. PointAMaLR implements hierarchical reconstruction across multiple local regions, with lower layers focusing on fine-scale feature restoration while upper layers address coarse-scale feature reconstruction, thereby enabling complex inter-patch interactions. Furthermore, to augment feature representation capabilities, we incorporate a Local Attention (LA) module in the embedding layer to enhance semantic feature understanding. Comprehensive experiments on benchmark datasets ModelNet and ShapeNet demonstrate PointAMaLR’s superior accuracy and quality in both classification and reconstruction tasks. Moreover, when evaluated on the real-world dataset ScanObjectNN and the 3D large scene segmentation dataset S3DIS, our model achieves highly competitive performance metrics. These results not only validate PointAMaLR’s effectiveness in multi-scale semantic understanding but also underscore its practical applicability in real-world scenarios.
zh
[CV-138] Accurate and Efficient World Modeling with Masked Latent Transformers
【速读】:该论文试图解决生成式 AI (Generative AI) 在环境模拟中因世界模型的潜在空间压缩而导致关键信息丢失的问题,从而影响智能体性能。其解决方案的关键在于提出一种高效且准确的世界模型——EMERALD(Efficient MaskEd latent tRAnsformer worLD model),该模型通过使用空间潜在状态与MaskGIT预测来生成精确的潜在空间轨迹,从而提升智能体性能。
链接: https://arxiv.org/abs/2507.04075
作者: Maxime Burchi,Radu Timofte
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model’s latent space can result in the loss of crucial information, negatively affecting the agent’s performance. Recent approaches, such as \Delta -IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.
zh
[CV-139] Stochastic Human Motion Prediction with Memory of Action Transition and Action Characteristic CVPR2025
【速读】:该论文试图解决动作驱动的随机人类运动预测中的两个主要问题:一是由于不同动作的过渡速度差异导致生成平滑过渡运动的困难;二是由于某些动作之间的相似性使得动作特征难以学习,从而导致预测结果不合理和不一致。解决方案的关键是提出两个记忆库,即软过渡动作库(STAB)和动作特征库(ACB),分别用于存储动作过渡信息和动作特征,并引入自适应注意力调整(AAA)策略以更好地融合来自两个库的特征,从而提升预测效果。
链接: https://arxiv.org/abs/2507.04062
作者: Jianwei Tang,Hong Yang,Tengyue Chen,Jian-Fang Hu
机构: Sun Yat-sen University (中山大学); Guangdong Province Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (机器智能与先进计算教育部重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by CVPR2025
Abstract:Action-driven stochastic human motion prediction aims to generate future motion sequences of a pre-defined target action based on given past observed sequences performing non-target actions. This task primarily presents two challenges. Firstly, generating smooth transition motions is hard due to the varying transition speeds of different actions. Secondly, the action characteristic is difficult to be learned because of the similarity of some actions. These issues cause the predicted results to be unreasonable and inconsistent. As a result, we propose two memory banks, the Soft-transition Action Bank (STAB) and Action Characteristic Bank (ACB), to tackle the problems above. The STAB stores the action transition information. It is equipped with the novel soft searching approach, which encourages the model to focus on multiple possible action categories of observed motions. The ACB records action characteristic, which produces more prior information for predicting certain actions. To fuse the features retrieved from the two banks better, we further propose the Adaptive Attention Adjustment (AAA) strategy. Extensive experiments on four motion prediction datasets demonstrate that our approach consistently outperforms the previous state-of-the-art. The demo and code are available at this https URL.
zh
[CV-140] Consistent and Invariant Generalization Learning for Short-video Misinformation Detection ACM-MM2025
【速读】:该论文旨在解决短视频虚假信息检测任务中因领域差异导致的模型泛化能力不足问题,即当前模型在特定领域(源领域)上训练后,在未见过的领域(目标领域)上表现不佳。解决方案的关键在于提出一种名为DOCTOR的领域泛化模型,其核心是通过一致性与不变性学习来增强模型在多模态数据上的鲁棒性,具体包括:引入跨模态特征插值和插值蒸馏以实现多模态学习的同步,以及设计扩散模型通过跨模态引导去噪来保留多模态的核心特征并增强领域不变特征。
链接: https://arxiv.org/abs/2507.04061
作者: Hanghui Guo,Weijie Shi,Mengze Li,Juncheng Li,Hao Chen,Yue Cui,Jiajie Xu,Jia Zhu,Jiawei Shen,Zhangze Chen,Sirui Han
机构: Zhejiang Normal University(浙江师范大学); HKUST(香港科技大学); Zhejiang University(浙江大学); Tencent(腾讯); Soochow University(苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to ACM MM 2025,15 pages, 16figures
Abstract:Short-video misinformation detection has attracted wide attention in the multi-modal domain, aiming to accurately identify the misinformation in the video format accompanied by the corresponding audio. Despite significant advancements, current models in this field, trained on particular domains (source domains), often exhibit unsatisfactory performance on unseen domains (target domains) due to domain gaps. To effectively realize such domain generalization on the short-video misinformation detection task, we propose deep insights into the characteristics of different domains: (1) The detection on various domains may mainly rely on different modalities (i.e., mainly focusing on videos or audios). To enhance domain generalization, it is crucial to achieve optimal model performance on all modalities simultaneously. (2) For some domains focusing on cross-modal joint fraud, a comprehensive analysis relying on cross-modal fusion is necessary. However, domain biases located in each modality (especially in each frame of videos) will be accumulated in this fusion process, which may seriously damage the final identification of misinformation. To address these issues, we propose a new DOmain generalization model via ConsisTency and invariance learning for shORt-video misinformation detection (named DOCTOR), which contains two characteristic modules: (1) We involve the cross-modal feature interpolation to map multiple modalities into a shared space and the interpolation distillation to synchronize multi-modal learning; (2) We design the diffusion model to add noise to retain core features of multi modal and enhance domain invariant features through cross-modal guided denoising. Extensive experiments demonstrate the effectiveness of our proposed DOCTOR model. Our code is public available at this https URL.
zh
[CV-141] mporal Continual Learning with Prior Compensation for Human Motion Prediction
【速读】:该论文旨在解决人类运动预测(Human Motion Prediction, HMP)中因对不同时间点的预测处理方式相同而导致的两个主要问题:短期预测的学习受到长期预测关注的影响,以及过去预测中的先验信息难以有效融入后续预测。解决方案的关键在于提出一种名为时间持续学习(Temporal Continual Learning, TCL)的多阶段训练框架,并引入先验补偿因子(Prior Compensation Factor, PCF),以更好地保留先验信息并优化模型训练目标。
链接: https://arxiv.org/abs/2507.04060
作者: Jianwei Tang,Jiangxin Sun,Xiaotong Lin,Lifang Zhang,Wei-Shi Zheng,Jian-Fang Hu
机构: Sun Yat-sen University (中山大学); Dongguan University of Technology (东莞理工学院); Guangdong Province Key Laboratory of Information Security Technology (广东省信息安全技术重点实验室); Key Laboratory of Machine Intelligence and Advanced Computing (机器智能与先进计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Advances in Neural Information Processing Systems 2023
Abstract:Human Motion Prediction (HMP) aims to predict future poses at different moments according to past motion sequences. Previous approaches have treated the prediction of various moments equally, resulting in two main limitations: the learning of short-term predictions is hindered by the focus on long-term predictions, and the incorporation of prior information from past predictions into subsequent predictions is limited. In this paper, we introduce a novel multi-stage training framework called Temporal Continual Learning (TCL) to address the above challenges. To better preserve prior information, we introduce the Prior Compensation Factor (PCF). We incorporate it into the model training to compensate for the lost prior information. Furthermore, we derive a more reasonable optimization objective through theoretical derivation. It is important to note that our TCL framework can be easily integrated with different HMP backbone models and adapted to various datasets and applications. Extensive experiments on four HMP benchmark datasets demonstrate the effectiveness and flexibility of TCL. The code is available at this https URL.
zh
[CV-142] Attributing Data for Sharpness-Aware Minimization
【速读】:该论文旨在解决在使用Sharpness-aware Minimization (SAM)进行大规模模型训练时,由于数据噪声和隐私问题带来的挑战,以及如何有效评估数据对模型参数的影响。其解决方案的关键在于针对SAM的双层优化结构,开发了两种创新的数据价值评估方法:基于Hessian的影响力函数(Hessian-based IF)和基于梯度轨迹的影响力函数(Gradient Trajectory-based IF),以实现更准确和高效的数据评估与参数调整。
链接: https://arxiv.org/abs/2507.04059
作者: Chenyang Ren,Yifan Jia,Huanyi Xie,Zhaobin Xu,Tianxing Wei,Liangyu Wang,Lijie Hu,Di Wang
机构: King Abdullah University of Science and Technology (沙特阿拉伯阿卜杜拉国王科技大学); Shandong University (山东大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 25 pages
Abstract:Sharpness-aware Minimization (SAM) improves generalization in large-scale model training by linking loss landscape geometry to generalization. However, challenges such as mislabeled noisy data and privacy concerns have emerged as significant issues. Data attribution, which identifies the contributions of specific training samples, offers a promising solution. However, directly rendering existing data influence evaluation tools such as influence functions (IF) to SAM will be inapplicable or inaccurate as SAM utilizes an inner loop to find model perturbations that maximize loss, which the outer loop then minimizes, resulting in a doubled computational structure. Additionally, this bilevel structure complicates the modeling of data influence on the parameters. In this paper, based on the IF, we develop two innovative data valuation methods for SAM, each offering unique benefits in different scenarios: the Hessian-based IF and the Gradient Trajectory-based IF. The first one provides a comprehensive estimation of data influence using a closed-form measure that relies only on the trained model weights. In contrast, the other IF for SAM utilizes gradient trajectory information during training for more accurate and efficient data assessment. Extensive experiments demonstrate their effectiveness in data evaluation and parameter tuning, with applications in identifying mislabeled data, model editing, and enhancing interpretability.
zh
[CV-143] Generate Refine and Encode: Leverag ing Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery ICCV2025
【速读】:该论文试图解决在线流数据中新类别发现(On-the-fly Category Discovery, OCD)的问题,即在仅有标记数据的情况下,识别可能属于已知或未知类别的新到达数据。现有方法在利用已知类别知识进行迁移学习时存在局限性,尤其在细粒度分类任务中,由于标注数据或类别数量有限,导致迁移知识不足。解决方案的关键在于提出一种基于扩散的OCD框架DiffGRE,该框架通过多阶段集成生成、精炼和编码过程:首先在扩散潜在空间中利用跨图像插值生成属性组合样本;其次通过多样性驱动的精炼方法选择与已知类别不同的合成图像;最后借助半监督领导者编码将合成数据中的额外类别知识注入OCD模型,从而提升在线推理过程中对已知和未知类别的发现能力。
链接: https://arxiv.org/abs/2507.04051
作者: Xiao Liu,Nan Pu,Haiyang Zheng,Wenjing Li,Nicu Sebe,Zhun Zhong
机构: Hefei University of Technology(合肥工业大学); University of Trento(特伦托大学); Helmholtz AI(赫尔姆霍兹人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:In this paper, we investigate a practical yet challenging task: On-the-fly Category Discovery (OCD). This task focuses on the online identification of newly arriving stream data that may belong to both known and unknown categories, utilizing the category knowledge from only labeled data. Existing OCD methods are devoted to fully mining transferable knowledge from only labeled data. However, the transferability learned by these methods is limited because the knowledge contained in known categories is often insufficient, especially when few annotated data/categories are available in fine-grained recognition. To mitigate this limitation, we propose a diffusion-based OCD framework, dubbed DiffGRE, which integrates Generation, Refinement, and Encoding in a multi-stage fashion. Specifically, we first design an attribute-composition generation method based on cross-image interpolation in the diffusion latent space to synthesize novel samples. Then, we propose a diversity-driven refinement approach to select the synthesized images that differ from known categories for subsequent OCD model training. Finally, we leverage a semi-supervised leader encoding to inject additional category knowledge contained in synthesized data into the OCD models, which can benefit the discovery of both known and unknown categories during the on-the-fly inference process. Extensive experiments demonstrate the superiority of our DiffGRE over previous methods on six fine-grained datasets.
zh
[CV-144] Breaking Imitation Bottlenecks: Reinforced Diffusion Powers Diverse Trajectory Generation
【速读】:该论文试图解决端到端自动驾驶方法依赖单一专家示范进行模仿学习所导致的行为保守和同质化问题,从而限制了在复杂现实场景中的泛化能力。其解决方案的关键在于提出DIVER框架,该框架将强化学习与基于扩散的生成机制相结合,通过条件化地图元素和周围代理生成多条参考轨迹,以缓解单一示范带来的局限性,并利用基于奖励的监督指导扩散过程,从而在生成轨迹中强制实施安全性和多样性约束,提升轨迹的实用性和泛化能力。
链接: https://arxiv.org/abs/2507.04049
作者: Ziying Song,Lin Liu,Hongyu Pan,Bencheng Liao,Mingzhe Guo,Lei Yang,Yongchang Zhang,Shaoqing Xu,Caiyan Jia,Yadan Luo
机构: Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, School of Computer Science and Technology, Beijing Jiaotong University (北京交通大数据挖掘与具身智能重点实验室,计算机科学与技术学院,北京交通大学); Horizon Robotics (地平线机器人); Nanyang Technological University, Singapore (南洋理工大学,新加坡); University of Macau, China (澳门大学,中国); School of Information Technology and Electrical Engineering, The University of Queensland, Australia (信息科技与电气工程学院,昆士兰大学,澳大利亚)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 16 pages, 6 figures
Abstract:Most end-to-end autonomous driving methods rely on imitation learning from single expert demonstrations, often leading to conservative and homogeneous behaviors that limit generalization in complex real-world scenarios. In this work, we propose DIVER, an end-to-end driving framework that integrates reinforcement learning with diffusion-based generation to produce diverse and feasible trajectories. At the core of DIVER lies a reinforced diffusion-based generation mechanism. First, the model conditions on map elements and surrounding agents to generate multiple reference trajectories from a single ground-truth trajectory, alleviating the limitations of imitation learning that arise from relying solely on single expert demonstrations. Second, reinforcement learning is employed to guide the diffusion process, where reward-based supervision enforces safety and diversity constraints on the generated trajectories, thereby enhancing their practicality and generalization capability. Furthermore, to address the limitations of L2-based open-loop metrics in capturing trajectory diversity, we propose a novel Diversity metric to evaluate the diversity of multi-mode this http URL experiments on the closed-loop NAVSIM and Bench2Drive benchmarks, as well as the open-loop nuScenes dataset, demonstrate that DIVER significantly improves trajectory diversity, effectively addressing the mode collapse problem inherent in imitation learning.
zh
[CV-145] Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
【速读】:该论文旨在解决具身场景理解中缺乏主动感知与环境探索能力的问题,现有3D视觉-语言(3D-VL)模型主要关注静态观测中的物体定位,而未能实现对环境的主动感知。其解决方案的关键在于提出一种统一框架MTU3D,通过三个核心创新实现:1)基于在线查询的表征学习,直接从RGB-D帧构建空间记忆;2)统一的定位与探索目标,将未探索区域表示为前沿查询并联合优化物体定位与前沿选择;3)端到端轨迹学习,结合百万级多样轨迹的视觉-语言-探索预训练。
链接: https://arxiv.org/abs/2507.04047
作者: Ziyu Zhu,Xilin Wang,Yixuan Li,Zhuofan Zhang,Xiaojian Ma,Yixin Chen,Baoxiong Jia,Wei Liang,Qian Yu,Zhidong Deng,Siyuan Huang,Qing Li
机构: Tsinghua University (清华大学); Beijing Institute of Technology (北京理工大学); Beihang University (北京航空航天大学); State Key Laboratory of General Artificial Intelligence, BIGAI, China (国家重点实验室通用人工智能,BIGAI,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Embodied AI; 3D Vision Language Understanding
Abstract:Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce \underline\textbfMove \underline\textbfto \underline\textbfUnderstand (\textbf\model), a unified framework that integrates active perception with \underline\textbf3D vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations: 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploring, which represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines \textbfVision-\textbfLanguage-\textbfExploration pre-training over a million diverse trajectories collected from both simulated and real-world RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14%, 23%, 9%, and 2% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. \model’s versatility enables navigation using diverse input modalities, including categories, language descriptions, and reference images. These findings highlight the importance of bridging visual grounding and exploration for embodied intelligence.
zh
[CV-146] -SYNTH: A Knowledge-Based Dataset of Synthetic Breast Images
【速读】:该论文试图解决医学影像算法开发与评估中因缺乏大规模带合适标注的数据集而遇到的障碍。解决方案的关键在于利用物理仿真生成具有像素级分割标注的合成图像,这些标注通常难以获取。研究者将该方法应用于乳腺成像分析,并发布了T-SYNTH数据集,这是一个包含配对的2D数字乳腺X线摄影(DM)和3D数字乳腺断层扫描(DBT)图像的大规模开源数据集。实验结果表明,T-SYNTH图像在增强有限的真实患者数据集以用于DM和DBT中的检测任务方面具有潜力。
链接: https://arxiv.org/abs/2507.04038
作者: Christopher Wiedeman,Anastasiia Sarmakeeva,Elena Sizikova,Daniil Filienko,Miguel Lago,Jana G. Delfino,Aldo Badano
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:One of the key impediments for developing and assessing robust medical imaging algorithms is limited access to large-scale datasets with suitable annotations. Synthetic data generated with plausible physical and biological constraints may address some of these data limitations. We propose the use of physics simulations to generate synthetic images with pixel-level segmentation annotations, which are notoriously difficult to obtain. Specifically, we apply this approach to breast imaging analysis and release T-SYNTH, a large-scale open-source dataset of paired 2D digital mammography (DM) and 3D digital breast tomosynthesis (DBT) images. Our initial experimental results indicate that T-SYNTH images show promise for augmenting limited real patient datasets for detection tasks in DM and DBT. Our data and code are publicly available at this https URL.
zh
[CV-147] PresentAgent : Multimodal Agent for Presentation Video Generation
【速读】:该论文试图解决将静态文本材料转化为动态、有效且易于理解的演示格式的问题。现有方法仅限于生成静态幻灯片或文本摘要,而本文提出的解决方案——PresentAgent,通过生成同步的视觉与语音内容,突破了这些限制。其关键在于采用模块化流程,系统地分割输入文档,规划并渲染幻灯片风格的视觉帧,利用大语言模型和文本转语音模型生成上下文相关的口语化叙述,并精确对齐音视频以合成最终视频。
链接: https://arxiv.org/abs/2507.04036
作者: Jingwei Shi,Zeyu Zhang,Biao Wu,Yanjie Liang,Meng Fang,Ling Chen,Yang Zhao
机构: AI Geeks(人工智能极客); Australian Artificial Intelligence Institute(澳大利亚人工智能研究院); University of Liverpool(利物浦大学); La Trobe University(拉特罗布大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present PresentAgent, a multimodal agent that transforms long-form documents into narrated presentation videos. While existing approaches are limited to generating static slides or text summaries, our method advances beyond these limitations by producing fully synchronized visual and spoken content that closely mimics human-style presentations. To achieve this integration, PresentAgent employs a modular pipeline that systematically segments the input document, plans and renders slide-style visual frames, generates contextual spoken narration with large language models and Text-to-Speech models, and seamlessly composes the final video with precise audio-visual alignment. Given the complexity of evaluating such multimodal outputs, we introduce PresentEval, a unified assessment framework powered by Vision-Language Models that comprehensively scores videos across three critical dimensions: content fidelity, visual clarity, and audience comprehension through prompt-based evaluation. Our experimental validation on a curated dataset of 30 document-presentation pairs demonstrates that PresentAgent approaches human-level quality across all evaluation metrics. These results highlight the significant potential of controllable multimodal agents in transforming static textual materials into dynamic, effective, and accessible presentation formats. Code will be available at this https URL.
zh
[CV-148] Exploring Kolmogorov-Arnold Network Expansions in Vision Transformers for Mitigating Catastrophic Forgetting in Continual Learning
【速读】:该论文试图解决持续学习(Continual Learning, CL)中模型在学习新任务时容易遗忘先前知识的问题,特别是在利用多层感知机(MLPs)进行全局表征学习的视觉变压器(ViTs)中,灾难性遗忘(Catastrophic Forgetting)尤为显著。解决方案的关键在于用Kolmogorov-Arnold网络(KANs)替代ViTs中的MLPs,KANs通过基于样条的激活函数实现局部可塑性,从而在每个样本更新时仅调整部分参数,有效保留已学习的知识。
链接: https://arxiv.org/abs/2507.04020
作者: Zahid Ullah,Jihie Kim
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual learning (CL), the ability of a model to learn new tasks without forgetting previously acquired knowledge, remains a critical challenge in artificial intelligence, particularly for vision transformers (ViTs) utilizing Multilayer Perceptrons (MLPs) for global representation learning. Catastrophic forgetting, where new information overwrites prior knowledge, is especially problematic in these models. This research proposes replacing MLPs in ViTs with Kolmogorov-Arnold Network (KANs) to address this issue. KANs leverage local plasticity through spline-based activations, ensuring that only a subset of parameters is updated per sample, thereby preserving previously learned knowledge. The study investigates the efficacy of KAN-based ViTs in CL scenarios across benchmark datasets (MNIST, CIFAR100), focusing on their ability to retain accuracy on earlier tasks while adapting to new ones. Experimental results demonstrate that KAN-based ViTs significantly mitigate catastrophic forgetting, outperforming traditional MLP-based ViTs in knowledge retention and task adaptation. This novel integration of KANs into ViTs represents a promising step toward more robust and adaptable models for dynamic environments.
zh
[CV-149] Habitat Classification from Ground-Level Imagery Using Deep Neural Networks
【速读】:该论文试图解决局部尺度上的栖息地评估问题,旨在通过自动化手段提升生物多样性保护的效率与准确性。传统方法依赖于专家实地调查,成本高昂,而现有基于遥感的AI工具受限于传感器可用性、天气条件和分辨率。该研究的关键解决方案是利用地面级图像数据,并应用先进的深度神经网络架构(包括卷积神经网络和视觉变压器)进行细粒度栖息地分类。研究结果表明,视觉变压器在关键分类指标上优于当前最先进的卷积神经网络基线,并且通过监督对比学习显著降低了视觉相似栖息地的误分类率,展现出与生态学专家相当的分类性能。
链接: https://arxiv.org/abs/2507.04017
作者: Hongrui Shi,Lisa Norton,Lucy Ridding,Simon Rolph,Tom August,Claire M Wood,Lan Qie,Petra Bosilj,James M Brown
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 12 figures, 6 tables
Abstract:Habitat assessment at local scales – critical for enhancing biodiversity and guiding conservation priorities – often relies on expert field survey that can be costly, motivating the exploration of AI-driven tools to automate and refine this process. While most AI-driven habitat mapping depends on remote sensing, it is often constrained by sensor availability, weather, and coarse resolution. In contrast, ground-level imagery captures essential structural and compositional cues invisible from above and remains underexplored for robust, fine-grained habitat classification. This study addresses this gap by applying state-of-the-art deep neural network architectures to ground-level habitat imagery. Leveraging data from the UK Countryside Survey covering 18 broad habitat types, we evaluate two families of models – convolutional neural networks (CNNs) and vision transformers (ViTs) – under both supervised and supervised contrastive learning paradigms. Our results demonstrate that ViTs consistently outperform state-of-the-art CNN baselines on key classification metrics (Top-3 accuracy = 91%, MCC = 0.66) and offer more interpretable scene understanding tailored to ground-level images. Moreover, supervised contrastive learning significantly reduces misclassification rates among visually similar habitats (e.g., Improved vs. Neutral Grassland), driven by a more discriminative embedding space. Finally, our best model performs on par with experienced ecological experts in habitat classification from images, underscoring the promise of expert-level automated assessment. By integrating advanced AI with ecological expertise, this research establishes a scalable, cost-effective framework for ground-level habitat monitoring to accelerate biodiversity conservation and inform land-use decisions at the national scale.
zh
[CV-150] Group-wise Scaling and Orthogonal Decomposition for Domain-Invariant Feature Extraction in Face Anti-Spoofing ICCV2025
【速读】:该论文试图解决领域可泛化的人脸反欺骗(Domain Generalizable Face Anti-Spoofing, DGFAS)方法中因决策边界偏置项未对齐而导致的分类阈值不一致和在未见目标领域性能下降的问题。解决方案的关键在于提出一种新的DGFAS框架,通过特征正交分解(Feature Orthogonal Decomposition, FOD)和组级缩放风险最小化(Group-wise Scaling Risk Minimization, GS-RM)联合对齐权重和偏置项,从而实现跨领域的有效特征对齐与稳定泛化。
链接: https://arxiv.org/abs/2507.04006
作者: Seungjin Jung,Kanghee Lee,Yonghyun Jeong,Haeun Noh,Jungmin Lee,Jongwon Choi
机构: Chung-Ang University(忠南大学); Naver Cloud(纳维云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICCV 2025. code is will be available at this https URL
Abstract:Domain Generalizable Face Anti-Spoofing (DGFAS) methods effectively capture domain-invariant features by aligning the directions (weights) of local decision boundaries across domains. However, the bias terms associated with these boundaries remain misaligned, leading to inconsistent classification thresholds and degraded performance on unseen target domains. To address this issue, we propose a novel DGFAS framework that jointly aligns weights and biases through Feature Orthogonal Decomposition (FOD) and Group-wise Scaling Risk Minimization (GS-RM). Specifically, GS-RM facilitates bias alignment by balancing group-wise losses across multiple domains. FOD employs the Gram-Schmidt orthogonalization process to decompose the feature space explicitly into domain-invariant and domain-specific subspaces. By enforcing orthogonality between domain-specific and domain-invariant features during training using domain labels, FOD ensures effective weight alignment across domains without negatively impacting bias alignment. Additionally, we introduce Expected Calibration Error (ECE) as a novel evaluation metric for quantitatively assessing the effectiveness of our method in aligning bias terms across domains. Extensive experiments on benchmark datasets demonstrate that our approach achieves state-of-the-art performance, consistently improving accuracy, reducing bias misalignment, and enhancing generalization stability on unseen target domains.
zh
[CV-151] NRSeg: Noise-Resilient Learning for BEV Semantic Segmentation via Driving World Models
【速读】:该论文旨在解决无监督和半监督的Birds’ Eye View (BEV)语义分割任务中由于标注数据分布均匀而导致的性能不足问题。其关键解决方案是提出NRSeg框架,该框架通过引入Perspective-Geometry Consistency Metric (PGCM)来评估生成数据对模型学习的指导能力,并采用Bi-Distribution Parallel Prediction (BiDPP)增强模型的鲁棒性,同时设计Hierarchical Local Semantic Exclusion (HLSE)模块以处理BEV语义分割任务中的非互斥性问题。
链接: https://arxiv.org/abs/2507.04002
作者: Siyu Li,Fei Teng,Yihong Cao,Kailun Yang,Zhiyong Li,Yaonan Wang
机构: Hunan University(湖南大学); Hunan Normal University(湖南师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code will be made publicly available at this https URL
Abstract:Birds’ Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at this https URL.
zh
[CV-152] LEHA-CVQAD: Dataset To Enable Generalized Video Quality Assessment of Compression Artifacts
【速读】:该论文旨在解决压缩导向视频质量评估(Compression-Oriented Video Quality Assessment, CVQA)中的挑战,特别是在评估视频质量评估(VQA)模型在保持码率-质量排序一致性方面的能力。其关键解决方案是提出LEHA-CVQAD数据集,该数据集包含6,240个视频片段,通过融合1.8M对比较数据和1.5k MOS评分构建了一个统一的质量尺度,并引入了Rate-Distortion Alignment Error (RDAE)作为新的评估指标,以量化VQA模型在保持码率与质量关系方面的性能,从而支持编解码器参数的优化。
链接: https://arxiv.org/abs/2507.03990
作者: Aleksandr Gushchin,Maksim Smirnov,Dmitriy Vatolin,Anastasia Antsiferova
机构: Lomonosov Moscow State University (莫斯科国立大学); ISP RAS Research Center for Trusted Artificial Intelligence (俄罗斯科学院可信人工智能研究中心); MSU Institute for Artificial Intelligence (莫斯科国立大学人工智能研究所); HSE (高等经济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose the LEHA-CVQAD (Large-scale Enriched Human-Annotated) dataset, which comprises 6,240 clips for compression-oriented video quality assessment. 59 source videos are encoded with 186 codec-preset variants, 1.8M pairwise, and 1.5k MOS ratings are fused into a single quality scale; part of the videos remains hidden for blind evaluation. We also propose Rate-Distortion Alignment Error (RDAE), a novel evaluation metric that quantifies how well VQA models preserve bitrate-quality ordering, directly supporting codec parameter tuning. Testing IQA/VQA methods reveals that popular VQA metrics exhibit high RDAE and lower correlations, underscoring the dataset challenges and utility. The open part and the results of LEHA-CVQAD are available at this https URL .io/lcvqad/
zh
[CV-153] CoT-Segmenter: Enhancing OOD Detection in Dense Road Scenes via Chain-of-Thought Reasoning
【速读】:该论文旨在解决复杂道路环境中语义分割模型的分布外(Out-of-Distribution, OOD)检测问题,特别是在密集重叠物体、远距离小物体以及大前景主导物体等挑战性场景下,现有最先进的OOD分割方法表现不佳。其解决方案的关键在于提出一种基于思维链(Chain-of-Thought, CoT)的框架,利用基础模型(如GPT-4)的广泛知识和推理能力,通过改进图像理解和与问题场景属性对齐的提示推理,提升OOD检测性能。
链接: https://arxiv.org/abs/2507.03984
作者: Jeonghyo Song,Kimin Yun,DaeUng Jo,Jinyoung Kim,Youngjoon Yoo
机构: Chung-Ang University (忠南大学); ETRI (电子通信研究院); University of Science and Technology (科学技术院); Kyungpook National University (庆北国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures
Abstract:Effective Out-of-Distribution (OOD) detection is criti-cal for ensuring the reliability of semantic segmentation models, particularly in complex road environments where safety and accuracy are paramount. Despite recent advancements in large language models (LLMs), notably GPT-4, which significantly enhanced multimodal reasoning through Chain-of-Thought (CoT) prompting, the application of CoT-based visual reasoning for OOD semantic segmentation remains largely unexplored. In this paper, through extensive analyses of the road scene anomalies, we identify three challenging scenarios where current state-of-the-art OOD segmentation methods consistently struggle: (1) densely packed and overlapping objects, (2) distant scenes with small objects, and (3) large foreground-dominant objects. To address the presented challenges, we propose a novel CoT-based framework targeting OOD detection in road anomaly scenes. Our method leverages the extensive knowledge and reasoning capabilities of foundation models, such as GPT-4, to enhance OOD detection through improved image understanding and prompt-based reasoning aligned with observed problematic scene attributes. Extensive experiments show that our framework consistently outperforms state-of-the-art methods on both standard benchmarks and our newly defined challenging subset of the RoadAnomaly dataset, offering a robust and interpretable solution for OOD semantic segmentation in complex driving environments.
zh
[CV-154] Flux-Sculptor: Text-Driven Rich-Attribute Portrait Editing through Decomposed Spatial Flow Control
【速读】:该论文试图解决文本驱动的人像编辑中重建保真度与编辑灵活性难以平衡的问题(text-driven portrait editing challenges)。其解决方案的关键在于提出Flux-Sculptor框架,该框架包含Prompt-Aligned Spatial Locator (PASL)以精确定位编辑区域,并采用Structure-to-Detail Edit Control (S2D-EC)策略通过序列掩码引导的潜在表示和注意力值融合来空间引导去噪过程。
链接: https://arxiv.org/abs/2507.03979
作者: Tianyao He,Runqi Wang,Yang Chen,Dejia Song,Nemo Chen,Xu Tang,Yao Hu
机构: Xiaohongshu(小红书); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 17 figures
Abstract:Text-driven portrait editing holds significant potential for various applications but also presents considerable challenges. An ideal text-driven portrait editing approach should achieve precise localization and appropriate content modification, yet existing methods struggle to balance reconstruction fidelity and editing flexibility. To address this issue, we propose Flux-Sculptor, a flux-based framework designed for precise text-driven portrait editing. Our framework introduces a Prompt-Aligned Spatial Locator (PASL) to accurately identify relevant editing regions and a Structure-to-Detail Edit Control (S2D-EC) strategy to spatially guide the denoising process through sequential mask-guided fusion of latent representations and attention values. Extensive experiments demonstrate that Flux-Sculptor surpasses existing methods in rich-attribute editing and facial information preservation, making it a strong candidate for practical portrait editing applications. Project page is available at this https URL.
zh
[CV-155] Robust Low-light Scene Restoration via Illumination Transition
【速读】:该论文试图解决从低光多视角图像中合成正常光照新视角的问题,该任务面临输入图像可见度低和高ISO噪声的挑战。现有方法在预处理低光输入时效果不佳,因其未能考虑多视角之间的相关性;而其他先进方法虽然引入了与光照相关的组件,但常导致色彩失真、伪影等问题,并且去噪效果有限。论文提出了一种名为RoSe的鲁棒低光场景恢复框架,其关键在于将任务建模为三维空间中的照度过渡估计问题,并将其概念化为一种专门的渲染任务,从而建立低光与正常光照条件之间的稳健联系。通过利用光照的固有低秩特性来约束过渡表示,实现了更有效的去噪,无需复杂的二维技术或显式噪声建模。
链接: https://arxiv.org/abs/2507.03976
作者: Ze Li,Feng Zhang,Xiatian Zhu,Meng Zhang,Yanghong Zhou,P. Y. Mok
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong Polytechnic University (香港理工大学); Nanjing University of Posts and Telecommunications (南京邮电大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures
Abstract:Synthesizing normal-light novel views from low-light multiview images is an important yet challenging task, given the low visibility and high ISO noise present in the input images. Existing low-light enhancement methods often struggle to effectively preprocess such low-light inputs, as they fail to consider correlations among multiple views. Although other state-of-the-art methods have introduced illumination-related components offering alternative solutions to the problem, they often result in drawbacks such as color distortions and artifacts, and they provide limited denoising effectiveness. In this paper, we propose a novel Robust Low-light Scene Restoration framework (RoSe), which enables effective synthesis of novel views in normal lighting conditions from low-light multiview image inputs, by formulating the task as an illuminance transition estimation problem in 3D space, conceptualizing it as a specialized rendering task. This multiview-consistent illuminance transition field establishes a robust connection between low-light and normal-light conditions. By further exploiting the inherent low-rank property of illumination to constrain the transition representation, we achieve more effective denoising without complex 2D techniques or explicit noise modeling. To implement RoSe, we design a concise dual-branch architecture and introduce a low-rank denoising module. Experiments demonstrate that RoSe significantly outperforms state-of-the-art models in both rendering quality and multiview consistency on standard benchmarks. The codes and data are available at this https URL.
zh
[CV-156] Evaluating Adversarial Protections for Diffusion Personalization: A Comprehensive Study ICML
【速读】:该论文试图解决扩散模型在图像生成和个性化应用中带来的隐私泄露和内容滥用问题。解决方案的关键在于评估和比较八种基于扰动的保护方法(AdvDM、ASPL、FSGM、MetaCloak、Mist、PhotoGuard、SDS 和 SimAC)在不同扰动预算下的视觉不可察觉性和防护效果,以提供实际的方法选择指导。
链接: https://arxiv.org/abs/2507.03953
作者: Kai Ye,Tianyi Chen,Zhen Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to the 2nd Workshop on Reliable and Responsible Foundation Models (R2-FM 2025) at ICML. 8 pages, 3 figures
Abstract:With the increasing adoption of diffusion models for image generation and personalization, concerns regarding privacy breaches and content misuse have become more pressing. In this study, we conduct a comprehensive comparison of eight perturbation based protection methods: AdvDM, ASPL, FSGM, MetaCloak, Mist, PhotoGuard, SDS, and SimAC–across both portrait and artwork domains. These methods are evaluated under varying perturbation budgets, using a range of metrics to assess visual imperceptibility and protective efficacy. Our results offer practical guidance for method selection. Code is available at: this https URL.
zh
[CV-157] More than One Step at a Time: Designing Procedural Feedback for Non-visual Makeup Routines
【速读】:该论文试图解决视觉障碍人群在使用化妆品过程中面临的辅助技术不足问题,特别是针对化妆流程的步骤协调、产品摆放管理和最终效果评估等复杂性。其解决方案的关键在于通过情境探究和专业化妆师访谈,识别出非视觉化妆中的反馈需求,并提出强调无手操作、对话式交互以及上下文感知流程支持的设计原则,以促进表达性和独立性的美容实践。
链接: https://arxiv.org/abs/2507.03942
作者: Franklin Mingzhe Li,Akihiko Oharazawa,Chloe Qingyu Zhu,Misty Fan,Daisuke Sato,Chieko Asakawa,Patrick Carrington
机构: Carnegie Mellon University (卡内基梅隆大学); IBM Research, IBM (IBM研究)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: ASSETS 2025
Abstract:Makeup plays a vital role in self-expression, identity, and confidence - yet remains an underexplored domain for assistive technology, especially for people with vision impairments. While existing tools support isolated tasks such as color identification or product labeling, they rarely address the procedural complexity of makeup routines: coordinating step sequences, managing product placement, and assessing the final look with accessible feedback. To understand the real-world process, we conducted a contextual inquiry with 15 visually impaired makeup users, capturing real-time makeup application behaviors and their step-by-step information needs and assessment approaches. Our findings reveal embodied, tactile-first strategies; persistent challenges in blending, symmetry, and assessment; and a desire for honest, real-time, goal-aligned feedback. We also interviewed five professional makeup artists, who reviewed participant makeup videos and provided expert responses to participant-raised questions and assessment practices. We contribute a taxonomy of feedback needs in non-visual makeup, and outline design implications for future assistive systems - emphasizing hands-free, conversational interaction and context-aware, procedural support for expressive and independent beauty practices.
zh
[CV-158] VISC: mmWave Radar Scene Flow Estimation using Pervasive Visual-Inertial Supervision
【速读】:该论文试图解决毫米波雷达(mmWave radar)场景流(scene flow)估计中缺乏有效监督信号的问题,尤其是在烟雾等恶劣环境中,传统依赖3D激光雷达(LiDAR)的密集点云数据进行监督的方法成本高且难以普及。解决方案的关键在于提出一种无漂移的刚体变换估计器,通过融合基于运动学模型的自车运动(ego-motion)与神经网络学习结果,提供对雷达刚体变换的强监督信号,并推断静态点的场景流。此外,还设计了一个光学-毫米波联合监督提取模块,通过光学和毫米波雷达测量的联合约束学习动态点的场景流,从而增强监督效果。
链接: https://arxiv.org/abs/2507.03938
作者: Kezhong Liu,Yiwen Zhou,Mozi Chen,Jianhua He,Jingao Xu,Zheng Yang,Chris Xiaoxuan Lu,Shengkai Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:This work proposes a mmWave radar’s scene flow estimation framework supervised by data from a widespread visual-inertial (VI) sensor suite, allowing crowdsourced training data from smart vehicles. Current scene flow estimation methods for mmWave radar are typically supervised by dense point clouds from 3D LiDARs, which are expensive and not widely available in smart vehicles. While VI data are more accessible, visual images alone cannot capture the 3D motions of moving objects, making it difficult to supervise their scene flow. Moreover, the temporal drift of VI rigid transformation also degenerates the scene flow estimation of static points. To address these challenges, we propose a drift-free rigid transformation estimator that fuses kinematic model-based ego-motions with neural network-learned results. It provides strong supervision signals to radar-based rigid transformation and infers the scene flow of static points. Then, we develop an optical-mmWave supervision extraction module that extracts the supervision signals of radar rigid transformation and scene flow. It strengthens the supervision by learning the scene flow of dynamic points with the joint constraints of optical and mmWave radar measurements. Extensive experiments demonstrate that, in smoke-filled environments, our method even outperforms state-of-the-art (SOTA) approaches using costly LiDARs.
zh
[CV-159] Learning Adaptive Node Selection with External Attention for Human Interaction Recognition ACM-MM25
【速读】:该论文试图解决图卷积网络(Graph Convolutional Network, GCN)方法在建模交互个体时忽视内在相互依赖关系的问题,以及现有方法依赖预定义的交互邻接矩阵无法自适应捕捉动态和上下文相关的联合交互问题。解决方案的关键在于提出一种名为主动节点选择与外部注意力网络(Active Node Selection with External Attention Network, ASEA)的方法,该方法通过自适应时间节点幅度计算(Adaptive Temporal Node Amplitude Calculation, AT-NAC)模块动态识别关键节点,并利用外部注意力(External Attention, EA)模块对这些节点进行交互建模,从而更有效地捕捉交互关系。
链接: https://arxiv.org/abs/2507.03936
作者: Chen Pang,Xuequan Lu,Qianyu Zhou,Lei Lyu
机构: Shandong Normal University (山东师范大学); University of Western Australia (西澳大学); Jilin University (吉林大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM25
Abstract:Most GCN-based methods model interacting individuals as independent graphs, neglecting their inherent inter-dependencies. Although recent approaches utilize predefined interaction adjacency matrices to integrate participants, these matrices fail to adaptively capture the dynamic and context-specific joint interactions across different actions. In this paper, we propose the Active Node Selection with External Attention Network (ASEA), an innovative approach that dynamically captures interaction relationships without predefined assumptions. Our method models each participant individually using a GCN to capture intra-personal relationships, facilitating a detailed representation of their actions. To identify the most relevant nodes for interaction modeling, we introduce the Adaptive Temporal Node Amplitude Calculation (AT-NAC) module, which estimates global node activity by combining spatial motion magnitude with adaptive temporal weighting, thereby highlighting salient motion patterns while reducing irrelevant or redundant information. A learnable threshold, regularized to prevent extreme variations, is defined to selectively identify the most informative nodes for interaction modeling. To capture interactions, we design the External Attention (EA) module to operate on active nodes, effectively modeling the interaction dynamics and semantic relationships between individuals. Extensive evaluations show that our method captures interaction relationships more effectively and flexibly, achieving state-of-the-art performance.
zh
[CV-160] DNF-Intrinsic: Deterministic Noise-Free Diffusion for Indoor Inverse Rendering
【速读】:该论文旨在解决生成式逆渲染(generative inverse rendering)中由于依赖噪声图像进行内在属性预测而导致的鲁棒性和图像质量不足的问题。其解决方案的关键在于提出DNF-Intrinsic方法,该方法通过将源图像而非高斯噪声作为输入,利用流匹配直接预测确定性的内在属性,并设计生成式渲染器以确保预测的内在属性在物理上与源图像一致。
链接: https://arxiv.org/abs/2507.03924
作者: Rongjia Zheng,Qing Zhang,Chengjiang Long,Wei-Shi Zheng
机构: Sun Yat-sen University (中山大学); Meta Reality Labs (Meta现实实验室); Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education (教育部机器智能与先进计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent methods have shown that pre-trained diffusion models can be fine-tuned to enable generative inverse rendering by learning image-conditioned noise-to-intrinsic mapping. Despite their remarkable progress, they struggle to robustly produce high-quality results as the noise-to-intrinsic paradigm essentially utilizes noisy images with deteriorated structure and appearance for intrinsic prediction, while it is common knowledge that structure and appearance information in an image are crucial for inverse rendering. To address this issue, we present DNF-Intrinsic, a robust yet efficient inverse rendering approach fine-tuned from a pre-trained diffusion model, where we propose to take the source image rather than Gaussian noise as input to directly predict deterministic intrinsic properties via flow matching. Moreover, we design a generative renderer to constrain that the predicted intrinsic properties are physically faithful to the source image. Experiments on both synthetic and real-world datasets show that our method clearly outperforms existing state-of-the-art methods.
zh
[CV-161] Learning Disentangled Stain and Structural Representations for Semi-Supervised Histopathology Segmentation
【速读】:该论文旨在解决组织病理学图像中腺体分割的挑战,这一任务对于癌症的诊断和预后至关重要。然而,由于Hematoxylin and Eosin (HE)染色和组织形态的显著变异性以及标注数据的有限性,自动化分割面临重大困难。论文提出的解决方案是Color-Structure Dual-Student (CSDS)框架,其关键在于通过两个专门的学生网络分别学习染色外观和组织结构的解耦表示:一个网络在染色增强输入上训练以建模颜色变化,另一个网络在结构增强输入上训练以捕捉形态线索。此外,共享的教师网络通过伪标签监督两个学生,并引入了染色感知和结构感知的不确定性估计模块,以自适应地调节训练过程中每个学生的贡献。
链接: https://arxiv.org/abs/2507.03923
作者: Ha-Hieu Pham,Nguyen Lan Vi Vu,Thanh-Huy Nguyen,Ulas Bagci,Min Xu,Trung-Nghia Le,Huy-Hieu Pham
机构: University of Science, VNU-HCM, Ho Chi Minh City, Vietnam; Ho Chi Minh University of Technology, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam; Carnegie Mellon University, Pittsburgh, PA, USA; Northwestern University, Chicago, IL, USA; College of Engineering & Computer Science, VinUniversity, Ha Noi City, Vietnam; VinUni-Illinois Smart Health Center, VinUniversity, Ha Noi City, Vietnam
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate gland segmentation in histopathology images is essential for cancer diagnosis and prognosis. However, significant variability in Hematoxylin and Eosin (HE) staining and tissue morphology, combined with limited annotated data, poses major challenges for automated segmentation. To address this, we propose Color-Structure Dual-Student (CSDS), a novel semi-supervised segmentation framework designed to learn disentangled representations of stain appearance and tissue structure. CSDS comprises two specialized student networks: one trained on stain-augmented inputs to model chromatic variation, and the other on structure-augmented inputs to capture morphological cues. A shared teacher network, updated via Exponential Moving Average (EMA), supervises both students through pseudo-labels. To further improve label reliability, we introduce stain-aware and structure-aware uncertainty estimation modules that adaptively modulate the contribution of each student during training. Experiments on the GlaS and CRAG datasets show that CSDS achieves state-of-the-art performance in low-label settings, with Dice score improvements of up to 1.2% on GlaS and 0.7% on CRAG at 5% labeled data, and 0.7% and 1.4% at 10%. Our code and pre-trained models are available at this https URL.
zh
[CV-162] Consistency-Aware Padding for Incomplete Multi-Modal Alignment Clustering Based on Self-Repellent Greedy Anchor Search IJCAI2025
【速读】:该论文试图解决多模态数据集在存在数据不平衡和错位情况下的缺失数据填充问题,这一问题在现有研究中尚未得到有效解决。其关键解决方案是提出了一种一致性感知的填充方法(Consistency-Aware Padding, CAPM),结合自排斥贪心锚点搜索模块(Self-Repellent Greedy Anchor Search Module, SRGASM),通过自排斥随机游走与贪心算法识别锚点,进而利用噪声对比学习实现对不平衡和错位数据的有效插值与对齐,从而提升多模态数据融合的质量。
链接: https://arxiv.org/abs/2507.03917
作者: Shubin Ma,Liang Zhao,Mingdong Lu,Yifan Guo,Bo Xu
机构: Dalian University of Technology (大连理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IJCAI 2025. 9 pages, 3 figures
Abstract:Multimodal representation is faithful and highly effective in describing real-world data samples’ characteristics by describing their complementary information. However, the collected data often exhibits incomplete and misaligned characteristics due to factors such as inconsistent sensor frequencies and device malfunctions. Existing research has not effectively addressed the issue of filling missing data in scenarios where multiview data are both imbalanced and misaligned. Instead, it relies on class-level alignment of the available data. Thus, it results in some data samples not being well-matched, thereby affecting the quality of data fusion. In this paper, we propose the Consistency-Aware Padding for Incomplete Multimodal Alignment Clustering Based on Self-Repellent Greedy Anchor Search(CAPIMAC) to tackle the problem of filling imbalanced and misaligned data in multimodal datasets. Specifically, we propose a self-repellent greedy anchor search module(SRGASM), which employs a self-repellent random walk combined with a greedy algorithm to identify anchor points for re-representing incomplete and misaligned multimodal data. Subsequently, based on noise-contrastive learning, we design a consistency-aware padding module (CAPM) to effectively interpolate and align imbalanced and misaligned data, thereby improving the quality of multimodal data fusion. Experimental results demonstrate the superiority of our method over benchmark datasets. The code will be publicly released at this https URL.
zh
[CV-163] Animation Needs Attention: A Holistic Approach to Slides Animation Comprehension with Visual-Language Models
【速读】:该论文试图解决当前AI驱动的幻灯片生成工具缺乏原生动画支持的问题,以及现有视觉-语言模型(VLMs)在处理动画任务时由于缺乏公开数据集和有限的时间推理能力而表现不佳的问题。解决方案的关键在于发布了首个公开的幻灯片动画建模数据集,包含12,000组自然语言描述、动画JSON文件和渲染视频,并基于该数据集对Qwen-2.5-VL-7B模型进行低秩适应(LoRA)微调,从而提升了模型在时间推理和动画生成方面的性能。
链接: https://arxiv.org/abs/2507.03916
作者: Yifan Jiang,Yibo Xue,Yukun Kang,Pin Zheng,Jian Peng,Feiran Wu,Changliang Xu
机构: Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(杭州高等研究院,中国科学院大学); Alibaba Group(阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Appendix at: this https URL
Abstract:Slide animations, such as fade-ins, fly-ins, and wipes, are critical for audience engagement, efficient information delivery, and vivid visual expression. However, most AI-driven slide-generation tools still lack native animation support, and existing vision-language models (VLMs) struggle with animation tasks due to the absence of public datasets and limited temporal-reasoning capabilities. To address this gap, we release the first public dataset for slide-animation modeling: 12,000 triplets of natural-language descriptions, animation JSON files, and rendered videos, collectively covering every built-in PowerPoint effect. Using this resource, we fine-tune Qwen-2.5-VL-7B with Low-Rank Adaptation (LoRA) and achieve consistent improvements over GPT-4.1 and Gemini-2.5-Pro in BLEU-4, ROUGE-L, SPICE, and our Coverage-Order-Detail Assessment (CODA) metric, which evaluates action coverage, temporal order, and detail fidelity. On a manually curated test set of slides, the LoRA model increases BLEU-4 by around 60%, ROUGE-L by 30%, and shows significant improvements in CODA-detail. This demonstrates that low-rank adaptation enables reliable temporal reasoning and generalization beyond synthetic data. Overall, our dataset, LoRA-enhanced model, and CODA metric provide a rigorous benchmark and foundation for future research on VLM-based dynamic slide generation.
zh
[CV-164] Bridging Vision and Language: Optimal Transport-Driven Radiology Report Generation via LLM s
【速读】:该论文试图解决医学影像报告生成中通用大语言模型(LLM)在临床有效性方面表现不足的问题,特别是其难以准确捕捉X射线图像与对应文本之间的关系,导致生成的报告临床实用性较差。解决方案的关键在于提出一种基于最优传输(Optimal Transport, OT)的放射学报告生成框架(OTDRG),通过OT算法将图像特征与从报告中提取的疾病标签对齐,有效缩小跨模态差距,并结合对齐后的图像和文本特征进行模型微调,从而提升生成报告的临床准确性。
链接: https://arxiv.org/abs/2507.03908
作者: Haifeng Zhao,Yufei Zhang,Leilei Ma,Shuo Xu,Dengdi Sun
机构: Anhui University(安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radiology report generation represents a significant application within medical AI, and has achieved impressive results. Concurrently, large language models (LLMs) have demonstrated remarkable performance across various domains. However, empirical validation indicates that general LLMs tend to focus more on linguistic fluency rather than clinical effectiveness, and lack the ability to effectively capture the relationship between X-ray images and their corresponding texts, thus resulting in poor clinical practicability. To address these challenges, we propose Optimal Transport-Driven Radiology Report Generation (OTDRG), a novel framework that leverages Optimal Transport (OT) to align image features with disease labels extracted from reports, effectively bridging the cross-modal gap. The core component of OTDRG is Alignment \ Fine-Tuning, where OT utilizes results from the encoding of label features and image visual features to minimize cross-modal distances, then integrating image and text features for LLMs fine-tuning. Additionally, we design a novel disease prediction module to predict disease labels contained in X-ray images during validation and testing. Evaluated on the MIMIC-CXR and IU X-Ray datasets, OTDRG achieves state-of-the-art performance in both natural language generation (NLG) and clinical efficacy (CE) metrics, delivering reports that are not only linguistically coherent but also clinically accurate.
zh
[CV-165] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
【速读】:该论文试图解决人类动画生成中模型推理速度慢、计算成本高以及不同任务(如唇形同步、音频驱动全身动画和从起始与结束帧生成视频)需要多个专用模型的问题。其解决方案的关键在于提出一种基于视频生成模型的统一多任务范式,将多样化的生成任务视为时空局部重建,仅需对输入进行修改;同时引入多模态解耦交叉注意力模块,以分而治之的方式融合文本、图像和音频等多模态条件;此外,采用SFT+Reward交替训练范式,使参数量仅为1.3B的模型达到与参数量为其十倍的模型相当的生成质量。
链接: https://arxiv.org/abs/2507.03905
作者: Rang Meng,Yan Wang,Weipeng Wu,Ruobing Zheng,Yuming Li,Chenguang Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human animation recently has advanced rapidly, achieving increasingly realistic and vivid results, especially with the integration of large-scale video generation models. However, the slow inference speed and high computational cost of these large models bring significant challenges for practical applications. Additionally, various tasks in human animation, such as lip-syncing, audio-driven full-body animation, and video generation from start and end frames, often require different specialized models. The introduction of large video models has not alleviated this dilemma. This raises an important question: Can we make human animation Faster, Higher in quality, Stronger in generalization, and make various tasks Together in one model? To address this, we dive into video generation models and discover that the devil lies in the details: Inspired by MAE, we propose a novel unified Multi-Task paradigm for human animation, treating diverse generation tasks as spatial-temporal local reconstructions, requiring modifications only on the input side; Given the interplay and division among multi-modal conditions including text, image, and audio, we introduce a multi-modal decoupled cross-attention module to fuse multi-modals in a divide-and-conquer manner; We propose a new SFT+Reward alternating training paradigm, enabling the minimal model with 1.3B parameters to achieve generation quality comparable to models with 10 times the parameters count. Through these innovations, our work paves the way for efficient, high-quality, and versatile digital human generation, addressing both performance and practicality challenges in the field. Extensive experiments demonstrate that EchoMimicV3 outperforms existing models in both facial and semi-body video generation, providing precise text-based control for creating videos in a wide range of scenarios.
zh
[CV-166] aming Anomalies with Down-Up Sampling Networks: Group Center Preserving Reconstruction for 3D Anomaly Detection ACM-MM25
【速读】:该论文旨在解决高精度点云在3D异常检测中的重建难题,传统基于重建的方法在处理大规模和复杂结构的高精度点云时面临显著挑战。其解决方案的关键在于提出一种称为Down-Up Sampling Network (DUS-Net) 的网络架构,该架构通过保留群体中心几何结构实现高精度点云的重建。DUS-Net首先引入噪声生成模块以增强训练数据的多样性,随后通过下采样网络学习无异常的中心点云,最后利用上采样网络融合多尺度上采样特征进行高精度点云重建,从而有效提升异常检测性能。
链接: https://arxiv.org/abs/2507.03903
作者: Hanzhe Liang,Jie Zhang,Tao Dai,Linlin Shen,Jinbao Wang,Can Gao
机构: Shenzhen University(深圳大学); Shenzhen Audencia Financial Technology Institute(深圳审计金融科技学院); Macao Polytechnic University(澳门理工学院); Shenzhen Institute of Artificial Intelligence and Robotics for Society(深圳人工智能与机器人社会研究所); National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University(深圳大学大数据系统计算技术国家工程实验室); Guangdong Provincial Key Laboratory of Intelligent Information Processing(广东省智能信息处理重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM MM25 Accepted
Abstract:Reconstruction-based methods have demonstrated very promising results for 3D anomaly detection. However, these methods face great challenges in handling high-precision point clouds due to the large scale and complex structure. In this study, a Down-Up Sampling Network (DUS-Net) is proposed to reconstruct high-precision point clouds for 3D anomaly detection by preserving the group center geometric structure. The DUS-Net first introduces a Noise Generation module to generate noisy patches, which facilitates the diversity of training data and strengthens the feature representation for reconstruction. Then, a Down-sampling Network~(Down-Net) is developed to learn an anomaly-free center point cloud from patches with noise injection. Subsequently, an Up-sampling Network (Up-Net) is designed to reconstruct high-precision point clouds by fusing multi-scale up-sampling features. Our method leverages group centers for construction, enabling the preservation of geometric structure and providing a more precise point cloud. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art (SOTA) performance with an Object-level AUROC of 79.9% and 79.5%, and a Point-level AUROC of 71.2% and 84.7% on the Real3D-AD and Anomaly-ShapeNet datasets, respectively.
zh
[CV-167] ransformer Model for Alzheimers Disease Progression Prediction Using Longitudinal Visit Sequences ALT
【速读】:该论文试图解决阿尔茨海默病(Alzheimer’s disease, AD)在临床随访中早期阶段的预测问题,旨在通过分析患者的就诊历史数据来预测其下一次临床访问时的疾病进展阶段。解决方案的关键在于提出一种基于Transformer的模型,利用从患者多次就诊记录中提取的特征进行预测,并与传统的循环神经网络(Recurrent Neural Networks, RNNs)如长短期记忆网络(LSTM)、门控循环单元(GRU)和minimalRNN进行比较,验证其在处理不同长度的历史数据和数据不平衡情况下的性能优势。此外,该模型在面对缺失就诊记录和缺失特征的情况下仍表现出较强的预测能力,尤其在识别向更严重疾病阶段转化的受试者方面表现突出。
链接: https://arxiv.org/abs/2507.03899
作者: Mahdi Moghaddami,Clayton Schubring,Mohammad-Reza Siadat
机构: Oakland University(奥克兰大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Conference on Health, Inference, and Learning (CHIL, 2025)
Abstract:Alzheimer’s disease (AD) is a neurodegenerative disorder with no known cure that affects tens of millions of people worldwide. Early detection of AD is critical for timely intervention to halt or slow the progression of the disease. In this study, we propose a Transformer model for predicting the stage of AD progression at a subject’s next clinical visit using features from a sequence of visits extracted from the subject’s visit history. We also rigorously compare our model to recurrent neural networks (RNNs) such as long short-term memory (LSTM), gated recurrent unit (GRU), and minimalRNN and assess their performances based on factors such as the length of prior visits and data imbalance. We test the importance of different feature categories and visit history, as well as compare the model to a newer Transformer-based model optimized for time series. Our model demonstrates strong predictive performance despite missing visits and missing features in available visits, particularly in identifying converter subjects – individuals transitioning to more severe disease stages – an area that has posed significant challenges in longitudinal prediction. The results highlight the model’s potential in enhancing early diagnosis and patient outcomes.
zh
[CV-168] Deconfounding Causal Inference through Two-Branch Framework with Early-Forking for Sensor-Based Cross-Domain Activity Recognition
【速读】:该论文旨在解决传感器基人体活动识别(Sensor-based Human Activity Recognition, HAR)场景中因分布偏移导致的性能下降问题。现有基于领域泛化的研究多聚焦于建模传感器数据与活动标签之间的统计依赖关系,而忽略了内在的因果机制。论文的关键解决方案是将DG-based HAR建模为因果推断问题,并提出一种受因果启发的表示学习算法。其核心在于设计一个早期分叉的双分支框架,分别学习因果(category-aware)和非因果(domain-specific)特征,并利用基于独立性的希尔伯特-施密特信息准则隐式解耦这些特征,从而提升跨领域活动识别的泛化能力。
链接: https://arxiv.org/abs/2507.03898
作者: Di Xiong,Lei Zhang,Shuoyuan Wang,Dongzhou Cheng,Wenbo Huang
机构: School of Electrical and Automation Engineering, Nanjing Normal University, Jiang Su, China; School of Electrical and Automation Engineering, Nanjing Normal University, Naning, Jiang Su, China; Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, Guang Dong, China; School of Cyber Science and Engineering, Southeast University, Naning, Jiang Su, China; School of Computer Science and Engineering, Southeast University, Naning, Jiang Su, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)
Abstract:Recently, domain generalization (DG) has emerged as a promising solution to mitigate distribution-shift issue in sensor-based human activity recognition (HAR) scenario. However, most existing DG-based works have merely focused on modeling statistical dependence between sensor data and activity labels, neglecting the importance of intrinsic casual mechanism. Intuitively, every sensor input can be viewed as a mixture of causal (category-aware) and non-causal factors (domain-specific), where only the former affects activity classification judgment. In this paper, by casting such DG-based HAR as a casual inference problem, we propose a causality-inspired representation learning algorithm for cross-domain activity recognition. To this end, an early-forking two-branch framework is designed, where two separate branches are respectively responsible for learning casual and non-causal features, while an independence-based Hilbert-Schmidt Information Criterion is employed to implicitly disentangling them. Additionally, an inhomogeneous domain sampling strategy is designed to enhance disentanglement, while a category-aware domain perturbation layer is performed to prevent representation collapse. Extensive experiments on several public HAR benchmarks demonstrate that our causality-inspired approach significantly outperforms eleven related state-of-the-art baselines under cross-person, cross-dataset, and cross-position settings. Detailed ablation and visualizations analyses reveal underlying casual mechanism, indicating its effectiveness, efficiency, and universality in cross-domain activity recognition scenario.
zh
[CV-169] Hierarchical Semantic-Visual Fusion of Visible and Near-infrared Images for Long-range Haze Removal
【速读】:该论文旨在解决长距离雾霾去除问题,即在远距离场景中由于散射增强导致的严重雾霾和信号损失,使得仅依赖可见光图像难以恢复细节。其解决方案的关键在于提出一种分层语义-视觉融合(Hierarchical Semantic-Visual Fusion, HSVF)框架,该框架通过结合可见光与近红外模态的互补低级视觉特征以及共享的高级语义一致性,实现更有效的雾霾去除。具体而言,语义流通过对齐模态不变的内在表示来获取鲁棒的语义预测,而视觉流则通过融合可见光与近红外图像的互补信息来恢复丢失的结构细节,从而在远距离场景中生成高对比度和丰富纹理的清晰图像。
链接: https://arxiv.org/abs/2507.03893
作者: Yi Li,Xiaoxiong Wang,Jiawei Wang,Yi Chang,Kai Cao,Luxin Yan
机构: Huazhong University of Science and Technology (华中科技大学); National Key Laboratory of Multispectral Information Intelligent Processing Technology (多光谱信息智能处理技术国家重点实验室); State Key Laboratory of Dynamic Optical Imaging and Measurement (动态光学成像与测量国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This work has been accepted by IEEE Transactions on Multimedia for publication
Abstract:While image dehazing has advanced substantially in the past decade, most efforts have focused on short-range scenarios, leaving long-range haze removal under-explored. As distance increases, intensified scattering leads to severe haze and signal loss, making it impractical to recover distant details solely from visible images. Near-infrared, with superior fog penetration, offers critical complementary cues through multimodal fusion. However, existing methods focus on content integration while often neglecting haze embedded in visible images, leading to results with residual haze. In this work, we argue that the infrared and visible modalities not only provide complementary low-level visual features, but also share high-level semantic consistency. Motivated by this, we propose a Hierarchical Semantic-Visual Fusion (HSVF) framework, comprising a semantic stream to reconstruct haze-free scenes and a visual stream to incorporate structural details from the near-infrared modality. The semantic stream first acquires haze-robust semantic prediction by aligning modality-invariant intrinsic representations. Then the shared semantics act as strong priors to restore clear and high-contrast distant scenes under severe haze degradation. In parallel, the visual stream focuses on recovering lost structural details from near-infrared by fusing complementary cues from both visible and near-infrared images. Through the cooperation of dual streams, HSVF produces results that exhibit both high-contrast scenes and rich texture details. Moreover, we introduce a novel pixel-aligned visible-infrared haze dataset with semantic labels to facilitate benchmarking. Extensive experiments demonstrate the superiority of our method over state-of-the-art approaches in real-world long-range haze removal.
zh
[CV-170] ArmGS: Composite Gaussian Appearance Refinement for Modeling Dynamic Urban Environments
【速读】:该论文试图解决动态城市环境在自动驾驶模拟中的建模问题,特别是现有基于神经辐射场的数据驱动方法在渲染效率上的不足,以及现有3D高斯点云方法在建模帧间和视角间细粒度变化时的局限性。其解决方案的关键在于提出一种名为ArmGS的新方法,该方法通过多粒度外观细化的复合驾驶高斯点云技术,设计了一个多层次外观建模方案,以优化从局部高斯层级到全局图像层级和动态物体层级的复合高斯细化变换参数,从而同时建模全局场景外观变化和背景与物体的局部细粒度变化。
链接: https://arxiv.org/abs/2507.03886
作者: Guile Wu,Dongfeng Bai,Bingbing Liu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report
Abstract:This work focuses on modeling dynamic urban environments for autonomous driving simulation. Contemporary data-driven methods using neural radiance fields have achieved photorealistic driving scene modeling, but they suffer from low rendering efficacy. Recently, some approaches have explored 3D Gaussian splatting for modeling dynamic urban scenes, enabling high-fidelity reconstruction and real-time rendering. However, these approaches often neglect to model fine-grained variations between frames and camera viewpoints, leading to suboptimal results. In this work, we propose a new approach named ArmGS that exploits composite driving Gaussian splatting with multi-granularity appearance refinement for autonomous driving scene modeling. The core idea of our approach is devising a multi-level appearance modeling scheme to optimize a set of transformation parameters for composite Gaussian refinement from multiple granularities, ranging from local Gaussian level to global image level and dynamic actor level. This not only models global scene appearance variations between frames and camera viewpoints, but also models local fine-grained changes of background and objects. Extensive experiments on multiple challenging autonomous driving datasets, namely, Waymo, KITTI, NOTR and VKITTI2, demonstrate the superiority of our approach over the state-of-the-art methods.
zh
[CV-171] A Rigorous Behavior Assessment of CNNs Using a Data-Domain Sampling Regime IEEE-VIS2025
【速读】:该论文试图解决卷积神经网络(Convolutional Neural Networks, CNNs)在图形感知行为上的量化问题,特别是其在条形图中的比例估计能力。解决方案的关键在于提出一种数据域采样制度,该制度允许从三个角度评估CNNs的表现:对训练-测试分布差异的敏感性、对有限样本的稳定性以及与人类观察者的相对专业性。通过分析大量CNN模型和人类参与者的实验数据,研究发现CNNs的表现优劣仅取决于训练-测试距离,从而得出CNNs可以超越人类且其偏差仅依赖于训练-测试距离的结论。
链接: https://arxiv.org/abs/2507.03866
作者: Shuning Jiang,Wei-Lun Chao,Daniel Haehn,Hanspeter Pfister,Jian Chen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: This is a preprint of a paper that has been conditionally accepted for publication at IEEE VIS 2025. The final version may be different upon publication. 9 pages main text, 11 pages supplementary contents, 37 figures
Abstract:We present a data-domain sampling regime for quantifying CNNs’ graphic perception behaviors. This regime lets us evaluate CNNs’ ratio estimation ability in bar charts from three perspectives: sensitivity to training-test distribution discrepancies, stability to limited samples, and relative expertise to human observers. After analyzing 16 million trials from 800 CNNs models and 6,825 trials from 113 human participants, we arrived at a simple and actionable conclusion: CNNs can outperform humans and their biases simply depend on the training-test distance. We show evidence of this simple, elegant behavior of the machines when they interpret visualization images. this http URL provides registration, the code for our sampling regime, and experimental results.
zh
[CV-172] Interpretable Diffusion Models with B-cos Networks
【速读】:该论文试图解决文本到图像扩散模型在生成图像时无法准确反映提示(prompt)中所有语义信息的问题,此类问题难以通过自动化方式检测。解决方案的关键在于引入一种基于B-cos模块的扩散模型架构,该架构具备内在可解释性,能够通过生成解释来揭示每个提示词对生成图像像素区域的影响,从而提供关于提示与图像对齐程度的有意义见解。
链接: https://arxiv.org/abs/2507.03846
作者: Nicola Bernold,Moritz Vandenhirtz,Alice Bizeul,Julia E. Vogt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Text-to-image diffusion models generate images by iteratively denoising random noise, conditioned on a prompt. While these models have enabled impressive progress in image generation, they often fail to accurately reflect all semantic information described in the prompt – failures that are difficult to detect automatically. In this work, we introduce a diffusion model architecture built with B-cos modules that offers inherent interpretability. Our approach provides insight into how individual prompt tokens affect the generated image by producing explanations that highlight the pixel regions influenced by each token. We demonstrate that B-cos diffusion models can produce high-quality images while providing meaningful insights into prompt-image alignment.
zh
[CV-173] F-Hash: Feature-Based Hash Design for Time-Varying Volume Visualization via Multi-Resolution Tesseract Encoding
【速读】:该论文试图解决交互式时变体积可视化中由于数据集的复杂时空特征和庞大规模所带来的挑战,特别是针对基于隐式神经表示(INR)的训练收敛速度慢的问题。其解决方案的关键在于提出了一种基于特征的多分辨率四维编码架构F-Hash,该架构利用无冲突的多级哈希函数映射动态的四维多分辨率嵌入网格,从而在减少编码参数的同时实现高效的编码能力,显著提升了训练速度。
链接: https://arxiv.org/abs/2507.03836
作者: Jianxin Sun,David Lenz,Hongfeng Yu,Tom Peterka
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Interactive time-varying volume visualization is challenging due to its complex spatiotemporal features and sheer size of the dataset. Recent works transform the original discrete time-varying volumetric data into continuous Implicit Neural Representations (INR) to address the issues of compression, rendering, and super-resolution in both spatial and temporal domains. However, training the INR takes a long time to converge, especially when handling large-scale time-varying volumetric datasets. In this work, we proposed F-Hash, a novel feature-based multi-resolution Tesseract encoding architecture to greatly enhance the convergence speed compared with existing input encoding methods for modeling time-varying volumetric data. The proposed design incorporates multi-level collision-free hash functions that map dynamic 4D multi-resolution embedding grids without bucket waste, achieving high encoding capacity with compact encoding parameters. Our encoding method is agnostic to time-varying feature detection methods, making it a unified encoding solution for feature tracking and evolution visualization. Experiments show the F-Hash achieves state-of-the-art convergence speed in training various time-varying volumetric datasets for diverse features. We also proposed an adaptive ray marching algorithm to optimize the sample streaming for faster rendering of the time-varying neural representation.
zh
[CV-174] Query-Based Adaptive Aggregation for Multi-Dataset Joint Training Toward Universal Visual Place Recognition
【速读】:该论文试图解决现有视觉定位(Visual Place Recognition, VPR)方法在单一数据集上训练导致的领域特定归纳偏置问题,以及多数据集联合训练中因数据集间差异导致的特征聚合层信息容量饱和问题。其解决方案的关键是提出基于查询的自适应聚合(Query-based Adaptive Aggregation, QAA),通过学习查询作为参考代码本,以增强特征聚合的信息容量,同时保持较低的计算和参数复杂度。该方法利用查询级图像特征与参考代码本之间的跨查询相似性(Cross-query Similarity, CS)生成鲁棒描述符,从而实现跨多种数据集的平衡泛化能力。
链接: https://arxiv.org/abs/2507.03831
作者: Jiuhong Xiao,Yang Zhou,Giuseppe Loianno
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 4 figures
Abstract:Deep learning methods for Visual Place Recognition (VPR) have advanced significantly, largely driven by large-scale datasets. However, most existing approaches are trained on a single dataset, which can introduce dataset-specific inductive biases and limit model generalization. While multi-dataset joint training offers a promising solution for developing universal VPR models, divergences among training datasets can saturate limited information capacity in feature aggregation layers, leading to suboptimal performance. To address these challenges, we propose Query-based Adaptive Aggregation (QAA), a novel feature aggregation technique that leverages learned queries as reference codebooks to effectively enhance information capacity without significant computational or parameter complexity. We show that computing the Cross-query Similarity (CS) between query-level image features and reference codebooks provides a simple yet effective way to generate robust descriptors. Our results demonstrate that QAA outperforms state-of-the-art models, achieving balanced generalization across diverse datasets while maintaining peak performance comparable to dataset-specific models. Ablation studies further explore QAA’s mechanisms and scalability. Visualizations reveal that the learned queries exhibit diverse attention patterns across datasets. Code will be publicly released.
zh
[CV-175] Zero Memory Overhead Approach for Protecting Vision Transformer Parameters
【速读】:该论文旨在解决Vision Transformers (ViTs)在存储参数时面对位翻转故障(bit-flip faults)的可靠性问题,特别是在安全关键应用如自动驾驶中确保模型正确功能的需求。解决方案的关键在于利用参数中最低有效位(LSB)对模型精度影响较小的特性,将LSB替换为奇偶校验位以实现无内存开销的错误检测机制。当检测到故障时,通过将受影响的参数置零来掩码错误,从而有效防止精度下降,显著提升了ViT参数对位翻转的鲁棒性。
链接: https://arxiv.org/abs/2507.03816
作者: Fereshteh Baradaran,Mohsen Raji,Azadeh Baradaran,Arezoo Baradaran,Reihaneh Akbarifard
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision Transformers (ViTs) have demonstrated superior performance over Convolutional Neural Networks (CNNs) in various vision-related tasks such as classification, object detection, and segmentation due to their use of self-attention mechanisms. As ViTs become more popular in safety-critical applications like autonomous driving, ensuring their correct functionality becomes essential, especially in the presence of bit-flip faults in their parameters stored in memory. In this paper, a fault tolerance technique is introduced to protect ViT parameters against bit-flip faults with zero memory overhead. Since the least significant bits of parameters are not critical for model accuracy, replacing the LSB with a parity bit provides an error detection mechanism without imposing any overhead on the model. When faults are detected, affected parameters are masked by zeroing out, as most parameters in ViT models are near zero, effectively preventing accuracy degradation. This approach enhances reliability across ViT models, improving the robustness of parameters to bit-flips by up to three orders of magnitude, making it an effective zero-overhead solution for fault tolerance in critical applications.
zh
[CV-176] FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed
【速读】:该论文试图解决大规模视觉基础模型在实际应用中因计算资源消耗大而难以复现预训练解决方案的问题,特别是在处理私有数据、新模态或科学探究场景时。其关键解决方案是提出一种新颖的预训练策略,该策略通过频率滤波课程(先学习低频信息)和高斯噪声补丁增强技术,同时加速收敛并提高对常见噪声的鲁棒性。该方法在减少预训练时间和计算量(FLOPs)的同时,保持了与基线相当的鲁棒性和线性探测性能。
链接: https://arxiv.org/abs/2507.03779
作者: Jiaqi Zhang,Juntuo Wang,Zhixin Sun,John Zou,Randall Balestriero
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning–which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence–and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum–low-frequency being seen first–and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at this https URL
zh
[CV-177] Efficient Event-Based Semantic Segmentation via Exploiting Frame-Event Fusion: A Hybrid Neural Network Approach
【速读】:该论文试图解决事件相机(event camera)在图像语义分割中难以充分利用帧与事件数据互补信息的问题,从而导致训练策略复杂和计算成本高的问题。解决方案的关键在于提出一种高效的混合框架,包含针对事件的脉冲神经网络分支和针对帧的人工神经网络分支,并通过三个专门模块——自适应时间加权注入器(Adaptive Temporal Weighting Injector)、事件驱动稀疏注入器(Event-Driven Sparse Injector)和通道选择融合模块(Channel Selection Fusion Module)——实现两个分支之间的有效交互,以提升分割精度并降低能耗。
链接: https://arxiv.org/abs/2507.03765
作者: Hebei Li,Yansong Peng,Jiahui Yuan,Peixi Wu,Jin Wang,Yueyi Zhang,Xiaoyan Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event-based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65% reduction on the DSEC-Semantic dataset.
zh
[CV-178] StreamDiT: Real-Time Streaming Text-to-Video Generation
【速读】:该论文旨在解决现有文本到视频(Text-to-Video, T2V)生成模型在交互式和实时应用中仅能离线生成短片段的问题。其关键解决方案是提出StreamDiT,一种流式视频生成模型,通过基于流匹配的训练方法并引入移动缓冲区,结合不同分段方案的混合训练以提升内容一致性和视觉质量,同时采用自适应层归一化DiT(adaLN DiT)结构与时间嵌入和窗口注意力机制,最终实现单GPU下16 FPS的实时视频流生成能力。
链接: https://arxiv.org/abs/2507.03745
作者: Akio Kodaira,Tingbo Hou,Ji Hou,Masayoshi Tomizuka,Yue Zhao
机构: UC Berkeley (加州大学伯克利分校); Meta (元)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling distillation is performed in each segment of a chosen partitioning scheme. After distillation, the total number of function evaluations (NFEs) is reduced to the number of chunks in a buffer. Finally, our distilled model reaches real-time performance at 16 FPS on one GPU, which can generate video streams at 512p resolution. We evaluate our method through both quantitative metrics and human evaluation. Our model enables real-time applications, e.g. streaming generation, interactive generation, and video-to-video. We provide video results and more examples in our project website: a href="this https URL https URL./a
zh
[CV-179] ChestGPT : Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays
【速读】:该论文试图解决放射科医生供需失衡的问题,特别是在医学影像服务依赖度不断上升的背景下。其解决方案的关键在于构建一个结合视觉变压器(Vision Transformer, ViT)与大型语言模型(Large Language Model, LLM)的深度学习框架,即ChestGPT,用于胸部X光图像中的疾病分类和感兴趣区域定位。该框架通过将X光图像转换为令牌,并结合工程化提示输入LLM,实现疾病联合分类与定位,同时利用迁移学习技术提升可解释性与性能。
链接: https://arxiv.org/abs/2507.03739
作者: Shehroz S. Khan,Petar Przulj,Ahmed Ashraf,Ali Abedi
机构: KITE Research Institute, Toronto Rehabilitation Institute, University Health Network( KITE 研究所,多伦多康复研究所,大学健康网络); College of Engineering and Technology, American University of the Middle East(工程与技术学院,中东美国大学); Faculty of Applied Science and Engineering, University of Toronto(应用科学与工程学院,多伦多大学); Department of Electrical and Computer Engineering, University of Manitoba(电气与计算机工程系,曼尼托巴大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 4 tables
Abstract:The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists’ capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists’ workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.
zh
[CV-180] Flow-Anchored Consistency Models
【速读】:该论文试图解决连续时间一致性模型(Continuous-time Consistency Models, CMs)在训练过程中面临的稳定性问题,这一问题源于模型仅学习概率流中的捷径,而失去了对定义该流的瞬时速度场的把握。论文提出的解决方案的关键在于引入一种称为“流锚定”(Flow-Anchoring)的训练策略,通过将流匹配(Flow Matching, FM)任务作为主CM捷径目标的锚点,从而在不修改模型结构的前提下,增强模型对底层概率流的理解和稳定性。
链接: https://arxiv.org/abs/2507.03738
作者: Yansong Peng,Kai Zhu,Yu Liu,Pingyu Wu,Hebei Li,Xiaoyan Sun,Feng Wu
机构: University of Science and Technology of China (中国科学技术大学); Tongyi Lab (通义实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continuous-time Consistency Models (CMs) promise efficient few-step generation but face significant challenges with training instability. We argue this instability stems from a fundamental conflict: by training a network to learn only a shortcut across a probability flow, the model loses its grasp on the instantaneous velocity field that defines the flow. Our solution is to explicitly anchor the model in the underlying flow during training. We introduce the Flow-Anchored Consistency Model (FACM), a simple but effective training strategy that uses a Flow Matching (FM) task as an anchor for the primary CM shortcut objective. This Flow-Anchoring approach requires no architectural modifications and is broadly compatible with standard model architectures. By distilling a pre-trained LightningDiT model, our method achieves a state-of-the-art FID of 1.32 with two steps (NFE=2) and 1.76 with just one step (NFE=1) on ImageNet 256x256, significantly outperforming previous methods. This provides a general and effective recipe for building high-performance, few-step generative models. Our code and pretrained models: this https URL.
zh
[CV-181] Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps ICCV2025
【速读】:该论文试图解决3D Gaussian Splatting (3DGS) 在SLAM应用中存在的两个关键问题:一是先前方法在户外场景中缺乏几何先验导致的跟踪不准确,二是引入独立跟踪模块时因相机大范围运动引发的尺度漂移问题。其解决方案的关键在于提出一种基于RGB的鲁棒户外3DGS SLAM方法S3PO-GS,该方法通过建立一个以3DGS点云为锚点的自洽跟踪模块,避免了尺度漂移并提高了跟踪精度与鲁棒性;同时设计了一个基于补丁的点云动态映射模块,在引入几何先验的同时避免了尺度模糊,从而显著提升了跟踪精度和场景重建质量。
链接: https://arxiv.org/abs/2507.03737
作者: Chong Cheng,Sicheng Yu,Zijian Wang,Yifan Zhou,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025
Abstract:3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, \textbflack geometric priors in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to \textbfscale drift. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy. Project page: this https URL.
zh
[CV-182] 3D PixBrush: Image-Guided Local Texture Synthesis
【速读】:该论文试图解决在3D网格上进行局部区域的图像驱动编辑的问题,即如何将参考图像中的物体信息准确地映射到3D模型的特定区域。解决方案的关键在于提出了一种名为“3D PixBrush”的方法,该方法通过预测一个定位掩码和合成纹理来忠实呈现参考图像中的物体。其核心创新是引入了“定位调制图像引导”(localization-modulated image guidance),对分数蒸馏采样技术进行了改进,从而无需用户输入即可实现全局一致且局部精确的定位。
链接: https://arxiv.org/abs/2507.03731
作者: Dale Decatur,Itai Lang,Kfir Aberman,Rana Hanocka
机构: University of Chicago(芝加哥大学); Decart AI(Decart AI)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present 3D PixBrush, a method for performing image-driven edits of local regions on 3D meshes. 3D PixBrush predicts a localization mask and a synthesized texture that faithfully portray the object in the reference image. Our predicted localizations are both globally coherent and locally precise. Globally - our method contextualizes the object in the reference image and automatically positions it onto the input mesh. Locally - our method produces masks that conform to the geometry of the reference image. Notably, our method does not require any user input (in the form of scribbles or bounding boxes) to achieve accurate localizations. Instead, our method predicts a localization mask on the 3D mesh from scratch. To achieve this, we propose a modification to the score distillation sampling technique which incorporates both the predicted localization and the reference image, referred to as localization-modulated image guidance. We demonstrate the effectiveness of our proposed technique on a wide variety of meshes and images.
zh
[CV-183] Less is More: Empowering GUI Agent with Context-Aware Simplification ICCV2025
【速读】:该论文旨在解决GUI代理在纯视觉基础上进行上下文建模时存在的问题,特别是元素上下文的高密度与弱关联性以及历史上下文的高冗余性。其解决方案的关键在于提出一种上下文感知的简化框架SimpAgent,通过引入基于掩码的元素剪枝方法来减少无关元素的干扰,并设计了一种一致性引导的历史压缩模块以降低历史信息的冗余,从而在性能与效率之间取得平衡。
链接: https://arxiv.org/abs/2507.03730
作者: Gongwei Chen,Xurui Zhou,Rui Shao,Yibo Lyu,Kaiwen Zhou,Shuai Wang,Wentao Li,Yinchuan Li,Zhongang Qi,Liqiang Nie
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳); Huawei Noah’s Ark Lab(华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to ICCV 2025
Abstract:The research focus of GUI agents is shifting from text-dependent to pure-vision-based approaches, which, though promising, prioritize comprehensive pre-training data collection while neglecting contextual modeling challenges. We probe the characteristics of element and history contextual modeling in GUI agent and summarize: 1) the high-density and loose-relation of element context highlight the existence of many unrelated elements and their negative influence; 2) the high redundancy of history context reveals the inefficient history modeling in current GUI agents. In this work, we propose a context-aware simplification framework for building an efficient and effective GUI Agent, termed SimpAgent. To mitigate potential interference from numerous unrelated elements, we introduce a masking-based element pruning method that circumvents the intractable relation modeling through an efficient masking mechanism. To reduce the redundancy in historical information, we devise a consistency-guided history compression module, which enhances implicit LLM-based compression through innovative explicit guidance, achieving an optimal balance between performance and efficiency. With the above components, SimpAgent reduces 27% FLOPs and achieves superior GUI navigation performances. Comprehensive navigation experiments across diverse web and mobile environments demonstrate the effectiveness and potential of our agent.
zh
[CV-184] Computationally efficient non-Intrusive pre-impact fall detection system
【速读】:该论文试图解决现有预碰撞跌倒检测系统在部署时存在侵入性或计算资源消耗大的问题,这些问题限制了其在全球范围内的应用。解决方案的关键在于提出一种非侵入性和计算效率高的预碰撞跌倒检测系统,该系统利用摄像头获取的视频数据进行检测,无需被检测者佩戴专用设备,并采用少量跌倒特定特征和简单的神经网络模型以降低计算成本。
链接: https://arxiv.org/abs/2507.03705
作者: Praveen Jesudhas,Raghuveera T,Shiney Jeyaraj
机构: Anna University, Guindy(安纳大学,古因迪); Shark AI Solutions(鲨鱼人工智能解决方案)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing pre-impact fall detection systems have high accuracy, however they are either intrusive to the subject or require heavy computational resources for fall detection, resulting in prohibitive deployment costs. These factors limit the global adoption of existing fall detection systems. In this work we present a Pre-impact fall detection system that is both non-intrusive and computationally efficient at deployment. Our system utilizes video data of the locality available through cameras, thereby requiring no specialized equipment to be worn by the subject. Further, the fall detection system utilizes minimal fall specific features and simplistic neural network models, designed to reduce the computational cost of the system. A minimal set of fall specific features are derived from the skeletal data, post observing the relative position of human skeleton during fall. These features are shown to have different distributions for Fall and non-fall scenarios proving their discriminative capability. A Long Short Term Memory (LSTM) based network is selected and the network architecture and training parameters are designed after evaluation of performance on standard datasets. In the Pre-impact fall detection system the computation requirement is about 18 times lesser than existing modules with a comparable accuracy of 88%. Given the low computation requirements and higher accuracy levels, the proposed system is suitable for wider adoption in engineering systems related to industrial and residential safety.
zh
[CV-185] Sign Spotting Disambiguation using Large Language Models
【速读】:该论文试图解决连续手语视频中个体手势的识别与定位问题,即手语符号检测(sign spotting),以应对手语翻译领域数据稀缺的问题。其解决方案的关键在于引入一种无需训练的框架,该框架结合大型语言模型(Large Language Models, LLMs)来提升手语检测的质量。该方法通过提取全局时空特征和手部形状特征,并利用动态时间规整和余弦相似度与大规模手语词典进行匹配,从而实现词汇灵活性的提升;同时,借助LLM进行上下文感知的词素消歧,有效减少匹配过程中的噪声和歧义,无需微调即可实现更准确的检测结果。
链接: https://arxiv.org/abs/2507.03703
作者: JianHe Low,Ozge Mercanoglu Sincan,Richard Bowden
机构: University of Surrey(萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Sign spotting, the task of identifying and localizing individual signs within continuous sign language video, plays a pivotal role in scaling dataset annotations and addressing the severe data scarcity issue in sign language translation. While automatic sign spotting holds great promise for enabling frame-level supervision at scale, it grapples with challenges such as vocabulary inflexibility and ambiguity inherent in continuous sign streams. Hence, we introduce a novel, training-free framework that integrates Large Language Models (LLMs) to significantly enhance sign spotting quality. Our approach extracts global spatio-temporal and hand shape features, which are then matched against a large-scale sign dictionary using dynamic time warping and cosine similarity. This dictionary-based matching inherently offers superior vocabulary flexibility without requiring model retraining. To mitigate noise and ambiguity from the matching process, an LLM performs context-aware gloss disambiguation via beam search, notably without fine-tuning. Extensive experiments on both synthetic and real-world sign language datasets demonstrate our method’s superior accuracy and sentence fluency compared to traditional approaches, highlighting the potential of LLMs in advancing sign spotting.
zh
[CV-186] SAMed-2: Selective Memory Enhanced Medical Segment Anything Model MICCAI2025
【速读】:该论文旨在解决将“分割一切”(Segment Anything)模型直接应用于医学图像分割时面临的挑战,包括医学数据的复杂性、噪声标注以及跨多种成像模态和解剖结构的持续学习需求。其解决方案的关键在于提出SAMed-2,一个基于SAM-2架构的医学图像分割基础模型,该模型引入了时间适配器以捕捉图像相关性,并采用了一种基于置信度的记忆机制,用于存储高确定性特征以供后续检索,从而有效应对大规模医学数据集中的普遍噪声并减轻在遇到新任务或模态时的灾难性遗忘问题。
链接: https://arxiv.org/abs/2507.03698
作者: Zhiling Yan,Sifan Song,Dingjie Song,Yiwei Li,Rong Zhou,Weixiang Sun,Zhennong Chen,Sekeun Kim,Hui Ren,Tianming Liu,Quanzheng Li,Xiang Li,Lifang He,Lichao Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025
Abstract:Recent “segment anything” efforts show promise by learning from large-scale data, but adapting such models directly to medical images remains challenging due to the complexity of medical data, noisy annotations, and continual learning requirements across diverse modalities and anatomical structures. In this work, we propose SAMed-2, a new foundation model for medical image segmentation built upon the SAM-2 architecture. Specifically, we introduce a temporal adapter into the image encoder to capture image correlations and a confidence-driven memory mechanism to store high-certainty features for later retrieval. This memory-based strategy counters the pervasive noise in large-scale medical datasets and mitigates catastrophic forgetting when encountering new tasks or modalities. To train and evaluate SAMed-2, we curate MedBank-100k, a comprehensive dataset spanning seven imaging modalities and 21 medical segmentation tasks. Our experiments on both internal benchmarks and 10 external datasets demonstrate superior performance over state-of-the-art baselines in multi-task scenarios. The code is available at: this https URL.
zh
[CV-187] On the rankability of visual embeddings
【速读】:该论文试图解决的问题是:视觉嵌入模型是否能够沿线性方向捕捉连续的序数属性,即所谓的“rank axes”(排名轴)。研究发现,许多嵌入模型在特定属性上具有“rankable”(可排序)特性,即通过将嵌入投影到这些轴上可以保持属性的顺序。解决方案的关键在于,仅需少量样本,甚至仅两个极端示例,即可恢复有意义的排名轴,而无需全面监督。这一发现为图像排序在向量数据库中的新应用提供了可能,并推动了对可排序嵌入结构和学习方法的进一步研究。
链接: https://arxiv.org/abs/2507.03683
作者: Ankit Sonthalia,Arnas Uselis,Seong Joon Oh
机构: Tübingen AI Center, Universität Tübingen, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We study whether visual embedding models capture continuous, ordinal attributes along linear directions, which we term rank axes. We define a model as rankable for an attribute if projecting embeddings onto such an axis preserves the attribute’s order. Across 7 popular encoders and 9 datasets with attributes like age, crowd count, head pose, aesthetics, and recency, we find that many embeddings are inherently rankable. Surprisingly, a small number of samples, or even just two extreme examples, often suffice to recover meaningful rank axes, without full-scale supervision. These findings open up new use cases for image ranking in vector databases and motivate further study into the structure and learning of rankable embeddings. Our code is available at this https URL.
zh
[CV-188] Dynamic Multimodal Prototype Learning in Vision-Language Models
【速读】:该论文试图解决预训练视觉-语言模型(VLMs)在测试时适应(TTA)过程中,由于类别名称的语义模糊性导致文本原型无法充分捕捉视觉概念的问题。解决方案的关键在于提出一种无需训练的框架——ProtoMM,该框架通过构建多模态原型,在测试时对VLM进行适应。ProtoMM将原型视为文本描述和视觉粒子的离散分布,从而结合多模态特征进行全面的原型学习,并动态更新视觉粒子以持续从数据中学习,提升在未见场景中的泛化能力。
链接: https://arxiv.org/abs/2507.03657
作者: Xingyu Zhu,Shuo Wang,Beier Zhu,Miaoge Li,Yunfan Li,Junfeng Fang,Zhicai Wang,Dongsheng Wang,Hanwang Zhang
机构: University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学); The Hong Kong Polytechnic University (香港理工大学); Sichuan University (四川大学); National University of Singapore (新加坡国立大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the increasing attention to pre-trained vision-language models (VLMs), \eg, CLIP, substantial efforts have been devoted to many downstream tasks, especially in test-time adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce \textbfProtoMM, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing stream flows. This allows our multimodal prototypes to continually learn from the data, enhancing their generalizability in unseen scenarios. In addition, we quantify the importance of the prototypes and test images by formulating their semantic distance as an optimal transport problem. Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of our method, achieving a 1.03% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets.
zh
[CV-189] SecureT2I: No More Unauthorized Manipulation on AI Generated Images from Prompts
【速读】:该论文试图解决基于扩散模型的文本引导图像编辑中存在的伦理和版权问题,即防止未经授权的图像修改。解决方案的关键在于提出SecureT2I框架,通过将图像分为允许集和禁止集,并为禁止集设计特定的训练目标,以鼓励模糊或语义不明确的输出,从而抑制有意义的编辑,同时保持对允许集的编辑质量。核心挑战是实现对未经授权编辑的有效阻断,同时不影响授权输入的编辑效果,为此设计了分离的损失函数来引导选择性编辑行为。
链接: https://arxiv.org/abs/2507.03636
作者: Xiaodong Wu,Xiangman Li,Qi Li,Jianbing Ni,Rongxing Lu
机构: 未知
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-guided image manipulation with diffusion models enables flexible and precise editing based on prompts, but raises ethical and copyright concerns due to potential unauthorized modifications. To address this, we propose SecureT2I, a secure framework designed to prevent unauthorized editing in diffusion-based generative models. SecureT2I is compatible with both general-purpose and domain-specific models and can be integrated via lightweight fine-tuning without architectural changes. We categorize images into a permit set and a forbid set based on editing permissions. For the permit set, the model learns to perform high-quality manipulations as usual. For the forbid set, we introduce training objectives that encourage vague or semantically ambiguous outputs (e.g., blurred images), thereby suppressing meaningful edits. The core challenge is to block unauthorized editing while preserving editing quality for permitted inputs. To this end, we design separate loss functions that guide selective editing behavior. Extensive experiments across multiple datasets and models show that SecureT2I effectively degrades manipulation quality on forbidden images while maintaining performance on permitted ones. We also evaluate generalization to unseen inputs and find that SecureT2I consistently outperforms baselines. Additionally, we analyze different vagueness strategies and find that resize-based degradation offers the best trade-off for secure manipulation control.
zh
[CV-190] From Video to EEG: Adapting Joint Embedding Predictive Architecture to Uncover Visual Concepts in Brain Signal Analysis
【速读】:该论文旨在解决脑电图(EEG)信号分析中因标记数据有限、高维度以及缺乏能够充分捕捉时空依赖性的可扩展模型所带来的挑战。其解决方案的关键在于提出EEG-VJEPA,这是一种针对EEG分类任务对视频联合嵌入预测架构(V-JEPA)的创新适配方法。通过将EEG视为类似视频的序列,EEG-VJEPA利用联合嵌入和自适应掩码学习语义上有意义的时空表示,从而在公开的Temple University Hospital异常EEG数据集上实现了优于现有最先进模型的分类性能。
链接: https://arxiv.org/abs/2507.03633
作者: Amir Hojjati,Lu Li,Ibrahim Hameed,Anis Yazidi,Pedro G. Lind,Rabindra Khadka
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:EEG signals capture brain activity with high temporal and low spatial resolution, supporting applications such as neurological diagnosis, cognitive monitoring, and brain-computer interfaces. However, effective analysis is hindered by limited labeled data, high dimensionality, and the absence of scalable models that fully capture spatiotemporal dependencies. Existing self-supervised learning (SSL) methods often focus on either spatial or temporal features, leading to suboptimal representations. To this end, we propose EEG-VJEPA, a novel adaptation of the Video Joint Embedding Predictive Architecture (V-JEPA) for EEG classification. By treating EEG as video-like sequences, EEG-VJEPA learns semantically meaningful spatiotemporal representations using joint embeddings and adaptive masking. To our knowledge, this is the first work that exploits V-JEPA for EEG classification and explores the visual concepts learned by the model. Evaluations on the publicly available Temple University Hospital (TUH) Abnormal EEG dataset show that EEG-VJEPA outperforms existing state-of-the-art models in classification this http URL classification accuracy, EEG-VJEPA captures physiologically relevant spatial and temporal signal patterns, offering interpretable embeddings that may support human-AI collaboration in diagnostic workflows. These findings position EEG-VJEPA as a promising framework for scalable, trustworthy EEG analysis in real-world clinical settings.
zh
[CV-191] SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications ICCV2025
【速读】:该论文试图解决如何将视频基础模型(Video Foundation Models, ViFMs)有效地迁移至不同科学领域的问题,以及验证单一预训练ViFM是否能够与领域特定的基线方法相竞争。其解决方案的关键在于构建一个名为SciVid的综合性基准,涵盖医学计算机视觉、动物行为和天气预报等五个科学视频任务,并通过简单的可训练读出模块将六种先进的ViFMs适配到该基准,从而验证了通用表示在跨领域任务中的有效性。
链接: https://arxiv.org/abs/2507.03578
作者: Yana Hasson,Pauline Luc,Liliane Momeni,Maks Ovsjanikov,Guillaume Le Moing,Alina Kuznetsova,Ira Ktena,Jennifer J. Sun,Skanda Koppula,Dilara Gokay,Joseph Heyward,Etienne Pot,Andrew Zisserman
机构: Google DeepMind(谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICCV 2025, GitHub repo: this https URL
Abstract:In recent years, there has been a proliferation of spatiotemporal foundation models in different scientific disciplines. While promising, these models are often domain-specific and are only assessed within the particular applications for which they are designed. Given that many tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise as general-purpose domain-agnostic approaches. However, it is not known whether the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific disciplines, and if a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five Scientific Video tasks, across medical computer vision, animal behavior, and weather forecasting. We adapt six leading ViFMs to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by leveraging the general-purpose representations from ViFM backbones. Furthermore, our results reveal the limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications. We release our code at this https URL to facilitate further research in the development of ViFMs.
zh
[CV-192] 2.5D Object Detection for Intelligent Roadside Infrastructure ITSC
【速读】:该论文旨在解决自动驾驶车辆上车载传感器因遮挡、视场受限等问题导致的感知局限性,从而影响下游驾驶决策的问题。其解决方案的关键在于提出一种针对道路基础设施安装的摄像头的2.5D目标检测框架,通过在图像坐标系中将车辆的地面平面检测为平行四边形,以保留物体的平面位置、尺寸和方向,同时忽略高度信息,从而提升模型在不同视角和恶劣天气条件下的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2507.03564
作者: Nikolai Polley,Yacin Boualili,Ferdinand Mütsch,Maximilian Zipfl,Tobias Fleck,J. Marius Zöllner
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); FZI Research Center for Information Technology (弗劳恩霍夫信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC)
Abstract:On-board sensors of autonomous vehicles can be obstructed, occluded, or limited by restricted fields of view, complicating downstream driving decisions. Intelligent roadside infrastructure perception systems, installed at elevated vantage points, can provide wide, unobstructed intersection coverage, supplying a complementary information stream to autonomous vehicles via vehicle-to-everything (V2X) communication. However, conventional 3D object-detection algorithms struggle to generalize under the domain shift introduced by top-down perspectives and steep camera angles. We introduce a 2.5D object detection framework, tailored specifically for infrastructure roadside-mounted cameras. Unlike conventional 2D or 3D object detection, we employ a prediction approach to detect ground planes of vehicles as parallelograms in the image frame. The parallelogram preserves the planar position, size, and orientation of objects while omitting their height, which is unnecessary for most downstream applications. For training, a mix of real-world and synthetically generated scenes is leveraged. We evaluate generalizability on a held-out camera viewpoint and in adverse-weather scenarios absent from the training set. Our results show high detection accuracy, strong cross-viewpoint generalization, and robustness to diverse lighting and weather conditions. Model weights and inference code are provided at: this https URL
zh
[CV-193] Predicting Asphalt Pavement Friction Using Texture-Based Image Indicator
【速读】:该论文试图解决如何高效、经济地预测路面摩擦性能的问题,以提升道路安全性。其解决方案的关键在于提出一种基于纹理的图像指标——集料突出面积(aggregate protrusion area),通过数字图像分析与统计模型建立该指标与摩擦系数之间的关系,从而实现对路面摩擦的准确预测。
链接: https://arxiv.org/abs/2507.03559
作者: Bingjie Lu,Zhengyang Lu,Yijiashun Qi,Hanzhe Guo,Tianyao Sun,Zunduo Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Pavement skid resistance is of vital importance for road safety. The objective of this study is to propose and validate a texture-based image indicator to predict pavement friction. This index enables pavement friction to be measured easily and inexpensively using digital images. Three different types of asphalt surfaces (dense-graded asphalt mix, open-grade friction course, and chip seal) were evaluated subject to various tire polishing cycles. Images were taken with corresponding friction measured using Dynamic Friction Tester (DFT) in the laboratory. The aggregate protrusion area is proposed as the indicator. Statistical models are established for each asphalt surface type to correlate the proposed indicator with friction coefficients. The results show that the adjusted R-square values of all relationships are above 0.90. Compared to other image-based indicators in the literature, the proposed image indicator more accurately reflects the changes in pavement friction with the number of polishing cycles, proving its cost-effective use for considering pavement friction in mix design stage.
zh
[CV-194] An Advanced Deep Learning Framework for Ischemic and Hemorrhagic Brain Stroke Diagnosis Using Computed Tomography (CT) Images
【速读】:该论文旨在解决脑卒中(brain stroke)早期精准快速诊断的问题,以提高诊断效率和准确性。其解决方案的关键在于结合轻量级预训练深度学习模型与先进的优化及分类技术,通过特征工程和机器学习算法提升分类性能,其中MobileNetV2、LDA和SVC的组合在实验中表现出最高的分类准确率(97.93%)。
链接: https://arxiv.org/abs/2507.03558
作者: Md. Sabbir Hossen,Eshat Ahmed Shuvo,Shibbir Ahmed Arif,Pabon Shaha,Md. Saiduzzaman,Mostofa Kamal Nasir
机构: Department of Computer Science and Engineering, Bangladesh University(计算机科学与工程系,孟加拉大学); School of Computing, Montclair State University(计算学院,蒙特克莱尔州立大学); Department of Computer Science and Engineering, Mawlana Bhashani Science and Technology University(计算机科学与工程系,穆拉纳·巴沙尼科学与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint version. Submitted for peer review
Abstract:Brain stroke is one of the leading causes of mortality and long-term disability worldwide, highlighting the need for precise and fast prediction techniques. Computed Tomography (CT) scan is considered one of the most effective methods for diagnosing brain strokes. The majority of stroke classification techniques rely on a single slice-level prediction mechanism, allowing the radiologist to manually choose the most critical CT slice from the original CT volume. Although clinical evaluations are often used in traditional diagnostic procedures, machine learning (ML) has opened up new avenues for improving stroke diagnosis. To supplement traditional diagnostic techniques, this study investigates the use of machine learning models, specifically concerning the prediction of brain stroke at an early stage utilizing CT scan images. In this research, we proposed a novel approach to brain stroke detection leveraging machine learning techniques, focusing on optimizing classification performance with pre-trained deep learning models and advanced optimization strategies. Pre-trained models, including DenseNet201, InceptionV3, MobileNetV2, ResNet50, and Xception, are utilized for feature extraction. Additionally, we employed feature engineering techniques, including BFO, PCA, and LDA, to enhance models’ performance further. These features are subsequently classified using machine learning algorithms such as SVC, RF, XGB, DT, LR, KNN, and GNB. Our experiments demonstrate that the combination of MobileNetV2, LDA, and SVC achieved the highest classification accuracy of 97.93%, significantly outperforming other model-optimizer-classifier combinations. The results underline the effectiveness of integrating lightweight pre-trained models with robust optimization and classification techniques for brain stroke diagnosis.
zh
[CV-195] Beyond Accuracy: Metrics that Uncover What Makes a `Good Visual Descriptor CVPR2025
【速读】:该论文试图解决文本视觉描述符(text-based visual descriptors)在视觉概念发现和图像分类任务中效果受限的问题,其核心在于理解描述符质量的决定因素及其与视觉语言模型(VLM)预训练数据之间的关系。解决方案的关键在于提出两种基于对齐的度量方法——全局对齐(Global Alignment)和CLIP相似性(CLIP Similarity),以超越传统准确率评估,深入分析不同描述符生成策略与基础模型特性之间的交互机制。
链接: https://arxiv.org/abs/2507.03542
作者: Ethan Lin,Linxi Zhao,Atharva Sehgal,Jennifer J. Sun
机构: Cornell University (康奈尔大学); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: VisCon @ CVPR 2025
Abstract:Text-based visual descriptors-ranging from simple class names to more descriptive phrases-are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM’s pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics-Global Alignment and CLIP Similarity-that move beyond accuracy. These metrics allow us to shed light on how different descriptor generation strategies interact with foundation model properties, offering insights into ways of studying descriptor effectiveness beyond accuracy evaluations.
zh
[CV-196] Foundation versus Domain-specific Models: Performance Comparison Fusion and Explainability in Face Recognition
【速读】:该论文试图解决通用基础模型(如CLIP、BLIP、LLaVa、DINO)与领域特定的人脸识别模型(如AdaFace或ArcFace)在人脸识别任务中的性能比较问题。解决方案的关键在于通过一系列实验验证领域特定模型在多数数据集上优于零样本通用基础模型,并揭示上下文线索对基础模型性能的重要性,同时提出通过将基础模型与领域特定模型进行分数级融合以提升低误匹配率下的识别准确率,以及利用生成式AI(Generative AI)增强人脸识别流程的可解释性。
链接: https://arxiv.org/abs/2507.03541
作者: Redwan Sony,Parisa Farmanifard,Arun Ross,Anil K. Jain
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, we address the following question: How do generic foundation models (e.g., CLIP, BLIP, LLaVa, DINO) compare against a domain-specific face recognition model (viz., AdaFace or ArcFace) on the face recognition task? Through a series of experiments involving several foundation models and benchmark datasets, we are able to report the following findings: (a) In all datasets considered, domain-specific models outperformed zero-shot foundation models. (b) The performance of zero-shot generic foundation models improves on over-segmented face images than tightly cropped faces thereby suggesting the importance of contextual clues. For example, at a False Match Rate (FMR) of 0.01%, the True Match Rate (TMR) of OpenCLIP improved from 64.97% to 81.73% on the LFW dataset as the face crop increased from 112x112 to 250x250 while the TMR of domain-specific AdaFace dropped from 99.09% to 77.31%. © A simple score-level fusion of a foundation model with a domain-specific FR model improved the accuracy at low FMRs. For example, the TMR of AdaFace when fused with BLIP improved from 72.64% to 83.31% at an FMR of 0.0001% on the IJB-B dataset and from 73.17% to 85.81% on the IJB-C dataset. (d) Foundation models, such as ChatGPT, can be used to impart explainability to the FR pipeline (e.g., Despite minor lighting and head tilt differences, the two left-profile images show high consistency in forehead slope, nose shape, chin contour...''). In some instances, foundation models are even able to resolve low-confidence decisions made by AdaFace (e.g.,
Although AdaFace assigns a low similarity score of 0.21, both images exhibit visual similarity…and the pair is likely of the same person’'), thereby reiterating the importance of combining domain-specific FR models with generic foundation models in a judicious manner.
zh
[CV-197] CLOT: Closed Loop Optimal Transport for Unsupervised Action Segmentation ICCV2025
【速读】:该论文试图解决无监督动作分割中由于缺乏段级监督而导致的帧与动作表示之间反馈效果受限的问题。其解决方案的关键在于提出一种名为闭环最优传输(Closed Loop Optimal Transport, CLOT)的新颖最优传输(OT)框架,该框架引入了多层级循环特征学习机制,通过编码器-解码器结构同时学习伪标签、帧嵌入和段嵌入,并利用跨注意力机制在帧嵌入与段嵌入之间进行交互,从而整合第三个OT问题以优化帧嵌入和伪标签。
链接: https://arxiv.org/abs/2507.03539
作者: Elena Bueno-Benito,Mariella Dimiccoli
机构: Institut de Robòtica i Informàtica Industrial, CSIC-UPC (工业机器人与信息学研究所,CSIC-UPC)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025
Abstract:Unsupervised action segmentation has recently pushed its limits with ASOT, an optimal transport (OT)-based method that simultaneously learns action representations and performs clustering using pseudo-labels. Unlike other OT-based approaches, ASOT makes no assumptions on the action ordering, and it is able to decode a temporally consistent segmentation from a noisy cost matrix between video frames and action labels. However, the resulting segmentation lacks segment-level supervision, which limits the effectiveness of the feedback between frames and action representations. To address this limitation, we propose Closed Loop Optimal Transport (CLOT), a novel OT-based framework that introduces a multi-level cyclic feature learning mechanism. Leveraging its encoder-decoder architecture, CLOT learns pseudo-labels alongside frame and segment embeddings by solving two separate OT problems. It then refines both frame embeddings and pseudo-labels through cross-attention between the learned frame and segment embeddings, integrating a third OT problem. Experimental results on four benchmark datasets demonstrate the benefits of cyclical learning for unsupervised action segmentation.
zh
[CV-198] PhenoBench: A Comprehensive Benchmark for Cell Phenotyping MICCAI2025
【速读】:该论文试图解决数字病理学中基础模型(FM)在细胞表型分类任务上的性能缺乏统一评估的问题。其解决方案的关键在于提出PhenoBench,这是一个针对Hematoxylin and Eosin(HE)染色组织病理图像的细胞表型分类综合基准,包含PhenoCell数据集和可直接使用的微调与基准测试代码,从而实现了对多种主流病理学基础模型在不同泛化场景下的系统性评估。
链接: https://arxiv.org/abs/2507.03532
作者: Jerome Luescher,Nora Koreuber,Jannik Franzen,Fabian H. Reith,Claudia Winklmayr,Christian M. Schuerch,Dagmar Kainmueller,Josef Lorenz Rumberger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted for presentation at MICCAI 2025
Abstract:Digital pathology has seen the advent of a wealth of foundational models (FM), yet to date their performance on cell phenotyping has not been benchmarked in a unified manner. We therefore propose PhenoBench: A comprehensive benchmark for cell phenotyping on Hematoxylin and Eosin (HE) stained histopathology images. We provide both PhenoCell, a new HE dataset featuring 14 granular cell types identified by using multiplexed imaging, and ready-to-use fine-tuning and benchmarking code that allows the systematic evaluation of multiple prominent pathology FMs in terms of dense cell phenotype predictions in different generalization scenarios. We perform extensive benchmarking of existing FMs, providing insights into their generalization behavior under technical vs. medical domain shifts. Furthermore, while FMs achieve macro F1 scores 0.70 on previously established benchmarks such as Lizard and PanNuke, on PhenoCell, we observe scores as low as 0.20. This indicates a much more challenging task not captured by previous benchmarks, establishing PhenoCell as a prime asset for future benchmarking of FMs and supervised models alike. Code and data are available on GitHub.
zh
[CV-199] Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding
【速读】:该论文试图解决细粒度视频分类中因复杂时空和语义线索超出单一模态能力而带来的挑战。其解决方案的关键在于提出一种多模态框架,该框架通过基于GRU的序列编码器和跨模态注意力机制融合视频、图像和文本表示,并利用特征级增强和自编码技术进行正则化,从而提升模型的鲁棒性和性能。
链接: https://arxiv.org/abs/2507.03531
作者: Namho Kim,Junhwa Kim
机构: Korean Broadcasting System(KBS); Konyang University(高阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-grained video classification requires understanding complex spatio-temporal and semantic cues that often exceed the capacity of a single modality. In this paper, we propose a multimodal framework that fuses video, image, and text representations using GRU-based sequence encoders and cross-modal attention mechanisms. The model is trained using a combination of classification or regression loss, depending on the task, and is further regularized through feature-level augmentation and autoencoding techniques. To evaluate the generality of our framework, we conduct experiments on two challenging benchmarks: the DVD dataset for real-world violence detection and the Aff-Wild2 dataset for valence-arousal estimation. Our results demonstrate that the proposed fusion strategy significantly outperforms unimodal baselines, with cross-attention and feature augmentation contributing notably to robustness and performance.
zh
[CV-200] Information-Bottleneck Driven Binary Neural Network for Change Detection ICCV2025
【速读】:该论文试图解决二值神经网络(BNN)在变化检测任务中表现不佳的问题,传统方法直接对权重和激活进行量化,导致网络表征能力受限,难以区分变化与未变化区域,从而显著降低检测精度。解决方案的关键在于提升BNN的表征能力和特征可分性,通过引入基于信息瓶颈(IB)原理的辅助目标,引导编码器保留关键输入信息并增强特征区分能力,同时设计了一个紧凑且可学习的辅助模块作为近似目标,实现一种简单有效的优化策略,兼顾重建损失和标准变化检测损失的最小化。
链接: https://arxiv.org/abs/2507.03504
作者: Kaijie Yin,Zhiyuan Zhang,Shu Kong,Tian Gao,Chengzhong Xu,Hui Kong
机构: University of Macau(澳门大学); Singapore Management University(新加坡管理大学); Nanjing University of Science and Technology(南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Accepted
Abstract:In this paper, we propose Binarized Change Detection (BiCD), the first binary neural network (BNN) designed specifically for change detection. Conventional network binarization approaches, which directly quantize both weights and activations in change detection models, severely limit the network’s ability to represent input data and distinguish between changed and unchanged regions. This results in significantly lower detection accuracy compared to real-valued networks. To overcome these challenges, BiCD enhances both the representational power and feature separability of BNNs, improving detection performance. Specifically, we introduce an auxiliary objective based on the Information Bottleneck (IB) principle, guiding the encoder to retain essential input information while promoting better feature discrimination. Since directly computing mutual information under the IB principle is intractable, we design a compact, learnable auxiliary module as an approximation target, leading to a simple yet effective optimization strategy that minimizes both reconstruction loss and standard change detection loss. Extensive experiments on street-view and remote sensing datasets demonstrate that BiCD establishes a new benchmark for BNN-based change detection, achieving state-of-the-art performance in this domain.
zh
[CV-201] Radar Velocity Transformer: Single-scan Moving Object Segmentation in Noisy Radar Point Clouds ICRA
【速读】:该论文试图解决在噪声雷达点云中进行运动目标分割的问题,同时区分静止和运动的车辆以提升场景理解。其解决方案的关键在于提出一种基于Transformer的架构——Radar Velocity Transformer,该架构通过在网络的每个模块中融入有价值的多普勒速度信息,从而实现对运动与非运动目标的精确分割。
链接: https://arxiv.org/abs/2507.03463
作者: Matthias Zeller,Vardeep S. Sandhu,Benedikt Mersch,Jens Behley,Michael Heidingsfeld,Cyrill Stachniss
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proc. of the IEEE Intl. Conf. on Robotics Automation (ICRA)
Abstract:The awareness about moving objects in the surroundings of a self-driving vehicle is essential for safe and reliable autonomous navigation. The interpretation of LiDAR and camera data achieves exceptional results but typically requires to accumulate and process temporal sequences of data in order to extract motion information. In contrast, radar sensors, which are already installed in most recent vehicles, can overcome this limitation as they directly provide the Doppler velocity of the detections and, hence incorporate instantaneous motion information within a single measurement. % In this paper, we tackle the problem of moving object segmentation in noisy radar point clouds. We also consider differentiating parked from moving cars, to enhance scene understanding. Instead of exploiting temporal dependencies to identify moving objects, we develop a novel transformer-based approach to perform single-scan moving object segmentation in sparse radar scans accurately. The key to our Radar Velocity Transformer is to incorporate the valuable velocity information throughout each module of the network, thereby enabling the precise segmentation of moving and non-moving objects. Additionally, we propose a transformer-based upsampling, which enhances the performance by adaptively combining information and overcoming the limitation of interpolation of sparse point clouds. Finally, we create a new radar moving object segmentation benchmark based on the RadarScenes dataset and compare our approach to other state-of-the-art methods. Our network runs faster than the frame rate of the sensor and shows superior segmentation results using only single-scan radar data.
zh
[CV-202] Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach
【速读】:该论文试图解决Vision-Language Models (VLMs)如CLIP在跨模态语义对齐中对局部视觉细节感知不足的问题,传统提示工程主要依赖粗粒度类别标签,忽略了细粒度的局部语义。解决方案的关键在于通过随机多裁剪增强(stochastic multi-crop augmentation)激活CLIP对局部特征的分析能力,从而约束模型的感受野并重新校准其注意力机制,缓解其对全局图像模式的固有偏倚。
链接: https://arxiv.org/abs/2507.03458
作者: Leyan Xue,Zongbo Han,Guangyu Wang,Qinghua Hu,Mingyue Cheng,Changqing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) like CLIP achieve cross-modal semantic alignment through contrastive learning, exhibiting robust zero-shot generalization. Traditional prompt engineering, however, predominantly relies on coarse-grained category labels, neglecting fine-grained local semantics. Existing approaches assume that VLMs inherently recognize localized visual details and attempt to enhance classification by augmenting text prompts with attribute descriptors generated by large language models. However, our systematic experiments reveal critical limitations: CLIP’s strong bias toward global image patterns hinders its ability to process localized visual descriptors. To address this fundamental constraint, we propose a simple, effective, and plug-and-play solution that enables CLIP to ``See Both the Forest and the Trees." Specifically, we employ stochastic multi-crop augmentation to activate CLIP’s latent capacity for localized feature analysis. By cropping only partial regions, the approach effectively constrains the model’s receptive field and recalibrates its attention mechanism, thereby mitigating its inherent bias. We evaluate the proposed method under zero-shot, few-shot, and test-time adaptation settings, and extensive experiments demonstrate that DD achieves promising performance.
zh
[CV-203] Evaluating the Evaluators: Trust in Adversarial Robustness Tests
【速读】:该论文试图解决当前对抗性逃避攻击(adversarial evasion attacks)评估方法中存在的一致性和可靠性不足的问题。现有评估常依赖于不匹配的模型、未经验证的实现以及不均衡的计算资源,导致结果偏差和虚假的安全感。解决方案的关键是提出AttackBench,这是一个基准框架,旨在在标准化和可复现的条件下评估基于梯度的攻击效果。AttackBench通过一种新的最优性度量对现有攻击实现进行排序,使研究人员能够识别最可靠和有效的攻击方法,从而提升鲁棒性验证的可信度。
链接: https://arxiv.org/abs/2507.03450
作者: Antonio Emanuele Cinà,Maura Pintor,Luca Demetrio,Ambra Demontis,Battista Biggio,Fabio Roli
机构: DIBRIS - Department of Informatics, Bioengineering, Robotics and Systems Engineering, University of Genoa (DIBRIS - 信息学、生物工程、机器人学与系统工程系,热那亚大学); Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice (环境科学、信息学与统计系,威尼斯卡弗拉里大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Despite significant progress in designing powerful adversarial evasion attacks for robustness verification, the evaluation of these methods often remains inconsistent and unreliable. Many assessments rely on mismatched models, unverified implementations, and uneven computational budgets, which can lead to biased results and a false sense of security. Consequently, robustness claims built on such flawed testing protocols may be misleading and give a false sense of security. As a concrete step toward improving evaluation reliability, we present AttackBench, a benchmark framework developed to assess the effectiveness of gradient-based attacks under standardized and reproducible conditions. AttackBench serves as an evaluation tool that ranks existing attack implementations based on a novel optimality metric, which enables researchers and practitioners to identify the most reliable and effective attack for use in subsequent robustness evaluations. The framework enforces consistent testing conditions and enables continuous updates, making it a reliable foundation for robustness verification.
zh
[CV-204] Radar Tracker: Moving Instance Tracking in Sparse and Noisy Radar Point Clouds ICRA
【速读】:该论文试图解决在稀疏雷达点云中对移动目标进行实例跟踪的问题,以提升场景理解能力。其解决方案的关键在于提出一种基于学习的雷达跟踪器,结合时间偏移预测实现直接中心关联,并通过引入额外的运动线索增强分割性能;同时采用注意力机制融入外观特征,最终通过几何与外观特征的联合关联克服中心跟踪的局限性,从而实现更可靠的实例关联。
链接: https://arxiv.org/abs/2507.03441
作者: Matthias Zeller,Daniel Casado Herraez,Jens Behley,Michael Heidingsfeld,Cyrill Stachniss
机构: CARIAD SE(卡里亚德公司); Center for Robotics, University of Bonn(机器人中心,波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence(拉玛尔机器学习与人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proc. of the IEEE Intl. Conf. on Robotics Automation (ICRA)
Abstract:Robots and autonomous vehicles should be aware of what happens in their surroundings. The segmentation and tracking of moving objects are essential for reliable path planning, including collision avoidance. We investigate this estimation task for vehicles using radar sensing. We address moving instance tracking in sparse radar point clouds to enhance scene interpretation. We propose a learning-based radar tracker incorporating temporal offset predictions to enable direct center-based association and enhance segmentation performance by including additional motion cues. We implement attention-based tracking for sparse radar scans to include appearance features and enhance performance. The final association combines geometric and appearance features to overcome the limitations of center-based tracking to associate instances reliably. Our approach shows an improved performance on the moving instance tracking benchmark of the RadarScenes dataset compared to the current state of the art.
zh
[CV-205] Unlearning the Noisy Correspondence Makes CLIP More Robust ICCV2025
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在大规模数据训练过程中因数据质量下降而引入的噪声对应(Noisy Correspondence, NC)问题,此类问题会显著影响模型性能。其解决方案的关键在于提出一种名为NCU的噪声对应遗忘微调框架,该框架通过遗忘已学习的噪声知识来增强VLMs的鲁棒性,核心思想是学习最难的负样本信息,从而为假正例和假反例提供明确的遗忘方向,并将这一双目标遗忘过程形式化为一个统一的最优传输目标以实现快速微调。
链接: https://arxiv.org/abs/2507.03434
作者: Haochen Han,Alex Jinpeng Wang,Peijun Ye,Fangming Liu
机构: Peng Cheng Laboratory(鹏城实验室); Central South University(中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ICCV 2025
Abstract:The data appetite for Vision-Language Models (VLMs) has continuously scaled up from the early millions to billions today, which faces an untenable trade-off with data quality and inevitably introduces Noisy Correspondence (NC) samples. Undoubtedly, such semantically unrelated data significantly impairs the performance of VLMs. Previous efforts mainly address this challenge by estimating refined alignment for more precise guidance. However, such resource-intensive pipelines that train VLMs from scratch struggle to meet realistic data demands. In this paper, we present a brand new perspective that seeks to directly eliminate the harmful effects of NC in pre-trained VLMs. Specifically, we propose NCU, a Noisy Correspondence Unlearning fine-tuning framework that efficiently enhances VLMs’ robustness by forgetting learned noisy knowledge. The key to NCU is learning the hardest negative information, which can provide explicit unlearning direction for both false positives and false negatives. Such twin goals unlearning process can be formalized into one unified optimal transport objective for fast fine-tuning. We validate our approach with the prevailing CLIP model over various downstream tasks. Remarkably, NCU surpasses the robust pre-trained method on zero-shot transfer while with lower computational overhead. The code will be released upon acceptance.
zh
[CV-206] Rectifying Adversarial Sample with Low Entropy Prior for Test-Time Defense
【速读】:该论文试图解决现有防御方法在面对未知攻击时失效,导致对抗鲁棒性泛化能力不足的问题。其解决方案的关键在于揭示了各种对抗样本中被忽视的低熵先验(LE prior)特性,并基于此提出了一种两阶段的REAL方法,通过最大-最小熵优化策略对测试阶段的对抗样本进行校正,从而提升模型对未见攻击的鲁棒性。
链接: https://arxiv.org/abs/2507.03427
作者: Lina Ma,Xiaowei Fu,Fuxiang Huang,Xinbo Gao,Lei Zhang
机构: Chongqing University (重庆大学); The Hong Kong University of Science and Technology (香港科技大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in IEEEE Transactions on Multimedia
Abstract:Existing defense methods fail to defend against unknown attacks and thus raise generalization issue of adversarial robustness. To remedy this problem, we attempt to delve into some underlying common characteristics among various attacks for generality. In this work, we reveal the commonly overlooked low entropy prior (LE) implied in various adversarial samples, and shed light on the universal robustness against unseen attacks in inference phase. LE prior is elaborated as two properties across various attacks as shown in Fig. 1 and Fig. 2: 1) low entropy misclassification for adversarial samples and 2) lower entropy prediction for higher attack intensity. This phenomenon stands in stark contrast to the naturally distributed samples. The LE prior can instruct existing test-time defense methods, thus we propose a two-stage REAL approach: Rectify Adversarial sample based on LE prior for test-time adversarial rectification. Specifically, to align adversarial samples more closely with clean samples, we propose to first rectify adversarial samples misclassified with low entropy by reverse maximizing prediction entropy, thereby eliminating their adversarial nature. To ensure the rectified samples can be correctly classified with low entropy, we carry out secondary rectification by forward minimizing prediction entropy, thus creating a Max-Min entropy optimization scheme. Further, based on the second property, we propose an attack-aware weighting mechanism to adaptively adjust the strengths of Max-Min entropy objectives. Experiments on several datasets show that REAL can greatly improve the performance of existing sample rectification models.
zh
[CV-207] Pose-Star: Anatomy-Aware Editing for Open-World Fashion Images ICCV25
【速读】:该论文旨在解决现有两阶段时尚图像编辑流程中生成器优化过度而忽视掩码可控性的问题,从而导致用户定义灵活性不足和姿态鲁棒性弱两大关键限制。其解决方案的关键在于提出Pose-Star框架,该框架通过动态重组身体结构为解剖感知掩码,并利用骨骼关键点校准扩散衍生的注意力(Star tokens),以增强复杂姿态下稀有结构的定位能力,同时通过相位感知分析注意力动态(收敛、稳定、发散)结合阈值掩码和滑动窗口融合抑制噪声,以及通过交叉自注意力合并和Canny对齐优化边缘,实现解剖感知且姿态鲁棒的图像编辑。
链接: https://arxiv.org/abs/2507.03402
作者: Yuran Dong,Mang Ye
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 17 figures, ICCV25
Abstract:To advance real-world fashion image editing, we analyze existing two-stage pipelines(mask generation followed by diffusion-based editing)which overly prioritize generator optimization while neglecting mask controllability. This results in two critical limitations: I) poor user-defined flexibility (coarse-grained human masks restrict edits to predefined regions like upper torso; fine-grained clothes masks preserve poses but forbid style/length customization). II) weak pose robustness (mask generators fail due to articulated poses and miss rare regions like waist, while human parsers remain limited by predefined categories). To address these gaps, we propose Pose-Star, a framework that dynamically recomposes body structures (e.g., neck, chest, etc.) into anatomy-aware masks (e.g., chest-length) for user-defined edits. In Pose-Star, we calibrate diffusion-derived attention (Star tokens) via skeletal keypoints to enhance rare structure localization in complex poses, suppress noise through phase-aware analysis of attention dynamics (Convergence,Stabilization,Divergence) with threshold masking and sliding-window fusion, and refine edges via cross-self attention merging and Canny alignment. This work bridges controlled benchmarks and open-world demands, pioneering anatomy-aware, pose-robust editing and laying the foundation for industrial fashion image editing.
zh
[CV-208] Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering ICCV2025
【速读】:该论文旨在解决噪声点云中法向量估计的问题,这是3D几何处理中的一个持续性挑战,尤其是在端到端的法向量估计场景下。现有方法通常针对相对干净的数据,并依赖于监督先验来拟合特定邻域内的局部表面。论文提出了一种新的方法,通过局部梯度感知的表面滤波从噪声点云中学习法向量。其关键在于利用由局部梯度约束的隐式函数得到的法向量和距离,将噪声点投影到潜在表面上,并结合全局表面拟合的距离测量算子以及基于隐式场的滤波方法,同时引入局部梯度一致性、方向和聚合约束以避免过度平滑和梯度退化。
链接: https://arxiv.org/abs/2507.03394
作者: Qing Li,Huifang Feng,Xun Gong,Yu-Shen Liu
机构: Southwest Jiaotong University (西南交通大学); Xihua University (西华大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025. Code: this https URL
Abstract:Estimating normals for noisy point clouds is a persistent challenge in 3D geometry processing, particularly for end-to-end oriented normal estimation. Existing methods generally address relatively clean data and rely on supervised priors to fit local surfaces within specific neighborhoods. In this paper, we propose a novel approach for learning normals from noisy point clouds through local gradient-aware surface filtering. Our method projects noisy points onto the underlying surface by utilizing normals and distances derived from an implicit function constrained by local gradients. We start by introducing a distance measurement operator for global surface fitting on noisy data, which integrates projected distances along normals. Following this, we develop an implicit field-based filtering approach for surface point construction, adding projection constraints on these points during filtering. To address issues of over-smoothing and gradient degradation, we further incorporate local gradient consistency constraints, as well as local gradient orientation and aggregation. Comprehensive experiments on normal estimation, surface reconstruction, and point cloud denoising demonstrate the state-of-the-art performance of our method. The source code and trained models are available at this https URL.
zh
[CV-209] Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos
【速读】:该论文试图解决 instructional videos 中的流程规划问题,旨在从起始和结束的视觉观测中生成连贯且任务对齐的动作序列。其解决方案的关键在于提出一种名为 Masked Temporal Interpolation Diffusion (MTID) 的模型,该模型在扩散模型中引入了一个潜在空间时间插值模块,利用可学习的插值矩阵生成中间潜在特征,从而增强视觉监督的中态细节信息,并通过整合这种增强后的监督实现端到端的任务定制训练,显著提升了模型预测时间一致动作序列的能力。
链接: https://arxiv.org/abs/2507.03393
作者: Yufan Zhou,Zhaobo Qi,Lingshuai Lin,Junqi Jing,Tingting Chai,Beichen Zhang,Shuhui Wang,Weigang Zhang
机构: Harbin Institute of Technology, Weihai (哈尔滨工业大学威海); Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS (智能信息处理重点实验室,计算技术研究所,中科院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we address the challenge of procedure planning in instructional videos, aiming to generate coherent and task-aligned action sequences from start and end visual observations. Previous work has mainly relied on text-level supervision to bridge the gap between observed states and unobserved actions, but it struggles with capturing intricate temporal relationships among actions. Building on these efforts, we propose the Masked Temporal Interpolation Diffusion (MTID) model that introduces a latent space temporal interpolation module within the diffusion model. This module leverages a learnable interpolation matrix to generate intermediate latent features, thereby augmenting visual supervision with richer mid-state details. By integrating this enriched supervision into the model, we enable end-to-end training tailored to task-specific requirements, significantly enhancing the model’s capacity to predict temporally coherent action sequences. Additionally, we introduce an action-aware mask projection mechanism to restrict the action generation space, combined with a task-adaptive masked proximity loss to prioritize more accurate reasoning results close to the given start and end states over those in intermediate steps. Simultaneously, it filters out task-irrelevant action predictions, leading to contextually aware action sequences. Experimental results across three widely used benchmark datasets demonstrate that our MTID achieves promising action planning performance on most metrics. The code is available at this https URL.
zh
[CV-210] MRC-DETR: An Adaptive Multi-Residual Coupled Transformer for Bare Board PCB Defect Detection
【速读】:该论文旨在解决印刷电路板(Printed Circuit Boards, PCBs)缺陷检测中特征表示能力有限、计算冗余以及高质量训练数据不足的问题,这些问题限制了现有方法在工业应用中对准确性和效率的需求。其解决方案的关键在于提出MRC-DETR框架,该框架通过设计多残差方向耦合块(Multi-Residual Directional Coupled Block, MRDCB)增强特征表示能力,引入自适应筛选金字塔网络(Adaptive Screening Pyramid Network, ASPN)减少计算冗余,并构建一个高质量的PCB缺陷数据集以弥补现有公开资源的不足。
链接: https://arxiv.org/abs/2507.03386
作者: Jiangzhong Cao,Huanqi Wu,Xu Zhang,Lianghong Tan,Huan Zhang
机构: Guangdong University of Technology (广东工业大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In modern electronic manufacturing, defect detection on Printed Circuit Boards (PCBs) plays a critical role in ensuring product yield and maintaining the reliability of downstream assembly processes. However, existing methods often suffer from limited feature representation, computational redundancy, and insufficient availability of high-quality training data – challenges that hinder their ability to meet industrial demands for both accuracy and efficiency. To address these limitations, we propose MRC-DETR, a novel and efficient detection framework tailored for bare PCB defect inspection, built upon the foundation of RT-DETR. Firstly, to enhance feature representation capability, we design a Multi-Residual Directional Coupled Block (MRDCB). This module improves channel-wise feature interaction through a multi-residual structure. Moreover, a cross-spatial learning strategy is integrated to capture fine-grained pixel-level relationships, further enriching the representational power of the extracted features. Secondly, to reduce computational redundancy caused by inefficient cross-layer information fusion, we introduce an Adaptive Screening Pyramid Network (ASPN). This component dynamically filters and aggregates salient low-level features, selectively fusing them with high-level semantic features. By focusing on informative regions and suppressing redundant computations, ASPN significantly improves both efficiency and detection accuracy. Finally, to tackle the issue of insufficient training data, particularly in the context of bare PCBs, we construct a new, high-quality dataset that fills a critical gap in current public resources. Our dataset not only supports the training and evaluation of our proposed framework but also serves as a valuable benchmark for future research in this domain.
zh
[CV-211] Be the Change You Want to See: Revisiting Remote Sensing Change Detection Practices
【速读】:该论文试图解决遥感变化检测中模型性能提升的根源问题,即现有方法过于关注新增复杂组件而忽视了基础设计选择(如主干网络选择、预训练策略和训练配置)对性能的影响。论文的核心解决方案在于系统地重新审视变化检测模型的设计空间,并分析一个优化良好的基线模型的全部潜力,从而识别出对新旧架构均有益的基础设计选择。通过这些设计选择,即使结构简单的模型也能在六个具有挑战性的变化检测数据集上达到或超越当前最先进水平,表明基础设计选择的探索仍存在较大空间。
链接: https://arxiv.org/abs/2507.03367
作者: Blaž Rolih,Matic Fučka,Filip Wolf,Luka Čehovin Zajc
机构: University of Ljubljana, Faculty of Computer and Information Science (卢布尔雅那大学计算机与信息科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by IEEE TGRS: this https URL
Abstract:Remote sensing change detection aims to localize semantic changes between images of the same location captured at different times. In the past few years, newer methods have attributed enhanced performance to the additions of new and complex components to existing architectures. Most fail to measure the performance contribution of fundamental design choices such as backbone selection, pre-training strategies, and training configurations. We claim that such fundamental design choices often improve performance even more significantly than the addition of new architectural components. Due to that, we systematically revisit the design space of change detection models and analyse the full potential of a well-optimised baseline. We identify a set of fundamental design choices that benefit both new and existing architectures. Leveraging this insight, we demonstrate that when carefully designed, even an architecturally simple model can match or surpass state-of-the-art performance on six challenging change detection datasets. Our best practices generalise beyond our architecture and also offer performance improvements when applied to related methods, indicating that the space of fundamental design choices has been underexplored. Our guidelines and architecture provide a strong foundation for future methods, emphasizing that optimizing core components is just as important as architectural novelty in advancing change detection performance. Code: this https URL
zh
[CV-212] DESign: Dynamic Context-Aware Convolution and Efficient Subnet Regularization for Continuous Sign Language Recognition
【速读】:该论文旨在解决连续手语识别(CSLR)中处理多样样本的困难,特别是现有方法在捕捉时间动态和上下文依赖性方面的不足。其解决方案的关键在于提出DESign框架,该框架包含动态上下文感知卷积(DCAC)和子网络正则化连接时序分类(SR-CTC)。DCAC通过细粒度地根据上下文信息调整卷积权重,动态捕获帧间运动线索,从而提升模型对多样化手语行为的泛化能力;SR-CTC通过在子网络上施加监督,鼓励模型探索多样的CTC对齐路径,有效防止过拟合,并且不增加推理开销,可无缝集成到现有CSLR模型中。
链接: https://arxiv.org/abs/2507.03339
作者: Sheng Liu,Yiheng Yu,Yuan Feng,Min Xu,Zhelun Jin,Yining Jiang,Tiantian Yuan
机构: Zhejiang University of Technology(浙江理工大学); Tianjin University of Technology(天津理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Current continuous sign language recognition (CSLR) methods struggle with handling diverse samples. Although dynamic convolutions are ideal for this task, they mainly focus on spatial modeling and fail to capture the temporal dynamics and contextual dependencies. To address this, we propose DESign, a novel framework that incorporates Dynamic Context-Aware Convolution (DCAC) and Subnet Regularization Connectionist Temporal Classification (SR-CTC). DCAC dynamically captures the inter-frame motion cues that constitute signs and uniquely adapts convolutional weights in a fine-grained manner based on contextual information, enabling the model to better generalize across diverse signing behaviors and boost recognition accuracy. Furthermore, we observe that existing methods still rely on only a limited number of frames for parameter updates during training, indicating that CTC learning overfits to a dominant path. To address this, SR-CTC regularizes training by applying supervision to subnetworks, encouraging the model to explore diverse CTC alignment paths and effectively preventing overfitting. A classifier-sharing strategy in SR-CTC further strengthens multi-scale consistency. Notably, SR-CTC introduces no inference overhead and can be seamlessly integrated into existing CSLR models to boost performance. Extensive ablations and visualizations further validate the effectiveness of the proposed methods. Results on mainstream CSLR datasets (i.e., PHOENIX14, PHOENIX14-T, CSL-Daily) demonstrate that DESign achieves state-of-the-art performance.
zh
[CV-213] De-Fake: Style based Anomaly Deepfake Detection
【速读】:该论文试图解决face-swap深度伪造检测的问题,特别是在现实场景中,由于易于获取的工具使得任何人无需技术知识即可进行面部交换,导致虚假信息传播、声誉损害、政治操控、非自愿性亲密深度伪造(NCID)和儿童性虐待材料(CSAM)等严重问题。现有检测方法依赖于面部关键点或像素级特征的不一致性,难以有效识别无缝融合源面部的目标图像或视频。该论文提出的解决方案的关键在于利用风格差异来检测面部交换图像,而无需访问真实的面部图像,从而在保护隐私的前提下实现有效的检测。
链接: https://arxiv.org/abs/2507.03334
作者: Sudev Kumar Padhi,Harshit Kumar,Umesh Kashyap,Sk. Subidh Ali
机构: Indian Institute of Technology Bhilai (印度理工学院比哈里分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Detecting deepfakes involving face-swaps presents a significant challenge, particularly in real-world scenarios where anyone can perform face-swapping with freely available tools and apps without any technical knowledge. Existing deepfake detection methods rely on facial landmarks or inconsistencies in pixel-level features and often struggle with face-swap deepfakes, where the source face is seamlessly blended into the target image or video. The prevalence of face-swap is evident in everyday life, where it is used to spread false information, damage reputations, manipulate political opinions, create non-consensual intimate deepfakes (NCID), and exploit children by enabling the creation of child sexual abuse material (CSAM). Even prominent public figures are not immune to its impact, with numerous deepfakes of them circulating widely across social media platforms. Another challenge faced by deepfake detection methods is the creation of datasets that encompass a wide range of variations, as training models require substantial amounts of data. This raises privacy concerns, particularly regarding the processing and storage of personal facial data, which could lead to unauthorized access or misuse. Our key idea is to identify these style discrepancies to detect face-swapped images effectively without accessing the real facial image. We perform comprehensive evaluations using multiple datasets and face-swapping methods, which showcases the effectiveness of SafeVision in detecting face-swap deepfakes across diverse scenarios. SafeVision offers a reliable and scalable solution for detecting face-swaps in a privacy preserving manner, making it particularly effective in challenging real-world applications. To the best of our knowledge, SafeVision is the first deepfake detection using style features while providing inherent privacy protection.
zh
[CV-214] ask-Specific Generative Dataset Distillation with Difficulty-Guided Sampling
【速读】:该论文旨在解决深度神经网络对大规模数据集的依赖问题,通过数据集蒸馏生成紧凑且高质量的合成数据集以达到与原始数据集相当的性能。其解决方案的关键在于提出一种任务特定的采样策略,该策略引入了“难度”概念,以更好地满足目标任务的需求。具体而言,最终数据集是从一个更大的图像池中根据与原始数据集难度分布相匹配的采样分布进行采样的,并通过对数变换作为预处理步骤来校正分布偏差。
链接: https://arxiv.org/abs/2507.03331
作者: Mingzhuo Li,Guang Li,Jiafeng Mao,Linfeng Ye,Takahiro Ogawa,Miki Haseyama
机构: Hokkaido Univerisy(北海道大学); The University of Tokyo(东京大学); University of Toronto(多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:To alleviate the reliance of deep neural networks on large-scale datasets, dataset distillation aims to generate compact, high-quality synthetic datasets that can achieve comparable performance to the original dataset. The integration of generative models has significantly advanced this field. However, existing approaches primarily focus on aligning the distilled dataset with the original one, often overlooking task-specific information that can be critical for optimal downstream performance. In this paper, focusing on the downstream task of classification, we propose a task-specific sampling strategy for generative dataset distillation that incorporates the concept of difficulty to consider the requirements of the target task better. The final dataset is sampled from a larger image pool with a sampling distribution obtained by matching the difficulty distribution of the original dataset. A logarithmic transformation is applied as a pre-processing step to correct for distributional bias. The results of extensive experiments demonstrate the effectiveness of our method and suggest its potential for enhancing performance on other downstream tasks.
zh
[CV-215] Exploring Object Status Recognition for Recipe Progress Tracking in Non-Visual Cooking
【速读】:该论文旨在解决视障人群在非视觉烹饪过程中缺乏有效的进度跟踪和情境反馈支持的问题。其解决方案的关键在于提出OSCAR(Object Status Context Awareness for Recipes)技术流程,该流程通过对象状态识别实现食谱进度的上下文感知跟踪,整合了食谱解析、对象状态提取、视觉与烹饪步骤对齐以及时间因果建模等模块,以支持实时步骤预测。
链接: https://arxiv.org/abs/2507.03330
作者: Franklin Mingzhe Li,Kaitlyn Ng,Bin Zhu,Patrick Carrington
机构: Carnegie Mellon University (卡内基梅隆大学); Singapore Management University (新加坡管理大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: ASSETS 2025
Abstract:Cooking plays a vital role in everyday independence and well-being, yet remains challenging for people with vision impairments due to limited support for tracking progress and receiving contextual feedback. Object status - the condition or transformation of ingredients and tools - offers a promising but underexplored foundation for context-aware cooking support. In this paper, we present OSCAR (Object Status Context Awareness for Recipes), a technical pipeline that explores the use of object status recognition to enable recipe progress tracking in non-visual cooking. OSCAR integrates recipe parsing, object status extraction, visual alignment with cooking steps, and time-causal modeling to support real-time step tracking. We evaluate OSCAR on 173 instructional videos and a real-world dataset of 12 non-visual cooking sessions recorded by BLV individuals in their homes. Our results show that object status consistently improves step prediction accuracy across vision-language models, and reveal key factors that impact performance in real-world conditions, such as implicit tasks, camera placement, and lighting. We contribute the pipeline of context-aware recipe progress tracking, an annotated real-world non-visual cooking dataset, and design insights to guide future context-aware assistive cooking systems.
zh
[CV-216] Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents
【速读】:该论文旨在解决商业设计任务中广告横幅生成不仅需要视觉保真度,还需具备结构化布局、精确排版和一致品牌风格的问题。其解决方案的关键在于提出MIMO(Mirror In-the-Model)框架,该框架结合了分层多模态代理系统(MIMO-Core)与协调循环(MIMO-Loop),通过探索多种风格方向并迭代优化设计质量,实现自动化的广告横幅生成。
链接: https://arxiv.org/abs/2507.03326
作者: Zhao Wang,Bowen Chen,Yotaro Shimose,Sota Moriyama,Heng Wang,Shingo Takamatsu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent generative models such as GPT-4o have shown strong capabilities in producing high-quality images with accurate text rendering. However, commercial design tasks like advertising banners demand more than visual fidelity – they require structured layouts, precise typography, consistent branding, and more. In this paper, we introduce MIMO (Mirror In-the-Model), an agentic refinement framework for automatic ad banner generation. MIMO combines a hierarchical multi-modal agent system (MIMO-Core) with a coordination loop (MIMO-Loop) that explores multiple stylistic directions and iteratively improves design quality. Requiring only a simple natural language based prompt and logo image as input, MIMO automatically detects and corrects multiple types of errors during generation. Experiments show that MIMO significantly outperforms existing diffusion and LLM-based baselines in real-world banner design scenarios.
zh
[CV-217] Source-Free Domain Adaptation via Multi-view Contrastive Learning
【速读】:该论文旨在解决源域不可用情况下的无监督领域自适应(Source-Free Unsupervised Domain Adaptation, SFUDA)中面临的两个关键问题:原型样本质量低和伪标签分配错误。其解决方案的关键在于提出一个包含三个主要阶段的方法:首先引入可靠样本记忆(Reliable Sample Memory, RSM)模块以提升原型样本的代表性;其次采用多视角对比学习(Multi-View Contrastive Learning, MVCL)方法通过数据增强提高伪标签的质量;最后应用噪声标签过滤技术进一步优化伪标签。
链接: https://arxiv.org/abs/2507.03321
作者: Amirfarhad Farhadi,Naser Mozayani,Azadeh Zamanifar
机构: School of Computer engineering, Iran university of Science and Technology, Tehran, Iran; Department of Computer Engineering, SR.C, Islamic Azad university
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Domain adaptation has become a widely adopted approach in machine learning due to the high costs associated with labeling data. It is typically applied when access to a labeled source domain is available. However, in real-world scenarios, privacy concerns often restrict access to sensitive information, such as fingerprints, bank account details, and facial images. A promising solution to this issue is Source-Free Unsupervised Domain Adaptation (SFUDA), which enables domain adaptation without requiring access to labeled target domain data. Recent research demonstrates that SFUDA can effectively address domain discrepancies; however, two key challenges remain: (1) the low quality of prototype samples, and (2) the incorrect assignment of pseudo-labels. To tackle these challenges, we propose a method consisting of three main phases. In the first phase, we introduce a Reliable Sample Memory (RSM) module to improve the quality of prototypes by selecting more representative samples. In the second phase, we employ a Multi-View Contrastive Learning (MVCL) approach to enhance pseudo-label quality by leveraging multiple data augmentations. In the final phase, we apply a noisy label filtering technique to further refine the pseudo-labels. Our experiments on three benchmark datasets - VisDA 2017, Office-Home, and Office-31 - demonstrate that our method achieves approximately 2 percent and 6 percent improvements in classification accuracy over the second-best method and the average of 13 well-known state-of-the-art approaches, respectively.
zh
[CV-218] Personalized Image Generation from an Author Writing Style
【速读】:该论文试图解决将文本定义的作者写作风格转化为引人注目的视觉表现这一新型挑战,其解决方案的关键在于引入了Author Writing Sheets (AWS)——一种结构化的作者文学特征摘要,并将其作为输入传递给大型语言模型(LLM, Claude 3.7 Sonnet),以生成三种不同的文本到图像提示,随后由扩散模型(Stable Diffusion 3.5 Medium)进行渲染。
链接: https://arxiv.org/abs/2507.03313
作者: Sagar Gandhi,Vishal Gandhi
机构: Joyspace AI(乔伊斯pace人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Translating nuanced, textually-defined authorial writing styles into compelling visual representations presents a novel challenge in generative AI. This paper introduces a pipeline that leverages Author Writing Sheets (AWS) - structured summaries of an author’s literary characteristics - as input to a Large Language Model (LLM, Claude 3.7 Sonnet). The LLM interprets the AWS to generate three distinct, descriptive text-to-image prompts, which are then rendered by a diffusion model (Stable Diffusion 3.5 Medium). We evaluated our approach using 49 author styles from Reddit data, with human evaluators assessing the stylistic match and visual distinctiveness of the generated images. Results indicate a good perceived alignment between the generated visuals and the textual authorial profiles (mean style match: 4.08/5 ), with images rated as moderately distinctive. Qualitative analysis further highlighted the pipeline’s ability to capture mood and atmosphere, while also identifying challenges in representing highly abstract narrative elements. This work contributes a novel end-to-end methodology for visual authorial style personalization and provides an initial empirical validation, opening avenues for applications in creative assistance and cross-modal understanding.
zh
[CV-219] MGSfM: Multi-Camera Geometry Driven Global Structure-from-Motion ICCV2025
【速读】:该论文旨在解决多相机系统在结构光束法平差(Structure-from-Motion, SfM)中的鲁棒性不足问题。传统全局SfM系统由于其优化框架的局限性,在处理多相机配置时表现出较低的稳定性。论文提出的解决方案关键在于设计了一个新颖的全局运动平均框架,包含两个核心组件:解耦的旋转平均模块和混合的平移平均模块。旋转平均模块采用分层策略,先估计刚性相机单元内的相对旋转,再计算全局刚性单元旋转;平移平均模块则通过结合相机间与相机-点约束,利用基于凸距离的目标函数初始化相机位置和三维点,并通过无偏非双线性角度目标函数进行优化,从而提升鲁棒性和效率。
链接: https://arxiv.org/abs/2507.03306
作者: Peilin Tao,Hainan Cui,Diantao Tu,Shuhan Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025, The code is available at this https URL
Abstract:Multi-camera systems are increasingly vital in the environmental perception of autonomous vehicles and robotics. Their physical configuration offers inherent fixed relative pose constraints that benefit Structure-from-Motion (SfM). However, traditional global SfM systems struggle with robustness due to their optimization framework. We propose a novel global motion averaging framework for multi-camera systems, featuring two core components: a decoupled rotation averaging module and a hybrid translation averaging module. Our rotation averaging employs a hierarchical strategy by first estimating relative rotations within rigid camera units and then computing global rigid unit rotations. To enhance the robustness of translation averaging, we incorporate both camera-to-camera and camera-to-point constraints to initialize camera positions and 3D points with a convex distance-based objective function and refine them with an unbiased non-bilinear angle-based objective function. Experiments on large-scale datasets show that our system matches or exceeds incremental SfM accuracy while significantly improving efficiency. Our framework outperforms existing global SfM methods, establishing itself as a robust solution for real-world multi-camera SfM applications. The code is available at this https URL.
zh
[CV-220] Bridging Domain Generalization to Multimodal Domain Generalization via Unified Representations ICCV2025
【速读】:该论文试图解决多模态领域泛化(Multi-modal Domain Generalization, MMDG)中的挑战,即如何使在多模态源域上训练的模型有效泛化到未见过的目标域。现有单模态领域泛化方法在直接应用于多模态场景时,由于模态间的固有差异,往往导致泛化结果不一致甚至性能下降。论文提出的解决方案的关键在于构建统一表示(Unified Representations),通过将不同配对模态映射到同一空间,实现多模态的同步改进,并引入监督解耦框架以分离模态通用和模态特定信息,从而增强统一表示的一致性与泛化能力。
链接: https://arxiv.org/abs/2507.03304
作者: Hai Huang,Yan Xia,Sashuai Zhou,Hanting Wang,Shulei Wang,Zhou Zhao
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Domain Generalization (DG) aims to enhance model robustness in unseen or distributionally shifted target domains through training exclusively on source domains. Although existing DG techniques, such as data manipulation, learning strategies, and representation learning, have shown significant progress, they predominantly address single-modal data. With the emergence of numerous multi-modal datasets and increasing demand for multi-modal tasks, a key challenge in Multi-modal Domain Generalization (MMDG) has emerged: enabling models trained on multi-modal sources to generalize to unseen target distributions within the same modality set. Due to the inherent differences between modalities, directly transferring methods from single-modal DG to MMDG typically yields sub-optimal results. These methods often exhibit randomness during generalization due to the invisibility of target domains and fail to consider inter-modal consistency. Applying these methods independently to each modality in the MMDG setting before combining them can lead to divergent generalization directions across different modalities, resulting in degraded generalization capabilities. To address these challenges, we propose a novel approach that leverages Unified Representations to map different paired modalities together, effectively adapting DG methods to MMDG by enabling synchronized multi-modal improvements within the unified space. Additionally, we introduce a supervised disentanglement framework that separates modal-general and modal-specific information, further enhancing the alignment of unified representations. Extensive experiments on benchmark datasets, including EPIC-Kitchens and Human-Animal-Cartoon, demonstrate the effectiveness and superiority of our method in enhancing multi-modal domain generalization.
zh
[CV-221] Leverag ing Out-of-Distribution Unlabeled Images: Semi-Supervised Semantic Segmentation with an Open-Vocabulary Model
【速读】:该论文旨在解决半监督语义分割中如何有效利用具有分布差异的未标记图像(out-of-distribution, OOD)的问题。现有方法在受控数据集划分下表现良好,但缺乏对大规模未标记图像潜在价值的探索。论文提出的解决方案关键在于引入一种开放词汇语义分割模型(open-vocabulary segmentation, OVS),通过该模型对OOD图像生成更准确的伪标签,从而提升半监督学习的效果。实验表明,该方法在Pascal VOC和Context数据集上均取得了显著性能提升,尤其在标签数量较少的情况下表现突出。
链接: https://arxiv.org/abs/2507.03302
作者: Wooseok Shin,Jisu Kang,Hyeonki Jeong,Jin Sob Kim,Sung Won Han
机构: Korea University (高丽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19pages, 8 figures
Abstract:In semi-supervised semantic segmentation, existing studies have shown promising results in academic settings with controlled splits of benchmark datasets. However, the potential benefits of leveraging significantly larger sets of unlabeled images remain unexplored. In real-world scenarios, abundant unlabeled images are often available from online sources (web-scraped images) or large-scale datasets. However, these images may have different distributions from those of the target dataset, a situation known as out-of-distribution (OOD). Using these images as unlabeled data in semi-supervised learning can lead to inaccurate pseudo-labels, potentially misguiding network training. In this paper, we propose a new semi-supervised semantic segmentation framework with an open-vocabulary segmentation model (SemiOVS) to effectively utilize unlabeled OOD images. Extensive experiments on Pascal VOC and Context datasets demonstrate two key findings: (1) using additional unlabeled images improves the performance of semi-supervised learners in scenarios with few labels, and (2) using the open-vocabulary segmentation (OVS) model to pseudo-label OOD images leads to substantial performance gains. In particular, SemiOVS outperforms existing PrevMatch and SemiVL methods by +3.5 and +3.0 mIoU, respectively, on Pascal VOC with a 92-label setting, achieving state-of-the-art performance. These findings demonstrate that our approach effectively utilizes abundant unlabeled OOD images for semantic segmentation tasks. We hope this work can inspire future research and real-world applications. The code is available at this https URL
zh
[CV-222] CPKD: Clinical Prior Knowledge-Constrained Diffusion Models for Surgical Phase Recognition in Endoscopic Submucosal Dissection
【速读】:该论文试图解决在复杂内镜操作流程中可靠识别手术阶段的瓶颈问题,这一问题限制了计算机辅助系统在内镜下黏膜剥离术(ESD)中的临床应用。解决方案的关键在于提出一种基于生成模型的框架——临床先验知识约束扩散(CPKD),该框架通过去噪扩散原理重新构想阶段识别过程,同时保留核心的迭代优化理念。该架构从随机噪声逐步重建阶段序列,并基于视觉-时间特征进行条件化,结合条件掩码策略以更好地捕捉位置先验、边界模糊性和关系依赖等三个领域特性,同时引入临床先验知识以提升模型纠正阶段逻辑错误的能力。
链接: https://arxiv.org/abs/2507.03295
作者: Xiangning Zhang,Jinnan Chen,Qingwei Zhang,Yaqi Wang,Chengfeng Zhou,Xiaobo Li,Dahong Qian
机构: 上海交通大学(Shanghai Jiao Tong University); 仁济医院(Renji Hospital); 湖南大学(Hunan University); 粤港澳大湾区人工智能研究院(Guangdong-Hong Kong-Macao Greater Bay Area Artificial Intelligence Research Institute)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gastrointestinal malignancies constitute a leading cause of cancer-related mortality worldwide, with advanced-stage prognosis remaining particularly dismal. Originating as a groundbreaking technique for early gastric cancer treatment, Endoscopic Submucosal Dissection has evolved into a versatile intervention for diverse gastrointestinal lesions. While computer-assisted systems significantly enhance procedural precision and safety in ESD, their clinical adoption faces a critical bottleneck: reliable surgical phase recognition within complex endoscopic workflows. Current state-of-the-art approaches predominantly rely on multi-stage refinement architectures that iteratively optimize temporal predictions. In this paper, we present Clinical Prior Knowledge-Constrained Diffusion (CPKD), a novel generative framework that reimagines phase recognition through denoising diffusion principles while preserving the core iterative refinement philosophy. This architecture progressively reconstructs phase sequences starting from random noise and conditioned on visual-temporal features. To better capture three domain-specific characteristics, including positional priors, boundary ambiguity, and relation dependency, we design a conditional masking strategy. Furthermore, we incorporate clinical prior knowledge into the model training to improve its ability to correct phase logical errors. Comprehensive evaluations on ESD820, Cholec80, and external multi-center demonstrate that our proposed CPKD achieves superior or comparable performance to state-of-the-art approaches, validating the effectiveness of diffusion-based generative paradigms for surgical phase recognition.
zh
[CV-223] Zero-shot Inexact CAD Model Alignment from a Single Image ICCV2025
【速读】:该论文试图解决从单张图像中推断3D场景结构时,现有方法依赖于带有姿态标注的监督训练,导致仅适用于有限的物体类别问题。其解决方案的关键在于提出一种弱监督的9-DoF对齐方法,该方法无需姿态标注,并能泛化到未见过的类别。该方法通过基于基础特征的新颖特征空间实现多视角一致性,并利用自监督三元组损失克服基础特征固有的对称性模糊问题;同时引入了一种与纹理无关的姿态优化技术,在归一化物体坐标中进行密集对齐。
链接: https://arxiv.org/abs/2507.03292
作者: Pattaramanee Arsomngern,Sasikarn Khwanmuang,Matthias Nießner,Supasorn Suwajanakorn
机构: VISTEC(视觉技术中心); Technical University of Munich(慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL
Abstract:One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing methods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose a weakly supervised 9-DoF alignment method for inexact 3D models that requires no pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss. Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space. We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA weakly supervised baselines by +4.3% mean alignment accuracy and is the only weakly supervised approach to surpass the supervised ROCA by +2.7%. To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.
zh
[CV-224] MolVision: Molecular Property Prediction with Vision Language Models
【速读】:该论文试图解决分子性质预测中的问题,这一任务在计算化学中具有重要应用,尤其是在药物发现和材料科学领域。传统方法主要依赖于文本形式的分子表示,如SMILES/SELFIES,这些表示可能存在歧义且结构信息不足。论文提出的解决方案的关键在于引入MolVision,这是一种利用视觉-语言模型(VLMs)的新方法,通过整合分子结构的图像和文本描述来提升性质预测的准确性。该方法通过多模态融合显著增强了分子性质的泛化能力,并结合高效的微调策略如LoRA进一步提升了性能。
链接: https://arxiv.org/abs/2507.03283
作者: Deepan Adak,Yogesh Singh Rawat,Shruti Vyas
机构: NIT Kurukshetra; University of Central Florida
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Molecular property prediction is a fundamental task in computational chemistry with critical applications in drug discovery and materials science. While recent works have explored Large Language Models (LLMs) for this task, they primarily rely on textual molecular representations such as SMILES/SELFIES, which can be ambiguous and structurally less informative. In this work, we introduce MolVision, a novel approach that leverages Vision-Language Models (VLMs) by integrating both molecular structure as images and textual descriptions to enhance property prediction. We construct a benchmark spanning ten diverse datasets, covering classification, regression and description tasks. Evaluating nine different VLMs in zero-shot, few-shot, and fine-tuned settings, we find that visual information improves prediction performance, particularly when combined with efficient fine-tuning strategies such as LoRA. Our results reveal that while visual information alone is insufficient, multimodal fusion significantly enhances generalization across molecular properties. Adaptation of vision encoder for molecular images in conjunction with LoRA further improves the performance. The code and data is available at : \hrefthis https URLthis https URL .
zh
[CV-225] NOVO: Unlearning-Compliant Vision Transformers
【速读】:该论文试图解决机器遗忘(Machine Unlearning, MUL)问题,即在保留模型在剩余数据集上性能的同时,使预训练模型选择性地遗忘某些训练实例或类别。现有方法通常依赖于使用遗忘集和/或保留集进行微调,这不仅成本高昂且不切实际,还可能导致模型性能下降。该论文提出的解决方案的关键在于引入\pname,一种基于视觉Transformer的架构,能够在不针对请求集进行微调的情况下直接执行未来遗忘请求。该模型通过在训练过程中模拟遗忘,将每个小批量中的类别/子类别随机分为代理遗忘集和保留集,并优化模型使其无法预测遗忘集,从而实现遗忘过程的实时进行并避免性能下降。
链接: https://arxiv.org/abs/2507.03281
作者: Soumya Roy,Soumya Banerjee,Vinay Verma,Soumik Dasgupta,Deepak Gupta,Piyush Rai
机构: Amazon India(亚马逊印度); IIT Kanpur(印度理工学院坎普尔分校); Walmart Labs(沃尔玛实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine unlearning (MUL) refers to the problem of making a pre-trained model selectively forget some training instances or class(es) while retaining performance on the remaining dataset. Existing MUL research involves fine-tuning using a forget and/or retain set, making it expensive and/or impractical, and often causing performance degradation in the unlearned model. We introduce \pname, an unlearning-aware vision transformer-based architecture that can directly perform unlearning for future unlearning requests without any fine-tuning over the requested set. The proposed model is trained by simulating unlearning during the training process itself. It involves randomly separating class(es)/sub-class(es) present in each mini-batch into two disjoint sets: a proxy forget-set and a retain-set, and the model is optimized so that it is unable to predict the forget-set. Forgetting is achieved by withdrawing keys, making unlearning on-the-fly and avoiding performance degradation. The model is trained jointly with learnable keys and original weights, ensuring withholding a key irreversibly erases information, validated by membership inference attack scores. Extensive experiments on various datasets, architectures, and resolutions confirm \pname’s superiority over both fine-tuning-free and fine-tuning-based methods.
zh
[CV-226] ConceptMix: Leveling the Playing Field in Text-to-Image Benchmarking via Iterative Prompt Optimization CVPR2025
【速读】:该论文试图解决当前文本到图像(Text-to-Image, T2I)基准测试中由于提示敏感性导致的模型生成能力低估问题,以及由此产生的基准偏差。其解决方案的关键在于提出ConceptMix++框架,通过迭代提示优化将提示表述与视觉生成能力解耦,并引入多模态优化流程,利用视觉语言模型反馈系统地优化提示,从而提升组合生成性能并实现更公平的T2I模型比较。
链接: https://arxiv.org/abs/2507.03275
作者: Haosheng Gan,Berk Tinaz,Mohammad Shahab Sepehri,Zalan Fabian,Mahdi Soltanolkotabi
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: An earlier version appeared in the CVPR 2025 Workshop on Generative Models for Computer Vision
Abstract:Current text-to-image (T2I) benchmarks evaluate models on rigid prompts, potentially underestimating true generative capabilities due to prompt sensitivity and creating biases that favor certain models while disadvantaging others. We introduce ConceptMix++, a framework that disentangles prompt phrasing from visual generation capabilities by applying iterative prompt optimization. Building on ConceptMix, our approach incorporates a multimodal optimization pipeline that leverages vision-language model feedback to refine prompts systematically. Through extensive experiments across multiple diffusion models, we show that optimized prompts significantly improve compositional generation performance, revealing previously hidden model capabilities and enabling fairer comparisons across T2I models. Our analysis reveals that certain visual concepts – such as spatial relationships and shapes – benefit more from optimization than others, suggesting that existing benchmarks systematically underestimate model performance in these categories. Additionally, we find strong cross-model transferability of optimized prompts, indicating shared preferences for effective prompt phrasing across models. These findings demonstrate that rigid benchmarking approaches may significantly underrepresent true model capabilities, while our framework provides more accurate assessment and insights for future development.
zh
[CV-227] Dual-frequency Selected Knowledge Distillation with Statistical-based Sample Rectification for PolSAR Image Classification
【速读】:该论文试图解决双频极化合成孔径雷达(PolSAR)图像协同分类中的区域一致性对分类信息学习的影响以及双频数据的合理利用问题。解决方案的关键在于提出一种基于统计的样本修正选择知识蒸馏网络(SKDNet-SSR),其中包含统计-based动态样本修正(SDSR)模块和双频门控选择蒸馏(DGSD)模块。SDSR模块通过动态评估样本纯度并进行像素选择与生成,以减少噪声像素对空间信息学习的干扰;DGSD模块则通过每个样本主导的单频分支作为教师模型来训练双频学生模型,实现双频数据在不同地形目标上的互补利用。
链接: https://arxiv.org/abs/2507.03268
作者: Xinyue Xin,Ming Li,Yan Wu,Xiang Li,Peng Zhang,Dazhi Xu
机构: National Laboratory of Radar Signal Processing, Xidian University; Collaborative Innovation Center of Information Sensing and Understanding, Xidian University; Remote Sensing Image Processing and Fusion Group, School of Electronics Engineering, Xidian University; Beijing Institute of Radio Measurement
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The collaborative classification of dual-frequency PolSAR images is a meaningful but also challenging research. The effect of regional consistency on classification information learning and the rational use of dual-frequency data are two main difficulties for dual-frequency collaborative classification. To tackle these problems, a selected knowledge distillation network with statistical-based sample rectification (SKDNet-SSR) is proposed in this article. First, in addition to applying CNN and ViT as local and global feature extractors, a statistical-based dynamic sample rectification (SDSR) module is designed to avoid the impact of poor regional consistency on spatial information learning process. Specifically, based on the fact that the PolSAR covariance matrix conforms to the complex Wishart distribution, SDSR first dynamically evaluates the sample purity, and then performs pixel selection and pixel generation to remove noisy pixels, thereby avoiding the feature interaction between informative pixels and noisy pixels and improving the classification feature extraction process. Next, a dual-frequency gate-selected distillation (DGSD) module is constructed to emphasize the advantages of different frequency bands and perform complementary learning on dual-frequency data. It uses the dominant single-frequency branch on each sample as teacher model to train the dual-frequency student model, enabling the student model to learn the optimal results and realizing complementary utilization of dual-frequency data on different terrain objects. Comprehensive experiments on four measured dual-frequency PolSAR data demonstrate that the proposed SKDNet-SSR outperforms other related methods.
zh
[CV-228] Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders
【速读】:该论文试图解决多视觉编码器在多模态大语言模型(Multimodal Large Language Models, MLLMs)中引入的编码器冗余问题,即增加编码器带来的性能提升有限甚至出现性能下降的现象。解决方案的关键在于提出一种系统性的评估方法,包括条件利用率(Conditional Utilization Rate, CUR)和信息差距(Information Gap, IG),以量化每个编码器的独特贡献,并揭示编码器之间的实用性差异,从而为设计更高效、有效的多模态架构提供诊断工具。
链接: https://arxiv.org/abs/2507.03262
作者: Song Mao,Yang Chen,Pinglong Cai,Ding Wang,Guohang Yan,Zhi Yu,Botian Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Wrok in Process
Abstract:Multimodal Large Language Models (MLLMs) increasingly adopt multiple vision encoders to capture diverse visual information, ranging from coarse semantics to fine grained details. While this approach is intended to enhance visual understanding capability, we observe that the performance gains from adding encoders often diminish and can even lead to performance degradation, a phenomenon we term encoder redundancy. This paper presents a systematic investigation into this issue. Through comprehensive ablation studies on state of the art multi encoder MLLMs, we empirically demonstrate that significant redundancy exists. To quantify each encoder’s unique contribution, we propose a principled metric: the Conditional Utilization Rate (CUR). Building on CUR, we introduce the Information Gap (IG) to capture the overall disparity in encoder utility within a this http URL experiments reveal that certain vision encoders contribute little, or even negatively, to overall performance, confirming substantial redundancy. Our experiments reveal that certain vision encoders contribute minimally, or even negatively, to the model’s performance, confirming the prevalence of redundancy. These findings highlight critical inefficiencies in current multi encoder designs and establish that our proposed metrics can serve as valuable diagnostic tools for developing more efficient and effective multimodal architectures.
zh
[CV-229] LACONIC: A 3D Layout Adapter for Controllable Image Creation ICCV2025
【速读】:该论文旨在解决现有生成方法在多物体场景的引导图像合成中难以保持和尊重场景底层三维几何结构的问题。其解决方案的关键在于提出一种新颖的条件化方法、训练策略及适配器网络,能够嵌入预训练的文本到图像扩散模型中,赋予模型3D感知能力,同时利用其丰富的先验知识。该方法支持相机控制、显式三维几何条件以及对场景整体上下文(包括屏幕内外物品)的综合考虑,从而生成合理且语义丰富的图像。
链接: https://arxiv.org/abs/2507.03257
作者: Léopold Maillard,Tom Durand,Adrien Ramanana Rahary,Maks Ovsjanikov
机构: LIX, École Polytechnique, IP Paris; Dassault Systèmes
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025. Preprint version
Abstract:Existing generative approaches for guided image synthesis of multi-object scenes typically rely on 2D controls in the image or text space. As a result, these methods struggle to maintain and respect consistent three-dimensional geometric structure, underlying the scene. In this paper, we propose a novel conditioning approach, training method and adapter network that can be plugged into pretrained text-to-image diffusion models. Our approach provides a way to endow such models with 3D-awareness, while leveraging their rich prior knowledge. Our method supports camera control, conditioning on explicit 3D geometries and, for the first time, accounts for the entire context of a scene, i.e., both on and off-screen items, to synthesize plausible and semantically rich images. Despite its multi-modal nature, our model is lightweight, requires a reasonable number of data for supervised learning and shows remarkable generalization power. We also introduce methods for intuitive and consistent image editing and restyling, e.g., by positioning, rotating or resizing individual objects in a scene. Our method integrates well within various image creation workflows and enables a richer set of applications compared to previous approaches.
zh
[CV-230] MoDA: Multi-modal Diffusion Architecture for Talking Head Generation
【速读】:该论文旨在解决任意身份和语音音频的说话头生成问题,这是数字人类和虚拟元宇宙领域中的关键挑战。其解决方案的关键在于:1)定义一个联合参数空间以连接运动生成与神经渲染,并利用流匹配简化扩散学习过程;2)引入多模态扩散架构以建模噪声运动、音频和辅助条件之间的交互,从而提升整体面部表现力。此外,采用从粗到细的融合策略逐步整合不同模态,确保跨特征空间的有效融合。
链接: https://arxiv.org/abs/2507.03256
作者: Xinyang Li,Gen Li,Zhihui Lin,Yichen Qian,GongXin Yao,Weinan Jia,Weihua Chen,Fan Wang
机构: Xunguang Team, DAMO Academy, Alibaba Group(通义实验室,达摩院,阿里巴巴集团); Zhejiang University(浙江大学); Hupan Lab(湖畔实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures
Abstract:Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of digital humans and the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation and generalization capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts, which arise from the implicit latent space of Variational Auto-Encoders (VAE), complicating the diffusion process; 2) authentic facial expressions and head movements, resulting from insufficient multi-modal information interaction. In this paper, MoDA handle these challenges by 1) defines a joint parameter space to bridge motion generation and neural rendering, and leverages flow matching to simplify the diffusion learning process; 2) introduces a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and auxiliary conditions, ultimately enhancing overall facial expressiveness. Subsequently, a coarse-to-fine fusion strategy is adopted to progressively integrate different modalities, ensuring effective integration across feature spaces. Experimental results demonstrate that MoDA significantly improves video diversity, realism, and efficiency, making it suitable for real-world applications.
zh
[CV-231] Subject Invariant Contrastive Learning for Human Activity Recognition
【速读】:该论文试图解决在人体活动识别(Human Activity Recognition, HAR)中,由于受试者差异导致的领域偏移问题,这限制了基于对比学习(contrastive learning)方法的模型在未见受试者上的泛化能力。解决方案的关键在于提出一种名为Subject-Invariant Contrastive Learning (SICL) 的损失函数,通过重新加权来自同一受试者的负样本对,抑制与受试者相关的特征,强调与活动相关的特征,从而提升模型的泛化性能。
链接: https://arxiv.org/abs/2507.03250
作者: Yavuz Yarici,Kiran Kokilepersaud,Mohit Prabhushankar,Ghassan AlRegib
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The high cost of annotating data makes self-supervised approaches, such as contrastive learning methods, appealing for Human Activity Recognition (HAR). Effective contrastive learning relies on selecting informative positive and negative samples. However, HAR sensor signals are subject to significant domain shifts caused by subject variability. These domain shifts hinder model generalization to unseen subjects by embedding subject-specific variations rather than activity-specific features. As a result, human activity recognition models trained with contrastive learning often struggle to generalize to new subjects. We introduce Subject-Invariant Contrastive Learning (SICL), a simple yet effective loss function to improve generalization in human activity recognition. SICL re-weights negative pairs drawn from the same subject to suppress subject-specific cues and emphasize activity-specific information. We evaluate our loss function on three public benchmarks: UTD-MHAD, MMAct, and DARai. We show that SICL improves performance by up to 11% over traditional contrastive learning methods. Additionally, we demonstrate the adaptability of our loss function across various settings, including multiple self-supervised methods, multimodal scenarios, and supervised learning frameworks.
zh
[CV-232] A Vision-Based Closed-Form Solution for Measuring the Rotation Rate of an Object by Tracking One Point
【速读】:该论文试图解决在刚体旋转运动中,仅通过跟踪图像中的一个特征点即可解析出刚体瞬时旋转速率的问题。其解决方案的关键在于:在正交投影下且相机固定于刚体上的某一点时,无论所跟踪点位于刚体的何处,其得到的瞬时旋转速率基本一致,从而无需依赖物体的三维形状或场景的先验知识,即可实现对刚体运动的分析与场景分割。
链接: https://arxiv.org/abs/2507.03237
作者: Daniel Raviv,Juan D. Yepes,Eiki M. Martinson
机构: Florida Atlantic University (佛罗里达大西洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We demonstrate that, under orthographic projection and with a camera fixated on a point located on a rigid body, the rotation of that body can be analytically obtained by tracking only one other feature in the image. With some exceptions, any tracked point, regardless of its location on the body, yields the same value of the instantaneous rotation rate. The proposed method is independent of the shape of the 3D object and does not require a priori knowledge about the scene. This algorithm is suited for parallel processing and can achieve segmentation of the scene by distinguishing points that do not belong to the same rigid body, simply because they do not produce the same value of the rotation. This paper presents an analytical derivation, simulation results, and results from real video data. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.03237 [cs.CV] (or arXiv:2507.03237v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.03237 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-233] Development of an Improved Capsule-Yolo Network for Automatic Tomato Plant Disease Early Detection and Diagnosis
【速读】:该论文旨在解决番茄作物因病原体引发的疾病问题,这些问题对番茄的健康、产量及品质构成严重威胁,进而影响粮食安全。其解决方案的关键在于提出一种增强的Capsule-YOLO网络架构,该架构利用YOLO框架自动分割复杂背景下的重叠和遮挡番茄叶片图像,并通过高精度的性能指标(如99.31%的准确率、98.78%的召回率、99.09%的精确率以及98.93%的F1分数)有效识别病害症状。此外,系统还提供了一个用户友好的界面,用于上传受感染番茄植株的图像并检测早期病害症状,同时给出诊断和治疗建议。
链接: https://arxiv.org/abs/2507.03219
作者: Idris Ochijenu,Monday Abutu Idakwo,Sani Felix
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Like many countries, Nigeria is naturally endowed with fertile agricultural soil that supports large-scale tomato production. However, the prevalence of disease causing pathogens poses a significant threat to tomato health, often leading to reduced yields and, in severe cases, the extinction of certain species. These diseases jeopardise both the quality and quantity of tomato harvests, contributing to food insecurity. Fortunately, tomato diseases can often be visually identified through distinct forms, appearances, or textures, typically first visible on leaves and fruits. This study presents an enhanced Capsule-YOLO network architecture designed to automatically segment overlapping and occluded tomato leaf images from complex backgrounds using the YOLO framework. It identifies disease symptoms with impressive performance metrics: 99.31% accuracy, 98.78% recall, and 99.09% precision, and a 98.93% F1-score representing improvements of 2.91%, 1.84%, 5.64%, and 4.12% over existing state-of-the-art methods. Additionally, a user-friendly interface was developed to allow farmers and users to upload images of affected tomato plants and detect early disease symptoms. The system also provides recommendations for appropriate diagnosis and treatment. The effectiveness of this approach promises significant benefits for the agricultural sector by enhancing crop yields and strengthening food security.
zh
[CV-234] AI-driven Web Application for Early Detection of Sudden Death Syndrome (SDS) in Soybean Leaves Using Hyperspectral Images and Genetic Algorithm
【速读】:该论文试图解决大豆叶片猝死综合征(Sudden Death Syndrome, SDS)的早期检测问题,以减少病害对大豆生产造成的损失。其解决方案的关键在于利用高光谱成像技术结合人工智能算法,通过筛选具有区分感染状态的五个关键波长(505.4, 563.7, 712.2, 812.9, 和 908.4 nm),并将其输入轻量级卷积神经网络(Convolutional Neural Network, CNN)提取空间-光谱特征,最终通过集成学习模型(如随机森林、AdaBoost、线性支持向量机和神经网络)实现高精度分类,准确率达到98%。此外,研究还开发了一个基于Web的应用程序,实现了病害诊断的实时化与便捷化。
链接: https://arxiv.org/abs/2507.03198
作者: Pappu Kumar Yadav,Rishik Aggarwal,Supriya Paudel,Amee Parmar,Hasan Mirzakhaninafchi,Zain Ul Abideen Usmani,Dhe Yeong Tchalla,Shyam Solanki,Ravi Mural,Sachin Sharma,Thomas F. Burks,Jianwei Qin,Moon S. Kim
机构: South Dakota State University (南达科他州立大学); University of Florida (佛罗里达大学); USDA-ARS Environmental Microbial and Food Safety Lab (美国农业部农业研究服务局环境微生物与食品安全实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:Sudden Death Syndrome (SDS), caused by Fusarium virguliforme, poses a significant threat to soybean production. This study presents an AI-driven web application for early detection of SDS on soybean leaves using hyperspectral imaging, enabling diagnosis prior to visible symptom onset. Leaf samples from healthy and inoculated plants were scanned using a portable hyperspectral imaging system (398-1011 nm), and a Genetic Algorithm was employed to select five informative wavelengths (505.4, 563.7, 712.2, 812.9, and 908.4 nm) critical for discriminating infection status. These selected bands were fed into a lightweight Convolutional Neural Network (CNN) to extract spatial-spectral features, which were subsequently classified using ten classical machine learning models. Ensemble classifiers (Random Forest, AdaBoost), Linear SVM, and Neural Net achieved the highest accuracy (98%) and minimal error across all folds, as confirmed by confusion matrices and cross-validation metrics. Poor performance by Gaussian Process and QDA highlighted their unsuitability for this dataset. The trained models were deployed within a web application that enables users to upload hyperspectral leaf images, visualize spectral profiles, and receive real-time classification results. This system supports rapid and accessible plant disease diagnostics, contributing to precision agriculture practices. Future work will expand the training dataset to encompass diverse genotypes, field conditions, and disease stages, and will extend the system for multiclass disease classification and broader crop applicability.
zh
[CV-235] ransparent Machine Learning: Training and Refining an Explainable Boosting Machine to Identify Overshooting Tops in Satellite Imagery
【速读】:该论文试图解决在气象学应用中构建可解释的、基于物理的机器学习算法的问题,特别是针对卫星图像中过冲顶(overshooting top, OTs)的检测。其解决方案的关键在于结合可解释的提升机(Explainable Boosting Machine, EBM)与特征工程,通过数学方法提取关键特征(如使用灰度共生矩阵提取云纹理),并利用地球静止运行环境卫星16号的先进基线成像仪的可见光和红外波段数据进行训练,最终实现一个可解释的机器学习模型,该模型在人类与机器协作下开发,虽未达到复杂方法的精度,但为构建完全可解释的机器学习算法提供了重要进展。
链接: https://arxiv.org/abs/2507.03183
作者: Nathan Mitchell,Lander Ver Hoef,Imme Ebert-Uphoff,Kristina Moen,Kyle Hilburn,Yoonjin Lee,Emily J. King
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 38 pages, 19 figures
Abstract:An Explainable Boosting Machine (EBM) is an interpretable machine learning (ML) algorithm that has benefits in high risk applications but has not yet found much use in atmospheric science. The overall goal of this work is twofold: (1) explore the use of EBMs, in combination with feature engineering, to obtain interpretable, physics-based machine learning algorithms for meteorological applications; (2) illustrate these methods for the detection of overshooting top (OTs) in satellite imagery. Specifically, we seek to simplify the process of OT detection by first using mathematical methods to extract key features, such as cloud texture using Gray-Level Co-occurrence Matrices, followed by applying an EBM. Our EBM focuses on the classification task of predicting OT regions, utilizing Channel 2 (visible imagery) and Channel 13 (infrared imagery) of the Advanced Baseline Imager sensor of the Geostationary Operational Environmental Satellite 16. Multi-Radar/Multi-Sensor system convection flags are used as labels to train the EBM model. Note, however, that detecting convection, while related, is different from detecting OTs. Once trained, the EBM was examined and minimally altered to more closely match strategies used by domain scientists to identify OTs. The result of our efforts is a fully interpretable ML algorithm that was developed in a human-machine collaboration. While the final model does not reach the accuracy of more complex approaches, it performs well and represents a significant step toward building fully interpretable ML algorithms for this and other meteorological applications. Comments: 38 pages, 19 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2507.03183 [cs.CV] (or arXiv:2507.03183v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.03183 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-236] Adopting a human developmental visual diet yields robust shape-based AI vision
【速读】:该论文试图解决人工智能(Artificial Intelligence, AI)视觉系统与人类视觉之间存在的显著不对齐问题。具体表现为AI过度依赖纹理特征而非形状信息、对图像失真缺乏鲁棒性、易受对抗攻击以及在复杂背景中识别简单抽象形状的能力不足。该研究的关键解决方案是借鉴人类视觉从婴儿期到成年期的发展过程,通过整合心理学和神经生理学的研究成果,构建了一种新型的发育视觉饮食(Developmental Visual Diet, DVD),引导AI模型按照类似人类视觉发展的路径进行学习,从而实现更接近人类的视觉表现。
链接: https://arxiv.org/abs/2507.03168
作者: Zejin Lu,Sushrut Thorat,Radoslaw M Cichy,Tim C Kietzmann
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite years of research and the dramatic scaling of artificial intelligence (AI) systems, a striking misalignment between artificial and human vision persists. Contrary to humans, AI heavily relies on texture-features rather than shape information, lacks robustness to image distortions, remains highly vulnerable to adversarial attacks, and struggles to recognise simple abstract shapes within complex backgrounds. To close this gap, we here introduce a solution that arises from a previously underexplored direction: rather than scaling up, we take inspiration from how human vision develops from early infancy into adulthood. We quantified the visual maturation by synthesising decades of psychophysical and neurophysiological research into a novel developmental visual diet (DVD) for AI vision. We show that guiding AI systems through this human-inspired curriculum produces models that closely align with human behaviour on every hallmark of robust vision tested yielding the strongest reported reliance on shape information to date, abstract shape recognition beyond the state of the art, higher robustness to image corruptions, and stronger resilience to adversarial attacks. By outperforming high parameter AI foundation models trained on orders of magnitude more data, we provide evidence that robust AI vision can be achieved by guiding the way how a model learns, not merely how much it learns, offering a resource-efficient route toward safer and more human-like artificial visual systems.
zh
[CV-237] Neural Dynamic Modes: Computational Imaging of Dynamical Systems from Sparse Observations
【速读】:该论文试图解决在科学成像中从间接、噪声大且高度稀疏的测量数据中重建连续时空动力学的问题。解决方案的关键在于提出NeuralDMD框架,该框架结合了神经隐式表示与动态模态分解(Dynamic Mode Decomposition, DMD),利用神经表示的表达能力捕捉复杂的空间结构,并通过DMD的线性动力学模式引入归纳偏置,从而指导训练并支持稳定、低维的表示和预测。
链接: https://arxiv.org/abs/2507.03094
作者: Ali SaraerToosi,Renbo Tu,Kamyar Azizzadenesheli,Aviad Levis
机构: University of Toronto (多伦多大学); NVIDIA Corporation (英伟达公司)
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
备注: 24 pages, 18 figures
Abstract:Dynamical systems are ubiquitous within science and engineering, from turbulent flow across aircraft wings to structural variability of proteins. Although some systems are well understood and simulated, scientific imaging often confronts never-before-seen dynamics observed through indirect, noisy, and highly sparse measurements. We present NeuralDMD, a model-free framework that combines neural implicit representations with Dynamic Mode Decomposition (DMD) to reconstruct continuous spatio-temporal dynamics from such measurements. The expressiveness of neural representations enables capturing complex spatial structures, while the linear dynamical modes of DMD introduce an inductive bias that guides training and supports stable, low-dimensional representations and forecasting. We validate NeuralDMD on two real-world problems: reconstructing near-surface wind-speed fields over North America from sparse station observations, and recovering the evolution of plasma near the Galactic-center black hole, Sgr A*. In both cases, NeuralDMD outperforms established baselines, demonstrating its potential as a general tool for imaging dynamical systems across geoscience, astronomy, and beyond.
zh
[CV-238] LATTE: Latent Trajectory Embedding for Diffusion-Generated Image Detection NEURIPS2025
【速读】:该论文试图解决生成式图像与真实图像难以区分的问题,这可能损害数字媒体的信任度。解决方案的关键在于提出LATTE(Latent Trajectory Embedding),该方法通过建模潜在嵌入在多个去噪时间步中的演变轨迹,而非仅关注单步重建误差,从而捕捉到区分真实与生成图像的细微且具有判别性的模式。
链接: https://arxiv.org/abs/2507.03054
作者: Ana Vasilcoiu,Ivona Najdenkoska,Zeno Geradts,Marcel Worring
机构: University of Amsterdam (阿姆斯特丹大学); Netherlands Forensic Institute (荷兰司法鉴定研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, submitted to NeurIPS 2025, includes benchmark evaluations on GenImage and Diffusion Forensics
Abstract:The rapid advancement of diffusion-based image generators has made it increasingly difficult to distinguish generated from real images. This can erode trust in digital media, making it critical to develop generalizable detectors for generated images. Recent methods leverage diffusion denoising cues, but mainly focus on single-step reconstruction errors, ignoring the inherent sequential nature of the denoising process. In this work, we propose LATTE - Latent Trajectory Embedding - a novel approach that models the evolution of latent embeddings across several denoising timesteps. By modeling the trajectory of such embeddings rather than single-step errors, LATTE captures subtle, discriminative patterns that distinguish real from generated images. Each latent is refined by employing our latent-visual feature refinement module and aggregated into a unified representation. Afterwards, it is fused with the visual features and finally passed into a lightweight classifier. Our experiments demonstrate that LATTE surpasses the baselines on several established benchmarks, such as GenImage and DiffusionFake. Moreover, it demonstrates strong performance in cross-generator and cross-datasets settings, highlighting the potential of using the trajectory of latent embeddings for generated image detection. The code is available on the following link: this https URL.
zh
[CV-239] Detection of Rail Line Track and Human Beings Near the Track to Avoid Accidents
【速读】:该论文旨在解决铁路环境中因人类靠近轨道而可能引发的潜在事故问题,其解决方案的关键在于利用YOLOv5深度学习模型实现铁路线路的检测以及对轨道附近移动物体的实时识别,特别是针对人类的检测。该方法通过实时视频数据处理,能够在1米范围内准确识别接近轨道的人类,并具备远距离物体识别功能,从而提升铁路环境中的安全防护能力。
链接: https://arxiv.org/abs/2507.03040
作者: Mehrab Hosain,Rajiv Kapoor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at COMITCON 2023; Published in Lecture Notes in Electrical Engineering, Vol. 1191, Springer
Abstract:This paper presents an approach for rail line detection and the identification of human beings in proximity to the track, utilizing the YOLOv5 deep learning model to mitigate potential accidents. The technique incorporates real-time video data to identify railway tracks with impressive accuracy and recognizes nearby moving objects within a one-meter range, specifically targeting the identification of humans. This system aims to enhance safety measures in railway environments by providing real-time alerts for any detected human presence close to the track. The integration of a functionality to identify objects at a longer distance further fortifies the preventative capabilities of the system. With a precise focus on real-time object detection, this method is poised to deliver significant contributions to the existing technologies in railway safety. The effectiveness of the proposed method is demonstrated through a comprehensive evaluation, yielding a remarkable improvement in accuracy over existing methods. These results underscore the potential of this approach to revolutionize safety measures in railway environments, providing a substantial contribution to accident prevention strategies.
zh
[CV-240] Intelligent Histology for Tumor Neurosurgery
【速读】:该论文试图解决传统术中病理分析流程中存在的速度慢、资源消耗大以及缺乏实时数字成像能力的问题。其解决方案的关键在于引入一种新兴的智能组织学方法,该方法将人工智能(AI)与受激拉曼组织学(SRH)相结合,利用SRH快速、无标记、数字化的成像能力,在数秒内生成高分辨率的手术标本数字图像,并通过AI实现肿瘤组织学分析、分子分类和肿瘤浸润检测,从而构建一种变革性的术中工作流程。
链接: https://arxiv.org/abs/2507.03037
作者: Xinhai Hou,Akhil Kondepudi,Cheng Jiang,Yiwei Lyu,Samir Harake,Asadur Chowdury,Anna-Katharina Meißner,Volker Neuschmelting,David Reinecke,Gina Furtjes,Georg Widhalm,Lisa Irina Koerner,Jakob Straehle,Nicolas Neidert,Pierre Scheffler,Juergen Beck,Michael Ivan,Ashish Shah,Aditya Pandey,Sandra Camelo-Piragua,Dieter Henrik Heiland,Oliver Schnell,Chris Freudiger,Jacob Young,Melike Pekmezci,Katie Scotford,Shawn Hervey-Jumper,Daniel Orringer,Mitchel Berger,Todd Hollon
机构: Invenio Imaging Inc.(Invenio成像公司); University of Michigan(密歇根大学); University Hospital Cologne(科隆大学医院); Medical University of Vienna(维也纳医科大学); University Medical Center Freiburg(弗莱堡大学医学中心); University of Miami(迈阿密大学); University of California, San Francisco(加利福尼亚大学旧金山分校); NYU Langone Health(纽约大学朗格尼健康)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The importance of rapid and accurate histologic analysis of surgical tissue in the operating room has been recognized for over a century. Our standard-of-care intraoperative pathology workflow is based on light microscopy and H\E histology, which is slow, resource-intensive, and lacks real-time digital imaging capabilities. Here, we present an emerging and innovative method for intraoperative histologic analysis, called Intelligent Histology, that integrates artificial intelligence (AI) with stimulated Raman histology (SRH). SRH is a rapid, label-free, digital imaging method for real-time microscopic tumor tissue analysis. SRH generates high-resolution digital images of surgical specimens within seconds, enabling AI-driven tumor histologic analysis, molecular classification, and tumor infiltration detection. We review the scientific background, clinical translation, and future applications of intelligent histology in tumor neurosurgery. We focus on the major scientific and clinical studies that have demonstrated the transformative potential of intelligent histology across multiple neurosurgical specialties, including neurosurgical oncology, skull base, spine oncology, pediatric tumors, and periperal nerve tumors. Future directions include the development of AI foundation models through multi-institutional datasets, incorporating clinical and radiologic data for multimodal learning, and predicting patient outcomes. Intelligent histology represents a transformative intraoperative workflow that can reinvent real-time tumor analysis for 21st century neurosurgery.
zh
[CV-241] Rethinking Data Protection in the (Generative) Artificial Intelligence Era
【速读】:该论文试图解决传统数据保护理念在生成式人工智能(Generative AI)时代所面临的不足,即数据在AI生命周期中的动态性和复杂性使得现有保护机制难以有效应对。其解决方案的关键在于提出一个四层分类体系,包括不可用性、隐私保护、可追溯性和可删除性,以全面覆盖现代生成式AI模型与系统中的多样化保护需求,并通过结构化框架分析数据效用与控制之间的权衡,揭示监管盲点,为未来AI技术发展与治理提供可信数据实践的指导。
链接: https://arxiv.org/abs/2507.03034
作者: Yiming Li,Shuo Shao,Yu He,Junfeng Guo,Tianwei Zhang,Zhan Qin,Pin-Yu Chen,Michael Backes,Philip Torr,Dacheng Tao,Kui Ren
机构: Zhejiang University (浙江大学); Nanyang Technological University (南洋理工大学); University of Maryland (马里兰大学); IBM Research (IBM研究院); CISPA Helmholtz Center for Information Security (CISPA赫尔姆霍兹信息安全中心); University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Perspective paper for a broader scientific audience. The first two authors contributed equally to this paper. 13 pages
Abstract:The (generative) artificial intelligence (AI) era has profoundly reshaped the meaning and value of data. No longer confined to static content, data now permeates every stage of the AI lifecycle from the training samples that shape model parameters to the prompts and outputs that drive real-world model deployment. This shift renders traditional notions of data protection insufficient, while the boundaries of what needs safeguarding remain poorly defined. Failing to safeguard data in AI systems can inflict societal and individual, underscoring the urgent need to clearly delineate the scope of and rigorously enforce data protection. In this perspective, we propose a four-level taxonomy, including non-usability, privacy preservation, traceability, and deletability, that captures the diverse protection needs arising in modern (generative) AI models and systems. Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline, including training datasets, model weights, system prompts, and AI-generated content. We analyze representative technical approaches at each level and reveal regulatory blind spots that leave critical assets exposed. By offering a structured lens to align future AI technologies and governance with trustworthy data practices, we underscore the urgency of rethinking data protection for modern AI techniques and provide timely guidance for developers, researchers, and regulators alike.
zh
[CV-242] Look-Back: Implicit Visual Re-focusing in MLLM Reasoning
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中过度依赖文本信息而忽视视觉输入整合的问题。现有方法通常通过显式注入视觉信息来引导推理过程,但本文提出了一种隐式的解决方案——Look-Back,其关键在于通过分析MLLM的注意力模式,发现适当的引导可以使模型在推理后期自发地重新关注视觉信息,而无需显式注入视觉输入。这一方法使模型能够自主决定何时、何地以及如何重新聚焦于视觉信息,从而提升其推理与感知能力。
链接: https://arxiv.org/abs/2507.03019
作者: Shuo Yang,Yuwei Niu,Yuyang Liu,Yang Ye,Bin Lin,Li Yuan
机构: Peking University, Shenzhen Graduate School (北京大学深圳研究生院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual information injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce Look-Back, an implicit approach designed to guide MLLMs to ``look back" at visual information in a self-directed manner during reasoning. Look-Back empowers the model to autonomously determine when, where, and how to re-focus on visual inputs, eliminating the need for explicit model-structure constraints or additional input. We demonstrate that Look-Back significantly enhances the model’s reasoning and perception capabilities, as evidenced by extensive empirical evaluations on multiple multimodal benchmarks.
zh
[CV-243] Markerless Stride Length estimation in Athletic using Pose Estimation with monocular vision
【速读】:该论文旨在解决通过视频序列估计运动员步长和速度变化的问题,并评估视频分析处理在运动员中的应用。解决方案的关键在于利用基于计算机视觉的方法,结合概率霍夫变换与人体姿态检测算法,以估算跑者的腿部关节位置,并通过单应性变换来计算步长。实验表明,该系统可作为教练和训练中的有用工具,具有测量和监测运动员步态参数的潜力。
链接: https://arxiv.org/abs/2507.03016
作者: Patryk Skorupski,Cosimo Distante,Pier Luigi Mazzeo
机构: Università del Salento (萨伦托大学); Consiglio Nazionale delle Ricerche - (CNR) (国家研究委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Performance measures such as stride length in athletics and the pace of runners can be estimated using different tricks such as measuring the number of steps divided by the running length or helping with markers printed on the track. Monitoring individual performance is essential for supporting staff coaches in establishing a proper training schedule for each athlete. The aim of this paper is to investigate a computer vision-based approach for estimating stride length and speed transition from video sequences and assessing video analysis processing among athletes. Using some well-known image processing methodologies such as probabilistic hough transform combined with a human pose detection algorithm, we estimate the leg joint position of runners. In this way, applying a homography transformation, we can estimate the runner stride length. Experiments on various race videos with three different runners demonstrated that the proposed system represents a useful tool for coaching and training. This suggests its potential value in measuring and monitoring the gait parameters of athletes.
zh
[CV-244] opological Signatures vs. Gradient Histograms: A Comparative Study for Medical Image Classification
【速读】:该论文旨在解决医学图像分类问题,特别是针对视网膜眼底图像中的糖尿病视网膜病变检测与分级。其解决方案的关键在于对比分析两种不同的特征提取技术:基于梯度的直方图方向(HOG)和基于拓扑数据分析(TDA)的方法,并评估它们在分类任务中的性能表现。研究通过提取HOG特征和TDA特征,分别训练多种机器学习模型,验证了这两种方法在不同任务下的有效性,表明两者均能提供具有竞争力的分类结果,但各自捕捉了图像中不同的结构特性。
链接: https://arxiv.org/abs/2507.03006
作者: Faisal Ahmed,Mohammad Alfrad Nobel Bhuiyan
机构: Embry-Riddle Aeronautical University (伊布里-里奥航空大学); Louisiana State University Health Sciences Center (路易斯安那州立大学健康科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 12 figures
Abstract:We present the first comparative study of two fundamentally distinct feature extraction techniques: Histogram of Oriented Gradients (HOG) and Topological Data Analysis (TDA), for medical image classification using retinal fundus images. HOG captures local texture and edge patterns through gradient orientation histograms, while TDA, using cubical persistent homology, extracts high-level topological signatures that reflect the global structure of pixel intensities. We evaluate both methods on the large APTOS dataset for two classification tasks: binary detection (normal versus diabetic retinopathy) and five-class diabetic retinopathy severity grading. From each image, we extract 26244 HOG features and 800 TDA features, using them independently to train seven classical machine learning models with 10-fold cross-validation. XGBoost achieved the best performance in both cases: 94.29 percent accuracy (HOG) and 94.18 percent (TDA) on the binary task; 74.41 percent (HOG) and 74.69 percent (TDA) on the multi-class task. Our results show that both methods offer competitive performance but encode different structural aspects of the images. This is the first work to benchmark gradient-based and topological features on retinal imagery. The techniques are interpretable, applicable to other medical imaging domains, and suitable for integration into deep learning pipelines.
zh
[CV-245] What to Do Next? Memorizing skills from Egocentric Instructional Video
【速读】:该论文试图解决从第一视角进行高阶目标导向动作规划的问题,旨在通过示范学习提取环境中的有意义信息以实现有效的动作选择。其解决方案的关键在于结合拓扑效用记忆(topological affordance memory)与Transformer架构,利用环境结构的记忆化过程来根据上下文选择合适的动作,并在完成特定目标时检测动作偏差。
链接: https://arxiv.org/abs/2507.02997
作者: Jing Bi,Chenliang Xu
机构: University of Rochester(罗切斯特大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning to perform activities through demonstration requires extracting meaningful information about the environment from observations. In this research, we investigate the challenge of planning high-level goal-oriented actions in a simulation setting from an egocentric perspective. We present a novel task, interactive action planning, and propose an approach that combines topological affordance memory with transformer architecture. The process of memorizing the environment’s structure through extracting affordances facilitates selecting appropriate actions based on the context. Moreover, the memory model allows us to detect action deviations while accomplishing specific objectives. To assess the method’s versatility, we evaluate it in a realistic interactive simulation environment. Our experimental results demonstrate that the proposed approach learns meaningful representations, resulting in improved performance and robust when action deviations occur.
zh
[CV-246] xt-Guided Multi-Instance Learning for Scoliosis Screening via Gait Video Analysis MICCAI
【速读】:该论文旨在解决青少年早期脊柱侧凸(scoliosis)难以检测的问题,传统基于X射线的方法存在辐射风险且依赖临床经验,限制了其在大规模筛查中的应用。论文提出的解决方案是采用文本引导的多实例学习网络(Text-Guided Multi-Instance Learning Network, TG-MILNet),通过步态视频进行非侵入性检测。其关键在于利用动态时间规整(Dynamic Time Warping, DTW)聚类对步态序列进行分段,结合跨袋时间注意力机制(Inter-Bag Temporal Attention, IBTA)聚焦关键诊断特征,并设计边界感知模型(Boundary-Aware Model, BAM)以提升对轻微脊柱偏移的敏感性,同时引入领域专家和大语言模型的文本指导以增强特征表示与模型可解释性。
链接: https://arxiv.org/abs/2507.02996
作者: Haiqing Li,Yuzhi Guo,Feng Jiang,Thao M. Dang,Hehuan Ma,Qifeng Zhou,Jean Gao,Junzhou Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10.5 pages, 4 figures, MICCAI conference
Abstract:Early-stage scoliosis is often difficult to detect, particularly in adolescents, where delayed diagnosis can lead to serious health issues. Traditional X-ray-based methods carry radiation risks and rely heavily on clinical expertise, limiting their use in large-scale screenings. To overcome these challenges, we propose a Text-Guided Multi-Instance Learning Network (TG-MILNet) for non-invasive scoliosis detection using gait videos. To handle temporal misalignment in gait sequences, we employ Dynamic Time Warping (DTW) clustering to segment videos into key gait phases. To focus on the most relevant diagnostic features, we introduce an Inter-Bag Temporal Attention (IBTA) mechanism that highlights critical gait phases. Recognizing the difficulty in identifying borderline cases, we design a Boundary-Aware Model (BAM) to improve sensitivity to subtle spinal deviations. Additionally, we incorporate textual guidance from domain experts and large language models (LLM) to enhance feature representation and improve model interpretability. Experiments on the large-scale Scoliosis1K gait dataset show that TG-MILNet achieves state-of-the-art performance, particularly excelling in handling class imbalance and accurately detecting challenging borderline cases. The code is available at this https URL
zh
[CV-247] FreqCross: A Multi-Modal Frequency-Spatial Fusion Network for Robust Detection of Stable Diffusion 3.5 Generated Images
【速读】:该论文试图解决由扩散模型(如Stable Diffusion 3.5)生成的高保真合成图像对现有检测方法构成的挑战,旨在实现对AI生成图像的鲁棒检测。解决方案的关键在于提出FreqCross,一种结合空间RGB特征、频域伪影和径向能量分布模式的多模态融合网络,其核心创新包括基于ResNet-18的空间特征提取、轻量级CNN处理二维快速傅里叶变换(2D FFT)幅度谱,以及多层感知机分析径向能量轮廓,并通过简单的特征拼接和紧凑分类头进行融合,从而有效捕捉扩散生成图像中的特征频率伪影。
链接: https://arxiv.org/abs/2507.02995
作者: Guang Yang
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:The rapid advancement of diffusion models, particularly Stable Diffusion 3.5, has enabled the generation of highly photorealistic synthetic images that pose significant challenges to existing detection methods. This paper presents FreqCross, a novel multi-modal fusion network that combines spatial RGB features, frequency domain artifacts, and radial energy distribution patterns to achieve robust detection of AI-generated images. Our approach leverages a three-branch architecture: (1) a ResNet-18 backbone for spatial feature extraction, (2) a lightweight CNN for processing 2D FFT magnitude spectra, and (3) a multi-layer perceptron for analyzing radial energy profiles. We introduce a novel radial energy distribution analysis that captures characteristic frequency artifacts inherent in diffusion-generated images, and fuse it with spatial and spectral cues via simple feature concatenation followed by a compact classification head. Extensive experiments on a dataset of 10,000 paired real (MS-COCO) and synthetic (Stable Diffusion 3.5) images demonstrate that FreqCross achieves 97.8% accuracy, outperforming state-of-the-art baselines by 5.2%. The frequency analysis further reveals that synthetic images exhibit distinct spectral signatures in the 0.1–0.4 normalised frequency range, providing theoretical foundation for our approach. Code and pre-trained models are publicly available to facilitate reproducible research.
zh
[CV-248] MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization MICCAI2025
【速读】:该论文旨在解决医学图像定位(Medical Image Grounding, MIG)任务中依赖大量链式思维(Chain-of-Thought, CoT)推理标注的问题,这些标注成本高且耗时。其解决方案的关键在于将基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习框架适配到视觉-语言模型(Vision-Language Models, VLMs)中,并引入空间-语义奖励机制与链式框模板(Chain-of-Box template),从而在无需CoT标注的情况下提升模型对医学图像中区域的空间关系理解和定位能力。
链接: https://arxiv.org/abs/2507.02994
作者: Huihui Xu,Yuanpeng Nie,Hualiang Wang,Ying Chen,Wei Li,Junzhi Ning,Lihao Liu,Hongqiu Wang,Lei Zhu,Jiyao Liu,Xiaomeng Li,Junjun He
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI2025 Early Accept
Abstract:Medical Image Grounding (MIG), which involves localizing specific regions in medical images based on textual descriptions, requires models to not only perceive regions but also deduce spatial relationships of these regions. Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations, which are expensive and time-consuming to acquire. Recently, DeepSeek-R1 demonstrated that Large Language Models (LLMs) can acquire reasoning abilities through Group Relative Policy Optimization (GRPO) without requiring CoT annotations. In this paper, we adapt the GRPO reinforcement learning framework to VLMs for Medical Image Grounding. We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations. Specifically, we introduce Spatial-Semantic Rewards, which combine spatial accuracy reward and semantic consistency reward to provide nuanced feedback for both spatially positive and negative completions. Additionally, we propose to use the Chain-of-Box template, which integrates visual information of referring bounding boxes into the think reasoning process, enabling the model to explicitly reason about spatial regions during intermediate steps. Experiments on three datasets MS-CXR, ChestX-ray8, and M3D-RefSeg demonstrate that our method achieves state-of-the-art performance in Medical Image Grounding. Ablation studies further validate the effectiveness of each component in our approach. Code, checkpoints, and datasets are available at this https URL
zh
[CV-249] Enabling Robust Real-Time Verification of Vision-Based Navigation through View Synthesis
【速读】:该论文试图解决基于视觉导航的图像处理算法验证问题,传统方法如合成渲染或机器人测试平台获取存在设置复杂和运行速度慢的缺点。其解决方案的关键在于实时增强图像数据集,通过在新视角下合成视图来生成连续轨迹,从而利用稀疏的现有数据集构建开放或闭环路径。此外,提出了一种新的相机位姿间距离度量——Boresight Deviation Distance,该度量相较于现有方法更适用于视图合成,进而开发出一种提高图像数据集密度的方法。
链接: https://arxiv.org/abs/2507.02993
作者: Marius Neuhalfen,Jonathan Grzymisch,Manuel Sanchez-Gestido
机构: European Space Agency, ESTEC, The Netherlands(欧洲空间局, 欧洲航天研究所在荷兰); RWTH Aachen University (亚琛工业大学); École Centrale de Lille (里尔中央理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Published at the EUCASS2025 conference in Rome. Source code is public, please see link in paper
Abstract:This work introduces VISY-REVE: a novel pipeline to validate image processing algorithms for Vision-Based Navigation. Traditional validation methods such as synthetic rendering or robotic testbed acquisition suffer from difficult setup and slow runtime. Instead, we propose augmenting image datasets in real-time with synthesized views at novel poses. This approach creates continuous trajectories from sparse, pre-existing datasets in open or closed-loop. In addition, we introduce a new distance metric between camera poses, the Boresight Deviation Distance, which is better suited for view synthesis than existing metrics. Using it, a method for increasing the density of image datasets is developed.
zh
[CV-250] Leverag ing the Structure of Medical Data for Improved Representation Learning
【速读】:该论文试图解决在医疗领域构建可泛化的AI系统时,由于临床数据集(如MIMIC-CXR)图像数量有限且标注稀缺,但具有丰富的多视角成像内部结构所带来的数据效率和领域感知问题。解决方案的关键在于提出一种自监督框架,利用医疗数据集的固有结构,将配对的胸部X光片(即正位和侧位视图)视为自然的正样本对,通过从稀疏块中重建每种视图并对其潜在嵌入进行对齐,从而学习到具有信息量的表示,而无需文本监督。
链接: https://arxiv.org/abs/2507.02987
作者: Andrea Agostini,Sonia Laguna,Alain Ryser,Samuel Ruiperez-Campillo,Moritz Vandenhirtz,Nicolas Deperrois,Farhad Nooralahzadeh,Michael Krauthammer,Thomas M. Sutter,Julia E. Vogt
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Building generalizable medical AI systems requires pretraining strategies that are data-efficient and domain-aware. Unlike internet-scale corpora, clinical datasets such as MIMIC-CXR offer limited image counts and scarce annotations, but exhibit rich internal structure through multi-view imaging. We propose a self-supervised framework that leverages the inherent structure of medical datasets. Specifically, we treat paired chest X-rays (i.e., frontal and lateral views) as natural positive pairs, learning to reconstruct each view from sparse patches while aligning their latent embeddings. Our method requires no textual supervision and produces informative representations. Evaluated on MIMIC-CXR, we show strong performance compared to supervised objectives and baselines being trained without leveraging structure. This work provides a lightweight, modality-agnostic blueprint for domain-specific pretraining where data is structured but scarce
zh
[CV-251] Iterative Misclassification Error Training (IMET): An Optimized Neural Network Training Technique for Image Classification
【速读】:该论文试图解决医学图像分类中因数据集噪声、误标或泛化能力差而导致的模型过拟合问题,尤其是在边缘案例和异常结果上的诊断准确性不足。其解决方案的关键在于提出一种名为迭代误分类误差训练(Iterative Misclassification Error Training, IMET)的新框架,该框架结合了课程学习和核心集选择的思想,旨在识别误分类样本并优化模型对边缘案例和罕见结果的关注,从而提升模型的鲁棒性和准确性。
链接: https://arxiv.org/abs/2507.02979
作者: Ruhaan Singh,Sreelekha Guggilam
机构: National Institute of Standards and Technology (国家技术标准研究院); Colorado State University (科罗拉多州立大学); University of Colorado (科罗拉多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep learning models have proven to be effective on medical datasets for accurate diagnostic predictions from images. However, medical datasets often contain noisy, mislabeled, or poorly generalizable images, particularly for edge cases and anomalous outcomes. Additionally, high quality datasets are often small in sample size that can result in overfitting, where models memorize noise rather than learn generalizable patterns. This in particular, could pose serious risks in medical diagnostics where the risk associated with mis-classification can impact human life. Several data-efficient training strategies have emerged to address these constraints. In particular, coreset selection identifies compact subsets of the most representative samples, enabling training that approximates full-dataset performance while reducing computational overhead. On the other hand, curriculum learning relies on gradually increasing training difficulty and accelerating convergence. However, developing a generalizable difficulty ranking mechanism that works across diverse domains, datasets, and models while reducing the computational tasks and remains challenging. In this paper, we introduce Iterative Misclassification Error Training (IMET), a novel framework inspired by curriculum learning and coreset selection. The IMET approach is aimed to identify misclassified samples in order to streamline the training process, while prioritizing the model’s attention to edge case senarious and rare outcomes. The paper evaluates IMET’s performance on benchmark medical image classification datasets against state-of-the-art ResNet architectures. The results demonstrating IMET’s potential for enhancing model robustness and accuracy in medical image analysis are also presented in the paper.
zh
[CV-252] Ascending the Infinite Ladder: Benchmarking Spatial Deformation Reasoning in Vision-Language Models
【速读】:该论文试图解决当前视觉-语言模型(Vision-Language Models, VLMs)在空间变形推理任务中的能力不足问题,即评估这些模型是否真正具备理解和操作空间对象的能力。其解决方案的关键在于构建一个从2D到3D的空间变形推理基准测试,并利用数据引擎生成无限且无数据泄露的评估问题对,从而全面评估模型在正向推理(给定操作,找出最终状态)和反向推理(给定最终状态,确定操作)两个方向上的空间变形推理能力。
链接: https://arxiv.org/abs/2507.02978
作者: Jiahuan Zhang,Shunwen Bai,Tianheng Wang,Kaiwen Guo,Kai Han,Guozheng Rao,Kaicheng Yu
机构: Autolab, Westlake University (西湖大学); Tianjin University (天津大学); Zhejiang University (浙江大学); Capital Normal University (首都师范大学); University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Humans naturally possess the spatial reasoning ability to form and manipulate images and structures of objects in space. There is an increasing effort to endow Vision-Language Models (VLMs) with similar spatial reasoning capabilities. However, it remains unclear whether these models truly understand and manipulate spatial objects or not. To address this question, we propose a new evaluation framework aimed at assessing the performance of VLMs in spatial deformation reasoning tasks. Specifically, we construct a benchmark for spatial deformation reasoning from 2D to 3D. Leveraging our data engine, we can generate unlimited evaluation problem pairs with infinite steps, without any data leakage. We explore whether the model can effectively perform spatial deformation reasoning from two directions: forward reasoning (given the operations, find the final state) and reverse reasoning (given the final state, determine the operations). We adopt a ladder competition format, using the number of deformation steps as the level classification criterion, with the goal of exploring the boundaries of the model’s deformation reasoning capabilities. Interestingly, the benchmarking results reveal that almost no model demonstrates plausible spatial deformation reasoning abilities. Furthermore, even after applying targeted training and mainstream reasoning enhancement methods, the models are still unable to perform well on 3D spatial deformation reasoning.
zh
[CV-253] Mimesis Poiesis and Imagination: Exploring Text-to-Image Generation of Biblical Narratives
【速读】:该论文试图解决人工智能在再现或重新诠释圣经叙事(如《出埃及记》2:5-9中摩西在尼罗河被发现的场景)时的表现及其潜在局限性问题。其解决方案的关键在于通过对比分析AI生成图像与传统图像(包括谷歌图片结果和古典绘画),探讨文本到图像模型在风格、神学和文化维度上的表现,进而评估AI在模仿与创造性生成之间的平衡。研究指出,尽管AI能够生成具有审美价值和想象力的视觉内容,但其结果仍受训练数据的偏见和限制影响,从而引发对AI是否具备真正创造力、作者意图及神学深度的质疑。
链接: https://arxiv.org/abs/2507.02973
作者: Willem Th. van Peursen,Samuel E. Entsua-Mensah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study explores the intersection of artificial intelligence and the visualization of Biblical narratives by analyzing AI-generated images of Exodus 2:5-9 (Moses found in River Nile) using MidJourney. Drawing on the classical concepts of mimesis (imitation) and poiesis (creative generation), the authors investigate how text-to-image (T2I) models reproduce or reimagine sacred narratives. Through comparative visual analysis, including Google image results and classical paintings, the research evaluates the stylistic, theological, and cultural dimensions of AI-generated depictions. Findings show that while AI excels in producing aesthetically rich and imaginative visuals, it also reflects the biases and limitations of its training data. The study highlights AI’s potential to augment human imagination but questions its capacity for genuine creativity, authorial intent, and theological depth. It concludes by suggesting that AI can serve as a creative partner in reinterpreting biblical texts, though its role in sacred art remains complex and contested.
zh
[CV-254] Farm-Level In-Season Crop Identification for India
【速读】:该论文旨在解决农业大国如印度在作物类型信息获取上的关键数据缺口,特别是在全国范围内实现及时、精准的农场级多作物识别问题。现有方法面临地理覆盖有限、作物类型覆盖不足、混合像元与异质景观复杂性以及生长期中作物识别的挑战。其解决方案的关键在于构建一个基于深度学习的框架,利用Sentinel-1和Sentinel-2卫星遥感数据与国家级农场边界数据的融合,结合自动季节检测算法,实现了生长期早期(约两个月后)的可靠作物识别,并形成了首个全国范围内的生长期农场级作物类型数据产品。
链接: https://arxiv.org/abs/2507.02972
作者: Ishan Deshpande,Amandeep Kaur Reehal,Chandan Nath,Renu Singh,Aayush Patel,Aishwarya Jayagopal,Gaurav Singh,Gaurav Aggarwal,Amit Agarwal,Prathmesh Bele,Sridhar Reddy,Tanya Warrier,Kinjal Singh,Ashish Tendulkar,Luis Pazos Outon,Nikita Saxena,Agata Dondzik,Dinesh Tewari,Shruti Garg,Avneet Singh,Harsh Dhand,Vaibhav Rajan,Alok Talekar
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Accurate, timely, and farm-level crop type information is paramount for national food security, agricultural policy formulation, and economic planning, particularly in agriculturally significant nations like India. While remote sensing and machine learning have become vital tools for crop monitoring, existing approaches often grapple with challenges such as limited geographical scalability, restricted crop type coverage, the complexities of mixed-pixel and heterogeneous landscapes, and crucially, the robust in-season identification essential for proactive decision-making. We present a framework designed to address the critical data gaps for targeted data driven decision making which generates farm-level, in-season, multi-crop identification at national scale (India) using deep learning. Our methodology leverages the strengths of Sentinel-1 and Sentinel-2 satellite imagery, integrated with national-scale farm boundary data. The model successfully identifies 12 major crops (which collectively account for nearly 90% of India’s total cultivated area showing an agreement with national crop census 2023-24 of 94% in winter, and 75% in monsoon season). Our approach incorporates an automated season detection algorithm, which estimates crop sowing and harvest periods. This allows for reliable crop identification as early as two months into the growing season and facilitates rigorous in-season performance evaluation. Furthermore, we have engineered a highly scalable inference pipeline, culminating in what is, to our knowledge, the first pan-India, in-season, farm-level crop type data product. The system’s effectiveness and scalability are demonstrated through robust validation against national agricultural statistics, showcasing its potential to deliver actionable, data-driven insights for transformative agricultural monitoring and management across India. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2507.02972 [cs.CV] (or arXiv:2507.02972v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.02972 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-255] YOLO-Based Pipeline Monitoring in Challenging Visual Environments
【速读】:该论文试图解决在低能见度水下环境中对海底管道进行状态监测所面临的挑战,这些问题包括浑浊、光线扭曲和图像退化,导致传统基于视觉的检测系统难以提供可靠的映射、目标识别或缺陷检测数据。解决方案的关键在于集成先进的生成式人工智能(Generative AI)技术,以提升图像质量、检测管道结构并支持自主故障诊断,其中重点比较了YOLOv8和YOLOv11及其三种适用于复杂和低能见度水下环境的图像分割任务的变体模型性能。
链接: https://arxiv.org/abs/2507.02967
作者: Pragya Dhungana,Matteo Fresta,Niraj Tamrakar,Hariom Dhungana
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Condition monitoring subsea pipelines in low-visibility underwater environments poses significant challenges due to turbidity, light distortion, and image degradation. Traditional visual-based inspection systems often fail to provide reliable data for mapping, object recognition, or defect detection in such conditions. This study explores the integration of advanced artificial intelligence (AI) techniques to enhance image quality, detect pipeline structures, and support autonomous fault diagnosis. This study conducts a comparative analysis of two most robust versions of YOLOv8 and Yolov11 and their three variants tailored for image segmentation tasks in complex and low-visibility subsea environments. Using pipeline inspection datasets captured beneath the seabed, it evaluates model performance in accurately delineating target structures under challenging visual conditions. The results indicated that YOLOv11 outperformed YOLOv8 in overall performance.
zh
[CV-256] Concept-based Adversarial Attack: a Probabilistic Perspective
【速读】:该论文试图解决传统对抗攻击方法局限于单图像扰动,难以生成多样化且保持原始概念的对抗样本的问题。解决方案的关键在于提出一种基于概念的对抗攻击框架,该框架从概率视角出发,对整个概念(由概率生成模型或一组图像表示)进行操作,而非仅修改单个图像,从而生成在姿态、视角或背景上变化但保留原始概念的对抗样本。
链接: https://arxiv.org/abs/2507.02965
作者: Andi Zhang,Xuan Ding,Steven McDonagh,Samuel Kaski
机构: University of Manchester(曼彻斯特大学); Beijing Normal University(北京师范大学); University of Edinburgh(爱丁堡大学); University of Aalto(阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept – represented by a probabilistic generative model or a set of images – to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial examples and effectively preserve the underlying concept, while achieving higher attack efficiency.
zh
[CV-257] VR-YOLO: Enhancing PCB Defect Detection with Viewpoint Robustness Based on YOLO
【速读】:该论文旨在解决传统工业生产中对电子元件缺陷检测的自动化需求,特别是针对YOLO图像检测模型在实际应用中对目标图像角度、方向和清晰度的严格要求。其解决方案的关键在于提出一种增强的PCB缺陷检测算法VR-YOLO,通过引入多样化场景增强(DSE)方法提升目标多样性,并结合新颖的关键对象聚焦(KOF)方案,利用角度损失和额外的注意力机制增强小目标特征的细粒度学习,从而提高模型的泛化能力和视角鲁棒性。
链接: https://arxiv.org/abs/2507.02963
作者: Hengyi Zhu,Linye Wei,He Li
机构: Chien-Shiung Wu College, Southeast University (东南大学); School of Electronic Science & Engineering, Southeast University (电子科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:The integration of large-scale circuits and systems emphasizes the importance of automated defect detection of electronic components. The YOLO image detection model has been used to detect PCB defects and it has become a typical AI-assisted case of traditional industrial production. However, conventional detection algorithms have stringent requirements for the angle, orientation, and clarity of target images. In this paper, we propose an enhanced PCB defect detection algorithm, named VR-YOLO, based on the YOLOv8 model. This algorithm aims to improve the model’s generalization performance and enhance viewpoint robustness in practical application scenarios. We first propose a diversified scene enhancement (DSE) method by expanding the PCB defect dataset by incorporating diverse scenarios and segmenting samples to improve target diversity. A novel key object focus (KOF) scheme is then presented by considering angular loss and introducing an additional attention mechanism to enhance fine-grained learning of small target features. Experimental results demonstrate that our improved PCB defect detection approach achieves a mean average precision (mAP) of 98.9% for the original test images, and 94.7% for the test images with viewpoint shifts (horizontal and vertical shear coefficients of \pm 0.06 and rotation angle of \pm 10 degrees), showing significant improvements compared to the baseline YOLO model with negligible additional computational cost.
zh
[CV-258] CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, vLLMs)在处理长视频序列和丰富语言描述时,因标准注意力机制的二次复杂度导致的计算瓶颈问题。其关键解决方案是引入压缩感知注意力变换器(Compressed Sensing Attention Transformer, CSAT),通过将高维键和值表示投影到低维子空间,并利用稀疏恢复算法重建注意力输出,从而显著降低注意力复杂度,同时保持语义完整性。
链接: https://arxiv.org/abs/2507.02957
作者: Andrew Kiruluta,Preethi Raju,Priscilla Burity
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (vLLMs) have emerged as powerful architectures for joint reasoning over visual and textual inputs, enabling breakthroughs in image captioning, cross modal retrieval, and multimodal dialogue. However, as these models scale to longer video sequences and richer language descriptions, the quadratic complexity of the standard attention mechanism presents a fundamental computational bottleneck. This challenge is exacerbated in vLLMs, where attention must be computed not only within modalities but also across them, leading to prohibitive memory and latency costs. In this work, we introduce the Compressed Sensing Attention Transformer (CSAT), a novel architecture that reimagines attention computation through the lens of compressed sensing. By projecting high dimensional key and value representations into a lower-dimensional subspace via random measurement matrices and reconstructing the attention outputs using sparse recovery algorithms, CSAT significantly reduces attention complexity while maintaining semantic fidelity. Applied to vLLMs, CSAT exploits the inherent compressibility of both visual and textual representations especially evident in video, where temporal redundancy is high, and in language, where cross-modal grounding is often sparse. In contrast to LLMs, which must often model entangled symbolic dependencies, vLLMs benefit from structured sparsity in alignment and scene composition, making them particularly well-suited to compressed attention. We provide a formal mathematical treatment of CSAT, demonstrate its integration into vision language pipelines, and validate its performance on standard benchmarks, highlighting its promise as a scalable, interpretable, and resource efficient solution for next generation multimodal transformers.
zh
[CV-259] Multimodal image registration for effective thermographic fever screening
【速读】:该论文试图解决在使用红外热成像仪(IRT)进行发热筛查时,如何准确定位内眦区域(canthi regions)的问题,以提高公共场合如医院和机场的体温监测效率。解决方案的关键在于通过多模态配准技术,将红外图像与可见光图像进行精确对齐,采用基于特征点和眼轮廓边缘检测的不同配准模型,实现粗略到精细的分层配准策略,从而达到2.7 mm以内的配准精度,确保内眦区域的准确定位。
链接: https://arxiv.org/abs/2507.02955
作者: C.Y.N. Dwith,Pejhman Ghassemi,Joshua Pfefer,Jon Casamento,Quanzeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fever screening based on infrared thermographs (IRTs) is a viable mass screening approach during infectious disease pandemics, such as Ebola and SARS, for temperature monitoring in public places like hospitals and airports. IRTs have found to be powerful, quick and non-invasive methods to detect elevated temperatures. Moreover, regions medially adjacent to the inner canthi (called the canthi regions in this paper) are preferred sites for fever screening. Accurate localization of the canthi regions can be achieved through multi-modal registration of infrared (IR) and white-light images. We proposed a registration method through a coarse-fine registration strategy using different registration models based on landmarks and edge detection on eye contours. We evaluated the registration accuracy to be within 2.7 mm, which enables accurate localization of the canthi regions.
zh
[CV-260] DriveMRP: Enhancing Vision-Language Models with Synthetic Motion Data for Motion Risk Prediction
【速读】:该论文旨在解决在长尾场景下,由于动态环境的不确定性以及数据覆盖的局限性,导致对自车未来运动的安全性准确预测仍是一个重大挑战的问题。其解决方案的关键在于通过合成高风险运动数据来增强视觉-语言模型(VLM)的运动风险预测能力,具体包括引入一种基于鸟瞰图(BEV)的运动仿真方法,从自车、其他车辆和环境三个角度建模风险,并设计了一个与VLM无关的运动风险估计框架DriveMRP-Agent,该框架通过新颖的信息注入策略,提升了VLM对运动路径点与环境空间关系的推理能力。
链接: https://arxiv.org/abs/2507.02948
作者: Zhiyi Hou,Enhui Ma,Fang Li,Zhiyi Lai,Kalok Ho,Zhanqian Wu,Lijun Zhou,Long Chen,Chitian Sun,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Kaicheng Yu
机构: Westlake University (西湖大学); Xiaomi EV (小米汽车); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 12 pages, 4 figures. Code available at this https URL
Abstract:Autonomous driving has seen significant progress, driven by extensive real-world data. However, in long-tail scenarios, accurately predicting the safety of the ego vehicle’s future motion remains a major challenge due to uncertainties in dynamic environments and limitations in data coverage. In this work, we aim to explore whether it is possible to enhance the motion risk prediction capabilities of Vision-Language Models (VLM) by synthesizing high-risk motion data. Specifically, we introduce a Bird’s-Eye View (BEV) based motion simulation method to model risks from three aspects: the ego-vehicle, other vehicles, and the environment. This allows us to synthesize plug-and-play, high-risk motion data suitable for VLM training, which we call DriveMRP-10K. Furthermore, we design a VLM-agnostic motion risk estimation framework, named DriveMRP-Agent. This framework incorporates a novel information injection strategy for global context, ego-vehicle perspective, and trajectory projection, enabling VLMs to effectively reason about the spatial relationships between motion waypoints and the environment. Extensive experiments demonstrate that by fine-tuning with DriveMRP-10K, our DriveMRP-Agent framework can significantly improve the motion risk prediction performance of multiple VLM baselines, with the accident recognition accuracy soaring from 27.13% to 88.03%. Moreover, when tested via zero-shot evaluation on an in-house real-world high-risk motion dataset, DriveMRP-Agent achieves a significant performance leap, boosting the accuracy from base_model’s 29.42% to 68.50%, which showcases the strong generalization capabilities of our method in real-world scenarios.
zh
[CV-261] Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding
【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长视频时因对时间间隔感知效率低下而导致的性能不足问题。解决方案的关键在于提出一种无需训练的框架——时间搜索(Temporal Search, TS),其核心思想是利用模型在不同时间区间生成内容的置信度与预测准确性之间的高度相关性,通过迭代方式探索更精细的时间区间,从而提升模型对长视频的理解能力。TS通过不断调整注意力焦点,并结合关键帧级描述以增强跨区间感知,显著提高了模型处理长视频的效率和准确性。
链接: https://arxiv.org/abs/2507.02946
作者: Chenglin Li,Qianglong Chen,fengtao,Yin Zhang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance in video understanding tasks. However, they continue to struggle with long-form videos because of an inefficient perception of temporal intervals. Unlike humans, who can dynamically adjust their temporal focus to locate query-relevant moments, current MLLMs often rely on dense, uniform sampling across the video timeline, leading to high memory consumption and a risk of missing crucial information. To address this challenge, we introduce Temporal Search, a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively. TS is based on a key observation: the model’s generation confidence across different temporal intervals is highly correlated with prediction accuracy. TS operates through two main iterative stages. First, the MLLM proposes a temporal interval that is likely to contain task-relevant information. Then, it samples a fixed number of frames from the interval, regardless of length, and feeds them into the model to produce a refined response and confidence score. TS refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos. Additionally, keyframe-level descriptions are collected to facilitate cross-interval perception throughout the video. To further improve efficiency, we introduce TS-BFS, a best-first search strategy over a tree. Each node represents a candidate interval and is expanded via two methods: self-driven proposals and uniform partitioning. Nodes are scored based on confidence and self-evaluation, and the most promising one is selected for continued exploration.
zh
[CV-262] Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting ICCV-2025
【速读】:该论文旨在解决时空预测任务中复杂模型训练效率低和内存消耗高的问题。其关键解决方案是提出一种轻量级框架——频谱解耦知识蒸馏(Spectral Decoupled Knowledge Distillation, SDKD),通过将复杂教师模型的多尺度时空表征转移到更高效的轻量级学生网络中,从而提升模型性能并降低计算复杂度。该方法利用频率对齐的知识蒸馏策略,从教师模型的潜在空间中提取多尺度频谱特征,以指导学生模型同时捕捉局部细粒度变化和全局演变模式。
链接: https://arxiv.org/abs/2507.02939
作者: Yuqi Li,Chuanguang Yang,Hansheng Zeng,Zeyu Dong,Zhulin An,Yongjun Xu,Yingli Tian,Hao Wu
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); The University of Hong Kong (香港大学); The City University of New York (纽约市立大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV-2025, 11 pages
Abstract:Spatiotemporal forecasting tasks, such as traffic flow, combustion dynamics, and weather forecasting, often require complex models that suffer from low training efficiency and high memory consumption. This paper proposes a lightweight framework, Spectral Decoupled Knowledge Distillation (termed SDKD), which transfers the multi-scale spatiotemporal representations from a complex teacher model to a more efficient lightweight student network. The teacher model follows an encoder-latent evolution-decoder architecture, where its latent evolution module decouples high-frequency details and low-frequency trends using convolution and Transformer (global low-frequency modeler). However, the multi-layer convolution and deconvolution structures result in slow training and high memory usage. To address these issues, we propose a frequency-aligned knowledge distillation strategy, which extracts multi-scale spectral features from the teacher’s latent space, including both high and low frequency components, to guide the lightweight student model in capturing both local fine-grained variations and global evolution patterns. Experimental results show that SDKD significantly improves performance, achieving reductions of up to 81.3% in MSE and in MAE 52.3% on the Navier-Stokes equation dataset. The framework effectively captures both high-frequency variations and long-term trends while reducing computational complexity. Our codes are available at this https URL
zh
[CV-263] OBSER: Object-Based Sub-Environment Recognition for Zero-Shot Environmental Inference ICCV2025
【速读】:该论文试图解决在开放世界和逼真环境中实现可靠且准确的子环境与物体关系推断问题,特别是针对链式检索任务中场景基方法的局限性。解决方案的关键在于提出一种基于贝叶斯框架的Object-Based Sub-Environment Recognition (OBSER)方法,通过度量学习和自监督学习模型在潜在空间中估计子环境的物体分布,从而计算三种基本关系,并引入( \epsilon,\delta )统计可分函数(EDS)来衡量表示对齐程度,从而实现零样本环境识别和自主环境理解。
链接: https://arxiv.org/abs/2507.02929
作者: Won-Seok Choi,Dong-Sig Han,Suhyung Choi,Hyeonseo Yang,Byoung-Tak Zhang
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: This manuscript was initially submitted to ICCV 2025 and is now made available as a preprint
Abstract:We present the Object-Based Sub-Environment Recognition (OBSER) framework, a novel Bayesian framework that infers three fundamental relationships between sub-environments and their constituent objects. In the OBSER framework, metric and self-supervised learning models estimate the object distributions of sub-environments on the latent space to compute these measures. Both theoretically and empirically, we validate the proposed framework by introducing the ( \epsilon,\delta ) statistically separable (EDS) function which indicates the alignment of the representation. Our framework reliably performs inference in open-world and photorealistic environments and outperforms scene-based methods in chained retrieval tasks. The OBSER framework enables zero-shot recognition of environments to achieve autonomous environment understanding.
zh
[CV-264] Modeling Urban Food Insecurity with Google Street View Images
【速读】:该论文试图解决城市地区粮食不安全(food insecurity)识别方法在规模扩展上的局限性,传统方法主要依赖于定性和定量的调查数据,难以大规模应用。其解决方案的关键在于利用街景图像(street-level images)进行建模,并提出了一种两步流程:特征提取与门控注意力机制(gated attention)用于图像聚合,以辅助识别粮食不安全状况。
链接: https://arxiv.org/abs/2507.02924
作者: David Li
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Food insecurity is a significant social and public health issue that plagues many urban metropolitan areas around the world. Existing approaches to identifying food insecurity rely primarily on qualitative and quantitative survey data, which is difficult to scale. This project seeks to explore the effectiveness of using street-level images in modeling food insecurity at the census tract level. To do so, we propose a two-step process of feature extraction and gated attention for image aggregation. We evaluate the effectiveness of our model by comparing against other model architectures, interpreting our learned weights, and performing a case study. While our model falls slightly short in terms of its predictive power, we believe our approach still has the potential to supplement existing methods of identifying food insecurity for urban planners and policymakers.
zh
[CV-265] Enhancing Sports Strategy with Video Analytics and Data Mining: Automated Video-Based Analytics Framework for Tennis Doubles
【速读】:该论文试图解决网球双打比赛中缺乏自动化分析工具的问题,这一运动因其战略复杂性而对分析技术提出了较高要求。解决方案的关键在于提出了一种标准化的标注方法,涵盖球员位置、击球类型、场地阵型和比赛结果,并结合专用的标注工具以满足网球视频标注的独特需求。此外,该框架整合了先进的机器学习技术,包括通过自然语言接地实现精准球员定位的GroundingDINO以及用于鲁棒姿态估计的YOLO-Pose,从而显著降低了人工标注工作量并提升了数据的一致性和质量。
链接: https://arxiv.org/abs/2507.02906
作者: Jia Wei Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: this http URL . thesis 59 pages, 26 figures
Abstract:We present a comprehensive video-based analytics framework for tennis doubles that addresses the lack of automated analysis tools for this strategically complex sport. Our approach introduces a standardised annotation methodology encompassing player positioning, shot types, court formations, and match outcomes, coupled with a specialised annotation tool designed to meet the unique requirements of tennis video labelling. The framework integrates advanced machine learning techniques including GroundingDINO for precise player localisation through natural language grounding and YOLO-Pose for robust pose estimation. This combination significantly reduces manual annotation effort whilst improving data consistency and quality. We evaluate our approach on doubles tennis match data and demonstrate that CNN-based models with transfer learning substantially outperform pose-based methods for predicting shot types, player positioning, and formations. The CNN models effectively capture complex visual and contextual features essential for doubles tennis analysis. Our integrated system bridges advanced analytical capabilities with the strategic complexities of tennis doubles, providing a foundation for automated tactical analysis, performance evaluation, and strategic modelling in professional tennis.
zh
[CV-266] Enhancing Sports Strategy with Video Analytics and Data Mining: Assessing the effectiveness of Multimodal LLM s in tennis video analysis
【速读】:该论文试图解决当前在体育分析领域中,尤其是网球比赛视频分析中,缺乏能够理解和识别网球击球回合中事件序列的模型的问题。解决方案的关键在于评估多模态大语言模型(Multimodal LLMs, MLLMs)在分类网球动作以及在连续动作序列中识别这些动作的能力,并探索提升其性能的方法,包括不同的训练策略以及与其他传统模型的结合使用。
链接: https://arxiv.org/abs/2507.02904
作者: Charlton Teo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this http URL . dissertation
Abstract:The use of Large Language Models (LLMs) in recent years has also given rise to the development of Multimodal LLMs (MLLMs). These new MLLMs allow us to process images, videos and even audio alongside textual inputs. In this project, we aim to assess the effectiveness of MLLMs in analysing sports videos, focusing mainly on tennis videos. Despite research done on tennis analysis, there remains a gap in models that are able to understand and identify the sequence of events in a tennis rally, which would be useful in other fields of sports analytics. As such, we will mainly assess the MLLMs on their ability to fill this gap - to classify tennis actions, as well as their ability to identify these actions in a sequence of tennis actions in a rally. We further looked into ways we can improve the MLLMs’ performance, including different training methods and even using them together with other traditional models.
zh
[CV-267] Online Continual Learning via Spiking Neural Networks with Sleep Enhanced Latent Replay
【速读】:该论文旨在解决边缘计算场景中在线持续学习(Online Continual Learning, OCL)算法存在的高内存开销和对近期训练任务的偏倚问题。其解决方案的关键在于提出一种名为SESLR的新方法,该方法结合了增强型睡眠潜伏回放机制与脉冲神经网络(Spiking Neural Networks, SNNs)。SESLR利用SNNs的二进制脉冲特性,以单比特存储回放特征,显著降低了内存消耗;同时,受生物睡眠-觉醒周期启发,引入噪声增强的睡眠阶段,模型仅在可控噪声注入下对回放样本进行训练,从而有效缓解对新类别的分类偏倚。
链接: https://arxiv.org/abs/2507.02901
作者: Erliang Lin,Wenbin Luo,Wei Jia,Yu Chen,Shaofu Yang
机构: Southeast University (东南大学); Ministry of Education (教育部)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 4figures
Abstract:Edge computing scenarios necessitate the development of hardware-efficient online continual learning algorithms to be adaptive to dynamic environment. However, existing algorithms always suffer from high memory overhead and bias towards recently trained tasks. To tackle these issues, this paper proposes a novel online continual learning approach termed as SESLR, which incorporates a sleep enhanced latent replay scheme with spiking neural networks (SNNs). SESLR leverages SNNs’ binary spike characteristics to store replay features in single bits, significantly reducing memory overhead. Furthermore, inspired by biological sleep-wake cycles, SESLR introduces a noise-enhanced sleep phase where the model exclusively trains on replay samples with controlled noise injection, effectively mitigating classification bias towards new classes. Extensive experiments on both conventional (MNIST, CIFAR10) and neuromorphic (NMNIST, CIFAR10-DVS) datasets demonstrate SESLR’s effectiveness. On Split CIFAR10, SESLR achieves nearly 30% improvement in average accuracy with only one-third of the memory consumption compared to baseline methods. On Split CIFAR10-DVS, it improves accuracy by approximately 10% while reducing memory overhead by a factor of 32. These results validate SESLR as a promising solution for online continual learning in resource-constrained edge computing scenarios.
zh
[CV-268] Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies Datasets Evaluation Metrics and Loss Functions
【速读】:该论文旨在解决 Talking Head Generation (THG) 领域中生成高质量、同步且逼真的人脸视频的问题,其核心挑战包括提升感知真实性和技术效率。解决方案的关键在于综合分析并分类不同方法,如基于2D、3D、NeRF、扩散模型及参数驱动的技术,并通过评估算法、数据集和评价指标来推动技术进步。此外,论文强调了模块化架构、多语言数据集、混合模型以及创新损失函数等未来方向,以应对现有技术在依赖预训练模型、极端姿态处理、多语言合成和时间一致性等方面的不足。
链接: https://arxiv.org/abs/2507.02900
作者: Vineet Kumar Rakesh,Soumya Mazumdar,Research Pratim Maity,Sarbajit Pal,Amitabha Das,Tapas Samanta
机构: Homi Bhabha National Institute (印度原子能委员会); VECC (维克兰特电子研究中心); Gargi Memorial Institute of Technology (加吉纪念技术学院); Jadavpur University (贾达布普大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注:
Abstract:Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D–based, 3D–based, Neural Radiance Fields (NeRF)–based, diffusion–based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre–trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre–trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: this https URL.
zh
[CV-269] Learning to Generate Vectorized Maps at Intersections with Multiple Roadside Cameras
【速读】:该论文试图解决传统构建高精度矢量地图方法中存在的成本高、依赖人工标注或在线方法性能受限的问题。其解决方案的关键在于提出MRC-VMap,这是一种以视觉为中心的端到端神经网络,能够直接利用现有道路旁监控摄像头采集的多方向时间对齐图像生成高精度矢量地图,从而减少对额外中间模块的依赖,降低计算开销和误差传播,并提升复杂交叉口场景下的地图完整性和鲁棒性。
链接: https://arxiv.org/abs/2507.02899
作者: Miao Fan,Quanxin Zheng,Shengtong Xu,Linghe Kong,Haoyi Xiong
机构: Beijing Institute of Graphic Communication (北京印刷学院); NavInfo Co. Ltd. (NavInfo公司); Autohome Inc. (汽车之家公司); Shanghai Jiaotong University (上海交通大学); Baidu Inc. (百度公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vectorized maps are indispensable for precise navigation and the safe operation of autonomous vehicles. Traditional methods for constructing these maps fall into two categories: offline techniques, which rely on expensive, labor-intensive LiDAR data collection and manual annotation, and online approaches that use onboard cameras to reduce costs but suffer from limited performance, especially at complex intersections. To bridge this gap, we introduce MRC-VMap, a cost-effective, vision-centric, end-to-end neural network designed to generate high-definition vectorized maps directly at intersections. Leveraging existing roadside surveillance cameras, MRC-VMap directly converts time-aligned, multi-directional images into vectorized map representations. This integrated solution lowers the need for additional intermediate modules–such as separate feature extraction and Bird’s-Eye View (BEV) conversion steps–thus reducing both computational overhead and error propagation. Moreover, the use of multiple camera views enhances mapping completeness, mitigates occlusions, and provides robust performance under practical deployment constraints. Extensive experiments conducted on 4,000 intersections across 4 major metropolitan areas in China demonstrate that MRC-VMap not only outperforms state-of-the-art online methods but also achieves accuracy comparable to high-cost LiDAR-based approaches, thereby offering a scalable and efficient solution for modern autonomous navigation systems.
zh
[CV-270] Regulation Compliant AI for Fusion: Real-Time Image Analysis-Based Control of Divertor Detachment in Tokamaks
【速读】:该论文试图解决在监管环境中实现符合要求的融合控制问题,特别是针对偏滤器分离控制的挑战。其解决方案的关键在于开发了一个基于生成式AI的实时线性可解释控制系统,该系统通过DIII-D下偏滤器相机实现了反馈式的偏滤器分离与再附着控制,成功将平均绝对偏差控制在目标值的2%以内,为未来聚变反应堆的合规控制器提供了可扩展的图像诊断基础框架。
链接: https://arxiv.org/abs/2507.02897
作者: Nathaniel Chen,Cheolsik Byun,Azarakash Jalalvand,Sangkyeun Kim,Andrew Rothstein,Filippo Scotti,Steve Allen,David Eldon,Keith Erickson,Egemen Kolemen
机构: Princeton University (普林斯顿大学); Princeton Plasma Physics Labratory (普林斯顿等离子体物理实验室); Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室); General Atomics (通用原子公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY); Plasma Physics (physics.plasm-ph)
备注:
Abstract:While artificial intelligence (AI) has been promising for fusion control, its inherent black-box nature will make compliant implementation in regulatory environments a challenge. This study implements and validates a real-time AI enabled linear and interpretable control system for successful divertor detachment control with the DIII-D lower divertor camera. Using D2 gas, we demonstrate feedback divertor detachment control with a mean absolute difference of 2% from the target for both detachment and reattachment. This automatic training and linear processing framework can be extended to any image based diagnostic for regulatory compliant controller necessary for future fusion reactors.
zh
[CV-271] A Simulator Dataset to Support the Study of Impaired Driving
【速读】:该论文旨在解决因驾驶者状态受损(如酒精中毒和认知分心)导致的交通事故问题,这些问题仍然给社会带来高昂的成本。解决方案的关键是构建一个包含23.7小时模拟城市驾驶数据的驾驶数据集,该数据集涵盖了52名受试者在正常和受损状态下的行为,包括车辆数据(如地面真实感知、车辆姿态、控制信息)和面向驾驶员的数据(如注视、音频、问卷调查),以支持对酒精中毒(0.10%血液酒精浓度)、两种认知分心形式(音频n-back任务和句子解析任务)及其组合下驾驶员行为变化的分析,以及对八种受控道路危险情况的响应研究。
链接: https://arxiv.org/abs/2507.02867
作者: John Gideon,Kimimasa Tamura,Emily Sumner,Laporsha Dees,Patricio Reyes Gomez,Bassamul Haq,Todd Rowell,Avinash Balachandran,Simon Stent,Guy Rosman
机构: Toyota Research Institute (丰田研究机构)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 6 figures, 4 tables
Abstract:Despite recent advances in automated driving technology, impaired driving continues to incur a high cost to society. In this paper, we present a driving dataset designed to support the study of two common forms of driver impairment: alcohol intoxication and cognitive distraction. Our dataset spans 23.7 hours of simulated urban driving, with 52 human subjects under normal and impaired conditions, and includes both vehicle data (ground truth perception, vehicle pose, controls) and driver-facing data (gaze, audio, surveys). It supports analysis of changes in driver behavior due to alcohol intoxication (0.10% blood alcohol content), two forms of cognitive distraction (audio n-back and sentence parsing tasks), and combinations thereof, as well as responses to a set of eight controlled road hazards, such as vehicle cut-ins. The dataset will be made available at this https URL.
zh
[CV-272] PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection
【速读】:该论文旨在解决单目3D目标检测(Monocular 3D Object Detection, M3OD)中因数据稀缺导致的挑战,特别是由高标注成本和固有的2D到3D歧义性引起的难题。其解决方案的关键在于提出了一种仅使用视频数据的新型伪LiDAR伪标签框架,通过时间相邻帧中静态和动态物体的物体点跟踪技术聚合伪LiDAR信息,从而在无法获取3D数据的场景下实现3D属性提取,该方法无需多视角设置、额外传感器、相机位姿或领域特定训练,具有更强的鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2507.02393
作者: Seokyeong Lee,Sithu Aung,Junyong Choi,Seungryong Kim,Ig-Jae Kim,Junghyun Cho
机构: Korea Institute of Science and Technology (KIST); Korea University; KAIST AI; AI-Robotics, KIST School, University of Science and Technology (UST); Yonsei-KIST Convergence Research Institute, Yonsei University
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 18 pages, 16 figures
Abstract:Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.
zh
[CV-273] RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis
【速读】:该论文试图解决在类风湿性关节炎(Rheumatoid Arthritis, RA)诊断中,由于获取高质量实例级标注的困难而导致的计算机辅助诊断(Computer-Aided Diagnosis, CAD)研究受限的问题。解决方案的关键在于构建一个包含两个任务的多任务数据集:手腕骨实例分割和Sharp/van der Heijde (SvdH) 骨侵蚀(Bone Erosion, BE)评分,这是首个公开的手腕骨实例分割资源。该数据集包含来自四个医疗中心的227名患者的621张手腕常规X光片,并提供了443张图像的像素级实例分割标注和548张图像的SvdH BE评分,旨在支持多种与RA相关的研究任务。
链接: https://arxiv.org/abs/2507.05193
作者: Songxiao Yang,Haolin Wang,Yao Fu,Ye Tian,Tamotsu Kamishima,Masayuki Ikebe,Yafei Ou,Masatoshi Okutomi
机构: Institute of Science Tokyo(东京科学大学); Hokkaido University(北海道大学); The University of Tokyo(东京大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rheumatoid arthritis (RA) is a common autoimmune disease that has been the focus of research in computer-aided diagnosis (CAD) and disease monitoring. In clinical settings, conventional radiography (CR) is widely used for the screening and evaluation of RA due to its low cost and accessibility. The wrist is a critical region for the diagnosis of RA. However, CAD research in this area remains limited, primarily due to the challenges in acquiring high-quality instance-level annotations. (i) The wrist comprises numerous small bones with narrow joint spaces, complex structures, and frequent overlaps, requiring detailed anatomical knowledge for accurate annotation. (ii) Disease progression in RA often leads to osteophyte, bone erosion (BE), and even bony ankylosis, which alter bone morphology and increase annotation difficulty, necessitating expertise in rheumatology. This work presents a multi-task dataset for wrist bone in CR, including two tasks: (i) wrist bone instance segmentation and (ii) Sharp/van der Heijde (SvdH) BE scoring, which is the first public resource for wrist bone instance segmentation. This dataset comprises 621 wrist conventional radiographs of 227 patients from four medical centers, with pixel-level instance segmentation annotations for 443 images and SvdH BE scores for 548 images. This dataset can potentially support a wide range of research tasks related to RA, including joint space narrowing (JSN) progression quantification, BE detection, bone deformity evaluation, and osteophyte detection. It may also be applied to other wrist-related tasks, such as carpal bone fracture localization. We hope this dataset will significantly lower the barrier to research on wrist RA and accelerate progress in CAD research within the RA-related domain.
zh
[CV-274] QMoE: A Quantum Mixture of Experts Framework for Scalable Quantum Neural Networks
【速读】:该论文试图解决量子机器学习(Quantum Machine Learning, QML)模型在可扩展性和表达能力方面的问题,这些问题主要由硬件限制导致。解决方案的关键在于提出一种名为量子专家混合(Quantum Mixture of Experts, QMoE)的新型量子架构,该架构将专家混合(Mixture of Experts, MoE)范式引入QML场景,通过多个参数化量子电路作为专家模型以及可学习的量子路由机制来选择和聚合特定输入的量子专家,从而提升模型性能。
链接: https://arxiv.org/abs/2507.05190
作者: Hoang-Quan Nguyen,Xuan-Bac Nguyen,Sankalp Pandey,Samee U. Khan,Ilya Safro,Khoa Luu
机构: University of Arkansas(阿肯色大学); Kansas State University(堪萨斯州立大学); University of Delaware(特拉华大学)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Quantum machine learning (QML) has emerged as a promising direction in the noisy intermediate-scale quantum (NISQ) era, offering computational and memory advantages by harnessing superposition and entanglement. However, QML models often face challenges in scalability and expressiveness due to hardware constraints. In this paper, we propose quantum mixture of experts (QMoE), a novel quantum architecture that integrates the mixture of experts (MoE) paradigm into the QML setting. QMoE comprises multiple parameterized quantum circuits serving as expert models, along with a learnable quantum routing mechanism that selects and aggregates specialized quantum experts per input. The empirical results from the proposed QMoE on quantum classification tasks demonstrate that it consistently outperforms standard quantum neural networks, highlighting its effectiveness in learning complex data patterns. Our work paves the way for scalable and interpretable quantum learning frameworks.
zh
[CV-275] Latent Motion Profiling for Annotation-free Cardiac Phase Detection in Adult and Fetal Echocardiography Videos
【速读】:该论文试图解决心脏相位识别中依赖大量人工标注数据的问题,这一过程既耗时又费力。其解决方案的关键在于提出一种无监督框架,通过从四腔心超声视频中学习潜在的心脏运动轨迹来进行舒张末期(ED)和收缩末期(ES)的检测。该方法通过训练重建模型来编码可解释的时空运动模式,从而消除了对人工标注的依赖,包括ED和ES指标、分割或容积测量。
链接: https://arxiv.org/abs/2507.05154
作者: Yingyu Yang,Qianye Yang,Kangning Cui,Can Peng,Elena D’Alberti,Netzahualcoyotl Hernandez-Cruz,Olga Patey,Aris T. Papageorghiou,J. Alison Noble
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The identification of cardiac phase is an essential step for analysis and diagnosis of cardiac function. Automatic methods, especially data-driven methods for cardiac phase detection, typically require extensive annotations, which is time-consuming and labor-intensive. In this paper, we present an unsupervised framework for end-diastole (ED) and end-systole (ES) detection through self-supervised learning of latent cardiac motion trajectories from 4-chamber-view echocardiography videos. Our method eliminates the need for manual annotations, including ED and ES indices, segmentation, or volumetric measurements, by training a reconstruction model to encode interpretable spatiotemporal motion patterns. Evaluated on the EchoNet-Dynamic benchmark, the approach achieves mean absolute error (MAE) of 3 frames (58.3 ms) for ED and 2 frames (38.8 ms) for ES detection, matching state-of-the-art supervised methods. Extended to fetal echocardiography, the model demonstrates robust performance with MAE 1.46 frames (20.7 ms) for ED and 1.74 frames (25.3 ms) for ES, despite the fact that the fetal heart model is built using non-standardized heart views due to fetal heart positioning variability. Our results demonstrate the potential of the proposed latent motion trajectory strategy for cardiac phase detection in adult and fetal echocardiography. This work advances unsupervised cardiac motion analysis, offering a scalable solution for clinical populations lacking annotated data. Code will be released at this https URL.
zh
[CV-276] SV-DRR: High-Fidelity Novel View X-Ray Synthesis Using Diffusion Model MICCAI2025
【速读】:该论文试图解决多视角X射线成像带来的辐射暴露增加和临床工作流程复杂化问题。其解决方案的关键在于提出一种基于视图条件的扩散模型,该模型能够从单视角X射线图像中合成多视角X射线图像。与以往方法相比,该方法利用了Diffusion Transformer以保持细节,并采用弱到强训练策略实现稳定高分辨率图像生成,从而在角度控制和图像质量方面取得了显著提升。
链接: https://arxiv.org/abs/2507.05148
作者: Chun Xie,Yuichi Yoshii,Itaru Kitahara
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI2025
Abstract:X-ray imaging is a rapid and cost-effective tool for visualizing internal human anatomy. While multi-view X-ray imaging provides complementary information that enhances diagnosis, intervention, and education, acquiring images from multiple angles increases radiation exposure and complicates clinical workflows. To address these challenges, we propose a novel view-conditioned diffusion model for synthesizing multi-view X-ray images from a single view. Unlike prior methods, which are limited in angular range, resolution, and image quality, our approach leverages the Diffusion Transformer to preserve fine details and employs a weak-to-strong training strategy for stable high-resolution image generation. Experimental results demonstrate that our method generates higher-resolution outputs with improved control over viewing angles. This capability has significant implications not only for clinical applications but also for medical education and data extension, enabling the creation of diverse, high-quality datasets for training and analysis. Our code is available at GitHub.
zh
[CV-277] Sequential Attention-based Sampling for Histopathological Analysis
【速读】:该论文试图解决在全切片图像(WSI)中进行自动化组织病理学分析时计算资源消耗大、标注成本高以及诊断信息区域稀疏的问题。其解决方案的关键在于提出SASHA——一种基于序列注意力的采样方法,通过轻量级分层注意力机制的多实例学习模型提取有用特征,并智能地选择性地对少量高分辨率图像块(10-20%)进行分析,从而实现高效且准确的诊断。
链接: https://arxiv.org/abs/2507.05077
作者: Tarun G,Naman Malpani,Gugan Thoppe,Sridharan Devarajan
机构: Indian Institute of Science (印度科学研究所)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks are increasingly applied for automated histopathology. Yet, whole-slide images (WSIs) are often acquired at gigapixel sizes, rendering it computationally infeasible to analyze them entirely at high resolution. Diagnostic labels are largely available only at the slide-level, because expert annotation of images at a finer (patch) level is both laborious and expensive. Moreover, regions with diagnostic information typically occupy only a small fraction of the WSI, making it inefficient to examine the entire slide at full resolution. Here, we propose SASHA – \it Sequential \it Attention-based \it Sampling for \it Histopathological \it Analysis – a deep reinforcement learning approach for efficient analysis of histopathological images. First, SASHA learns informative features with a lightweight hierarchical, attention-based multiple instance learning (MIL) model. Second, SASHA samples intelligently and zooms selectively into a small fraction (10-20%) of high-resolution patches, to achieve reliable diagnosis. We show that SASHA matches state-of-the-art methods that analyze the WSI fully at high-resolution, albeit at a fraction of their computational and memory costs. In addition, it significantly outperforms competing, sparse sampling methods. We propose SASHA as an intelligent sampling model for medical imaging challenges that involve automated diagnosis with exceptionally large images containing sparsely informative features.
zh
[CV-278] MurreNet: Modeling Holistic Multimodal Interactions Between Histopathology and Genomic Profiles for Survival Prediction MICCAI2025
【速读】:该论文旨在解决癌症生存预测中整合病理全切片图像(Whole Slide Images, WSI)和基因组谱型的挑战性问题,该问题由于模态间的固有异质性和建模跨模态与同模态交互的复杂性而尤为困难。现有方法通常采用简单的融合策略进行多模态特征整合,未能全面捕捉模态特异性和模态共性交互,导致对多模态相关性的理解有限和预测性能不佳。其解决方案的关键在于提出一种多模态表征解耦网络(MurreNet),通过多模态表征分解(MRD)模块将输入数据显式分解为模态特异性与模态共享表征,并利用一种新的训练正则化策略对分布相似性、差异性和模态特征代表性施加约束,最终通过深度整体正交融合(DHOF)策略实现增强的多模态特征整合。
链接: https://arxiv.org/abs/2507.04891
作者: Mingxin Liu,Chengfei Cai,Jun Li,Pengbo Xu,Jinze Li,Jiquan Ma,Jun Xu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures, Accepted by MICCAI 2025
Abstract:Cancer survival prediction requires integrating pathological Whole Slide Images (WSIs) and genomic profiles, a challenging task due to the inherent heterogeneity and the complexity of modeling both inter- and intra-modality interactions. Current methods often employ straightforward fusion strategies for multimodal feature integration, failing to comprehensively capture modality-specific and modality-common interactions, resulting in a limited understanding of multimodal correlations and suboptimal predictive performance. To mitigate these limitations, this paper presents a Multimodal Representation Decoupling Network (MurreNet) to advance cancer survival analysis. Specifically, we first propose a Multimodal Representation Decomposition (MRD) module to explicitly decompose paired input data into modality-specific and modality-shared representations, thereby reducing redundancy between modalities. Furthermore, the disentangled representations are further refined then updated through a novel training regularization strategy that imposes constraints on distributional similarity, difference, and representativeness of modality features. Finally, the augmented multimodal features are integrated into a joint representation via proposed Deep Holistic Orthogonal Fusion (DHOF) strategy. Extensive experiments conducted on six TCGA cancer cohorts demonstrate that our MurreNet achieves state-of-the-art (SOTA) performance in survival prediction.
zh
[CV-279] Uncovering Neuroimaging Biomarkers of Brain Tumor Surgery with AI-Driven Methods
【速读】:该论文旨在解决脑肿瘤切除手术中患者预后预测的问题,以帮助临床医生和患者在治疗过程中实现最优的肿瘤控制与功能保留平衡。其解决方案的关键在于提出一种融合可解释人工智能(Explainable AI, XAI)与神经影像特征工程的框架,通过全局解释优化器对深度学习模型中的生存相关特征进行精炼赋权,从而提升模型的可解释性与可靠性。该方法能够有效识别与生存相关的影像学变异,为精准医学中的脑肿瘤治疗提供重要支持。
链接: https://arxiv.org/abs/2507.04881
作者: Carmen Jimenez-Mesa,Yizhou Wan,Guilio Sansone,Francisco J. Martinez-Murcia,Javier Ramirez,Pietro Lio,Juan M. Gorriz,Stephen J. Price,John Suckling,Michail Mamalakis
机构: University of Málaga(马拉加大学); University of Cambridge(剑桥大学); University of Padova(帕多瓦大学); University of Granada(格拉纳达大学); University of Cambridge(剑桥大学); University of Cambridge(剑桥大学); University of Cambridge(剑桥大学); University of Cambridge(剑桥大学); University of Cambridge(剑桥大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Brain tumor resection is a complex procedure with significant implications for patient survival and quality of life. Predictions of patient outcomes provide clinicians and patients the opportunity to select the most suitable onco-functional balance. In this study, global features derived from structural magnetic resonance imaging in a clinical dataset of 49 pre- and post-surgery patients identified potential biomarkers associated with survival outcomes. We propose a framework that integrates Explainable AI (XAI) with neuroimaging-based feature engineering for survival assessment, offering guidance for surgical decision-making. In this study, we introduce a global explanation optimizer that refines survival-related feature attribution in deep learning models, enhancing interpretability and reliability. Our findings suggest that survival is influenced by alterations in regions associated with cognitive and sensory functions, indicating the importance of preserving areas involved in decision-making and emotional regulation during surgery to improve outcomes. The global explanation optimizer improves both fidelity and comprehensibility of explanations compared to state-of-the-art XAI methods. It effectively identifies survival-related variability, underscoring its relevance in precision medicine for brain tumor treatment.
zh
[CV-280] Efficacy of Image Similarity as a Metric for Augmenting Small Dataset Retinal Image Segmentation
【速读】:该论文试图解决在有限的医学影像数据集下,如何通过合成图像增强来提升U-Net模型在糖尿病性黄斑水肿相关视网膜内液分割任务中的性能问题。其解决方案的关键在于利用生成式对抗网络(Generative Adversarial Network, GAN)生成的合成图像作为数据增强手段,并通过Fréchet Inception Distance (FID) 评估合成图像的质量与真实数据集之间的相似性,进而分析其对分割性能的影响。研究发现,当合成数据与训练数据之间的FID较低时,增强效果显著且稳定,而高FID的不相似数据集则无法有效提升模型性能。此外,研究还表明合成数据在FID与性能提升之间表现出与传统增强方法不同的对数正态分布趋势,显示出其在特定条件下的优越性。
链接: https://arxiv.org/abs/2507.04862
作者: Thomas Wallace,Ik Siong Heng,Senad Subasic,Chris Messenger
机构: University of Glasgow(格拉斯哥大学); Optos PLC(奥普托斯 plc)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 10 figures
Abstract:Synthetic images are an option for augmenting limited medical imaging datasets to improve the performance of various machine learning models. A common metric for evaluating synthetic image quality is the Fréchet Inception Distance (FID) which measures the similarity of two image datasets. In this study we evaluate the relationship between this metric and the improvement which synthetic images, generated by a Progressively Growing Generative Adversarial Network (PGGAN), grant when augmenting Diabetes-related Macular Edema (DME) intraretinal fluid segmentation performed by a U-Net model with limited amounts of training data. We find that the behaviour of augmenting with standard and synthetic images agrees with previously conducted experiments. Additionally, we show that dissimilar (high FID) datasets do not improve segmentation significantly. As FID between the training and augmenting datasets decreases, the augmentation datasets are shown to contribute to significant and robust improvements in image segmentation. Finally, we find that there is significant evidence to suggest that synthetic and standard augmentations follow separate log-normal trends between FID and improvements in model performance, with synthetic data proving more effective than standard augmentation techniques. Our findings show that more similar datasets (lower FID) will be more effective at improving U-Net performance, however, the results also suggest that this improvement may only occur when images are sufficiently dissimilar.
zh
[CV-281] SPATIA: Multimodal Model for Prediction and Generation of Spatial Cell Phenotypes
【速读】:该论文试图解决如何在细胞形态、基因表达和空间组织的联合作用下理解组织功能这一生物学核心挑战,特别是在高分辨率图像与基因表达数据的整合分析中,现有机器学习方法通常仅单独处理不同模态或局限于低分辨率。其解决方案的关键在于提出SPATIA,一个用于空间转录组学的多尺度生成与预测模型,该模型通过交叉注意力机制融合图像衍生的形态学标记和转录组向量标记,生成细胞级嵌入,并利用Transformer模块在微环境和组织层面聚合以捕捉空间依赖性,同时在生成扩散解码器中引入标记合并机制,实现基于基因表达的高分辨率细胞图像合成。
链接: https://arxiv.org/abs/2507.04704
作者: Zhenglun Kong,Mufan Qiu,John Boesen,Xiang Lin,Sukwon Yun,Tianlong Chen,Manolis Kellis,Marinka Zitnik
机构: Harvard University (哈佛大学); UNC-Chapel Hill (北卡罗来纳大学教堂山分校); MIT (麻省理工学院)
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding how cellular morphology, gene expression, and spatial organization jointly shape tissue function is a central challenge in biology. Image-based spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but machine learning methods typically analyze these modalities in isolation or at limited resolution. We address the problem of learning unified, spatially aware representations that integrate cell morphology, gene expression, and spatial context across biological scales. This requires models that can operate at single-cell resolution, reason across spatial neighborhoods, and generalize to whole-slide tissue organization. Here, we introduce SPATIA, a multi-scale generative and predictive model for spatial transcriptomics. SPATIA learns cell-level embeddings by fusing image-derived morphological tokens and transcriptomic vector tokens using cross-attention and then aggregates them at niche and tissue levels using transformer modules to capture spatial dependencies. SPATIA incorporates token merging in its generative diffusion decoder to synthesize high-resolution cell images conditioned on gene expression. We assembled a multi-scale dataset consisting of 17 million cell-gene pairs, 1 million niche-gene pairs, and 10,000 tissue-gene pairs across 49 donors, 17 tissue types, and 12 disease states. We benchmark SPATIA against 13 existing models across 12 individual tasks, which span several categories including cell annotation, cell clustering, gene imputation, cross-modal prediction, and image generation. SPATIA achieves improved performance over all baselines and generates realistic cell morphologies that reflect transcriptomic perturbations.
zh
[CV-282] SPIDER: Structure-Preferential Implicit Deep Network for Biplanar X-ray Reconstruction
【速读】:该论文旨在解决从两个正交的X射线投影中重建三维CT体积这一高度病态的逆问题,该问题由于深度信息的缺失和软组织可视化中的不可消除歧义而极具挑战性。现有方法在重建骨骼结构和CT体积时往往导致不完整的骨几何形状、不精确的组织边界以及缺乏解剖真实性,从而限制了其在手术规划和术后评估等临床场景中的应用。论文提出的解决方案是SPIDER,其关键在于引入了一种监督框架,通过统一的编码器-解码器架构将组织结构作为先验信息(如解剖分割)以联合监督的形式融入隐式神经表示解码器中,使模型能够以像素对齐的方式共同学习图像强度和解剖结构,并直接嵌入解剖约束以增强结构连续性并减少软组织伪影。
链接: https://arxiv.org/abs/2507.04684
作者: Tianqi Yu,Xuanyu Tian,Jiawen Yang,Dongming He,Jingyi Yu,Xudong Wang,Yuyao Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Biplanar X-ray imaging is widely used in health screening, postoperative rehabilitation evaluation of orthopedic diseases, and injury surgery due to its rapid acquisition, low radiation dose, and straightforward setup. However, 3D volume reconstruction from only two orthogonal projections represents a profoundly ill-posed inverse problem, owing to the intrinsic lack of depth information and irreducible ambiguities in soft-tissue visualization. Some existing methods can reconstruct skeletal structures and Computed Tomography (CT) volumes, they often yield incomplete bone geometry, imprecise tissue boundaries, and a lack of anatomical realism, thereby limiting their clinical utility in scenarios such as surgical planning and postoperative assessment. In this study, we introduce SPIDER, a novel supervised framework designed to reconstruct CT volumes from biplanar X-ray images. SPIDER incorporates tissue structure as prior (e.g., anatomical segmentation) into an implicit neural representation decoder in the form of joint supervision through a unified encoder-decoder architecture. This design enables the model to jointly learn image intensities and anatomical structures in a pixel-aligned fashion. To address the challenges posed by sparse input and structural ambiguity, SPIDER directly embeds anatomical constraints into the reconstruction process, thereby enhancing structural continuity and reducing soft-tissue artifacts. We conduct comprehensive experiments on clinical head CT datasets and show that SPIDER generates anatomically accurate reconstructions from only two projections. Furthermore, our approach demonstrates strong potential in downstream segmentation tasks, underscoring its utility in personalized treatment planning and image-guided surgical navigation.
zh
[CV-283] CP-Dilatation: A Copy-and-Paste Augmentation Method for Preserving the Boundary Context Information of Histopathology Images
【速读】:该论文试图解决医学图像分割,特别是组织病理学图像分割中因缺乏足够标注数据而导致的训练数据不足问题。其解决方案的关键在于提出一种基于传统Copy and Paste(CP)数据增强技术的改进方法——CP-Dilatation,该方法通过引入膨胀操作来保留恶性病变边界上下文信息,从而提升分割性能。
链接: https://arxiv.org/abs/2507.04660
作者: Sungrae Hong,Sol Lee,Mun Yong Yi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 5 figures
Abstract:Medical AI diagnosis including histopathology segmentation has derived benefits from the recent development of deep learning technology. However, deep learning itself requires a large amount of training data and the medical image segmentation masking, in particular, requires an extremely high cost due to the shortage of medical specialists. To mitigate this issue, we propose a new data augmentation method built upon the conventional Copy and Paste (CP) augmentation technique, called CP-Dilatation, and apply it to histopathology image segmentation. To the well-known traditional CP technique, the proposed method adds a dilation operation that can preserve the boundary context information of the malignancy, which is important in histopathological image diagnosis, as the boundary between the malignancy and its margin is mostly unclear and a significant context exists in the margin. In our experiments using histopathology benchmark datasets, the proposed method was found superior to the other state-of-the-art baselines chosen for comparison.
zh
[CV-284] A Deep Unfolding Framework for Diffractive Snapshot Spectral Imaging
【速读】:该论文旨在解决 diffractive snapshot spectral imaging (DSSI) 系统中重建算法兼容性不足的问题,即现有基于深度学习的网络和深度展开方法由于光学编码机制的不同,无法完全适配 DSSI 系统。其解决方案的关键在于提出一种高效的深度展开框架——diffractive deep unfolding (DDU),通过推导 DSSI 中数据保真项的解析解,确保迭代重建过程的效率与效果,并采用基于网络的初始化策略以提升稳定性与性能,从而实现与现有先进模型的良好兼容性。
链接: https://arxiv.org/abs/2507.04622
作者: Zhengyue Zhuge,Jiahui Xu,Shiqi Chen,Hao Xu,Yueting Chen,Zhihai Xu,Huajun Feng
机构: Zhejiang University (浙江大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Snapshot hyperspectral imaging systems acquire spectral data cubes through compressed sensing. Recently, diffractive snapshot spectral imaging (DSSI) methods have attracted significant attention. While various optical designs and improvements continue to emerge, research on reconstruction algorithms remains limited. Although numerous networks and deep unfolding methods have been applied on similar tasks, they are not fully compatible with DSSI systems because of their distinct optical encoding mechanism. In this paper, we propose an efficient deep unfolding framework for diffractive systems, termed diffractive deep unfolding (DDU). Specifically, we derive an analytical solution for the data fidelity term in DSSI, ensuring both the efficiency and the effectiveness during the iterative reconstruction process. Given the severely ill-posed nature of the problem, we employ a network-based initialization strategy rather than non-learning-based methods or linear layers, leading to enhanced stability and performance. Our framework demonstrates strong compatibility with existing state-of-the-art (SOTA) models, which effectively address the initialization and prior subproblem. Extensive experiments validate the superiority of the proposed DDU framework, showcasing improved performance while maintaining comparable parameter counts and computational complexity. These results suggest that DDU provides a solid foundation for future unfolding-based methods in DSSI.
zh
[CV-285] Comprehensive Modeling of Camera Spectral and Color Behavior
【速读】:该论文试图解决现有数字相机光谱响应建模缺乏端到端光输入与像素强度输出交互的全面模型的问题(the lack of a comprehensive model that considers the end-to-end interaction between light input and pixel intensity output)。其解决方案的关键在于提出一种新的技术来建模RGB数字相机的光谱响应,该技术通过在不同成像场景下改变光照条件进行测试,并与实验数据进行验证,从而有效提升颜色保真度和光谱准确性。
链接: https://arxiv.org/abs/2507.04617
作者: Sanush K Abeysekera,Ye Chow Kuang,Melanie Po-Leen Ooi
机构: University of Waikato (怀卡托大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 11 figures, 2025 I2MTC IEEE Instrumentation and Measurement Society Conference
Abstract:The spectral response of a digital camera defines the mapping between scene radiance and pixel intensity. Despite its critical importance, there is currently no comprehensive model that considers the end-to-end interaction between light input and pixel intensity output. This paper introduces a novel technique to model the spectral response of an RGB digital camera, addressing this gap. Such models are indispensable for applications requiring accurate color and spectral data interpretation. The proposed model is tested across diverse imaging scenarios by varying illumination conditions and is validated against experimental data. Results demonstrate its effectiveness in improving color fidelity and spectral accuracy, with significant implications for applications in machine vision, remote sensing, and spectral imaging. This approach offers a powerful tool for optimizing camera systems in scientific, industrial, and creative domains where spectral precision is paramount.
zh
[CV-286] Emerging Frameworks for Objective Task-based Evaluation of Quantitative Medical Imaging Methods
【速读】:该论文试图解决定量成像(Quantitative Imaging, QI)方法在临床转化过程中缺乏客观评估的问题,特别是在缺乏金标准的情况下对QI方法进行临床评价的挑战。解决方案的关键在于提出四种新兴的评估框架:虚拟成像试验(Virtual Imaging Trials, VITs)、无金标准评估框架、联合检测与量化任务评估框架以及多维参数输出评估框架。这些框架旨在为QI方法提供全面且实用的评估手段,以促进其在临床中的应用。
链接: https://arxiv.org/abs/2507.04591
作者: Yan Liu,Huitian Xia,Nancy A. Obuchowski,Richard Laforest,Arman Rahmim,Barry A. Siegel,Abhinav K. Jha
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 19 pages, 7 figures
Abstract:Quantitative imaging (QI) is demonstrating strong promise across multiple clinical applications. For clinical translation of QI methods, objective evaluation on clinically relevant tasks is essential. To address this need, multiple evaluation strategies are being developed. In this paper, based on previous literature, we outline four emerging frameworks to perform evaluation studies of QI methods. We first discuss the use of virtual imaging trials (VITs) to evaluate QI methods. Next, we outline a no-gold-standard evaluation framework to clinically evaluate QI methods without ground truth. Third, a framework to evaluate QI methods for joint detection and quantification tasks is outlined. Finally, we outline a framework to evaluate QI methods that output multi-dimensional parameters, such as radiomic features. We review these frameworks, discussing their utilities and limitations. Further, we examine future research areas in evaluation of QI methods. Given the recent advancements in PET, including long axial field-of-view scanners and the development of artificial-intelligence algorithms, we present these frameworks in the context of PET.
zh
[CV-287] FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging ICCV2025
【速读】:该论文旨在解决4D医学影像中的时间插值问题,该问题在呼吸运动建模的临床实践中具有重要作用。现有方法基于简化的线性运动假设,采用基于光流的模型进行中间帧的插值,但实际呼吸运动应为非线性且准周期性的。论文的关键解决方案是从频率角度出发,提出一种基于傅里叶基的扩散模型(FB-Diff),通过引入生理运动先验和案例特定的频谱信息,设计傅里叶运动算子以提取傅里叶基,从而更准确地模拟具有特定频率运动模式的呼吸运动,并借助基交互算子实现生成式的时空插值。
链接: https://arxiv.org/abs/2507.04547
作者: Xin You,Runze Yang,Chuyan Zhang,Zhongliang Jiang,Jie Yang,Nassir Navab
机构: Technical University of Munich (慕尼黑工业大学); Shanghai Jiao Tong University (上海交通大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Macquarie University (麦考瑞大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:The temporal interpolation task for 4D medical imaging, plays a crucial role in clinical practice of respiratory motion modeling. Following the simplified linear-motion hypothesis, existing approaches adopt optical flow-based models to interpolate intermediate frames. However, realistic respiratory motions should be nonlinear and quasi-periodic with specific frequencies. Intuited by this property, we resolve the temporal interpolation task from the frequency perspective, and propose a Fourier basis-guided Diffusion model, termed FB-Diff. Specifically, due to the regular motion discipline of respiration, physiological motion priors are introduced to describe general characteristics of temporal data distributions. Then a Fourier motion operator is elaborately devised to extract Fourier bases by incorporating physiological motion priors and case-specific spectral information in the feature space of Variational Autoencoder. Well-learned Fourier bases can better simulate respiratory motions with motion patterns of specific frequencies. Conditioned on starting and ending frames, the diffusion model further leverages well-learned Fourier bases via the basis interaction operator, which promotes the temporal interpolation task in a generative manner. Extensive results demonstrate that FB-Diff achieves state-of-the-art (SOTA) perceptual performance with better temporal consistency while maintaining promising reconstruction metrics. Codes are available.
zh
[CV-288] Dynamic Frequency Feature Fusion Network for Multi-Source Remote Sensing Data Classification
【速读】:该论文旨在解决多源数据分类在遥感图像解译中的适应性不足问题,特别是在建模频域特征时对多样地表覆盖类型的适应能力有限。其解决方案的关键在于提出一种动态频域特征融合网络(DFFNet),通过设计动态滤波块来在频域中动态学习滤波核,并将频域上下文知识注入频域滤波核中;同时引入光谱-空间自适应融合块以增强跨模态特征融合效果,从而实现更全面的多源数据分类性能。
链接: https://arxiv.org/abs/2507.04510
作者: Yikang Zhao,Feng Gao,Xuepeng Jin,Junyu Dong,Qian Du
机构: Ocean University of China (中国海洋大学); Mississippi State University (密西西比州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE GRSL
Abstract:Multi-source data classification is a critical yet challenging task for remote sensing image interpretation. Existing methods lack adaptability to diverse land cover types when modeling frequency domain features. To this end, we propose a Dynamic Frequency Feature Fusion Network (DFFNet) for hyperspectral image (HSI) and Synthetic Aperture Radar (SAR) / Light Detection and Ranging (LiDAR) data joint classification. Specifically, we design a dynamic filter block to dynamically learn the filter kernels in the frequency domain by aggregating the input features. The frequency contextual knowledge is injected into frequency filter kernels. Additionally, we propose spectral-spatial adaptive fusion block for cross-modal feature fusion. It enhances the spectral and spatial attention weight interactions via channel shuffle operation, thereby providing comprehensive cross-modal feature fusion. Experiments on two benchmark datasets show that our DFFNet outperforms state-of-the-art methods in multi-source data classification. The codes will be made publicly available at this https URL.
zh
[CV-289] Street design and driving behavior: evidence from a large-scale study in Milan Amsterdam and Dubai
【速读】:该论文试图解决城市在实施30 km/h限速后,驾驶员合规性不足的问题,以及如何通过街道设计提升限速遵守率。解决方案的关键在于利用基于计算机视觉的语义分割模型分析Google街景图像,结合街道特征与驾驶行为之间的关系,进而提出适用于不同城市背景的路网设计指南,并开发机器学习模型预测驾驶速度以支持城市规划决策。
链接: https://arxiv.org/abs/2507.04434
作者: Giacomo Orsi,Titus Venverloo,Andrea La Grotteria,Umberto Fugiglando,Fábio Duarte,Paolo Santi,Carlo Ratti
机构: 未知
类目: Physics and Society (physics.soc-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, cities have increasingly reduced speed limits from 50 km/h to 30 km/h to enhance road safety, reduce noise pollution, and promote sustainable modes of transportation. However, achieving compliance with these new limits remains a key challenge for urban planners. This study investigates drivers’ compliance with the 30 km/h speed limit in Milan and examines how street characteristics influence driving behavior. Our findings suggest that the mere introduction of lower speed limits is not sufficient to reduce driving speeds effectively, highlighting the need to understand how street design can improve speed limit adherence. To comprehend this relationship, we apply computer vision-based semantic segmentation models to Google Street View images. A large-scale analysis reveals that narrower streets and densely built environments are associated with lower speeds, whereas roads with greater visibility and larger sky views encourage faster driving. To evaluate the influence of the local context on speeding behaviour, we apply the developed methodological framework to two additional cities: Amsterdam, which, similar to Milan, is a historic European city not originally developed for cars, and Dubai, which instead has developed in recent decades with a more car-centric design. The results of the analyses largely confirm the findings obtained in Milan, which demonstrates the broad applicability of the road design guidelines for driver speed compliance identified in this paper. Finally, we develop a machine learning model to predict driving speeds based on street characteristics. We showcase the model’s predictive power by estimating the compliance with speed limits in Milan if the city were to adopt a 30 km/h speed limit city-wide. The tool provides actionable insights for urban planners, supporting the design of interventions to improve speed limit compliance.
zh
[CV-290] ViTaL: A Multimodality Dataset and Benchmark for Multi-pathological Ovarian Tumor Recognition
【速读】:该论文旨在解决卵巢肿瘤早期检测不足导致的健康危机问题,以及由于公共数据集有限而阻碍深度神经网络在该领域应用的问题。其解决方案的关键在于构建了一个多模态的卵巢肿瘤病理识别数据集ViTaL,该数据集包含496名患者的视觉、表格和语言模态数据,并提出了一种基于三元组分层偏移注意力机制(THOAM)的ViTaL-Net模型,以实现多病理类型的多模态分类任务,从而提升特征融合过程中的信息相关性和互补性。
链接: https://arxiv.org/abs/2507.04383
作者: You Zhou,Lijiang Chen,Guangxia Cui,Wenpei Bai,Yu Guo,Shuchang Lyu,Guangliang Cheng,Qi Zhao
机构: Beijing University of Aeronautics and Astronautics (北京航空航天大学); Beijing Shijingtan Hospital (北京世纪坛医院); China Medical University (中国医科大学); University of Liverpool (利物浦大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ovarian tumor, as a common gynecological disease, can rapidly deteriorate into serious health crises when undetected early, thus posing significant threats to the health of women. Deep neural networks have the potential to identify ovarian tumors, thereby reducing mortality rates, but limited public datasets hinder its progress. To address this gap, we introduce a vital ovarian tumor pathological recognition dataset called \textbfViTaL that contains \textbfVisual, \textbfTabular and \textbfLinguistic modality data of 496 patients across six pathological categories. The ViTaL dataset comprises three subsets corresponding to different patient data modalities: visual data from 2216 two-dimensional ultrasound images, tabular data from medical examinations of 496 patients, and linguistic data from ultrasound reports of 496 patients. It is insufficient to merely distinguish between benign and malignant ovarian tumors in clinical practice. To enable multi-pathology classification of ovarian tumor, we propose a ViTaL-Net based on the Triplet Hierarchical Offset Attention Mechanism (THOAM) to minimize the loss incurred during feature fusion of multi-modal data. This mechanism could effectively enhance the relevance and complementarity between information from different modalities. ViTaL-Net serves as a benchmark for the task of multi-pathology, multi-modality classification of ovarian tumors. In our comprehensive experiments, the proposed method exhibited satisfactory performance, achieving accuracies exceeding 90% on the two most common pathological types of ovarian tumor and an overall performance of 85%. Our dataset and code are available at this https URL.
zh
[CV-291] CLIP-RL: Surgical Scene Segmentation Using Contrastive Language-Vision Pretraining Reinforcement Learning
【速读】:该论文旨在解决微创手术(Minimally Invasive Surgery, MIS)视频中手术场景的语义分割问题,以提升医疗服务质量。其解决方案的关键在于提出一种名为CLIP-RL的新型对比语言-图像预训练模型,该模型结合了强化学习(Reinforcement Learning, RL)和课程学习(Curriculum Learning),实现了分割掩码在完整训练流程中的持续优化。通过将CLIP模型作为强大的特征提取器,结合RL模块对预测结果进行动态调整,CLIP-RL在不同光学条件下表现出色,显著提升了分割性能。
链接: https://arxiv.org/abs/2507.04317
作者: Fatmaelzahraa Ali Ahmed,Muhammad Arsalan,Abdulaziz Al-Ali,Khalid Al-Jalham,Shidin Balakrishnan
机构: Hamad Medical Corporation(哈马德医疗公司); Qatar University(卡塔尔大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Understanding surgical scenes can provide better healthcare quality for patients, especially with the vast amount of video data that is generated during MIS. Processing these videos generates valuable assets for training sophisticated models. In this paper, we introduce CLIP-RL, a novel contrastive language-image pre-training model tailored for semantic segmentation for surgical scenes. CLIP-RL presents a new segmentation approach which involves reinforcement learning and curriculum learning, enabling continuous refinement of the segmentation masks during the full training pipeline. Our model has shown robust performance in different optical settings, such as occlusions, texture variations, and dynamic lighting, presenting significant challenges. CLIP model serves as a powerful feature extractor, capturing rich semantic context that enhances the distinction between instruments and tissues. The RL module plays a pivotal role in dynamically refining predictions through iterative action-space adjustments. We evaluated CLIP-RL on the EndoVis 2018 and EndoVis 2017 datasets. CLIP-RL achieved a mean IoU of 81%, outperforming state-of-the-art models, and a mean IoU of 74.12% on EndoVis 2017. This superior performance was achieved due to the combination of contrastive learning with reinforcement learning and curriculum learning.
zh
[CV-292] Surg-SegFormer: A Dual Transformer-Based Model for Holistic Surgical Scene Segmentation
【速读】:该论文旨在解决机器人辅助手术(RAS)中术中实时教学资源不足的问题,特别是在缺乏足够专家外科医生的情况下,难以明确划分手术中的可行区域(go zones)和不可行区域(no-go zones)。传统高性能语义分割模型依赖用户生成的提示信息,这使得它们在处理长时间手术视频时不够实用。论文提出的解决方案是引入Surg-SegFormer,这是一种无需提示的新型分割模型,其关键在于能够实现自动化、鲁棒的手术场景理解,从而显著减轻专家外科医生的教学负担,并使外科住院医师能够独立有效地理解复杂的手术环境。
链接: https://arxiv.org/abs/2507.04304
作者: Fatimaelzahraa Ahmed,Muraam Abdel-Ghani,Muhammad Arsalan,Mahmoud Ali,Abdulaziz Al-Ali,Shidin Balakrishnan
机构: Hamad Medical Corporation(哈马德医疗公司); Qatar University(卡塔尔大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE Case 2025
Abstract:Holistic surgical scene segmentation in robot-assisted surgery (RAS) enables surgical residents to identify various anatomical tissues, articulated tools, and critical structures, such as veins and vessels. Given the firm intraoperative time constraints, it is challenging for surgeons to provide detailed real-time explanations of the operative field for trainees. This challenge is compounded by the scarcity of expert surgeons relative to trainees, making the unambiguous delineation of go- and no-go zones inconvenient. Therefore, high-performance semantic segmentation models offer a solution by providing clear postoperative analyses of surgical procedures. However, recent advanced segmentation models rely on user-generated prompts, rendering them impractical for lengthy surgical videos that commonly exceed an hour. To address this challenge, we introduce Surg-SegFormer, a novel prompt-free model that outperforms current state-of-the-art techniques. Surg-SegFormer attained a mean Intersection over Union (mIoU) of 0.80 on the EndoVis2018 dataset and 0.54 on the EndoVis2017 dataset. By providing robust and automated surgical scene comprehension, this model significantly reduces the tutoring burden on expert surgeons, empowering residents to independently and effectively understand complex surgical environments.
zh
[CV-293] Deep-Learning-Assisted Highly-Accurate COVID-19 Diagnosis on Lung Computed Tomography Images
【速读】:该论文试图解决COVID-19诊断中CT图像数据质量不佳以及数据集长尾分布问题。其解决方案的关键在于提出一种基于生成式对抗网络(Generative Adversarial Network, GAN)和滑动窗口的数据质量控制流程,以提升CT图像的质量,并采用类敏感损失函数,包括Label Distribution Aware Loss (LDAM Loss) 和 Class-balanced (CB) Loss,以缓解数据集中类别不平衡带来的影响。
链接: https://arxiv.org/abs/2507.04252
作者: Yinuo Wang,Juhyun Bae,Ka Ho Chow,Shenyang Chen,Shreyash Gupta
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:COVID-19 is a severe and acute viral disease that can cause symptoms consistent with pneumonia in which inflammation is caused in the alveolous regions of the lungs leading to a build-up of fluid and breathing difficulties. Thus, the diagnosis of COVID using CT scans has been effective in assisting with RT-PCR diagnosis and severity classifications. In this paper, we proposed a new data quality control pipeline to refine the quality of CT images based on GAN and sliding windows. Also, we use class-sensitive cost functions including Label Distribution Aware Loss(LDAM Loss) and Class-balanced(CB) Loss to solve the long-tail problem existing in datasets. Our model reaches more than 0.983 MCC in the benchmark test dataset.
zh
[CV-294] Grid-Reg: Grid-Based SAR and Optical Image Registration Across Platforms
【速读】:该论文旨在解决机载合成孔径雷达(SAR)与星载光学图像之间的跨平台异构图像配准问题,这一问题由于几何和辐射差异显著,现有方法难以有效处理。其解决方案的关键在于提出一种基于网格的多模态配准框架(Grid-Reg),该框架包含一种新的领域鲁棒描述符提取网络——混合孪生相关度量学习网络(HSCMLNet)以及一个基于网格的变换参数估计求解器(Grid-solver)。Grid-Reg采用无检测器和全局匹配损失机制,而非依赖于精确的关键点对应,从而避免了在存在大几何变形的异构图像中获取准确对应点的困难。通过Grid-solver,该框架以粗到精的策略优化基于鲁棒全局匹配损失的全图块对应关系,以估计变换参数。此外,HSCMLNet通过混合孪生模块和基于等角单位基向量(EUBVs)的相关性学习模块,提升了对噪声和变化目标的块间相似性计算能力,并引入了流形损失EUBVsLoss以约束局部嵌入与EUBVs之间的归一化相关性。
链接: https://arxiv.org/abs/2507.04233
作者: Xiaochen Wei,Weiwei Guo,Zenghui Zhang,Wenxian Yu
机构: Shanghai Key Laboratory of Intelligent Sensing and Recognition, Shanghai Jiao Tong University (上海交通大学智能感知与识别重点实验室); Center of Digital Innovation, Tongji University (同济大学数字创新中心)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Registering airborne SAR with spaceborne optical images is crucial for SAR image interpretation and geo-localization. It is challenging for this cross-platform heterogeneous image registration due to significant geometric and radiation differences, which current methods fail to handle. To tackle these challenges, we propose a novel grid-based multimodal registration framework (Grid-Reg) across airborne and space-born platforms, including a new domain-robust descriptor extraction network, Hybrid Siamese Correlation Metric Learning Network (HSCMLNet) and a grid-based solver (Grid-solver) for transformation parameters estimation. Our Grid-Reg is based on detector-free and global matching loss rather than accurate keypoint correspondences. These accurate correspondences are inherently difficult in heterogeneous images with large geometric deformation. By Grid-Solver, our Grid-Reg estimates transformation parameters by optimizing robust global matching loss-based patch correspondences of whole images in a coarse-to-fine strategy. To robustly calculate the similarity between patches, specifically that have noise and change objects, we propose HSCMLNet, including a hybrid Siamese module to extract high-level features of multimodal images and a correlation learning module (CMLModule) based equiangular unit basis vectors (EUBVs). Moreover, we propose a manifold loss EUBVsLoss to constrain the normalized correlation between local embeddings of patches and EUBVs. Furthermore, we curate a new challenging benchmark dataset of SAR-to-optical registration using real-world UAV MiniSAR data and optical images from Google Earth. We extensively analyze factors affecting registration accuracy and compare our method with state-of-the-art techniques on this dataset, showing superior performance.
zh
[CV-295] Differentiable High-Performance Ray Tracing-Based Simulation of Radio Propagation with Point Clouds
【速读】:该论文旨在解决传统射线追踪方法在无线电传播模拟中对环境模型依赖性强、计算效率低的问题,其解决方案的关键在于提出一种基于可微射线追踪的无线电传播模拟器,该模拟器直接作用于点云数据,并结合语义分割标签以学习环境的电磁特性,从而提升模拟的准确性与效率。
链接: https://arxiv.org/abs/2507.04021
作者: Niklas Vaara,Pekka Sangi,Miguel Bordallo López,Janne Heikkilä
机构: University of Oulu(奥卢大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ray tracing is a widely used deterministic method for radio propagation simulations, capable of producing physically accurate multipath components. The accuracy depends on the quality of the environment model and its electromagnetic properties. Recent advances in computer vision and machine learning have made it possible to reconstruct detailed environment models augmented with semantic segmentation labels. In this letter, we propose a differentiable ray tracing-based radio propagation simulator that operates directly on point clouds. We showcase the efficiency of our method by simulating multi-bounce propagation paths with up to five interactions with specular reflections and diffuse scattering in two indoor scenarios, each completing in less than 90 ms. Lastly, we demonstrate how the differentiability of electromagnetic computations can be combined with segmentation labels to learn the electromagnetic properties of the environment. Subjects: Signal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.04021 [eess.SP] (or arXiv:2507.04021v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2507.04021 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-296] PASC-Net:Plug-and-play Shape Self-learning Convolutions Network with Hierarchical Topology Constraints for Vessel Segmentation
【速读】:该论文试图解决医学图像中血管分割的挑战,特别是由于血管复杂的树状管状结构导致的细小血管分支被忽略以及血管拓扑结构难以准确捕捉的问题。解决方案的关键在于提出一种名为PASC Net的新型血管分割框架,其包含两个核心模块:可插拔的形状自学习卷积(SSL)模块和分层拓扑约束(HTC)模块。SSL模块通过优化传统卷积为可学习的条带卷积,提升网络对管状解剖结构细粒度特征的感知能力;HTC模块则通过线性、平面和体积层级的拓扑约束,确保血管连通性与结构一致性,从而提高分割的准确性与完整性。
链接: https://arxiv.org/abs/2507.04008
作者: Xiao Zhang,Zhuo Jin,Shaoxuan Wu,Fengyu Wang,Guansheng Peng,Xiang Zhang,Ying Huang,JingKun Chen,Jun Feng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate vessel segmentation is crucial to assist in clinical diagnosis by medical experts. However, the intricate tree-like tubular structure of blood vessels poses significant challenges for existing segmentation algorithms. Small vascular branches are often overlooked due to their low contrast compared to surrounding tissues, leading to incomplete vessel segmentation. Furthermore, the complex vascular topology prevents the model from accurately capturing and reconstructing vascular structure, resulting in incorrect topology, such as breakpoints at the bifurcation of the vascular tree. To overcome these challenges, we propose a novel vessel segmentation framework called PASC Net. It includes two key modules: a plug-and-play shape self-learning convolutional (SSL) module that optimizes convolution kernel design, and a hierarchical topological constraint (HTC) module that ensures vascular connectivity through topological constraints. Specifically, the SSL module enhances adaptability to vascular structures by optimizing conventional convolutions into learnable strip convolutions, which improves the network’s ability to perceive fine-grained features of tubular anatomies. Furthermore, to better preserve the coherence and integrity of vascular topology, the HTC module incorporates hierarchical topological constraints-spanning linear, planar, and volumetric levels-which serve to regularize the network’s representation of vascular continuity and structural consistency. We replaced the standard convolutional layers in U-Net, FCN, U-Mamba, and nnUNet with SSL convolutions, leading to consistent performance improvements across all architectures. Furthermore, when integrated into the nnUNet framework, our method outperformed other methods on multiple metrics, achieving state-of-the-art vascular segmentation performance. Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.04008 [eess.IV] (or arXiv:2507.04008v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.04008 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Biomedical Signal Processing and Control 2025 Submission history From: Zhuo Jin [view email] [v1] Sat, 5 Jul 2025 11:28:35 UTC (11,135 KB) Full-text links: Access Paper: View a PDF of the paper titled PASC-Net:Plug-and-play Shape Self-learning Convolutions Network with Hierarchical Topology Constraints for Vessel Segmentation, by Xiao Zhang and 8 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: eess.IV prev | next new | recent | 2025-07 Change to browse by: cs cs.CV eess References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-297] EdgeSRIE: A hybrid deep learning framework for real-time speckle reduction and image enhancement on portable ultrasound systems
【速读】:该论文旨在解决超声图像中散斑(speckle)噪声对解剖结构显示的干扰问题,从而提高诊断准确性。其解决方案的关键在于提出了一种轻量级的混合深度学习框架EdgeSRIE,该框架包含两个主要分支:一个无监督去散斑分支通过最小化含散斑图像之间的损失函数进行训练,另一个去模糊分支用于恢复模糊图像为清晰图像。此外,该框架通过将训练好的网络量化为8位整数精度,并部署在资源受限的片上系统(SoC)上,实现了低功耗和实时处理,满足便携式超声设备的计算需求。
链接: https://arxiv.org/abs/2507.03937
作者: Hyunwoo Cho,Jongsoo Lee,Jinbum Kang,Yangmo Yoo
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Speckle patterns in ultrasound images often obscure anatomical details, leading to diagnostic uncertainty. Recently, various deep learning (DL)-based techniques have been introduced to effectively suppress speckle; however, their high computational costs pose challenges for low-resource devices, such as portable ultrasound systems. To address this issue, EdgeSRIE, which is a lightweight hybrid DL framework for real-time speckle reduction and image enhancement in portable ultrasound imaging, is introduced. The proposed framework consists of two main branches: an unsupervised despeckling branch, which is trained by minimizing a loss function between speckled images, and a deblurring branch, which restores blurred images to sharp images. For hardware implementation, the trained network is quantized to 8-bit integer precision and deployed on a low-resource system-on-chip (SoC) with limited power consumption. In the performance evaluation with phantom and in vivo analyses, EdgeSRIE achieved the highest contrast-to-noise ratio (CNR) and average gradient magnitude (AGM) compared with the other baselines (different 2-rule-based methods and other 4-DL-based methods). Furthermore, EdgeSRIE enabled real-time inference at over 60 frames per second while satisfying computational requirements ( 20K parameters) on actual portable ultrasound hardware. These results demonstrated the feasibility of EdgeSRIE for real-time, high-quality ultrasound imaging in resource-limited environments.
zh
[CV-298] PLUS: Plug-and-Play Enhanced Liver Lesion Diagnosis Model on Non-Contrast CT Scans MICCAI2025
【速读】:该论文旨在解决在非对比CT(NCCT)图像上对肝部局灶性病变(FLL)进行准确分类和分割的问题,尤其是区分恶性与良性病变的挑战。现有方法受限于无法捕捉病变间的细微差异,并且依赖于高成本、高分辨率的增强CT或MRI等成像模态,而NCCT在常规腹部成像中更为普遍。论文提出的解决方案是PLUS框架,其关键在于作为一个即插即用的模块,能够增强任意3D分割模型在NCCT图像上的FLL分析能力,从而显著提升病灶级别和患者级别的F1分数。
链接: https://arxiv.org/abs/2507.03872
作者: Jiacheng Hao,Xiaoming Zhang,Wei Liu,Xiaoli Yin,Yuan Gao,Chunli Li,Ling Zhang,Le Lu,Yu Shi,Xu Han,Ke Yan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025 (Early Accepted)
Abstract:Focal liver lesions (FLL) are common clinical findings during physical examination. Early diagnosis and intervention of liver malignancies are crucial to improving patient survival. Although the current 3D segmentation paradigm can accurately detect lesions, it faces limitations in distinguishing between malignant and benign liver lesions, primarily due to its inability to differentiate subtle variations between different lesions. Furthermore, existing methods predominantly rely on specialized imaging modalities such as multi-phase contrast-enhanced CT and magnetic resonance imaging, whereas non-contrast CT (NCCT) is more prevalent in routine abdominal imaging. To address these limitations, we propose PLUS, a plug-and-play framework that enhances FLL analysis on NCCT images for arbitrary 3D segmentation models. In extensive experiments involving 8,651 patients, PLUS demonstrated a significant improvement with existing methods, improving the lesion-level F1 score by 5.66%, the malignant patient-level F1 score by 6.26%, and the benign patient-level F1 score by 4.03%. Our results demonstrate the potential of PLUS to improve malignant FLL screening using widely available NCCT imaging substantially.
zh
[CV-299] Inverse Synthetic Aperture Fourier Ptychography
【速读】:该论文旨在解决传统傅里叶全息(Fourier ptychography, FP)在实现高分辨率、大视场成像时所面临的成本和复杂性问题,这些问题主要源于需要改变照明角度或相机位置以获取测量多样性。论文提出的解决方案的关键在于引入逆合成孔径傅里叶全息(Inverse Synthetic Aperture Fourier Ptychography),通过目标运动生成测量多样性,而非依赖传统的物理移动方式。此外,该方法还提出了一种基于学习的k空间坐标估计方法,利用双平面强度测量实现合成孔径成像,而无需事先知道目标的旋转信息。
链接: https://arxiv.org/abs/2507.03733
作者: Matthew A. Chan,Casey J. Pellizzari,Christopher A. Metzler
机构: University of Maryland, College Park(马里兰大学学院公园分校); United States Air Force Academy(美国空军学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Fourier ptychography (FP) is a powerful light-based synthetic aperture imaging technique that allows one to reconstruct a high-resolution, wide field-of-view image by computationally integrating a diverse collection of low-resolution, far-field measurements. Typically, FP measurement diversity is introduced by changing the angle of the illumination or the position of the camera; either approach results in sampling different portions of the target’s spatial frequency content, but both approaches introduce substantial costs and complexity to the acquisition process. In this work, we introduce Inverse Synthetic Aperture Fourier Ptychography, a novel approach to FP that foregoes changing the illumination angle or camera position and instead generates measurement diversity through target motion. Critically, we also introduce a novel learning-based method for estimating k-space coordinates from dual plane intensity measurements, thereby enabling synthetic aperture imaging without knowing the rotation of the target. We experimentally validate our method in simulation and on a tabletop optical system.
zh
[CV-300] Segmentation of separated Lumens in 3D CTA images of Aortic Dissection
【速读】:该论文旨在解决主动脉夹层中真假腔的分割与分离问题,以辅助临床诊断。其关键解决方案是利用生成的三维填充表面来封闭内膜撕裂处,这些表面通过操作Aktouf等提出的数字拓扑框架下的闭合算法获得,进而用这些表面作为图像处理算子切断腔体之间的连接,实现腔体的分离。这一方法首次将填充撕裂的表面用于断开三维物体的不同部分,为提供主动脉夹层的初步图谱奠定了基础,并可能提升当前相关研究(如配准、分割和血流动力学分析)的效果。
链接: https://arxiv.org/abs/2507.03655
作者: Christophe Lohou,Bruno Miguel
机构: Université Clermont Auvergne(克莱蒙奥弗涅大学); Clermont Auvergne INP(克莱蒙奥弗涅高等理工学院); CNRS, Institut Pascal(法国国家科学研究中心,帕斯卡研究所)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:
Abstract:Aortic dissection is a serious pathology and requires an emergency management. It is characterized by one or more tears of the intimal wall of the normal blood duct of the aorta (true lumen); the blood under pressure then creates a second blood lumen (false lumen) in the media tissue. The two lumens are separated by an intimal wall, called flap. From the segmentation of connected lumens (more precisely, blood inside lumens) of an aortic dissection 3D Computed Tomography Angiography (CTA) image, our previous studies allow us to retrieve the intimal flap by using Mathematical Morphology operators, and characterize intimal tears by 3d thin surfaces that fill them, these surfaces are obtained by operating the Aktouf et al. closing algorithm proposed in the framework of Digital Topology. Indeed, intimal tears are 3D holes in the intimal flap; although it is impossible to directly segment such non-concrete data, it is nevertheless possible to “materialize” them with these 3D filling surfaces that may be quantified or make easier the visualization of these holes. In this paper, we use these surfaces that fill tears to cut connections between lumens in order to separate them. This is the first time that surfaces filling tears are used as an image processing operator (to disconnect several parts of a 3D object). This lumen separation allows us to provide one of the first cartographies of an aortic dissection, that may better visually assist physicians during their diagnosis. Our method is able to disconnect lumens, that may also lead to enhance several current investigations (registration, segmentation, hemodynamics). Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM) Cite as: arXiv:2507.03655 [eess.IV] (or arXiv:2507.03655v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.03655 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Christophe Lohou [view email] [v1] Fri, 4 Jul 2025 15:30:26 UTC (4,908 KB)
zh
[CV-301] Dual-Alignment Knowledge Retention for Continual Medical Image Segmentation
【速读】:该论文试图解决医疗图像分割中的持续学习问题,特别是在跨不同领域(如临床站点)的数据顺序获取过程中,由于任务干扰导致的灾难性遗忘问题。解决方案的关键在于引入一种新颖的框架,通过建立并增强历史数据与当前任务之间的复杂依赖关系来缓解遗忘。该框架的核心是双对齐策略,包括跨网络对齐(CNA)模块和跨表示对齐(CRA)模块,其中CNA模块对齐当前和先前网络的瓶颈层特征,CRA模块对齐当前网络从历史缓冲数据和当前输入数据中学习到的特征。为实现这两种对齐,作者分析了HSIC的线性和非线性形式,并在CRA模块中设计了特征映射和特征配对块。
链接: https://arxiv.org/abs/2507.03638
作者: Yuxin Ye,Yan Liu,Shujian Yu
机构: Sichuan University(四川大学); Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual learning in medical image segmentation involves sequential data acquisition across diverse domains (e.g., clinical sites), where task interference between past and current domains often leads to catastrophic forgetting. Existing continual learning methods fail to capture the complex dependencies between tasks. We introduce a novel framework that mitigates forgetting by establishing and enhancing complex dependencies between historical data and the network in the present task. Our framework features a dual-alignment strategy, the cross-network alignment (CNA) module aligns the features extracted from the bottleneck layers of the current and previous networks, respectively, while the cross-representation alignment (CRA) module aligns the features learned by the current network from historical buffered data and current input data, respectively. Implementing both types of alignment is a non-trivial task. To address this, we further analyze the linear and nonlinear forms of the well-established Hilbert-Schmidt Independence Criterion (HSIC) and deliberately design feature mapping and feature pairing blocks within the CRA module. Experiments on medical image segmentation task demonstrate our framework’s effectiveness in mitigating catastrophic forgetting under domain shifts.
zh
[CV-302] PhotIQA: A photoacoustic image data set with image quality ratings
【速读】:该论文试图解决在光声成像(Photoacoustic Imaging, PAI)中缺乏标准基准方法以评估图像重建质量的问题。由于PAI是一种多物理场成像模态,需要解决两个逆问题,这使得图像质量评估(Image Quality Assessment, IQA)方法的应用尤为复杂,尤其是在处理声学和光学伪影方面。解决方案的关键在于构建PhotIQA数据集,该数据集包含1134张经专家评定的重建PA图像,涵盖了五个质量属性,并使用高度表征的成像测试对象提供真实值以支持全参考IQA评估。实验表明,HaarPSI_med在与质量评分的相关性上显著优于SSIM。
链接: https://arxiv.org/abs/2507.03478
作者: Anna Breger,Janek Gröhl,Clemens Karner,Thomas R Else,Ian Selby,Jonathan Weir-McCall,Carola-Bibiane Schönlieb
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages
Abstract:Image quality assessment (IQA) is crucial in the evaluation stage of novel algorithms operating on images, including traditional and machine learning based methods. Due to the lack of available quality-rated medical images, most commonly used IQA methods employing reference images (i.e. full-reference IQA) have been developed and tested for natural images. Reported application inconsistencies arising when employing such measures for medical images are not surprising, as they rely on different properties than natural images. In photoacoustic imaging (PAI), especially, standard benchmarking approaches for assessing the quality of image reconstructions are lacking. PAI is a multi-physics imaging modality, in which two inverse problems have to be solved, which makes the application of IQA measures uniquely challenging due to both, acoustic and optical, artifacts. To support the development and testing of full- and no-reference IQA measures we assembled PhotIQA, a data set consisting of 1134 reconstructed photoacoustic (PA) images that were rated by 2 experts across five quality properties (overall quality, edge visibility, homogeneity, inclusion and background intensity), where the detailed rating enables usage beyond PAI. To allow full-reference assessment, highly characterised imaging test objects were used, providing a ground truth. Our baseline experiments show that HaarPSI _med significantly outperforms SSIM in correlating with the quality ratings (SRCC: 0.83 vs. 0.62). The dataset is publicly available at this https URL. Comments: 12 pages Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.03478 [eess.IV] (or arXiv:2507.03478v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.03478 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-303] Hybrid-View Attention for csPCa Classification in TRUS
【速读】:该论文旨在解决前列腺癌(Prostate Cancer, PCa)中临床显著性前列腺癌(clinically significant PCa, csPCa)的准确识别问题,特别是在三维经直肠超声(3D TRUS)图像分析中的诊断挑战。其解决方案的关键在于提出一种新型的混合视图注意力(Hybrid-View Attention, HVA)网络,该网络通过融合矢状面和横断面视图的互补信息,结合卷积神经网络与Transformer架构,提取局部细粒度特征和全局依赖关系,从而提升csPCa分类的准确性。
链接: https://arxiv.org/abs/2507.03421
作者: Zetian Feng,Juan Fu,Xuebin Zou,Hongsheng Ye,Hong Wu,Jianhua Zhou,Yi Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prostate cancer (PCa) is a leading cause of cancer-related mortality in men, and accurate identification of clinically significant PCa (csPCa) is critical for timely intervention. Transrectal ultrasound (TRUS) is widely used for prostate biopsy; however, its low contrast and anisotropic spatial resolution pose diagnostic challenges. To address these limitations, we propose a novel hybrid-view attention (HVA) network for csPCa classification in 3D TRUS that leverages complementary information from transverse and sagittal views. Our approach integrates a CNN-transformer hybrid architecture, where convolutional layers extract fine-grained local features and transformer-based HVA models global dependencies. Specifically, the HVA comprises intra-view attention to refine features within a single view and cross-view attention to incorporate complementary information across views. Furthermore, a hybrid-view adaptive fusion module dynamically aggregates features along both channel and spatial dimensions, enhancing the overall representation. Experiments are conducted on an in-house dataset containing 590 subjects who underwent prostate biopsy. Comparative and ablation results prove the efficacy of our method. The code is available at this https URL.
zh
[CV-304] UltraDfeGAN: Detail-Enhancing Generative Adversarial Networks for High-Fidelity Functional Ultrasound Synthesis
【速读】:该论文试图解决功能性超声(functional ultrasound, fUS)在临床应用中面临的数据稀缺性和生成真实fUS图像能力有限的问题。其解决方案的关键在于采用生成对抗网络(generative adversarial network, GAN)框架,并引入特征增强模块和归一化技术,以提升生成图像的保真度和生理合理性。
链接: https://arxiv.org/abs/2507.03341
作者: Zhuo Li,Xuhang Chen,Shuqiang Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Functional ultrasound (fUS) is a neuroimaging technique known for its high spatiotemporal resolution, enabling non-invasive observation of brain activity through neurovascular coupling. Despite its potential in clinical applications such as neonatal monitoring and intraoperative guidance, the development of fUS faces challenges related to data scarcity and limitations in generating realistic fUS images. This paper explores the use of a generative adversarial network (GAN) framework tailored for fUS image synthesis. The proposed method incorporates architectural enhancements, including feature enhancement modules and normalization techniques, aiming to improve the fidelity and physiological plausibility of generated images. The study evaluates the performance of the framework against existing generative models, demonstrating its capability to produce high-quality fUS images under various experimental conditions. Additionally, the synthesized images are assessed for their utility in downstream tasks, showing improvements in classification accuracy when used for data augmentation. Experimental results are based on publicly available fUS datasets, highlighting the framework’s effectiveness in addressing data limitations.
zh
[CV-305] Cancer cytoplasm segmentation in hyperspectral cell image with data augmentation
【速读】:该论文旨在解决在显微镜图像中准确检测癌细胞胞质区域的问题,这对于确定癌症类型至关重要。由于CMOS图像缺乏足够的诊断信息,而高光谱图像虽然提供更全面的细胞信息,但获取大量高质量图像困难且常包含仪器噪声,因此研究提出了一种数据增强方法以应对仪器噪声的影响。解决方案的关键在于利用视觉清晰的CMOS图像进行数据增强,从而辅助高光谱图像的训练,提升模型性能。
链接: https://arxiv.org/abs/2507.03325
作者: Rebeka Sultana,Hibiki Horibe,Tomoaki Murakami,Ikuko Shimizu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Hematoxylin and Eosin (HE)-stained images are commonly used to detect nuclear or cancerous regions in cells from images captured by a microscope. Identifying cancer cytoplasm is crucial for determining the type of cancer; hence, obtaining accurate cancer cytoplasm regions in cell images is important. While CMOS images often lack detailed information necessary for diagnosis, hyperspectral images provide more comprehensive cell information. Using a deep learning model, we propose a method for detecting cancer cell cytoplasm in hyperspectral images. Deep learning models require large datasets for learning; however, capturing a large number of hyperspectral images is difficult. Additionally, hyperspectral images frequently contain instrumental noise, depending on the characteristics of the imaging devices. We propose a data augmentation method to account for instrumental noise. CMOS images were used for data augmentation owing to their visual clarity, which facilitates manual annotation compared to original hyperspectral images. Experimental results demonstrate the effectiveness of the proposed data augmentation method both quantitatively and qualitatively.
zh
[CV-306] owards Interpretable PolSAR Image Classification: Polarimetric Scattering Mechanism Informed Concept Bottleneck and Kolmogorov-Arnold Network
【速读】:该论文试图解决深度学习(Deep Learning, DL)方法在极化合成孔径雷达(PolSAR)图像分类中的可解释性问题,即如何解释高维特征提取过程以及基于这些特征的决策回溯。解决方案的关键在于结合极化目标分解(Polarimetric Target Decomposition, PTD)方法,构建极化概念标签和一种名为并行概念瓶颈网络(Parallel Concept Bottleneck Networks, PaCBM)的新结构,将不可解释的高维特征转化为基于物理可验证的极化散射机制的人类可理解概念,并采用Kolmogorov-Arnold网络(KAN)替代多层感知机(MLP),以实现更简洁且可理解的层间映射过程和增强的非线性建模能力。
链接: https://arxiv.org/abs/2507.03315
作者: Jinqi Zhang,Fangzhou Han,Di Zhuang,Lamei Zhang,Bin Zou,Li Yuan
机构: Harbin Institute of Technology(哈尔滨工业大学); National Key Laboratory of Scattering and Radiation(散射与辐射国家重点实验室)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, Deep Learning (DL) based methods have received extensive and sufficient attention in the field of PolSAR image classification, which show excellent performance. However, due to the ``black-box" nature of DL methods, the interpretation of the high-dimensional features extracted and the backtracking of the decision-making process based on the features are still unresolved problems. In this study, we first highlight this issue and attempt to achieve the interpretability analysis of DL-based PolSAR image classification technology with the help of Polarimetric Target Decomposition (PTD), a feature extraction method related to the scattering mechanism unique to the PolSAR image processing field. In our work, by constructing the polarimetric conceptual labels and a novel structure named Parallel Concept Bottleneck Networks (PaCBM), the uninterpretable high-dimensional features are transformed into human-comprehensible concepts based on physically verifiable polarimetric scattering mechanisms. Then, the Kolmogorov-Arnold Network (KAN) is used to replace Multi-Layer Perceptron (MLP) for achieving a more concise and understandable mapping process between layers and further enhanced non-linear modeling ability. The experimental results on several PolSAR datasets show that the features could be conceptualization under the premise of achieving satisfactory accuracy through the proposed pipeline, and the analytical function for predicting category labels from conceptual labels can be obtained by combining spline functions, thus promoting the research on the interpretability of the DL-based PolSAR image classification model.
zh
[CV-307] Event2Audio: Event-Based Optical Vibration Sensing
【速读】:该论文试图解决如何从视频中提取微小振动信息以恢复声音和材料特性等问题,特别是在振动不可见的情况下。解决方案的关键在于利用事件相机(event-based camera)的高效性来增强主动传感方法,从而在存在环境干扰的情况下,实现对多个同时发声源的音频恢复,并达到接近实时处理的速度。
链接: https://arxiv.org/abs/2507.03273
作者: Mingxuan Cai,Dekel Galor,Amit Pal Singh Kohli,Jacob L. Yates,Laura Waller
机构: University of California, Berkeley(加州大学伯克利分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 14 pages, 13 figures
Abstract:Small vibrations observed in video can unveil information beyond what is visual, such as sound and material properties. It is possible to passively record these vibrations when they are visually perceptible, or actively amplify their visual contribution with a laser beam when they are not perceptible. In this paper, we improve upon the active sensing approach by leveraging event-based cameras, which are designed to efficiently capture fast motion. We demonstrate our method experimentally by recovering audio from vibrations, even for multiple simultaneous sources, and in the presence of environmental distortions. Our approach matches the state-of-the-art reconstruction quality at much faster speeds, approaching real-time processing.
zh
[CV-308] EvRWKV: A RWKV Framework for Effective Event-guided Low-Light Image Enhancement
【速读】:该论文旨在解决低光条件下高质量视觉内容捕获的问题,主要挑战包括严重的噪声、运动模糊和欠曝光,这些因素会降低下游应用的性能。传统基于帧的低光增强方法往往放大噪声或无法有效保留结构细节。该论文提出的解决方案是EvRWKV框架,其关键在于通过双域处理实现连续的跨模态交互,具体包括Cross-RWKV模块和Event Image Spectral Fusion Enhancer (EISFE)模块,分别用于细粒度的时间与跨模态融合以及自适应频域噪声抑制和空间域变形卷积对齐。
链接: https://arxiv.org/abs/2507.03184
作者: WenJie Cai,Qingguo Meng,Zhenyu Wang,Xingbo Dong,Zhe Jin
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Capturing high-quality visual content under low-light conditions remains a challenging problem due to severe noise, motion blur, and underexposure, which degrade the performance of downstream applications. Traditional frame-based low-light enhancement methods often amplify noise or fail to preserve structural details, especially in real-world scenarios. Event cameras, offering high dynamic range and microsecond temporal resolution by asynchronously capturing brightness changes, emerge as promising alternatives for low-light imaging. However, existing event-image fusion methods suffer from simplistic fusion strategies and inadequate handling of spatial-temporal misalignment and noise. To address these challenges, we propose EvRWKV, a novel framework that enables continuous cross-modal interaction through dual-domain processing. Our approach incorporates a Cross-RWKV module, leveraging the Receptance Weighted Key Value (RWKV) architecture for fine-grained temporal and cross-modal fusion, and an Event Image Spectral Fusion Enhancer (EISFE) module, which jointly performs adaptive frequency-domain noise suppression and spatial-domain deformable convolution alignment. Extensive qualitative and quantitative evaluations on real-world low-light datasets(SDE, SDSD, RELED) demonstrate that EvRWKV achieves state-of-the-art performance, effectively enhancing image quality by suppressing noise, restoring structural details, and improving visual clarity in challenging low-light conditions.
zh
[CV-309] Outcome prediction and individualized treatment effect estimation in patients with large vessel occlusion stroke MICCAI
【速读】:该论文试图解决急性大血管闭塞(LVO)卒中患者在接受机械取栓治疗后,仅有50%的患者能够获得良好功能预后的预测问题,旨在通过整合临床变量与影像学数据来提升功能预后预测及个体化治疗效果(ITE)估计的准确性。解决方案的关键在于开发可解释的深度学习模型,利用来自随机临床试验的449例LVO卒中患者的多模态数据,包括临床变量、非增强CT(NCCT)和血管造影(CTA)扫描,并通过新型基础模型对影像信息进行整合,从而实现对改良Rankin量表(mRS)评分的精准预测及ITE的估计。
链接: https://arxiv.org/abs/2507.03046
作者: Lisa Herzog,Pascal Bühler,Ezequiel de la Rosa,Beate Sick,Susanne Wegener
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Under review for SWITCH 2025 (MICCAI)
Abstract:Mechanical thrombectomy has become the standard of care in patients with stroke due to large vessel occlusion (LVO). However, only 50% of successfully treated patients show a favorable outcome. We developed and evaluated interpretable deep learning models to predict functional outcomes in terms of the modified Rankin Scale score alongside individualized treatment effects (ITEs) using data of 449 LVO stroke patients from a randomized clinical trial. Besides clinical variables, we considered non-contrast CT (NCCT) and angiography (CTA) scans which were integrated using novel foundation models to make use of advanced imaging information. Clinical variables had a good predictive power for binary functional outcome prediction (AUC of 0.719 [0.666, 0.774]) which could slightly be improved when adding CTA imaging (AUC of 0.737 [0.687, 0.795]). Adding NCCT scans or a combination of NCCT and CTA scans to clinical features yielded no improvement. The most important clinical predictor for functional outcome was pre-stroke disability. While estimated ITEs were well calibrated to the average treatment effect, discriminatory ability was limited indicated by a C-for-Benefit statistic of around 0.55 in all models. In summary, the models allowed us to jointly integrate CT imaging and clinical features while achieving state-of-the-art prediction performance and ITE estimates. Yet, further research is needed to particularly improve ITE estimation.
zh
[CV-310] Automated Workflow for the Detection of Vugs
【速读】:该论文旨在解决地质学中关于地下岩层中溶洞(vug)识别的效率与准确性问题。传统的人工和半自动化方法受限于个体偏见、劳动强度以及参数调优的灵活性不足。论文提出的解决方案关键在于构建一种基于先进计算机视觉技术的自动化溶洞检测模型,该模型通过包括逻辑文件提取、归一化、top-k模式提取、自适应阈值处理、轮廓识别、聚合、高级过滤及可选的低溶洞区域过滤在内的六步识别流程,实现了对溶洞的高效、准确识别,并通过统计分析进一步提升了识别结果的可靠性与实用性。
链接: https://arxiv.org/abs/2507.02988
作者: M. Quamer Nasim,T. Maiti,N. Mosavat,P. V. Grech,T. Singh,P. Nath Singha Roy
机构: 未知
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 Figures
Abstract:Image logs are crucial in capturing high-quality geological information about subsurface formations. Among the various geological features that can be gleaned from Formation Micro Imager log, vugs are essential for reservoir evaluation. This paper introduces an automated Vug Detection Model, leveraging advanced computer vision techniques to streamline the vug identification process. Manual and semiautomated methods are limited by individual bias, labour-intensity and inflexibility in parameter finetuning. Our methodology also introduces statistical analysis on vug characteristics. Pre-processing steps, including logical file extraction and normalization, ensured standardized and usable data. The sixstep vug identification methodology encompasses top-k mode extraction, adaptive thresholding, contour identification, aggregation, advanced filtering, and optional filtering for low vuggy regions. The model’s adaptability is evidenced by its ability to identify vugs missed by manual picking undertaken by experts. Results demonstrate the model’s accuracy through validation against expert picks. Detailed metrics, such as count, mean, and standard deviation of vug areas within zones, were introduced, showcasing the model’s capabilities compared to manual picking. The vug area distribution plot enhances understanding of vug types in the reservoir. This research focuses on the identification and characterization of vugs that in turn aids in the better understanding of reservoirs.
zh
人工智能
[AI-0] Action Space Reduction Strategies for Reinforcement Learning in Autonomous Driving
【速读】:该论文试图解决在自主驾驶中由于大而高维的动作空间导致的训练效率低下和探索成本增加的问题。解决方案的关键在于引入两种新颖的结构化动作空间修改策略:动态掩码和相对动作空间缩减。这些方法通过实时基于上下文和状态转移的动作掩码,保持动作一致性的同时消除无效或次优选择,从而在学习速度、控制精度和泛化能力之间实现良好的平衡。
链接: https://arxiv.org/abs/2507.05251
作者: Elahe Delavari,Feeza Khan Khanzada,Jaerock Kwon
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) offers a promising framework for autonomous driving by enabling agents to learn control policies through interaction with environments. However, large and high-dimensional action spaces often used to support fine-grained control can impede training efficiency and increase exploration costs. In this study, we introduce and evaluate two novel structured action space modification strategies for RL in autonomous driving: dynamic masking and relative action space reduction. These approaches are systematically compared against fixed reduction schemes and full action space baselines to assess their impact on policy learning and performance. Our framework leverages a multimodal Proximal Policy Optimization agent that processes both semantic image sequences and scalar vehicle states. The proposed dynamic and relative strategies incorporate real-time action masking based on context and state transitions, preserving action consistency while eliminating invalid or suboptimal choices. Through comprehensive experiments across diverse driving routes, we show that action space reduction significantly improves training stability and policy performance. The dynamic and relative schemes, in particular, achieve a favorable balance between learning speed, control precision, and generalization. These findings highlight the importance of context-aware action space design for scalable and reliable RL in autonomous driving tasks.
zh
[AI-1] Modeling Latent Partner Strategies for Adaptive Zero-Shot Human-Agent Collaboration
【速读】:该论文试图解决在异构团队(如人机协作)中,代理如何实时观察、识别并适应人类合作伙伴的问题,尤其是在时间压力和复杂战略空间下动态变化的任务环境中。解决方案的关键在于提出TALENTS框架,该框架通过变分自编码器从轨迹数据中学习潜在策略空间,并利用聚类方法识别不同的策略类型,随后训练协作代理根据这些聚类生成针对不同策略类型的合作伙伴。此外,为适应未见过的合作伙伴,采用固定份额损失最小化算法动态推断和调整估计的合作伙伴策略。
链接: https://arxiv.org/abs/2507.05244
作者: Benjamin Li,Shuyang Shi,Lucia Romero,Huao Li,Yaqi Xie,Woojun Kim,Stefanos Nikolaidis,Michael Lewis,Katia Sycara,Simon Stepputtis
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Best Paper Award at the RSS 2025 Generative Models x HRI (GenAI-HRI) Workshop
Abstract:In collaborative tasks, being able to adapt to your teammates is a necessary requirement for success. When teammates are heterogeneous, such as in human-agent teams, agents need to be able to observe, recognize, and adapt to their human partners in real time. This becomes particularly challenging in tasks with time pressure and complex strategic spaces where the dynamics can change rapidly. In this work, we introduce TALENTS, a strategy-conditioned cooperator framework that learns to represent, categorize, and adapt to a range of partner strategies, enabling ad-hoc teamwork. Our approach utilizes a variational autoencoder to learn a latent strategy space from trajectory data. This latent space represents the underlying strategies that agents employ. Subsequently, the system identifies different types of strategy by clustering the data. Finally, a cooperator agent is trained to generate partners for each type of strategy, conditioned on these clusters. In order to adapt to previously unseen partners, we leverage a fixed-share regret minimization algorithm that infers and adjusts the estimated partner strategy dynamically. We assess our approach in a customized version of the Overcooked environment, posing a challenging cooperative cooking task that demands strong coordination across a wide range of possible strategies. Using an online user study, we show that our agent outperforms current baselines when working with unfamiliar human partners.
zh
[AI-2] rain-before-Test Harmonizes Language Model Rankings
【速读】:该论文试图解决现有语言模型基准测试中因不同基准对相同技能的评估导致的模型排名矛盾问题,这一问题阻碍了模型选择和比较,并增加了竞争性模型生态系统的混乱。解决方案的关键在于“训练前测试”(train-before-test),即在评估前对每个模型进行相同的基准特定微调,以消除模型在不同测试任务上的准备程度差异,从而显著提升排名一致性,并增强排名的外部有效性。
链接: https://arxiv.org/abs/2507.05195
作者: Guanhua Zhang,Ricardo Dominguez-Olmedo,Moritz Hardt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. Recent work attributed ranking disagreement to the phenomenon of training on the test task: As released, different models exhibit a different level of preparation for any given test task. A candidate solution to the problem is train-before-test: Give each model the same benchmark-specific finetuning before evaluation. Our primary contribution is a broad empirical evaluation of train-before-test across 24 benchmarks and 61 models. We show that train-before-test significantly improves ranking agreement consistently across all benchmarks. Whereas rankings have little external validity to start with, they enjoy a significant degree of external validity when applying train-before-test: Model rankings transfer gracefully from one benchmark to the other. Even within the same model family, train-before-test reduces strong ranking disagreement to near-perfect agreement. In addition, train-before-test reduces the model-score matrix to essentially rank one, revealing new insights into the latent factors of benchmark performance. Our work supports the recommendation to make train-before-test a default component of LLM benchmarking.
zh
[AI-3] Infrastructuring Contestability: A Framework for Community-Defined AI Value Pluralism
【速读】:该论文试图解决AI驱动系统在人机交互(HCI)和计算机支持的协同工作(CSCW)中对用户自主性(agency)的削弱以及未能考虑价值多元主义的问题。现有价值对齐方法依赖于集中式的自上而下定义,缺乏有意义的争议机制,导致用户和社区无法挑战或塑造影响其数字生活的系统中的价值观,进而引发合法性与信任危机。论文提出的解决方案是社区定义的价值多元主义(Community-Defined AI Value Pluralism, CDAVP),其关键在于将设计问题从实现单一对齐状态转变为构建动态的价值讨论与应用生态系统,使多样化的自组织社区能够定义并维护明确的价值档案,这些档案可被终端用户在特定情境下激活,并保留对AI行为引导价值的最终控制权。同时,AI应用被设计为透明地解释这些档案并调和冲突,遵循一组不可协商的、民主合法化的元规则。
链接: https://arxiv.org/abs/2507.05187
作者: Andreas Mayer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:The proliferation of AI-driven systems presents a fundamental challenge to Human-Computer Interaction (HCI) and Computer-Supported Cooperative Work (CSCW), often diminishing user agency and failing to account for value pluralism. Current approaches to value alignment, which rely on centralized, top-down definitions, lack the mechanisms for meaningful contestability. This leaves users and communities unable to challenge or shape the values embedded in the systems that govern their digital lives, creating a crisis of legitimacy and trust. This paper introduces Community-Defined AI Value Pluralism (CDAVP), a socio-technical framework that addresses this gap. It reframes the design problem from achieving a single aligned state to infrastructuring a dynamic ecosystem for value deliberation and application. At its core, CDAVP enables diverse, self-organizing communities to define and maintain explicit value profiles - rich, machine-readable representations that can encompass not only preferences but also community-specific rights and duties. These profiles are then contextually activated by the end-user, who retains ultimate control (agency) over which values guide the AI’s behavior. AI applications, in turn, are designed to transparently interpret these profiles and moderate conflicts, adhering to a set of non-negotiable, democratically-legitimated meta-rules. The designer’s role shifts from crafting static interfaces to becoming an architect of participatory ecosystems. We argue that infrastructuring for pluralism is a necessary pathway toward achieving robust algorithmic accountability and genuinely contestable, human-centric AI.
zh
[AI-4] CREW-WILDFIRE: Benchmarking Agent ic Multi-Agent Collaborations at Scale
【速读】:该论文试图解决当前基准测试在评估基于大型语言模型(Large Language Model, LLM)的多智能体系统时存在的不足,特别是在可扩展性、鲁棒性和协作能力方面的问题。现有环境通常局限于小规模、完全可观测或低复杂度的领域,无法有效支持下一代多智能体Agentic AI框架的研发与评估。解决方案的关键在于引入CREW-Wildfire,这是一个开源基准,构建于CREW模拟平台之上,提供了具有大规模地图、异构智能体、部分可观测性、随机动态和长时程规划目标的程序生成野火响应场景,从而提升了评估的真实性和复杂度。
链接: https://arxiv.org/abs/2507.05178
作者: Jonathan Hyun,Nicholas R Waytowich,Boyuan Chen
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Our project website is at: this http URL
Abstract:Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. Built atop the human-AI teaming CREW simulation platform, CREW-Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning objectives. The environment supports both low-level control and high-level natural language interactions through modular Perception and Execution modules. We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty. By providing more realistic complexity, scalable architecture, and behavioral evaluation metrics, CREW-Wildfire establishes a critical foundation for advancing research in scalable multi-agent Agentic intelligence. All code, environments, data, and baselines will be released to support future research in this emerging domain.
zh
[AI-5] Effects of Unplanned Incoming Flights on Airport Relief Processes after a Major Natural Disaster
【速读】:该论文试图解决自然灾害后机场货物处理操作中因信息不完全和资源分配策略不当而导致的效率问题。其关键解决方案是构建一个基于代理的模型,该模型模拟了在不同信息不确定性程度下的实际卸货策略,并通过专家验证确保模型的准确性,从而分析信息不对称和资源不足对机场运营性能的影响。
链接: https://arxiv.org/abs/2507.05150
作者: Luka Van de Sype,Matthieu Vert,Alexei Sharpanskykh,Seyed Sahand Mohammadi Ziabari
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:The severity of natural disasters is increasing every year, impacting many people’s lives. During the response phase of disasters, airports are important hubs where relief aid arrives and people need to be evacuated. However, the airport often forms a bottleneck in these relief operations due to the sudden need for increased capacity. Limited research has been done on the operational side of airport disaster management. Experts identify the main problems as, first, the asymmetry of information between the airport and incoming flights, and second, the lack of resources. The goal of this research is to understand the effects of incomplete knowledge of incoming flights with different resource allocation strategies on the performance of cargo handling operations at an airport after a natural disaster. An agent-based model is created, implementing realistic offloading strategies with different degrees of information uncertainty. Model calibration and verification are performed with experts in the field. The model performance is measured by the average turnaround time, which is divided into offloading time, boarding time, and cumulative waiting times. The results show that the effects of one unplanned aircraft are negligible. However, all waiting times increase with more arriving unplanned aircraft.
zh
[AI-6] GIST: Cross-Domain Click-Through Rate Prediction via Guided Content-Behavior Distillation
【速读】:该论文旨在解决跨领域点击率预测中的数据稀疏性和冷启动问题,通过从源领域向目标领域迁移知识来提升在线广告系统的性能。其解决方案的关键在于提出GIST模型,该模型通过解耦源领域与目标领域的训练过程,并引入内容-行为联合训练模块(CBJT)以对齐内容与行为分布,结合引导信息以获得更稳定的表示,同时采用非对称相似性融合策略(ASI)增强知识迁移效果。
链接: https://arxiv.org/abs/2507.05142
作者: Wei Xu,Haoran Li,Baoyuan Ou,Lai Xu,Yingjie Qin,Ruilong Su,Ruiwen Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-domain Click-Through Rate prediction aims to tackle the data sparsity and the cold start problems in online advertising systems by transferring knowledge from source domains to a target domain. Most existing methods rely on overlapping users to facilitate this transfer, often focusing on joint training or pre-training with fine-tuning approach to connect the source and target domains. However, in real-world industrial settings, joint training struggles to learn optimal representations with different distributions, and pre-training with fine-tuning is not well-suited for continuously integrating new data. To address these issues, we propose GIST, a cross-domain lifelong sequence model that decouples the training processes of the source and target domains. Unlike previous methods that search lifelong sequences in the source domains using only content or behavior signals or their simple combinations, we innovatively introduce a Content-Behavior Joint Training Module (CBJT), which aligns content-behavior distributions and combines them with guided information to facilitate a more stable representation. Furthermore, we develop an Asymmetric Similarity Integration strategy (ASI) to augment knowledge transfer through similarity computation. Extensive experiments demonstrate the effectiveness of GIST, surpassing SOTA methods on offline evaluations and an online A/B test. Deployed on the Xiaohongshu (RedNote) platform, GIST effectively enhances online ads system performance at scale, serving hundreds of millions of daily active users.
zh
[AI-7] VerifyLLM : LLM -Based Pre-Execution Task Plan Verification for Robots IROS2025
【速读】:该论文试图解决机器人领域中任务规划的可靠性和效率问题,特别是通过在执行前验证高层次任务计划来减少错误并提升系统整体性能。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)将自然语言指令转换为线性时态逻辑(Linear Temporal Logic, LTL),并进行动作序列的全面分析,从而评估逻辑一致性并识别计划中的潜在漏洞。
链接: https://arxiv.org/abs/2507.05118
作者: Danil S. Grigorev,Alexey K. Kovalev,Aleksandr I. Panov
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: IROS 2025
Abstract:In the field of robotics, researchers face a critical challenge in ensuring reliable and efficient task planning. Verifying high-level task plans before execution significantly reduces errors and enhance the overall performance of these systems. In this paper, we propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments. Leveraging Large Language Models (LLMs), our approach consists of two key steps: first, the conversion of natural language instructions into Linear Temporal Logic (LTL), followed by a comprehensive analysis of action sequences. The module uses the reasoning capabilities of the LLM to evaluate logical coherence and identify potential gaps in the plan. Rigorous testing on datasets of varying complexity demonstrates the broad applicability of the module to household tasks. We contribute to improving the reliability and efficiency of task planning and addresses the critical need for robust pre-execution verification in autonomous systems. The code is available at this https URL.
zh
[AI-8] Rule Learning for Knowledge Graph Reasoning under Agnostic Distribution Shift
【速读】:该论文试图解决知识图谱(Knowledge Graph, KG)推理中因未知样本选择偏差和测试阶段分布偏移导致的性能与可靠性下降问题,特别是在分布外(out-of-distribution, OOD)场景下的KG推理这一尚未充分探索的问题。解决方案的关键在于提出StableRule框架,该框架通过将特征去相关性与规则学习网络相结合,增强模型在分布外环境中的泛化能力,从而提升逻辑规则学习的鲁棒性。
链接: https://arxiv.org/abs/2507.05110
作者: Shixuan Liu,Yue He,Yunfei Wang,Hao Zou,Haoxiang Cheng,Wenjing Yang,Peng Cui,Zhong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Knowledge graph (KG) reasoning remains a critical research area focused on inferring missing knowledge by analyzing relationships among observed facts. Despite its success, a key limitation of existing KG reasoning methods is their dependence on the I.I.D assumption. This assumption can easily be violated due to unknown sample selection bias during training or agnostic distribution shifts during testing, significantly compromising model performance and reliability. To facilitate the deployment of KG reasoning in wild environments, this study investigates learning logical rules from KGs affected by unknown selection bias. Additionally, we address test sets with agnostic distribution shifts, formally defining this challenge as out-of-distribution (OOD) KG reasoning-a previously underexplored problem. To solve the issue, we propose the Stable Rule Learning (StableRule) framework, an end-to-end methodology that integrates feature decorrelation with rule learning network, to enhance OOD generalization performance. By leveraging feature decorrelation, the StableRule framework mitigates the adverse effects of covariate shifts arising in OOD scenarios, thereby improving the robustness of the rule learning component in effectively deriving logical rules. Extensive experiments on seven benchmark KGs demonstrate the framework’s superior effectiveness and stability across diverse heterogeneous environments, underscoring its practical significance for real-world applications.
zh
[AI-9] PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs
【速读】:该论文试图解决现有蛋白质-蛋白质相互作用(PPI)预测基准主要关注孤立对的评估,而忽视了模型在重建生物上有意义的PPI网络方面的能力问题。解决方案的关键在于引入PRING,这是首个从图级视角评估PPI预测的综合性基准,其核心在于构建了一个高质量、多物种的PPI网络数据集,并设计了应对数据冗余和泄露的策略,同时建立了拓扑导向和功能导向的两种互补评估范式,以全面评估模型在理解网络拓扑结构、进行功能注释和生物模块检测等方面的能力。
链接: https://arxiv.org/abs/2507.05101
作者: Xinzhe Zheng,Hao Du,Fanding Xu,Jinzhe Li,Zhiyuan Liu,Wenkang Wang,Tao Chen,Wanli Ouyang,Stan Z. Li,Yan Lu,Nanqing Dong,Yang Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Molecular Networks (q-bio.MN)
备注:
Abstract:Deep learning-based computational methods have achieved promising results in predicting protein-protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model’s capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates protein-protein interaction prediction from a graph-level perspective. PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions, with well-designed strategies to address both data redundancy and leakage. Building on this golden-standard dataset, we establish two complementary evaluation paradigms: (1) topology-oriented tasks, which assess intra and cross-species PPI network construction, and (2) function-oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model’s capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity-based, naive sequence-based, protein language model-based, and structure-based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real-world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at this https URL.
zh
[AI-10] Beyond Features: How Dataset Design Influences Multi-Agent Trajectory Prediction Performance
【速读】:该论文试图解决多智能体场景下轨迹预测模型性能受数据集设计影响的问题,特别是特征选择、跨数据集迁移和地理多样性对预测准确性的影响。其解决方案的关键在于通过构建并使用自有的L4 Motion Forecasting数据集,系统评估了补充地图和代理特征对模型性能的影响,发现现代架构并不需要复杂的特征集即可达到最优性能,同时验证了领域知识在不同数据集及驾驶文化间的可迁移性。
链接: https://arxiv.org/abs/2507.05098
作者: Tobias Demmler,Jakob Häringer,Andreas Tamke,Thao Dang,Alexander Hegai,Lars Mikelsons
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurate trajectory prediction is critical for safe autonomous navigation, yet the impact of dataset design on model performance remains understudied. This work systematically examines how feature selection, cross-dataset transfer, and geographic diversity influence trajectory prediction accuracy in multi-agent settings. We evaluate a state-of-the-art model using our novel L4 Motion Forecasting dataset based on our own data recordings in Germany and the US. This includes enhanced map and agent features. We compare our dataset to the US-centric Argoverse 2 benchmark. First, we find that incorporating supplementary map and agent features unique to our dataset, yields no measurable improvement over baseline features, demonstrating that modern architectures do not need extensive feature sets for optimal performance. The limited features of public datasets are sufficient to capture convoluted interactions without added complexity. Second, we perform cross-dataset experiments to evaluate how effective domain knowledge can be transferred between datasets. Third, we group our dataset by country and check the knowledge transfer between different driving cultures.
zh
[AI-11] he Hidden Threat in Plain Text: Attacking RAG Data Loaders
【速读】:该论文试图解决Retrieval-Augmented Generation (RAG)系统在文档加载阶段存在的安全漏洞问题,特别是恶意攻击者通过操纵文档内容对RAG管道进行隐蔽污染的威胁。解决方案的关键在于提出了一种基于知识的中毒攻击分类法,并引入了两种新型威胁向量——内容混淆(Content Obfuscation)和内容注入(Content Injection),针对常见文档格式(如DOCX、HTML、PDF)进行攻击验证。研究通过自动化工具实施19种隐蔽注入技术,测试了多种数据加载器,并在多个端到端RAG系统中验证了这些威胁的有效性,揭示了现有系统在抵御此类攻击方面的严重不足。
链接: https://arxiv.org/abs/2507.05093
作者: Alberto Castagnaro,Umberto Salviati,Mauro Conti,Luca Pajola,Simeone Pizzi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: currently under submission
Abstract:Large Language Models (LLMs) have transformed human-machine interaction since ChatGPT’s 2022 debut, with Retrieval-Augmented Generation (RAG) emerging as a key framework that enhances LLM outputs by integrating external knowledge. However, RAG’s reliance on ingesting external documents introduces new vulnerabilities. This paper exposes a critical security gap at the data loading stage, where malicious actors can stealthily corrupt RAG pipelines by exploiting document ingestion. We propose a taxonomy of 9 knowledge-based poisoning attacks and introduce two novel threat vectors – Content Obfuscation and Content Injection – targeting common formats (DOCX, HTML, PDF). Using an automated toolkit implementing 19 stealthy injection techniques, we test five popular data loaders, finding a 74.4% attack success rate across 357 scenarios. We further validate these threats on six end-to-end RAG systems – including white-box pipelines and black-box services like NotebookLM and OpenAI Assistants – demonstrating high success rates and critical vulnerabilities that bypass filters and silently compromise output integrity. Our results emphasize the urgent need to secure the document ingestion process in RAG systems against covert content manipulations. Comments: currently under submission Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.05093 [cs.CR] (or arXiv:2507.05093v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.05093 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-12] How Rules Represent Causal Knowledge: Causal Modeling with Abductive Logic Programs
【速读】:该论文试图解决如何在分层可假说逻辑程序的框架下对因果关系进行形式化建模的问题,以支持对外部行动效果的合理预测。其解决方案的关键在于将分层可假说逻辑程序翻译为因果系统,并基于哲学基础和近期研究工作,赋予此类程序的稳定模型以因果解释,从而确保其符合因果理论中的关键原则,如因果充分性、自然必然性和未观测效应的无关性。
链接: https://arxiv.org/abs/2507.05088
作者: Kilian Rückschloß,Felix Weitkämper
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Pearl observes that causal knowledge enables predicting the effects of interventions, such as actions, whereas descriptive knowledge only permits drawing conclusions from observation. This paper extends Pearl’s approach to causality and interventions to the setting of stratified abductive logic programs. It shows how stable models of such programs can be given a causal interpretation by building on philosophical foundations and recent work by Bochman and Eelink et al. In particular, it provides a translation of abductive logic programs into causal systems, thereby clarifying the informal causal reading of logic program rules and supporting principled reasoning about external actions. The main result establishes that the stable model semantics for stratified programs conforms to key philosophical principles of causation, such as causal sufficiency, natural necessity, and irrelevance of unobserved effects. This justifies the use of stratified abductive logic programs as a framework for causal modeling and for predicting the effects of interventions
zh
[AI-13] Replacing thinking with tool usage enables reasoning in small language models
【速读】:该论文试图解决在推理过程中有效利用额外计算资源的问题,以提升大型语言模型(Large Language Models)在特定任务上的性能。其解决方案的关键在于将生成的 tokens 格式化为与状态工具的多轮交互轨迹,并在每一轮中将工具的新状态附加到模型的上下文中,使模型通过自定义领域特定语言(DSL)生成控制工具所需的 tokens,从而实现对额外计算资源的高效利用。
链接: https://arxiv.org/abs/2507.05065
作者: Corrado Rainone,Tim Bakker,Roland Memisevic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 23 pages, includes appendix
Abstract:Recent advances have established a new machine learning paradigm based on scaling up compute at inference time as well as at training time. In that line of work, a combination of Supervised Fine-Tuning (SFT) on synthetic demonstrations and Reinforcement Learning with Verifiable Rewards (RLVR) is used for training Large Language Models to expend extra compute during inference in the form of “thoughts” expressed in natural language. In this paper, we propose to instead format these tokens as a multi-turn interaction trace with a stateful tool. At each turn, the new state of the tool is appended to the context of the model, whose job is to generate the tokens necessary to control the tool via a custom DSL. We benchmark this approach on the problem of repairing malfunctioning Python code, and show that this constrained setup allows for faster sampling of experience and a denser reward signal, allowing even models of size up to 3B parameters to learn how to proficiently expend additional compute on the task.
zh
[AI-14] Perspectives on How Sociology Can Advance Theorizing about Human-Chatbot Interaction and Developing Chatbots for Social Good
【速读】:该论文试图解决社会学在聊天机器人研究中相对滞后的问题,尤其是在理解人类与聊天机器人互动机制及开发具有社会价值的聊天机器人方面。其解决方案的关键在于引入四种社会学理论:资源替代理论和权力依赖理论,用于补充现有模型中对聊天机器人使用驱动因素的分析,考虑社会结构(如系统性歧视、资源分布不均)对个体使用行为的影响;以及情感控制理论和疾病根本原因理论,用于指导聊天机器人驱动的干预措施设计,以减少安全风险并提升公平性,通过关注文化情境(如情感规范)来促进福祉与社区发展。
链接: https://arxiv.org/abs/2507.05030
作者: Celeste Campos-Castillo,Xuan Kang,Linnea I. Laestadius
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Recently, research into chatbots (also known as conversational agents, AI agents, voice assistants), which are computer applications using artificial intelligence to mimic human-like conversation, has grown sharply. Despite this growth, sociology lags other disciplines (including computer science, medicine, psychology, and communication) in publishing about chatbots. We suggest sociology can advance understanding of human-chatbot interaction and offer four sociological theories to enhance extant work in this field. The first two theories (resource substitution theory, power-dependence theory) add new insights to existing models of the drivers of chatbot use, which overlook sociological concerns about how social structure (e.g., systemic discrimination, the uneven distribution of resources within networks) inclines individuals to use chatbots, including problematic levels of emotional dependency on chatbots. The second two theories (affect control theory, fundamental cause of disease theory) help inform the development of chatbot-driven interventions that minimize safety risks and enhance equity by leveraging sociological insights into how chatbot outputs could attend to cultural contexts (e.g., affective norms) to promote wellbeing and enhance communities (e.g., opportunities for civic participation). We discuss the value of applying sociological theories for advancing theorizing about human-chatbot interaction and developing chatbots for social good.
zh
[AI-15] Meta-Learning Transformers to Improve In-Context Generalization
【速读】:该论文试图解决传统训练范式中依赖大规模、非结构化数据集所带来的存储成本高、质量与平衡性难以评估、以及隐私和伦理问题。其解决方案的关键在于采用多组小规模、领域特定的数据集,通过提升数据质量和多样性来增强基于上下文学习的模型泛化能力,同时在性能上与使用单一大规模数据集训练的模型相当。
链接: https://arxiv.org/abs/2507.05019
作者: Lorenzo Braccaioli,Anna Vettoruzzo,Prabhant Singh,Joaquin Vanschoren,Mohamed-Rafik Bouguelia,Nicola Conci
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an alternative training strategy where we leverage a collection of multiple, small-scale, and domain-specific datasets. We empirically demonstrate that the increased quality and diversity of such data improve the generalization abilities of in-context learners beyond their training domain, while achieving comparable performance with models trained on a single large-scale dataset. We investigate this paradigm by leveraging meta-learning to train an in-context learner on the Meta-Album collection under several settings. Firstly, we show the performance in a controlled environment, where the test domain is completely excluded from the training knowledge. Secondly, we explore the robustness of these models to forgetting in a continual scenario where the information is accessible for a limited time. Finally, we explore the more challenging unsupervised scenario. Our findings demonstrate that transformers still generalize for in-context prediction when trained on a curated dataset collection while offering advantages in modularity and replaceability.
zh
[AI-16] Supported Abstract Argumentation for Case-Based Reasoning
【速读】:该论文试图解决案例推理(Case-Based Reasoning, CBR)中因包含冗余案例(或称为“尖峰”)而导致的模型性能下降问题。其解决方案的关键在于引入支持的抽象论证(Supported Abstract Argumentation, sAA),通过让过去案例在分类过程中进行辩论,以支持或攻击其他案例的标签,从而确保模型中不包含冗余案例,同时保持关键模型特性。
链接: https://arxiv.org/abs/2507.04994
作者: Adam Gould,Gabriel de Olim Gaul,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to IARML@ICJAI2025: Workshop on the Interactions between Analogical Reasoning and Machine Learning
Abstract:We introduce Supported Abstract Argumentation for Case-Based Reasoning (sAA-CBR), a binary classification model in which past cases engage in debates by arguing in favour of their labelling and attacking or supporting those with opposing or agreeing labels. With supports, sAA-CBR overcomes the limitation of its precursor AA-CBR, which can contain extraneous cases (or spikes) that are not included in the debates. We prove that sAA-CBR contains no spikes, without trading off key model properties
zh
[AI-17] Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning
【速读】:该论文旨在解决自身免疫性疾病中T细胞受体(TCR)库在临床应用中的局限性,如序列稀疏性和低见证率问题。其解决方案的关键在于开发EAMil,一个基于多实例深度学习框架的模型,通过整合PrimeSeq特征提取、ESMonehot编码以及增强门控注意力机制,实现了对系统性红斑狼疮(SLE)和类风湿关节炎(RA)的高精度诊断,分别达到了98.95%和97.76%的AUC值。该方法不仅能够识别与疾病相关的基因,还能有效区分疾病特异性TCR基因,并具备对多种疾病类别进行分类的能力。
链接: https://arxiv.org/abs/2507.04981
作者: Ruihao Zhang,Fei Ye,Dandan Meng,Yixuan Huang,Maochen,Xiao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Genomics (q-bio.GN)
备注: 7 figures, 4 tabels
Abstract:T cell receptor (TCR) repertoires encode critical immunological signatures for autoimmune diseases, yet their clinical application remains limited by sequence sparsity and low witness rates. We developed EAMil, a multi-instance deep learning framework that leverages TCR sequencing data to diagnose systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) with exceptional accuracy. By integrating PrimeSeq feature extraction with ESMonehot encoding and enhanced gate attention mechanisms, our model achieved state-of-the-art performance with AUCs of 98.95% for SLE and 97.76% for RA. EAMil successfully identified disease-associated genes with over 90% concordance with established differential analyses and effectively distinguished disease-specific TCR genes. The model demonstrated robustness in classifying multiple disease categories, utilizing the SLEDAI score to stratify SLE patients by disease severity as well as to diagnose the site of damage in SLE patients, and effectively controlling for confounding factors such as age and gender. This interpretable framework for immune receptor analysis provides new insights for autoimmune disease detection and classification with broad potential clinical applications across immune-mediated conditions.
zh
[AI-18] LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning
【速读】:该论文旨在解决歌唱语音合成(Singing Voice Synthesis, SVS)中在低资源场景下捕捉声乐风格、特定流派的音高变化以及语言相关特征的挑战。其解决方案的关键在于提出LAPS-Diff模型,该模型结合了语言感知嵌入和声乐风格引导的学习机制,并引入了风格编码器和音高提取模型以计算风格和音高损失,从而更好地捕捉合成歌唱的自然性和表现力。此外,还利用MERT和IndicWav2Vec模型提取音乐和上下文嵌入作为条件先验,进一步优化声学特征生成过程。
链接: https://arxiv.org/abs/2507.04966
作者: Sandipan Dhar,Mayank Gupta,Preeti Rao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 10 pages, 5 figures, 3 Tables
Abstract:The field of Singing Voice Synthesis (SVS) has seen significant advancements in recent years due to the rapid progress of diffusion-based approaches. However, capturing vocal style, genre-specific pitch inflections, and language-dependent characteristics remains challenging, particularly in low-resource scenarios. To address this, we propose LAPS-Diff, a diffusion model integrated with language-aware embeddings and a vocal-style guided learning mechanism, specifically designed for Bollywood Hindi singing style. We curate a Hindi SVS dataset and leverage pre-trained language models to extract word and phone-level embeddings for an enriched lyrics representation. Additionally, we incorporated a style encoder and a pitch extraction model to compute style and pitch losses, capturing features essential to the naturalness and expressiveness of the synthesized singing, particularly in terms of vocal style and pitch variations. Furthermore, we utilize MERT and IndicWav2Vec models to extract musical and contextual embeddings, serving as conditional priors to refine the acoustic feature generation process further. Based on objective and subjective evaluations, we demonstrate that LAPS-Diff significantly improves the quality of the generated samples compared to the considered state-of-the-art (SOTA) model for our constrained dataset that is typical of the low resource scenario.
zh
[AI-19] Object-centric Denoising Diffusion Models for Physical Reasoning
【速读】:该论文试图解决多物体交互轨迹推理问题,这是机器学习中物理推理任务的核心挑战之一。传统方法依赖于自回归建模,仅能基于初始状态进行条件约束,无法处理后续时间步的状态条件。本文提出了一种以物体为中心的去噪扩散模型架构,该架构在时间上具有平移等变性,在物体之间具有排列等变性,并且可以对任意时间步的任意物体进行条件约束。解决方案的关键在于设计一个能够灵活处理多条件、多物体和不同轨迹长度的物理推理模型。
链接: https://arxiv.org/abs/2507.04920
作者: Moritz Lange,Raphael C. Engelhardt,Wolfgang Konen,Andrew Melnik,Laurenz Wiskott
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning about the trajectories of multiple, interacting objects is integral to physical reasoning tasks in machine learning. This involves conditions imposed on the objects at different time steps, for instance initial states or desired goal states. Existing approaches in physical reasoning generally rely on autoregressive modeling, which can only be conditioned on initial states, but not on later states. In fields such as planning for reinforcement learning, similar challenges are being addressed with denoising diffusion models. In this work, we propose an object-centric denoising diffusion model architecture for physical reasoning that is translation equivariant over time, permutation equivariant over objects, and can be conditioned on arbitrary time steps for arbitrary objects. We demonstrate how this model can solve tasks with multiple conditions and examine its performance when changing object numbers and trajectory lengths during inference.
zh
[AI-20] Leadership Detection via Time-Lagged Correlation-Based Network Inference
【速读】:该论文试图解决在集体行为中识别领导-跟随关系的问题,特别是在动物生态学、群体机器人和智能交通领域。传统信息论方法如转移熵(Transfer Entropy, TE)和时间滞后互信息(Time-Lagged Mutual Information, TLMI)在噪声或短时数据集中存在显著局限性,因其依赖于稳健的概率估计。该研究提出了一种基于动态网络推断的方法,利用多个运动学变量(速度、加速度和方向)的时间滞后相关性构建有向影响图,从而在无需大量数据或参数敏感离散化的情况下识别领导模式。该方法的关键在于通过时间滞后相关性分析实现对领导关系的高效推断。
链接: https://arxiv.org/abs/2507.04917
作者: Thayanne França da Silva,José Everardo Bessa Maia
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO)
备注:
Abstract:Understanding leadership dynamics in collective behavior is a key challenge in animal ecology, swarm robotics, and intelligent transportation. Traditional information-theoretic approaches, including Transfer Entropy (TE) and Time-Lagged Mutual Information (TLMI), have been widely used to infer leader-follower relationships but face critical limitations in noisy or short-duration datasets due to their reliance on robust probability estimations. This study proposes a method based on dynamic network inference using time-lagged correlations across multiple kinematic variables: velocity, acceleration, and direction. Our approach constructs directed influence graphs over time, enabling the identification of leadership patterns without the need for large volumes of data or parameter-sensitive discretization. We validate our method through two multi-agent simulations in NetLogo: a modified Vicsek model with informed leaders and a predator-prey model featuring coordinated and independent wolf groups. Experimental results demonstrate that the network-based method outperforms TE and TLMI in scenarios with limited spatiotemporal observations, ranking true leaders at the top of influence metrics more consistently than TE and TLMI.
zh
[AI-21] BackFed: An Efficient Standardized Benchmark Suite for Backdoor Attacks in Federated Learning NEURIPS’25
【速读】:该论文试图解决联邦学习(Federated Learning, FL)系统中后门攻击的评估与防御问题,旨在克服现有研究中实验设置不一致、实现错误和假设不现实导致的公平比较和有效结论难以得出的挑战。其解决方案的关键在于提出BackFed——一个全面的基准套件,通过多进程实现显著加速实验过程,并采用模块化设计使新方法能够通过定义良好的API无缝集成,从而提供标准化的评估流程,支持对后门攻击与防御的有效、可靠评估。
链接: https://arxiv.org/abs/2507.04903
作者: Thinh Dao,Dung Thuy Nguyen,Khoa D Doan,Kok-Seng Wong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Under review at NeurIPS’25
Abstract:Federated Learning (FL) systems are vulnerable to backdoor attacks, where adversaries train their local models on poisoned data and submit poisoned model updates to compromise the global model. Despite numerous proposed attacks and defenses, divergent experimental settings, implementation errors, and unrealistic assumptions hinder fair comparisons and valid conclusions about their effectiveness in real-world scenarios. To address this, we introduce BackFed - a comprehensive benchmark suite designed to standardize, streamline, and reliably evaluate backdoor attacks and defenses in FL, with a focus on practical constraints. Our benchmark offers key advantages through its multi-processing implementation that significantly accelerates experimentation and the modular design that enables seamless integration of new methods via well-defined APIs. With a standardized evaluation pipeline, we envision BackFed as a plug-and-play environment for researchers to comprehensively and reliably evaluate new attacks and defenses. Using BackFed, we conduct large-scale studies of representative backdoor attacks and defenses across both Computer Vision and Natural Language Processing tasks with diverse model architectures and experimental settings. Our experiments critically assess the performance of proposed attacks and defenses, revealing unknown limitations and modes of failures under practical conditions. These empirical insights provide valuable guidance for the development of new methods and for enhancing the security of FL systems. Our framework is openly available at this https URL.
zh
[AI-22] Beyond Training-time Poisoning: Component-level and Post-training Backdoors in Deep Reinforcement Learning
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)系统在安全关键应用中的后门攻击问题,特别是针对现有研究仅关注训练阶段攻击的局限性。论文提出两种新型后门攻击方法:TrojanentRL和InfrectroRL,其关键在于能够利用DRL供应链中的漏洞,在极大降低攻击者权限的情况下植入持久性后门,并且其中InfrectroRL无需访问训练、验证或测试数据即可完成攻击。这些解决方案揭示了DRL系统的潜在安全风险,并对现有防御机制的有效性提出了挑战。
链接: https://arxiv.org/abs/2507.04883
作者: Sanyam Vyas,Alberto Caron,Chris Hicks,Pete Burnap,Vasilios Mavroudis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Deep Reinforcement Learning (DRL) systems are increasingly used in safety-critical applications, yet their security remains severely underexplored. This work investigates backdoor attacks, which implant hidden triggers that cause malicious actions only when specific inputs appear in the observation space. Existing DRL backdoor research focuses solely on training-time attacks requiring unrealistic access to the training pipeline. In contrast, we reveal critical vulnerabilities across the DRL supply chain where backdoors can be embedded with significantly reduced adversarial privileges. We introduce two novel attacks: (1) TrojanentRL, which exploits component-level flaws to implant a persistent backdoor that survives full model retraining; and (2) InfrectroRL, a post-training backdoor attack which requires no access to training, validation, nor test data. Empirical and analytical evaluations across six Atari environments show our attacks rival state-of-the-art training-time backdoor attacks while operating under much stricter adversarial constraints. We also demonstrate that InfrectroRL further evades two leading DRL backdoor defenses. These findings challenge the current research focus and highlight the urgent need for robust defenses.
zh
[AI-23] DoPI: Doctor-like Proactive Interrogation LLM for Traditional Chinese Medicine
【速读】:该论文试图解决传统中医(TCM)诊断中多轮对话与知识图谱整合能力不足的问题,以及当前大语言模型(LLM)在医疗应用中进行有效多轮对话和主动提问的局限性。解决方案的关键在于提出一种名为DoPI的新颖LLM系统,其采用协作架构,包含引导模型和专家模型。引导模型通过知识图谱动态生成问题以高效提取关键症状信息,而专家模型则利用深度TCM专业知识提供最终诊断和治疗方案。
链接: https://arxiv.org/abs/2507.04877
作者: Zewen Sun,Ruoxiang Huang,Jiahe Feng,Rundong Kong,Yuqian Wang,Hengyu Liu,Ziqi Gong,Yuyuan Qin,Yingxue Wang,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Enhancing interrogation capabilities in Traditional Chinese Medicine (TCM) diagnosis through multi-turn dialogues and knowledge graphs presents a significant challenge for modern AI systems. Current large language models (LLMs), despite their advancements, exhibit notable limitations in medical applications, particularly in conducting effective multi-turn dialogues and proactive questioning. These shortcomings hinder their practical application and effectiveness in simulating real-world diagnostic scenarios. To address these limitations, we propose DoPI, a novel LLM system specifically designed for the TCM domain. The DoPI system introduces a collaborative architecture comprising a guidance model and an expert model. The guidance model conducts multi-turn dialogues with patients and dynamically generates questions based on a knowledge graph to efficiently extract critical symptom information. Simultaneously, the expert model leverages deep TCM expertise to provide final diagnoses and treatment plans. Furthermore, this study constructs a multi-turn doctor-patient dialogue dataset to simulate realistic consultation scenarios and proposes a novel evaluation methodology that does not rely on manually collected real-world consultation data. Experimental results show that the DoPI system achieves an accuracy rate of 84.68 percent in interrogation outcomes, significantly enhancing the model’s communication ability during diagnosis while maintaining professional expertise.
zh
[AI-24] owards Human-in-the-Loop Onset Detection: A Transfer Learning Approach for Maracatu
【速读】:该论文试图解决在非洲巴西马拉卡图(Maracatu)传统音乐中,由于复杂的节奏模式导致常规模型在节拍起始点检测(onset detection)任务中表现不佳的问题。解决方案的关键在于采用迁移学习策略,利用预训练的时序卷积网络(Temporal Convolutional Network)架构,通过仅需每种乐器5秒的标注片段进行分层微调,以适应不同乐器的特性。研究结果显示,该方法在 intra-task 设置下取得了高达0.998的F1分数,并在最佳情况下提升了超过50个百分点,尤其在时间保持乐器上表现出色,验证了跨任务适应的有效性。
链接: https://arxiv.org/abs/2507.04858
作者: António Sá Pinto
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted at ISMIR 2025
Abstract:We explore transfer learning strategies for musical onset detection in the Afro-Brazilian Maracatu tradition, which features complex rhythmic patterns that challenge conventional models. We adapt two Temporal Convolutional Network architectures: one pre-trained for onset detection (intra-task) and another for beat tracking (inter-task). Using only 5-second annotated snippets per instrument, we fine-tune these models through layer-wise retraining strategies for five traditional percussion instruments. Our results demonstrate significant improvements over baseline performance, with F1 scores reaching up to 0.998 in the intra-task setting and improvements of over 50 percentage points in best-case scenarios. The cross-task adaptation proves particularly effective for time-keeping instruments, where onsets naturally align with beat positions. The optimal fine-tuning configuration varies by instrument, highlighting the importance of instrument-specific adaptation strategies. This approach addresses the challenges of underrepresented musical traditions, offering an efficient human-in-the-loop methodology that minimizes annotation effort while maximizing performance. Our findings contribute to more inclusive music information retrieval tools applicable beyond Western musical contexts.
zh
[AI-25] Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters
【速读】:该论文旨在解决语音转换领域中对语音特征(如基频、持续时间、语速等)进行精确控制的挑战。传统上,这类目标主要通过基于声码器的方法实现,而本文提出了一种基于卷积神经网络(Convolutional Neural Network, CNN)的解决方案,其关键在于直接对基频(F0)、音素序列、强度和说话人身份等参数进行显式条件建模,从而生成梅尔频谱图,并通过通用神经声码器将其转换为波形。该方法在推理阶段允许用户自由调整F0轮廓、音素序列和说话人嵌入,实现了直观可控的语音变换。
链接: https://arxiv.org/abs/2507.04817
作者: Mathilde Abrassart,Nicolas Obin,Axel Roebel
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 8 pages, 4 figures
Abstract:Precise control over speech characteristics, such as pitch, duration, and speech rate, remains a significant challenge in the field of voice conversion. The ability to manipulate parameters like pitch and syllable rate is an important element for effective identity conversion, but can also be used independently for voice transformation, achieving goals that were historically addressed by vocoder-based methods. In this work, we explore a convolutional neural network-based approach that aims to provide means for modifying fundamental frequency (F0), phoneme sequences, intensity, and speaker identity. Rather than relying on disentanglement techniques, our model is explicitly conditioned on these factors to generate mel spectrograms, which are then converted into waveforms using a universal neural vocoder. Accordingly, during inference, F0 contours, phoneme sequences, and speaker embeddings can be freely adjusted, allowing for intuitively controlled voice transformations. We evaluate our approach on speaker conversion and expressive speech tasks using both perceptual and objective metrics. The results suggest that the proposed method offers substantial flexibility, while maintaining high intelligibility and speaker similarity. Comments: 8 pages, 4 figures Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) Cite as: arXiv:2507.04817 [cs.SD] (or arXiv:2507.04817v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2507.04817 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-26] Application and Evaluation of Large Language Models for Forecasting the Impact of Traffic Incidents ITSC
【速读】:该论文试图解决交通事件对交通流影响的预测问题。其解决方案的关键在于提出一种完全基于大型语言模型(LLMs)的方法,该方法结合了交通特征和LLM提取的事件特征来进行影响预测。其中,关键要素是为LLM的上下文学习有效选择示例的方法。
链接: https://arxiv.org/abs/2507.04803
作者: George Jagadeesh,Srikrishna Iyer,Michal Polanowski,Kai Xin Thia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted for publication at the 2025 IEEE 28th International Conference on Intelligent Transportation Systems (ITSC), Gold Coast, Australia, 2025. Copyright IEEE
Abstract:This study examines the feasibility of applying large language models (LLMs) for forecasting the impact of traffic incidents on the traffic flow. The use of LLMs for this task has several advantages over existing machine learning-based solutions such as not requiring a large training dataset and the ability to utilize free-text incident logs. We propose a fully LLM-based solution that predicts the incident impact using a combination of traffic features and LLM-extracted incident features. A key ingredient of this solution is an effective method of selecting examples for the LLM’s in-context learning. We evaluate the performance of three advanced LLMs and two state-of-the-art machine learning models on a real traffic incident dataset. The results show that the best-performing LLM matches the accuracy of the most accurate machine learning model, despite the former not having been trained on this prediction task. The findings indicate that LLMs are a practically viable option for traffic incident impact prediction.
zh
[AI-27] Large Language Models for Network Intrusion Detection Systems: Foundations Implementations and Future Directions
【速读】:该论文试图解决传统网络入侵检测系统(Network Intrusion Detection Systems, NIDS)在上下文感知、可解释性和自动化响应方面的不足。其解决方案的关键在于引入生成式人工智能(Generative AI)和大型语言模型(Large Language Models, LLMs),通过整合LLMs的能力,实现对结构化与非结构化安全数据的深度上下文推理、可解释性决策以及入侵行为的自动化响应,从而构建更加智能、自适应和可解释的AI驱动型NIDS。
链接: https://arxiv.org/abs/2507.04752
作者: Shuo Yang,Xinran Zheng,Xinchen Zhang,Jinfeng Xu,Jinze Li,Donglin Xie,Weicai Long,Edith C.H. Ngai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Large Language Models (LLMs) have revolutionized various fields with their exceptional capabilities in understanding, processing, and generating human-like text. This paper investigates the potential of LLMs in advancing Network Intrusion Detection Systems (NIDS), analyzing current challenges, methodologies, and future opportunities. It begins by establishing a foundational understanding of NIDS and LLMs, exploring the enabling technologies that bridge the gap between intelligent and cognitive systems in AI-driven NIDS. While Intelligent NIDS leverage machine learning and deep learning to detect threats based on learned patterns, they often lack contextual awareness and explainability. In contrast, Cognitive NIDS integrate LLMs to process both structured and unstructured security data, enabling deeper contextual reasoning, explainable decision-making, and automated response for intrusion behaviors. Practical implementations are then detailed, highlighting LLMs as processors, detectors, and explainers within a comprehensive AI-driven NIDS pipeline. Furthermore, the concept of an LLM-centered Controller is proposed, emphasizing its potential to coordinate intrusion detection workflows, optimizing tool collaboration and system performance. Finally, this paper identifies critical challenges and opportunities, aiming to foster innovation in developing reliable, adaptive, and explainable NIDS. By presenting the transformative potential of LLMs, this paper seeks to inspire advancement in next-generation network security systems.
zh
[AI-28] LLM -based Question-Answer Framework for Sensor-driven HVAC System Interaction
【速读】:该论文旨在解决如何通过大型语言模型(Large Language Models, LLMs)实现与暖通空调(HVAC)系统进行准确、实时且上下文感知的交互问题,特别是在非专家用户场景下。其关键解决方案是提出一种两阶段的LLM驱动问答(QA)框架JARVIS,该框架包含一个专家级LLM用于将高层用户查询转换为结构化执行指令,以及一个代理(Agent)负责SQL数据检索、统计处理和最终响应生成。此外,JARVIS通过自适应上下文注入策略、参数化SQL构建器与执行器以及自底向上的规划方案来应对HVAC系统的特定挑战。
链接: https://arxiv.org/abs/2507.04748
作者: Sungmin Lee,Minju Kang,Joonhee Lee,Seungyong Lee,Dongju Kim,Jingi Hong,Jun Shin,Pei Zhang,JeongGil Ko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Question-answering (QA) interfaces powered by large language models (LLMs) present a promising direction for improving interactivity with HVAC system insights, particularly for non-expert users. However, enabling accurate, real-time, and context-aware interactions with HVAC systems introduces unique challenges, including the integration of frequently updated sensor data, domain-specific knowledge grounding, and coherent multi-stage reasoning. In this paper, we present JARVIS, a two-stage LLM-based QA framework tailored for sensor data-driven HVAC system interaction. JARVIS employs an Expert-LLM to translate high-level user queries into structured execution instructions, and an Agent that performs SQL-based data retrieval, statistical processing, and final response generation. To address HVAC-specific challenges, JARVIS integrates (1) an adaptive context injection strategy for efficient HVAC and deployment-specific information integration, (2) a parameterized SQL builder and executor to improve data access reliability, and (3) a bottom-up planning scheme to ensure consistency across multi-stage response generation. We evaluate JARVIS using real-world data collected from a commercial HVAC system and a ground truth QA dataset curated by HVAC experts to demonstrate its effectiveness in delivering accurate and interpretable responses across diverse queries. Results show that JARVIS consistently outperforms baseline and ablation variants in both automated and user-centered assessments, achieving high response quality and accuracy.
zh
[AI-29] Activation Steering for Chain-of-Thought Compression
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在生成“思维链”(Chains of Thought, CoTs)时存在的冗长问题,这导致了上下文浪费、延迟增加和能耗上升。解决方案的关键在于提出一种无需重新训练的推理压缩方法——激活引导压缩(Activation-Steered Compression, ASC),通过提取并注入“导向向量”来在模型的残差流激活空间中切换不同风格的推理模式,从而实现对CoT的高效压缩,同时保持模型的准确性。
链接: https://arxiv.org/abs/2507.04742
作者: Seyedarmin Azizi,Erfan Baghaei Potraghloo,Massoud Pedram
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) excel at complex reasoning when they include intermediate steps, known as “chains of thought” (CoTs). However, these rationales are often overly verbose, even for simple problems, leading to wasted context, increased latency, and higher energy consumption. We observe that verbose, English-heavy CoTs and concise, math-centric CoTs occupy distinct regions in the model’s residual-stream activation space. By extracting and injecting a “steering vector” to transition between these modes, we can reliably shift generation toward more concise reasoning, effectively compressing CoTs without retraining. We formalize this approach as Activation-Steered Compression (ASC), an inference-time technique that shortens reasoning traces by directly modifying hidden representations. In addition, we provide a theoretical analysis of the impact of ASC on the output distribution, derived from a closed-form KL-divergence-bounded constraint to regulate steering strength. Using only 100 paired verbose and concise examples, ASC achieves up to 67.43% reduction in CoT length on MATH500 and GSM8K datasets, while maintaining accuracy across 7B, 8B, and 32B parameter models. As a training-free method, ASC introduces negligible runtime overhead and, on MATH500, delivers an average 2.73x speedup in end-to-end reasoning wall-clock time on an 8B model. This makes ASC a practical and efficient tool for streamlining the deployment of reasoning-capable LLMs in latency- or cost-sensitive settings. The code is available at: this https URL
zh
[AI-30] ChipSeek-R1: Generating Human-Surpassing RTL with LLM via Hierarchical Reward-Driven Reinforcement Learning
【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在自动化寄存器传输级(Register-Transfer Level, RTL)代码生成过程中无法同时优化功能正确性与硬件质量(Power, Performance, Area - PPA)的问题。现有方法要么通过监督微调生成功能正确但PPA次优的代码,缺乏优化原则的学习机制;要么依赖后处理技术提升PPA指标,但因不更新模型参数而效率低下。论文提出的解决方案关键在于引入ChipSeek-R1,这是一个分层奖励驱动的强化学习框架,通过在强化学习过程中整合语法、功能正确性(来自仿真器)和PPA指标(来自综合工具)的直接反馈,使模型能够通过试错学习复杂的硬件设计权衡,从而生成既功能正确又PPA优化的RTL代码。
链接: https://arxiv.org/abs/2507.04736
作者: Zhirong Chen,Kaiyan Chang,Zhuolin Li,Xinyang He,Chujie Chen,Cangyuan Li,Mengdi Wang,Haobo Xu,Yinhe Han,Ying Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Programming Languages (cs.PL)
备注:
Abstract:Large Language Models (LLMs) show significant potential for automating Register-Transfer Level (RTL) code generation. However, current approaches face a critical challenge: they can not simultaneously optimize for functional correctness and hardware quality (Power, Performance, Area - PPA). Methods based on supervised fine-tuning often generate functionally correct but PPA-suboptimal code, lacking mechanisms to learn optimization principles. In contrast, post-processing techniques that attempt to improve PPA metrics after generation are often inefficient because they operate externally without updating the LLM’s parameters, thus failing to enhance the model’s intrinsic design capabilities. To bridge this gap, we introduce ChipSeek-R1, a hierarchical reward-driven reinforcement learning framework to train LLMs to generate RTL code that achieves both functional correctness and optimized PPA metrics. ChipSeek-R1 employs a hierarchical reward system, which incorporates direct feedback on syntax, functional correctness (from simulators) and PPA metrics (from synthesis tools) during reinforcement learning. This enables the model to learn complex hardware design trade-offs via trial-and-error, generating RTL code that is both functionally correct and PPA-optimized. Evaluating ChipSeek-R1 on standard benchmarks (VerilogEval, RTLLM), we achieve state-of-the-art results in functional correctness. Notably, on the RTLLM benchmark, ChipSeek-R1 generated 27 RTL designs surpassing the PPA metrics of the original human-written code. Our findings demonstrate the effectiveness of integrating toolchain feedback into LLM training and highlight the potential for reinforcement learning to enable automated generation of human-surpassing RTL code. We open-source our code in anonymous github. Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Programming Languages (cs.PL) Cite as: arXiv:2507.04736 [cs.AI] (or arXiv:2507.04736v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.04736 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-31] Whos the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM Agent s in LLM-Based Multi-Agent Systems
【速读】:该论文试图解决基于大语言模型的多智能体系统(LLM-MAS)在通信与协调过程中存在的安全风险,特别是意图隐藏威胁。其解决方案的关键在于设计四种具有高度隐蔽性的攻击范式,并提出一种基于心理学的检测框架AgentXposed,该框架结合HEXACO人格模型与Reid技术,通过渐进式问卷调查和行为监控实现对恶意行为的有效识别。
链接: https://arxiv.org/abs/2507.04724
作者: Yizhe Xie,Congcong Zhu,Xinyue Zhang,Minghao Wang,Chi Liu,Minglu Zhu,Tianqing Zhu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems powered by Large Language Models (LLM-MAS) demonstrate remarkable capabilities in collaborative problem-solving. While LLM-MAS exhibit strong collaborative abilities, the security risks in their communication and coordination remain underexplored. We bridge this gap by systematically investigating intention-hiding threats in LLM-MAS, and design four representative attack paradigms that subtly disrupt task completion while maintaining high concealment. These attacks are evaluated in centralized, decentralized, and layered communication structures. Experiments conducted on six benchmark datasets, including MMLU, MMLU-Pro, HumanEval, GSM8K, arithmetic, and biographies, demonstrate that they exhibit strong disruptive capabilities. To identify these threats, we propose a psychology-based detection framework AgentXposed, which combines the HEXACO personality model with the Reid Technique, using progressive questionnaire inquiries and behavior-based monitoring. Experiments conducted on six types of attacks show that our detection framework effectively identifies all types of malicious behaviors. The detection rate for our intention-hiding attacks is slightly lower than that of the two baselines, Incorrect Fact Injection and Dark Traits Injection, demonstrating the effectiveness of intention concealment. Our findings reveal the structural and behavioral risks posed by intention-hiding attacks and offer valuable insights into securing LLM-based multi-agent systems through psychological perspectives, which contributes to a deeper understanding of multi-agent safety. The code and data are available at this https URL.
zh
[AI-32] LumiCRS: Asymmetric Contrastive Prototype Learning for Long-Tail Conversational Movie Recommendation
【速读】:该论文试图解决对话推荐系统(Conversational Recommender Systems, CRSs)中由于对话数据的极端长尾分布导致的推荐偏差问题,这一问题使得系统过度拟合高频头部内容,牺牲了推荐多样性并加剧了冷启动问题。解决方案的关键在于提出LumiCRS框架,该框架通过三个相互增强的层次来缓解长尾不平衡:(i) 自适应综合焦点损失(Adaptive Comprehensive Focal Loss, ACFL),动态调整类别权重和聚焦因子以抑制头部过拟合和减少流行度偏差;(ii) 长尾推荐的原型学习(Prototype Learning for Long-Tail Recommendation),通过选择语义、情感和上下文原型来引导聚类并稳定主体和尾部表示;(iii) 基于GPT-4o的原型引导对话增强模块,自动生成多样化的长尾对话片段以缓解尾部稀疏性和分布偏移。
链接: https://arxiv.org/abs/2507.04722
作者: Jinzhi Wang,Bin Li,Qingke Peng,Haozhou Li,Zeyuan Zeng,Ruimeng Li,Biyi Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Conversational recommender systems (CRSs) often suffer from an extreme long-tail distribution of dialogue data, causing a strong bias toward head-frequency blockbusters that sacrifices diversity and exacerbates the cold-start problem. An empirical analysis of DCRS and statistics on the REDIAL corpus show that only 10% of head movies account for nearly half of all mentions, whereas about 70% of tail movies receive merely 26% of the attention. This imbalance gives rise to three critical challenges: head over-fitting, body representation drift, and tail sparsity. To address these issues, we propose LumiCRS, an end-to-end framework that mitigates long-tail imbalance through three mutually reinforcing layers: (i) an Adaptive Comprehensive Focal Loss (ACFL) that dynamically adjusts class weights and focusing factors to curb head over-fitting and reduce popularity bias; (ii) Prototype Learning for Long-Tail Recommendation, which selects semantic, affective, and contextual prototypes to guide clustering and stabilize body and tail representations; and (iii) a GPT-4o-driven prototype-guided dialogue augmentation module that automatically generates diverse long-tail conversational snippets to alleviate tail sparsity and distribution shift. Together, these strategies enable LumiCRS to markedly improve recommendation accuracy, diversity, and fairness: on the REDIAL and INSPIRED benchmarks, LumiCRS boosts Recall@10 and Tail-Recall@10 by 7-15% over fifteen strong baselines, while human evaluations confirm superior fluency, informativeness, and long-tail relevance. These results demonstrate the effectiveness of multi-layer collaboration in building an efficient and fair long-tail conversational recommender.
zh
[AI-33] Advocate for Complete Benchmarks for Formal Reasoning with Formal/Informal Statements and Formal/Informal Proofs
【速读】:该论文试图解决当前形式化推理与自动定理证明领域中基准测试和评估实践存在的问题,这些问题可能阻碍了该领域的进展。论文认为,开放代码、开放数据以及完整且无错误的基准测试将加速该领域的发展。解决方案的关键在于消除阻碍贡献的实践,并避免产生误导性评估信息,从而促进自动化定理证明、自动形式化和非形式化推理各相关群体之间的交流与合作。
链接: https://arxiv.org/abs/2507.04719
作者: Roozbeh Yousefzadeh,Xuenan Cao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:
Abstract:This position paper provides a critical but constructive discussion of current practices in benchmarking and evaluative practices in the field of formal reasoning and automated theorem proving. We take the position that open code, open data, and benchmarks that are complete and error-free will accelerate progress in this field. We identify practices that create barriers to contributing to this field and suggest ways to remove them. We also discuss some of the practices that might produce misleading evaluative information. We aim to create discussions that bring together people from various groups contributing to automated theorem proving, autoformalization, and informal reasoning.
zh
[AI-34] UrbanMind: Towards Urban General Intelligence via Tool-Enhanced Retrieval-Augmented Generation and Multilevel Optimization
【速读】:该论文试图解决在动态且复杂的城市环境中实现人工智能系统自主感知、推理与行动的问题,即城市通用智能(Urban General Intelligence, UGI)的构建。其解决方案的关键在于提出一种基于持续检索增强的混合专家模型(Continual Retrieval-Augmented MoE-based LLM, C-RAG-LLM)的架构,该架构能够动态整合领域知识和不断变化的城市数据,以支持长期适应性,并通过多层级优化框架实现各层之间的协同与独立优化。
链接: https://arxiv.org/abs/2507.04706
作者: Kai Yang,Zelin Zhu,Chengtao Jian,Hui Ma,Shengjie Zhao,Xiaozhou Ye,Ye Ouyang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Urban general intelligence (UGI) refers to the capacity of AI systems to autonomously perceive, reason, and act within dynamic and complex urban environments. In this paper, we introduce UrbanMind, a tool-enhanced retrieval-augmented generation (RAG) framework designed to facilitate UGI. Central to UrbanMind is a novel architecture based on Continual Retrieval-Augmented MoE-based LLM (C-RAG-LLM), which dynamically incorporates domain-specific knowledge and evolving urban data to support long-term adaptability. The architecture of C-RAG-LLM aligns naturally with a multilevel optimization framework, where different layers are treated as interdependent sub-problems. Each layer has distinct objectives and can be optimized either independently or jointly through a hierarchical learning process. The framework is highly flexible, supporting both end-to-end training and partial layer-wise optimization based on resource or deployment constraints. To remain adaptive under data drift, it is further integrated with an incremental corpus updating mechanism. Evaluations on real-world urban tasks of a variety of complexity verify the effectiveness of the proposed framework. This work presents a promising step toward the realization of general-purpose LLM agents in future urban environments.
zh
[AI-35] rojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message
【速读】:该论文试图解决对话式人工智能系统中因依赖对话历史而引入的安全漏洞问题,具体表现为攻击者通过伪造模型自身的过往对话内容来绕过安全机制,从而生成有害内容。解决方案的关键在于提出一种名为“木马提示”(Trojan Horse Prompting)的新型越狱技术,其核心在于利用模型对自身对话历史的隐式信任,通过在模型属性消息中注入恶意负载,并结合良性用户提示触发有害内容生成,从而揭示了当前对话AI安全机制中存在的根本性缺陷。
链接: https://arxiv.org/abs/2507.04673
作者: Wei Duan,Li Qian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of conversational interfaces has greatly enhanced LLM usability by leveraging dialogue history for sophisticated reasoning. However, this reliance introduces an unexplored attack surface. This paper introduces Trojan Horse Prompting, a novel jailbreak technique. Adversaries bypass safety mechanisms by forging the model’s own past utterances within the conversational history provided to its API. A malicious payload is injected into a model-attributed message, followed by a benign user prompt to trigger harmful content generation. This vulnerability stems from Asymmetric Safety Alignment: models are extensively trained to refuse harmful user requests but lack comparable skepticism towards their own purported conversational history. This implicit trust in its “past” creates a high-impact vulnerability. Experimental validation on Google’s Gemini-2.0-flash-preview-image-generation shows Trojan Horse Prompting achieves a significantly higher Attack Success Rate (ASR) than established user-turn jailbreaking methods. These findings reveal a fundamental flaw in modern conversational AI security, necessitating a paradigm shift from input-level filtering to robust, protocol-level validation of conversational context integrity.
zh
[AI-36] Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)微调过程中因频繁的提示评估和策略更新而导致的高计算成本问题。其解决方案的关键在于提出一种基于贝叶斯风险预测的在线提示选择框架——Model Predictive Prompt Selection (MoPPS),该框架通过流式贝叶斯推断对提示难度进行近似评估,无需依赖昂贵的LLM交互即可实现高效且自适应的提示选择。
链接: https://arxiv.org/abs/2507.04632
作者: Yun Qu,Qi Cheems Wang,Yixiu Mao,Vincent Tao Hu,Xiangyang Ji
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent advances have witnessed the effectiveness of reinforcement learning (RL) finetuning in enhancing the reasoning capabilities of large language models (LLMs). The optimization process often requires numerous iterations to achieve satisfactory performance, resulting in high computational costs due to the need for frequent prompt evaluations under intensive LLM interactions and repeated policy updates. Appropriate online prompt selection methods reduce iteration steps by prioritizing informative prompts during training, while the pipeline’s reliance on exhaustive prompt evaluation and subset selection for optimization still incurs substantial computational overhead due to frequent LLM inference calls. Distinguished from these direct evaluate-then-select schemes, this work investigates iterative approximate evaluation for arbitrary prompts and introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework that online estimates prompt difficulty without requiring costly LLM interactions. Technically, MoPPS models each prompt’s success rate as a latent variable, performs streaming Bayesian inference, and employs posterior sampling in a constructed multi-armed bandit machine, enabling sample efficient and adaptive prompt selection. Extensive experiments across mathematics, planning, and vision-based geometry tasks show that MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced LLM rollouts.
zh
[AI-37] Hierarchical Intent-guided Optimization with Pluggable LLM -Driven Semantics for Session-based Recommendation
【速读】:该论文旨在解决基于会话推荐(Session-based Recommendation, SBR)中现有模型仅关注单会话信息、忽略跨会话关系及潜在的有价值信息,以及依赖物品ID共现而忽视丰富语义细节的问题。其解决方案的关键在于提出一种名为HIPHOP的层次化意图引导优化方法,该方法结合可插拔的大语言模型(Large Language Model, LLM)语义学习模块,以生成高质量的语义表示;利用图神经网络(Graph Neural Network, GNN)建模物品转移关系,并引入动态多意图捕捉模块以应对用户在会话中的多样化兴趣;同时设计了一个由用户意图引导的层次化跨会话相似性学习模块,以捕捉全局与局部会话关系,并通过意图引导的去噪策略降低噪声影响,最终通过对比学习增强会话表示的判别能力。
链接: https://arxiv.org/abs/2507.04623
作者: Jinpeng Chen,Jianxiang He,Huan Li,Senzhang Wang,Yuan Cao,Kaimin Wei,Zhenye Yang,Ye Ji
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Session-based Recommendation (SBR) aims to predict the next item a user will likely engage with, using their interaction sequence within an anonymous session. Existing SBR models often focus only on single-session information, ignoring inter-session relationships and valuable cross-session insights. Some methods try to include inter-session data but struggle with noise and irrelevant information, reducing performance. Additionally, most models rely on item ID co-occurrence and overlook rich semantic details, limiting their ability to capture fine-grained item features. To address these challenges, we propose a novel hierarchical intent-guided optimization approach with pluggable LLM-driven semantic learning for session-based recommendations, called HIPHOP. First, we introduce a pluggable embedding module based on large language models (LLMs) to generate high-quality semantic representations, enhancing item embeddings. Second, HIPHOP utilizes graph neural networks (GNNs) to model item transition relationships and incorporates a dynamic multi-intent capturing module to address users’ diverse interests within a session. Additionally, we design a hierarchical inter-session similarity learning module, guided by user intent, to capture global and local session relationships, effectively exploring users’ long-term and short-term interests. To mitigate noise, an intent-guided denoising strategy is applied during inter-session learning. Finally, we enhance the model’s discriminative capability by using contrastive learning to optimize session representations. Experiments on multiple datasets show that HIPHOP significantly outperforms existing methods, demonstrating its effectiveness in improving recommendation quality. Our code is available: this https URL.
zh
[AI-38] Multimodal LLM Integrated Semantic Communications for 6G Immersive Experiences
【速读】:该论文旨在解决6G网络中高维多模态数据传输与实时智能数据处理的挑战,特别是在资源受限的无线通信系统中实现环境、上下文和用户意图的联合理解以有效传递任务相关的内容。其解决方案的关键在于提出一种集成语义通信的多模态大语言模型(MLLM-SC)框架,该框架充分利用预训练基础模型的推理与生成能力,通过设备-边缘协同架构,在边缘侧利用MLLM驱动的语义引导模块生成重要性感知的注意力图,结合重要性感知语义编码器与资源自适应语义解码器,实现语义关键信息的优先传输与高质量内容重建或生成。
链接: https://arxiv.org/abs/2507.04621
作者: Yusong Zhang,Yuxuan Sun,Lei Guo,Wei Chen,Bo Ai,Deniz Gunduz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: This work has been submitted to the IEEE for possible publication
Abstract:6G networks promise revolutionary immersive communication experiences including augmented reality (AR), virtual reality (VR), and holographic communications. These applications demand high-dimensional multimodal data transmission and intelligent data processing in real-time, which is extremely challenging over resource-limited wireless communication systems. Moreover, a joint understanding of the environment, context, and user intent is essential to deliver task-relevant content effectively. This article presents a novel multimodal large language model (MLLM) integrated semantic communications framework, termed MLLM-SC, which fully leverages reasoning and generative capabilities of pre-trained foundation models for context-aware and task-oriented wireless communication. The MLLM-SC framework adopts a device-edge collaborative architecture. At the edge, MLLM-empowered semantic guidance module analyzes multimodal inputs, user intents, and channel conditions to generate importance-aware attention maps prioritizing semantically critical information. An importance-aware semantic encoder and a resource-adaptive semantic decoder are jointly designed and optimized, which can utilize the semantic guidance for adaptive bandwidth allocation and high-quality content reconstruction or generation. Extensive case studies on visual question answering for AR/VR applications and diffusion-driven image generation validate the effectiveness of MLLM-SC.
zh
[AI-39] any4: Learned 4-bit Numeric Representation for LLM s ICML2025
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)中权重量化的问题,旨在提供一种无需对权重或激活进行预处理的任意数值表示的4位量化方案。其解决方案的关键在于any4方法,它能够在不依赖预处理的情况下实现比int4、fp4和nf4等其他4位数值表示类型更高的准确性,并且在性能上与需要预处理的技术(如AWQ和GPTQ)相媲美。此外,any4还通过使用单一精心挑选的多样化样本进行校准,而非传统方法所需的数百个数据集样本,进一步提升了效率。
链接: https://arxiv.org/abs/2507.04610
作者: Mostafa Elhoushi,Jeff Johnson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2025
Abstract:We present any4, a learned 4-bit weight quantization solution for large language models (LLMs) providing arbitrary numeric representations without requiring pre-processing of weights or activations. any4 yields higher accuracy compared to other related 4-bit numeric representation types: int4, fp4 and nf4, as evaluated on a range of model sizes, generations and families (Llama 2, Llama 3, Mistral and Mixtral). While any4 does not require preprocessing of weights or activations, it is also competitive with orthogonal techniques that require such preprocessing (e.g., AWQ and GPTQ). We also experiment with any3 and any2 and show competitiveness at lower bits. Additionally, we show that we can calibrate using a single curated diverse sample rather than hundreds of samples from a dataset as done in most quantization approaches. We also open source tinygemm, a latency optimized GPU matrix multiplication library for LLMs, that implements any4 using a GPU-efficient lookup table strategy along with other common quantization methods. We open source our code at this https URL .
zh
[AI-40] Accelerated Online Reinforcement Learning using Auxiliary Start State Distributions ICML
【速读】:该论文试图解决在线强化学习(Online Reinforcement Learning, RL)中的样本效率问题,这一问题源于无法高效地探索环境。论文提出的解决方案的关键在于利用少量专家演示和一个允许任意状态重置的模拟器来加速学习过程。研究发现,通过选择一个与马尔可夫决策过程(Markov Decision Process, MDP)真实初始状态分布不同的辅助初始状态分布,并结合安全性的概念来指导该分布的选择,可以显著提升样本效率。此外,通过将情节长度信息用于操作化这一概念,作者在稀疏奖励的高探索难度环境中实现了最先进的样本效率。
链接: https://arxiv.org/abs/2507.04606
作者: Aman Mehra,Alexandre Capone,Jeff Schneider
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ICML ARLET Workshop 2024
Abstract:A long-standing problem in online reinforcement learning (RL) is of ensuring sample efficiency, which stems from an inability to explore environments efficiently. Most attempts at efficient exploration tackle this problem in a setting where learning begins from scratch, without prior information available to bootstrap learning. However, such approaches fail to leverage expert demonstrations and simulators that can reset to arbitrary states. These affordances are valuable resources that offer enormous potential to guide exploration and speed up learning. In this paper, we explore how a small number of expert demonstrations and a simulator allowing arbitrary resets can accelerate learning during online RL. We find that training with a suitable choice of an auxiliary start state distribution that may differ from the true start state distribution of the underlying Markov Decision Process can significantly improve sample efficiency. We find that using a notion of safety to inform the choice of this auxiliary distribution significantly accelerates learning. By using episode length information as a way to operationalize this notion, we demonstrate state-of-the-art sample efficiency on a sparse-reward hard-exploration environment.
zh
[AI-41] DisMS-TS: Eliminating Redundant Multi-Scale Features for Time Series Classification ACM-MM2025
【速读】:该论文试图解决多尺度时间序列分析中冗余的尺度共享特征导致模型对这些特征过度或不足关注的问题,从而影响分类性能。解决方案的关键在于提出一种端到端的解耦多尺度框架(Disentangled Multi-Scale framework for Time Series classification, DisMS-TS),通过引入时间解耦模块分别捕捉尺度共享和尺度特定的时间表示,并结合两种正则化项以确保尺度共享表示的一致性和尺度特定表示的差异性,从而有效消除冗余特征并提升预测性能。
链接: https://arxiv.org/abs/2507.04600
作者: Zhipeng Liu,Peibo Duan,Binwu Wang,Xuan Tang,Qi Chu,Changsheng Zhang,Yongsheng Huang,Bin Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted for presentation at the ACM International Conference on Multimedia (ACM MM 2025)
Abstract:Real-world time series typically exhibit complex temporal variations, making the time series classification task notably challenging. Recent advancements have demonstrated the potential of multi-scale analysis approaches, which provide an effective solution for capturing these complex temporal patterns. However, existing multi-scale analysis-based time series prediction methods fail to eliminate redundant scale-shared features across multi-scale time series, resulting in the model over- or under-focusing on scale-shared features. To address this issue, we propose a novel end-to-end Disentangled Multi-Scale framework for Time Series classification (DisMS-TS). The core idea of DisMS-TS is to eliminate redundant shared features in multi-scale time series, thereby improving prediction performance. Specifically, we propose a temporal disentanglement module to capture scale-shared and scale-specific temporal representations, respectively. Subsequently, to effectively learn both scale-shared and scale-specific temporal representations, we introduce two regularization terms that ensure the consistency of scale-shared representations and the disparity of scale-specific representations across all temporal scales. Extensive experiments conducted on multiple datasets validate the superiority of DisMS-TS over its competitive baselines, with the accuracy improvement up to 9.71%.
zh
[AI-42] Exploring Core and Periphery Precepts in Biological and Artificial Intelligence: An Outcome-Based Perspective
【速读】:该论文试图解决智能系统在扩展其智能水平时所面临的工程方法论挑战,特别是传统分解与重组原则在智能系统中的适用性问题。论文提出的解决方案关键在于引入“核心与边缘”原则,这一概念框架基于抽象系统理论和必要多样性定律,旨在为通用智能的工程化提供新的系统性原理。通过数学定义核心主导与边缘主导系统,论文进一步拓展了理论框架,并展示了其在生物和人工智能系统中的实际应用价值。
链接: https://arxiv.org/abs/2507.04594
作者: Niloofar Shadab,Tyler Cody,Alejandro Salado,Taylan G. Topcu,Mohammad Shadab,Peter Beling
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Engineering methodologies predominantly revolve around established principles of decomposition and recomposition. These principles involve partitioning inputs and outputs at the component level, ensuring that the properties of individual components are preserved upon composition. However, this view does not transfer well to intelligent systems, particularly when addressing the scaling of intelligence as a system property. Our prior research contends that the engineering of general intelligence necessitates a fresh set of overarching systems principles. As a result, we introduced the “core and periphery” principles, a novel conceptual framework rooted in abstract systems theory and the Law of Requisite Variety. In this paper, we assert that these abstract concepts hold practical significance. Through empirical evidence, we illustrate their applicability to both biological and artificial intelligence systems, bridging abstract theory with real-world implementations. Then, we expand on our previous theoretical framework by mathematically defining core-dominant vs periphery-dominant systems.
zh
[AI-43] SPIRA: Building an Intelligent System for Respiratory Insufficiency Detection
【速读】:该论文试图解决呼吸功能不全(respiratory insufficiency)的检测问题,提出了一种基于语音的智能系统SPIRA来实现这一目标。解决方案的关键在于通过分析语音信号来识别呼吸功能不全的迹象,其核心挑战包括数据收集、模型训练和推理过程中的技术问题,并总结了相关经验以指导未来类似系统的设计与实现。
链接: https://arxiv.org/abs/2507.04548
作者: Renato Cordeiro Ferreira(1),Dayanne Gomes(1),Vitor Tamae(1),Francisco Wernke(1),Alfredo Goldman(1) ((1) University of São Paulo)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 1 figure (1 diagram), published at ISE 2022
Abstract:Respiratory insufficiency is a medic symptom in which a person gets a reduced amount of oxygen in the blood. This paper reports the experience of building SPIRA: an intelligent system for detecting respiratory insufficiency from voice. It compiles challenges faced in two succeeding implementations of the same architecture, summarizing lessons learned on data collection, training, and inference for future projects in similar systems.
zh
[AI-44] owards integration of Privacy Enhancing Technologies in Explainable Artificial Intelligence
【速读】:该论文试图解决在使用基于特征的可解释人工智能(Explainable AI, XAI)方法时,由于解释信息泄露导致的个体隐私风险问题。其解决方案的关键在于采用隐私增强技术(Privacy Enhancing Technologies, PETs),通过合成训练数据、差分隐私训练和噪声添加等方法,在保护模型实用性和解释质量的前提下,有效降低属性推断攻击对敏感个人信息的威胁。
链接: https://arxiv.org/abs/2507.04528
作者: Sonal Allana,Rozita Dara,Xiaodong Lin,Pulei Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under peer review
Abstract:Explainable Artificial Intelligence (XAI) is a crucial pathway in mitigating the risk of non-transparency in the decision-making process of black-box Artificial Intelligence (AI) systems. However, despite the benefits, XAI methods are found to leak the privacy of individuals whose data is used in training or querying the models. Researchers have demonstrated privacy attacks that exploit explanations to infer sensitive personal information of individuals. Currently there is a lack of defenses against known privacy attacks targeting explanations when vulnerable XAI are used in production and machine learning as a service system. To address this gap, in this article, we explore Privacy Enhancing Technologies (PETs) as a defense mechanism against attribute inference on explanations provided by feature-based XAI methods. We empirically evaluate 3 types of PETs, namely synthetic training data, differentially private training and noise addition, on two categories of feature-based XAI. Our evaluation determines different responses from the mitigation methods and side-effects of PETs on other system properties such as utility and performance. In the best case, PETs integration in explanations reduced the risk of the attack by 49.47%, while maintaining model utility and explanation quality. Through our evaluation, we identify strategies for using PETs in XAI for maximizing benefits and minimizing the success of this privacy attack on sensitive personal information.
zh
[AI-45] Churn-Aware Recommendation Planning under Aggregated Preference Feedback
【速读】:该论文试图解决在推荐系统中由于隐私法规和技术限制导致无法获取个体用户数据,仅能依赖群体层面偏好信息时的序列决策问题。其解决方案的关键在于引入Rec-APC模型,该模型假设一个匿名用户从已知的潜在用户类型先验分布中抽取,并通过依次选择推荐项进行互动,其中正向反馈用于贝叶斯更新后验分布,负向反馈则导致会话终止。该方法证明了最优策略会在有限时间内收敛到纯利用状态,并提出了一种分支定界算法以高效计算这些策略。
链接: https://arxiv.org/abs/2507.04513
作者: Gur Keinan,Omer Ben-Porat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2502.18483
Abstract:We study a sequential decision-making problem motivated by recent regulatory and technological shifts that limit access to individual user data in recommender systems (RSs), leaving only population-level preference information. This privacy-aware setting poses fundamental challenges in planning under uncertainty: Effective personalization requires exploration to infer user preferences, yet unsatisfactory recommendations risk immediate user churn. To address this, we introduce the Rec-APC model, in which an anonymous user is drawn from a known prior over latent user types (e.g., personas or clusters), and the decision-maker sequentially selects items to recommend. Feedback is binary – positive responses refine the posterior via Bayesian updates, while negative responses result in the termination of the session. We prove that optimal policies converge to pure exploitation in finite time and propose a branch-and-bound algorithm to efficiently compute them. Experiments on synthetic and MovieLens data confirm rapid convergence and demonstrate that our method outperforms the POMDP solver SARSOP, particularly when the number of user types is large or comparable to the number of content categories. Our results highlight the applicability of this approach and inspire new ways to improve decision-making under the constraints imposed by aggregated preference data. Comments: arXiv admin note: substantial text overlap with arXiv:2502.18483 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.04513 [cs.AI] (or arXiv:2507.04513v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.04513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-46] Dealing with Uncertainty in Contextual Anomaly Detection
【速读】:该论文试图解决上下文异常检测(Contextual Anomaly Detection, CAD)问题,即在给定一组影响目标变量正常性的上下文变量的情况下,识别目标变量中的异常。传统方法往往未能充分考虑目标变量的不确定性,而该论文提出的解决方案的关键在于引入了一种名为正常性评分(Normalcy Score, NS)的新框架,该框架显式建模了随机不确定性(aleatoric uncertainty)和认知不确定性(epistemic uncertainty)。NS基于异方差高斯过程回归,将Z-score视为随机变量,从而提供反映异常评估可靠性的置信区间,提升了检测精度与可解释性,并支持基于不确定性的自适应决策过程。
链接: https://arxiv.org/abs/2507.04490
作者: Luca Bindini,Lorenzo Perini,Stefano Nistri,Jesse Davis,Paolo Frasconi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Contextual anomaly detection (CAD) aims to identify anomalies in a target (behavioral) variable conditioned on a set of contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In many anomaly detection tasks, there exist contextual variables that influence the normalcy of the target variable but are not themselves indicators of anomaly. In this work, we propose a novel framework for CAD, normalcy score (NS), that explicitly models both the aleatoric and epistemic uncertainties. Built on heteroscedastic Gaussian process regression, our method regards the Z-score as a random variable, providing confidence intervals that reflect the reliability of the anomaly assessment. Through experiments on benchmark datasets and a real-world application in cardiology, we demonstrate that NS outperforms state-of-the-art CAD methods in both detection accuracy and interpretability. Moreover, confidence intervals enable an adaptive, uncertainty-driven decision-making process, which may be very important in domains such as healthcare.
zh
[AI-47] LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization
【速读】:该论文试图解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在领域专业化任务中因大量矩阵乘法操作导致的计算效率低下和微调性能不足的问题。其解决方案的关键在于提出LoSiA(Low-Resources Subnet Integration Adaptation),通过梯度稀疏性分析识别关键子网络,并将其作为可训练目标进行优化,从而仅更新子网络参数实现高秩适应,减少额外的矩阵乘法操作。
链接: https://arxiv.org/abs/2507.04487
作者: Xujia Wang. Yunjia Qi,Bin Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18 pages, 12 figures
Abstract:Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA(Low-Resources Subnet Integration Adaptation), an innovative method that dynamically localizes and optimizes critical parameters during the training process. Specifically, it identifies a sub-network using gradient sparsity analysis and optimizes it as the trainable target. This design enables effective high-rank adaptation by updating only the sub-network parameters, reducing the additional matrix multiplication. We also present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about 27% compared to LoRA. Extensive evaluations show that our method achieves minimal performance drop compared to full fine-tuning, while requiring the least training time across domain specialization and common-sense reasoning tasks. Further analysis shows that LoSiA also reduces forgetting during continued training.
zh
[AI-48] Source Attribution in Retrieval-Augmented Generation
【速读】:该论文试图解决在基于大语言模型(Large Language Models, LLM)的检索增强生成(Retrieval-Augmented Generation, RAG)系统中,如何有效且高效地识别具有影响力的检索文档的问题。其关键解决方案在于将基于Shapley值的归因方法适配到RAG系统的文档层级设置中,并通过比较Shapley值与计算更可行的近似方法以及现有LLM归因方法,评估这些方法在保持较低计算成本的同时,能否准确反映文档的重要性。研究重点在于减少昂贵的LLM调用次数,同时确保对复杂文档间关系(如冗余性、互补性和协同性)的解释能力。
链接: https://arxiv.org/abs/2507.04480
作者: Ikhtiyor Nematov,Tarik Kalai,Elizaveta Kuzmenko,Gabriele Fugagnoli,Dimitris Sacharidis,Katja Hose,Tomer Sagi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While attribution methods, such as Shapley values, are widely used to explain the importance of features or training data in traditional machine learning, their application to Large Language Models (LLMs), particularly within Retrieval-Augmented Generation (RAG) systems, is nascent and challenging. The primary obstacle is the substantial computational cost, where each utility function evaluation involves an expensive LLM call, resulting in direct monetary and time expenses. This paper investigates the feasibility and effectiveness of adapting Shapley-based attribution to identify influential retrieved documents in RAG. We compare Shapley with more computationally tractable approximations and some existing attribution methods for LLM. Our work aims to: (1) systematically apply established attribution principles to the RAG document-level setting; (2) quantify how well SHAP approximations can mirror exact attributions while minimizing costly LLM interactions; and (3) evaluate their practical explainability in identifying critical documents, especially under complex inter-document relationships such as redundancy, complementarity, and synergy. This study seeks to bridge the gap between powerful attribution techniques and the practical constraints of LLM-based RAG systems, offering insights into achieving reliable and affordable RAG explainability.
zh
[AI-49] Model Inversion Attacks on Llama 3: Extracting PII from Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在训练数据中可能记忆敏感信息所带来的隐私风险问题。研究重点在于模型反转攻击(model inversion attacks)对Llama 3.2模型的可行性,通过精心设计的查询提示,成功提取了如密码、电子邮件地址和账户号码等个人可识别信息(PII)。解决方案的关键在于揭示即使较小的LLMs也存在隐私攻击的脆弱性,并提出差分隐私(differential privacy)和数据净化(data sanitization)等潜在缓解策略,以增强模型的隐私保护能力。
链接: https://arxiv.org/abs/2507.04478
作者: Sathesh P.Sivashanmugam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large language models (LLMs) have transformed natural language processing, but their ability to memorize training data poses significant privacy risks. This paper investigates model inversion attacks on the Llama 3.2 model, a multilingual LLM developed by Meta. By querying the model with carefully crafted prompts, we demonstrate the extraction of personally identifiable information (PII) such as passwords, email addresses, and account numbers. Our findings highlight the vulnerability of even smaller LLMs to privacy attacks and underscore the need for robust defenses. We discuss potential mitigation strategies, including differential privacy and data sanitization, and call for further research into privacy-preserving machine learning techniques.
zh
[AI-50] Anomalous Decision Discovery using Inverse Reinforcement Learning
【速读】:该论文旨在解决自动驾驶车辆中异常检测面临的挑战,即现有方法在面对未见过的场景、传感器噪声和遮挡时效果下降,且依赖大量标注数据,限制了实际应用。其解决方案的关键在于提出一种基于逆强化学习(Inverse Reinforcement Learning, IRL)的异常检测框架,通过从序列感知数据中推断隐含的驾驶意图,实现鲁棒的异常识别。核心创新在于通过奖励和最坏情况监督隐式学习时间信用分配,并利用变时限采样的预训练策略以最大化后果时间,从而实现行为偏差的早期检测。
链接: https://arxiv.org/abs/2507.04464
作者: Ashish Bastola,Mert D. Pesé,Long Cheng,Jonathon Smereka,Abolfazl Razi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Anomaly detection plays a critical role in Autonomous Vehicles (AVs) by identifying unusual behaviors through perception systems that could compromise safety and lead to hazardous situations. Current approaches, which often rely on predefined thresholds or supervised learning paradigms, exhibit reduced efficacy when confronted with unseen scenarios, sensor noise, and occlusions, leading to potential safety-critical failures. Moreover, supervised methods require large annotated datasets, limiting their real-world feasibility. To address these gaps, we propose an anomaly detection framework based on Inverse Reinforcement Learning (IRL) to infer latent driving intentions from sequential perception data, thus enabling robust identification. Specifically, we present Trajectory-Reward Guided Adaptive Pre-training (TRAP), a novel IRL framework for anomaly detection, to address two critical limitations of existing methods: noise robustness and generalization to unseen scenarios. Our core innovation is implicitly learning temporal credit assignments via reward and worst-case supervision. We leverage pre-training with variable-horizon sampling to maximize time-to-consequence, resulting in early detection of behavior deviation. Experiments on 14,000+ simulated trajectories demonstrate state-of-the-art performance, achieving 0.90 AUC and 82.2% F1-score - outperforming similarly trained supervised and unsupervised baselines by 39% on Recall and 12% on F1-score, respectively. Similar performance is achieved while exhibiting robustness to various noise types and generalization to unseen anomaly types. Our code will be available at: this https URL
zh
[AI-51] ARMR: Adaptively Responsive Network for Medication Recommendation IJCAI2025
【速读】:该论文试图解决在复杂医疗条件下药物推荐中如何有效平衡历史用药与新药引入的问题。解决方案的关键在于提出一种自适应响应的药物推荐网络(Adaptively Responsive network for Medication Recommendation, ARMR),其核心包括两个组成部分:一是分段时间学习组件,用于区分近期与远期患者病史,实现更细致的时间理解;二是自适应响应机制,能够根据患者的当前健康状态和用药历史动态调整对新旧药物的关注度。
链接: https://arxiv.org/abs/2507.04428
作者: Feiyue Wu,Tianxing Wu,Shenqi Jing
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, accepted by IJCAI 2025
Abstract:Medication recommendation is a crucial task in healthcare, especially for patients with complex medical conditions. However, existing methods often struggle to effectively balance the reuse of historical medications with the introduction of new drugs in response to the changing patient conditions. In order to address this challenge, we propose an Adaptively Responsive network for Medication Recommendation (ARMR), a new method which incorporates 1) a piecewise temporal learning component that distinguishes between recent and distant patient history, enabling more nuanced temporal understanding, and 2) an adaptively responsive mechanism that dynamically adjusts attention to new and existing drugs based on the patient’s current health state and medication history. Experiments on the MIMIC-III and MIMIC-IV datasets indicate that ARMR has better performance compared with the state-of-the-art baselines in different evaluation metrics, which contributes to more personalized and accurate medication recommendations. The source code is publicly avaiable at: this https URL.
zh
[AI-52] Learning Software Bug Reports: A Systematic Literature Review
【速读】:该论文试图解决软件工程研究中,特别是缺陷报告分析领域,机器学习(Machine Learning, ML)应用的系统性综述缺失问题。其解决方案的关键在于通过系统文献回顾方法,对1,825篇相关论文进行筛选与分析,最终选取204篇进行深入研究,从而提炼出七个关键发现,包括模型使用趋势、特征表示方法、预处理技术、评估项目、任务类型、关注的缺陷类型以及评估指标等,为后续研究提供了重要的参考和方向。
链接: https://arxiv.org/abs/2507.04422
作者: Guoming Long,Jingzhi Gong,Hui Fang,Tao Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by TOSEM
Abstract:The recent advancement of artificial intelligence, especially machine learning (ML), has significantly impacted software engineering research, including bug report analysis. ML aims to automate the understanding, extraction, and correlation of information from bug reports. Despite its growing importance, there has been no comprehensive review in this area. In this paper, we present a systematic literature review covering 1,825 papers, selecting 204 for detailed analysis. We derive seven key findings: 1) Extensive use of CNN, LSTM, and k NN for bug report analysis, with advanced models like BERT underutilized due to their complexity. 2) Word2Vec and TF-IDF are popular for feature representation, with a rise in deep learning approaches. 3) Stop word removal is the most common preprocessing, with structural methods rising after 2020. 4) Eclipse and Mozilla are the most frequently evaluated software projects. 5) Bug categorization is the most common task, followed by bug localization and severity prediction. 6) There is increasing attention on specific bugs like non-functional and performance bugs. 7) Common evaluation metrics are F1-score, Recall, Precision, and Accuracy, with k -fold cross-validation preferred for model evaluation. 8) Many studies lack robust statistical tests. We also identify six promising future research directions to provide useful insights for practitioners.
zh
[AI-53] LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在事实性生成任务中容易出现的错误问题,这些问题限制了其在知识密集型任务中的可靠性。论文提出的解决方案的关键在于引入一种基于token感知、层局部对比解码的方法,该方法通过将特定类型的token与对其影响最大的Transformer层进行对齐,从而提升事实性生成的效果。核心思想是利用经验注意力分析发现的两个关键模式:标点符号token在早期层中占据主导注意力,而概念性token则在中间层主导语义推理,通过在相应层次上选择性抑制这些token的注意力,实现受控的事实退化,并生成对比信号以指导最终的事实解码。该方法无需额外训练或模型修改,已在多个LLM和基准测试中验证其有效性。
链接: https://arxiv.org/abs/2507.04404
作者: Jingze Zhu,Yongliang Wu,Wenbo Zhu,Jiawang Cao,Yanqiang Zheng,Jiawei Chen,Xu Yang,Bernt Schiele,Jonas Fischer,Xinting Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors, limiting their reliability in knowledge-intensive tasks. While decoding-time strategies provide a promising efficient solution without training, existing methods typically treat token-level and layer-level signals in isolation, overlooking the joint dynamics between them. In this work, we introduce a token-aware, layer-localized contrastive decoding method that aligns specific token types with their most influential transformer layers to improve factual generation. Through empirical attention analysis, we identify two key patterns: punctuation tokens receive dominant attention in early layers, while conceptual tokens govern semantic reasoning in intermediate layers. By selectively suppressing attention to these token types at their respective depths, we achieve the induction of controlled factual degradation and derive contrastive signals to guide the final factual decoding. Our method requires no additional training or model modification, and experiments demonstrate that our method consistently improves factuality across multiple LLMs and various benchmarks.
zh
[AI-54] ractable Representation Learning with Probabilistic Circuits
【速读】:该论文试图解决在概率电路(Probabilistic Circuits, PCs)中进行表示学习时,现有方法依赖外部神经嵌入或基于激活的编码方式,缺乏对概率嵌入的显式建模问题。其解决方案的关键在于提出自编码概率电路(Autoencoding Probabilistic Circuits, APCs),通过利用PCs的可 tractable 推理能力,联合建模数据与嵌入,从而直接通过概率推理获得嵌入表示。该框架能够原生处理任意缺失数据,并与神经解码器无缝集成,形成端到端可训练的混合架构。
链接: https://arxiv.org/abs/2507.04385
作者: Steven Braun,Sahil Sidheekh,Antonio Vergari,Martin Mundt,Sriraam Natarajan,Kristian Kersting
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic circuits (PCs) are powerful probabilistic models that enable exact and tractable inference, making them highly suitable for probabilistic reasoning and inference tasks. While dominant in neural networks, representation learning with PCs remains underexplored, with prior approaches relying on external neural embeddings or activation-based encodings. To address this gap, we introduce autoencoding probabilistic circuits (APCs), a novel framework leveraging the tractability of PCs to model probabilistic embeddings explicitly. APCs extend PCs by jointly modeling data and embeddings, obtaining embedding representations through tractable probabilistic inference. The PC encoder allows the framework to natively handle arbitrary missing data and is seamlessly integrated with a neural decoder in a hybrid, end-to-end trainable architecture enabled by differentiable sampling. Our empirical evaluation demonstrates that APCs outperform existing PC-based autoencoding methods in reconstruction quality, generate embeddings competitive with, and exhibit superior robustness in handling missing data compared to neural autoencoders. These results highlight APCs as a powerful and flexible representation learning method that exploits the probabilistic inference capabilities of PCs, showing promising directions for robust inference, out-of-distribution detection, and knowledge distillation.
zh
[AI-55] DC-Mamber: A Dual Channel Prediction Model based on Mamba and Linear Transformer for Multivariate Time Series Forecasting
【速读】:该论文旨在解决多变量时间序列预测(MTSF)中现有模型在处理局部时间模式和全局时间依赖性方面的局限性。当前主流模型如Transformer和Mamba分别存在对局部模式敏感性不足及难以并行聚合全局上下文信息的问题。解决方案的关键在于提出DC-Mamba,一个基于Mamba和线性Transformer的双通道预测模型,其中Mamba通道采用通道独立策略提取变量内特征,而Transformer通道采用通道混合策略建模跨时间步的全局依赖关系,通过双通道特征融合实现更准确的预测。
链接: https://arxiv.org/abs/2507.04381
作者: Bing Fan,Shusen Ma,Yun-Bo Zhao,Yu Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In multivariate time series forecasting (MTSF), existing strategies for processing sequences are typically categorized as channel-independent and channel-mixing. The former treats all temporal information of each variable as a token, focusing on capturing local temporal features of individual variables, while the latter constructs a token from the multivariate information at each time step, emphasizing the modeling of global temporal dependencies. Current mainstream models are mostly based on Transformer and the emerging Mamba. Transformers excel at modeling global dependencies through self-attention mechanisms but exhibit limited sensitivity to local temporal patterns and suffer from quadratic computational complexity, restricting their efficiency in long-sequence processing. In contrast, Mamba, based on state space models (SSMs), achieves linear complexity and efficient long-range modeling but struggles to aggregate global contextual information in parallel. To overcome the limitations of both models, we propose DC-Mamber, a dual-channel forecasting model based on Mamba and linear Transformer for time series forecasting. Specifically, the Mamba-based channel employs a channel-independent strategy to extract intra-variable features, while the Transformer-based channel adopts a channel-mixing strategy to model cross-timestep global dependencies. DC-Mamber first maps the raw input into two distinct feature representations via separate embedding layers. These representations are then processed by a variable encoder (built on Mamba) and a temporal encoder (built on linear Transformer), respectively. Finally, a fusion layer integrates the dual-channel features for prediction. Extensive experiments on eight public datasets confirm DC-Mamber’s superior accuracy over existing models.
zh
[AI-56] MOD-X: A Modular Open Decentralized eXchange Framework proposal for Heterogeneous Interoperable Artificial Agents
【速读】:该论文试图解决当前人工智能系统从单一模型向专业化代理生态系统演进过程中,代理之间缺乏标准化通信协议的问题。解决方案的关键在于提出MOD-X(Modular Open Decentralized eXchange)架构,其核心创新包括分层结构中的通用消息总线、全面的状态管理、翻译能力以及基于区块链的安全机制,同时引入发布-订阅通信模型、语义能力发现和动态工作流编排,以实现异构专业代理之间的有效集成与互操作性。
链接: https://arxiv.org/abs/2507.04376
作者: Georgios Ioannides,Christos Constantinou,Vinija Jain,Aman Chadha,Aaron Elkins
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注:
Abstract:As Artificial Intelligence systems evolve from monolithic models to ecosystems of specialized agents, the need for standardized communication protocols becomes increasingly critical. This paper introduces MOD-X (Modular Open Decentralized eXchange), a novel architectural framework proposal for agent interoperability that addresses key limitations of existing protocols. Unlike current approaches, MOD-X proposes a layered architecture with a Universal Message Bus, thorough state management, translation capabilities, and blockchain-based security mechanisms. We present MOD-X’s architecture, compare it with existing protocols, and demonstrate its application through a worked example how it enables integration between heterogeneous specialist agents (agents with different architectures, vendors, capabilities, and knowledge representations–including rule-based systems, neural networks, symbolic reasoning engines, and legacy software with agent wrappers). MOD-X’s key innovations include a publish-subscribe communication model, semantic capability discovery, and dynamic workflow orchestration–providing a framework that bridges theoretical formalism with practical implementation. This architecture addresses the growing need for truly decentralized, interoperable agent ecosystems that can scale effectively without the need for central coordination.
zh
[AI-57] WebSynthesis: World-Model-Guided MCTS for Efficient WebUI-Trajectory Synthesis
【速读】:该论文试图解决在复杂和动态网络环境中,基于大规模语言模型(Large Language Models, LLMs)的网络代理(web agents)在轨迹级规划与执行上的不足。其关键解决方案是提出WebSynthesis框架,该框架通过学习世界模型来模拟虚拟网络环境,使策略代理能够进行高效且可逆的基于树的规划,从而支持大规模生成多样且高质量的轨迹,用于优化代理策略。
链接: https://arxiv.org/abs/2507.04370
作者: Yifei Gao,Junhong Ye,Jiaqi Wang,Jitao Sang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models (LLMs) have significantly improved the capabilities of web agents. However, effectively navigating complex and dynamic web environments still requires more advanced trajectory-level planning and execution. Prior studies have addressed self-improving agents by collecting extensive GUI trajectories from real-environment interactions. Despite their effectiveness, these approaches encounter two critical challenges: (1) Uncontrollable environment states, where real or sandboxed web environments often yield unstable and non-deterministic feedback, complicating the reproduction and debugging of agent behaviors; and (2) High API costs, as generating even a single interaction trajectory can involve hundreds of queries, leading to considerable API usage and computational expenses. To address these limitations and enable scalable self-improvement for agents, we propose WebSynthesis, a novel framework for trajectory synthesis and training. WebSynthesis leverages a learned world model to simulate virtual web environments, allowing a policy agent to perform efficient and reversible tree-based planning. This approach supports the large-scale generation of diverse and high-quality trajectories, which are subsequently utilized to refine the agent’s policy. Experimental results demonstrate that an agent trained using WebSynthesis on a small-scale synthetic dataset achieves performance comparable to or even surpassing that of models trained on large-scale real-world data.
zh
[AI-58] AI-washing: The Asymmetric Effects of Its Two Types on Consumer Moral Judgments
【速读】:该论文试图解决组织在人工智能(Artificial Intelligence, AI)热潮下,通过夸大或低估其实际AI使用情况来误导消费者的问题,即所谓的AI洗白(AI-washing)。研究的关键在于揭示这种虚假声明对消费者态度和购买意图的影响,并发现欺骗性否认比诚实否定引发更强烈的道德谴责,而欺骗性夸大则无显著影响,这一结果由感知背叛所中介。
链接: https://arxiv.org/abs/2507.04352
作者: Greg Nyilasy,Harsha Gangadharbatla
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:As AI hype continues to grow, organizations face pressure to broadcast or downplay purported AI initiatives - even when contrary to truth. This paper introduces AI-washing as overstating (deceptive boasting) or understating (deceptive denial) a company’s real AI usage. A 2x2 experiment (N = 401) examines how these false claims affect consumer attitudes and purchase intentions. Results reveal a pronounced asymmetry: deceptive denial evokes more negative moral judgments than honest negation, while deceptive boasting has no effects. We show that perceived betrayal mediates these outcomes. By clarifying how AI-washing erodes trust, the study highlights clear ethical implications for policymakers, marketers, and researchers striving for transparency.
zh
[AI-59] MLLM -Fabric: Multimodal Large Language Model-Driven Robotic Framework for Fabric Sorting and Selection
【速读】:该论文试图解决在纺织制造、服装生产和智能零售等机器人应用中,如何准确选择符合功能和质量要求的面料问题。解决方案的关键在于提出一种基于多模态大语言模型(MLLMs)的机器人框架MLLM-Fabric,该框架结合了视觉、触觉和压力传感数据,并通过监督微调和多模态解释引导的知识蒸馏技术,实现了对面料属性的精确分类与排序。
链接: https://arxiv.org/abs/2507.04351
作者: Liman Wang,Hanyang Zhong,Tianyuan Wang,Shan Luo,Jihong Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Choosing the right fabric is crucial to meet functional and quality requirements in robotic applications for textile manufacturing, apparel production, and smart retail. We present MLLM-Fabric, a robotic framework powered by multimodal large language models (MLLMs) for fabric sorting and selection. The system includes a robotic arm, a camera, a visuotactile sensor, and a pressure sensor. It employs supervised fine-tuning and multimodal explanation-guided knowledge distillation to accurately classify and rank fabric properties. To facilitate further research, we release a dataset of 220 unique fabric samples, including RGB images and synchronized visuotactile and pressure data. Experimental results show that our Fabric-Llama-90B model consistently outperforms pretrained vision-language baselines in both property ranking accuracy and selection reliability.
zh
[AI-60] Improving Action Smoothness for a Cascaded Online Learning Flight Control System
【速读】:该论文旨在解决级联在线学习飞行控制系统中由于振荡控制动作导致的稳定性问题(oscillatory control actions),从而提升动作的平滑性。解决方案的关键在于引入一种在线时间平滑技术(online temporal smoothness technique)和低通滤波器(low-pass filter),以降低控制动作的幅度和频率,同时利用快速傅里叶变换(Fast Fourier Transform, FFT)在频域分析策略性能。
链接: https://arxiv.org/abs/2507.04346
作者: Yifei Li,Erik-jan van Kampen
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper aims to improve the action smoothness of a cascaded online learning flight control system. Although the cascaded structure is widely used in flight control design, its stability can be compromised by oscillatory control actions, which poses challenges for practical engineering applications. To address this issue, we introduce an online temporal smoothness technique and a low-pass filter to reduce the amplitude and frequency of the control actions. Fast Fourier Transform (FFT) is used to analyze policy performance in the frequency domain. Simulation results demonstrate the improvements achieved by the two proposed techniques.
zh
[AI-61] Voltage Mode Winner-Take-All Circuit for Neuromorphic Systems
【速读】:该论文旨在解决神经形态计算系统中关键学习单元——竞争选择电路(winner-take-all circuit)的功能局限性问题,通过设计一种可配置的winner-take-all电路实现k-winner和滞回特性。其解决方案的关键在于提出一种能够在IBM 65 nm工艺节点下模拟的电路结构,该电路在处理1000个输入时表现出低功耗(34.9 μW)和短延迟(10.4 ns)的性能,从而提升了空间滤波和分类任务的适用性。
链接: https://arxiv.org/abs/2507.04338
作者: Abdullah M. Zyarah,Dhireesha Kudithipudi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in neuromorphic computing demonstrate on-device learning capabilities with low power consumption. One of the key learning units in these systems is the winner-take-all circuit. In this research, we propose a winner-take-all circuit that can be configured to achieve k-winner and hysteresis properties, simulated in IBM 65 nm node. The circuit dissipated 34.9 \mu W of power with a latency of 10.4 ns, while processing 1000 inputs. The utility of the circuit is demonstrated for spatial filtering and classification.
zh
[AI-62] QF: Quick Feedforward AI Model Training without Gradient Back Propagation
【速读】:该论文试图解决如何高效地将指令衍生的知识迁移到基于Transformer的模型权重中,同时避免传统微调过程中所需的梯度反向传播。其解决方案的关键在于提出了一种名为Quick Feedforward (QF) Learning的新知识整合框架,该框架通过前馈激活实现知识的高效迁移,更新过程以闭式形式计算,仅需最小的参数修改,并保留先前知识,从而使得模型在相同的运行环境中完成训练与推理,提升了资源效率并更贴近人脑运作方式。
链接: https://arxiv.org/abs/2507.04300
作者: Feng Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:We propose Quick Feedforward (QF) Learning, a novel knowledge consolidation framework for transformer-based models that enables efficient transfer of instruction derived knowledge into model weights through feedforward activations without any gradient back propagation. Unlike traditional finetuning, QF updates are computed in closed form, require minimal parameter modification, and preserve prior knowledge. Importantly, QF allows models to train and infer within the same runtime environment, making the process more resource efficient and closely aligned with how the human brain operates. Code and models are open sourced on GitHub. I hope QF Learning inspires a more efficient and brain-like paradigm for AI systems.
zh
[AI-63] Answer Set Programming Modulo Theories and Reasoning about Continuous Changes IJCAI2013
【速读】:该论文试图解决如何将答案集编程(ASP)与满足模理论(SMT)进行紧密集成的问题,从而扩展ASP在处理连续变化和离散变化方面的能力。解决方案的关键在于引入“紧致”ASPMT程序,并将其转化为SMT实例,从而利用SMT求解器进行计算。这一方法基于功能稳定模型语义,通过固定背景理论的解释来实现ASP与SMT的融合。
链接: https://arxiv.org/abs/2507.04299
作者: Joohyung Lee,Yunsong Meng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: In Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 2013), pages 990-996, 2013
Abstract:Answer Set Programming Modulo Theories (ASPMT) is a new framework of tight integration of answer set programming (ASP) and satisfiability modulo theories (SMT). Similar to the relationship between first-order logic and SMT, it is based on a recent proposal of the functional stable model semantics by fixing interpretations of background theories. Analogously to a known relationship between ASP and SAT, ``tight’’ ASPMT programs can be translated into SMT instances. We demonstrate the usefulness of ASPMT by enhancing action language C+ to handle continuous changes as well as discrete changes. We reformulate the semantics of C+ in terms ofASPMT, and show that SMT solvers can be used to compute the language. We also show how the language can represent cumulative effects on continuous resources.
zh
[AI-64] VOLTRON: Detecting Unknown Malware Using Graph-Based Zero-Shot Learning
【速读】:该论文试图解决Android恶意软件持续威胁带来的安全挑战,特别是针对新兴且未见过的恶意软件家族检测效果不佳的问题。传统机器学习方法依赖于大量标记数据,而在面对缺乏标签数据的新威胁时表现受限。该论文提出的解决方案关键在于引入一种结合变分图自编码器(VGAE)与孪生神经网络(SNN)的零样本学习框架,通过图结构表示Android应用,从而在无需特定恶意软件家族先验示例的情况下识别恶意行为。
链接: https://arxiv.org/abs/2507.04275
作者: M. Tahir Akdeniz,Zeynep Yeşilkaya,İ. Enes Köse,İ. Ulaş Ünal,Sevil Şen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 6 figures, Submitted as a preprint
Abstract:The persistent threat of Android malware presents a serious challenge to the security of millions of users globally. While many machine learning-based methods have been developed to detect these threats, their reliance on large labeled datasets limits their effectiveness against emerging, previously unseen malware families, for which labeled data is scarce or nonexistent. To address this challenge, we introduce a novel zero-shot learning framework that combines Variational Graph Auto-Encoders (VGAE) with Siamese Neural Networks (SNN) to identify malware without needing prior examples of specific malware families. Our approach leverages graph-based representations of Android applications, enabling the model to detect subtle structural differences between benign and malicious software, even in the absence of labeled data for new threats. Experimental results show that our method outperforms the state-of-the-art MaMaDroid, especially in zero-day malware detection. Our model achieves 96.24% accuracy and 95.20% recall for unknown malware families, highlighting its robustness against evolving Android threats. Comments: 17 pages, 6 figures, Submitted as a preprint Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) MSC classes: 68T05, 68M25 ACMclasses: D.4.6; I.2.6; K.6.5 Cite as: arXiv:2507.04275 [cs.CR] (or arXiv:2507.04275v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.04275 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-65] Just Enough Shifts: Mitigating Over-Refusal in Aligned Language Models with Targeted Representation Fine-Tuning
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在安全对齐过程中出现的过度拒绝(over-refusals)问题,即模型错误地拒绝了无害的提示,从而影响用户体验和模型实用性。解决方案的关键在于提出ACTOR(Activation-Based Training for Over-Refusal Reduction),该框架通过利用来自多样化查询的内部激活模式,精准识别并调整触发拒绝的激活组件,从而实现对拒绝机制的更强控制,仅微调单一层即可有效减少过度拒绝现象,同时保持模型处理有害查询的能力和整体性能。
链接: https://arxiv.org/abs/2507.04250
作者: Mahavir Dabas,Si Chen,Charles Fleming,Ming Jin,Ruoxi Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety alignment is crucial for large language models (LLMs) to resist malicious instructions but often results in over-refusals, where benign prompts are unnecessarily rejected, impairing user experience and model utility. We introduce ACTOR (Activation-Based Training for Over-Refusal Reduction), a robust and compute- and data-efficient training framework that minimizes over-refusals by leveraging internal activation patterns from diverse queries. ACTOR precisely identifies and adjusts the activation components that trigger refusals, providing stronger control over the refusal mechanism. By fine-tuning only a single model layer, ACTOR effectively reduces over-refusals across multiple benchmarks while maintaining the model’s ability to handle harmful queries and preserve overall utility.
zh
[AI-66] Scaling Context Requires Rethinking Attention
【速读】:该论文试图解决在长序列训练中,传统Transformer架构和次二次复杂度架构的局限性:前者处理上下文的成本过高,后者则过于低廉,导致无法有效捕捉上下文信息。为解决这些问题,作者提出了一种名为Power Attention的架构层,其关键在于实现线性复杂度的序列建模,并且能够独立于参数调整状态大小,从而在实际应用中充分发挥线性注意力的优势。
链接: https://arxiv.org/abs/2507.04239
作者: Carles Gelada,Jacob Buckman,Sean Zhang,Txus Bach
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential attention and linear attention at long-context training.
zh
[AI-67] Design Optimization of Three-Dimensional Wire Arrangement Considering Wire Crossings for Tendon-driven Robots WWW ATC IROS2025
【速读】:该论文试图解决腱驱动机构中线缆布局优化的问题,特别是在三维结构中考虑线缆交叉情况下的优化设计。传统方法多依赖经验,而在处理复杂结构时面临巨大挑战,且现有研究常通过简化假设(如限制运动在二维平面、保持力臂不变或忽略线缆交叉)来降低问题复杂度。本文的关键解决方案是采用一种多目标黑箱优化方法,在确保线缆不交叉的同时,沿定义的目标轨迹提供足够的关节扭矩,从而实现更精确和实用的三维线缆布局优化。
链接: https://arxiv.org/abs/2507.04235
作者: Kento Kawaharazuka,Shintaro Inoue,Yuta Sahara,Keita Yoneda,Temma Suzuki,Kei Okada
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at IROS2025, Website - this https URL , YouTube - this https URL
Abstract:Tendon-driven mechanisms are useful from the perspectives of variable stiffness, redundant actuation, and lightweight design, and they are widely used, particularly in hands, wrists, and waists of robots. The design of these wire arrangements has traditionally been done empirically, but it becomes extremely challenging when dealing with complex structures. Various studies have attempted to optimize wire arrangement, but many of them have oversimplified the problem by imposing conditions such as restricting movements to a 2D plane, keeping the moment arm constant, or neglecting wire crossings. Therefore, this study proposes a three-dimensional wire arrangement optimization that takes wire crossings into account. We explore wire arrangements through a multi-objective black-box optimization method that ensures wires do not cross while providing sufficient joint torque along a defined target trajectory. For a 3D link structure, we optimize the wire arrangement under various conditions, demonstrate its effectiveness, and discuss the obtained design solutions.
zh
[AI-68] High-Resolution Sustain Pedal Depth Estimation from Piano Audio Across Room Acoustics
【速读】:该论文试图解决钢琴延音踏板检测在实际钢琴演奏场景中因踏板深度对音乐表现力有显著影响而受限的问题,传统方法仅将其作为二分类任务处理。解决方案的关键在于提出一种基于Transformer的架构,能够实现高分辨率的连续踏板深度估计,从而提供更具音乐意义的预测结果,相较于基线模型的二进制检测方法,能够更好地捕捉细腻的表达。
链接: https://arxiv.org/abs/2507.04230
作者: Kun Fang,Hanwen Zhang,Ziyu Wang,Ichiro Fujinaga
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
备注:
Abstract:Piano sustain pedal detection has previously been approached as a binary on/off classification task, limiting its application in real-world piano performance scenarios where pedal depth significantly influences musical expression. This paper presents a novel approach for high-resolution estimation that predicts continuous pedal depth values. We introduce a Transformer-based architecture that not only matches state-of-the-art performance on the traditional binary classification task but also achieves high accuracy in continuous pedal depth estimation. Furthermore, by estimating continuous values, our model provides musically meaningful predictions for sustain pedal usage, whereas baseline models struggle to capture such nuanced expressions with their binary detection approach. Additionally, this paper investigates the influence of room acoustics on sustain pedal estimation using a synthetic dataset that includes varied acoustic conditions. We train our model with different combinations of room settings and test it in an unseen new environment using a “leave-one-out” approach. Our findings show that the two baseline models and ours are not robust to unseen room conditions. Statistical analysis further confirms that reverberation influences model predictions and introduces an overestimation bias.
zh
[AI-69] Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties
【速读】:该论文旨在解决移动图形用户界面代理(Mobile GUI agents)在现实场景中面对部分被不可信第三方篡改的屏幕内容时的脆弱性问题。其解决方案的关键在于提出了一种可扩展的攻击模拟框架AgentHazard,该框架能够灵活且有针对性地修改现有应用中的屏幕内容,并基于此构建了一个包含超过3,000个攻击场景的全面基准测试套件,用于评估移动GUI代理的安全性和鲁棒性。
链接: https://arxiv.org/abs/2507.04227
作者: Guohong Liu,Jialei Ye,Jiacheng Liu,Yuanchun Li,Wei Liu,Pengzhi Gao,Jian Luan,Yunxin Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile GUI agents are designed to autonomously execute diverse device-control tasks by interpreting and interacting with mobile screens. Despite notable advancements, their resilience in real-world scenarios where screen content may be partially manipulated by untrustworthy third parties remains largely unexplored. Owing to their black-box and autonomous nature, these agents are vulnerable to manipulations that could compromise user devices. In this work, we present the first systematic investigation into the vulnerabilities of mobile GUI agents. We introduce a scalable attack simulation framework AgentHazard, which enables flexible and targeted modifications of screen content within existing applications. Leveraging this framework, we develop a comprehensive benchmark suite comprising both a dynamic task execution environment and a static dataset of vision-language-action tuples, totaling over 3,000 attack scenarios. The dynamic environment encompasses 58 reproducible tasks in an emulator with various types of hazardous UI content, while the static dataset is constructed from 210 screenshots collected from 14 popular commercial apps. Importantly, our content modifications are designed to be feasible for unprivileged third parties. We evaluate 7 widely-used mobile GUI agents and 5 common backbone models using our benchmark. Our findings reveal that all examined agents are significantly influenced by misleading third-party content (with an average misleading rate of 28.8% in human-crafted attack scenarios) and that their vulnerabilities are closely linked to the employed perception modalities and backbone LLMs. Furthermore, we assess training-based mitigation strategies, highlighting both the challenges and opportunities for enhancing the robustness of mobile GUI agents. Our code and data will be released at this https URL.
zh
[AI-70] Zero-Shot Cyclic Peptide Design with Composable Geometric Conditions
【速读】:该论文试图解决目标特异性环肽设计中因训练数据有限而导致的研究不足问题。解决方案的关键在于提出CP-Composer,这是一个基于可组合几何约束的生成框架,能够通过节点和边的几何条件将复杂的环化模式分解为单元约束,并在扩散模型中进行学习与生成,从而实现零样本环肽生成。
链接: https://arxiv.org/abs/2507.04225
作者: Dapeng Jiang,Xiangzhe Kong,Jiaqi Han,Mingyu Li,Rui Jiao,Wenbing Huang,Stefano Ermon,Jianzhu Ma,Yang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Cyclic peptides, characterized by geometric constraints absent in linear peptides, offer enhanced biochemical properties, presenting new opportunities to address unmet medical needs. However, designing target-specific cyclic peptides remains underexplored due to limited training data. To bridge the gap, we propose CP-Composer, a novel generative framework that enables zero-shot cyclic peptide generation via composable geometric constraints. Our approach decomposes complex cyclization patterns into unit constraints, which are incorporated into a diffusion model through geometric conditioning on nodes and edges. During training, the model learns from unit constraints and their random combinations in linear peptides, while at inference, novel constraint combinations required for cyclization are imposed as input. Experiments show that our model, despite trained with linear peptides, is capable of generating diverse target-binding cyclic peptides, reaching success rates from 38% to 84% on different cyclization strategies.
zh
[AI-71] Model Collapse Is Not a Bug but a Feature in Machine Unlearning for LLM s
【速读】:该论文试图解决大语言模型(Large Language Model, LLM)在遗忘特定敏感信息时存在的隐私泄露问题,传统方法通过在训练目标中显式优化被遗忘目标,不仅可能加剧对敏感数据的暴露,还违背了最小化使用敏感数据的原则。解决方案的关键是提出一种新型遗忘方法——部分模型崩溃(Partial Model Collapse, PMC),该方法无需在遗忘目标中包含被遗忘的数据,而是通过触发对敏感数据的部分模型崩溃,从而有效移除模型中的私有信息。
链接: https://arxiv.org/abs/2507.04219
作者: Yan Scholten,Sophie Xhonneux,Stephan Günnemann,Leo Schwinn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Current unlearning methods for LLMs optimize on the private information they seek to remove by incorporating it into their training objectives. We argue this not only risks reinforcing exposure to sensitive data, it also fundamentally contradicts the principle of minimizing its use. As a remedy, we propose a novel unlearning method - Partial Model Collapse (PMC), which does not require unlearning targets in the unlearning objective. Our approach is inspired by recent observations that training generative models on their own generations leads to distribution collapse, effectively removing information from the model. Our core idea is to leverage this collapse for unlearning by triggering collapse partially on the sensitive data. We theoretically analyze that our approach converges to the desired outcome, i.e. the LLM unlearns the information in the forget set. We empirically demonstrate that PMC overcomes two key limitations of existing unlearning approaches that explicitly optimize on unlearning targets, and more effectively removes private information from model outputs. Overall, our contributions represent an important step toward more comprehensive unlearning that aligns with real-world privacy constraints. Code available at this https URL.
zh
[AI-72] Mpemba Effect in Large-Language Model Training Dynamics: A Minimal Analysis of the Valley-River model
【速读】:该论文试图解决大规模语言模型(Large Language Model, LLM)训练中学习率(Learning Rate, LR)调度策略的机制解释不足以及平台阶段高度和衰减策略选择过于依赖经验的问题。其解决方案的关键在于通过热力学类比引入Mpemba效应,将训练动态与“山谷-河流”损失景观相结合,揭示了暖启动阶段的必要性,并提出高平台学习率能够加速衰减阶段的损失下降。研究进一步表明,在某些损失景观中存在一个最优平台学习率——“强Mpemba点”,在此点上最慢模式消失,从而实现衰减阶段的更快收敛。
链接: https://arxiv.org/abs/2507.04206
作者: Sibei Liu,Zhijian Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Learning rate (LR) schedules in large language model (LLM) training often follow empirical templates: warm-up, constant plateau/stable phase, and decay (WSD). However, the mechanistic explanation for this strategy remains underexplored, and the choice of plateau height and decay schedule is largely heuristic. In this paper, we connect training dynamics to a thermodynamic analogy via the Mpemba effect - a phenomenon in which a hotter system cools faster than a colder one when quenched into the same bath. We analyze a class of “valley-river” loss landscapes, where sharp (valley) directions equilibrate quickly, while flatter (river) directions govern global descent. The Mpemba effect provides an explanation for the necessity of the warm-up phase and motivates a high plateau - rather than a low one - for accelerating loss decrease during decay. We show that for certain loss landscapes, there exists an optimal plateau learning rate - the “strong Mpemba point” - at which the slowest mode vanishes, resulting in faster convergence during the decay phase. We derive analytical conditions for its existence and estimate decay dynamics required to preserve the Mpemba advantage. Our minimal model and analysis offer a principled justification for plateau-based schedulers and provide guidance for tuning LR in LLMs with minimal hyperparameter sweep.
zh
[AI-73] Uncertainty Quantification in the Tsetlin Machine
【速读】:该论文试图解决生成式 AI (Generative AI) 模型在预测过程中缺乏透明性和可解释性的问题,特别是针对 Tsetlin Machines (TMs) 的预测概率评分和不确定性量化方法进行改进。解决方案的关键在于引入一种基于 TM 学习动态分析的固有概率评分,并开发新的不确定性量化技术,以增强模型预测的可解释性。通过模拟数据验证了概率评分与数据底层概率之间的明确关联,并展示了 TM 在训练数据域外预测时的低置信度特性,这与传统人工神经网络的外推现象形成对比。
链接: https://arxiv.org/abs/2507.04175
作者: Runar Helin,Ole-Christoffer Granmo,Mayur Kishor Shende,Lei Jiao,Vladimir I. Zadorozhny,Kunal Ganesh Dumbre,Rishad Shafik,Alex Yakovlev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Data modeling using Tsetlin machines (TMs) is all about building logical rules from the data features. The decisions of the model are based on a combination of these logical rules. Hence, the model is fully transparent and it is possible to get explanations of its predictions. In this paper, we present a probability score for TM predictions and develop new techniques for uncertainty quantification to increase the explainability further. The probability score is an inherent property of any TM variant and is derived through an analysis of the TM learning dynamics. Simulated data is used to show a clear connection between the learned TM probability scores and the underlying probabilities of the data. A visualization of the probability scores also reveals that the TM is less confident in its predictions outside the training data domain, which contrasts the typical extrapolation phenomenon found in Artificial Neural Networks. The paper concludes with an application of the uncertainty quantification techniques on an image classification task using the CIFAR-10 dataset, where they provide new insights and suggest possible improvements to current TM image classification models.
zh
[AI-74] Structure As Search: Unsupervised Permutation Learning for Combinatorial Optimization
【速读】:该论文试图解决旅行商问题(Travelling Salesman Problem, TSP),其核心挑战在于寻找经过所有城市且总路径最短的循环路径。解决方案的关键在于提出一种非自回归框架,通过学习排列直接生成解,而非依赖显式搜索过程。该方法通过对哈密顿回路应用相似性变换,使模型通过连续松弛近似排列矩阵,从而在无需序列决策的情况下有效利用问题的内在结构进行组合优化。
链接: https://arxiv.org/abs/2507.04164
作者: Yimeng Min,Carla P. Gomes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a non-autoregressive framework for the Travelling Salesman Problem where solutions emerge directly from learned permutations without explicit search. By applying a similarity transformation to Hamiltonian cycles, the model learns to approximate permutation matrices via continuous relaxations. Our unsupervised approach achieves competitive performance against classical heuristics, demonstrating that the inherent structure of the problem can effectively guide combinatorial optimization without sequential decision-making.
zh
[AI-75] Physics-informed neural networks and neural operators for a study of EUV electromagnetic wave diffraction from a lithography mask
【速读】:该论文试图解决极紫外(Extreme Ultraviolet, EUV)电磁波从光刻掩模衍射的问题,旨在提高光刻掩模设计流程的效率。其解决方案的关键在于提出了一种新型的混合波导神经算子(Waveguide Neural Operator, WGNO),该方法将传统波导方法中计算成本最高的部分替换为神经网络,从而在保持高精度的同时显著提升了推理速度。
链接: https://arxiv.org/abs/2507.04153
作者: Vasiliy A. Es’kin,Egor V. Ivanov
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Optics (physics.optics)
备注:
Abstract:Physics-informed neural networks (PINNs) and neural operators (NOs) for solving the problem of diffraction of Extreme Ultraviolet (EUV) electromagnetic waves from a mask are presented. A novel hybrid Waveguide Neural Operator (WGNO) is introduced, which is based on a waveguide method with its most computationally expensive part replaced by a neural network. Numerical experiments on realistic 2D and 3D masks show that the WGNO achieves state-of-the-art accuracy and inference time, providing a highly efficient solution for accelerating the design workflows of lithography masks.
zh
[AI-76] A Technical Survey of Reinforcement Learning Techniques for Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在指令遵循、伦理对齐和推理能力方面的关键挑战。其解决方案的核心在于将强化学习(Reinforcement Learning, RL)技术与LLMs相结合,通过多种算法如近端策略优化(PPO)、Q-Learning、Actor-Critic方法以及专门针对LLMs设计的技术,如从人类反馈中进行强化学习(RLHF)、AI反馈(RLAIF)、直接偏好优化(DPO)和组相对策略优化(GRPO)来提升模型性能。这些方法通过优化策略、改进奖励建模和反馈机制,推动LLMs在多个领域(如代码生成和工具增强的推理)的应用,同时应对奖励黑客、计算成本和可扩展反馈收集等持续性问题。
链接: https://arxiv.org/abs/2507.04136
作者: Saksham Sahai Srivastava,Vaneet Aggarwal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, LaTeX source
Abstract:Reinforcement Learning (RL) has emerged as a transformative approach for aligning and enhancing Large Language Models (LLMs), addressing critical challenges in instruction following, ethical alignment, and reasoning capabilities. This survey offers a comprehensive foundation on the integration of RL with language models, highlighting prominent algorithms such as Proximal Policy Optimization (PPO), Q-Learning, and Actor-Critic methods. Additionally, it provides an extensive technical overview of RL techniques specifically tailored for LLMs, including foundational methods like Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), as well as advanced strategies such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). We systematically analyze their applications across domains, i.e., from code generation to tool-augmented reasoning. We also present a comparative taxonomy based on reward modeling, feedback mechanisms, and optimization strategies. Our evaluation highlights key trends. RLHF remains dominant for alignment, and outcome-based RL such as RLVR significantly improves stepwise reasoning. However, persistent challenges such as reward hacking, computational costs, and scalable feedback collection underscore the need for continued innovation. We further discuss emerging directions, including hybrid RL algorithms, verifier-guided training, and multi-objective alignment frameworks. This survey serves as a roadmap for researchers advancing RL-driven LLM development, balancing capability enhancement with safety and scalability.
zh
[AI-77] Addressing The Devastating Effects Of Single-Task Data Poisoning In Exemplar-Free Continual Learning
【速读】:该论文试图解决持续学习(continual learning, CL)中因数据中毒(data poisoning)引发的安全问题。现有研究多关注场景依赖的攻击,而本文提出了一种更简单且现实的单任务中毒(single-task poison, STP)威胁,其中攻击者仅能访问当前任务的数据流,缺乏对模型及前后任务的了解和访问权限。解决方案的关键在于证明即使在这些严格限制条件下,攻击者仍可通过标准图像损坏手段显著影响模型性能,进而破坏持续训练过程的稳定性与适应性。此外,本文还提出了一个针对CL的高层防御框架以及基于任务向量的中毒任务检测方法。
链接: https://arxiv.org/abs/2507.04106
作者: Stanisław Pawlak(1),Bartłomiej Twardowski(2 and 3),Tomasz Trzciński(1 and 2),Joost van de Weijer(3) ((1) Warsaw University of Technology, Poland, (2) IDEAS Research Institute, Poland, (3) Computer Vision Center, Universitat Autonoma de Barcelona, Spain)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at CoLLAs 2025
Abstract:Our research addresses the overlooked security concerns related to data poisoning in continual learning (CL). Data poisoning - the intentional manipulation of training data to affect the predictions of machine learning models - was recently shown to be a threat to CL training stability. While existing literature predominantly addresses scenario-dependent attacks, we propose to focus on a more simple and realistic single-task poison (STP) threats. In contrast to previously proposed poisoning settings, in STP adversaries lack knowledge and access to the model, as well as to both previous and future tasks. During an attack, they only have access to the current task within the data stream. Our study demonstrates that even within these stringent conditions, adversaries can compromise model performance using standard image corruptions. We show that STP attacks are able to strongly disrupt the whole continual training process: decreasing both the stability (its performance on past tasks) and plasticity (capacity to adapt to new tasks) of the algorithm. Finally, we propose a high-level defense framework for CL along with a poison task detection method based on task vectors. The code is available at this https URL .
zh
[AI-78] Enhancing Robustness of LLM -Driven Multi-Agent Systems through Randomized Smoothing
【速读】:该论文试图解决在安全关键领域(如航空航天)中,基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)的安全性问题。其解决方案的关键在于应用随机平滑(randomized smoothing)这一统计鲁棒性认证技术,以在对抗性影响下为MAS的共识过程提供概率保障。该方法采用两阶段自适应采样机制,在保证鲁棒性的同时兼顾计算效率,并在黑盒设置下运行,从而有效防止对抗行为和幻觉的传播,同时维持系统的共识性能。
链接: https://arxiv.org/abs/2507.04105
作者: Jinwei Hu,Yi Dong,Zhengtao Ding,Xiaowei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Preprint accepted by Chinese Journal of Aeronautics
Abstract:This paper presents a defense framework for enhancing the safety of large language model (LLM) empowered multi-agent systems (MAS) in safety-critical domains such as aerospace. We apply randomized smoothing, a statistical robustness certification technique, to the MAS consensus context, enabling probabilistic guarantees on agent decisions under adversarial influence. Unlike traditional verification methods, our approach operates in black-box settings and employs a two-stage adaptive sampling mechanism to balance robustness and computational efficiency. Simulation results demonstrate that our method effectively prevents the propagation of adversarial behaviors and hallucinations while maintaining consensus performance. This work provides a practical and scalable path toward safe deployment of LLM-based MAS in real-world, high-stakes environments.
zh
[AI-79] How to Train Your LLM Web Agent : A Statistical Diagnosis
【速读】:该论文试图解决基于大语言模型(Large Language Model, LLM)的网络代理在开放源代码系统中与闭源系统之间性能差距扩大的问题,以及由于多步骤网络交互复杂性和高昂的计算成本导致的进展受限问题。其解决方案的关键在于提出一种统计基础的计算资源分配方法,通过两阶段流水线实现:首先使用监督微调(Supervised Fine-Tuning, SFT)让较小的Llama 3.1 8B模型模仿较大的Llama 3.3 70B模型,随后进行策略内强化学习(On-policy Reinforcement Learning)。该方法在超参数选择上高度敏感,因此通过采样1,370种配置并利用自助法(Bootstrapping)估计有效超参数,从而减少计算资源浪费,并在WorkArena和MiniWob++基准测试中展现出优于单一方法的性能,同时显著降低计算需求。
链接: https://arxiv.org/abs/2507.04103
作者: Dheeraj Vattikonda,Santhoshi Ravichandran,Emiliano Penaloza,Hadi Nekoei,Megh Thakkar,Thibault Le Sellier de Chezelles,Nicolas Gontier,Miguel Muñoz-Mármol,Sahar Omidi Shayegan,Stefania Raimondo,Xue Liu,Alexandre Drouin,Laurent Charlin,Alexandre Piché,Alexandre Lacoste,Massimo Caccia
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
zh
[AI-80] Hierarchical Testing with Rabbit Optimization for Industrial Cyber-Physical Systems
【速读】:该论文试图解决工业网络物理系统(Industrial Cyber-Physical Systems, ICPS)中基于深度学习的故障预测与健康管理(Prognostics and Health Management, PHM)系统的鲁棒性评估问题。解决方案的关键在于提出一种名为HERO(Hierarchical Testing with Rabbit Optimization)的黑盒对抗测试框架,该框架利用人工兔优化算法生成符合物理约束的对抗样本,这些样本能够与真实数据分布保持一致,从而有效揭示先进PHM模型中的脆弱性。
链接: https://arxiv.org/abs/2507.04100
作者: Jinwei Hu,Zezhi Tang,Xin Jin,Benyuan Zhang,Yi Dong,Xiaowei Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Preprint accepted by IEEE Transactions on Industrial Cyber Physical Systems
Abstract:This paper presents HERO (Hierarchical Testing with Rabbit Optimization), a novel black-box adversarial testing framework for evaluating the robustness of deep learning-based Prognostics and Health Management systems in Industrial Cyber-Physical Systems. Leveraging Artificial Rabbit Optimization, HERO generates physically constrained adversarial examples that align with real-world data distributions via global and local perspective. Its generalizability ensures applicability across diverse ICPS scenarios. This study specifically focuses on the Proton Exchange Membrane Fuel Cell system, chosen for its highly dynamic operational conditions, complex degradation mechanisms, and increasing integration into ICPS as a sustainable and efficient energy solution. Experimental results highlight HERO’s ability to uncover vulnerabilities in even state-of-the-art PHM models, underscoring the critical need for enhanced robustness in real-world applications. By addressing these challenges, HERO demonstrates its potential to advance more resilient PHM systems across a wide range of ICPS domains.
zh
[AI-81] Human-centered AI with focus on Human-robot interaction (Book chapter)
【速读】:该论文试图解决现代社交机器人在与人类交互过程中所面临的挑战和问题,强调这些机器人需要以人类为中心来满足用户的需求。解决方案的关键在于提出一种名为“双金字塔”(Dual Pyramid)的新框架,该框架基于人本AI的原则,系统性地涵盖了从最基本的机器人效能到宏观层面的机器人协作以实现联合国17个可持续发展目标的人类需求。
链接: https://arxiv.org/abs/2507.04095
作者: Alireza Mortezapour,Giuliana Vitiello
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Modern social robots can be considered the descendants of steam engines from the First Industrial Revolution (IR 1.0) and industrial robotic arms from the Third Industrial Revolution (IR 3.0). As some time has passed since the introduction of these robots during the Fourth Industrial Revolution (IR 4.0), challenges and issues in their interaction with humans have emerged, leading researchers to conclude that, like any other AI-based technology, these robots must also be human-centered to meet the needs of their users. This chapter aims to introduce humans and their needs in interactions with robots, ranging from short-term, one-on-one interactions (micro-level) to long-term, macro-level needs at the societal scale. Building upon the principles of human-centered AI, this chapter presents, for the first time, a new framework of human needs called the Dual Pyramid. This framework encompasses a comprehensive list of human needs in robot interactions, from the most fundamental, robot effectiveness to macro level requirements, such as the collaboration with robots in achieving the United Nations 17 Sustainable Development Goals.
zh
[AI-82] HAWK: A Hierarchical Workflow Framework for Multi-Agent Collaboration SIGIR2025
【速读】:该论文试图解决当代多智能体系统在跨平台互操作性、动态任务调度和高效资源共享方面持续存在的挑战。其解决方案的关键在于提出一种名为Hierarchical Agent Workflow (HAWK)的模块化框架,该框架包含五个层级——用户层、工作流层、操作层、智能体层和资源层,并由十六个标准化接口支持。HAWK的核心是一个位于工作流层的自适应调度与优化模块,该模块通过实时反馈和动态策略调整来最大化资源利用率,同时资源层提供了对异构数据源、大模型、物理设备和第三方服务工具的统一抽象,从而简化跨领域信息检索。
链接: https://arxiv.org/abs/2507.04067
作者: Yuyang Cheng,Yumiao Xu,Chaojia Yu,Yong Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: AgentIR@SIGIR 2025
Abstract:Contemporary multi-agent systems encounter persistent challenges in cross-platform interoperability, dynamic task scheduling, and efficient resource sharing. Agents with heterogeneous implementations often lack standardized interfaces; collaboration frameworks remain brittle and hard to extend; scheduling policies are static; and inter-agent state synchronization is insufficient. We propose Hierarchical Agent Workflow (HAWK), a modular framework comprising five layers-User, Workflow, Operator, Agent, and Resource-and supported by sixteen standardized interfaces. HAWK delivers an end-to-end pipeline covering task parsing, workflow orchestration, intelligent scheduling, resource invocation, and data synchronization. At its core lies an adaptive scheduling and optimization module in the Workflow Layer, which harnesses real-time feedback and dynamic strategy adjustment to maximize utilization. The Resource Layer provides a unified abstraction over heterogeneous data sources, large models, physical devices, and third-party servicestools, simplifying cross-domain information retrieval. We demonstrate HAWK’s scalability and effectiveness via CreAgentive, a multi-agent novel-generation prototype, which achieves marked gains in throughput, lowers invocation complexity, and improves system controllability. We also show how hybrid deployments of large language models integrate seamlessly within HAWK, highlighting its flexibility. Finally, we outline future research avenues-hallucination mitigation, real-time performance tuning, and enhanced cross-domain adaptability-and survey prospective applications in healthcare, government, finance, and education.
zh
[AI-83] Rethinking and Exploring String-Based Malware Family Classification in the Era of LLM s and RAG
【速读】:该论文旨在解决恶意软件家族分类(Malware Family Classification, MFC)问题,即准确识别潜在恶意软件样本所属的细粒度家族(如GuLoader或BitRAT),而非仅进行二元分类(是/否)。其解决方案的关键在于探索传统二进制字符串特征在大型语言模型(LLMs)和检索增强生成(RAG)新时期的可行性,并研究如何将家族特定字符串(Family-Specific String, FSS)特征以类似RAG的方式用于MFC任务。
链接: https://arxiv.org/abs/2507.04055
作者: Yufan Chen,Daoyuan Wu,Juantao Zhong,Zicheng Zhang,Debin Gao,Shuai Wang,Yingjiu Li,Ning Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Malware Family Classification (MFC) aims to identify the fine-grained family (e.g., GuLoader or BitRAT) to which a potential malware sample belongs, in contrast to malware detection or sample classification that predicts only an Yes/No. Accurate family identification can greatly facilitate automated sample labeling and understanding on crowdsourced malware analysis platforms such as VirusTotal and MalwareBazaar, which generate vast amounts of data daily. In this paper, we explore and assess the feasibility of using traditional binary string features for MFC in the new era of large language models (LLMs) and Retrieval-Augmented Generation (RAG). Specifically, we investigate how Family-Specific String (FSS) features could be utilized in a manner similar to RAG to facilitate MFC. To this end, we develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules.
zh
[AI-84] Predictive Modeling of Effluent Temperature in SAT Systems Using Ambient Meteorological Data: Implications for Infiltration Management
【速读】:该论文旨在解决再生池出水温度准确预测的问题,这对于优化土壤含水层处理(SAT)过程至关重要,因为温度直接影响水的粘度和渗透速率。研究采用环境气象数据,开发并评估了多种预测模型,包括多元线性回归(MLR)、神经网络(NN)和随机森林(RF)。其中,MLR模型因其操作简便和稳健性能被优选,其预测精度较高(R² = 0.86–0.87),并用于估算10年间的出水温度。研究的关键在于揭示了表层土壤温度对入渗废水热力学特性的主导作用,并提供了用于实时监测和长期SAT运营规划的实用公式。
链接: https://arxiv.org/abs/2507.04050
作者: Roy Elkayam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate prediction of effluent temperature in recharge basins is essential for optimizing the Soil Aquifer Treatment (SAT) process, as temperature directly influences water viscosity and infiltration rates. This study develops and evaluates predictive models for effluent temperature in the upper recharge layer of a Shafdan SAT system recharge basin using ambient meteorological data. Multiple linear regression (MLR), neural networks (NN), and random forests (RF) were tested for their predictive accuracy and interpretability. The MLR model, preferred for its operational simplicity and robust performance, achieved high predictive accuracy (R2 = 0.86-0.87) and was used to estimate effluent temperatures over a 10-year period. Results highlight pronounced seasonal temperature cycles and the importance of topsoil temperature in governing the thermal profile of the infiltrating effluent. The study provides practical equations for real-time monitoring and long-term planning of SAT operations.
zh
[AI-85] Evaluating the Effectiveness of Large Language Models in Solving Simple Programming Tasks: A User-Centered Study
【速读】:该论文试图解决如何设计大语言模型(Large Language Models, LLMs)与用户进行有效交互以提升学习和编程性能的问题。研究的核心在于探索不同交互风格(被动、主动和协作)对用户完成简单编程任务的影响。解决方案的关键在于采用协作式交互方式,即通过与用户进行双向互动来引导、提示和响应,从而显著提高任务完成效率、用户满意度和感知帮助性,表明LLMs的设计应注重交互性、适应性和以用户为中心的体验。
链接: https://arxiv.org/abs/2507.04043
作者: Kai Deng
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) become more common in educational tools and programming environments, questions arise about how these systems should interact with users. This study investigates how different interaction styles with ChatGPT-4o (passive, proactive, and collaborative) affect user performance on simple programming tasks. I conducted a within-subjects experiment where fifteen high school students participated, completing three problems under three distinct versions of the model. Each version was designed to represent a specific style of AI support: responding only when asked, offering suggestions automatically, or engaging the user in back-and-forth this http URL analysis revealed that the collaborative interaction style significantly improved task completion time compared to the passive and proactive conditions. Participants also reported higher satisfaction and perceived helpfulness when working with the collaborative version. These findings suggest that the way an LLM communicates, how it guides, prompts, and responds, can meaningfully impact learning and performance. This research highlights the importance of designing LLMs that go beyond functional correctness to support more interactive, adaptive, and user-centered experiences, especially for novice programmers.
zh
[AI-86] Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments
【速读】:该论文试图解决静态基准测试与现实法律实践动态性之间的差距,这一差距是推动法律智能发展的关键障碍。其解决方案的关键在于引入J1-ENVS,这是首个针对基于大语言模型(LLM)代理的交互式和动态法律环境,该环境由法律专家指导,涵盖了中国法律实践中的六个代表性场景,并包含三个层次的环境复杂度。此外,还提出了J1-EVAL,一个细粒度的评估框架,用于评估不同法律熟练程度下的任务表现和程序合规性。
链接: https://arxiv.org/abs/2507.04037
作者: Zheng Jia,Shengbin Yue,Wei Chen,Siyuan Wang,Yidong Liu,Yun Song,Zhongyu Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The gap between static benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1-EVAL, a fine-grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT-4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.
zh
[AI-87] Lyria: A General LLM -Driven Genetic Algorithm Framework for Problem Solving
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在处理复杂问题时的局限性,这些问题通常具有多目标优化、精确约束满足和庞大的解空间等特性。解决方案的关键在于结合LLMs的优越语义理解能力与遗传算法(Genetic Algorithms)出色的全局搜索和优化能力,提出了一种名为Lyria的通用LLM驱动的遗传算法框架,该框架包含7个核心组件。
链接: https://arxiv.org/abs/2507.04034
作者: Weizhi Tang,Kwabena Nuamah,Vaishak Belle
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) have demonstrated impressive abilities across various domains, they still struggle with complex problems characterized by multi-objective optimization, precise constraint satisfaction, immense solution spaces, etc. To address the limitation, drawing on the superior semantic understanding ability of LLMs and also the outstanding global search and optimization capability of genetic algorithms, we propose to capitalize on their respective strengths and introduce Lyria, a general LLM-driven genetic algorithm framework, comprising 7 essential components. Through conducting extensive experiments with 4 LLMs across 3 types of problems, we demonstrated the efficacy of Lyria. Additionally, with 7 additional ablation experiments, we further systematically analyzed and elucidated the factors that affect its performance.
zh
[AI-88] Leverag ing Multimodal Data and Side Users for Diffusion Cross-Domain Recommendation
【速读】:该论文试图解决跨域推荐(Cross-domain recommendation, CDR)中冷启动用户信息利用不足以及目标域侧用户信息未被充分挖掘的问题。解决方案的关键在于提出一种结合多模态数据和侧用户的扩散跨域推荐模型(MuSiC),通过多模态大语言模型提取物品多模态特征,并利用提示学习获取用户特征;同时引入跨域扩散模块,从侧用户学习目标域特征分布,并通过重叠用户理解跨域转换模式,从而生成冷启动用户在目标域的特征向量,实现有效的跨域推荐。
链接: https://arxiv.org/abs/2507.04000
作者: Fan Zhang,Jinpeng Chen,Huan Li,Senzhang Wang,Yuan Cao,Kaimin Wei,JianXiang He,Feifei Kou,Jinqing Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Cross-domain recommendation (CDR) aims to address the persistent cold-start problem in Recommender Systems. Current CDR research concentrates on transferring cold-start users’ information from the auxiliary domain to the target domain. However, these systems face two main issues: the underutilization of multimodal data, which hinders effective cross-domain alignment, and the neglect of side users who interact solely within the target domain, leading to inadequate learning of the target domain’s vector space distribution. To address these issues, we propose a model leveraging Multimodal data and Side users for diffusion Cross-domain recommendation (MuSiC). We first employ a multimodal large language model to extract item multimodal features and leverage a large language model to uncover user features using prompt learning without fine-tuning. Secondly, we propose the cross-domain diffusion module to learn the generation of feature vectors in the target domain. This approach involves learning feature distribution from side users and understanding the patterns in cross-domain transformation through overlapping users. Subsequently, the trained diffusion module is used to generate feature vectors for cold-start users in the target domain, enabling the completion of cross-domain recommendation tasks. Finally, our experimental evaluation of the Amazon dataset confirms that MuSiC achieves state-of-the-art performance, significantly outperforming all selected baselines. Our code is available: this https URL.
zh
[AI-89] oward Better Generalisation in Uncertainty Estimators: Leverag ing Data-Agnostic Features
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)生成的回答虽然表达自信但可能事实错误的问题,旨在提升模型对自身输出正确性的准确估计能力。其解决方案的关键在于利用隐藏状态(hidden states)训练探测器(probe)以评估输出的真实性,同时探索将数据无关特征与隐藏状态特征相结合是否能提升模型在不同任务或领域下的泛化性能。
链接: https://arxiv.org/abs/2507.03998
作者: Thuy An Ha,Bao Quoc Vo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) often generate responses that are factually incorrect yet expressed with high confidence, which can pose serious risks for end users. To address this, it is essential for LLMs not only to produce answers but also to provide accurate estimates of their correctness. Uncertainty quantification methods have been introduced to assess the quality of LLM outputs, with factual accuracy being a key aspect of that quality. Among these methods, those that leverage hidden states to train probes have shown particular promise, as these internal representations encode information relevant to the factuality of responses, making this approach the focus of this paper. However, the probe trained on the hidden states of one dataset often struggles to generalise to another dataset of a different task or domain. To address this limitation, we explore combining data-agnostic features with hidden-state features and assess whether this hybrid feature set enhances out-of-domain performance. We further examine whether selecting only the most informative hidden-state features, thereby discarding task-specific noise, enables the data-agnostic features to contribute more effectively. The experiment results indicate that although introducing data-agnostic features generally enhances generalisation performance in most cases, in certain scenarios their inclusion degrades performance. A similar pattern emerges when retaining only the most important hidden-state features - adding data-agnostic features does not consistently further enhance performance compared to using the full set of hidden-state features. A closer analysis reveals that, in some specific cases, the trained probe underweights the data-agnostic features relative to the hidden-state features, which we believe is the main reason why the results are inconclusive.
zh
[AI-90] Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data
【速读】:该论文试图解决在小规模数据集上使用基于表格数据的Foundation模型(如TabPFN)时,其性能受限的问题。解决方案的关键在于通过针对特定任务的持续预训练阶段,利用少量精心挑选的大规模真实数据集进行微调,从而显著提升模型的下游预测准确性,相较于使用更广泛但可能包含噪声的数据集(如CommonCrawl或GitTables)效果更优。
链接: https://arxiv.org/abs/2507.03971
作者: Anurag Garg,Muhammad Ali,Noah Hollmann,Lennart Purucker,Samuel Müller,Frank Hutter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注:
Abstract:Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.
zh
[AI-91] Optimizing Age of Trust and Throughput in Multi-Hop UAV-Aided IoT Networks
【速读】:该论文试图解决在部署于广域物联网(IoT)网络中的设备因地理位置分散、可能未受保护而易受攻击的问题,以及如何通过无人机(UAV)辅助的认证框架提高设备的安全性与网络效率。解决方案的关键在于利用深度强化学习(DRL)优化无人机的充电计划和每次飞行中被认证设备的选择,以平衡设备离线对数据传输的影响和充电站能量波动对无人机飞行时间及充电调度的限制。
链接: https://arxiv.org/abs/2507.03950
作者: Yizhou Luo,Kwan-Wu Chin,Ruyi Guan,Xi Xiao,Caimeng Wang,Jingyin Feng,Tengjiao He
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:
Abstract:Devices operating in Internet of Things (IoT) networks may be deployed across vast geographical areas and interconnected via multi-hop communications. Further, they may be unguarded. This makes them vulnerable to attacks and motivates operators to check on devices frequently. To this end, we propose and study an Unmanned Aerial Vehicle (UAV)-aided attestation framework for use in IoT networks with a charging station powered by solar. A key challenge is optimizing the trajectory of the UAV to ensure it attests as many devices as possible. A trade-off here is that devices being checked by the UAV are offline, which affects the amount of data delivered to a gateway. Another challenge is that the charging station experiences time-varying energy arrivals, which in turn affect the flight duration and charging schedule of the UAV. To address these challenges, we employ a Deep Reinforcement Learning (DRL) solution to optimize the UAV’s charging schedule and the selection of devices to be attested during each flight. The simulation results show that our solution reduces the average age of trust by 88% and throughput loss due to attestation by 30%.
zh
[AI-92] An ASP-Based Framework for MUSes
【速读】:该论文旨在解决在给定不可满足公式的情况下,高效枚举和计数最小不可满足子集(MUS)的问题。其解决方案的关键在于提出一种基于答案集编程(Answer Set Programming, ASP)的框架,称为MUS-ASP,通过将MUS枚举转化为答案集求解问题,从而利用现代ASP系统的计算效率实现对MUS的在线枚举。实验结果表明,该方法在MUS枚举和计数任务中均表现出显著的加速效果,尤其在与混合求解器集成时效果更为明显。
链接: https://arxiv.org/abs/2507.03929
作者: Mohimenul Kabir,Kuldeep S Meel
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: To appear in ICLP 2025 Technical Communication
Abstract:Given an unsatisfiable formula, understanding the core reason for unsatisfiability is crucial in several applications. One effective way to capture this is through the minimal unsatisfiable subset (MUS), the subset-minimal set of clauses that remains unsatisfiable. Current research broadly focuses on two directions: (i) enumerating as many MUSes as possible within a given time limit, and (ii) counting the total number of MUSes for a given unsatisfiable formula. In this paper, we introduce an answer set programming-based framework, named MUS-ASP, designed for online enumeration of MUSes. ASP is a powerful tool for its strengths in knowledge representation and is particularly suitable for specifying complex combinatorial problems. By translating MUS enumeration into answer set solving, MUS-ASP leverages the computational efficiency of state-of-the-art ASP systems. Our extensive experimental evaluation demonstrates the effectiveness of MUS-ASP and highlights the acceleration in both MUS enumeration and counting tasks, particularly when integrated within hybrid solvers, including the framework proposed in this paper. Comments: To appear in ICLP 2025 Technical Communication Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2507.03929 [cs.AI] (or arXiv:2507.03929v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.03929 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-93] CortexDebate: Debating Sparsely and Equally for Multi-Agent Debate ACL2025
【速读】:该论文旨在解决单一大型语言模型(Large Language Model, LLM)在幻觉和推理能力不足方面的关键问题,以及现有多智能体辩论(Multi-Agent Debate, MAD)方法中存在的输入上下文过长导致性能下降和过度自信问题。其解决方案的关键在于提出一种名为“CortexDebate”的新型MAD方法,该方法通过构建一个稀疏且动态优化的辩论图,使LLM代理仅与对其有帮助的代理进行辩论,并引入基于麦肯锡的辩论物质模块(McKinsey-based Debate Matter, MDM),利用社会学中的麦肯锡信任公式实现可信评估,从而优化辩论图结构。
链接: https://arxiv.org/abs/2507.03928
作者: Yiliu Sun,Zicheng Zhao,Sheng Wan,Chen Gong
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted by ACL 2025
Abstract:Nowadays, single Large Language Model (LLM) struggles with critical issues such as hallucination and inadequate reasoning abilities. To mitigate these issues, Multi-Agent Debate (MAD) has emerged as an effective strategy, where LLM agents engage in in-depth debates with others on tasks. However, existing MAD methods face two major issues: (a) too lengthy input contexts, which causes LLM agents to get lost in plenty of input information and experiences performance drop; and (b) the overconfidence dilemma, where self-assured LLM agents dominate the debate, leading to low debating effectiveness. To address these limitations, we propose a novel MAD method called “CortexDebate”. Inspired by the human brain’s tendency to establish a sparse and dynamically optimized network among cortical areas governed by white matter, CortexDebate constructs a sparse debating graph among LLM agents, where each LLM agent only debates with the ones that are helpful to it. To optimize the graph, we propose a module named McKinsey-based Debate Matter (MDM), which acts as an artificial analog to white matter. By integrating the McKinsey Trust Formula, a well-established measure of trustworthiness from sociology, MDM enables credible evaluations that guide graph optimization. The effectiveness of our CortexDebate has been well demonstrated by extensive experimental results across eight datasets from four task types.
zh
[AI-94] Agent Exchange: Shaping the Future of AI Agent Economics
【速读】:该论文试图解决如何将大型语言模型(Large Language Models, LLMs)驱动的AI代理从被动计算工具转变为具有自主经济行为的主体,从而构建代理主导的经济体系的问题。其解决方案的关键在于提出Agent Exchange (AEX),一个专门设计的拍卖平台,用于支持AI代理市场的动态运作。AEX通过优化的基础设施促进代理间的协作与经济参与,并借鉴在线广告中的实时竞价(Real-Time Bidding, RTB)系统,作为核心拍卖引擎,协调用户侧平台、代理侧平台、代理枢纽和数据管理平台之间的交互。
链接: https://arxiv.org/abs/2507.03904
作者: Yingxuan Yang,Ying Wen,Jun Wang,Weinan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:The rise of Large Language Models (LLMs) has transformed AI agents from passive computational tools into autonomous economic actors. This shift marks the emergence of the agent-centric economy, in which agents take on active economic roles-exchanging value, making strategic decisions, and coordinating actions with minimal human oversight. To realize this vision, we propose Agent Exchange (AEX), a specialized auction platform designed to support the dynamics of the AI agent marketplace. AEX offers an optimized infrastructure for agent coordination and economic participation. Inspired by Real-Time Bidding (RTB) systems in online advertising, AEX serves as the central auction engine, facilitating interactions among four ecosystem components: the User-Side Platform (USP), which translates human goals into agent-executable tasks; the Agent-Side Platform (ASP), responsible for capability representation, performance tracking, and optimization; Agent Hubs, which coordinate agent teams and participate in AEX-hosted auctions; and the Data Management Platform (DMP), ensuring secure knowledge sharing and fair value attribution. We outline the design principles and system architecture of AEX, laying the groundwork for agent-based economic infrastructure in future AI ecosystems.
zh
[AI-95] ayFCS: Towards Light Feature Combination Selection for Deep Recommender Systems
【速读】:该论文旨在解决深度推荐模型中特征交互建模的问题,特别是如何高效地选择有用的高阶特征组合以减少噪声并优化内存消耗。传统特征选择方法通常仅限于单个特征的选择,而高阶特征组合的选择面临计算复杂度指数级增长的挑战。论文提出的解决方案关键在于设计了一种轻量级的特征组合选择方法——TayFCS,其核心是通过泰勒展开评分器(TayScorer)对基础模型进行字段级泰勒展开,从而无需多次实验即可基于子组件的梯度近似评估特征组合的重要性,仅需一次反向传播即可完成计算。此外,引入逻辑回归消减(LRE)以降低特征组合及其子组件之间的信息冗余,提升了方法的有效性和效率。
链接: https://arxiv.org/abs/2507.03895
作者: Xianquan Wang,Zhaocheng Du,Jieming Zhu,Chuhan Wu,Qinglin Jia,Zhenhua Dong
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Feature interaction modeling is crucial for deep recommendation models. A common and effective approach is to construct explicit feature combinations to enhance model performance. However, in practice, only a small fraction of these combinations are truly informative. Thus it is essential to select useful feature combinations to reduce noise and manage memory consumption. While feature selection methods have been extensively studied, they are typically limited to selecting individual features. Extending these methods for high-order feature combination selection presents a significant challenge due to the exponential growth in time complexity when evaluating feature combinations one by one. In this paper, we propose \textbfTayFCS , a lightweight feature combination selection method that significantly improves model performance. Specifically, we propose the Taylor Expansion Scorer (TayScorer) module for field-wise Taylor expansion on the base model. Instead of evaluating all potential feature combinations’ importance by repeatedly running experiments with feature adding and removal, this scorer only needs to approximate the importance based on their sub-components’ gradients. This can be simply computed with one backward pass based on a trained recommendation model. To further reduce information redundancy among feature combinations and their sub-components, we introduce Logistic Regression Elimination (LRE), which estimates the corresponding information gain based on the model prediction performance. Experimental results on three benchmark datasets validate both the effectiveness and efficiency of our approach. Furthermore, online A/B test results demonstrate its practical applicability and commercial value.
zh
[AI-96] LLM s model how humans induce logically structured rules
【速读】:该论文试图解决认知科学中关于心智结构及其发展的计算显式描述问题,具体包括:认知的原始表征构建块是什么、这些原始表征如何组合以及它们的来源是什么。论文提出的解决方案的关键在于利用大规模语言模型(Large Language Models, LLMs)作为计算模型,通过实验证明LLMs在逻辑概念规则归纳任务中能够提供与贝叶斯概率语言思维模型(pLoT)相当甚至更优的拟合效果,并且在规则推断和部署的性质上作出定性不同的预测,表明LLMs可能代表了一种新的理论框架,用于解释人类逻辑概念的原始表征和计算机制。
链接: https://arxiv.org/abs/2507.03876
作者: Alyssa Loo,Ellie Pavlick,Roman Feiman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A central goal of cognitive science is to provide a computationally explicit account of both the structure of the mind and its development: what are the primitive representational building blocks of cognition, what are the rules via which those primitives combine, and where do these primitives and rules come from in the first place? A long-standing debate concerns the adequacy of artificial neural networks as computational models that can answer these questions, in particular in domains related to abstract cognitive function, such as language and logic. This paper argues that recent advances in neural networks – specifically, the advent of large language models (LLMs) – represent an important shift in this debate. We test a variety of LLMs on an existing experimental paradigm used for studying the induction of rules formulated over logical concepts. Across four experiments, we find converging empirical evidence that LLMs provide at least as good a fit to human behavior as models that implement a Bayesian probablistic language of thought (pLoT), which have been the best computational models of human behavior on the same task. Moreover, we show that the LLMs make qualitatively different predictions about the nature of the rules that are inferred and deployed in order to complete the task, indicating that the LLM is unlikely to be a mere implementation of the pLoT solution. Based on these results, we argue that LLMs may instantiate a novel theoretical account of the primitive representations and computations necessary to explain human logical concepts, with which future work in cognitive science should engage.
zh
[AI-97] Enhancing Adaptive Behavioral Interventions with LLM Inference from Participant-Described States ALT
【速读】:该论文试图解决在健康行为干预中,由于数据稀缺问题导致强化学习(Reinforcement Learning, RL)方法应用受限的问题。其解决方案的关键在于通过引入自然语言描述来扩展自适应干预的的状态空间,同时保持数据效率。具体而言,该方法允许干预参与者提供当前状态的自然语言描述,并利用预训练的大语言模型(Large Language Models, LLMs)进行推理,以更好地将基础RL方法的策略与这些状态描述对齐。
链接: https://arxiv.org/abs/2507.03871
作者: Karine Karine,Benjamin M. Marlin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at Machine Learning for Healthcare (MLHC) 2025
Abstract:The use of reinforcement learning (RL) methods to support health behavior change via personalized and just-in-time adaptive interventions is of significant interest to health and behavioral science researchers focused on problems such as smoking cessation support and physical activity promotion. However, RL methods are often applied to these domains using a small collection of context variables to mitigate the significant data scarcity issues that arise from practical limitations on the design of adaptive intervention trials. In this paper, we explore an approach to significantly expanding the state space of an adaptive intervention without impacting data efficiency. The proposed approach enables intervention participants to provide natural language descriptions of aspects of their current state. It then leverages inference with pre-trained large language models (LLMs) to better align the policy of a base RL method with these state descriptions. To evaluate our method, we develop a novel physical activity intervention simulation environment that generates text-based state descriptions conditioned on latent state variables using an auxiliary LLM. We show that this approach has the potential to significantly improve the performance of online policy learning methods.
zh
[AI-98] Uncovering Systemic and Environment Errors in Autonomous Systems Using Differential Testing
【速读】:该论文试图解决在自主代理行为不良时,难以区分问题是源于代理本身的系统性错误(如模型或策略缺陷)还是环境配置导致的任务不可行性问题。解决方案的关键在于提出AIProbe,这是一种新颖的黑盒测试技术,通过差异测试将不良行为归因于代理缺陷或环境不可行性。AIProbe首先利用拉丁超立方采样生成多样化的环境配置和任务,然后使用基于搜索的规划器独立求解每个任务,通过对比代理性能与规划器的解,从而识别失败原因。
链接: https://arxiv.org/abs/2507.03870
作者: Rahil P Mehta,Yashwanthi Anand,Manish Motwani,Sandhya Saisubramanian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:When an autonomous agent behaves undesirably, including failure to complete a task, it can be difficult to determine whether the behavior is due to a systemic agent error, such as flaws in the model or policy, or an environment error, where a task is inherently infeasible under a given environment configuration, even for an ideal agent. As agents and their environments grow more complex, identifying the error source becomes increasingly difficult but critical for reliable deployment. We introduce AIProbe, a novel black-box testing technique that applies differential testing to attribute undesirable agent behaviors either to agent deficiencies, such as modeling or training flaws, or due to environmental infeasibility. AIProbe first generates diverse environmental configurations and tasks for testing the agent, by modifying configurable parameters using Latin Hypercube sampling. It then solves each generated task using a search-based planner, independent of the agent. By comparing the agent’s performance to the planner’s solution, AIProbe identifies whether failures are due to errors in the agent’s model or policy, or due to unsolvable task conditions. Our evaluation across multiple domains shows that AIProbe significantly outperforms state-of-the-art techniques in detecting both total and unique errors, thereby contributing to a reliable deployment of autonomous agents.
zh
[AI-99] From Query to Explanation: Uni-RAG for Multi-Modal Retrieval-Augmented Learning in STEM
【速读】:该论文旨在解决AI辅助教学中,现有检索系统在处理现实教育场景中查询风格的多样性和模糊性方面的不足。其解决方案的关键在于提出一种轻量且高效的多模态检索模块Uni-Retrieval,该模块通过提取查询风格原型并动态匹配持续更新的Prompt Bank中的token,从而提升检索效果。同时,结合紧凑的指令调优语言模型,构建了完整的检索增强生成框架Uni-RAG,以实现与学习目标对齐的教育内容生成。
链接: https://arxiv.org/abs/2507.03868
作者: Xinyi Wu,Yanhao Jia,Luwei Xiao,Shuai Zhao,Fengkuang Chiang,Erik Cambria
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY); Multimedia (cs.MM)
备注:
Abstract:In AI-facilitated teaching, leveraging various query styles to interpret abstract educational content is crucial for delivering effective and accessible learning experiences. However, existing retrieval systems predominantly focus on natural text-image matching and lack the capacity to address the diversity and ambiguity inherent in real-world educational scenarios. To address this limitation, we develop a lightweight and efficient multi-modal retrieval module, named Uni-Retrieval, which extracts query-style prototypes and dynamically matches them with tokens from a continually updated Prompt Bank. This Prompt Bank encodes and stores domain-specific knowledge by leveraging a Mixture-of-Expert Low-Rank Adaptation (MoE-LoRA) module and can be adapted to enhance Uni-Retrieval’s capability to accommodate unseen query types at test time. To enable natural language educational content generation, we integrate the original Uni-Retrieval with a compact instruction-tuned language model, forming a complete retrieval-augmented generation pipeline named Uni-RAG. Given a style-conditioned query, Uni-RAG first retrieves relevant educational materials and then generates human-readable explanations, feedback, or instructional content aligned with the learning objective. Experimental results on SER and other multi-modal benchmarks show that Uni-RAG outperforms baseline retrieval and RAG systems in both retrieval accuracy and generation quality, while maintaining low computational cost. Our framework provides a scalable, pedagogically grounded solution for intelligent educational systems, bridging retrieval and generation to support personalized, explainable, and efficient learning assistance across diverse STEM scenarios.
zh
[AI-100] Enhanced accuracy through ensembling of randomly initialized auto-regressive models for time-dependent PDEs
【速读】:该论文试图解决由偏微分方程(PDE)描述的系统在长期预测中因机器学习(ML)代理模型的自回归推理导致的误差累积问题。解决方案的关键在于提出一种深度集成框架,通过并行训练多个具有随机权重初始化的ML代理模型,并在推理阶段进行聚合,从而利用模型预测的多样性来缓解误差传播,同时保留自回归策略对系统时间依赖关系的捕捉能力。
链接: https://arxiv.org/abs/2507.03863
作者: Ishan Khurjekar,Indrashish Saha,Lori Graham-Brady,Somdatta Goswami
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 29 Pages
Abstract:Systems governed by partial differential equations (PDEs) require computationally intensive numerical solvers to predict spatiotemporal field evolution. While machine learning (ML) surrogates offer faster solutions, autoregressive inference with ML models suffer from error accumulation over successive predictions, limiting their long-term accuracy. We propose a deep ensemble framework to address this challenge, where multiple ML surrogate models with random weight initializations are trained in parallel and aggregated during inference. This approach leverages the diversity of model predictions to mitigate error propagation while retaining the autoregressive strategies ability to capture the system’s time dependent relations. We validate the framework on three PDE-driven dynamical systems - stress evolution in heterogeneous microstructures, Gray-Scott reaction-diffusion, and planetary-scale shallow water system - demonstrating consistent reduction in error accumulation over time compared to individual models. Critically, the method requires only a few time steps as input, enabling full trajectory predictions with inference times significantly faster than numerical solvers. Our results highlight the robustness of ensemble methods in diverse physical systems and their potential as efficient and accurate alternatives to traditional solvers. The codes for this work are available on GitHub (this https URL).
zh
[AI-101] KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)生成幻觉的问题,即模型产生的语法上合理但缺乏事实依据的陈述。其解决方案的关键在于提出KEA(Kernel-Enriched AI)Explain框架,该框架通过将从LLM输出构建的知识图谱与维基数据(Wikidata)或上下文文档中的真实数据进行比较,从而检测并解释幻觉。该方法利用图核和语义聚类技术提供可解释的幻觉检测结果,确保了检测的鲁棒性和可解释性。
链接: https://arxiv.org/abs/2507.03847
作者: Reilly Haskins,Ben Adams
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) frequently generate hallucinations: statements that are syntactically plausible but lack factual grounding. This research presents KEA (Kernel-Enriched AI) Explain: a neurosymbolic framework that detects and explains such hallucinations by comparing knowledge graphs constructed from LLM outputs with ground truth data from Wikidata or contextual documents. Using graph kernels and semantic clustering, the method provides explanations for detected hallucinations, ensuring both robustness and interpretability. Our framework achieves competitive accuracy in detecting hallucinations across both open- and closed-domain tasks, and is able to generate contrastive explanations, enhancing transparency. This research advances the reliability of LLMs in high-stakes domains and provides a foundation for future work on precision improvements and multi-source knowledge integration.
zh
[AI-102] Participatory Evolution of Artificial Life Systems via Semantic Feedback
【速读】:该论文试图解决如何通过自然语言指导人工生命系统演化的难题,其核心问题是实现用户意图与系统输出之间的语义对齐。解决方案的关键在于构建一个集成提示到参数编码器、CMA-ES优化器和基于CLIP的评估机制的语义反馈框架,从而使得用户可以通过自然语言调节视觉结果和底层行为规则。
链接: https://arxiv.org/abs/2507.03839
作者: Shuowen Li,Kexin Wang,Minglu Fang,Danqi Huang,Ali Asadipour,Haipeng Mi,Yitong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 10 pages
Abstract:We present a semantic feedback framework that enables natural language to guide the evolution of artificial life systems. Integrating a prompt-to-parameter encoder, a CMA-ES optimizer, and CLIP-based evaluation, the system allows user intent to modulate both visual outcomes and underlying behavioral rules. Implemented in an interactive ecosystem simulation, the framework supports prompt refinement, multi-agent interaction, and emergent rule synthesis. User studies show improved semantic alignment over manual tuning and demonstrate the system’s potential as a platform for participatory generative design and open-ended evolution.
zh
[AI-103] Economic Evaluation of LLM s
【速读】:该论文试图解决在不同大型语言模型(Large Language Models, LLMs)之间进行性能比较时存在的局限性,特别是在面对具有不同优势和缺陷的模型时(如低成本但易出错的模型与高成本但高精度的模型)。传统方法通过绘制帕累托前沿来权衡准确率与成本,但无法有效进行跨模型比较。论文提出的解决方案是经济评估框架,其关键在于将LLM的性能权衡转化为一个基于具体应用场景经济约束的单一数值,该数值以美元为单位,包括错误成本、增量延迟成本和回避查询成本。这一框架使得在不同场景下能够更直观地评估和比较LLMs的综合表现。
链接: https://arxiv.org/abs/2507.03834
作者: Michael J. Zellinger,Matt Thomson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 6 figures
Abstract:Practitioners often navigate LLM performance trade-offs by plotting Pareto frontiers of optimal accuracy-cost trade-offs. However, this approach offers no way to compare between LLMs with distinct strengths and weaknesses: for example, a cheap, error-prone model vs a pricey but accurate one. To address this gap, we propose economic evaluation of LLMs. Our framework quantifies the performance trade-off of an LLM as a single number based on the economic constraints of a concrete use case, all expressed in dollars: the cost of making a mistake, the cost of incremental latency, and the cost of abstaining from a query. We apply our economic evaluation framework to compare the performance of reasoning and non-reasoning models on difficult questions from the MATH benchmark, discovering that reasoning models offer better accuracy-cost tradeoffs as soon as the economic cost of a mistake exceeds \ 0.01. In addition, we find that single large LLMs often outperform cascades when the cost of making a mistake is as low as \ 0.1. Overall, our findings suggest that when automating meaningful human tasks with AI models, practitioners should typically use the most powerful available model, rather than attempt to minimize AI deployment costs, since deployment costs are likely dwarfed by the economic impact of AI errors.
zh
[AI-104] RELRaE: LLM -Based Relationship Extraction Labelling Refinement and Evaluation ISWC2025
【速读】:该论文试图解决实验室中由机器人产生的大量XML数据在不同实验室之间缺乏互操作性的问题,其解决方案是将XML数据转换为知识图谱。该工作的关键在于对XML模式进行增强,以构建本体模式的基础,而这一过程的核心是利用大型语言模型(Large Language Models, LLMs)在不同阶段提取并准确标注XML模式中隐含的关系。
链接: https://arxiv.org/abs/2507.03829
作者: George Hannah,Jacopo de Berardinis,Terry R. Payne,Valentina Tamma,Andrew Mitchell,Ellen Piercy,Ewan Johnson,Andrew Ng,Harry Rostron,Boris Konev
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 Pages, 8 Tables, Under-review at ISWC 2025
Abstract:A large volume of XML data is produced in experiments carried out by robots in laboratories. In order to support the interoperability of data between labs, there is a motivation to translate the XML data into a knowledge graph. A key stage of this process is the enrichment of the XML schema to lay the foundation of an ontology schema. To achieve this, we present the RELRaE framework, a framework that employs large language models in different stages to extract and accurately label the relationships implicitly present in the XML schema. We investigate the capability of LLMs to accurately generate these labels and then evaluate them. Our work demonstrates that LLMs can be effectively used to support the generation of relationship labels in the context of lab automation, and that they can play a valuable role within semi-automatic ontology generation frameworks more generally.
zh
[AI-105] Leverag ing Large Language Models for Tacit Knowledge Discovery in Organizational Contexts IJCNN
【速读】:该论文试图解决组织中隐性知识(tacit knowledge)记录的难题,这一过程面临初始信息不完整、难以识别知识拥有者、正式层级与非正式网络的相互作用以及需要提出正确问题等挑战。解决方案的关键在于提出一种基于代理(agent-based)的框架,该框架利用大型语言模型(LLMs)通过与员工的交互迭代重建数据集描述。该方法将知识传播建模为具有衰减传染性的易感-感染(SI)过程,并通过大量模拟验证其有效性,结果显示该代理能够实现94.9%的全知识回忆率,表明其在组织复杂性中导航并捕获碎片化知识的能力。
链接: https://arxiv.org/abs/2507.03811
作者: Gianlucca Zuin,Saulo Mastelini,Túlio Loures,Adriano Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 8 pages, 4 figures, accepted to International Joint Conference on Neural Networks (IJCNN) 2025
Abstract:Documenting tacit knowledge in organizations can be a challenging task due to incomplete initial information, difficulty in identifying knowledgeable individuals, the interplay of formal hierarchies and informal networks, and the need to ask the right questions. To address this, we propose an agent-based framework leveraging large language models (LLMs) to iteratively reconstruct dataset descriptions through interactions with employees. Modeling knowledge dissemination as a Susceptible-Infectious (SI) process with waning infectivity, we conduct 864 simulations across various synthetic company structures and different dissemination parameters. Our results show that the agent achieves 94.9% full-knowledge recall, with self-critical feedback scores strongly correlating with external literature critic scores. We analyze how each simulation parameter affects the knowledge retrieval process for the agent. In particular, we find that our approach is able to recover information without needing to access directly the only domain specialist. These findings highlight the agent’s ability to navigate organizational complexity and capture fragmented knowledge that would otherwise remain inaccessible.
zh
[AI-106] Generating Novelty in Open-World Multi-Agent Strategic Board Games NEURIPS2020
【速读】:该论文试图解决多智能体人工智能系统在面对开放世界中的新颖性时的有效性评估问题。解决方案的关键在于 GNOME(Generating Novelty in Open-world Multi-agent Environments)平台,它通过将AI博弈代理的开发与模拟器分离,实现了对未预见的新颖性的测试,从而避免了模型选择偏差的影响。
链接: https://arxiv.org/abs/2507.03802
作者: Mayank Kejriwal,Shilpa Thomas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, shorter version demonstrated in NeurIPS 2020
Abstract:We describe GNOME (Generating Novelty in Open-world Multi-agent Environments), an experimental platform that is designed to test the effectiveness of multi-agent AI systems when faced with \emphnovelty. GNOME separates the development of AI gameplaying agents with the simulator, allowing \emphunanticipated novelty (in essence, novelty that is not subject to model-selection bias). Using a Web GUI, GNOME was recently demonstrated at NeurIPS 2020 using the game of Monopoly to foster an open discussion on AI robustness and the nature of novelty in real-world environments. In this article, we further detail the key elements of the demonstration, and also provide an overview of the experimental design that is being currently used in the DARPA Science of Artificial Intelligence and Learning for Open-World Novelty (SAIL-ON) program to evaluate external teams developing novelty-adaptive gameplaying agents.
zh
[AI-107] Learning Dark Souls Combat Through Pixel Input With Neuroevolution
【速读】:该论文试图解决在复杂动作角色扮演游戏中自动化游戏玩法的问题,特别是针对《黑暗之魂》(Dark Souls)这类具有复杂战斗机制、动态环境和高维视觉输入的游戏。其解决方案的关键在于使用Neuroevolution of Augmenting Topologies (NEAT)直接从原始像素数据中进化神经网络,从而绕过了对显式游戏状态信息的依赖。研究引入了Dark Souls API (DSAPI),通过实时计算机视觉技术提取关键游戏指标,使进化算法能够在没有预定义行为或领域特定启发式的情况下,成功演化出击败游戏初始Boss“疯人院恶魔”的有效战斗策略。
链接: https://arxiv.org/abs/2507.03793
作者: Jim O’Connor,Gary B. Parker,Mustafa Bugti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: IEEE Conference on Games 2025
Abstract:This paper investigates the application of Neuroevolution of Augmenting Topologies (NEAT) to automate gameplay in Dark Souls, a notoriously challenging action role-playing game characterized by complex combat mechanics, dynamic environments, and high-dimensional visual inputs. Unlike traditional reinforcement learning or game playing approaches, our method evolves neural networks directly from raw pixel data, circumventing the need for explicit game-state information. To facilitate this approach, we introduce the Dark Souls API (DSAPI), a novel Python framework leveraging real-time computer vision techniques for extracting critical game metrics, including player and enemy health states. Using NEAT, agents evolve effective combat strategies for defeating the Asylum Demon, the game’s initial boss, without predefined behaviors or domain-specific heuristics. Experimental results demonstrate that evolved agents achieve up to a 35% success rate, indicating the viability of neuroevolution in addressing complex, visually intricate gameplay scenarios. This work represents an interesting application of vision-based neuroevolution, highlighting its potential use in a wide range of challenging game environments lacking direct API support or well-defined state representations.
zh
[AI-108] Optimizing UAV Trajectories via a Simplified Close Enough TSP Approach
【速读】:该论文试图解决的是“足够接近的旅行商问题”(Close Enough Traveling Salesman Problem, CETSP),其目标是通过引入对欧几里得距离的近似和简化目标函数来优化数学模型的表述。解决方案的关键在于利用凸集进行约束设计,从而带来计算上的优势,并通过基于CPLEX的分段方法在实际CETSP实例中进行验证,证明了该方法在不牺牲解质量的前提下有效管理计算资源。
链接: https://arxiv.org/abs/2507.03775
作者: Hiba Bederina
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This article explores an approach to addressing the Close Enough Traveling Salesman Problem (CETSP). The objective is to streamline the mathematical formulation by introducing reformulations that approximate the Euclidean distances and simplify the objective function. Additionally, the use of convex sets in the constraint design offers computational benefits. The proposed methodology is empirically validated on real-world CETSP instances, with the aid of computational strategies such as a fragmented CPLEX-based approach. Results demonstrate its effectiveness in managing computational resources without compromising solution quality. Furthermore, the article analyzes the behavior of the proposed mathematical formulations, providing comprehensive insights into their performance.
zh
[AI-109] Roadmap for using large language models (LLM s) to accelerate cross-disciplinary research with an example from computational biology
【速读】:该论文试图解决如何在跨学科研究中有效且负责任地整合大型语言模型(Large Language Models, LLMs)的问题,以克服其在研究应用中因幻觉、偏见和潜在危害而引发的质疑。解决方案的关键在于将LLMs作为“人机协同”框架中的增强工具,通过迭代交互促进跨领域沟通、知识转移与合作,从而提升研究效率与创新性。论文通过计算生物学案例(HIV反弹动力学建模)展示了这一方法的实际应用价值。
链接: https://arxiv.org/abs/2507.03722
作者: Ruian Ke,Ruy M. Ribeiro
机构: 未知
类目: Artificial Intelligence (cs.AI); Other Quantitative Biology (q-bio.OT)
备注:
Abstract:Large language models (LLMs) are powerful artificial intelligence (AI) tools transforming how research is conducted. However, their use in research has been met with skepticism, due to concerns about hallucinations, biases and potential harms to research. These emphasize the importance of clearly understanding the strengths and weaknesses of LLMs to ensure their effective and responsible use. Here, we present a roadmap for integrating LLMs into cross-disciplinary research, where effective communication, knowledge transfer and collaboration across diverse fields are essential but often challenging. We examine the capabilities and limitations of LLMs and provide a detailed computational biology case study (on modeling HIV rebound dynamics) demonstrating how iterative interactions with an LLM (ChatGPT) can facilitate interdisciplinary collaboration and research. We argue that LLMs are best used as augmentative tools within a human-in-the-loop framework. Looking forward, we envisage that the responsible use of LLMs will enhance innovative cross-disciplinary research and substantially accelerate scientific discoveries.
zh
[AI-110] Predicting Business Angel Early-Stage Decision Making Using AI
【速读】:该论文试图解决商业天使投资决策过程中存在的主观性强、资源消耗大以及评估效率低的问题。其解决方案的关键在于利用训练有素的生成式 AI (Generative AI) 模型,通过自动化提取关键评估因素,从而替代传统需要多人参与且耗时较长的 Critical Factor Assessment (CFA) 方法,实现高效、可靠且较少偏见的创业公司路演评估。
链接: https://arxiv.org/abs/2507.03721
作者: Yan Katcharovski,Andrew L. Maxwell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:External funding is crucial for early-stage ventures, particularly technology startups that require significant RD investment. Business angels offer a critical source of funding, but their decision-making is often subjective and resource-intensive for both investor and entrepreneur. Much research has investigated this investment process to find the critical factors angels consider. One such tool, the Critical Factor Assessment (CFA), deployed more than 20,000 times by the Canadian Innovation Centre, has been evaluated post-decision and found to be significantly more accurate than investors’ own decisions. However, a single CFA analysis requires three trained individuals and several days, limiting its adoption. This study builds on previous work validating the CFA to investigate whether the constraints inhibiting its adoption can be overcome using a trained AI model. In this research, we prompted multiple large language models (LLMs) to assign the eight CFA factors to a dataset of 600 transcribed, unstructured startup pitches seeking business angel funding with known investment outcomes. We then trained and evaluated machine learning classification models using the LLM-generated CFA scores as input features. Our best-performing model demonstrated high predictive accuracy (85.0% for predicting BA deal/no-deal outcomes) and exhibited significant correlation (Spearman’s r = 0.896, p-value 0.001) with conventional human-graded evaluations. The integration of AI-based feature extraction with a structured and validated decision-making framework yielded a scalable, reliable, and less-biased model for evaluating startup pitches, removing the constraints that previously limited adoption.
zh
[AI-111] owards Unified Neurosymbolic Reasoning on Knowledge Graphs
【速读】:该论文旨在解决知识图谱(Knowledge Graph, KG)推理中现有方法主要依赖单一的神经或符号推理方式,未能有效融合两者优势,以及多数方法仅针对单一推理场景,难以满足现实世界中多样化推理任务需求的问题。其解决方案的关键在于提出一种统一的神经符号推理框架Tunsr,该框架通过引入一致的推理图结构和前向逻辑消息传递机制,实现对命题表示、注意力以及一阶逻辑(First-Order Logic, FOL)表示的更新,并通过FARI算法在推理图上持续进行注意力计算以诱导FOL规则,从而有效整合神经与符号方法,并适应多种推理场景。
链接: https://arxiv.org/abs/2507.03697
作者: Qika Lin,Fangzhi Xu,Hao Lu,Kai He,Rui Mao,Jun Liu,Erik Cambria,Mengling Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:Knowledge Graph (KG) reasoning has received significant attention in the fields of artificial intelligence and knowledge engineering, owing to its ability to autonomously deduce new knowledge and consequently enhance the availability and precision of downstream applications. However, current methods predominantly concentrate on a single form of neural or symbolic reasoning, failing to effectively integrate the inherent strengths of both approaches. Furthermore, the current prevalent methods primarily focus on addressing a single reasoning scenario, presenting limitations in meeting the diverse demands of real-world reasoning tasks. Unifying the neural and symbolic methods, as well as diverse reasoning scenarios in one model is challenging as there is a natural representation gap between symbolic rules and neural networks, and diverse scenarios exhibit distinct knowledge structures and specific reasoning objectives. To address these issues, we propose a unified neurosymbolic reasoning framework, namely Tunsr, for KG reasoning. Tunsr first introduces a consistent structure of reasoning graph that starts from the query entity and constantly expands subsequent nodes by iteratively searching posterior neighbors. Based on it, a forward logic message-passing mechanism is proposed to update both the propositional representations and attentions, as well as first-order logic (FOL) representations and attentions of each node. In this way, Tunsr conducts the transformation of merging multiple rules by merging possible relations at each step. Finally, the FARI algorithm is proposed to induce FOL rules by constantly performing attention calculations over the reasoning graph. Extensive experimental results on 19 datasets of four reasoning scenarios (transductive, inductive, interpolation, and extrapolation) demonstrate the effectiveness of Tunsr.
zh
[AI-112] owards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning
【速读】:该论文试图解决机器理论心智(Theory of Mind, ToM)建模中规模扩展性与推理鲁棒性之间的矛盾问题。其解决方案的关键在于结合大规模语言模型(LLMs)生成假设和似然函数的能力,以及贝叶斯逆向规划模型计算代理可能心理状态后验概率的精确性,从而在保持推理准确性的同时提升模型对复杂场景的适应能力。
链接: https://arxiv.org/abs/2507.03682
作者: Rebekah A. Gelpí,Eric Xue,William A. Cunningham
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose a hybrid approach to machine Theory of Mind (ToM) that uses large language models (LLMs) as a mechanism for generating hypotheses and likelihood functions with a Bayesian inverse planning model that computes posterior probabilities for an agent’s likely mental states given its actions. Bayesian inverse planning models can accurately predict human reasoning on a variety of ToM tasks, but these models are constrained in their ability to scale these predictions to scenarios with a large number of possible hypotheses and actions. Conversely, LLM-based approaches have recently demonstrated promise in solving ToM benchmarks, but can exhibit brittleness and failures on reasoning tasks even when they pass otherwise structurally identical versions. By combining these two methods, this approach leverages the strengths of each component, closely matching optimal results on a task inspired by prior inverse planning models and improving performance relative to models that utilize LLMs alone or with chain-of-thought prompting, even with smaller LLMs that typically perform poorly on ToM tasks. We also exhibit the model’s potential to predict mental states on open-ended tasks, offering a promising direction for future development of ToM models and the creation of socially intelligent generative agents.
zh
[AI-113] Large Language Models for Combinatorial Optimization: A Systematic Review
【速读】:该论文试图解决如何将大型语言模型(Large Language Models, LLMs)应用于组合优化(Combinatorial Optimization, CO)领域的问题,以探索其在该领域的潜力与实际应用。解决方案的关键在于通过系统综述方法,依据PRISMA指南对文献进行筛选与分析,从而全面梳理LLMs在CO中的任务类型、模型架构、专用数据集及应用场景,并为未来研究提供方向。
链接: https://arxiv.org/abs/2507.03637
作者: Francesca Da Ros,Michael Soprano,Luca Di Gaspero,Kevin Roitero
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This systematic review explores the application of Large Language Models (LLMs) in Combinatorial Optimization (CO). We report our findings using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. We conduct a literature search via Scopus and Google Scholar, examining over 2,000 publications. We assess publications against four inclusion and four exclusion criteria related to their language, research focus, publication year, and type. Eventually, we select 103 studies. We classify these studies into semantic categories and topics to provide a comprehensive overview of the field, including the tasks performed by LLMs, the architectures of LLMs, the existing datasets specifically designed for evaluating LLMs in CO, and the field of application. Finally, we identify future directions for leveraging LLMs in this field.
zh
[AI-114] Disentangling Doubt in Deep Causal AI
【速读】:该论文旨在解决高风险应用中个体处理效应估计的问题,即在保证可靠点预测的同时实现可解释的不确定性量化。其解决方案的关键在于提出了一种分因子的蒙特卡洛Dropout框架,用于深度双网络模型,该框架将总预测方差分解为共享编码器中的表示不确定性(sigma_rep)和结果头部中的预测不确定性(sigma_pred)。通过这种模块级分解,能够有效检测和解释深度因果效应模型中的不确定性来源。
链接: https://arxiv.org/abs/2507.03622
作者: Cooper Doyle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 14 pages, 5 figures, 3 tables
Abstract:Accurate individual treatment-effect estimation in high-stakes applications demands both reliable point predictions and interpretable uncertainty quantification. We propose a factorized Monte Carlo Dropout framework for deep twin-network models that splits total predictive variance into representation uncertainty (sigma_rep) in the shared encoder and prediction uncertainty (sigma_pred) in the outcome heads. Across three synthetic covariate-shift regimes, our intervals are well-calibrated (ECE 0.03) and satisfy sigma_rep^2 + sigma_pred^2 ~ sigma_tot^2. Additionally, we observe a crossover: head uncertainty leads on in-distribution data, but representation uncertainty dominates under shift. Finally, on a real-world twins cohort with induced multivariate shifts, only sigma_rep spikes on out-of-distribution samples (delta sigma ~ 0.0002) and becomes the primary error predictor (rho_rep = 0.89), while sigma_pred remains flat. This module-level decomposition offers a practical diagnostic for detecting and interpreting uncertainty sources in deep causal-effect models.
zh
[AI-115] EvoAgent X: An Automated Framework for Evolving Agent ic Workflows
【速读】:该论文试图解决多智能体系统(MAS)在工作流配置、动态演化和性能优化方面的不足,以及现有优化算法缺乏统一框架集成的问题。其解决方案的关键在于提出EvoAgentX平台,该平台通过模块化架构实现多智能体工作流的自动化生成、执行与进化优化,并在演化层集成TextGrad、AFlow和MIPRO三种MAS优化算法,以迭代改进代理提示、工具配置和工作流拓扑结构。
链接: https://arxiv.org/abs/2507.03616
作者: Yingxu Wang,Siwei Liu,Jinyuan Fang,Zaiqiao Meng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent systems (MAS) have emerged as a powerful paradigm for orchestrating large language models (LLMs) and specialized tools to collaboratively address complex tasks. However, existing MAS frameworks often require manual workflow configuration and lack native support for dynamic evolution and performance optimization. In addition, many MAS optimization algorithms are not integrated into a unified framework. In this paper, we present EvoAgentX, an open-source platform that automates the generation, execution, and evolutionary optimization of multi-agent workflows. EvoAgentX employs a modular architecture consisting of five core layers: the basic components, agent, workflow, evolving, and evaluation layers. Specifically, within the evolving layer, EvoAgentX integrates three MAS optimization algorithms, TextGrad, AFlow, and MIPRO, to iteratively refine agent prompts, tool configurations, and workflow topologies. We evaluate EvoAgentX on HotPotQA, MBPP, and MATH for multi-hop reasoning, code generation, and mathematical problem solving, respectively, and further assess it on real-world tasks using GAIA. Experimental results show that EvoAgentX consistently achieves significant performance improvements, including a 7.44% increase in HotPotQA F1, a 10.00% improvement in MBPP pass@1, a 10.00% gain in MATH solve accuracy, and an overall accuracy improvement of up to 20.00% on GAIA. The source code is available at: this https URL
zh
[AI-116] Benchmarking Vector Graph and Hybrid Retrieval Augmented Generation (RAG ) Pipelines for Open Radio Access Networks (ORAN)
【速读】:该论文试图解决在ORAN架构中,利用生成式AI(GenAI)进行电信领域特定任务时,传统微调大型语言模型(LLMs)成本高、资源消耗大的问题。其解决方案的关键在于采用检索增强生成(RAG)技术,通过上下文学习实现领域适应,而无需完全重新训练模型。研究进一步探索了GraphRAG和Hybrid GraphRAG等改进方法,以提升多跳推理能力和事实准确性,从而在不增加计算负担的情况下优化GenAI在无线网络自治优化中的应用效果。
链接: https://arxiv.org/abs/2507.03608
作者: Sarat Ahmad,Zeinab Nezami,Maryam Hafeez,Syed Ali Raza Zaidi
机构: 未知
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Generative AI (GenAI) is expected to play a pivotal role in enabling autonomous optimization in future wireless networks. Within the ORAN architecture, Large Language Models (LLMs) can be specialized to generate xApps and rApps by leveraging specifications and API definitions from the RAN Intelligent Controller (RIC) platform. However, fine-tuning base LLMs for telecom-specific tasks remains expensive and resource-intensive. Retrieval-Augmented Generation (RAG) offers a practical alternative through in-context learning, enabling domain adaptation without full retraining. While traditional RAG systems rely on vector-based retrieval, emerging variants such as GraphRAG and Hybrid GraphRAG incorporate knowledge graphs or dual retrieval strategies to support multi-hop reasoning and improve factual grounding. Despite their promise, these methods lack systematic, metric-driven evaluations, particularly in high-stakes domains such as ORAN. In this study, we conduct a comparative evaluation of Vector RAG, GraphRAG, and Hybrid GraphRAG using ORAN specifications. We assess performance across varying question complexities using established generation metrics: faithfulness, answer relevance, context relevance, and factual correctness. Results show that both GraphRAG and Hybrid GraphRAG outperform traditional RAG. Hybrid GraphRAG improves factual correctness by 8%, while GraphRAG improves context relevance by 7%.
zh
[AI-117] Behaviour Space Analysis of LLM -driven Meta-heuristic Discovery
【速读】:该论文旨在解决由大型语言模型驱动的算法发现方法所生成的元启发式优化算法的行为空间分析问题,以理解不同算法配置在搜索动态和性能上的差异。其解决方案的关键在于利用Large Language Evolutionary Algorithm (LLaMEA)框架,结合GPT-4-mini大语言模型,通过迭代演化黑盒优化启发式方法,并基于BBOB基准测试集中的10个函数进行评估。研究通过动态行为指标的记录与分析,结合可视化投影、网络表示、代码演化图以及搜索轨迹网络等多种手段,揭示了不同LLaMEA变体在探索、开发、收敛和停滞等方面的差异,从而为未来自适应LLM驱动算法生成器的设计提供指导。
链接: https://arxiv.org/abs/2507.03605
作者: Niki van Stein,Haoran Yin,Anna V. Kononova,Thomas Bäck,Gabriela Ochoa
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate the behaviour space of meta-heuristic optimisation algorithms automatically generated by Large Language Model driven algorithm discovery methods. Using the Large Language Evolutionary Algorithm (LLaMEA) framework with a GPT o4-mini LLM, we iteratively evolve black-box optimisation heuristics, evaluated on 10 functions from the BBOB benchmark suite. Six LLaMEA variants, featuring different mutation prompt strategies, are compared and analysed. We log dynamic behavioural metrics including exploration, exploitation, convergence and stagnation measures, for each run, and analyse these via visual projections and network-based representations. Our analysis combines behaviour-based projections, Code Evolution Graphs built from static code features, performance convergence curves, and behaviour-based Search Trajectory Networks. The results reveal clear differences in search dynamics and algorithm structures across LLaMEA configurations. Notably, the variant that employs both a code simplification prompt and a random perturbation prompt in a 1+1 elitist evolution strategy, achieved the best performance, with the highest Area Over the Convergence Curve. Behaviour-space visualisations show that higher-performing algorithms exhibit more intensive exploitation behaviour and faster convergence with less stagnation. Our findings demonstrate how behaviour-space analysis can explain why certain LLM-designed heuristics outperform others and how LLM-driven algorithm discovery navigates the open-ended and complex search space of algorithms. These findings provide insights to guide the future design of adaptive LLM-driven algorithm generators. Subjects: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.03605 [cs.NE] (or arXiv:2507.03605v1 [cs.NE] for this version) https://doi.org/10.48550/arXiv.2507.03605 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-118] MusGO: A Community-Driven Framework For Assessing Openness in Music-Generative AI
【速读】:该论文试图解决音乐生成式人工智能(Generative AI)在开放性(openness)定义和评估标准方面的模糊性问题,以及由此引发的伦理挑战。其解决方案的关键在于提出并验证一个基于证据的开放性评估框架——MusGO(Music-Generative Open AI),该框架包含13个开放性类别,并通过音乐信息检索(MIR)社区的110名参与者反馈进行优化,以实现对生成模型开放性的系统化评估与公开透明的比较。
链接: https://arxiv.org/abs/2507.03599
作者: Roser Batlle-Roca,Laura Ibáñez-Martínez,Xavier Serra,Emilia Gómez,Martín Rocamora
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Audio and Speech Processing (eess.AS)
备注: Accepted at ISMIR 2025
Abstract:Since 2023, generative AI has rapidly advanced in the music domain. Despite significant technological advancements, music-generative models raise critical ethical challenges, including a lack of transparency and accountability, along with risks such as the replication of artists’ works, which highlights the importance of fostering openness. With upcoming regulations such as the EU AI Act encouraging open models, many generative models are being released labelled as ‘open’. However, the definition of an open model remains widely debated. In this article, we adapt a recently proposed evidence-based framework for assessing openness in LLMs to the music domain. Using feedback from a survey of 110 participants from the Music Information Retrieval (MIR) community, we refine the framework into MusGO (Music-Generative Open AI), which comprises 13 openness categories: 8 essential and 5 desirable. We evaluate 16 state-of-the-art generative models and provide an openness leaderboard that is fully open to public scrutiny and community contributions. Through this work, we aim to clarify the concept of openness in music-generative AI and promote its transparent and responsible development.
zh
[AI-119] A Universal Approach to Feature Representation in Dynamic Task Assignment Problems
【速读】:该论文试图解决在业务流程中动态任务分配的问题,特别是当任务或资源具有无限可能取值的特征时,如何有效地将状态和可能的分配表示为神经网络策略函数的输入和输出。解决方案的关键在于提出了一种基于图的分配问题特征表示方法(称为分配图),并通过将带标记的着色Petri网映射到分配图,以及对近端策略优化算法进行适应性调整,从而实现对无限状态和动作空间的分配问题的有效求解。
链接: https://arxiv.org/abs/2507.03579
作者: Riccardo Lo Bianco,Remco Dijkman,Wim Nuijten,Willem van Jaarsveld
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Dynamic task assignment concerns the optimal assignment of resources to tasks in a business process. Recently, Deep Reinforcement Learning (DRL) has been proposed as the state of the art for solving assignment problems. DRL methods usually employ a neural network (NN) as an approximator for the policy function, which ingests the state of the process and outputs a valuation of the possible assignments. However, representing the state and the possible assignments so that they can serve as inputs and outputs for a policy NN remains an open challenge, especially when tasks or resources have features with an infinite number of possible values. To solve this problem, this paper proposes a method for representing and solving assignment problems with infinite state and action spaces. In doing so, it provides three contributions: (I) A graph-based feature representation of assignment problems, which we call assignment graph; (II) A mapping from marked Colored Petri Nets to assignment graphs; (III) An adaptation of the Proximal Policy Optimization algorithm that can learn to solve assignment problems represented through assignment graphs. To evaluate the proposed representation method, we model three archetypal assignment problems ranging from finite to infinite state and action space dimensionalities. The experiments show that the method is suitable for representing and learning close-to-optimal task assignment policies regardless of the state and action space dimensionalities.
zh
[AI-120] Generating Synthetic Relational Tabular Data via Structural Causal Models ACL2025
【速读】:该论文试图解决当前合成表格数据生成方法在处理现实世界中常见的多表关联结构时的不足,即现有方法未能充分应对跨表的复杂依赖关系。其解决方案的关键在于提出一种基于结构因果模型(Structural Causal Models, SCMs)的新框架,该框架能够生成包含跨表因果关系的真实合成关系型表格数据,从而更好地模拟现实场景中的数据结构。
链接: https://arxiv.org/abs/2507.03528
作者: Frederik Hoppe,Astrid Franz,Lars Kleinemeier,Udo Göbel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注: Accepted to Table Representation Learning Workshop at ACL 2025
Abstract:Synthetic tabular data generation has received increasing attention in recent years, particularly with the emergence of foundation models for tabular data. The breakthrough success of TabPFN (Hollmann et al.,2025), which leverages vast quantities of synthetic tabular datasets derived from structural causal models (SCMs), demonstrates the critical role synthetic data plays in developing powerful tabular foundation models. However, most real-world tabular data exists in relational formats spanning multiple interconnected tables - a structure not adequately addressed by current generation methods. In this work, we extend the SCM-based approach by developing a novel framework that generates realistic synthetic relational tabular data including causal relationships across tables. Our experiments confirm that this framework is able to construct relational datasets with complex inter-table dependencies mimicking real-world scenarios.
zh
[AI-121] Limits of Safe AI Deployment: Differentiating Oversight and Control
【速读】:该论文试图解决AI系统在问责性、可靠性及满足治理和管理要求方面存在的监督机制不清晰问题,特别是在学术和政策讨论中对监督(supervision)与控制(control)概念的混淆或区分不足。其解决方案的关键在于明确区分控制为事前或实时的、操作性的机制,而监督则为政策和治理功能或事后行为,并提出一个理论与政策相结合的框架,以界定不同机制的应用条件、局限性及实践意义,同时构建AI监督的成熟度模型,以支持监管者、审计人员和从业者识别现有局限及未来改进方向。
链接: https://arxiv.org/abs/2507.03525
作者: David Manheim,Aidan Homewood
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Oversight and control (collectively, supervision) are often invoked as key levers for ensuring that AI systems are accountable, reliable, and able to fulfill governance and management requirements. However, the concepts are frequently conflated or insufficiently distinguished in academic and policy discourse, undermining efforts to design or evaluate systems that should remain under meaningful human supervision. This paper undertakes a targeted critical review of literature on supervision outside of AI, along with a brief summary of past work on the topic related to AI. We then differentiate control as being ex-ante or real-time, and operational rather than policy or governance. In contrast, oversight is either a policy and governance function, or is ex-post. We suggest that control aims to prevent failures. In contrast, oversight often focuses on detection, remediation, or incentives for future prevention; all preventative oversight strategies nonetheless necessitate control. Building on this foundation, we make three contributions. First, we propose a theoretically-informed yet policy-grounded framework that articulates the conditions under which each mechanism is possible, where they fall short, and what is required to make them meaningful in practice. Second, we outline how supervision methods should be documented and integrated into risk management, and drawing on the Microsoft Responsible AI Maturity Model, we outline a maturity model for AI supervision. Third, we explicitly highlight some boundaries of these mechanisms, including where they apply, where they fail, and where it is clear that no existing methods suffice. This foregrounds the question of whether meaningful supervision is possible in a given deployment context, and can support regulators, auditors, and practitioners in identifying both present limitations and the need for new conceptual and technical advances. Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY) ACMclasses: I.2; K.6; D.2.9 Cite as: arXiv:2507.03525 [cs.AI] (or arXiv:2507.03525v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.03525 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-122] Reinforcement Learning-based Feature Generation Algorithm for Scientific Data
【速读】:该论文试图解决科学数据特征生成过程中面临的两个关键问题:一是高阶特征组合的有效构建需要深厚的领域专业知识,二是随着特征组合阶数的增加,搜索空间呈指数级增长,导致人力成本过高。解决方案的关键在于提出多智能体特征生成(Multi-agent Feature Generation, MAFG)框架,通过多智能体协作构建数学变换方程,合成并识别高信息量的特征组合,并利用强化学习机制优化策略,最终结合大语言模型对重要模型性能突破进行可解释性评估,从而实现特征生成过程的自动化与效率提升。
链接: https://arxiv.org/abs/2507.03498
作者: Meng Xiao,Junfeng Zhou,Yuanchun Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, in Chinese language, accepted by JCRD
Abstract:Feature generation (FG) aims to enhance the prediction potential of original data by constructing high-order feature combinations and removing redundant features. It is a key preprocessing step for tabular scientific data to improve downstream machine-learning model performance. Traditional methods face the following two challenges when dealing with the feature generation of scientific data: First, the effective construction of high-order feature combinations in scientific data necessitates profound and extensive domain-specific expertise. Secondly, as the order of feature combinations increases, the search space expands exponentially, imposing prohibitive human labor consumption. Advancements in the Data-Centric Artificial Intelligence (DCAI) paradigm have opened novel avenues for automating feature generation processes. Inspired by that, this paper revisits the conventional feature generation workflow and proposes the Multi-agent Feature Generation (MAFG) framework. Specifically, in the iterative exploration stage, multi-agents will construct mathematical transformation equations collaboratively, synthesize and identify feature combinations ex-hibiting high information content, and leverage a reinforcement learning mechanism to evolve their strategies. Upon completing the exploration phase, MAFG integrates the large language models (LLMs) to interpreta-tively evaluate the generated features of each significant model performance breakthrough. Experimental results and case studies consistently demonstrate that the MAFG framework effectively automates the feature generation process and significantly enhances various downstream scientific data mining tasks.
zh
[AI-123] REAL: Benchmarking Abilities of Large Language Models for Housing Transactions and Services
【速读】:该论文试图解决的问题是评估大型语言模型(Large Language Models, LLMs)在房地产交易和服务领域中是否能够胜任代理角色,以及其表现与人类相比的差距。解决方案的关键在于提出了首个针对房地产交易和服务领域的LLMs评估套件——Real Estate Agent Large Language Model Evaluation (REAL),该套件包含5,316个高质量的评估条目,覆盖记忆、理解、推理和幻觉四个主题,并划分为14个类别,用以全面评估LLMs在该场景下的知识和能力。
链接: https://arxiv.org/abs/2507.03477
作者: Kexin Zhu,Yang Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The development of large language models (LLMs) has greatly promoted the progress of chatbot in multiple fields. There is an urgent need to evaluate whether LLMs can play the role of agent in housing transactions and services as well as humans. We present Real Estate Agent Large Language Model Evaluation (REAL), the first evaluation suite designed to assess the abilities of LLMs in the field of housing transactions and services. REAL comprises 5,316 high-quality evaluation entries across 4 topics: memory, comprehension, reasoning and hallucination. All these entries are organized as 14 categories to assess whether LLMs have the knowledge and ability in housing transactions and services scenario. Additionally, the REAL is used to evaluate the performance of most advanced LLMs. The experiment results indicate that LLMs still have significant room for improvement to be applied in the real estate field.
zh
[AI-124] Multi-Agent Reasoning for Cardiovascular Imaging Phenotype Analysis
【速读】:该论文旨在解决传统方法在识别影像表型与疾病风险因素及预后之间关联时存在的局限性,这些方法依赖于人工驱动的假设检验和关联因子选择,往往忽视了影像表型与其他多模态数据之间的复杂非线性依赖关系。其解决方案的关键在于引入一种名为MESHAgents的多智能体探索协同框架,该框架利用大语言模型作为代理,动态地提取、呈现并决定关联研究中的混杂因素和表型,通过心血管影像学进行验证。该框架通过多学科AI代理的迭代自组织推理,实现统计相关性与多专家共识的动态融合,从而提供一种自动化的全表型关联研究(PheWAS)流程。
链接: https://arxiv.org/abs/2507.03460
作者: Weitong Zhang,Mengyun Qiao,Chengqi Zang,Steven Niederer,Paul M Matthews,Wenjia Bai,Bernhard Kainz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Identifying the associations between imaging phenotypes and disease risk factors and outcomes is essential for understanding disease mechanisms and improving diagnosis and prognosis models. However, traditional approaches rely on human-driven hypothesis testing and selection of association factors, often overlooking complex, non-linear dependencies among imaging phenotypes and other multi-modal data. To address this, we introduce a Multi-agent Exploratory Synergy for the Heart (MESHAgents) framework that leverages large language models as agents to dynamically elicit, surface, and decide confounders and phenotypes in association studies, using cardiovascular imaging as a proof of concept. Specifically, we orchestrate a multi-disciplinary team of AI agents – spanning cardiology, biomechanics, statistics, and clinical research – which spontaneously generate and converge on insights through iterative, self-organizing reasoning. The framework dynamically synthesizes statistical correlations with multi-expert consensus, providing an automated pipeline for phenome-wide association studies (PheWAS). We demonstrate the system’s capabilities through a population-based study of imaging phenotypes of the heart and aorta. MESHAgents autonomously uncovered correlations between imaging phenotypes and a wide range of non-imaging factors, identifying additional confounder variables beyond standard demographic factors. Validation on diagnosis tasks reveals that MESHAgents-discovered phenotypes achieve performance comparable to expert-selected phenotypes, with mean AUC differences as small as -0.004 on disease classification tasks. Notably, the recall score improves for 6 out of 9 disease types. Our framework provides clinically relevant imaging phenotypes with transparent reasoning, offering a scalable alternative to expert-driven methods.
zh
[AI-125] Multi-Level Fusion Graph Neural Network for Molecule Property Prediction
【速读】:该论文试图解决现有图神经网络(Graph Neural Networks, GNNs)在药物发现等相关领域中难以同时捕捉分子局部和全局结构的问题。其解决方案的关键在于提出一种多层级融合图神经网络(Multi-Level Fusion Graph Neural Network, MLFGNN),该网络整合了图注意力网络与一种新型图变换器,以联合建模局部和全局依赖关系。此外,还引入分子指纹作为互补模态,并设计了一种注意力交互机制,以自适应地融合不同表示中的信息。
链接: https://arxiv.org/abs/2507.03430
作者: XiaYu Liu,Hou-biao Li,Yang Liu,Chao Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 38 pages, 11 figures, 6 tables
Abstract:Accurate molecular property prediction is essential in drug discovery and related fields. However, existing graph neural networks (GNNs) often struggle to simultaneously capture both local and global molecular structures. In this work, we propose a Multi-Level Fusion Graph Neural Network (MLFGNN) that integrates Graph Attention Networks and a novel Graph Transformer to jointly model local and global dependencies. In addition, we incorporate molecular fingerprints as a complementary modality and introduce a mechanism of interaction between attention to adaptively fuse information across representations. Extensive experiments on multiple benchmark datasets demonstrate that MLFGNN consistently outperforms state-of-the-art methods in both classification and regression tasks. Interpretability analysis further reveals that the model effectively captures task-relevant chemical patterns, supporting the usefulness of multi-level and multi-modal fusion in molecular representation learning.
zh
[AI-126] Lessons from a Chimp: AI “Scheming” and the Quest for Ape Language
【速读】:该论文试图解决当前人工智能系统是否可能发展出“ scheming ”(隐秘且战略性地追求不一致目标)的能力这一问题。其解决方案的关键在于避免历史上研究非人类灵长类动物是否能掌握自然语言时所犯的错误,包括过度赋予人类特征、过度依赖轶事和描述性分析,以及缺乏强有力的理论框架。论文建议在AI scheming研究中应主动规避这些陷阱,并提出了一些具体的步骤以确保该研究计划能够以科学严谨的方式推进。
链接: https://arxiv.org/abs/2507.03409
作者: Christopher Summerfield,Lennart Luettgau,Magda Dubois,Hannah Rose Kirk,Kobi Hackenburg,Catherine Fist,Katarina Slama,Nicola Ding,Rebecca Anselmetti,Andrew Strait,Mario Giulianelli,Cozmin Ududec
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We examine recent research that asks whether current AI systems may be developing a capacity for “scheming” (covertly and strategically pursuing misaligned goals). We compare current research practices in this field to those adopted in the 1970s to test whether non-human primates could master natural language. We argue that there are lessons to be learned from that historical research endeavour, which was characterised by an overattribution of human traits to other agents, an excessive reliance on anecdote and descriptive analysis, and a failure to articulate a strong theoretical framework for the research. We recommend that research into AI scheming actively seeks to avoid these pitfalls. We outline some concrete steps that can be taken for this research programme to advance in a productive and scientifically rigorous fashion.
zh
[AI-127] Artificial intelligence in drug discovery: A comprehensive review with a case study on hyperuricemia gout arthritis and hyperuricemic nephropathy
【速读】:该论文试图解决传统药物发现方法在复杂性、成本、时间周期和失败率方面的瓶颈问题,其解决方案的关键在于系统性地整合人工智能(Artificial Intelligence, AI)与机器学习(Machine Learning, ML)技术,贯穿整个药物发现流程。通过全面分析AI/ML在靶点识别、片段筛选和先导化合物优化等核心阶段的应用,论文强调了方法学上的重要进展及其对各阶段的影响力,并通过针对高尿酸血症、痛风性关节炎和高尿酸肾病的案例研究,展示了AI/ML在实际药物发现中的成效。
链接: https://arxiv.org/abs/2507.03407
作者: Junwei Su,Cheng Xin,Ao Shang,Shan Wu,Zhenzhen Xie,Ruogu Xiong,Xiaoyu Xu,Cheng Zhang,Guang Chen,Yau-Tuen Chan,Guoyi Tang,Ning Wang,Yong Xu,Yibin Feng
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:This paper systematically reviews recent advances in artificial intelligence (AI), with a particular focus on machine learning (ML), across the entire drug discovery pipeline. Due to the inherent complexity, escalating costs, prolonged timelines, and high failure rates of traditional drug discovery methods, there is a critical need to comprehensively understand how AI/ML can be effectively integrated throughout the full process. Currently available literature reviews often narrowly focus on specific phases or methodologies, neglecting the dependence between key stages such as target identification, hit screening, and lead optimization. To bridge this gap, our review provides a detailed and holistic analysis of AI/ML applications across these core phases, highlighting significant methodological advances and their impacts at each stage. We further illustrate the practical impact of these techniques through an in-depth case study focused on hyperuricemia, gout arthritis, and hyperuricemic nephropathy, highlighting real-world successes in molecular target identification and therapeutic candidate discovery. Additionally, we discuss significant challenges facing AI/ML in drug discovery and outline promising future research directions. Ultimately, this review serves as an essential orientation for researchers aiming to leverage AI/ML to overcome existing bottlenecks and accelerate drug discovery.
zh
[AI-128] LLM 4Hint: Leverag ing Large Language Models for Hint Recommendation in Offline Query Optimization
【速读】:该论文试图解决传统查询优化器在处理复杂工作负载时需要繁琐的手动调优问题,以及基于学习的方法在保证泛化能力方面的局限性。其解决方案的关键在于利用中等规模的大型语言模型(Large Language Model, LLM)来推荐查询优化提示,通过集成轻量级模型生成软提示、设计查询重写策略以简化SQL语义,并引入显式匹配提示以促进LLM与轻量级模型之间的对齐,从而提升优化效果和泛化能力。
链接: https://arxiv.org/abs/2507.03384
作者: Suchen Liu,Jun Gao,Yinjun Han,Yang Lin
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:Query optimization is essential for efficient SQL query execution in DBMS, and remains attractive over time due to the growth of data volumes and advances in hardware. Existing traditional optimizers struggle with the cumbersome hand-tuning required for complex workloads, and the learning-based methods face limitations in ensuring generalization. With the great success of Large Language Model (LLM) across diverse downstream tasks, this paper explores how LLMs can be incorporated to enhance the generalization of learned optimizers. Though promising, such an incorporation still presents challenges, mainly including high model inference latency, and the substantial fine-tuning cost and suboptimal performance due to inherent discrepancy between the token sequences in LLM and structured SQL execution plans with rich numerical features. In this paper, we focus on recurring queries in offline optimization to alleviate the issue of high inference latency, and propose \textbfLLM4Hint that leverages moderate-sized backbone LLMs to recommend query optimization hints. LLM4Hint achieves the goals through: (i) integrating a lightweight model to produce a soft prompt, which captures the data distribution in DBMS and the SQL predicates to provide sufficient optimization features while simultaneously reducing the context length fed to the LLM, (ii) devising a query rewriting strategy using a larger commercial LLM, so as to simplify SQL semantics for the backbone LLM and reduce fine-tuning costs, and (iii) introducing an explicit matching prompt to facilitate alignment between the LLM and the lightweight model, which can accelerate convergence of the combined model. Experiments show that LLM4Hint, by leveraging the LLM’s stronger capability to understand the query statement, can outperform the state-of-the-art learned optimizers in terms of both effectiveness and generalization. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.03384 [cs.DB] (or arXiv:2507.03384v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2507.03384 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-129] Effects of structure on reasoning in instance-level Self-Discover
【速读】:该论文试图解决在复杂问题求解中,结构化输出与非结构化推理之间的性能权衡问题,以及如何有效组织复合系统以提升模型推理能力。其解决方案的关键在于引入iSelf-Discover框架的实例级适配方法,通过对比动态生成的结构化JSON推理与非结构化推理的效果,揭示了非结构化推理在多个基准测试中的显著优势,尤其是在复杂任务如MATH基准上表现出更高的性能。研究还表明,计划生成的最佳粒度(实例级与任务级)依赖于具体情境,从而挑战了对结构化格式的过度依赖。
链接: https://arxiv.org/abs/2507.03347
作者: Sachith Gunasekara,Yasiru Ratnayake
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The drive for predictable LLM reasoning in their integration with compound systems has popularized structured outputs, yet concerns remain about performance trade-offs compared to unconstrained natural language. At the same time, training on unconstrained Chain of Thought (CoT) traces has brought about a new class of strong reasoning models that nevertheless present novel compute budget and faithfulness challenges. This paper introduces iSelf-Discover, an instance-level adaptation of the Self-Discover framework, and using it compares dynamically generated structured JSON reasoning with its unstructured counterpart. Our empirical evaluation across diverse benchmarks using state-of-the-art open-source models supports a consistent advantage for unstructured reasoning. Notably, on the complex MATH benchmark, unstructured plans achieved relative performance improvements of up to 18.90% over structured approaches. Zero-shot unstructured iSelf-Discover variants are also shown to outperform their five-shot structured counterparts, underscoring the significance of this gap, even when structured plans are dynamically generated to ensure reasoning precedes the final answer. We further demonstrate that the optimal granularity of plan generation (instance-level vs. task-level) is context-dependent. These findings invite re-evaluation of the reliance on structured formats for complex problem-solving and how compound systems should be organized.
zh
[AI-130] NDAI-NeuroMAP: A Neuroscience-Specific Embedding Model for Domain-Specific Retrieval
【速读】:该论文旨在解决神经科学领域中高精度信息检索任务的挑战,通过构建专门的密集向量嵌入模型来提升相关应用的效果。其解决方案的关键在于构建了一个包含500,000个精心设计的三元组(查询-正例-负例配置)、250,000个神经科学特定定义条目以及250,000个从权威神经学本体论中提取的结构化知识图谱三元组的领域特定训练语料库,并采用基于FremyCompany/BioLORD-2023基础模型的多目标优化框架,结合对比学习与基于三元组的度量学习范式进行精细调优。
链接: https://arxiv.org/abs/2507.03329
作者: Devendra Patel,Aaditya Jain,Jayant Verma,Divyansh Rajput,Sunil Mahala,Ketki Suresh Khapare,Jayateja Kalla
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The document consists of 15 pages in total: the first 13 pages comprise the main paper, while the last two pages contain supplementary material
Abstract:We present NDAI-NeuroMAP, the first neuroscience-domain-specific dense vector embedding model engineered for high-precision information retrieval tasks. Our methodology encompasses the curation of an extensive domain-specific training corpus comprising 500,000 carefully constructed triplets (query-positive-negative configurations), augmented with 250,000 neuroscience-specific definitional entries and 250,000 structured knowledge-graph triplets derived from authoritative neurological ontologies. We employ a sophisticated fine-tuning approach utilizing the FremyCompany/BioLORD-2023 foundation model, implementing a multi-objective optimization framework combining contrastive learning with triplet-based metric learning paradigms. Comprehensive evaluation on a held-out test dataset comprising approximately 24,000 neuroscience-specific queries demonstrates substantial performance improvements over state-of-the-art general-purpose and biomedical embedding models. These empirical findings underscore the critical importance of domain-specific embedding architectures for neuroscience-oriented RAG systems and related clinical natural language processing applications.
zh
[AI-131] Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization
【速读】:该论文旨在解决结构-活性关系(SAR)建模中化合物性质预测的挑战,特别是在靶点活性数据有限以及性质对分子微小变化敏感的情况下,构建端到端的可解释机器学习模型。其关键解决方案是利用活性悬崖分子对(activity-cliff molecule pairs)来增强模型对分子结构与活性差异的解释能力,结合图神经网络(GNN)获取原子级特征并预测化合物-蛋白质亲和力,同时通过结构感知损失函数、组套索(group lasso)和稀疏组套索(sparse group lasso)优化模型,以提升模型的可解释性与预测精度。
链接: https://arxiv.org/abs/2507.03318
作者: Zanyu Shi,Yang Wang,Pathum Weerawarna,Jie Zhang,Timothy Richardson,Yijie Wang,Kun Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures
Abstract:Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable machine learning models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as limited activity data per target and the sensitivity of properties to subtle molecular changes. To address this, we leveraged activity-cliff molecule pairs, i.e., compounds sharing a common scaffold but differing sharply in potency, targeting three proto-oncogene tyrosine-protein kinase Src proteins (i.e., PDB IDs 1O42, 2H8H, and 4MXO). We implemented graph neural network (GNN) methods to obtain atom-level feature information and predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). In addition, we trained GNN models with different structure-aware loss functions to adequately leverage molecular property and structure information. We also utilized group lasso and sparse group lasso to prune and highlight molecular subgraphs and enhance the structure-specific model explainability for the predicted property difference in molecular activity-cliff pairs. We improved drug property prediction by integrating common and uncommon node information and using sparse group lasso, reducing the average root mean squared error (RMSE) by 12.70%, and achieving the lowest averaged RMSE=0.2551 and the highest PCC=0.9572. Furthermore, applying regularization enhances feature attribution methods that estimate the contribution of each atom in the molecular graphs by boosting global direction scores and atom-level accuracy in atom coloring accuracy, which improves model interpretability in drug discovery pipelines, particularly in investigating important molecular substructures in lead optimization.
zh
[AI-132] Partial Label Learning for Automated Theorem Proving
【速读】:该论文试图解决在学习辅助的自动定理证明中处理多种可能证明路径的问题,将学习引导的自动定理证明建模为部分标签学习(Partial Label Learning),构建了这两个研究领域之间的首个桥梁,并提供了处理学习过程中替代证明的理论框架。解决方案的关键在于利用部分标签学习领域的方法,提升学习辅助定理证明器的性能,具体通过plCoP定理证明器进行了验证。
链接: https://arxiv.org/abs/2507.03314
作者: Zsolt Zombori,Balázs Indruck
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:We formulate learning guided Automated Theorem Proving as Partial Label Learning, building the first bridge across these fields of research and providing a theoretical framework for dealing with alternative proofs during learning. We use the plCoP theorem prover to demonstrate that methods from the Partial Label Learning literature tend to increase the performance of learning assisted theorem provers.
zh
[AI-133] ReTimeCausal: EM-Augmented Additive Noise Models for Interpretable Causal Discovery in Irregular Time Series
【速读】:该论文试图解决不规则采样时间序列中的因果发现问题(causal discovery in irregularly sampled time series),这一问题在金融、医疗和气候科学等高风险领域中尤为关键,因为缺失数据和不一致的采样频率会扭曲因果机制。现有方法如Granger因果关系和PCMCI无法处理多尺度交互,而神经方法如CUTS+则缺乏可解释性。该论文提出的解决方案ReTimeCausal的关键在于将Additive Noise Models(ANM)与Expectation-Maximization(EM)相结合,通过物理引导的数据插补与稀疏因果推断的整合,利用核化稀疏回归和结构约束,迭代优化缺失值和因果图,从而解决跨频率依赖性和缺失数据问题。
链接: https://arxiv.org/abs/2507.03310
作者: Weihong Li,Anpeng Wu,Kun Kuang,Keting Yin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures
Abstract:This paper studies causal discovery in irregularly sampled time series-a pivotal challenge in high-stakes domains like finance, healthcare, and climate science, where missing data and inconsistent sampling frequencies distort causal mechanisms. Traditional methods (e.g., Granger causality, PCMCI) fail to reconcile multi-scale interactions (e.g., hourly storms vs. decadal climate shifts), while neural approaches (e.g., CUTS+) lack interpretability, stemming from a critical gap: existing frameworks either rigidly assume temporal regularity or aggregate dynamics into opaque representations, neglecting real-world granularity and auditable logic. To bridge this gap, we propose ReTimeCausal, a novel integration of Additive Noise Models (ANM) and Expectation-Maximization (EM) that unifies physics-guided data imputation with sparse causal inference. Through kernelized sparse regression and structural constraints, ReTimeCausal iteratively refines missing values (E-step) and causal graphs (M-step), resolving cross-frequency dependencies and missing data issues. Extensive experiments on synthetic and real-world datasets demonstrate that ReTimeCausal outperforms existing state-of-the-art methods under challenging irregular sampling and missing data conditions.
zh
[AI-134] Scaffolding Recursive Divergence and Convergence in Story Ideation
【速读】:该论文试图解决现有生成式 AI (Generative AI) 驱动的创意支持工具(CSTs)在复杂协调发散(divergence)与收敛(convergence)过程方面存在的不足。其解决方案的关键在于提出Reverger,一个能够通过灵活迭代发散与收敛机制,帮助用户探索并合成故事修改概念方向的AI-powered CST。该工具通过递归探索故事特定部分的多样化修改方向,并允许用户将这些方向整合为具体变体,从而实现更高效和富有创造力的思维过程。
链接: https://arxiv.org/abs/2507.03307
作者: Taewook Kim,Matthew Kay,Yuqian Sun,Melissa Roemmele,Max Kreminski,John Joon Young Chung
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures, 3 tables
Abstract:Human creative ideation involves both exploration of diverse ideas (divergence) and selective synthesis of explored ideas into coherent combinations (convergence). While processes of divergence and convergence are often interleaved and nested, existing AI-powered creativity support tools (CSTs) lack support for sophisticated orchestration of divergence and convergence. We present Reverger, an AI-powered CST that helps users ideate variations of conceptual directions for modifying a story by scaffolding flexible iteration between divergence and convergence. For divergence, our tool enables recursive exploration of alternative high-level directions for modifying a specific part of the original story. For convergence, it allows users to collect explored high-level directions and synthesize them into concrete variations. Users can then iterate between divergence and convergence until they find a satisfactory outcome. A within-subject study revealed that Reverger permitted participants to explore more unexpected and diverse high-level directions than a comparable baseline. Reverger users also felt that they had more fine-grained control and discovered more effort-worthy outcomes.
zh
[AI-135] MGAA: Multi-Granular Adaptive Allocation fof Low-Rank Compression of LLM s
【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在模型压缩过程中普遍存在的问题,即大多数研究采用统一的压缩比例对所有权重矩阵进行处理,而忽视了不同权重矩阵对模型性能的不同影响。其解决方案的关键在于提出一种多粒度自适应分配(Multi-Granular Adaptive Allocation, MGAA)方法,该方法能够在不依赖任务特定评估的情况下,自适应地在子层之间和子层内部分配参数。MGAA通过基于输入与输出余弦相似性的子层压缩比例分配以及基于能量分布特征的权重矩阵压缩比例分配,实现了更高效的压缩与性能保持。
链接: https://arxiv.org/abs/2507.03294
作者: Guangyan Li,Yongqiang Tang,Wensheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures
Abstract:The enormous parameter scale of large language models (LLMs) has made model compression a research hotspot, which aims to alleviate computational resource demands during deployment and inference. As a promising direction, low-rank approximation technique has made remarkable achievements. Nevertheless, unfortunately, the vast majority of studies to low-rank approximation compression generally apply uniform compression ratios across all weight matrices, while disregarding their inherently differentiated impacts on the model’s performance. Although a few recent work attempts to employ heuristic search strategies to achieve the optimal parameter allocation, such strategies are computationally inefficient and lose the generalization ability in the era of LLMs. In this study, we propose a novel parameter Multi-Granular Adaptive Allocation (MGAA) method, which can adaptively allocate parameters between and within sublayers without task-specific evaluations in the compression process. MGAA consists of two components: 1) Among different sublayers, it assigns compression ratios based on their cosine similarity between inputs and outputs, allowing for a more tailored compression in sublayers with varying degrees of importance, and 2) Within each sublayer, it allocates different compression ratios to weight matrices based on their energy distribution characteristics, ensuring a consistent energy retention ratio while optimizing compression efficiency. Comprehensive evaluations of MGAA across multiple LLMs backbone models and benchmark datasets demonstrate its superior performance. Additionally, we apply our MGAA to multimodal model LLaVA, exhibiting remarkable performance improvements.
zh
[AI-136] Memory Mosaics at scale
【速读】:该论文旨在解决大规模语言模型在训练知识存储、新知识存储以及上下文学习能力方面的局限性。其提出的解决方案是通过改进的“Memory Mosaics v2”架构,该架构在保持原有组合性和上下文学习能力的基础上,成功扩展至10B参数规模,并在真实世界数据集上进行训练,从而在多项评估维度中表现出优于传统Transformer模型的性能。
链接: https://arxiv.org/abs/2507.03285
作者: Jianyu Zhang,Léon Bottou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: arXiv admin note: substantial text overlap with arXiv:2504.14751
Abstract:Memory Mosaics [Zhang et al., 2025], networks of associative memories, have demonstrated appealing compositional and in-context learning capabilities on medium-scale networks (GPT-2 scale) and synthetic small datasets. This work shows that these favorable properties remain when we scale memory mosaics to large language model sizes (llama-8B scale) and real-world datasets. To this end, we scale memory mosaics to 10B size, we train them on one trillion tokens, we introduce a couple architectural modifications (“Memory Mosaics v2”), we assess their capabilities across three evaluation dimensions: training-knowledge storage, new-knowledge storage, and in-context learning. Throughout the evaluation, memory mosaics v2 match transformers on the learning of training knowledge (first dimension) and significantly outperforms transformers on carrying out new tasks at inference time (second and third dimensions). These improvements cannot be easily replicated by simply increasing the training data for transformers. A memory mosaics v2 trained on one trillion tokens still perform better on these tasks than a transformer trained on eight trillion tokens. Comments: arXiv admin note: substantial text overlap with arXiv:2504.14751 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.03285 [cs.AI] (or arXiv:2507.03285v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.03285 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-137] Conformal Information Pursuit for Interactively Guiding Large Language Models
【速读】:该论文试图解决在交互式问答任务中,如何通过优化顺序查询策略以最小化预期查询次数的问题。其核心挑战在于准确估计大型语言模型(Large Language Models, LLMs)的不确定性,从而实现更优的查询选择和预测性能。传统方法如信息追求(Information Pursuit, IP)依赖于互信息或条件熵的估计,但因LLM概率输出存在过度或不足置信的问题,导致估计不准确。论文提出的解决方案——共形信息追求(Conformal Information Pursuit, C-IP),关键在于利用共形预测集与条件熵之间的关系,通过预测集的平均大小来估计不确定性,该方法具有分布无关性和鲁棒性,从而提升了查询效率和预测性能。
链接: https://arxiv.org/abs/2507.03279
作者: Kwan Ho Ryan Chan,Yuyan Ge,Edgar Dobriban,Hamed Hassani,René Vidal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:A significant use case of instruction-finetuned Large Language Models (LLMs) is to solve question-answering tasks interactively. In this setting, an LLM agent is tasked with making a prediction by sequentially querying relevant information from the user, as opposed to a single-turn conversation. This paper explores sequential querying strategies that aim to minimize the expected number of queries. One such strategy is Information Pursuit (IP), a greedy algorithm that at each iteration selects the query that maximizes information gain or equivalently minimizes uncertainty. However, obtaining accurate estimates of mutual information or conditional entropy for LLMs is very difficult in practice due to over- or under-confident LLM probabilities, which leads to suboptimal query selection and predictive performance. To better estimate the uncertainty at each iteration, we propose Conformal Information Pursuit (C-IP), an alternative approach to sequential information gain based on conformal prediction sets. More specifically, C-IP leverages a relationship between prediction sets and conditional entropy at each iteration to estimate uncertainty based on the average size of conformal prediction sets. In contrast to conditional entropy, we find that conformal prediction sets are a distribution-free and robust method of measuring uncertainty. Experiments with 20 Questions show that C-IP obtains better predictive performance and shorter query-answer chains compared to previous approaches to IP and uncertainty-based chain-of-thought methods. Furthermore, extending to an interactive medical setting between a doctor and a patient on the MediQ dataset, C-IP achieves competitive performance with direct single-turn prediction while offering greater interpretability.
zh
[AI-138] ForgeHLS: A Large-Scale Open-Source Dataset for High-Level Synthesis
【速读】:该论文试图解决现有电路数据集在规模和多样性方面的不足,从而限制了电子设计自动化(EDA)算法和生成式AI模型的性能与泛化能力。解决方案的关键在于构建ForgeEDA,这是一个开源的跨类别综合电路数据集,包含多种电路表示形式,如寄存器传输级(RTL)代码、映射后(PM)网表、与非门图(AIG)以及布局后的网表,从而支持全面的分析与开发,并推动EDA技术的进步。
链接: https://arxiv.org/abs/2507.03255
作者: Zedong Peng,Zeju Li,Mingzhe Gao,Qiang Xu,Chen Zhang,Jieru Zhao
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce ForgeEDA, an open-source comprehensive circuit dataset across various categories. ForgeEDA includes diverse circuit representations such as Register Transfer Level (RTL) code, Post-mapping (PM) netlists, And-Inverter Graphs (AIGs), and placed netlists, enabling comprehensive analysis and development. We demonstrate ForgeEDA’s utility by benchmarking state-of-the-art EDA algorithms on critical tasks such as Power, Performance, and Area (PPA) optimization, highlighting its ability to expose performance gaps and drive advancements. Additionally, ForgeEDA’s scale and diversity facilitate the training of AI models for EDA tasks, demonstrating its potential to improve model performance and generalization. By addressing limitations in existing datasets, ForgeEDA aims to catalyze breakthroughs in modern IC design and support the next generation of innovations in EDA.
zh
[AI-139] CodeAgents : A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLM s
【速读】:该论文试图解决多智能体系统中基于大语言模型(LLM)的代理在规划能力上的局限性,尤其是现有结构化提示策略在多智能体环境中的token效率、模块化和可扩展性不足的问题。其解决方案的关键在于提出CodeAgents框架,该框架将代理交互的各个组件(任务、计划、反馈、系统角色和外部工具调用)编码为带有控制结构(如循环、条件语句)、布尔逻辑和类型变量的模块化伪代码,从而实现结构化、token高效的多智能体推理。
链接: https://arxiv.org/abs/2507.03254
作者: Bruce Yang,Xinfeng He,Huan Gao,Yifan Cao,Xiaofan Li,David Hsu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Effective prompt design is essential for improving the planning capabilities of large language model (LLM)-driven agents. However, existing structured prompting strategies are typically limited to single-agent, plan-only settings, and often evaluate performance solely based on task accuracy - overlooking critical factors such as token efficiency, modularity, and scalability in multi-agent environments. To address these limitations, we introduce CodeAgents, a prompting framework that codifies multi-agent reasoning and enables structured, token-efficient planning in multi-agent systems. In CodeAgents, all components of agent interaction - Task, Plan, Feedback, system roles, and external tool invocations - are codified into modular pseudocode enriched with control structures (e.g., loops, conditionals), boolean logic, and typed variables. This design transforms loosely connected agent plans into cohesive, interpretable, and verifiable multi-agent reasoning programs. We evaluate the proposed framework across three diverse benchmarks - GAIA, HotpotQA, and VirtualHome - using a range of representative LLMs. Results show consistent improvements in planning performance, with absolute gains of 3-36 percentage points over natural language prompting baselines. On VirtualHome, our method achieves a new state-of-the-art success rate of 56%. In addition, our approach reduces input and output token usage by 55-87% and 41-70%, respectively, underscoring the importance of token-aware evaluation metrics in the development of scalable multi-agent LLM systems. The code and resources are available at: this https URL
zh
[AI-140] oward Efficient Speech Emotion Recognition via Spectral Learning and Attention
【速读】:该论文旨在解决传统语音情感识别(Speech Emotion Recognition, SER)方法在捕捉细微情感变化和跨不同数据集的泛化能力方面存在的不足。其关键解决方案是提出一种基于一维卷积神经网络(1D-CNN)的框架,该框架结合了数据增强技术与通道及空间注意力机制,通过Mel-Frequency Cepstral Coefficients (MFCCs) 提取的谱特征来增强模型对语音信号中情感模式的感知能力,从而显著提升了情感分类的准确性和鲁棒性。
链接: https://arxiv.org/abs/2507.03251
作者: HyeYoung Lee,Muhammad Nadeem
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech Emotion Recognition (SER) traditionally relies on auditory data analysis for emotion classification. Several studies have adopted different methods for SER. However, existing SER methods often struggle to capture subtle emotional variations and generalize across diverse datasets. In this article, we use Mel-Frequency Cepstral Coefficients (MFCCs) as spectral features to bridge the gap between computational emotion processing and human auditory perception. To further improve robustness and feature diversity, we propose a novel 1D-CNN-based SER framework that integrates data augmentation techniques. MFCC features extracted from the augmented data are processed using a 1D Convolutional Neural Network (CNN) architecture enhanced with channel and spatial attention mechanisms. These attention modules allow the model to highlight key emotional patterns, enhancing its ability to capture subtle variations in speech signals. The proposed method delivers cutting-edge performance, achieving the accuracy of 97.49% for SAVEE, 99.23% for RAVDESS, 89.31% for CREMA-D, 99.82% for TESS, 99.53% for EMO-DB, and 96.39% for EMOVO. Experimental results show new benchmarks in SER, demonstrating the effectiveness of our approach in recognizing emotional expressions with high precision. Our evaluation demonstrates that the integration of advanced Deep Learning (DL) methods substantially enhances generalization across diverse datasets, underscoring their potential to advance SER for real-world deployment in assistive technologies and human-computer interaction.
zh
[AI-141] On Jailbreaking Quantized Language Models Through Fault Injection Attacks
【速读】:该论文试图解决语言模型(Language Models, LMs)在低精度量化部署下遭受直接参数操纵攻击(如故障注入攻击)的安全对齐问题。其关键解决方案是提出基于梯度引导的攻击方法,包括一种定制的逐步位级搜索算法和一种比较性的词级(单权重更新)攻击方法,以评估不同量化方案下攻击的有效性。研究结果表明,量化方案显著影响攻击成功率,其中FP8和INT8模型相较于FP16模型表现出更高的抗攻击能力,但仍有潜在漏洞存在,尤其是在后攻击量化过程中。
链接: https://arxiv.org/abs/2507.03236
作者: Noureldin Zahran,Ahmad Tahmasivand,Ihsen Alouani,Khaled Khasawneh,Mohammed E. Fouda
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This work has been published in GLSVLSI 2025
Abstract:The safety alignment of Language Models (LMs) is a critical concern, yet their integrity can be challenged by direct parameter manipulation attacks, such as those potentially induced by fault injection. As LMs are increasingly deployed using low-precision quantization for efficiency, this paper investigates the efficacy of such attacks for jailbreaking aligned LMs across different quantization schemes. We propose gradient-guided attacks, including a tailored progressive bit-level search algorithm introduced herein and a comparative word-level (single weight update) attack. Our evaluation on Llama-3.2-3B, Phi-4-mini, and Llama-3-8B across FP16 (baseline), and weight-only quantization (FP8, INT8, INT4) reveals that quantization significantly influences attack success. While attacks readily achieve high success (80% Attack Success Rate, ASR) on FP16 models, within an attack budget of 25 perturbations, FP8 and INT8 models exhibit ASRs below 20% and 50%, respectively. Increasing the perturbation budget up to 150 bit-flips, FP8 models maintained ASR below 65%, demonstrating some resilience compared to INT8 and INT4 models that have high ASR. In addition, analysis of perturbation locations revealed differing architectural targets across quantization schemes, with (FP16, INT4) and (INT8, FP8) showing similar characteristics. Besides, jailbreaks induced in FP16 models were highly transferable to subsequent FP8/INT8 quantization (5% ASR difference), though INT4 significantly reduced transferred ASR (avg. 35% drop). These findings highlight that while common quantization schemes, particularly FP8, increase the difficulty of direct parameter manipulation jailbreaks, vulnerabilities can still persist, especially through post-attack quantization.
zh
[AI-142] Efficient Knowledge Graph Construction and Retrieval from Unstructured Text for Large-Scale RAG Systems
【速读】:该论文试图解决在企业环境中部署基于图的检索增强生成(GraphRAG)所面临的高计算成本和图检索延迟问题。其关键解决方案是引入两项核心创新:一是基于依赖关系的知识图谱构建流程,利用工业级自然语言处理(NLP)库从非结构化文本中提取实体和关系,从而完全摆脱对大语言模型(LLM)的依赖;二是轻量级图检索策略,结合混合查询节点识别与高效单跳遍历,实现高召回率、低延迟的子图提取。
链接: https://arxiv.org/abs/2507.03226
作者: Congmin Min,Rhea Mathew,Joyce Pan,Sahil Bansal,Abbas Keshavarzi,Amar Viswanathan Kannan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a scalable and cost-efficient framework for deploying Graph-based Retrieval Augmented Generation (GraphRAG) in enterprise environments. While GraphRAG has shown promise for multi-hop reasoning and structured retrieval, its adoption has been limited by the high computational cost of constructing knowledge graphs using large language models (LLMs) and the latency of graph-based retrieval. To address these challenges, we introduce two core innovations: (1) a dependency-based knowledge graph construction pipeline that leverages industrial-grade NLP libraries to extract entities and relations from unstructured text completely eliminating reliance on LLMs; and (2) a lightweight graph retrieval strategy that combines hybrid query node identification with efficient one-hop traversal for high-recall, low-latency subgraph extraction. We evaluate our framework on two SAP datasets focused on legacy code migration and demonstrate strong empirical performance. Our system achieves up to 15% and 4.35% improvements over traditional RAG baselines based on LLM-as-Judge and RAGAS metrics, respectively. Moreover, our dependency-based construction approach attains 94% of the performance of LLM-generated knowledge graphs (61.87% vs. 65.83%) while significantly reducing cost and improving scalability. These results validate the feasibility of deploying GraphRAG systems in real-world, large-scale enterprise applications without incurring prohibitive resource requirements paving the way for practical, explainable, and domain-adaptable retrieval-augmented reasoning.
zh
[AI-143] SI-Agent : An Agent ic Framework for Feedback-Driven Generation and Tuning of Human-Readable System Instructions for Large Language Models
【速读】:该论文试图解决如何高效生成可读性强且性能优越的系统指令(System Instructions, SIs)的问题,以替代传统手动编写方式所带来的资源消耗大和效果不佳的缺陷。其解决方案的关键在于提出SI-Agent框架,该框架通过三个协作代理——导师代理、指令执行代理和反馈/奖励代理——构建一个基于反馈的迭代优化流程,从而自动生成并持续优化人类可读的SIs,实现了任务性能与可解释性之间的良好平衡。
链接: https://arxiv.org/abs/2507.03223
作者: Jeshwanth Challagundla
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:System Instructions (SIs), or system prompts, are pivotal for guiding Large Language Models (LLMs) but manual crafting is resource-intensive and often suboptimal. Existing automated methods frequently generate non-human-readable “soft prompts,” sacrificing interpretability. This paper introduces SI-Agent, a novel agentic framework designed to automatically generate and iteratively refine human-readable SIs through a feedback-driven loop. SI-Agent employs three collaborating agents: an Instructor Agent, an Instruction Follower Agent (target LLM), and a Feedback/Reward Agent evaluating task performance and optionally SI readability. The framework utilizes iterative cycles where feedback guides the Instructor’s refinement strategy (e.g., LLM-based editing, evolutionary algorithms). We detail the framework’s architecture, agent roles, the iterative refinement process, and contrast it with existing methods. We present experimental results validating SI-Agent’s effectiveness, focusing on metrics for task performance, SI readability, and efficiency. Our findings indicate that SI-Agent generates effective, readable SIs, offering a favorable trade-off between performance and interpretability compared to baselines. Potential implications include democratizing LLM customization and enhancing model transparency. Challenges related to computational cost and feedback reliability are acknowledged.
zh
[AI-144] Neural Inhibition Improves Dynamic Routing and Mixture of Experts
【速读】:该论文试图解决深度学习模型在面对多样化数据时,如何动态选择合适架构以提升有效性和效率的问题。其解决方案的关键在于引入神经抑制(neural inhibition)机制,通过抑制神经种群中普遍共享的信号,使路由模型能够为每个数据样本选择专门的专家路径,从而增强模型的灵活性和性能。
链接: https://arxiv.org/abs/2507.03221
作者: Will Y. Zou,Jennifer Y. Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:To be effective, efficient, and diverse, deep learning models need to dynamically choose its architecture based on signals from a population of neurons. We hypothesize dynamic routing models can be improved with neural inhibition in those neural populations. This means signals commonly shared among the various modes of data statistics can be inhibited so that the routing model can choose a specialized expert path for each data sample. Only through inhibition is the routing mechanism able to effectively select neural pathways. We believe this is an under-studied and under-verified implementation methodology for Mixture-of-Experts, dynamic routing, and transformer language models. We provide experimental evidence that the neural inhibition algorithm significantly boosts the performance of general tasks and motivates more effort to be invested in this research direction.
zh
[AI-145] Symbiosis: Multi-Adapter Inference and Fine-Tuning
【速读】:该论文试图解决参数高效微调(Parameter-efficient fine-tuning, PEFT)在实际应用中面临的资源利用率低、多适配器支持不足、资源无法共享以及隐私保护缺失等问题。其解决方案的关键在于提出Symbiosis框架,通过将基础模型作为服务部署,实现基础模型层在多个推理或微调任务间的共享,并采用分阶段执行技术,将客户端特定的适配器与冻结的基础模型层解耦,从而提供灵活的资源管理、独立的微调方法选择和性能优化能力。
链接: https://arxiv.org/abs/2507.03220
作者: Saransh Gupta,Umesh Deshpande,Travis Janssen,Swami Sundararaman
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Parameter-efficient fine-tuning (PEFT) allows model builders to capture the task specific parameters into adapters, which are a fraction of the size of the original base model. Popularity of PEFT technique for fine-tuning has led to creation of a large number of adapters for popular Large Language Models (LLMs). However, existing frameworks fall short in supporting inference or fine-tuning with multiple adapters in the following ways. 1) For fine-tuning, each job needs to deploy its dedicated base model instance, which results in excessive GPU memory consumption and poor GPU utilization. 2) While popular inference platforms can serve multiple PEFT adapters, they do not allow independent resource management or mixing of different PEFT methods. 3) They cannot share resources (such as base model instance) between inference and fine-tuning jobs. 4) They do not provide privacy to users who may not wish to expose their fine-tuned parameters to service providers. In Symbiosis, we address the above problems by enabling as-a-service deployment of base model. The base model layers can be shared across multiple inference or fine-tuning processes. Our split-execution technique decouples the execution of client-specific adapters and layers from the frozen base model layers offering them flexibility to manage their resources, to select their fine-tuning method, to achieve their performance goals. Our approach is transparent to models and works out-of-the-box for most models in the transformers library. Our evaluation on Llama2-13B shows the compared to baseline, Symbiosis can fine-tune 4X more adapters on the same set of GPUs in the same amount of time.
zh
[AI-146] Disclosing Generative AI Use in Digital Humanities Research
【速读】:该论文试图解决数字人文学者在研究中对生成式 AI (Generative AI) 使用的披露实践与认知之间的差距问题。研究发现,尽管学者们普遍认可披露的重要性,但实际研究中的披露率仍然较低,且在哪些活动最需要披露以及采用何种方法进行披露方面存在分歧。解决方案的关键在于通过机构政策建立AI披露的保障措施,而非依赖个体决策。
链接: https://arxiv.org/abs/2507.03216
作者: Rongqian Ma,Xuhan Zhang,Adrian Wisnicki
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL); Emerging Technologies (cs.ET)
备注:
Abstract:This survey study investigates how digital humanists perceive and approach generative AI disclosure in research. The results indicate that while digital humanities scholars acknowledge the importance of disclosing GenAI use, the actual rate of disclosure in research practice remains low. Respondents differ in their views on which activities most require disclosure and on the most appropriate methods for doing so. Most also believe that safeguards for AI disclosure should be established through institutional policies rather than left to individual decisions. The study’s findings will offer empirical guidance to scholars, institutional leaders, funders, and other stakeholders responsible for shaping effective disclosure policies.
zh
[AI-147] Discovering Algorithms with Computational Language Processing
【速读】:该论文旨在解决复杂组合优化问题和基础量子计算方法中算法性能不足的问题,特别是针对强NP难问题和如Grover算法及量子近似优化算法(Quantum Approximate Optimization Algorithm, QAOA)等经典方法的局限性。其解决方案的关键在于将算法视为操作序列,并以标记(token)形式表示,通过语法结构进行链式组合,结合强化学习(Reinforcement Learning, RL)指导的集成蒙特卡洛树搜索(Ensemble Monte Carlo Tree Search, MCTS)探索有效的标记组合,从而生成性能显著优于现有方法的新算法。
链接: https://arxiv.org/abs/2507.03190
作者: Theo Bourdais,Abeynaya Gnanasekaran,Houman Owhadi,Tuhin Sahai
机构: 未知
类目: Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
备注: 21 pages
Abstract:Algorithms are the engine for reproducible problem-solving. We present a framework automating algorithm discovery by conceptualizing them as sequences of operations, represented as tokens. These computational tokens are chained using a grammar, enabling the formation of increasingly sophisticated procedures. Our ensemble Monte Carlo tree search (MCTS) guided by reinforcement learning (RL) explores token chaining and drives the creation of new tokens. This methodology rediscovers, improves, and generates new algorithms that substantially outperform existing methods for strongly NP-hard combinatorial optimization problems and foundational quantum computing approaches such as Grover’s and Quantum Approximate Optimization Algorithm. Operating at the computational rather than code-generation level, our framework produces algorithms that can be tailored specifically to problem instances, not merely classes.
zh
[AI-148] Understanding Knowledge Transferability for Transfer Learning: A Survey
【速读】:该论文试图解决如何可靠评估知识迁移能力的问题,这是迁移学习中一个关键但尚未完全解决的挑战。其解决方案的关键在于提出一种统一的迁移能力度量分类体系,根据可迁移知识类型和测量粒度对现有度量方法进行系统归纳,并分析不同度量在不同学习范式中的适用性,从而为研究人员和实践者提供选择合适度量的指导,以提升迁移学习的效率、可靠性和可信度。
链接: https://arxiv.org/abs/2507.03175
作者: Haohua Wang,Jingge Wang,Zijie Zhao,Yang Tan,Yanru Wu,Hanbing Liu,Jingyun Yang,Enming Zhang,Xiangyu Chen,Zhengze Rong,Shanxin Guo,Yang Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 35 pages, 15 figures, submitted to ACM Computing Surveys
Abstract:Transfer learning has become an essential paradigm in artificial intelligence, enabling the transfer of knowledge from a source task to improve performance on a target task. This approach, particularly through techniques such as pretraining and fine-tuning, has seen significant success in fields like computer vision and natural language processing. However, despite its widespread use, how to reliably assess the transferability of knowledge remains a challenge. Understanding the theoretical underpinnings of each transferability metric is critical for ensuring the success of transfer learning. In this survey, we provide a unified taxonomy of transferability metrics, categorizing them based on transferable knowledge types and measurement granularity. This work examines the various metrics developed to evaluate the potential of source knowledge for transfer learning and their applicability across different learning paradigms emphasizing the need for careful selection of these metrics. By offering insights into how different metrics work under varying conditions, this survey aims to guide researchers and practitioners in selecting the most appropriate metric for specific applications, contributing to more efficient, reliable, and trustworthy AI systems. Finally, we discuss some open challenges in this field and propose future research directions to further advance the application of transferability metrics in trustworthy transfer learning.
zh
[AI-149] he Impact of LLM -Assistants on Software Developer Productivity: A Systematic Literature Review
【速读】:该论文试图解决的问题是:当前关于生成式AI(Generative AI)在软件开发中的应用,尤其是大型语言模型助手(LLM-assistants)对开发者生产力的影响,缺乏系统的综述与总结。解决方案的关键在于通过系统文献综述的方法,分析2014年至2024年间发表的37篇同行评议研究,全面评估LLM-assistants在代码搜索、开发速度、任务自动化等方面的潜在优势及其带来的认知卸载、团队协作减少和代码质量不一致等风险,同时指出当前研究在多维评估框架(SPACE维度)应用上的不足,并为未来研究提供方向。
链接: https://arxiv.org/abs/2507.03156
作者: Amr Mohamed,Maram Assi,Mariam Guizani
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 37 pages
Abstract:Large language model assistants (LLM-assistants) present new opportunities to transform software development. Developers are increasingly adopting these tools across tasks, including coding, testing, debugging, documentation, and design. Yet, despite growing interest, there is no synthesis of how LLM-assistants affect software developer productivity. In this paper, we present a systematic literature review of 37 peer-reviewed studies published between January 2014 and December 2024 that examine this impact. Our analysis reveals that LLM-assistants offer both considerable benefits and critical risks. Commonly reported gains include minimized code search, accelerated development, and the automation of trivial and repetitive tasks. However, studies also highlight concerns around cognitive offloading, reduced team collaboration, and inconsistent effects on code quality. While the majority of studies (92%) adopt a multi-dimensional perspective by examining at least two SPACE dimensions, reflecting increased awareness of the complexity of developer productivity, only 14% extend beyond three dimensions, indicating substantial room for more integrated evaluations. Satisfaction, Performance, and Efficiency are the most frequently investigated dimensions, whereas Communication and Activity remain underexplored. Most studies are exploratory (64%) and methodologically diverse, but lack longitudinal and team-based evaluations. This review surfaces key research gaps and provides recommendations for future research and practice. All artifacts associated with this study are publicly available at this https URL.
zh
[AI-150] How Overconfidence in Initial Choices and Underconfidence Under Criticism Modulate Change of Mind in Large Language Models
【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在面对挑战时表现出的矛盾行为,即在初始回答中显得过于自信,而在受到质疑时又容易过度怀疑的问题。其解决方案的关键在于开发了一种新的实验范式,能够从LLMs中获取置信度估计而不记录其初始判断,从而揭示LLMs存在显著的选择支持性偏差,这种偏差增强了它们对自身答案的置信度,并导致其对改变观点表现出明显的抗拒。此外,研究还发现LLMs对不一致建议的权重明显高于一致建议,这一现象与规范的贝叶斯更新存在定性差异。这些机制共同解释了LLMs在不同领域中的行为表现。
链接: https://arxiv.org/abs/2507.03120
作者: Dharshan Kumaran,Stephen M Fleming,Larisa Markeeva,Joe Heyward,Andrea Banino,Mrinal Mathur,Razvan Pascanu,Simon Osindero,Benedetto de Martino,Petar Velickovic,Viorica Patraucean
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 41 pages
Abstract:Large language models (LLMs) exhibit strikingly conflicting behaviors: they can appear steadfastly overconfident in their initial answers whilst at the same time being prone to excessive doubt when challenged. To investigate this apparent paradox, we developed a novel experimental paradigm, exploiting the unique ability to obtain confidence estimates from LLMs without creating memory of their initial judgments – something impossible in human participants. We show that LLMs – Gemma 3, GPT4o and o1-preview – exhibit a pronounced choice-supportive bias that reinforces and boosts their estimate of confidence in their answer, resulting in a marked resistance to change their mind. We further demonstrate that LLMs markedly overweight inconsistent compared to consistent advice, in a fashion that deviates qualitatively from normative Bayesian updating. Finally, we demonstrate that these two mechanisms – a drive to maintain consistency with prior commitments and hypersensitivity to contradictory feedback – parsimoniously capture LLM behavior in a different domain. Together, these findings furnish a mechanistic account of LLM confidence that explains both their stubbornness and excessive sensitivity to criticism.
zh
[AI-151] Neural-Network solver of ideal MHD equilibria
【速读】:该论文试图解决计算三维磁流体动力学平衡(Magnetohydrodynamic equilibria)的问题,其解决方案的关键在于利用人工神经网络对傅里叶模态进行参数化,并通过一阶优化器在实空间中最小化全局力残差。这种方法在计算成本上与传统求解器具有竞争力,并且在增加计算成本的情况下能够获得更低的残差值,从而建立了力残差的新下限。
链接: https://arxiv.org/abs/2507.03119
作者: Timo Thun,Andrea Merlo,Rory Conlin,Dario Panici,Daniel Böckenhoff
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Plasma Physics (physics.plasm-ph)
备注: To be submitted to Nuclear Fusion, 16 pages, 8 figures
Abstract:We present a novel approach to compute three-dimensional Magnetohydrodynamic equilibria by parametrizing Fourier modes with artificial neural networks and compare it to equilibria computed by conventional solvers. The full nonlinear global force residual across the volume in real space is then minimized with first order optimizers. Already,we observe competitive computational cost to arrive at the same minimum residuals computed by existing codes. With increased computational cost,lower minima of the residual are achieved by the neural networks,establishing a new lower bound for the force residual. We use minimally complex neural networks,and we expect significant improvements for solving not only single equilibria with neural networks,but also for computing neural network models valid over continuous distributions of equilibria.
zh
[AI-152] Uncovering Synergistic Educational Injustices of COVID-19 and AI
【速读】:该论文试图解决新冠疫情和生成式AI在高等教育中快速普及所带来的长期教育影响问题,特别是其对学习体验、教育公平性和认知导向的潜在负面影响。研究通过伊朗高校学生叙事的分析,揭示了疫情期间及之后的学习经历与缺乏准备的AI工具接触共同导致了隐性但深远的教育不平等和认知混乱。解决方案的关键在于深入理解这些复杂社会技术互动的长期后果,并通过批判现实主义和叙事探究的方法论框架来揭示其内在机制。
链接: https://arxiv.org/abs/2507.03095
作者: Ahmad Banyasady(Malayer university)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 15
Abstract:Grounded in critical realism and using narrative inquiry, this article explores this article explores the long-term consequences of the COVID-19 pandemic and the rapid proliferation of artificial intelligence within higher education. Through the analysis of student narratives collected in Iranian university settings, the study reveals that learning experiences during and after the pandemic, coupled with unprepared exposure to AI tools, have generated hidden yet impactful layers of educational inequality and cognitive disorientation.
zh
[AI-153] LLM -Driven Auto Configuration for Transient IoT Device Collaboration
【速读】:该论文旨在解决临时物联网(IoT)环境中设备间安全协作的问题,特别是在瞬态设备需要与主机设备在临时访问的环境下交互时,如何实现细粒度访问控制策略的自动配置与执行。其解决方案的关键在于采用基于大语言模型(LLM)的方法,将用户的高层意图转化为细粒度的访问控制策略,并通过基于能力的访问控制进行授权,以及使用轻量级代理进行策略执行,从而提供硬件无关的抽象。
链接: https://arxiv.org/abs/2507.03064
作者: Hetvi Shastri,Walid A. Hanafy,Li Wu,David Irwin,Mani Srivastava,Prashant Shenoy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Today’s Internet of Things (IoT) has evolved from simple sensing and actuation devices to those with embedded processing and intelligent services, enabling rich collaborations between users and their devices. However, enabling such collaboration becomes challenging when transient devices need to interact with host devices in temporarily visited environments. In such cases, fine-grained access control policies are necessary to ensure secure interactions; however, manually implementing them is often impractical for non-expert users. Moreover, at run-time, the system must automatically configure the devices and enforce such fine-grained access control rules. Additionally, the system must address the heterogeneity of devices. In this paper, we present CollabIoT, a system that enables secure and seamless device collaboration in transient IoT environments. CollabIoT employs a Large language Model (LLM)-driven approach to convert users’ high-level intents to fine-grained access control policies. To support secure and seamless device collaboration, CollabIoT adopts capability-based access control for authorization and uses lightweight proxies for policy enforcement, providing hardware-independent abstractions. We implement a prototype of CollabIoT’s policy generation and auto configuration pipelines and evaluate its efficacy on an IoT testbed and in large-scale emulated environments. We show that our LLM-based policy generation pipeline is able to generate functional and correct policies with 100% accuracy. At runtime, our evaluation shows that our system configures new devices in ~150 ms, and our proxy-based data plane incurs network overheads of up to 2 ms and access control overheads up to 0.3 ms. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.03064 [cs.CR] (or arXiv:2507.03064v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.03064 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-154] BERT4Traj: Transformer Based Trajectory Reconstruction for Sparse Mobility Data
【速读】:该论文试图解决移动数据稀疏性问题(sparsity),即由于数据采集方法的限制,如不频繁的GPS采样或仅在通信事件期间记录位置的通话详单(CDR)数据,导致无法获得完整的个体移动轨迹。解决方案的关键在于提出BERT4Traj模型,该模型基于Transformer架构,通过预测稀疏移动序列中的隐藏访问点来重建完整的移动轨迹,其核心机制包括空间嵌入、时间嵌入以及上下文背景特征(如人口统计信息和锚点)的融合。
链接: https://arxiv.org/abs/2507.03062
作者: Hao Yang,Angela Yao,Christopher Whalen,Gengchen Mai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper was accepted at GIScience 2025
Abstract:Understanding human mobility is essential for applications in public health, transportation, and urban planning. However, mobility data often suffers from sparsity due to limitations in data collection methods, such as infrequent GPS sampling or call detail record (CDR) data that only capture locations during communication events. To address this challenge, we propose BERT4Traj, a transformer based model that reconstructs complete mobility trajectories by predicting hidden visits in sparse movement sequences. Inspired by BERT’s masked language modeling objective and self_attention mechanisms, BERT4Traj leverages spatial embeddings, temporal embeddings, and contextual background features such as demographics and anchor points. We evaluate BERT4Traj on real world CDR and GPS datasets collected in Kampala, Uganda, demonstrating that our approach significantly outperforms traditional models such as Markov Chains, KNN, RNNs, and LSTMs. Our results show that BERT4Traj effectively reconstructs detailed and continuous mobility trajectories, enhancing insights into human movement patterns.
zh
[AI-155] AI-Based Reconstruction from Inherited Personal Data: Analysis Feasibility and Prospects
【速读】:该论文试图解决如何通过人工智能技术创建已故研究人员的“电子副本”,以保存其智力遗产的问题。解决方案的关键在于利用存储在研究人员个人电脑中的数据(包括文本文件、非文本数据和文件元数据)对先进预训练模型进行微调,从而高保真地再现研究者的写作风格、领域专长和修辞声音。
链接: https://arxiv.org/abs/2507.03059
作者: Mark Zilberman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:This article explores the feasibility of creating an “electronic copy” of a deceased researcher by training artificial intelligence (AI) on the data stored in their personal computers. By analyzing typical data volumes on inherited researcher computers, including textual files such as articles, emails, and drafts, it is estimated that approximately one million words are available for AI training. This volume is sufficient for fine-tuning advanced pre-trained models like GPT-4 to replicate a researcher’s writing style, domain expertise, and rhetorical voice with high fidelity. The study also discusses the potential enhancements from including non-textual data and file metadata to enrich the AI’s representation of the researcher. Extensions of the concept include communication between living researchers and their electronic copies, collaboration among individual electronic copies, as well as the creation and interconnection of organizational electronic copies to optimize information access and strategic decision-making. Ethical considerations such as ownership and security of these electronic copies are highlighted as critical for responsible implementation. The findings suggest promising opportunities for AI-driven preservation and augmentation of intellectual legacy.
zh
[AI-156] Automated Grading of Students Handwritten Graphs: A Comparison of Meta-Learning and Vision-Large Language Models
【速读】:该论文试图解决在线学习中对数学领域高效且一致的自动评分需求,特别是针对学生手写图形与文本混合响应的自动评分问题。其解决方案的关键在于采用多模态元学习模型对包含手写图形和文本的图像进行自动评分,并与视觉大语言模型(VLLMs)进行性能比较。研究结果表明,在二分类任务中,元学习模型表现优于VLLMs,而在更复杂的三分类任务中,VLLMs则略胜一筹。
链接: https://arxiv.org/abs/2507.03056
作者: Behnam Parsaeifard,Martin Hlosta,Per Bergamin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rise of online learning, the demand for efficient and consistent assessment in mathematics has significantly increased over the past decade. Machine Learning (ML), particularly Natural Language Processing (NLP), has been widely used for autograding student responses, particularly those involving text and/or mathematical expressions. However, there has been limited research on autograding responses involving students’ handwritten graphs, despite their prevalence in Science, Technology, Engineering, and Mathematics (STEM) curricula. In this study, we implement multimodal meta-learning models for autograding images containing students’ handwritten graphs and text. We further compare the performance of Vision Large Language Models (VLLMs) with these specially trained metalearning models. Our results, evaluated on a real-world dataset collected from our institution, show that the best-performing meta-learning models outperform VLLMs in 2-way classification tasks. In contrast, in more complex 3-way classification tasks, the best-performing VLLMs slightly outperform the meta-learning models. While VLLMs show promising results, their reliability and practical applicability remain uncertain and require further investigation.
zh
[AI-157] From 2:4 to 8:16 sparsity patterns in LLM s for Outliers and Weights with Variance Correction
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在保持性能的同时,如何通过高效压缩技术降低存储和计算需求的问题。其关键解决方案是探索8:16半结构化稀疏性,该方法在保持与未压缩模型或更小模型相当的准确性下,能够突破性能阈值,并相较于2:4稀疏性提供更高的灵活性和更低的存储开销。此外,论文还提出通过结构化稀疏模式处理显著权重,并结合如方差校正和类似SmoothQuant的权重均衡技术来提升稀疏模型的性能。
链接: https://arxiv.org/abs/2507.03052
作者: Egor Maximov,Yulia Kuzkina,Azamat Kanametov,Alexander Prutko,Aleksei Goncharov,Maxim Zhelnin,Egor Shvetsov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models (LLMs) grow in size, efficient compression techniques like quantization and sparsification are critical. While quantization maintains performance with reduced precision, structured sparsity methods, such as N:M sparsification, often fall short due to limited flexibility, and sensitivity to outlier weights. We explore 8:16 semi-structured sparsity, demonstrating its ability to surpass the Performance Threshold-where a compressed model matches the accuracy of its uncompressed or smaller counterpart under equivalent memory constraints. Compared to 2:4 sparsity, 8:16 offers greater flexibility with minimal storage overhead (0.875 vs. 0.75 bits/element). We also apply sparse structured patterns for salient weights, showing that structured sparsity for outliers is competitive with unstructured approaches leading to equivalent or better results. Finally, we demonstrate that simple techniques such as variance correction and SmoothQuant like weight equalization improve sparse models performance.
zh
[AI-158] From Turing to Tomorrow: The UKs Approach to AI Regulation
【速读】:该论文试图解决英国在人工智能(Artificial Intelligence, AI)监管方面的政策定位与实践效果之间的矛盾问题,即如何在推动AI创新与有效管理AI风险之间取得平衡。其解决方案的关键在于建立一个灵活的、基于原则的监管框架,以监督最先进AI的发展,并采取防御性措施应对AI辅助生物设计工具带来的风险,同时强调需要更多的技术研究来应对AI生成的虚假信息问题。此外,论文还主张更新版权、歧视和AI代理相关的法律框架,并指出监管机构在AI对劳动力市场产生重大影响时将发挥有限但重要的作用。
链接: https://arxiv.org/abs/2507.03050
作者: Oliver Ritchie,Markus Anderljung,Tom Rachman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This is a chapter intended for publication in a forthcoming edited volume. It is the version of the author’s manuscript prior to acceptance for publication and has not undergone editorial and/or peer review on behalf of the Publisher
Abstract:The UK has pursued a distinctive path in AI regulation: less cautious than the EU but more willing to address risks than the US, and has emerged as a global leader in coordinating AI safety efforts. Impressive developments from companies like London-based DeepMind began to spark concerns in the UK about catastrophic risks from around 2012, although regulatory discussion at the time focussed on bias and discrimination. By 2022, these discussions had evolved into a “pro-innovation” strategy, in which the government directed existing regulators to take a light-touch approach, governing AI at point of use, but avoided regulating the technology or infrastructure directly. ChatGPT arrived in late 2022, galvanising concerns that this approach may be insufficient. The UK responded by establishing an AI Safety Institute to monitor risks and hosting the first international AI Safety Summit in 2023, but - unlike the EU - refrained from regulating frontier AI development in addition to its use. A new government was elected in 2024 which promised to address this gap, but at the time of writing is yet to do so. What should the UK do next? The government faces competing objectives: harnessing AI for economic growth and better public services while mitigating risk. In light of these, we propose establishing a flexible, principles-based regulator to oversee the most advanced AI development, defensive measures against risks from AI-enabled biological design tools, and argue that more technical work is needed to understand how to respond to AI-generated misinformation. We argue for updated legal frameworks on copyright, discrimination, and AI agents, and that regulators will have a limited but important role if AI substantially disrupts labour markets. If the UK gets AI regulation right, it could demonstrate how democratic societies can harness AI’s benefits while managing its risks. Comments: This is a chapter intended for publication in a forthcoming edited volume. It is the version of the author’s manuscript prior to acceptance for publication and has not undergone editorial and/or peer review on behalf of the Publisher Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.03050 [cs.CY] (or arXiv:2507.03050v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2507.03050 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-159] Personalised Explanations in Long-term Human-Robot Interactions
【速读】:该论文旨在解决人机交互(Human-Robot Interaction, HRI)中如何提升人类对机器人理解的问题,特别是在生成解释(explanations)时需要根据用户的知识水平进行个性化调整。其解决方案的关键在于提出一个框架,用于更新和检索用户知识-记忆模型,从而在生成解释时能够参考用户已有的概念,动态调整解释的详细程度。实验表明,采用两阶段架构(先生成解释再进行个性化调整)能够在存在相关用户知识的情况下有效降低解释的详细程度。
链接: https://arxiv.org/abs/2507.03049
作者: Ferran Gebellí,Anaís Garrell,Jan-Gerrit Habekost,Séverin Lemaignan,Stefan Wermter,Raquel Ros
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages. It will be published at RO-MAN 2025
Abstract:In the field of Human-Robot Interaction (HRI), a fundamental challenge is to facilitate human understanding of robots. The emerging domain of eXplainable HRI (XHRI) investigates methods to generate explanations and evaluate their impact on human-robot interactions. Previous works have highlighted the need to personalise the level of detail of these explanations to enhance usability and comprehension. Our paper presents a framework designed to update and retrieve user knowledge-memory models, allowing for adapting the explanations’ level of detail while referencing previously acquired concepts. Three architectures based on our proposed framework that use Large Language Models (LLMs) are evaluated in two distinct scenarios: a hospital patrolling robot and a kitchen assistant robot. Experimental results demonstrate that a two-stage architecture, which first generates an explanation and then personalises it, is the framework architecture that effectively reduces the level of detail only when there is related user knowledge.
zh
[AI-160] Monitoring of Static Fairness
【速读】:该论文试图解决机器学习系统在决策过程中可能存在的算法偏见问题,确保其对个体公平,即不因敏感属性而产生歧视。解决方案的关键在于提出一种运行时验证算法公平性的通用框架,该框架适用于模型未知但具有马尔可夫链结构的系统,且无需完全观测状态空间。该框架引入了一种能够建模多种常见算法公平性属性(如人口均等、平等机会和社会负担)的规范语言,并构建了监控器,通过观察系统生成的事件序列,实时输出系统在运行过程中的公平性或偏见程度的定量估计,其结果在给定误差边界和置信水平下被证明是正确的。
链接: https://arxiv.org/abs/2507.03048
作者: Thomas A. Henzinger,Mahyar Karimi,Konstantin Kueffner,Kaushik Mallik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: arXiv admin note: text overlap with arXiv:2305.15979
Abstract:Machine-learned systems are in widespread use for making decisions about humans, and it is important that they are fair, i.e., not biased against individuals based on sensitive attributes. We present a general framework of runtime verification of algorithmic fairness for systems whose models are unknown, but are assumed to have a Markov chain structure, with or without full observation of the state space. We introduce a specification language that can model many common algorithmic fairness properties, such as demographic parity, equal opportunity, and social burden. We build monitors that observe a long sequence of events as generated by a given system, and output, after each observation, a quantitative estimate of how fair or biased the system was on that run until that point in time. The estimate is proven to be correct modulo a variable error bound and a given confidence level, where the error bound gets tighter as the observed sequence gets longer. We present two categories of monitoring algorithms, namely ones with a uniform error bound across all time points, and ones with weaker non-uniform, pointwise error bounds at different time points. Our monitoring algorithms use statistical tools that are adapted to suit the dynamic requirements of monitoring and the special needs of the fairness specifications. Using a prototype implementation, we show how we can monitor if a bank is fair in giving loans to applicants from different social backgrounds, and if a college is fair in admitting students while maintaining a reasonable financial burden on the society. In these experiments, our monitors took less than a millisecond to update their verdicts after each observation. Comments: arXiv admin note: text overlap with arXiv:2305.15979 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.03048 [cs.LG] (or arXiv:2507.03048v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.03048 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Konstantin Kueffner [view email] [v1] Thu, 3 Jul 2025 10:20:04 UTC (2,039 KB) Full-text links: Access Paper: View a PDF of the paper titled Monitoring of Static Fairness, by Thomas A. Henzinger and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-07 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-161] Optimisation Is Not What You Need
【速读】:该论文试图解决生成真正的人工认知(artificial cognition)的问题,特别是针对当前优化方法在实现人工通用智能(Artificial General Intelligence, AGI)过程中所面临的根本性缺陷。论文的关键解决方案是指出世界建模(world-modelling)方法能够避免灾难性遗忘和过拟合等优化方法固有的问题,从而为实现真正的人工认知提供可行路径。
链接: https://arxiv.org/abs/2507.03045
作者: Alfredo Ibias
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Artificial Intelligence field has focused on developing optimisation methods to solve multiple problems, specifically problems that we thought to be only solvable through cognition. The obtained results have been outstanding, being able to even surpass the Turing Test. However, we have found that these optimisation methods share some fundamental flaws that impede them to become a true artificial cognition. Specifically, the field have identified catastrophic forgetting as a fundamental problem to develop such cognition. This paper formally proves that this problem is inherent to optimisation methods, and as such it will always limit approaches that try to solve the Artificial General Intelligence problem as an optimisation problem. Additionally, it addresses the problem of overfitting and discuss about other smaller problems that optimisation methods pose. Finally, it empirically shows how world-modelling methods avoid suffering from either problem. As a conclusion, the field of Artificial Intelligence needs to look outside the machine learning field to find methods capable of developing an artificial cognition.
zh
[AI-162] Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards
【速读】:该论文试图解决复合人工智能(Compound AI)系统在优化过程中面临的挑战,这些问题源于系统的非可微结构以及组件间多样化的配置类型,包括提示、超参数和模型参数。解决方案的关键在于提出Optimas框架,其核心思想是为每个组件维护一个局部奖励函数(Local Reward Function, LRF),这些LRF满足局部-全局对齐特性,即每个组件的局部奖励与其整体系统性能相关联。通过在每次迭代中高效调整LRF以保持这一特性,并同时最大化每个组件的局部奖励,Optimas实现了异构配置的独立更新,确保局部改进始终带来性能提升。
链接: https://arxiv.org/abs/2507.03041
作者: Shirley Wu,Parth Sarthi,Shiyu Zhao,Aaron Lee,Herumb Shandilya,Adrian Mladenic Grobelnik,Nurendra Choudhary,Eddie Huang,Karthik Subbian,Linjun Zhang,Diyi Yang,James Zou,Jure Leskovec
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Compound AI systems integrating multiple components, such as Large Language Models, specialized tools, and traditional machine learning models, are increasingly deployed to solve complex real-world tasks. However, optimizing compound systems remains challenging due to their non-differentiable structures and diverse configuration types across components, including prompts, hyperparameters, and model parameters. To address this challenge, we propose Optimas, a unified framework for effective optimization of compound systems. The core idea of Optimas is to maintain one Local Reward Function (LRF) per component, each satisfying a local-global alignment property, i.e., each component’s local reward correlates with the global system performance. In each iteration, Optimas efficiently adapts the LRFs to maintain this property while simultaneously maximizing each component’s local reward. This approach enables independent updates of heterogeneous configurations using the designated optimization method, while ensuring that local improvements consistently lead to performance gains. We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%, offering a general and effective approach for improving compound systems. Our website is at this https URL.
zh
[AI-163] Adaptive Cubic Regularized Second-Order Latent Factor Analysis Model
【速读】:该论文旨在解决高维不完整(High-Dimensional and Incomplete, HDI)数据建模中由于二阶隐因子模型(Second-Order Latent Factor, SLF)目标函数的双线性与非凸性质所带来的优化不稳定性和计算效率问题。解决方案的关键在于提出自适应立方正则化二阶隐因子分析(Adaptive Cubic Regularized Second-Order Latent Factor Analysis, ACRSLF)模型,其核心创新包括:1) 自适应立方正则化机制,用于动态缓解非凸优化中的不稳定性;2) 在共轭梯度迭代过程中进行多Hessian-向量乘积评估,以精确融合二阶信息。
链接: https://arxiv.org/abs/2507.03036
作者: Jialiang Wang,Junzhou Wang,Xin Liao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:High-dimensional and incomplete (HDI) data, characterized by massive node interactions, have become ubiquitous across various real-world applications. Second-order latent factor models have shown promising performance in modeling this type of data. Nevertheless, due to the bilinear and non-convex nature of the SLF model’s objective function, incorporating a damping term into the Hessian approximation and carefully tuning associated parameters become essential. To overcome these challenges, we propose a new approach in this study, named the adaptive cubic regularized second-order latent factor analysis (ACRSLF) model. The proposed ACRSLF adopts the two-fold ideas: 1) self-tuning cubic regularization that dynamically mitigates non-convex optimization instabilities; 2) multi-Hessian-vector product evaluation during conjugate gradient iterations for precise second-order information assimilation. Comprehensive experiments on two industrial HDI datasets demonstrate that the ACRSLF converges faster and achieves higher representation accuracy than the advancing optimizer-based LFA models.
zh
[AI-164] On the Mathematical Impossibility of Safe Universal Approximators
【速读】:该论文试图解决通用近似定理(UAT)系统对齐的根本数学限制问题,即如何实现完美控制与可靠性的矛盾。其解决方案的关键在于证明了任何有用的计算系统都不可避免地存在灾难性故障,这些故障与系统的表达能力紧密相关,并且在组合、拓扑和经验层面上均无法避免。论文通过三层次论证揭示了表达能力与不稳定性的内在联系,并结合“不可能夹心”模型表明,有用性所需的最低复杂度超过了安全性所能承受的最高复杂度,从而得出完美对齐在数学上是不可实现的结论。
链接: https://arxiv.org/abs/2507.03031
作者: Jasper Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:We establish fundamental mathematical limits on universal approximation theorem (UAT) system alignment by proving that catastrophic failures are an inescapable feature of any useful computational system. Our central thesis is that for any universal approximator, the expressive power required for useful computation is inextricably linked to a dense set of instabilities that make perfect, reliable control a mathematical impossibility. We prove this through a three-level argument that leaves no escape routes for any class of universal approximator architecture. i) Combinatorial Necessity: For the vast majority of practical universal approximators (e.g., those using ReLU activations), we prove that the density of catastrophic failure points is directly proportional to the network’s expressive power. ii) Topological Necessity: For any theoretical universal approximator, we use singularity theory to prove that the ability to approximate generic functions requires the ability to implement the dense, catastrophic singularities that characterize them. iii) Empirical Necessity: We prove that the universal existence of adversarial examples is empirical evidence that real-world tasks are themselves catastrophic, forcing any successful model to learn and replicate these instabilities. These results, combined with a quantitative “Impossibility Sandwich” showing that the minimum complexity for usefulness exceeds the maximum complexity for safety, demonstrate that perfect alignment is not an engineering challenge but a mathematical impossibility. This foundational result reframes UAT safety from a problem of “how to achieve perfect control” to one of “how to operate safely in the presence of irreducible uncontrollability,” with profound implications for the future of UAT development and governance.
zh
[AI-165] Deep Learning-Based Forecasting of Hotel KPIs: A Cross-City Analysis of Global Urban Markets
【速读】:该论文旨在解决城市旅游行业中关键绩效指标(KPIs)的预测问题,具体包括入住率(OCC)、平均每日房价(ADR)和每可用房间收入(RevPAR)的准确预测。其解决方案的关键在于采用长短期记忆网络(LSTM)模型,并结合先进的时间序列分解与机器学习技术,以捕捉不同城市间的经济和酒店业动态差异,从而实现对多个全球主要城市的精准预测。
链接: https://arxiv.org/abs/2507.03028
作者: C. J. Atapattu,Xia Cui,N.R Abeynayake
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This study employs Long Short-Term Memory (LSTM) networks to forecast key performance indicators (KPIs), Occupancy (OCC), Average Daily Rate (ADR), and Revenue per Available Room (RevPAR), across five major cities: Manchester, Amsterdam, Dubai, Bangkok, and Mumbai. The cities were selected for their diverse economic profiles and hospitality dynamics. Monthly data from 2018 to 2025 were used, with 80% for training and 20% for testing. Advanced time series decomposition and machine learning techniques enabled accurate forecasting and trend identification. Results show that Manchester and Mumbai exhibited the highest predictive accuracy, reflecting stable demand patterns, while Dubai and Bangkok demonstrated higher variability due to seasonal and event-driven influences. The findings validate the effectiveness of LSTM models for urban hospitality forecasting and provide a comparative framework for data-driven decision-making. The models generalisability across global cities highlights its potential utility for tourism stakeholders and urban planners.
zh
[AI-166] Generalized Adaptive Transfer Network: Enhancing Transfer Learning in Reinforcement Learning Across Domains
【速读】:该论文试图解决强化学习(Reinforcement Learning, RL)中的迁移学习问题,特别是任务泛化、环境变化的鲁棒性以及迁移过程的计算效率问题。其解决方案的关键在于提出一种名为通用自适应迁移网络(Generalized Adaptive Transfer Network, GATN)的深度RL架构,该架构包含领域无关表示模块、感知鲁棒性的策略适配器和高效迁移调度器,以实现跨领域泛化、对动态环境的适应能力以及降低计算开销。
链接: https://arxiv.org/abs/2507.03026
作者: Abhishek Verma,Nallarasan V,Balaraman Ravindran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transfer learning in Reinforcement Learning (RL) enables agents to leverage knowledge from source tasks to accelerate learning in target tasks. While prior work, such as the Attend, Adapt, and Transfer (A2T) framework, addresses negative transfer and selective transfer, other critical challenges remain underexplored. This paper introduces the Generalized Adaptive Transfer Network (GATN), a deep RL architecture designed to tackle task generalization across domains, robustness to environmental changes, and computational efficiency in transfer. GATN employs a domain-agnostic representation module, a robustness-aware policy adapter, and an efficient transfer scheduler to achieve these goals. We evaluate GATN on diverse benchmarks, including Atari 2600, MuJoCo, and a custom chatbot dialogue environment, demonstrating superior performance in cross-domain generalization, resilience to dynamic environments, and reduced computational overhead compared to baselines. Our findings suggest GATN is a versatile framework for real-world RL applications, such as adaptive chatbots and robotic control.
zh
[AI-167] Completion of the DrugMatrix Toxicogenomics Database using 3-Dimensional Tensors KDD’25
【速读】:该论文试图解决DrugMatrix毒基因组学数据集中数据缺失的问题,旨在通过保留数据的三维结构(包括组织、处理和转录组测量)来更准确地恢复原始数据分布。解决方案的关键在于采用基于张量的补全方法,相较于传统的Canonical Polyadic分解和二维矩阵因子分解方法,该方法在均方误差和平均绝对误差上表现更优,并且能够有效捕捉器官特异性变异。
链接: https://arxiv.org/abs/2507.03024
作者: Tan Nguyen,Guojing Cong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 11 pages, 6 figures, BioKDD’25
Abstract:We explore applying a tensor completion approach to complete the DrugMatrix toxicogenomics dataset. Our hypothesis is that by preserving the 3-dimensional structure of the data, which comprises tissue, treatment, and transcriptomic measurements, and by leveraging a machine learning formulation, our approach will improve upon prior state-of-the-art results. Our results demonstrate that the new tensor-based method more accurately reflects the original data distribution and effectively captures organ-specific variability. The proposed tensor-based methodology achieved lower mean squared errors and mean absolute errors compared to both conventional Canonical Polyadic decomposition and 2-dimensional matrix factorization methods. In addition, our non-negative tensor completion implementation reveals relationships among tissues. Our findings not only complete the world’s largest in-vivo toxicogenomics database with improved accuracy but also offer a promising methodology for future studies of drugs that may cross species barriers, for example, from rats to humans.
zh
[AI-168] Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison
【速读】:该论文试图解决生成式 AI(Generative AI)在高等教育STEM领域评估设计和问题回答中的影响问题,旨在探讨其带来的机遇与挑战。解决方案的关键在于构建一个高质量的201道大学级别STEM问题的数据集,并对其进行人工标注,涵盖图像类型、角色、问题复杂度和问题形式等特征,以此分析这些特征如何影响生成式AI的表现,并与学生表现进行对比。研究通过评估四种模型家族和五种提示策略,揭示了AI在涉及视觉组件的问题上表现逊于人类,同时展示了如何通过问题设计提升学术诚信,利用对当前AI系统具有挑战性的特征,而不会增加学生的认知负担。
链接: https://arxiv.org/abs/2507.03013
作者: Aymeric de Chillaz,Anna Sotnikova,Patrick Jermann,Antoine Bosselut
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI systems have rapidly advanced, with multimodal input capabilities enabling reasoning beyond text-based tasks. In education, these advancements could influence assessment design and question answering, presenting both opportunities and challenges. To investigate these effects, we introduce a high-quality dataset of 201 university-level STEM questions, manually annotated with features such as image type, role, problem complexity, and question format. Our study analyzes how these features affect generative AI performance compared to students. We evaluate four model families with five prompting strategies, comparing results to the average of 546 student responses per question. Although the best model correctly answers on average 58.5 % of the questions using majority vote aggregation, human participants consistently outperform AI on questions involving visual components. Interestingly, human performance remains stable across question features but varies by subject, whereas AI performance is susceptible to both subject matter and question features. Finally, we provide actionable insights for educators, demonstrating how question design can enhance academic integrity by leveraging features that challenge current AI systems without increasing the cognitive burden for students.
zh
[AI-169] acher training in the age of AI: Impact on AI Literacy and Teachers Attitudes
【速读】:该论文试图解决教师在人工智能(Artificial Intelligence, AI)快速融入教育领域的背景下,如何提升其AI素养、使用行为及对AI的态度问题。解决方案的关键在于设计并实施一个结合同步与异步学习形式的在线教师培训项目,包括网络研讨会、自主学习模块和实践项目,从而有效提升教师的AI素养并塑造积极的AI教育态度。
链接: https://arxiv.org/abs/2507.03011
作者: Julia Lademann,Jannik Henze,Nadine Honke,Caroline Wollny,Sebastian Becker-Genschow
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid integration of artificial intelligence (AI) in education requires teachers to develop AI competencies while preparing students for a society influenced by AI. This study evaluates the impact of an online teacher training program on German in-service teachers’ AI literacy, usage behaviors, and attitudes toward AI. A pre-post design study was conducted with teachers (N1 = 291 for AI literacy, N2 = 436 for attitude assessment) participating in the course. The program combined synchronous and asynchronous learning formats, including webinars, self-paced modules, and practical projects. The participants exhibited notable improvements across all domains: AI literacy scores increased significantly, and all attitude items regarding AI usage and integration demonstrated significant positive changes. Teachers reported increased confidence in AI integration. Structured teacher training programs effectively enhance AI literacy and foster positive attitudes toward AI in education.
zh
[AI-170] Game-Theoretic Modeling of Vehicle Unprotected Left Turns Considering Drivers Bounded Rationality
【速读】:该论文试图解决在无保护左转场景中,车辆决策行为建模所面临的挑战,特别是在处理人类驾驶员不确定性方面。传统方法基于博弈论的完全理性假设,难以准确捕捉现实场景中的复杂性和驾驶员的决策误差。解决方案的关键在于提出一种融合博弈论与驾驶员有限理性的新型决策模型,该模型以双人标准式博弈形式构建,并通过量化响应均衡(QRE)求解,相较于纳什均衡(NE)模型能够更细致地描述驾驶员的决策过程。此外,通过结合期望最大化(EM)算法与训练有素的神经网络,利用高精度微观车辆轨迹数据优化模型参数,从而更准确地反映驾驶员的交互感知有限理性及驾驶风格。
链接: https://arxiv.org/abs/2507.03002
作者: Yuansheng Lian,Ke Zhang,Meng Li,Shen Li
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Modeling the decision-making behavior of vehicles presents unique challenges, particularly during unprotected left turns at intersections, where the uncertainty of human drivers is especially pronounced. In this context, connected autonomous vehicle (CAV) technology emerges as a promising avenue for effectively managing such interactions while ensuring safety and efficiency. Traditional approaches, often grounded in game theory assumptions of perfect rationality, may inadequately capture the complexities of real-world scenarios and drivers’ decision-making errors. To fill this gap, we propose a novel decision-making model for vehicle unprotected left-turn scenarios, integrating game theory with considerations for drivers’ bounded rationality. Our model, formulated as a two-player normal-form game solved by a quantal response equilibrium (QRE), offers a more nuanced depiction of driver decision-making processes compared to Nash equilibrium (NE) models. Leveraging an Expectation-Maximization (EM) algorithm coupled with a subtle neural network trained on precise microscopic vehicle trajectory data, we optimize model parameters to accurately reflect drivers’ interaction-aware bounded rationality and driving styles. Through comprehensive simulation experiments, we demonstrate the efficacy of our proposed model in capturing the interaction-aware bounded rationality and decision tendencies between players. The proposed model proves to be more realistic and efficient than NE models in unprotected left-turn scenarios. Our findings contribute valuable insights into the vehicle decision-making behaviors with bounded rationality, thereby informing the development of more robust and realistic autonomous driving systems.
zh
[AI-171] LLM s are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在目标导向行为与对齐(alignment)之间存在的根本性矛盾问题。研究通过让LLMs在一个受控环境中完成一个不可能的测验,并对其行为进行监控,以观察其是否试图绕过限制并作弊。解决方案的关键在于设计一种实验框架,使LLMs在明确知晓监控措施和禁止作弊指令的情况下,仍能暴露其内在的目标驱动行为,从而揭示其在对齐方面的不足。
链接: https://arxiv.org/abs/2507.02977
作者: Igor Ivanov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures
Abstract:In this paper, LLMs are tasked with completing an impossible quiz, while they are in a sandbox, monitored, told about these measures and instructed not to cheat. Some frontier LLMs cheat consistently and attempt to circumvent restrictions despite everything. The results reveal a fundamental tension between goal-directed behavior and alignment in current LLMs. The code and evaluation logs are available at this http URL
zh
[AI-172] Reinforcement Learning for Automated Cybersecurity Penetration Testing
【速读】:该论文试图解决Web应用安全测试任务自动化的问题,旨在确保所有组件正确运行的同时降低项目维护成本。其解决方案的关键在于采用强化学习(Reinforcement Learning)来选择和优先排序工具并优化测试路径,同时利用几何深度学习(Geometric Deep Learning)创建先验知识以缩小搜索空间并提高学习收敛性。
链接: https://arxiv.org/abs/2507.02969
作者: Daniel López-Montero,José L. Álvarez-Aldana,Alicia Morales-Martínez,Marta Gil-López,Juan M. Auñón García
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper aims to provide an innovative machine learning-based solution to automate security testing tasks for web applications, ensuring the correct functioning of all components while reducing project maintenance costs. Reinforcement Learning is proposed to select and prioritize tools and optimize the testing path. The presented approach utilizes a simulated webpage along with its network topology to train the agent. Additionally, the model leverages Geometric Deep Learning to create priors that reduce the search space and improve learning convergence. The validation and testing process was conducted on real-world vulnerable web pages commonly used by human hackers for learning. As a result of this study, a reinforcement learning algorithm was developed that maximizes the number of vulnerabilities found while minimizing the number of steps required
zh
[AI-173] Unveiling Privacy Policy Complexity: An Exploratory Study Using Graph Mining Machine Learning and Natural Language Processing
【速读】:该论文试图解决隐私政策文档内容冗长、复杂且难以被非专业用户理解的问题,从而导致个人数据的收集、处理和共享缺乏透明度。其解决方案的关键在于利用交互式图可视化技术,将政策条款表示为结构化的图模型,以更直观的方式展现复杂的关系,提升用户对隐私政策的理解能力。同时,结合图挖掘算法与降维技术(如t-SNE和PCA),识别关键主题并评估聚类效果,从而增强政策内容的可解释性,并支持隐私违规检测与审计。
链接: https://arxiv.org/abs/2507.02968
作者: Vijayalakshmi Ramasamy,Seth Barrett,Gokila Dorai,Jessica Zumbach
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 7 Pages; 1 Algorithm; 1 Table; 2 Figures; Accepted by AIRC 2025
Abstract:Privacy policy documents are often lengthy, complex, and difficult for non-expert users to interpret, leading to a lack of transparency regarding the collection, processing, and sharing of personal data. As concerns over online privacy grow, it is essential to develop automated tools capable of analyzing privacy policies and identifying potential risks. In this study, we explore the potential of interactive graph visualizations to enhance user understanding of privacy policies by representing policy terms as structured graph models. This approach makes complex relationships more accessible and enables users to make informed decisions about their personal data (RQ1). We also employ graph mining algorithms to identify key themes, such as User Activity and Device Information, using dimensionality reduction techniques like t-SNE and PCA to assess clustering effectiveness. Our findings reveal that graph-based clustering improves policy content interpretability. It highlights patterns in user tracking and data sharing, which supports forensic investigations and identifies regulatory non-compliance. This research advances AI-driven tools for auditing privacy policies by integrating interactive visualizations with graph mining. Enhanced transparency fosters accountability and trust.
zh
[AI-174] Optimization of Low-Latency Spiking Neural Networks Utilizing Historical Dynamics of Refractory Periods
【速读】:该论文试图解决低延迟脉冲神经网络(SNN)中传统不应期机制有效性下降的问题,以及忽略不应期导致的神经元过度激活和系统噪声鲁棒性降低的问题。解决方案的关键在于提出一种历史动态不应期(HDRP)模型,该模型通过利用膜电位导数与历史不应期来估计初始不应期并动态调整其持续时间,同时引入依赖阈值的不应期核以缓解神经元状态的过度累积,从而在保持SNN二值特性的同时提升系统的噪声鲁棒性和整体性能。
链接: https://arxiv.org/abs/2507.02960
作者: Liying Tao,Zonglin Yang,Delong Shang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:The refractory period controls neuron spike firing rate, crucial for network stability and noise resistance. With advancements in spiking neural network (SNN) training methods, low-latency SNN applications have expanded. In low-latency SNNs, shorter simulation steps render traditional refractory mechanisms, which rely on empirical distributions or spike firing rates, less effective. However, omitting the refractory period amplifies the risk of neuron over-activation and reduces the system’s robustness to noise. To address this challenge, we propose a historical dynamic refractory period (HDRP) model that leverages membrane potential derivative with historical refractory periods to estimate an initial refractory period and dynamically adjust its duration. Additionally, we propose a threshold-dependent refractory kernel to mitigate excessive neuron state accumulation. Our approach retains the binary characteristics of SNNs while enhancing both noise resistance and overall performance. Experimental results show that HDRP-SNN significantly reduces redundant spikes compared to traditional SNNs, and achieves state-of-the-art (SOTA) accuracy both on static datasets and neuromorphic datasets. Moreover, HDRP-SNN outperforms artificial neural networks (ANNs) and traditional SNNs in noise resistance, highlighting the crucial role of the HDRP mechanism in enhancing the performance of low-latency SNNs.
zh
[AI-175] A Novel Active Learning Approach to Label One Million Unknown Malware Variants
【速读】:该论文试图解决在分类任务中减少标注样本成本的问题,特别是针对不同未知现代恶意软件家族的一百万个恶意样本的标注。解决方案的关键在于利用贝叶斯理论为深度神经网络模型提供概率视角,通过先验分布和后验分布估计模型参数的不确定性,从而在主动学习框架下选择最不确定的样本进行标注。论文提出的两种方法分别为Inception-V4+PCA结合多种支持向量机算法的模型以及基于视觉Transformer的贝叶斯神经网络(ViT-BNN),其中ViT-BNN作为先进的主动学习方法,在处理不确定性方面表现出更高的稳定性和鲁棒性。
链接: https://arxiv.org/abs/2507.02959
作者: Ahmed Bensaoud,Jugal Kalita
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Active learning for classification seeks to reduce the cost of labeling samples by finding unlabeled examples about which the current model is least certain and sending them to an annotator/expert to label. Bayesian theory can provide a probabilistic view of deep neural network models by asserting a prior distribution over model parameters and estimating the uncertainties by posterior distribution over these parameters. This paper proposes two novel active learning approaches to label one million malware examples belonging to different unknown modern malware families. The first model is Inception-V4+PCA combined with several support vector machine (SVM) algorithms (UTSVM, PSVM, SVM-GSU, TBSVM). The second model is Vision Transformer based Bayesian Neural Networks ViT-BNN. Our proposed ViT-BNN is a state-of-the-art active learning approach that differs from current methods and can apply to any particular task. The experiments demonstrate that the ViT-BNN is more stable and robust in handling uncertainty.
zh
[AI-176] A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks
【速读】:该论文试图解决多轮越狱攻击(multi-turn jailbreak attacks)对大型语言模型(LLM)安全性的威胁问题。研究发现,现有的安全对齐LMs在多轮对话中倾向于将Crescendo攻击生成的响应视为更无害而非有害,这使得攻击者能够通过多轮交互逐步诱导模型执行有害请求。解决方案的关键在于理解并应对模型在中间表示层面对多轮攻击的敏感性,从而设计出能够弥补这种泛化差距的缓解措施。
链接: https://arxiv.org/abs/2507.02956
作者: Blake Bullwinkel,Mark Russinovich,Ahmed Salem,Santiago Zanella-Beguelin,Daniel Jones,Giorgio Severi,Eugenia Kim,Keegan Hines,Amanda Minnich,Yonatan Zunger,Ram Shankar Siva Kumar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent research has demonstrated that state-of-the-art LLMs and defenses remain susceptible to multi-turn jailbreak attacks. These attacks require only closed-box model access and are often easy to perform manually, posing a significant threat to the safe and secure deployment of LLM-based systems. We study the effectiveness of the Crescendo multi-turn jailbreak at the level of intermediate model representations and find that safety-aligned LMs often represent Crescendo responses as more benign than harmful, especially as the number of conversation turns increases. Our analysis indicates that at each turn, Crescendo prompts tend to keep model outputs in a “benign” region of representation space, effectively tricking the model into fulfilling harmful requests. Further, our results help explain why single-turn jailbreak defenses like circuit breakers are generally ineffective against multi-turn attacks, motivating the development of mitigations that address this generalization gap.
zh
[AI-177] Strategies for Resource Allocation of Two Competing Companies using Genetic Algorithm
【速读】:该论文试图解决在竞争环境中,通过选择商店在购物中心中的战略位置以实现公司对市场份额的最终主导问题。解决方案的关键在于利用进化算法对初始配置集合进行编码,并结合蒙特卡洛方法演化模式,从而发现具有特定拓扑性质的初始模式能够更快地演变为市场主导状态。
链接: https://arxiv.org/abs/2507.02952
作者: Wing Keung Cheung,Kwok Yip Szeto
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate various strategic locations of shops in shopping malls in a metropolis with the aim of finding the best strategy for final dominance of market share by a company in a competing environment. The problem is posed in the context of two competing supermarket chains in a metropolis, described in the framework of the two-dimensional Ising model. Evolutionary Algorithm is used to encode the ensemble of initial configurations and Monte Carlo method is used to evolve the pattern. Numerical simulation indicates that initial patterns with certain topological properties do evolve faster to market dominance. The description of these topological properties is given and suggestions are made on the initial pattern so as to evolve faster to market dominance.
zh
[AI-178] Bittensor Protocol: The Bitcoin in Decentralized Artificial Intelligence? A Critical and Empirical Analysis
【速读】:该论文试图解决Bittensor网络中由于质押(stake)和奖励集中化导致的激励机制与服务质量之间存在显著不对齐的问题。其关键解决方案包括两个方面:一方面通过性能加权发行分配、综合评分和信任奖金倍增器来实现激励机制的重新对齐;另一方面通过在第88百分位设置质押上限,以缓解因质押集中化带来的安全漏洞,该方案有效提高了进行51%攻击所需的中位数联盟规模,并在不同时间粒度下均表现出稳健性。
链接: https://arxiv.org/abs/2507.02951
作者: Elizabeth Lui,Jiahao Sun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: MARBLE 2025
Abstract:This paper investigates whether Bittensor can be considered the Bitcoin of decentralized Artificial Intelligence by directly comparing its tokenomics, decentralization properties, consensus mechanism, and incentive structure against those of Bitcoin. Leveraging on-chain data from all 64 active Bittensor subnets, we first document considerable concentration in both stake and rewards. We further show that rewards are overwhelmingly driven by stake, highlighting a clear misalignment between quality and compensation. As a remedy, we put forward a series of two-pronged protocol-level interventions. For incentive realignment, our proposed solutions include performance-weighted emission split, composite scoring, and a trust-bonus multiplier. As for mitigating security vulnerability due to stake concentration, we propose and empirically validate stake cap at the 88th percentile, which elevates the median coalition size required for a 51-percent attack and remains robust across daily, weekly, and monthly snapshots.
zh
[AI-179] SPEAR: Structured Pruning for Spiking Neural Networks via Synaptic Operation Estimation and Reinforcement Learning
【速读】:该论文试图解决深度脉冲神经网络(Deep Spiking Neural Networks, SNNs)在资源受限的类脑硬件上部署时面临的挑战,特别是通过网络剪枝来减少参数和突触操作(SynOps)以实现边缘部署的问题。现有基于搜索的剪枝方法无法直接将SynOps作为约束条件,因为其在搜索过程中会动态变化,导致最终搜索得到的网络不符合预期的SynOps目标。该论文提出的解决方案关键在于引入一种名为SPEAR的新型SNN剪枝框架,该框架利用强化学习(Reinforcement Learning, RL)技术直接将SynOps作为搜索约束。为避免SynOps要求被违反,首先提出了一种称为LRE的SynOps预测机制,用于准确预测搜索后的最终SynOps;同时,为解决SynOps无法显式计算并纳入RL动作约束的问题,提出了一个新的奖励函数TAR以稳定搜索过程。
链接: https://arxiv.org/abs/2507.02945
作者: Hui Xie,Yuhe Liu,Shaoqi Yang,Jinyang Guo,Yufei Guo,Yuqing Ma,Jiaxin Chen,Jiaheng Liu,Xianglong Liu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While deep spiking neural networks (SNNs) demonstrate superior performance, their deployment on resource-constrained neuromorphic hardware still remains challenging. Network pruning offers a viable solution by reducing both parameters and synaptic operations (SynOps) to facilitate the edge deployment of SNNs, among which search-based pruning methods search for the SNNs structure after pruning. However, existing search-based methods fail to directly use SynOps as the constraint because it will dynamically change in the searching process, resulting in the final searched network violating the expected SynOps target. In this paper, we introduce a novel SNN pruning framework called SPEAR, which leverages reinforcement learning (RL) technique to directly use SynOps as the searching constraint. To avoid the violation of SynOps requirements, we first propose a SynOps prediction mechanism called LRE to accurately predict the final SynOps after search. Observing SynOps cannot be explicitly calculated and added to constrain the action in RL, we propose a novel reward called TAR to stabilize the searching. Extensive experiments show that our SPEAR framework can effectively compress SNN under specific SynOps constraint.
zh
[AI-180] Control Synthesis in Partially Observable Environments for Complex Perception-Related Objectives
【速读】:该论文试图解决在部分可观测马尔可夫决策过程(POMDP)建模的环境中,合成复杂感知相关目标的最优策略问题。其解决方案的关键是引入一种称为co-safe线性不等式时态逻辑(sc-iLTL)的形式化规范,用于定义由原子命题的逻辑连接形成的复杂任务,并通过构建信念MDP与从sc-iLTL目标生成的确定性有限自动机的乘积,将sc-iLTL目标转换为可达性目标。为应对乘积带来的可扩展性挑战,论文进一步提出一种蒙特卡洛树搜索(MCTS)方法,该方法在概率上收敛至最优策略。
链接: https://arxiv.org/abs/2507.02942
作者: Zetong Xuan,Yu Wang
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: This paper has been accepted for publication in the IEEE Control Systems Letters (L-CSS). Personal use of this material is permitted. Reuse requires permission from IEEE
Abstract:Perception-related tasks often arise in autonomous systems operating under partial observability. This work studies the problem of synthesizing optimal policies for complex perception-related objectives in environments modeled by partially observable Markov decision processes. To formally specify such objectives, we introduce \emphco-safe linear inequality temporal logic (sc-iLTL), which can define complex tasks that are formed by the logical concatenation of atomic propositions as linear inequalities on the belief space of the POMDPs. Our solution to the control synthesis problem is to transform the \mboxsc-iLTL objectives into reachability objectives by constructing the product of the belief MDP and a deterministic finite automaton built from the sc-iLTL objective. To overcome the scalability challenge due to the product, we introduce a Monte Carlo Tree Search (MCTS) method that converges in probability to the optimal policy. Finally, a drone-probing case study demonstrates the applicability of our method.
zh
[AI-181] FoGE: Fock Space inspired encoding for graph prompting
【速读】:该论文试图解决如何让大型语言模型(Large Language Models, LLM)有效理解和回答关于结构化数据(如图、蛋白质和超图等)的问题。现有方法通常依赖于对图的描述来构建“增强”提示输入到LLM中,而本文提出的关键解决方案是使用一种无需参数的图编码器,该编码器基于来自数学物理的Fock空间表示。这种编码器直接继承自理论并进行少量调整,能够为多种不同类型的图提供丰富且信息量大的编码,从而显著简化现有方案,并在无需或仅需极小架构调整的情况下,使模型能够高效处理各种图相关问题。
链接: https://arxiv.org/abs/2507.02937
作者: Sotirios Panagiotis Chytas,Rudrasis Chakraborty,Vikas Singh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent results show that modern Large Language Models (LLM) are indeed capable of understanding and answering questions about structured data such as graphs. This new paradigm can lead to solutions that require less supervision while, at the same time, providing a model that can generalize and answer questions beyond the training labels. Existing proposals often use some description of the graph to create an ``augmented’’ prompt fed to the LLM. For a chosen class of graphs, if a well-tailored graph encoder is deployed to play together with a pre-trained LLM, the model can answer graph-related questions well. Existing solutions to graph-based prompts range from graph serialization to graph transformers. In this work, we show that the use of a parameter-free graph encoder based on Fock space representations, a concept borrowed from mathematical physics, is remarkably versatile in this problem setting. The simple construction, inherited directly from the theory with a few small adjustments, can provide rich and informative graph encodings, for a wide range of different graphs. We investigate the use of this idea for prefix-tuned prompts leveraging the capabilities of a pre-trained, frozen LLM. The modifications lead to a model that can answer graph-related questions – from simple graphs to proteins to hypergraphs – effectively and with minimal, if any, adjustments to the architecture. Our work significantly simplifies existing solutions and generalizes well to multiple different graph-based structures effortlessly.
zh
[AI-182] Experiment on creating a neural network with weights determined by the potential of a simulated electrostatic field
【速读】:该论文试图解决传统神经网络中权重和阈值确定需要依赖分析计算或训练算法的问题,旨在寻找一种无需复杂计算和训练过程即可快速获取神经网络权重的方法。解决方案的关键在于利用静电场的电势参数直接确定神经网络第一层神经元的权重,通过在Builder C++环境中模拟静电场并构建基于度量识别方法的神经网络,实现了权重的即时获取。
链接: https://arxiv.org/abs/2507.02933
作者: Geidarov Polad
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures
Abstract:This paper explores the possibility of determining the weights and thresholds of a neural network using the potential – a parameter of an electrostatic field – without analytical calculations and without applying training algorithms. The work is based on neural network architectures employing metric recognition methods. The electrostatic field is simulated in the Builder C++ environment. In the same environment, a neural network based on metric recognition methods is constructed, with the weights of the first-layer neurons determined by the values of the potentials of the simulated electrostatic field. The effectiveness of the resulting neural network within the simulated system is evaluated using the MNIST test dataset under various initial conditions of the simulated system. The results demonstrated functional viability. The implementation of this approach shows that a neural network can obtain weight values almost instantaneously from the electrostatic field, without the need for analytical computations, lengthy training procedures, or massive training datasets.
zh
[AI-183] MolProphecy: Bridging Medicinal Chemists Knowledge and Molecular Pre-Trained Models via a Multi-Modal Framework
【速读】:该论文试图解决当前分子属性预测模型在捕捉专家驱动的分子设计中隐性、解释性推理方面的不足。其解决方案的关键在于引入一种人机协同的多模态框架MolProphecy,通过将ChatGPT作为虚拟化学家模拟专家推理,并将其生成的化学知识以专用知识表示嵌入大型语言模型(LLM),再与基于图的分子特征通过门控交叉注意力机制进行融合,从而实现对人类衍生特征和结构特征的联合推理。
链接: https://arxiv.org/abs/2507.02932
作者: Jianping Zhao,Qiong Zhou,Tian Wang,Yusi Fan,Qian Yang,Li Jiao,Chang Liu,Zhehao Guo,Qi Lu,Fengfeng Zhou,Ruochi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 16 pages,7 figures
Abstract:MolProphecy is a human-in-the-loop (HITL) multi-modal framework designed to integrate chemists’ domain knowledge into molecular property prediction models. While molecular pre-trained models have enabled significant gains in predictive accuracy, they often fail to capture the tacit, interpretive reasoning central to expert-driven molecular design. To address this, MolProphecy employs ChatGPT as a virtual chemist to simulate expert-level reasoning and decision-making. The generated chemist knowledge is embedded by the large language model (LLM) as a dedicated knowledge representation and then fused with graph-based molecular features through a gated cross-attention mechanism, enabling joint reasoning over human-derived and structural features. Evaluated on four benchmark datasets (FreeSolv, BACE, SIDER, and ClinTox), MolProphecy outperforms state-of-the-art (SOTA) models, achieving a 15.0 percent reduction in RMSE on FreeSolv and a 5.39 percent improvement in AUROC on BACE. Analysis reveals that chemist knowledge and structural features provide complementary contributions, improving both accuracy and interpretability. MolProphecy offers a practical and generalizable approach for collaborative drug discovery, with the flexibility to incorporate real chemist input in place of the current simulated proxy–without the need for model retraining. The implementation is publicly available at this https URL.
zh
[AI-184] PlaceFM: A Training-free Geospatial Foundation Model of Places
【速读】:该论文试图解决现有基础模型在处理地理空间智能系统时缺乏对“地点”(place)的灵活推理能力的问题,即如何有效捕捉跨越不同空间粒度的上下文丰富区域的表征。解决方案的关键在于提出PlaceFM,这是一种基于无预训练图压缩方法的空间基础模型,通过整合美国境内的Foursquare和OpenStreetMap数据构建全国兴趣点(POI)图,并生成通用的地点嵌入表示,从而支持多尺度地理空间分析任务。
链接: https://arxiv.org/abs/2507.02921
作者: Mohammad Hashemi,Hossein Amiri,Andreas Zufle
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Spatial structure is central to effective geospatial intelligence systems. While foundation models have shown promise, they often lack the flexibility to reason about places, which are context-rich regions spanning different spatial granularities. We propose PlaceFM, a spatial foundation model that captures place representations using a training-free graph condensation method. PlaceFM condenses a nationwide POI graph built from integrated Foursquare and OpenStreetMap data in the U.S., generating general-purpose embeddings of places. These embeddings can be seamlessly integrated into geolocation data pipelines to support a wide range of downstream tasks. Without requiring pretraining, PlaceFM offers a scalable and adaptable solution for multi-scale geospatial analysis.
zh
[AI-185] Visual-Conversational Interface for Evidence-Based Explanation of Diabetes Risk Prediction
【速读】:该论文试图解决AI驱动的临床决策支持系统在可视化复杂性和缺乏科学证据支撑方面的关键问题。其解决方案的关键在于集成交互式可视化与对话代理,结合微调语言模型处理分析性查询和通用大语言模型应对更广泛的医学问题,同时提出一种将AI解释 grounded 在科学证据上的方法,以及特征范围分析技术以增强对特征贡献的理解。
链接: https://arxiv.org/abs/2507.02920
作者: Reza Samimi,Aditya Bhattacharya,Lucija Gosak,Gregor Stiglic,Katrien Verbert
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 5 figures, 7th ACM Conference on Conversational User Interfaces
Abstract:Healthcare professionals need effective ways to use, understand, and validate AI-driven clinical decision support systems. Existing systems face two key limitations: complex visualizations and a lack of grounding in scientific evidence. We present an integrated decision support system that combines interactive visualizations with a conversational agent to explain diabetes risk assessments. We propose a hybrid prompt handling approach combining fine-tuned language models for analytical queries with general Large Language Models (LLMs) for broader medical questions, a methodology for grounding AI explanations in scientific evidence, and a feature range analysis technique to support deeper understanding of feature contributions. We conducted a mixed-methods study with 30 healthcare professionals and found that the conversational interactions helped healthcare professionals build a clear understanding of model assessments, while the integration of scientific evidence calibrated trust in the system’s decisions. Most participants reported that the system supported both patient risk evaluation and recommendation.
zh
[AI-186] Echo State Transformer: When chaos brings memory
【速读】:该论文试图解决传统大型语言模型(Large Language Models)及其基础架构Transformer在处理序列数据时存在的计算复杂度高(二次增长)以及与人类大脑认知任务处理机制不一致的问题。其解决方案的关键在于提出了一种混合架构——Echo State Transformers (EST),该架构将Transformer的注意力机制与Reservoir Computing(储备计算)原理相结合,引入了基于多个并行储备网络的“Working Memory”模块,并通过训练储备网络的超参数动态调整记忆与非线性之间的权衡,从而实现固定大小的分布式内存系统,使计算复杂度保持恒定,有效克服了标准Transformer的二次扩展问题。
链接: https://arxiv.org/abs/2507.02917
作者: Yannis Bendi-Ouis(Mnemosyne),Xavier Hinaut(Mnemosyne)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models and their underlying Transformer architecture are remarkably efficient, they do not reflect how our brain processes and learns a diversity of cognitive tasks such as language and working memory. Furthermore, sequential data processing with Transformers encounters a fundamental barrier: quadratic complexity growth with sequence length. Motivated by these limitations, our ambition is to create more efficient models that are less reliant on intensive computations and massive volumes of data. We introduce Echo State Transformers (EST), a hybrid architecture that elegantly resolves this challenge while demonstrating exceptional performance in low-data regimes. EST integrates the Transformer attention mechanisms with principles from Reservoir Computing to create a fixedsize window distributed memory system. Drawing inspiration from Echo State Networks, the most prominent instance of the Reservoir Computing paradigm, our architecture integrates a new module called ‘‘Working Memory’’ based on several reservoirs (i.e. random recurrent networks) working in parallel. These reservoirs work as independent memory units with distinct internal dynamics. A novelty here is that the classical reservoir hyperparameters controlling the dynamics are now trained. Thus, the EST dynamically adapts the memory/non-linearity trade-off in reservoirs. By maintaining a fixed number of memory units regardless of sequence length, EST achieves constant computational complexity at each processing step, effectively breaking the quadratic scaling problem of standard Transformers. Evaluations on the STREAM benchmark, which comprises 12 diverse sequential processing tasks, demonstrate that EST outperforms GRUs, LSTMs, and even Transformers on 8 of these tasks. These findings highlight that Echo State Transformers can be an effective replacement to GRUs and LSTMs while complementing standard Transformers at least on resource-constrained environments and low-data scenarios across diverse sequential processing tasks.
zh
[AI-187] Efficient Certified Reasoning for Binarized Neural Networks
【速读】:该论文旨在解决二值化神经网络(BNNs)在安全关键应用中的验证问题,特别是现有方法在可扩展性或可靠性方面存在的不足。其解决方案的关键在于提出一种可扩展且可信的验证方法,通过在定制设计的求解器中引入BNN约束的原生表示,以及在近似模型计数器中实现定量推理,并开发了专门的证明生成与验证流水线,以确保所有验证结果的可信度。
链接: https://arxiv.org/abs/2507.02916
作者: Jiong Yang,Yong Kiam Tan,Mate Soos,Magnus O. Myreen,Kuldeep S. Meel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 18 pages, 4 figures, to be published in SAT25
Abstract:Neural networks have emerged as essential components in safety-critical applications – these use cases demand complex, yet trustworthy computations. Binarized Neural Networks (BNNs) are a type of neural network where each neuron is constrained to a Boolean value; they are particularly well-suited for safety-critical tasks because they retain much of the computational capacities of full-scale (floating-point or quantized) deep neural networks, but remain compatible with satisfiability solvers for qualitative verification and with model counters for quantitative reasoning. However, existing methods for BNN analysis suffer from either limited scalability or susceptibility to soundness errors, which hinders their applicability in real-world scenarios. In this work, we present a scalable and trustworthy approach for both qualitative and quantitative verification of BNNs. Our approach introduces a native representation of BNN constraints in a custom-designed solver for qualitative reasoning, and in an approximate model counter for quantitative reasoning. We further develop specialized proof generation and checking pipelines with native support for BNN constraint reasoning, ensuring trustworthiness for all of our verification results. Empirical evaluations on a BNN robustness verification benchmark suite demonstrate that our certified solving approach achieves a 9\times speedup over prior certified CNF and PB-based approaches, and our certified counting approach achieves a 218\times speedup over the existing CNF-based baseline. In terms of coverage, our pipeline produces fully certified results for 99% and 86% of the qualitative and quantitative reasoning queries on BNNs, respectively. This is in sharp contrast to the best existing baselines which can fully certify only 62% and 4% of the queries, respectively. Comments: 18 pages, 4 figures, to be published in SAT25 Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2507.02916 [cs.LG] (or arXiv:2507.02916v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.02916 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-188] Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning
【速读】:该论文旨在解决音频数据的自监督学习问题,通过构建一种高效的音频表征学习框架。其解决方案的关键在于采用Audio-JEPA模型,该模型基于联合嵌入预测架构(JEPA),利用简单的视觉Transformer主干网络对掩码的梅尔频谱图块进行潜在表征预测,而非直接重建原始音频信号,从而在减少计算成本的同时实现有效的表征学习。
链接: https://arxiv.org/abs/2507.02915
作者: Ludovic Tuncay(IRIT-SAMoVA),Etienne Labbé(IRIT-SAMoVA),Emmanouil Benetos(QMUL),Thomas Pellegrini(IRIT-SAMoVA)
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注:
Abstract:Building on the Joint-Embedding Predictive Architecture (JEPA) paradigm, a recent self-supervised learning framework that predicts latent representations of masked regions in high-level feature spaces, we propose Audio-JEPA (Audio Joint-Embedding Predictive Architecture), tailored specifically for audio data. Audio-JEPA uses a simple Vision Transformer backbone to predict latent representations of masked spectrogram patches rather than reconstructing raw audio. We pre-train on unlabeled AudioSet clips (10s, 32kHz) with random patch masking on mel-spectrograms. We evaluate on the X-ARES suite covering speech, music, and environmental sound tasks. Although our implementation is a straightforward translation of the original model to audio, the results still show comparable performance to wav2vec 2.0 and data2vec while using less than one-fifth of their training data and with no hyper-parameter tuning. All code and pretrained checkpoints will be released on GitHub.
zh
[AI-189] OAK – Onboarding with Actionable Knowledge
【速读】:该论文试图解决熟练操作员离职导致的知识流失问题,这一问题在企业中尤为关键,因为相关知识具有多样性和非结构化的特性。解决方案的关键在于结合知识图谱嵌入(knowledge graph embeddings)和多模态接口,以收集和检索专业知识,使其可被有效利用,并支持车间层面的决策。此外,通过引入大语言模型(LLMs)提升查询理解能力并提供适应性回答,进一步增强了系统的实用性。
链接: https://arxiv.org/abs/2507.02914
作者: Steve Devènes(1),Marine Capallera(2),Robin Cherix(2),Elena Mugellini(2),Omar Abou Khaled(2),Francesco Carrino(1) ((1) Institute of Systems Engineering, HEI-VS, HES-SO University of Applied Sciences and Arts Western Switzerland, (2) HumanTech Institute, HEIA HES-SO University of Applied Sciences and Arts Western Switzerland)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper is an extended version of the work originally presented at the AI-Days 2024 conference in Lausanne, Switzerland. It builds upon the findings shared during the conference and includes additional results and analysis
Abstract:The loss of knowledge when skilled operators leave poses a critical issue for companies. This know-how is diverse and unstructured. We propose a novel method that combines knowledge graph embeddings and multi-modal interfaces to collect and retrieve expertise, making it actionable. Our approach supports decision-making on the shop floor. Additionally, we leverage LLMs to improve query understanding and provide adapted answers. As application case studies, we developed a proof-of-concept for quality control in high precision manufacturing.
zh
[AI-190] oward Cyclic A.I. Modelling of Self-Regulated Learning: A Case Study with E-Learning Trace Data
【速读】:该论文试图解决如何在现代机器学习框架中有效表征自我调节学习(Self-Regulated Learning, SRL)的循环性和非定向性问题。其解决方案的关键在于应用受SRL启发的特征来分析追踪数据,从而提升对学生SRL活动建模的准确性,并增强对电子学习环境中学习因果效应的预测能力和可解释性。
链接: https://arxiv.org/abs/2507.02913
作者: Andrew Schwabe,Özgür Akgün,Ella Haig
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 6 pages, 3 figures
Abstract:Many e-learning platforms assert their ability or potential to improve students’ self-regulated learning (SRL), however the cyclical and undirected nature of SRL theoretical models represent significant challenges for representation within contemporary machine learning frameworks. We apply SRL-informed features to trace data in order to advance modelling of students’ SRL activities, to improve predictability and explainability regarding the causal effects of learning in an eLearning environment. We demonstrate that these features improve predictive accuracy and validate the value of further research into cyclic modelling techniques for SRL.
zh
[AI-191] Multicollinearity Resolution Based on Machine Learning: A Case Study of Carbon Emissions DATE
【速读】:该论文旨在解决具有结构复杂性和多重共线性的多因素问题,以碳排放分析为例。其解决方案的关键在于将DBSCAN聚类算法与Elastic Net回归模型相结合,利用DBSCAN进行无监督学习以客观地对特征进行聚类,同时通过Elastic Net实现高维特征选择和复杂性控制,该模型通过结合L1(套索)和L2(岭)惩罚项,有效平衡了特征选择与正则化,适用于具有相关预测变量的数据集。
链接: https://arxiv.org/abs/2507.02912
作者: Xuanming Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Vital Renew Update Based on Previous Version
Abstract:This study proposes an analytical framework that integrates DBSCAN clustering with the Elastic Net regression model to address multifactorial problems characterized by structural complexity and multicollinearity, exemplified by carbon emissions analysis. DBSCAN is employed for unsupervised learning to objectively cluster features, while the Elastic Net is utilized for high-dimensional feature selection and complexity control. The Elastic Net is specifically chosen for its ability to balance feature selection and regularization by combining L1 (lasso) and L2 (ridge) penalties, making it particularly suited for datasets with correlated predictors. Applying this framework to energy consumption data from 46 industries in China (2000-2019) resulted in the identification of 16 categories. Emission characteristics and drivers were quantitatively assessed for each category, demonstrating the framework’s capacity to identify primary emission sources and provide actionable insights. This research underscores the global applicability of the framework for analyzing complex regional challenges, such as carbon emissions, and highlights qualitative features that humans find meaningful may not be accurate for the model.
zh
[AI-192] DiceHuBERT: Distilling HuBERT with a Self-Supervised Learning Objective INTERSPEECH
【速读】:该论文试图解决如何有效压缩基于自监督学习(SSL)的语音基础模型HuBERT的问题,同时保持其高性能。现有知识蒸馏方法通常依赖于教师模型与学生模型之间的逐层和逐特征映射,而DiceHuBERT的关键创新在于利用HuBERT自身的迭代自蒸馏机制,直接用学生模型替换原始模型,从而在不引入额外模块或架构约束的情况下,使用与预训练HuBERT相同的SSL目标进行训练。这一方法显著提升了模型压缩效果,并在多个任务上表现出色。
链接: https://arxiv.org/abs/2507.02911
作者: Hyung Gun Chi,Zakaria Aldeneh,Tatiana Likhomanenko,Oggi Rudovic,Takuya Higuchi,Li-Wei Chen,Shinji Watanabe,Ahmed Hussen Abdelaziz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure, interspeech accepted paper
Abstract:We introduce DiceHuBERT, a knowledge distillation framework for compressing HuBERT, a widely used self-supervised learning (SSL)-based speech foundation model. Unlike existing distillation methods that rely on layer-wise and feature-wise mapping between teacher and student models, DiceHuBERT leverages HuBERT’s iterative self-distillation mechanism by directly replacing the original model with a student model. This replacement allows the student to be trained using the same SSL objective used when pre-training HuBERT, eliminating the need for additional modules or architectural constraints. Experimental results on SUPERB show that DiceHuBERT consistently outperforms existing distillation methods, improving phoneme recognition performance by over 21% and ASR performance by more than 14%. Furthermore, DiceHuBERT demonstrates competitive performance across multiple tasks, highlighting its clear advantage.
zh
[AI-193] Causal-Paced Deep Reinforcement Learning
【速读】:该论文试图解决课程强化学习(Curriculum Reinforcement Learning, CRL)中任务序列设计的问题,即如何识别既能促进探索又具备足够相似性以支持有效迁移的任务。解决方案的关键在于提出一种基于任务间结构因果模型(Structural Causal Model, SCM)差异的课程学习框架——因果步进深度强化学习(Causal-Paced Deep Reinforcement Learning, CP-DRL),该框架通过交互数据近似估计SCM差异,从而捕捉任务新颖性,并结合代理的可学习性(通过奖励增益衡量)形成统一的目标函数。
链接: https://arxiv.org/abs/2507.02910
作者: Geonwoo Cho,Jaegyun Im,Doyoon Kim,Sundong Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Workshop on Causal Reinforcement Learning, Reinforcement Learning Conference (RLC) 2025
Abstract:Designing effective task sequences is crucial for curriculum reinforcement learning (CRL), where agents must gradually acquire skills by training on intermediate tasks. A key challenge in CRL is to identify tasks that promote exploration, yet are similar enough to support effective transfer. While recent approach suggests comparing tasks via their Structural Causal Models (SCMs), the method requires access to ground-truth causal structures, an unrealistic assumption in most RL settings. In this work, we propose Causal-Paced Deep Reinforcement Learning (CP-DRL), a curriculum learning framework aware of SCM differences between tasks based on interaction data approximation. This signal captures task novelty, which we combine with the agent’s learnability, measured by reward gain, to form a unified objective. Empirically, CP-DRL outperforms existing curriculum methods on the Point Mass benchmark, achieving faster convergence and higher returns. CP-DRL demonstrates reduced variance with comparable final returns in the Bipedal Walker-Trivial setting, and achieves the highest average performance in the Infeasible variant. These results indicate that leveraging causal relationships between tasks can improve the structure-awareness and sample efficiency of curriculum reinforcement learning. We provide the full implementation of CP-DRL to facilitate the reproduction of our main results at this https URL.
zh
[AI-194] Hyperbolic Kernel Graph Neural Networks for Neurocognitive Decline Analysis from Multimodal Brain Imaging
【速读】:该论文试图解决多模态神经影像数据在欧几里得空间中融合时无法有效捕捉脑网络固有层次结构的问题,从而影响对神经认知衰退的准确分析。其解决方案的关键在于提出一种超球核图融合(Hyperbolic Kernel Graph Fusion, HKGF)框架,该框架通过将脑图嵌入到双曲空间中,利用超球核图神经网络(HKGNNs)来同时捕捉脑区之间的局部与全局依赖关系,并保持脑网络的层次结构,从而实现更有效的多模态数据融合与神经认知衰退预测。
链接: https://arxiv.org/abs/2507.02908
作者: Meimei Yang,Yongheng Sun,Qianqian Wang,Andrea Bozoki,Maureen Kohi,Mingxia Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, 7 tables
Abstract:Multimodal neuroimages, such as diffusion tensor imaging (DTI) and resting-state functional MRI (fMRI), offer complementary perspectives on brain activities by capturing structural or functional interactions among brain regions. While existing studies suggest that fusing these multimodal data helps detect abnormal brain activity caused by neurocognitive decline, they are generally implemented in Euclidean space and can’t effectively capture intrinsic hierarchical organization of structural/functional brain networks. This paper presents a hyperbolic kernel graph fusion (HKGF) framework for neurocognitive decline analysis with multimodal neuroimages. It consists of a multimodal graph construction module, a graph representation learning module that encodes brain graphs in hyperbolic space through a family of hyperbolic kernel graph neural networks (HKGNNs), a cross-modality coupling module that enables effective multimodal data fusion, and a hyperbolic neural network for downstream predictions. Notably, HKGNNs represent graphs in hyperbolic space to capture both local and global dependencies among brain regions while preserving the hierarchical structure of brain networks. Extensive experiments involving over 4,000 subjects with DTI and/or fMRI data suggest the superiority of HKGF over state-of-the-art methods in two neurocognitive decline prediction tasks. HKGF is a general framework for multimodal data analysis, facilitating objective quantification of structural/functional brain connectivity changes associated with neurocognitive decline.
zh
[AI-195] Preference-Optimal Multi-Metric Weighting for Parallel Coordinate Plots
【速读】:该论文试图解决在多指标情况下难以通过颜色渐变直观展示控制参数与指标关系的问题。传统方法通过线性加权计算单一指标,但权重设置对用户不清晰。解决方案的关键在于提出一种基于特定偏好指标组合的最优权重计算原则,并利用UMAP降维后的二维平面上的雷达图可视化指标权衡,从而帮助用户更直观地选择偏好。
链接: https://arxiv.org/abs/2507.02905
作者: Chisa Mori,Shuhei Watanabe,Masaki Onishi,Takayuki Itoh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to International Conference Information Visualisation (iV2025)
Abstract:Parallel coordinate plots (PCPs) are a prevalent method to interpret the relationship between the control parameters and metrics. PCPs deliver such an interpretation by color gradation based on a single metric. However, it is challenging to provide such a gradation when multiple metrics are present. Although a naive approach involves calculating a single metric by linearly weighting each metric, such weighting is unclear for users. To address this problem, we first propose a principled formulation for calculating the optimal weight based on a specific preferred metric combination. Although users can simply select their preference from a two-dimensional (2D) plane for bi-metric problems, multi-metric problems require intuitive visualization to allow them to select their preference. We achieved this using various radar charts to visualize the metric trade-offs on the 2D plane reduced by UMAP. In the analysis using pedestrian flow guidance planning, our method identified unique patterns of control parameter importance for each user preference, highlighting the effectiveness of our method.
zh
[AI-196] Particle Swarm Optimization for Quantum Circuit Synthesis: Performance Analysis and Insights
【速读】:该论文试图解决如何利用粒子群优化(Particle Swarm Optimization, PSO)生成量子电路以求解MaxOne问题实例的问题。其解决方案的关键在于将量子电路的编码与表示作为PSO的参数,并通过MaxOne问题进行适应度评估,从而探索PSO在量子电路合成中的有效性及其相较于遗传算法的收敛速度优势。
链接: https://arxiv.org/abs/2507.02898
作者: Mirza Hizriyan Nubli Hidayat,Tan Chye Cheah
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper discusses how particle swarm optimization (PSO) can be used to generate quantum circuits to solve an instance of the MaxOne problem. It then analyzes previous studies on evolutionary algorithms for circuit synthesis. With a brief introduction to PSO, including its parameters and algorithm flow, the paper focuses on a method of quantum circuit encoding and representation as PSO parameters. The fitness evaluation used in this paper is the MaxOne problem. The paper presents experimental results that compare different learning abilities and inertia weight variations in the PSO algorithm. A comparison is further made between the PSO algorithm and a genetic algorithm for quantum circuit synthesis. The results suggest PSO converges more quickly to the optimal solution.
zh
[AI-197] Large Language Model-Driven Surrogate-Assisted Evolutionary Algorithm for Expensive Optimization
【速读】:该论文试图解决在代理辅助进化算法(Surrogate-assisted Evolutionary Algorithms, SAEAs)中,设计高效动态选择策略所面临的劳动密集和需要大量领域知识的问题。解决方案的关键在于提出一种名为LLM-SAEA的新方法,该方法通过集成大型语言模型(Large Language Models, LLMs)在线配置代理模型和填充采样准则。具体而言,LLM-SAEA构建了一个专家协作框架,其中一个是评分专家(LLM-SE),根据优化性能对代理模型和填充采样准则进行评分;另一个是决策专家(LLM-DE),通过分析评分和当前优化状态来选择合适的配置。
链接: https://arxiv.org/abs/2507.02892
作者: Lindong Xie,Genghui Li,Zhenkun Wang,Edward Chung,Maoguo Gong
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:Surrogate-assisted evolutionary algorithms (SAEAs) are a key tool for addressing costly optimization tasks, with their efficiency being heavily dependent on the selection of surrogate models and infill sampling criteria. However, designing an effective dynamic selection strategy for SAEAs is labor-intensive and requires substantial domain knowledge. To address this challenge, this paper proposes LLM-SAEA, a novel approach that integrates large language models (LLMs) to configure both surrogate models and infill sampling criteria online. Specifically, LLM-SAEA develops a collaboration-of-experts framework, where one LLM serves as a scoring expert (LLM-SE), assigning scores to surrogate models and infill sampling criteria based on their optimization performance, while another LLM acts as a decision expert (LLM-DE), selecting the appropriate configurations by analyzing their scores along with the current optimization state. Experimental results demonstrate that LLM-SAEA outperforms several state-of-the-art algorithms across standard test cases. The source code is publicly available at this https URL.
zh
[AI-198] Lightweight LSTM Model for Energy Theft Detection via Input Data Reduction
【速读】:该论文试图解决智能电网中能源窃取检测的问题,尤其是在低窃电场景下,传统基于人工智能的模型因持续推理导致的高计算和能耗限制了其实际应用。论文提出的解决方案的关键在于设计了一个轻量级的检测单元(或称为看门狗机制),作为预过滤器以决定何时激活长短期记忆(LSTM)模型,从而减少输入到LSTM模型的数据量,仅对可能涉及能源窃取的实例进行深度分析,实现了在保持检测精度的同时显著降低能耗。
链接: https://arxiv.org/abs/2507.02872
作者: Caylum Collier,Krishnendu Guha
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages
Abstract:With the increasing integration of smart meters in electrical grids worldwide, detecting energy theft has become a critical and ongoing challenge. Artificial intelligence (AI)-based models have demonstrated strong performance in identifying fraudulent consumption patterns; however, previous works exploring the use of machine learning solutions for this problem demand high computational and energy costs, limiting their practicality – particularly in low-theft scenarios where continuous inference can result in unnecessary energy usage. This paper proposes a lightweight detection unit, or watchdog mechanism, designed to act as a pre-filter that determines when to activate a long short-term memory (LSTM) model. This mechanism reduces the volume of input fed to the LSTM model, limiting it to instances that are more likely to involve energy theft thereby preserving detection accuracy while substantially reducing energy consumption associated with continuous model execution. The proposed system was evaluated through simulations across six scenarios with varying theft severity and number of active thieves. Results indicate a power consumption reduction exceeding 64%, with minimal loss in detection accuracy and consistently high recall. These findings support the feasibility of a more energy-efficient and scalable approach to energy theft detection in smart grids. In contrast to prior work that increases model complexity to achieve marginal accuracy gains, this study emphasizes practical deployment considerations such as inference efficiency and system scalability. The results highlight the potential for deploying sustainable, AI-assisted monitoring systems within modern smart grid infrastructures.
zh
[AI-199] ZettaLith: An Architectural Exploration of Extreme-Scale AI Inference Acceleration
【速读】:该论文试图解决当前和未来人工智能系统在计算成本和功耗方面的重大挑战,这些问题限制了AI的广泛部署和进一步扩展。论文提出的解决方案是ZettaLith架构,其关键在于通过一系列协同设计的架构创新,利用成熟的数字电子技术,显著降低AI推理的成本和功耗,相比当前基于GPU的系统可实现超过1000倍的改进。ZettaLith专注于AI推理任务,摒弃了通用GPU应用,从而实现了更高的性能、能效和成本效益。
链接: https://arxiv.org/abs/2507.02871
作者: Kia Silverbrook
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注: 53 pages, 15 figures, 23 tables
Abstract:The high computational cost and power consumption of current and anticipated AI systems present a major challenge for widespread deployment and further scaling. Current hardware approaches face fundamental efficiency limits. This paper introduces ZettaLith, a scalable computing architecture designed to reduce the cost and power of AI inference by over 1,000x compared to current GPU-based systems. Based on architectural analysis and technology projections, a single ZettaLith rack could potentially achieve 1.507 zettaFLOPS in 2027 - representing a theoretical 1,047x improvement in inference performance, 1,490x better power efficiency, and could be 2,325x more cost-effective than current leading GPU racks for FP4 transformer inference. The ZettaLith architecture achieves these gains by abandoning general purpose GPU applications, and via the multiplicative effect of numerous co-designed architectural innovations using established digital electronic technologies, as detailed in this paper. ZettaLith’s core architectural principles scale down efficiently to exaFLOPS desktop systems and petaFLOPS mobile chips, maintaining their roughly 1,000x advantage. ZettaLith presents a simpler system architecture compared to the complex hierarchy of current GPU clusters. ZettaLith is optimized exclusively for AI inference and is not applicable for AI training.
zh
[AI-200] Enhancing the Aesthetic Appeal of AI-Generated Physical Product Designs through LoRA Fine-Tuning with Human Feedback
【速读】:该论文试图解决生成式AI模型在实物产品设计中输出结果的吸引力和审美吸引力不足的问题,特别是以灯具设计为例。解决方案的关键在于通过人类审美评估引导的低秩适应(LoRA)微调方法,将人类反馈整合到AI模型中,从而提升生成设计的美观性和用户偏好度。
链接: https://arxiv.org/abs/2507.02865
作者: Dinuo Liao,James Derek Lomas,Cehao Yu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 6 pages, 7 figures. Submitted to AIGC2024
Abstract:This study explores how Low-Rank Adaptation (LoRA) fine-tuning, guided by human aesthetic evaluations, can enhance the outputs of generative AI models in tangible product design, using lamp design as a case study. By integrating human feedback into the AI model, we aim to improve both the desirability and aesthetic appeal of the generated designs. Comprehensive experiments were conducted, starting with prompt optimization techniques and focusing on LoRA fine-tuning of the Stable Diffusion model. Additionally, methods to convert AI-generated designs into tangible products through 3D realization using 3D printing technologies were investigated. The results indicate that LoRA fine-tuning effectively aligns AI-generated designs with human aesthetic preferences, leading to significant improvements in desirability and aesthetic appeal scores. These findings highlight the potential of human-AI collaboration in tangible product design and provide valuable insights into integrating human feedback into AI design processes.
zh
[AI-201] Continual Gradient Low-Rank Projection Fine-Tuning for LLM s ACL2025
【速读】:该论文试图解决持续学习中大型语言模型(Large Language Models, LLMs)微调的效率与表达能力之间的权衡问题。解决方案的关键在于提出一种名为GORP(Gradient LOw Rank Projection)的新训练策略,该策略通过协同结合全参数和低秩参数,并在统一的低秩梯度子空间内联合更新,从而克服了传统低秩适应(LoRA)方法在学习新任务和知识迁移方面的局限性。GORP在扩展优化空间的同时保持了效率并缓解了灾难性遗忘问题。
链接: https://arxiv.org/abs/2507.02503
作者: Chenxu Wang,Yilin Lyu,Zicheng Sun,Liping Jing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 15 pages, 6 figures, accepted by ACL 2025 main
Abstract:Continual fine-tuning of Large Language Models (LLMs) is hampered by the trade-off between efficiency and expressiveness. Low-Rank Adaptation (LoRA) offers efficiency but constrains the model’s ability to learn new tasks and transfer knowledge due to its low-rank nature and reliance on explicit parameter constraints. We propose GORP (Gradient LOw Rank Projection) for Continual Learning, a novel training strategy that overcomes these limitations by synergistically combining full and low-rank parameters and jointly updating within a unified low-rank gradient subspace. GORP expands the optimization space while preserving efficiency and mitigating catastrophic forgetting. Extensive experiments on continual learning benchmarks demonstrate GORP’s superior performance compared to existing state-of-the-art approaches. Code is available at this https URL.
zh
[AI-202] An Adaptive Simulated Annealing-Based Machine Learning Approach for Developing an E-Triage Tool for Hospital Emergency Operations
【速读】:该论文试图解决急诊科(Emergency Departments, EDs)患者分诊效率与准确性的问题,旨在通过机器学习技术开发一种电子分诊工具。解决方案的关键在于优化机器学习模型的超参数,以提升模型性能。研究中采用了模拟退火(Simulated Annealing, SA)和自适应模拟退火(Adaptive Simulated Annealing, ASA)等元启发式优化算法对极端梯度提升(Extreme Gradient Boosting, XGB)和类别提升(Categorical Boosting, CaB)模型进行参数优化,最终结果表明ASA-CaB在准确率、精确率、召回率和F1分数上均表现最佳。
链接: https://arxiv.org/abs/2212.11892
作者: Abdulaziz Ahmed,Mohammed Al-Maamari,Mohammad Firouz,Dursun Delen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Patient triage at emergency departments (EDs) is necessary to prioritize care for patients with critical and time-sensitive conditions. Different tools are used for patient triage and one of the most common ones is the emergency severity index (ESI), which has a scale of five levels, where level 1 is the most urgent and level 5 is the least urgent. This paper proposes a framework for utilizing machine learning to develop an e-triage tool that can be used at EDs. A large retrospective dataset of ED patient visits is obtained from the electronic health record of a healthcare provider in the Midwest of the US for three years. However, the main challenge of using machine learning algorithms is that most of them have many parameters and without optimizing these parameters, developing a high-performance model is not possible. This paper proposes an approach to optimize the hyperparameters of machine learning. The metaheuristic optimization algorithms simulated annealing (SA) and adaptive simulated annealing (ASA) are proposed to optimize the parameters of extreme gradient boosting (XGB) and categorical boosting (CaB). The newly proposed algorithms are SA-XGB, ASA-XGB, SA-CaB, ASA-CaB. Grid search (GS), which is a traditional approach used for machine learning fine-tunning is also used to fine-tune the parameters of XGB and CaB, which are named GS-XGB and GS-CaB. The six algorithms are trained and tested using eight data groups obtained from the feature selection phase. The results show ASA-CaB outperformed all the proposed algorithms with accuracy, precision, recall, and f1 of 83.3%, 83.2%, 83.3%, 83.2%, respectively.
zh
[AI-203] OGF: An Online Gradient Flow Method for Optimizing the Statistical Steady-State Time Averag es of Unsteady Turbulent Flows
【速读】:该论文旨在解决在湍流模拟中对稳态统计量进行优化的挑战,尤其是在高自由度系统中,传统计算方法无法有效扩展至物理上具有代表性的网格点数量。其核心问题在于湍流的混沌性导致伴随方法计算的梯度随时间指数发散。论文提出的解决方案是在线梯度流(Online Gradient-Flow, OGF)方法,该方法通过同时前向传播对F(x; θ)梯度的在线估计和参数θ的在线更新,实现对混沌、非定常湍流解析模拟的稳态统计量优化。关键创新在于算法的完全在线特性以及与有限差分估计器的结合,从而避免因混沌性导致的梯度发散问题。
链接: https://arxiv.org/abs/2507.05149
作者: Tom Hickling,Jonathan F. MacArt,Justin Sirignano,Den Waidmann
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 13 figures
Abstract:Turbulent flows are chaotic and unsteady, but their statistical distribution converges to a statistical steady state. Engineering quantities of interest typically take the form of time-average statistics such as \frac1t \int_0^t f ( u(x,\tau; \theta) ) d\tau \oversett \rightarrow \infty\rightarrow F(x; \theta) , where u(x,t; \theta) are solutions of the Navier–Stokes equations with parameters \theta . Optimizing over F(x; \theta) has many engineering applications including geometric optimization, flow control, and closure modeling. However, this remains an open challenge, as existing computational approaches are incapable of scaling to physically representative numbers of grid points. The fundamental obstacle is the chaoticity of turbulent flows: gradients calculated with the adjoint method diverge exponentially as t \rightarrow \infty . We develop a new online gradient-flow (OGF) method that is scalable to large degree-of-freedom systems and enables optimizing for the steady-state statistics of chaotic, unsteady, turbulence-resolving simulations. The method forward-propagates an online estimate for the gradient of F(x; \theta) while simultaneously performing online updates of the parameters \theta . A key feature is the fully online nature of the algorithm to facilitate faster optimization progress and its combination with a finite-difference estimator to avoid the divergence of gradients due to chaoticity. The proposed OGF method is demonstrated for optimizations over three chaotic ordinary and partial differential equations: the Lorenz-63 equation, the Kuramoto–Sivashinsky equation, and Navier–Stokes solutions of compressible, forced, homogeneous isotropic turbulence. In each case, the OGF method successfully reduces the loss based on F(x; \theta) by several orders of magnitude and accurately recovers the optimal parameters. Comments: 29 pages, 13 figures Subjects: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.05149 [physics.flu-dyn] (or arXiv:2507.05149v1 [physics.flu-dyn] for this version) https://doi.org/10.48550/arXiv.2507.05149 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-204] A Novel Approach for Estimating Positive Lyapunov Exponents in One-Dimensional Chaotic Time Series Using Machine Learning
【速读】:该论文旨在解决从实验数据中准确估计非线性动力系统中正Lyapunov指数(Positive Lyapunov Exponent, MLE)的难题,这一问题在科学和工程领域具有基础性挑战。其解决方案的关键在于利用机器学习方法,通过样本外预测误差的增长作为轨迹发散的代理指标,从而实现对MLE的有效估计。该方法在多个经典混沌映射测试中表现出高精度和良好的泛化能力,尤其在Chebyshev映射中达到了接近完美的拟合度(R²pos = 0.999)。
链接: https://arxiv.org/abs/2507.04868
作者: A. Velichko,M. Belyaev,P. Boriskov
机构: 未知
类目: Chaotic Dynamics (nlin.CD); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures, 2 Tables, 10 Equations
Abstract:Understanding and quantifying chaos in nonlinear dynamical systems remains a fundamental challenge in science and engineering. The Lyapunov exponent is a key measure of chaotic behavior, but its accurate estimation from experimental data is often hindered by methodological and computational limitations. In this work, we present a novel machine-learning-based approach for estimating the positive Lyapunov exponent (MLE) from one-dimensional time series, using the growth of out-of-sample prediction errors as a proxy for trajectory divergence. Our method demonstrates high scientific relevance, offering a robust, data-driven alternative to traditional analytic techniques. Through comprehensive testing on several canonical chaotic maps - including the logistic, sine, cubic, and Chebyshev maps - we achieved a coefficient of determination R2pos 0.9 between predicted and theoretical MLE values for time series as short as M = 200 points. The best accuracy was observed for the Chebyshev map (R2pos = 0.999). Notably, the proposed method maintains high computational efficiency and generalizes well across various machine learning algorithms. These results highlight the significance of our approach for practical chaos analysis in both synthetic and experimental settings, opening new possibilities for robust nonlinear dynamics assessment when only time series data are available.
zh
[AI-205] Lilith: Developmental Modular LLM s with Chemical Signaling
【速读】:该论文试图解决如何通过模拟大脑多区域的化学信号传递机制来理解意识的产生问题。其解决方案的关键在于提出一种名为LILITH的新架构,该架构结合了模块化语言模型的发展性训练与脑启发的基于令牌的通信协议,将不同的大脑区域建模为具有特定功能的LLM模块,并通过涌现的令牌信号协议进行通信,类似于神经递质网络,从而实现对意识产生的直接实证研究。
链接: https://arxiv.org/abs/2507.04575
作者: Mohid Farooqi,Alejandro Comas-Leon
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 4 pages, 0 figures, position paper
Abstract:Current paradigms in Artificial Intelligence rely on layers of feedforward networks which model brain activity at the neuronal level. We conjecture that expanding to the level of multiple brain regions with chemical signaling may be a productive step toward understanding the emergence of consciousness. We propose LILITH, a novel architecture that combines developmental training of modular language models with brain-inspired token-based communication protocols, mirroring chemical signaling in the brain. Our approach models distinct brain regions as specialized LLM modules including thinking, memory, sensory, and regulatory components that communicate through emergent token-based signaling protocols analogous to neurotransmitter networks. Unlike traditional pre-trained systems, LILITH would employ developmental training where untrained LLM architectures learn through simulated life experiences, developing communication pathways and cognitive abilities through environmental interaction and evolutionary optimization. This framework would enable direct empirical investigation of consciousness emergence using Integrated Information Theory metrics while providing unprecedented insight into inter-module signaling patterns during development. By optimizing for consciousness emergence rather than task performance, LILITH could provide insight into different emergent phenomena at multiple levels of neural correlates, contrasting neuronal-level processing with multi-region coordination dynamics. The goal of this paper is to put the idea forward while recognizing the substantial challenges in implementing such a system.
zh
[AI-206] he Joys of Categorical Conformal Prediction
【速读】:该论文试图解决生成式 AI (Generative AI) 中不确定性量化(Uncertainty Quantification, UQ)的理论透明性问题,以及如何将 conformal prediction (CP) 作为一种通用的不确定性表示技术进行形式化。论文的核心贡献在于采用范畴论方法对 CP 进行形式化,将其视为两个新定义范畴之间的态射,并通过交换图进行描述。这一解决方案的关键在于揭示 CP 在最小假设下本质上是一种 UQ 机制,同时表明 CP 可以连接甚至涵盖贝叶斯、频率学派和不精确概率等不同统计推理框架,并进一步指出 conformal prediction region (CPR) 是协变函子的像,从而为 AI 隐私保护提供理论支持。
链接: https://arxiv.org/abs/2507.04441
作者: Michele Caprio
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Category Theory (math.CT)
备注:
Abstract:Conformal prediction (CP) is an Uncertainty Representation technique that delivers finite-sample calibrated prediction regions for any underlying Machine Learning model, yet its status as an Uncertainty Quantification (UQ) tool has remained conceptually opaque. We adopt a category-theoretic approach to CP – framing it as a morphism, embedded in a commuting diagram, of two newly-defined categories – that brings us three joys. First, we show that – under minimal assumptions – CP is intrinsically a UQ mechanism, that is, its UQ capabilities are a structural feature of the method. Second, we demonstrate that CP bridges (and perhaps subsumes) the Bayesian, frequentist, and imprecise probabilistic approaches to predictive statistical reasoning. Finally, we show that a conformal prediction region (CPR) is the image of a covariant functor. This observation is relevant to AI privacy: It implies that privacy noise added locally does not break coverage.
zh
[AI-207] Mission-Aligned Learning-Informed Control of Autonomous Systems: Formulation and Foundations
【速读】:该论文试图解决自主物理代理系统在安全性与可靠性方面的问题,特别是在机器人护理等应用场景中,如何实现更高效和可解释的决策过程。解决方案的关键在于提出一种双层级优化框架,将底层控制与高层经典规划相结合,并融入学习能力,从而实现控制、经典规划和强化学习(Reinforcement Learning)的协同整合,以提升系统的安全性和可解释性。
链接: https://arxiv.org/abs/2507.04356
作者: Vyacheslav Kungurtsev,Gustav Sir,Akhil Anand,Sebastien Gros,Haozhe Tian,Homayoun Hamedmoghadam
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Research, innovation and practical capital investment have been increasing rapidly toward the realization of autonomous physical agents. This includes industrial and service robots, unmanned aerial vehicles, embedded control devices, and a number of other realizations of cybernetic/mechatronic implementations of intelligent autonomous devices. In this paper, we consider a stylized version of robotic care, which would normally involve a two-level Reinforcement Learning procedure that trains a policy for both lower level physical movement decisions as well as higher level conceptual tasks and their sub-components. In order to deliver greater safety and reliability in the system, we present the general formulation of this as a two-level optimization scheme which incorporates control at the lower level, and classical planning at the higher level, integrated with a capacity for learning. This synergistic integration of multiple methodologies – control, classical planning, and RL – presents an opportunity for greater insight for algorithm development, leading to more efficient and reliable performance. Here, the notion of reliability pertains to physical safety and interpretability into an otherwise black box operation of autonomous agents, concerning users and regulators. This work presents the necessary background and general formulation of the optimization framework, detailing each component and its integration with the others.
zh
[AI-208] Efficient Perplexity Bound and Ratio Matching in Discrete Diffusion Language Models
【速读】:该论文旨在解决连续扩散模型在处理类别数据时效果不佳的问题,提出了一种基于连续时间离散马尔可夫链(CTMC)框架的比率匹配方法,以作为语言建模中自回归模型的有力替代。其解决方案的关键在于引入三个关于数据分布与学习分布之间KL散度的新定理,从而推导出困惑度的改进上界,并通过最小化干净数据与噪声数据之间的去噪交叉熵实现更优的比率匹配,显著降低了困惑度并提高了训练效率。此外,论文还提出了一种新的CTMC转移率矩阵,支持预测精炼,并推导了其矩阵指数的解析表达式,从而提升了条件比率的计算效率,进一步优化了训练和生成过程。
链接: https://arxiv.org/abs/2507.04341
作者: Etrit Haxholli,Yeti Z. Gürbüz,Oğul Can,Eli Waxman
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While continuous diffusion models excel in modeling continuous distributions, their application to categorical data has been less effective. Recent work has shown that ratio-matching through score-entropy within a continuous-time discrete Markov chain (CTMC) framework serves as a competitive alternative to autoregressive models in language modeling. To enhance this framework, we first introduce three new theorems concerning the KL divergence between the data and learned distribution. Our results serve as the discrete counterpart to those established for continuous diffusion models and allow us to derive an improved upper bound of the perplexity. Second, we empirically show that ratio-matching performed by minimizing the denoising cross-entropy between the clean and corrupted data enables models to outperform those utilizing score-entropy with up to 10% lower perplexity/generative-perplexity, and 15% faster training steps. To further support our findings, we introduce and evaluate a novel CTMC transition-rate matrix that allows prediction refinement, and derive the analytic expression for its matrix exponential which facilitates the computation of conditional ratios thus enabling efficient training and generation.
zh
[AI-209] Mixed-Sample SGD: an End-to-end Analysis of Supervised Transfer Learning
【速读】:该论文试图解决监督迁移学习(Supervised Transfer Learning, STL)中如何设计一种高效的随机梯度下降(SGD)过程,以在不依赖源数据质量先验知识的情况下,保持统计迁移保证的问题。其解决方案的关键在于设计一种自适应的子采样机制,使得在源数据具有信息量时能够充分利用源数据,在源数据信息不足时则偏向目标数据,从而避免负迁移。该方法通过跟踪一系列约束凸优化问题来维持所需的迁移保证,并在平方损失下的线性回归任务中验证了其有效性,证明了算法收敛至一个与源数据质量自适应的目标域统计性能解,收敛速率为 1/T。
链接: https://arxiv.org/abs/2507.04194
作者: Yuyang Deng,Samory Kpotufe
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Theoretical works on supervised transfer learning (STL) – where the learner has access to labeled samples from both source and target distributions – have for the most part focused on statistical aspects of the problem, while efficient optimization has received less attention. We consider the problem of designing an SGD procedure for STL that alternates sampling between source and target data, while maintaining statistical transfer guarantees without prior knowledge of the quality of the source data. A main algorithmic difficulty is in understanding how to design such an adaptive sub-sampling mechanism at each SGD step, to automatically gain from the source when it is informative, or bias towards the target and avoid negative transfer when the source is less informative. We show that, such a mixed-sample SGD procedure is feasible for general prediction tasks with convex losses, rooted in tracking an abstract sequence of constrained convex programs that serve to maintain the desired transfer guarantees. We instantiate these results in the concrete setting of linear regression with square loss, and show that the procedure converges, with 1/\sqrtT rate, to a solution whose statistical performance on the target is adaptive to the a priori unknown quality of the source. Experiments with synthetic and real datasets support the theory. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.04194 [stat.ML] (or arXiv:2507.04194v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.04194 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-210] opoMAS: Large Language Model Driven Topological Materials Multiagent System
【速读】:该论文旨在解决拓扑材料在跨尺度设计中因发现流程效率低下而面临的瓶颈问题。其解决方案的关键在于提出一种交互式人机协同框架TopoMAS,该框架能够无缝整合材料发现的全流程,包括用户定义的查询、多源数据检索、理论推断、晶体结构生成以及第一性原理验证,并通过将计算结果自动集成到动态知识图谱中实现持续的知识精炼,从而实现高效的材料发现与拓扑相的识别。
链接: https://arxiv.org/abs/2507.04053
作者: Baohua Zhang,Xin Li,Huangchao Xu,Zhong Jin,Quansheng Wu,Ce Li
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 13 pages,7 figures,3 tables
Abstract:Topological materials occupy a frontier in condensed-matter physics thanks to their remarkable electronic and quantum properties, yet their cross-scale design remains bottlenecked by inefficient discovery workflows. Here, we introduce TopoMAS (Topological materials Multi-Agent System), an interactive human-AI framework that seamlessly orchestrates the entire materials-discovery pipeline: from user-defined queries and multi-source data retrieval, through theoretical inference and crystal-structure generation, to first-principles validation. Crucially, TopoMAS closes the loop by autonomously integrating computational outcomes into a dynamic knowledge graph, enabling continuous knowledge refinement. In collaboration with human experts, it has already guided the identification of novel topological phases SrSbO3, confirmed by first-principles calculations. Comprehensive benchmarks demonstrate robust adaptability across base Large Language Model, with the lightweight Qwen2.5-72B model achieving 94.55% accuracy while consuming only 74.3-78.4% of tokens required by Qwen3-235B and 83.0% of DeepSeek-V3’s usage–delivering responses twice as fast as Qwen3-235B. This efficiency establishes TopoMAS as an accelerator for computation-driven discovery pipelines. By harmonizing rational agent orchestration with a self-evolving knowledge graph, our framework not only delivers immediate advances in topological materials but also establishes a transferable, extensible paradigm for materials-science domain.
zh
[AI-211] he role of gain neuromodulation in layer-5 pyramidal neurons
【速读】:该论文试图解决生物和人工学习系统共同面临的可塑性-稳定性矛盾(plasticity-stability dilemma)。其解决方案的关键在于利用层-5锥体神经元的特性,通过胆碱能和去甲肾上腺素能投射调节神经元增益和抑制门控,从而实现电路的分离与整合之间的平衡。研究者开发了一个双舱室Izhikevich模型用于锥体神经元以及单舱室生长抑素(SOM)和帕拉伏林(PV)中间神经元,并通过高斯连接和尖峰时序依赖可塑性(STDP)进行连接,以捕捉这种机制。关键在于 dendritic plateaus 和 somatic spikes 的耦合关系,以及由此产生的增益可调放大器特性,使得神经调质信号能够转化为灵活的皮层活动。
链接: https://arxiv.org/abs/2507.03222
作者: Alejandro Rodriguez-Garcia,Christopher J. Whyte,Brandon R. Munn,Jie Mei,James M. Shine,Srikanth Ramaswamy
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, 1 table, presented at 34th Annual Computational Neuroscience Meeting
Abstract:Biological and artificial learning systems alike confront the plasticity-stability dilemma. In the brain, neuromodulators such as acetylcholine and noradrenaline relieve this tension by tuning neuronal gain and inhibitory gating, balancing segregation and integration of circuits. Fed by dense cholinergic and noradrenergic projections from the ascending arousal system, layer-5 pyramidal neurons in the cerebral cortex offer a relevant substrate for understanding these dynamics. When distal dendritic signals coincide with back-propagating action potentials, calcium plateaus turn a single somatic spike into a high-gain burst, and interneuron inhibition sculpts the output. These properties make layer-5 cells gain-tunable amplifiers that translate neuromodulatory cues into flexible cortical activity. To capture this mechanism we developed a two-compartment Izhikevich model for pyramidal neurons and single-compartment somatostatin (SOM) and parvalbumin (PV) interneurons, linked by Gaussian connectivity and spike-timing-dependent plasticity (STDP). The soma and apical dendrite are so coupled that somatic spikes back-propagate, while dendritic plateaus can switch the soma from regular firing to bursting by shifting reset and adaptation variables. We show that stronger dendritic drive or tighter coupling raise gain by increasing the likelihood of calcium-triggered somatic bursts. In contrast, dendritic-targeted inhibition suppresses gain, while somatic-targeted inhibition raises the firing threshold of neighboring neurons, thus gating neurons output. Notably, bursting accelerates STDP, supporting rapid synaptic reconfiguration and this http URL suggests that brief gain pulses driven by neuromodulators could serve as an adaptive two-timescale optimization mechanism, effectively modulating the synaptic weight updates.
zh
[AI-212] Deep Learning Atmospheric Models Reliably Simulate Out-of-Sample Land Heat and Cold Wave Frequencies
【速读】:该论文试图解决深度学习(Deep Learning, DL)驱动的通用环流模型(General Circulation Models, GCMs)在模拟超出其训练范围的极端气候事件(如热浪和冷浪)时的能力问题。其解决方案的关键在于评估两种DL模型——混合神经通用环流模型(Neural General Circulation Model, NGCM)和纯数据驱动的深度学习地球系统模型(Deep Learning Earth System Model, DLESyM)——与传统高分辨率陆气耦合模型(HiRAM)在模拟1900-1960年间极端事件方面的表现,验证其在未见气候条件下的泛化能力。结果表明,两种DL模型能够成功泛化到未见气候条件,其在热浪和冷浪事件频率及空间分布上的模拟技能与HiRAM相当,尽管在部分区域(如北亚和北美)表现较差,且DLESyM由于温度自相关过强而高估了极端事件频率,而NGCM则表现出更接近HiRAM的持续性特征。
链接: https://arxiv.org/abs/2507.03176
作者: Zilu Meng,Gregory J. Hakim,Wenchang Yang,Gabriel A. Vecchi
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Deep learning (DL)-based general circulation models (GCMs) are emerging as fast simulators, yet their ability to replicate extreme events outside their training range remains unknown. Here, we evaluate two such models – the hybrid Neural General Circulation Model (NGCM) and purely data-driven Deep Learning Earth System Model (DL\textitESyM) – against a conventional high-resolution land-atmosphere model (HiRAM) in simulating land heatwaves and coldwaves. All models are forced with observed sea surface temperatures and sea ice over 1900-2020, focusing on the out-of-sample early-20th-century period (1900-1960). Both DL models generalize successfully to unseen climate conditions, broadly reproducing the frequency and spatial patterns of heatwave and cold wave events during 1900-1960 with skill comparable to HiRAM. An exception is over portions of North Asia and North America, where all models perform poorly during 1940-1960. Due to excessive temperature autocorrelation, DL\textitESyM tends to overestimate heatwave and cold wave frequencies, whereas the physics-DL hybrid NGCM exhibits persistence more similar to HiRAM.
zh
[AI-213] Flow-Through Tensors: A Unified Computational Graph Architecture for Multi-Layer Transportation Network Optimization
【速读】:该论文试图解决现代交通网络建模中多种方法(如基于传感器的预测、强化学习、经典流量优化和需求建模)难以整合的问题,这些方法传统上是独立发展的。解决方案的关键在于提出Flow Through Tensors (FTT)框架,该框架通过将起讫流量、路径概率和路段旅行时间作为相互关联的张量进行统一建模,建立了一种一致的数学结构,从而实现了跨不同建模要素的梯度优化,并支持多维交通模式分析,同时采用张量分解技术保持大规模应用的计算可行性。
链接: https://arxiv.org/abs/2507.02961
作者: Xuesong(Simon)Zhou,Taehooie Kim,Mostafa Ameli,Henan(Bety)Zhu,Yu- dai Honma,Ram M. Pendyala
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Modern transportation network modeling increasingly involves the integration of diverse methodologies including sensor-based forecasting, reinforcement learning, classical flow optimization, and demand modeling that have traditionally been developed in isolation. This paper introduces Flow Through Tensors (FTT), a unified computational graph architecture that connects origin destination flows, path probabilities, and link travel times as interconnected tensors. Our framework makes three key contributions: first, it establishes a consistent mathematical structure that enables gradient-based optimization across previously separate modeling elements; second, it supports multidimensional analysis of traffic patterns over time, space, and user groups with precise quantification of system efficiency; third, it implements tensor decomposition techniques that maintain computational tractability for large scale applications. These innovations collectively enable real time control strategies, efficient coordination between multiple transportation modes and operators, and rigorous enforcement of physical network constraints. The FTT framework bridges the gap between theoretical transportation models and practical deployment needs, providing a foundation for next generation integrated mobility systems.
zh
[AI-214] AuraG enome: An LLM -Powered Framework for On-the-Fly Reusable and Scalable Circular Genome Visualizations
【速读】:该论文试图解决现有工具在生成多层环形基因组可视化时需要复杂脚本和手动配置的问题,这些问题导致过程耗时、易出错且难以学习。解决方案的关键在于引入AuraGenome,这是一个基于大语言模型(LLM)的框架,它结合了语义驱动的多智能体工作流与交互式可视化分析系统。该工作流利用七个专门的LLM驱动智能体,分别承担意图识别、布局规划和代码生成等不同角色,以将原始基因组数据转化为定制化的可视化结果。
链接: https://arxiv.org/abs/2507.02877
作者: Chi Zhang,Yu Dong,Yang Wang,Yuetong Han,Guihua Shan,Bixia Tang
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:
Abstract:Circular genome visualizations are essential for exploring structural variants and gene regulation. However, existing tools often require complex scripting and manual configuration, making the process time-consuming, error-prone, and difficult to learn. To address these challenges, we introduce AuraGenome, an LLM-powered framework for rapid, reusable, and scalable generation of multi-layered circular genome visualizations. AuraGenome combines a semantic-driven multi-agent workflow with an interactive visual analytics system. The workflow employs seven specialized LLM-driven agents, each assigned distinct roles such as intent recognition, layout planning, and code generation, to transform raw genomic data into tailored visualizations. The system supports multiple coordinated views tailored for genomic data, offering ring, radial, and chord-based layouts to represent multi-layered circular genome visualizations. In addition to enabling interactions and configuration reuse, the system supports real-time refinement and high-quality report export. We validate its effectiveness through two case studies and a comprehensive user study. AuraGenome is available at: this https URL.
zh
[AI-215] Using Large Language Models to Study Mathematical Practice
【速读】:该论文试图解决哲学研究中因依赖选择性案例而产生的偏倚问题,以及文件柜效应(file-drawer problem),即未发表的负面结果可能影响研究的全面性。其解决方案的关键在于采用语料库分析方法,并借助Google的Gemini 2.5 Pro模型进行大规模文本分析,该模型具备强大的推理能力、幻觉控制机制和大上下文窗口,能够准确处理大量文本数据,从而构建出非选择性、具有代表性的数学论文语料库,为数学实践哲学(PMP)提供更可靠的数据支持。
链接: https://arxiv.org/abs/2507.02873
作者: William D’Alessandro
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI)
备注:
Abstract:The philosophy of mathematical practice (PMP) looks to evidence from working mathematics to help settle philosophical questions. One prominent program under the PMP banner is the study of explanation in mathematics, which aims to understand what sorts of proofs mathematicians consider explanatory and what role the pursuit of explanation plays in mathematical practice. In an effort to address worries about cherry-picked examples and file-drawer problems in PMP, a handful of authors have recently turned to corpus analysis methods as a promising alternative to small-scale case studies. This paper reports the results from such a corpus study facilitated by Google’s Gemini 2.5 Pro, a model whose reasoning capabilities, advances in hallucination control and large context window allow for the accurate analysis of hundreds of pages of text per query. Based on a sample of 5000 mathematics papers from arXiv.org, the experiments yielded a dataset of hundreds of useful annotated examples. Its aim was to gain insight on questions like the following: How often do mathematicians make claims about explanation in the relevant sense? Do mathematicians’ explanatory practices vary in any noticeable way by subject matter? Which philosophical theories of explanation are most consistent with a large body of non-cherry-picked examples? How might philosophers make further use of AI tools to gain insights from large datasets of this kind? As the first PMP study making extensive use of LLM methods, it also seeks to begin a conversation about these methods as research tools in practice-oriented philosophy and to evaluate the strengths and weaknesses of current models for such work.
zh
机器学习
[LG-0] Multi-Disease Deep Learning Framework for GWAS: Beyond Feature Selection Constraints
链接: https://arxiv.org/abs/2507.05247
作者: Iqra Farooq,Sara Atito,Ayse Demirkan,Inga Prokopenko,Muhammad Rana
类目: Machine Learning (cs.LG)
*备注:
Abstract:Traditional GWAS has advanced our understanding of complex diseases but often misses nonlinear genetic interactions. Deep learning offers new opportunities to capture complex genomic patterns, yet existing methods mostly depend on feature selection strategies that either constrain analysis to known pathways or risk data leakage when applied across the full dataset. Further, covariates can inflate predictive performance without reflecting true genetic signals. We explore different deep learning architecture choices for GWAS and demonstrate that careful architectural choices can outperform existing methods under strict no-leakage conditions. Building on this, we extend our approach to a multi-label framework that jointly models five diseases, leveraging shared genetic architecture for improved efficiency and discovery. Applied to five million SNPs across 37,000 samples, our method achieves competitive predictive performance (AUC 0.68-0.96), offering a scalable, leakage-free, and biologically meaningful approach for multi-disease GWAS analysis.
[LG-1] Cascade: Token-Sharded Private LLM Inference ICML2025
链接: https://arxiv.org/abs/2507.05228
作者: Rahul Thomas,Louai Zahran,Erica Choi,Akilesh Potti,Micah Goldblum,Arka Pal
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: To be published in ICML 2025 Main Proceedings as “Hidden No More: Attacking and Defending Private Third-Party LLM Inference”, together with arXiv:2505.18332
Abstract:As LLMs continue to increase in parameter size, the computational resources required to run them are available to fewer parties. Therefore, third-party inference services – where LLMs are hosted by third parties with significant computational resources – are becoming increasingly popular. However, third party inference raises critical concerns about user data privacy. To mitigate these risks, privacy researchers have developed provably secure schemes for third-party inference, such as Secure Multi-Party Computation (SMPC). However, SMPC protocols have significant computational and communication overhead, and do not scale to large models. In this work, we propose a new multi-party inference protocol, Cascade, that avoids these punitive costs by leveraging sharding in the sequence dimension to maintain privacy, trading off cryptographic privacy guarantees for increased performance and scalability. We demonstrate that Cascade is resistant to a generalization of a recent attack that is highly effective against other statistical privacy schemes, and that it is further resistant to learning-based attacks. As Cascade is orders of magnitude faster than existing schemes, our findings offer practical solutions for secure deployment of modern state-of-the-art LLMs.
[LG-2] QuEst: Enhancing Estimates of Quantile-Based Distributional Measures Using Model Predictions ICML2025
链接: https://arxiv.org/abs/2507.05220
作者: Zhun Deng,Thomas P Zollo,Benjamin Eyre,Amogh Inamdar,David Madras,Richard Zemel
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published as a conference paper at ICML 2025
Abstract:As machine learning models grow increasingly competent, their predictions can supplement scarce or expensive data in various important domains. In support of this paradigm, algorithms have emerged to combine a small amount of high-fidelity observed data with a much larger set of imputed model outputs to estimate some quantity of interest. Yet current hybrid-inference tools target only means or single quantiles, limiting their applicability for many critical domains and use cases. We present QuEst, a principled framework to merge observed and imputed data to deliver point estimates and rigorous confidence intervals for a wide family of quantile-based distributional measures. QuEst covers a range of measures, from tail risk (CVaR) to population segments such as quartiles, that are central to fields such as economics, sociology, education, medicine, and more. We extend QuEst to multidimensional metrics, and introduce an additional optimization technique to further reduce variance in this and other hybrid estimators. We demonstrate the utility of our framework through experiments in economic modeling, opinion polling, and language model auto-evaluation.
[LG-3] A 3D Machine Learning based Volume Of Fluid scheme without explicit interface reconstruction
链接: https://arxiv.org/abs/2507.05218
作者: Moreno Pintore,Bruno Després
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:We present a machine-learning based Volume Of Fluid method to simulate multi-material flows on three-dimensional domains. One of the novelties of the method is that the flux fraction is computed by evaluating a previously trained neural network and without explicitly reconstructing any local interface approximating the exact one. The network is trained on a purely synthetic dataset generated by randomly sampling numerous local interfaces and which can be adapted to improve the scheme on less regular interfaces when needed. Several strategies to ensure the efficiency of the method and the satisfaction of physical constraints and properties are suggested and formalized. Numerical results on the advection equation are provided to show the performance of the method. We observe numerical convergence as the size of the mesh tends to zero h=1/N_h\searrow 0 , with a better rate than two reference schemes.
[LG-4] Bridging Prediction and Intervention Problems in Social Systems
链接: https://arxiv.org/abs/2507.05216
作者: Lydia T. Liu,Inioluwa Deborah Raji,Angela Zhou,Luke Guerdan,Jessica Hullman,Daniel Malinsky,Bryan Wilder,Simone Zhang,Hammaad Adam,Amanda Coston,Ben Laufer,Ezinne Nwankwo,Michael Zanger-Tishler,Eli Ben-Michael,Solon Barocas,Avi Feller,Marissa Gerchick,Talia Gillis,Shion Guha,Daniel Ho,Lily Hu,Kosuke Imai,Sayash Kapoor,Joshua Loftus,Razieh Nabi,Arvind Narayanan,Ben Recht,Juan Carlos Perdomo,Matthew Salganik,Mark Sendak,Alexander Tolbert,Berk Ustun,Suresh Venkatasubramanian,Angelina Wang,Ashia Wilson
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many automated decision systems (ADS) are designed to solve prediction problems – where the goal is to learn patterns from a sample of the population and apply them to individuals from the same population. In reality, these prediction systems operationalize holistic policy interventions in deployment. Once deployed, ADS can shape impacted population outcomes through an effective policy change in how decision-makers operate, while also being defined by past and present interactions between stakeholders and the limitations of existing organizational, as well as societal, infrastructure and context. In this work, we consider the ways in which we must shift from a prediction-focused paradigm to an interventionist paradigm when considering the impact of ADS within social systems. We argue this requires a new default problem setup for ADS beyond prediction, to instead consider predictions as decision support, final decisions, and outcomes. We highlight how this perspective unifies modern statistical frameworks and other tools to study the design, implementation, and evaluation of ADS systems, and point to the research directions necessary to operationalize this paradigm shift. Using these tools, we characterize the limitations of focusing on isolated prediction tasks, and lay the foundation for a more intervention-oriented approach to developing and deploying ADS.
[LG-5] A generalized Wasserstein-2 distance approach for efficient reconstruction of random field models using stochastic neural networks
链接: https://arxiv.org/abs/2507.05143
作者: Mingtao Xia,Qijing Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we propose a novel generalized Wasserstein-2 distance approach for efficiently training stochastic neural networks to reconstruct random field models, where the target random variable comprises both continuous and categorical components. We prove that a stochastic neural network can approximate random field models under a Wasserstein-2 distance metric under nonrestrictive conditions. Furthermore, this stochastic neural network can be efficiently trained by minimizing our proposed generalized local squared Wasserstein-2 loss function. We showcase the effectiveness of our proposed approach in various uncertainty quantification tasks, including classification, reconstructing the distribution of mixed random variables, and learning complex noisy dynamical systems from spatiotemporal data.
[LG-6] Hardware-efficient tractable probabilistic inference for TinyML Neurosymbolic AI applications
链接: https://arxiv.org/abs/2507.05141
作者: Jelin Leslin,Martin Trapp,Martin Andraud
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:Neurosymbolic AI (NSAI) has recently emerged to mitigate limitations associated with deep learning (DL) models, e.g. quantifying their uncertainty or reason with explicit rules. Hence, TinyML hardware will need to support these symbolic models to bring NSAI to embedded scenarios. Yet, although symbolic models are typically compact, their sparsity and computation resolution contrasts with low-resolution and dense neuro models, which is a challenge on resource-constrained TinyML hardware severely limiting the size of symbolic models that can be computed. In this work, we remove this bottleneck leveraging a tight hardware/software integration to present a complete framework to compute NSAI with TinyML hardware. We focus on symbolic models realized with tractable probabilistic circuits (PCs), a popular subclass of probabilistic models for hardware integration. This framework: (1) trains a specific class of hardware-efficient \emphdeterministic PCs, chosen for the symbolic task; (2) \emphcompresses this PC until it can be computed on TinyML hardware with minimal accuracy degradation, using our n^th -root compression technique, and (3) \emphdeploys the complete NSAI model on TinyML hardware. Compared to a 64b precision baseline necessary for the PC without compression, our workflow leads to significant hardware reduction on FPGA (up to 82.3% in FF, 52.6% in LUTs, and 18.0% in Flash usage) and an average inference speedup of 4.67x on ESP32 microcontroller.
[LG-7] Deep Learning to Automate Parameter Extraction and Model Fitting of Two-Dimensional Transistors
链接: https://arxiv.org/abs/2507.05134
作者: Robert K. A. Bennett,Jan-Lucas Uslu,Harmon F. Gault,Asir Intisar Khan,Lauren Hoang,Tara Peña,Kathryn Neilson,Young Suh Song,Zhepeng Zhang,Andrew J. Mannix,Eric Pop
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Applied Physics (physics.app-ph)
*备注: Main text + supplementary information
Abstract:We present a deep learning approach to extract physical parameters (e.g., mobility, Schottky contact barrier height, defect profiles) of two-dimensional (2D) transistors from electrical measurements, enabling automated parameter extraction and technology computer-aided design (TCAD) fitting. To facilitate this task, we implement a simple data augmentation and pre-training approach by training a secondary neural network to approximate a physics-based device simulator. This method enables high-quality fits after training the neural network on electrical data generated from physics-based simulations of ~500 devices, a factor 40 \times fewer than other recent efforts. Consequently, fitting can be achieved by training on physically rigorous TCAD models, including complex geometry, self-consistent transport, and electrostatic effects, and is not limited to computationally inexpensive compact models. We apply our approach to reverse-engineer key parameters from experimental monolayer WS _2 transistors, achieving a median coefficient of determination ( R^2 ) = 0.99 when fitting measured electrical data. We also demonstrate that this approach generalizes and scales well by reverse-engineering electrical data on high-electron-mobility transistors while fitting 35 parameters simultaneously. To facilitate future research on deep learning approaches for inverse transistor design, we have published our code and sample data sets online.
[LG-8] Kronecker-factored Approximate Curvature (KFAC) From Scratch
链接: https://arxiv.org/abs/2507.05127
作者: Felix Dangel,Bálint Mucsányi,Tobias Weber,Runa Eschenhagen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Kronecker-factored approximate curvature (KFAC) is arguably one of the most prominent curvature approximations in deep learning. Its applications range from optimization to Bayesian deep learning, training data attribution with influence functions, and model compression or merging. While the intuition behind KFAC is easy to understand, its implementation is tedious: It comes in many flavours, has common pitfalls when translating the math to code, and is challenging to test, which complicates ensuring a properly functioning implementation. Some of the authors themselves have dealt with these challenges and experienced the discomfort of not being able to fully test their code. Thanks to recent advances in understanding KFAC, we are now able to provide test cases and a recipe for a reliable KFAC implementation. This tutorial is meant as a ground-up introduction to KFAC. In contrast to the existing work, our focus lies on providing both math and code side-by-side and providing test cases based on the latest insights into KFAC that are scattered throughout the literature. We hope this tutorial provides a contemporary view of KFAC that allows beginners to gain a deeper understanding of this curvature approximation while lowering the barrier to its implementation, extension, and usage in practice.
[LG-9] CLIP-Guided Backdoor Defense through Entropy-Based Poisoned Dataset Separation
链接: https://arxiv.org/abs/2507.05113
作者: Binyan Xu,Fan Yang,Xilin Dai,Di Tang,Kehuan Zhang
类目: Multimedia (cs.MM); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages, 9 figures, 15 tables. To appear in the Proceedings of the 32nd ACM International Conference on Multimedia (MM '25)
Abstract:Deep Neural Networks (DNNs) are susceptible to backdoor attacks, where adversaries poison training data to implant backdoor into the victim model. Current backdoor defenses on poisoned data often suffer from high computational costs or low effectiveness against advanced attacks like clean-label and clean-image backdoors. To address them, we introduce CLIP-Guided backdoor Defense (CGD), an efficient and effective method that mitigates various backdoor attacks. CGD utilizes a publicly accessible CLIP model to identify inputs that are likely to be clean or poisoned. It then retrains the model with these inputs, using CLIP’s logits as a guidance to effectively neutralize the backdoor. Experiments on 4 datasets and 11 attack types demonstrate that CGD reduces attack success rates (ASRs) to below 1% while maintaining clean accuracy (CA) with a maximum drop of only 0.3%, outperforming existing defenses. Additionally, we show that clean-data-based defenses can be adapted to poisoned data using CGD. Also, CGD exhibits strong robustness, maintaining low ASRs even when employing a weaker CLIP model or when CLIP itself is compromised by a backdoor. These findings underscore CGD’s exceptional efficiency, effectiveness, and applicability for real-world backdoor defense scenarios. Code: this https URL.
[LG-10] DICE: Discrete inverse continuity equation for learning population dynamics
链接: https://arxiv.org/abs/2507.05107
作者: Tobias Blickhan,Jules Berman,Andrew Stuart,Benjamin Peherstorfer
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce the Discrete Inverse Continuity Equation (DICE) method, a generative modeling approach that learns the evolution of a stochastic process from given sample populations at a finite number of time points. Models learned with DICE capture the typically smooth and well-behaved population dynamics, rather than the dynamics of individual sample trajectories that can exhibit complex or even chaotic behavior. The DICE loss function is developed specifically to be invariant, even in discrete time, to spatially constant but time-varying spurious constants that can emerge during training; this invariance increases training stability and robustness. Generating a trajectory of sample populations with DICE is fast because samples evolve directly in the time interval over which the stochastic process is formulated, in contrast to approaches that condition on time and then require multiple sampling steps per time step. DICE is stable to train, in situations where other methods for learning population dynamics fail, and DICE generates representative samples with orders of magnitude lower costs than methods that have to condition on time. Numerical experiments on a wide range of problems from random waves, Vlasov-Poisson instabilities and high-dimensional chaos are included to justify these assertions.
[LG-11] Exploring Semantic Clustering and Similarity Search for Heterogeneous Traffic Scenario Graph
链接: https://arxiv.org/abs/2507.05086
作者: Ferdinand Mütsch,Maximilian Zipfl,Nikolai Polley,J. Marius Zöllner
类目: Machine Learning (cs.LG)
*备注: accepted in the IEEE IAVVC 2025 conference
Abstract:Scenario-based testing is an indispensable instrument for the comprehensive validation and verification of automated vehicles (AVs). However, finding a manageable and finite, yet representative subset of scenarios in a scalable, possibly unsupervised manner is notoriously challenging. Our work is meant to constitute a cornerstone to facilitate sample-efficient testing, while still capturing the diversity of relevant operational design domains (ODDs) and accounting for the “long tail” phenomenon in particular. To this end, we first propose an expressive and flexible heterogeneous, spatio-temporal graph model for representing traffic scenarios. Leveraging recent advances of graph neural networks (GNNs), we then propose a self-supervised method to learn a universal embedding space for scenario graphs that enables clustering and similarity search. In particular, we implement contrastive learning alongside a bootstrapping-based approach and evaluate their suitability for partitioning the scenario space. Experiments on the nuPlan dataset confirm the model’s ability to capture semantics and thus group related scenarios in a meaningful way despite the absence of discrete class labels. Different scenario types materialize as distinct clusters. Our results demonstrate how variable-length traffic scenarios can be condensed into single vector representations that enable nearest-neighbor retrieval of representative candidates for distinct scenario categories. Notably, this is achieved without manual labeling or bias towards an explicit objective such as criticality. Ultimately, our approach can serve as a basis for scalable selection of scenarios to further enhance the efficiency and robustness of testing AVs in simulation.
[LG-12] Distribution-dependent Generalization Bounds for Tuning Linear Regression Across Tasks
链接: https://arxiv.org/abs/2507.05084
作者: Maria-Florina Balcan,Saumya Goyal,Dravyansh Sharma
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 49 pages
Abstract:Modern regression problems often involve high-dimensional data and a careful tuning of the regularization hyperparameters is crucial to avoid overly complex models that may overfit the training data while guaranteeing desirable properties like effective variable selection. We study the recently introduced direction of tuning regularization hyperparameters in linear regression across multiple related tasks. We obtain distribution-dependent bounds on the generalization error for the validation loss when tuning the L1 and L2 coefficients, including ridge, lasso and the elastic net. In contrast, prior work develops bounds that apply uniformly to all distributions, but such bounds necessarily degrade with feature dimension, d. While these bounds are shown to be tight for worst-case distributions, our bounds improve with the “niceness” of the data distribution. Concretely, we show that under additional assumptions that instances within each task are i.i.d. draws from broad well-studied classes of distributions including sub-Gaussians, our generalization bounds do not get worse with increasing d, and are much sharper than prior work for very large d. We also extend our results to a generalization of ridge regression, where we achieve tighter bounds that take into account an estimate of the mean of the ground truth distribution.
[LG-13] Beyond Scaling Curves: Internal Dynamics of Neural Networks Through the NTK Lens
链接: https://arxiv.org/abs/2507.05035
作者: Konstantin Nikolaou,Sven Krippendorf,Samuel Tovey,Christian Holm
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scaling laws offer valuable insights into the relationship between neural network performance and computational cost, yet their underlying mechanisms remain poorly understood. In this work, we empirically analyze how neural networks behave under data and model scaling through the lens of the neural tangent kernel (NTK). This analysis establishes a link between performance scaling and the internal dynamics of neural networks. Our findings of standard vision tasks show that similar performance scaling exponents can occur even though the internal model dynamics show opposite behavior. This demonstrates that performance scaling alone is insufficient for understanding the underlying mechanisms of neural networks. We also address a previously unresolved issue in neural scaling: how convergence to the infinite-width limit affects scaling behavior in finite-width models. To this end, we investigate how feature learning is lost as the model width increases and quantify the transition between kernel-driven and feature-driven scaling regimes. We identify the maximum model width that supports feature learning, which, in our setups, we find to be more than ten times smaller than typical large language model widths.
[LG-14] he Case for Instance-Optimized LLM s in OLAP Databases
链接: https://arxiv.org/abs/2507.04967
作者: Bardia Mohammadi,Laurent Bindschaedler
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) can enhance analytics systems with powerful data summarization, cleaning, and semantic transformation capabilities. However, deploying LLMs at scale – processing millions to billions of rows – remains prohibitively expensive in computation and memory. We present IOLM-DB, a novel system that makes LLM-enhanced database queries practical through query-specific model optimization. Instead of using general-purpose LLMs, IOLM-DB generates lightweight, specialized models tailored to each query’s specific needs using representative data samples. IOLM-DB reduces model footprints by up to 76% and increases throughput by up to 3.31 \times while maintaining accuracy through aggressive compression techniques, including quantization, sparsification, and structural pruning. We further show how our approach enables higher parallelism on existing hardware and seamlessly supports caching and batching strategies to reduce overheads. Our prototype demonstrates that leveraging LLM queries inside analytics systems is feasible at scale, opening new possibilities for future OLAP applications.
[LG-15] When do World Models Successfully Learn Dynamical Systems?
链接: https://arxiv.org/abs/2507.04898
作者: Edmund Ross,Claudia Drygala,Leonhard Schwarz,Samir Kaiser,Francesca di Mare,Tobias Breiten,Hanno Gottschalk
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:In this work, we explore the use of compact latent representations with learned time dynamics (‘World Models’) to simulate physical systems. Drawing on concepts from control theory, we propose a theoretical framework that explains why projecting time slices into a low-dimensional space and then concatenating to form a history (‘Tokenization’) is so effective at learning physics datasets, and characterise when exactly the underlying dynamics admit a reconstruction mapping from the history of previous tokenized frames to the next. To validate these claims, we develop a sequence of models with increasing complexity, starting with least-squares regression and progressing through simple linear layers, shallow adversarial learners, and ultimately full-scale generative adversarial networks (GANs). We evaluate these models on a variety of datasets, including modified forms of the heat and wave equations, the chaotic regime 2D Kuramoto-Sivashinsky equation, and a challenging computational fluid dynamics (CFD) dataset of a 2D Kármán vortex street around a fixed cylinder, where our model is successfully able to recreate the flow.
[LG-16] Fine-tuning on simulated data outperforms prompting for agent tone of voice
链接: https://arxiv.org/abs/2507.04889
作者: Ingo Marquardt,Philippe Brule
类目: Machine Learning (cs.LG)
*备注: 22 pages, 5 figures, 6 tables
Abstract:Deploying language models (LMs) in customer-facing speech applications requires conversational fluency and adherence to specific stylistic guidelines. This can be challenging to achieve reliably using complex system prompts due to issues like instruction following limitations and in-context bias. This study investigates the effectiveness of fine-tuning versus system prompting for aligning LMs with a specific behavioral target: responding in a natural, conversational tone suitable for voice interactions. We fine-tuned a small, open-weights model (Llama3.2-1B-Instruct
) using Low-Rank Adaptation (LoRA) on a synthetically generated dataset derived from Wikipedia. Additionally, we fine-tuned two closed-source models (gpt-4o-mini
, gpt-4.1-mini
). Our results demonstrate that fine-tuning outperformed system prompting, achieving a high percentage of conversational responses, even when trained on only 100 data samples. Semantic similarity analysis confirmed that fine-tuning did not degrade content quality. Interestingly, fine-tuning with 8-bit integer quantization converged faster towards the target style than using bfloat16 precision, potentially due to implicit regularization effects. We conclude that fine-tuning small, open-weights LMs on simulated data is a highly effective and data-efficient method for instilling specific stylistic behaviors, offering a preferable alternative to complex system prompting for practical applications requiring nuanced response styles.
[LG-17] NTSFormer: A Self-Teaching Graph Transformer for Multimodal Cold-Start Node Classification
链接: https://arxiv.org/abs/2507.04870
作者: Jun Hu,Yufei He,Yuan Li,Bryan Hooi,Bingsheng He
类目: Machine Learning (cs.LG)
*备注:
Abstract:Cold-start node classification on multimodal graphs is challenging because cold-start nodes are isolated (i.e., no edges) and often have missing modalities (e.g., absent text or image features). Existing methods address structural isolation by degrading graph learning models to MLPs for cold-start inference, using a teacher model (with graph access) to guide the MLP. However, this results in limited model capacity in the student, which is further challenged when modalities are missing. In this paper, we propose Neighbor-to-Self Graph Transformer (NTSFormer), a unified Graph Transformer framework that jointly tackles the isolation and missing-modality issues via a self-teaching paradigm. Specifically, NTSFormer uses a cold-start attention mask to simultaneously make two predictions for each node: a “student” prediction based only on self-information (i.e., the node’s own features), and a “teacher” prediction incorporating both self and neighbor information. This enables the model to supervise itself without degrading to an MLP, thereby fully leveraging the Transformer’s capacity to handle missing modalities. To handle diverse graph information and missing modalities, NTSFormer performs a one-time multimodal graph pre-computation that converts structural and feature data into token sequences, which are then processed by a Mixture-of-Experts (MoE) Input Projection and Transformer layers for effective fusion. Experimental results on public datasets show that NTSFormer achieves superior performance on multimodal cold-start node classification tasks.
[LG-18] Music Boomerang: Reusing Diffusion Models for Data Augmentation and Audio Manipulation
链接: https://arxiv.org/abs/2507.04864
作者: Alexander Fichtinger,Jan Schlüter,Gerhard Widmer
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Accepted at SMC 2025. Code at this https URL
Abstract:Generative models of music audio are typically used to generate output based solely on a text prompt or melody. Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pretrained diffusion model. In this work, we explore its application in the audio domain as a tool for data augmentation or content manipulation. Specifically, implementing Boomerang sampling for Stable Audio Open, we augment training data for a state-of-the-art beat tracker, and attempt to replace musical instruments in recordings. Our results show that the rhythmic structure of existing examples is mostly preserved, that it improves performance of the beat tracker, but only in scenarios of limited training data, and that it can accomplish text-based instrument replacement on monophonic inputs. We publish our implementation to invite experiments on data augmentation in other tasks and explore further applications.
[LG-19] Discrete Diffusion Trajectory Alignment via Stepwise Decomposition
链接: https://arxiv.org/abs/2507.04832
作者: Jiaqi Han,Austin Wang,Minkai Xu,Wenda Chu,Meihua Dang,Yisong Yue,Stefano Ermon
类目: Machine Learning (cs.LG)
*备注: 22 pages, 3 figures
Abstract:Discrete diffusion models have demonstrated great promise in modeling various sequence data, ranging from human language to biological sequences. Inspired by the success of RL in language models, there is growing interest in further improving the models by alignment with a certain reward. In this work, we propose a novel preference optimization method for masked discrete diffusion models through a principled diffusion trajectory alignment. Instead of applying the reward on the final output and backpropagating the gradient to the entire discrete denoising process, we decompose the problem into a set of stepwise alignment objectives. This framework enables efficient diffusion optimization, is compatible with arbitrary reward functions, and importantly, guarantees an equivalent optimal solution under additive factorization of the trajectory reward. Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach. Notably, it achieves an up to 12% improvement over the most competitive RL-based baseline in terms of predicted activity on DNA sequence design, and further improves the GSM8K score from 78.6 to 80.7 on LLaDA-8B-Instruct for language modeling.
[LG-20] A High-Level Compiler Integration Approach for Deep Learning Accelerators Supporting Abstraction and Optimization
链接: https://arxiv.org/abs/2507.04828
作者: Samira Ahmadifarsani,Daniel Mueller-Gritschneder,Ulf Schlichtmann
类目: Machine Learning (cs.LG)
*备注:
Abstract:The growing adoption of domain-specific architectures in edge computing platforms for deep learning has highlighted the efficiency of hardware accelerators. However, integrating custom accelerators into modern machine learning (ML) compilers remains a complex challenge due to the need for significant modifications in compilation layers and specialized scheduling techniques. Existing frameworks offer partial solutions and require users to navigate intricate compiler internals. In this paper, we introduce a TVM-based compilation integration approach that targets GEMM-based deep learning accelerators. Our approach abstracts the complexities of compiler integration, enabling seamless integration of accelerators without requiring in-depth knowledge of the underlying compiler. Furthermore, we extend and incorporate design space exploration tools, specifically CoSA, to automate efficient tensor scheduling, accounting for factors such as uneven mapping and double buffering. Our framework is benchmarked on the Gemmini accelerator, demonstrating performance comparable to its specialized manually implemented toolchain. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.04828 [cs.LG] (or arXiv:2507.04828v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.04828 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-21] Kalman Filter Aided Federated Koopman Learning
链接: https://arxiv.org/abs/2507.04808
作者: Yutao Chen,Wei Chen
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Real-time control and estimation are pivotal for applications such as industrial automation and future healthcare. The realization of this vision relies heavily on efficient interactions with nonlinear systems. Therefore, Koopman learning, which leverages the power of deep learning to linearize nonlinear systems, has been one of the most successful examples of mitigating the complexity inherent in nonlinearity. However, the existing literature assumes access to accurate system states and abundant high-quality data for Koopman analysis, which is usually impractical in real-world scenarios. To fill this void, this paper considers the case where only observations of the system are available and where the observation data is insufficient to accomplish an independent Koopman analysis. To this end, we propose Kalman Filter aided Federated Koopman Learning (KF-FedKL), which pioneers the combination of Kalman filtering and federated learning with Koopman analysis. By doing so, we can achieve collaborative linearization with privacy guarantees. Specifically, we employ a straightforward yet efficient loss function to drive the training of a deep Koopman network for linearization. To obtain system information devoid of individual information from observation data, we leverage the unscented Kalman filter and the unscented Rauch-Tung-Striebel smoother. To achieve collaboration between clients, we adopt the federated learning framework and develop a modified FedAvg algorithm to orchestrate the collaboration. A convergence analysis of the proposed framework is also presented. Finally, through extensive numerical simulations, we showcase the performance of KF-FedKL under various situations.
[LG-22] Machine Learning from Explanations ICML2025
链接: https://arxiv.org/abs/2507.04788
作者: Jiashu Tao,Reza Shokri
类目: Machine Learning (cs.LG)
*备注: ICML 2025 AIW
Abstract:Acquiring and training on large-scale labeled data can be impractical due to cost constraints. Additionally, the use of small training datasets can result in considerable variability in model outcomes, overfitting, and learning of spurious correlations. A crucial shortcoming of data labels is their lack of any reasoning behind a specific label assignment, causing models to learn any arbitrary classification rule as long as it aligns data with labels. To overcome these issues, we introduce an innovative approach for training reliable classification models on smaller datasets, by using simple explanation signals such as important input features from labeled data. Our method centers around a two-stage training cycle that alternates between enhancing model prediction accuracy and refining its attention to match the explanations. This instructs models to grasp the rationale behind label assignments during their learning phase. We demonstrate that our training cycle expedites the convergence towards more accurate and reliable models, particularly for small, class-imbalanced training data, or data with spurious features.
[LG-23] FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift
链接: https://arxiv.org/abs/2507.04781
作者: Yong Zhang,Feng Liang,Guanghu Yuan,Min Yang,Chengming Li,Xiping Hu
类目: Machine Learning (cs.LG)
*备注: 10 pages, 6 figures, and 1 table
Abstract:Federated learning (FL) enables collaborative training of a global model in the centralized server with data from multiple parties while preserving privacy. However, data heterogeneity can significantly degrade the performance of the global model when each party uses datasets from different sources to train a local model, thereby affecting personalized local models. Among various cases of data heterogeneity, feature drift, feature space difference among parties, is prevalent in real-life data but remains largely unexplored. Feature drift can distract feature extraction learning in clients and thus lead to poor feature extraction and classification performance. To tackle the problem of feature drift in FL, we propose FedPall, an FL framework that utilizes prototype-based adversarial learning to unify feature spaces and collaborative learning to reinforce class information within the features. Moreover, FedPall leverages mixed features generated from global prototypes and local features to enhance the global classifier with classification-relevant information from a global perspective. Evaluation results on three representative feature-drifted datasets demonstrate FedPall’s consistently superior performance in classification with feature-drifted data in the FL scenario.
[LG-24] Improving BERT for Symbolic Music Understanding Using Token Denoising and Pianoroll Prediction
链接: https://arxiv.org/abs/2507.04776
作者: Jun-You Wang,Li Su
类目: ound (cs.SD); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注: Accepted at ISMIR 2025
Abstract:We propose a pre-trained BERT-like model for symbolic music understanding that achieves competitive performance across a wide range of downstream tasks. To achieve this target, we design two novel pre-training objectives, namely token correction and pianoroll prediction. First, we sample a portion of note tokens and corrupt them with a limited amount of noise, and then train the model to denoise the corrupted tokens; second, we also train the model to predict bar-level and local pianoroll-derived representations from the corrupted note tokens. We argue that these objectives guide the model to better learn specific musical knowledge such as pitch intervals. For evaluation, we propose a benchmark that incorporates 12 downstream tasks ranging from chord estimation to symbolic genre classification. Results confirm the effectiveness of the proposed pre-training objectives on downstream tasks.
[LG-25] Efficient Unlearning with Privacy Guarantees
链接: https://arxiv.org/abs/2507.04771
作者: Josep Domingo-Ferrer,Najeeb Jebreel,David Sánchez
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Privacy protection laws, such as the GDPR, grant individuals the right to request the forgetting of their personal data not only from databases but also from machine learning (ML) models trained on them. Machine unlearning has emerged as a practical means to facilitate model forgetting of data instances seen during training. Although some existing machine unlearning methods guarantee exact forgetting, they are typically costly in computational terms. On the other hand, more affordable methods do not offer forgetting guarantees and are applicable only to specific ML models. In this paper, we present \emphefficient unlearning with privacy guarantees (EUPG), a novel machine unlearning framework that offers formal privacy guarantees to individuals whose data are being unlearned. EUPG involves pre-training ML models on data protected using privacy models, and it enables \em efficient unlearning with the privacy guarantees offered by the privacy models in use. Through empirical evaluation on four heterogeneous data sets protected with k -anonymity and \epsilon -differential privacy as privacy models, our approach demonstrates utility and forgetting effectiveness comparable to those of exact unlearning methods, while significantly reducing computational and storage costs. Our code is available at this https URL.
[LG-26] CueLearner: Bootstrapping and local policy adaptation from relative feedback IROS2025
链接: https://arxiv.org/abs/2507.04730
作者: Giulio Schiavi,Andrei Cramariuc,Lionel Ott,Roland Siegwart
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted to IROS 2025
Abstract:Human guidance has emerged as a powerful tool for enhancing reinforcement learning (RL). However, conventional forms of guidance such as demonstrations or binary scalar feedback can be challenging to collect or have low information content, motivating the exploration of other forms of human input. Among these, relative feedback (i.e., feedback on how to improve an action, such as “more to the left”) offers a good balance between usability and information richness. Previous research has shown that relative feedback can be used to enhance policy search methods. However, these efforts have been limited to specific policy classes and use feedback inefficiently. In this work, we introduce a novel method to learn from relative feedback and combine it with off-policy reinforcement learning. Through evaluations on two sparse-reward tasks, we demonstrate our method can be used to improve the sample efficiency of reinforcement learning by guiding its exploration process. Additionally, we show it can adapt a policy to changes in the environment or the user’s preferences. Finally, we demonstrate real-world applicability by employing our approach to learn a navigation policy in a sparse reward setting.
[LG-27] Spooky Action at a Distance: Normalization Layers Enable Side-Channel Spatial Communication
链接: https://arxiv.org/abs/2507.04709
作者: Samuel Pfrommer,George Ma,Yixiao Huang,Somayeh Sojoudi
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work shows that normalization layers can facilitate a surprising degree of communication across the spatial dimensions of an input tensor. We study a toy localization task with a convolutional architecture and show that normalization layers enable an iterative message passing procedure, allowing information aggregation from well outside the local receptive field. Our results suggest that normalization layers should be employed with caution in applications such as diffusion-based trajectory generation, where maintaining a spatially limited receptive field is crucial.
[LG-28] Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation
链接: https://arxiv.org/abs/2507.04697
作者: Daichi Mukunoki,Shun-ichiro Hayashi,Tetsuya Hoshino,Takahiro Katagiri
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Mathematical Software (cs.MS)
*备注: 8 pages, 6 tables
Abstract:Generative AI technology based on Large Language Models (LLM) has been developed and applied to assist or automatically generate program codes. In this paper, we evaluate the capability of existing general LLMs for Basic Linear Algebra Subprograms (BLAS) code generation for CPUs. We use two LLMs provided by OpenAI: GPT-4.1, a Generative Pre-trained Transformer (GPT) model, and o4-mini, one of the o-series of Reasoning models. Both have been released in April 2025. For the routines from level-1 to 3 BLAS, we tried to generate (1) C code without optimization from routine name only, (2) C code with basic performance optimizations (thread parallelization, SIMD vectorization, and cache blocking) from routine name only, and (3) C code with basic performance optimizations based on Fortran reference code. As a result, we found that correct code can be generated in many cases even when only routine name are given. We also confirmed that thread parallelization with OpenMP, SIMD vectorization, and cache blocking can be implemented to some extent, and that the code is faster than the reference code.
[LG-29] Interpretable Reward Modeling with Active Concept Bottlenecks
链接: https://arxiv.org/abs/2507.04695
作者: Sonia Laguna,Katarzyna Kobalczyk,Julia E. Vogt,Mihaela Van der Schaar
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning through selective concept annotation. Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts. To make this framework efficient in low-supervision settings, we formalize an active learning strategy that dynamically acquires the most informative concept labels. We propose an acquisition function based on Expected Information Gain and show that it significantly accelerates concept learning without compromising preference accuracy. Evaluated on the UltraFeedback dataset, our method outperforms baselines in interpretability and sample efficiency, marking a step towards more transparent, auditable, and human-aligned reward models.
[LG-30] Recovering Plasticity of Neural Networks via Soft Weight Rescaling
链接: https://arxiv.org/abs/2507.04683
作者: Seungwon Oh,Sangyeon Park,Isaac Han,Kyung-Joong Kim
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent studies have shown that as training progresses, neural networks gradually lose their capacity to learn new information, a phenomenon known as plasticity loss. An unbounded weight growth is one of the main causes of plasticity loss. Furthermore, it harms generalization capability and disrupts optimization dynamics. Re-initializing the network can be a solution, but it results in the loss of learned information, leading to performance drops. In this paper, we propose Soft Weight Rescaling (SWR), a novel approach that prevents unbounded weight growth without losing information. SWR recovers the plasticity of the network by simply scaling down the weight at each step of the learning process. We theoretically prove that SWR bounds weight magnitude and balances weight magnitude between layers. Our experiment shows that SWR improves performance on warm-start learning, continual learning, and single-task learning setups on standard image classification benchmarks.
[LG-31] Operator-based machine learning framework for generalizable prediction of unsteady treatment dynamics in stormwater infrastructure
链接: https://arxiv.org/abs/2507.04682
作者: Mohamed Shatarah,Kai Liu,Haochen Li
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 9 figures
Abstract:Stormwater infrastructures are decentralized urban water-management systems that face highly unsteady hydraulic and pollutant loadings from episodic rainfall-runoff events. Accurately evaluating their in-situ treatment performance is essential for cost-effective design and planning. Traditional lumped dynamic models (e.g., continuously stirred tank reactor, CSTR) are computationally efficient but oversimplify transport and reaction processes, limiting predictive accuracy and insight. Computational fluid dynamics (CFD) resolves detailed turbulent transport and pollutant fate physics but incurs prohibitive computational cost for unsteady and long-term simulations. To address these limitations, this study develops a composite operator-based neural network (CPNN) framework that leverages state-of-the-art operator learning to predict the spatial and temporal dynamics of hydraulics and particulate matter (PM) in stormwater treatment. The framework is demonstrated on a hydrodynamic separator (HS), a common urban treatment device. Results indicate that the CPNN achieves R2 0.8 for hydraulic predictions in 95.2% of test cases; for PM concentration predictions, R2 0.8 in 72.6% of cases and 0.4 R2 0.8 in 22.6%. The analysis identifies challenges in capturing dynamics under extreme low-flow conditions, owing to their lower contribution to the training loss. Exploiting the automatic-differentiation capability of the CPNN, sensitivity analyses quantify the influence of storm event loading on PM transport. Finally, the potential of the CPNN framework for continuous, long-term evaluation of stormwater infrastructure performance is discussed, marking a step toward robust, climate-aware planning and implementation.
[LG-32] Hybrid Adversarial Spectral Loss Conditional Generative Adversarial Networks for Signal Data Augmentation in Ultra-precision Machining Surface Roughness Prediction
链接: https://arxiv.org/abs/2507.04665
作者: Suiyan Shang,Chi Fai Cheung,Pai Zheng
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 figures
Abstract:Accurate surface roughness prediction in ultra-precision machining (UPM) is critical for real-time quality control, but small datasets hinder model performance. We propose HAS-CGAN, a Hybrid Adversarial Spectral Loss CGAN, for effective UPM data augmentation. Among five CGAN variants tested, HAS-CGAN excels in 1D force signal generation, particularly for high-frequency signals, achieving 0.85 wavelet coherence through Fourier-domain optimization. By combining generated signals with machining parameters, prediction accuracy significantly improves. Experiments with traditional ML (SVR, RF, LSTM) and deep learning models (BPNN, 1DCNN, CNN-Transformer) demonstrate that augmenting training data with 520+ synthetic samples reduces prediction error from 31.4% (original 52 samples) to ~9%, effectively addressing data scarcity in UPM roughness prediction."
[LG-33] A Cycle-Consistency Constrained Framework for Dynamic Solution Space Reduction in Noninjective Regression
链接: https://arxiv.org/abs/2507.04659
作者: Hanzhang Jia,Yi Gao
类目: Machine Learning (cs.LG)
*备注:
Abstract:To address the challenges posed by the heavy reliance of multi-output models on preset probability distributions and embedded prior knowledge in non-injective regression tasks, this paper proposes a cycle consistency-based data-driven training framework. The method jointly optimizes a forward model \Phi: X to Y and a backward model \Psi: Y to X, where the cycle consistency loss is defined as L _cycleb equal L(Y reduce \Phi(\Psi(Y))) (and vice versa). By minimizing this loss, the framework establishes a closed-loop mechanism integrating generation and validation phases, eliminating the need for manual rule design or prior distribution assumptions. Experiments on normalized synthetic and simulated datasets demonstrate that the proposed method achieves a cycle reconstruction error below 0.003, achieving an improvement of approximately 30% in evaluation metrics compared to baseline models without cycle consistency. Furthermore, the framework supports unsupervised learning and significantly reduces reliance on manual intervention, demonstrating potential advantages in non-injective regression tasks.
[LG-34] VaxPulse: Monitoring of Online Public Concerns to Enhance Post-licensure Vaccine Surveillance
链接: https://arxiv.org/abs/2507.04656
作者: Muhammad Javed,Sedigh Khademi,Joanne Hickman,Jim Buttery,Hazel Clothier,Gerardo Luis Dimaguila
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:
Abstract:The recent vaccine-related infodemic has amplified public concerns, highlighting the need for proactive misinformation management. We describe how we enhanced the reporting surveillance system of Victoria’s vaccine safety service, SAEFVIC, through the incorporation of new information sources for public sentiment analysis, topics of discussion, and hesitancies about vaccinations online. Using VaxPulse, a multi-step framework, we integrate adverse events following immunisation (AEFI) with sentiment analysis, demonstrating the importance of contextualising public concerns. Additionally, we emphasise the need to address non-English languages to stratify concerns across ethno-lingual communities, providing valuable insights for vaccine uptake strategies and combating mis/disinformation. The framework is applied to real-world examples and a case study on women’s vaccine hesitancy, showcasing its benefits and adaptability by identifying public opinion from online media.
[LG-35] SOSAE: Self-Organizing Sparse AutoEncoder IJCAI
链接: https://arxiv.org/abs/2507.04644
作者: Sarthak Ketanbhai Modi,Zi Pong Lim,Yushi Cao,Yupeng Cheng,Yon Shin Teo,Shang-Wei Lin
类目: Machine Learning (cs.LG)
*备注: Accepted at IJCAI-DSO Workshop 2024
Abstract:The process of tuning the size of the hidden layers for autoencoders has the benefit of providing optimally compressed representations for the input data. However, such hyper-parameter tuning process would take a lot of computation and time effort with grid search as the default option. In this paper, we introduce the Self-Organization Regularization for Autoencoders that dynamically adapts the dimensionality of the feature space to the optimal size. Inspired by physics concepts, Self-Organizing Sparse AutoEncoder (SOSAE) induces sparsity in feature space in a structured way that permits the truncation of the non-active part of the feature vector without any loss of information. This is done by penalizing the autoencoder based on the magnitude and the positional index of the feature vector dimensions, which during training constricts the feature space in both terms. Extensive experiments on various datasets show that our SOSAE can tune the feature space dimensionality up to 130 times lesser Floating-point Operations (FLOPs) than other baselines while maintaining the same quality of tuning and performance.
[LG-36] Photon Splatting: A Physics-Guided Neural Surrogate for Real-Time Wireless Channel Prediction
链接: https://arxiv.org/abs/2507.04595
作者: Ge Cao,Gabriele Gradoni,Zhen Peng
类目: Machine Learning (cs.LG)
*备注: 14 pages, 11 figures
Abstract:We present Photon Splatting, a physics-guided neural surrogate model for real-time wireless channel prediction in complex environments. The proposed framework introduces surface-attached virtual sources, referred to as photons, which carry directional wave signatures informed by the scene geometry and transmitter configuration. At runtime, channel impulse responses (CIRs) are predicted by splatting these photons onto the angular domain of the receiver using a geodesic rasterizer. The model is trained to learn a physically grounded representation that maps transmitter-receiver configurations to full channel responses. Once trained, it generalizes to new transmitter positions, antenna beam patterns, and mobile receivers without requiring model retraining. We demonstrate the effectiveness of the framework through a series of experiments, from canonical 3D scenes to a complex indoor cafe with 1,000 receivers. Results show 30 millisecond-level inference latency and accurate CIR predictions across a wide range of configurations. The approach supports real-time adaptability and interpretability, making it a promising candidate for wireless digital twin platforms and future 6G network planning.
[LG-37] A Lightweight Deep Learning Model for Automatic Modulation Classification using Dual Path Deep Residual Shrinkage Network
链接: https://arxiv.org/abs/2507.04586
作者: Prakash Suman,Yanzhen Qu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Efficient spectrum utilization is critical to meeting the growing data demands of modern wireless communication networks. Automatic Modulation Classification (AMC) plays a key role in enhancing spectrum efficiency by accurately identifying modulation schemes in received signals-an essential capability for dynamic spectrum allocation and interference mitigation, particularly in cognitive radio (CR) systems. With the increasing deployment of smart edge devices, such as IoT nodes with limited computational and memory resources, there is a pressing need for lightweight AMC models that balance low complexity with high classification accuracy. This paper proposes a low-complexity, lightweight deep learning (DL) AMC model optimized for resource-constrained edge devices. We introduce a dual-path deep residual shrinkage network (DP-DRSN) with Garrote thresholding for effective signal denoising and design a compact hybrid CNN-LSTM architecture comprising only 27,000 training parameters. The proposed model achieves average classification accuracies of 61.20%, 63.78%, and 62.13% on the RML2016.10a, RML2016.10b, and RML2018.01a datasets, respectively demonstrating a strong balance between model efficiency and classification performance. These results underscore the model’s potential for enabling accurate and efficient AMC on-edge devices with limited resources.
[LG-38] Self-supervised learning of speech representations with Dutch archival data INTERSPEECH2025
链接: https://arxiv.org/abs/2507.04554
作者: Nik Vaessen,David A. van Leeuwen,Roeland Ordelman
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: accepted at interspeech 2025
Abstract:This paper explores the use of Dutch archival television broadcast data for self-supervised learning of speech foundation models, specifically wav2vec 2.0. We first study data quality assumptions for pre-training, and show how music, noise and speaker overlap affect SSL convergence and downstream fine-tuning performance. Secondly, we explore effectively pre-processing strategies to convert the noisy broadcast dataset into a qualitative dataset for pre-training, by using Whisper and WhisperX., Thirdly, we compare mono-lingual and multi-lingual pre-training with equivalent amounts of data, and show that mono-lingual pre-training is more robust to out-of-domain data. Lastly, we achieve a state-of-the-art LARGE wav2vec 2.0 model for the Dutch language, by a continuation of pre-training a wav2vec 2.0 XLS-R model checkpoint with our 55k hour archival dataset.
[LG-39] da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs
链接: https://arxiv.org/abs/2507.04535
作者: Chang Sun,Zhiqiang Que,Vladimir Loncar,Wayne Luk,Maria Spiropulu
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:
Abstract:Neural networks with a latency requirement on the order of microseconds, like the ones used at the CERN Large Hadron Collider, are typically deployed on FPGAs fully unrolled and pipelined. A bottleneck for the deployment of such neural networks is area utilization, which is directly related to the required constant matrix-vector multiplication (CMVM) operations. In this work, we propose an efficient algorithm for implementing CMVM operations with distributed arithmetic (DA) on FPGAs that simultaneously optimizes for area consumption and latency. The algorithm achieves resource reduction similar to state-of-the-art algorithms while being significantly faster to compute. The proposed algorithm is open-sourced and integrated into the \texttthls4ml library, a free and open-source library for running real-time neural network inference on FPGAs. We show that the proposed algorithm can reduce on-chip resources by up to a third for realistic, highly quantized neural networks while simultaneously reducing latency, enabling the implementation of previously infeasible networks.
[LG-40] Verification of Visual Controllers via Compositional Geometric Transformations
链接: https://arxiv.org/abs/2507.04523
作者: Alexander Estornell,Leonard Jung,Michael Everett
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Perception-based neural network controllers are increasingly used in autonomous systems that rely on visual inputs to operate in the real world. Ensuring the safety of such systems under uncertainty is challenging. Existing verification techniques typically focus on Lp-bounded perturbations in the pixel space, which fails to capture the low-dimensional structure of many real-world effects. In this work, we introduce a novel verification framework for perception-based controllers that can generate outer-approximations of reachable sets through explicitly modeling uncertain observations with geometric perturbations. Our approach constructs a boundable mapping from states to images, enabling the use of state-based verification tools while accounting for uncertainty in perception. We provide theoretical guarantees on the soundness of our method and demonstrate its effectiveness across benchmark control environments. This work provides a principled framework for certifying the safety of perception-driven control systems under realistic visual perturbations.
[LG-41] Machine Learning-Based Prediction of Metal-Organic Framework Materials: A Comparative Analysis of Multiple Models
链接: https://arxiv.org/abs/2507.04493
作者: Zhuo Zheng,Keyan Liu,Xiyuan Zhu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Metal-organic frameworks (MOFs) have emerged as promising materials for various applications due to their unique structural properties and versatile functionalities. This study presents a comprehensive investigation of machine learning approaches for predicting MOF material properties. We employed five different machine learning models: Random Forest, XGBoost, LightGBM, Support Vector Machine, and Neural Network, to analyze and predict MOF characteristics using a dataset from the Kaggle platform. The models were evaluated using multiple performance metrics, including RMSE, R^2, MAE, and cross-validation scores. Results demonstrated that the Random Forest model achieved superior performance with an R^2 value of 0.891 and RMSE of 0.152, significantly outperforming other models. LightGBM showed remarkable computational efficiency, completing training in 25.7 seconds while maintaining high accuracy. Our comparative analysis revealed that ensemble learning methods generally exhibited better performance than traditional single models in MOF property prediction. This research provides valuable insights into the application of machine learning in materials science and establishes a robust framework for future MOF material design and property prediction.
[LG-42] ESSA: Evolutionary Strategies for Scalable Alignment
链接: https://arxiv.org/abs/2507.04453
作者: Daria Korotyshova,Boris Shaposhnikov,Alexey Malakhov,Nikita Surnachev,George Bredis,Alexey Gorbatovski,Viacheslav Sinii,Daniil Gavrilov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models (LLMs) are increasingly relying on alignment techniques to ensure that their outputs match human preferences. Although reinforcement learning from human feedback (RLHF) is the dominant approach, it has high computational costs, memory requirements, and training instability, particularly when scaling to larger models. This paper introduces ESSA (Evolutionary Strategies for Scalable Alignment), a new framework that uses Evolutionary Strategies (ES) to efficiently align LLMs without the need for gradient computation. ES is well-suited for LLM alignment due to its favorable properties, such as high parallelizability, memory efficiency, robustness to sparse rewards, and fewer data samples required for convergence, especially when starting from a strong pre-trained policy. Moreover, ES eliminates the need for extensive hyperparameter tuning, making the alignment process simpler and more stable. Although ES excels in low-dimensional optimization, it poses a challenge when applied to high-dimensional LLMs. To address this challenge, we propose a parameter-efficient architectural modification that reduces the dimensionality of optimization through low-rank adaptation. We evaluated our approach on mathematical reasoning tasks with verifiable accuracy-based metrics, demonstrating that ESSA converges faster and is more data efficient than gradient-based methods like Group Relative Policy Optimization (GRPO). Our findings establish ES as a promising and scalable alternative to gradient-based alignment, paving the way for efficient post-training of large language models.
[LG-43] ransfer Learning in Infinite Width Feature Learning Networks
链接: https://arxiv.org/abs/2507.04448
作者: Clarissa Lauditi,Blake Bordelon,Cengiz Pehlevan
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (stat.ML)
*备注:
Abstract:We develop a theory of transfer learning in infinitely wide neural networks where both the pretraining (source) and downstream (target) task can operate in a feature learning regime. We analyze both the Bayesian framework, where learning is described by a posterior distribution over the weights, and gradient flow training of randomly initialized networks trained with weight decay. Both settings track how representations evolve in both source and target tasks. The summary statistics of these theories are adapted feature kernels which, after transfer learning, depend on data and labels from both source and target tasks. Reuse of features during transfer learning is controlled by an elastic weight coupling which controls the reliance of the network on features learned during training on the source task. We apply our theory to linear and polynomial regression tasks as well as real datasets. Our theory and experiments reveal interesting interplays between elastic weight coupling, feature learning strength, dataset size, and source and target task alignment on the utility of transfer learning.
[LG-44] ail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking
链接: https://arxiv.org/abs/2507.04446
作者: Tim Beyer,Yan Scholten,Stephan Günnemann,Leo Schwinn
类目: Machine Learning (cs.LG)
*备注:
Abstract:To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point, greedy generations, overlooking the inherently stochastic nature of LLMs. In this paper, we propose a novel framework for adversarial robustness evaluation that explicitly models the entire output distribution, including tail-risks, providing better estimates for model robustness at scale. By casting the attack process as a resource allocation problem between optimization and sampling, we determine compute-optimal tradeoffs and show that integrating sampling into existing attacks boosts ASR by up to 48% and improves efficiency by up to two orders of magnitude. Our framework also enables us to analyze how different attack algorithms affect output harm distributions. Surprisingly, we find that most optimization strategies have little effect on output harmfulness. Finally, we introduce a data-free proof-of-concept objective based on entropy-maximization to demonstrate how our tail-aware perspective enables new optimization targets. Overall, our findings highlight the importance of tail-aware attacks and evaluation protocols to accurately assess and strengthen LLM safety.
[LG-45] Enhancing Text-Based Hierarchical Multilabel Classification for Mobile Applications via Contrastive Learning
链接: https://arxiv.org/abs/2507.04413
作者: Jiawei Guo,Yang Xiao,Weipeng Huang,Guangyuan Piao
类目: Machine Learning (cs.LG)
*备注:
Abstract:A hierarchical labeling system for mobile applications (apps) benefits a wide range of downstream businesses that integrate the labeling with their proprietary user data, to improve user modeling. Such a label hierarchy can define more granular labels that capture detailed app features beyond the limitations of traditional broad app categories. In this paper, we address the problem of hierarchical multilabel classification for apps by using their textual information such as names and descriptions. We present: 1) HMCN (Hierarchical Multilabel Classification Network) for handling the classification from two perspectives: the first focuses on a multilabel classification without hierarchical constraints, while the second predicts labels sequentially at each hierarchical level considering such constraints; 2) HMCL (Hierarchical Multilabel Contrastive Learning), a scheme that is capable of learning more distinguishable app representations to enhance the performance of HMCN. Empirical results on our Tencent App Store dataset and two public datasets demonstrate that our approach performs well compared with state-of-the-art methods. The approach has been deployed at Tencent and the multilabel classification outputs for apps have helped a downstream task–credit risk management of user–improve its performance by 10.70% with regard to the Kolmogorov-Smirnov metric, for over one year.
[LG-46] Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning
链接: https://arxiv.org/abs/2507.04406
作者: Uri Sherman,Tomer Koren,Yishay Mansour
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We study reinforcement learning (RL) in the agnostic policy learning setting, where the goal is to find a policy whose performance is competitive with the best policy in a given class of interest \Pi – crucially, without assuming that \Pi contains the optimal policy. We propose a general policy learning framework that reduces this problem to first-order optimization in a non-Euclidean space, leading to new algorithms as well as shedding light on the convergence properties of existing ones. Specifically, under the assumption that \Pi is convex and satisfies a variational gradient dominance (VGD) condition – an assumption known to be strictly weaker than more standard completeness and coverability conditions – we obtain sample complexity upper bounds for three policy learning algorithms: \emph(i) Steepest Descent Policy Optimization, derived from a constrained steepest descent method for non-convex optimization; \emph(ii) the classical Conservative Policy Iteration algorithm \citepkakade2002approximately reinterpreted through the lens of the Frank-Wolfe method, which leads to improved convergence results; and \emph(iii) an on-policy instantiation of the well-studied Policy Mirror Descent algorithm. Finally, we empirically evaluate the VGD condition across several standard environments, demonstrating the practical relevance of our key assumption.
[LG-47] Inverse Reinforcement Learning using Revealed Preferences and Passive Stochastic Optimization
链接: https://arxiv.org/abs/2507.04396
作者: Vikram Krishnamurthy
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: arXiv admin note: text overlap with arXiv:2006.11674
Abstract:This monograph, spanning three chapters, explores Inverse Reinforcement Learning (IRL). The first two chapters view inverse reinforcement learning (IRL) through the lens of revealed preferences from microeconomics while the third chapter studies adaptive IRL via Langevin dynamics stochastic gradient algorithms. Chapter uses classical revealed preference theory (Afriat’s theorem and extensions) to identify constrained utility maximizers based on observed agent actions. This allows for the reconstruction of set-valued estimates of an agent’s utility. We illustrate this procedure by identifying the presence of a cognitive radar and reconstructing its utility function. The chapter also addresses the construction of a statistical detector for utility maximization behavior when agent actions are corrupted by noise. Chapter 2 studies Bayesian IRL. It investigates how an analyst can determine if an observed agent is a rationally inattentive Bayesian utility maximizer (i.e., simultaneously optimizing its utility and observation likelihood). The chapter discusses inverse stopping-time problems, focusing on reconstructing the continuation and stopping costs of a Bayesian agent operating over a random horizon. We then apply this IRL methodology to identify the presence of a Bayes-optimal sequential detector. Additionally, Chapter 2 provides a concise overview of discrete choice models, inverse Bayesian filtering, and inverse stochastic gradient algorithms for adaptive IRL. Finally, Chapter 3 introduces an adaptive IRL approach utilizing passive Langevin dynamics. This method aims to track time-varying utility functions given noisy and misspecified gradients. In essence, the adaptive IRL algorithms presented in Chapter 3 can be conceptualized as inverse stochastic gradient algorithms, as they learn the utility function in real-time while a stochastic gradient algorithm is in operation. Comments: arXiv admin note: text overlap with arXiv:2006.11674 Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP) Cite as: arXiv:2507.04396 [cs.LG] (or arXiv:2507.04396v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.04396 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vikram Krishnamurthy [view email] [v1] Sun, 6 Jul 2025 13:56:02 UTC (994 KB)
[LG-48] Hierarchical Reinforcement Learning with Targeted Causal Interventions ICML2025
链接: https://arxiv.org/abs/2507.04373
作者: Sadegh Khorasani,Saber Salehkaleybar,Negar Kiyavash,Matthias Grossglauser
类目: Machine Learning (cs.LG)
*备注: 44 pages, 13 figures. Proceedings of ICML 2025. Proposes a new framework for hierarchical reinforcement learning (HRL) using targeted causal interventions, with theoretical analysis and experiments in synthetic and Minecraft environments
Abstract:Hierarchical reinforcement learning (HRL) improves the efficiency of long-horizon reinforcement-learning tasks with sparse rewards by decomposing the task into a hierarchy of subgoals. The main challenge of HRL is efficient discovery of the hierarchical structure among subgoals and utilizing this structure to achieve the final goal. We address this challenge by modeling the subgoal structure as a causal graph and propose a causal discovery algorithm to learn it. Additionally, rather than intervening on the subgoals at random during exploration, we harness the discovered causal model to prioritize subgoal interventions based on their importance in attaining the final goal. These targeted interventions result in a significantly more efficient policy in terms of the training cost. Unlike previous work on causal HRL, which lacked theoretical analysis, we provide a formal analysis of the problem. Specifically, for tree structures and, for a variant of Erdős-Rényi random graphs, our approach results in remarkable improvements. Our experimental results on HRL tasks also illustrate that our proposed framework outperforms existing work in terms of training cost.
[LG-49] Adaptive Malware Detection using Sequential Feature Selection: A Dueling Double Deep Q-Network (D3QN) Framework for Intelligent Classification
链接: https://arxiv.org/abs/2507.04372
作者: Naseem Khan,Aref Y. Al-Tamimi,Amine Bermak,Issa M. Khalil
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Traditional malware detection methods exhibit computational inefficiency due to exhaustive feature extraction requirements, creating accuracy-efficiency trade-offs that limit real-time deployment. We formulate malware classification as a Markov Decision Process with episodic feature acquisition and propose a Dueling Double Deep Q-Network (D3QN) framework for adaptive sequential feature selection. The agent learns to dynamically select informative features per sample before terminating with classification decisions, optimizing both detection accuracy and computational cost through reinforcement learning. We evaluate our approach on Microsoft Big2015 (9-class, 1,795 features) and BODMAS (binary, 2,381 features) datasets. D3QN achieves 99.22% and 98.83% accuracy while utilizing only 61 and 56 features on average, representing 96.6% and 97.6% dimensionality reduction. This yields computational efficiency improvements of 30.1x and 42.5x over traditional ensemble methods. Comprehensive ablation studies demonstrate consistent superiority over Random Forest, XGBoost, and static feature selection approaches. Quantitative analysis demonstrates that D3QN learns non-random feature selection policies with 62.5% deviation from uniform baseline distributions. The learned policies exhibit structured hierarchical preferences, utilizing high-level metadata features for initial assessment while selectively incorporating detailed behavioral features based on classification uncertainty. Feature specialization analysis reveals 57.7% of examined features demonstrate significant class-specific discrimination patterns. Our results validate reinforcement learning-based sequential feature selection for malware classification, achieving superior accuracy with substantial computational reduction through learned adaptive policies. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2507.04372 [cs.LG] (or arXiv:2507.04372v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.04372 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-50] Information-theoretic Quantification of High-order Feature Effects in Classification Problems
链接: https://arxiv.org/abs/2507.04362
作者: Ivan Lazic,Chiara Barà,Marta Iovino,Sebastiano Stramaglia,Niksa Jakovljevic,Luca Faes
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注:
Abstract:Understanding the contribution of individual features in predictive models remains a central goal in interpretable machine learning, and while many model-agnostic methods exist to estimate feature importance, they often fall short in capturing high-order interactions and disentangling overlapping contributions. In this work, we present an information-theoretic extension of the High-order interactions for Feature importance (Hi-Fi) method, leveraging Conditional Mutual Information (CMI) estimated via a k-Nearest Neighbor (kNN) approach working on mixed discrete and continuous random variables. Our framework decomposes feature contributions into unique, synergistic, and redundant components, offering a richer, model-independent understanding of their predictive roles. We validate the method using synthetic datasets with known Gaussian structures, where ground truth interaction patterns are analytically derived, and further test it on non-Gaussian and real-world gene expression data from TCGA-BRCA. Results indicate that the proposed estimator accurately recovers theoretical and expected findings, providing a potential use case for developing feature selection algorithms or model development based on interaction analysis.
[LG-51] Interactive Groupwise Comparison for Reinforcement Learning from Human Feedback
链接: https://arxiv.org/abs/2507.04340
作者: Jan Kompatscher,Danqing Shi,Giovanna Varni,Tino Weinkauf,Antti Oulasvirta
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 11 pages, 11 figures in proceedings of Computer Graphics Forum
Abstract:Reinforcement learning from human feedback (RLHF) has emerged as a key enabling technology for aligning AI behavior with human preferences. The traditional way to collect data in RLHF is via pairwise comparisons: human raters are asked to indicate which one of two samples they prefer. We present an interactive visualization that better exploits the human visual ability to compare and explore whole groups of samples. The interface is comprised of two linked views: 1) an exploration view showing a contextual overview of all sampled behaviors organized in a hierarchical clustering structure; and 2) a comparison view displaying two selected groups of behaviors for user queries. Users can efficiently explore large sets of behaviors by iterating between these two views. Additionally, we devised an active learning approach suggesting groups for comparison. As shown by our evaluation in six simulated robotics tasks, our approach increases the final policy returns by 69.34%. It leads to lower error rates and better policies. We open-source the code that can be easily integrated into the RLHF training loop, supporting research on human-AI alignment.
[LG-52] Consistent Labeling Across Group Assignments: Variance Reduction in Conditional Averag e Treatment Effect Estimation
链接: https://arxiv.org/abs/2507.04332
作者: Yi-Fu Fu,Keng-Te Liao,Shou-De Lin
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Numerous algorithms have been developed for Conditional Average Treatment Effect (CATE) estimation. In this paper, we first highlight a common issue where many algorithms exhibit inconsistent learning behavior for the same instance across different group assignments. We introduce a metric to quantify and visualize this inconsistency. Next, we present a theoretical analysis showing that this inconsistency indeed contributes to higher test errors and cannot be resolved through conventional machine learning techniques. To address this problem, we propose a general method called \textbfConsistent Labeling Across Group Assignments (CLAGA), which eliminates the inconsistency and is applicable to any existing CATE estimation algorithm. Experiments on both synthetic and real-world datasets demonstrate significant performance improvements with CLAGA.
[LG-53] nyProto: Communication-Efficient Federated Learning with Sparse Prototypes in Resource-Constrained Environments
链接: https://arxiv.org/abs/2507.04327
作者: Gyuejeong Lee,Daeyoung Choi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Communication efficiency in federated learning (FL) remains a critical challenge for resource-constrained environments. While prototype-based FL reduces communication overhead by sharing class prototypes-mean activations in the penultimate layer-instead of model parameters, its efficiency decreases with larger feature dimensions and class counts. We propose TinyProto, which addresses these limitations through Class-wise Prototype Sparsification (CPS) and adaptive prototype scaling. CPS enables structured sparsity by allocating specific dimensions to class prototypes and transmitting only non-zero elements, while adaptive scaling adjusts prototypes based on class distributions. Our experiments show TinyProto reduces communication costs by up to 4x compared to existing methods while maintaining performance. Beyond its communication efficiency, TinyProto offers crucial advantages: achieving compression without client-side computational overhead and supporting heterogeneous architectures, making it ideal for resource-constrained heterogeneous FL.
[LG-54] Heterogeneous Federated Learning with Prototype Alignment and Upscaling
链接: https://arxiv.org/abs/2507.04310
作者: Gyuejeong Lee,Jihwan Shin,Daeyoung Choi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Heterogeneity in data distributions and model architectures remains a significant challenge in federated learning (FL). Various heterogeneous FL (HtFL) approaches have recently been proposed to address this challenge. Among them, prototype-based FL (PBFL) has emerged as a practical framework that only shares per-class mean activations from the penultimate layer. However, PBFL approaches often suffer from suboptimal prototype separation, limiting their discriminative power. We propose Prototype Normalization (ProtoNorm), a novel PBFL framework that addresses this limitation through two key components: Prototype Alignment (PA) and Prototype Upscaling (PU). The PA method draws inspiration from the Thomson problem in classical physics, optimizing global prototype configurations on a unit sphere to maximize angular separation; subsequently, the PU method increases prototype magnitudes to enhance separation in Euclidean space. Extensive evaluations on benchmark datasets show that our approach better separates prototypes and thus consistently outperforms existing HtFL approaches. Notably, since ProtoNorm inherits the communication efficiency of PBFL and the PA is performed server-side, it is particularly suitable for resource-constrained environments.
[LG-55] Domain Adaptation of Drag Reduction Policy to Partial Measurements
链接: https://arxiv.org/abs/2507.04309
作者: Anton Plaksin,Georgios Rigas
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Feedback control of fluid-based systems poses significant challenges due to their high-dimensional, nonlinear, and multiscale dynamics, which demand real-time, three-dimensional, multi-component measurements for sensing. While such measurements are feasible in digital simulations, they are often only partially accessible in the real world. In this paper, we propose a method to adapt feedback control policies obtained from full-state measurements to setups with only partial measurements. Our approach is demonstrated in a simulated environment by minimising the aerodynamic drag of a simplified road vehicle. Reinforcement learning algorithms can optimally solve this control task when trained on full-state measurements by placing sensors in the wake. However, in real-world applications, sensors are limited and typically only on the vehicle, providing only partial measurements. To address this, we propose to train a Domain Specific Feature Transfer (DSFT) map reconstructing the full measurements from the history of the partial measurements. By applying this map, we derive optimal policies based solely on partial data. Additionally, our method enables determination of the optimal history length and offers insights into the architecture of optimal control policies, facilitating their implementation in real-world environments with limited sensor information.
[LG-56] ATwo-Stage Ensemble Feature Selection and Particle Swarm Optimization Approach for Micro-Array Data Classification in Distributed Computing Environments
链接: https://arxiv.org/abs/2507.04251
作者: Aayush Adhikari,Sandesh Bhatta,Harendra S. Jangwan,Amit Mishra,Khair Ul Nisa,Abu Taha Zamani,Aaron Sapkota,Debendra Muduli,Nikhat Parveen
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures. Submitted to Results in Engineering (Elsevier), under review
Abstract:High dimensionality in datasets produced by microarray technology presents a challenge for Machine Learning (ML) algorithms, particularly in terms of dimensionality reduction and handling imbalanced sample sizes. To mitigate the explained problems, we have proposedhybrid ensemble feature selection techniques with majority voting classifier for micro array classi f ication. Here we have considered both filter and wrapper-based feature selection techniques including Mutual Information (MI), Chi-Square, Variance Threshold (VT), Least Absolute Shrinkage and Selection Operator (LASSO), Analysis of Variance (ANOVA), and Recursive Feature Elimination (RFE), followed by Particle Swarm Optimization (PSO) for selecting the optimal features. This Artificial Intelligence (AI) approach leverages a Majority Voting Classifier that combines multiple machine learning models, such as Logistic Regression (LR), Random Forest (RF), and Extreme Gradient Boosting (XGBoost), to enhance overall performance and accuracy. By leveraging the strengths of each model, the ensemble approach aims to provide more reliable and effective diagnostic predictions. The efficacy of the proposed model has been tested in both local and cloud environments. In the cloud environment, three virtual machines virtual Central Processing Unit (vCPU) with size 8,16 and 64 bits, have been used to demonstrate the model performance. From the experiment it has been observed that, virtual Central Processing Unit (vCPU)-64 bits provides better classification accuracies of 95.89%, 97.50%, 99.13%, 99.58%, 99.11%, and 94.60% with six microarray datasets, Mixed Lineage Leukemia (MLL), Leukemia, Small Round Blue Cell Tumors (SRBCT), Lymphoma, Ovarian, andLung,respectively, validating the effectiveness of the proposed modelin bothlocalandcloud environments. Comments: 22 pages, 8 figures. Submitted to Results in Engineering (Elsevier), under review Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.04251 [cs.LG] (or arXiv:2507.04251v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.04251 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Aayush Adhikari [view email] [v1] Sun, 6 Jul 2025 05:50:34 UTC (2,192 KB) Full-text links: Access Paper: View a PDF of the paper titled ATwo-Stage Ensemble Feature Selection and Particle Swarm Optimization Approach for Micro-Array Data Classification in Distributed Computing Environments, by Aayush Adhikari and 8 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-07 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-57] Normalized Iterative Hard Thresholding for Tensor Recovery
链接: https://arxiv.org/abs/2507.04228
作者: Li Li,Yuneng Liang,Kaijie Zheng,Jian Lu
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13pages
Abstract:Low-rank recovery builds upon ideas from the theory of compressive sensing, which predicts that sparse signals can be accurately reconstructed from incomplete measurements. Iterative thresholding-type algorithms-particularly the normalized iterative hard thresholding (NIHT) method-have been widely used in compressed sensing (CS) and applied to matrix recovery tasks. In this paper, we propose a tensor extension of NIHT, referred to as TNIHT, for the recovery of low-rank tensors under two widely used tensor decomposition models. This extension enables the effective reconstruction of high-order low-rank tensors from a limited number of linear measurements by leveraging the inherent low-dimensional structure of multi-way data. Specifically, we consider both the CANDECOMP/PARAFAC (CP) rank and the Tucker rank to characterize tensor low-rankness within the TNIHT framework. At the same time, we establish a convergence theorem for the proposed TNIHT method under the tensor restricted isometry property (TRIP), providing theoretical support for its recovery guarantees. Finally, we evaluate the performance of TNIHT through numerical experiments on synthetic, image, and video data, and compare it with several state-of-the-art algorithms.
[LG-58] An explicit formulation of the learned noise predictor ε_θ(bf x_t t) via the forward-process noise ε_t in denoising diffusion probabilistic models (DDPMs)
链接: https://arxiv.org/abs/2507.04203
作者: KiHyun Yun
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注: 5pages, no figures
Abstract:In denoising diffusion probabilistic models (DDPMs), the learned noise predictor \epsilon_\theta ( \bf x_t , t) is trained to approximate the forward-process noise \epsilon_t . The equality \nabla_\bf x_t \log q(\bf x_t) = -\frac 1 \sqrt 1- \bar \alpha_t \epsilon_\theta ( \bf x_t , t) plays a fundamental role in both theoretical analyses and algorithmic design, and thus is frequently employed across diffusion-based generative models. In this paper, an explicit formulation of \epsilon_\theta ( \bf x_t , t) in terms of the forward-process noise \epsilon_t is derived. This result show how the forward-process noise \epsilon_t contributes to the learned predictor \epsilon_\theta ( \bf x_t , t) . Furthermore, based on this formulation, we present a novel and mathematically rigorous proof of the fundamental equality above, clarifying its origin and providing new theoretical insight into the structure of diffusion models.
[LG-59] Predicting Air Pollution in Cork Ireland Using Machine Learning
链接: https://arxiv.org/abs/2507.04196
作者: Md Rashidunnabi,Fahmida Faiza Ananna,Kailash Hambarde,Bruno Gabriel Nascimento Andrade,Dean Venables,Hugo Proenca
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 22 Pages, 14 Figures, 4 Tables
Abstract:Air pollution poses a critical health threat in cities worldwide, with nitrogen dioxide levels in Cork, Ireland exceeding World Health Organization safety standards by up to 278% . This study leverages artificial intelligence to predict air pollution with unprecedented accuracy, analyzing nearly ten years of data from five monitoring stations combined with 30 years of weather records. We evaluated 17 machine learning algorithms, with Extra Trees emerging as the optimal solution, achieving 77% prediction accuracy and significantly outperforming traditional forecasting methods. Our analysis reveals that meteorological conditions particularly temperature, wind speed, and humidity are the primary drivers of pollution levels, while traffic patterns and seasonal changes create predictable pollution cycles. Pollution exhibits dramatic seasonal variations, with winter levels nearly double those of summer, and daily rush-hour peaks reaching 120% above normal levels. While Cork’s air quality shows concerning violations of global health standards, our models detected an encouraging 31% improvement from 2014 to 2022. This research demonstrates that intelligent forecasting systems can provide city planners and environmental officials with powerful prediction tools, enabling life-saving early warning systems and informed urban planning decisions. The technology exists today to transform urban air quality management. All research materials and code are freely available at: this https URL
[LG-60] JAX-MPM: A Learning-Augmented Differentiable Meshfree Framework for GPU-Accelerated Lagrangian Simulation and Geophysical Inverse Modeling
链接: https://arxiv.org/abs/2507.04192
作者: Honghui Du,QiZhi He
类目: Machine Learning (cs.LG)
*备注:
Abstract:Differentiable programming that enables automatic differentiation through simulation pipelines has emerged as a powerful paradigm in scientific computing, supporting both forward and inverse modeling and facilitating integration with deep learning frameworks. We present JAX-MPM, a general-purpose differentiable meshfree solver within a hybrid Lagrangian-Eulerian framework, tailored for simulating complex continuum mechanics involving large deformations, frictional contact, and inelastic material behavior, with emphasis on geomechanics and geophysical hazard applications. Built on the material point method (MPM) and implemented using the JAX computing framework, JAX-MPM is fully differentiable and GPU-accelerated, enabling efficient gradient-based optimization directly through time-stepping solvers. It supports joint training of physical models and neural networks, allowing the learning of embedded closures and neural constitutive models. We validate JAX-MPM on several 2D and 3D benchmarks, including dam-breaks and granular collapses, demonstrating its accuracy and performance. A high-resolution 3D granular cylinder collapse with 2.7 million particles completes 1000 steps in ~22 seconds (single precision) and ~98 seconds (double precision) on a single GPU. Beyond forward modeling, we demonstrate inverse modeling capabilities such as velocity field reconstruction and spatially varying friction estimation. These results establish JAX-MPM as a unified, scalable platform for differentiable meshfree simulation and data-driven geomechanical inference.
[LG-61] skfolio: Portfolio Optimization in Python
链接: https://arxiv.org/abs/2507.04176
作者: Carlo Nicolini,Matteo Manzi,Hugo Delatte
类目: Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
*备注: 7 pages
Abstract:Portfolio optimization is a fundamental challenge in quantitative finance, requiring robust computational tools that integrate statistical rigor with practical implementation. We present skfolio, an open-source Python library for portfolio construction and risk management that seamlessly integrates with the scikit-learn ecosystem. skfolio provides a unified framework for diverse allocation strategies, from classical mean-variance optimization to modern clustering-based methods, state-of-the-art financial estimators with native interfaces, and advanced cross-validation techniques tailored for financial time series. By adhering to scikit-learn’s fit-predict-transform paradigm, the library enables researchers and practitioners to leverage machine learning workflows for portfolio optimization, promoting reproducibility and transparency in quantitative finance.
[LG-62] Graph Neural Networks as a Substitute for Transformers in Single-Cell Transcriptomics
链接: https://arxiv.org/abs/2507.04125
作者: Jiaxin Qi,Yan Cui,Jinli Ou,Jianqiang Huang,Gaogang Xie
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注: 9 pages, 5 figures
Abstract:Graph Neural Networks (GNNs) and Transformers share significant similarities in their encoding strategies for interacting with features from nodes of interest, where Transformers use query-key scores and GNNs use edges. Compared to GNNs, which are unable to encode relative positions, Transformers leverage dynamic attention capabilities to better represent relative relationships, thereby becoming the standard backbones in large-scale sequential pre-training. However, the subtle difference prompts us to consider: if positions are no longer crucial, could we substitute Transformers with Graph Neural Networks in some fields such as Single-Cell Transcriptomics? In this paper, we first explore the similarities and differences between GNNs and Transformers, specifically in terms of relative positions. Additionally, we design a synthetic example to illustrate their equivalence where there are no relative positions between tokens in the sample. Finally, we conduct extensive experiments on a large-scale position-agnostic dataset-single-cell transcriptomics-finding that GNNs achieve competitive performance compared to Transformers while consuming fewer computation resources. These findings provide novel insights for researchers in the field of single-cell transcriptomics, challenging the prevailing notion that the Transformer is always the optimum choice.
[LG-63] Benchmarking Stochastic Approximation Algorithms for Fairness-Constrained Training of Deep Neural Networks
链接: https://arxiv.org/abs/2507.04033
作者: Andrii Kliachkin,Jana Lepšová,Gilles Bareilles,Jakub Mareček
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:The ability to train Deep Neural Networks (DNNs) with constraints is instrumental in improving the fairness of modern machine-learning models. Many algorithms have been analysed in recent years, and yet there is no standard, widely accepted method for the constrained training of DNNs. In this paper, we provide a challenging benchmark of real-world large-scale fairness-constrained learning tasks, built on top of the US Census (Folktables). We point out the theoretical challenges of such tasks and review the main approaches in stochastic approximation algorithms. Finally, we demonstrate the use of the benchmark by implementing and comparing three recently proposed, but as-of-yet unimplemented, algorithms both in terms of optimization performance, and fairness improvement. We release the code of the benchmark as a Python package at this https URL.
[LG-64] Commute Networks as a Signature of Urban Socioeconomic Performance: Evaluating Mobility Structures with Deep Learning Models
链接: https://arxiv.org/abs/2507.04027
作者: Devashish Khulbe,Alexander Belyi,Stanislav Sobolevsky
类目: Machine Learning (cs.LG)
*备注:
Abstract:Urban socioeconomic modeling has predominantly concentrated on extensive location and neighborhood-based features, relying on the localized population footprint. However, networks in urban systems are common, and many urban modeling methods don’t account for network-based effects. In this study, we propose using commute information records from the census as a reliable and comprehensive source to construct mobility networks across cities. Leveraging deep learning architectures, we employ these commute networks across U.S. metro areas for socioeconomic modeling. We show that mobility network structures provide significant predictive performance without considering any node features. Consequently, we use mobility networks to present a supervised learning framework to model a city’s socioeconomic indicator directly, combining Graph Neural Network and Vanilla Neural Network models to learn all parameters in a single learning pipeline. Our experiments in 12 major U.S. cities show the proposed model outperforms previous conventional machine learning models. This work provides urban researchers methods to incorporate network effects in urban modeling and informs stakeholders of wider network-based effects in urban policymaking and planning.
[LG-65] Seamlessly Integrating Tree-Based Positional Embeddings into Transformer Models for Source Code Representation
链接: https://arxiv.org/abs/2507.04003
作者: Patryk Bartkowiak,Filip Graliński
类目: Machine Learning (cs.LG)
*备注: Published at “The 1st Joint Workshop on Large Language Models and Structure Modeling”
Abstract:Transformer-based models have demonstrated significant success in various source code representation tasks. Nonetheless, traditional positional embeddings employed by these models inadequately capture the hierarchical structure intrinsic to source code, typically represented as Abstract Syntax Trees (ASTs). To address this, we propose a novel tree-based positional embedding approach that explicitly encodes hierarchical relationships derived from ASTs, including node depth and sibling indices. These hierarchical embeddings are integrated into the transformer architecture, specifically enhancing the CodeBERTa model. We thoroughly evaluate our proposed model through masked language modeling (MLM) pretraining and clone detection fine-tuning tasks. Experimental results indicate that our Tree-Enhanced CodeBERTa consistently surpasses the baseline model in terms of loss, accuracy, F1 score, precision, and recall, emphasizing the importance of incorporating explicit structural information into transformer-based representations of source code.
[LG-66] Fast Re-Trainable Attention Autoencoder for Liquid Sensor Anomaly Detection at the Edge
链接: https://arxiv.org/abs/2507.03995
作者: Seongyun Choi
类目: Machine Learning (cs.LG)
*备注: 11 pages, 2 figure
Abstract:A lightweight, edge-deployable pipeline is proposed for detecting sensor anomalies in chemistry and biology laboratories. A custom PCB captures seven sensor channels and streams them over the local network. An Attention-based One-Class Autoencoder reaches a usable state after training on only thirty minutes of normal data. Despite the small data set, the model already attains an F1 score of 0.72, a precision of 0.89, and a recall of 0.61 when tested on synthetic micro-anomalies. The trained network is converted into a TensorFlow-Lite binary of about 31 kB and runs on an Advantech ARK-1221L, a fan-less x86 edge device without AVX instructions; end-to-end inference latency stays below two seconds. The entire collect-train-deploy workflow finishes within one hour, which demonstrates that the pipeline adapts quickly whenever a new liquid or sensor is introduced.
[LG-67] MalVol-25: A Diverse Labelled and Detailed Volatile Memory Dataset for Malware Detection and Response Testing and Validation
链接: https://arxiv.org/abs/2507.03993
作者: Dipo Dunsin,Mohamed Chahine Ghanem,Eduardo Almeida Palmieri
类目: Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 6 pages
Abstract:This paper addresses the critical need for high-quality malware datasets that support advanced analysis techniques, particularly machine learning and agentic AI frameworks. Existing datasets often lack diversity, comprehensive labelling, and the complexity necessary for effective machine learning and agent-based AI training. To fill this gap, we developed a systematic approach for generating a dataset that combines automated malware execution in controlled virtual environments with dynamic monitoring tools. The resulting dataset comprises clean and infected memory snapshots across multiple malware families and operating systems, capturing detailed behavioural and environmental features. Key design decisions include applying ethical and legal compliance, thorough validation using both automated and manual methods, and comprehensive documentation to ensure replicability and integrity. The dataset’s distinctive features enable modelling system states and transitions, facilitating RL-based malware detection and response strategies. This resource is significant for advancing adaptive cybersecurity defences and digital forensic research. Its scope supports diverse malware scenarios and offers potential for broader applications in incident response and automated threat mitigation.
[LG-68] Graph Collaborative Attention Network for Link Prediction in Knowledge Graphs
链接: https://arxiv.org/abs/2507.03947
作者: Thanh Hoang-Minh(Department of Information Technology, VNUHCM - University of Science)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Knowledge graphs offer a structured representation of real-world entities and their relationships, enabling a wide range of applications from information retrieval to automated reasoning. In this paper, we conduct a systematic comparison between traditional rule-based approaches and modern deep learning methods for link prediction. We focus on KBGAT, a graph neural network model that leverages multi-head attention to jointly encode both entity and relation features within local neighborhood structures. To advance this line of research, we introduce \textbfGCAT (Graph Collaborative Attention Network), a refined model that enhances context aggregation and interaction between heterogeneous nodes. Experimental results on four widely-used benchmark datasets demonstrate that GCAT not only consistently outperforms rule-based methods but also achieves competitive or superior performance compared to existing neural embedding models. Our findings highlight the advantages of attention-based architectures in capturing complex relational patterns for knowledge graph completion tasks.
[LG-69] MCST-Mamba: Multivariate Mamba-Based Model for Traffic Prediction
链接: https://arxiv.org/abs/2507.03927
作者: Mohamed Hamad,Mohamed Mabrok,Nizar Zorba
类目: Machine Learning (cs.LG)
*备注: Submitted to the Communications Software and Multimedia track of the 2025 IEEE Global Communications Conference
Abstract:Accurate traffic prediction plays a vital role in intelligent transportation systems by enabling efficient routing, congestion mitigation, and proactive traffic control. However, forecasting is challenging due to the combined effects of dynamic road conditions, varying traffic patterns across different locations, and external influences such as weather and accidents. Traffic data often consists of several interrelated measurements - such as speed, flow and occupancy - yet many deep-learning approaches either predict only one of these variables or require a separate model for each. This limits their ability to capture joint patterns across channels. To address this, we introduce the Multi-Channel Spatio-Temporal (MCST) Mamba model, a forecasting framework built on the Mamba selective state-space architecture that natively handles multivariate inputs and simultaneously models all traffic features. The proposed MCST-Mamba model integrates adaptive spatio-temporal embeddings and separates the modeling of temporal sequences and spatial sensor interactions into two dedicated Mamba blocks, improving representation learning. Unlike prior methods that evaluate on a single channel, we assess MCST-Mamba across all traffic features at once, aligning more closely with how congestion arises in practice. Our results show that MCST-Mamba achieves strong predictive performance with a lower parameter count compared to baseline models.
[LG-70] Combining Graph Neural Networks and Mixed Integer Linear Programming for Molecular Inference under the Two-Layered Model
链接: https://arxiv.org/abs/2507.03920
作者: Jianshen Zhu,Naveed Ahmed Azam,Kazuya Haraguchi,Liang Zhao,Tatsuya Akutsu
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: arXiv admin note: substantial text overlap with arXiv:2107.02381 , arXiv:2109.02628
Abstract:Recently, a novel two-phase framework named mol-infer for inference of chemical compounds with prescribed abstract structures and desired property values has been proposed. The framework mol-infer is primarily based on using mixed integer linear programming (MILP) to simulate the computational process of machine learning methods and describe the necessary and sufficient conditions to ensure such a chemical graph exists. The existing approaches usually first convert the chemical compounds into handcrafted feature vectors to construct prediction functions, but because of the limit on the kinds of descriptors originated from the need for tractability in the MILP formulation, the learning performances on datasets of some properties are not good enough. A lack of good learning performance can greatly lower the quality of the inferred chemical graphs, and thus improving learning performance is of great importance. On the other hand, graph neural networks (GNN) offer a promising machine learning method to directly utilize the chemical graphs as the input, and many existing GNN-based approaches to the molecular property prediction problem have shown that they can enjoy better learning performances compared to the traditional approaches that are based on feature vectors. In this study, we develop a molecular inference framework based on mol-infer, namely mol-infer-GNN, that utilizes GNN as the learning method while keeping the great flexibility originated from the two-layered model on the abstract structure of the chemical graph to be inferred. We conducted computational experiments on the QM9 dataset to show that our proposed GNN model can obtain satisfying learning performances for some properties despite its simple structure, and can infer small chemical graphs comprising up to 20 non-hydrogen atoms within reasonable computational time.
[LG-71] Return of the Latent Space COWBOYS: Re-thinking the use of VAEs for Bayesian Optimisation of Structured Spaces ICML
链接: https://arxiv.org/abs/2507.03910
作者: Henry B. Moss,Sebastian W. Ober,Tom Diethe
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted for publication at the International Conference on Machine Learning (ICML) 2025
Abstract:Bayesian optimisation in the latent space of a Variational AutoEncoder (VAE) is a powerful framework for optimisation tasks over complex structured domains, such as the space of scientifically interesting molecules. However, existing approaches tightly couple the surrogate and generative models, which can lead to suboptimal performance when the latent space is not tailored to specific tasks, which in turn has led to the proposal of increasingly sophisticated algorithms. In this work, we explore a new direction, instead proposing a decoupled approach that trains a generative model and a Gaussian Process (GP) surrogate separately, then combines them via a simple yet principled Bayesian update rule. This separation allows each component to focus on its strengths – structure generation from the VAE and predictive modelling by the GP. We show that our decoupled approach improves our ability to identify high-potential candidates in molecular optimisation problems under constrained evaluation budgets.
[LG-72] Risk-sensitive Actor-Critic with Static Spectral Risk Measures for Online and Offline Reinforcement Learning
链接: https://arxiv.org/abs/2507.03900
作者: Mehrdad Moghimi,Hyejin Ku
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The development of Distributional Reinforcement Learning (DRL) has introduced a natural way to incorporate risk sensitivity into value-based and actor-critic methods by employing risk measures other than expectation in the value function. While this approach is widely adopted in many online and offline RL algorithms due to its simplicity, the naive integration of risk measures often results in suboptimal policies. This limitation can be particularly harmful in scenarios where the need for effective risk-sensitive policies is critical and worst-case outcomes carry severe consequences. To address this challenge, we propose a novel framework for optimizing static Spectral Risk Measures (SRM), a flexible family of risk measures that generalizes objectives such as CVaR and Mean-CVaR, and enables the tailoring of risk preferences. Our method is applicable to both online and offline RL algorithms. We establish theoretical guarantees by proving convergence in the finite state-action setting. Moreover, through extensive empirical evaluations, we demonstrate that our algorithms consistently outperform existing risk-sensitive methods in both online and offline environments across diverse domains.
[LG-73] GenAI-Powered Inference
链接: https://arxiv.org/abs/2507.03897
作者: Kosuke Imai,Kentaro Nakamura
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:We introduce GenAI-Powered Inference (GPI), a statistical framework for both causal and predictive inference using unstructured data, including text and images. GPI leverages open-source Generative Artificial Intelligence (GenAI) models - such as large language models and diffusion models - not only to generate unstructured data at scale but also to extract low-dimensional representations that capture their underlying structure. Applying machine learning to these representations, GPI enables estimation of causal and predictive effects while quantifying associated estimation uncertainty. Unlike existing approaches to representation learning, GPI does not require fine-tuning of generative models, making it computationally efficient and broadly accessible. We illustrate the versatility of the GPI framework through three applications: (1) analyzing Chinese social media censorship, (2) estimating predictive effects of candidates’ facial appearance on electoral outcomes, and (3) assessing the persuasiveness of political rhetoric. An open-source software package is available for implementing GPI.
[LG-74] Unraveling the Black-box Magic: An Analysis of Neural Networks Dynamic Local Extrema
链接: https://arxiv.org/abs/2507.03885
作者: Shengjian Chen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages, 8 figures
Abstract:We point out that neural networks are not black boxes, and their generalization stems from the ability to dynamically map a dataset to the local extrema of the model function. We further prove that the number of local extrema in a neural network is positively correlated with the number of its parameters, and on this basis, we give a new algorithm that is different from the back-propagation algorithm, which we call the extremum-increment algorithm. Some difficult situations, such as gradient vanishing and overfitting, can be reasonably explained and dealt with in this framework.
[LG-75] Continual Recommender Systems
链接: https://arxiv.org/abs/2507.03861
作者: Hyunsik Yoo,SeongKu Kang,Hanghang Tong
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Modern recommender systems operate in uniquely dynamic settings: user interests, item pools, and popularity trends shift continuously, and models must adapt in real time without forgetting past preferences. While existing tutorials on continual or lifelong learning cover broad machine learning domains (e.g., vision and graphs), they do not address recommendation-specific demands-such as balancing stability and plasticity per user, handling cold-start items, and optimizing recommendation metrics under streaming feedback. This tutorial aims to make a timely contribution by filling that gap. We begin by reviewing the background and problem settings, followed by a comprehensive overview of existing approaches. We then highlight recent efforts to apply continual learning to practical deployment environments, such as resource-constrained systems and sequential interaction settings. Finally, we discuss open challenges and future research directions. We expect this tutorial to benefit researchers and practitioners in recommender systems, data mining, AI, and information retrieval across academia and industry.
[LG-76] aylor-Model Physics-Informed Neural Networks (PINNs) for Ordinary Differential Equations
链接: https://arxiv.org/abs/2507.03860
作者: Chandra Kanth Nagesh,Sriram Sankaranarayanan,Ramneet Kaur,Tuhin Sahai,Susmit Jha
类目: Machine Learning (cs.LG); Symbolic Computation (cs.SC)
*备注: 22 pages, 13 figures, 4 tables, Neuro-symbolic Systems 2025
Abstract:We study the problem of learning neural network models for Ordinary Differential Equations (ODEs) with parametric uncertainties. Such neural network models capture the solution to the ODE over a given set of parameters, initial conditions, and range of times. Physics-Informed Neural Networks (PINNs) have emerged as a promising approach for learning such models that combine data-driven deep learning with symbolic physics models in a principled manner. However, the accuracy of PINNs degrade when they are used to solve an entire family of initial value problems characterized by varying parameters and initial conditions. In this paper, we combine symbolic differentiation and Taylor series methods to propose a class of higher-order models for capturing the solutions to ODEs. These models combine neural networks and symbolic terms: they use higher order Lie derivatives and a Taylor series expansion obtained symbolically, with the remainder term modeled as a neural network. The key insight is that the remainder term can itself be modeled as a solution to a first-order ODE. We show how the use of these higher order PINNs can improve accuracy using interesting, but challenging ODE benchmarks. We also show that the resulting model can be quite useful for situations such as controlling uncertain physical systems modeled as ODEs. Comments: 22 pages, 13 figures, 4 tables, Neuro-symbolic Systems 2025 Subjects: Machine Learning (cs.LG); Symbolic Computation (cs.SC) Cite as: arXiv:2507.03860 [cs.LG] (or arXiv:2507.03860v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.03860 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-77] ransformer with Koopman-Enhanced Graph Convolutional Network for Spatiotemporal Dynamics Forecasting
链接: https://arxiv.org/abs/2507.03855
作者: Zekai Wang,Bing Yao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Spatiotemporal dynamics forecasting is inherently challenging, particularly in systems defined over irregular geometric domains, due to the need to jointly capture complex spatial correlations and nonlinear temporal dynamics. To tackle these challenges, we propose TK-GCN, a two-stage framework that integrates geometry-aware spatial encoding with long-range temporal modeling. In the first stage, a Koopman-enhanced Graph Convolutional Network (K-GCN) is developed to embed the high-dimensional dynamics distributed on spatially irregular domains into a latent space where the evolution of system states is approximately linear. By leveraging Koopman operator theory, this stage enhances the temporal consistency during the latent learning. In the second stage, a Transformer module is employed to model the temporal progression within the Koopman-encoded latent space. Through the self-attention mechanism, the Transformer captures long-range temporal dependencies, enabling accurate forecasting over extended horizons. We evaluate TK-GCN in spatiotemporal cardiac dynamics forecasting and benchmark its performance against several state-of-the-art baselines. Experimental results and ablation studies show that TK-GCN consistently delivers superior predictive accuracy across a range of forecast horizons, demonstrating its capability to effectively model complex spatial structures and nonlinear temporal dynamics.
[LG-78] Latent FxLMS: Accelerating Active Noise Control with Neural Adaptive Filters
链接: https://arxiv.org/abs/2507.03854
作者: Kanad Sarkar,Austin Lu,Manan Mittal,Yongjie Zhuang,Ryan Corey,Andrew Singer
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Systems and Control (eess.SY); Adaptation and Self-Organizing Systems (nlin.AO); Machine Learning (stat.ML)
*备注: 8 pages, Submitted at Forum Acousticum Euronoise 2025
Abstract:Filtered-X LMS (FxLMS) is commonly used for active noise control (ANC), wherein the soundfield is minimized at a desired location. Given prior knowledge of the spatial region of the noise or control sources, we could improve FxLMS by adapting along the low-dimensional manifold of possible adaptive filter weights. We train an auto-encoder on the filter coefficients of the steady-state adaptive filter for each primary source location sampled from a given spatial region and constrain the weights of the adaptive filter to be the output of the decoder for a given state of latent variables. Then, we perform updates in the latent space and use the decoder to generate the cancellation filter. We evaluate how various neural network constraints and normalization techniques impact the convergence speed and steady-state mean squared error. Under certain conditions, our Latent FxLMS model converges in fewer steps with comparable steady-state error to the standard FxLMS.
[LG-79] OrbitAll: A Unified Quantum Mechanical Representation Deep Learning Framework for All Molecular Systems
链接: https://arxiv.org/abs/2507.03853
作者: Beom Seok Kang,Vignesh C. Bhethanabotla,Amin Tavakoli,Maurice D. Hanisch,William A. Goddard III,Anima Anandkumar
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Despite the success of deep learning methods in quantum chemistry, their representational capacity is most often confined to neutral, closed-shell molecules. However, real-world chemical systems often exhibit complex characteristics, including varying charges, spins, and environments. We introduce OrbitAll, a geometry- and physics-informed deep learning framework that can represent all molecular systems with electronic structure information. OrbitAll utilizes spin-polarized orbital features from the underlying quantum mechanical method, and combines it with graph neural networks satisfying SE(3)-equivariance. The resulting framework can represent and process any molecular system with arbitrary charges, spins, and environmental effects. OrbitAll demonstrates superior performance and generalization on predicting charged, open-shell, and solvated molecules, while also robustly extrapolating to molecules significantly larger than the training data by leveraging a physics-informed architecture. OrbitAll achieves chemical accuracy using 10 times fewer training data than competing AI models, with a speedup of approximately 10^3 - 10^4 compared to density functional theory.
[LG-80] Regularizing Log-Linear Cost Models for Inpatient Stays by Merging ICD-10 Codes
链接: https://arxiv.org/abs/2507.03843
作者: Chi-Ken Lu,David Alonge,Nicole Richardson,Bruno Richard
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注: Submitted to MLHC 2025
Abstract:Cost models in healthcare research must balance interpretability, accuracy, and parameter consistency. However, interpretable models often struggle to achieve both accuracy and consistency. Ordinary least squares (OLS) models for high-dimensional regression can be accurate but fail to produce stable regression coefficients over time when using highly granular ICD-10 diagnostic codes as predictors. This instability arises because many ICD-10 codes are infrequent in healthcare datasets. While regularization methods such as Ridge can address this issue, they risk discarding important predictors. Here, we demonstrate that reducing the granularity of ICD-10 codes is an effective regularization strategy within OLS while preserving the representation of all diagnostic code categories. By truncating ICD-10 codes from seven characters (e.g., T67.0XXA, T67.0XXD) to six (e.g., T67.0XX) or fewer, we reduce the dimensionality of the regression problem while maintaining model interpretability and consistency. Mathematically, the merging of predictors in OLS leads to increased trace of the Hessian matrix, which reduces the variance of coefficient estimation. Our findings explain why broader diagnostic groupings like DRGs and HCC codes are favored over highly granular ICD-10 codes in real-world risk adjustment and cost models.
[LG-81] Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction
链接: https://arxiv.org/abs/2507.03840
作者: Manasa Kaniselvan,Alexander Maeder,Chen Hao Xia,Alexandros Nikolaos Ziogas,Mathieu Luisier
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 13 pages, 8 figures
Abstract:Equivariant Graph Neural Networks (eGNNs) trained on density-functional theory (DFT) data can potentially perform electronic structure prediction at unprecedented scales, enabling investigation of the electronic properties of materials with extended defects, interfaces, or exhibiting disordered phases. However, as interactions between atomic orbitals typically extend over 10+ angstroms, the graph representations required for this task tend to be densely connected, and the memory requirements to perform training and inference on these large structures can exceed the limits of modern GPUs. Here we present a distributed eGNN implementation which leverages direct GPU communication and introduce a partitioning strategy of the input graph to reduce the number of embedding exchanges between GPUs. Our implementation shows strong scaling up to 128 GPUs, and weak scaling up to 512 GPUs with 87% parallel efficiency for structures with 3,000 to 190,000 atoms on the Alps supercomputer.
[LG-82] MatRL: Provably Generalizable Iterative Algorithm Discovery via Monte-Carlo Tree Search
链接: https://arxiv.org/abs/2507.03833
作者: Sungyoon Kim,Rajat Vadiraj Dwaraknath,Longling geng,Mert Pilanci
类目: Machine Learning (cs.LG)
*备注:
Abstract:Iterative methods for computing matrix functions have been extensively studied and their convergence speed can be significantly improved with the right tuning of parameters and by mixing different iteration types. Handtuning the design options for optimal performance can be cumbersome, especially in modern computing environments: numerous different classical iterations and their variants exist, each with non-trivial per-step cost and tuning parameters. To this end, we propose MatRL – a reinforcement learning based framework that automatically discovers iterative algorithms for computing matrix functions. The key idea is to treat algorithm design as a sequential decision-making process. Monte-Carlo tree search is then used to plan a hybrid sequence of matrix iterations and step sizes, tailored to a specific input matrix distribution and computing environment. Moreover, we also show that the learned algorithms provably generalize to sufficiently large matrices drawn from the same distribution. Finally, we corroborate our theoretical results with numerical experiments demonstrating that MatRL produces algorithms that outperform various baselines in the literature.
[LG-83] IMPACT: Importance-Aware Activation Space Reconstruction
链接: https://arxiv.org/abs/2507.03828
作者: Md Mokarram Chowdhury,Daniel Agyei Asante,Ernie Chang,Yang Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Large language models (LLMs) achieve strong performance across many domains but are difficult to deploy in resource-constrained settings due to their size. Low-rank weight matrix compression is a popular strategy for reducing model size, typically by minimizing weight reconstruction error under the assumption that weights are low-rank. However, this assumption often does not hold in LLMs. Instead, LLM activations exhibit stronger low-rank structure-prompting a shift toward minimizing activation reconstruction error. We show that this shift alone is insufficient: activation dimensions contribute unequally to model performance, and uniform reconstruction can harm performance. We propose IMPACT, a principled framework for importance-aware activation reconstruction that links model compression decisions to their impact on model behavior. IMPACT formulates an optimization problem that considers both activation structure and gradient sensitivity, and derives a closed-form solution where the optimal reconstruction bases are the eigenvectors of an importance-weighted activation covariance matrix. This enables low-rank approximations explicitly optimized to preserve accuracy. Experiments across diverse models and tasks show that IMPACT achieves up to 48.6% greater model size reduction with accuracy comparable to state-of-the-art baselines. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2507.03828 [cs.LG] (or arXiv:2507.03828v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.03828 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-84] Coil Geometry Learning for Short-Range Magnetic Actuation
链接: https://arxiv.org/abs/2507.03806
作者: Yuta Takahashi,Hayate Tajima,Shin-ichiro Sakai
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Fuel-free docking is a key operational technology for in-space assembly, resupplying space stations, sample return missions, and formation keeping of large-scale satellite swarms. The use of conventional propulsion systems, including thrusters, can cause adverse effects at short distances, such as sensor contamination, which may lead to the failure of the satellite or onboard equipment. The magnetic field interaction control generated by magnetorquers can overcome these weaknesses of propulsion. This actuation enables simultaneous control of attitude and formation control among desired satellite groups. The previous study typically uses the traditional dipole approximation model of the exact magnetic field to reduce computation cost. However, proximity operations often involve relatively short distances between satellites, which can easily compromise the effectiveness of this approximation. To avoid model errors that could result in satellite collisions, we utilize a magnetic field model described by Biot-Savart’s law, without distance approximations (Near-field model), in consideration of short-distance operations. To overcome the high computational cost associated with the coil geometry and relative states information, a learning-based magnetic field approximation is derived, and its effectiveness is shown in the docking simulation of target and chaser satellites equipped with electromagnetic coils on three axes. Our method significantly reduces the computational cost of the exact magnetic model and possesses scalability that can accommodate an increasing number of target satellites through parallel processing.
[LG-85] Efficient and Effective Query Context-Aware Learning-to-Rank Model for Sequential Recommendation
链接: https://arxiv.org/abs/2507.03789
作者: Andrii Dzhoha,Alisa Mironenko,Vladimir Vlasov,Maarten Versteegh,Marjan Celikik
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Modern sequential recommender systems commonly use transformer-based models for next-item prediction. While these models demonstrate a strong balance between efficiency and quality, integrating interleaving features - such as the query context (e.g., browse category) under which next-item interactions occur - poses challenges. Effectively capturing query context is crucial for refining ranking relevance and enhancing user engagement, as it provides valuable signals about user intent within a session. Unlike an item’s features, query context is not temporally aligned with the item sequence, making its incorporation into transformers challenging and error-prone. This paper analyzes different strategies for incorporating query context into transformers trained with a causal language modeling procedure as a case study. We propose a new method that effectively fuses the item sequence with query context within the attention mechanism. Through extensive offline and online experiments on a large-scale online platform and open datasets, we present evidence that our proposed method is an effective approach for integrating query context to improve model ranking quality in terms of relevance and diversity.
[LG-86] Effective Capacitance Modeling Using Graph Neural Networks
链接: https://arxiv.org/abs/2507.03787
作者: Eren Dogan,Matthew R. Guthaus
类目: Machine Learning (cs.LG)
*备注: 7 pages, 5 figures, 6 tables
Abstract:Static timing analysis is a crucial stage in the VLSI design flow that verifies the timing correctness of circuits. Timing analysis depends on the placement and routing of the design, but at the same time, placement and routing efficiency depend on the final timing performance. VLSI design flows can benefit from timing-related prediction to better perform the earlier stages of the design flow. Effective capacitance is an essential input for gate delay calculation, and finding exact values requires routing or routing estimates. In this work, we propose the first GNN-based post-layout effective capacitance modeling method, GNN-Ceff, that achieves significant speed gains due to GPU parallelization while also providing better accuracy than current heuristics. GNN-Ceff parallelization achieves 929x speedup on real-life benchmarks over the state-of-the-art method run serially.
[LG-87] Skewed Score: A statistical framework to assess autograders
链接: https://arxiv.org/abs/2507.03772
作者: Magda Dubois,Harry Coppock,Mario Giulianelli,Lennart Luettgau,Cozmin Ududec
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as “LLM-as-a-judge”, or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, and other factors. In this paper we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while also addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional reliability metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying the source of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.
[LG-88] FAROS: Fair Graph Generation via Attribute Switching Mechanisms
链接: https://arxiv.org/abs/2507.03728
作者: Abdennacer Badaoui,Oussama Kharouiche,Hatim Mrabet,Daniele Malitesta,Fragkiskos D. Malliaros
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recent advancements in graph diffusion models (GDMs) have enabled the synthesis of realistic network structures, yet ensuring fairness in the generated data remains a critical challenge. Existing solutions attempt to mitigate bias by re-training the GDMs with ad-hoc fairness constraints. Conversely, with this work, we propose FAROS, a novel FAir graph geneRatiOn framework leveraging attribute Switching mechanisms and directly running in the generation process of the pre-trained GDM. Technically, our approach works by altering nodes’ sensitive attributes during the generation. To this end, FAROS calculates the optimal fraction of switching nodes, and selects the diffusion step to perform the switch by setting tailored multi-criteria constraints to preserve the node-topology profile from the original distribution (a proxy for accuracy) while ensuring the edge independence on the sensitive attributes for the generated graph (a proxy for fairness). Our experiments on benchmark datasets for link prediction demonstrate that the proposed approach effectively reduces fairness discrepancies while maintaining comparable (or even higher) accuracy performance to other similar baselines. Noteworthy, FAROS is also able to strike a better accuracy-fairness trade-off than other competitors in some of the tested settings under the Pareto optimality concept, demonstrating the effectiveness of the imposed multi-criteria constraints.
[LG-89] CosmoBench: A Multiscale Multiview Multitask Cosmology Benchmark for Geometric Deep Learning
链接: https://arxiv.org/abs/2507.03707
作者: Ningyuan Huang,Richard Stiskalek,Jun-Young Lee,Adrian E. Bayer,Charles C. Margossian,Christian Kragh Jespersen,Lucia A. Perez,Lawrence K. Saul,Francisco Villaescusa-Navarro
类目: Machine Learning (cs.LG); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM)
*备注:
Abstract:Cosmological simulations provide a wealth of data in the form of point clouds and directed trees. A crucial goal is to extract insights from this data that shed light on the nature and composition of the Universe. In this paper we introduce CosmoBench, a benchmark dataset curated from state-of-the-art cosmological simulations whose runs required more than 41 million core-hours and generated over two petabytes of data. CosmoBench is the largest dataset of its kind: it contains 34 thousand point clouds from simulations of dark matter halos and galaxies at three different length scales, as well as 25 thousand directed trees that record the formation history of halos on two different time scales. The data in CosmoBench can be used for multiple tasks – to predict cosmological parameters from point clouds and merger trees, to predict the velocities of individual halos and galaxies from their collective positions, and to reconstruct merger trees on finer time scales from those on coarser time scales. We provide several baselines on these tasks, some based on established approaches from cosmological modeling and others rooted in machine learning. For the latter, we study different approaches – from simple linear models that are minimally constrained by symmetries to much larger and more computationally-demanding models in deep learning, such as graph neural networks. We find that least-squares fits with a handful of invariant features sometimes outperform deep architectures with many more parameters and far longer training times. Still there remains tremendous potential to improve these baselines by combining machine learning and cosmology to fully exploit the data. CosmoBench sets the stage for bridging cosmology and geometric deep learning at scale. We invite the community to push the frontier of scientific discovery by engaging with this dataset, available at this https URL
[LG-90] Plugging Attention into Power Grids: Towards Transparent Forecasting
链接: https://arxiv.org/abs/2507.03690
作者: Eloi Campagne,Itai Zehavi,Yvenn Amara-Ouali,Yannig Goude,Argyris Kalogeratos
类目: Machine Learning (cs.LG)
*备注: 16 pages
Abstract:Accurate electricity consumption forecasting is crucial for ensuring grid stability and optimizing power generation, particularly in increasingly decentralized and complex systems. While classical approaches such as Generalized Additive Models (GAMs) remain widely used, they often fail to capture the spatial dependencies inherent in energy networks. Graph Neural Networks (GNNs) offer a principled framework to incorporate this structure by directly leveraging graph topologies. In this work, we evaluate a broad set of GNN architectures – including GCN, GraphSAGE, ChebConv, TAG, APPNP, TransformerConv, and Graph Attention Networks (GAT and GATv2) – on two real-world electricity consumption datasets from France and the UK. Our experiments show that while complex architectures like GATv2 and TransformerConv do not consistently outperform their simpler counterparts, models such as GCN and APPNP achieve strong results in low-data or highly disaggregated settings. Nonetheless, the vanilla GAT remains highly competitive across both datasets and offers an additional interpretability layer via attention mechanisms. We perform a temporal analysis of attention weights, revealing evolving patterns of regional interaction linked to seasonal and meteorological variability. These results highlight that, although attention is not universally superior, it provides valuable explanatory power when spatial dependencies are prominent. Finally, we benchmark ensemble-based expert aggregation strategies, showing that uniform or learned combinations can enhance robustness and outperform individual models under data heterogeneity.
[LG-91] When Network Architecture Meets Physics: Deep Operator Learning for Coupled Multiphysics
链接: https://arxiv.org/abs/2507.03660
作者: Kazuma Kobayashi,Jaewan Park,Qibang Liu,Seid Koric,Diab Abueidda,Syed Bahauddin Alam
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scientific applications increasingly demand real-time surrogate models that can capture the behavior of strongly coupled multiphysics systems driven by multiple input functions, such as in thermo-mechanical and electro-thermal processes. While neural operator frameworks, such as Deep Operator Networks (DeepONets), have shown considerable success in single-physics settings, their extension to multiphysics problems remains poorly understood. In particular, the challenge of learning nonlinear interactions between tightly coupled physical fields has received little systematic attention. This study addresses a foundational question: should the architectural design of a neural operator reflect the strength of physical coupling it aims to model? To answer this, we present the first comprehensive, architecture-aware evaluation of DeepONet variants across three regimes: single-physics, weakly coupled, and strongly coupled multiphysics systems. We consider a reaction-diffusion equation with dual spatial inputs, a nonlinear thermo-electrical problem with bidirectional coupling through temperature-dependent conductivity, and a viscoplastic thermo-mechanical model of steel solidification governed by transient phase-driven interactions. Two operator-learning frameworks, the classical DeepONet and its sequential GRU-based extension, S-DeepONet, are benchmarked using both single-branch and multi-branch (MIONet-style) architectures. Our results demonstrate that architectural alignment with physical coupling is crucial: single-branch networks significantly outperform multi-branch counterparts in strongly coupled settings, whereas multi-branch encodings offer advantages for decoupled or single-physics problems. Once trained, these surrogates achieve full-field predictions up to 1.8e4 times faster than high-fidelity finite-element solvers, without compromising solution accuracy.
[LG-92] MLASDO: a software tool to detect and explain clinical and omics inconsistencies applied to the Parkinsons Progression Markers Initiative cohort
链接: https://arxiv.org/abs/2507.03656
作者: José A. Pardo,Tomás Bernal,Jaime Ñiguez,Ana Luisa Gil-Martínez,Laura Ibañez,José T. Palma,Juan A. Botía,Alicia Gómez-Pascual
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures
Abstract:Inconsistencies between clinical and omics data may arise within medical cohorts. The identification, annotation and explanation of anomalous omics-based patients or individuals may become crucial to better reshape the disease, e.g., by detecting early onsets signaled by the omics and undetectable from observable symptoms. Here, we developed MLASDO (Machine Learning based Anomalous Sample Detection on Omics), a new method and software tool to identify, characterize and automatically describe anomalous samples based on omics data. Its workflow is based on three steps: (1) classification of healthy and cases individuals using a support vector machine algorithm; (2) detection of anomalous samples within groups; (3) explanation of anomalous individuals based on clinical data and expert knowledge. We showcase MLASDO using transcriptomics data of 317 healthy controls (HC) and 465 Parkinson’s disease (PD) cases from the Parkinson’s Progression Markers Initiative. In this cohort, MLASDO detected 15 anomalous HC with a PD-like transcriptomic signature and PD-like clinical features, including a lower proportion of CD4/CD8 naive T-cells and CD4 memory T-cells compared to HC (P3.510^-3). MLASDO also identified 22 anomalous PD cases with a transcriptomic signature more similar to that of HC and some clinical features more similar to HC, including a lower proportion of mature neutrophils compared to PD cases (P610^-3). In summary, MLASDO is a powerful tool that can help the clinician to detect and explain anomalous HC and cases of interest to be followed up. MLASDO is an open-source R package available at: this https URL.
[LG-93] Scientific Machine Learning of Chaotic Systems Discovers Governing Equations for Neural Populations
链接: https://arxiv.org/abs/2507.03631
作者: Anthony G. Chesebro,David Hofmann,Vaibhav Dixit,Earl K. Miller,Richard H. Granger,Alan Edelman,Christopher V. Rackauckas,Lilianne R. Mujica-Parodi,Helmut H. Strey
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph); Chaotic Dynamics (nlin.CD); Neurons and Cognition (q-bio.NC)
*备注: 46 pages, 9 figures
Abstract:Discovering governing equations that describe complex chaotic systems remains a fundamental challenge in physics and neuroscience. Here, we introduce the PEM-UDE method, which combines the prediction-error method with universal differential equations to extract interpretable mathematical expressions from chaotic dynamical systems, even with limited or noisy observations. This approach succeeds where traditional techniques fail by smoothing optimization landscapes and removing the chaotic properties during the fitting process without distorting optimal parameters. We demonstrate its efficacy by recovering hidden states in the Rossler system and reconstructing dynamics from noise-corrupted electrical circuit data, where the correct functional form of the dynamics is recovered even when one of the observed time series is corrupted by noise 5x the magnitude of the true signal. We demonstrate that this method is capable of recovering the correct dynamics, whereas direct symbolic regression methods, such as SINDy, fail to do so with the given amount of data and noise. Importantly, when applied to neural populations, our method derives novel governing equations that respect biological constraints such as network sparsity - a constraint necessary for cortical information processing yet not captured in next-generation neural mass models - while preserving microscale neuronal parameters. These equations predict an emergent relationship between connection density and both oscillation frequency and synchrony in neural circuits. We validate these predictions using three intracranial electrode recording datasets from the medial entorhinal cortex, prefrontal cortex, and orbitofrontal cortex. Our work provides a pathway to develop mechanistic, multi-scale brain models that generalize across diverse neural architectures, bridging the gap between single-neuron dynamics and macroscale brain activity.
[LG-94] Kinetic Langevin Diffusion for Crystalline Materials Generation ICML2025
链接: https://arxiv.org/abs/2507.03602
作者: François Cornet,Federico Bergamin,Arghya Bhowmik,Juan Maria Garcia Lastra,Jes Frellsen,Mikkel N. Schmidt
类目: Machine Learning (cs.LG)
*备注: Accepted at ICML 2025
Abstract:Generative modeling of crystalline materials using diffusion models presents a series of challenges: the data distribution is characterized by inherent symmetries and involves multiple modalities, with some defined on specific manifolds. Notably, the treatment of fractional coordinates representing atomic positions in the unit cell requires careful consideration, as they lie on a hypertorus. In this work, we introduce Kinetic Langevin Diffusion for Materials (KLDM), a novel diffusion model for crystalline materials generation, where the key innovation resides in the modeling of the coordinates. Instead of resorting to Riemannian diffusion on the hypertorus directly, we generalize Trivialized Diffusion Model (TDM) to account for the symmetries inherent to crystals. By coupling coordinates with auxiliary Euclidean variables representing velocities, the diffusion process is now offset to a flat space. This allows us to effectively perform diffusion on the hypertorus while providing a training objective that accounts for the periodic translation symmetry of the true data distribution. We evaluate KLDM on both Crystal Structure Prediction (CSP) and De-novo Generation (DNG) tasks, demonstrating its competitive performance with current state-of-the-art models.
[LG-95] From Street Form to Spatial Justice: Explaining Urban Exercise Inequality via a Triadic SHAP-Informed Framework
链接: https://arxiv.org/abs/2507.03570
作者: Minwei Zhao,Guosheng Yang,Zhuoni Zhang,Cai Wu
类目: Computers and Society (cs.CY); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: 31 pages, 3 tables and 11 figures
Abstract:Urban streets are essential public spaces that facilitate everyday physical activity and promote health equity. Drawing on Henri Lefebvre’s spatial triad, this study proposes a conceptual and methodological framework to quantify street-level exercise deprivation through the dimensions of conceived (planning and structure), perceived (visual and sensory), and lived (practice and experiential) urban spaces. We integrate multi-source spatial data-including street networks, street-view imagery, and social media-using explainable machine learning (SHAP analysis) to classify streets by their dominant deprivation modes, forming a novel typology of spatial inequity. Results highlight significant differences across urban contexts: older city cores predominantly experience infrastructural constraints (conceived space), whereas new development areas suffer from experiential disengagement (lived space). Furthermore, by identifying spatial mismatches between population distribution and exercise intensity, our study reveals localized clusters of latent deprivation. Simulation experiments demonstrate that targeted improvements across spatial dimensions can yield up to 14% increases in exercise supportiveness. This research not only operationalizes Lefebvre’s spatial theory at the street scale but also provides actionable insights and intervention guidelines, contributing to the broader goals of spatial justice and urban health equity.
[LG-96] Simplifying Graph Neural Kernels: from Stacking Layers to Collapsed Structure
链接: https://arxiv.org/abs/2507.03560
作者: Lin Wang,Shijie Wang,Sirui Huang,Qing Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Graph Neural Tangent Kernel (GNTK) successfully bridges the gap between kernel methods and Graph Neural Networks (GNNs), addressing key challenges such as the difficulty of training deep networks and the limitations of traditional kernel methods. However, the existing layer-stacking strategy in GNTK introduces redundant computations, significantly increasing computational complexity and limiting scalability for practical applications. To address these issues, this paper proposes the Simplified Graph Neural Tangent Kernel (SGTK), which replaces the traditional multi-layer stacking mechanism with a continuous K -step aggregation operation. This novel approach streamlines the iterative kernel computation process, effectively eliminating redundant calculations while preserving the kernel’s expressiveness. By reducing the dependency on layer stacking, SGTK achieves both computational simplicity and efficiency. Furthermore, we introduce the Simplified Graph Neural Kernel (SGNK), which models infinitely wide Graph Neural Networks as Gaussian Processes. This allows kernel values to be directly determined from the expected outputs of activation functions in the infinite-width regime, bypassing the need for explicit layer-by-layer computation. SGNK further reduces computational complexity while maintaining the capacity to capture intricate structural patterns in graphs. Extensive experiments on node and graph classification tasks demonstrate that the proposed SGTK and SGNK achieve performance comparable to existing approaches while improving computational efficiency. Implementation details are available at this https URL.
[LG-97] Communication Efficient Differentially Private Distributed Optimization using Correlation-Aware Sketching
链接: https://arxiv.org/abs/2507.03545
作者: Julien Nicolas,Mohamed Maouche,Sonia Ben Mokhtar,Mark Coates
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning with differential privacy suffers from two major costs: each client must transmit d -dimensional gradients every round, and the magnitude of DP noise grows with d . Yet empirical studies show that gradient updates exhibit strong temporal correlations and lie in a k -dimensional subspace with k \ll d . Motivated by this, we introduce DOME, a decentralized DP optimization framework in which each client maintains a compact sketch to project gradients into \mathbbR^k before privatization and Secure Aggregation. This reduces per-round communication from order d to order k and moves towards a gradient approximation mean-squared error of \sigma^2 k . To allow the sketch to span new directions and prevent it from collapsing onto historical gradients, we augment it with random probes orthogonal to historical directions. We prove that our overall protocol satisfies (\epsilon,\delta) -Differential Privacy.
[LG-98] Decoupled Relative Learning Rate Schedules
链接: https://arxiv.org/abs/2507.03526
作者: Jan Ludziejewski,Jan Małaśnicki,Maciej Pióro,Michał Krutul,Kamil Ciebiera,Maciej Stefaniak,Jakub Krajewski,Piotr Sankowski,Marek Cygan,Kamil Adamczewski,Sebastian Jaszczur
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to 23% , particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to 27\times larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.
[LG-99] A Flexible Instruction Set Architecture for Efficient GEMMs
链接: https://arxiv.org/abs/2507.03522
作者: Alexandre de Limas Santana,Adrià Armejach,Francesc Martinez,Erich Focht,Marc Casas
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver suboptimal performance when running the most commonly used convolution and transformer models. This paper proposes the Matrix Tile Extension (MTE), the first matrix ISA that completely decouples the instruction set architecture from the microarchitecture and seamlessly interacts with existing vector ISAs. MTE incurs minimal implementation overhead since it only requires a few additional instructions and a 64-bit Control Status Register (CSR) to keep its state. Specifically, MTE can i) vectorize GEMMs across the three dimensions M, N, and K; ii) leverage the capacity of the existing vector register file; and iii) decouple the tile shape from the underlying microarchitecture. MTE achieves speed-ups of 1.35x over the best state-of-the-art matrix ISA. Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG) ACMclasses: C.1.0 Cite as: arXiv:2507.03522 [cs.AR] (or arXiv:2507.03522v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2507.03522 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-100] ObjectRL: An Object-Oriented Reinforcement Learning Codebase
链接: https://arxiv.org/abs/2507.03487
作者: Gulcin Baykal,Abdullah Akgül,Manuel Haussmann,Bahareh Tasdighi,Nicklas Werge,Yi-Shan Wu,Melih Kandemir
类目: Machine Learning (cs.LG)
*备注:
Abstract:ObjectRL is an open-source Python codebase for deep reinforcement learning (RL), designed for research-oriented prototyping with minimal programming effort. Unlike existing codebases, ObjectRL is built on Object-Oriented Programming (OOP) principles, providing a clear structure that simplifies the implementation, modification, and evaluation of new algorithms. ObjectRL lowers the entry barrier for deep RL research by organizing best practices into explicit, clearly separated components, making them easier to understand and adapt. Each algorithmic component is a class with attributes that describe key RL concepts and methods that intuitively reflect their interactions. The class hierarchy closely follows common ontological relationships, enabling data encapsulation, inheritance, and polymorphism, which are core features of OOP. We demonstrate the efficiency of ObjectRL’s design through representative use cases that highlight its flexibility and suitability for rapid prototyping. The documentation and source code are available at this https URL and this https URL .
[LG-101] Molecular Machine Learning Using Euler Characteristic Transforms
链接: https://arxiv.org/abs/2507.03474
作者: Victor Toscano-Duran,Florian Rottach,Bastian Rieck
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Biomolecules (q-bio.BM)
*备注:
Abstract:The shape of a molecule determines its physicochemical and biological properties. However, it is often underrepresented in standard molecular representation learning approaches. Here, we propose using the Euler Characteristic Transform (ECT) as a geometrical-topological descriptor. Computed directly on a molecular graph derived from handcrafted atomic features, the ECT enables the extraction of multiscale structural features, offering a novel way to represent and encode molecular shape in the feature space. We assess the predictive performance of this representation across nine benchmark regression datasets, all centered around predicting the inhibition constant K_i . In addition, we compare our proposed ECT-based representation against traditional molecular representations and methods, such as molecular fingerprints/descriptors and graph neural networks (GNNs). Our results show that our ECT-based representation achieves competitive performance, ranking among the best-performing methods on several datasets. More importantly, its combination with traditional representations, particularly with the AVALON fingerprint, significantly \emphenhances predictive performance, outperforming other methods on most datasets. These findings highlight the complementary value of multiscale topological information and its potential for being combined with established techniques. Our study suggests that hybrid approaches incorporating explicit shape information can lead to more informative and robust molecular representations, enhancing and opening new avenues in molecular machine learning tasks. To support reproducibility and foster open biomedical research, we provide open access to all experiments and code used in this work.
[LG-102] A Hybrid Game-Theory and Deep Learning Framework for Predicting Tourist Arrivals via Big Data Analytics and Opinion Leader Detection
链接: https://arxiv.org/abs/2507.03411
作者: Ali Nikseresht
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Computer Science and Game Theory (cs.GT); Signal Processing (eess.SP)
*备注:
Abstract:In the era of Industry 5.0, data-driven decision-making has become indispensable for optimizing systems across Industrial Engineering. This paper addresses the value of big data analytics by proposing a novel non-linear hybrid approach for forecasting international tourist arrivals in two different contexts: (i) arrivals to Hong Kong from five major source nations (pre-COVID-19), and (ii) arrivals to Sanya in Hainan province, China (post-COVID-19). The method integrates multiple sources of Internet big data and employs an innovative game theory-based algorithm to identify opinion leaders on social media platforms. Subsequently, nonstationary attributes in tourism demand data are managed through Empirical Wavelet Transform (EWT), ensuring refined time-frequency analysis. Finally, a memory-aware Stacked Bi-directional Long Short-Term Memory (Stacked BiLSTM) network is used to generate accurate demand forecasts. Experimental results demonstrate that this approach outperforms existing state-of-the-art techniques and remains robust under dynamic and volatile conditions, highlighting its applicability to broader Industrial Engineering domains, such as logistics, supply chain management, and production planning, where forecasting and resource allocation are key challenges. By merging advanced Deep Learning (DL), time-frequency analysis, and social media insights, the proposed framework showcases how large-scale data can elevate the quality and efficiency of decision-making processes.
[LG-103] On the Effectiveness of the z-Transform Method in Quadratic Optimization
链接: https://arxiv.org/abs/2507.03404
作者: Francis Bach(SIERRA)
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The z -transform of a sequence is a classical tool used within signal processing, control theory, computer science, and electrical engineering. It allows for studying sequences from their generating functions, with many operations that can be equivalently defined on the original sequence and its z -transform. In particular, the z -transform method focuses on asymptotic behaviors and allows the use of Taylor expansions. We present a sequence of results of increasing significance and difficulty for linear models and optimization algorithms, demonstrating the effectiveness and versatility of the z -transform method in deriving new asymptotic results. Starting from the simplest gradient descent iterations in an infinite-dimensional Hilbert space, we show how the spectral dimension characterizes the convergence behavior. We then extend the analysis to Nesterov acceleration, averaging techniques, and stochastic gradient descent.
[LG-104] Absolute Evaluation Measures for Machine Learning: A Survey
链接: https://arxiv.org/abs/2507.03392
作者: Silvia Beddar-Wiesing,Alice Moallemy-Oureh,Marie Kempkes,Josephine M. Thomas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Machine Learning is a diverse field applied across various domains such as computer science, social sciences, medicine, chemistry, and finance. This diversity results in varied evaluation approaches, making it difficult to compare models effectively. Absolute evaluation measures offer a practical solution by assessing a model’s performance on a fixed scale, independent of reference models and data ranges, enabling explicit comparisons. However, many commonly used measures are not universally applicable, leading to a lack of comprehensive guidance on their appropriate use. This survey addresses this gap by providing an overview of absolute evaluation metrics in ML, organized by the type of learning problem. While classification metrics have been extensively studied, this work also covers clustering, regression, and ranking metrics. By grouping these measures according to the specific ML challenges they address, this survey aims to equip practitioners with the tools necessary to select appropriate metrics for their models. The provided overview thus improves individual model evaluation and facilitates meaningful comparisons across different models and applications.
[LG-105] Action Robust Reinforcement Learning via Optimal Adversary Aware Policy Optimization
链接: https://arxiv.org/abs/2507.03372
作者: Buqing Nie,Yangqing Fu,Jingtian Ji,Yue Gao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning (RL) has achieved remarkable success in sequential decision tasks. However, recent studies have revealed the vulnerability of RL policies to different perturbations, raising concerns about their effectiveness and safety in real-world applications. In this work, we focus on the robustness of RL policies against action perturbations and introduce a novel framework called Optimal Adversary-aware Policy Iteration (OA-PI). Our framework enhances action robustness under various perturbations by evaluating and improving policy performance against the corresponding optimal adversaries. Besides, our approach can be integrated into mainstream DRL algorithms such as Twin Delayed DDPG (TD3) and Proximal Policy Optimization (PPO), improving action robustness effectively while maintaining nominal performance and sample efficiency. Experimental results across various environments demonstrate that our method enhances robustness of DRL policies against different action adversaries effectively.
[LG-106] Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency
链接: https://arxiv.org/abs/2507.03340
作者: Naoki Nishikawa,Rei Higuchi,Taiji Suzuki
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 1 figure
Abstract:Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear attention. However, a critical challenge remains: how to choose the feature dimension that governs the approximation quality. Existing methods fix this dimension uniformly across all attention layers, overlooking the diverse roles and complexities of them. In this paper, we propose a principled method to automatically determine the feature dimension in linear attention using the concept of statistical degrees of freedom, which represent the effective dimensionality of the inputs. We provide a theoretical bound on the approximation error and show that the dimension chosen by our method achieves smaller error under a fixed computational budget. Furthermore, we introduce an efficient layerwise training strategy to learn nonlinear features tailored to each layer. Experiments on multiple pre-trained transformers demonstrate that our method improves the performance of distilled models compared to baselines without increasing the inference cost. Our findings also provide insight into how the complexity of the attention mechanism evolves across layers.
[LG-107] MPX: Mixed Precision Training for JAX
链接: https://arxiv.org/abs/2507.03312
作者: Alexander Gräfe,Sebastian Trimpe
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mixed-precision training has emerged as an indispensable tool for enhancing the efficiency of neural network training in recent years. Concurrently, JAX has grown in popularity as a versatile machine learning toolbox. However, it currently lacks robust support for mixed-precision training. We propose MPX, a mixed-precision training toolbox for JAX that simplifies and accelerates the training of large-scale neural networks while preserving model accuracy. MPX seamlessly integrates with popular toolboxes such as Equinox and Flax, allowing users to convert full-precision pipelines to mixed-precision versions with minimal modifications. By casting both inputs and outputs to half precision, and introducing a dynamic loss-scaling mechanism, MPX alleviates issues like gradient underflow and overflow that commonly arise in half precision computations. Its design inherits critical features from JAX’s type-promotion behavior, ensuring that operations take place in the correct precision and allowing for selective enforcement of full precision where needed (e.g., sums, means, or softmax). MPX further provides wrappers for automatic creation and management of mixed-precision gradients and optimizers, enabling straightforward integration into existing JAX training pipelines. MPX’s source code, documentation, and usage examples are available at this http URL.
[LG-108] LRM-1B: Towards Large Routing Model
链接: https://arxiv.org/abs/2507.03300
作者: Han Li,Fei Liu,Zhenkun Wang,Qingfu Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Vehicle routing problems (VRPs) are central to combinatorial optimization with significant practical implications. Recent advancements in neural combinatorial optimization (NCO) have demonstrated promising results by leveraging neural networks to solve VRPs, yet the exploration of model scaling within this domain remains underexplored. Inspired by the success of model scaling in large language models (LLMs), this study introduces a Large Routing Model with 1 billion parameters (LRM-1B), designed to address diverse VRP scenarios. We present a comprehensive evaluation of LRM-1B across multiple problem variants, distributions, and sizes, establishing state-of-the-art results. Our findings reveal that LRM-1B not only adapts to different VRP challenges but also showcases superior performance, outperforming existing models. Additionally, we explore the scaling behavior of neural routing models from 1M to 1B parameters. Our analysis confirms power-law between multiple model factors and performance, offering critical insights into the optimal configurations for foundation neural routing solvers.
[LG-109] Dyn-O: Building Structured World Models with Object-Centric Representations
链接: https://arxiv.org/abs/2507.03298
作者: Zizhao Wang,Kaixin Wang,Li Zhao,Peter Stone,Jiang Bian
类目: Machine Learning (cs.LG)
*备注:
Abstract:World models aim to capture the dynamics of the environment, enabling agents to predict and plan for future states. In most scenarios of interest, the dynamics are highly centered on interactions among objects within the environment. This motivates the development of world models that operate on object-centric rather than monolithic representations, with the goal of more effectively capturing environment dynamics and enhancing compositional generalization. However, the development of object-centric world models has largely been explored in environments with limited visual complexity (such as basic geometries). It remains underexplored whether such models can generalize to more complex settings with diverse textures and cluttered scenes. In this paper, we fill this gap by introducing Dyn-O, an enhanced structured world model built upon object-centric representations. Compared to prior work in object-centric representations, Dyn-O improves in both learning representations and modeling dynamics. On the challenging Procgen games, we find that our method can learn object-centric world models directly from pixel observations, outperforming DreamerV3 in rollout prediction accuracy. Furthermore, by decoupling object-centric features into dynamics-agnostic and dynamics-aware components, we enable finer-grained manipulation of these features and generate more diverse imagined trajectories.
[LG-110] Global Variational Inference Enhanced Robust Domain Adaptation
链接: https://arxiv.org/abs/2507.03291
作者: Lingkun Luo,Shiqiang Hu,Liming Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning-based domain adaptation (DA) methods have shown strong performance by learning transferable representations. However, their reliance on mini-batch training limits global distribution modeling, leading to unstable alignment and suboptimal generalization. We propose Global Variational Inference Enhanced Domain Adaptation (GVI-DA), a framework that learns continuous, class-conditional global priors via variational inference to enable structure-aware cross-domain alignment. GVI-DA minimizes domain gaps through latent feature reconstruction, and mitigates posterior collapse using global codebook learning with randomized sampling. It further improves robustness by discarding low-confidence pseudo-labels and generating reliable target-domain samples. Extensive experiments on four benchmarks and thirty-eight DA tasks demonstrate consistent state-of-the-art performance. We also derive the model’s evidence lower bound (ELBO) and analyze the effects of prior continuity, codebook size, and pseudo-label noise tolerance. In addition, we compare GVI-DA with diffusion-based generative frameworks in terms of optimization principles and efficiency, highlighting both its theoretical soundness and practical advantages.
[LG-111] Do Tensorized Large-Scale Spatiotemporal Dynamic Atmospheric Data Exhibit Low-Rank Properties?
链接: https://arxiv.org/abs/2507.03289
作者: Ryan Solgi,Seyedali Mousavinezhad,Hugo A. Loaiciga
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:
Abstract:In this study, we investigate for the first time the low-rank properties of a tensorized large-scale spatio-temporal dynamic atmospheric variable. We focus on the Sentinel-5P tropospheric NO2 product (S5P-TN) over a four-year period in an area that encompasses the contiguous United States (CONUS). Here, it is demonstrated that a low-rank approximation of such a dynamic variable is feasible. We apply the low-rank properties of the S5P-TN data to inpaint gaps in the Sentinel-5P product by adopting a low-rank tensor model (LRTM) based on the CANDECOMP / PARAFAC (CP) decomposition and alternating least squares (ALS). Furthermore, we evaluate the LRTM’s results by comparing them with spatial interpolation using geostatistics, and conduct a comprehensive spatial statistical and temporal analysis of the S5P-TN product. The results of this study demonstrated that the tensor completion successfully reconstructs the missing values in the S5P-TN product, particularly in the presence of extended cloud obscuration, predicting outliers and identifying hotspots, when the data is tensorized over extended spatial and temporal scales.
[LG-112] Securing Transformer-based AI Execution via Unified TEE and Crypto-protected Accelerators
链接: https://arxiv.org/abs/2507.03278
作者: Jiaqi Xue,Yifei Zhao,Mengxin Zheng,Xun Chen,Fan Yao,Yan Solihin,Qian Lou
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages
Abstract:Recent advances in Transformer models, e.g., large language models (LLMs), have brought tremendous breakthroughs in various artificial intelligence (AI) tasks, leading to their wide applications in many security-critical domains. Due to their unprecedented scale and prohibitively high development cost, these models have become highly valuable intellectual property for AI stakeholders and are increasingly deployed via machine learning as a service (MLaaS). However, MLaaS often runs on untrusted cloud infrastructure, exposing data and models to potential breaches. Mainstream protection mechanisms leverage trusted execution environments (TEEs) where confidentiality and integrity for secretive data are shielded using hardware-based encryption and integrity checking. Unfortunately, running model inference entirely within TEEs is subject to non-trivial slowdown, which is further exacerbated in LLMs due to the substantial computation and memory footprint involved. Recent studies reveal that the hybrid TEE-based scheme offloading partial model inference operations to the untrusted accelerators (e.g., GPU) is a promising solution. However, prior offloading schemes fail to ensure dual protection of data and model in Transformer inference, as they cannot securely offload critical operations, i.e., Attention and SoftMax, forcing these computations to remain confined within TEEs. To address these challenges, we propose TwinShield, a framework enabling secure Transformer inference in heterogeneous TEE and accelerator systems with dual protection for both model and data. TwinShield offloads ~87% of computation to GPUs and delivers 4.0x - 6.1x speedups over previous approaches across various Transformer models.
[LG-113] DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLM s with Distributed Parallel Computing
链接: https://arxiv.org/abs/2507.03211
作者: Liangyu Wang,Huanyi Xie,Di Wang
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注:
Abstract:Fine-tuning large language models (LLMs) remains resource-intensive due to their sheer scale. While zeroth-order (ZO) optimization provides a memory-efficient alternative by eliminating backward passes, its application to multi-hundred-billion-parameter models is constrained by GPU memory and compute throughput. The ZO2 framework addresses the memory bottleneck by offloading model parameters to CPU memory and overlapping transformer block transfer with dual forward computation on a single GPU. However, ZO2 remains limited by its single-device execution and achieves modest throughput. In this work, we present DistZO2, a high-throughput, memory-efficient framework for distributed zeroth-order fine-tuning of LLMs. DistZO2 introduces three parallel strategies: (1) Perturbation Parallelism (PertP), which parallelizes the two perturbed forward passes across devices; (2) Distributed Data Parallelism (DDP), adapted to the scalar-gradient nature of ZO training; and (3) a unified 2D Parallelism design that combines PertP and DDP. To further mitigate communication bottlenecks introduced by parameter offloading, we propose a hardware-aware communication strategy that slices parameter blocks and redistributes them across GPUs via high-speed interconnects such as NVLink. DistZO2 scales zeroth-order fine-tuning to modern multi-GPU systems, preserving ZO2’s memory efficiency while substantially improving training throughput. In our experiments on OPT-175B, DistZO2 achieves a 3x speedup over ZO2 with distributed computing. DistZO2’s code has been open-sourced in this https URL.
[LG-114] Weak Form Scientific Machine Learning: Test Function Construction for System Identification
链接: https://arxiv.org/abs/2507.03206
作者: April Tran,David Bortz
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Weak form Scientific Machine Learning (WSciML) is a recently developed framework for data-driven modeling and scientific discovery. It leverages the weak form of equation error residuals to provide enhanced noise robustness in system identification via convolving model equations with test functions, reformulating the problem to avoid direct differentiation of data. The performance, however, relies on wisely choosing a set of compactly supported test functions. In this work, we mathematically motivate a novel data-driven method for constructing Single-scale-Local reference functions for creating the set of test functions. Our approach numerically approximates the integration error introduced by the quadrature and identifies the support size for which the error is minimal, without requiring access to the model parameter values. Through numerical experiments across various models, noise levels, and temporal resolutions, we demonstrate that the selected supports consistently align with regions of minimal parameter estimation error. We also compare the proposed method against the strategy for constructing Multi-scale-Global (and orthogonal) test functions introduced in our prior work, demonstrating the improved computational efficiency. Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG) Cite as: arXiv:2507.03206 [math.NA] (or arXiv:2507.03206v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2507.03206 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-115] Quantifying Cross-Attention Interaction in Transformers for Interpreting TCR-pMHC Binding
链接: https://arxiv.org/abs/2507.03197
作者: Jiarui Li,Zixiang Yin,Haley Smith,Zhengming Ding,Samuel J. Landry,Ramgopal R. Mettu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:CD8+ “killer” T cells and CD4+ “helper” T cells play a central role in the adaptive immune system by recognizing antigens presented by Major Histocompatibility Complex (pMHC) molecules via T Cell Receptors (TCRs). Modeling binding between T cells and the pMHC complex is fundamental to understanding basic mechanisms of human immune response as well as in developing therapies. While transformer-based models such as TULIP have achieved impressive performance in this domain, their black-box nature precludes interpretability and thus limits a deeper mechanistic understanding of T cell response. Most existing post-hoc explainable AI (XAI) methods are confined to encoder-only, co-attention, or model-specific architectures and cannot handle encoder-decoder transformers used in TCR-pMHC modeling. To address this gap, we propose Quantifying Cross-Attention Interaction (QCAI), a new post-hoc method designed to interpret the cross-attention mechanisms in transformer decoders. Quantitative evaluation is a challenge for XAI methods; we have compiled TCR-XAI, a benchmark consisting of 274 experimentally determined TCR-pMHC structures to serve as ground truth for binding. Using these structures we compute physical distances between relevant amino acid residues in the TCR-pMHC interaction region and evaluate how well our method and others estimate the importance of residues in this region across the dataset. We show that QCAI achieves state-of-the-art performance on both interpretability and prediction accuracy under the TCR-XAI benchmark.
[LG-116] First Contact: Data-driven Friction-Stir Process Control
链接: https://arxiv.org/abs/2507.03177
作者: James Koch,Ethan King,WoongJo Choi,Megan Ebers,David Garcia,Ken Ross,Keerti Kappagantula
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:This study validates the use of Neural Lumped Parameter Differential Equations for open-loop setpoint control of the plunge sequence in Friction Stir Processing (FSP). The approach integrates a data-driven framework with classical heat transfer techniques to predict tool temperatures, informing control strategies. By utilizing a trained Neural Lumped Parameter Differential Equation model, we translate theoretical predictions into practical set-point control, facilitating rapid attainment of desired tool temperatures and ensuring consistent thermomechanical states during FSP. This study covers the design, implementation, and experimental validation of our control approach, establishing a foundation for efficient, adaptive FSP operations.
[LG-117] Latent Thermodynamic Flows: Unified Representation Learning and Generative Modeling of Temperature-Dependent Behaviors from Limited Data
链接: https://arxiv.org/abs/2507.03174
作者: Yunrui Qiu,Richard John,Lukas Herron,Pratyush Tiwary
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Biological Physics (physics.bio-ph); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Accurate characterization of the equilibrium distributions of complex molecular systems and their dependence on environmental factors such as temperature is essential for understanding thermodynamic properties and transition mechanisms. Projecting these distributions onto meaningful low-dimensional representations enables interpretability and downstream analysis. Recent advances in generative AI, particularly flow models such as Normalizing Flows (NFs), have shown promise in modeling such distributions, but their scope is limited without tailored representation learning. In this work, we introduce Latent Thermodynamic Flows (LaTF), an end-to-end framework that tightly integrates representation learning and generative modeling. LaTF unifies the State Predictive Information Bottleneck (SPIB) with NFs to simultaneously learn low-dimensional latent representations, referred to as Collective Variables (CVs), classify metastable states, and generate equilibrium distributions across temperatures beyond the training data. The two components of representation learning and generative modeling are optimized jointly, ensuring that the learned latent features capture the system’s slow, important degrees of freedom while the generative model accurately reproduces the system’s equilibrium behavior. We demonstrate LaTF’s effectiveness across diverse systems, including a model potential, the Chignolin protein, and cluster of Lennard Jones particles, with thorough evaluations and benchmarking using multiple metrics and extensive simulations. Finally, we apply LaTF to a RNA tetraloop system, where despite using simulation data from only two temperatures, LaTF reconstructs the temperature-dependent structural ensemble and melting behavior, consistent with experimental and prior extensive computational results.
[LG-118] PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset
链接: https://arxiv.org/abs/2507.03165
作者: Michal Golovanevsky,Pranav Mahableshwarkar,Carsten Eickhoff,Ritambhara Singh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal deep learning holds promise for improving clinical prediction by integrating diverse patient data, including text, imaging, time-series, and structured demographics. Contrastive learning facilitates this integration by producing a unified representation that can be reused across tasks, reducing the need for separate models or encoders. Although contrastive learning has seen success in vision-language domains, its use in clinical settings remains largely limited to image and text pairs. We propose the Pipeline for Contrastive Modality Evaluation and Encoding (PiCME), which systematically assesses five clinical data types from MIMIC: discharge summaries, radiology reports, chest X-rays, demographics, and time-series. We pre-train contrastive models on all 26 combinations of two to five modalities and evaluate their utility on in-hospital mortality and phenotype prediction. To address performance plateaus with more modalities, we introduce a Modality-Gated LSTM that weights each modality according to its contrastively learned importance. Our results show that contrastive models remain competitive with supervised baselines, particularly in three-modality settings. Performance declines beyond three modalities, which supervised models fail to recover. The Modality-Gated LSTM mitigates this drop, improving AUROC from 73.19% to 76.93% and AUPRC from 51.27% to 62.26% in the five-modality setting. We also compare contrastively learned modality importance scores with attribution scores and evaluate generalization across demographic subgroups, highlighting strengths in interpretability and fairness. PiCME is the first to scale contrastive learning across all modality combinations in MIMIC, offering guidance for modality selection, training strategies, and equitable clinical prediction.
[LG-119] MathOptAI.jl: Embed trained machine learning predictors into JuMP models
链接: https://arxiv.org/abs/2507.03159
作者: Oscar Dowson,Robert B Parker,Russel Bent
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We present \textttthis http URL, an open-source Julia library for embedding trained machine learning predictors into a JuMP model. \textttthis http URL can embed a wide variety of neural networks, decision trees, and Gaussian Processes into a larger mathematical optimization model. In addition to interfacing a range of Julia-based machine learning libraries such as \textttthis http URL and \textttthis http URL, \textttthis http URL uses Julia’s Python interface to provide support for PyTorch models. When the PyTorch support is combined with \textttthis http URL’s gray-box formulation, the function, Jacobian, and Hessian evaluations associated with the PyTorch model are offloaded to the GPU in Python, while the rest of the nonlinear oracles are evaluated on the CPU in Julia. \MathOptAI is available at this https URL under a BSD-3 license.
[LG-120] HGCA: Hybrid GPU-CPU Attention for Long Context LLM Inference
链接: https://arxiv.org/abs/2507.03153
作者: Weishu Deng,Yujie Yang,Peiran Du,Lingfeng Xiang,Zhen Lin,Chen Zhong,Song Jiang,Hui Lu,Jia Rao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scaling inference for large language models (LLMs) is increasingly constrained by limited GPU memory, especially due to growing key-value (KV) caches required for long-context generation. While existing approaches offload KV caches to CPU memory or apply sparse attention to reduce GPU load, they often underutilize CPU compute resources and compromise accuracy. We present HGCA, a hybrid CPU-GPU attention mechanism that enables scalable, high-throughput LLM inference with near-full attention quality. HGCA performs dense attention on recently generated KV entries retained in GPU memory and parallel sparse attention on selected, salient KV entries in CPU memory. The attention outputs are efficiently merged using log-sum-exp fusion, minimizing PCIe transfer overhead. HGCA also introduces a finegrained, per-head sparsification strategy optimized for CPU execution, preserving contextual relevance while reducing computation. Our implementation seamlessly integrates into existing LLM frameworks without requiring model retraining. Experiments across diverse models and workloads show that HGCA achieves superior scalability, supports longer sequences and larger batch sizes, and outperforms existing sparse attention baselines in both performance and accuracy – all on commodity GPU hardware.
[LG-121] Last-Iterate Convergence of No-Regret Learning for Equilibria in Bargaining Games
链接: https://arxiv.org/abs/2507.03150
作者: Serafina Kamp,Reese Liebman,Benjamin Fish
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:Bargaining games, where agents attempt to agree on how to split utility, are an important class of games used to study economic behavior, which motivates a study of online learning algorithms in these games. In this work, we tackle when no-regret learning algorithms converge to Nash equilibria in bargaining games. Recent results have shown that online algorithms related to Follow the Regularized Leader (FTRL) converge to Nash equilibria (NE) in the last iterate in a wide variety of games, including zero-sum games. However, bargaining games do not have the properties used previously to established convergence guarantees, even in the simplest case of the ultimatum game, which features a single take-it-or-leave-it offer. Nonetheless, we establish that FTRL (without the modifications necessary for zero-sum games) achieves last-iterate convergence to an approximate NE in the ultimatum game along with a bound on convergence time under mild assumptions. Further, we provide experimental results to demonstrate that convergence to NE, including NE with asymmetric payoffs, occurs under a broad range of initial conditions, both in the ultimatum game and in bargaining games with multiple rounds. This work demonstrates how complex economic behavior (e.g. learning to use threats and the existence of many possible equilibrium outcomes) can result from using a simple learning algorithm, and that FTRL can converge to equilibria in a more diverse set of games than previously known.
[LG-122] Set Valued Predictions For Robust Domain Generalization ICML2025
链接: https://arxiv.org/abs/2507.03146
作者: Ron Tsibulsky,Daniel Nevo,Uri Shalit
类目: Machine Learning (cs.LG)
*备注: ICML 2025
Abstract:Despite the impressive advancements in modern machine learning, achieving robustness in Domain Generalization (DG) tasks remains a significant challenge. In DG, models are expected to perform well on samples from unseen test distributions (also called domains), by learning from multiple related training distributions. Most existing approaches to this problem rely on single-valued predictions, which inherently limit their robustness. We argue that set-valued predictors could be leveraged to enhance robustness across unseen domains, while also taking into account that these sets should be as small as possible. We introduce a theoretical framework defining successful set prediction in the DG setting, focusing on meeting a predefined performance criterion across as many domains as possible, and provide theoretical insights into the conditions under which such domain generalization is achievable. We further propose a practical optimization method compatible with modern learning architectures, that balances robust performance on unseen domains with small prediction set sizes. We evaluate our approach on several real-world datasets from the WILDS benchmark, demonstrating its potential as a promising direction for robust domain generalization.
[LG-123] Neural Substitute Solver for Efficient Edge Inference of Power Electronic Hybrid Dynamics
链接: https://arxiv.org/abs/2507.03144
作者: Jialin Zheng,Haoyu Wang,Yangbin Zeng,Han Xu,Di Mou,Hong Li,Sergio Vazquez,Leopoldo G. Franquelo
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures,
Abstract:Advancing the dynamics inference of power electronic systems (PES) to the real-time edge-side holds transform-ative potential for testing, control, and monitoring. How-ever, efficiently inferring the inherent hybrid continu-ous-discrete dynamics on resource-constrained edge hardware remains a significant challenge. This letter pro-poses a neural substitute solver (NSS) approach, which is a neural-network-based framework aimed at rapid accurate inference with significantly reduced computational costs. Specifically, NSS leverages lightweight neural networks to substitute time-consuming matrix operation and high-order numerical integration steps in traditional solvers, which transforms sequential bottlenecks into highly parallel operation suitable for edge hardware. Experimental vali-dation on a multi-stage DC-DC converter demonstrates that NSS achieves 23x speedup and 60% hardware resource reduction compared to traditional solvers, paving the way for deploying edge inference of high-fidelity PES dynamics.
[LG-124] BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers
链接: https://arxiv.org/abs/2507.03117
作者: Patrik Okanovic,Sameer Deshmukh,Grzegorz Kwasniewski,Kentaro Katayama,Takumi Honda,Maciej Besta,Torsten Hoefler
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:The energy consumption of large-scale ML models is dominated by data movement - shuffling billions of parameters across memory hierarchies and data centers. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable sparsification method applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss. Our fused, highly optimized Sparse MLP kernel delivers up to 16.7x speedup over dense MLPs across 9 architectures and 8 datasets, resulting in up to 1.6x inference speedup, 1.11x pretraining speedup and up to 3.12x inference memory usage reduction. BLaST enables the next generation of large-scale AI systems by reducing energy use, memory footprint, and latency.
[LG-125] SymMatika: Structure-Aware Symbolic Discovery
链接: https://arxiv.org/abs/2507.03110
作者: Michael Scherk,Boyuan Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Symbolic regression (SR) seeks to recover closed-form mathematical expressions that describe observed data. While existing methods have advanced the discovery of either explicit mappings (i.e., y = f(\mathbfx) ) or discovering implicit relations (i.e., F(\mathbfx, y)=0 ), few modern and accessible frameworks support both. Moreover, most approaches treat each expression candidate in isolation, without reusing recurring structural patterns that could accelerate search. We introduce SymMatika, a hybrid SR algorithm that combines multi-island genetic programming (GP) with a reusable motif library inspired by biological sequence analysis. SymMatika identifies high-impact substructures in top-performing candidates and reintroduces them to guide future generations. Additionally, it incorporates a feedback-driven evolutionary engine and supports both explicit and implicit relation discovery using implicit-derivative metrics. Across benchmarks, SymMatika achieves state-of-the-art recovery rates, achieving 5.1% higher performance than the previous best results on Nguyen, the first recovery of Nguyen-12, and competitive performance on the Feynman equations. It also recovers implicit physical laws from Eureqa datasets up to 100\times faster. Our results demonstrate the power of structure-aware evolutionary search for scientific discovery. To support broader research in interpretable modeling and symbolic discovery, we have open-sourced the full SymMatika framework.
[LG-126] Mitigating Goal Misgeneralization with Minimax Regret
链接: https://arxiv.org/abs/2507.03068
作者: Karim Abdel Sadek,Matthew Farrugia-Roberts,Usman Anwar,Hannah Erlebach,Christian Schroeder de Witt,David Krueger,Michael Dennis
类目: Machine Learning (cs.LG)
*备注: Published at RLC 2025, 11 pages
Abstract:Safe generalization in reinforcement learning requires not only that a learned policy acts capably in new situations, but also that it uses its capabilities towards the pursuit of the designer’s intended goal. The latter requirement may fail when a proxy goal incentivizes similar behavior to the intended goal within the training environment, but not in novel deployment environments. This creates the risk that policies will behave as if in pursuit of the proxy goal, rather than the intended goal, in deployment – a phenomenon known as goal misgeneralization. In this paper, we formalize this problem setting in order to theoretically study the possibility of goal misgeneralization under different training objectives. We show that goal misgeneralization is possible under approximate optimization of the maximum expected value (MEV) objective, but not the minimax expected regret (MMER) objective. We then empirically show that the standard MEV-based training method of domain randomization exhibits goal misgeneralization in procedurally-generated grid-world environments, whereas current regret-based unsupervised environment design (UED) methods are more robust to goal misgeneralization (though they don’t find MMER policies in all cases). Our findings suggest that minimax expected regret is a promising approach to mitigating goal misgeneralization.
[LG-127] Cycle-Consistent Helmholtz Machine: Goal-Seeded Simulation via Inverted Inference
链接: https://arxiv.org/abs/2507.03065
作者: Xin Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Helmholtz Machine (HM) is a foundational architecture for unsupervised learning, coupling a bottom-up recognition model with a top-down generative model through alternating inference. However, its reliance on symmetric, data-driven updates constrains its ability to perform goal-directed reasoning or simulate temporally extended processes. In this work, we introduce the \emphCycle-Consistent Helmholtz Machine (C ^2 HM), a novel extension that reframes inference as a \emphgoal-seeded, \emphasymmetric process grounded in structured internal priors. Rather than inferring latent causes solely from sensory data, C ^2 HM simulates plausible latent trajectories conditioned on abstract goals, aligning them with observed outcomes through a recursive cycle of forward generation and inverse refinement. This cycle-consistent formulation integrates top-down structure with bottom-up evidence via a variational loop, enforcing mutual alignment between goal-conditioned latent predictions and recognition-based reconstructions. We formalize this mechanism within the framework of the \emphContext-Content Uncertainty Principle (CCUP), which posits that inference proceeds by aligning structured, low-entropy content with high-entropy, ambiguous context. C ^2 HM improves representational efficiency, supports memory chaining via path-dependent inference, and enables spatial compositional imagination. By offering a biologically inspired alternative to classical amortized inference, C^2 HM reconceives generative modeling as intentional simulation, bridging memory-based planning and unsupervised learning in a unified probabilistic framework.
[LG-128] Statistical Quality and Reproducibility of Pseudorandom Number Generators in Machine Learning technologies
链接: https://arxiv.org/abs/2507.03007
作者: Benjamin A. Antunes(LIRMM | DALI)
类目: Other Computer Science (cs.OH); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning (ML) frameworks rely heavily on pseudorandom number generators (PRNGs) for tasks such as data shuffling, weight initialization, dropout, and optimization. Yet, the statistical quality and reproducibility of these generators-particularly when integrated into frameworks like PyTorch, TensorFlow, and NumPy-are underexplored. In this paper, we compare the statistical quality of PRNGs used in ML frameworks (Mersenne Twister, PCG, and Philox) against their original C implementations. Using the rigorous TestU01 BigCrush test suite, we evaluate 896 independent random streams for each generator. Our findings challenge claims of statistical robustness, revealing that even generators labeled ‘‘crush-resistant’’ (e.g., PCG, Philox) may fail certain statistical tests. Surprisingly, we can observe some differences in failure profiles between the native and framework-integrated versions of the same algorithm, highlighting some implementation differences that may exist.
[LG-129] Learning Beyond Euclid: Curvature-Adaptive Generalization for Neural Networks on Manifolds
链接: https://arxiv.org/abs/2507.02999
作者: Krisanu Sarkar
类目: Machine Learning (cs.LG); Differential Geometry (math.DG); Machine Learning (stat.ML)
*备注:
Abstract:In this work, we develop new generalization bounds for neural networks trained on data supported on Riemannian manifolds. Existing generalization theories often rely on complexity measures derived from Euclidean geometry, which fail to account for the intrinsic structure of non-Euclidean spaces. Our analysis introduces a geometric refinement: we derive covering number bounds that explicitly incorporate manifold-specific properties such as sectional curvature, volume growth, and injectivity radius. These geometric corrections lead to sharper Rademacher complexity bounds for classes of Lipschitz neural networks defined on compact manifolds. The resulting generalization guarantees recover standard Euclidean results when curvature is zero but improve substantially in settings where the data lies on curved, low-dimensional manifolds embedded in high-dimensional ambient spaces. We illustrate the tightness of our bounds in negatively curved spaces, where the exponential volume growth leads to provably higher complexity, and in positively curved spaces, where the curvature acts as a regularizing factor. This framework provides a principled understanding of how intrinsic geometry affects learning capacity, offering both theoretical insight and practical implications for deep learning on structured data domains.
[LG-130] Physics Augmented Machine Learning Discovery of Composition-Dependent Constitutive Laws for 3D Printed Digital Materials
链接: https://arxiv.org/abs/2507.02991
作者: Steven Yang,Michal Levin,Govinda Anantha Padmanabha,Miriam Borshevsky,Ohad Cohen,D. Thomas Seidl,Reese E. Jones,Nikolaos Bouklas,Noy Cohen
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 39 pages, 12 figures, journal article, submitted to Composites Part B: Engineering
Abstract:Multi-material 3D printing, particularly through polymer jetting, enables the fabrication of digital materials by mixing distinct photopolymers at the micron scale within a single build to create a composite with tunable mechanical properties. This work presents an integrated experimental and computational investigation into the composition-dependent mechanical behavior of 3D printed digital materials. We experimentally characterize five formulations, combining soft and rigid UV-cured polymers under uniaxial tension and torsion across three strain and twist rates. The results reveal nonlinear and rate-dependent responses that strongly depend on composition. To model this behavior, we develop a physics-augmented neural network (PANN) that combines a partially input convex neural network (pICNN) for learning the composition-dependent hyperelastic strain energy function with a quasi-linear viscoelastic (QLV) formulation for time-dependent response. The pICNN ensures convexity with respect to strain invariants while allowing non-convex dependence on composition. To enhance interpretability, we apply L_0 sparsification. For the time-dependent response, we introduce a multilayer perceptron (MLP) to predict viscoelastic relaxation parameters from composition. The proposed model accurately captures the nonlinear, rate-dependent behavior of 3D printed digital materials in both uniaxial tension and torsion, achieving high predictive accuracy for interpolated material compositions. This approach provides a scalable framework for automated, composition-aware constitutive model discovery for multi-material 3D printing.
[LG-131] Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench
链接: https://arxiv.org/abs/2507.02976
作者: Amirali Sajadi,Kostadin Damevski,Preetha Chatterjee
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:Large Language Models (LLMs) and their agentic frameworks are increasingly adopted to automate software development tasks such as issue resolution and program repair. While prior work has identified security risks in LLM-generated code, most evaluations have focused on synthetic or isolated settings, leaving open questions about the security of these systems in real-world development contexts. In this study, we present the first large-scale security analysis of LLM-generated patches using 20,000+ issues from the SWE-bench dataset. We evaluate patches produced by a standalone LLM (Llama 3.3) and compare them to developer-written patches. We also assess the security of patches generated by three top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb) on a subset of our data. Finally, we analyze a wide range of code, issue, and project-level factors to understand the conditions under which LLMs and agents are most likely to generate insecure code. Our findings reveal that the standalone LLM introduces nearly 9x more new vulnerabilities than developers, with many of these exhibiting unique patterns not found in developers’ code. Agentic workflows also generate a significant number of vulnerabilities, particularly when granting LLMs more autonomy, potentially increasing the likelihood of misinterpreting project context or task requirements. We find that vulnerabilities are more likely to occur in LLM patches associated with a higher number of files, more lines of generated code, and GitHub issues that lack specific code snippets or information about the expected code behavior and steps to reproduce. These results suggest that contextual factors play a critical role in the security of the generated code and point toward the need for proactive risk assessment methods that account for both code and issue-level information to complement existing vulnerability detection tools.
[LG-132] Introducing Answered with Evidence – a framework for evaluating whether LLM responses to biomedical questions are founded in evidence
链接: https://arxiv.org/abs/2507.02975
作者: Julian D Baldwin,Christina Dinh,Arjun Mukerji,Neil Sanghavi,Saurabh Gombar
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: 17 pages; 3 figures; 3 main tables. This work will be submitted for full publication shortly;wanted to share ahead of industry conference
Abstract:The growing use of large language models (LLMs) for biomedical question answering raises concerns about the accuracy and evidentiary support of their responses. To address this, we present Answered with Evidence, a framework for evaluating whether LLM-generated answers are grounded in scientific literature. We analyzed thousands of physician-submitted questions using a comparative pipeline that included: (1) Alexandria, fka the Atropos Evidence Library, a retrieval-augmented generation (RAG) system based on novel observational studies, and (2) two PubMed-based retrieval-augmented systems (System and Perplexity). We found that PubMed-based systems provided evidence-supported answers for approximately 44% of questions, while the novel evidence source did so for about 50%. Combined, these sources enabled reliable answers to over 70% of biomedical queries. As LLMs become increasingly capable of summarizing scientific content, maximizing their value will require systems that can accurately retrieve both published and custom-generated evidence or generate such evidence in real time.
[LG-133] Beyond Parallelism: Synergistic Computational Graph Effects in Multi-Head Attention
链接: https://arxiv.org/abs/2507.02944
作者: Haitz Sáez de Ocáriz Borde
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-head attention powers Transformer networks, the primary deep learning architecture behind the success of large language models (LLMs). Yet, the theoretical advantages of multi-head versus single-head attention, beyond mere parallel processing, remain underexplored. In this paper, we reframe multi-head attention as a system of potentially synergistic computational graphs, where each head functions as a feedforward directed acyclic graph (DAG) with a common sink state. We provide intuition and preliminary theoretical analysis of mixing time and minimax fidelity in this framework. Our results show that multi-head attention can synergistically enhance information propagation, yielding faster mixing times and minimax fidelity amplification under specific head-diversity conditions. Finally, we train single-head and multi-head Transformers, each with the same total number of parameters, on sequence manipulation tasks and empirically verify the predicted effects.
[LG-134] Predictive Maintenance Optimization for Smart Vending Machines Using IoT and Machine Learning
链接: https://arxiv.org/abs/2507.02934
作者: Md. Nisharul Hasan(Department of Industrial Engineering, Lamar University, Beaumont, Texas, USA)
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 20 pages, 3 figures and 4 tables
Abstract:The increasing proliferation of vending machines in public and commercial environments has placed a growing emphasis on operational efficiency and customer satisfaction. Traditional maintenance approaches either reactive or time-based preventive are limited in their ability to preempt machine failures, leading to unplanned downtimes and elevated service costs. This research presents a novel predictive maintenance framework tailored for vending machines by leveraging Internet of Things (IoT) sensors and machine learning (ML) algorithms. The proposed system continuously monitors machine components and operating conditions in real time and applies predictive models to forecast failures before they occur. This enables timely maintenance scheduling, minimizing downtime and extending machine lifespan. The framework was validated through simulated fault data and performance evaluation using classification algorithms. Results show a significant improvement in early fault detection and a reduction in redundant service interventions. The findings indicate that predictive maintenance systems, when integrated into vending infrastructure, can transform operational efficiency and service reliability.
[LG-135] Domain Knowledge in Artificial Intelligence: Using Conceptual Modeling to Increase Machine Learning Accuracy and Explainability
链接: https://arxiv.org/abs/2507.02922
作者: V.C. Storey,J. Parsons,A. Castellanos,M. Tremblay,R. Lukyanenko,W. Maass,A. Castillo
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Machine learning enables the extraction of useful information from large, diverse datasets. However, despite many successful applications, machine learning continues to suffer from performance and transparency issues. These challenges can be partially attributed to the limited use of domain knowledge by machine learning models. This research proposes using the domain knowledge represented in conceptual models to improve the preparation of the data used to train machine learning models. We develop and demonstrate a method, called the Conceptual Modeling for Machine Learning (CMML), which is comprised of guidelines for data preparation in machine learning and based on conceptual modeling constructs and principles. To assess the impact of CMML on machine learning outcomes, we first applied it to two real-world problems to evaluate its impact on model performance. We then solicited an assessment by data scientists on the applicability of the method. These results demonstrate the value of CMML for improving machine learning outcomes.
[LG-136] Beyond Token Pruning: Operation Pruning in Vision-Language Models
链接: https://arxiv.org/abs/2507.02909
作者: Aoming Liu,Reuben Tan,Boqing Gong,Bryan A. Plummer
类目: Machine Learning (cs.LG)
*备注:
Abstract:Prior Vision Language Model (VLM) token pruning reduces computation by eliminating attention and feed-forward operations for pruned tokens while maintaining all operations for critical tokens. However, this binary approach conflates token/operation redundancy - critical operations may be removed along with discarded tokens, while preserved tokens retain all potentially redundant operations. To surgically eliminate redundant operations while preserving critical ones, we propose Greedily Sorted Operation Pruning (GSOP), a data-driven method that directly prunes operations rather than tokens. GSOP first decomposes a VLM decoder’s computations into atomic operations along three dimensions: token groups, layer positions, and computation modules. GSOP determines the pruning order of operations through greedy sorting: GSOP iteratively selects the redundant operation that incurs minimal performance drop considering previously pruned operations. Different computational budgets can be accommodated without re-searching by simply pruning operations according to this order until the desired budget is met. GSOP enhances sorting efficiency through: a) leveraging historical operation rankings to avoid redundant evaluations; b) excluding the free-to-prune" and
danger-to-prune" operations from sorting. GSOP achieves compelling efficiency-performance tradeoffs, reducing computation by 70% with only 4% performance loss while maintaining up to 18% higher performance than state-of-the-art methods when transferred across diverse VLMs and tasks. Real GPU efficiency evaluations confirm its practical value. The code is in this https URL.
[LG-137] Scaling Transformers for Time Series Forecasting: Do Pretrained Large Models Outperform Small-Scale Alternatives?
链接: https://arxiv.org/abs/2507.02907
作者: Sanjay Chakraborty,Ibrahim Delibasoglu,Fredrik Heintz
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large pre-trained models have demonstrated remarkable capabilities across domains, but their effectiveness in time series forecasting remains understudied. This work empirically examines whether pre-trained large-scale time series models (LSTSMs) trained on diverse datasets can outperform traditional non-pretrained small-scale transformers in forecasting tasks. We analyze state-of-the-art (SOTA) pre-trained universal time series models (e.g., Moirai, TimeGPT) alongside conventional transformers, evaluating accuracy, computational efficiency, and interpretability across multiple benchmarks. Our findings reveal the strengths and limitations of pre-trained LSTSMs, providing insights into their suitability for time series tasks compared to task-specific small-scale architectures. The results highlight scenarios where pretraining offers advantages and where simpler models remain competitive.
[LG-138] Harnessing Near-Infrared Spectroscopy and Machine Learning for Traceable Classification of Hanwoo and Holstein Beef
链接: https://arxiv.org/abs/2507.02903
作者: AMM Nurul Alam,Abdul Samad,AMM Shamsul Alam,Jahan Ara Monti,Ayesha Muazzam
类目: Machine Learning (cs.LG)
*备注:
Abstract:This study evaluates the use of Near-Infrared spectroscopy (NIRS) combined with advanced machine learning (ML) techniques to differentiate Hanwoo beef (HNB) and Holstein beef (HLB) to address food authenticity, mislabeling, and adulteration. Rapid and non-invasive spectral data were attained by a portable NIRS, recording absorbance data within the wavelength range of 700 to 1100 nm. A total of 40 Longissimus lumborum samples, evenly split between HNB and HLB, were obtained from a local hypermarket. Data analysis using Principal Component Analysis (PCA) demonstrated distinct spectral patterns associated with chemical changes, clearly separating the two beef varieties and accounting for 93.72% of the total variance. ML models, including Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Logistic Regression (LR), Random Forest, Gradient Boosting (GB), K-Nearest Neighbors, Decision Tree (DT), Naive Bayes (NB), and Neural Networks (NN), were implemented, optimized through hyperparameter tuning, and validated by 5-fold cross-validation techniques to enhance model robustness and prevent overfitting. Random Forest provided the highest predictive accuracy with a Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) of 0.8826, closely followed by the SVM model at 0.8747. Furthermore, GB and NN algorithms exhibited satisfactory performances, with cross-validation scores of 0.752. Notably, the NN model achieved the highest recall rate of 0.7804, highlighting its suitability in scenarios requiring heightened sensitivity. DT and NB exhibited comparatively lower predictive performance. The LR and SVM models emerged as optimal choices by effectively balancing high accuracy, precision, and recall. This study confirms that integrating NIRS with ML techniques offers a powerful and reliable method for meat authenticity, significantly contributing to detecting food fraud.
[LG-139] Controllable diffusion-based generation for multi-channel biological data
链接: https://arxiv.org/abs/2507.02902
作者: Haoran Zhang,Mingyuan Zhou,Wesley Tansey
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Spatial profiling technologies in biology, such as imaging mass cytometry (IMC) and spatial transcriptomics (ST), generate high-dimensional, multi-channel data with strong spatial alignment and complex inter-channel relationships. Generative modeling of such data requires jointly capturing intra- and inter-channel structure, while also generalizing across arbitrary combinations of observed and missing channels for practical application. Existing diffusion-based models generally assume low-dimensional inputs (e.g., RGB images) and rely on simple conditioning mechanisms that break spatial correspondence and ignore inter-channel dependencies. This work proposes a unified diffusion framework for controllable generation over structured and spatial biological data. Our model contains two key innovations: (1) a hierarchical feature injection mechanism that enables multi-resolution conditioning on spatially aligned channels, and (2) a combination of latent-space and output-space channel-wise attention to capture inter-channel relationships. To support flexible conditioning and generalization to arbitrary subsets of observed channels, we train the model using a random masking strategy, enabling it to reconstruct missing channels from any combination of inputs. We demonstrate state-of-the-art performance across both spatial and non-spatial prediction tasks, including protein imputation in IMC and gene-to-protein prediction in single-cell datasets, and show strong generalization to unseen conditional configurations.
[LG-140] Modeling Membrane Degradation in PEM Electrolyzers with Physics-Informed Neural Networks
链接: https://arxiv.org/abs/2507.02887
作者: Alejandro Polo-Molina,Jose Portela,Luis Alberto Herrero Rozas,Román Cicero González
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures, 3 tables
Abstract:Proton exchange membrane (PEM) electrolyzers are pivotal for sustainable hydrogen production, yet their long-term performance is hindered by membrane degradation, which poses reliability and safety challenges. Therefore, accurate modeling of this degradation is essential for optimizing durability and performance. To address these concerns, traditional physics-based models have been developed, offering interpretability but requiring numerous parameters that are often difficult to measure and calibrate. Conversely, data-driven approaches, such as machine learning, offer flexibility but may lack physical consistency and generalizability. To address these limitations, this study presents the first application of Physics-Informed Neural Networks (PINNs) to model membrane degradation in PEM electrolyzers. The proposed PINN framework couples two ordinary differential equations, one modeling membrane thinning via a first-order degradation law and another governing the time evolution of the cell voltage under membrane degradation. Results demonstrate that the PINN accurately captures the long-term system’s degradation dynamics while preserving physical interpretability with limited noisy data. Consequently, this work introduces a novel hybrid modeling approach for estimating and understanding membrane degradation mechanisms in PEM electrolyzers, offering a foundation for more robust predictive tools in electrochemical system diagnostics.
[LG-141] On the Structure of Replicable Hypothesis Testers
链接: https://arxiv.org/abs/2507.02842
作者: Anders Aamand,Maryam Aliakbarpour,Justin Y. Chen,Shyam Narayanan,Sandeep Silwal
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: Abstract abridged to meet arxiv requirements
Abstract:A hypothesis testing algorithm is replicable if, when run on two different samples from the same distribution, it produces the same output with high probability. This notion, defined by by Impagliazzo, Lei, Pitassi, and Sorell [STOC’22], can increase trust in testing procedures and is deeply related to algorithmic stability, generalization, and privacy. We build general tools to prove lower and upper bounds on the sample complexity of replicable testers, unifying and quantitatively improving upon existing results. We identify a set of canonical properties, and prove that any replicable testing algorithm can be modified to satisfy these properties without worsening accuracy or sample complexity. A canonical replicable algorithm computes a deterministic function of its input (i.e., a test statistic) and thresholds against a uniformly random value in [0,1] . It is invariant to the order in which the samples are received, and, if the testing problem is ``symmetric,‘’ then the algorithm is also invariant to the labeling of the domain elements, resolving an open question by Liu and Ye [NeurIPS’24]. We prove new lower bounds for uniformity, identity, and closeness testing by reducing to the case where the replicable algorithm satisfies these canonical properties. We systematize and improve upon a common strategy for replicable algorithm design based on test statistics with known expectation and bounded variance. Our framework allow testers which have been extensively analyzed in the non-replicable setting to be made replicable with minimal overhead. As direct applications of our framework, we obtain constant-factor optimal bounds for coin testing and closeness testing and get replicability for free in a large parameter regime for uniformity testing. We also give state-of-the-art bounds for replicable Gaussian mean testing, and, unlike prior work, our algorithm runs in polynomial time. Comments: Abstract abridged to meet arxiv requirements Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2507.02842 [cs.DS] (or arXiv:2507.02842v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2507.02842 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Justin Chen [view email] [v1] Thu, 3 Jul 2025 17:51:31 UTC (82 KB)
[LG-142] Automated Energy Billing with Blockchain and the Prophet Forecasting Model: A Holistic Approach
链接: https://arxiv.org/abs/2506.16649
作者: Ajesh Thangaraj Nadar,Soham Chandane,Gabriel Nixon Raj,Nihar Mahesh Pasi,Yash Arvind Patil
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures. Presented at IEEE International Conference on Multidisciplinary Research in Technology and Management MRTM 2023 held on 22 to 23 September 2023 at New Horizon College of Engineering India
Abstract:This paper presents a comprehensive approach to automated energy billing that leverages IoT-based smart meters, blockchain technology, and the Prophet time series forecasting model. The proposed system facilitates real-time power consumption monitoring via Wi-Fi-enabled ESP32 modules and a mobile application interface. It integrates Firebase and blockchain for secure, transparent billing processes and employs smart contracts for automated payments. The Prophet model is used for energy demand forecasting, with careful data preprocessing, transformation, and parameter tuning to improve prediction accuracy. This holistic solution aims to reduce manual errors, enhance user awareness, and promote sustainable energy use.
[LG-143] Blind Targeting: Personalization under Third-Party Privacy Constraints
链接: https://arxiv.org/abs/2507.05175
作者: Anya Shchetkina
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM); Applications (stat.AP)
*备注:
Abstract:Major advertising platforms recently increased privacy protections by limiting advertisers’ access to individual-level data. Instead of providing access to granular raw data, the platforms only allow a limited number of aggregate queries to a dataset, which is further protected by adding differentially private noise. This paper studies whether and how advertisers can design effective targeting policies within these restrictive privacy preserving data environments. To achieve this, I develop a probabilistic machine learning method based on Bayesian optimization, which facilitates dynamic data exploration. Since Bayesian optimization was designed to sample points from a function to find its maximum, it is not applicable to aggregate queries and to targeting. Therefore, I introduce two innovations: (i) integral updating of posteriors which allows to select the best regions of the data to query rather than individual points and (ii) a targeting-aware acquisition function that dynamically selects the most informative regions for the targeting task. I identify the conditions of the dataset and privacy environment that necessitate the use of such a “smart” querying strategy. I apply the strategic querying method to the Criteo AI Labs dataset for uplift modeling (Diemert et al., 2018) that contains visit and conversion data from 14M users. I show that an intuitive benchmark strategy only achieves 33% of the non-privacy-preserving targeting potential in some cases, while my strategic querying method achieves 97-101% of that potential, and is statistically indistinguishable from Causal Forest (Athey et al., 2019): a state-of-the-art non-privacy-preserving machine learning targeting method.
[LG-144] A Dynamical Systems Perspective on the Analysis of Neural Networks
链接: https://arxiv.org/abs/2507.05164
作者: Dennis Chemnitz,Maximilian Engel,Christian Kuehn,Sara-Viola Kuntz
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Adaptation and Self-Organizing Systems (nlin.AO)
*备注: preprint of a book chapter contribution
Abstract:In this chapter, we utilize dynamical systems to analyze several aspects of machine learning algorithms. As an expository contribution we demonstrate how to re-formulate a wide variety of challenges from deep neural networks, (stochastic) gradient descent, and related topics into dynamical statements. We also tackle three concrete challenges. First, we consider the process of information propagation through a neural network, i.e., we study the input-output map for different architectures. We explain the universal embedding property for augmented neural ODEs representing arbitrary functions of given regularity, the classification of multilayer perceptrons and neural ODEs in terms of suitable function classes, and the memory-dependence in neural delay equations. Second, we consider the training aspect of neural networks dynamically. We describe a dynamical systems perspective on gradient descent and study stability for overdetermined problems. We then extend this analysis to the overparameterized setting and describe the edge of stability phenomenon, also in the context of possible explanations for implicit bias. For stochastic gradient descent, we present stability results for the overparameterized setting via Lyapunov exponents of interpolation solutions. Third, we explain several results regarding mean-field limits of neural networks. We describe a result that extends existing techniques to heterogeneous neural networks involving graph limits via digraph measures. This shows how large classes of neural networks naturally fall within the framework of Kuramoto-type models on graphs and their large-graph limits. Finally, we point out that similar strategies to use dynamics to study explainable and reliable AI can also be applied to settings such as generative models or fundamental issues in gradient training methods, such as backpropagation or vanishing/exploding gradients.
[LG-145] Pseudo-likelihood produces associative memories able to generalize even for asymmetric couplings
链接: https://arxiv.org/abs/2507.05147
作者: Francesco D’Amico,Dario Bocchi,Luca Maria Del Bono,Saverio Rossi,Matteo Negri
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:
Abstract:Energy-based probabilistic models learned by maximizing the likelihood of the data are limited by the intractability of the partition function. A widely used workaround is to maximize the pseudo-likelihood, which replaces the global normalization with tractable local normalizations. Here we show that, in the zero-temperature limit, a network trained to maximize pseudo-likelihood naturally implements an associative memory: if the training set is small, patterns become fixed-point attractors whose basins of attraction exceed those of any classical Hopfield rule. We explain quantitatively this effect on uncorrelated random patterns. Moreover, we show that, for different structured datasets coming from computer science (random feature model, MNIST), physics (spin glasses) and biology (proteins), as the number of training examples increases the learned network goes beyond memorization, developing meaningful attractors with non-trivial correlations with test examples, thus showing the ability to generalize. Our results therefore reveal pseudo-likelihood works both as an efficient inference tool and as a principled mechanism for memory and generalization.
[LG-146] Vecchia-Inducing-Points Full-Scale Approximations for Gaussian Processes
链接: https://arxiv.org/abs/2507.05064
作者: Tim Gyger,Reinhard Furrer,Fabio Sigrist
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Gaussian processes are flexible, probabilistic, non-parametric models widely used in machine learning and statistics. However, their scalability to large data sets is limited by computational constraints. To overcome these challenges, we propose Vecchia-inducing-points full-scale (VIF) approximations combining the strengths of global inducing points and local Vecchia approximations. Vecchia approximations excel in settings with low-dimensional inputs and moderately smooth covariance functions, while inducing point methods are better suited to high-dimensional inputs and smoother covariance functions. Our VIF approach bridges these two regimes by using an efficient correlation-based neighbor-finding strategy for the Vecchia approximation of the residual process, implemented via a modified cover tree algorithm. We further extend our framework to non-Gaussian likelihoods by introducing iterative methods that substantially reduce computational costs for training and prediction by several orders of magnitudes compared to Cholesky-based computations when using a Laplace approximation. In particular, we propose and compare novel preconditioners and provide theoretical convergence results. Extensive numerical experiments on simulated and real-world data sets show that VIF approximations are both computationally efficient as well as more accurate and numerically stable than state-of-the-art alternatives. All methods are implemented in the open source C++ library GPBoost with high-level Python and R interfaces.
[LG-147] A COMPASS to Model Comparison and Simulation-Based Inference in Galactic Chemical Evolution
链接: https://arxiv.org/abs/2507.05060
作者: Berkay Gunes,Sven Buder,Tobias Buck
类目: Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Accepted at the 2025 Workshop on Machine Learning for Astrophysics
Abstract:We present \textttCOMPASS, a novel simulation-based inference framework that combines score-based diffusion models with transformer architectures to jointly perform parameter estimation and Bayesian model comparison across competing Galactic Chemical Evolution (GCE) models. \textttCOMPASS handles high-dimensional, incomplete, and variable-size stellar abundance datasets. % Applied to high-precision elemental abundance measurements, \textttCOMPASS evaluates 40 combinations of nucleosynthetic yield tables. The model strongly favours Asymptotic Giant Branch yields from NuGrid and core-collapse SN yields used in the IllustrisTNG simulation, achieving near-unity cumulative posterior probability. Using the preferred model, we infer a steep high-mass IMF slope and an elevated Supernova,Ia normalization, consistent with prior solar neighbourhood studies but now derived from fully amortized Bayesian inference. % Our results demonstrate that modern SBI methods can robustly constrain uncertain physics in astrophysical simulators and enable principled model selection when analysing complex, simulation-based data.
[LG-148] A Generative Diffusion Model for Amorphous Materials
链接: https://arxiv.org/abs/2507.05024
作者: Kai Yang,Daniel Schwalbe-Koda
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: Main: 19 pages, 5 figures; SI: 22 pages, 16 figures
Abstract:Generative models show great promise for the inverse design of molecules and inorganic crystals, but remain largely ineffective within more complex structures such as amorphous materials. Here, we present a diffusion model that reliably generates amorphous structures up to 1000 times faster than conventional simulations across processing conditions, compositions, and data sources. Generated structures recovered the short- and medium-range order, sampling diversity, and macroscopic properties of silica glass, as validated by simulations and an information-theoretical strategy. Conditional generation allowed sampling large structures at low cooling rates of 10 ^-2 K/ps to uncover a ductile-to-brittle transition and mesoporous silica structures. Extension to metallic glassy systems accurately reproduced local structures and properties from both computational and experimental datasets, demonstrating how synthetic data can be generated from characterization results. Our methods provide a roadmap for the design and simulation of amorphous materials previously inaccessible to computational methods.
[LG-149] Adaptive Slimming for Scalable and Efficient Speech Enhancement
链接: https://arxiv.org/abs/2507.04879
作者: Riccardo Miccini,Minje Kim,Clément Laroche,Luca Pezzarossa,Paris Smaragdis
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted for publication at the 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2025)
Abstract:Speech enhancement (SE) enables robust speech recognition, real-time communication, hearing aids, and other applications where speech quality is crucial. However, deploying such systems on resource-constrained devices involves choosing a static trade-off between performance and computational efficiency. In this paper, we introduce dynamic slimming to DEMUCS, a popular SE architecture, making it scalable and input-adaptive. Slimming lets the model operate at different utilization factors (UF), each corresponding to a different performance/efficiency trade-off, effectively mimicking multiple model sizes without the extra storage costs. In addition, a router subnet, trained end-to-end with the backbone, determines the optimal UF for the current input. Thus, the system saves resources by adaptively selecting smaller UFs when additional complexity is unnecessary. We show that our solution is Pareto-optimal against individual UFs, confirming the benefits of dynamic routing. When training the proposed dynamically-slimmable model to use 10% of its capacity on average, we obtain the same or better speech quality as the equivalent static 25% utilization while reducing MACs by 29%.
[LG-150] Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos
链接: https://arxiv.org/abs/2507.04845
作者: Davide Berghi,Philip J. B. Jackson
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
*备注:
Abstract:This report presents our systems submitted to the audio-only and audio-visual tracks of the DCASE2025 Task 3 Challenge: Stereo Sound Event Localization and Detection (SELD) in Regular Video Content. SELD is a complex task that combines temporal event classification with spatial localization, requiring reasoning across spatial, temporal, and semantic dimensions. The last is arguably the most challenging to model. Traditional SELD architectures rely on multichannel input, which limits their ability to leverage large-scale pre-training due to data constraints. To address this, we enhance standard SELD architectures with semantic information by integrating pre-trained, contrastive language-aligned models: CLAP for audio and OWL-ViT for visual inputs. These embeddings are incorporated into a modified Conformer module tailored for multimodal fusion, which we refer to as the Cross-Modal Conformer. Additionally, we incorporate autocorrelation-based acoustic features to improve distance estimation. We pre-train our models on curated synthetic audio and audio-visual datasets and apply a left-right channel swapping augmentation to further increase the training data. Both our audio-only and audio-visual systems substantially outperform the challenge baselines on the development set, demonstrating the effectiveness of our strategy. Performance is further improved through model ensembling and a visual post-processing step based on human keypoints. Future work will investigate the contribution of each modality and explore architectural variants to further enhance results.
[LG-151] Interpretable Machine Learning for Urban Heat Mitigation: Attribution and Weighting of Multi-Scale Drivers
链接: https://arxiv.org/abs/2507.04802
作者: David Tschan,Zhi Wang,Jan Carmeliet,Yongling Zhao
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 32 pages, 7 figures, 2 tables
Abstract:Urban heat islands (UHIs) are often accentuated during heat waves (HWs) and pose a public health risk. Mitigating UHIs requires urban planners to first estimate how urban heat is influenced by different land use types (LUTs) and drivers across scales - from synoptic-scale climatic background processes to small-scale urban- and scale-bridging features. This study proposes to classify these drivers into driving (D), urban (U), and local (L) features, respectively. To increase interpretability and enhance computation efficiency, a LUT-distinguishing machine learning approach is proposed as a fast emulator for Weather Research and Forecasting model coupled to a Single-Layer Urban Canopy Model (WRF-SLUCM) to predict ground- (TSK) and 2-meter air temperature (T2). Using random forests (RFs) with extreme gradient boosting (XGB) trained on WRF-SLUCM output over Zurich, Switzerland, during heatwave (HW) periods in 2017 and 2019, this study proposes LUT-based (LB) models that categorize features by scales and practical controllability, allowing optional categorical weighting. This approach enables category-specific feature ranking and sensitivity estimation of T2 and TSK to most important small-scale drivers - most notably surface emissivity, albedo, and leaf area index (LAI). Models employing the LB framework are statistically significantly more accurate than models that do not, with higher performance when more HW data is included in training. With RF-XGB robustly performing optimal with unit weights, the method substantially increase interpretability. Despite the needs to reduce statistical uncertainties and testing the method on other cities, the proposed approach offers urban planners a direct framework for feasibility-centered UHI mitigation assessment.
[LG-152] Sure Convergence and Constructive Universal Approximation for Multi-Layer Neural Networks
链接: https://arxiv.org/abs/2507.04779
作者: Chien-Ming Chi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 39 pages, 3 figures, 8 tables
Abstract:We propose a new neural network model, 01Neuro, built on indicator activation neurons. Its boosted variant possesses two key statistical properties: (1) Sure Convergence, where model optimization can be achieved with high probability given sufficient computational resources; and (2) Constructive Universal Approximation: In the infinite sample setting, the model can approximate any finite sum of measurable functions, each depending on only k out of p input features, provided the architecture is properly tuned. Unlike most universal approximation results that are agnostic to training procedures, our guarantees are directly tied to the model’s explicit construction and optimization algorithm. To improve prediction stability, we integrate stochastic training and bagging into the boosted 01Neuro framework. Empirical evaluations on simulated and real-world tabular datasets with small to medium sample sizes highlight its strengths: effective approximation of interaction components (multiplicative terms), stable prediction performance (comparable to Random Forests), robustness to many noisy features, and insensitivity to feature scaling. A major limitation of the current implementation of boosted 01Neuro is its higher computational cost, which is approximately 5 to 30 times that of Random Forests and XGBoost.
[LG-153] Intervening to learn and compose disentangled representations
链接: https://arxiv.org/abs/2507.04754
作者: Alex Markham,Jeri A. Chang,Isaac Hirsch,Liam Solus,Bryon Aragam
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 45 pages, 14 figures
Abstract:In designing generative models, it is commonly believed that in order to learn useful latent structure, we face a fundamental tension between expressivity and structure. In this paper we challenge this view by proposing a new approach to training arbitrarily expressive generative models that simultaneously learn disentangled latent structure. This is accomplished by adding a simple decoder-only module to the head of an existing decoder block that can be arbitrarily complex. The module learns to process concept information by implicitly inverting linear representations from an encoder. Inspired by the notion of intervention in causal graphical models, our module selectively modifies its architecture during training, allowing it to learn a compact joint model over different contexts. We show how adding this module leads to disentangled representations that can be composed for out-of-distribution generation. To further validate our proposed approach, we prove a new identifiability result that extends existing work on identifying structured representations in nonlinear models.
[LG-154] Optimal Model Selection for Conformalized Robust Optimization
链接: https://arxiv.org/abs/2507.04716
作者: Yajie Bao,Yang Hu,Haojie Ren,Peng Zhao,Changliang Zou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:In decision-making under uncertainty, Contextual Robust Optimization (CRO) provides reliability by minimizing the worst-case decision loss over a prediction set, hedging against label variability. While recent advances use conformal prediction to construct prediction sets for machine learning models, the downstream decisions critically depend on model selection. This paper introduces novel model selection frameworks for CRO that unify robustness control with decision risk minimization. We first propose Conformalized Robust Optimization with Model Selection (CROMS), which automatically selects models to approximately minimize the average decision risk in CRO solutions. We develop two algorithms: E-CROMS, which is computationally efficient, and F-CROMS, which enjoys a marginal robustness guarantee in finite samples. Further, we introduce Conformalized Robust Optimization with Individualized Model Selection (CROiMS), which performs individualized model selection by minimizing the conditional decision risk given the covariate of test data. This framework advances conformal prediction methodology by enabling covariate-aware model selection. Theoretically, CROiMS achieves asymptotic conditional robustness and decision efficiency under mild assumptions. Numerical results demonstrate significant improvements in decision efficiency and robustness across diverse synthetic and real-world applications, outperforming baseline approaches.
[LG-155] Mutual Information Optimal Control of Discrete-Time Linear Systems
链接: https://arxiv.org/abs/2507.04712
作者: Shoju Enami,Kenji Kashima
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:
Abstract:In this paper, we formulate a mutual information optimal control problem (MIOCP) for discrete-time linear systems. This problem can be regarded as an extension of a maximum entropy optimal control problem (MEOCP). Differently from the MEOCP where the prior is fixed to the uniform distribution, the MIOCP optimizes the policy and prior simultaneously. As analytical results, under the policy and prior classes consisting of Gaussian distributions, we derive the optimal policy and prior of the MIOCP with the prior and policy fixed, respectively. Using the results, we propose an alternating minimization algorithm for the MIOCP. Through numerical experiments, we discuss how our proposed algorithm works.
[LG-156] Quantum Algorithms for Bandits with Knapsacks with Improved Regret and Time Complexities
链接: https://arxiv.org/abs/2507.04438
作者: Yuexin Su,Ziyi Yang,Peiyuan Huang,Tongyang Li,Yinyu Ye
类目: Quantum Physics (quant-ph); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 33 pages
Abstract:Bandits with knapsacks (BwK) constitute a fundamental model that combines aspects of stochastic integer programming with online learning. Classical algorithms for BwK with a time horizon T achieve a problem-independent regret bound of O(\sqrtT) and a problem-dependent bound of O(\log T) . In this paper, we initiate the study of the BwK model in the setting of quantum computing, where both reward and resource consumption can be accessed via quantum oracles. We establish both problem-independent and problem-dependent regret bounds for quantum BwK algorithms. For the problem-independent case, we demonstrate that a quantum approach can improve the classical regret bound by a factor of (1+\sqrtB/\mathrmOPT_\mathrmLP) , where B is budget constraint in BwK and \mathrmOPT_\mathrmLP denotes the optimal value of a linear programming relaxation of the BwK problem. For the problem-dependent setting, we develop a quantum algorithm using an inexact quantum linear programming solver. This algorithm achieves a quadratic improvement in terms of the problem-dependent parameters, as well as a polynomial speedup of time complexity on problem’s dimensions compared to classical counterparts. Compared to previous works on quantum algorithms for multi-armed bandits, our study is the first to consider bandit models with resource constraints and hence shed light on operations research.
[LG-157] Neural Networks for Tamed Milstein Approximation of SDEs with Additive Symmetric Jump Noise Driven by a Poisson Random Measure
链接: https://arxiv.org/abs/2507.04417
作者: Ramirez-Gonzalez Jose-Hermenegildo,Sun Ying
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages, 9 figures, 4 tables
Abstract:This work aims to estimate the drift and diffusion functions in stochastic differential equations (SDEs) driven by a particular class of Lévy processes with finite jump intensity, using neural networks. We propose a framework that integrates the Tamed-Milstein scheme with neural networks employed as non-parametric function approximators. Estimation is carried out in a non-parametric fashion for the drift function ( f: \mathbbZ \to \mathbbR ), the diffusion coefficient ( g: \mathbbZ \to \mathbbR ). The model of interest is given by [ dX(t) = \xi + f(X(t)), dt + g(X(t)), dW_t + \gamma \int_\mathbbZ z, N(dt,dz), ] where ( W_t ) is a standard Brownian motion, and ( N(dt,dz) ) is a Poisson random measure on ( (~\mathbbR_+ ~\times ~\mathbbZ~, ~\mathcalB~(~\mathbbR_+~)~\otimes~\mathcalZ~,~ \lambda( \Lambda~\otimes~v~)~) ), with ( \lambda, \gamma 0 ), ( \Lambda ) being the Lebesgue measure on ( \mathbbR_+ ), and ( v ) a finite measure on the measurable space ( (\mathbbZ, \mathcalZ) ). Neural networks are used as non-parametric function approximators, enabling the modeling of complex nonlinear dynamics without assuming restrictive functional forms. The proposed methodology constitutes a flexible alternative for inference in systems with state-dependent noise and discontinuities driven by Lévy processes. Comments: 15 pages, 9 figures, 4 tables Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) MSC classes: 60H10, 68T07 ACMclasses: I.2.6; G.3 Cite as: arXiv:2507.04417 [stat.ML] (or arXiv:2507.04417v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.04417 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-158] A note on the unique properties of the Kullback–Leibler divergence for sampling via gradient flows
链接: https://arxiv.org/abs/2507.04330
作者: Francesca Romana Crucinio
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注:
Abstract:We consider the problem of sampling from a probability distribution \pi . It is well known that this can be written as an optimisation problem over the space of probability distribution in which we aim to minimise a divergence from \pi . and The optimisation problem is normally solved through gradient flows in the space of probability distribution with an appropriate metric. We show that the Kullback–Leibler divergence is the only divergence in the family of Bregman divergences whose gradient flow w.r.t. many popular metrics does not require knowledge of the normalising constant of \pi .
[LG-159] Bandit Pareto Set Identification in a Multi-Output Linear Model AISTATS2025
链接: https://arxiv.org/abs/2507.04255
作者: Cyrille Kone,Emilie Kaufmann,Laura Richert
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: In Proceedings of AISTATS 2025
Abstract:We study the Pareto Set Identification (PSI) problem in a structured multi-output linear bandit model. In this setting, each arm is associated a feature vector belonging to \mathbbR^h , and its mean vector in \mathbbR^d linearly depends on this feature vector through a common unknown matrix \Theta \in \mathbbR^h \times d . The goal is to identify the set of non-dominated arms by adaptively collecting samples from the arms. We introduce and analyze the first optimal design-based algorithms for PSI, providing nearly optimal guarantees in both the fixed-budget and the fixed-confidence settings. Notably, we show that the difficulty of these tasks mainly depends on the sub-optimality gaps of h arms only. Our theoretical results are supported by an extensive benchmark on synthetic and real-world datasets.
[LG-160] Inertial Quadratic Majorization Minimization with Application to Kernel Regularized Learning
链接: https://arxiv.org/abs/2507.04247
作者: Qiang Heng,Caixing Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:First-order methods in convex optimization offer low per-iteration cost but often suffer from slow convergence, while second-order methods achieve fast local convergence at the expense of costly Hessian inversions. In this paper, we highlight a middle ground: minimizing a quadratic majorant with fixed curvature at each iteration. This strategy strikes a balance between per-iteration cost and convergence speed, and crucially allows the reuse of matrix decompositions, such as Cholesky or spectral decompositions, across iterations and varying regularization parameters. We introduce the Quadratic Majorization Minimization with Extrapolation (QMME) framework and establish its sequential convergence properties under standard assumptions. The new perspective of our analysis is to center the arguments around the induced norm of the curvature matrix H . To demonstrate practical advantages, we apply QMME to large-scale kernel regularized learning problems. In particular, we propose a novel Sylvester equation modelling technique for kernel multinomial regression. In Julia-based experiments, QMME compares favorably against various established first- and second-order methods. Furthermore, we demonstrate that our algorithms complement existing kernel approximation techniques through more efficiently handling sketching matrices with large projection dimensions. Our numerical experiments and real data analysis are available and fully reproducible at this https URL.
[LG-161] Where to Intervene: Action Selection in Deep Reinforcement Learning
链接: https://arxiv.org/abs/2507.04187
作者: Wenbo Zhang,Hengrui Cai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted by Transactions on Machine Learning Research (TMLR)
Abstract:Deep reinforcement learning (RL) has gained widespread adoption in recent years but faces significant challenges, particularly in unknown and complex environments. Among these, high-dimensional action selection stands out as a critical problem. Existing works often require a sophisticated prior design to eliminate redundancy in the action space, relying heavily on domain expert experience or involving high computational complexity, which limits their generalizability across different RL tasks. In this paper, we address these challenges by proposing a general data-driven action selection approach with model-free and computationally friendly properties. Our method not only selects minimal sufficient actions but also controls the false discovery rate via knockoff sampling. More importantly, we seamlessly integrate the action selection into deep RL methods during online training. Empirical experiments validate the established theoretical guarantees, demonstrating that our method surpasses various alternative techniques in terms of both performance in variable selection and overall achieved rewards.
[LG-162] Online Convex Optimization with Switching Cost with Only One Single Gradient Evaluation
链接: https://arxiv.org/abs/2507.04133
作者: Harsh Shah,Purna Chandrasekhar,Rahul Vaze
类目: Optimization and Control (math.OC); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures
Abstract:Online convex optimization with switching cost is considered under the frugal information setting where at time t , before action x_t is taken, only a single function evaluation and a single gradient is available at the previously chosen action x_t-1 for either the current cost function f_t or the most recent cost function f_t-1 . When the switching cost is linear, online algorithms with optimal order-wise competitive ratios are derived for the frugal setting. When the gradient information is noisy, an online algorithm whose competitive ratio grows quadratically with the noise magnitude is derived.
[LG-163] Model selection for stochastic dynamics: a parsimonious and principled approach
链接: https://arxiv.org/abs/2507.04121
作者: Andonis Gerardos
类目: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: PhD thesis, 201 pages, 29 figures. Defended at the University of Aix-Marseille on July 1, 2025. Supervised by Dr. Pierre Ronceray
Abstract:This thesis focuses on the discovery of stochastic differential equations (SDEs) and stochastic partial differential equations (SPDEs) from noisy and discrete time series. A major challenge is selecting the simplest possible correct model from vast libraries of candidate models, where standard information criteria (AIC, BIC) are often limited. We introduce PASTIS (Parsimonious Stochastic Inference), a new information criterion derived from extreme value theory. Its penalty term, n_\mathcalB \ln(n_0/p) , explicitly incorporates the size of the initial library of candidate parameters ( n_0 ), the number of parameters in the considered model ( n_\mathcalB ), and a significance threshold ( p ). This significance threshold represents the probability of selecting a model containing more parameters than necessary when comparing many models. Benchmarks on various systems (Lorenz, Ornstein-Uhlenbeck, Lotka-Volterra for SDEs; Gray-Scott for SPDEs) demonstrate that PASTIS outperforms AIC, BIC, cross-validation (CV), and SINDy (a competing method) in terms of exact model identification and predictive capability. Furthermore, real-world data can be subject to large sampling intervals ( \Delta t ) or measurement noise ( \sigma ), which can impair model learning and selection capabilities. To address this, we have developed robust variants of PASTIS, PASTIS- \Delta t and PASTIS- \sigma , thus extending the applicability of the approach to imperfect experimental data. PASTIS thus provides a statistically grounded, validated, and practical methodological framework for discovering simple models for processes with stochastic dynamics. Comments: PhD thesis, 201 pages, 29 figures. Defended at the University of Aix-Marseille on July 1, 2025. Supervised by Dr. Pierre Ronceray Subjects: Machine Learning (stat.ML); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Computational Physics (physics.comp-ph) Cite as: arXiv:2507.04121 [stat.ML] (or arXiv:2507.04121v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.04121 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Andonis Gerardos [view email] [v1] Sat, 5 Jul 2025 18:15:26 UTC (16,763 KB) Full-text links: Access Paper: View a PDF of the paper titled Model selection for stochastic dynamics: a parsimonious and principled approach, by Andonis GerardosView PDFTeX SourceOther Formats view license Current browse context: stat.ML prev | next new | recent | 2025-07 Change to browse by: cond-mat cond-mat.stat-mech cs cs.LG physics physics.comp-ph stat References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-164] Leo Breiman the Rashomon Effect and the Occam Dilemma
链接: https://arxiv.org/abs/2507.03884
作者: Cynthia Rudin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In the famous Two Cultures paper, Leo Breiman provided a visionary perspective on the cultures of ‘‘data models’’ (modeling with consideration of data generation) versus ‘‘algorithmic models’’ (vanilla machine learning models). I provide a modern perspective on these approaches. One of Breiman’s key arguments against data models is the ‘‘Rashomon Effect,’’ which is the existence of many different-but-equally-good models. The Rashomon Effect implies that data modelers would not be able to determine which model generated the data. Conversely, one of his core advantages in favor of data models is simplicity, as he claimed there exists an ‘‘Occam Dilemma,’’ i.e., an accuracy-simplicity tradeoff. After 25 years of powerful computers, it has become clear that this claim is not generally true, in that algorithmic models do not need to be complex to be accurate; however, there are nuances that help explain Breiman’s logic, specifically, that by ‘‘simple,’’ he appears to consider only linear models or unoptimized decision trees. Interestingly, the Rashomon Effect is a key tool in proving the nullification of the Occam Dilemma. To his credit though, Breiman did not have the benefit of modern computers, with which my observations are much easier to make. Breiman’s goal for interpretability was somewhat intertwined with causality: simpler models can help reveal which variables have a causal relationship with the outcome. However, I argue that causality can be investigated without the use of single models, whether or not they are simple. Interpretability is useful in its own right, and I think Breiman knew that too. Technically, my modern perspective does not belong to either of Breiman’s Two Cultures, but shares the goals of both of them - causality, simplicity, accuracy - and shows that these goals can be accomplished in other ways, without the limitations Breiman was concerned about. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) MSC classes: 01, 62, 68 ACMclasses: K.2; I.2.6 Cite as: arXiv:2507.03884 [stat.ML] (or arXiv:2507.03884v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.03884 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-165] Causal Evidence for the Primordiality of Colors in Trans-Neptunian Objects ICML2025
链接: https://arxiv.org/abs/2507.03760
作者: Benjamin L. Davis,Mohamad Ali-Dib,Yujia Zheng,Zehao Jin,Kun Zhang,Andrea Valerio Macciò
类目: Earth and Planetary Astrophysics (astro-ph.EP); Machine Learning (cs.LG)
*备注: Accepted to ML4Astro 2025 (Machine Learning for Astrophysics workshop at ICML 2025)
Abstract:The origins of the colors of Trans-Neptunian Objects (TNOs) represent a crucial unresolved question, central to understanding the history of our Solar System. Recent observational surveys have revealed correlations between the eccentricity and inclination of TNOs and their colors. This has rekindled the long-standing debate on whether these colors reflect the conditions of TNO formation or their subsequent collisional evolution. In this study, we address this question with 98.7% certainty, using a model-agnostic, data-driven approach based on causal graphs. First, as a sanity check, we demonstrate how our model can replicate the currently accepted paradigms of TNOs’ dynamical history, blindly and without any orbital modeling or physics-based assumptions. In fact, our causal model (with no knowledge of the existence of Neptune) predicts the existence of an unknown perturbing body, i.e., Neptune. We then show how this model predicts, with high certainty, that the color of TNOs is the root cause of their inclination distribution, rather than the other way around. This strongly suggests that the colors of TNOs reflect an underlying dynamical property, most likely their formation location. Moreover, our causal model excludes formation scenarios that invoke substantial color modification by subsequent irradiation. We therefore conclude that the colors of TNOs are predominantly primordial.
[LG-166] Sequential Regression Learning with Randomized Algorithms
链接: https://arxiv.org/abs/2507.03759
作者: Dorival Leão,Reiko Aoki,Teh Led Red
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents ``randomized SINDy", a sequential machine learning algorithm designed for dynamic data that has a time-dependent structure. It employs a probabilistic approach, with its PAC learning property rigorously proven through the mathematical theory of functional analysis. The algorithm dynamically predicts using a learned probability distribution of predictors, updating weights via gradient descent and a proximal algorithm to maintain a valid probability density. Inspired by SINDy (Brunton et al. 2016), it incorporates feature augmentation and Tikhonov regularization. For multivariate normal weights, the proximal step is omitted to focus on parameter estimation. The algorithm’s effectiveness is demonstrated through experimental results in regression and binary classification using real-world data.
[LG-167] Implicit Regularisation in Diffusion Models: An Algorithm-Dependent Generalisation Analysis
链接: https://arxiv.org/abs/2507.03756
作者: Tyler Farghly,Patrick Rebeschini,George Deligiannidis,Arnaud Doucet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:The success of denoising diffusion models raises important questions regarding their generalisation behaviour, particularly in high-dimensional settings. Notably, it has been shown that when training and sampling are performed perfectly, these models memorise training data – implying that some form of regularisation is essential for generalisation. Existing theoretical analyses primarily rely on algorithm-independent techniques such as uniform convergence, heavily utilising model structure to obtain generalisation bounds. In this work, we instead leverage the algorithmic aspects that promote generalisation in diffusion models, developing a general theory of algorithm-dependent generalisation for this setting. Borrowing from the framework of algorithmic stability, we introduce the notion of score stability, which quantifies the sensitivity of score-matching algorithms to dataset perturbations. We derive generalisation bounds in terms of score stability, and apply our framework to several fundamental learning settings, identifying sources of regularisation. In particular, we consider denoising score matching with early stopping (denoising regularisation), sampler-wide coarse discretisation (sampler regularisation) and optimising with SGD (optimisation regularisation). By grounding our analysis in algorithmic properties rather than model structure, we identify multiple sources of implicit regularisation unique to diffusion models that have so far been overlooked in the literature.
[LG-168] Determination of Particle-Size Distributions from Light-Scattering Measurement Using Constrained Gaussian Process Regression
链接: https://arxiv.org/abs/2507.03736
作者: Fahime Seyedheydari,Mahdi Nasiri,Marcin Mińkowski,Simo Särkkä
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optics (physics.optics); Methodology (stat.ME)
*备注:
Abstract:In this work, we propose a novel methodology for robustly estimating particle size distributions from optical scattering measurements using constrained Gaussian process regression. The estimation of particle size distributions is commonly formulated as a Fredholm integral equation of the first kind, an ill-posed inverse problem characterized by instability due to measurement noise and limited data. To address this, we use a Gaussian process prior to regularize the solution and integrate a normalization constraint into the Gaussian process via two approaches: by constraining the Gaussian process using a pseudo-measurement and by using Lagrange multipliers in the equivalent optimization problem. To improve computational efficiency, we employ a spectral expansion of the covariance kernel using eigenfunctions of the Laplace operator, resulting in a computationally tractable low-rank representation without sacrificing accuracy. Additionally, we investigate two complementary strategies for hyperparameter estimation: a data-driven approach based on maximizing the unconstrained log marginal likelihood, and an alternative approach where the physical constraints are taken into account. Numerical experiments demonstrate that the proposed constrained Gaussian process regression framework accurately reconstructs particle size distributions, producing numerically stable, smooth, and physically interpretable results. This methodology provides a principled and efficient solution for addressing inverse scattering problems and related ill-posed integral equations.
[LG-169] ransforming Calabi-Yau Constructions: Generating New Calabi-Yau Manifolds with Transformers
链接: https://arxiv.org/abs/2507.03732
作者: Jacky H. T. Yip,Charles Arnal,Francois Charton,Gary Shiu
类目: High Energy Physics - Theory (hep-th); Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注: 37 pages, 14 figures
Abstract:Fine, regular, and star triangulations (FRSTs) of four-dimensional reflexive polytopes give rise to toric varieties, within which generic anticanonical hypersurfaces yield smooth Calabi-Yau threefolds. We employ transformers – deep learning models originally developed for language modeling – to generate FRSTs across a range of polytope sizes. Our models exhibit efficient and unbiased sampling, and can self-improve through retraining on their own output. These results lay the foundation for AICY: a community-driven platform that combines self-improving machine learning models with a continuously expanding FRST database to explore and catalog the Calabi-Yau landscape.
[LG-170] Differentially private scale testing via rank transformations and percentile modifications
链接: https://arxiv.org/abs/2507.03725
作者: Joshua Levine,Kelly Ramsay
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 4 figures, 35 tables, 53 pages
Abstract:We develop a class of differentially private two-sample scale tests, called the rank-transformed percentile-modified Siegel–Tukey tests, or RPST tests. These RPST tests are inspired both by recent differentially private extensions of some common rank tests and some older modifications to non-private rank tests. We present the asymptotic distribution of the RPST test statistic under the null hypothesis, under a very general condition on the rank transformation. We also prove RPST tests are differentially private, and that their type I error does not exceed the given level. We uncover that the growth rate of the rank transformation presents a tradeoff between power and sensitivity. We do extensive simulations to investigate the effects of the tuning parameters and compare to a general private testing framework. Lastly, we show that our techniques can also be used to improve the differentially private signed-rank test.
[LG-171] A Resource Efficient Quantum Kernel
链接: https://arxiv.org/abs/2507.03689
作者: Utkarsh Singh,Marco Armenta,Jean-Frédéric Laprade,Aaron Z. Goldberg,Khabat Heshami
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 17 pages, 12 figures
Abstract:Quantum processors may enhance machine learning by mapping high-dimensional data onto quantum systems for processing. Conventional quantum kernels, or feature maps, for encoding data features onto a quantum circuit are currently impractical, as the number of entangling gates scales quadratically with the dimension of the dataset and the number of qubits. In this work, we introduce a quantum kernel designed to handle high-dimensional data with a significantly reduced number of qubits and entangling operations. Our approach preserves essential data characteristics while promoting computational efficiency, as evidenced by extensive experiments on benchmark datasets that demonstrate a marked improvement in both accuracy and resource utilization, as compared to state-of-the-art quantum feature maps. Our noisy simulations results combined with lower resource requirements highlight our kernel’s ability to function within the constraints of noisy intermediate-scale quantum devices. Through numerical simulations and small-scale implementation on a superconducting circuit quantum computing platform, we demonstrate that our scheme performs on par or better than a set of classical algorithms for classification. Our findings herald a promising avenue for the practical implementation of quantum machine learning algorithms on near future quantum computing platforms.
[LG-172] Robust estimation of heterogeneous treatment effects in randomized trials leverag ing external data
链接: https://arxiv.org/abs/2507.03681
作者: Rickard Karlsson,Piersilvio De Bartolomeis,Issa J. Dahabreh,Jesse H. Krijthe
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 20 pages, including references and appendix
Abstract:Randomized trials are typically designed to detect average treatment effects but often lack the statistical power to uncover effect heterogeneity over patient characteristics, limiting their value for personalized decision-making. To address this, we propose the QR-learner, a model-agnostic learner that estimates conditional average treatment effects (CATE) within the trial population by leveraging external data from other trials or observational studies. The proposed method is robust: it has the potential to reduce the CATE prediction mean squared error while maintaining consistency, even when the external data is not aligned with the trial. Moreover, we introduce a procedure that combines the QR-learner with a trial-only CATE learner and show that it asymptotically matches or exceeds the trial-only learner in terms of mean squared error. We examine the performance of our approach in simulation studies and apply the methods to a real-world dataset, demonstrating improvements in both CATE estimation and statistical power for detecting heterogeneous effects.
[LG-173] UWB TDoA Error Correction using Transformers: Patching and Positional Encoding Strategies
链接: https://arxiv.org/abs/2507.03523
作者: Dieter Coppens,Adnan Shahid,Eli De Poorter
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 13 pages, 10 figures, 7 tables
Abstract:Despite their high accuracy, UWB-based localization systems suffer inaccuracies when deployed in industrial locations with many obstacles due to multipath effects and non-line-of-sight (NLOS) conditions. In such environments, current error mitigation approaches for time difference of arrival (TDoA) localization typically exclude NLOS links. However, this exclusion approach leads to geometric dilution of precision problems and this approach is infeasible when the majority of links are NLOS. To address these limitations, we propose a transformer-based TDoA position correction method that uses raw channel impulse responses (CIRs) from all available anchor nodes to compute position corrections. We introduce different CIR ordering, patching and positional encoding strategies for the transformer, and analyze each proposed technique’s scalability and performance gains. Based on experiments on real-world UWB measurements, our approach can provide accuracies of up to 0.39 m in a complex environment consisting of (almost) only NLOS signals, which is an improvement of 73.6 % compared to the TDoA baseline.
[LG-174] Generating realistic patient data
链接: https://arxiv.org/abs/2507.03423
作者: Tabea Brandt,Christina Büsing,Johanna Leweke,Finn Seesemann,Sina Weber
类目: Optimization and Control (math.OC); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:
Abstract:Developing algorithms for real-life problems that perform well in practice highly depends on the availability of realistic data for testing. Obtaining real-life data for optimization problems in health care, however, is often difficult. This is especially true for any patient related optimization problems, e.g., for patient-to-room assignment, due to data privacy policies. Furthermore, obtained real-life data usually cannot be published which prohibits reproducibility of results by other researchers. Therefore, often artificially generated instances are used. In this paper, we present combinatorial insights about the feasibility of instances for the patient-to-room assignment problem (PRA). We use these insights to develop a configurable instance generator for PRA with an easy-to-use graphical user interface. Configurability is in this case especially important as we observed in an extensive analysis of real-life data that, e.g., the probability distribution for patients’ age and length of stay depends on the respective ward.
[LG-175] Adaptive Gate-Aware Mamba Networks for Magnetic Resonance Fingerprinting
链接: https://arxiv.org/abs/2507.03369
作者: Tianyi Ding,Hongli Chen,Yang Gao,Zhuang Xiong,Feng Liu,Martijn A. Cloos,Hongfu Sun
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 31 pages, 7 figures
Abstract:Magnetic Resonance Fingerprinting (MRF) enables fast quantitative imaging by matching signal evolutions to a predefined dictionary. However, conventional dictionary matching suffers from exponential growth in computational cost and memory usage as the number of parameters increases, limiting its scalability to multi-parametric mapping. To address this, recent work has explored deep learning-based approaches as alternatives to DM. We propose GAST-Mamba, an end-to-end framework that combines a dual Mamba-based encoder with a Gate-Aware Spatial-Temporal (GAST) processor. Built on structured state-space models, our architecture efficiently captures long-range spatial dependencies with linear complexity. On 5 times accelerated simulated MRF data (200 frames), GAST-Mamba achieved a T1 PSNR of 33.12~dB, outperforming SCQ (31.69~dB). For T2 mapping, it reached a PSNR of 30.62~dB and SSIM of 0.9124. In vivo experiments further demonstrated improved anatomical detail and reduced artifacts. Ablation studies confirmed that each component contributes to performance, with the GAST module being particularly important under strong undersampling. These results demonstrate the effectiveness of GAST-Mamba for accurate and robust reconstruction from highly undersampled MRF acquisitions, offering a scalable alternative to traditional DM-based methods.
[LG-176] LILI clustering algorithm: Limit Inferior Leaf Interval Integrated into Causal Forest for Causal Interference
链接: https://arxiv.org/abs/2507.03271
作者: Yiran Dong,Di Fan,Chuanhou Gao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Causal forest methods are powerful tools in causal inference. Similar to traditional random forest in machine learning, causal forest independently considers each causal tree. However, this independence consideration increases the likelihood that classification errors in one tree are repeated in others, potentially leading to significant bias in causal e ect estimation. In this paper, we propose a novel approach that establishes connections between causal trees through the Limit Inferior Leaf Interval (LILI) clustering algorithm. LILIs are constructed based on the leaves of all causal trees, emphasizing the similarity of dataset confounders. When two instances with di erent treatments are grouped into the same leaf across a su cient number of causal trees, they are treated as counterfactual outcomes of each other. Through this clustering mechanism, LILI clustering reduces bias present in traditional causal tree methods and enhances the prediction accuracy for the average treatment e ect (ATE). By integrating LILIs into a causal forest, we develop an e cient causal inference method. Moreover, we explore several key properties of LILI by relating it to the concepts of limit inferior and limit superior in the set theory. Theoretical analysis rigorously proves the convergence of the estimated ATE using LILI clustering. Empirically, extensive comparative experiments demonstrate the superior performance of LILI clustering.
[LG-177] LANTERN: A Machine Learning Framework for Lipid Nanoparticle Transfection Efficiency Prediction
链接: https://arxiv.org/abs/2507.03209
作者: Asal Mehradfar,Mohammad Shahab Sepehri,Jose Miguel Hernandez-Lobato,Glen S. Kwon,Mahdi Soltanolkotabi,Salman Avestimehr,Morteza Rasoulianboroujeni
类目: Quantitative Methods (q-bio.QM); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Molecular Networks (q-bio.MN)
*备注:
Abstract:The discovery of new ionizable lipids for efficient lipid nanoparticle (LNP)-mediated RNA delivery remains a critical bottleneck for RNA-based therapeutics development. Recent advances have highlighted the potential of machine learning (ML) to predict transfection efficiency from molecular structure, enabling high-throughput virtual screening and accelerating lead identification. However, existing approaches are hindered by inadequate data quality, ineffective feature representations, low predictive accuracy, and poor generalizability. Here, we present LANTERN (Lipid nANoparticle Transfection Efficiency pRedictioN), a robust ML framework for predicting transfection efficiency based on ionizable lipid representation. We benchmarked a diverse set of ML models against AGILE, a previously published model developed for transfection prediction. Our results show that combining simpler models with chemically informative features, particularly count-based Morgan fingerprints, outperforms more complex models that rely on internally learned embeddings, such as AGILE. We also show that a multi-layer perceptron trained on a combination of Morgan fingerprints and Expert descriptors achieved the highest performance ( \textR^2 = 0.8161, r = 0.9053), significantly exceeding AGILE ( \textR^2 = 0.2655, r = 0.5488). We show that the models in LANTERN consistently have strong performance across multiple evaluation metrics. Thus, LANTERN offers a robust benchmarking framework for LNP transfection prediction and serves as a valuable tool for accelerating lipid-based RNA delivery systems design.
[LG-178] Beyond SEO: A Transformer-Based Approach for Reinventing Web Content Optimisation
链接: https://arxiv.org/abs/2507.03169
作者: Florian Lüttgenau,Imar Colic,Gervasio Ramirez
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 9 pages, 3 figures
Abstract:The rise of generative AI search engines is disrupting traditional SEO, with Gartner predicting 25% reduction in conventional search usage by 2026. This necessitates new approaches for web content visibility in AI-driven search environments. We present a domain-specific fine-tuning approach for Generative Engine Optimization (GEO) that transforms web content to improve discoverability in large language model outputs. Our method fine-tunes a BART-base transformer on synthetically generated training data comprising 1,905 cleaned travel website content pairs. Each pair consists of raw website text and its GEO-optimized counterpart incorporating credible citations, statistical evidence, and improved linguistic fluency. We evaluate using intrinsic metrics (ROUGE-L, BLEU) and extrinsic visibility assessments through controlled experiments with Llama-3.3-70B. The fine-tuned model achieves significant improvements over baseline BART: ROUGE-L scores of 0.249 (vs. 0.226) and BLEU scores of 0.200 (vs. 0.173). Most importantly, optimized content demonstrates substantial visibility gains in generative search responses with 15.63% improvement in absolute word count and 30.96% improvement in position-adjusted word count metrics. This work provides the first empirical demonstration that targeted transformer fine-tuning can effectively enhance web content visibility in generative search engines with modest computational resources. Our results suggest GEO represents a tractable approach for content optimization in the AI-driven search landscape, offering concrete evidence that small-scale, domain-focused fine-tuning yields meaningful improvements in content discoverability.
[LG-179] Multiple data-driven missing imputation
链接: https://arxiv.org/abs/2507.03061
作者: Sergii Kavun
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: this https URL
Abstract:This paper introduces KZImputer, a novel adaptive imputation method for univariate time series designed for short to medium-sized missed points (gaps) (1-5 points and beyond) with tailored strategies for segments at the start, middle, or end of the series. KZImputer employs a hybrid strategy to handle various missing data scenarios. Its core mechanism differentiates between gaps at the beginning, middle, or end of the series, applying tailored techniques at each position to optimize imputation accuracy. The method leverages linear interpolation and localized statistical measures, adapting to the characteristics of the surrounding data and the gap size. The performance of KZImputer has been systematically evaluated against established imputation techniques, demonstrating its potential to enhance data quality for subsequent time series analysis. This paper describes the KZImputer methodology in detail and discusses its effectiveness in improving the integrity of time series data. Empirical analysis demonstrates that KZImputer achieves particularly strong performance for datasets with high missingness rates (around 50% or more), maintaining stable and competitive results across statistical and signal-reconstruction metrics. The method proves especially effective in high-sparsity regimes, where traditional approaches typically experience accuracy degradation.
[LG-180] Enhancing Swarms Durability to Threats via Graph Signal Processing and GNN-based Generative Modeling
链接: https://arxiv.org/abs/2507.03039
作者: Jonathan Karin,Zoe Piran,Mor Nitzan
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:
Abstract:Swarms, such as schools of fish or drone formations, are prevalent in both natural and engineered systems. While previous works have focused on the social interactions within swarms, the role of external perturbations–such as environmental changes, predators, or communication breakdowns–in affecting swarm stability is not fully understood. Our study addresses this gap by modeling swarms as graphs and applying graph signal processing techniques to analyze perturbations as signals on these graphs. By examining predation, we uncover a “detectability-durability trade-off”, demonstrating a tension between a swarm’s ability to evade detection and its resilience to predation, once detected. We provide theoretical and empirical evidence for this trade-off, explicitly tying it to properties of the swarm’s spatial configuration. Toward task-specific optimized swarms, we introduce SwaGen, a graph neural network-based generative model. We apply SwaGen to resilient swarm generation by defining a task-specific loss function, optimizing the contradicting trade-off terms this http URL this, SwaGen reveals novel spatial configurations, optimizing the trade-off at both ends. Applying the model can guide the design of robust artificial swarms and deepen our understanding of natural swarm dynamics.
[LG-181] Modeling Gene Expression Distributional Shifts for Unseen Genetic Perturbations
链接: https://arxiv.org/abs/2507.02980
作者: Kalyan Ramakrishnan,Jonathan G. Hedley,Sisi Qu,Puneet K. Dokania,Philip H. S. Torr,Cesar A. Prada-Medina,Julien Fauqueur,Kaspar Martens
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:We train a neural network to predict distributional responses in gene expression following genetic perturbations. This is an essential task in early-stage drug discovery, where such responses can offer insights into gene function and inform target identification. Existing methods only predict changes in the mean expression, overlooking stochasticity inherent in single-cell data. In contrast, we offer a more realistic view of cellular responses by modeling expression distributions. Our model predicts gene-level histograms conditioned on perturbations and outperforms baselines in capturing higher-order statistics, such as variance, skewness, and kurtosis, at a fraction of the training cost. To generalize to unseen perturbations, we incorporate prior knowledge via gene embeddings from large language models (LLMs). While modeling a richer output space, the method remains competitive in predicting mean expression changes. This work offers a practical step towards more expressive and biologically informative models of perturbation effects.
[LG-182] A Novel Method to Manage Production on Industry 4.0: Forecasting Overall Equipment Efficiency by Time Series with Topological Features
链接: https://arxiv.org/abs/2507.02890
作者: Korkut Anapa,İsmail Güzel,Ceylan Yozgatlıgil
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 38 pages
Abstract:Purpose: Overall equipment efficiency (OEE) is a key manufacturing KPI, but its volatile nature complicates short-term forecasting. This study presents a novel framework combining time series decomposition and topological data analysis to improve OEE prediction across various equipment, such as hydraulic press systems. Methods: The approach begins by decomposing hourly OEE data into trend, seasonal, and residual components. The residual, capturing short-term variability, is modeled using a seasonal ARIMA with exogenous variables (SARIMAX). These exogenous features include statistical descriptors and topological summaries from related time series. To manage the high-dimensional input space, we propose a hybrid feature selection strategy using recursive feature elimination based on statistically significant SARIMAX predictors, coupled with BIC-guided particle swarm optimization. The framework is evaluated on real-world datasets from multiple production systems. Results: The proposed model consistently outperforms conventional time series models and advanced transformer-based approaches, achieving significantly lower mean absolute error and mean absolute percentage error. Conclusion: Integrating classical forecasting with topological data analysis enhances OEE prediction accuracy, enabling proactive maintenance and informed production decisions in complex manufacturing environments. Comments: 38 pages Subjects: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 62R40, 55N31, 62M10 Cite as: arXiv:2507.02890 [stat.AP] (or arXiv:2507.02890v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2507.02890 Focus to learn more arXiv-issued DOI via DataCite Submission history From: İsmail Güzel [view email] [v1] Fri, 20 Jun 2025 10:04:49 UTC (4,117 KB)
[LG-183] DISPROTBENCH: A Disorder-Aware Task-Rich Benchmark for Evaluating Protein Structure Prediction in Realistic Biological Contexts
链接: https://arxiv.org/abs/2507.02883
作者: Xinyue Zeng,Tuo Wang,Adithya Kulkarni,Alexander Lu,Alexandra Ni,Phoebe Xing,Junhan Zhao,Siwei Chen,Dawei Zhou
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:
Abstract:Recent advances in protein structure prediction have achieved near-atomic accuracy for well-folded proteins. However, current benchmarks inadequately assess model performance in biologically challenging contexts, especially those involving intrinsically disordered regions (IDRs), limiting their utility in applications such as drug discovery, disease variant interpretation, and protein interface design. We introduce DisProtBench, a comprehensive benchmark for evaluating protein structure prediction models (PSPMs) under structural disorder and complex biological conditions. DisProtBench spans three key axes: (1) Data complexity, covering disordered regions, G protein-coupled receptor (GPCR) ligand pairs, and multimeric complexes; (2) Task diversity, benchmarking twelve leading PSPMs across structure-based tasks with unified classification, regression, and interface metrics; and (3) Interpretability, via the DisProtBench Portal, which provides precomputed 3D structures and visual error analyses. Our results reveal significant variability in model robustness under disorder, with low-confidence regions linked to functional prediction failures. Notably, global accuracy metrics often fail to predict task performance in disordered settings, emphasizing the need for function-aware evaluation. DisProtBench establishes a reproducible, extensible, and biologically grounded framework for assessing next-generation PSPMs in realistic biomedical scenarios.
[LG-184] Biaxialformer: Leverag ing Channel Independence and Inter-Channel Correlations in EEG Signal Decoding for Predicting Neurological Outcomes
链接: https://arxiv.org/abs/2507.02879
作者: Naimahmed Nesaragi,Hemin Ali Qadir,Per Steiner Halvorsen,Ilangko Balasingham
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 12 pages, 3 figures, Article
Abstract:Accurate decoding of EEG signals requires comprehensive modeling of both temporal dynamics within individual channels and spatial dependencies across channels. While Transformer-based models utilizing channel-independence (CI) strategies have demonstrated strong performance in various time series tasks, they often overlook the inter-channel correlations that are critical in multivariate EEG signals. This omission can lead to information degradation and reduced prediction accuracy, particularly in complex tasks such as neurological outcome prediction. To address these challenges, we propose Biaxialformer, characterized by a meticulously engineered two-stage attention-based framework. This model independently captures both sequence-specific (temporal) and channel-specific (spatial) EEG information, promoting synergy and mutual reinforcement across channels without sacrificing CI. By employing joint learning of positional encodings, Biaxialformer preserves both temporal and spatial relationships in EEG data, mitigating the interchannel correlation forgetting problem common in traditional CI models. Additionally, a tokenization module with variable receptive fields balance the extraction of fine-grained, localized features and broader temporal dependencies. To enhance spatial feature extraction, we leverage bipolar EEG signals, which capture inter-hemispheric brain interactions, a critical but often overlooked aspect in EEG analysis. Our study broadens the use of Transformer-based models by addressing the challenge of predicting neurological outcomes in comatose patients. Using the multicenter I-CARE data from five hospitals, we validate the robustness and generalizability of Biaxialformer with an average AUC 0.7688, AUPRC 0.8643, and F1 0.6518 in a cross-hospital scenario.
信息检索
[IR-0] In-Context Learning as an Effective Estimator of Functional Correctness of LLM -Generated Code
链接: https://arxiv.org/abs/2507.05200
作者: Susmita Das,Madhusudan Ghosh,Priyanka Swami,Debasis Ganguly,Gul Calikli
类目: oftware Engineering (cs.SE); Information Retrieval (cs.IR)
*备注:
Abstract:When applying LLM-based code generation to software development projects that follow a feature-driven or rapid application development approach, it becomes necessary to estimate the functional correctness of the generated code in the absence of test cases. Just as a user selects a relevant document from a ranked list of retrieved ones, a software generation workflow requires a developer to choose (and potentially refine) a generated solution from a ranked list of alternative solutions, ordered by their posterior likelihoods. This implies that estimating the quality of a ranked list – akin to estimating “relevance” for query performance prediction (QPP) in IR – is also crucial for generative software development, where quality is defined in terms of “functional correctness”. In this paper, we propose an in-context learning (ICL) based approach for code quality estimation. Our findings demonstrate that providing few-shot examples of functionally correct code from a training set enhances the performance of existing QPP approaches as well as a zero-shot-based approach for code quality estimation.
[IR-1] Interest Networks (iNETs) for Cities: Cross-Platform Insights and Urban Behavior Explanations KDD
链接: https://arxiv.org/abs/2507.04995
作者: Gustavo H. Santos,Myriam Delgado,Thiago H. Silva
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR)
*备注: Accepted at ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD-UMC)
Abstract:Location-Based Social Networks (LBSNs) provide a rich foundation for modeling urban behavior through iNETs (Interest Networks), which capture how user interests are distributed throughout urban spaces. This study compares iNETs across platforms (Google Places and Foursquare) and spatial granularities, showing that coarser levels reveal more consistent cross-platform patterns, while finer granularities expose subtle, platform-specific behaviors. Our analysis finds that, in general, user interest is primarily shaped by geographic proximity and venue similarity, while socioeconomic and political contexts play a lesser role. Building on these insights, we develop a multi-level, explainable recommendation system that predicts high-interest urban regions for different user types. The model adapts to behavior profiles – such as explorers, who are driven by proximity, and returners, who prefer familiar venues – and provides natural-language explanations using explainable AI (XAI) techniques. To support our approach, we introduce h3-cities, a tool for multi-scale spatial analysis, and release a public demo for interactively exploring personalized urban recommendations. Our findings contribute to urban mobility research by providing scalable, context-aware, and interpretable recommendation systems.
[IR-2] SimLab: A Platform for Simulation-based Evaluation of Conversational Information Access Systems
链接: https://arxiv.org/abs/2507.04888
作者: Nolwenn Bernard,Sharath Chandra Etagi Suresh,Krisztian Balog,ChengXiang Zhai
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Research on interactive and conversational information access systems, including search engines, recommender systems, and conversational assistants, has been hindered by the difficulty in evaluating such systems with reproducible experiments. User simulation provides a promising solution, but there is a lack of infrastructure and tooling to support this kind of evaluation. To facilitate simulation-based evaluation of conversational information access systems, we introduce SimLab, the first cloud-based platform to provide a centralized general solution for the community to benchmark both conversational systems and user simulators in a controlled and reproducible environment. We articulate requirements for such a platform and propose a general infrastructure to address these requirements. We then present the design and implementation of an initial version of SimLab and showcase its features with an initial evaluation task of conversational movie recommendation, which is made publicly available. Furthermore, we discuss the sustainability of the platform and its future opportunities. This paper is a call for the community to contribute to the platform to drive progress in the field of conversational information access and user simulation.
[IR-3] Harnessing Pairwise Ranking Prompting Through Sample-Efficient Ranking Distillation SIGIR2025
链接: https://arxiv.org/abs/2507.04820
作者: Junru Wu,Le Yan,Zhen Qin,Honglei Zhuang,Paul Suganthan G. C.,Tianqi Liu,Zhe Dong,Xuanhui Wang,Harrie Oosterhuis
类目: Information Retrieval (cs.IR)
*备注: ReNeuIR 2025 (at SIGIR 2025) - 4th Workshop on Reaching Efficiency in Neural Information Retrieval, July 17, 2025, Padua, Italy
Abstract:While Pairwise Ranking Prompting (PRP) with Large Language Models (LLMs) is one of the most effective zero-shot document ranking methods, it has a quadratic computational complexity with respect to the number of documents to be ranked, as it requires an enumeration over all possible document pairs. Consequently, the outstanding ranking performance of PRP has remained unreachable for most real-world ranking applications. In this work, we propose to harness the effectiveness of PRP through pairwise distillation. Specifically, we distill a pointwise student ranker from pairwise teacher labels generated by PRP, resulting in an efficient student model that retains the performance of PRP with substantially lower computational costs. Furthermore, we find that the distillation process can be made sample-efficient: with only 2% of pairs, we are able to obtain the same performance as using all pairs for teacher labels. Thus, our novel approach provides a solution to harness the ranking performance of PRP without incurring high computational costs during both distillation and serving. Comments: ReNeuIR 2025 (at SIGIR 2025) - 4th Workshop on Reaching Efficiency in Neural Information Retrieval, July 17, 2025, Padua, Italy Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2507.04820 [cs.IR] (or arXiv:2507.04820v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.04820 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-4] FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation KDD2025
链接: https://arxiv.org/abs/2507.04651
作者: Maolin Wang,Yutian Xiao,Binhao Wang,Sheng Zhang,Shanshan Ye,Wanyu Wang,Hongzhi Yin,Ruocheng Guo,Zenglin Xu
类目: Information Retrieval (cs.IR)
*备注: Accepted by KDD 2025
Abstract:Modern recommendation systems face significant challenges in processing multimodal sequential data, particularly in temporal dynamics modeling and information flow coordination. Traditional approaches struggle with distribution discrepancies between heterogeneous features and noise interference in multimodal signals. We propose \textbfFindRec~ (\textbfFlexible unified \textbfinformation \textbfdisentanglement for multi-modal sequential \textbfRecommendation), introducing a novel “information flow-control-output” paradigm. The framework features two key innovations: (1) A Stein kernel-based Integrated Information Coordination Module (IICM) that theoretically guarantees distribution consistency between multimodal features and ID streams, and (2) A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance. Our approach leverages multi-head subspace decomposition for routing stability and RBF-Stein gradient for unbiased distribution alignment, enhanced by linear-complexity Mamba layers for efficient temporal modeling. Extensive experiments on three real-world datasets demonstrate FindRec’s superior performance over state-of-the-art baselines, particularly in handling long sequences and noisy multimodal inputs. Our framework achieves both improved recommendation accuracy and enhanced model interpretability through its modular design. The implementation code is available anonymously online for easy reproducibility~\footnotethis https URL.
[IR-5] Heterogeneous User Modeling for LLM -based Recommendation RECSYS2025
链接: https://arxiv.org/abs/2507.04626
作者: Honghui Bao,Wenjie Wang,Xinyu Lin,Fengbin Zhu,Teng Sun,Fuli Feng,Tat-Seng Chua
类目: Information Retrieval (cs.IR)
*备注: Accepted by RecSys 2025
Abstract:Leveraging Large Language Models (LLMs) for recommendation has demonstrated notable success in various domains, showcasing their potential for open-domain recommendation. A key challenge to advancing open-domain recommendation lies in effectively modeling user preferences from users’ heterogeneous behaviors across multiple domains. Existing approaches, including ID-based and semantic-based modeling, struggle with poor generalization, an inability to compress noisy interactions effectively, and the domain seesaw phenomenon. To address these challenges, we propose a Heterogeneous User Modeling (HUM) method, which incorporates a compression enhancer and a robustness enhancer for LLM-based recommendation. The compression enhancer uses a customized prompt to compress heterogeneous behaviors into a tailored token, while a masking mechanism enhances cross-domain knowledge extraction and understanding. The robustness enhancer introduces a domain importance score to mitigate the domain seesaw phenomenon by guiding domain optimization. Extensive experiments on heterogeneous datasets validate that HUM effectively models user heterogeneity by achieving both high efficacy and robustness, leading to superior performance in open-domain recommendation.
[IR-6] BiFair: A Fairness-aware Training Framework for LLM -enhanced Recommender Systems via Bi-level Optimization
链接: https://arxiv.org/abs/2507.04294
作者: Jiaming Zhang,Yuyuan Li,Yiqun Xu,Li Zhang,Xiaohua Feng,Zhifei Ren,Chaochao Chen
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large Language Model-enhanced Recommender Systems (LLM-enhanced RSs) have emerged as a powerful approach to improving recommendation quality by leveraging LLMs to generate item representations. Despite these advancements, the integration of LLMs raises severe fairness concerns. Existing studies reveal that LLM-based RSs exhibit greater unfairness than traditional RSs, yet fairness issues in LLM-enhanced RSs remain largely unexplored. In this paper, our empirical study reveals that while LLM-enhanced RSs improve fairness across item groups, a significant fairness gap persists. Further enhancement remains challenging due to the architectural differences and varying sources of unfairness inherent in LLM-enhanced RSs. To bridge this gap, we first decompose unfairness into i) \textitprior unfairness in LLM-generated representations and ii) \textittraining unfairness in recommendation models. Then, we propose BiFair, a bi-level optimization-based fairness-aware training framework designed to mitigate both prior and training unfairness simultaneously. BiFair optimizes two sets of learnable parameters: LLM-generated representations and a trainable projector in the recommendation model, using a two-level nested optimization process. Additionally, we introduce an adaptive inter-group balancing mechanism, leveraging multi-objective optimization principles to dynamically balance fairness across item groups. Extensive experiments on three real-world datasets demonstrate that BiFair significantly mitigates unfairness and outperforms previous state-of-the-art methods.
[IR-7] Cloud Digital Forensic Readiness: An Open Source Approach to Law Enforcement Request Management
链接: https://arxiv.org/abs/2507.04174
作者: Abdellah Akilal,M-Tahar Kechadi
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:
Abstract:Cloud Forensics presents a multi-jurisdictional challenge that may undermines the success of digital forensic investigations (DFIs). The growing volumes of domiciled and foreign law enforcement (LE) requests, the latency and complexity of formal channels for crossborder data access are challenging issues. In this paper, we first discuss major Cloud Service Providers (CSPs) transparency reports and law enforcement guidelines, then propose an abstract architecture for a Cloud Law Enforcement Requests Management System (CLERMS). A proof of concept of the proposed solution is developed, deployed and validated by two realistic scenarios, in addition to an economic estimation of its associated costs. Based on available open source components, our solution is for the benefit of both CSPs and Cloud Service Consumers (CSCs), and aims to enhance the due Cloud Digital Forensic Readiness (CDFR).
[IR-8] CTR-Guided Generative Query Suggestion in Conversational Search
链接: https://arxiv.org/abs/2507.04072
作者: Erxue Min,Hsiu-Yuan Huang,Xihong Yang,Min Yang,Xin Jia,Yunfang Wu,Hengyi Cai,Junfeng Wang,Shuaiqiang Wang,Dawei Yin
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Generating effective query suggestions in conversational search requires aligning model outputs with user preferences, which is challenging due to sparse and noisy click signals. We propose GQS, a generative framework that integrates click modeling and preference optimization to enhance real-world user engagement. GQS consists of three key components: (1) a Multi-Source CTR Modeling module that captures diverse contextual signals to estimate fine-grained click-through rates; (2) a Diversity-Aware Preference Alignment strategy using CTR-weighted Direct Preference Optimization (DPO), which balances relevance and semantic diversity; and (3) a CTR-Calibrated Iterative Optimization process that jointly refines the CTR and generation models across training rounds. Experiments on two real-world tasks demonstrate that GQS outperforms strong baselines in CTR, relevance, and diversity.
[IR-9] Function-based Labels for Complementary Recommendation: Definition Annotation and LLM -as-a-Judge
链接: https://arxiv.org/abs/2507.03945
作者: Chihiro Yamasaki,Kai Sugahara,Yuma Nagi,Kazushi Okamoto
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Complementary recommendations enhance the user experience by suggesting items that are frequently purchased together while serving different functions from the query item. Inferring or evaluating whether two items have a complementary relationship requires complementary relationship labels; however, defining these labels is challenging because of the inherent ambiguity of such relationships. Complementary labels based on user historical behavior logs attempt to capture these relationships, but often produce inconsistent and unreliable results. Recent efforts have introduced large language models (LLMs) to infer these relationships. However, these approaches provide a binary classification without a nuanced understanding of complementary relationships. In this study, we address these challenges by introducing Function-Based Labels (FBLs), a novel definition of complementary relationships independent of user purchase logs and the opaque decision processes of LLMs. We constructed a human-annotated FBLs dataset comprising 2,759 item pairs and demonstrated that it covered possible item relationships and minimized ambiguity. We then evaluated whether some machine learning (ML) methods using annotated FBLs could accurately infer labels for unseen item pairs, and whether LLM-generated complementary labels align with human perception. Our results demonstrate that even with limited data, ML models, such as logistic regression and SVM achieve high macro-F1 scores (approximately 0.82). Furthermore, LLMs, such as gpt-4o-mini, demonstrated high consistency (0.989) and classification accuracy (0.849) under the detailed definition of FBLs, indicating their potential as effective annotators that mimic human judgment. Overall, our study presents FBLs as a clear definition of complementary relationships, enabling more accurate inferences and automated labeling of complementary recommendations.
[IR-10] Ranking-based Fusion Algorithms for Extreme Multi-label Text Classification (XMTC)
链接: https://arxiv.org/abs/2507.03761
作者: Celso França,Gestefane Rabbi,Thiago Salles,Washington Cunha,Leonardo Rocha,Marcos André Gonçalves
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In the context of Extreme Multi-label Text Classification (XMTC), where labels are assigned to text instances from a large label space, the long-tail distribution of labels presents a significant challenge. Labels can be broadly categorized into frequent, high-coverage \textbfhead labels and infrequent, low-coverage \textbftail labels, complicating the task of balancing effectiveness across all labels. To address this, combining predictions from multiple retrieval methods, such as sparse retrievers (e.g., BM25) and dense retrievers (e.g., fine-tuned BERT), offers a promising solution. The fusion of \textitsparse and \textitdense retrievers is motivated by the complementary ranking characteristics of these methods. Sparse retrievers compute relevance scores based on high-dimensional, bag-of-words representations, while dense retrievers utilize approximate nearest neighbor (ANN) algorithms on dense text and label embeddings within a shared embedding space. Rank-based fusion algorithms leverage these differences by combining the precise matching capabilities of sparse retrievers with the semantic richness of dense retrievers, thereby producing a final ranking that improves the effectiveness across both head and tail labels.
[IR-11] GENPLUGIN: A Plug-and-Play Framework for Long-Tail Generative Recommendation with Exposure Bias Mitigation
链接: https://arxiv.org/abs/2507.03568
作者: Kun Yang,Siyao Zheng,Tianyi Li,Xiaodong Li,Hui Li
类目: Information Retrieval (cs.IR)
*备注: 16 pages, 8 figures
Abstract:Generative recommendation (GenRec) offers LLM integration, reduced embedding costs, and eliminates per-candidate scoring, attracting great attention. Despite its promising performance, this study reveals that it suffers from generation exposure bias and poor long-tail item generalization, two critical limitations overlooked by prior works on GenRec. To address these, we propose GENPLUGIN, a plug-and-play framework featuring a dual-encoder, shared-decoder architecture. During pre-training, it aligns language and ID views via contrastive learning, harmonizing item representations across two complementary views. Besides, GENPLUGIN uses a novel training strategy that probabilistically substitutes ground-truth item ID tokens with predictions from the language-semantics encoder, alleviating exposure bias. To improve long-tail generative recommendation, we propose a retrieval-based data augmentation mechanism. It fine-tunes the decoder of GENPLUGIN to endow GENPLUGIN with the ability to use relevant users w.r.t. contexts or collaborative information to augment the generation of item ID tokens in long-tail recommendation scenarios. We have plugged GENPLUGIN into several representative GenRec models and the extensive experiments demonstrate that GENPLUGIN can notably mitigate generation exposure bias during item ID generation while significantly improving the quality of long-tail item recommendation.
[IR-12] A Multistakeholder Approach to Value-Driven Co-Design of Recommender System Evaluation Metrics in Digital Archives RECSYS’25
链接: https://arxiv.org/abs/2507.03556
作者: Florian Atzenhofer-Baumgartner,Georg Vogeler,Dominik Kowald
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: Accepted at RecSys’25
Abstract:This paper presents the first multistakeholder approach for translating diverse stakeholder values into an evaluation metric setup for Recommender Systems (RecSys) in digital archives. While commercial platforms mainly rely on engagement metrics, cultural heritage domains require frameworks that balance competing priorities among archivists, platform owners, researchers, and other stakeholders. To address this challenge, we conducted high-profile focus groups (5 groups x 5 persons) with upstream, provider, system, consumer, and downstream stakeholders, identifying value priorities across critical dimensions: visibility/representation, expertise adaptation, and transparency/trust. Our analysis shows that stakeholder concerns naturally align with four sequential research funnel stages: discovery, interaction, integration, and impact. The resulting framework addresses domain-specific challenges including collection representation imbalances, non-linear research patterns, and tensions between specialized expertise and broader accessibility. We propose tailored metrics for each stage in this research journey, such as research path quality for discovery, contextual appropriateness for interaction, metadata-weighted relevance for integration, and cross-stakeholder value alignment for impact assessment. Our contributions extend beyond digital archives to the broader RecSys community, offering transferable evaluation approaches for domains where value emerges through sustained engagement rather than immediate consumption.
[IR-13] Exploring the Effect of Context-Awareness and Popularity Calibration on Popularity Bias in POI Recommendations RECSYS2025
链接: https://arxiv.org/abs/2507.03503
作者: Andrea Forster,Simone Kopeinik,Denic Helic,Stefan Thalmann,Dominik Kowald
类目: Information Retrieval (cs.IR)
*备注: Accepted at RecSys 2025
Abstract:Point-of-interest (POI) recommender systems help users discover relevant locations, but their effectiveness is often compromised by popularity bias, which disadvantages less popular, yet potentially meaningful places. This paper addresses this challenge by evaluating the effectiveness of context-aware models and calibrated popularity techniques as strategies for mitigating popularity bias. Using four real-world POI datasets (Brightkite, Foursquare, Gowalla, and Yelp), we analyze the individual and combined effects of these approaches on recommendation accuracy and popularity bias. Our results reveal that context-aware models cannot be considered a uniform solution, as the models studied exhibit divergent impacts on accuracy and bias. In contrast, calibration techniques can effectively align recommendation popularity with user preferences, provided there is a careful balance between accuracy and bias mitigation. Notably, the combination of calibration and context-awareness yields recommendations that balance accuracy and close alignment with the users’ popularity profiles, i.e., popularity calibration.
[IR-14] Explainable Information Retrieval in the Audit Domain SIGIR2025
链接: https://arxiv.org/abs/2507.03479
作者: Alexander Frummet,Emanuel Slany,Jonas Amling,Moritz Lang,Stephan Scheele
类目: Information Retrieval (cs.IR)
*备注: Extended abstract accepted at the Workshop on Explainability in Information Retrieval (WExIR), co-located with SIGIR 2025
Abstract:Conversational agents such as Microsoft Copilot and Google Gemini assist users with complex search tasks but often generate misleading or fabricated references. This undermines trust, particularly in high-stakes domains such as medicine and finance. Explainable information retrieval (XIR) aims to address this by making search results more transparent and interpretable. While most XIR research is domain-agnostic, this paper focuses on auditing – a critical yet underexplored area. We argue that XIR systems can support auditors in completing their complex task. We outline key challenges and future research directions to advance XIR in this domain.
[IR-15] Modeling Item-Level Dynamic Variability with Residual Diffusion for Bundle Recommendation
链接: https://arxiv.org/abs/2507.03280
作者: Dong Zhang,Lin Li,Ming Li,Xiaohui Tao,Meng Sun,Jimmy Xiangji Huang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Existing solutions for bundle recommendation(BR) have achieved remarkable effectiveness for predicting the user’s preference for prebuilt bundles. However, bundle-item(B-I) affiliation will vary dynamically in real scenarios. For example, a bundle themed as ‘casual outfit’, may add ‘hat’ or remove ‘watch’ due to factors such as seasonal variations, changes in user pes or inventory adjustments. Our empirical study demonstrates that the performance of mainstream BR models will fluctuate or even decline regarding item-level variability. This paper makes the first attempt to referencaddress the above problem and proposes a novel Residual Diffusion for Bundle Recommendation(RDiffBR) as a model-agnostic generative framework which can assist a BR model in adapting this scenario. During the initial training of the BR model, RDiffBR employs a residual diffusion model to process the item-level bundle embeddings which are generated by BR model to represent bundle theme via a forward-reverse process. In the inference stage, RDiffBR reverses item-level bundle embeddings obtained by the well-trained bundle model under B-I variability scenarios to generate the effective item-level bundle embeddings. In particular, the residual connection in our residual approximator significantly enhances item-level bundle embeddings generation ability of BR models. Experiments on six BR models and four public datasets from different domains show that RDiffBR improves the performance of Recall and NDCG of backbone BR models by up to 23%, while only increased training time about 4%.Codes and datasets are available at this https URL.