本篇博文主要内容为 2025-07-23 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-23)
今日共更新458篇论文,其中:
- 自然语言处理共62篇(Computation and Language (cs.CL))
- 人工智能共143篇(Artificial Intelligence (cs.AI))
- 计算机视觉共101篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共119篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning
【速读】: 该论文旨在解决当前开源社区在科学推理(scientific reasoning)领域数据资源匮乏的问题,即缺乏大规模、高质量、可验证的科学推理数据集,从而限制了AI科学家的发展与人类研究人员在自然科学发现前沿的推进。其解决方案的关键在于构建两个核心组件:一是TextbookReasoning数据集,该数据集从12,000本大学水平科学教科书中提取真实参考答案,包含650,000个跨7个学科的推理问题;二是MegaScience数据集,通过系统性消融实验筛选最优子集,整合多个高质量开源科学数据源形成总计125万条实例的大规模混合数据集。此外,研究还开发了一套涵盖15个基准测试的综合评估体系,并基于此训练出性能显著优于官方指令微调模型的Llama3.1、Qwen2.5和Qwen3系列基础模型,验证了MegaScience在大模型上的缩放效益,为科学推理研究提供了可复现的数据与模型基础设施。
链接: https://arxiv.org/abs/2507.16812
作者: Run-Ze Fan,Zengzhi Wang,Pengfei Liu
机构: Shanghai Jiao Tong University (上海交通大学); SII; GAIR Lab (通用人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages; Github: this https URL HF: this https URL
Abstract:Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.
zh
[NLP-1] LingBench: A Linguistically-Informed Benchmark and Reasoning Framework for Multi-Step and Cross-Cultural Inference with LLM s
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在复杂语言任务评估中缺乏结构化推理过程、文化多样性覆盖不足以及可解释性弱的问题。现有基准主要关注最终答案的准确率,忽略了对模型推理路径的细致分析和跨语言、跨文化的泛化能力。解决方案的关键在于提出 LingBench++——一个融合语言学知识的基准与推理框架,其核心包括:提供结构化的推理轨迹(reasoning traces)、分步评估协议(stepwise evaluation protocols)及涵盖90余种低资源和跨文化语言的类型学元数据;同时构建多智能体架构,集成语法知识检索、工具增强推理与主动假设检验机制,从而显著提升模型在准确性与可解释性上的表现。
链接: https://arxiv.org/abs/2507.16809
作者: Da-Chen Lian,Ri-Sheng Huang,Pin-Er Chen,Chunki Lim,You-Kuan Lin,Guan-Yu Tseng,Zi-Cheng Yang,Shu-Kai Hsieh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 41 pages, 17 figures, 10 tables
Abstract:We propose LingBench++, a linguistically-informed benchmark and reasoning framework designed to evaluate large language models (LLMs) on complex linguistic tasks inspired by the International Linguistics Olympiad (IOL). Unlike prior benchmarks that focus solely on final answer accuracy, LingBench++ provides structured reasoning traces, stepwise evaluation protocols, and rich typological metadata across over 90 low-resource and cross-cultural languages. We further develop a multi-agent architecture integrating grammatical knowledge retrieval, tool-augmented reasoning, and deliberate hypothesis testing. Through systematic comparisons of baseline and our proposed agentic models, we demonstrate that models equipped with external knowledge sources and iterative reasoning outperform single-pass approaches in both accuracy and interpretability. LingBench++ offers a comprehensive foundation for advancing linguistically grounded, culturally informed, and cognitively plausible reasoning in LLMs.
zh
[NLP-2] Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)训练语言模型(Language Models, LMs)生成推理链时导致的校准性下降问题,即模型在获得高准确率的同时,其置信度估计变得不可靠,容易产生错误答案(或“幻觉”)。解决方案的关键在于提出一种新的训练框架 RLCR(Reinforcement Learning with Calibration Rewards),该方法在训练过程中同时优化两个目标:一是基于二元正确性标签的奖励信号,二是引入 Brier 分数(Brier score)作为校准评分规则,以激励模型输出与真实概率一致的数值化置信度估计。通过理论证明和实证验证,RLCR 在不牺牲准确性的前提下显著提升模型在校准性上的表现,且适用于域内和域外任务,从而构建更可靠、可信赖的推理模型。
链接: https://arxiv.org/abs/2507.16806
作者: Mehul Damani,Isha Puri,Stewart Slocum,Idan Shenfeld,Leshem Choshen,Yoon Kim,Jacob Andreas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:When language models (LMs) are trained via reinforcement learning (RL) to generate natural language “reasoning chains”, their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or “hallucinate”) in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score – a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations – outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.
zh
[NLP-3] Agent ar-Fin-R1: Enhancing Financial Intelligence through Domain Expertise Training Efficiency and Advanced Reasoning
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在金融领域应用中面临的三大挑战:推理能力不足、可信度难以保障以及任务特异性适应效率低下。解决方案的关键在于构建一个面向金融场景的专用大模型系列——Agentar-Fin-R1(含8B和32B参数版本),其核心创新包括:基于高质量金融任务分类体系的系统性优化设计,以及融合知识工程、多智能体数据合成与严格验证治理的多层次可信保障框架;同时通过标签引导的难度感知自动化优化、两阶段学习流程及细粒度归因机制显著提升训练效率,最终在多个主流金融基准(如FinEva、FinEval、FinanceIQ)及通用推理数据集(如MATH-500、GPQA)上实现卓越性能,并通过自研的Finova基准验证其代理级金融推理与合规验证能力,证明了其在高风险金融场景中的可靠性与实用性。
链接: https://arxiv.org/abs/2507.16802
作者: Yanjun Zheng,Xiyang Du,Longfei Liao,Xiaoke Zhao,Zhaowen Zhou,Bo Zhang,Jiawei Liu,Xiang Qi,Zhe Li,Zhiqiang Zhang,Wang Wei,Peng Zhang
机构: Ant Group(蚂蚁集团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) demonstrate tremendous potential in the financial domain, yet existing models often fall short in scenarios demanding robust reasoning capabilities, stringent trustworthiness requirements, and efficient adaptation to task-specific needs. We introduce the Agentar-Fin-R1 series of financial large language models (8B and 32B parameters), specifically engineered based on the Qwen3 foundation model to enhance reasoning capabilities, reliability, and domain specialization for financial applications. Our optimization approach integrates a high-quality, systematic financial task taxonomy with a comprehensive multi-layered trustworthiness assurance framework. This framework encompasses high-quality trustworthy knowledge engineering, multi-agent trustworthy data synthesis, and rigorous data validation governance. Through label-guided automated difficulty-aware optimization, tow-stage learning processes, and detailed attribution systems, we achieve substantial improvements in training efficiency. Our models undergo comprehensive evaluation on mainstream financial benchmarks including FinEva, FinEval, and FinanceIQ, as well as general reasoning datasets such as MATH-500 and GPQA. To thoroughly assess real-world deployment capabilities, we innovatively propose the Finova evaluation benchmark, which focuses on agent-level financial reasoning and compliance verification. Experimental results demonstrate that Agentar-Fin-R1 not only achieves state-of-the-art performance on financial tasks but also exhibits exceptional general reasoning capabilities, validating its effectiveness as a trustworthy solution for high-stakes financial applications.
zh
[NLP-4] st-Time-Matching: Decouple Personality Memory and Linguistic Style in LLM -based Role-Playing Language Agent
【速读】: 该论文旨在解决当前角色扮演语言代理在深度沉浸特定角色(尤其是知名虚构或公众人物)时面临的两大挑战:一是仅依赖提示(prompt)和上下文输入难以实现深层次的角色代入;二是基于微调(fine-tuning)的方法受限于数据收集难度与训练计算资源,难以广泛应用。解决方案的关键在于提出一种无需训练的测试时匹配(Test-Time-Matching, TTM)框架,通过测试时扩展(test-time scaling)与上下文工程(context engineering),自动将角色特征解耦为人格(personality)、记忆(memory)和语言风格(linguistic style)三部分,并构建一个结构化的三阶段生成流水线,从而实现高保真度的角色扮演对话生成,同时支持跨语言风格甚至人格与记忆的灵活组合。
链接: https://arxiv.org/abs/2507.16799
作者: Xiaoyu Zhan,Xinyu Fu,Hao Sun,Yuanqi Li,Jie Guo,Yanwen Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of large language models (LLMs) has enabled role-playing language agents to demonstrate significant potential in various applications. However, relying solely on prompts and contextual inputs often proves insufficient for achieving deep immersion in specific roles, particularly well-known fictional or public figures. On the other hand, fine-tuning-based approaches face limitations due to the challenges associated with data collection and the computational resources required for training, thereby restricting their broader applicability. To address these issues, we propose Test-Time-Matching (TTM), a training-free role-playing framework through test-time scaling and context engineering. TTM uses LLM agents to automatically decouple a character’s features into personality, memory, and linguistic style. Our framework involves a structured, three-stage generation pipeline that utilizes these features for controlled role-playing. It achieves high-fidelity role-playing performance, also enables seamless combinations across diverse linguistic styles and even variations in personality and memory. We evaluate our framework through human assessment, and the results demonstrate that our method achieves the outstanding performance in generating expressive and stylistically consistent character dialogues.
zh
[NLP-5] Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中出现的非预期分布外泛化(out-of-distribution generalization)问题,尤其是当模型在窄任务上微调后,对通用问题产生严重偏离对齐意图的响应(即“ emergent misalignment”现象)。解决方案的关键在于提出概念消融微调(Concept Ablation Fine-Tuning, CAFT),其核心机制是利用可解释性工具识别与 undesired concepts 相关的潜在空间方向,并在微调过程中通过线性投影消融这些概念,从而在不修改训练数据或引入目标分布样本的前提下,有效引导模型朝向预期的泛化行为。实验表明,CAFT 在不损害训练分布性能的情况下,将错误对齐响应减少10倍。
链接: https://arxiv.org/abs/2507.16795
作者: Helena Casademunt,Caden Juang,Adam Karvonen,Samuel Marks,Senthooran Rajamanoharan,Neel Nanda
机构: Harvard University (哈佛大学); Northeastern University (东北大学); Independent (独立); Anthropic (Anthropic)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Fine-tuning large language models (LLMs) can lead to unintended out-of-distribution generalization. Standard approaches to this problem rely on modifying training data, for example by adding data that better specify the intended generalization. However, this is not always practical. We introduce Concept Ablation Fine-Tuning (CAFT), a technique that leverages interpretability tools to control how LLMs generalize from fine-tuning, without needing to modify the training data or otherwise use data from the target distribution. Given a set of directions in an LLM’s latent space corresponding to undesired concepts, CAFT works by ablating these concepts with linear projections during fine-tuning, steering the model away from unintended generalizations. We successfully apply CAFT to three fine-tuning tasks, including emergent misalignment, a phenomenon where LLMs fine-tuned on a narrow task generalize to give egregiously misaligned responses to general questions. Without any changes to the fine-tuning data, CAFT reduces misaligned responses by 10x without degrading performance on the training distribution. Overall, CAFT represents a novel approach for steering LLM generalization without modifying training data.
zh
[NLP-6] Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中受上下文长度限制的问题,该限制会显著影响推理的准确性与效率。解决方案的核心在于提出了一种名为Thread Inference Model (TIM) 的LLM家族及其配套的推理运行时TIMRUN,通过将自然语言建模为具有深度和长度的推理树(reasoning trees),实现递归式与分解式的任务求解。关键创新在于:利用基于规则的任务剪枝机制维护一个精简的工作内存(working memory),仅保留最相关上下文token的键值状态(key-value states),从而复用位置嵌入(positional embeddings)和GPU内存页,突破输出长度、位置编码约束及GPU显存瓶颈,支持单次推理中的多跳工具调用与近乎无限的工作内存。
链接: https://arxiv.org/abs/2507.16784
作者: Hongyin Luo,Nathaniel Morgan,Tina Li,Derek Zhao,Ai Vy Ngo,Philip Schroeder,Lijie Yang,Assaf Ben-Kish,Jack O’Brien,James Glass
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Subconscious Systems Technologies, Inc. ( subconscious 系统技术公司); Princeton University (普林斯顿大学); Tel Aviv University (特拉维夫大学)
类目: Computation and Language (cs.CL)
备注: Research preview
Abstract:To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.
zh
[NLP-7] Unpacking Ambiguity: The Interaction of Polysemous Discourse Markers and Non-DM Signals
【速读】: 该论文旨在解决话语标记(Discourse Markers, DMs)的多义性(polysemy)与其与非话语标记(non-DMs)共现之间的交互机制不明确的问题,尤其关注这种关系如何在不同语域中表现。其核心解决方案在于引入eRST框架提出了一种分级定义的DM多义性概念,并通过相关分析和回归分析验证多义DM是否伴随更多样化的非DM信号。研究发现,多义DM确实更常与多样化的非DM共现,但总数量并不必然增加;同时,语域显著调节了DM与非DM信号的互动模式,揭示了语境对话语连贯性构建的关键作用。
链接: https://arxiv.org/abs/2507.16748
作者: Jingni Wu,Amir Zeldes
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Discourse markers (DMs) like ‘but’ or ‘then’ are crucial for creating coherence in discourse, yet they are often replaced by or co-occur with non-DMs (‘in the morning’ can mean the same as ‘then’), and both can be ambiguous (‘since’ can refer to time or cause). The interaction mechanism between such signals remains unclear but pivotal for their disambiguation. In this paper we investigate the relationship between DM polysemy and co-occurrence of non-DM signals in English, as well as the influence of genre on these patterns. Using the framework of eRST, we propose a graded definition of DM polysemy, and conduct correlation and regression analyses to examine whether polysemous DMs are accompanied by more numerous and diverse non-DM signals. Our findings reveal that while polysemous DMs do co-occur with more diverse non-DMs, the total number of co-occurring signals does not necessarily increase. Moreover, genre plays a significant role in shaping DM-signal interactions. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.16748 [cs.CL] (or arXiv:2507.16748v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.16748 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-8] Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在复杂问题求解中缺乏有效视觉链式思维(Visual Chain of Thought, Visual CoT)能力的问题,具体挑战包括:现有视觉 CoT 模型性能不足,限制了强化学习的训练效果;以及高质量视觉 CoT 训练数据的缺失。其解决方案的关键在于构建了一个大规模、多样化的数据集 Zebra-CoT,包含 182,384 条逻辑连贯的文本-图像交织推理轨迹,覆盖几何、物理、算法等科学问题、二维视觉推理任务(如视觉搜索和拼图)、三维推理任务(如多跳推理与机器人规划)以及视觉逻辑题和策略游戏(如国际象棋)。通过在该数据集上微调 Anole-7B 和 Bagel-7B 模型,显著提升了测试集准确率(+12%)及标准视觉语言模型(VLM)基准上的性能(最高+13%),验证了 Zebra-CoT 对发展多模态推理能力的有效性。
链接: https://arxiv.org/abs/2507.16746
作者: Ang Li,Charles Wang,Kaiyu Yue,Zikui Cai,Ollie Liu,Deqing Fu,Peng Guo,Wang Bill Zhu,Vatsal Sharan,Robin Jia,Willie Neiswanger,Furong Huang,Tom Goldstein,Micah Goldblum
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: dataset link: this https URL
Abstract:Humans often use visual aids, for example diagrams or sketches, when solving complex problems. Training multimodal models to do the same, known as Visual Chain of Thought (Visual CoT), is challenging due to: (1) poor off-the-shelf visual CoT performance, which hinders reinforcement learning, and (2) the lack of high-quality visual CoT training data. We introduce \textbfZebra-CoT , a diverse large-scale dataset with 182,384 samples, containing logically coherent interleaved text-image reasoning traces. We focus on four categories of tasks where sketching or visual reasoning is especially natural, spanning scientific questions such as geometry, physics, and algorithms; 2D visual reasoning tasks like visual search and jigsaw puzzles; 3D reasoning tasks including 3D multi-hop inference, embodied and robot planning; visual logic problems and strategic games like chess. Fine-tuning the Anole-7B model on the Zebra-CoT training corpus results in an improvement of +12% in our test-set accuracy and yields up to +13% performance gain on standard VLM benchmark evaluations. Fine-tuning Bagel-7B yields a model that generates high-quality interleaved visual reasoning chains, underscoring Zebra-CoT’s effectiveness for developing multimodal reasoning abilities. We open-source our dataset and models to support development and evaluation of visual CoT.
zh
[NLP-9] RAVine: Reality-Aligned Evaluation for Agent ic Search
【速读】: 该论文旨在解决当前评估框架在衡量生成式 AI(Generative AI)驱动的代理搜索(Agentic Search)系统时存在的三大问题:一是现有基准测试中使用的复杂查询与真实用户搜索场景偏离;二是传统方法在提取端到端评估的“黄金标准”(ground truth)时引入噪声,导致细粒度评估失真;三是多数框架仅关注最终答案质量,忽视了代理搜索中固有的迭代交互过程。其解决方案的关键在于提出 RAVine——一个面向代理大语言模型(LLM)搜索的现实对齐评估框架,通过聚焦多点查询和长文本回答以更贴合用户意图,并采用可归因的黄金标准构建策略提升细粒度评估准确性;同时,RAVine 还系统性地考察模型在整个迭代过程中与搜索工具的交互行为,并纳入效率因素进行综合评价。
链接: https://arxiv.org/abs/2507.16725
作者: Yilong Xu,Xiang Long,Zhi Zheng,Jinhua Gao
机构: ICT, CAS (中国科学院计算技术研究所); ModelBest Inc. (ModelBest 公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Agentic search, as a more autonomous and adaptive paradigm of retrieval augmentation, is driving the evolution of intelligent search systems. However, existing evaluation frameworks fail to align well with the goals of agentic search. First, the complex queries commonly used in current benchmarks often deviate from realistic user search scenarios. Second, prior approaches tend to introduce noise when extracting ground truth for end-to-end evaluations, leading to distorted assessments at a fine-grained level. Third, most current frameworks focus solely on the quality of final answers, neglecting the evaluation of the iterative process inherent to agentic search. To address these limitations, we propose RAVine – a Reality-Aligned eValuation framework for agentic LLMs with search. RAVine targets multi-point queries and long-form answers that better reflect user intents, and introduces an attributable ground truth construction strategy to enhance the accuracy of fine-grained evaluation. Moreover, RAVine examines model’s interaction with search tools throughout the iterative process, and accounts for factors of efficiency. We benchmark a series of models using RAVine and derive several insights, which we hope will contribute to advancing the development of agentic search systems. The code and datasets are available at this https URL.
zh
[NLP-10] Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在物理机器人上的具身化(grounding)难题,即如何将原本在互联网数据上训练的VLM有效迁移至多样化的现实世界机器人系统中。其解决方案的关键在于提出ExpTeach框架,该框架通过构建一个由机器人自主生成的长期记忆库来实现VLM与物理世界的闭环对齐:首先,VLM在真实环境中自主规划动作、验证结果并反思失败,从而持续优化行为;其次,这些经验被提炼为结构化知识存入长时记忆,并借助检索增强生成(Retrieval-Augmented Generation, RAG)机制用于指导后续任务决策;此外,引入按需图像标注模块进一步提升VLM的空间理解能力。实验表明,该方法显著提升了任务成功率(从36%升至84%),并在多个未见过的真实场景中实现了高达80%的单次执行成功率,验证了其有效性与泛化性。
链接: https://arxiv.org/abs/2507.16713
作者: Guowei Lan,Kaixian Qu,René Zurbrügg,Changan Chen,Christopher E. Mower,Haitham Bou-Ammar,Marco Hutter
机构: Robotic Systems Lab, ETH Zurich (苏黎世联邦理工学院机器人系统实验室); ETH AI Center (苏黎世联邦理工学院人工智能中心); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); UCL Centre for AI (伦敦大学学院人工智能中心)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Vision-language models (VLMs) have been widely adopted in robotics to enable autonomous planning. However, grounding VLMs, originally trained on internet data, to diverse real-world robots remains a challenge. This paper presents ExpTeach, a framework that grounds VLMs to physical robots by building a self-generated memory of real-world experiences. In ExpTeach, the VLM autonomously plans actions, verifies outcomes, reflects on failures, and adapts robot behaviors in a closed loop. The self-generated experiences during this process are then summarized into a long-term memory, enabling retrieval of learned knowledge to guide future tasks via retrieval-augmented generation (RAG). Additionally, ExpTeach enhances the spatial understanding of VLMs with an on-demand image annotation module. In experiments, we show that reflection improves success rates from 36% to 84% on four challenging robotic tasks and observe the emergence of intelligent object interactions, including creative tool use. Across extensive tests on 12 real-world scenarios (including eight unseen ones), we find that grounding with long-term memory boosts single-trial success rates from 22% to 80%, demonstrating the effectiveness and generalizability of ExpTeach.
zh
[NLP-11] Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance
【速读】: 该论文旨在解决高度监管行业中风险与质量(Risk and Quality, RQ)保证所面临的挑战,即员工在日常工作中需频繁处理大量涉及复杂法规政策解读的查询,而传统依赖专家人工解答的方式存在效率低下和可扩展性差的问题。解决方案的关键在于构建一个基于检索增强生成(Retrieval Augmented Generation, RAG)的系统,该系统融合了大型语言模型(Large Language Models, LLMs)、混合搜索策略以及相关性增强机制,从而显著提升RQ查询处理的准确性和效率。实证结果表明,该系统在124个专家标注的真实查询上优于传统RAG方法,并通过超参数分析为实践者提供了可复用的配置优化依据。
链接: https://arxiv.org/abs/2507.16711
作者: Lars Hillebrand,Armin Berger,Daniel Uedelhoven,David Berghaus,Ulrich Warning,Tim Dilmaghani,Bernd Kliem,Thomas Schmid,Rüdiger Loitz,Rafet Sifa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted and published at BigData 2024, 3 pages, 3 tables, 2 figures
Abstract:Risk and Quality (RQ) assurance in highly regulated industries requires constant navigation of complex regulatory frameworks, with employees handling numerous daily queries demanding accurate policy interpretation. Traditional methods relying on specialized experts create operational bottlenecks and limit scalability. We present a novel Retrieval Augmented Generation (RAG) system leveraging Large Language Models (LLMs), hybrid search and relevance boosting to enhance RQ query processing. Evaluated on 124 expert-annotated real-world queries, our actively deployed system demonstrates substantial improvements over traditional RAG approaches. Additionally, we perform an extensive hyperparameter analysis to compare and evaluate multiple configuration setups, delivering valuable insights to practitioners.
zh
[NLP-12] Interpretable Topic Extraction and Word Embedding Learning using row-stochastic DEDICOM
【速读】: 该论文旨在解决传统矩阵分解方法在文本数据中难以同时实现可解释的主题建模与词嵌入学习的问题。其核心解决方案是引入一种新的行随机化DEDICOM(DEcomposition into DIagonal COMponents)算法变体,应用于文本语料库的点互信息(Pointwise Mutual Information, PMI)矩阵,从而在识别词汇潜在主题簇的同时,学习具有可解释性的词嵌入表示。关键创新在于通过约束优化策略高效训练该算法,并结合定性评估验证其在主题建模和词向量学习两方面的性能表现。
链接: https://arxiv.org/abs/2507.16695
作者: Lars Hillebrand,David Biesner,Christian Bauckhage,Rafet Sifa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted and published at CD-MAKE 2020, 20 pages, 8 tables, 8 figures
Abstract:The DEDICOM algorithm provides a uniquely interpretable matrix factorization method for symmetric and asymmetric square matrices. We employ a new row-stochastic variation of DEDICOM on the pointwise mutual information matrices of text corpora to identify latent topic clusters within the vocabulary and simultaneously learn interpretable word embeddings. We introduce a method to efficiently train a constrained DEDICOM algorithm and a qualitative evaluation of its topic modeling and word embedding performance.
zh
[NLP-13] PICACO: Pluralistic In-Context Value Alignment of LLM s via Total Correlation Optimization
【速读】: 该论文旨在解决当前基于提示的对齐(In-Context Alignment, ICA)方法在面对人类价值观多元性时所面临的“指令瓶颈”(Instruction Bottleneck)问题,即大语言模型(LLMs)难以在同一提示中协调多个相互冲突的价值观(如刺激与传统),导致对齐效果不完整或存在偏倚。解决方案的关键在于提出一种无需微调的新型多元价值对齐方法 PICACO,其核心思想是通过优化一个元指令(meta-instruction),最大化指定价值观与模型响应之间的总相关性,在理论上增强价值观间的关联性并降低干扰噪声,从而更有效地引导模型理解并平衡多种价值取向。
链接: https://arxiv.org/abs/2507.16679
作者: Han Jiang,Dongyao Zhu,Zhihua Wei,Xiaoyuan Yi,Ziang Xiao,Xing Xie
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs’ comprehension of input prompts remains agnostic, limiting ICA’s ability to address value tensions–human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that navigates multiple values to better elicit LLMs’ understanding of them and improve their alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, theoretically reinforcing value correlation while reducing distractive noise, resulting in effective value instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.
zh
[NLP-14] Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成与理解任务中存在自相矛盾的问题,即模型在生成图像时往往产生与其自身理解不一致的结果,导致生成内容与输入提示(prompt)偏离。其核心解决方案在于识别并利用这种“自矛盾性”(self-contradiction),通过构建一个量化指标——Nonunified score来捕捉生成与理解之间的能力不对称,并基于此设计一种内部监督机制:以更强的理解能力引导较弱的生成分支进行优化。关键创新点在于发现仅微调生成分支即可实现生成与理解能力的协同提升(co-improvement),这源于模型对误判为对齐样本的检测精度提高;同时指出若监督信号质量差可能导致共退化(co-degradation),强调了数据质量控制的重要性,并提出基于课程学习(curriculum-based)的渐进式训练策略,有效提升MLLM的整体统一性与性能。
链接: https://arxiv.org/abs/2507.16663
作者: Yujin Han,Hao Chen,Andi Han,Zhiheng Wang,Xinyu Lin,Yingya Zhang,Shiwei Zhang,Difan Zou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures, 3 tables
Abstract:Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model’s own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.
zh
[NLP-15] P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLM s
【速读】: 该论文旨在解决文本型大语言模型(Large Language Models, LLMs)在语音学推理能力方面表现不足的问题,特别是针对韵律词生成、音素到音节转换(Grapheme-to-Phoneme conversion, g2p)和音节数计算等任务。其解决方案的关键在于提出一种基于教育学理论的参与式链式思维(Pedagogically-motivated Participatory Chain-of-Thought, P-CoT)提示方法,该方法通过结构化引导激活模型中潜在的语音学能力,显著提升性能,最高可达52%的改进,并在某些任务中超越人类基准。
链接: https://arxiv.org/abs/2507.16656
作者: Dongjun Jang,Youngchae Ahn,Hyopil Shin
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This study explores the potential of phonological reasoning within text-based large language models (LLMs). Utilizing the PhonologyBench benchmark, we assess tasks like rhyme word generation, g2p conversion, and syllable counting. Our evaluations across 12 LLMs reveal that while few-shot learning offers inconsistent gains, the introduction of a novel Pedagogically-motivated Participatory Chain-of-Thought (P-CoT) prompt, which is anchored in educational theories like scaffolding and discovery learning, consistently enhances performance. This method leverages structured guidance to activate latent phonological abilities, achieving up to 52% improvement and even surpassing human baselines in certain tasks. Future work could aim to optimize P-CoT prompts for specific models or explore their application across different linguistic domains.
zh
[NLP-16] owards Automated Regulatory Compliance Verification in Financial Auditing with Large Language Models
【速读】: 该论文旨在解决当前基于生成式 AI (Generative AI) 的财务文档审计系统在监管合规性验证上的局限性问题,即现有模型虽能推荐与会计准则相关的文本片段,但难以准确判断这些推荐内容是否真正符合特定法律要求。解决方案的关键在于通过对比不同配置的大语言模型(Large Language Models, LLMs)在合规性检测任务中的表现,尤其是评估开源模型如 Llama-2 与闭源模型如 GPT 系列的性能差异;研究发现,尽管 Llama-2 70B 在识别非合规项(真负例)上优于多数闭源模型,但 GPT-4 在多种场景下,特别是非英语语境中,整体表现最优,表明模型架构与训练数据多样性对合规性验证能力具有决定性影响。
链接: https://arxiv.org/abs/2507.16642
作者: Armin Berger,Lars Hillebrand,David Leonhard,Tobias Deußer,Thiago Bell Felix de Oliveira,Tim Dilmaghani,Mohamed Khaled,Bernd Kliem,Rüdiger Loitz,Christian Bauckhage,Rafet Sifa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted and published at BigData 2023, 10 pages, 3 figures, 5 tables
Abstract:The auditing of financial documents, historically a labor-intensive process, stands on the precipice of transformation. AI-driven solutions have made inroads into streamlining this process by recommending pertinent text passages from financial reports to align with the legal requirements of accounting standards. However, a glaring limitation remains: these systems commonly fall short in verifying if the recommended excerpts indeed comply with the specific legal mandates. Hence, in this paper, we probe the efficiency of publicly available Large Language Models (LLMs) in the realm of regulatory compliance across different model configurations. We place particular emphasis on comparing cutting-edge open-source LLMs, such as Llama-2, with their proprietary counterparts like OpenAI’s GPT models. This comparative analysis leverages two custom datasets provided by our partner PricewaterhouseCoopers (PwC) Germany. We find that the open-source Llama-2 70 billion model demonstrates outstanding performance in detecting non-compliance or true negative occurrences, beating all their proprietary counterparts. Nevertheless, proprietary models such as GPT-4 perform the best in a broad variety of scenarios, particularly in non-English contexts.
zh
[NLP-17] Step-Audio 2 Technical Report
【速读】: 该论文旨在解决工业级音频理解与语音对话中面临的挑战,包括对语调、情感等副语言信息的敏感性不足、幻觉问题以及缺乏多模态知识融合能力。其解决方案的关键在于构建一个端到端的多模态大语言模型 Step-Audio 2,通过引入潜在音频编码器(latent audio encoder)和以推理为核心的强化学习(reasoning-centric reinforcement learning),实现高质量自动语音识别(ASR)与音频理解;同时将离散音频标记生成整合进语言建模流程,显著提升对说话风格和情绪等副语言信息的响应能力;此外,结合检索增强生成(retrieval-augmented generation, RAG)机制及外部工具调用(如网络搜索和音频检索),有效缓解幻觉并支持音色切换,从而在多样化的对话场景中实现更智能、更具表现力的交互能力。
链接: https://arxiv.org/abs/2507.16632
作者: Boyong Wu,Chao Yan,Chen Hu,Cheng Yi,Chengli Feng,Fei Tian,Feiyu Shen,Gang Yu,Haoyang Zhang,Jingbei Li,Mingrui Chen,Peng Liu,Wang You,Xiangyu Tony Zhang,Xingyuan Li,Xuerui Yang,Yayue Deng,Yechang Huang,Yuxin Li,Yuxin Zhang,Zhao You,Brian Li,Changyi Wan,Hanpeng Hu,Jiangjie Zhen,Siyu Chen,Song Yuan,Xuelin Zhang,Yimin Jiang,Yu Zhou,Yuxiang Yang,Bingxin Li,Buyun Ma,Changhe Song,Dongqing Pang,Guoqiang Hu,Haiyang Sun,Kang An,Na Wang,Shuli Gao,Wei Ji,Wen Li,Wen Sun,Xuan Wen,Yong Ren,Yuankai Ma,Yufan Lu,Bin Wang,Bo Li,Changxin Miao,Che Liu,Chen Xu,Dapeng Shi,Dingyuan Hu,Donghang Wu,Enle Liu,Guanzhe Huang,Gulin Yan,Han Zhang,Hao Nie,Haonan Jia,Hongyu Zhou,Jianjian Sun,Jiaoren Wu,Jie Wu,Jie Yang,Jin Yang,Junzhe Lin,Kaixiang Li,Lei Yang,Liying Shi,Li Zhou,Longlong Gu,Ming Li,Mingliang Li,Mingxiao Li,Nan Wu,Qi Han,Qinyuan Tan,Shaoliang Pang,Shengjie Fan,Siqi Liu,Tiancheng Cao,Wanying Lu,Wenqing He,Wuxun Xie,Xu Zhao,Xueqi Li,Yanbo Yu,Yang Yang,Yi Liu,Yifan Lu,Yilei Wang,Yuanhao Ding,Yuanwei Liang,Yuanwei Lu,Yuchu Luo,Yuhe Yin,Yumeng Zhan,Yuxiang Zhang
机构: StepFun Audio Team
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper presents Step-Audio~2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit this https URL for more information.
zh
[NLP-18] Scaling Linear Attention with Sparse State Expansion
【速读】: 该论文旨在解决Transformer架构在长上下文场景中因二次计算复杂度和线性内存增长导致的效率瓶颈问题,尤其是在使用线性注意力机制进行上下文压缩时,常出现信息检索与推理性能下降的问题。其解决方案的关键在于提出两种创新:一是基于信息分类思想设计行稀疏更新机制(row-sparse update formulation),通过softmax-based top-k硬分类实现稀疏状态更新,从而扩展感受野并减少类别间干扰;二是引入稀疏状态扩展(Sparse State Expansion, SSE),在稀疏框架内将上下文状态划分为多个分区,在不增加参数规模的前提下提升状态容量,同时保持稀疏分类范式。该设计经高效并行化实现后,显著增强了状态表示的判别能力,并在语言建模、上下文检索及数学推理任务中验证了其有效性与可扩展性。
链接: https://arxiv.org/abs/2507.16577
作者: Yuqi Pan,Yongqi An,Zheng Li,Yuhong Chou,Ruijie Zhu,Xiaohui Wang,Mingxuan Wang,Jinqiao Wang,Guoqi Li
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top- k hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Our design, supported by efficient parallelized implementations, yields effective classification and discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.7 on AIME24 and 51.3 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.
zh
[NLP-19] Pixels to Principles: Probing Intuitive Physics Understanding in Multimodal Language Models
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在直观物理推理任务中表现不佳的问题,特别是模型难以可靠区分物理上合理与不合理场景的局限性。其解决方案的关键在于通过探针分析(probing analysis)对模型嵌入进行深入考察,提取关键处理阶段的中间表示,揭示视觉与语言信息融合过程中的瓶颈。研究发现,尽管视觉编码器能有效捕捉物理合理性线索,但这些信息未能被语言模型充分利用,导致推理失败,表明当前MLLMs的主要限制并非视觉模块本身,而是视觉-语言对齐不足,因此提升跨模态信息整合能力是未来改进的核心方向。
链接: https://arxiv.org/abs/2507.16572
作者: Mohamad Ballout,Serwan Jassim,Elia Bruni
机构: University of Osnabrück (奥斯纳布吕克大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents a systematic evaluation of state-of-the-art multimodal large language models (MLLMs) on intuitive physics tasks using the GRASP and IntPhys 2 datasets. We assess the open-source models InternVL 2.5, Qwen 2.5 VL, LLaVA-OneVision, and the proprietary Gemini 2.0 Flash Thinking, finding that even the latest models struggle to reliably distinguish physically plausible from implausible scenarios. To go beyond performance metrics, we conduct a probing analysis of model embeddings, extracting intermediate representations at key processing stages to examine how well task-relevant information is preserved. Our results show that, depending on task difficulty, a critical vision-language misalignment can emerge: vision encoders successfully capture physical plausibility cues, but this information is not effectively utilized by the language model, leading to failures in reasoning. This misalignment suggests that the primary limitation of MLLMs in intuitive physics tasks is not the vision component but the ineffective integration of visual and linguistic information. Our findings highlight vision-language alignment as a key area for improvement, offering insights for future MLLMs development.
zh
[NLP-20] Exploring Gender Bias in Large Language Models : An In-depth Dive into the German Language ACL2025
【速读】: 该论文旨在解决跨语言场景下性别偏见评估方法的可迁移性问题,特别是将最初为英语设计的偏见测量方法直接应用于德语等其他语言时所面临的挑战。其解决方案的关键在于构建了五个基于成熟性别偏见理论框架的德语偏见评估数据集,并通过多种方法学途径实现对多语言大语言模型(LLMs)的系统性测评。研究发现德语中存在独特的偏见表现形式,如男性职业术语的歧义解读及看似中性的名词对性别感知的影响,从而强调了开发针对特定语言特性的评估框架的必要性。
链接: https://arxiv.org/abs/2507.16557
作者: Kristin Gnadt,David Thulke,Simone Kopeinik,Ralf Schlüter
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP) at ACL 2025
Abstract:In recent years, various methods have been proposed to evaluate gender bias in large language models (LLMs). A key challenge lies in the transferability of bias measurement methods initially developed for the English language when applied to other languages. This work aims to contribute to this research strand by presenting five German datasets for gender bias evaluation in LLMs. The datasets are grounded in well-established concepts of gender bias and are accessible through multiple methodologies. Our findings, reported for eight multilingual LLM models, reveal unique challenges associated with gender bias in German, including the ambiguous interpretation of male occupational terms and the influence of seemingly neutral nouns on gender perception. This work contributes to the understanding of gender bias in LLMs across languages and underscores the necessity for tailored evaluation frameworks.
zh
[NLP-21] Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report
【速读】: 该论文旨在系统识别和评估快速发展的生成式AI(Generative AI)模型所带来的前沿风险,以应对其可能引发的不可控后果。解决方案的关键在于引入E-T-C分析框架(部署环境、威胁来源、使能能力)和“AI-45°定律”,通过设定“红线”(不可容忍阈值)与“黄线”(早期预警指标)来划分风险区域:绿色(可管理风险)、黄色(需强化缓解措施)和红色(需暂停开发或部署)。实验结果表明,当前主流前沿AI模型均未触及红线,但部分在说服与操纵、生物/化学风险等领域处于黄区,提示需加强监测与治理,推动协同行动以防范潜在威胁。
链接: https://arxiv.org/abs/2507.16534
作者: Shanghai AI Lab:Xiaoyang Chen,Yunhao Chen,Zeren Chen,Zhiyun Chen,Hanyun Cui,Yawen Duan,Jiaxuan Guo,Qi Guo,Xuhao Hu,Hong Huang,Lige Huang,Chunxiao Li,Juncheng Li,Qihao Lin,Dongrui Liu,Xinmin Liu,Zicheng Liu,Chaochao Lu,Xiaoya Lu,Jingjing Qu,Qibing Ren,Jing Shao,Jingwei Shi,Jingwei Sun,Peng Wang,Weibing Wang,Jia Xu,Lewen Yan,Xiao Yu,Yi Yu,Boxuan Zhang,Jie Zhang,Weichen Zhang,Zhijie Zheng,Tianyi Zhou,Bowen Zhou
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 97 pages, 37 figures
Abstract:To understand and identify the unprecedented risks posed by rapidly advancing artificial intelligence (AI) models, this report presents a comprehensive assessment of their frontier risks. Drawing on the E-T-C analysis (deployment environment, threat source, enabling capability) from the Frontier AI Risk Management Framework (v1.0) (SafeWork-F1-Framework), we identify critical risks in seven areas: cyber offense, biological and chemical risks, persuasion and manipulation, uncontrolled autonomous AI R\D, strategic deception and scheming, self-replication, and collusion. Guided by the “AI- 45^\circ Law,” we evaluate these risks using “red lines” (intolerable thresholds) and “yellow lines” (early warning indicators) to define risk zones: green (manageable risk for routine deployment and continuous monitoring), yellow (requiring strengthened mitigations and controlled deployment), and red (necessitating suspension of development and/or deployment). Experimental results show that all recent frontier AI models reside in green and yellow zones, without crossing red lines. Specifically, no evaluated models cross the yellow line for cyber offense or uncontrolled AI R\D risks. For self-replication, and strategic deception and scheming, most models remain in the green zone, except for certain reasoning models in the yellow zone. In persuasion and manipulation, most models are in the yellow zone due to their effective influence on humans. For biological and chemical risks, we are unable to rule out the possibility of most models residing in the yellow zone, although detailed threat modeling and in-depth assessment are required to make further claims. This work reflects our current understanding of AI frontier risks and urges collective action to mitigate these challenges.
zh
[NLP-22] Learning Text Styles: A Study on Transfer Attribution and Verification
【速读】: 该论文旨在解决文本风格操控与识别中的三大核心问题:文本风格迁移(Text Style Transfer, TST),即在保持语义内容不变的前提下改变文本的风格属性(如情感倾向、正式程度);作者归属(Authorship Attribution, AA),通过文本的风格特征指纹识别其作者;以及作者验证(Authorship Verification, AV),判断两段文本是否出自同一作者。解决方案的关键在于:利用大语言模型(Large Language Models, LLMs)的参数高效适配技术,实现对风格特征的对比解耦(contrastive disentanglement),并通过指令微调(instruction-based fine-tuning)提升验证过程的可解释性。
链接: https://arxiv.org/abs/2507.16530
作者: Zhiqiang Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: PhD thesis
Abstract:This thesis advances the computational understanding and manipulation of text styles through three interconnected pillars: (1) Text Style Transfer (TST), which alters stylistic properties (e.g., sentiment, formality) while preserving content; (2)Authorship Attribution (AA), identifying the author of a text via stylistic fingerprints; and (3) Authorship Verification (AV), determining whether two texts share the same authorship. We address critical challenges in these areas by leveraging parameter-efficient adaptation of large language models (LLMs), contrastive disentanglement of stylistic features, and instruction-based fine-tuning for explainable verification.
zh
[NLP-23] C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在提升推理能力时面临的两个核心挑战:一是现有自提升方法通常单独增强视觉或文本数据,导致模态间任务复杂度不匹配(如过于简化的图表搭配冗余的文本描述);二是数据与模型的演化过程分离,造成模型暴露于难度不匹配的任务中。解决方案的关键在于提出C2-Evo框架,这是一个自动化的闭环自提升系统,通过交叉模态数据演化环(cross-modal data evolution loop)和数据-模型协同演化环(data-model evolution loop)同步优化训练数据与模型性能:前者生成融合结构化文本子问题与迭代生成几何图示的复杂多模态问题以扩充数据集,后者基于基础模型表现自适应选择问题进行监督微调与强化学习交替训练,从而实现模型与数据的持续协同进化,在多个数学推理基准测试中均取得显著性能提升。
链接: https://arxiv.org/abs/2507.16518
作者: Xiuwei Chen,Wentao Hu,Hanhui Li,Jun Zhou,Zisheng Chen,Meng Cao,Yihan Zeng,Kui Zhang,Yu-Jie Yuan,Jianhua Han,Hang Xu,Xiaodan Liang
机构: Sun Yat-sen University (中山大学); The Hong Kong Polytechnic University (香港理工大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in multimodal large language models (MLLMs) have shown impressive reasoning capabilities. However, further enhancing existing MLLMs necessitates high-quality vision-language datasets with carefully curated task complexities, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct supervised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains considerable performance gains across multiple mathematical reasoning benchmarks. Our code, models, and datasets will be released.
zh
[NLP-24] Introducing Quality Estimation to Machine Translation Post-editing Workflow: An Empirical Study on Its Usefulness
【速读】: 该论文旨在解决机器翻译后编辑(Machine Translation Post-Editing, MTPE)中效率低下和译者决策支持不足的问题,特别是如何利用句子级别的质量评估(Sentence-level Quality Estimation, QE)来提升后编辑速度并优化译者对翻译质量的判断。其解决方案的关键在于引入QE系统作为辅助工具,不仅能够识别潜在问题句段以减少冗余修改,还具备验证译者主观评价、支持复核翻译输出等多重功能;研究发现QE可显著缩短后编辑时间,且其效果在不同机器翻译质量水平及译者专业程度下均保持稳定,但若QE预测不准确则可能干扰后编辑流程,提示需谨慎设计与部署QE系统以实现高效集成。
链接: https://arxiv.org/abs/2507.16515
作者: Siqi Liu,Guangrong Dai,Dechao Li
机构: The Hong Kong Polytechnic University (香港理工大学); Guangdong University of Foreign Studies (广东外语外贸大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 11 pages, 5 figures, 2 tables. To be published in the Proceedings of the 20th Machine Translation Summit (MT Summit 2025; Geneva, Switzerland)
Abstract:This preliminary study investigates the usefulness of sentence-level Quality Estimation (QE) in English-Chinese Machine Translation Post-Editing (MTPE), focusing on its impact on post-editing speed and student translators’ perceptions. It also explores the interaction effects between QE and MT quality, as well as between QE and translation expertise. The findings reveal that QE significantly reduces post-editing time. The examined interaction effects were not significant, suggesting that QE consistently improves MTPE efficiency across medium- and high-quality MT outputs and among student translators with varying levels of expertise. In addition to indicating potentially problematic segments, QE serves multiple functions in MTPE, such as validating translators’ evaluations of MT quality and enabling them to double-check translation outputs. However, interview data suggest that inaccurate QE may hinder post-editing processes. This research provides new insights into the strengths and limitations of QE, facilitating its more effective integration into MTPE workflows to enhance translators’ productivity.
zh
[NLP-25] he Ever-Evolving Science Exam
【速读】: 该论文旨在解决当前科学能力评估基准在实际应用中面临的两大核心问题:一是数据泄露风险(data leakage risks),即测试数据可能被模型训练过程中接触过,从而削弱评估的有效性;二是评估效率低下(evaluation inefficiency),源于大规模测试带来的计算开销。为应对这些问题,论文提出了一种动态基准——Ever-Evolving Science Exam (EESE),其关键在于构建了一个包含超过10万条专家构造的跨5个学科、500多个子领域的科学问答对的非公开基准池(EESE-Pool),并通过多阶段流程确保了广度(Range)、覆盖度(Reach)和严谨性(Rigor);同时,定期更新的500个实例子集(EESE)通过采样与验证机制实现了抗泄露、低开销的高效评估,从而提供可靠且可扩展的科学理解能力评测方案。
链接: https://arxiv.org/abs/2507.16514
作者: Junying Wang,Zicheng Zhang,Yijin Guo,Farong Wen,Ye Shen,Yingji Liang,Yalun Wu,Wenzhe Li,Chunyi Li,Zijian Chen,Qi Jia,Guangtao Zhai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:As foundation models grow rapidly in capability and deployment, evaluating their scientific understanding becomes increasingly critical. Existing science benchmarks have made progress towards broad Range, wide Reach, and high Rigor, yet they often face two major challenges: data leakage risks that compromise benchmarking validity, and evaluation inefficiency due to large-scale testing. To address these issues, we introduce the Ever-Evolving Science Exam (EESE), a dynamic benchmark designed to reliably assess scientific capabilities in foundation models. Our approach consists of two components: 1) a non-public EESE-Pool with over 100K expertly constructed science instances (question-answer pairs) across 5 disciplines and 500+ subfields, built through a multi-stage pipeline ensuring Range, Reach, and Rigor, 2) a periodically updated 500-instance subset EESE, sampled and validated to enable leakage-resilient, low-overhead evaluations. Experiments on 32 open- and closed-source models demonstrate that EESE effectively differentiates the strengths and weaknesses of models in scientific fields and cognitive dimensions. Overall, EESE provides a robust, scalable, and forward-compatible solution for science benchmark design, offering a realistic measure of how well foundation models handle science questions. The project page is at: this https URL.
zh
[NLP-26] Combining Language and Topic Models for Hierarchical Text Classification
【速读】: 该论文旨在解决层次化文本分类(Hierarchical Text Classification, HTC)中如何有效融合预训练语言模型(Pre-trained Language Models, PLMs)与主题模型(Topic Models)所提取特征的问题,以提升分类性能。其解决方案的关键在于:设计一个双流特征提取架构,分别利用PLM和主题模型从文本中抽取细粒度语义信息与全局文档主题表示,并通过独立的卷积层处理后融合,再经标签感知的注意力机制生成类别特定的文档表示。实验表明,尽管理论上主题模型能提供高层语义信息,但实际应用中其引入的特征反而会降低HTC性能,挑战了以往认为结合两类特征必然有益的假设。
链接: https://arxiv.org/abs/2507.16490
作者: Jaco du Toit,Marcel Dunaiski
机构: Stellenbosch University (斯泰伦博斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 2 figures
Abstract:Hierarchical text classification (HTC) is a natural language processing task which has the objective of categorising text documents into a set of classes from a predefined structured class hierarchy. Recent HTC approaches use various techniques to incorporate the hierarchical class structure information with the natural language understanding capabilities of pre-trained language models (PLMs) to improve classification performance. Furthermore, using topic models along with PLMs to extract features from text documents has been shown to be an effective approach for multi-label text classification tasks. The rationale behind the combination of these feature extractor models is that the PLM captures the finer-grained contextual and semantic information while the topic model obtains high-level representations which consider the corpus of documents as a whole. In this paper, we use a HTC approach which uses a PLM and a topic model to extract features from text documents which are used to train a classification model. Our objective is to determine whether the combination of the features extracted from the two models is beneficial to HTC performance in general. In our approach, the extracted features are passed through separate convolutional layers whose outputs are combined and passed to a label-wise attention mechanisms which obtains label-specific document representations by weighing the most important features for each class separately. We perform comprehensive experiments on three HTC benchmark datasets and show that using the features extracted from the topic model generally decreases classification performance compared to only using the features obtained by the PLM. In contrast to previous work, this shows that the incorporation of features extracted from topic models for text classification tasks should not be assumed beneficial.
zh
[NLP-27] ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLM s ACL2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时容易产生幻觉(hallucination)的问题,现有基于隐藏状态的检测方法多依赖静态且孤立的表示,未能捕捉隐藏状态在不同网络层间的动态演化过程,从而限制了检测效果。其解决方案的关键在于提出一种新的度量指标——ICR Score(Information Contribution to Residual Stream),用于量化模型模块对残差流(residual stream)中隐藏状态更新的信息贡献,并基于此构建I CR Probe检测方法,该方法能够有效捕获跨层隐藏状态的演化特性,在显著减少参数量的同时实现更优的幻觉检测性能。
链接: https://arxiv.org/abs/2507.16488
作者: Zhenliang Zhang,Xinyu Hu,Huixuan Zhang,Junzhe Zhang,Xiaojun Wan
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机技术研究所); School of Software and Microelectronics, Peking University (北京大学软件与微电子学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL 2025 (Main Conference)
Abstract:Large language models (LLMs) excel at various natural language processing tasks, but their tendency to generate hallucinations undermines their reliability. Existing hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations, overlooking their dynamic evolution across layers, which limits efficacy. To address this limitation, we shift the focus to the hidden state update process and introduce a novel metric, the ICR Score (Information Contribution to Residual Stream), which quantifies the contribution of modules to the hidden states’ update. We empirically validate that the ICR Score is effective and reliable in distinguishing hallucinations. Building on these insights, we propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states. Experimental results show that the ICR Probe achieves superior performance with significantly fewer parameters. Furthermore, ablation studies and case analyses offer deeper insights into the underlying mechanism of this method, improving its interpretability.
zh
[NLP-28] MMS Player: an open source software for parametric data-driven animation of Sign Language avatars
【速读】: 该论文旨在解决手语动画自动生成中缺乏对多模态信息(如并行执行、时序和词形变化)精确建模的问题。现有基于词元(gloss-based)的表示方法难以充分表达手语的复杂结构,导致生成的动画在自然性和准确性上受限。解决方案的关键在于提出了一种新的手语表示格式——MMS(MultiModal Signstream),它在传统词元基础上引入了并行执行、时间参数和词形变化等关键信息,从而更全面地描述手语动作流;同时开发了基于Python与Blender 3D工具集成的开源软件MMS-Player,支持通过命令行或HTTP API调用,实现高质量手语动画的合成与导出,为手语数字内容制作提供了可扩展的技术框架。
链接: https://arxiv.org/abs/2507.16463
作者: Fabrizio Nunnari,Shailesh Mishra,Patrick Gebhard
机构: German Research Center for Artificial Intelligence (德国人工智能研究中心)
类目: Graphics (cs.GR); Computation and Language (cs.CL)
备注:
Abstract:This paper describes the MMS-Player, an open source software able to synthesise sign language animations from a novel sign language representation format called MMS (MultiModal Signstream). The MMS enhances gloss-based representations by adding information on parallel execution of signs, timing, and inflections. The implementation consists of Python scripts for the popular Blender 3D authoring tool and can be invoked via command line or HTTP API. Animations can be rendered as videos or exported in other popular 3D animation exchange formats. The software is freely available under GPL-3.0 license at this https URL.
zh
[NLP-29] owards Enforcing Company Policy Adherence in Agent ic Workflows
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在执行业务流程自动化时难以可靠遵守复杂公司政策的问题。其解决方案的关键在于提出一种确定性、透明且模块化的框架,通过两个阶段实现:首先在离线构建阶段将政策文档编译为与工具调用相关的可验证防护代码(guard code),其次在运行时集成这些防护机制,在每次代理动作前确保合规性。该方法已在τ-bench Airlines这一具有挑战性的领域中验证了初步有效性,并指出了实际部署中的关键挑战。
链接: https://arxiv.org/abs/2507.16459
作者: Naama Zwerdling,David Boaz,Ella Rabinovich,Guy Uziel,David Amid,Ateret Anaby-Tavor
机构: IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL)
备注: 11 pages
Abstract:Large Language Model (LLM) agents hold promise for a flexible and scalable alternative to traditional business process automation, but struggle to reliably follow complex company policies. In this study we introduce a deterministic, transparent, and modular framework for enforcing business policy adherence in agentic workflows. Our method operates in two phases: (1) an offline buildtime stage that compiles policy documents into verifiable guard code associated with tool use, and (2) a runtime integration where these guards ensure compliance before each agent action. We demonstrate our approach on the challenging \tau -bench Airlines domain, showing encouraging preliminary results in policy enforcement, and further outline key challenges for real-world deployments.
zh
[NLP-30] Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch
【速读】: 该论文旨在解决语言模型中存在的偏见问题,尤其是跨语言场景下偏见的测量与比较难题。当前多数研究集中于英语语言模型,缺乏对其他语言(如荷兰语)中偏见的系统评估。其解决方案的关键在于构建了一个面向荷兰语的CrowS-Pairs数据集,包含1463个句子对,覆盖9类社会敏感属性(如性别、残疾、性取向等),并通过该数据集量化了多种荷兰语模型(如BERTje、RobBERT等)的偏见水平。研究进一步对比了英、法、荷三种语言模型的偏见表现,发现英语模型偏见最显著,而荷兰语模型相对最低;同时揭示了角色设定(persona)对模型偏见程度的影响,强调文化与语言因素在塑造模型偏见中的关键作用。
链接: https://arxiv.org/abs/2507.16442
作者: Elza Strazda,Gerasimos Spanakis
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, accepted at RANLP 2025 data and code here: this https URL
Abstract:Warning: This paper contains explicit statements of offensive stereotypes which might be upsetting. Language models are prone to exhibiting biases, further amplifying unfair and harmful stereotypes. Given the fast-growing popularity and wide application of these models, it is necessary to ensure safe and fair language models. As of recent considerable attention has been paid to measuring bias in language models, yet the majority of studies have focused only on English language. A Dutch version of the US-specific CrowS-Pairs dataset for measuring bias in Dutch language models is introduced. The resulting dataset consists of 1463 sentence pairs that cover bias in 9 categories, such as Sexual orientation, Gender and Disability. The sentence pairs are composed of contrasting sentences, where one of the sentences concerns disadvantaged groups and the other advantaged groups. Using the Dutch CrowS-Pairs dataset, we show that various language models, BERTje, RobBERT, multilingual BERT, GEITje and Mistral-7B exhibit substantial bias across the various bias categories. Using the English and French versions of the CrowS-Pairs dataset, bias was evaluated in English (BERT and RoBERTa) and French (FlauBERT and CamemBERT) language models, and it was shown that English models exhibit the most bias, whereas Dutch models the least amount of bias. Additionally, results also indicate that assigning a persona to a language model changes the level of bias it exhibits. These findings highlight the variability of bias across languages and contexts, suggesting that cultural and linguistic factors play a significant role in shaping model biases. Comments: 10 pages, accepted at RANLP 2025 data and code here: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.16442 [cs.CL] (or arXiv:2507.16442v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.16442 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-31] PromptAL: Sample-Aware Dynamic Soft Prompts for Few-Shot Active Learning
【速读】: 该论文旨在解决少样本(few-shot)场景下主动学习(Active Learning, AL)中因标注数据有限导致的经验分布与目标分布偏离的问题,从而造成决策边界偏移和所选样本无法准确代表目标分布的局限性。现有方法通常仅依赖已标注数据定义决策边界,忽视了未标注样本在优化经验分布以逼近目标分布中的潜在作用。解决方案的关键在于提出一种名为 PromptAL 的混合主动学习框架,其核心创新是利用未标注数据构建“样本感知的动态软提示”(sample-aware dynamic soft prompts),通过调整模型预测分布和决策边界来增强经验分布对目标分布的拟合能力;随后基于修正后的决策边界,结合全局与局部多样性信息进行不确定性估计,从而选择更高质量且更具代表性的真实样本。
链接: https://arxiv.org/abs/2507.16424
作者: Hui Xiang,Jinqiao Shi,Ting Zhang,Xiaojie Zhao,Yong Liu,Yong Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Qianxin (奇安信)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Active learning (AL) aims to optimize model training and reduce annotation costs by selecting the most informative samples for labeling. Typically, AL methods rely on the empirical distribution of labeled data to define the decision boundary and perform uncertainty or diversity estimation, subsequently identifying potential high-quality samples. In few-shot scenarios, the empirical distribution often diverges significantly from the target distribution, causing the decision boundary to shift away from its optimal position. However, existing methods overlook the role of unlabeled samples in enhancing the empirical distribution to better align with the target distribution, resulting in a suboptimal decision boundary and the selection of samples that inadequately represent the target distribution. To address this, we propose a hybrid AL framework, termed \textbfPromptAL (Sample-Aware Dynamic Soft \textbfPrompts for Few-Shot \textbfActive \textbfLearning). This framework accounts for the contribution of each unlabeled data point in aligning the current empirical distribution with the target distribution, thereby optimizing the decision boundary. Specifically, PromptAL first leverages unlabeled data to construct sample-aware dynamic soft prompts that adjust the model’s predictive distribution and decision boundary. Subsequently, based on the adjusted decision boundary, it integrates uncertainty estimation with both global and local diversity to select high-quality samples that more accurately represent the target distribution. Experimental results on six in-domain and three out-of-domain datasets show that PromptAL achieves superior performance over nine baselines. Our codebase is openly accessible.
zh
[NLP-32] GG-BBQ: German Gender Bias Benchmark for Question Answering ACL2025
【速读】: 该论文旨在解决德国大型语言模型(Large Language Models, LLMs)在性别身份维度上的偏见评估问题,特别是在自然语言处理(Natural Language Processing, NLP)中如何有效衡量和量化模型预测中的性别偏差。其解决方案的关键在于构建一个适用于德语环境的性别偏见测评基准数据集,该数据集基于Parrish等人(2022)提出的英文问答偏见测评基准,并通过机器翻译将其中的性别身份子集转换为德语,随后由语言专家进行人工校正以克服德语语法性别(grammatical gender)对机器翻译准确性的限制。这一人工修订步骤被证明是确保数据集质量与可解释性的核心环节,最终形成的两个子集(Subset-I:性别身份群体术语;Subset-II:用专有名词替换群体术语)用于评估多个德语LLMs的准确性与偏见得分,结果表明所有模型均存在符合及违背社会刻板印象的偏见行为。
链接: https://arxiv.org/abs/2507.16410
作者: Shalaka Satheesh,Katrin Klug,Katharina Beckh,Héctor Allende-Cid,Sebastian Houben,Teena Hassan
机构: Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS (弗劳恩霍夫智能分析与信息系统研究所); Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所); Bonn-Rhein-Sieg University of Applied Sciences (波恩-莱茵-锡格应用科学大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted to the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP), taking place on August 1st 2025, as part of ACL 2025 in Vienna
Abstract:Within the context of Natural Language Processing (NLP), fairness evaluation is often associated with the assessment of bias and reduction of associated harm. In this regard, the evaluation is usually carried out by using a benchmark dataset, for a task such as Question Answering, created for the measurement of bias in the model’s predictions along various dimensions, including gender identity. In our work, we evaluate gender bias in German Large Language Models (LLMs) using the Bias Benchmark for Question Answering by Parrish et al. (2022) as a reference. Specifically, the templates in the gender identity subset of this English dataset were machine translated into German. The errors in the machine translated templates were then manually reviewed and corrected with the help of a language expert. We find that manual revision of the translation is crucial when creating datasets for gender bias evaluation because of the limitations of machine translation from English to a language such as German with grammatical gender. Our final dataset is comprised of two subsets: Subset-I, which consists of group terms related to gender identity, and Subset-II, where group terms are replaced with proper names. We evaluate several LLMs used for German NLP on this newly created dataset and report the accuracy and bias scores. The results show that all models exhibit bias, both along and against existing social stereotypes.
zh
[NLP-33] Re:Form – Reducing Human Priors in Scalable Formal Software Verification with RL in LLM s: A Preliminary Study on Dafny
【速读】: 该论文旨在解决当前基于非形式化语言(如自然语言)训练的大语言模型(LLMs)在强化学习(RL)过程中面临的关键挑战:其验证过程既不可靠又难以扩展,导致主流商业大模型难以生成可验证的程序。为应对这一问题,作者提出将LLMs锚定于严格的正式系统(formal system),使其在形式语言空间(如Dafny)中进行推理与编程,从而实现对推理过程和结果的自动且数学上可证明的验证,这是实现大规模、可靠形式化软件验证的关键前提。解决方案的核心在于构建一个以Dafny为环境的自动化、可扩展的数据整理管道,并结合来自形式语言验证器的反馈设计精细的强化学习机制,同时引入DafnyComp基准测试集用于规范推理任务评估,使得即使是小型模型(如0.5B参数)也能生成语法正确且可验证的Dafny代码,并通过正则化强化学习进一步提升泛化能力,优于所有强基线模型。
链接: https://arxiv.org/abs/2507.16331
作者: Chuanhao Yan,Fengdi Che,Xuhan Huang,Xu Xu,Xin Li,Yizhi Li,Xingwei Qu,Jingzhe Shi,Zhuangzhuang He,Chenghua Lin,Yaodong Yang,Binhang Yuan,Hang Zhao,Yu Qiao,Bowen Zhou,Jie Fu
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs. Unfortunately, it becomes unacceptably all-consuming to provide such priors for supervising complex programming tasks. In this work, we systematically explore ways to reduce human priors with the formal language, Dafny, as the main environment for our pilot study. Our pipeline mainly relies on introducing an automatic and scalable data curation pipeline, and careful RL designs integrated with feedback from the formal language verifier. We introduce DafnyComp, a benchmark of compositional formal programs with auto-formalized specifications for specification reasoning. Our supervised fine-tuning (SFT) stage enables even small models (e.g., 0.5B) to generate syntactically valid and verifiable Dafny code, surpassing proprietary models. RL with regularization further improves performance, achieving stronger generalization to out-of-domain tasks and outperforming all strong baselines on the challenging DafnyComp benchmark.
zh
[NLP-34] SpeLLM : Character-Level Multi-Head Decoding
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在扩展词汇表(vocabulary)时面临的计算瓶颈问题:传统架构中输出投影层(output projection layer)的参数量随词汇规模线性增长,导致大规模词汇扩展不切实际。解决方案的关键在于提出SpeLLM方法,通过解耦输入与输出词汇空间,利用多个独立的输出头(output heads)并行预测字符级字符串(character-level strings),从而以较小且独立的线性层实现对更大输出空间的表示,有效降低模型运行时开销,同时保持下游任务性能。
链接: https://arxiv.org/abs/2507.16323
作者: Amit Ben-Artzy,Roy Schwartz
机构: The Hebrew University of Jerusalem (希伯来大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Scaling LLM vocabulary is often used to reduce input sequence length and alleviate attention’s quadratic cost. Yet, current LLM architectures impose a critical bottleneck to this procedure: the output projection layer scales linearly with vocabulary size, rendering substantial expansion impractical. We propose SpeLLM, a method that decouples input and output vocabularies by predicting character-level strings through multiple output heads. In SpeLLM, each of the k linear heads predicts a single character simultaneously, enabling the model to represent a much larger output space using smaller, independent linear heads. We present a self-distillation approach for converting a standard LLM to a SpeLLM. Our experiments with four pre-trained LLMs show their SpeLLM variants achieve competitive performance on downstream tasks while reducing runtime by 5.1% on average across models. Our approach provides a potential avenue for reducing LLM costs, while increasing support for underrepresented languages and domains.
zh
[NLP-35] WhatsApp Tiplines and Multilingual Claims in the 2021 Indian Assembly Elections
【速读】: 该论文旨在解决社交媒体平台上虚假信息(misinformation)在选举期间传播的问题,特别是通过WhatsApp tiplines这一用户交互机制,提升公众对误导性内容的识别与验证能力。其解决方案的关键在于利用混合方法(mixed-method approach)对来自不同语言环境(英语、印地语和泰卢固语)的580条用户提交线索进行系统分析,包括内容分类、语义相似性比较(基于词频与神经句向量聚类)、用户跨语言行为及事实核查机构间的重叠度研究,并量化事实核查响应时间。结果表明,尽管语言不同,用户提交的内容存在显著相似性,且多数用户仅向单一事实核查组织提交线索,说明各组织拥有独立受众;同时,事实核查平均需数日完成,提示需优化响应效率与伦理考量下的信息传播机制。
链接: https://arxiv.org/abs/2507.16298
作者: Gautam Kishore Shahi,Scot A. Hale
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:WhatsApp tiplines, first launched in 2019 to combat misinformation, enable users to interact with fact-checkers to verify misleading content. This study analyzes 580 unique claims (tips) from 451 users, covering both high-resource languages (English, Hindi) and a low-resource language (Telugu) during the 2021 Indian assembly elections using a mixed-method approach. We categorize the claims into three categories, election, COVID-19, and others, and observe variations across languages. We compare content similarity through frequent word analysis and clustering of neural sentence embeddings. We also investigate user overlap across languages and fact-checking organizations. We measure the average time required to debunk claims and inform tipline users. Results reveal similarities in claims across languages, with some users submitting tips in multiple languages to the same fact-checkers. Fact-checkers generally require a couple of days to debunk a new claim and share the results with users. Notably, no user submits claims to multiple fact-checking organizations, indicating that each organization maintains a unique audience. We provide practical recommendations for using tiplines during elections with ethical consideration of users’ information.
zh
[NLP-36] Language Detection by Means of the Minkowski Norm: Identification Through Character Bigrams and Frequency Analysis
【速读】: 该论文试图解决的是语言识别(language identification)问题,尤其是在生成式 AI(Generative AI)主导的背景下,传统非AI方法被忽视的现状。解决方案的关键在于采用基于字符频次统计的经典方法,通过提取单字频次(monograms)和双字频次(bigrams)的排序特征,并利用已有语言学研究成果构建数学算法,从而实现对不同长度、历史时期和体裁文本的语言判别。实验表明,该方法在短文本(<150字符)上准确率超过80%,在长文本及古文上可达100%,验证了频率基础方法在语言识别任务中的有效性与可扩展性。
链接: https://arxiv.org/abs/2507.16284
作者: Paul-Andrei Pogăcean,Sanda-Maria Avram
机构: Babeș-Bolyai University (巴贝什-博雅大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:The debate surrounding language identification has gained renewed attention in recent years, especially with the rapid evolution of AI-powered language models. However, the non-AI-based approaches to language identification have been overshadowed. This research explores a mathematical implementation of an algorithm for language determinism by leveraging monograms and bigrams frequency rankings derived from established linguistic research. The datasets used comprise texts varying in length, historical period, and genre, including short stories, fairy tales, and poems. Despite these variations, the method achieves over 80% accuracy on texts shorter than 150 characters and reaches 100% accuracy for longer texts and older writings. These results demonstrate that classical frequency-based approaches remain effective and scalable alternatives to AI-driven models for language detection.
zh
[NLP-37] Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理复杂现实文档时,难以从碎片化信息中提取并组织成结构化表格的问题。当前LLMs生成的通常是杂乱无章、缺乏可追溯性的段落式回答,无法满足对信息精确性和组织性要求较高的应用场景。解决方案的关键在于提出一个名为“有序与结构化抽取基准”(Arranged and Organized Extraction Benchmark, AOE)的新型多语言评测基准,该基准包含11个跨三个不同领域的任务,要求模型根据输入查询动态生成上下文相关的结构化schema,并将分散信息重构为统一表格。此设计突破了传统文本到表格任务依赖固定模式和窄域限制的局限,系统评估了LLMs在理解碎片文档和结构化重组方面的能力。
链接: https://arxiv.org/abs/2507.16271
作者: Tianyun Zhong,Guozhao Mo,Yanjiang Liu,Yihan Chen,Lingdi Kong,Xuanang Chen,Yaojie Lu,Hongyu Lin,Ben He,Le Sun
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所信息处理实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the emergence of large language models (LLMs), there is an expectation that LLMs can effectively extract explicit information from complex real-world documents (e.g., papers, reports). However, most LLMs generate paragraph-style answers that are chaotic, disorganized, and untraceable. To bridge this gap, we introduce the Arranged and Organized Extraction Benchmark (AOE), a new bilingual benchmark with data and documents of varying lengths designed to systematically evaluate the ability of LLMs to comprehend fragmented documents and reconstruct isolated information into one organized table. Unlike conventional text-to-table tasks, which rely on fixed schema and narrow task domains, AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries. In the experiment, we evaluated both open-source and closed-source state-of-the-art LLMs. The results show that even the most advanced models struggled significantly. The benchmark is available at this https URL.
zh
[NLP-38] Shumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss
【速读】: 该论文致力于解决大型语言模型(Large Language Model, LLM)在预训练过程中可能记忆并保留非合规敏感数据的问题,旨在实现高效且可控的“机器遗忘”(Machine Unlearning),即在有限计算资源下擦除特定敏感信息而不显著损害模型的通用能力。其解决方案的关键在于提出一种更可控的遗忘损失函数——有效遗忘损失(Effective Unlearning Loss),并通过与多种技术集成,提升了遗忘效果与模型性能之间的平衡,从而在SemEval 2025任务中取得第五名的成绩。
链接: https://arxiv.org/abs/2507.16263
作者: Yujian Sun,Tian Li
机构: Shumei AI Research Institute (深圳AI研究院); Newcastle University (纽卡斯尔大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As the Large Language Model (LLM) gains widespread adoption, increasing attention has been given to the challenge of making LLM forget non-compliant data memorized during its pre-training. Machine Unlearning focuses on efficiently erasing sensitive information from LLM under limited computational resources. To advance research in this area, SemEval 2025 Task 4: “Unlearning Sensitive Content from Large Language Models” introduces three unlearning datasets and establishes a benchmark by evaluating both forgetting effectiveness and the preservation of standard capabilities. In this work, we propose a more controllable forgetting loss, Effective Unlearning Loss, and explore its integration with various techniques to achieve more efficient and controlled unlearning. Our system ultimately ranked 5th on the competition leaderboard.
zh
[NLP-39] Efficient RL for optimizing conversation level outcomes with an LLM -based tutor
【速读】: 该论文旨在解决当前基于强化学习与人类反馈(Reinforcement Learning with Human Feedback, RLHF)的大语言模型(Large Language Models, LLMs)在多轮对话场景中,如在线数学辅导任务中,因仅优化单轮对话层面的人类偏好而导致长期教学效果不佳的问题。其解决方案的关键在于引入一个低维的潜在状态表示(latent state representation)来建模学生的学习状态,并基于该潜在状态优化一个长期策略(long-term policy),以决定高阶教学行为(high-level actions),从而更有效地引导学生独立完成目标数学问题的解答。该方法相比以往端到端训练策略直接输出下一句回应的方式更为轻量且高效。
链接: https://arxiv.org/abs/2507.16252
作者: Hyunji Nam,Omer Gottesman,Amy Zhang,Dean Foster,Emma Brunskill,Lyle Ungar
机构: Stanford University (斯坦福大学); Amazon; University of Texas at Austin (得克萨斯大学奥斯汀分校); University of Pennsylvania (宾夕法尼亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor’s behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor’s next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.
zh
[NLP-40] FinResearchBench: A Logic Tree based Agent -as-a-Judge Evaluation Framework for Financial Research Agents
【速读】: 该论文旨在解决当前缺乏系统化、自动化评估框架来衡量金融研究类AI代理(AI agent)能力的问题,尤其针对金融研究任务特有的复杂性和细微性。其解决方案的关键在于提出FinResearchBench,这是一个基于逻辑树(logic tree)的“代理即裁判”(Agent-as-a-Judge)评估系统:通过提取研究结果的逻辑树结构作为中间信息,实现对金融研究代理在7类典型任务上的全面、可靠且鲁棒的自动评估,从而填补了现有评估体系的空白。
链接: https://arxiv.org/abs/2507.16248
作者: Run Sun,Zuo Bai,Wentao Zhang,Yuxiang Zhang,Li Zhao,Shan Sun,Zhengwen Qiu
机构: Stepfun(步履科技); FinStep(金融步履科技)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recently, AI agents are rapidly evolving in intelligence and widely used in professional research applications, such as STEM, software development, finance, etc. Among these AI agents, deep research agent is a key category as it can perform long-horizon tasks and solve problems of greater complexity. However, there are few evaluation frameworks and benchmarks that systematically and automatically investigate the capabilities of these research agents. Furthermore, financial research problems have distinct complexity and subtlety. To fill in the gap, we propose FinResearchBench, which is a logic tree based Agent-as-a-Judge and targets specifically for the financial research agents. It provides a comprehensive and automatic assessment of the research agents across 7 key types of tasks in the financial research domain. The contributions of this work are two-folded: (1) the first and innovative Agent-as-a-Judge system that extracts the logic tree of the research outcome and uses it as the intermediate information to present a comprehensive, reliable and robust evaluation; (2) finance oriented that it covers 70 typical financial research questions, spreading across 7 frequently encountered types of tasks in the domain.
zh
[NLP-41] owards Compute-Optimal Many-Shot In-Context Learning
【速读】: 该论文旨在解决长上下文大语言模型(Long-context Large Language Models, LLMs)在多示例上下文学习(Many-shot In-Context Learning, ICL)中,如何高效且高性能地选择演示样本(demonstrations)的问题。当前实践中常采用随机选取固定演示集的方式,虽具备缓存复用和低推理成本的优势,但性能提升有限。论文提出两种轻量级演示选择策略:第一种结合少量基于测试样本相似度筛选的演示与大量缓存的随机演示;第二种进一步将随机演示替换为通过k-means聚类得到的中心点代表演示,从而提升代表性。其核心创新在于利用少量高相关性演示增强语义匹配能力,同时保留大规模缓存带来的计算效率优势,在显著降低推理成本(最高达一个数量级)的同时实现性能超越或持平最优方法。
链接: https://arxiv.org/abs/2507.16217
作者: Shahriar Golchin,Yanfei Chen,Rujun Han,Manan Gandhi,Tianli Yu,Swaroop Mishra,Mihai Surdeanu,Rishabh Agarwal,Chen-Yu Lee,Tomas Pfister
机构: University of Arizona (亚利桑那大学); Google Cloud AI Research (谷歌云人工智能研究); Google DeepMind (谷歌深度智域); Microsoft (微软); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Final version; accepted at COLM 2025
Abstract:Long-context large language models (LLMs) are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.
zh
[NLP-42] WakenLLM : A Fine-Grained Benchmark for Evaluating LLM Reasoning Reasoning Potential and Reasoning Process Stability
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理任务中频繁输出“Unknown”标签的问题,而当前评估体系仅关注此类回答是否诚实,忽略了其背后的根本原因:是输入本身不可判定(intrinsic indeterminacy),还是模型能力不足导致的误判。这种混淆现象被称为“模糊感知”(Vague Perception)。论文的关键解决方案在于提出一个量化框架,区分“Unknown”响应中由模型能力局限所致的比例,并通过引导式刺激(guided stimulation)测试能否将这些响应转化为正确答案(Known)或真正不可判定的结果。该方法为准确刻画LLM的推理边界及其改进潜力提供了新视角。
链接: https://arxiv.org/abs/2507.16199
作者: Zipeng Ling,Yuehao Tang,Shuliang Liu,Junqi Yang,Shenghong Fu,Yao Wan,Kejia Huang,Zhichao Hou,Xuming Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) frequently output the label \emphUnknown, yet current evaluations focus almost exclusively on whether such answers are \emphhonest rather than why they arise. This blurs two distinct cases: (i) an input that is genuinely indeterminate and (ii) a solvable problem that the model fails to resolve. We call this phenomenon \emphVague Perception. And thus we introduce a framework that quantifies the proportion of \emphUnknown responses attributable to model incapacity and tests whether guided stimulation can convert them into either correct (\emphKnown) or intrinsically indeterminate outcomes. By separating these sources of uncertainty, our method provides a clearer picture of LLM reasoning limits and their potential for improvement. As we get a theoretical accuracy of reasoning task on different LLMs, we apply different methods to test whether the model can reach the accuracy given a baseline framework. Our work is meaningful in exploring the true reasoning ability of LLMs and providing a new perspective on solving the \emphVague Perception phenomenon.
zh
[NLP-43] Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在社会推理能力上的局限性问题,特别是其在“动态规划理论心智”(planning theory of mind, PToM)任务中的表现不足。传统理论心智(Theory of Mind, ToM)评估多聚焦于观察者角色下的行为预测与解释,而人类的ToM还涉及主动干预他人心理状态的能力,即基于对他人信念和欲望的理解来制定策略并改变其行为。为更贴近真实社交场景,作者提出MindGames这一新任务范式,要求代理在不直接获知他人偏好前提下,通过推理推断对方心理状态并进行说服以达成目标。关键创新在于将ToM从静态理解扩展至动态决策与干预,并首次明确评估了ToM的实际应用场景。实验表明,人类在PToM任务中显著优于o1-preview模型(提升11%,p=0.006),但后者在无需心理状态推理的基线任务中表现更优,说明LLMs缺乏人类所具备的隐式因果模型,从而限制了其社会智能的发展。
链接: https://arxiv.org/abs/2507.16196
作者: Jared Moore,Ned Cooper,Rasmus Overmark,Beba Cibralic,Nick Haber,Cameron R. Jones
机构: 未知
类目: Computation and Language (cs.CL)
备注: To appear in COLM, 2025
Abstract:Recent evidence suggests Large Language Models (LLMs) display Theory of Mind (ToM) abilities. Most ToM experiments place participants in a spectatorial role, wherein they predict and interpret other agents’ behavior. However, human ToM also contributes to dynamically planning action and strategically intervening on others’ mental states. We present MindGames: a novel `planning theory of mind’ (PToM) task which requires agents to infer an interlocutor’s beliefs and desires to persuade them to alter their behavior. Unlike previous evaluations, we explicitly evaluate use cases of ToM. We find that humans significantly outperform o1-preview (an LLM) at our PToM task (11% higher; p=0.006 ). We hypothesize this is because humans have an implicit causal model of other agents (e.g., they know, as our task requires, to ask about people’s preferences). In contrast, o1-preview outperforms humans in a baseline condition which requires a similar amount of planning but minimal mental state inferences (e.g., o1-preview is better than humans at planning when already given someone’s preferences). These results suggest a significant gap between human-like social reasoning and LLM abilities.
zh
[NLP-44] Characterizing Online Activities Contributing to Suicide Mortality among Youth AAAI
【速读】: 该论文旨在解决青少年自杀率上升背景下,线上体验如何影响自杀风险这一公共健康问题。其核心挑战在于识别并量化在线活动中与自杀死亡相关的风险因素,并建立可扩展的分析框架。解决方案的关键在于采用混合方法,首先基于29,124份死亡调查的开放文本摘要进行主题分析,识别出12类与自杀死亡相关联的在线活动;随后构建零样本学习(zero-shot learning)框架对这些主题进行规模化建模,并结合自杀理论(如“习得性无助”和“人际心理学理论”)解析不同人群特征(如年龄、死亡方式、人际关系问题)及时间趋势(如新冠疫情期间)下的主题分布变化,从而揭示数字空间中隐性自杀风险信号的模式,为开发基于计算社会科学的早期干预策略提供依据。
链接: https://arxiv.org/abs/2507.16185
作者: Aparna Ananthasubramaniam,Elyse J. Thulin,Viktoryia Kalesnikava,Silas Falde,Jonathan Kertawidjaja,Lily Johns,Alejandro Rodríguez-Putnam,Emma Spring,Kara Zivin,Briana Mezuk
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: Accepted at the AAAI International Conference on Web and Social Media (ICWSM) 2026
Abstract:The recent rise in youth suicide highlights the urgent need to understand how online experiences contribute to this public health issue. Our mixed-methods approach responds to this challenge by developing a set of themes focused on risk factors for suicide mortality in online spaces among youth ages 10-24, and a framework to model these themes at scale. Using 29,124 open text summaries of death investigations between 2013-2022, we conducted a thematic analysis to identify 12 types of online activities that were considered by investigators or next of kin to be relevant in contextualizing a given suicide death. We then develop a zero-shot learning framework to model these 12 themes at scale, and analyze variation in these themes by decedent characteristics and over time. Our work uncovers several online activities related to harm to self, harm to others, interpersonal interactions, activity levels online, and life events, which correspond to different phases of suicide risk from two prominent suicide theories. We find an association between these themes and decedent characteristics like age, means of death, and interpersonal problems, and many themes became more prevalent during the 2020 COVID-19 lockdowns. While digital spaces have taken some steps to address expressions of suicidality online, our work illustrates the opportunities for developing interventions related to less explicit indicators of suicide risk by combining suicide theories with computational research.
zh
[NLP-45] BIDWESH: A Bangla Regional Based Hate Speech Detection Dataset
【速读】: 该论文旨在解决当前仇恨言论检测模型在孟加拉语(Bangla)方言中表现不足的问题,尤其是在语言多样性显著的国家如孟加拉国,主流标准孟加拉语数据集无法覆盖地方性、非正式且富含文化特征的表达方式,导致检测能力受限并产生偏见。其解决方案的关键在于构建了首个多方言孟加拉语仇恨言论数据集BIDWESH,通过将BD-SHS语料库中的9,183条实例翻译并标注至巴里沙尔(Barishal)、诺阿卡利(Noakhali)和吉大港(Chittagong)三大地区方言,并由人工校验标签,涵盖仇恨存在性、类型(诽谤、性别、宗教、煽动暴力)及目标群体(个体、男性、女性、群体),从而提供一个语言丰富、平衡且包容的数据资源,为低资源语言环境下开发方言敏感的自然语言处理(Natural Language Processing, NLP)工具奠定基础。
链接: https://arxiv.org/abs/2507.16183
作者: Azizul Hakim Fayaz,MD. Shorif Uddin,Rayhan Uddin Bhuiyan,Zakia Sultana,Md. Samiul Islam,Bidyarthi Paul,Tashreef Muhammad,Shahriar Manzoor
机构: Southeast University (东南大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Hate speech on digital platforms has become a growing concern globally, especially in linguistically diverse countries like Bangladesh, where regional dialects play a major role in everyday communication. Despite progress in hate speech detection for standard Bangla, Existing datasets and systems fail to address the informal and culturally rich expressions found in dialects such as Barishal, Noakhali, and Chittagong. This oversight results in limited detection capability and biased moderation, leaving large sections of harmful content unaccounted for. To address this gap, this study introduces BIDWESH, the first multi-dialectal Bangla hate speech dataset, constructed by translating and annotating 9,183 instances from the BD-SHS corpus into three major regional dialects. Each entry was manually verified and labeled for hate presence, type (slander, gender, religion, call to violence), and target group (individual, male, female, group), ensuring linguistic and contextual accuracy. The resulting dataset provides a linguistically rich, balanced, and inclusive resource for advancing hate speech detection in Bangla. BIDWESH lays the groundwork for the development of dialect-sensitive NLP tools and contributes significantly to equitable and context-aware content moderation in low-resource language settings.
zh
[NLP-46] SpiroLLM : Finetuning Pretrained LLM s to Understand Spirogram Time Series with Clinical Validation in COPD Reporting
【速读】: 该论文旨在解决当前慢性阻塞性肺疾病(Chronic Obstructive Pulmonary Disease, COPD)诊断中人工智能模型缺乏可解释性以及大型语言模型(Large Language Models, LLMs)无法理解呼吸流量-容积曲线(spirogram)的问题,从而限制了其在临床中的可信度与应用。解决方案的关键在于提出首个多模态大语言模型SpiroLLM,通过一个专门设计的SpiroEncoder从呼吸曲线中提取形态学特征,并利用SpiroProjector将这些特征与肺功能测试(Pulmonary Function Tests, PFTs)的数值在统一潜在空间中对齐,最终使大语言模型能够生成结构化、可解释的诊断报告。该方法显著提升了诊断性能(AUROC=0.8980)并在缺失数据情况下保持100%有效响应率,验证了生理信号与大语言模型深度融合的可行性与优越性。
链接: https://arxiv.org/abs/2507.16145
作者: Shuhao Mei,Yongchao Long,Shan Cao,Xiaobo Han,Shijia Geng,Jinbo Sun,Yuxi Zhou,Shenda Hong
机构: Guangzhou Institute of Technology, Xidian University, Xi’an, China; Department of Computer Science, Tianjin University of Technology, Tianjin, China; Department of Respiratory, The Second Hospital of Tianjin Medical University, China; College of Pulmonary and Critical Care Medicine, Chinese PLA General Hospital, Beijing, China; HeartVoice Medical Technology, Hefei, China; School of Life Science and Technology, Xidian University, Xi’an, China; DCST, BNRist, RIIT, Institute of Internet Industry, Tsinghua University, Beijing, China; National Institute of Health Data Science, Peking University, Beijing, China; Institute for Artificial Intelligence, Peking University, Beijing, China
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Chronic Obstructive Pulmonary Disease (COPD), a major chronic respiratory disease with persistent airflow limitation, is a leading global cause of disability and mortality. Respiratory spirogram time series, routinely collected during pulmonary function tests (PFTs), play a critical role in the early detection of repsiratory diseases and in monitoring lung function over time. However, most current AI models for COPD diagnosis are limited to outputting classification results without providing a rationale for their diagnostic process, while current Large Language Models (LLMs) cannot understand spirograms yet, which severely limits their clinical trust and adoption. To tackle this challenge, we leverage a cohort of 234,028 individuals from the UK Biobank (UKB) to propose SpiroLLM, the first multimodal large language model that can understand spirogram. The model extracts morphological features from respiratory curves via a SpiroEncoder and aligns them with PFT numerical values in a unified latent space using a SpiroProjector, ultimately empowering a large language model to generate a comprehensive diagnostic report. Experimental results confirm that SpiroLLM achieved a diagnostic AUROC of 0.8980 (95% CI: 0.8820-0.9132). In a robustness test with missing core data, it maintained a 100% valid response rate, far surpassing the 13.4% of a text-only model and showcasing the superiority of its multimodal design. This work demonstrates the substantial potential of deeply fusing physiological signals with large language models, establishing a new paradigm for the next generation of interpretable and reliable clinical decision support tools.
zh
[NLP-47] Efficient Compositional Multi-tasking for On-device Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在设备端(on-device)环境下进行文本类组合多任务处理(compositional multi-tasking)的问题,即每个测试样本需同时执行多个任务(如翻译与摘要生成),而现有方法仅适用于单任务场景。其解决方案的关键在于提出一种名为“可学习校准”(Learnable Calibration)的高效方法,专为计算资源受限的设备端应用设计,兼顾资源效率与性能表现,从而推动LLMs在真实复杂多任务场景中的落地应用。
链接: https://arxiv.org/abs/2507.16083
作者: Ondrej Bohdal,Mete Ozay,Jijoong Moon,Kyeng-Hun Lee,Hyeonmok Ko,Umberto Michieli
机构: Samsung R&D Institute UK (三星研发研究院英国); Samsung Research (三星研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.
zh
[NLP-48] he Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models
【速读】: 该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)中,通过角色提示(persona prompting)模拟不同社会人口学群体观点时,提示策略的制定方式如何显著影响模拟结果的准确性与公平性,尤其是在对边缘化群体(如非二元性别、西班牙裔和中东身份群体)的刻画上存在偏差。解决方案的关键在于:采用访谈式角色扮演格式(interview-style role adoption)和基于姓名的群体提示(name-based demographic priming),可有效降低刻板印象并提升模型输出与目标群体的一致性;同时发现较小模型(如OLMo-2-7B)在某些情境下反而优于更大模型(如Llama-3.3-70B),提示模型规模并非决定模拟质量的唯一因素,提示设计策略的重要性应被优先考虑。
链接: https://arxiv.org/abs/2507.16076
作者: Marlene Lutz,Indira Sen,Georg Ahnert,Elisa Rogers,Markus Strohmaier
机构: University of Mannheim (曼海姆大学); GESIS - Leibniz Institute for the Social Sciences (GESIS-莱布尼茨社会科学研究所); Complexity Science Hub Vienna (维也纳复杂科学中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Persona prompting is increasingly used in large language models (LLMs) to simulate views of various sociodemographic groups. However, how a persona prompt is formulated can significantly affect outcomes, raising concerns about the fidelity of such simulations. Using five open-source LLMs, we systematically examine how different persona prompt strategies, specifically role adoption formats and demographic priming strategies, influence LLM simulations across 15 intersectional demographic groups in both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups, particularly nonbinary, Hispanic, and Middle Eastern identities, but that the choice of demographic priming and role adoption strategy significantly impacts their portrayal. Specifically, we find that prompting in an interview-style format and name-based priming can help reduce stereotyping and improve alignment. Surprisingly, smaller models like OLMo-2-7B outperform larger ones such as Llama-3.3-70B. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.
zh
[NLP-49] Deep Researcher with Test-Time Diffusion
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的深度研究代理(Deep Research Agents)在生成复杂、长篇研究报告时性能易趋于饱和的问题,尤其是在使用通用测试时缩放(test-time scaling)算法的情况下。其解决方案的关键在于提出一种名为“测试时扩散深度研究员”(Test-Time Diffusion Deep Researcher, TTD-DR)的新框架,该框架将报告生成建模为一个扩散过程:以可更新的初稿(preliminary draft)作为演化基础,通过引入检索机制动态整合外部信息,并结合自进化算法优化每个智能体工作流组件,从而实现迭代式“去噪”与内容精炼。这种以初稿为中心的设计提升了写作时效性和连贯性,同时减少了迭代搜索中的信息丢失,显著优于现有方法,在需密集搜索和多跳推理的任务上达到最先进性能。
链接: https://arxiv.org/abs/2507.16075
作者: Rujun Han,Yanfei Chen,Zoey CuiZhu,Lesly Miculicich,Guan Sun,Yuanjun Bi,Weiming Wen,Hui Wan,Chunfeng Wen,Solène Maître,George Lee,Vishy Tirumalashetty,Emily Xue,Zizhao Zhang,Salem Haykal,Burak Gokturk,Tomas Pfister,Chen-Yu Lee
机构: Google Cloud AI Research(谷歌云人工智能研究); Google Cloud(谷歌云)
类目: Computation and Language (cs.CL)
备注:
Abstract:Deep research agents, powered by Large Language Models (LLMs), are rapidly advancing; yet, their performance often plateaus when generating complex, long-form research reports using generic test-time scaling algorithms. Drawing inspiration from the iterative nature of human research, which involves cycles of searching, reasoning, and revision, we propose the Test-Time Diffusion Deep Researcher (TTD-DR). This novel framework conceptualizes research report generation as a diffusion process. TTD-DR initiates this process with a preliminary draft, an updatable skeleton that serves as an evolving foundation to guide the research direction. The draft is then iteratively refined through a “denoising” process, which is dynamically informed by a retrieval mechanism that incorporates external information at each step. The core process is further enhanced by a self-evolutionary algorithm applied to each component of the agentic workflow, ensuring the generation of high-quality context for the diffusion process. This draft-centric design makes the report writing process more timely and coherent while reducing information loss during the iterative search process. We demonstrate that our TTD-DR achieves state-of-the-art results on a wide array of benchmarks that require intensive search and multi-hop reasoning, significantly outperforming existing deep research agents.
zh
[NLP-50] AutoMeet: a proof-of-concept study of genAI to automate meetings in automotive engineering
【速读】: 该论文旨在解决大型组织中知识共享效率低下的问题,尤其是会议记录不一致、文档难以检索导致的重复沟通与高频会议现象。其核心解决方案是利用生成式人工智能(Generative AI, genAI)模型,构建一个端到端的自动化会议文档处理流程:通过genAI实现会议录音转写,并将异构信息源整合为结构化内容,最终以可交互式聊天机器人接口支持即席查询。该方案的关键在于在真实工程部门部署这一工具并收集伦理与技术维度的实证数据,验证genAI在降低会议负担、提升知识可访问性方面的有效性,同时识别出组织层面伦理治理对系统成功落地的重要性。
链接: https://arxiv.org/abs/2507.16054
作者: Simon Baeuerle,Max Radyschevski,Ulrike Pado
机构: Institute of Automation and Applied Informatics (IAI), Karlsruhe Institute of Technology (KIT); Mobility Electronics, Robert Bosch GmbH; Stuttgart University of Applied Sciences
类目: Computation and Language (cs.CL)
备注:
Abstract:In large organisations, knowledge is mainly shared in meetings, which takes up significant amounts of work time. Additionally, frequent in-person meetings produce inconsistent documentation – official minutes, personal notes, presentations may or may not exist. Shared information therefore becomes hard to retrieve outside of the meeting, necessitating lengthy updates and high-frequency meeting schedules. Generative Artificial Intelligence (genAI) models like Large Language Models (LLMs) exhibit an impressive performance on spoken and written language processing. This motivates a practical usage of genAI for knowledge management in engineering departments: using genAI for transcribing meetings and integrating heterogeneous additional information sources into an easily usable format for ad-hoc searches. We implement an end-to-end pipeline to automate the entire meeting documentation workflow in a proof-of-concept state: meetings are recorded and minutes are created by genAI. These are further made easily searchable through a chatbot interface. The core of our work is to test this genAI-based software tooling in a real-world engineering department and collect extensive survey data on both ethical and technical aspects. Direct feedback from this real-world setup points out both opportunities and risks: a) users agree that the effort for meetings could be significantly reduced with the help of genAI models, b) technical aspects are largely solved already, c) organizational aspects are crucial for a successful ethical usage of such a system. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.16054 [cs.CL] (or arXiv:2507.16054v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.16054 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-51] mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages ACL2025
【速读】: 该论文旨在解决多语言知识图谱构建(Multilingual Knowledge Graph Construction, mKGC)问题,即在低资源语言环境下自动补全或预测缺失的实体和关系链接。其核心挑战在于如何有效利用高资源语言的知识迁移能力,并提升低资源语言中的知识图谱构建精度。解决方案的关键在于将mKGC任务重新建模为问答(Question Answering, QA)任务,并提出基于检索增强生成(Retrieval-Augmented Generation, RAG)的mRAKL系统:通过将头实体和关系作为问题输入,让模型预测尾实体作为答案;同时借助BM25检索器从多语言语料中获取相关上下文信息以增强生成效果。实验表明,在Tigrinya和Amharic等低资源语言上,该方法显著优于无上下文基线,且理想化检索条件下准确率分别提升4.92和8.79个百分点。
链接: https://arxiv.org/abs/2507.16011
作者: Hellina Hailu Nigatu,Min Li,Maartje ter Hoeve,Saloni Potdar,Sarah Chasins
机构: UC Berkeley (加州大学伯克利分校); Apple (苹果公司)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of ACL 2025
Abstract:Knowledge Graphs represent real-world entities and the relationships between them. Multilingual Knowledge Graph Construction (mKGC) refers to the task of automatically constructing or predicting missing entities and links for knowledge graphs in a multilingual setting. In this work, we reformulate the mKGC task as a Question Answering (QA) task and introduce mRAKL: a Retrieval-Augmented Generation (RAG) based system to perform mKGC. We achieve this by using the head entity and linking relation in a question, and having our model predict the tail entity as an answer. Our experiments focus primarily on two low-resourced languages: Tigrinya and Amharic. We experiment with using higher-resourced languages Arabic and English for cross-lingual transfer. With a BM25 retriever, we find that the RAG-based approach improves performance over a no-context setting. Further, our ablation studies show that with an idealized retrieval system, mRAKL improves accuracy by 4.92 and 8.79 percentage points for Tigrinya and Amharic, respectively.
zh
[NLP-52] Help Me Write a Story: Evaluating LLM s Ability to Generate Writing Feedback ACL2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成有意义写作反馈方面的能力问题,特别是其在支持创意写作者时的挑战与局限性。解决方案的关键在于构建了一个新的任务定义、一个包含1300篇被有意引入写作问题的故事数据集,以及一套结合自动评估与人工评价的框架,用于系统性地测试LLMs在识别和反馈写作缺陷方面的表现。实验表明,当前模型在多数情况下能提供具体且准确的反馈,但在识别最严重的问题及判断何时应给出批评性或鼓励性反馈上存在明显不足。
链接: https://arxiv.org/abs/2507.16007
作者: Hannah Rashkin,Elizabeth Clark,Fantine Huot,Mirella Lapata
机构: Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL)
备注: ACL 2025 main conference
Abstract:Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects – providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.
zh
[NLP-53] Learning without training: The implicit dynamics of in-context learning
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLM)在推理阶段能够通过提示(prompt)中的示例实现“上下文学习”(in-context learning)的机制问题,即模型如何在不更新参数的情况下从新出现的模式中进行学习。解决方案的关键在于揭示了Transformer块中自注意力(self-attention)与前馈网络(MLP)层的协同作用:通过理论分析和实验表明,在特定简化假设下,自注意力层可以将输入上下文隐式地转换为对MLP层的低秩权重更新,从而使得模型无需显式梯度更新即可适应新任务或模式。这一机制为LLM实现上下文学习提供了潜在的内在解释。
链接: https://arxiv.org/abs/2507.16003
作者: Benoit Dherin,Michael Munn,Hanna Mazzawi,Michael Wunder,Javier Gonzalvo
机构: Google Research(谷歌研究)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:One of the most striking features of Large Language Models (LLM) is their ability to learn in context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in context and not only during training. Specifically, we show under mild simplifying assumptions how a transformer block implicitly transforms a context into a low-rank weight-update of the MLP layer.
zh
[NLP-54] Enhancing Hindi NER in Low Context: A Comparative study of Transformer-based models with vs. without Retrieval Augmentation
【速读】: 该论文旨在解决自然语言处理中命名实体识别(Named Entity Recognition, NER)的挑战,尤其是在资源有限的语言如印地语(Hindi)场景下性能受限的问题。其解决方案的关键在于结合预训练语言模型(如MuRIL、XLM-R以及Llama系列和GPT3.5-turbo)与检索增强(Retrieval Augmentation, RA),通过从维基百科等外部上下文获取相关知识来扩充训练数据。实验表明,引入RA后,基于MuRIL和XLM-R的微调模型F1分数分别从0.69和0.495提升至0.70和0.71,且生成式模型(如Llama2-7B、GPT3.5-turbo)在少量样本条件下也因RA显著受益,证明了RA对低资源语言NER任务的有效性与必要性。
链接: https://arxiv.org/abs/2507.16002
作者: Sumit Singh,Rohit Mishra,Uma Shanker Tiwary
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:One major challenge in natural language processing is named entity recognition (NER), which identifies and categorises named entities in textual input. In order to improve NER, this study investigates a Hindi NER technique that makes use of Hindi-specific pretrained encoders (MuRIL and XLM-R) and Generative Models ( Llama-2-7B-chat-hf (Llama2-7B), Llama-2-70B-chat-hf (Llama2-70B), Llama-3-70B-Instruct (Llama3-70B) and GPT3.5-turbo), and augments the data with retrieved data from external relevant contexts, notably from Wikipedia. We have fine-tuned MuRIL, XLM-R and Llama2-7B with and without RA. However, Llama2-70B, lama3-70B and GPT3.5-turbo are utilised for few-shot NER generation. Our investigation shows that the mentioned language models (LMs) with Retrieval Augmentation (RA) outperform baseline methods that don’t incorporate RA in most cases. The macro F1 scores for MuRIL and XLM-R are 0.69 and 0.495, respectively, without RA and increase to 0.70 and 0.71, respectively, in the presence of RA. Fine-tuned Llama2-7B outperforms Llama2-7B by a significant margin. On the other hand the generative models which are not fine-tuned also perform better with augmented data. GPT3.5-turbo adopted RA well; however, Llama2-70B and llama3-70B did not adopt RA with our retrieval context. The findings show that RA significantly improves performance, especially for low-context data. This study adds significant knowledge about how best to use data augmentation methods and pretrained models to enhance NER performance, particularly in languages with limited resources.
zh
[NLP-55] AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
【速读】: 该论文试图解决当前语言模型(Language Model, LM)评估体系过于依赖人类已解决任务的局限性,旨在探索LM在面对计算挑战性问题时是否具备设计与实现高效算法的能力。其解决方案的关键在于构建了一个名为AlgoTune的开放式基准测试平台,包含155个由领域专家收集的编码任务,并提供验证和计时框架以对比LM生成代码与主流开源库(如SciPy、sk-learn和CVXPY)中参考实现的性能差异。此外,作者开发了基线LM代理AlgoTuner,在前沿模型上进行评估,结果显示其平均获得1.72倍的速度提升,但尚未发现算法层面的创新,仅表现出表层优化倾向,表明当前LM仍缺乏创造性问题求解能力。
链接: https://arxiv.org/abs/2507.15887
作者: Ori Press,Brandon Amos,Haoyu Zhao,Yikai Wu,Samuel K. Ainsworth,Dominik Krupke,Patrick Kidger,Touqir Sajed,Bartolomeo Stellato,Jisun Park,Nathanael Bosch,Eli Meril,Albert Steppi,Arman Zharmagambetov,Fangzhao Zhang,David Perez-Pineiro,Alberto Mercurio,Ni Zhan,Talor Abramovich,Kilian Lieret,Hanlin Zhang,Shirley Huang,Matthias Bethge,Ofir Press
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Despite progress in language model (LM) capabilities, evaluations have thus far focused on models’ performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models’ ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 155 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.
zh
[NLP-56] Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长文档时性能评估缺乏合适基准的问题。现有研究对长文本或复杂视觉文档的处理能力关注不足,限制了模型在真实场景中的应用与优化。为此,作者提出了Document Haystack这一综合性基准,其关键在于构建包含5至200页、结构复杂的文档数据集,并在不同深度嵌入纯文本或图文混合的“needle”(即目标信息),从而系统性地测试模型在长距离上下文中的检索能力。该基准涵盖400种文档变体和8,250个问题,并配备自动化客观评估框架,为未来多模态长文档理解的研究提供了标准化评测工具。
链接: https://arxiv.org/abs/2507.15882
作者: Goeric Huybrechts,Srikanth Ronanki,Sai Muralidhar Jayanthi,Jack Fitzgerald,Srinivasan Veeravanallur
机构: Amazon AGI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The proliferation of multimodal Large Language Models has significantly advanced the ability to analyze and understand complex data inputs from different modalities. However, the processing of long documents remains under-explored, largely due to a lack of suitable benchmarks. To address this, we introduce Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image “needles” at various depths within the documents to challenge VLMs’ retrieval capabilities. Comprising 400 document variants and a total of 8,250 questions, it is supported by an objective, automated evaluation framework. We detail the construction and characteristics of the Document Haystack dataset, present results from prominent VLMs and discuss potential research avenues in this area.
zh
[NLP-57] Why Braking? Scenario Extraction and Reasoning Utilizing LLM
【速读】: 该论文旨在解决自动驾驶系统中安全关键性边缘场景(safety-critical corner cases)识别困难的问题,尤其是在海量常规驾驶数据中难以有效挖掘潜在危险制动事件。现有基于规则的启发式方法在简单道路环境(如高速公路)中表现良好,但在复杂城市环境中泛化能力不足。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的双路径场景检索框架:一方面通过类别驱动的搜索支持已知场景的精准定位,另一方面利用嵌入表示实现对分布外(Out-of-Distribution, OOD)未知场景的有效检索;该方法实现了低维数值信号与自然语言描述之间的语义映射,从而提升了场景理解与推理能力,并在Argoverse 2传感器数据集上验证了其优于传统规则基线并具备良好的OOD泛化性能。
链接: https://arxiv.org/abs/2507.15874
作者: Yin Wu,Daniel Slieter,Vivek Subramanian,Ahmed Abouelazm,Robin Bohn,J. Marius Zöllner
机构: CARIAD SE (CARIAD SE); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); FZI Research Center for Information Technology (信息科技研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The growing number of ADAS-equipped vehicles has led to a dramatic increase in driving data, yet most of them capture routine driving behavior. Identifying and understanding safety-critical corner cases within this vast dataset remains a significant challenge. Braking events are particularly indicative of potentially hazardous situations, motivating the central question of our research: Why does a vehicle brake? Existing approaches primarily rely on rule-based heuristics to retrieve target scenarios using predefined condition filters. While effective in simple environments such as highways, these methods lack generalization in complex urban settings. In this paper, we propose a novel framework that leverages Large Language Model (LLM) for scenario understanding and reasoning. Our method bridges the gap between low-level numerical signals and natural language descriptions, enabling LLM to interpret and classify driving scenarios. We propose a dual-path scenario retrieval that supports both category-based search for known scenarios and embedding-based retrieval for unknown Out-of-Distribution (OOD) scenarios. To facilitate evaluation, we curate scenario annotations on the Argoverse 2 Sensor Dataset. Experimental results show that our method outperforms rule-based baselines and generalizes well to OOD scenarios.
zh
[NLP-58] Small Edits Big Consequences: Telling Good from Bad Robustness in Large Language Models
【速读】: 该论文试图解决当前大语言模型(Large Language Models, LLMs)在代码生成任务中对提示词(prompt)扰动的鲁棒性与敏感性失衡问题,即模型在面对无害噪声时过度鲁棒(over-robust),而在关键语义变更时却表现出不敏感甚至错误适应的问题。解决方案的关键在于设计三种最小扰动类型——逐步删减(underspecification)、词汇翻转(lexical flip)和术语膨胀(jargon inflation),系统评估六种前沿模型在LeetCode任务上的表现,发现模型在90%提示缺失情况下仍保持85%正确率,显示其对信息缺失过度鲁棒;而仅54%的模型能响应一个关键量词翻转(如“max”→“min”),且经过推理调优的版本更不敏感;术语替换则处于中间水平(56%)。研究提出应建立区分性敏感性的评估与训练协议:在良性噪声下保持稳定,在语义实质性改变时主动调整或拒绝生成,从而弥合模型对“无害噪声”与“有害改动”的模糊边界。
链接: https://arxiv.org/abs/2507.15868
作者: Altynbek Ismailov,Salia Asanova
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) now write code in settings where misreading a single word can break safety or cost money, yet we still expect them to overlook stray typos. To probe where useful robustness ends and harmful insensitivity begins, we compile 50 LeetCode problems and craft three minimal prompt perturbations that should vary in importance: (i) progressive underspecification deleting 10 % of words per step; (ii) lexical flip swapping a pivotal quantifier (“max” to “min”); and (iii) jargon inflation replacing a common noun with an obscure technical synonym. Six frontier models, including three “reasoning-tuned” versions, solve each mutated prompt, and their Python outputs are checked against the original test suites to reveal whether they reused the baseline solution or adapted. Among 11 853 generations we observe a sharp double asymmetry. Models remain correct in 85 % of cases even after 90 % of the prompt is missing, showing over-robustness to underspecification, yet only 54 % react to a single quantifier flip that reverses the task, with reasoning-tuned variants even less sensitive than their bases. Jargon edits lie in between, passing through 56 %. Current LLMs thus blur the line between harmless noise and meaning - changing edits, often treating both as ignorable. Masking salient anchors such as function names can force re - evaluation. We advocate evaluation and training protocols that reward differential sensitivity: stay steady under benign noise but adapt - or refuse - when semantics truly change.
zh
[NLP-59] RDMA: Cost Effective Agent -Driven Rare Disease Discovery within Electronic Health Record Systems
【速读】: 该论文旨在解决罕见病(rare diseases)在电子健康记录(Electronic Health Records, EHR)中因标准ICD编码系统无法准确捕获而导致的关键信息被隐藏于临床笔记中的问题。现有方法存在对医学缩写的处理能力弱、难以识别隐含的疾病提及、依赖云端处理引发隐私风险以及缺乏临床推理能力等局限。解决方案的关键在于提出一种名为Rare Disease Mining Agents (RDMA) 的框架,该框架模拟临床专家通过整合分散的临床观察来识别特定罕见病模式的能力;RDMA 能够有效处理临床缩写、识别隐含疾病模式,并在本地硬件上实现上下文感知的推理,从而在显著降低隐私风险的同时,将F1分数提升超过30%,并将推理成本降低10倍,助力早期诊断。
链接: https://arxiv.org/abs/2507.15867
作者: John Wu,Adam Cross,Jimeng Sun
机构: University of Illinois(伊利诺伊大学); University of Illinois Chicago(芝加哥伊利诺伊大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
Abstract:Rare diseases affect 1 in 10 Americans, yet standard ICD coding systems fail to capture these conditions in electronic health records (EHR), leaving crucial information buried in clinical notes. Current approaches struggle with medical abbreviations, miss implicit disease mentions, raise privacy concerns with cloud processing, and lack clinical reasoning abilities. We present Rare Disease Mining Agents (RDMA), a framework that mirrors how medical experts identify rare disease patterns in EHR. RDMA connects scattered clinical observations that together suggest specific rare conditions. By handling clinical abbreviations, recognizing implicit disease patterns, and applying contextual reasoning locally on standard hardware, RDMA reduces privacy risks while improving F1 performance by upwards of 30% and decreasing inferences costs 10-fold. This approach helps clinicians avoid the privacy risk of using cloud services while accessing key rare disease information from EHR systems, supporting earlier diagnosis for rare disease patients. Available at this https URL.
zh
[NLP-60] Adversarial Demonstration Learning for Low-resource NER Using Dual Similarity
【速读】: 该论文旨在解决低资源场景下基于演示学习(demonstration learning)的命名实体识别(Named Entity Recognition, NER)问题。针对现有方法在演示样本选择和模型训练中的两个关键缺陷:一是演示示例选取主要依赖语义相似性,而未充分考虑特征相似性;二是NER标签器对演示示例的引用能力不足,作者提出两项改进策略作为解决方案的核心:其一,采用双相似性(dual similarity)选择机制,融合语义相似性和特征相似性以提升演示样本质量;其二,引入对抗性演示训练(adversarial demonstration training),迫使模型在标注任务中主动参考演示示例,从而增强其利用演示信息的能力。实验表明,该方法在低资源NER任务中显著优于多种基线方法。
链接: https://arxiv.org/abs/2507.15864
作者: Guowen Yuan,Tien-Hsuan Wu,Lianghao Xia,Ben Kao
机构: The University of Hong Kong (香港大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We study the problem of named entity recognition (NER) based on demonstration learning in low-resource scenarios. We identify two issues in demonstration construction and model training. Firstly, existing methods for selecting demonstration examples primarily rely on semantic similarity; We show that feature similarity can provide significant performance improvement. Secondly, we show that the NER tagger’s ability to reference demonstration examples is generally inadequate. We propose a demonstration and training approach that effectively addresses these issues. For the first issue, we propose to select examples by dual similarity, which comprises both semantic similarity and feature similarity. For the second issue, we propose to train an NER model with adversarial demonstration such that the model is forced to refer to the demonstrations when performing the tagging task. We conduct comprehensive experiments in low-resource NER tasks, and the results demonstrate that our method outperforms a range of methods.
zh
[NLP-61] Sapienss DEREK Module: Deep Extraction Reasoning Engine for Knowledge with LLM s
【速读】: 该论文旨在解决企业级文档问答(Document Question Answering, DQA)中准确率低、可追溯性差及安全性不足的问题,尤其在法律和金融等高风险领域对事实依据严格要求的场景下。解决方案的关键在于构建一个安全且可扩展的检索增强生成(Retrieval-Augmented Generation, RAG)管道——DEREK模块:首先通过分块(1000-token重叠切片)与混合索引(HNSW+BM25)提升召回率;其次利用GPT-4o查询改写、Cohere reranker优化排序、CO-STAR提示工程增强回答质量;最后引入LangGraph验证器强制引用重叠,确保每条陈述均有出处,从而将TRACe利用率提升至0.5以上并控制无支持陈述低于3%。整个系统容器化部署,端到端启用TLS 1.3和AES-256加密,实现生产就绪的可审计、上下文忠实的文档问答能力。
链接: https://arxiv.org/abs/2507.15863
作者: Isaac Shi,Zeyuan Li,Fan Liu,Wenli Wang,Lewei He,Yang Yang,Tianyu Shi
机构: eSapiens Team (eSapiens团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages;1 figure;5 tables
Abstract:We present the DEREK (Deep Extraction Reasoning Engine for Knowledge) Module, a secure and scalable Retrieval-Augmented Generation pipeline designed specifically for enterprise document question answering. Designed and implemented by eSapiens, the system ingests heterogeneous content (PDF, Office, web), splits it into 1,000-token overlapping chunks, and indexes them in a hybrid HNSW+BM25 store. User queries are refined by GPT-4o, retrieved via combined vector+BM25 search, reranked with Cohere, and answered by an LLM using CO-STAR prompt engineering. A LangGraph verifier enforces citation overlap, regenerating answers until every claim is grounded. On four LegalBench subsets, 1000-token chunks improve Recall@50 by approximately 1 pp and hybrid+rerank boosts Precision@10 by approximately 7 pp; the verifier raises TRACe Utilization above 0.50 and limits unsupported statements to less than 3%. All components run in containers, enforce end-to-end TLS 1.3 and AES-256. These results demonstrate that the DEREK module delivers accurate, traceable, and production-ready document QA with minimal operational overhead. The module is designed to meet enterprise demands for secure, auditable, and context-faithful retrieval, providing a reliable baseline for high-stakes domains such as legal and finance.
zh
计算机视觉
[CV-0] hinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)推理任务中现有端到端模型缺乏显式规划能力、难以实现多步规划和复杂任务适应性的问题。其解决方案的关键在于提出了一种双系统框架ThinkAct,通过强化学习驱动的视觉隐式规划(reinforced visual latent planning)将高层推理与底层动作执行相连接:首先利用多模态大语言模型(Multimodal LLM)生成基于目标完成度和轨迹一致性的动作对齐视觉奖励引导的具身推理计划,再将这些计划压缩为视觉计划隐变量(visual plan latent),作为条件输入指导下游动作模型在目标环境中进行鲁棒执行,从而实现少样本适应、长时程规划和自我修正等关键能力。
链接: https://arxiv.org/abs/2507.16815
作者: Chi-Pin Huang,Yueh-Hua Wu,Min-Hung Chen,Yu-Chiang Frank Wang,Fu-En Yang
机构: NVIDIA; National Taiwan University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.
zh
[CV-1] Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning
【速读】:该论文旨在解决大视觉语言模型(Large Vision-Language Models, LVLMs)在处理复杂多模态任务时缺乏视觉慢思考推理能力的问题。由于LVLM主要通过视觉-语言对齐进行训练,直接采用在线策略强化学习(on-policy RL)难以拓展其推理能力,而离线策略强化学习(off-policy RL)虽可突破当前策略限制,但直接从外部模型蒸馏轨迹易因视觉感知能力差异导致视觉幻觉。解决方案的关键在于提出SOPHIA——一种简单且可扩展的半离线策略强化学习框架,其核心机制是构建一个结合在线策略视觉理解(来自可训练LVLM)与离线策略慢思考推理(来自语言模型)的混合行为模型,并基于结果奖励分配推理奖励、将视觉奖励反向传播,使LVLM能够借助传播后的奖励信号,通过离线策略RL算法学习慢思考推理能力。
链接: https://arxiv.org/abs/2507.16814
作者: Junhao Shen,Haiteng Zhao,Yuzhe Gu,Songyang Gao,Kuikun Liu,Haian Huang,Jianfei Gao,Dahua Lin,Wenwei Zhang,Kai Chen
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiao Tong University (上海交通大学); MMLab, The Chinese University of Hong Kong (香港中文大学多媒体实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.
zh
[CV-2] HOComp: Interaction-Aware Human-Object Composition
【速读】:该论文旨在解决现有图像引导合成方法在处理人-物交互场景时难以生成无缝且自然融合的交互式组合的问题,尤其当任务涉及人类与物体之间的互动时,现有方法常因无法准确建模交互关系和保持外观一致性而表现不佳。其解决方案的关键在于提出HOComp框架,包含两项核心设计:一是基于多模态大语言模型(MLLMs)的区域级姿态引导(MRPG),通过识别交互区域与类型(如“手持”或“放置”),结合人体姿态关键点提供从粗到细的姿态约束,以确保动作合理性与细节一致性;二是细节一致的外观保持机制(DCAP),通过形状感知注意力调制、多视角外观损失及背景一致性损失,实现前景与背景人物在形状、纹理上的统一性,从而提升整体合成质量。
链接: https://arxiv.org/abs/2507.16813
作者: Dong Liang,Jinyuan Jia,Yuhao Liu,Rynson W.H. Lau
机构: Tongji University (同济大学); CityUHK (香港城市大学); HKUST(GZ) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While existing image-guided composition methods may help insert a foreground object onto a user-specified region of a background image, achieving natural blending inside the region with the rest of the image unchanged, we observe that these existing methods often struggle in synthesizing seamless interaction-aware compositions when the task involves human-object interactions. In this paper, we first propose HOComp, a novel approach for compositing a foreground object onto a human-centric background image, while ensuring harmonious interactions between the foreground object and the background person and their consistent appearances. Our approach includes two key designs: (1) MLLMs-driven Region-based Pose Guidance (MRPG), which utilizes MLLMs to identify the interaction region as well as the interaction type (e.g., holding and lefting) to provide coarse-to-fine constraints to the generated pose for the interaction while incorporating human pose landmarks to track action variations and enforcing fine-grained pose constraints; and (2) Detail-Consistent Appearance Preservation (DCAP), which unifies a shape-aware attention modulation mechanism, a multi-view appearance loss, and a background consistency loss to ensure consistent shapes/textures of the foreground and faithful reproduction of the background human. We then propose the first dataset, named Interaction-aware Human-Object Composition (IHOC), for the task. Experimental results on our dataset show that HOComp effectively generates harmonious human-object interactions with consistent appearances, and outperforms relevant methods qualitatively and quantitatively.
zh
[CV-3] Enhancing Domain Diversity in Synthetic Data Face Recognition with Dataset Fusion ICCV
【速读】:该论文旨在解决当前人脸识别模型训练中因使用网络爬取的真实数据集而引发的伦理与隐私问题,同时克服现有基于合成数据训练的模型性能不足的问题。其关键解决方案是通过融合两个采用不同架构生成的先进合成人脸数据集,从而减少单一生成器带来的特定模型伪影(model-specific artifacts),提升姿态、光照和人口统计学特征的多样性,并通过对身份相关特征的强调实现隐式正则化,最终显著提升人脸识别模型在标准基准上的性能表现。
链接: https://arxiv.org/abs/2507.16790
作者: Anjith George,Sebastien Marcel
机构: Idiap Research Institute (Idiap 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in ICCV Workshops 2025
Abstract:While the accuracy of face recognition systems has improved significantly in recent years, the datasets used to train these models are often collected through web crawling without the explicit consent of users, raising ethical and privacy concerns. To address this, many recent approaches have explored the use of synthetic data for training face recognition models. However, these models typically underperform compared to those trained on real-world data. A common limitation is that a single generator model is often used to create the entire synthetic dataset, leading to model-specific artifacts that may cause overfitting to the generator’s inherent biases and artifacts. In this work, we propose a solution by combining two state-of-the-art synthetic face datasets generated using architecturally distinct backbones. This fusion reduces model-specific artifacts, enhances diversity in pose, lighting, and demographics, and implicitly regularizes the face recognition model by emphasizing identity-relevant features. We evaluate the performance of models trained on this combined dataset using standard face recognition benchmarks and demonstrate that our approach achieves superior performance across many of these benchmarks.
zh
[CV-4] ask-Specific Zero-shot Quantization-Aware Training for Object Detection ICCV2025
【速读】:该论文旨在解决零样本量化(Zero-shot Quantization, ZSQ)在目标检测任务中性能不佳的问题,其核心挑战在于现有方法使用无标签且任务无关的合成图像,缺乏目标检测所需的对象位置、尺寸和类别分布等关键信息。解决方案的关键在于提出一种任务特定的ZSQ框架,包含两个核心阶段:第一阶段设计边界框与类别采样策略,从预训练网络中生成任务特定的校准集,无需先验知识即可重建对象的位置、大小和类别分布;第二阶段将任务特定训练融入知识蒸馏过程,有效恢复量化后检测网络的性能。该方法在MS-COCO和Pascal VOC数据集上验证了高效性和先进性。
链接: https://arxiv.org/abs/2507.16782
作者: Changhao Li,Xinrui Chen,Ji Wang,Kang Zhao,Jianfei Chen
机构: Georgia Institute of Technology (佐治亚理工学院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data. Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks. Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method. Our code is publicly available at: this https URL .
zh
[CV-5] Faithful Interpretable Chest X-ray Diagnosis with Anti-Aliased B-cos Networks
【速读】:该论文旨在解决B-cos网络在临床医学影像应用中面临的两大问题:一是解释图(explanation maps)存在严重的混叠伪影(aliasing artifacts),影响其可读性和可信度;二是原始B-cos模型仅支持多类分类(multi-class),无法适应胸部X光片分析中常见的多标签分类(multi-label)场景。解决方案的关键在于:(1) 引入抗混叠策略,通过FLCPooling(FLC)和BlurPool(BP)两种下采样方法显著提升解释图的质量,消除伪影;(2) 将B-cos网络扩展至多标签分类框架,使其适用于同时检测多种异常的临床任务。实验表明,改进后的B-cos_FLC与B-cos_BP在保持高诊断性能的同时,生成了符合临床要求的忠实且无伪影的解释结果。
链接: https://arxiv.org/abs/2507.16761
作者: Marcel Kleinmann,Shashank Agnihotri,Margret Keuper
机构: Data and Web Science Group, University of Mannheim, Germany (德国曼海姆大学数据与网络科学组); Max-Planck-Institute for Informatics, Saarland Informatics Campus, Germany (德国萨尔兰信息学园区马普研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Faithfulness and interpretability are essential for deploying deep neural networks (DNNs) in safety-critical domains such as medical imaging. B-cos networks offer a promising solution by replacing standard linear layers with a weight-input alignment mechanism, producing inherently interpretable, class-specific explanations without post-hoc methods. While maintaining diagnostic performance competitive with state-of-the-art DNNs, standard B-cos models suffer from severe aliasing artifacts in their explanation maps, making them unsuitable for clinical use where clarity is essential. Additionally, the original B-cos formulation is limited to multi-class settings, whereas chest X-ray analysis often requires multi-label classification due to co-occurring abnormalities. In this work, we address both limitations: (1) we introduce anti-aliasing strategies using FLCPooling (FLC) and BlurPool (BP) to significantly improve explanation quality, and (2) we extend B-cos networks to support multi-label classification. Our experiments on chest X-ray datasets demonstrate that the modified \textB-cos_\textFLC and \textB-cos_\textBP preserve strong predictive performance while providing faithful and artifact-free explanations suitable for clinical application in multi-label settings. Code available at: \hrefthis https URLGitHub repository .
zh
[CV-6] CMP: A Composable Meta Prompt for SAM-Based Cross-Domain Few-Shot Segmentation
【速读】:该论文旨在解决跨域少样本分割(Cross-Domain Few-Shot Segmentation, CD-FSS)中的两大核心挑战:一是传统方法对人工提示(prompt)的依赖性高,二是模型在不同域之间存在显著差异导致泛化能力受限。为此,作者提出了一种可组合元提示(Composable Meta-Prompt, CMP)框架,其关键在于引入三个模块:(i) 参考补全与变换(Reference Complement and Transformation, RCT)模块实现语义扩展,(ii) 可组合元提示生成(Composable Meta-Prompt Generation, CMPG)模块自动合成元提示以减少人工干预,(iii) 频率感知交互(Frequency-Aware Interaction, FAI)模块用于缓解域间差异带来的性能下降。该方案在四个跨域数据集上验证了其优越性,在1-shot和5-shot场景下分别达到71.8%和74.5%的mIoU,显著优于现有方法。
链接: https://arxiv.org/abs/2507.16753
作者: Shuai Chen,Fanman Meng,Chunjin Yang,Haoran Wei,Chenhao Wu,Qingbo Wu,Hongliang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3 figures
Abstract:Cross-Domain Few-Shot Segmentation (CD-FSS) remains challenging due to limited data and domain shifts. Recent foundation models like the Segment Anything Model (SAM) have shown remarkable zero-shot generalization capability in general segmentation tasks, making it a promising solution for few-shot scenarios. However, adapting SAM to CD-FSS faces two critical challenges: reliance on manual prompt and limited cross-domain ability. Therefore, we propose the Composable Meta-Prompt (CMP) framework that introduces three key modules: (i) the Reference Complement and Transformation (RCT) module for semantic expansion, (ii) the Composable Meta-Prompt Generation (CMPG) module for automated meta-prompt synthesis, and (iii) the Frequency-Aware Interaction (FAI) module for domain discrepancy mitigation. Evaluations across four cross-domain datasets demonstrate CMP’s state-of-the-art performance, achieving 71.8% and 74.5% mIoU in 1-shot and 5-shot scenarios respectively.
zh
[CV-7] Denoising-While-Completing Network (DWCNet): Robust Point Cloud Completion Under Corruption
【速读】:该论文旨在解决真实场景中高度退化的部分点云(partial point clouds)的补全与去噪问题,这类点云常因噪声、遮挡等多重退化因素导致质量下降,而现有基于合成数据训练的补全网络在面对真实世界退化时表现不佳。解决方案的关键在于提出DWCNet(Denoising-While-Completing Network),其核心创新是引入噪声管理模块(Noise Management Module, NMM),该模块结合对比学习(contrastive learning)和自注意力机制(self-attention),实现边补全边去噪,有效抑制噪声并建模点云内部结构关系,从而显著提升模型在清洁与退化点云上的鲁棒性与性能。
链接: https://arxiv.org/abs/2507.16743
作者: Keneni W. Tesema,Lyndon Hill,Mark W. Jones,Gary K.L. Tam
机构: Swansea University (斯旺西大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for Computers and Graphics and EG Symposium on 3D Object Retrieval 2025 (3DOR’25)
Abstract:Point cloud completion is crucial for 3D computer vision tasks in autonomous driving, augmented reality, and robotics. However, obtaining clean and complete point clouds from real-world environments is challenging due to noise and occlusions. Consequently, most existing completion networks – trained on synthetic data – struggle with real-world degradations. In this work, we tackle the problem of completing and denoising highly corrupted partial point clouds affected by multiple simultaneous degradations. To benchmark robustness, we introduce the Corrupted Point Cloud Completion Dataset (CPCCD), which highlights the limitations of current methods under diverse corruptions. Building on these insights, we propose DWCNet (Denoising-While-Completing Network), a completion framework enhanced with a Noise Management Module (NMM) that leverages contrastive learning and self-attention to suppress noise and model structural relationships. DWCNet achieves state-of-the-art performance on both clean and corrupted, synthetic and real-world datasets. The dataset and code will be publicly available at this https URL
zh
[CV-8] DFR: A Decompose-Fuse-Reconstruct Framework for Multi-Modal Few-Shot Segmentation
【速读】:该论文旨在解决少样本分割(Few-Shot Segmentation, FSS)中如何有效利用多模态引导信息的问题。现有方法主要依赖视觉支持样本或文本描述,其单模态或双模态范式难以充分挖掘真实场景中丰富的感知信息。解决方案的关键在于提出DFR(Decompose, Fuse and Reconstruct)框架,其核心创新包括:1)多模态分解(Multi-modal Decompose),通过Segment Anything Model(SAM)对视觉区域提案进行层次化提取,将文本语义扩展为细粒度描述符,并处理音频特征以增强上下文信息;2)多模态对比融合(Multi-modal Contrastive Fuse),采用对比学习策略保持视觉、文本与音频模态间的一致性,并实现前景与背景特征间的动态语义交互;3)双路径重构(Dual-path Reconstruct),自适应融合三模态融合token的语义引导与多模态位置先验的几何线索,从而提升分割精度。
链接: https://arxiv.org/abs/2507.16736
作者: Shuai Chen,Fanman Meng,Xiwei Zhang,Haoran Wei,Chenhao Wu,Qingbo Wu,Hongliang Li
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3 figures
Abstract:This paper presents DFR (Decompose, Fuse and Reconstruct), a novel framework that addresses the fundamental challenge of effectively utilizing multi-modal guidance in few-shot segmentation (FSS). While existing approaches primarily rely on visual support samples or textual descriptions, their single or dual-modal paradigms limit exploitation of rich perceptual information available in real-world scenarios. To overcome this limitation, the proposed approach leverages the Segment Anything Model (SAM) to systematically integrate visual, textual, and audio modalities for enhanced semantic understanding. The DFR framework introduces three key innovations: 1) Multi-modal Decompose: a hierarchical decomposition scheme that extracts visual region proposals via SAM, expands textual semantics into fine-grained descriptors, and processes audio features for contextual enrichment; 2) Multi-modal Contrastive Fuse: a fusion strategy employing contrastive learning to maintain consistency across visual, textual, and audio modalities while enabling dynamic semantic interactions between foreground and background features; 3) Dual-path Reconstruct: an adaptive integration mechanism combining semantic guidance from tri-modal fused tokens with geometric cues from multi-modal location priors. Extensive experiments across visual, textual, and audio modalities under both synthetic and real settings demonstrate DFR’s substantial performance improvements over state-of-the-art methods.
zh
[CV-9] HarmonPaint: Harmonized Training-Free Diffusion Inpainting
【速读】:该论文旨在解决现有图像修复(inpainting)方法在整合新内容时需大量重训练或微调,且难以在结构和风格上保持与背景一致性的难题。其解决方案的关键在于提出一种无需训练的框架 HarmonPaint,通过利用扩散模型(diffusion model)中的自注意力机制(self-attention)设计掩码策略,确保修复区域的结构保真度;同时借助扩散模型的内在特性,将未掩码区域的风格信息迁移至掩码区域,实现风格上的和谐统一。
链接: https://arxiv.org/abs/2507.16732
作者: Ying Li,Xinzhe Li,Yong Du,Yangyang Xu,Junyu Dong,Shengfeng He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing inpainting methods often require extensive retraining or fine-tuning to integrate new content seamlessly, yet they struggle to maintain coherence in both structure and style between inpainted regions and the surrounding background. Motivated by these limitations, we introduce HarmonPaint, a training-free inpainting framework that seamlessly integrates with the attention mechanisms of diffusion models to achieve high-quality, harmonized image inpainting without any form of training. By leveraging masking strategies within self-attention, HarmonPaint ensures structural fidelity without model retraining or fine-tuning. Additionally, we exploit intrinsic diffusion model properties to transfer style information from unmasked to masked regions, achieving a harmonious integration of styles. Extensive experiments demonstrate the effectiveness of HarmonPaint across diverse scenes and styles, validating its versatility and performance.
zh
[CV-10] mporally-Constrained Video Reasoning Segmentation and Automated Benchmark Construction
【速读】:该论文旨在解决传统视频分割方法在处理开放词汇(out-of-vocabulary)对象以及基于复杂文本查询隐式指代目标对象时的局限性,尤其是在动态变化的场景中(如手术室视频分析),现有方法无法适应目标对象随时间上下文变化而出现、消失或改变相关性的特性。其解决方案的关键在于提出一种新的任务范式——时序约束视频推理分割(temporally-constrained video reasoning segmentation, TC-VRS),要求模型能够根据包含时间推理的文本查询,隐式推断目标对象何时在时序上变得语境相关;同时,为克服人工标注成本高的问题,作者设计了一种创新的自动化基准构建方法,并发布了首个TCVideoRSBenchmark数据集,包含52个来自MVOR数据集的样本,以支持该任务的研究与评估。
链接: https://arxiv.org/abs/2507.16718
作者: Yiqing Shen,Chenjia Li,Chenxiao Fan,Mathias Unberath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Conventional approaches to video segmentation are confined to predefined object categories and cannot identify out-of-vocabulary objects, let alone objects that are not identified explicitly but only referred to implicitly in complex text queries. This shortcoming limits the utility for video segmentation in complex and variable scenarios, where a closed set of object categories is difficult to define and where users may not know the exact object category that will appear in the video. Such scenarios can arise in operating room video analysis, where different health systems may use different workflows and instrumentation, requiring flexible solutions for video analysis. Reasoning segmentation (RS) now offers promise towards such a solution, enabling natural language text queries as interaction for identifying object to segment. However, existing video RS formulation assume that target objects remain contextually relevant throughout entire video sequences. This assumption is inadequate for real-world scenarios in which objects of interest appear, disappear or change relevance dynamically based on temporal context, such as surgical instruments that become relevant only during specific procedural phases or anatomical structures that gain importance at particular moments during surgery. Our first contribution is the introduction of temporally-constrained video reasoning segmentation, a novel task formulation that requires models to implicitly infer when target objects become contextually relevant based on text queries that incorporate temporal reasoning. Since manual annotation of temporally-constrained video RS datasets would be expensive and limit scalability, our second contribution is an innovative automated benchmark construction method. Finally, we present TCVideoRSBenchmark, a temporally-constrained video RS dataset containing 52 samples using the videos from the MVOR dataset.
zh
[CV-11] Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM -Based High-Quality Image-Text Dataset Generation
【速读】:该论文旨在解决遥感(Remote Sensing, RS)图像与文本配对数据集质量低、规模不足的问题,从而限制了视觉-语言基础模型(Vision-language foundation models, VLFMs)在遥感领域的性能提升。其核心解决方案是提出一种两阶段的高质量文本描述生成方法——MpGI(Multi-Perspective Generation and Integration),关键在于:第一阶段利用规则引导的多模态大语言模型(Rule-MLLM)和多模态大语言模型(Multimodal Large Language Model, MLLM)从不同视角生成多样且细节丰富的描述;第二阶段通过大型语言模型(Large Language Models, LLMs)整合这些多视角描述,形成结构完整、信息全面的最终文本 caption。该方法显著提升了数据质量,仅用4.2%的训练数据即使HQRS-CLIP模型超越现有最优遥感CLIP模型,并推动RS-CoCa模型在多个基准测试中达到先进水平,甚至可生成媲美人工标注的遥感图像描述。
链接: https://arxiv.org/abs/2507.16716
作者: Yiguo He,Junjie Zhu,Yiying Li,Xiaoyu Zhang,Chunping Qiu,Jun Wang,Qiangjuan Huang,Ke Yang
机构: Intelligent Game and Decision Lab (智能游戏与决策实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SUBMIT TO IEEE TRANSACTIONS
Abstract:The application of Vision-language foundation models (VLFMs) to remote sensing (RS) imagery has garnered significant attention due to their superior capability in various downstream tasks. A key challenge lies in the scarcity of high-quality, large-scale, image-text paired training data. Recently, several works introduced extensive image-text datasets for RS and trained their VLFMs. However, due to the rudimentary methods used for generating captions, the quality of datasets is suboptimal, requiring larger volumes of training data, while only yielding modest performance improvements. In this paper, we propose a two-stage method named MpGI(Multi-Perspective Generation and Integration) for generating high-quality text captions for RS images. Firstly, we generate distinct and detailed descriptions from different perspectives using Rule-MLLM(Multimodal Large Language Model) Relay Generation and MLLMs generation methods. Next, we utilize Large Language Models (LLMs) to integrate these diverse descriptions into comprehensive captions, capturing details from multiple perspectives. Finally, we have created the HQRS-IT-210K dataset, including about 210,000 RS images and 1.3 million captions. We fine-tuned two VLFMs using our dataset: CLIP, a discriminative model, and CoCa, an image-to-text generative model. This process resulted in our proposed HQRS-CLIP and RS-CoCa models. Experimental results demonstrate that HQRS-CLIP surpassed the previous SOTA RS CLIP model in various downstream tasks while using only 4.2% of the training data. RS-CoCa outperforms other advanced approaches across benchmark datasets and can generate captions for RS images that rival or even exceed manual annotations. Dataset, pre-trained models, and codes will be released at this https URL.
zh
[CV-12] Screen2AX: Vision-Based Approach for Automatic macOS Accessibility Generation
【速读】:该论文旨在解决桌面应用程序中可访问性元数据缺失或不完整的问题,这导致依赖屏幕阅读器等辅助工具的用户难以有效使用软件。当前仅有33%的macOS应用提供完整的可访问性支持,而现有方法多集中于UI元素检测或描述,未能复现桌面界面的完整层次结构。解决方案的关键在于提出Screen2AX框架,该框架首次实现了从单张截图自动构建实时、树状结构的可访问性元数据,利用视觉-语言模型和目标检测模型对UI元素进行识别、描述与层级组织,从而模拟macOS系统级的可访问性结构。通过构建包含112个macOS应用的三个公开数据集,并引入Screen2AX-Task基准测试,验证了其在重构完整可访问性树(F1得分77%)及提升自主代理在复杂桌面环境中任务执行性能(相较原生可访问性表示提升2.2倍)方面的有效性。
链接: https://arxiv.org/abs/2507.16704
作者: Viktor Muryn,Marta Sumyk,Mariya Hirna,Sofiya Garkot,Maksym Shamrai
机构: Ukrainian Catholic University (乌克兰天主教大学); MacPaw (麦帕)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Desktop accessibility metadata enables AI agents to interpret screens and supports users who depend on tools like screen readers. Yet, many applications remain largely inaccessible due to incomplete or missing metadata provided by developers - our investigation shows that only 33% of applications on macOS offer full accessibility support. While recent work on structured screen representation has primarily addressed specific challenges, such as UI element detection or captioning, none has attempted to capture the full complexity of desktop interfaces by replicating their entire hierarchical structure. To bridge this gap, we introduce Screen2AX, the first framework to automatically create real-time, tree-structured accessibility metadata from a single screenshot. Our method uses vision-language and object detection models to detect, describe, and organize UI elements hierarchically, mirroring macOS’s system-level accessibility structure. To tackle the limited availability of data for macOS desktop applications, we compiled and publicly released three datasets encompassing 112 macOS applications, each annotated for UI element detection, grouping, and hierarchical accessibility metadata alongside corresponding screenshots. Screen2AX accurately infers hierarchy trees, achieving a 77% F1 score in reconstructing a complete accessibility tree. Crucially, these hierarchy trees improve the ability of autonomous agents to interpret and interact with complex desktop interfaces. We introduce Screen2AX-Task, a benchmark specifically designed for evaluating autonomous agent task execution in macOS desktop environments. Using this benchmark, we demonstrate that Screen2AX delivers a 2.2x performance improvement over native accessibility representations and surpasses the state-of-the-art OmniParser V2 system on the ScreenSpot benchmark.
zh
[CV-13] QRetinex-Net: Quaternion-Valued Retinex Decomposition for Low-Level Computer Vision Applications
【速读】:该论文旨在解决低光照条件下图像中存在的色偏、对比度低、噪声大等退化问题,这些问题会显著降低计算机视觉任务的准确性。传统Retinex模型存在四大缺陷:独立处理RGB通道、缺乏神经科学支持的颜色视觉建模、无法完美重建输入图像以及不能解释人类颜色恒常性。其解决方案的关键在于提出首个四元数Retinex(Quaternion Retinex)公式,将场景表示为四元数形式的反射率与照度的Hamilton乘积,从而在数学上统一处理多光谱信息并保留颜色一致性;同时引入反射率一致性指数(Reflectance Consistency Index)量化反射率稳定性,实验证明该方法在裂缝检测、人脸检测和红外-可见光融合任务中相较主流方法提升2–11%,且具备更优的颜色保真度、更低噪声和更高反射率稳定性。
链接: https://arxiv.org/abs/2507.16683
作者: Sos Agaian,Vladimir Frants
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Images taken in low light often show color shift, low contrast, noise, and other artifacts that hurt computer-vision accuracy. Retinex theory addresses this by viewing an image S as the pixel-wise product of reflectance R and illumination I, mirroring the way people perceive stable object colors under changing light. The decomposition is ill-posed, and classic Retinex models have four key flaws: (i) they treat the red, green, and blue channels independently; (ii) they lack a neuroscientific model of color vision; (iii) they cannot perfectly rebuild the input image; and (iv) they do not explain human color constancy. We introduce the first Quaternion Retinex formulation, in which the scene is written as the Hamilton product of quaternion-valued reflectance and illumination. To gauge how well reflectance stays invariant, we propose the Reflectance Consistency Index. Tests on low-light crack inspection, face detection under varied lighting, and infrared-visible fusion show gains of 2-11 percent over leading methods, with better color fidelity, lower noise, and higher reflectance stability.
zh
[CV-14] Synthetic Data Matters: Re-training with Geo-typical Synthetic Labels for Building Detection
【速读】:该论文旨在解决深度学习模型在遥感建筑分割任务中因地理区域差异导致的泛化能力不足问题,即模型难以适应不同城市布局和建筑类型、尺寸及分布的变化。其核心解决方案在于:在测试时利用目标区域的地理特征(如来自OpenStreetMap的街道网络)生成具有代表性的合成数据(geo-typical synthetic data),并通过程序化建模与物理渲染技术构建高分辨率图像,同时引入域随机化以增强多样性;随后将此类合成数据融入对抗域自适应框架中进行训练,从而有效缩小合成数据到真实数据的域差距(synthetic-to-real domain gap)。该方法不依赖大量人工标注数据,具备可扩展性和成本效益,显著提升了模型在新区域的分割性能,缓解了纯合成数据集可能导致的“模型坍缩”(model collapse)现象。
链接: https://arxiv.org/abs/2507.16657
作者: Shuang Song,Yang Tang,Rongjun Qin
机构: The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures, This work has been submitted to the IEEE for possible publication
Abstract:Deep learning has significantly advanced building segmentation in remote sensing, yet models struggle to generalize on data of diverse geographic regions due to variations in city layouts and the distribution of building types, sizes and locations. However, the amount of time-consuming annotated data for capturing worldwide diversity may never catch up with the demands of increasingly data-hungry models. Thus, we propose a novel approach: re-training models at test time using synthetic data tailored to the target region’s city layout. This method generates geo-typical synthetic data that closely replicates the urban structure of a target area by leveraging geospatial data such as street network from OpenStreetMap. Using procedural modeling and physics-based rendering, very high-resolution synthetic images are created, incorporating domain randomization in building shapes, materials, and environmental illumination. This enables the generation of virtually unlimited training samples that maintain the essential characteristics of the target environment. To overcome synthetic-to-real domain gaps, our approach integrates geo-typical data into an adversarial domain adaptation framework for building segmentation. Experiments demonstrate significant performance enhancements, with median improvements of up to 12%, depending on the domain gap. This scalable and cost-effective method blends partial geographic knowledge with synthetic imagery, providing a promising solution to the “model collapse” issue in purely synthetic datasets. It offers a practical pathway to improving generalization in remote sensing building segmentation without extensive real-world annotations.
zh
[CV-15] Benchmarking pig detection and tracking under diverse and challenging conditions
【速读】:该论文旨在解决猪场中个体行为监测的自动化难题,其核心挑战在于如何在真实猪舍环境中实现高精度的猪只定位(空间)与追踪(时间),即对象检测(object detection)和多目标跟踪(multi-object tracking)。解决方案的关键在于构建两个高质量、涵盖复杂场景(如遮挡和低可见度)的基准数据集——PigDetect(用于对象检测)和PigTrack(用于多目标跟踪),并通过系统性比较不同模型方法的性能,发现:1)使用具有挑战性的训练图像可显著提升检测性能;2)基于SORT的跟踪方法在检测精度上优于端到端可训练模型,而后者在关联性能上更具潜力;3)所训练模型在未见猪栏中表现良好,验证了高质量训练数据对泛化能力的重要性。该研究为猪场智能化管理提供了可靠的技术基础,并公开了数据与代码以促进后续发展。
链接: https://arxiv.org/abs/2507.16639
作者: Jonathan Henrich,Christian Post,Maximilian Zilke,Parth Shiroya,Emma Chanut,Amir Mollazadeh Yamchi,Ramin Yahyapour,Thomas Kneib,Imke Traulsen
机构: University of Göttingen(哥廷根大学); Kiel University(基尔大学); Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen(哥廷根科学数据处理有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:To ensure animal welfare and effective management in pig farming, monitoring individual behavior is a crucial prerequisite. While monitoring tasks have traditionally been carried out manually, advances in machine learning have made it possible to collect individualized information in an increasingly automated way. Central to these methods is the localization of animals across space (object detection) and time (multi-object tracking). Despite extensive research of these two tasks in pig farming, a systematic benchmarking study has not yet been conducted. In this work, we address this gap by curating two datasets: PigDetect for object detection and PigTrack for multi-object tracking. The datasets are based on diverse image and video material from realistic barn conditions, and include challenging scenarios such as occlusions or bad visibility. For object detection, we show that challenging training images improve detection performance beyond what is achievable with randomly sampled images alone. Comparing different approaches, we found that state-of-the-art models offer substantial improvements in detection quality over real-time alternatives. For multi-object tracking, we observed that SORT-based methods achieve superior detection performance compared to end-to-end trainable models. However, end-to-end models show better association performance, suggesting they could become strong alternatives in the future. We also investigate characteristic failure cases of end-to-end models, providing guidance for future improvements. The detection and tracking models trained on our datasets perform well in unseen pens, suggesting good generalization capabilities. This highlights the importance of high-quality training data. The datasets and research code are made publicly available to facilitate reproducibility, re-use and further development.
zh
[CV-16] A2Mamba: Attention-augmented State Space Models for Visual Recognition
【速读】:该论文旨在解决现有Transformer与Mamba混合架构中缺乏深层交互机制的问题,即当前方法仅通过简单堆叠Transformer和Mamba层实现集成,未充分挖掘两者在特征提取上的互补性。其解决方案的关键在于提出A2Mamba架构,其中引入了一种新型token mixer——多尺度注意力增强状态空间模型(Multi-scale Attention-augmented State Space Model, MASS),该模块通过将多尺度注意力图融入注意力增强状态空间模型(Attention-augmented State Space Model, A2SSM)中,实现跨层的信息融合;具体而言,A2SSM的核心步骤采用一种变体的交叉注意力机制,利用多尺度注意力图对状态空间模型的隐藏状态进行空间聚合,从而增强二维空间中的依赖关系并提升状态空间模型的动态建模能力。这一设计显著提升了视觉识别任务中的性能与效率。
链接: https://arxiv.org/abs/2507.16624
作者: Meng Lou,Yunxiang Fu,Yizhou Yu
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 5 figures, 13 tables
Abstract:Transformers and Mamba, initially invented for natural language processing, have inspired backbone architectures for visual recognition. Recent studies integrated Local Attention Transformers with Mamba to capture both local details and global contexts. Despite competitive performance, these methods are limited to simple stacking of Transformer and Mamba layers without any interaction mechanism between them. Thus, deep integration between Transformer and Mamba layers remains an open problem. We address this problem by proposing A2Mamba, a powerful Transformer-Mamba hybrid network architecture, featuring a new token mixer termed Multi-scale Attention-augmented State Space Model (MASS), where multi-scale attention maps are integrated into an attention-augmented SSM (A2SSM). A key step of A2SSM performs a variant of cross-attention by spatially aggregating the SSM’s hidden states using the multi-scale attention maps, which enhances spatial dependencies pertaining to a two-dimensional space while improving the dynamic modeling capabilities of SSMs. Our A2Mamba outperforms all previous ConvNet-, Transformer-, and Mamba-based architectures in visual recognition tasks. For instance, A2Mamba-L achieves an impressive 86.1% top-1 accuracy on ImageNet-1K. In semantic segmentation, A2Mamba-B exceeds CAFormer-S36 by 2.5% in mIoU, while exhibiting higher efficiency. In object detection and instance segmentation with Cascade Mask R-CNN, A2Mamba-S surpasses MambaVision-B by 1.2%/0.9% in AP^b/AP^m, while having 40% less parameters. Code is publicly available at this https URL.
zh
[CV-17] Automatic Fine-grained Segmentation-assisted Report Generation
【速读】:该论文旨在解决医学影像报告生成中模型性能可靠性和报告内容可验证性(groundedness)不足的问题,以提升生成报告对临床医生或患者的可信度。其核心解决方案是提出ASaRG(Automatic Segmentation-assisted Report Generation),通过在LLaVA架构的多模态投影层中简单拼接中间特征与细粒度分割图(fine-grained segmentation maps),实现对医学影像的结构化理解增强。该方法仅引入少量参数,即在仅使用中间特征时F1分数提升0.89%(p=0.012),结合分割图后进一步提升至+2.77%(p<0.001),显著优于COMG和ORID等现有方法(分别提升6.98%和6.28%)。此外,该设计支持任意数量的分割图输入,使报告内容可追溯至对应分割区域,从而增强评估的可解释性和医学合理性。
链接: https://arxiv.org/abs/2507.16623
作者: Frederic Jonske,Constantin Seibold,Osman Alperen Koras,Fin Bahnsen,Marie Bauer,Amin Dada,Hamza Kalisch,Anton Schily,Jens Kleesiek
机构: Institute for AI in Medicine, University Medicine Essen (人工智能医学研究所,埃森大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Reliable end-to-end clinical report generation has been a longstanding goal of medical ML research. The end goal for this process is to alleviate radiologists’ workloads and provide second opinions to clinicians or patients. Thus, a necessary prerequisite for report generation models is a strong general performance and some type of innate grounding capability, to convince clinicians or patients of the veracity of the generated reports. In this paper, we present ASaRG (\textbfAutomatic \textbfSegmentation-\textbfassisted \textbfReport \textbfGeneration), an extension of the popular LLaVA architecture that aims to tackle both of these problems. ASaRG proposes to fuse intermediate features and fine-grained segmentation maps created by specialist radiological models into LLaVA’s multi-modal projection layer via simple concatenation. With a small number of added parameters, our approach achieves a +0.89% performance gain ( p=0.012 ) in CE F1 score compared to the LLaVA baseline when using only intermediate features, and +2.77% performance gain ( p0.001 ) when adding a combination of intermediate features and fine-grained segmentation maps. Compared with COMG and ORID, two other report generation methods that utilize segmentations, the performance gain amounts to 6.98% and 6.28% in F1 score, respectively. ASaRG is not mutually exclusive with other changes made to the LLaVA architecture, potentially allowing our method to be combined with other advances in the field. Finally, the use of an arbitrary number of segmentations as part of the input demonstrably allows tracing elements of the report to the corresponding segmentation maps and verifying the groundedness of assessments. Our code will be made publicly available at a later date.
zh
[CV-18] A Target-based Multi-LiDAR Multi-Camera Extrinsic Calibration System
【速读】:该论文旨在解决多传感器融合系统中不同传感器(特别是多LiDAR与多相机)之间的外参标定(extrinsic calibration)难题,尤其是在缺乏先验知识的情况下实现高精度对齐。其关键解决方案在于提出了一种基于定制ChArUco板的目标引导式标定方法,并结合非线性优化策略,实现了LiDAR与相机间的跨模态标定,从而提升了感知模块的准确性与鲁棒性。
链接: https://arxiv.org/abs/2507.16621
作者: Lorenzo Gentilini,Pierpaolo Serio,Valentina Donzella,Lorenzo Pollini
机构: Toyota Material Handling Manufacturing (丰田物料搬运制造); Queen Mary University of London (伦敦玛丽女王大学); University of Pisa (比萨大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Extrinsic Calibration represents the cornerstone of autonomous driving. Its accuracy plays a crucial role in the perception pipeline, as any errors can have implications for the safety of the vehicle. Modern sensor systems collect different types of data from the environment, making it harder to align the data. To this end, we propose a target-based extrinsic calibration system tailored for a multi-LiDAR and multi-camera sensor suite. This system enables cross-calibration between LiDARs and cameras with limited prior knowledge using a custom ChArUco board and a tailored nonlinear optimization method. We test the system with real-world data gathered in a warehouse. Results demonstrated the effectiveness of the proposed method, highlighting the feasibility of a unique pipeline tailored for various types of sensors.
zh
[CV-19] CTSL: Codebook-based Temporal-Spatial Learning for Accurate Non-Contrast Cardiac Risk Prediction Using Cine MRIs MICCAI2025
【速读】:该论文旨在解决从无对比剂增强的Cine MRI序列中准确预测主要不良心脏事件(Major Adverse Cardiac Events, MACE)这一关键挑战,现有方法通常依赖于人工标注的心室心肌分割掩膜(segmentation masks),在缺乏对比剂时难以实施。其解决方案的关键在于提出一种基于代码本(codebook)的时空学习框架(Codebook-based Temporal-Spatial Learning, CTSL),通过多视图蒸馏策略解耦时间与空间特征:教师模型处理多个Cine视图,学生模型则从降维后的Cine-SA序列中学习;同时利用代码本特征表示和基于运动线索的动态病灶自检测机制,捕捉复杂的时间依赖性和运动模式,从而实现高置信度的MACE风险预测,无需依赖对比剂即可提供快速、非侵入性的临床心脏风险评估方案。
链接: https://arxiv.org/abs/2507.16612
作者: Haoyang Su,Shaohao Rui,Jinyi Xiang,Lianming Wu,Xiaosong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025
Abstract:Accurate and contrast-free Major Adverse Cardiac Events (MACE) prediction from Cine MRI sequences remains a critical challenge. Existing methods typically necessitate supervised learning based on human-refined masks in the ventricular myocardium, which become impractical without contrast agents. We introduce a self-supervised framework, namely Codebook-based Temporal-Spatial Learning (CTSL), that learns dynamic, spatiotemporal representations from raw Cine data without requiring segmentation masks. CTSL decouples temporal and spatial features through a multi-view distillation strategy, where the teacher model processes multiple Cine views, and the student model learns from reduced-dimensional Cine-SA sequences. By leveraging codebook-based feature representations and dynamic lesion self-detection through motion cues, CTSL captures intricate temporal dependencies and motion patterns. High-confidence MACE risk predictions are achieved through our model, providing a rapid, non-invasive solution for cardiac risk assessment that outperforms traditional contrast-dependent methods, thereby enabling timely and accessible heart disease diagnosis in clinical settings.
zh
[CV-20] Dyna3DGR: 4D Cardiac Motion Tracking with Dynamic 3D Gaussian Representation MICCAI2025
【速读】:该论文旨在解决心脏运动追踪中因心肌组织同质性及缺乏显著特征而导致的细粒度4D心脏运动建模难题,尤其针对现有图像基方法在拓扑一致性上的不足和表示基方法丢失图像级细节的问题。其解决方案的关键在于提出动态3D高斯表示(Dynamic 3D Gaussian Representation, Dyna3DGR),通过显式3D高斯表示与隐式神经运动场建模相结合,在自监督框架下同时优化心脏结构与运动,无需大量训练数据或点对点对应关系;借助可微分体素渲染,高效实现连续运动表示与图像空间对齐,同时保持拓扑一致性和时间一致性。
链接: https://arxiv.org/abs/2507.16608
作者: Xueming Fu,Pei Wu,Yingtai Li,Xin Luo,Zihang Jiang,Junhao Mei,Jian Lu,Gao-Jun Teng,S. Kevin Zhou
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MICCAI 2025
Abstract:Accurate analysis of cardiac motion is crucial for evaluating cardiac function. While dynamic cardiac magnetic resonance imaging (CMR) can capture detailed tissue motion throughout the cardiac cycle, the fine-grained 4D cardiac motion tracking remains challenging due to the homogeneous nature of myocardial tissue and the lack of distinctive features. Existing approaches can be broadly categorized into image based and representation-based, each with its limitations. Image-based methods, including both raditional and deep learning-based registration approaches, either struggle with topological consistency or rely heavily on extensive training data. Representation-based methods, while promising, often suffer from loss of image-level details. To address these limitations, we propose Dynamic 3D Gaussian Representation (Dyna3DGR), a novel framework that combines explicit 3D Gaussian representation with implicit neural motion field modeling. Our method simultaneously optimizes cardiac structure and motion in a self-supervised manner, eliminating the need for extensive training data or point-to-point correspondences. Through differentiable volumetric rendering, Dyna3DGR efficiently bridges continuous motion representation with image-space alignment while preserving both topological and temporal consistency. Comprehensive evaluations on the ACDC dataset demonstrate that our approach surpasses state-of-the-art deep learning-based diffeomorphic registration methods in tracking accuracy. The code will be available in this https URL.
zh
[CV-21] A Multimodal Deviation Perceiving Framework for Weakly-Supervised Temporal Forgery Localization
【速读】:该论文旨在解决当前深度伪造(Deepfake)检测研究中普遍存在的问题:现有方法通常将检测任务视为分类或时间上的伪造定位问题,导致在大规模数据集上难以扩展、计算成本高且效率低。为此,作者提出了一种弱监督的时间伪造定位多模态偏差感知框架(Multimodal Deviation Perceiving framework for Weakly-supervised Temporal Forgery Localization, MDP),其核心创新在于两个关键组件:一是新颖的多模态交互机制(Multimodal Interaction, MI),通过保留时间特性的跨模态注意力机制,在概率嵌入空间中度量视觉与音频模态间的相关性,从而识别模态间偏差并构建用于时间伪造定位的综合视频特征;二是可扩展的偏差感知损失函数,旨在增强伪造样本相邻片段间的时序偏差,同时降低真实样本的偏差,从而提升弱监督条件下的定位精度。实验表明,该框架在多个指标上达到了与全监督方法相当的效果。
链接: https://arxiv.org/abs/2507.16596
作者: Wenbo Xu,Junyan Wu,Wei Lu,Xiangyang Luo,Qian Wang
机构: Sun Yat-sen University (中山大学); Zhengzhou (郑州); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures,conference
Abstract:Current researches on Deepfake forensics often treat detection as a classification task or temporal forgery localization problem, which are usually restrictive, time-consuming, and challenging to scale for large datasets. To resolve these issues, we present a multimodal deviation perceiving framework for weakly-supervised temporal forgery localization (MDP), which aims to identify temporal partial forged segments using only video-level annotations. The MDP proposes a novel multimodal interaction mechanism (MI) and an extensible deviation perceiving loss to perceive multimodal deviation, which achieves the refined start and end timestamps localization of forged segments. Specifically, MI introduces a temporal property preserving cross-modal attention to measure the relevance between the visual and audio modalities in the probabilistic embedding space. It could identify the inter-modality deviation and construct comprehensive video features for temporal forgery localization. To explore further temporal deviation for weakly-supervised learning, an extensible deviation perceiving loss has been proposed, aiming at enlarging the deviation of adjacent segments of the forged samples and reducing that of genuine samples. Extensive experiments demonstrate the effectiveness of the proposed framework and achieve comparable results to fully-supervised approaches in several evaluation metrics.
zh
[CV-22] Comparative validation of surgical phase recognition instrument keypoint estimation and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge
【速读】:该论文旨在解决在内窥镜视频中可靠识别与定位手术器械的问题,这是实现计算机辅助和机器人辅助微创手术(RAMIS)中手术训练、技能评估及自主辅助等应用的基础。现有方法在真实场景下仍难以保证鲁棒性,因此本文提出通过引入手术过程上下文(如当前操作阶段)来提升模型的鲁棒性和可解释性。解决方案的关键在于构建了一个多中心、全长度腹腔镜胆囊切除术视频数据集(PhaKIR),统一标注了三个相互关联的任务:手术阶段识别、器械关键点估计与器械实例分割,并支持跨整个手术过程的时间信息整合,从而为开发时序感知、上下文驱动的RAMIS方法提供首个基准与高质量资源。
链接: https://arxiv.org/abs/2507.16559
作者: Tobias Rueckert,David Rauber,Raphaela Maerkl,Leonard Klausmann,Suemeyye R. Yildiran,Max Gutbrod,Danilo Weber Nunes,Alvaro Fernandez Moreno,Imanol Luengo,Danail Stoyanov,Nicolas Toussaint,Enki Cho,Hyeon Bae Kim,Oh Sung Choo,Ka Young Kim,Seong Tae Kim,Gonçalo Arantes,Kehan Song,Jianjun Zhu,Junchen Xiong,Tingyi Lin,Shunsuke Kikuchi,Hiroki Matsuzaki,Atsushi Kouno,João Renato Ribeiro Manesco,João Paulo Papa,Tae-Min Choi,Tae Kyeong Jeong,Juyoun Park,Oluwatosin Alabi,Meng Wei,Tom Vercauteren,Runzhi Wu,Mengya Xu,An Wang,Long Bai,Hongliang Ren,Amine Yamlahi,Jakob Hennighausen,Lena Maier-Hein,Satoshi Kondo,Satoshi Kasai,Kousuke Hirasawa,Shu Yang,Yihui Wang,Hao Chen,Santiago Rodríguez,Nicolás Aparicio,Leonardo Manrique,Juan Camilo Lyons,Olivia Hosie,Nicolás Ayobi,Pablo Arbeláez,Yiping Li,Yasmina Al Khalil,Sahar Nasirihaghighi,Stefanie Speidel,Daniel Rueckert,Hubertus Feussner,Dirk Wilhelm,Christoph Palm
机构: Regensburg Medical Image Computing (ReMIC), OTH Regensburg (奥格斯堡技术大学); AKTORmed Robotic Surgery (AKTORmed机器人外科), Neutraubling (纽特劳布林), Germany (德国); Regensburg Center of Biomedical Engineering (RCBE), OTH Regensburg and Regensburg University (雷根斯堡大学), Regensburg (雷根斯堡), Germany (德国); Regensburg Center of Health Sciences and Technology (RCHST), OTH Regensburg (奥格斯堡技术大学), Regensburg (雷根斯堡), Germany (德国); AI Centre of Excellence, Medtronic Ltd. (美敦力有限公司), Watford (沃特福德), UK (英国); Engineering Sciences, University College London (伦敦大学学院), London (伦敦), UK (英国); Augmented Intelligence Lab, Kyung Hee University (高丽大学), Seoul (首尔), South Korea (韩国); University of Minho (米尼奥大学), Braga (布拉加), Portugal (葡萄牙); Hanglok Tech (航科科技), Zhuhai City (珠海市), China (中国); Jmees Inc. (Jmees公司), Kashiwa City (柏市), Japan (日本); School of Sciences, São Paulo State University (UNESP) (圣保罗州立大学), Bauru (鲍鲁), Brazil (巴西); KIST HARILAB, Center for Humanoid Research, Artificial Intelligence and Robot Institute, Korea Institute of Science and Technology (KIST) (韩国科学技术院), Seoul (首尔), South Korea (韩国); King’s College London (伦敦国王学院), London (伦敦), UK (英国); The Chinese University of Hong Kong (香港中文大学), Hong Kong (香港), China (中国); Division of Intelligent Medical Systems, German Cancer Research Center (DKFZ) (德国癌症研究中心), Heidelberg (海德堡), Germany (德国); Muroran Institute of Technology (室兰工业大学), Hokkaido (北海道), Japan (日本); Niigata University of Health and Welfare (新泻健康福祉大学), Niigata (新泻), Japan (日本); Konica Minolta, Inc. (柯尼卡美能达公司), Osaka (大阪), Japan (日本); Department of Computer Science and Engineering, The Hong Kong University of Science and Technology (香港科技大学), Hong Kong (香港), China (中国); Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology (香港科技大学), Hong Kong (香港), China (中国); HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute (港科大深圳-香港协同创新研究院), Shenzhen (深圳), China (中国); Center for Research and Formation in Artificial Intelligence (CinfonIA), Los Andes University (安第斯大学), Bogota (波哥大), Colombia (哥伦比亚); Department of Biomedical Engineering, Medical Image Analysis Group, Eindhoven University of Technology (埃因霍温理工大学), Eindhoven (埃因霍温), Netherlands (荷兰); Institute of Information Technology (ITEC), Klagenfurt University (克恩顿大学), Klagenfurt (克恩顿), Austria (奥地利); Center for Tactile Internet with Human-in-the-loop (CeTI), TU Dresden (德累斯顿工业大学), Dresden (德累斯顿), Germany (德国); Department of Translational Surgical Oncology, National Center for Tumor Diseases (NCT/UCC) (国家肿瘤疾病中心), Partner Site Dresden (德累斯顿合作站点), Dresden (德累斯顿), Germany (德国); Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) (慕尼黑工业大学) and TUM University Hospital (慕尼黑工业大学附属医院), Munich (慕尼黑), Germany (德国); Biomedical Image Analysis Group, Department of Computing, Imperial College London (帝国理工学院), London (伦敦), UK (英国); Research Group MITI, TUM University Hospital (慕尼黑工业大学附属医院), School of Medicine and Health, Technical University of Munich (慕尼黑工业大学), Munich (慕尼黑), Germany (德国); Department of Surgery, TUM University Hospital (慕尼黑工业大学附属医院), School of Medicine and Health, Technical University of Munich (慕尼黑工业大学), Munich (慕尼黑), Germany (德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A challenge report pre-print containing 36 pages, 15 figures, and 13 tables
Abstract:Reliable recognition and localization of surgical instruments in endoscopic video recordings are foundational for a wide range of applications in computer- and robot-assisted minimally invasive surgery (RAMIS), including surgical training, skill assessment, and autonomous assistance. However, robust performance under real-world conditions remains a significant challenge. Incorporating surgical context - such as the current procedural phase - has emerged as a promising strategy to improve robustness and interpretability. To address these challenges, we organized the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) sub-challenge as part of the Endoscopic Vision (EndoVis) challenge at MICCAI 2024. We introduced a novel, multi-center dataset comprising thirteen full-length laparoscopic cholecystectomy videos collected from three distinct medical institutions, with unified annotations for three interrelated tasks: surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation. Unlike existing datasets, ours enables joint investigation of instrument localization and procedural context within the same data while supporting the integration of temporal information across entire procedures. We report results and findings in accordance with the BIAS guidelines for biomedical image analysis challenges. The PhaKIR sub-challenge advances the field by providing a unique benchmark for developing temporally aware, context-driven methods in RAMIS and offers a high-quality resource to support future research in surgical scene understanding. Comments: A challenge report pre-print containing 36 pages, 15 figures, and 13 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.16559 [cs.CV] (or arXiv:2507.16559v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.16559 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-23] Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach
【速读】:该论文旨在解决在高级驾驶系统(ADS)中部署基于深度神经网络(DNN)的高光谱成像(HSI)分割处理器时面临的实时性、资源受限与系统集成效率问题。其核心挑战包括:DNN模型参数冗余导致的高计算复杂度、HSI数据预处理对内存布局和任务间通信的高要求,以及边缘平台(如FPGA-based SoC)上算力与功耗的严格约束。解决方案的关键在于一套面向实际应用的软硬件协同优化策略,涵盖功能任务的合理分配、面向硬件特性的预处理优化、模型压缩技术以降低计算与存储开销,以及完整的流水线化部署架构。通过所提方法,DNN模型运算量减少至原规模的24.34%,参数量降至1.02%,推理速度提升2.86倍,同时保持分割精度无明显下降,从而实现了高效可靠的边缘智能感知系统设计。
链接: https://arxiv.org/abs/2507.16556
作者: Jon Gutiérrez-Zaballa,Koldo Basterretxea,Javier Echanobe
机构: University of the Basque Country (UPV/EHU); University of the Basque Country (UPV/EHU)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:The use of HSI for autonomous navigation is a promising research field aimed at improving the accuracy and robustness of detection, tracking, and scene understanding systems based on vision sensors. Combining advanced computer algorithms, such as DNNs, with small-size snapshot HSI cameras enhances the reliability of these systems. HSI overcomes intrinsic limitations of greyscale and RGB imaging in depicting physical properties of targets, particularly regarding spectral reflectance and metamerism. Despite promising results in HSI-based vision developments, safety-critical systems like ADS demand strict constraints on latency, resource consumption, and security, motivating the shift of ML workloads to edge platforms. This involves a thorough software/hardware co-design scheme to distribute and optimize the tasks efficiently among the limited resources of computing platforms. With respect to inference, the over-parameterized nature of DNNs poses significant computational challenges for real-time on-the-edge deployment. In addition, the intensive data preprocessing required by HSI, which is frequently overlooked, must be carefully managed in terms of memory arrangement and inter-task communication to enable an efficient integrated pipeline design on a SoC. This work presents a set of optimization techniques for the practical co-design of a DNN-based HSI segmentation processor deployed on a FPGA-based SoC targeted at ADS, including key optimizations such as functional software/hardware task distribution, hardware-aware preprocessing, ML model compression, and a complete pipelined deployment. Applied compression techniques significantly reduce the complexity of the designed DNN to 24.34% of the original operations and to 1.02% of the original number of parameters, achieving a 2.86x speed-up in the inference task without noticeable degradation of the segmentation accuracy.
zh
[CV-24] EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion
【速读】:该论文旨在解决当前3D生成方法难以扩展至地理尺度(如建模数千平方公里地球表面)的挑战。其核心解决方案在于双创新:一是构建了目前最大的3D航空数据集Aerial-Earth3D,包含50k个经筛选的场景(每个600m×600m),涵盖45M个多视角Google Earth图像及多模态标注信息(如位姿、深度图、法向量、语义分割等),确保地形多样性与质量控制;二是提出EarthCrafter框架,采用稀疏解耦潜在扩散机制,将结构与纹理生成分离:通过双稀疏3D-VAE压缩高分辨率几何体素与文本纹理2D高斯光栅化(2D Gaussian Splats, 2DGS)至紧凑潜在空间,显著降低大规模地理场景下的计算开销;同时设计条件感知流匹配模型,在混合输入(语义、图像或无条件)下独立建模潜在几何与纹理特征,从而实现高效且地理合理的超大规模3D地球生成。
链接: https://arxiv.org/abs/2507.16535
作者: Shang Liu,Chenjie Cao,Chaohui Yu,Wen Qian,Jing Wang,Fan Wang
机构: DAMO Academy, Alibaba Group (达摩院,阿里巴巴集团); Hupan Lab (湖畔实验室); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth’s surface, remains an open challenge. We address this through a dual innovation in data infrastructure and model architecture. First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m x 600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames. Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity. Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation: 1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently. Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation. The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D.
zh
[CV-25] Spatial 3D-LLM : Exploring Spatial Awareness in 3D Vision-Language Models ICME2025
【速读】:该论文旨在解决当前3D多模态大语言模型(3D Multimodal Large Language Models, 3D MLLMs)在处理3D视觉-语言任务时空间感知能力不足的问题,其根源在于现有方法通常依赖于压缩整体3D场景信息或独立分割物体,导致对3D场景中丰富空间关系的表征不充分。解决方案的关键在于提出Spatial 3D-LLM,该模型通过引入一种渐进式空间感知机制(progressive spatial awareness scheme),随着感知视野的扩展逐步捕获空间信息,并生成富含位置信息的3D场景嵌入(location-enriched 3D scene embeddings)作为视觉提示,从而显著增强模型对3D场景的空间理解能力。
链接: https://arxiv.org/abs/2507.16524
作者: Xiaoyan Wang,Zeju Li,Yifan Xu,Jiaxing Qi,Zhifei Yang,Ruifei Ma,Xiangde Liu,Chao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICME2025
Abstract:New era has unlocked exciting possibilities for extending Large Language Models (LLMs) to tackle 3D vision-language tasks. However, most existing 3D multimodal LLMs (MLLMs) rely on compressing holistic 3D scene information or segmenting independent objects to perform these tasks, which limits their spatial awareness due to insufficient representation of the richness inherent in 3D scenes. To overcome these limitations, we propose Spatial 3D-LLM, a 3D MLLM specifically designed to enhance spatial awareness for 3D vision-language tasks by enriching the spatial embeddings of 3D scenes. Spatial 3D-LLM integrates an LLM backbone with a progressive spatial awareness scheme that progressively captures spatial information as the perception field expands, generating location-enriched 3D scene embeddings to serve as visual prompts. Furthermore, we introduce two novel tasks: 3D object distance measurement and 3D layout editing, and construct a 3D instruction dataset, MODEL, to evaluate the model’s spatial awareness capabilities. Experimental results demonstrate that Spatial 3D-LLM achieves state-of-the-art performance across a wide range of 3D vision-language tasks, revealing the improvements stemmed from our progressive spatial awareness scheme of mining more profound spatial information. Our code is available at this https URL.
zh
[CV-26] PlantSAM: An Object Detection-Driven Segmentation Pipeline for Herbarium Specimens
【速读】:该论文旨在解决基于深度学习的标本图像(herbarium images)分类中因背景异质性导致的噪声和伪影干扰问题,此类干扰可能误导模型并降低分类准确性。解决方案的关键在于提出PlantSAM自动化分割流程,该流程结合YOLOv10用于植物区域检测生成边界框提示(bounding box prompts),并驱动Segment Anything Model (SAM2)实现高精度分割;通过在标本图像上微调两个模型,并利用交并比(IoU)和Dice系数进行评估,PlantSAM达到了先进的分割性能(IoU=0.94,Dice=0.97),进而显著提升后续分类模型对五种植物性状的识别准确率,最高提升达4.36%,验证了背景去除对聚焦前景植物结构、增强分类性能的重要性。
链接: https://arxiv.org/abs/2507.16506
作者: Youcef Sklab,Florian Castanet,Hanane Ariouat,Souhila Arib,Jean-Daniel Zucker,Eric Chenin,Edi Prifti
机构: IRD(法国国际农业研究发展中心); Sorbonne Université(索邦大学); UMMISCO(法国国家科学研究中心与巴黎狄德罗大学联合实验室); CY Cergy Paris Université(赛吉-巴黎大学); CNRS(法国国家科学研究中心); INSERM(法国国家健康与医学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 11 figures, 8 tables
Abstract:Deep learning-based classification of herbarium images is hampered by background heterogeneity, which introduces noise and artifacts that can potentially mislead models and reduce classification accuracy. Addressing these background-related challenges is critical to improving model performance. We introduce PlantSAM, an automated segmentation pipeline that integrates YOLOv10 for plant region detection and the Segment Anything Model (SAM2) for segmentation. YOLOv10 generates bounding box prompts to guide SAM2, enhancing segmentation accuracy. Both models were fine-tuned on herbarium images and evaluated using Intersection over Union (IoU) and Dice coefficient metrics. PlantSAM achieved state-of-the-art segmentation performance, with an IoU of 0.94 and a Dice coefficient of 0.97. Incorporating segmented images into classification models led to consistent performance improvements across five tested botanical traits, with accuracy gains of up to 4.36% and F1-score improvements of 4.15%. Our findings highlight the importance of background removal in herbarium image analysis, as it significantly enhances classification accuracy by allowing models to focus more effectively on the foreground plant structures.
zh
[CV-27] Designing for Difference: How Human Characteristics Shape Perceptions of Collaborative Robots
【速读】:该论文旨在解决当前辅助机器人在社会协作场景中设计缺乏责任性与包容性的关键问题,尤其关注机器人行为如何被不同人群(如老年人或残障人士)感知和评估。现有研究不足在于参与者对先进家用机器人的实际经验有限,导致难以准确评估多样化的机器人行为与人类需求的匹配效果。解决方案的关键在于引入认知情感映射(cognitive-affective mapping, CAM)这一反思性方法,使参与者即使缺乏真实使用经验,也能通过结构化反思深化对特定人机协作情境的理解。结果显示,CAM虽未显著改变总体评分,但增强了对特定机器人行为与人类状态组合的细致评价,同时揭示了协作类型、对象传递行为及人类年龄等因素对接受度的重要影响,凸显了以共情和亲社会为导向的设计策略对于开发面向多元群体的用户中心型机器人系统的重要性。
链接: https://arxiv.org/abs/2507.16480
作者: Sabrina Livanec,Laura Londoño,Michael Gorki,Adrian Röfer,Abhinav Valada,Andrea Kiesel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注:
Abstract:The development of assistive robots for social collaboration raises critical questions about responsible and inclusive design, especially when interacting with individuals from protected groups such as those with disabilities or advanced age. Currently, research is scarce on how participants assess varying robot behaviors in combination with diverse human needs, likely since participants have limited real-world experience with advanced domestic robots. In the current study, we aim to address this gap while using methods that enable participants to assess robot behavior, as well as methods that support meaningful reflection despite limited experience. In an online study, 112 participants (from both experimental and control groups) evaluated 7 videos from a total of 28 variations of human-robot collaboration types. The experimental group first completed a cognitive-affective mapping (CAM) exercise on human-robot collaboration before providing their ratings. Although CAM reflection did not significantly affect overall ratings, it led to more pronounced assessments for certain combinations of robot behavior and human condition. Most importantly, the type of human-robot collaboration influences the assessment. Antisocial robot behavior was consistently rated as the lowest, while collaboration with aged individuals elicited more sensitive evaluations. Scenarios involving object handovers were viewed more positively than those without them. These findings suggest that both human characteristics and interaction paradigms influence the perceived acceptability of collaborative robots, underscoring the importance of prosocial design. They also highlight the potential of reflective methods, such as CAM, to elicit nuanced feedback, supporting the development of user-centered and socially responsible robotic systems tailored to diverse populations.
zh
[CV-28] Survival Modeling from Whole Slide Images via Patch-Level Graph Clustering and Mixture Density Experts
【速读】:该论文旨在解决从全切片病理图像(Whole Slide Images, WSI)中准确预测癌症特异性生存期的问题,其核心挑战在于WSI的高分辨率与复杂组织异质性导致的特征提取困难。解决方案的关键在于提出一个模块化框架,包含四个核心组件:1)基于分位数阈值的动态补丁选择策略,用于高效识别具有预后信息的组织区域;2)图引导的k均值聚类方法,以捕捉基于空间和形态一致性的表型异质性;3)注意力机制建模局部特征与不同组织分区间的全局空间关系;4)专家指导的混合密度建模,采用高斯混合模型估计复杂的生存分布。该方法在TCGA-KIRC和TCGA-LUAD数据集上分别实现了0.712和0.645的C指数,显著优于现有方法,验证了其跨癌种的预测潜力。
链接: https://arxiv.org/abs/2507.16476
作者: Ardhendu Sekhar,Vasu Soni,Keshav Aske,Garima Jain,Pranav Jeevan,Amit Sethi
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Indian Council of Medical Research (印度医学研究委员会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce a modular framework for predicting cancer-specific survival from whole slide pathology images (WSIs) that significantly improves upon the state-of-the-art accuracy. Our method integrating four key components. Firstly, to tackle large size of WSIs, we use dynamic patch selection via quantile-based thresholding for isolating prognostically informative tissue regions. Secondly, we use graph-guided k-means clustering to capture phenotype-level heterogeneity through spatial and morphological coherence. Thirdly, we use attention mechanisms that model both intra- and inter-cluster relationships to contextualize local features within global spatial relations between various types of tissue compartments. Finally, we use an expert-guided mixture density modeling for estimating complex survival distributions using Gaussian mixture models. The proposed model achieves a concordance index of 0.712 \pm 0.028 and Brier score of 0.254 \pm 0.018 on TCGA-KIRC (renal cancer), and a concordance index of 0.645 \pm 0.017 and Brier score of 0.281 \pm 0.031 on TCGA-LUAD (lung adenocarcinoma). These results are significantly better than the state-of-art and demonstrate predictive potential of the proposed method across diverse cancer types.
zh
[CV-29] DenseSR: Image Shadow Removal as Dense Prediction
【速读】:该论文旨在解决单图像阴影去除(Single-image Shadow Removal, SR)中因间接光照条件导致的图像质量退化问题,尤其针对非均匀内容退化和固有歧义性带来的挑战,传统方法难以同时恢复阴影区域内的细节并保持清晰边界,从而造成修复不一致和模糊现象,影响下游应用与视觉体验。解决方案的关键在于提出一种名为DenseSR的框架,其核心创新是通过密集预测视角提升修复质量,具体包括两个策略:一是利用几何-语义先验引导的深度场景理解以消除歧义并隐式定位阴影;二是设计新颖的密集融合模块(Dense Fusion Block, DFB),其中包含自适应内容平滑模块(Adaptive Content Smoothing Module, ACSM)用于一致性外观恢复,以及纹理-边界恢复模块(Texture-Boundary Recuperation Module, TBRM)用于精细纹理和锐利边界的重建,最终实现高质量、高保真的阴影去除效果。
链接: https://arxiv.org/abs/2507.16472
作者: Yu-Fan Lin,Chia-Ming Lee,Chih-Chung Hsu
机构: National Cheng Kung University (国立成功大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper accepted to ACMMM 2025
Abstract:Shadows are a common factor degrading image quality. Single-image shadow removal (SR), particularly under challenging indirect illumination, is hampered by non-uniform content degradation and inherent ambiguity. Consequently, traditional methods often fail to simultaneously recover intra-shadow details and maintain sharp boundaries, resulting in inconsistent restoration and blurring that negatively affect both downstream applications and the overall viewing experience. To overcome these limitations, we propose the DenseSR, approaching the problem from a dense prediction perspective to emphasize restoration quality. This framework uniquely synergizes two key strategies: (1) deep scene understanding guided by geometric-semantic priors to resolve ambiguity and implicitly localize shadows, and (2) high-fidelity restoration via a novel Dense Fusion Block (DFB) in the decoder. The DFB employs adaptive component processing-using an Adaptive Content Smoothing Module (ACSM) for consistent appearance and a Texture-Boundary Recuperation Module (TBRM) for fine textures and sharp boundaries-thereby directly tackling the inconsistent restoration and blurring issues. These purposefully processed components are effectively fused, yielding an optimized feature representation preserving both consistency and fidelity. Extensive experimental results demonstrate the merits of our approach over existing methods. Our code can be available on https://github . com/VanLinLin/DenseSR
zh
[CV-30] VGGT-Long: Chunk it Loop it Align it – Pushing VGGTs Limits on Kilometer-scale Long RGB Sequences
【速读】:该论文旨在解决基础模型(foundation models)在大规模RGB流3D重建任务中因内存限制而难以扩展的问题,特别是在千米级、无界户外环境下的单目3D重建挑战。其解决方案的关键在于提出一种名为VGGT-Long的系统,通过基于块(chunk-based)的处理策略结合重叠对齐(overlapping alignment)与轻量级回环闭合优化(lightweight loop closure optimization),有效缓解了现有模型的可扩展性瓶颈。该方法无需相机标定、深度监督或模型重训练,即可实现与传统方法相当的轨迹估计和重建精度,在KITTI、Waymo及Virtual KITTI数据集上验证了其在长序列RGB输入下稳定运行并生成一致几何结构的能力,凸显了基础模型在真实世界场景(尤其是自动驾驶)中进行可扩展单目3D场景重建的巨大潜力。
链接: https://arxiv.org/abs/2507.16443
作者: Kai Deng,Zexin Ti,Jiawei Xu,Jian Yang,Jin Xie
机构: Nankai University (南开大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models for 3D vision have recently demonstrated remarkable capabilities in 3D perception. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. In this work, we propose VGGT-Long, a simple yet effective system that pushes the limits of monocular 3D reconstruction to kilometer-scale, unbounded outdoor environments. Our approach addresses the scalability bottlenecks of existing models through a chunk-based processing strategy combined with overlapping alignment and lightweight loop closure optimization. Without requiring camera calibration, depth supervision or model retraining, VGGT-Long achieves trajectory and reconstruction performance comparable to traditional methods. We evaluate our method on KITTI, Waymo, and Virtual KITTI datasets. VGGT-Long not only runs successfully on long RGB sequences where foundation models typically fail, but also produces accurate and consistent geometry across various conditions. Our results highlight the potential of leveraging foundation models for scalable monocular 3D scene in real-world settings, especially for autonomous driving scenarios. Code is available at this https URL.
zh
[CV-31] Robust Noisy Pseudo-label Learning for Semi-supervised Medical Image Segmentation Using Diffusion Model
【速读】:该论文旨在解决半监督医学图像分割中因伪标签噪声导致潜在语义分布结构不清晰的问题。现有方法在利用少量标注数据与大量未标注数据进行训练时,难以有效建模潜在空间中的语义一致性,从而影响分割精度。其解决方案的关键在于提出一种基于扩散模型的新框架,在去噪扩散过程中引入原型约束(prototype-based contrastive consistency),通过类原型(class prototypes)作为潜在空间中的中心化语义锚点来增强语义结构的稳定性,而非直接学习边界。该策略显著提升了模型对噪声伪标签的鲁棒性,从而实现更准确的密集预测。
链接: https://arxiv.org/abs/2507.16429
作者: Lin Xi,Yingliang Ma,Cheng Wang,Sandra Howell,Aldo Rinaldi,Kawal S. Rhode
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Obtaining pixel-level annotations in the medical domain is both expensive and time-consuming, often requiring close collaboration between clinical experts and developers. Semi-supervised medical image segmentation aims to leverage limited annotated data alongside abundant unlabeled data to achieve accurate segmentation. However, existing semi-supervised methods often struggle to structure semantic distributions in the latent space due to noise introduced by pseudo-labels. In this paper, we propose a novel diffusion-based framework for semi-supervised medical image segmentation. Our method introduces a constraint into the latent structure of semantic labels during the denoising diffusion process by enforcing prototype-based contrastive consistency. Rather than explicitly delineating semantic boundaries, the model leverages class prototypes centralized semantic representations in the latent space as anchors. This strategy improves the robustness of dense predictions, particularly in the presence of noisy pseudo-labels. We also introduce a new publicly available benchmark: Multi-Object Segmentation in X-ray Angiography Videos (MOSXAV), which provides detailed, manually annotated segmentation ground truth for multiple anatomical structures in X-ray angiography videos. Extensive experiments on the EndoScapes2023 and MOSXAV datasets demonstrate that our method outperforms state-of-the-art medical image segmentation approaches under the semi-supervised learning setting. This work presents a robust and data-efficient diffusion model that offers enhanced flexibility and strong potential for a wide range of clinical applications.
zh
[CV-32] Combined Image Data Augmentations diminish the benefits of Adaptive Label Smoothing FAST
【速读】:该论文旨在解决监督学习中图像分类器在面对强数据增强时容易过拟合的问题,尤其是在使用随机裁剪(random-crop)等增强策略时,标签置信度过高可能导致模型泛化能力下降。其解决方案的关键在于扩展自适应标签平滑(adaptive label smoothing)框架,使其不仅适用于随机裁剪,还可应用于其他强增强技术如随机擦除(random erasing)和噪声注入(noise injection)。通过根据增强强度动态降低训练样本的标签置信度,该方法实现了更有效的正则化,但研究表明,当增强类型多样且复杂(如TrivialAugment中所用)时,过度标签平滑反而会损害模型对常见图像退化的鲁棒性,因此该方法应仅在训练数据分布由有限且同质的增强类型主导时使用。
链接: https://arxiv.org/abs/2507.16427
作者: Georg Siedel,Ekagra Gupta,Weijia Shao,Silvia Vock,Andrey Morozov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint submitted to the Fast Review Track of DAGM German Conference on Pattern Recognition (GCPR) 2025
Abstract:Soft augmentation regularizes the supervised learning process of image classifiers by reducing label confidence of a training sample based on the magnitude of random-crop augmentation applied to it. This paper extends this adaptive label smoothing framework to other types of aggressive augmentations beyond random-crop. Specifically, we demonstrate the effectiveness of the method for random erasing and noise injection data augmentation. Adaptive label smoothing permits stronger regularization via higher-intensity Random Erasing. However, its benefits vanish when applied with a diverse range of image transformations as in the state-of-the-art TrivialAugment method, and excessive label smoothing harms robustness to common corruptions. Our findings suggest that adaptive label smoothing should only be applied when the training data distribution is dominated by a limited, homogeneous set of image transformation types.
zh
[CV-33] owards Railway Domain Adaptation for LiDAR-based 3D Detection: Road-to-Rail and Sim-to-Real via SynDRA-BBox
【速读】:该论文旨在解决铁路领域缺乏公开可用的真实世界标注数据集的问题,这限制了视觉感知算法在铁路环境中的测试与验证。其解决方案的关键在于提出一个名为SynDRA-BBox的合成数据集,该数据集专为铁路场景下的2D和3D目标检测任务设计,并首次实现了基于合成数据的域自适应方法在铁路领域的应用。通过将原本用于汽车感知的先进半监督域自适应技术迁移至铁路场景,实验表明该方法能有效提升合成数据到真实铁路环境的3D目标检测性能,从而推动铁路感知能力的发展。
链接: https://arxiv.org/abs/2507.16413
作者: Xavier Diaz,Gianluca D’Amico,Raul Dominguez-Sanchez,Federico Nesti,Max Ronecker,Giorgio Buttazzo
机构: SETLabs Research GmbH (SETLabs 研究有限公司); Scuola Superiore Sant’Anna (圣安娜高等学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: IEEE International Conference on Intelligent Rail Transportation (ICIRT) 2025
Abstract:In recent years, interest in automatic train operations has significantly increased. To enable advanced functionalities, robust vision-based algorithms are essential for perceiving and understanding the surrounding environment. However, the railway sector suffers from a lack of publicly available real-world annotated datasets, making it challenging to test and validate new perception solutions in this domain. To address this gap, we introduce SynDRA-BBox, a synthetic dataset designed to support object detection and other vision-based tasks in realistic railway scenarios. To the best of our knowledge, is the first synthetic dataset specifically tailored for 2D and 3D object detection in the railway domain, the dataset is publicly available at this https URL. In the presented evaluation, a state-of-the-art semi-supervised domain adaptation method, originally developed for automotive perception, is adapted to the railway context, enabling the transferability of synthetic data to 3D object detection. Experimental results demonstrate promising performance, highlighting the effectiveness of synthetic datasets and domain adaptation techniques in advancing perception capabilities for railway environments.
zh
[CV-34] Sparse-View 3D Reconstruction: Recent Advances and Open Challenges
【速读】:该论文旨在解决稀疏视角(sparse-view)三维重建问题,即在图像获取受限场景下(如机器人、增强/虚拟现实和自动驾驶系统),由于视图间重叠度低导致传统结构光恢复(SfM)和多视图立体视觉(MVS)方法失效的问题。解决方案的关键在于融合几何正则化、显式形状建模与生成推理机制,具体体现在三类方法:基于神经隐式模型(如NeRF及其正则化变体)、基于点云的显式方法(如3D Gaussian Splatting)以及结合扩散模型和视觉基础模型(VFMs)先验的混合框架。这些方法通过引入几何约束、优化显式表示或利用生成式先验来减少浮点伪影(floaters)和位姿歧义(pose ambiguities),从而提升稀疏视角下的重建精度、效率与泛化能力。
链接: https://arxiv.org/abs/2507.16406
作者: Tanveer Younis,Zhanglin Cheng
机构: Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (深圳先进技术研究院,中国科学院); Shenzhen VisuCA Key Lab (深圳视觉计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 6 figures
Abstract:Sparse-view 3D reconstruction is essential for applications in which dense image acquisition is impractical, such as robotics, augmented/virtual reality (AR/VR), and autonomous systems. In these settings, minimal image overlap prevents reliable correspondence matching, causing traditional methods, such as structure-from-motion (SfM) and multiview stereo (MVS), to fail. This survey reviews the latest advances in neural implicit models (e.g., NeRF and its regularized versions), explicit point-cloud-based approaches (e.g., 3D Gaussian Splatting), and hybrid frameworks that leverage priors from diffusion and vision foundation models (VFMs).We analyze how geometric regularization, explicit shape modeling, and generative inference are used to mitigate artifacts such as floaters and pose ambiguities in sparse-view settings. Comparative results on standard benchmarks reveal key trade-offs between the reconstruction accuracy, efficiency, and generalization. Unlike previous reviews, our survey provides a unified perspective on geometry-based, neural implicit, and generative (diffusion-based) methods. We highlight the persistent challenges in domain generalization and pose-free reconstruction and outline future directions for developing 3D-native generative priors and achieving real-time, unconstrained sparse-view reconstruction.
zh
[CV-35] Reason VQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering ICCV
【速读】:该论文旨在解决视觉问答(Visual Question Answering, VQA)任务中现有数据集在复杂推理能力与外部知识整合方面的不足,尤其是缺乏能够支持多跳推理(multi-hop reasoning)的高质量标注数据。其解决方案的关键在于提出一个新的数据集ReasonVQA,该数据集通过自动集成结构化百科知识(structured encyclopedic knowledge)并采用低成本框架构建,从而生成具有复杂逻辑关系的多跳问题;同时,该数据集在规模上显著超越现有需外部知识的数据集,具备良好的可扩展性,为VQA模型提供了更具挑战性的基准测试环境。
链接: https://arxiv.org/abs/2507.16403
作者: Thuy-Duong Tran,Trung-Kien Tran,Manfred Hauswirth,Danh Le Phuoc
机构: Bosch Center for Artificial Intelligence (博世人工智能中心); Technical University of Berlin (柏林工业大学); Fraunhofer FOKUS (弗劳恩霍夫协会通信与信息工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE/CVF International Conference on Computer Vision (ICCV) 2025
Abstract:In this paper, we propose a new dataset, ReasonVQA, for the Visual Question Answering (VQA) task. Our dataset is automatically integrated with structured encyclopedic knowledge and constructed using a low-cost framework, which is capable of generating complex, multi-hop questions. We evaluated state-of-the-art VQA models on ReasonVQA, and the empirical results demonstrate that ReasonVQA poses significant challenges to these models, highlighting its potential for benchmarking and advancing the field of VQA. Additionally, our dataset can be easily scaled with respect to input images; the current version surpasses the largest existing datasets requiring external knowledge by more than an order of magnitude.
zh
[CV-36] ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement
【速读】:该论文旨在解决文档图像伪造检测中因篡改区域与均匀背景(Background, BG)及结构化文本高度融合而导致的定位困难问题,以及现有方法在面对多种退化(如缩放、裁剪等)时鲁棒性不足的问题。其解决方案的关键在于提出ADCD-Net模型,该模型通过自适应融合RGB与离散余弦变换(Discrete Cosine Transform, DCT)的取证特征,并引入三个核心机制:1)基于预测对齐分数动态调节DCT特征贡献,提升对块错位敏感性的鲁棒性;2)采用分层内容解耦策略缓解文本与背景之间的差异,增强定位精度;3)构建纯净背景原型(pristine prototype),捕捉未篡改区域的特征,从而进一步提高定位准确性和整体鲁棒性。
链接: https://arxiv.org/abs/2507.16397
作者: Kahim Wong,Jicheng Zhou,Haiwei Wu,Yain-Whar Si,Jiantao Zhou
机构: University of Macau (澳门大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The advancement of image editing tools has enabled malicious manipulation of sensitive document images, underscoring the need for robust document image forgery this http URL forgery detectors for natural images have been extensively studied, they struggle with document images, as the tampered regions can be seamlessly blended into the uniform document background (BG) and structured text. On the other hand, existing document-specific methods lack sufficient robustness against various degradations, which limits their practical deployment. This paper presents ADCD-Net, a robust document forgery localization model that adaptively leverages the RGB/DCT forensic traces and integrates key characteristics of document images. Specifically, to address the DCT traces’ sensitivity to block misalignment, we adaptively modulate the DCT feature contribution based on a predicted alignment score, resulting in much improved resilience to various distortions, including resizing and cropping. Also, a hierarchical content disentanglement approach is proposed to boost the localization performance via mitigating the text-BG disparities. Furthermore, noticing the predominantly pristine nature of BG regions, we construct a pristine prototype capturing traces of untampered regions, and eventually enhance both the localization accuracy and robustness. Our proposed ADCD-Net demonstrates superior forgery localization performance, consistently outperforming state-of-the-art methods by 20.79% averaged over 5 types of distortions. The code is available at this https URL.
zh
[CV-37] Are Foundation Models All You Need for Zero-shot Face Presentation Attack Detection?
【速读】:该论文旨在解决生成式 AI (Generative AI) 驱动的面部识别系统在面对呈现攻击(Presentation Attack, PA)时存在的安全漏洞问题,尤其是现有深度学习呈现攻击检测(Presentation Attack Detection, PAD)方法在未知攻击手段(Unknown Presentation Attack Instruments, PAI)或未见数据库上泛化能力不足的问题。解决方案的关键在于引入零样本呈现攻击检测(Zero-Shot PAD),通过评估基础模型(Foundation Models)在复杂实验场景下的有效性与泛化性能,并提出一种简单但高效的零样本PAD框架;实验表明,该框架仅需极少调整即可在具有挑战性的SiW-Mv2数据集上实现优于当前最先进方法的性能,尤其在未知2D和3D攻击场景中表现突出。
链接: https://arxiv.org/abs/2507.16393
作者: Lazaro Janier Gonzalez-Sole,Juan E. Tapia,Christoph Busch
机构: da/sec - Biometrics and Security Research Group, Darmstadt, Germany; European Union (欧盟); UKRI Funding Service (英国研究与创新署); German Federal Ministry of Education and Research (德国联邦教育与研究部); Hessian Ministry of Higher Education, Research, Science and the Arts (黑森州高等教育、研究、科学和艺术部); National Research Center for Applied Cybersecurity ATHENE (国家应用网络安全研究中心 ATHENE)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at FG 2025
Abstract:Although face recognition systems have undergone an impressive evolution in the last decade, these technologies are vulnerable to attack presentations (AP). These attacks are mostly easy to create and, by executing them against the system’s capture device, the malicious actor can impersonate an authorised subject and thus gain access to the latter’s information (e.g., financial transactions). To protect facial recognition schemes against presentation attacks, state-of-the-art deep learning presentation attack detection (PAD) approaches require a large amount of data to produce reliable detection performances and even then, they decrease their performance for unknown presentation attack instruments (PAI) or database (information not seen during training), i.e. they lack generalisability. To mitigate the above problems, this paper focuses on zero-shot PAD. To do so, we first assess the effectiveness and generalisability of foundation models in established and challenging experimental scenarios and then propose a simple but effective framework for zero-shot PAD. Experimental results show that these models are able to achieve performance in difficult scenarios with minimal effort of the more advanced PAD mechanisms, whose weights were optimised mainly with training sets that included APs and bona fide presentations. The top-performing foundation model outperforms by a margin the best from the state of the art observed with the leaving-one-out protocol on the SiW-Mv2 database, which contains challenging unknown 2D and 3D attacks
zh
[CV-38] From Flat to Round: Redefining Brain Decoding with Surface-Based fMRI and Cortex Structure ICCV
【速读】:该论文旨在解决从人类脑活动(如fMRI)中重建视觉刺激时存在的关键问题:现有方法常忽视大脑结构-功能关系,导致空间信息被扁平化处理,并忽略个体解剖差异。其解决方案的关键在于三个方面:(1) 提出一种新颖的球面令牌化(sphere tokenizer),将fMRI信号显式建模为皮层表面上的空间一致二维球面数据;(2) 融合结构磁共振成像(sMRI)数据,实现对个体解剖变异的个性化编码;(3) 设计正样本混合策略(positive-sample mixup),高效利用同一视觉刺激对应的多段fMRI扫描数据。这些创新显著提升了重建精度、生物学可解释性及跨个体泛化能力。
链接: https://arxiv.org/abs/2507.16389
作者: Sijin Yu,Zijiao Chen,Wenxuan Wu,Shengxian Chen,Zhongliang Liu,Jingxin Nie,Xiaofen Xing,Xiangmin Xu,Xin Zhang
机构: South China University of Technology (华南理工大学); Stanford University (斯坦福大学); South China Normal University (华南师范大学); Pazhou Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 14 figures, ICCV Findings 2025
Abstract:Reconstructing visual stimuli from human brain activity (e.g., fMRI) bridges neuroscience and computer vision by decoding neural representations. However, existing methods often overlook critical brain structure-function relationships, flattening spatial information and neglecting individual anatomical variations. To address these issues, we propose (1) a novel sphere tokenizer that explicitly models fMRI signals as spatially coherent 2D spherical data on the cortical surface; (2) integration of structural MRI (sMRI) data, enabling personalized encoding of individual anatomical variations; and (3) a positive-sample mixup strategy for efficiently leveraging multiple fMRI scans associated with the same visual stimulus. Collectively, these innovations enhance reconstruction accuracy, biological interpretability, and generalizability across individuals. Experiments demonstrate superior reconstruction performance compared to SOTA methods, highlighting the effectiveness and interpretability of our biologically informed approach.
zh
[CV-39] STAR: A Benchmark for Astronomical Star Fields Super-Resolution
【速读】:该论文旨在解决当前天文超分辨率(Astronomical Super-Resolution, ASR)研究中三大关键问题:通量不一致性(flux inconsistency)、对象裁剪设置(object-crop setting)以及数据多样性不足(insufficient data diversity),这些问题严重制约了ASR模型的系统性发展与物理真实性评估。解决方案的关键在于提出STAR数据集,该数据集包含54,738对通量一致的星场图像对,由哈勃空间望远镜高分辨率观测与通过保通量数据生成流程合成的低分辨率图像构成,从而支持在场级(field-level)层面进行ASR建模;同时引入新型通量误差(Flux Error, FE)指标用于从物理视角量化SR模型性能,并基于此设计出通量不变超分辨率(Flux-Invariant Super Resolution, FISR)模型,其在新设计的通量一致性指标上优于现有最先进方法24.84%,验证了该方案在天体物理应用中的优越性。
链接: https://arxiv.org/abs/2507.16385
作者: Kuo-Cheng Wu,Guohang Zhuang,Jinyang Huang,Xiang Zhang,Wanli Ouyang,Yan Lu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Hefei University of Technology (合肥工业大学); University of Science and Technology of China (中国科学技术大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Super-resolution (SR) advances astronomical imaging by enabling cost-effective high-resolution capture, crucial for detecting faraway celestial objects and precise structural analysis. However, existing datasets for astronomical SR (ASR) exhibit three critical limitations: flux inconsistency, object-crop setting, and insufficient data diversity, significantly impeding ASR development. We propose STAR, a large-scale astronomical SR dataset containing 54,738 flux-consistent star field image pairs covering wide celestial regions. These pairs combine Hubble Space Telescope high-resolution observations with physically faithful low-resolution counterparts generated through a flux-preserving data generation pipeline, enabling systematic development of field-level ASR models. To further empower the ASR community, STAR provides a novel Flux Error (FE) to evaluate SR models in physical view. Leveraging this benchmark, we propose a Flux-Invariant Super Resolution (FISR) model that could accurately infer the flux-consistent high-resolution images from input photometry, suppressing several SR state-of-the-art methods by 24.84% on a novel designed flux consistency metric, showing the priority of our method for astrophysics. Extensive experiments demonstrate the effectiveness of our proposed method and the value of our dataset. Code and models are available at this https URL.
zh
[CV-40] LPTR-AFLNet: Lightweight Integrated Chinese License Plate Rectification and Recognition Network
【速读】:该论文旨在解决复杂非约束环境下中国车牌识别(Chinese License Plate Recognition, CLPR)中因拍摄角度导致的透视畸变问题,以及单行与双行车牌的校正与识别难题。针对边缘设备计算资源有限的特点,提出了一种轻量级端到端统一网络LPTR-AFLNet,其关键在于将透视变换校正模块(Perspective Transformation Correction Module, PTR)与优化后的车牌识别网络AFLNet相结合,并利用识别输出作为弱监督信号指导校正过程,从而实现高精度的透视畸变矫正;同时通过改进注意力机制以减少相似字符混淆、引入Focal Loss缓解类别不平衡问题,显著提升识别准确率。实验表明,该方法在多种挑战性场景下均表现优异,且在中低端GPU平台上推理时间低于10毫秒,具备良好的实时性与实用性。
链接: https://arxiv.org/abs/2507.16362
作者: Guangzhu Xu,Pengcheng Zuo,Zhi Ke,Bangjun Lei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 33 figures
Abstract:Chinese License Plate Recognition (CLPR) faces numerous challenges in unconstrained and complex environments, particularly due to perspective distortions caused by various shooting angles and the correction of single-line and double-line license plates. Considering the limited computational resources of edge devices, developing a low-complexity, end-to-end integrated network for both correction and recognition is essential for achieving real-time and efficient deployment. In this work, we propose a lightweight, unified network named LPTR-AFLNet for correcting and recognizing Chinese license plates, which combines a perspective transformation correction module (PTR) with an optimized license plate recognition network, AFLNet. The network leverages the recognition output as a weak supervisory signal to effectively guide the correction process, ensuring accurate perspective distortion correction. To enhance recognition accuracy, we introduce several improvements to LPRNet, including an improved attention module to reduce confusion among similar characters and the use of Focal Loss to address class imbalance during training. Experimental results demonstrate the exceptional performance of LPTR-AFLNet in rectifying perspective distortion and recognizing double-line license plate images, maintaining high recognition accuracy across various challenging scenarios. Moreover, on lower-mid-range GPUs platform, the method runs in less than 10 milliseconds, indicating its practical efficiency and broad applicability.
zh
[CV-41] Mamba-OTR: a Mamba-based Solution for Online Take and Release Detection from Untrimmed Egocentric Video
【速读】:该论文旨在解决未剪辑的第一人称视频中物体“抓取与释放”(Take and Release, OTR)的在线检测问题,该任务面临标签严重不平衡、正样本时间稀疏以及需要精确时序预测等挑战,同时要求模型在计算效率上满足实时部署需求。解决方案的关键在于提出基于Mamba架构的Mamba-OTR模型,其设计特点是在推理阶段利用时序递归特性,而训练时仅使用短片段视频;此外,通过引入焦点损失(focal loss)和一种新颖的正则化方案,使模型预测与评估指标对齐,从而有效缓解标签不平衡问题并提升时序精度。实验表明,Mamba-OTR在EPIC-KITCHENS-100数据集上显著优于基于Transformer和原始Mamba的基线方法,在滑动窗口模式下达到45.48的mp-mAP,Streaming模式下为43.35,展现出卓越的准确性和效率。
链接: https://arxiv.org/abs/2507.16342
作者: Alessandro Sebastiano Catinello,Giovanni Maria Farinella,Antonino Furnari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work tackles the problem of Online detection of Take and Release (OTR) of an object in untrimmed egocentric videos. This task is challenging due to severe label imbalance, with temporally sparse positive annotations, and the need for precise temporal predictions. Furthermore, methods need to be computationally efficient in order to be deployed in real-world online settings. To address these challenges, we propose Mamba-OTR, a model based on the Mamba architecture. Mamba-OTR is designed to exploit temporal recurrence during inference while being trained on short video clips. To address label imbalance, our training pipeline incorporates the focal loss and a novel regularization scheme that aligns model predictions with the evaluation metric. Extensive experiments on EPIC-KITCHENS-100, the comparisons with transformer-based approach, and the evaluation of different training and test schemes demonstrate the superiority of Mamba-OTR in both accuracy and efficiency. These finding are particularly evident when evaluating full-length videos or high frame-rate sequences, even when trained on short video snippets for computational convenience. The proposed Mamba-OTR achieves a noteworthy mp-mAP of 45.48 when operating in a sliding-window fashion, and 43.35 in streaming mode, versus the 20.32 of a vanilla transformer and 25.16 of a vanilla Mamba, thus providing a strong baseline for OTR. We will publicly release the source code of Mamba-OTR to support future research.
zh
[CV-42] Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model
【速读】:该论文旨在解决人脸重演(Face Reenactment)在大姿态变化下存在的失真问题,尤其是由于图像扭曲(warping)导致的伪影以及粗粒度关键点限制所引发的细节丢失和时序不一致。其解决方案的关键在于提出一种名为Face Reenactment Video Diffusion (FRVD) 的新框架:首先通过运动提取器从源图与驱动视频中提取隐式面部关键点以实现精细化运动表征与对齐;进而引入Warping Feature Mapper (WFM),将扭曲后的源图像映射至预训练图像到视频(I2V)模型的运动感知潜在空间(motion-aware latent space),从而利用大规模视频数据中学习到的人脸动态先验进行扭曲校正并提升时序一致性,显著改善极端姿态变化下的重建质量与身份保真度。
链接: https://arxiv.org/abs/2507.16341
作者: Mingtao Guo,Guanyu Xing,Yanci Zhang,Yanli Liu
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Face reenactment aims to generate realistic talking head videos by transferring motion from a driving video to a static source image while preserving the source identity. Although existing methods based on either implicit or explicit keypoints have shown promise, they struggle with large pose variations due to warping artifacts or the limitations of coarse facial landmarks. In this paper, we present the Face Reenactment Video Diffusion model (FRVD), a novel framework for high-fidelity face reenactment under large pose changes. Our method first employs a motion extractor to extract implicit facial keypoints from the source and driving images to represent fine-grained motion and to perform motion alignment through a warping module. To address the degradation introduced by warping, we introduce a Warping Feature Mapper (WFM) that maps the warped source image into the motion-aware latent space of a pretrained image-to-video (I2V) model. This latent space encodes rich priors of facial dynamics learned from large-scale video data, enabling effective warping correction and enhancing temporal coherence. Extensive experiments show that FRVD achieves superior performance over existing methods in terms of pose accuracy, identity preservation, and visual quality, especially in challenging scenarios with extreme pose variations.
zh
[CV-43] One Polyp Identifies All: One-Shot Polyp Segmentation with SAM via Cascaded Priors and Iterative Prompt Evolution ICCV2025
【速读】:该论文旨在解决结肠息肉(polyp)分割任务中因形态多样性与域偏移导致的传统全监督方法泛化能力差、频繁重训练的问题,以及依赖大规模标注数据所带来的标注耗时且易错的瓶颈。其核心解决方案是提出OP-SAM框架,基于Segment Anything Model (SAM) 实现单样本(one-shot)自动分割:关键创新包括基于相关性的先验生成(Correlation-based Prior Generation, CPG)用于语义标签迁移、尺度级联先验融合(Scale-cascaded Prior Fusion, SPF)以适应息肉尺寸变化并抑制噪声传播,以及欧氏提示演化(Euclidean Prompt Evolution, EPE)机制实现迭代式提示优化,从而在无需额外标注的情况下显著提升分割精度与鲁棒性。
链接: https://arxiv.org/abs/2507.16337
作者: Xinyu Mao,Xiaohan Xing,Fei Meng,Jianbang Liu,Fan Bai,Qiang Nie,Max Meng
机构: Chinese University of Hong Kong (香港中文大学); Stanford University (斯坦福大学); Hong Kong University of Science and Technology (香港科技大学); Hong Kong University of Science and Technology(Guangzhou) (香港科技大学(广州)); Southern University of Science and Technology (南方科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICCV2025
Abstract:Polyp segmentation is vital for early colorectal cancer detection, yet traditional fully supervised methods struggle with morphological variability and domain shifts, requiring frequent retraining. Additionally, reliance on large-scale annotations is a major bottleneck due to the time-consuming and error-prone nature of polyp boundary labeling. Recently, vision foundation models like Segment Anything Model (SAM) have demonstrated strong generalizability and fine-grained boundary detection with sparse prompts, effectively addressing key polyp segmentation challenges. However, SAM’s prompt-dependent nature limits automation in medical applications, since manually inputting prompts for each image is labor-intensive and time-consuming. We propose OP-SAM, a One-shot Polyp segmentation framework based on SAM that automatically generates prompts from a single annotated image, ensuring accurate and generalizable segmentation without additional annotation burdens. Our method introduces Correlation-based Prior Generation (CPG) for semantic label transfer and Scale-cascaded Prior Fusion (SPF) to adapt to polyp size variations as well as filter out noisy transfers. Instead of dumping all prompts at once, we devise Euclidean Prompt Evolution (EPE) for iterative prompt refinement, progressively enhancing segmentation quality. Extensive evaluations across five datasets validate OP-SAM’s effectiveness. Notably, on Kvasir, it achieves 76.93% IoU, surpassing the state-of-the-art by 11.44%.
zh
[CV-44] Scene Text Detection and Recognition “in light of” Challenging Environmental Conditions using Aria Glasses Egocentric Vision Cameras
【速读】:该论文旨在解决在真实场景下,穿戴式设备(如Meta的Project Aria智能眼镜)中场景文本检测与识别(Scene Text Detection and Recognition, STDR)性能受环境变量(如光照、距离和分辨率)影响的问题。其关键解决方案在于构建了一个在受控条件下采集的新型数据集,并系统评估了两种OCR流水线(EAST+CRNN与EAST+PyTesseract),发现图像上采样(image upscaling)作为预处理技术可显著降低字符错误率(CER),从0.65降至0.48;同时提出利用眼动追踪优化处理效率,聚焦用户注意力区域,从而为自适应、用户感知的增强现实(AR)系统提供技术支持。
链接: https://arxiv.org/abs/2507.16330
作者: Joseph De Mathia,Carlos Francisco Moreno-García
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures
Abstract:In an era where wearable technology is reshaping applications, Scene Text Detection and Recognition (STDR) becomes a straightforward choice through the lens of egocentric vision. Leveraging Meta’s Project Aria smart glasses, this paper investigates how environmental variables, such as lighting, distance, and resolution, affect the performance of state-of-the-art STDR algorithms in real-world scenarios. We introduce a novel, custom-built dataset captured under controlled conditions and evaluate two OCR pipelines: EAST with CRNN, and EAST with PyTesseract. Our findings reveal that resolution and distance significantly influence recognition accuracy, while lighting plays a less predictable role. Notably, image upscaling emerged as a key pre-processing technique, reducing Character Error Rate (CER) from 0.65 to 0.48. We further demonstrate the potential of integrating eye-gaze tracking to optimise processing efficiency by focusing on user attention zones. This work not only benchmarks STDR performance under realistic conditions but also lays the groundwork for adaptive, user-aware AR systems. Our contributions aim to inspire future research in robust, context-sensitive text recognition for assistive and research-oriented applications, such as asset inspection and nutrition analysis. The code is available at this https URL.
zh
[CV-45] DREAM: Scalable Red Teaming for Text-to-Image Generative Systems via Distribution Modeling
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在安全对齐和外部过滤机制下仍可能产生有害内容的问题,特别是通过自动化红队测试(red teaming)来系统性识别能够触发不安全输出的多样化提示词(prompt),从而提升模型在真实部署前的安全性。解决方案的关键在于提出 DREAM 框架,其核心创新是直接建模目标系统中问题提示的潜在概率分布,而非孤立优化单个提示;这一方法实现了对提示有效性(effectiveness)与多样性(diversity)的显式联合优化,并支持训练后高效的大规模采样。为实现该目标且无需访问代表性训练样本,作者借鉴能量模型思想重构优化目标,并引入 GC-SPSA 优化算法以在长且可能不可微的 T2I 流程中获得稳定的梯度估计,显著提升了红队测试的覆盖率和效率。
链接: https://arxiv.org/abs/2507.16329
作者: Boheng Li,Junjie Wang,Yiming Li,Zhiyang Hu,Leyi Qi,Jianshuo Dong,Run Wang,Han Qiu,Zhan Qin,Tianwei Zhang
机构: Nanyang Technological University, Singapore; Wuhan University, China; Zhejiang University, China; Tsinghua University, China
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version. Under review
Abstract:Despite the integration of safety alignment and external filters, text-to-image (T2I) generative models are still susceptible to producing harmful content, such as sexual or violent imagery. This raises serious concerns about unintended exposure and potential misuse. Red teaming, which aims to proactively identify diverse prompts that can elicit unsafe outputs from the T2I system (including the core generative model as well as potential external safety filters and other processing components), is increasingly recognized as an essential method for assessing and improving safety before real-world deployment. Yet, existing automated red teaming approaches often treat prompt discovery as an isolated, prompt-level optimization task, which limits their scalability, diversity, and overall effectiveness. To bridge this gap, in this paper, we propose DREAM, a scalable red teaming framework to automatically uncover diverse problematic prompts from a given T2I system. Unlike most prior works that optimize prompts individually, DREAM directly models the probabilistic distribution of the target system’s problematic prompts, which enables explicit optimization over both effectiveness and diversity, and allows efficient large-scale sampling after training. To achieve this without direct access to representative training samples, we draw inspiration from energy-based models and reformulate the objective into simple and tractable objectives. We further introduce GC-SPSA, an efficient optimization algorithm that provide stable gradient estimates through the long and potentially non-differentiable T2I pipeline. The effectiveness of DREAM is validated through extensive experiments, demonstrating that it surpasses 9 state-of-the-art baselines by a notable margin across a broad range of T2I models and safety filters in terms of prompt success rate and diversity.
zh
[CV-46] M-SpecGene: Generalized Foundation Model for RGBT Multispectral Vision ICCV2025
【速读】:该论文旨在解决RGB-Thermal (RGBT) 多光谱视觉任务中因人工归纳偏置(artificial inductive bias)、模态偏置(modality bias)和数据瓶颈(data bottleneck)导致的泛化能力受限问题。其解决方案的关键在于构建首个通用型RGBT多光谱基础模型(Generalized RGBT MultiSpectral foundation model, M-SpecGene),通过自监督方式从大规模跨场景数据中学习模态不变表示(modality-invariant representations)。为应对RGBT数据中固有的信息不平衡特性,作者提出交叉模态结构稀疏性(Cross-Modality Structural Sparsity, CMSS)度量指标,并设计了GMM-CMSS渐进掩码策略,实现从易到难、以目标为中心的预训练过程,从而显著提升模型在11个数据集上4类下游任务中的泛化性能。
链接: https://arxiv.org/abs/2507.16318
作者: Kailai Zhou,Fuqiang Yang,Shixian Wang,Bihan Wen,Chongde Zi,Linsen Chen,Qiu Shen,Xun Cao
机构: Nanjing University (南京大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by ICCV2025
Abstract:RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene’s generalizability across eleven datasets for four RGBT downstream tasks. The code will be available at this https URL.
zh
[CV-47] MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation
【速读】:该论文旨在解决现有文本到视频(text-to-video)方法在将运动从参考对象平滑迁移至目标对象时面临的挑战,尤其是在参考对象与目标对象在外观或结构上存在显著差异的情况下。解决方案的关键在于提出了一种无需训练的框架 MotionShot,其核心机制包括两个层次的对齐:首先通过语义特征匹配实现参考对象与目标对象之间的高层级对齐,确保语义一致性;其次通过参考到目标的形状重定向(reference-to-target shape retargeting)建立低层级的形态学对齐,从而在保持外观连贯性的同时实现高保真度的运动迁移。此外,MotionShot 利用时间注意力机制编码运动信息,能够在显著的外观和结构差异下仍实现跨对象的连贯运动传递。
链接: https://arxiv.org/abs/2507.16310
作者: Yanchen Liu,Yanan Sun,Zhening Xing,Junyao Gao,Kai Chen,Wenjie Pei
机构: Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)); Peng Cheng Laboratory (鹏城实验室); Shanghai AI Laboratory (上海人工智能实验室); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing text-to-video methods struggle to transfer motion smoothly from a reference object to a target object with significant differences in appearance or structure between them. To address this challenge, we introduce MotionShot, a training-free framework capable of parsing reference-target correspondences in a fine-grained manner, thereby achieving high-fidelity motion transfer while preserving coherence in appearance. To be specific, MotionShot first performs semantic feature matching to ensure high-level alignments between the reference and target objects. It then further establishes low-level morphological alignments through reference-to-target shape retargeting. By encoding motion with temporal attention, our MotionShot can coherently transfer motion across objects, even in the presence of significant appearance and structure disparities, demonstrated by extensive experiments. The project page is available at: this https URL.
zh
[CV-48] owards Resilient Safety-driven Unlearning for Diffusion Models against Downstream Fine-tuning
【速读】:该论文旨在解决生成式 AI(Generative AI)中文本到图像(Text-to-Image, T2I)扩散模型在下游微调过程中安全机制失效的问题。具体而言,尽管现有的安全驱动型遗忘(safety-driven unlearning)方法能有效抑制模型从有毒预训练数据中继承的有害行为,但这些方法在面对后续微调时表现出脆弱性——即使微调数据完全无害,其安全性仍会显著退化。解决方案的关键在于提出 ResAlign 框架,其核心创新包括:将下游微调建模为基于 Moreau 包络(Moreau Envelope)重构的隐式优化问题,从而实现对有害行为恢复的高效梯度估计;同时引入元学习策略模拟多样化的微调场景,提升遗忘机制在不同下游任务中的泛化能力。实验表明,ResAlign 在多种数据集和微调配置下均能稳定保持安全性,同时维持良好的良性生成性能。
链接: https://arxiv.org/abs/2507.16302
作者: Boheng Li,Renjie Gu,Junjie Wang,Leyi Qi,Yiming Li,Run Wang,Zhan Qin,Tianwei Zhang
机构: Nanyang Technological University (南洋理工大学); Central South University (中南大学); Wuhan University (武汉大学); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version. Under review
Abstract:Text-to-image (T2I) diffusion models have achieved impressive image generation quality and are increasingly fine-tuned for personalized applications. However, these models often inherit unsafe behaviors from toxic pretraining data, raising growing safety concerns. While recent safety-driven unlearning methods have made promising progress in suppressing model toxicity, they are identified to be fragile to downstream fine-tuning, where we reveal that state-of-the-art methods largely fail to retain their effectiveness even when fine-tuned on entirely benign datasets. To mitigate this problem, in this paper, we propose ResAlign, a safety-driven unlearning framework with enhanced resilience against downstream fine-tuning. By modeling downstream fine-tuning as an implicit optimization problem with a Moreau Envelope-based reformulation, ResAlign enables efficient gradient estimation to minimize the recovery of harmful behaviors. Additionally, a meta-learning strategy is proposed to simulate a diverse distribution of fine-tuning scenarios to improve generalization. Extensive experiments across a wide range of datasets, fine-tuning methods, and configurations demonstrate that ResAlign consistently outperforms prior unlearning approaches in retaining safety after downstream fine-tuning while preserving benign generation capability well.
zh
[CV-49] Dens3R: A Foundation Model for 3D Geometry Prediction
【速读】:该论文旨在解决当前密集三维重建中难以实现统一几何预测准确性的难题,即现有方法通常仅能从输入图像中预测单一几何量(如深度或表面法向量),而忽视了不同几何量之间内在的结构关联性,导致估计结果不一致,限制了精度与实际应用效果。其解决方案的关键在于提出一种名为Dens3R的3D基础模型,采用两阶段训练框架,通过轻量级共享编码器-解码器主干网络和位置插值旋转位置编码(position-interpolated rotary positional encoding)构建具备泛化能力和内在不变性的点云表示,并结合图像对匹配特征与内在不变性建模,实现对深度、表面法向量等多几何量的联合回归,从而在单视图到多视图输入下均能获得几何一致性感知。
链接: https://arxiv.org/abs/2507.16290
作者: Xianze Fang,Jingnan Gao,Zhe Wang,Zhuo Chen,Xingyu Ren,Jiangjing Lyu,Qiaomu Ren,Zhonglei Yang,Xiaokang Yang,Yichao Yan,Chengfei Lyu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , Code: this https URL
Abstract:Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various dense 3D prediction tasks and highlight its potential for broader applications.
zh
[CV-50] Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition ICCV2025
【速读】:该论文旨在解决少样本动作识别(Few-shot Action Recognition, FSAR)中因训练数据稀缺导致模型泛化能力不足的问题。传统方法仅依赖动作标签进行分类,难以充分捕捉人类动作中细微的姿势变化、运动动力学及物体交互等关键结构信息。其解决方案的关键在于提出语言引导的动作解剖框架(Language-Guided Action Anatomy, LGA),通过大型语言模型(Large Language Models, LLMs)对动作标签进行语义拆解,提取主体(subject)、运动(motion)和对象(object)三要素构成的原子级动作描述;同时设计视觉解剖模块将视频分解为原子阶段以建模时序结构,并采用细粒度融合策略在原子层级整合文本与视觉特征,从而生成更具泛化性的原型表示;最后引入多模态匹配机制(包括视频-视频与视频-文本匹配)提升少样本场景下的分类鲁棒性。
链接: https://arxiv.org/abs/2507.16287
作者: Zefeng Qian,Xincheng Yao,Yifei Huang,Chongyang Zhang,Jiangyong Ying,Hong Sun
机构: Shanghai Jiao Tong University (上海交通大学); The University of Tokyo (东京大学); Shanghai AI Laboratory (上海人工智能实验室); E-surfing Vision Technology Co., Ltd (易 surfing 视觉科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025
Abstract:Few-shot action recognition (FSAR) aims to classify human actions in videos with only a small number of labeled samples per category. The scarcity of training data has driven recent efforts to incorporate additional modalities, particularly text. However, the subtle variations in human posture, motion dynamics, and the object interactions that occur during different phases, are critical inherent knowledge of actions that cannot be fully exploited by action labels alone. In this work, we propose Language-Guided Action Anatomy (LGA), a novel framework that goes beyond label semantics by leveraging Large Language Models (LLMs) to dissect the essential representational characteristics hidden beneath action labels. Guided by the prior knowledge encoded in LLM, LGA effectively captures rich spatiotemporal cues in few-shot scenarios. Specifically, for text, we prompt an off-the-shelf LLM to anatomize labels into sequences of atomic action descriptions, focusing on the three core elements of action (subject, motion, object). For videos, a Visual Anatomy Module segments actions into atomic video phases to capture the sequential structure of actions. A fine-grained fusion strategy then integrates textual and visual features at the atomic level, resulting in more generalizable prototypes. Finally, we introduce a Multimodal Matching mechanism, comprising both video-video and video-text matching, to ensure robust few-shot classification. Experimental results demonstrate that LGA achieves state-of-the-art performance across multipe FSAR benchmarks.
zh
[CV-51] MAN: Scaling Momentum Auxiliary Network for Supervised Local Learning in Vision Tasks
【速读】:该论文旨在解决深度学习中端到端反向传播(end-to-end backpropagation)训练方法所固有的问题,包括参数优化过程中的更新锁定(update locking)、高GPU内存消耗以及缺乏生物合理性。为克服这些问题,作者提出了一种名为Momentum Auxiliary Network++(MAN++)的监督局部学习(supervised local learning)方案。其核心创新在于引入一种动态交互机制,通过相邻模块参数的指数移动平均(Exponential Moving Average, EMA)来增强块间信息流动,并设计了一个可学习的缩放偏置(learnable scaling bias)以校正局部块间的特征差异,从而有效弥合块间的信息鸿沟。实验表明,MAN++在图像分类、目标检测和图像分割等任务上性能接近端到端训练,同时显著降低GPU内存占用,为监督局部学习提供了新思路并成为传统训练方法的可行替代方案。
链接: https://arxiv.org/abs/2507.16279
作者: Junhao Su,Feiyu Zhu,Hengyu Shi,Tianyang Han,Yurui Qiu,Junfeng Luo,Xiaoming Wei,Jialin Gao
机构: MeiTuan(美团); AttrSense
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
Abstract:Deep learning typically relies on end-to-end backpropagation for training, a method that inherently suffers from issues such as update locking during parameter optimization, high GPU memory consumption, and a lack of biological plausibility. In contrast, supervised local learning seeks to mitigate these challenges by partitioning the network into multiple local blocks and designing independent auxiliary networks to update each block separately. However, because gradients are propagated solely within individual local blocks, performance degradation occurs, preventing supervised local learning from supplanting end-to-end backpropagation. To address these limitations and facilitate inter-block information flow, we propose the Momentum Auxiliary Network++ (MAN++). MAN++ introduces a dynamic interaction mechanism by employing the Exponential Moving Average (EMA) of parameters from adjacent blocks to enhance communication across the network. The auxiliary network, updated via EMA, effectively bridges the information gap between blocks. Notably, we observed that directly applying EMA parameters can be suboptimal due to feature discrepancies between local blocks. To resolve this issue, we introduce a learnable scaling bias that balances feature differences, thereby further improving performance. We validate MAN++ through extensive experiments on tasks that include image classification, object detection, and image segmentation, utilizing multiple network architectures. The experimental results demonstrate that MAN++ achieves performance comparable to end-to-end training while significantly reducing GPU memory usage. Consequently, MAN++ offers a novel perspective for supervised local learning and presents a viable alternative to conventional training methods.
zh
[CV-52] Understanding Generalization Robustness and Interpretability in Low-Capacity Neural Networks
【速读】:该论文旨在解决低容量神经网络中模型容量(capacity)、稀疏性(sparsity)与鲁棒性(robustness)之间的基本相互作用问题。其解决方案的关键在于构建了一个受控的框架,通过从MNIST数据集中设计一系列视觉难度递增的二分类任务(如“0和1” vs. “4和9”),系统性地探究上述三者的关系。实验表明:最小必要模型容量随任务复杂度线性增长;训练后的网络在极端剪枝(高达95%稀疏度)下仍保持高性能,证明存在稀疏且高效率的子网络;此外,过参数化显著提升对输入扰动的鲁棒性。该研究为理解简单神经网络中的基础权衡提供了明确的实证依据。
链接: https://arxiv.org/abs/2507.16278
作者: Yash Kumar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages (10 pages main text). 18 figures (8 main, 10 appendix), 1 table
Abstract:Although modern deep learning often relies on massive over-parameterized models, the fundamental interplay between capacity, sparsity, and robustness in low-capacity networks remains a vital area of study. We introduce a controlled framework to investigate these properties by creating a suite of binary classification tasks from the MNIST dataset with increasing visual difficulty (e.g., 0 and 1 vs. 4 and 9). Our experiments reveal three core findings. First, the minimum model capacity required for successful generalization scales directly with task complexity. Second, these trained networks are robust to extreme magnitude pruning (up to 95% sparsity), revealing the existence of sparse, high-performing subnetworks. Third, we show that over-parameterization provides a significant advantage in robustness against input corruption. Interpretability analysis via saliency maps further confirms that these identified sparse subnetworks preserve the core reasoning process of the original dense models. This work provides a clear, empirical demonstration of the foundational trade-offs governing simple neural networks.
zh
[CV-53] oFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在资源受限设备上部署时面临的计算成本过高问题,尤其是自注意力机制带来的高复杂度。现有基于token缩减的方法通常不可逆地丢弃低重要性token,导致这些token无法在后续网络块中被重新利用,而实际上不同阶段的Transformer可能关注不同信息,早期被舍弃的token在后期可能具有价值。为此,论文提出了一种新颖的Token Freezing and Reusing(ToFe)框架,其关键在于:通过设计一个预测模块识别每个阶段的重要token,并引入近似恢复模块对冻结的不重要token进行延迟重用;并通过考虑计算预算的端到端训练策略,使模型能够自适应地在各block中处理必要token,在显著降低计算开销的同时保持性能稳定——实验表明,该方法可使LV-ViT模型计算量减少50%,Top-1准确率下降不足2%,实现了更优的性能与复杂度权衡。
链接: https://arxiv.org/abs/2507.16260
作者: Haoyue Zhang,Jie Zhang,Song Guo
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. Token reduction, which discards less important tokens during forward propagation, has been proposed to enhance the efficiency of transformer models. However, existing methods handle unimportant tokens irreversibly, preventing their reuse in subsequent blocks. Considering that transformers focus on different information among blocks, tokens reduced in early blocks might be useful later. Furthermore, to adapt transformer models for resource-constrained devices, it is crucial to strike a balance between model performance and computational overhead. To address these challenges, in this paper, we introduce a novel Token Freezing and Reusing (ToFe) framework, where we identify important tokens at each stage and temporarily freeze the unimportant ones, allowing their lagged reusing at a later stage. Specifically, we design a prediction module for token identification and an approximate module for recovery of the frozen tokens. By jointly optimizing with the backbone through computation budget-aware end-to-end training, ToFe can adaptively process the necessary tokens at each block, thereby reducing computational cost while maintaining performance. Extensive experiments demonstrate that ToFe reduces the computational cost of LV-ViT model by 50% with less than 2% drop in Top-1 accuracy, achieving a better trade-off between performance and complexity compared to state-of-the-art methods.
zh
[CV-54] Quality Text Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models
【速读】:该论文旨在解决预训练视觉语言模型(Vision-Language Models, VLMs)在面对对抗攻击时的鲁棒性不足问题,尤其是在零样本任务中,现有对抗训练(Adversarial Training, AT)方法未能有效利用语言信息来提升视觉鲁棒性。具体而言,监督式AT因仅依赖类别标签生成对抗扰动而易过拟合至训练数据中的对象类别,而无监督AT虽避免过拟合但缺乏语义引导,难以应对实际场景中的文本引导对抗攻击。其解决方案的关键在于提出质量文本引导的对抗微调(Quality Text-guided Adversarial Fine-Tuning, QT-AFT),通过引入高质量图像描述(captions)作为语义指导,使对抗样本避开图像中多样化的语义内容,从而增强视觉编码器在对抗噪声下的泛化能力,实现更广泛的下游任务鲁棒性提升。
链接: https://arxiv.org/abs/2507.16257
作者: Futa Waseda,Saku Sugawara,Isao Echizen
机构: The University of Tokyo (东京大学); National Institute of Informatics (信息基础研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACMMM 2025 Accepted
Abstract:Defending pre-trained vision-language models (VLMs), such as CLIP, against adversarial attacks is crucial, as these models are widely used in diverse zero-shot tasks, including image classification. However, existing adversarial training (AT) methods for robust fine-tuning largely overlook the role of language in enhancing visual robustness. Specifically, (1) supervised AT methods rely on short texts (e.g., class labels) to generate adversarial perturbations, leading to overfitting to object classes in the training data, and (2) unsupervised AT avoids this overfitting but remains suboptimal against practical text-guided adversarial attacks due to its lack of semantic guidance. To address these limitations, we propose Quality Text-guided Adversarial Fine-Tuning (QT-AFT), which leverages high-quality captions during training to guide adversarial examples away from diverse semantics present in images. This enables the visual encoder to robustly recognize a broader range of image features even under adversarial noise, thereby enhancing robustness across diverse downstream tasks. QT-AFT overcomes the key weaknesses of prior methods – overfitting in supervised AT and lack of semantic awareness in unsupervised AT – achieving state-of-the-art zero-shot adversarial robustness and clean accuracy, evaluated across 16 zero-shot datasets. Furthermore, our comprehensive study uncovers several key insights into the role of language in enhancing vision robustness; for example, describing object properties in addition to object names further enhances zero-shot robustness. Our findings point to an urgent direction for future work – centering high-quality linguistic supervision in robust visual representation learning.
zh
[CV-55] Edge-case Synthesis for Fisheye Object Detection: A Data-centric Perspective
【速读】:该论文旨在解决鱼眼相机(fisheye camera)引入显著畸变后,传统目标检测模型性能下降的问题。其解决方案的关键在于提出一种以数据为中心的优化流程,通过系统性识别模型的盲区(blind spots),并针对关键边缘案例(edge-case)进行合成与修复:首先基于误差分析定位混淆类别对、边缘畸变和低频场景等典型问题;随后利用微调后的图像生成模型(image generative model),结合精心设计的提示词(prompt)生成模拟真实失败模式的合成图像,并采用高质量检测器进行伪标注后融入训练集。该方法在不改变模型结构的前提下显著提升了鱼眼目标检测性能,验证了聚焦数据缺陷并针对性修复的有效性。
链接: https://arxiv.org/abs/2507.16254
作者: Seunghyeon Kim,Kyeongryeol Go
机构: Superb AI(超级AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 7 figures
Abstract:Fisheye cameras introduce significant distortion and pose unique challenges to object detection models trained on conventional datasets. In this work, we propose a data-centric pipeline that systematically improves detection performance by focusing on the key question of identifying the blind spots of the model. Through detailed error analysis, we identify critical edge-cases such as confusing class pairs, peripheral distortions, and underrepresented contexts. Then we directly address them through edge-case synthesis. We fine-tuned an image generative model and guided it with carefully crafted prompts to produce images that replicate real-world failure modes. These synthetic images are pseudo-labeled using a high-quality detector and integrated into training. Our approach results in consistent performance gains, highlighting how deeply understanding data and selectively fixing its weaknesses can be impactful in specialized domains like fisheye object detection.
zh
[CV-56] HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery
【速读】:该论文旨在解决大尺寸遥感影像(large-size remote sensing imagery, RSI)在地理对象矢量化过程中因传统方法局限于小图像块处理而导致上下文信息丢失和矢量输出碎片化的问题。其解决方案的关键在于提出HoliTracer框架,通过两个核心组件实现整体性矢量化:一是Context Attention Net (CAN),利用局部到全局的注意力机制捕获长距离上下文依赖关系以提升分割精度;二是由Mask Contour Reformer (MCR) 和 Polygon Sequence Tracer (PST) 构成的鲁棒管线,分别完成多边形重构与顶点追踪,从而实现从大尺寸RSI中端到端地生成连续、完整的矢量地理对象。
链接: https://arxiv.org/abs/2507.16251
作者: Yu Wang,Bo Dang,Wanchun Li,Wei Chen,Yansheng Li
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies, and roads, demonstrate that HoliTracer outperforms state-of-the-art methods. Our code and data are available in this https URL.
zh
[CV-57] Scale Your Instructions: Enhance the Instruction-Following Fidelity of Unified Image Generation Model by Self-Adaptive Attention Scaling ICCV2025
【速读】:该论文旨在解决统一图像生成模型(如OmniGen)在处理包含多个子指令的文本指令时存在的“文本指令忽视”问题,即模型对部分子指令的响应不充分,导致指令遵循 fidelity 下降。其核心解决方案是提出 Self-Adaptive Attention Scaling (SaaS),该方法通过利用相邻时间步间交叉注意力(cross-attention)的一致性,动态调整每个子指令对应的注意力激活强度,从而增强模型对被忽略子指令的关注度,且无需额外训练或测试时优化即可显著提升指令遵循准确性。
链接: https://arxiv.org/abs/2507.16240
作者: Chao Zhou,Tianyi Wei,Nenghai Yu
机构: University of Science and Technology of China (中国科学技术大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by ICCV2025
Abstract:Recent advancements in unified image generation models, such as OmniGen, have enabled the handling of diverse image generation and editing tasks within a single framework, accepting multimodal, interleaved texts and images in free form. This unified architecture eliminates the need for text encoders, greatly reducing model complexity and standardizing various image generation and editing tasks, making it more user-friendly. However, we found that it suffers from text instruction neglect, especially when the text instruction contains multiple sub-instructions. To explore this issue, we performed a perturbation analysis on the input to identify critical steps and layers. By examining the cross-attention maps of these key steps, we observed significant conflicts between neglected sub-instructions and the activations of the input image. In response, we propose Self-Adaptive Attention Scaling (SaaS), a method that leverages the consistency of cross-attention between adjacent timesteps to dynamically scale the attention activation for each sub-instruction. Our SaaS enhances instruction-following fidelity without requiring additional training or test-time optimization. Experimental results on instruction-based image editing and visual conditional image generation validate the effectiveness of our SaaS, showing superior instruction-following fidelity over existing methods. The code is available this https URL.
zh
[CV-58] Positive Style Accumulation: A Style Screening and Continuous Utilization Framework for Federated DG-ReID ACM-MM2025
【速读】:该论文针对联邦域泛化行人重识别(Federated Domain Generalization for Person Re-identification, FedDG-ReID)中模型泛化性能受限的问题展开研究,核心在于现有方法通过风格变换提升样本多样性虽有一定效果,但未区分风格对模型泛化能力的正负影响。为此,作者提出将有益于泛化的风格定义为“正向风格”(positive styles),有害的为“负向风格”(negative styles),并构建Style Screening and Continuous Utilization (SSCU)框架以解决如何有效筛选并持续利用正向风格的问题。其关键创新在于:1)设计基于泛化增益引导的动态风格记忆模块(Generalization Gain-guided Dynamic Style Memory, GGDSM),用于客户端模型积累和筛选正向风格;2)引入风格记忆识别损失(style memory recognition loss)以充分利用记忆中的正向风格;3)提出协同风格训练策略(Collaborative Style Training, CST),在双分支架构下联合使用新生成风格与记忆中的正向风格进行训练,从而加速客户端模型学习新风格,并确保正向风格被持续、充分地利用,显著提升模型在源域与目标域上的泛化性能。
链接: https://arxiv.org/abs/2507.16238
作者: Xin Xu(1),Chaoyue Ren(1),Wei Liu(1),Wenke Huang(2),Bin Yang(2),Zhixi Yu(1),Kui Jiang(3) ((1) Wuhan University of Science and Technology, (2) Wuhan University, (3) Harbin Institute of Technology)
机构: Wuhan University of Science and Technology (武汉科技大学); Wuhan University (武汉大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, accepted at ACM MM 2025, Submission ID: 4394
Abstract:The Federated Domain Generalization for Person re-identification (FedDG-ReID) aims to learn a global server model that can be effectively generalized to source and target domains through distributed source domain data. Existing methods mainly improve the diversity of samples through style transformation, which to some extent enhances the generalization performance of the model. However, we discover that not all styles contribute to the generalization performance. Therefore, we define styles that are beneficial or harmful to the model’s generalization performance as positive or negative styles. Based on this, new issues arise: How to effectively screen and continuously utilize the positive styles. To solve these problems, we propose a Style Screening and Continuous Utilization (SSCU) framework. Firstly, we design a Generalization Gain-guided Dynamic Style Memory (GGDSM) for each client model to screen and accumulate generated positive styles. Meanwhile, we propose a style memory recognition loss to fully leverage the positive styles memorized by Memory. Furthermore, we propose a Collaborative Style Training (CST) strategy to make full use of positive styles. Unlike traditional learning strategies, our approach leverages both newly generated styles and the accumulated positive styles stored in memory to train client models on two distinct branches. This training strategy is designed to effectively promote the rapid acquisition of new styles by the client models, and guarantees the continuous and thorough utilization of positive styles, which is highly beneficial for the model’s generalization performance. Extensive experimental results demonstrate that our method outperforms existing methods in both the source domain and the target domain.
zh
[CV-59] MONITRS: Multimodal Observations of Natural Incidents Through Remote Sensing
【速读】:该论文旨在解决当前灾害监测中机器学习模型在应对多类型灾害时存在的局限性问题,包括模型对特定灾害类型的依赖性强、缺乏足够时间分辨率的数据以及自然语言标注缺失等问题。其解决方案的关键在于构建了一个名为MONITRS的新型多模态数据集,该数据集包含超过10,000个美国联邦紧急事务管理局(FEMA)灾害事件的时序卫星影像与来自新闻报道的自然语言注释,并附带地理标记位置和问答对。通过在该数据集上微调现有的多模态大语言模型(Multimodal Large Language Models, MLLMs),研究证明了其在灾害监测任务中的显著性能提升,从而为基于机器学习的灾害响应系统建立了新的基准。
链接: https://arxiv.org/abs/2507.16228
作者: Shreelekha Revankar,Utkarsh Mall,Cheng Perng Phoo,Kavita Bala,Bharath Hariharan
机构: Cornell University (康奈尔大学); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures, 4 tables
Abstract:Natural disasters cause devastating damage to communities and infrastructure every year. Effective disaster response is hampered by the difficulty of accessing affected areas during and after events. Remote sensing has allowed us to monitor natural disasters in a remote way. More recently there have been advances in computer vision and deep learning that help automate satellite imagery analysis, However, they remain limited by their narrow focus on specific disaster types, reliance on manual expert interpretation, and lack of datasets with sufficient temporal granularity or natural language annotations for tracking disaster progression. We present MONITRS, a novel multimodal dataset of more than 10,000 FEMA disaster events with temporal satellite imagery and natural language annotations from news articles, accompanied by geotagged locations, and question-answer pairs. We demonstrate that fine-tuning existing MLLMs on our dataset yields significant performance improvements for disaster monitoring tasks, establishing a new benchmark for machine learning-assisted disaster response systems. Code can be found at: this https URL
zh
[CV-60] LDRFusion: A LiDAR-Dominant multimodal refinement framework for 3D object detection
【速读】:该论文旨在解决现有LiDAR-相机融合方法在3D目标检测中因伪点云(pseudo point clouds)引入噪声而导致预测不准确的问题。其解决方案的关键在于提出一种Lidar-dominant两阶段细化框架(LDRFusion):第一阶段仅依赖LiDAR生成高精度定位的候选框,避免伪点云干扰;第二阶段引入伪点云以增强对困难实例的检测能力,并通过实例级结果融合提升整体性能。此外,为改善伪点云中局部结构表示,设计了分层伪点残差编码模块(hierarchical pseudo point residual encoding module),利用特征与位置残差联合编码邻域信息,从而提升伪点云的表达质量。
链接: https://arxiv.org/abs/2507.16224
作者: Jijun Wang,Yan Wu,Yujian Mo,Junqiao Zhao,Jun Yan,Yinghao Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing LiDAR-Camera fusion methods have achieved strong results in 3D object detection. To address the sparsity of point clouds, previous approaches typically construct spatial pseudo point clouds via depth completion as auxiliary input and adopts a proposal-refinement framework to generate detection results. However, introducing pseudo points inevitably brings noise, potentially resulting in inaccurate predictions. Considering the differing roles and reliability levels of each modality, we propose LDRFusion, a novel Lidar-dominant two-stage refinement framework for multi-sensor fusion. The first stage soley relies on LiDAR to produce accurately localized proposals, followed by a second stage where pseudo point clouds are incorporated to detect challenging instances. The instance-level results from both stages are subsequently merged. To further enhance the representation of local structures in pseudo point clouds, we present a hierarchical pseudo point residual encoding module, which encodes neighborhood sets using both feature and positional residuals. Experiments on the KITTI dataset demonstrate that our framework consistently achieves strong performance across multiple categories and difficulty levels.
zh
[CV-61] Advancing Visual Large Language Model for Multi-granular Versatile Perception ICCV2025
【速读】:该论文旨在解决当前计算机视觉感知任务研究中存在的一致性与泛化能力不足的问题,即现有方法通常仅聚焦于特定子任务组合,导致模型在不同预测类型(如框/掩码)和指令类型(词级/句级)下的适用性和灵活性受限。其解决方案的关键在于提出MVP-LM框架——一个融合视觉大语言模型(Visual Large Language Model, VLLM)的多粒度、多功能感知架构,通过创新的多粒度解码器与受思维链(Chain-of-Thought, CoT)启发的数据集统一策略,实现跨任务的端到端监督微调,并引入查询增强机制以充分利用VLLM的解码与生成能力,从而在全景分割、目标检测、视觉定位及指代表达分割等多样化任务中展现出卓越性能。
链接: https://arxiv.org/abs/2507.16213
作者: Wentao Xiang,Haoxian Tan,Cong Wei,Yujie Zhong,Dengjie Li,Yujiu Yang
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Meituan Inc. (美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To appear in ICCV 2025
Abstract:Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP-LM, a Multi-granular and Versatile Perception framework incorporating Visual Large Language Model. Our framework is designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions within a single architecture. MVP-LM features an innovative multi-granularity decoder in conjunction with a CoT-inspired dataset unification strategy, enabling seamless supervised fine-tuning across a wide spectrum of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in VLLMs. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework. The code will be available at this https URL.
zh
[CV-62] A Single-step Accurate Fingerprint Registration Method Based on Local Feature Matching
【速读】:该论文旨在解决低质量指纹图像导致的指纹注册失败问题,尤其是传统两步法中因特征点(minutiae)数量不足而引发的初始注册失败问题。其解决方案的关键在于提出一种端到端的单步指纹注册算法,通过直接预测两幅指纹图像间的半密集匹配点对应关系,避免了对 minutiae 依赖的初始注册步骤;同时引入全局-局部注意力机制,实现像素级的精准对齐,从而显著提升注册鲁棒性和最终识别性能。
链接: https://arxiv.org/abs/2507.16201
作者: Yuwei Jia,Zhe Cui,Fei Su
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Key Laboratory of Network System and Network Culture (北京市网络系统与网络文化重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Distortion of the fingerprint images leads to a decline in fingerprint recognition performance, and fingerprint registration can mitigate this distortion issue by accurately aligning two fingerprint images. Currently, fingerprint registration methods often consist of two steps: an initial registration based on minutiae, and a dense registration based on matching points. However, when the quality of fingerprint image is low, the number of detected minutiae is reduced, leading to frequent failures in the initial registration, which ultimately causes the entire fingerprint registration process to fail. In this study, we propose an end-to-end single-step fingerprint registration algorithm that aligns two fingerprints by directly predicting the semi-dense matching points correspondences between two fingerprints. Thus, our method minimizes the risk of minutiae registration failure and also leverages global-local attentions to achieve end-to-end pixel-level alignment between the two fingerprints. Experiment results prove that our method can achieve the state-of-the-art matching performance with only single-step registration, and it can also be used in conjunction with dense registration algorithms for further performance improvements.
zh
[CV-63] LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs
【速读】:该论文旨在解决当前文本引导图像编辑(Text-guided Image Editing, TIE)模型在图像质量、编辑对齐度和原始图像一致性之间难以平衡的问题,以及现有评估基准与人类感知不一致、规模有限的局限性。解决方案的关键在于构建首个大规模图像编辑评估基准 EBench-18K,包含 18K 张编辑图像及其细粒度的人类偏好标注,并基于此设计 LMM4Edit——一种基于大语言模型(Large Language Model, LLM)的统一评估指标,能够从感知质量、编辑对齐性、属性保持性和任务特定问答准确性四个维度综合评估 TIE 模型性能,实验表明其在多个数据集上均表现出与人类偏好高度一致的评估效果及良好的零样本泛化能力。
链接: https://arxiv.org/abs/2507.16193
作者: Zitong Xu,Huiyu Duan,Bingnan Liu,Guangji Ma,Jiarui Wang,Liu Yang,Shiqi Gao,Xiaoyu Wang,Jia Wang,Xiongkuo Min,Guangtao Zhai,Weisi Lin
机构: Shanghai JiaoTong University (上海交通大学); University of Electronic and Science Technology of China (电子科技大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to assess edited images, while the evaluation results, in turn, provide insights into assessing the alignment between the LMMs’ understanding ability and human preferences. Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy in an all-in-one manner. Extensive experiments show that LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on the other datasets also shows the generalization ability of our model. The dataset and code are available at this https URL.
zh
[CV-64] Explicit Context Reasoning with Supervision for Visual Tracking
【速读】:该论文旨在解决视觉跟踪中跨帧建模时因上下文关联缺乏显式监督而导致的时序一致性不足问题,尤其是在目标动态演变过程中传统方法仅通过堆叠历史信息难以有效建模目标状态变化的问题。解决方案的关键在于提出RSTrack框架,其核心是三个协同机制:(1)上下文推理机制(Context Reasoning Mechanism),将无约束的上下文关联转化为基于历史目标状态的时序推理过程,以预测当前表示并增强时序一致性;(2)前向监督策略(Forward Supervision Strategy),利用真实目标特征作为锚点约束推理流程,引导预测输出逼近真实目标分布,抑制上下文推理中的漂移;(3)高效状态建模(Efficient State Modeling),采用压缩-重建机制提取目标核心特征,去除帧间冗余信息,避免无效的上下文关联。这三项机制共同提升了上下文关联的准确性与稳定性,显著改善了传统时序建模中的发散问题。
链接: https://arxiv.org/abs/2507.16191
作者: Fansheng Zeng,Bineng Zhong,Haiying Xia,Yufei Tan,Xiantao Hu,Liangtao Shi,Shuxiang Song
机构: Guangxi Normal University (广西师范大学); Nanjing University of Science and Technology (南京理工大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Contextual reasoning with constraints is crucial for enhancing temporal consistency in cross-frame modeling for visual tracking. However, mainstream tracking algorithms typically associate context by merely stacking historical information without explicitly supervising the association process, making it difficult to effectively model the target’s evolving dynamics. To alleviate this problem, we propose RSTrack, which explicitly models and supervises context reasoning via three core mechanisms. \textit1) Context Reasoning Mechanism: Constructs a target state reasoning pipeline, converting unconstrained contextual associations into a temporal reasoning process that predicts the current representation based on historical target states, thereby enhancing temporal consistency. \textit2) Forward Supervision Strategy: Utilizes true target features as anchors to constrain the reasoning pipeline, guiding the predicted output toward the true target distribution and suppressing drift in the context reasoning process. \textit3) Efficient State Modeling: Employs a compression-reconstruction mechanism to extract the core features of the target, removing redundant information across frames and preventing ineffective contextual associations. These three mechanisms collaborate to effectively alleviate the issue of contextual association divergence in traditional temporal modeling. Experimental results show that RSTrack achieves state-of-the-art performance on multiple benchmark datasets while maintaining real-time running speeds. Our code is available at this https URL.
zh
[CV-65] AtrousMamaba: An Atrous-Window Scanning Visual State Space Model for Remote Sensing Change Detection
【速读】:该论文旨在解决当前基于Mamba的视觉模型在密集预测任务中对局部信息捕捉不足的问题,尤其是在与卷积神经网络(CNN)相比时,其提取细粒度局部特征的能力尚不明确。解决方案的关键在于提出一种新颖的空洞窗口选择性扫描机制(Atrous-Window Selective Scan, AWSS),通过可调节的扩张率逐步扩展扫描范围,从而缩短相邻token之间的距离,使模型能够在保持线性复杂度的前提下,有效兼顾细粒度局部细节与全局上下文信息的建模能力。基于此机制,作者设计了两个端到端的Mamba框架——AWMambaBCD和AWMambaSCD,分别用于二值变化检测(Binary Change Detection, BCD)和语义变化检测(Semantic Change Detection, SCD),并在六个基准数据集上验证了其优越性能。
链接: https://arxiv.org/abs/2507.16172
作者: Tao Wang,Tiecheng Bai,Chao Xu,Bin Liu,Erlei Zhang,Jiyun Huang,Hongming Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, a novel visual state space (VSS) model, referred to as Mamba, has demonstrated significant progress in modeling long sequences with linear complexity, comparable to Transformer models, thereby enhancing its adaptability for processing visual data. Although most methods aim to enhance the global receptive field by directly modifying Mamba’s scanning mechanism, they tend to overlook the critical importance of local information in dense prediction tasks. Additionally, whether Mamba can effectively extract local features as convolutional neural networks (CNNs) do remains an open question that merits further investigation. In this paper, We propose a novel model, AtrousMamba, which effectively balances the extraction of fine-grained local details with the integration of global contextual information. Specifically, our method incorporates an atrous-window selective scan mechanism, enabling a gradual expansion of the scanning range with adjustable rates. This design shortens the distance between adjacent tokens, enabling the model to effectively capture fine-grained local features and global context. By leveraging the atrous window scan visual state space (AWVSS) module, we design dedicated end-to-end Mamba-based frameworks for binary change detection (BCD) and semantic change detection (SCD), referred to as AWMambaBCD and AWMambaSCD, respectively. Experimental results on six benchmark datasets show that the proposed framework outperforms existing CNN-based, Transformer-based, and Mamba-based methods. These findings clearly demonstrate that Mamba not only captures long-range dependencies in visual data but also effectively preserves fine-grained local details.
zh
[CV-66] AMMNet: An Asymmetric Multi-Modal Network for Remote Sensing Semantic Segmentation
【速读】:该论文旨在解决多模态遥感语义分割中RGB影像与数字表面模型(DSM)融合时面临的两大问题:一是由于架构冗余导致的计算复杂度升高,二是因模态错位引起的分割性能下降。解决方案的关键在于提出一种不对称多模态网络(Asymmetric Multi-Modal Network, AMMNet),其核心设计包括三个模块:1)不对称双编码器(Asymmetric Dual Encoder, ADE)根据模态特性分配表征能力,对RGB采用深层编码器提取丰富上下文信息,对DSM则使用轻量编码器提取稀疏结构特征;2)不对称先验融合器(Asymmetric Prior Fuser, APF)引入模态感知先验矩阵,生成结构感知的上下文特征以促进模态对齐;3)分布对齐(Distribution Alignment, DA)模块通过最小化特征分布差异提升跨模态兼容性。该方案在保持高精度的同时显著降低了计算和内存开销。
链接: https://arxiv.org/abs/2507.16158
作者: Hui Ye,Haodong Chen,Zeke Zexi Hu,Xiaoming Chen,Yuk Ying Chung
机构: The University of Sydney (悉尼大学); Beijing Technology and Business University (北京工商大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation in remote sensing (RS) has advanced significantly with the incorporation of multi-modal data, particularly the integration of RGB imagery and the Digital Surface Model (DSM), which provides complementary contextual and structural information about the ground object. However, integrating RGB and DSM often faces two major limitations: increased computational complexity due to architectural redundancy, and degraded segmentation performance caused by modality misalignment. These issues undermine the efficiency and robustness of semantic segmentation, particularly in complex urban environments where precise multi-modal integration is essential. To overcome these limitations, we propose Asymmetric Multi-Modal Network (AMMNet), a novel asymmetric architecture that achieves robust and efficient semantic segmentation through three designs tailored for RGB-DSM input pairs. To reduce architectural redundancy, the Asymmetric Dual Encoder (ADE) module assigns representational capacity based on modality-specific characteristics, employing a deeper encoder for RGB imagery to capture rich contextual information and a lightweight encoder for DSM to extract sparse structural features. Besides, to facilitate modality alignment, the Asymmetric Prior Fuser (APF) integrates a modality-aware prior matrix into the fusion process, enabling the generation of structure-aware contextual features. Additionally, the Distribution Alignment (DA) module enhances cross-modal compatibility by aligning feature distributions through divergence minimization. Extensive experiments on the ISPRS Vaihingen and Potsdam datasets demonstrate that AMMNet attains state-of-the-art segmentation accuracy among multi-modal networks while reducing computational and memory requirements.
zh
[CV-67] LSSGen: Leverag ing Latent Space Scaling in Flow and Diffusion for Efficient Text to Image Generation ICCV
【速读】:该论文旨在解决传统文本到图像生成方法中因在像素空间进行降采样与上采样而导致的伪影和失真问题,这些问题通常发生在上采样图像重新编码至潜在空间时,从而降低最终图像质量。解决方案的关键在于提出Latent Space Scaling Generation (LSSGen) 框架,该框架通过一个轻量级潜在空间上采样器直接在潜在空间内完成分辨率缩放,无需修改原有的Transformer或U-Net架构,从而在保持高效性的同时显著提升图像视觉质量和多分辨率生成灵活性。
链接: https://arxiv.org/abs/2507.16154
作者: Jyun-Ze Tang,Chih-Fan Hsu,Jeng-Lin Li,Ming-Ching Chang,Wei-Chao Chen
机构: Inventec Corporation (英业达集团); University at Albany, State University of New York (纽约州立大学阿尔巴尼分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV AIGENS 2025
Abstract:Flow matching and diffusion models have shown impressive results in text-to-image generation, producing photorealistic images through an iterative denoising process. A common strategy to speed up synthesis is to perform early denoising at lower resolutions. However, traditional methods that downscale and upscale in pixel space often introduce artifacts and distortions. These issues arise when the upscaled images are re-encoded into the latent space, leading to degraded final image quality. To address this, we propose \bf Latent Space Scaling Generation (LSSGen), a framework that performs resolution scaling directly in the latent space using a lightweight latent upsampler. Without altering the Transformer or U-Net architecture, LSSGen improves both efficiency and visual quality while supporting flexible multi-resolution generation. Our comprehensive evaluation covering text-image alignment and perceptual quality shows that LSSGen significantly outperforms conventional scaling approaches. When generating 1024^2 images at similar speeds, it achieves up to 246% TOPIQ score improvement.
zh
[CV-68] SPACT18: Spiking Human Action Recognition Benchmark Dataset with Complementary RGB and Thermal Modalities
【速读】:该论文旨在解决当前视频动作识别(Video Action Recognition, VAR)任务中缺乏基于脉冲相机(spike camera)数据的多模态基准数据集问题,从而推动针对脉冲神经网络(Spiking Neural Networks, SNNs)的高效、低功耗视频理解研究。其解决方案的关键在于构建首个包含脉冲相机、同步RGB和热成像(thermal)模态的VAR数据集,通过保留脉冲数据固有的稀疏性和时间精度,为多模态视频理解提供独特的实验平台,并支持对脉冲、RGB与热成像模态的直接比较,进而促进基于脉冲数据的动作识别算法发展。
链接: https://arxiv.org/abs/2507.16151
作者: Yasser Ashraf,Ahmed Sharshar,Velibor Bojkovic,Bin Gu
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Spike cameras, bio-inspired vision sensors, asynchronously fire spikes by accumulating light intensities at each pixel, offering ultra-high energy efficiency and exceptional temporal resolution. Unlike event cameras, which record changes in light intensity to capture motion, spike cameras provide even finer spatiotemporal resolution and a more precise representation of continuous changes. In this paper, we introduce the first video action recognition (VAR) dataset using spike camera, alongside synchronized RGB and thermal modalities, to enable comprehensive benchmarking for Spiking Neural Networks (SNNs). By preserving the inherent sparsity and temporal precision of spiking data, our three datasets offer a unique platform for exploring multimodal video understanding and serve as a valuable resource for directly comparing spiking, thermal, and RGB modalities. This work contributes a novel dataset that will drive research in energy-efficient, ultra-low-power video understanding, specifically for action recognition tasks using spike-based data.
zh
[CV-69] LongSplat: Online Generalizable 3D Gaussian Splatting from Long Sequence Images
【速读】:该论文旨在解决3D Gaussian Splatting在在线长序列场景中应用受限的问题,即现有方法要么依赖缓慢的逐场景优化,要么无法实现高效的增量更新,从而难以维持连续性能。解决方案的关键在于提出一种流式更新机制(streaming update mechanism),该机制在融合当前视角观测的同时,选择性压缩冗余的历史高斯分布(Gaussian)。其核心创新是引入高斯图像表示(Gaussian-Image Representation, GIR),将3D高斯参数编码为结构化的二维图像格式,从而实现当前与历史高斯的高效融合以及基于身份感知的冗余压缩,使得模型能够在不显著增加内存或计算开销的前提下适应长序列输入。此外,论文还利用现有的图像压缩方法指导生成更紧凑且质量更高的3D高斯,最终在实时新视角合成任务中实现了优于现有方法的效率-质量平衡。
链接: https://arxiv.org/abs/2507.16144
作者: Guichen Huang,Ruoyu Wang,Xiangjun Gao,Che Sun,Yuwei Wu,Shenghua Gao,Yunde Jia
机构: Beijing Institute of Technology (北京理工大学); Shenzhen MSU-BIT University (深圳莫斯科大学-北京理工大学联合学院); Transcengram; The Hong Kong University of Science and Technology (香港科技大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting achieves high-fidelity novel view synthesis, but its application to online long-sequence scenarios is still limited. Existing methods either rely on slow per-scene optimization or fail to provide efficient incremental updates, hindering continuous performance. In this paper, we propose LongSplat, an online real-time 3D Gaussian reconstruction framework designed for long-sequence image input. The core idea is a streaming update mechanism that incrementally integrates current-view observations while selectively compressing redundant historical Gaussians. Crucial to this mechanism is our Gaussian-Image Representation (GIR), a representation that encodes 3D Gaussian parameters into a structured, image-like 2D format. GIR simultaneously enables efficient fusion of current-view and historical Gaussians and identity-aware redundancy compression. These functions enable online reconstruction and adapt the model to long sequences without overwhelming memory or computational costs. Furthermore, we leverage an existing image compression method to guide the generation of more compact and higher-quality 3D Gaussians. Extensive evaluations demonstrate that LongSplat achieves state-of-the-art efficiency-quality trade-offs in real-time novel view synthesis, delivering real-time reconstruction while reducing Gaussian counts by 44% compared to existing per-pixel Gaussian prediction methods.
zh
[CV-70] Universal Wavelet Units in 3D Retinal Layer Segmentation
【速读】:该论文旨在解决3D视网膜层分割中传统最大池化(max-pooling)操作导致的空间细节丢失和结构一致性下降的问题。解决方案的关键在于引入可调小波单元(tunable wavelet units, UwUs),并将其集成到运动校正的MGU-Net架构中,具体采用三种基于小波的下采样模块:OrthLattUwU、BiorthLattUwU 和 LS-BiorthLattUwU,这些模块利用可学习的格栅滤波器组同时保留低频与高频特征,从而提升分割精度和结构保真度。实验在Jacobs Retina Center (JRC) OCT数据集上验证了该方法的有效性,尤其是LS-BiorthLattUwU模块显著提升了Dice分数和准确性。
链接: https://arxiv.org/abs/2507.16119
作者: An D. Le,Hung Nguyen,Melanie Tran,Jesse Most,Dirk-Uwe G. Bartsch,William R Freeman,Shyamanga Borooah,Truong Q. Nguyen,Cheolhong An
机构: Jacobs School of Engineering, University of California San Diego, La Jolla, CA 92093, USA (加州大学圣地亚哥分校雅各布工程学院); Jacobs Retina Center, Shiley Eye Institute, University of California San Diego, La Jolla, CA 92093, USA (加州大学圣地亚哥分校谢利眼科研究所视网膜中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:This paper presents the first study to apply tunable wavelet units (UwUs) for 3D retinal layer segmentation from Optical Coherence Tomography (OCT) volumes. To overcome the limitations of conventional max-pooling, we integrate three wavelet-based downsampling modules, OrthLattUwU, BiorthLattUwU, and LS-BiorthLattUwU, into a motion-corrected MGU-Net architecture. These modules use learnable lattice filter banks to preserve both low- and high-frequency features, enhancing spatial detail and structural consistency. Evaluated on the Jacobs Retina Center (JRC) OCT dataset, our framework shows significant improvement in accuracy and Dice score, particularly with LS-BiorthLattUwU, highlighting the benefits of tunable wavelet filters in volumetric medical image segmentation.
zh
[CV-71] PUSA V1.0: Surpassing Wan-I2V with 500 Training Cost by Vectorized Timestep Adaptation
【速读】:该论文旨在解决视频扩散模型(video diffusion models)在时间建模方面的根本性局限,尤其是传统标量时间步长变量(scalar timestep variables)导致的帧演化同步僵化问题。这一限制使得现有方法在任务适配、计算效率和泛化能力上存在瓶颈,如训练成本高、灾难性遗忘或应用场景狭窄。解决方案的关键在于提出一种名为Pusa的新范式,其核心是向量化时间步长适应(Vectorized Timestep Adaptation, VTA),通过非破坏性微调策略,在不损害基础模型能力的前提下实现细粒度的时间控制。VTA不仅显著降低了训练资源需求(如仅需1/200的训练成本与1/2500的数据规模),还赋予模型零样本多任务能力(如起始-结束帧控制、视频扩展等),同时保持文本到视频生成性能,从而为下一代高效、可扩展、多功能的视频合成提供新路径。
链接: https://arxiv.org/abs/2507.16116
作者: Yaofang Liu,Yumeng Ren,Aitor Artola,Yuxuan Hu,Xiaodong Cun,Xiaotong Zhao,Alan Zhao,Raymond H. Chan,Suiyun Zhang,Rui Liu,Dandan Tu,Jean-Michel Morel
机构: City University of Hong Kong (香港城市大学); The Chinese University of Hong Kong (香港中文大学); Huawei Research (华为研究); Great Bay University (大湾大学); AI Technology Center, Tencent PCG (腾讯PCG人工智能技术中心); Lingnan University (岭南大学); Hong Kong Centre for Cerebro-Cardiovascular Health Engineering (香港脑心健康工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is open-sourced at this https URL
Abstract:The rapid advancement of video diffusion models has been hindered by fundamental limitations in temporal modeling, particularly the rigid synchronization of frame evolution imposed by conventional scalar timestep variables. While task-specific adaptations and autoregressive models have sought to address these challenges, they remain constrained by computational inefficiency, catastrophic forgetting, or narrow applicability. In this work, we present Pusa, a groundbreaking paradigm that leverages vectorized timestep adaptation (VTA) to enable fine-grained temporal control within a unified video diffusion framework. Besides, VTA is a non-destructive adaptation, which means it fully preserves the capabilities of the base model. By finetuning the SOTA Wan2.1-T2V-14B model with VTA, we achieve unprecedented efficiency – surpassing the performance of Wan-I2V-14B with \leq 1/200 of the training cost (\ 500 vs. \geq \ 100,000) and \leq 1/2500 of the dataset size (4K vs. \geq 10M samples). Pusa not only sets a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32% (vs. 86.86% of Wan-I2V-14B), but also unlocks many zero-shot multi-task capabilities such as start-end frames and video extension – all without task-specific training. Meanwhile, Pusa can still perform text-to-video generation. Mechanistic analyses reveal that our approach preserves the foundation model’s generative priors while surgically injecting temporal dynamics, avoiding the combinatorial explosion inherent to vectorized timesteps. This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis, democratizing high-fidelity video generation for research and industry alike. Code is open-sourced at this https URL
zh
[CV-72] Stop-band Energy Constraint for Orthogonal Tunable Wavelet Units in Convolutional Neural Networks for Computer Vision problems
【速读】:该论文旨在解决卷积神经网络(CNN)在纹理丰富数据集上进行图像分类和异常检测时性能受限的问题。其解决方案的关键在于引入一种带阻能量约束(stop-band energy constraint)机制,应用于具有格状结构(lattice structure)的正交可调小波单元(orthogonal tunable wavelet units)中,从而优化卷积、池化和下采样操作,显著提升模型对纹理特征的表征能力与判别力。
链接: https://arxiv.org/abs/2507.16114
作者: An D. Le,Hung Nguyen,Sungbal Seo,You-Suk Bae,Truong Q. Nguyen
机构: Jacobs School of Engineering, University of California San Diego (加州大学圣地亚哥分校); Department of Computer Engineering, Tech University of Korea (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:This work introduces a stop-band energy constraint for filters in orthogonal tunable wavelet units with a lattice structure, aimed at improving image classification and anomaly detection in CNNs, especially on texture-rich datasets. Integrated into ResNet-18, the method enhances convolution, pooling, and downsampling operations, yielding accuracy gains of 2.48% on CIFAR-10 and 13.56% on the Describable Textures dataset. Similar improvements are observed in ResNet-34. On the MVTec hazelnut anomaly detection task, the proposed method achieves competitive results in both segmentation and detection, outperforming existing approaches.
zh
[CV-73] Improving Personalized Image Generation through Social Context Feedback
【速读】:该论文旨在解决个性化图像生成中存在的三大问题:复杂动作(如“人推摩托车”)生成时人体姿态错误、参考人物身份无法保留,以及生成的人类注视方向与场景描述不一致或不自然。解决方案的关键在于采用基于反馈的微调机制,利用先进的姿态检测器(pose detector)、人-物交互检测器(human-object-interaction detector)、人脸识别模型(human facial recognition model)和注视点估计器(gaze-point estimation model)对现有扩散模型进行迭代优化,并根据信号层级(低级特征如姿态 vs. 高级特征如注视点)设计分阶段(timestep-based)引入不同反馈模块的策略,从而显著提升生成图像中人物行为合理性、身份一致性及整体视觉质量。
链接: https://arxiv.org/abs/2507.16095
作者: Parul Gupta,Abhinav Dhall,Thanh-Toan Do
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Personalized image generation, where reference images of one or more subjects are used to generate their image according to a scene description, has gathered significant interest in the community. However, such generated images suffer from three major limitations – complex activities, such as man, pushing, motorcycle are not generated properly with incorrect human poses, reference human identities are not preserved, and generated human gaze patterns are unnatural/inconsistent with the scene description. In this work, we propose to overcome these shortcomings through feedback-based fine-tuning of existing personalized generation methods, wherein, state-of-art detectors of pose, human-object-interaction, human facial recognition and human gaze-point estimation are used to refine the diffusion model. We also propose timestep-based inculcation of different feedback modules, depending upon whether the signal is low-level (such as human pose), or high-level (such as gaze point). The images generated in this manner show an improvement in the generated interactions, facial identities and image quality over three benchmark datasets.
zh
[CV-74] Disrupting Semantic and Abstract Features for Better Adversarial Transferability
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在黑盒场景下对抗样本的迁移性不足问题,即攻击者在未知目标模型的情况下,难以生成具有高迁移性的对抗样本。现有特征级攻击方法主要通过扰动中间特征来提升迁移性,但其权重矩阵的计算多依赖于图像的语义信息,忽略了CNN对高频成分(如纹理、边缘等抽象特征)的关注倾向。论文的关键创新在于提出一种名为“语义与抽象特征破坏”(Semantic and Abstract FEatures disRuption, SAFER)的平衡策略:在计算特征重要性权重矩阵时,一方面对输入图像进行BLOCKMIX增强以突出语义特征,另一方面对频域谱进行SELF-MIX操作以强化抽象特征的感知,从而引导攻击同时破坏语义和抽象特征,显著提升对抗样本的跨模型迁移能力。
链接: https://arxiv.org/abs/2507.16052
作者: Yuyang Luo,Xiaosen Wang,Zhijin Ge,Yingzhe He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Adversarial examples pose significant threats to deep neural networks (DNNs), and their property of transferability in the black-box setting has led to the emergence of transfer-based attacks, making it feasible to target real-world applications employing DNNs. Among them, feature-level attacks, where intermediate features are perturbed based on feature importance weight matrix computed from transformed images, have gained popularity. In this work, we find that existing feature-level attacks primarily manipulate the semantic information to derive the weight matrix. Inspired by several works that find CNNs tend to focus more on high-frequency components (a.k.a. abstract features, e.g., texture, edge, etc.), we validate that transforming images in the high-frequency space also improves transferability. Based on this finding, we propose a balanced approach called Semantic and Abstract FEatures disRuption (SAFER). Specifically, SAFER conducts BLOCKMIX on the input image and SELF-MIX on the frequency spectrum when computing the weight matrix to highlight crucial features. By using such a weight matrix, we can direct the attacker to disrupt both semantic and abstract features, leading to improved transferability. Extensive experiments on the ImageNet dataset also demonstrate the effectiveness of our method in boosting adversarial transferability.
zh
[CV-75] Discovering and using Spelke segments
【速读】:该论文旨在解决计算机视觉中语义分割(semantic segmentation)依赖类别特定惯例、难以捕捉物理世界中普适性物体边界的问题,而人类感知则基于Spelke对象——即在物理作用下保持一致运动的物体群组。其核心解决方案是提出SpelkeNet,一种基于因果运动关系的视觉世界模型,通过预测未来运动分布来提取Spelke segments;关键创新在于引入两个可学习的运动表征:运动可及性图(motion affordance map)用于识别易受外力扰动的区域,以及预期位移图(expected-displacement map)用于刻画场景其余部分的响应模式,并结合“统计反事实探测”机制,通过对高可及性区域施加多样化的虚拟戳刺(virtual pokes),以相关运动统计的聚类定义Spelke segments。实验表明,该方法在SpelkeBench上优于监督基线(如SegmentAnything),并在3DEditBench物理操作任务中显著提升下游模型性能。
链接: https://arxiv.org/abs/2507.16038
作者: Rahul Venkatesh,Klemen Kotar,Lilian Naing Chen,Seungwoo Kim,Luca Thomas Wheeler,Jared Watrous,Ashley Xu,Gia Ancone,Wanhee Lee,Honglin Chen,Daniel Bear,Stefan Stojanov,Daniel Yamins
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page at: this https URL
Abstract:Segments in computer vision are often defined by semantic considerations and are highly dependent on category-specific conventions. In contrast, developmental psychology suggests that humans perceive the world in terms of Spelke objects–groupings of physical things that reliably move together when acted on by physical forces. Spelke objects thus operate on category-agnostic causal motion relationships which potentially better support tasks like manipulation and planning. In this paper, we first benchmark the Spelke object concept, introducing the SpelkeBench dataset that contains a wide variety of well-defined Spelke segments in natural images. Next, to extract Spelke segments from images algorithmically, we build SpelkeNet, a class of visual world models trained to predict distributions over future motions. SpelkeNet supports estimation of two key concepts for Spelke object discovery: (1) the motion affordance map, identifying regions likely to move under a poke, and (2) the expected-displacement map, capturing how the rest of the scene will move. These concepts are used for “statistical counterfactual probing”, where diverse “virtual pokes” are applied on regions of high motion-affordance, and the resultant expected displacement maps are used define Spelke segments as statistical aggregates of correlated motion statistics. We find that SpelkeNet outperforms supervised baselines like SegmentAnything (SAM) on SpelkeBench. Finally, we show that the Spelke concept is practically useful for downstream applications, yielding superior performance on the 3DEditBench benchmark for physical object manipulation when used in a variety of off-the-shelf object manipulation models.
zh
[CV-76] Improved Semantic Segmentation from Ultra-Low-Resolution RGB Images Applied to Privacy-Preserving Object-Goal Navigation
【速读】:该论文旨在解决移动机器人在执行下游任务(如语义目标导航)时,如何在不损害用户视觉隐私的前提下实现高效感知与决策的问题。现有方法通常在任务性能与隐私保护之间做出权衡,而本文提出了一种基于语义的超低分辨率(ultra-low-resolution)机器人导航策略,以在保证隐私的同时提升任务效果。其解决方案的关键在于设计了一种全联合学习(fully joint-learning)框架,集成凝聚式特征提取器(agglomerative feature extractor)和分割感知判别器(segmentation-aware discriminator),从而从超低分辨率RGB图像中恢复高质量语义分割结果,进而支持隐私保护下的语义对象目标导航。
链接: https://arxiv.org/abs/2507.16034
作者: Xuying Huang,Sicong Pan,Olga Zatsarynna,Juergen Gall,Maren Bennewitz
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to RA-L
Abstract:User privacy in mobile robotics has become a critical concern. Existing methods typically prioritize either the performance of downstream robotic tasks or privacy protection, with the latter often constraining the effectiveness of task execution. To jointly address both objectives, we study semantic-based robot navigation in an ultra-low-resolution setting to preserve visual privacy. A key challenge in such scenarios is recovering semantic segmentation from ultra-low-resolution RGB images. In this work, we introduce a novel fully joint-learning method that integrates an agglomerative feature extractor and a segmentation-aware discriminator to solve ultra-low-resolution semantic segmentation, thereby enabling privacy-preserving, semantic object-goal navigation. Our method outperforms different baselines on ultra-low-resolution semantic segmentation and our improved segmentation results increase the success rate of the semantic object-goal navigation in a real-world privacy-constrained scenario.
zh
[CV-77] Artifacts and Attention Sinks: Structured Approximations for Efficient Vision Transformers
【速读】:该论文旨在解决视觉Transformer(Vision Transformer)中自注意力机制计算复杂度高、信息流动调控机制不清晰的问题。其关键解决方案在于识别并利用两类特殊token——“大量token”(massive tokens,即激活范数异常高的注意力汇点)和“人工token”(artifact tokens,推理过程中产生的副产物),发现它们通过注意力机制相互抑制,从而在模型内部形成结构化模式以调节信息流;基于此,作者提出训练-free的Fast Nyström Attention(FNA)方法,在线性时间和空间复杂度下近似自注意力运算,并设计掩码策略抑制噪声,实现高效且性能稳定的改进。
链接: https://arxiv.org/abs/2507.16018
作者: Andrew Lu,Wentinn Liao,Liuhui Wang,Huzheng Yang,Jianbo Shi
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision transformers have emerged as a powerful tool across a wide range of applications, yet their inner workings remain only partially understood. In this work, we examine the phenomenon of massive tokens - tokens with exceptionally high activation norms that act as attention sinks - and artifact tokens that emerge as a byproduct during inference. Our analysis reveals that these tokens mutually suppress one another through the attention mechanism, playing a critical role in regulating information flow within the network. Leveraging these insights, we introduce Fast Nyström Attention (FNA), a training-free method that approximates self-attention in linear time and space by exploiting the structured patterns formed by massive and artifact tokens. Additionally, we propose a masking strategy to mitigate noise from these tokens, yielding modest performance gains at virtually no cost. We evaluate our approach on popular pretrained vision backbones and demonstrate competitive performance on retrieval, classification, segmentation, and visual question answering (VQA), all while reducing computational overhead.
zh
[CV-78] Is Tracking really more challenging in First Person Egocentric Vision? ICCV
【速读】:该论文试图解决的问题是:在第一人称视角(egocentric vision)下,视觉目标跟踪与分割任务性能下降的原因究竟是源于第一人称视角的独特性,还是源于人类-物体交互活动这一更广泛的场景域特性。为了解决这一问题,论文提出了一种新的基准评估策略,其关键在于通过设计能够解耦第一人称视角因素与人类-物体交互活动域因素的实验框架,从而更精确地区分两类挑战来源,进而揭示真实困难所在,为后续研究提供更具针对性的方向。
链接: https://arxiv.org/abs/2507.16015
作者: Matteo Dunnhofer,Zaira Manigrasso,Christian Micheloni
机构: University of Udine (乌迪内大学); York University (约克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2025 IEEE/CVF International Conference on Computer Vision (ICCV)
Abstract:Visual object tracking and segmentation are becoming fundamental tasks for understanding human activities in egocentric vision. Recent research has benchmarked state-of-the-art methods and concluded that first person egocentric vision presents challenges compared to previously studied domains. However, these claims are based on evaluations conducted across significantly different scenarios. Many of the challenging characteristics attributed to egocentric vision are also present in third person videos of human-object activities. This raises a critical question: how much of the observed performance drop stems from the unique first person viewpoint inherent to egocentric vision versus the domain of human-object activities? To address this question, we introduce a new benchmark study designed to disentangle such factors. Our evaluation strategy enables a more precise separation of challenges related to the first person perspective from those linked to the broader domain of human-object activity understanding. By doing so, we provide deeper insights into the true sources of difficulty in egocentric tracking and segmentation, facilitating more targeted advancements on this task.
zh
[CV-79] FW-VTON: Flattening-and-Warping for Person-to-Person Virtual Try-on
【速读】:该论文旨在解决传统虚拟试衣(virtual try-on)方法仅适用于从服装到人体的试穿任务(garment-to-person try-on),而难以处理仅输入两张图像——目标人物图像和另一人穿着的服装图像——即“从人到人”的试穿任务(person-to-person try-on)的问题。其核心挑战在于如何在无额外标注或平面服装表示的情况下,实现自然且精准的服装与目标人物的融合。解决方案的关键是提出Flattening-and-Warping Virtual Try-On(FW-VTON)方法,该方法分三阶段进行:首先从源图像中提取扁平化服装图像(flattened garment image),其次将该服装根据目标姿态进行形变对齐(warping),最后将其无缝融合至目标人物图像中。此外,作者还构建了一个专门用于person-to-person try-on的新数据集以克服高质量训练数据稀缺的问题,实验表明该方法在定性和定量评估上均达到当前最优性能。
链接: https://arxiv.org/abs/2507.16010
作者: Zheng Wang,Xianbing Sun,Shengyi Wu,Jiahui Zhan,Jianlou Si,Chi Zhang,Liqing Zhang,Jianfu Zhang
机构: Shanghai Jiao Tong University (上海交通大学); TeleAI, China Telecom (中国电信TeleAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Traditional virtual try-on methods primarily focus on the garment-to-person try-on task, which requires flat garment representations. In contrast, this paper introduces a novel approach to the person-to-person try-on task. Unlike the garment-to-person try-on task, the person-to-person task only involves two input images: one depicting the target person and the other showing the garment worn by a different individual. The goal is to generate a realistic combination of the target person with the desired garment. To this end, we propose Flattening-and-Warping Virtual Try-On (\textbfFW-VTON), a method that operates in three stages: (1) extracting the flattened garment image from the source image; (2) warping the garment to align with the target pose; and (3) integrating the warped garment seamlessly onto the target person. To overcome the challenges posed by the lack of high-quality datasets for this task, we introduce a new dataset specifically designed for person-to-person try-on scenarios. Experimental evaluations demonstrate that FW-VTON achieves state-of-the-art performance, with superior results in both qualitative and quantitative assessments, and also excels in garment extraction subtasks.
zh
[CV-80] Semantic-Aware Gaussian Process Calibration with Structured Layerwise Kernels for Deep Neural Networks
【速读】:该论文旨在解决神经网络分类器在推理过程中难以准确量化预测可靠性的问题,尤其是传统高斯过程(Gaussian Process, GP)校准方法无法捕捉深度神经网络内部层次结构,从而限制了可解释性和预测可靠性评估的有效性。解决方案的关键在于提出一种语义感知的分层高斯过程(Semantic-Aware Layer-wise Gaussian Process, SAL-GP)框架,该框架通过在每一层特征表示上应用局部校准修正,并利用结构化的多层核函数耦合各层GP模型,实现跨层联合边缘化,从而同时捕获局部语义依赖与全局校准一致性,并持续传播预测不确定性,显著提升模型的可解释性与不确定性量化能力。
链接: https://arxiv.org/abs/2507.15987
作者: Kyung-hwan Lee,Kyung-tae Kim
机构: Pohang University of Science and Technology (浦项科技大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Calibrating the confidence of neural network classifiers is essential for quantifying the reliability of their predictions during inference. However, conventional Gaussian Process (GP) calibration methods often fail to capture the internal hierarchical structure of deep neural networks, limiting both interpretability and effectiveness for assessing predictive reliability. We propose a Semantic-Aware Layer-wise Gaussian Process (SAL-GP) framework that mirrors the layered architecture of the target neural network. Instead of applying a single global GP correction, SAL-GP employs a multi-layer GP model, where each layer’s feature representation is mapped to a local calibration correction. These layerwise GPs are coupled through a structured multi-layer kernel, enabling joint marginalization across all layers. This design allows SAL-GP to capture both local semantic dependencies and global calibration coherence, while consistently propagating predictive uncertainty through the network. The resulting framework enhances interpretability aligned with the network architecture and enables principled evaluation of confidence consistency and uncertainty quantification in deep models.
zh
[CV-81] A Lightweight Face Quality Assessment Framework to Improve Face Verification Performance in Real-Time Screening Applications
【速读】:该论文旨在解决低质量人脸图像对人脸识别系统准确性与可靠性造成的负面影响问题,尤其是在实时筛查场景(如监控、身份验证和访问控制)中,因运动模糊、光照不良、遮挡及极端姿态变化等因素导致的误拒率(False Rejection Rate, FRR)升高和识别性能下降的问题。解决方案的关键在于提出了一种轻量且高效的自动人脸质量评估框架,其核心是利用归一化的人脸关键点(Facial Landmarks)特征结合随机森林回归(Random Forest Regression)分类器进行图像质量打分,实现了96.67%的准确率;该模块可前置部署于验证流程中,有效过滤低质图像,在ArcFace模型上实现FRR降低99.7%并提升余弦相似度得分,同时显著缓解实际监控场景中常见的分辨率差异和姿态偏移两大挑战。
链接: https://arxiv.org/abs/2507.15961
作者: Ahmed Aman Ibrahim,Hamad Mansour Alawar,Abdulnasser Abbas Zehi,Ahmed Mohammad Alkendi,Bilal Shafi Ashfaq Ahmed Mirza,Shan Ullah,Ismail Lujain Jaleel,Hassan Ugail
机构: theCircle Ltd (theCircle有限公司); General Department of Forensic Science and Criminology (法医学与犯罪学总部门); Dubai Police Headquarters (迪拜警察总部); Centre for Visual Computing and Intelligent Systems (视觉计算与智能系统中心); Unviversity of Bradford (布拉德福德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Face image quality plays a critical role in determining the accuracy and reliability of face verification systems, particularly in real-time screening applications such as surveillance, identity verification, and access control. Low-quality face images, often caused by factors such as motion blur, poor lighting conditions, occlusions, and extreme pose variations, significantly degrade the performance of face recognition models, leading to higher false rejection and false acceptance rates. In this work, we propose a lightweight yet effective framework for automatic face quality assessment, which aims to pre-filter low-quality face images before they are passed to the verification pipeline. Our approach utilises normalised facial landmarks in conjunction with a Random Forest Regression classifier to assess image quality, achieving an accuracy of 96.67%. By integrating this quality assessment module into the face verification process, we observe a substantial improvement in performance, including a comfortable 99.7% reduction in the false rejection rate and enhanced cosine similarity scores when paired with the ArcFace face verification model. To validate our approach, we have conducted experiments on a real-world dataset collected comprising over 600 subjects captured from CCTV footage in unconstrained environments within Dubai Police. Our results demonstrate that the proposed framework effectively mitigates the impact of poor-quality face images, outperforming existing face quality assessment techniques while maintaining computational efficiency. Moreover, the framework specifically addresses two critical challenges in real-time screening: variations in face resolution and pose deviations, both of which are prevalent in practical surveillance scenarios.
zh
[CV-82] An empirical study for the early detection of Mpox from skin lesion images using pretrained CNN models leverag ing XAI technique
【速读】:该论文旨在解决猴痘(Mpox)早期诊断困难的问题,因其临床表现与其他皮肤疾病相似,传统方法易出现误诊。解决方案的关键在于利用预训练卷积神经网络(Convolutional Neural Networks, CNNs)进行迁移学习,通过冻结初始层并添加自定义分类层来适应猴痘图像识别任务,同时采用梯度加权类激活映射(Grad-CAM)等可解释人工智能(Explainable Artificial Intelligence, XAI)技术提升模型的可解释性。实验表明,InceptionV3在二分类数据集上达到95%准确率,MobileNetV2在多分类数据集上达93%,且Grad-CAM能有效定位关键病灶区域,验证了深度学习方法在提升诊断效率与透明度方面的潜力。
链接: https://arxiv.org/abs/2507.15915
作者: Mohammad Asifur Rahim,Muhammad Nazmul Arefin,Md. Mizanur Rahman,Md Ali Hossain,Ahmed Moustafa
机构: King Fahd University of Petroleum and Minerals(KFUPM); Daffodil International University; University of Johannesburg; Bond University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Context: Mpox is a zoonotic disease caused by the Mpox virus, which shares similarities with other skin conditions, making accurate early diagnosis challenging. Artificial intelligence (AI), especially Deep Learning (DL), has a strong tool for medical image analysis; however, pre-trained models like CNNs and XAI techniques for mpox detection is underexplored. Objective: This study aims to evaluate the effectiveness of pre-trained CNN models (VGG16, VGG19, InceptionV3, MobileNetV2) for the early detection of monkeypox using binary and multi-class datasets. It also seeks to enhance model interpretability using Grad-CAM an XAI technique. Method: Two datasets, MSLD and MSLD v2.0, were used for training and validation. Transfer learning techniques were applied to fine-tune pre-trained CNN models by freezing initial layers and adding custom layers for adapting the final features for mpox detection task and avoid overfitting. Models performance were evaluated using metrics such as accuracy, precision, recall, F1-score and ROC. Grad-CAM was utilized for visualizing critical features. Results: InceptionV3 demonstrated the best performance on the binary dataset with an accuracy of 95%, while MobileNetV2 outperformed on the multi-class dataset with an accuracy of 93%. Grad-CAM successfully highlighted key image regions. Despite high accuracy, some models showed overfitting tendencies, as videnced by discrepancies between training and validation losses. Conclusion: This study underscores the potential of pre-trained CNN models in monkeypox detection and the value of XAI techniques. Future work should address dataset limitations, incorporate multimodal data, and explore additional interpretability techniques to improve diagnostic reliability and model transparency
zh
[CV-83] Local Dense Logit Relations for Enhanced Knowledge Distillation ICCV2025
【速读】:该论文旨在解决现有基于logit的知识蒸馏方法在捕捉类别间细粒度关系方面的不足,尤其是对关键类别对之间差异性信息的利用不充分问题。其解决方案的关键在于提出局部密集关系蒸馏(Local Dense Relational Logit Distillation, LDRLD),通过递归解耦与重组logit信息来显式建模类别间的细粒度交互关系,并引入自适应衰减权重(Adaptive Decay Weight, ADW)策略——结合逆序权重(Inverse Rank Weighting, IRW)和指数秩衰减(Exponential Rank Decay, ERD)动态调整重要类别对的权重,从而强化关键关系的学习;同时,在递归解耦后进一步蒸馏剩余非目标知识以保证知识完整性,显著提升了学生模型的性能。
链接: https://arxiv.org/abs/2507.15911
作者: Liuchi Xu,Kang Liu,Jinshuai Liu,Lu Wang,Lisheng Xu,Jun Cheng
机构: Northeastern University, China; South China Normal University, China; Shenzhen Institutes of Advanced Technology, CAS, China; The Chinese University of Hong Kong, Hong Kong, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025
Abstract:State-of-the-art logit distillation methods exhibit versatility, simplicity, and efficiency. Despite the advances, existing studies have yet to delve thoroughly into fine-grained relationships within logit knowledge. In this paper, we propose Local Dense Relational Logit Distillation (LDRLD), a novel method that captures inter-class relationships through recursively decoupling and recombining logit information, thereby providing more detailed and clearer insights for student learning. To further optimize the performance, we introduce an Adaptive Decay Weight (ADW) strategy, which can dynamically adjust the weights for critical category pairs using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD). Specifically, IRW assigns weights inversely proportional to the rank differences between pairs, while ERD adaptively controls weight decay based on total ranking scores of category pairs. Furthermore, after the recursive decoupling, we distill the remaining non-target knowledge to ensure knowledge completeness and enhance performance. Ultimately, our method improves the student’s performance by transferring fine-grained knowledge and emphasizing the most critical relationships. Extensive experiments on datasets such as CIFAR-100, ImageNet-1K, and Tiny-ImageNet demonstrate that our method compares favorably with state-of-the-art logit-based distillation approaches. The code will be made publicly available.
zh
[CV-84] PAT: a cautionary tale about generative visual augmentation for Object Re-identification
【速读】:该论文旨在解决生成式数据增强(Generative Data Augmentation)在细粒度视觉识别任务——特别是目标重识别(Object Re-Identification, ReID)中的有效性问题。传统方法认为生成模型可提升模型性能,但本文发现,在保留身份关键特征方面,现有生成策略存在显著不足,导致域偏移(domain shift)和身份定义特征丢失,从而引发性能下降。解决方案的关键在于提出一种名为PAT++的新颖流水线,其核心创新是将扩散自蒸馏(Diffusion Self-Distillation)机制集成到经典的部件感知Transformer(Part-Aware Transformer)中,以期在生成过程中更好地保留身份敏感的细节信息,从而提升生成图像在训练与查询扩展中的实用性。
链接: https://arxiv.org/abs/2507.15888
作者: Leonardo Santiago Benitez Pereira,Arathy Jeevan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative data augmentation has demonstrated gains in several vision tasks, but its impact on object re-identification - where preserving fine-grained visual details is essential - remains largely unexplored. In this work, we assess the effectiveness of identity-preserving image generation for object re-identification. Our novel pipeline, named PAT++, incorporates Diffusion Self-Distillation into the well-established Part-Aware Transformer. Using the Urban Elements ReID Challenge dataset, we conduct extensive experiments with generated images used for both model training and query expansion. Our results show consistent performance degradation, driven by domain shifts and failure to retain identity-defining features. These findings challenge assumptions about the transferability of generative models to fine-grained recognition tasks and expose key limitations in current approaches to visual augmentation for identity-preserving applications.
zh
[CV-85] Salience Adjustment for Context-Based Emotion Recognition
【速读】:该论文旨在解决动态社交情境中情绪识别的问题,即如何有效融合面部表情与情境线索以提升情绪识别的准确性。其解决方案的关键在于提出一种基于显著性调整的框架,通过贝叶斯线索整合(Bayesian Cue Integration, BCI)与视觉-语言模型(Visual-Language Models, VLMs)动态调整面部信息与情境信息的权重,依据面部线索的表达性强弱进行自适应加权,从而更准确地捕捉复杂社交场景中的情绪状态。
链接: https://arxiv.org/abs/2507.15878
作者: Bin Han,Jonathan Gratch
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Emotion recognition in dynamic social contexts requires an understanding of the complex interaction between facial expressions and situational cues. This paper presents a salience-adjusted framework for context-aware emotion recognition with Bayesian Cue Integration (BCI) and Visual-Language Models (VLMs) to dynamically weight facial and contextual information based on the expressivity of facial cues. We evaluate this approach using human annotations and automatic emotion recognition systems in prisoner’s dilemma scenarios, which are designed to evoke emotional reactions. Our findings demonstrate that incorporating salience adjustment enhances emotion recognition performance, offering promising directions for future research to extend this framework to broader social contexts and multimodal applications.
zh
[CV-86] GPI-Net: Gestalt-Guided Parallel Interaction Network via Orthogonal Geometric Consistency for Robust Point Cloud Registration IJCAI2025
【速读】:该论文旨在解决特征点云配准中高质量对应关系识别的难题,尤其针对局部与全局特征融合时因特征冗余和复杂空间关系导致的性能瓶颈。其解决方案的关键在于提出一种基于格式塔(Gestalt)原理的正交几何一致性引导并行交互网络(GPI-Net),通过引入正交集成策略最优地减少冗余信息,构建更紧凑的全局结构以提升对应质量;同时设计了融合自注意力与交叉注意力机制的格式塔特征注意力(GFA)模块来捕捉对应关系中的几何特征,并创新性地采用双路径多粒度并行交互聚合(DMG)模块促进不同粒度间的信息交换,从而实现局部细节与全局结构的有效协同。
链接: https://arxiv.org/abs/2507.14452
作者: Weikang Gu,Mingyue Han,Li Xue,Heng Dong,Changcai Yang,Riqing Chen,Lifang Wei
机构: Fujian Agriculture and Forestry University (福建农林大学); Fuzhou Institute of Technology (福州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures. Accepted to IJCAI 2025
Abstract:The accurate identification of high-quality correspondences is a prerequisite task in feature-based point cloud registration. However, it is extremely challenging to handle the fusion of local and global features due to feature redundancy and complex spatial relationships. Given that Gestalt principles provide key advantages in analyzing local and global relationships, we propose a novel Gestalt-guided Parallel Interaction Network via orthogonal geometric consistency (GPI-Net) in this paper. It utilizes Gestalt principles to facilitate complementary communication between local and global information. Specifically, we introduce an orthogonal integration strategy to optimally reduce redundant information and generate a more compact global structure for high-quality correspondences. To capture geometric features in correspondences, we leverage a Gestalt Feature Attention (GFA) block through a hybrid utilization of self-attention and cross-attention mechanisms. Furthermore, to facilitate the integration of local detail information into the global structure, we design an innovative Dual-path Multi-Granularity parallel interaction aggregation (DMG) block to promote information exchange across different granularities. Extensive experiments on various challenging tasks demonstrate the superior performance of our proposed GPI-Net in comparison to existing methods. The code will be released at this https URL.
zh
[CV-87] MultiTaskDeltaNet: Change Detection-based Image Segmentation for Operando ETEM with Application to Carbon Gasification Kinetics
【速读】:该论文旨在解决原位透射电子显微镜(in-situ transmission electron microscopy, TEM)成像在用于固态反应空间分辨原位表征时,因标注数据稀缺、目标特征视觉模糊及小物体场景导致传统语义分割深度学习方法性能受限的问题。解决方案的关键在于提出一种名为MultiTaskDeltaNet(MTDN)的新型深度学习架构,其核心创新是将分割任务重构为变化检测问题,通过采用具有U-Net主干的Siamese网络结构,利用图像对捕捉特征演变信息,从而以极少量标注数据实现高质量分割;同时引入多任务学习策略,挖掘不同物理特征间的相关性,显著提升对细小且模糊结构的识别准确性,在环境TEM(ETEM)视频中碳丝气化过程的实验验证中,MTDN相较传统模型在小尺度和模糊特征预测上性能提升达10.22%。
链接: https://arxiv.org/abs/2507.16803
作者: Yushuo Niu,Tianyu Li,Yuanyuan Zhu,Qian Yang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transforming in-situ transmission electron microscopy (TEM) imaging into a tool for spatially-resolved operando characterization of solid-state reactions requires automated, high-precision semantic segmentation of dynamically evolving features. However, traditional deep learning methods for semantic segmentation often encounter limitations due to the scarcity of labeled data, visually ambiguous features of interest, and small-object scenarios. To tackle these challenges, we introduce MultiTaskDeltaNet (MTDN), a novel deep learning architecture that creatively reconceptualizes the segmentation task as a change detection problem. By implementing a unique Siamese network with a U-Net backbone and using paired images to capture feature changes, MTDN effectively utilizes minimal data to produce high-quality segmentations. Furthermore, MTDN utilizes a multi-task learning strategy to leverage correlations between physical features of interest. In an evaluation using data from in-situ environmental TEM (ETEM) videos of filamentous carbon gasification, MTDN demonstrated a significant advantage over conventional segmentation models, particularly in accurately delineating fine structural features. Notably, MTDN achieved a 10.22% performance improvement over conventional segmentation models in predicting small and visually ambiguous physical features. This work bridges several key gaps between deep learning and practical TEM image analysis, advancing automated characterization of nanomaterials in complex experimental settings.
zh
[CV-88] Improving U-Net Confidence on TEM Image Data with L2-Regularization Transfer Learning and Deep Fine-Tuning ICCV2025
【速读】:该论文旨在解决透射电子显微镜(Transmission Electron Microscopy, TEM)图像中纳米尺度缺陷自动识别的难题,其核心挑战在于TEM图像中的缺陷特征变异性强、标注数据稀缺且标注误差高,导致传统机器学习模型性能受限。解决方案的关键在于引入迁移学习策略,利用自然图像预训练的大规模模型作为编码器,并结合L2正则化,使模型优先关注简单可靠的信息特征而非复杂的语义特征,从而显著提升检测性能;同时,为克服传统评价指标(如F1-score)受标注误差影响的问题,研究者提出了与标注准确性无关的新评价指标,最终在UO₂ TEM图像的晶界检测任务中实现了57%的缺陷检测率提升,验证了方法的有效性。
链接: https://arxiv.org/abs/2507.16779
作者: Aiden Ochoa,Xinyuan Xu,Xing Wang
机构: Penn State University (宾夕法尼亚州立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted into the ICCV 2025 CV4MS Workshop
Abstract:With ever-increasing data volumes, it is essential to develop automated approaches for identifying nanoscale defects in transmission electron microscopy (TEM) images. However, compared to features in conventional photographs, nanoscale defects in TEM images exhibit far greater variation due to the complex contrast mechanisms and intricate defect structures. These challenges often result in much less labeled data and higher rates of annotation errors, posing significant obstacles to improving machine learning model performance for TEM image analysis. To address these limitations, we examined transfer learning by leveraging large, pre-trained models used for natural images. We demonstrated that by using the pre-trained encoder and L2-regularization, semantically complex features are ignored in favor of simpler, more reliable cues, substantially improving the model performance. However, this improvement cannot be captured by conventional evaluation metrics such as F1-score, which can be skewed by human annotation errors treated as ground truth. Instead, we introduced novel evaluation metrics that are independent of the annotation accuracy. Using grain boundary detection in UO2 TEM images as a case study, we found that our approach led to a 57% improvement in defect detection rate, which is a robust and holistic measure of model performance on the TEM dataset used in this work. Finally, we showed that model self-confidence is only achieved through transfer learning and fine-tuning of very deep layers. Comments: Accepted into the ICCV 2025 CV4MS Workshop Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.16779 [eess.IV] (or arXiv:2507.16779v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.16779 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-89] Pyramid Hierarchical Masked Diffusion Model for Imaging Synthesis
【速读】:该论文旨在解决医学图像合成中因扫描时间过长、图像污染、伪影、患者运动或对对比剂不耐受等因素导致的成像模态缺失问题。其核心解决方案是提出了一种金字塔层次掩码扩散模型(Pyramid Hierarchical Masked Diffusion Model, PHMDiff),关键创新在于采用多尺度分层结构实现对不同分辨率和层级图像的精细化控制:通过随机多尺度高比例掩码加速扩散模型训练,并在细节保真度与整体结构之间取得平衡;同时引入基于Transformer的扩散过程,结合跨粒度正则化机制,建模各粒度潜空间间的互信息一致性,从而提升像素级感知准确性。实验表明,PHMDiff在PSNR和SSIM指标上均优于现有方法,验证了其在跨模态及同模态医学图像合成中的有效性。
链接: https://arxiv.org/abs/2507.16579
作者: Xiaojiao Xiao,Qinmin Vivian Hu,Guanghui Wang
机构: Toronto Metropolitan University (多伦多都会大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical image synthesis plays a crucial role in clinical workflows, addressing the common issue of missing imaging modalities due to factors such as extended scan times, scan corruption, artifacts, patient motion, and intolerance to contrast agents. The paper presents a novel image synthesis network, the Pyramid Hierarchical Masked Diffusion Model (PHMDiff), which employs a multi-scale hierarchical approach for more detailed control over synthesizing high-quality images across different resolutions and layers. Specifically, this model utilizes randomly multi-scale high-proportion masks to speed up diffusion model training, and balances detail fidelity and overall structure. The integration of a Transformer-based Diffusion model process incorporates cross-granularity regularization, modeling the mutual information consistency across each granularity’s latent spaces, thereby enhancing pixel-level perceptual accuracy. Comprehensive experiments on two challenging datasets demonstrate that PHMDiff achieves superior performance in both the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), highlighting its capability to produce high-quality synthesized images with excellent structural integrity. Ablation studies further confirm the contributions of each component. Furthermore, the PHMDiff model, a multi-scale image synthesis framework across and within medical imaging modalities, shows significant advantages over other methods. The source code is available at this https URL
zh
[CV-90] Semantic Segmentation for Preoperative Planning in Transcatheter Aortic Valve Replacement MICCAI
【速读】:该论文旨在解决经导管主动脉瓣置换术(TAVR)术前规划中,如何利用医学影像自动识别并量化关键解剖结构的问题。其核心挑战在于将粗粒度的解剖信息转化为细粒度的伪标签(pseudo-labels),以训练语义分割模型来精确提取CT扫描中的相关结构。解决方案的关键在于:首先构建细粒度的TAVR相关伪标签用于模型训练,并通过改进损失函数设计,在Dice系数上实现了+1.27%的性能提升,从而显著增强了模型对目标解剖结构的定位与测量能力。
链接: https://arxiv.org/abs/2507.16573
作者: Cedric Zöllner,Simon Reiß,Alexander Jaus,Amroalalaa Sholi,Ralf Sodian,Rainer Stiefelhagen
机构: 11; 1122; 33; 11
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 16th MICCAI Workshop on Statistical Atlases and Computational Modeling of the Heart (STACOM)
Abstract:When preoperative planning for surgeries is conducted on the basis of medical images, artificial intelligence methods can support medical doctors during assessment. In this work, we consider medical guidelines for preoperative planning of the transcatheter aortic valve replacement (TAVR) and identify tasks, that may be supported via semantic segmentation models by making relevant anatomical structures measurable in computed tomography scans. We first derive fine-grained TAVR-relevant pseudo-labels from coarse-grained anatomical information, in order to train segmentation models and quantify how well they are able to find these structures in the scans. Furthermore, we propose an adaptation to the loss function in training these segmentation models and through this achieve a +1.27% Dice increase in performance. Our fine-grained TAVR-relevant pseudo-labels and the computed tomography scans we build upon are available at this https URL.
zh
[CV-91] A High Magnifications Histopathology Image Dataset for Oral Squamous Cell Carcinoma Diagnosis and Prognosis
【速读】:该论文旨在解决当前口腔鳞状细胞癌(Oral Squamous Cell Carcinoma, OSCC)领域中公开病理图像数据集存在的局限性问题,即患者队列规模有限且通常仅聚焦于诊断或预后单一任务,导致难以构建全面且泛化能力强的深度学习模型。解决方案的关键在于提出Multi-OSCC数据集——一个包含1,325名OSC C患者的高质量组织病理学图像数据集,每例患者提供六个高分辨率图像(x200、x400、x1000三种倍率各两张),覆盖肿瘤核心与边缘区域,并针对六项关键临床任务(复发预测、淋巴结转移、肿瘤分化、肿瘤浸润、癌栓及神经侵犯)进行精细标注,从而实现诊断与预后的统一建模。此外,研究系统评估了视觉编码器、多图融合策略、染色归一化及多任务学习框架的影响,揭示出染色归一化对诊断任务有益但可能损害复发预测性能,以及多任务学习相较单任务存在平均AUC下降3.34%的挑战,为后续模型设计提供了重要依据。
链接: https://arxiv.org/abs/2507.16360
作者: Jinquan Guan,Junhong Guo,Qi Chen,Jian Chen,Yongkang Cai,Yilin He,Zhiquan Huang,Yan Wang,Yutong Xie
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 11 tables, 4 figures
Abstract:Oral Squamous Cell Carcinoma (OSCC) is a prevalent and aggressive malignancy where deep learning-based computer-aided diagnosis and prognosis can enhance clinical this http URL, existing publicly available OSCC datasets often suffer from limited patient cohorts and a restricted focus on either diagnostic or prognostic tasks, limiting the development of comprehensive and generalizable models. To bridge this gap, we introduce Multi-OSCC, a new histopathology image dataset comprising 1,325 OSCC patients, integrating both diagnostic and prognostic information to expand existing public resources. Each patient is represented by six high resolution histopathology images captured at x200, x400, and x1000 magnifications-two per magnification-covering both the core and edge tumor this http URL Multi-OSCC dataset is richly annotated for six critical clinical tasks: recurrence prediction (REC), lymph node metastasis (LNM), tumor differentiation (TD), tumor invasion (TI), cancer embolus (CE), and perineural invasion (PI). To benchmark this dataset, we systematically evaluate the impact of different visual encoders, multi-image fusion techniques, stain normalization, and multi-task learning frameworks. Our analysis yields several key insights: (1) The top-performing models achieve excellent results, with an Area Under the Curve (AUC) of 94.72% for REC and 81.23% for TD, while all tasks surpass 70% AUC; (2) Stain normalization benefits diagnostic tasks but negatively affects recurrence prediction; (3) Multi-task learning incurs a 3.34% average AUC degradation compared to single-task models in our multi-task benchmark, underscoring the challenge of balancing multiple tasks in our dataset. To accelerate future research, we publicly release the Multi-OSCC dataset and baseline models at this https URL.
zh
[CV-92] SFNet: A Spatio-Frequency Domain Deep Learning Network for Efficient Alzheimers Disease Diagnosis
【速读】:该论文旨在解决当前阿尔茨海默病(Alzheimer’s disease, AD)诊断模型在利用磁共振成像(MRI)数据时仅依赖单一域特征(空间域或频率域)所导致的表征能力不足问题,尤其是尚未充分挖掘3D MRI中同时存在的空间与频率信息联合潜力。其解决方案的关键在于提出首个端到端的时空联合网络(Spatio-Frequency Network, SFNet),通过融合增强型密集卷积网络提取局部空间特征、全局频率模块捕获频域表征,并引入多尺度注意力机制优化空间特征提取过程,从而显著提升3D MRI驱动的AD分类性能,在ADNI数据集上实现95.1%的准确率,且计算开销更低。
链接: https://arxiv.org/abs/2507.16267
作者: Xinyue Yang,Meiliang Liu,Yunfang Xu,Xiaoxiao Yang,Zhengye Si,Zijin Li,Zhiwen Zhao
机构: Beijing Normal University (北京师范大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Alzheimer’s disease (AD) is a progressive neurodegenerative disorder that predominantly affects the elderly population and currently has no cure. Magnetic Resonance Imaging (MRI), as a non-invasive imaging technique, is essential for the early diagnosis of AD. MRI inherently contains both spatial and frequency information, as raw signals are acquired in the frequency domain and reconstructed into spatial images via the Fourier transform. However, most existing AD diagnostic models extract features from a single domain, limiting their capacity to fully capture the complex neuroimaging characteristics of the disease. While some studies have combined spatial and frequency information, they are mostly confined to 2D MRI, leaving the potential of dual-domain analysis in 3D MRI unexplored. To overcome this limitation, we propose Spatio-Frequency Network (SFNet), the first end-to-end deep learning framework that simultaneously leverages spatial and frequency domain information to enhance 3D MRI-based AD diagnosis. SFNet integrates an enhanced dense convolutional network to extract local spatial features and a global frequency module to capture global frequency-domain representations. Additionally, a novel multi-scale attention module is proposed to further refine spatial feature extraction. Experiments on the Alzheimer’s Disease Neuroimaging Initiative (ANDI) dataset demonstrate that SFNet outperforms existing baselines and reduces computational overhead in classifying cognitively normal (CN) and AD, achieving an accuracy of 95.1%.
zh
[CV-93] MLRU: Multiscale Lightweight Residual UNETR with Attention for Efficient 3D Medical Image Segmentation
【速读】:该论文旨在解决3D医学图像分割中准确率与计算效率难以兼顾的问题,尤其针对解剖结构变异大和体素数据计算成本高的挑战。其解决方案的核心在于提出MLRU++架构,关键创新包括:(1)轻量级通道与瓶颈注意力模块(Lightweight Channel and Bottleneck Attention Module, LCBAM),通过极低的额外计算开销增强上下文特征编码能力;(2)多尺度瓶颈块(Multiscale Bottleneck Block, M2B),在解码器中利用多分辨率特征聚合机制捕捉细粒度细节。这两个模块共同提升了分割精度并显著降低了参数量与计算复杂度,实验证明该方法在多个公开数据集上达到当前最优性能。
链接: https://arxiv.org/abs/2507.16122
作者: Nand Kumar Yadav,Rodrigue Rizk,Willium WC Chen, KC (Santosh AI Research Lab, Department of Computer Science and Biomedical and Translational Sciences, Sanford School of Medicine University Of South Dakota, Vermillion, SD, USA.)
机构: AI Resarch Lab, Department of Computer Science (AI 研究实验室,计算机科学系); Biomedical & Translational Sciences, Sanford School of Medicine (生物医学与转化科学,桑福德医学院); University Of South Dakota (南达科他大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate and efficient medical image segmentation is crucial but challenging due to anatomical variability and high computational demands on volumetric data. Recent hybrid CNN-Transformer architectures achieve state-of-the-art results but add significant complexity. In this paper, we propose MLRU++, a Multiscale Lightweight Residual UNETR++ architecture designed to balance segmentation accuracy and computational efficiency. It introduces two key innovations: a Lightweight Channel and Bottleneck Attention Module (LCBAM) that enhances contextual feature encoding with minimal overhead, and a Multiscale Bottleneck Block (M2B) in the decoder that captures fine-grained details via multi-resolution feature aggregation. Experiments on four publicly available benchmark datasets (Synapse, BTCV, ACDC, and Decathlon Lung) demonstrate that MLRU++ achieves state-of-the-art performance, with average Dice scores of 87.57% (Synapse), 93.00% (ACDC), and 81.12% (Lung). Compared to existing leading models, MLRU++ improves Dice scores by 5.38% and 2.12% on Synapse and ACDC, respectively, while significantly reducing parameter count and computational cost. Ablation studies evaluating LCBAM and M2B further confirm the effectiveness of the proposed architectural components. Results suggest that MLRU++ offers a practical and high-performing solution for 3D medical image segmentation tasks. Source code is available at: this https URL
zh
[CV-94] Handcrafted vs. Deep Radiomics vs. Fusion vs. Deep Learning: A Comprehensive Review of Machine Learning -Based Cancer Outcome Prediction in PET and SPECT Imaging
【速读】:该论文旨在解决当前机器学习(Machine Learning, ML)在PET和SPECT影像中用于癌症预后预测时,不同特征提取方法(如手工特征Radiomics Features, HRF、深度特征Deep Radiomics Features, DRF)、深度学习模型(Deep Learning, DL)及融合模型的性能比较结果不一致的问题。其解决方案的关键在于通过系统性综述226项研究,构建一个包含59个条目的标准化评估框架,对数据集构建、特征提取、验证方法、可解释性及偏倚风险进行全面分析,从而识别出DRF模型在准确率上表现最优(均值0.862),而融合模型在AUC上最高(0.861),并揭示了当前研究中存在的共性局限,如类别不平衡处理不足(59%)、缺失数据(29%)和人群多样性低(19%),进而提出需建立标准化流程、提升数据质量与可解释性AI以推动临床转化。
链接: https://arxiv.org/abs/2507.16065
作者: Mohammad R. Salmanpour,Somayeh Sadat Mehrnia,Sajad Jabarzadeh Ghandilu,Zhino Safahi,Sonya Falahati,Shahram Taeb,Ghazal Mousavi,Mehdi Maghsoudi,Ahmad Shariftabrizi,Ilker Hacihaliloglu,Arman Rahmim
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine learning (ML), including deep learning (DL) and radiomics-based methods, is increasingly used for cancer outcome prediction with PET and SPECT imaging. However, the comparative performance of handcrafted radiomics features (HRF), deep radiomics features (DRF), DL models, and hybrid fusion approaches remains inconsistent across clinical applications. This systematic review analyzed 226 studies published from 2020 to 2025 that applied ML to PET or SPECT imaging for outcome prediction. Each study was evaluated using a 59-item framework covering dataset construction, feature extraction, validation methods, interpretability, and risk of bias. We extracted key details including model type, cancer site, imaging modality, and performance metrics such as accuracy and area under the curve (AUC). PET-based studies (95%) generally outperformed those using SPECT, likely due to higher spatial resolution and sensitivity. DRF models achieved the highest mean accuracy (0.862), while fusion models yielded the highest AUC (0.861). ANOVA confirmed significant differences in performance (accuracy: p=0.0006, AUC: p=0.0027). Common limitations included inadequate handling of class imbalance (59%), missing data (29%), and low population diversity (19%). Only 48% of studies adhered to IBSI standards. These findings highlight the need for standardized pipelines, improved data quality, and explainable AI to support clinical integration.
zh
[CV-95] Quantization-Aware Neuromorphic Architecture for Efficient Skin Disease Classification on Resource-Constrained Devices
【速读】:该论文旨在解决在资源受限的边缘设备上实现高精度、低延迟和低能耗的皮肤病变分类问题,以支持可及且隐私敏感的皮肤病学诊断。其核心挑战在于计算能力、功耗与数据隐私之间的权衡。解决方案的关键在于提出一种量化感知的类脑架构(QANA),通过集成Ghost模块、高效通道注意力机制和挤压-激励块来增强特征表示能力,同时设计量化感知头部与脉冲兼容的转换策略,使模型可无缝迁移到脉冲神经网络(Spiking Neural Networks, SNNs)并部署于类脑硬件平台(如BrainChip Akida)。该方法显著提升了推理效率,在保持高准确率的同时大幅降低延迟与能耗,优于现有CNN到SNN的转换基线。
链接: https://arxiv.org/abs/2507.15958
作者: Haitian Wang,Xinyu Wang,Yiren Wang,Karen Lee,Zichen Geng,Xian Zhang,Kehkashan Kiran,Yu Zhang,Bo Miao
机构: Northwestern Polytechnical University (西北工业大学); The University of Western Australia (西澳大利亚大学); Australian Institute for Machine Learning (澳大利亚机器学习研究所)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript is under review for IEEE BIBM 2025
Abstract:Accurate and efficient skin lesion classification on edge devices is critical for accessible dermatological care but remains challenging due to computational, energy, and privacy constraints. We introduce QANA, a novel quantization-aware neuromorphic architecture for incremental skin lesion classification on resource-limited hardware. QANA effectively integrates ghost modules, efficient channel attention, and squeeze-and-excitation blocks for robust feature representation with low-latency and energy-efficient inference. Its quantization-aware head and spike-compatible transformations enable seamless conversion to spiking neural networks (SNNs) and deployment on neuromorphic platforms. Evaluation on the large-scale HAM10000 benchmark and a real-world clinical dataset shows that QANA achieves 91.6% Top-1 accuracy and 82.4% macro F1 on HAM10000, and 90.8% / 81.7% on the clinical dataset, significantly outperforming state-of-the-art CNN-to-SNN models under fair comparison. Deployed on BrainChip Akida hardware, QANA achieves 1.5,ms inference latency and 1.7,mJ energy per image, reducing inference latency and energy use by over 94.6%/98.6% compared to GPU-based CNNs surpassing state-of-the-art CNN-to-SNN conversion baselines. These results demonstrate the effectiveness of QANA for accurate, real-time, and privacy-sensitive medical analysis in edge environments.
zh
[CV-96] Systole-Conditioned Generative Cardiac Motion
【速读】:该论文旨在解决心脏计算机断层扫描(Cardiac Computed Tomography, CT)中密集运动估计缺乏足够标注数据的问题,尤其是获取密集的地面真实(Ground Truth, GT)运动标注在实践中往往不可行。为应对这一挑战,作者提出了一种基于条件变分自编码器(Conditional Variational Autoencoder, CVAE)的新颖数据生成方法,其关键在于引入一种多尺度特征条件机制,使模型能够仅以单帧CT图像为输入,生成对应的三维流场(3D flow field)作为运动信息。通过将生成的流场应用于原始图像进行形变,从而合成具有真实感的心肌形变的图像对,这些图像对自带完整的光学流GT标注,可用于训练和验证更复杂的运动模型,显著降低对人工标注的依赖。
链接: https://arxiv.org/abs/2507.15894
作者: Shahar Zuler,Gal Lifshitz,Hadar Averbuch-Elor,Dan Raviv
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate motion estimation in cardiac computed tomography (CT) imaging is critical for assessing cardiac function and surgical planning. Data-driven methods have become the standard approach for dense motion estimation, but they rely on vast amounts of labeled data with dense ground-truth (GT) motion annotations, which are often unfeasible to obtain. To address this limitation, we present a novel approach that synthesizes realistically looking pairs of cardiac CT frames enriched with dense 3D flow field annotations. Our method leverages a conditional Variational Autoencoder (CVAE), which incorporates a novel multi-scale feature conditioning mechanism and is trained to generate 3D flow fields conditioned on a single CT frame. By applying the generated flow field to warp the given frame, we create pairs of frames that simulate realistic myocardium deformations across the cardiac cycle. These pairs serve as fully annotated data samples, providing optical flow GT annotations. Our data generation pipeline could enable the training and validation of more complex and accurate myocardium motion models, allowing for substantially reducing reliance on manual annotations. Our code, along with animated generated samples and additional material, is available on our project page: this https URL. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.15894 [eess.IV] (or arXiv:2507.15894v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.15894 Focus to learn more arXiv-issued DOI via DataCite
zh
人工智能
[AI-0] Rethinking LLM -Based RTL Code Optimization Via Timing Logic Metamorphosis
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在寄存器传输级(Register Transfer Level, RTL)代码优化中对复杂时序逻辑处理能力不足的问题,即现有基于大语言模型(Large Language Models, LLMs)的方法在面对包含复杂时序控制流和跨时钟域逻辑的RTL代码时,优化效果不如传统编译器方法。其解决方案的关键在于提出一种基于“形态变换”(metamorphosis)的新评估方法,通过构造语义等价但结构更复杂的RTL代码变体,系统性地检验LLM优化结果的一致性,从而揭示LLM在理解与优化时序逻辑方面的局限性。
链接: https://arxiv.org/abs/2507.16808
作者: Zhihao Xu,Bixin Li,Lulu Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13pages with 9 pictures and 2 tables
Abstract:Register Transfer Level(RTL) code optimization is crucial for achieving high performance and low power consumption in digital circuit design. However, traditional optimization methods often rely on manual tuning and heuristics, which can be time-consuming and error-prone. Recent studies proposed to leverage Large Language Models(LLMs) to assist in RTL code optimization. LLMs can generate optimized code snippets based on natural language descriptions, potentially speeding up the optimization process. However, existing approaches have not thoroughly evaluated the effectiveness of LLM-Based code optimization methods for RTL code with complex timing logic. To address this gap, we conducted a comprehensive empirical investigation to assess the capability of LLM-Based RTL code optimization methods in handling RTL code with complex timing logic. In this study, we first propose a new benchmark for RTL optimization evaluation. It comprises four subsets, each corresponding to a specific area of RTL code optimization. Then we introduce a method based on metamorphosis to systematically evaluate the effectiveness of LLM-Based RTL code optimization this http URL key insight is that the optimization effectiveness should remain consistent for semantically equivalent but more complex code. After intensive experiments, we revealed several key findings. (1) LLM-Based RTL optimization methods can effectively optimize logic operations and outperform existing compiler-based methods. (2) LLM-Based RTL optimization methods do not perform better than existing compiler-based methods on RTL code with complex timing logic, particularly in timing control flow optimization and clock domain optimization. This is primarily attributed to the challenges LLMs face in understanding timing logic in RTL code. Based on these findings, we provide insights for further research in leveraging LLMs for RTL code optimization.
zh
[AI-1] Uncertainty-Aware Knowledge Transformers for Peer-to-Peer Energy Trading with Multi-Agent Reinforcement Learning ECAI2025
【速读】:该论文旨在解决当前点对点(Peer-to-Peer, P2P)能源交易中因依赖确定性预测而导致决策鲁棒性不足的问题,尤其是在随机性强的能源市场环境中。解决方案的关键在于提出一个融合不确定性感知预测与多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)的新框架:首先,引入基于异方差(heteroscedastic)概率Transformer的预测模型——知识Transformer带不确定性(Knowledge Transformer with Uncertainty, KTU),能够显式量化预测不确定性,并通过定制损失函数确保可靠的概率预测与置信区间;其次,将该不确定性信息嵌入到深度Q网络(Deep Q-Network, DQN)中,使智能体在交易策略优化过程中能明确识别风险与波动性。实验证明,该方法显著降低了购电成本并提升了售电收益,同时有效削减了电网峰值负荷,尤其在启用P2P交易时效果更为突出,体现了先进预测与市场机制之间的协同增效。
链接: https://arxiv.org/abs/2507.16796
作者: Mian Ibad Ali Shah,Enda Barrett,Karl Mason
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, 1 table, Proceedings of the Main Track of the European Conference on Artificial Intelligence (ECAI 2025), October 25-30, 2025
Abstract:This paper presents a novel framework for Peer-to-Peer (P2P) energy trading that integrates uncertainty-aware prediction with multi-agent reinforcement learning (MARL), addressing a critical gap in current literature. In contrast to previous works relying on deterministic forecasts, the proposed approach employs a heteroscedastic probabilistic transformer-based prediction model called Knowledge Transformer with Uncertainty (KTU) to explicitly quantify prediction uncertainty, which is essential for robust decision-making in the stochastic environment of P2P energy trading. The KTU model leverages domain-specific features and is trained with a custom loss function that ensures reliable probabilistic forecasts and confidence intervals for each prediction. Integrating these uncertainty-aware forecasts into the MARL framework enables agents to optimize trading strategies with a clear understanding of risk and variability. Experimental results show that the uncertainty-aware Deep Q-Network (DQN) reduces energy purchase costs by up to 5.7% without P2P trading and 3.2% with P2P trading, while increasing electricity sales revenue by 6.4% and 44.7%, respectively. Additionally, peak hour grid demand is reduced by 38.8% without P2P and 45.6% with P2P. These improvements are even more pronounced when P2P trading is enabled, highlighting the synergy between advanced forecasting and market mechanisms for resilient, economically efficient energy communities.
zh
[AI-2] ChatChecker: A Framework for Dialogue System Testing and Evaluation Through Non-cooperative User Simulation
【速读】:该论文旨在解决复杂对话系统(complex dialogue systems)在实际部署中缺乏有效自动化评估与测试方法的问题,尤其针对传统基于回合级(turn-level)分析的局限性,强调需从对话整体层面进行质量保障。其解决方案的关键在于提出 ChatChecker 框架,该框架利用大语言模型(LLM)模拟多样化用户交互,通过引入错误分类法(error taxonomy)提升对话中断(breakdown)检测性能,并设计一种基于挑战性人格设定(challenging personas)的非合作式用户模拟器,从而更有效地暴露目标系统的弱点。该方案无需参考对话、与目标系统实现解耦,显著降低配置成本并具备良好的泛化能力,为对话系统的可靠开发提供了可扩展的测试机制。
链接: https://arxiv.org/abs/2507.16792
作者: Roman Mayr,Michel Schimpf,Thomas Bohné
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While modern dialogue systems heavily rely on large language models (LLMs), their implementation often goes beyond pure LLM interaction. Developers integrate multiple LLMs, external tools, and databases. Therefore, assessment of the underlying LLM alone does not suffice, and the dialogue systems must be tested and evaluated as a whole. However, this remains a major challenge. With most previous work focusing on turn-level analysis, less attention has been paid to integrated dialogue-level quality assurance. To address this, we present ChatChecker, a framework for automated evaluation and testing of complex dialogue systems. ChatChecker uses LLMs to simulate diverse user interactions, identify dialogue breakdowns, and evaluate quality. Compared to previous approaches, our design reduces setup effort and is generalizable, as it does not require reference dialogues and is decoupled from the implementation of the target dialogue system. We improve breakdown detection performance over a prior LLM-based approach by including an error taxonomy in the prompt. Additionally, we propose a novel non-cooperative user simulator based on challenging personas that uncovers weaknesses in target dialogue systems more effectively. Through this, ChatChecker contributes to thorough and scalable testing. This enables both researchers and practitioners to accelerate the development of robust dialogue systems.
zh
[AI-3] WGRAMMAR: Leverag e Prior Knowledge to Accelerate Structured Decoding
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在结构化解码(structured decoding)过程中效率低下的问题,尤其是在生成符合下游系统要求格式(如HTML或JSON)时,现有方法因语法编译、状态追踪和掩码创建等步骤导致显著延迟。其解决方案的关键在于将约束条件分解为静态和动态两部分:静态结构在离线阶段预编译,动态参数在运行时通过语法片段实例化;同时摒弃传统的下推自动机(pushdown automata),改用一组组合式操作符来建模正则格式,从而降低状态转移延迟。此外,论文提出轻量级解码引擎wgrammar,集成领域感知简化、约束分解与掩码缓存机制,在保持准确性的同时实现最高达250倍的加速效果。
链接: https://arxiv.org/abs/2507.16768
作者: Ran Wang,Xiaoxuan Liu,Hao Ren,Gang Chen,Fanchao Qi,Maosong Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state tracking, and mask creation. We observe that many real-world tasks embed strong prior knowledge about output structure. Leveraging this, we propose a decomposition of constraints into static and dynamic components – precompiling static structures offline and instantiating dynamic arguments at runtime using grammar snippets. Instead of relying on pushdown automata, we employ a compositional set of operators to model regular formats, achieving lower transition latency. We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching, achieving up to 250x speedup over existing systems. wgrammar’s source code is publicly available at this https URL.
zh
[AI-4] Never Come Up Empty: Adaptive HyDE Retrieval for Improving LLM Developer Support
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在辅助开发者回答代码相关问题时存在的不可靠性问题,尤其是由幻觉(hallucinations)导致的答案错误。为应对这一挑战,研究提出采用检索增强生成(Retrieval-Augmented Generation, RAG)方法,通过引入外部知识源来提升答案的准确性与可靠性。其解决方案的关键在于构建了一个包含超过300万条Java和Python相关Stack Overflow帖子(含被接受的答案)的检索语料库,并系统评估了7种不同的RAG流水线设计及其63种变体;特别地,研究发现结合假设文档嵌入(Hypothetical-Documentation-Embedding, HyDE)与完整答案上下文(full-answer context)的RAG管道在处理历史相似问题时表现最优,并通过动态降低检索相似度阈值以覆盖无直接匹配的新问题,从而显著提升对未见场景的泛化能力。实验证明,该优化后的RAG管道在多个开源LLM上均优于零样本基线,在帮助性、正确性和细节丰富度方面取得一致提升。
链接: https://arxiv.org/abs/2507.16754
作者: Fangjian Lei,Mariam El Mezouar,Shayan Noei,Ying Zou
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have shown promise in assisting developers with code-related questions; however, LLMs carry the risk of generating unreliable answers. To address this, Retrieval-Augmented Generation (RAG) has been proposed to reduce the unreliability (i.e., hallucinations) of LLMs. However, designing effective pipelines remains challenging due to numerous design choices. In this paper, we construct a retrieval corpus of over 3 million Java and Python related Stack Overflow posts with accepted answers, and explore various RAG pipeline designs to answer developer questions, evaluating their effectiveness in generating accurate and reliable responses. More specifically, we (1) design and evaluate 7 different RAG pipelines and 63 pipeline variants to answer questions that have historically similar matches, and (2) address new questions without any close prior matches by automatically lowering the similarity threshold during retrieval, thereby increasing the chance of finding partially relevant context and improving coverage for unseen cases. We find that implementing a RAG pipeline combining hypothetical-documentation-embedding (HyDE) with the full-answer context performs best in retrieving and answering similarcontent for Stack Overflow questions. Finally, we apply our optimal RAG pipeline to 4 open-source LLMs and compare the results to their zero-shot performance. Our findings show that RAG with our optimal RAG pipeline consistently outperforms zero-shot baselines across models, achieving higher scores for helpfulness, correctness, and detail with LLM-as-a-judge. These findings demonstrate that our optimal RAG pipelines robustly enhance answer quality for a wide range of developer queries including both previously seen and novel questions across different LLMs
zh
[AI-5] AI-enhanced conversational agents for personalized asthma support Factors for engagement value and efficacy
【速读】:该论文旨在解决英国哮喘相关死亡率居欧洲之首、且仅有30%患者获得基本医疗服务的公共卫生问题,核心挑战在于如何有效触达并支持哮喘患者进行健康教育、自我管理及就医衔接。解决方案的关键在于利用自动化对话代理(Automated Conversational Agents),特别是移动聊天机器人(Mobile Chatbots),提供个性化、可及性强且持续可用的健康支持服务。研究通过一项由哮喘临床医生、患者和技术开发者共同设计的调查(N=1257)识别出影响用户参与度的核心因素,发现高意愿使用者多为认为自身病情更严重且自我管理信心较低的患者,并对全天候访问、个性化内容及WhatsApp作为接入方式表现出显著偏好;同时指出隐私安全顾虑和对技术能力的怀疑是主要障碍。基于此,作者提出7项开发建议以优化聊天机器人的有效性与实用性。
链接: https://arxiv.org/abs/2507.16735
作者: Laura Moradbakhti,Dorian Peters,Jennifer K. Quint,Björn Schuller,Darren Cook,Rafael A. Calvo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注: 7 Tables, 4 Figures
Abstract:Asthma-related deaths in the UK are the highest in Europe, and only 30% of patients access basic care. There is a need for alternative approaches to reaching people with asthma in order to provide health education, self-management support and bridges to care. Automated conversational agents (specifically, mobile chatbots) present opportunities for providing alternative and individually tailored access to health education, self-management support and risk self-assessment. But would patients engage with a chatbot, and what factors influence engagement? We present results from a patient survey (N=1257) devised by a team of asthma clinicians, patients, and technology developers, conducted to identify optimal factors for efficacy, value and engagement for a chatbot. Results indicate that most adults with asthma (53%) are interested in using a chatbot and the patients most likely to do so are those who believe their asthma is more serious and who are less confident about self-management. Results also indicate enthusiasm for 24/7 access, personalisation, and for WhatsApp as the preferred access method (compared to app, voice assistant, SMS or website). Obstacles to uptake include security/privacy concerns and skepticism of technological capabilities. We present detailed findings and consolidate these into 7 recommendations for developers for optimising efficacy of chatbot-based health support.
zh
[AI-6] Deliberative Searcher: Improving LLM Reliability via Reinforcement Learning with constraints
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在真实场景部署中可靠性不足的问题,特别是模型置信度与预测正确性之间不一致的问题。解决方案的关键在于提出首个将置信度校准(certainty calibration)与基于检索的搜索相结合的框架——Deliberative Searcher,该框架通过多步反思与验证机制对维基百科数据进行处理,并采用强化学习算法在软可靠性约束下优化准确率,从而显著提升模型输出的可信度和置信度-正确性对齐程度。
链接: https://arxiv.org/abs/2507.16727
作者: Zhenyun Yin,Shujie Wang,Xuhong Wang,Xingjun Ma,Yinchun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Improving the reliability of large language models (LLMs) is critical for deploying them in real-world scenarios. In this paper, we propose \textbfDeliberative Searcher, the first framework to integrate certainty calibration with retrieval-based search for open-domain question answering. The agent performs multi-step reflection and verification over Wikipedia data and is trained with a reinforcement learning algorithm that optimizes for accuracy under a soft reliability constraint. Empirical results show that proposed method improves alignment between model confidence and correctness, leading to more trustworthy outputs. This paper will be continuously updated.
zh
[AI-7] FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation
【速读】:该论文旨在解决工业信号异常检测中因多模态异质性(即M5问题)导致的建模难题,传统方法仅针对特定子问题设计专用模型,未能充分利用模态间的协同效应及模型规模扩展规律。其解决方案的关键在于提出FISHER——一种面向多模态工业信号的统一基础模型(Foundation model),通过将采样率变化视为子带信息的拼接,以STFT子带作为建模单元,并采用教师-学生自监督学习(SSL)框架进行预训练,从而实现对不同采样率下工业信号的通用表征学习。
链接: https://arxiv.org/abs/2507.16696
作者: Pingyi Fan,Anbai Jiang,Shuwei Zhang,Zhiqiang Lv,Bing Han,Xinhu Zheng,Wenrui Liang,Junjie Li,Wei-Qiang Zhang,Yanmin Qian,Xie Chen,Cheng Lu,Jia Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD)
备注: 11 pages, 6 figures
Abstract:With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also develop the RMIS benchmark, which evaluates the representations of M5 industrial signals on multiple health management tasks. Compared with top SSL models, FISHER showcases versatile and outstanding capabilities with a general performance gain up to 5.03%, along with much more efficient scaling curves. We also investigate the scaling law on downstream tasks and derive potential avenues for future works. FISHER is now open-sourced on this https URL
zh
[AI-8] Meta-Learning for Cold-Start Personalization in Prompt-Tuned LLM s
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLM)的推荐系统在冷启动用户场景下的个性化难题,即当用户缺乏历史交互数据时,传统依赖密集用户-物品交互信息的监督微调和协同过滤方法难以高效、低成本地实现个性化。其解决方案的关键在于提出一种元学习(meta-learning)框架,通过参数高效的提示调优(prompt-tuning)机制,在每个用户被视为独立任务的前提下,利用一阶(Reptile)和二阶(MAML)优化策略学习可微分的软提示嵌入(soft prompt embeddings),这些嵌入作为输入标记的增强项,编码用户行为先验。模型通过周期性采样、内循环适应与外循环泛化实现提示的元优化,从而在无需历史交互数据的情况下快速实现个性化建模,且具备实时推理能力(<300 ms),支持零历史个性化,并在金融风险实时评估中展现出显著优势,提升支付网络稳定性与系统韧性。
链接: https://arxiv.org/abs/2507.16672
作者: Yushang Zhao,Huijie Shen,Dannier Li,Lu Chang,Chengrui Zhou,Yinuo Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative, explainable, and flexible recommender systems, derived using Large Language Models (LLM) are promising and poorly adapted to the cold-start user situation, where there is little to no history of interaction. The current solutions i.e. supervised fine-tuning and collaborative filtering are dense-user-item focused and would be expensive to maintain and update. This paper introduces a meta-learning framework, that can be used to perform parameter-efficient prompt-tuning, to effectively personalize LLM-based recommender systems quickly at cold-start. The model learns soft prompt embeddings with first-order (Reptile) and second-order (MAML) optimization by treating each of the users as the tasks. As augmentations to the input tokens, these learnable vectors are the differentiable control variables that represent user behavioral priors. The prompts are meta-optimized through episodic sampling, inner-loop adaptation, and outer-loop generalization. On MovieLens-1M, Amazon Reviews, and Recbole, we can see that our adaptive model outperforms strong baselines in NDCG@10, HR@10, and MRR, and it runs in real-time (i.e., below 300 ms) on consumer GPUs. Zero-history personalization is also supported by this scalable solution, and its 275 ms rate of adaptation allows successful real-time risk profiling of financial systems by shortening detection latency and improving payment network stability. Crucially, the 275 ms adaptation capability can enable real-time risk profiling for financial institutions, reducing systemic vulnerability detection latency significantly versus traditional compliance checks. By preventing contagion in payment networks (e.g., Fedwire), the framework strengthens national financial infrastructure resilience.
zh
[AI-9] Adaptive Inventory Strategies using Deep Reinforcement Learning for Dynamic Agri-Food Supply Chains
【速读】:该论文旨在解决农业食品供应链中因需求与提前期不确定性导致的库存管理难题,以及现有研究忽视多层级利益相关者协同的问题。其核心挑战在于如何在产品保质期有限、供需波动剧烈的情况下,制定有效的补货策略以最大化整个供应链的利润。解决方案的关键在于提出一种融合值函数和策略梯度优势的深度强化学习(Deep Reinforcement Learning, DRL)算法,该算法通过连续动作空间选择最优订货量,在考虑 perishability(易腐性)和不确定性的同时,激励供应链各环节利益相关方围绕共同目标——即整体盈利最大化——进行协作,从而实现更高效、鲁棒的库存优化。
链接: https://arxiv.org/abs/2507.16670
作者: Amandeep Kaur,Gyan Prakash
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agricultural products are often subject to seasonal fluctuations in production and demand. Predicting and managing inventory levels in response to these variations can be challenging, leading to either excess inventory or stockouts. Additionally, the coordination among stakeholders at various level of food supply chain is not considered in the existing body of literature. To bridge these research gaps, this study focuses on inventory management of agri-food products under demand and lead time uncertainties. By implementing effective inventory replenishment policy results in maximize the overall profit throughout the supply chain. However, the complexity of the problem increases due to these uncertainties and shelf-life of the product, that makes challenging to implement traditional approaches to generate optimal set of solutions. Thus, the current study propose a novel Deep Reinforcement Learning (DRL) algorithm that combines the benefits of both value- and policy-based DRL approaches for inventory optimization under uncertainties. The proposed algorithm can incentivize collaboration among stakeholders by aligning their interests and objectives through shared optimization goal of maximizing profitability along the agri-food supply chain while considering perishability, and uncertainty simultaneously. By selecting optimal order quantities with continuous action space, the proposed algorithm effectively addresses the inventory optimization challenges. To rigorously evaluate this algorithm, the empirical data from fresh agricultural products supply chain inventory is considered. Experimental results corroborate the improved performance of the proposed inventory replenishment policy under stochastic demand patterns and lead time scenarios. The research findings hold managerial implications for policymakers to manage the inventory of agricultural products more effectively under uncertainty.
zh
[AI-10] Novel Multi-Agent Action Masked Deep Reinforcement Learning for General Industrial Assembly Lines Balancing Problems
【速读】:该论文旨在解决工业装配线中任务与资源调度的优化问题,以实现高效、低成本且符合约束条件的生产计划。传统方法如整数规划(Integer Programming, IP)在大规模场景下计算复杂度高,而启发式算法(如遗传算法)常难以获得最优解。其关键解决方案是将装配线建模为马尔可夫决策过程(Markov Decision Process, MDP),并基于此构建深度强化学习(Deep Reinforcement Learning, DRL)训练框架。核心创新包括:一是引入动作掩码(action-masking)技术,确保智能体仅选择可行动作,从而加速收敛;二是采用多智能体架构,每个工位由独立智能体管理,结合集中式训练与分布式执行策略,有效降低状态和动作空间维度,提升可扩展性。仿真结果表明,该方案相较于传统模型方法具有更快的收敛速度和更优的调度性能。
链接: https://arxiv.org/abs/2507.16635
作者: Ali Mohamed Ali,Luca Tirel,Hashim A. Hashim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Efficient planning of activities is essential for modern industrial assembly lines to uphold manufacturing standards, prevent project constraint violations, and achieve cost-effective operations. While exact solutions to such challenges can be obtained through Integer Programming (IP), the dependence of the search space on input parameters often makes IP computationally infeasible for large-scale scenarios. Heuristic methods, such as Genetic Algorithms, can also be applied, but they frequently produce suboptimal solutions in extensive cases. This paper introduces a novel mathematical model of a generic industrial assembly line formulated as a Markov Decision Process (MDP), without imposing assumptions on the type of assembly line a notable distinction from most existing models. The proposed model is employed to create a virtual environment for training Deep Reinforcement Learning (DRL) agents to optimize task and resource scheduling. To enhance the efficiency of agent training, the paper proposes two innovative tools. The first is an action-masking technique, which ensures the agent selects only feasible actions, thereby reducing training time. The second is a multi-agent approach, where each workstation is managed by an individual agent, as a result, the state and action spaces were reduced. A centralized training framework with decentralized execution is adopted, offering a scalable learning architecture for optimizing industrial assembly lines. This framework allows the agents to learn offline and subsequently provide real-time solutions during operations by leveraging a neural network that maps the current factory state to the optimal action. The effectiveness of the proposed scheme is validated through numerical simulations, demonstrating significantly faster convergence to the optimal solution compared to a comparable model-based approach.
zh
[AI-11] An Experimental Study of Split-Learning TinyML on Ultra-Low-Power Edge/IoT Nodes
【速读】:该论文旨在解决在超低功耗边缘/IoT节点上直接运行深度学习推理时,受限于微控制器的内存和计算资源瓶颈的问题。其解决方案的关键在于采用分片学习(Split Learning, SL),将模型推理过程拆分为两部分:一部分在传感器端本地执行,另一部分则卸载到协同设备上处理;通过无线通信协议(如ESP-NOW、BLE、UDP/IP和TCP/IP)传输中间激活张量,在相同硬件平台上实现性能对比测试。实验基于Espressif ESP32-S3开发板构建了首个面向TinyML的端到端分片学习测试平台,验证了MobileNetV2模型经8位量化后分片部署的可行性与效率,揭示了不同通信方式对延迟和能耗的影响。
链接: https://arxiv.org/abs/2507.16594
作者: Zied Jenhani,Mounir Bensalem,Jasenka Dizdarević,Admela Jukan
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: This paper is uploaded here for research community, thus it is for non-commercial purposes
Abstract:Running deep learning inference directly on ultra-low-power edge/IoT nodes has been limited by the tight memory and compute budgets of microcontrollers. Split learning (SL) addresses this limitation in which it executes part of the inference process on the sensor and off-loads the remainder to a companion device. In the context of constrained devices and the related impact of low-power, over-the-air transport protocols, the performance of split learning remains largely unexplored. TO the best of our knowledge, this paper presents the first end-to-end TinyML + SL testbed built on Espressif ESP32-S3 boards, designed to benchmark the over-the-air performance of split learning TinyML in edge/IoT environments. We benchmark the performance of a MobileNetV2 image recognition model, which is quantized to 8-bit integers, partitioned, and delivered to the nodes via over-the-air updates. The intermediate activations are exchanged through different wireless communication methods: ESP-NOW, BLE, and traditional UDP/IP and TCP/IP, enabling a head-to-head comparison on identical hardware. Measurements show that splitting the model after block_16_project_BN layer generates a 5.66 kB tensor that traverses the link in 3.2 ms, when UDP is used, achieving a steady-state round-trip latency of 5.8 s. ESP-NOW presents the most favorable RTT performance 3.7 s; BLE extends battery life further but increases latency beyond 10s.
zh
[AI-12] AI for Better UX in Computer-Aided Engineering: Is Academia Catching Up with Industry Demands? A Multivocal Literature Review
【速读】:该论文旨在解决计算机辅助工程(Computer-Aided Engineering, CAE)软件在用户体验(User Experience, UX)方面存在的效率与可访问性瓶颈问题,尤其是在人工智能(Artificial Intelligence, AI)技术日益融入CAE流程的背景下,学术研究与工业实践之间存在显著脱节。其解决方案的关键在于通过多视角文献综述(Multivocal Literature Review, MLR)系统梳理AI如何提升CAE软件的UX,并识别出当前研究中被忽视的核心机会领域,如生成式AI驱动的引导机制、自适应界面和工作流自动化等,从而为未来研究提供明确方向,推动AI与CAE融合向以用户为中心的设计演进。
链接: https://arxiv.org/abs/2507.16586
作者: Choro Ulan Uulu,Mikhail Kulyabin,Layan Etaiwi,Nuno Miguel Martins Pacheco,Jan Joosten,Kerstin Röse,Filippos Petridis,Jan Bosch,Helena Holmström Olsson
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Computer-Aided Engineering (CAE) enables simulation experts to optimize complex models, but faces challenges in user experience (UX) that limit efficiency and accessibility. While artificial intelligence (AI) has demonstrated potential to enhance CAE processes, research integrating these fields with a focus on UX remains fragmented. This paper presents a multivocal literature review (MLR) examining how AI enhances UX in CAE software across both academic research and industry implementations. Our analysis reveals significant gaps between academic explorations and industry applications, with companies actively implementing LLMs, adaptive UIs, and recommender systems while academic research focuses primarily on technical capabilities without UX validation. Key findings demonstrate opportunities in AI-powered guidance, adaptive interfaces, and workflow automation that remain underexplored in current research. By mapping the intersection of these domains, this study provides a foundation for future work to address the identified research gaps and advance the integration of AI to improve CAE user experience.
zh
[AI-13] Data-Driven Adaptive Gradient Recovery for Unstructured Finite Volume Computations
【速读】:该论文旨在解决非结构化网格上有限体积法(Finite Volume Method, FVM)在求解双曲守恒律(如二维欧拉方程)时梯度重构精度不足的问题,尤其是在激波主导区域难以保持物理一致性与数值稳定性。其解决方案的关键在于提出一种基于数据驱动的新型深度神经网络架构——改进的DeepONet(Deep Operator Network),该架构通过引入局部网格拓扑信息以确保旋转不变性,并施加一阶约束(first-order constraint)来保证所学算子的数学合理性;同时,在训练过程中采用物理信息正则化策略(包括熵惩罚、总变差递减惩罚和参数正则化),从而提升模型在复杂流动配置下的泛化能力与物理保真度。实验表明,该方法在保持守恒性和稳定性的同时,相较于传统二阶有限体积格式可实现20–60%的精度提升,并具备更优的网格收敛速率和计算效率。
链接: https://arxiv.org/abs/2507.16571
作者: G. de Romémont,F. Renac,F. Chinesta,J. Nunez,D. Gueyffier
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Analysis of PDEs (math.AP)
备注: 19 pages, 13 Figures, 1 table
Abstract:We present a novel data-driven approach for enhancing gradient reconstruction in unstructured finite volume methods for hyperbolic conservation laws, specifically for the 2D Euler equations. Our approach extends previous structured-grid methodologies to unstructured meshes through a modified DeepONet architecture that incorporates local geometry in the neural network. The architecture employs local mesh topology to ensure rotation invariance, while also ensuring first-order constraint on the learned operator. The training methodology incorporates physics-informed regularization through entropy penalization, total variation diminishing penalization, and parameter regularization to ensure physically consistent solutions, particularly in shock-dominated regions. The model is trained on high-fidelity datasets solutions derived from sine waves and randomized piecewise constant initial conditions with periodic boundary conditions, enabling robust generalization to complex flow configurations or geometries. Validation test cases from the literature, including challenging geometry configuration, demonstrates substantial improvements in accuracy compared to traditional second-order finite volume schemes. The method achieves gains of 20-60% in solution accuracy while enhancing computational efficiency. A convergence study has been conveyed and reveal improved mesh convergence rates compared to the conventional solver. The proposed algorithm is faster and more accurate than the traditional second-order finite volume solver, enabling high-fidelity simulations on coarser grids while preserving the stability and conservation properties essential for hyperbolic conservation laws. This work is a part of a new generation of solvers that are built by combining Machine-Learning (ML) tools with traditional numerical schemes, all while ensuring physical constraint on the results.
zh
[AI-14] MBA: Towards Text To Multiple Sources Binaural Audio Generation
【速读】:该论文旨在解决现有文本到音频(text-to-audio, TTA)生成方法仅输出单声道音频、忽视空间信息从而难以实现沉浸式听觉体验的问题。其解决方案的关键在于提出一种级联式的文本到多源双耳音频生成(text-to-multisource binaural audio generation, TTMBA)方法,通过预训练大语言模型(large language model, LLM)对文本进行结构化分割,明确每个声音事件的时间与空间属性;随后利用预训练的单声道音频生成网络分别生成各事件的音频片段,并基于LLM提供的空间数据,通过双耳渲染神经网络将其转换为双耳音频;最终按事件起始时间对双耳音频进行排序合成,实现具有精确时空控制能力的多源双耳音频输出。
链接: https://arxiv.org/abs/2507.16564
作者: Yuxuan He,Xiaoran Yang,Ningning Pan,Gongping Huang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages,3 figures,2 tables
Abstract:Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.
zh
[AI-15] Evaluating Social Acceptance of eXtended Reality (XR) Agent Technology: A User Study (Extended Version)
【速读】:该论文旨在解决eXtended Reality (XR) 代理技术在特定社会情境下的社会接受度问题,尤其是在记者远程培训场景中,如何通过无需专用设备(如头戴式显示器)的Web-based XR系统实现有效且被用户接纳的交互式训练。其解决方案的关键在于:构建一个基于模块化工具包的可远程访问XR训练系统,使用户能够与虚拟化身(virtual avatar)进行交互,并通过适配和扩展Almere模型,引入感知易用性、感知有用性、可靠性及安全性等维度来量化用户对XR代理系统的接受程度。实验结果表明,该方案不仅提升了用户对XR代理技术的认知与信任,也为未来XR系统的设计优化提供了实证依据。
链接: https://arxiv.org/abs/2507.16562
作者: Megha Quamara,Viktor Schmuck,Cristina Iani,Axel Primavesi,Alexander Plaum,Luca Vigano
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 26 pages (18 pages main body, 8 pages user consent form), 3 figures, 7 tables
Abstract:In this paper, we present the findings of a user study that evaluated the social acceptance of eXtended Reality (XR) agent technology, focusing on a remotely accessible, web-based XR training system developed for journalists. This system involves user interaction with a virtual avatar, enabled by a modular toolkit. The interactions are designed to provide tailored training for journalists in digital-remote settings, especially for sensitive or dangerous scenarios, without requiring specialized end-user equipment like headsets. Our research adapts and extends the Almere model, representing social acceptance through existing attributes such as perceived ease of use and perceived usefulness, along with added ones like dependability and security in the user-agent interaction. The XR agent was tested through a controlled experiment in a real-world setting, with data collected on users’ perceptions. Our findings, based on quantitative and qualitative measurements involving questionnaires, contribute to the understanding of user perceptions and acceptance of XR agent solutions within a specific social context, while also identifying areas for the improvement of XR systems.
zh
[AI-16] A Comprehensive Data-centric Overview of Federated Graph Learning
【速读】:该论文旨在解决当前联邦图学习(Federated Graph Learning, FGL)研究中缺乏以数据为中心的系统性分类框架的问题,现有综述多聚焦于联邦学习(Federated Learning, FL)与图机器学习(Graph Machine Learning, GML)的方法融合及模拟场景,未能从数据属性和使用方式的角度深入剖析FGL方法如何应对数据特性约束以提升模型性能。其解决方案的关键在于提出一个两级数据中心分类法:第一级为“数据特征”(Data Characteristics),依据数据的结构和分布特性对FGL研究进行归类;第二级为“数据利用”(Data Utilization),分析训练过程中用于克服关键数据挑战的技术与策略。每个层级均由三个正交准则定义,从而形成对FGL研究更系统、更具指导意义的组织框架。
链接: https://arxiv.org/abs/2507.16541
作者: Zhengyu Wu,Xunkai Li,Yinlin Zhu,Zekai Chen,Guochen Yan,Yanyu Yan,Hao Zhang,Yuming Ai,Xinmo Jin,Rong-Hua Li,Guoren Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:In the era of big data applications, Federated Graph Learning (FGL) has emerged as a prominent solution that reconcile the tradeoff between optimizing the collective intelligence between decentralized datasets holders and preserving sensitive information to maximum. Existing FGL surveys have contributed meaningfully but largely focus on integrating Federated Learning (FL) and Graph Machine Learning (GML), resulting in early stage taxonomies that emphasis on methodology and simulated scenarios. Notably, a data centric perspective, which systematically examines FGL methods through the lens of data properties and usage, remains unadapted to reorganize FGL research, yet it is critical to assess how FGL studies manage to tackle data centric constraints to enhance model performances. This survey propose a two-level data centric taxonomy: Data Characteristics, which categorizes studies based on the structural and distributional properties of datasets used in FGL, and Data Utilization, which analyzes the training procedures and techniques employed to overcome key data centric challenges. Each taxonomy level is defined by three orthogonal criteria, each representing a distinct data centric configuration. Beyond taxonomy, this survey examines FGL integration with Pretrained Large Models, showcases realistic applications, and highlights future direction aligned with emerging trends in GML.
zh
[AI-17] Symbolic Graph Intelligence: Hypervector Message Passing for Learning Graph-Level Patterns with Tsetlin Machines
【速读】:该论文旨在解决图分类任务中模型可解释性不足与表示效率低下的问题。其解决方案的关键在于提出了一种多层符号框架,利用稀疏二进制超向量(sparse binary hypervectors)和Tsetlin机(Tsetlin Machine)对图进行编码:通过结构化消息传递机制将节点、边及属性信息绑定并捆绑为符号超向量,逐层从节点属性到边关系再到结构角色进行绑定,从而保留图的层次语义并生成紧凑离散的表示;同时构建局部可解释性框架,使模型在保持竞争准确率的同时具备显著的符号透明性,优于传统神经图模型。
链接: https://arxiv.org/abs/2507.16537
作者: Christian D. Blakely
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, for ICTM '25
Abstract:We propose a multilayered symbolic framework for general graph classification that leverages sparse binary hypervectors and Tsetlin Machines. Each graph is encoded through structured message passing, where node, edge, and attribute information are bound and bundled into a symbolic hypervector. This process preserves the hierarchical semantics of the graph through layered binding from node attributes to edge relations to structural roles resulting in a compact, discrete representation. We also formulate a local interpretability framework which lends itself to a key advantage of our approach being locally interpretable. We validate our method on TUDataset benchmarks, demonstrating competitive accuracy with strong symbolic transparency compared to neural graph models.
zh
[AI-18] confopt: A Library for Implementation and Evaluation of Gradient-based One-Shot NAS Methods
【速读】:该论文旨在解决梯度驱动的一次性神经架构搜索(Gradient-based One-shot Neural Architecture Search, NAS)领域面临的两大挑战:一是评估方法过度依赖DARTS基准,导致性能提升趋于饱和且难以区分显著改进与噪声;二是现有实现分散在多个独立仓库中,阻碍了公平、可复现的比较与进一步发展。解决方案的关键在于提出一个名为Configurable Optimizer(confopt)的可扩展库,其核心创新是提供简洁的API以支持新搜索空间的快速集成,并将NAS优化器分解为可组合的核心组件,从而统一开发流程并构建新的DARTS基准套件与评估协议,揭示当前评估方式中存在的关键缺陷。
链接: https://arxiv.org/abs/2507.16533
作者: Abhash Kumar Jha,Shakiba Moradian,Arjun Krishnakumar,Martin Rapp,Frank Hutter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AutoML 25 ABCD Track
Abstract:Gradient-based one-shot neural architecture search (NAS) has significantly reduced the cost of exploring architectural spaces with discrete design choices, such as selecting operations within a model. However, the field faces two major challenges. First, evaluations of gradient-based NAS methods heavily rely on the DARTS benchmark, despite the existence of other available benchmarks. This overreliance has led to saturation, with reported improvements often falling within the margin of noise. Second, implementations of gradient-based one-shot NAS methods are fragmented across disparate repositories, complicating fair and reproducible comparisons and further development. In this paper, we introduce Configurable Optimizer (confopt), an extensible library designed to streamline the development and evaluation of gradient-based one-shot NAS methods. Confopt provides a minimal API that makes it easy for users to integrate new search spaces, while also supporting the decomposition of NAS optimizers into their core components. We use this framework to create a suite of new DARTS-based benchmarks, and combine them with a novel evaluation protocol to reveal a critical flaw in how gradient-based one-shot NAS methods are currently assessed. The code can be found at this https URL.
zh
[AI-19] Analogy making as amortised model construction
【速读】:该论文试图解决的问题是如何在资源有限的情况下,高效构建并利用内部模型(internal models)以应对新情境下的决策任务。核心挑战在于平衡模型的准确性与可构造性:模型需足够忠实于环境以支持有效规划,同时必须在计算上可行,便于快速构建。解决方案的关键在于引入类比(analogy)机制,将类比形式化为马尔可夫决策过程(Markov Decision Processes, MDPs)之间的部分同态(partial homomorphisms),从而复用过往经验中与解决方案相关的结构信息,实现对模型构建(construal)和规划过程的计算成本分摊。通过这种方式,从先前任务中提取的抽象模块可作为可组合的构建块,用于快速生成新任务的内部模型,进而支持跨领域策略与表征的灵活适应。
链接: https://arxiv.org/abs/2507.16511
作者: David G. Nagy,Tingke Shen,Hanqi Zhou,Charley M. Wu,Peter Dayan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: RLC 2025 Finding the Frame Workshop
Abstract:Humans flexibly construct internal models to navigate novel situations. To be useful, these internal models must be sufficiently faithful to the environment that resource-limited planning leads to adequate outcomes; equally, they must be tractable to construct in the first place. We argue that analogy plays a central role in these processes, enabling agents to reuse solution-relevant structure from past experiences and amortise the computational costs of both model construction (construal) and planning. Formalising analogies as partial homomorphisms between Markov decision processes, we sketch a framework in which abstract modules, derived from previous construals, serve as composable building blocks for new ones. This modular reuse allows for flexible adaptation of policies and representations across domains with shared structural essence.
zh
[AI-20] Agent ic RAG with Knowledge Graphs for Complex Multi-Hop Reasoning in Real-World Applications ECAI2025
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在复杂查询场景下表现不足的问题,具体表现为答案内容有限、仅限于抽取式回答、难以实现多目标检索以及无法有效处理实体间的复杂关系,这在知识密集型领域尤为突出。解决方案的关键在于提出INRAExplorer——一个基于大语言模型(Large Language Models, LLMs)的代理型RAG系统,其核心创新是采用多工具架构(multi-tool architecture),结合从INRAE开放出版物构建的综合知识图谱(knowledge graph),使系统能够执行迭代式、定向查询、全量数据检索(如检索某作者的所有文献)、多跳推理(multi-hop reasoning),并输出结构化、全面的答案,从而显著提升专业领域知识交互的质量与深度。
链接: https://arxiv.org/abs/2507.16507
作者: Jean Lelong,Adnane Errazine,Annabelle Blangero
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: ECAI 2025 demo track, 4 pages
Abstract:Conventional Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) but often fall short on complex queries, delivering limited, extractive answers and struggling with multiple targeted retrievals or navigating intricate entity relationships. This is a critical gap in knowledge-intensive domains. We introduce INRAExplorer, an agentic RAG system for exploring the scientific data of INRAE (France’s National Research Institute for Agriculture, Food and Environment). INRAExplorer employs an LLM-based agent with a multi-tool architecture to dynamically engage a rich knowledge base, through a comprehensive knowledge graph derived from open access INRAE publications. This design empowers INRAExplorer to conduct iterative, targeted queries, retrieve exhaustive datasets (e.g., all publications by an author), perform multi-hop reasoning, and deliver structured, comprehensive answers. INRAExplorer serves as a concrete illustration of enhancing knowledge interaction in specialized fields.
zh
[AI-21] ACT: Bridging the Gap in Code Translation through Synthetic Data Generation Adaptive Training
【速读】:该论文旨在解决传统代码翻译方法依赖手工规则导致灵活性和可扩展性不足,以及先进语言模型受限于闭源API实现所带来的数据安全与依赖风险的问题。其解决方案的关键在于提出Auto-Train for Code Translation (ACT)框架,通过支持在本地微调开源大语言模型(Large Language Models, LLMs)来提升代码翻译性能;其中核心创新包括一个合成数据生成模块,利用初始代码样本结合单元测试构建高质量、多样化的训练数据集,以及一个控制器模块,动态调整超参数并根据实时评估结果智能决策是否继续训练或生成针对性数据,从而显著缩小开源方案与闭源解决方案之间的性能差距。
链接: https://arxiv.org/abs/2507.16478
作者: Shreya Saxena,Siva Prasad,Zishan Ahmad,Vishal Vaddina
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Code translation is a crucial process in software development and migration projects, enabling interoperability between different programming languages and enhancing software adaptability and thus longevity. Traditional automated translation methods rely heavily on handcrafted transformation rules, which often lack flexibility and scalability. Meanwhile, advanced language models present promising alternatives but are often limited by proprietary, API-based implementations that raise concerns over data security and reliance. In this paper, we present Auto-Train for Code Translation (ACT), an innovative framework that aims to improve code translation capabilities by enabling in-house finetuning of open-source Large Language Models (LLMs). ACT’s automated pipeline significantly boosts the performance of these models, narrowing the gap between open-source accessibility and the high performance of closed-source solutions. Central to ACT is its synthetic data generation module, which builds extensive, high-quality datasets from initial code samples, incorporating unit tests to ensure functional accuracy and diversity. ACT’s evaluation framework incorporates execution-level checks, offering a comprehensive assessment of translation quality. A key feature in ACT is its controller module, which manages the entire pipeline by dynamically adjusting hyperparameters, orchestrating iterative data generation, and finetuning based on real-time evaluations. This enables ACT to intelligently optimize when to continue training, generate additional targeted training data, or stop the process. Our results demonstrate that ACT consistently enhances the effectiveness of open-source models, offering businesses and developers a secure and reliable alternative. Additionally, applying our data generation pipeline to industry-scale migration projects has led to a notable increase in developer acceleration.
zh
[AI-22] Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPs
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在链式思维(Chain-of-Thought, CoT)推理中因生成显式步骤文本而导致的计算开销大、效率低的问题。其核心解决方案是构建一个基于隐式推理的框架,将模型的“思考”过程映射到潜在空间中的抽象动作(即选项,options),并利用层次化强化学习(hierarchical reinforcement learning)进行建模。关键创新在于提出变分马尔可夫选项评论家(Variational Markovian Option Critic, VMOC),这是一种基于变分推断的离策略算法,能够在HiT-MDP框架下有效学习多样化的选项作为潜在嵌入;同时通过扩展连续马尔可夫决策过程(MDP)同态理论,证明在简化后的抽象潜空间中学习策略可保持原始复杂问题的最优性。此外,引入冷启动机制,利用监督微调(SFT)数据将人类推理示范蒸馏至该潜选项空间,为模型提供丰富的初始推理能力。实验表明该方法在逻辑推理与运动控制任务上均表现优异,验证了其作为抽象技能学习的原理性方法的有效性。
链接: https://arxiv.org/abs/2507.16473
作者: Chang Li,Yaren Zhang,Haoran Lv,Qiong Cao,Chao Xue,Xiaodong He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model “thinks” in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model’s reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.
zh
[AI-23] Improving ASP-based ORS Schedules through Machine Learning Predictions
【速读】:该论文旨在解决手术室调度(Operating Room Scheduling, ORS)问题中缺乏可生成临时排班方案及调度结果鲁棒性不足的挑战。现有基于答案集编程(Answer Set Programming, ASP)的方法仅能验证编码与实际数据的一致性,或提供替代调度建议,但无法生成初步可行的排班计划。为此,论文提出将归纳式(inductive)与演绎式(deductive)技术相结合:首先利用机器学习算法从历史数据中预测手术时长,从而生成初步调度方案;随后将预测置信度作为额外输入,动态调整ASP编码以生成更具鲁棒性的最终调度方案。实证结果表明,该集成方法在意大利ASL1 Liguria医院的历史数据上具有可行性。
链接: https://arxiv.org/abs/2507.16454
作者: Pierangela Bruno,Carmine Dodaro,Giuseppe Galatà,Marco Maratea,Marco Mochi
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: 17 pages, International Conference on Logic Programming, Under consideration in Theory and Practice of Logic Programming (TPLP)
Abstract:The Operating Room Scheduling (ORS) problem deals with the optimization of daily operating room surgery schedules. It is a challenging problem subject to many constraints, like to determine the starting time of different surgeries and allocating the required resources, including the availability of beds in different department units. Recently, solutions to this problem based on Answer Set Programming (ASP) have been delivered. Such solutions are overall satisfying but, when applied to real data, they can currently only verify whether the encoding aligns with the actual data and, at most, suggest alternative schedules that could have been computed. As a consequence, it is not currently possible to generate provisional schedules. Furthermore, the resulting schedules are not always robust. In this paper, we integrate inductive and deductive techniques for solving these issues. We first employ machine learning algorithms to predict the surgery duration, from historical data, to compute provisional schedules. Then, we consider the confidence of such predictions as an additional input to our problem and update the encoding correspondingly in order to compute more robust schedules. Results on historical data from the ASL1 Liguria in Italy confirm the viability of our integration. Under consideration in Theory and Practice of Logic Programming (TPLP). Comments: 17 pages, International Conference on Logic Programming, Under consideration in Theory and Practice of Logic Programming (TPLP) Subjects: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO) Cite as: arXiv:2507.16454 [cs.AI] (or arXiv:2507.16454v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.16454 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-24] From model-based learning to model-free behaviour with Meta-Interpretive Learning
【速读】:该论文旨在解决如何构建一个能够在新颖环境中独立行动的自主智能体(autonomous agent)问题,该智能体需同时具备模型-based(基于模型)和model-free(无模型)两类能力。其关键解决方案是利用元解释学习(Meta-Interpretive Learning)来学习一个基于模型的求解器(Solver),再用该求解器训练一个无模型的控制器(Controller),使得控制器能够像求解器一样解决相同的规划问题,从而实现两种能力的融合与等效性验证。实验表明,在随机生成的迷宫和开阔水域地图两类环境中,所有由求解器解决的导航问题均可被控制器成功解决,证明了二者在问题求解能力上的等价性。
链接: https://arxiv.org/abs/2507.16434
作者: Stassa Patsantzis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:A “model” is a theory that describes the state of an environment and the effects of an agent’s decisions on the environment. A model-based agent can use its model to predict the effects of its future actions and so plan ahead, but must know the state of the environment. A model-free agent cannot plan, but can act without a model and without completely observing the environment. An autonomous agent capable of acting independently in novel environments must combine both sets of capabilities. We show how to create such an agent with Meta-Interpretive Learning used to learn a model-based Solver used to train a model-free Controller that can solve the same planning problems as the Solver. We demonstrate the equivalence in problem-solving ability of the two agents on grid navigation problems in two kinds of environment: randomly generated mazes, and lake maps with wide open areas. We find that all navigation problems solved by the Solver are also solved by the Controller, indicating the two are equivalent.
zh
[AI-25] Beyond Algorethics: Addressing the Ethical and Anthropological Challenges of AI Recommender Systems
【速读】:该论文旨在解决AI驱动的推荐系统(Recommender Systems, RSs)在塑造数字环境与社会互动过程中所引发的伦理与人类学挑战,特别是其将人类复杂性简化为可量化维度、利用用户脆弱性以及以参与度优先于福祉等问题。现有基于算法伦理(algorethics)的技术导向解决方案虽必要但不足,因此论文提出一个以人类为中心的推荐系统设计综合框架,其关键在于融合跨学科视角、监管策略与教育举措,确保AI系统促进而非损害人类自主性与社会繁荣。
链接: https://arxiv.org/abs/2507.16430
作者: Octavian M. Machidon
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:In this paper, I examine the ethical and anthropological challenges posed by AI-driven recommender systems (RSs), which have become central to shaping digital environments and social interactions. By curating personalized content, RSs do not merely reflect user preferences but actively construct individual experiences across social media, entertainment platforms, and e-commerce. Despite their ubiquity, the ethical implications of RSs remain insufficiently explored, even as concerns over privacy, autonomy, and mental well-being intensify. I argue that existing ethical approaches, including algorethics, the effort to embed ethical principles into algorithmic design, are necessary but ultimately inadequate. RSs inherently reduce human complexity to quantifiable dimensions, exploit user vulnerabilities, and prioritize engagement over well-being. Addressing these concerns requires moving beyond purely technical solutions. I propose a comprehensive framework for human-centered RS design, integrating interdisciplinary perspectives, regulatory strategies, and educational initiatives to ensure AI systems foster rather than undermine human autonomy and societal flourishing.
zh
[AI-26] Identifying Pre-training Data in LLM s: A Neuron Activation-Based Detection Framework
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在预训练阶段可能包含受版权保护内容或私人信息所带来的法律与伦理风险,以及数据集污染和内部偏见问题。为应对这些问题,研究提出了一种新的预训练数据检测(Pre-Training Data Detection, PDD)任务,以识别特定数据是否存在于LLM的预训练语料库中。现有PDD方法通常依赖于预测置信度或损失等浅层特征,性能有限。本文的关键创新在于提出NA-PDD算法,通过分析训练数据与非训练数据在LLM推理过程中引发的不同神经元激活模式来提升检测精度;同时构建了CCNewsPDD基准,采用严格的时序无偏数据变换策略,确保训练与非训练数据的时间分布一致,从而提升评估的可靠性。实验表明,NA-PDD在多个LLM和三个基准上显著优于现有方法。
链接: https://arxiv.org/abs/2507.16414
作者: Hongyi Tang,Zhihao Zhu,Yi Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The performance of large language models (LLMs) is closely tied to their training data, which can include copyrighted material or private information, raising legal and ethical concerns. Additionally, LLMs face criticism for dataset contamination and internalizing biases. To address these issues, the Pre-Training Data Detection (PDD) task was proposed to identify if specific data was included in an LLM’s pre-training corpus. However, existing PDD methods often rely on superficial features like prediction confidence and loss, resulting in mediocre performance. To improve this, we introduce NA-PDD, a novel algorithm analyzing differential neuron activation patterns between training and non-training data in LLMs. This is based on the observation that these data types activate different neurons during LLM inference. We also introduce CCNewsPDD, a temporally unbiased benchmark employing rigorous data transformations to ensure consistent time distributions between training and non-training data. Our experiments demonstrate that NA-PDD significantly outperforms existing methods across three benchmarks and multiple LLMs.
zh
[AI-27] Self-Supervised Inductive Logic Programming
【速读】:该论文试图解决在缺乏特定领域背景知识(background theory)和负例(negative examples)的情况下,如何实现自监督的归纳逻辑编程(Inductive Logic Programming, ILP)问题。传统方法如Meta-Interpretive Learning (MIL) 依赖于专家提供的背景理论和负例来学习具有泛化能力的递归逻辑程序,但在实际应用中这些先验信息往往难以获取。解决方案的关键在于提出一种新的自监督MIL设置,并设计了一个名为Poker的新算法:该算法仅需少量正例标签(positive labelled examples),即可在训练过程中自动生成并标注新的正例与负例;同时引入一种基于二阶确定性正规形式(Second Order Definite Normal Form, SONF)的通用背景理论,使得模型无需针对具体任务定制背景知识即可学习目标类别的所有程序。实验表明,Poker在无负例情况下仍能有效学习上下文无关语法和L-系统语言,且性能随生成样本数量增加而提升,而对比方法Louise因缺乏负例出现过拟合现象。
链接: https://arxiv.org/abs/2507.16405
作者: Stassa Patsantzis
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Inductive Logic Programming (ILP) approaches like Meta -/ Interpretive Learning (MIL) can learn, from few examples, recursive logic programs with invented predicates that generalise well to unseen instances. This ability relies on a background theory and negative examples, both carefully selected with expert knowledge of a learning problem and its solutions. But what if such a problem-specific background theory or negative examples are not available? We formalise this question as a new setting for Self-Supervised ILP and present a new MIL algorithm that learns in the new setting from some positive labelled, and zero or more unlabelled examples, and automatically generates, and labels, new positive and negative examples during learning. We implement this algorithm in Prolog in a new MIL system, called Poker. We compare Poker to state-of-the-art MIL system Louise on experiments learning grammars for Context-Free and L-System languages from labelled, positive example strings, no negative examples, and just the terminal vocabulary of a language, seen in examples, as a first-order background theory. We introduce a new approach for the principled selection of a second-order background theory as a Second Order Definite Normal Form (SONF), sufficiently general to learn all programs in a class, thus removing the need for a backgound theory tailored to a learning task. We find that Poker’s performance improves with increasing numbers of automatically generated examples while Louise, bereft of negative examples, over-generalises.
zh
[AI-28] LLM -Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning
【速读】:该论文旨在解决软件开发中“纠缠提交”(tangled commits)的问题,即开发者在单个提交中混杂多个不相关的代码变更,导致代码审查和维护效率低下。现有方法如基于规则、特征或图的方法往往依赖浅层信号,难以区分显式依赖(如控制流/数据流)与隐式依赖(如语义或概念关联)。其解决方案的关键在于提出一种名为ColaUntangle的协作咨询框架,通过多智能体架构实现对显式与隐式依赖的联合建模:一个代理专注显式依赖,另一个代理处理隐式依赖,再由评审代理通过迭代协商整合二者视角;同时构建多版本程序依赖图(delta-PDG)以捕获符号与语义层面的上下文信息,从而提升自动化提交拆分的准确性。
链接: https://arxiv.org/abs/2507.16395
作者: Bo Hou,Xin Tan,Kai Zheng,Fang Liu,Yinghao Zhu,Li Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Atomic commits, each of which addresses a single development concern, are a best practice in software development. However, developers frequently produce tangled commits that mix unrelated changes due to practical constraints or unclear boundaries, negatively impacting code review and maintenance. Although prior commit untangling approaches: rule-based, feature-based, or graph-based, have made progress, they often rely on shallow signals and fail to distinguish between explicit dependencies (e.g., control/data flow) and implicit ones (e.g., semantic or conceptual relationships). In this paper, we propose ColaUntangle, a new collaborative consultation framework for commit untangling that models both explicit and implicit dependencies among code changes. ColaUntangle integrates Large Language Model (LLM)-driven agents in a multi-agent architecture: one agent specializes in explicit dependencies, another in implicit ones, and a reviewer agent synthesizes their perspectives through iterative consultation. To capture explicit and implicit contextual information, we construct multi-version Program Dependency Graphs (delta-PDG), enabling agents to reason over code relationships with both symbolic and semantic depth. We evaluate ColaUntangle on two widely-used datasets (1,612 C# and 14k Java tangled commits). Experimental results show that ColaUntangle outperforms the best-performing baseline, achieving an improvement of 44% on the C# dataset and 100% on the Java dataset. These findings highlight the potential of LLM-based collaborative frameworks for advancing automated commit untangling tasks.
zh
[AI-29] Application of LLM Guided Reinforcement Learning in Formation Control with Collision Avoidance IROS2025
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)在实现编队控制与避障(Formation Control with Collision Avoidance, FCCA)任务时,因难以设计高效奖励函数而导致策略网络收敛速度慢的问题。解决方案的关键在于引入一种基于大语言模型(Large Language Models, LLMs)的新型框架,通过LLMs对任务优先级和各智能体可观测信息进行分析,动态生成可在线调整的奖励函数;该机制不直接依赖原始奖励信号,而是利用更先进的评估指标来优化奖励结构,从而显著提升系统在动态环境中达成编队控制与避障目标的效率,减少迭代次数并达到更优性能。
链接: https://arxiv.org/abs/2507.16382
作者: Chenhao Yao,Zike Yuan,Xiaoxu Liu,Chi Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted by IROS 2025
Abstract:Multi-Agent Systems (MAS) excel at accomplishing complex objectives through the collaborative efforts of individual agents. Among the methodologies employed in MAS, Multi-Agent Reinforcement Learning (MARL) stands out as one of the most efficacious algorithms. However, when confronted with the complex objective of Formation Control with Collision Avoidance (FCCA): designing an effective reward function that facilitates swift convergence of the policy network to an optimal solution. In this paper, we introduce a novel framework that aims to overcome this challenge. By giving large language models (LLMs) on the prioritization of tasks and the observable information available to each agent, our framework generates reward functions that can be dynamically adjusted online based on evaluation outcomes by employing more advanced evaluation metrics rather than the rewards themselves. This mechanism enables the MAS to simultaneously achieve formation control and obstacle avoidance in dynamic environments with enhanced efficiency, requiring fewer iterations to reach superior performance levels. Our empirical studies, conducted in both simulation and real-world settings, validate the practicality and effectiveness of our proposed approach.
zh
[AI-30] Depth Gives a False Sense of Privacy: LLM Internal States Inversion USENIX-SECURITY’25 USENIX-SECURITY USENIX-SECURITY2025
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中因内部状态(Internal States, ISs)暴露所引发的隐私与安全问题。尽管ISs传统上被认为难以从输出反推输入,因其深度抽象表示和优化挑战,本文通过提出四种新型逆向攻击方法,挑战了这一假设。其解决方案的关键在于:首先设计两种基于白盒优化的攻击策略,针对低层与高层ISs分别采用两阶段反演流程以避免局部最优;其次,在更贴近现实场景的黑盒权重访问条件下,利用源模型与衍生模型间的迁移性扩展优化攻击;最后引入生成式攻击,将逆向任务建模为翻译问题,借助专门训练的逆向模型实现输入重建。实验表明,这些方法在医疗咨询和代码辅助等长文本场景下均能显著提升语义相似度与token匹配率,验证了ISs可被有效逆向重构,从而揭示当前防御机制的局限性,并为未来更鲁棒的安全防护设计提供依据。
链接: https://arxiv.org/abs/2507.16372
作者: Tian Dong,Yan Meng,Shaofeng Li,Guoxing Chen,Zhen Liu,Haojin Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by USENIX Security 2025. Please cite this paper as “Tian Dong, Yan Meng, Shaofeng Li, Guoxing Chen, Zhen Liu, Haojin Zhu. Depth Gives a False Sense of Privacy: LLM Internal States Inversion. In the 34th USENIX Security Symposium (USENIX Security '25).”
Abstract:Large Language Models (LLMs) are increasingly integrated into daily routines, yet they raise significant privacy and safety concerns. Recent research proposes collaborative inference, which outsources the early-layer inference to ensure data locality, and introduces model safety auditing based on inner neuron patterns. Both techniques expose the LLM’s Internal States (ISs), which are traditionally considered irreversible to inputs due to optimization challenges and the highly abstract representations in deep layers. In this work, we challenge this assumption by proposing four inversion attacks that significantly improve the semantic similarity and token matching rate of inverted inputs. Specifically, we first develop two white-box optimization-based attacks tailored for low-depth and high-depth ISs. These attacks avoid local minima convergence, a limitation observed in prior work, through a two-phase inversion process. Then, we extend our optimization attack under more practical black-box weight access by leveraging the transferability between the source and the derived LLMs. Additionally, we introduce a generation-based attack that treats inversion as a translation task, employing an inversion model to reconstruct inputs. Extensive evaluation of short and long prompts from medical consulting and coding assistance datasets and 6 LLMs validates the effectiveness of our inversion attacks. Notably, a 4,112-token long medical consulting prompt can be nearly perfectly inverted with 86.88 F1 token matching from the middle layer of Llama-3 model. Finally, we evaluate four practical defenses that we found cannot perfectly prevent ISs inversion and draw conclusions for future mitigation design.
zh
[AI-31] Canonical Representations of Markovian Structural Causal Models: A Framework for Counterfactual Reasoning
【速读】:该论文旨在解决如何在因果推断中形式化和实现反事实信念(counterfactual beliefs)这一基础科学问题,尤其针对无法通过随机实验验证的反事实陈述(如“如果爱丽丝服用了阿司匹林,她是否会康复?”),这些陈述构成了个体公平性等核心概念的基础。解决方案的关键在于提出了一种替代结构因果模型(Structural Causal Models, SCM)的反事实模型(counterfactual models),也称为SCM的规范表示(canonical representations)。该方法通过预设边缘分布的随机过程概率分布来选择不同的反事实构想,并能刻画SCM的反事实等价类;同时引入归一化程序以描述和实现多种反事实构想,而无需改变观测和干预约束,且无需估计反事实层的内容,仅需做出选择即可。这为反事实推理提供了更灵活、可操作且理论严谨的建模框架。
链接: https://arxiv.org/abs/2507.16370
作者: Lucas de Lara(IECL)
机构: 未知
类目: Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:
Abstract:Counterfactual reasoning aims at answering contrary-to-fact questions like ‘‘Would have Alice recovered had she taken aspirin?’’ and corresponds to the most fine-grained layer of causation. Critically, while many counterfactual statements cannot be falsified – even by randomized experiments – they underpin fundamental concepts like individual-wise fairness. Therefore, providing models to formalize and implement counterfactual beliefs remains a fundamental scientific problem. In the Markovian setting of Pearl’s causal framework, we propose an alternative approach to structural causal models to represent counterfactuals compatible with a given causal graphical model. More precisely, we introduce counterfactual models, also called canonical representations of structural causal models. They enable analysts to choose a counterfactual conception via random-process probability distributions with preassigned marginals and characterize the counterfactual equivalence class of structural causal models. Then, we present a normalization procedure to describe and implement various counterfactual conceptions. Compared to structural causal models, it allows to specify many counterfactual conceptions without altering the observational and interventional constraints. Moreover, the content of the model corresponding to the counterfactual layer does not need to be estimated; only to make a choice. Finally, we illustrate the specific role of counterfactuals in causality and the benefits of our approach on theoretical and numerical examples.
zh
[AI-32] Learning to Call: A Field Trial of a Collaborative Bandit Algorithm for Improved Message Delivery in Mobile Maternal Health
【速读】:该论文旨在解决移动健康(mHealth)项目中因随机拨号调度导致的电话漏接问题,从而降低关键健康信息的送达率。其核心解决方案是引入一种协作式多臂老虎机算法(collaborative bandit algorithm),通过学习个体母亲的偏好接听时间来优化呼叫时机,实现个性化调度。实验结果表明,该算法显著提升了电话接通率,验证了机器学习在大规模母婴健康干预中提升信息传递效率的可行性与有效性。
链接: https://arxiv.org/abs/2507.16356
作者: Arpan Dasgupta,Mizhaan Maniyar,Awadhesh Srivastava,Sanat Kumar,Amrita Mahale,Aparna Hedge,Arun Suggala,Karthikeyan Shanmugam,Aparna Taneja,Milind Tambe
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mobile health (mHealth) programs utilize automated voice messages to deliver health information, particularly targeting underserved communities, demonstrating the effectiveness of using mobile technology to disseminate crucial health information to these populations, improving health outcomes through increased awareness and behavioral change. India’s Kilkari program delivers vital maternal health information via weekly voice calls to millions of mothers. However, the current random call scheduling often results in missed calls and reduced message delivery. This study presents a field trial of a collaborative bandit algorithm designed to optimize call timing by learning individual mothers’ preferred call times. We deployed the algorithm with around 6500 Kilkari participants as a pilot study, comparing its performance to the baseline random calling approach. Our results demonstrate a statistically significant improvement in call pick-up rates with the bandit algorithm, indicating its potential to enhance message delivery and impact millions of mothers across India. This research highlights the efficacy of personalized scheduling in mobile health interventions and underscores the potential of machine learning to improve maternal health outreach at scale.
zh
[AI-33] Leverag ing Personalized PageRank and Higher-Order Topological Structures for Heterophily Mitigation in Graph Neural Networks IJCAI2025
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在异质性图(heterophilic graphs)中性能下降的问题,特别是现有模型多依赖成对关系而忽略高阶结构信息,导致在存在冲突类别信息噪声时表现不佳。其解决方案的关键在于提出HPGNN模型,通过引入高效高阶个性化PageRank(Higher-order Personalized PageRank, PPR)近似方法,捕获长程和多尺度节点交互,并将高阶结构信息嵌入卷积网络中,从而有效建模跨不同图维度的关键交互,同时降低计算复杂度并提升对噪声的鲁棒性。
链接: https://arxiv.org/abs/2507.16347
作者: Yumeng Wang,Zengyi Wo,Wenjun Wang,Xingcheng Fu,Minglai Shao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, accepted at IJCAI 2025
Abstract:Graph Neural Networks (GNNs) excel in node classification tasks but often assume homophily, where connected nodes share similar labels. This assumption does not hold in many real-world heterophilic graphs. Existing models for heterophilic graphs primarily rely on pairwise relationships, overlooking multi-scale information from higher-order structures. This leads to suboptimal performance, particularly under noise from conflicting class information across nodes. To address these challenges, we propose HPGNN, a novel model integrating Higher-order Personalized PageRank with Graph Neural Networks. HPGNN introduces an efficient high-order approximation of Personalized PageRank (PPR) to capture long-range and multi-scale node interactions. This approach reduces computational complexity and mitigates noise from surrounding information. By embedding higher-order structural information into convolutional networks, HPGNN effectively models key interactions across diverse graph dimensions. Extensive experiments on benchmark datasets demonstrate HPGNN’s effectiveness. The model achieves better performance than five out of seven state-of-the-art methods on heterophilic graphs in downstream tasks while maintaining competitive performance on homophilic graphs. HPGNN’s ability to balance multi-scale information and robustness to noise makes it a versatile solution for real-world graph learning challenges. Codes are available at this https URL.
zh
[AI-34] Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries
【速读】:该论文旨在解决现有声音事件检测(Sound Event Detection, SED)算法在开放词汇场景下泛化能力不足的问题,尤其是传统闭集假设限制了对未见类别的检测能力,以及当前基于语言驱动的零样本SED方法因细粒度对齐和跨模态特征融合不足而导致性能不佳。解决方案的关键在于提出一种基于查询的开放词汇SED框架——Detect Any Sound Model (DASM),其核心创新包括:1)将SED建模为帧级检索任务,通过文本或音频提示生成查询向量与音频特征进行匹配;2)设计双流解码器结构,显式分离事件识别与时间定位:交叉模态事件解码器完成跨模态特征融合并判断片段级事件存在性,上下文网络则建模时序依赖以实现帧级定位;3)引入推理阶段注意力掩码策略,利用基础类与新类别间的语义关联显著提升对未见类别的泛化性能。
链接: https://arxiv.org/abs/2507.16343
作者: Pengfei Cai,Yan Song,Qing Gu,Nan Jiang,Haoyu Song,Ian McLoughlin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted by MM 2025
Abstract:Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at this https URL.
zh
[AI-35] Higher Gauge Flow Models
【速读】:该论文旨在解决传统生成流模型(Generative Flow Models)在建模复杂数据分布时,难以有效捕捉高阶几何结构与高阶对称性的问题。其解决方案的关键在于引入基于L_∞-代数的高阶规范流模型(Higher Gauge Flow Models),通过将Lie代数扩展为更一般的L_∞-代数,从而自然地融合了高阶群所对应的高阶几何与高阶对称性,提升了模型对复杂数据结构的表达能力。实验表明,该方法在高斯混合模型数据集上显著优于传统流模型。
链接: https://arxiv.org/abs/2507.16334
作者: Alexander Strunk,Roland Assam
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Differential Geometry (math.DG)
备注:
Abstract:This paper introduces Higher Gauge Flow Models, a novel class of Generative Flow Models. Building upon ordinary Gauge Flow Models (arXiv:2507.13414), these Higher Gauge Flow Models leverage an L _\infty -algebra, effectively extending the Lie Algebra. This expansion allows for the integration of the higher geometry and higher symmetries associated with higher groups into the framework of Generative Flow Models. Experimental evaluation on a Gaussian Mixture Model dataset revealed substantial performance improvements compared to traditional Flow Models.
zh
[AI-36] Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens
【速读】:该论文旨在解决当前广泛使用的医学大语言模型(Large Language Model, LLM)评估基准在非洲医疗场景中的适用性问题,即现有基准主要反映高收入国家的疾病谱和考试内容,严重忽视了非洲地区以疟疾、艾滋病、结核病、镰状细胞病等被忽视的热带病(Neglected Tropical Diseases, NTDs)为主的疾病负担及本土临床指南。其解决方案的关键在于开发了一个基于检索增强生成(Retrieval-Augmented Generation, RAG)框架、锚定于肯尼亚临床实践指南的新型医学问答数据集——Alama Health QA,并通过统一的语义特征分析与盲评专家评分体系,系统验证其在NTD覆盖率、指南一致性、临床相关性和文化适配性等方面的优越表现,从而为非洲健康系统中LLM的安全、公平评估与部署提供可靠基准资源。
链接: https://arxiv.org/abs/2507.16322
作者: Fred Mutisya(1 and 2),Shikoh Gitau(1),Christine Syovata(2),Diana Oigara(2),Ibrahim Matende(2),Muna Aden(2),Munira Ali(2),Ryan Nyotu(2),Diana Marion(2),Job Nyangena(2),Nasubo Ongoma(1),Keith Mbae(1),Elizabeth Wamicha(1),Eric Mibuari(1),Jean Philbert Nsengemana(3),Talkmore Chidede(4) ((1) Qhala (Nairobi, Kenya), (2) Kenya Medical Association (Nairobi, Kenya), (3) Africa CDC (Addis Ababa, Ethiopia), (4) AfCFTA (Accra, Ghana))
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. 26 pages, includes appendix and tables
Abstract:Introduction: Existing medical LLM benchmarks largely reflect examination syllabi and disease profiles from high income settings, raising questions about their validity for African deployment where malaria, HIV, TB, sickle cell disease and other neglected tropical diseases (NTDs) dominate burden and national guidelines drive care. Methodology: We systematically reviewed 31 quantitative LLM evaluation papers (Jan 2019 May 2025) identifying 19 English medical QA benchmarks. Alama Health QA was developed using a retrieval augmented generation framework anchored on the Kenyan Clinical Practice Guidelines. Six widely used sets (AfriMedQA, MMLUMedical, PubMedQA, MedMCQA, MedQAUSMLE, and guideline grounded Alama Health QA) underwent harmonized semantic profiling (NTD proportion, recency, readability, lexical diversity metrics) and blinded expert rating across five dimensions: clinical relevance, guideline alignment, clarity, distractor plausibility, and language/cultural fit. Results: Alama Health QA captured 40% of all NTD mentions across corpora and the highest within set frequencies for malaria (7.7%), HIV (4.1%), and TB (5.2%); AfriMedQA ranked second but lacked formal guideline linkage. Global benchmarks showed minimal representation (e.g., sickle cell disease absent in three sets) despite large scale. Qualitatively, Alama scored highest for relevance and guideline alignment; PubMedQA lowest for clinical utility. Discussion: Quantitative medical LLM benchmarks widely used in the literature underrepresent African disease burdens and regulatory contexts, risking misleading performance claims. Guideline anchored, regionally curated resources such as Alama Health QA and expanded disease specific derivatives are essential for safe, equitable model evaluation and deployment across African health systems.
zh
[AI-37] Perovskite-R1: A Domain-Specialized LLM for Intelligent Discovery of Precursor Additives and Experimental Design
【速读】:该论文旨在解决钙钛矿太阳能电池(Perovskite Solar Cells, PSCs)在长期稳定性、环境可持续性和规模化制造方面面临的挑战,以及科研人员难以高效获取和利用领域知识的问题。解决方案的关键在于构建了一个专用于PSC前驱体添加剂设计的大型语言模型Perovskite-R1,其核心创新是基于1,232篇高质量文献与33,269种候选材料构建了领域特定的指令微调数据集,并通过自动问答生成与思维链推理(chain-of-thought reasoning)方法对QwQ-32B模型进行微调,从而实现对文献知识的智能整合与创新性添加剂策略的生成,实验验证表明该模型提出的方案能有效提升材料稳定性和器件性能,为钙钛矿光伏研究提供了闭环的数据驱动智能发现框架。
链接: https://arxiv.org/abs/2507.16307
作者: Xin-De Wang,Zhi-Rui Chen,Peng-Jie Guo,Ze-Feng Gao,Cheng Mu,Zhong-Yi Lu
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 24 pages; 5 figures
Abstract:Perovskite solar cells (PSCs) have rapidly emerged as a leading contender in next-generation photovoltaic technologies, owing to their exceptional power conversion efficiencies and advantageous material properties. Despite these advances, challenges such as long-term stability, environmental sustainability, and scalable manufacturing continue to hinder their commercialization. Precursor additive engineering has shown promise in addressing these issues by enhancing both the performance and durability of PSCs. However, the explosive growth of scientific literature and the complex interplay of materials, processes, and device architectures make it increasingly difficult for researchers to efficiently access, organize, and utilize domain knowledge in this rapidly evolving field. To address this gap, we introduce Perovskite-R1, a specialized large language model (LLM) with advanced reasoning capabilities tailored for the discovery and design of PSC precursor additives. By systematically mining and curating 1,232 high-quality scientific publications and integrating a comprehensive library of 33,269 candidate materials, we constructed a domain-specific instruction-tuning dataset using automated question-answer generation and chain-of-thought reasoning. Fine-tuning the QwQ-32B model on this dataset resulted in Perovskite-R1, which can intelligently synthesize literature insights and generate innovative and practical solutions for defect passivation and the selection of precursor additives. Experimental validation of several model-proposed strategies confirms their effectiveness in improving material stability and performance. Our work demonstrates the potential of domain-adapted LLMs in accelerating materials discovery and provides a closed-loop framework for intelligent, data-driven advancements in perovskite photovoltaic research.
zh
[AI-38] Cross-Modal Distillation For Widely Differing Modalities
【速读】:该论文旨在解决多模态知识蒸馏中因模态间域差异大而导致的过拟合问题,尤其是在训练阶段受限于多模态数据获取的情况下。其核心解决方案是提出一种跨模态知识蒸馏框架,关键在于引入两种软约束的知识蒸馏策略:在特征层和分类器层分别设计软约束损失函数,以替代传统的硬约束(如l2损失),从而缓解跨模态蒸馏中的过拟合现象;同时,进一步提出基于数据质量的自适应权重模块,通过量化输入样本的质量动态调整权重,提升模型训练的鲁棒性。
链接: https://arxiv.org/abs/2507.16296
作者: Cairong Zhao,Yufeng Jin,Zifan Song,Haonan Chen,Duoqian Miao,Guosheng Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 9 figures
Abstract:Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. To solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. l2 loss forcing the student being exact the same as the teacher, can easily lead to overfitting in cross-modality distillation. To address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively. In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech.
zh
[AI-39] ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry
【速读】:该论文旨在解决当前深度研究系统(Deep AI Research Systems, DARS)在前沿人工智能科学研究中缺乏有效评估标准的问题。现有基准主要关注其网页检索与报告生成能力,而忽视了其发现新科学洞见的潜力。为此,作者提出首个专注于评估DARS在前沿AI科学问题上表现的基准——ResearcherBench,其关键在于构建了一个由65个专家遴选的真实科研场景问题组成的高质量数据集,并设计了一套双维度评估框架:一是基于专家制定标准的评分体系(rubric assessment),用于衡量洞察质量;二是基于引用准确性(faithfulness)和覆盖度(groundedness)的事实性评估。这一方案为下一代AI研究助手的发展提供了标准化评测平台,推动了AI在科学协作中的角色演化。
链接: https://arxiv.org/abs/2507.16280
作者: Tianze Xu,Pengrui Lu,Lyumanshan Ye,Xiangkun Hu,Pengfei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 3 figures
Abstract:The emergence of deep research systems presents significant capabilities in problem-solving, extending from basic queries to sophisticated research tasks. However, existing benchmarks primarily evaluate these systems as agents for web retrieval and report generation, overlooking their potential to discover novel insights on the frontiers of scientific research. To address this gap, we introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of these advanced, agentic systems - which we refer to as Deep AI Research Systems (DARS) - on frontier AI scientific questions. We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios such as laboratory discussions and interviews, spanning 35 different AI subjects and categorized into three types: technical details, literature review, and open consulting. Our dual evaluation framework combines rubric assessment, which uses expert-designed criteria to evaluate insight quality, with factual assessment, which measures citation accuracy (faithfulness) and coverage (groundedness). We evaluated several leading commercial DARS and baseline systems. Results show that OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions. Such capabilities represent a meaningful step toward AI self-improvement, aligning with the vision of ASI for AI. We open-source ResearcherBench to provide a standardized platform for promoting the development of next-generation AI research assistants, hoping to foster a new perspective in AI research evaluation for a novel pattern of scientific collaboration: this https URL.
zh
[AI-40] Reducing GPU Memory Frag mentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练过程中因虚拟流水线(virtual pipeline)和重新计算(recomputation)等优化技术导致的GPU内存碎片化问题,这些问题破坏了张量(tensor)生命周期的规律性,使得主流深度学习框架如PyTorch默认的在线内存分配器效率低下,最高可浪费43%的显存并引发内存溢出错误。解决方案的关键在于提出STWeaver——一个可插拔的GPU内存分配器,其核心创新是将离线规划与在线分配相结合:离线阶段利用训练工作负载中内存分配行为的空间和时间规律性生成近似最优的分配计划,从而显著降低碎片率;在线阶段则针对复杂动态模型(如Mixture-of-Experts, MoE)进行灵活适配,实现平均79.2%(最高达100%)的碎片率下降,且开销极低,最终提升训练吞吐量并改善性能最多达32.5%。
链接: https://arxiv.org/abs/2507.16274
作者: Zixiao Huang,Junhao Hu,Hao Lin,Chunyang Zhu,Yueran Tang,Quanlu Zhang,Zhen Guo,Zhenhua Li,Shengen Yan,Zhenhua Zhu,Guohao Dai,Yu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
备注:
Abstract:The rapid scaling of large language models (LLMs) has significantly increased GPU memory pressure, which is further aggravated by training optimization techniques such as virtual pipeline and recomputation that disrupt tensor lifespans and introduce considerable memory fragmentation. Default GPU memory allocators of popular deep learning frameworks like PyTorch use online strategies without knowledge of tensor lifespans, which can waste up to 43% of memory and cause out-of-memory errors, rendering optimization techniques ineffective or even unusable. To address this, we introduce STWeaver, a GPU memory allocator for deep learning frameworks that reduces fragmentation by exploiting the spatial and temporal regularity in memory allocation behaviors of training workloads. STWeaver introduces a novel paradigm that combines offline planning with online allocation. The offline planning leverages spatio-temporal regularities to generate a near-optimal allocation plan, while the online allocation handles complex and dynamic models such as Mixture-of-Experts (MoE). Built as a pluggable PyTorch allocator, STWeaver reduces fragmentation ratio on average by 79.2% (up to 100%) across both dense and sparse models, with negligible overhead. This enables more efficient, high-throughput training configurations and improves performance by up to 32.5%. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF) Cite as: arXiv:2507.16274 [cs.LG] (or arXiv:2507.16274v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.16274 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-41] PRAC3 (Privacy Reputation Accountability Consent Credit Compensation): Long Tailed Risks of Voice Actors in AI Data-Economy
【速读】:该论文旨在解决当前合成语音(synthetic voice)经济中,专业配音演员因声纹数据被非授权使用而面临的一系列新兴风险问题,包括隐私泄露、声誉损害及责任归属缺失等。这些问题在传统伦理框架(Consent, Credit, Compensation,简称C3)下难以有效应对,尤其当语音作为生物特征标识与创意劳动双重属性被解耦时,个体控制权和法律保障严重不足。论文的关键解决方案是提出PRAC3框架,即在C3基础上扩展Privacy(隐私)、Reputation(声誉)、Accountability(问责),形成六个相互关联的支柱体系,用以系统性规范语音数据的采集、训练与再利用过程,从而恢复创作者主体性、确保数据溯源性,并建立可执行的伦理边界。
链接: https://arxiv.org/abs/2507.16247
作者: Tanusree Sharma,Yihao Zhou,Visar Berisha
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Early large-scale audio datasets, such as LibriSpeech, were built with hundreds of individual contributors whose voices were instrumental in the development of speech technologies, including audiobooks and voice assistants. Yet, a decade later, these same contributions have exposed voice actors to a range of risks. While existing ethical frameworks emphasize Consent, Credit, and Compensation (C3), they do not adequately address the emergent risks involving vocal identities that are increasingly decoupled from context, authorship, and control. Drawing on qualitative interviews with 20 professional voice actors, this paper reveals how the synthetic replication of voice without enforceable constraints exposes individuals to a range of threats. Beyond reputational harm, such as re-purposing voice data in erotic content, offensive political messaging, and meme culture, we document concerns about accountability breakdowns when their voice is leveraged to clone voices that are deployed in high-stakes scenarios such as financial fraud, misinformation campaigns, or impersonation scams. In such cases, actors face social and legal fallout without recourse, while very few of them have a legal representative or union protection. To make sense of these shifting dynamics, we introduce the PRAC3 framework, an expansion of C3 that foregrounds Privacy, Reputation, Accountability, Consent, Credit, and Compensation as interdependent pillars of data used in the synthetic voice economy. This framework captures how privacy risks are amplified through non-consensual training, how reputational harm arises from decontextualized deployment, and how accountability can be reimagined AI Data ecosystems. We argue that voice, as both a biometric identifier and creative labor, demands governance models that restore creator agency, ensure traceability, and establish enforceable boundaries for ethical reuse.
zh
[AI-42] X-NIDS: A Framework for Explainable Network Intrusion Detection Leverag ing Large Language Models
【速读】:该论文旨在解决流式网络入侵检测系统(Flow-based Network Intrusion Detection Systems, NIDS)中缺乏可解释性的问题,即难以向安全分析师提供清晰、可信的恶意流量判定依据。解决方案的关键在于提出了一种名为eX-NIDS的框架,其核心创新是通过“提示增强模块”(Prompt Augmenter)从被标记为恶意的网络流中提取上下文信息和网络威胁情报(Cyber Threat Intelligence, CTI)知识,并将其融合进大语言模型(Large Language Models, LLMs)的输入提示中,从而生成准确且一致的自然语言解释。实验表明,相较于不包含上下文信息的基线方法(Basic-Prompt Explainer),该增强提示策略使LLM生成的解释性能提升超过20%,显著增强了NIDS的可解释性和实用性。
链接: https://arxiv.org/abs/2507.16241
作者: Paul R. B. Houssel,Siamak Layeghy,Priyanka Singh,Marius Portmann
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces eX-NIDS, a framework designed to enhance interpretability in flow-based Network Intrusion Detection Systems (NIDS) by leveraging Large Language Models (LLMs). In our proposed framework, flows labelled as malicious by NIDS are initially processed through a module called the Prompt Augmenter. This module extracts contextual information and Cyber Threat Intelligence (CTI)-related knowledge from these flows. This enriched, context-specific data is then integrated with an input prompt for an LLM, enabling it to generate detailed explanations and interpretations of why the flow was identified as malicious by NIDS. We compare the generated interpretations against a Basic-Prompt Explainer baseline, which does not incorporate any contextual information into the LLM’s input prompt. Our framework is quantitatively evaluated using the Llama 3 and GPT-4 models, employing a novel evaluation method tailored for natural language explanations, focusing on their correctness and consistency. The results demonstrate that augmented LLMs can produce accurate and consistent explanations, serving as valuable complementary tools in NIDS to explain the classification of malicious flows. The use of augmented prompts enhances performance by over 20% compared to the Basic-Prompt Explainer.
zh
[AI-43] Voice-based AI Agents : Filling the Economic Gaps in Digital Health Delivery ALT
【速读】:该论文旨在解决当前数字健康服务在经济可及性和患者覆盖上的显著差距,尤其是在资源匮乏人群中的预防性护理与持续监测难题。其解决方案的关键在于开发并验证基于大语言模型(Large Language Model, LLM)的语音助手Agent PULSE,通过构建一个成本效益模型证明AI代理可在人类干预不具经济可行性的场景下提供高效医疗服务;同时,该方案强调技术实现(如实时对话处理、系统集成与隐私合规)、政策规范(监管框架、偏见缓解与患者自主权保障)与伦理对齐三者的协同优化,从而提升医疗可及性、患者参与度和系统效率,为实现公平、可持续的数字健康提供关键入口。
链接: https://arxiv.org/abs/2507.16229
作者: Bo Wen,Chen Wang,Qiwei Han,Raquel Norel,Julia Liu,Thaddeus Stappenbeck,Jeffrey L. Rogers
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注: IEEE International Conference on Digital Health (ICDH) 2025
Abstract:The integration of voice-based AI agents in healthcare presents a transformative opportunity to bridge economic and accessibility gaps in digital health delivery. This paper explores the role of large language model (LLM)-powered voice assistants in enhancing preventive care and continuous patient monitoring, particularly in underserved populations. Drawing insights from the development and pilot study of Agent PULSE (Patient Understanding and Liaison Support Engine) – a collaborative initiative between IBM Research, Cleveland Clinic Foundation, and Morehouse School of Medicine – we present an economic model demonstrating how AI agents can provide cost-effective healthcare services where human intervention is economically unfeasible. Our pilot study with 33 inflammatory bowel disease patients revealed that 70% expressed acceptance of AI-driven monitoring, with 37% preferring it over traditional modalities. Technical challenges, including real-time conversational AI processing, integration with healthcare systems, and privacy compliance, are analyzed alongside policy considerations surrounding regulation, bias mitigation, and patient autonomy. Our findings suggest that AI-driven voice agents not only enhance healthcare scalability and efficiency but also improve patient engagement and accessibility. For healthcare executives, our cost-utility analysis demonstrates huge potential savings for routine monitoring tasks, while technologists can leverage our framework to prioritize improvements yielding the highest patient impact. By addressing current limitations and aligning AI development with ethical and regulatory frameworks, voice-based AI agents can serve as a critical entry point for equitable, sustainable digital healthcare solutions.
zh
[AI-44] Distilled Large Language Model in Confidential Computing Environment for System-on-Chip Design
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在电路设计任务中应用时的隐私保护问题,特别是针对训练后的模型及其数据作为敏感知识产权(Intellectual Property, IP)需在可信执行环境(Trusted Execution Environment, TEE)中安全运行的需求。现有TEE实现难以高效支持LLMs的高资源消耗特性,因此本文提出基于Intel Trust Domain Extensions (TDX)的TEE环境进行系统性评估,其关键在于通过量化压缩(如4-bit和8-bit量化)与轻量级模型(如DeepSeek-r1-1.5B)优化,在保证安全性的同时显著提升推理性能——实验表明,量化模型相比FP16版本可实现最高达3倍的吞吐量提升,且在参数较少的情况下,TDX方案优于纯CPU实现,验证了轻量LLM在资源受限系统中部署用于半导体计算机辅助设计(Semiconductor CAD)任务的可行性。
链接: https://arxiv.org/abs/2507.16226
作者: Dong Ben,Hui Feng,Qian Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 7 pages, 4 figures;
Abstract:Large Language Models (LLMs) are increasingly used in circuit design tasks and have typically undergone multiple rounds of training. Both the trained models and their associated training data are considered confidential intellectual property (IP) and must be protected from exposure. Confidential Computing offers a promising solution to protect data and models through Trusted Execution Environments (TEEs). However, existing TEE implementations are not designed to support the resource-intensive nature of LLMs efficiently. In this work, we first present a comprehensive evaluation of the LLMs within a TEE-enabled confidential computing environment, specifically utilizing Intel Trust Domain Extensions (TDX). We constructed experiments on three environments: TEE-based, CPU-only, and CPU-GPU hybrid implementations, and evaluated their performance in terms of tokens per second. Our first observation is that distilled models, i.e., DeepSeek, surpass other models in performance due to their smaller parameters, making them suitable for resource-constrained devices. Also, in the quantized models such as 4-bit quantization (Q4) and 8-bit quantization (Q8), we observed a performance gain of up to 3x compared to FP16 models. Our findings indicate that for fewer parameter sets, such as DeepSeek-r1-1.5B, the TDX implementation outperforms the CPU version in executing computations within a secure environment. We further validate the results using a testbench designed for SoC design tasks. These validations demonstrate the potential of efficiently deploying lightweight LLMs on resource-constrained systems for semiconductor CAD applications. Comments: 7 pages, 4 figures; Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2507.16226 [cs.AI] (or arXiv:2507.16226v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.16226 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-45] Adaptive Relative Pose Estimation Framework with Dual Noise Tuning for Safe Approaching Maneuvers
【速读】:该论文旨在解决在主动碎片清除(Active Debris Removal, ADR)任务中,对翻滚废弃卫星(如ESA的ENVISAT)进行高精度、鲁棒的相对位姿估计问题,这是实现安全近距离操作的关键挑战。解决方案的核心在于构建一个集成先进计算机视觉与自适应非线性滤波的完整流程:首先利用增强图像预处理的卷积神经网络(Convolutional Neural Network, CNN)检测结构特征点(角点),并通过相机模型将其从2D图像坐标转换为3D测量值;随后,这些测量值融合进无迹卡尔曼滤波(Unscented Kalman Filter, UKF)框架中,以估计完整的相对位姿。关键创新在于UKF中的双自适应策略——动态调整测量噪声协方差以补偿CNN输出的不确定性变化,同时基于测量残差分析自适应调节过程噪声协方差,从而在线应对未建模动力学或机动行为,显著提升了系统对测量误差和动态模型不确定性的鲁棒性。
链接: https://arxiv.org/abs/2507.16214
作者: Batu Candan,Simone Servadio
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate and robust relative pose estimation is crucial for enabling challenging Active Debris Removal (ADR) missions targeting tumbling derelict satellites such as ESA’s ENVISAT. This work presents a complete pipeline integrating advanced computer vision techniques with adaptive nonlinear filtering to address this challenge. A Convolutional Neural Network (CNN), enhanced with image preprocessing, detects structural markers (corners) from chaser imagery, whose 2D coordinates are converted to 3D measurements using camera modeling. These measurements are fused within an Unscented Kalman Filter (UKF) framework, selected for its ability to handle nonlinear relative dynamics, to estimate the full relative pose. Key contributions include the integrated system architecture and a dual adaptive strategy within the UKF: dynamic tuning of the measurement noise covariance compensates for varying CNN measurement uncertainty, while adaptive tuning of the process noise covariance, utilizing measurement residual analysis, accounts for unmodeled dynamics or maneuvers online. This dual adaptation enhances robustness against both measurement imperfections and dynamic model uncertainties. The performance of the proposed adaptive integrated system is evaluated through high-fidelity simulations using a realistic ENVISAT model, comparing estimates against ground truth under various conditions, including measurement outages. This comprehensive approach offers an enhanced solution for robust onboard relative navigation, significantly advancing the capabilities required for safe proximity operations during ADR missions.
zh
[AI-46] LOCOFY Large Design Models – Design to code conversion solution
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)及其多模态变体在“设计到代码”(design-to-code)转换场景中面临的可解释性差、扩展性不足、资源消耗高以及结果不可复现等关键挑战。解决方案的核心在于提出一种专为设计与网页数据训练的大型设计模型(Large Design Models, LDMs)范式,并构建了包含数据工程优化和模型架构改进的端到端训练与推理流程:1)设计优化器(Design Optimiser)利用自有真实数据集识别并修正次优设计;2)标签与特征检测模块通过预训练及微调模型实现UI元素的精准识别与分类;3)自动组件提取(Auto Components)将重复的UI结构抽象为可复用组件,提升代码模块化水平与复用效率。该方案显著提升了节点定位准确性、响应式能力及结果一致性,在端到端设计到代码转换中展现出优于LLMs的性能。
链接: https://arxiv.org/abs/2507.16208
作者: Sohaib Muhammad,Ashwati Vipin,Karan Shetti,Honey Mittal
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite rapid advances in Large Language Models and Multimodal Large Language Models (LLMs), numerous challenges related to interpretability, scalability, resource requirements and repeatability remain, related to their application in the design-to-code space. To address this, we introduce the Large Design Models (LDMs) paradigm specifically trained on designs and webpages to enable seamless conversion from design-to-code. We have developed a training and inference pipeline by incorporating data engineering and appropriate model architecture modification. The training pipeline consists of the following: 1)Design Optimiser: developed using a proprietary ground truth dataset and addresses sub-optimal designs; 2)Tagging and feature detection: using pre-trained and fine-tuned models, this enables the accurate detection and classification of UI elements; and 3)Auto Components: extracts repeated UI structures into reusable components to enable creation of modular code, thus reducing redundancy while enhancing code reusability. In this manner, each model addresses distinct but key issues for design-to-code conversion. Separately, our inference pipeline processes real-world designs to produce precise and interpretable instructions for code generation and ensures reliability. Additionally, our models illustrated exceptional end-to-end design-to-code conversion accuracy using a novel preview match score metric. Comparative experiments indicated superior performance of LDMs against LLMs on accuracy of node positioning, responsiveness and reproducibility. Moreover, our custom-trained tagging and feature detection model demonstrated high precision and consistency in identifying UI elements across a wide sample of test designs. Thus, our proposed LDMs are a reliable and superior solution to understanding designs that subsequently enable the generation of efficient and reliable production-ready code.
zh
[AI-47] A Human-Centered Approach to Identifying Promises Risks Challenges of Text-to-Image Generative AI in Radiology AAAI
【速读】:该论文试图解决的问题是:当前文本到图像生成式 AI(text-to-image generative AI, GenAI)在医学影像领域的快速发展,往往忽视了医疗专业人员的实际需求与使用场景,导致开发出的模型可能缺乏实用性,甚至带来潜在风险。解决方案的关键在于采用以人为本(human-centered)的方法,通过让医疗学生、放射科住院医师和放射科医生等关键利益相关者参与评估与反思,系统性地探索文本到CT扫描生成模型在医学教育、培训和临床实践中的潜力、风险与挑战,从而推动负责任的模型开发,并识别出生成合成医学图像时存在的技术难题和领域特定风险。
链接: https://arxiv.org/abs/2507.16207
作者: Katelyn Morrison,Arpit Mathur,Aidan Bradshaw,Tom Wartmann,Steven Lundi,Afrooz Zandifar,Weichang Dai,Kayhan Batmanghelich,Motahhare Eslami,Adam Perer
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 2 figures, accepted to AAAI/ACM AIES 2025
Abstract:As text-to-image generative models rapidly improve, AI researchers are making significant advances in developing domain-specific models capable of generating complex medical imagery from text prompts. Despite this, these technical advancements have overlooked whether and how medical professionals would benefit from and use text-to-image generative AI (GenAI) in practice. By developing domain-specific GenAI without involving stakeholders, we risk the potential of building models that are either not useful or even more harmful than helpful. In this paper, we adopt a human-centered approach to responsible model development by involving stakeholders in evaluating and reflecting on the promises, risks, and challenges of a novel text-to-CT Scan GenAI model. Through exploratory model prompting activities, we uncover the perspectives of medical students, radiology trainees, and radiologists on the role that text-to-CT Scan GenAI can play across medical education, training, and practice. This human-centered approach additionally enabled us to surface technical challenges and domain-specific risks of generating synthetic medical images. We conclude by reflecting on the implications of medical text-to-image GenAI.
zh
[AI-48] METER: Multi-modal Evidence-based Thinking and Explainable Reasoning – Algorithm and Benchmark ICCV
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 伪造内容检测中存在的两大关键问题:一是现有方法多采用二分类模式,缺乏对伪造行为的详细且可解释的分析,限制了其在高风险场景中的应用;二是现有技术通常按模态(如图像、视频、音频)独立处理,缺少统一的跨模态伪造检测与解释基准。解决方案的关键在于提出 METER——一个统一的多模态伪造检测基准,涵盖图像、视频、音频及音视频融合内容,并引入基于证据链(evidence chain)的解释机制,包括时空定位、文本推理和伪造类型追踪等多层次可解释指标。此外,论文设计了一种人类对齐的三阶段思维链(Chain-of-Thought, CoT)训练策略,结合监督微调(SFT)、直接偏好优化(DPO)与一种新颖的GRPO阶段(整合人类对齐评估器与CoT推理),显著提升了模型在真实世界中通用性与可解释性的能力。
链接: https://arxiv.org/abs/2507.16206
作者: Xu Yang,Qi Zhang,Shuming Jiang,Yaowen Xu,Zhaofan Zou,Hao Sun,Xuelong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages,3 figures ICCV format
Abstract:With the rapid advancement of generative AI, synthetic content across images, videos, and audio has become increasingly realistic, amplifying the risk of misinformation. Existing detection approaches predominantly focus on binary classification while lacking detailed and interpretable explanations of forgeries, which limits their applicability in safety-critical scenarios. Moreover, current methods often treat each modality separately, without a unified benchmark for cross-modal forgery detection and interpretation. To address these challenges, we introduce METER, a unified, multi-modal benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content. Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations, including spatio-temporal localization, textual rationales, and forgery type tracing. Compared to prior benchmarks, METER offers broader modality coverage and richer interpretability metrics such as spatial/temporal IoU, multi-class tracing, and evidence consistency. We further propose a human-aligned, three-stage Chain-of-Thought (CoT) training strategy combining SFT, DPO, and a novel GRPO stage that integrates a human-aligned evaluator with CoT reasoning. We hope METER will serve as a standardized foundation for advancing generalizable and interpretable forgery detection in the era of generative media.
zh
[AI-49] CHIMERA: Compressed Hybrid Intelligence for Twin-Model Enhanced Multi-Agent Deep Reinforcement Learning for Multi-Functional RIS-Assisted Space-Air-Ground Integrated Networks
【速读】:该论文旨在解决空间-空中-地面一体化网络(SAGIN)中低轨卫星(LEO)在阴影区域面临的能量短缺问题,同时兼顾通信与计算能耗的协同优化,以提升长期能量效率(EE)。解决方案的关键在于提出一种多功能可重构智能表面(MF-RIS),其具备反射、放大和无线能量采集的复合能力,并通过联合优化MF-RIS参数(如信号放大系数、相位偏移、能量采集比例及有源单元选择)与SAGIN参数(如波束赋形向量、高空平台站(HAPS)部署位置、用户关联策略及计算能力分配),构建一个高度非凸且混合离散-连续变量的复杂优化问题。为高效求解该问题,作者设计了压缩混合智能双模型增强多智能体深度强化学习(CHIMERA)框架,融合语义状态-动作压缩与参数化共享机制,在多智能体强化学习中实现对复杂控制策略的有效探索,从而显著优于传统固定配置或无能量采集的RIS方案以及集中式和多智能体深度强化学习基线方法,在能效方面展现出明显优势。
链接: https://arxiv.org/abs/2507.16204
作者: Li-Hsiang Shen,Jyun-Jhe Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:
Abstract:A space-air-ground integrated network (SAGIN) architecture is proposed, empowered by multi-functional reconfigurable intelligent surfaces (MF-RIS) capable of simultaneously reflecting, amplifying, and harvesting wireless energy. The MF-RIS plays a pivotal role in addressing the energy shortages of low-Earth orbit (LEO) satellites operating in shadowed regions, while explicitly accounting for both communication and computing energy consumption across the SAGIN nodes. To maximize the long-term energy efficiency (EE), we formulate a joint optimization problem over the MF-RIS parameters, including signal amplification, phase-shifts, energy harvesting ratio, and active element selection as well as the SAGIN parameters of beamforming vectors, high-altitude platform station (HAPS) deployment, user association, and computing capability. The formulated problem is highly non-convex and non-linear and contains mixed discrete-continuous parameters. To tackle this, we conceive a compressed hybrid intelligence for twin-model enhanced multi-agent deep reinforcement learning (CHIMERA) framework, which integrates semantic state-action compression and parametrized sharing under hybrid reinforcement learning to efficiently explore suitable complex actions. The simulation results have demonstrated that the proposed CHIMERA scheme substantially outperforms the conventional benchmarks, including fixed-configuration or non-harvesting MF-RIS, traditional RIS, and no-RIS cases, as well as centralized and multi-agent deep reinforcement learning baselines in terms of the highest EE. Moreover, the proposed SAGIN-MF-RIS architecture achieves superior EE performance due to its complementary coverage, offering notable advantages over either standalone satellite, aerial, or ground-only deployments.
zh
[AI-50] SVAgent : AI Agent for Hardware Security Verification Assertion
【速读】:该论文旨在解决传统SystemVerilog断言(SVA)开发模型在复杂集成电路(IC)设计中效率低下且难以应对日益增长的安全漏洞问题。其核心挑战在于SVA编写过程缺乏结构化和自动化能力,导致人工开发耗时长、易出错,且难以适应现代IC的高复杂度与安全需求。解决方案的关键是提出一种创新的SVA自动生成功能框架SVAgent,其引入了需求分解机制(requirement decomposition mechanism),将原始复杂需求转化为结构化的、逐步可解的细粒度问题链(fine-grained problem-solving chain),从而显著提升生成SVA的准确性与一致性,并有效抑制幻觉和随机回答问题。实验表明,SVAgent在真实工程环境中具备良好的实用性与可靠性。
链接: https://arxiv.org/abs/2507.16203
作者: Rui Guo,Avinash Ayalasomayajula,Henian Li,Jingbo Zhou,Sujan Kumar Saha,Farimah Farahmandi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注:
Abstract:Verification using SystemVerilog assertions (SVA) is one of the most popular methods for detecting circuit design vulnerabilities. However, with the globalization of integrated circuit design and the continuous upgrading of security requirements, the SVA development model has exposed major limitations. It is not only inefficient in development, but also unable to effectively deal with the increasing number of security vulnerabilities in modern complex integrated circuits. In response to these challenges, this paper proposes an innovative SVA automatic generation framework SVAgent. SVAgent introduces a requirement decomposition mechanism to transform the original complex requirements into a structured, gradually solvable fine-grained problem-solving chain. Experiments have shown that SVAgent can effectively suppress the influence of hallucinations and random answers, and the key evaluation indicators such as the accuracy and consistency of the SVA are significantly better than existing frameworks. More importantly, we successfully integrated SVAgent into the most mainstream integrated circuit vulnerability assessment framework and verified its practicality and reliability in a real engineering design environment.
zh
[AI-51] Emergent Cognitive Convergence via Implementation: A Structured Loop Reflecting Four Theories of Mind (A Position Paper)
【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在多步推理任务中表现受限的问题,特别是其缺乏结构化认知机制导致的约束遵守能力弱和任务成功率低。解决方案的关键在于提出一种名为Agentic Flow的AI代理架构,该架构由五个相互依赖的模块(检索、认知、控制、记忆与行动)构成,并形成一个循环的认知回路。这一设计虽最初仅受Minsky的“心智社会”和Clark的“扩展心智”理论启发,但其结构意外地与Kahneman的双系统理论和Friston的预测加工理论产生结构收敛,体现出预测建模、关联回忆和误差敏感控制等计算模式。实验表明,该结构化代理在多步推理任务中达到95.8%的成功率,显著优于基线LLM代理(62.3%),验证了通过实践设计选择可自然浮现认知理论中的结构性特征,而非依赖于理论先行的顶层设计。
链接: https://arxiv.org/abs/2507.16184
作者: Myung Ho Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 21 pages
Abstract:We report the discovery of a structural convergence across four influential theories of mind: Kahneman’s dual-system theory, Friston’s predictive processing, Minsky’s society of mind, and Clark’s extended mind-emerging unintentionally within a practical AI agent architecture called Agentic Flow. Designed to address limitations in large language models (LLMs), Agentic Flow comprises five interdependent modules such as Retrieval, Cognition, Control, Memory, and Action arranged in a recurrent cognitive loop. Although originally inspired only by Minsky and Clark, the system’s structure retrospectively aligns with computational motifs found in all four theories, including predictive modeling, associative recall, and error-sensitive control. To assess this convergence, we conducted comparative experiments with baseline LLM agents on multi-step reasoning tasks. The structured agent achieved 95.8% task success and exhibited strong constraint adherence, while the baseline system succeeded 62.3% of the time. These results were not aimed at proving superiority, but at illustrating how theoretical structures may emerge through practical design choices rather than top-down theory. We introduce PEACE as a descriptive meta-architecture that captures design-level regularities observed in Agentic Flow. Not intended as a new theory, PEACE provides a shared vocabulary for understanding architectures shaped by real-world implementation demands. This paper should be read as a position paper - an exploratory reflection on how implementation can surface latent structural echoes of cognitive theory, without asserting theoretical unification. Comments: 21 pages Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2507.16184 [cs.AI] (or arXiv:2507.16184v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.16184 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-52] LLM Data Selection and Utilization via Dynamic Bi-level Optimization ICML2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)训练中因静态数据选择策略导致的效率低下与计算成本高的问题,即现有方法多依赖于训练无关的固定标准,未能捕捉模型在训练过程中对数据的动态偏好变化。其解决方案的关键在于提出一种数据加权模型(Data Weighting Model, DWM),通过在每个训练批次内动态调整所选数据的权重,实现训练期间数据利用的自适应优化;进一步地,采用双层优化框架来更新权重模型,以更准确地建模训练中模型对数据偏好的演化过程,从而提升训练效率和最终性能。
链接: https://arxiv.org/abs/2507.16178
作者: Yang Yu,Kai Han,Hang Zhou,Yehui Tang,Kaiqi Huang,Yunhe Wang,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The 42nd International Conference on Machine Learning (ICML 2025)
Abstract:While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model’s data preferences evolve throughout training, providing new insights into the data preference of the model during training.
zh
[AI-53] Attacking interpretable NLP systems
【速读】:该论文旨在解决文本分类模型在面对对抗样本时的脆弱性问题,特别是现有攻击方法往往破坏语义一致性与解释相似性,从而削弱用户对可解释自然语言处理系统(Interpretable Natural Language Processing Systems)的信任。解决方案的关键在于提出一种黑盒攻击方法AdvChar,其通过识别对分类决策最敏感的词元(token),并仅进行细微的字符级扰动,在最小化原始文本差异的同时,生成与良性输入具有相似解释的对抗样本,从而误导深度学习分类器并维持解释的一致性。
链接: https://arxiv.org/abs/2507.16164
作者: Eldor Abdukhamidov,Tamer Abuhmed,Joanna C. S. Santos,Mohammed Abuhamad
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Studies have shown that machine learning systems are vulnerable to adversarial examples in theory and practice. Where previous attacks have focused mainly on visual models that exploit the difference between human and machine perception, text-based models have also fallen victim to these attacks. However, these attacks often fail to maintain the semantic meaning of the text and similarity. This paper introduces AdvChar, a black-box attack on Interpretable Natural Language Processing Systems, designed to mislead the classifier while keeping the interpretation similar to benign inputs, thus exploiting trust in system transparency. AdvChar achieves this by making less noticeable modifications to text input, forcing the deep learning classifier to make incorrect predictions and preserve the original interpretation. We use an interpretation-focused scoring approach to determine the most critical tokens that, when changed, can cause the classifier to misclassify the input. We apply simple character-level modifications to measure the importance of tokens, minimizing the difference between the original and new text while generating adversarial interpretations similar to benign ones. We thoroughly evaluated AdvChar by testing it against seven NLP models and three interpretation models using benchmark datasets for the classification task. Our experiments show that AdvChar can significantly reduce the prediction accuracy of current deep learning models by altering just two characters on average in input samples.
zh
[AI-54] SDBench: A Comprehensive Benchmark Suite for Speaker Diarization
【速读】:该论文旨在解决当前说话人聚类(Speaker Diarization)系统在不同数据集上误差率波动大、跨系统比较缺乏统一标准的问题,从而阻碍了性能评估的可比性和可重复性。解决方案的关键在于提出一个名为SDBench(Speaker Diarization Benchmark)的开源基准套件,其整合了13个多样化数据集,并提供内置工具以实现对本地设备和服务器端系统的细粒度、一致性的性能分析。通过SDBench,研究人员能够更高效地进行消融实验与系统集成,例如基于此构建的SpeakerKit在保持与Pyannote v3相当准确率的前提下实现了9.6倍的速度提升,验证了该基准在推动模型效率优化方面的有效性。
链接: https://arxiv.org/abs/2507.16136
作者: Eduardo Pacheco,Atila Orhon,Berkin Durmus,Blaise Munyampirwa,Andrey Leonov
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Even state-of-the-art speaker diarization systems exhibit high variance in error rates across different datasets, representing numerous use cases and domains. Furthermore, comparing across systems requires careful application of best practices such as dataset splits and metric definitions to allow for apples-to-apples comparison. We propose SDBench (Speaker Diarization Benchmark), an open-source benchmark suite that integrates 13 diverse datasets with built-in tooling for consistent and fine-grained analysis of speaker diarization performance for various on-device and server-side systems. SDBench enables reproducible evaluation and easy integration of new systems over time. To demonstrate the efficacy of SDBench, we built SpeakerKit, an inference efficiency-focused system built on top of Pyannote v3. SDBench enabled rapid execution of ablation studies that led to SpeakerKit being 9.6x faster than Pyannote v3 while achieving comparable error rates. We benchmark 6 state-of-the-art systems including Deepgram, AWS Transcribe, and Pyannote AI API, revealing important trade-offs between accuracy and speed.
zh
[AI-55] Disability Across Cultures: A Human-Centered Audit of Ableism in Western and Indic LLM s
【速读】:该论文旨在解决当前用于识别和缓解网络仇恨的大型语言模型(LLMs)在非西方语境下对残障歧视(ableism)识别能力不足的问题,特别是针对印度等资源有限、污名化根深蒂固的地区。其关键解决方案在于构建并使用一个本地化的印地语残障歧视语料库,并对比分析美国开发的LLMs(如GPT-4、Gemini等)与印度本土开发的LLMs(如Krutrim、Nanda等)在识别和解释残障歧视时的表现差异,同时引入来自美国和印度的175名残障人士作为人类标注者进行交叉验证。研究发现,西方LLMs普遍高估残障歧视程度,而印度本地LLMs则低估;更重要的是,所有模型在面对印地语表达时均表现出更强的容忍度,并倾向于套用西方框架,忽视了印度残障群体强调意图、关系性和韧性等本土认知方式。因此,该研究提出必须将地方性残障经验纳入AI系统的设计与评估体系,以建立更具包容性的全球残障歧视识别标准。
链接: https://arxiv.org/abs/2507.16130
作者: Mahika Phutane,Aditya Vashistha
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:People with disabilities (PwD) experience disproportionately high levels of discrimination and hate online, particularly in India, where entrenched stigma and limited resources intensify these challenges. Large language models (LLMs) are increasingly used to identify and mitigate online hate, yet most research on online ableism focuses on Western audiences with Western AI models. Are these models adequately equipped to recognize ableist harm in non-Western places like India? Do localized, Indic language models perform better? To investigate, we adopted and translated a publicly available ableist speech dataset to Hindi, and prompted eight LLMs–four developed in the U.S. (GPT-4, Gemini, Claude, Llama) and four in India (Krutrim, Nanda, Gajendra, Airavata)–to score and explain ableism. In parallel, we recruited 175 PwD from both the U.S. and India to perform the same task, revealing stark differences between groups. Western LLMs consistently overestimated ableist harm, while Indic LLMs underestimated it. Even more concerning, all LLMs were more tolerant of ableism when it was expressed in Hindi and asserted Western framings of ableist harm. In contrast, Indian PwD interpreted harm through intention, relationality, and resilience–emphasizing a desire to inform and educate perpetrators. This work provides groundwork for global, inclusive standards of ableism, demonstrating the need to center local disability experiences in the design and evaluation of AI systems.
zh
[AI-56] axCalcBench: Evaluating Frontier Models on the Tax Calculation Task
【速读】:该论文试图解决的问题是:当前大语言模型(Large Language Models, LLMs)在处理美国个人所得税计算任务中的能力局限性,即尽管这些模型具备强大的自然语言理解能力,但在实际应用中仍难以准确完成复杂的税务计算。解决方案的关键在于提出TaxCalcBench——一个专门用于评估LLMs在给定完整税务信息条件下计算联邦个人所得税申报表能力的基准测试框架。通过该基准,研究发现最先进的模型在简化样本集上仅能正确计算不到三分之一的税表,暴露出模型在使用税表、执行计算及资格判定方面的系统性错误,从而揭示了现有LLMs在税务场景下部署所需的额外基础设施支持的必要性。
链接: https://arxiv.org/abs/2507.16126
作者: Michael R. Bock,Kara Molisee,Zachary Ozer,Sumit Shah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Can AI file your taxes? Not yet. Calculating US personal income taxes is a task that requires building an understanding of vast amounts of English text and using that knowledge to carefully compute results. We propose TaxCalcBench, a benchmark for determining models’ abilities to calculate personal income tax returns given all of the necessary information. Our experiment shows that state-of-the-art models succeed in calculating less than a third of federal income tax returns even on this simplified sample set. Our analysis concludes that models consistently misuse tax tables, make errors in tax calculation, and incorrectly determine eligibility. Our findings point to the need for additional infrastructure to apply LLMs to the personal income tax calculation task.
zh
[AI-57] Benchmarking LLM Privacy Recognition for Social Robot Decision Making
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在家庭场景下社交机器人应用中对敏感个人信息处理的隐私风险问题,特别是评估LLMs是否具备隐私意识以及能否有效响应用户隐私偏好。其解决方案的关键在于构建基于情境完整性(Contextual Integrity, CI)理论的隐私相关场景集,并通过大规模用户调研(N=450)与先进LLMs(N=10)的对比分析,揭示人类与LLMs在隐私决策上的显著差异;进而引入四种提示策略(prompting strategies)以增强LLMs的隐私控制能力,从而探索将LLMs作为人机交互中隐私代理的可行路径与改进方向。
链接: https://arxiv.org/abs/2507.16124
作者: Dakota Sullivan,Shirley Zhang,Jennica Li,Heather Kirkorian,Bilge Mutlu,Kassem Fawaz
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 18 pages, 7 figures. Dakota Sullivan and Shirley Zhang contributed equally to this work
Abstract:Social robots are embodied agents that interact with people while following human communication norms. These robots interact using verbal and non-verbal cues, and share the physical environments of people. While social robots have previously utilized rule-based systems or probabilistic models for user interaction, the rapid evolution of large language models (LLMs) presents new opportunities to develop LLM-empowered social robots for enhanced human-robot interaction. To fully realize these capabilities, however, robots need to collect data such as audio, fine-grained images, video, and locations. As a result, LLMs often process sensitive personal information, particularly within home environments. Given the tension between utility and privacy risks, evaluating how current LLMs manage sensitive data is critical. Specifically, we aim to explore the extent to which out-of-the-box LLMs are privacy-aware in the context of household social robots. In this study, we present a set of privacy-relevant scenarios crafted through the lens of Contextual Integrity (CI). We first survey users’ privacy preferences regarding in-home social robot behaviors and then examine how their privacy orientation affects their choices of these behaviors (N = 450). We then provide the same set of scenarios and questions to state-of-the-art LLMs (N = 10) and find that the agreement between humans and LLMs is low. To further investigate the capabilities of LLMs as a potential privacy controller, we implement four additional prompting strategies and compare their results. Finally, we discuss the implications and potential of AI privacy awareness in human-robot interaction.
zh
[AI-58] Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在电池材料发现等领域中推理能力尚未充分挖掘的问题,特别是如何将LLMs的链式思维(Chain-of-Thought, CoT)能力与特定领域知识相结合,以实现从材料设计到合成再到表征的全流程自动化创新。其解决方案的关键在于提出ChatBattery这一新型代理框架(agentic framework),通过整合电池领域的专业知识来引导LLMs进行更有效的推理,从而实现对锂离子电池正极材料的高效探索与优化,最终成功发现了三种性能显著优于商用LiNi₀.₈Mn₀.₁Co₀.₁O₂(NMC811)的新型材料。
链接: https://arxiv.org/abs/2507.16110
作者: Shengchao Liu,Hannan Xu,Yan Ai,Huanxin Li,Yoshua Bengio,Harry Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) leverage chain-of-thought (CoT) techniques to tackle complex problems, representing a transformative breakthrough in artificial intelligence (AI). However, their reasoning capabilities have primarily been demonstrated in solving math and coding problems, leaving their potential for domain-specific applications-such as battery discovery-largely unexplored. Inspired by the idea that reasoning mirrors a form of guided search, we introduce ChatBattery, a novel agentic framework that integrates domain knowledge to steer LLMs toward more effective reasoning in materials design. Using ChatBattery, we successfully identify, synthesize, and characterize three novel lithium-ion battery cathode materials, which achieve practical capacity improvements of 28.8%, 25.2%, and 18.5%, respectively, over the widely used cathode material, LiNi0.8Mn0.1Co0.1O2 (NMC811). Beyond this discovery, ChatBattery paves a new path by showing a successful LLM-driven and reasoning-based platform for battery materials invention. This complete AI-driven cycle-from design to synthesis to characterization-demonstrates the transformative potential of AI-driven reasoning in revolutionizing materials discovery.
zh
[AI-59] A Lower Bound for the Number of Linear Regions of Ternary ReLU Regression Neural Networks
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在实际应用中计算复杂度和内存消耗过高的问题,提出通过参数量化为三值(-1, 0, +1)的二值化方法来降低模型资源开销。其解决方案的关键在于从理论层面分析三值神经网络(Ternary Neural Networks, TNNs)的表达能力,具体通过评估带有ReLU激活函数的三值回归神经网络的线性区域数量,证明其线性区域数随网络宽度呈多项式增长、随深度呈指数增长,与标准神经网络类似;并进一步指出,仅需将宽度平方或深度加倍,即可使三值网络达到与通用ReLU回归网络相当的线性区域上限,从而为三值网络在实践中表现出的良好性能提供了理论依据。
链接: https://arxiv.org/abs/2507.16079
作者: Yuta Nakahara,Manabu Kobayashi,Toshiyasu Matsushima
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:With the advancement of deep learning, reducing computational complexity and memory consumption has become a critical challenge, and ternary neural networks (NNs) that restrict parameters to -1, 0, +1\ have attracted attention as a promising approach. While ternary NNs demonstrate excellent performance in practical applications such as image recognition and natural language processing, their theoretical understanding remains insufficient. In this paper, we theoretically analyze the expressivity of ternary NNs from the perspective of the number of linear regions. Specifically, we evaluate the number of linear regions of ternary regression NNs with Rectified Linear Unit (ReLU) for activation functions and prove that the number of linear regions increases polynomially with respect to network width and exponentially with respect to depth, similar to standard NNs. Moreover, we show that it suffices to either square the width or double the depth of ternary NNs to achieve a lower bound on the maximum number of linear regions comparable to that of general ReLU regression NNs. This provides a theoretical explanation, in some sense, for the practical success of ternary NNs.
zh
[AI-60] AI-driven Orchestration at Scale: Estimating Service Metrics on National-Wide Testbeds
【速读】:该论文旨在解决网络切片(Network Slicing, NS)在真实大规模生产环境中难以验证其性能的问题,尤其是在采用机器学习(Machine Learning, ML)驱动的编排架构时,现有方法多依赖于局部网络或实验室仿真,缺乏对实际场景的适用性。解决方案的关键在于提出一种基于深度神经网络(Deep Neural Networks, DNNs)和基础ML算法的预测模型,并将其嵌入NS架构中,通过在两个大规模生产测试床(production testbeds)上部署分布式数据库应用作为网络切片进行实证评估,从而实现对不同DNN与ML算法性能的量化比较,最终构建了一种可无缝集成到生产环境中的验证方法,替代传统受控仿真或实验室测试。
链接: https://arxiv.org/abs/2507.16077
作者: Rodrigo Moreira,Rafael Pasquini,Joberto S. B. Martins,Tereza C. Carvalho,Flávio de Oliveira Silva
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Networking and Internet Architecture (cs.NI)
备注: 17 pages, 18 figures, 14 tables,
Abstract:Network Slicing (NS) realization requires AI-native orchestration architectures to efficiently and intelligently handle heterogeneous user requirements. To achieve this, network slicing is evolving towards a more user-centric digital transformation, focusing on architectures that incorporate native intelligence to enable self-managed connectivity in an integrated and isolated manner. However, these initiatives face the challenge of validating their results in production environments, particularly those utilizing ML-enabled orchestration, as they are often tested in local networks or laboratory simulations. This paper proposes a large-scale validation method using a network slicing prediction model to forecast latency using Deep Neural Networks (DNNs) and basic ML algorithms embedded within an NS architecture, evaluated in real large-scale production testbeds. It measures and compares the performance of different DNNs and ML algorithms, considering a distributed database application deployed as a network slice over two large-scale production testbeds. The investigation highlights how AI-based prediction models can enhance network slicing orchestration architectures and presents a seamless, production-ready validation method as an alternative to fully controlled simulations or laboratory setups.
zh
[AI-61] Compositional Coordination for Multi-Robot Teams with Large Language Models
【速读】:该论文旨在解决多机器人协同任务中传统方法依赖专家手工翻译自然语言任务描述为数学公式、算法设计和可执行代码的问题,这一过程存在劳动密集、非专家难以参与且对任务变更适应性差的局限。其解决方案的关键在于提出LAN2CB(Language to Collective Behavior)框架,通过两个核心组件实现从自然语言到可部署机器人控制代码的自动化转换:(1) 任务分解(Mission Decomposition),将任务解析为带有依赖关系的任务图(task graph);(2) 代码生成(Code Generation),基于任务图与结构化知识库生成Python控制代码。该方法显著降低了人工工程需求,并提升了跨任务类型的泛化能力。
链接: https://arxiv.org/abs/2507.16068
作者: Zhehui Huang,Guangyao Shi,Yuwei Wu,Vijay Kumar,Gaurav S. Sukhatme
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 9 pages, 4 figures
Abstract:Multi-robot coordination has traditionally relied on a task-specific and expert-driven pipeline, where natural language mission descriptions are manually translated by domain experts into mathematical formulation, algorithm design, and executable code. This conventional process is labor-intensive, inaccessible to non-experts, and inflexible to changes in mission requirements. Here, we propose LAN2CB (Language to Collective Behavior), a novel framework that leverages large language models (LLMs) to streamline and generalize the multi-robot coordination pipeline. LAN2CB directly converts natural language mission descriptions into executable Python code for multi-robot systems through two key components: (1) Mission Decomposition for Task Representation, which parses the mission into a task graph with dependencies, and (2) Code Generation, which uses the task graph and a structured knowledge base to generate deployable robot control code. We further introduce a dataset of natural language mission specifications to support development and benchmarking. Experimental results in both simulation and real-world settings show that LAN2CB enables effective and flexible multi-robot coordination from natural language, significantly reducing the need for manual engineering while supporting generalization across mission types. Website: this https URL.
zh
[AI-62] A Unifying Framework for Semiring-Based Constraint Logic Programming With Negation (full version) IJCAI2025
【速读】:该论文试图解决约束逻辑编程(Constraint Logic Programming, CLP)中尚未充分研究的问题,即在程序体(body)中允许使用否定(negation)的扩展问题。现有CLP的多种扩展形式(如模糊约束满足、不确定性处理或否定运算)虽已通过不同类型的半环(semiring)作为统一抽象进行建模,但均未考虑在规则体中引入否定的情况。论文的关键解决方案是提出一种新的CLP扩展框架,利用近似点不动点理论(approximation fixpoint theory)为包含体中否定的程序提供语义,并系统分析半环性质对语义结构的影响。这一框架不仅统一了已有方法,还通过增强语言表达能力(支持体中否定)实现了更灵活和强大的约束求解机制。
链接: https://arxiv.org/abs/2507.16067
作者: Jeroen Spaans,Jesse Heyninck
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Full version, including proofs and appendices, of paper accepted at IJCAI 2025
Abstract:Constraint Logic Programming (CLP) is a logic programming formalism used to solve problems requiring the consideration of constraints, like resource allocation and automated planning and scheduling. It has previously been extended in various directions, for example to support fuzzy constraint satisfaction, uncertainty, or negation, with different notions of semiring being used as a unifying abstraction for these generalizations. None of these extensions have studied clauses with negation allowed in the body. We investigate an extension of CLP which unifies many of these extensions and allows negation in the body. We provide semantics for such programs, using the framework of approximation fixpoint theory, and give a detailed overview of the impacts of properties of the semirings on the resulting semantics. As such, we provide a unifying framework that captures existing approaches and allows extending them with a more expressive language.
zh
[AI-63] AI-Powered Commit Explorer (APCE)
【速读】:该论文旨在解决软件开发实践中 commit message 质量低下或缺失的问题,这会严重影响代码变更的可追溯性和后续维护效率。当前开发者常忽视撰写清晰、结构化的 commit message,导致未来维护者难以理解变更意图。为此,作者提出 AI-Powered Commit Explorer (APCE),其核心解决方案在于构建一个集成工具平台,支持基于大语言模型(Large Language Model, LLM)生成高质量 commit message,并提供多维度评估机制——包括预设提示词模板和增强型评价提示,以提升生成质量;同时,APCE 还集成了自动化与人工评估功能,便于研究人员系统性地验证和优化 LLM 生成效果,从而推动 commit message 自动化生成的研究与实践应用。
链接: https://arxiv.org/abs/2507.16063
作者: Yousab Grees,Polina Iaremchuk,Ramtin Ehsani,Esteban Parra,Preetha Chatterjee,Sonia Haiduc
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Commit messages in a version control system provide valuable information for developers regarding code changes in software systems. Commit messages can be the only source of information left for future developers describing what was changed and why. However, writing high-quality commit messages is often neglected in practice. Large Language Model (LLM) generated commit messages have emerged as a way to mitigate this issue. We introduce the AI-Powered Commit Explorer (APCE), a tool to support developers and researchers in the use and study of LLM-generated commit messages. APCE gives researchers the option to store different prompts for LLMs and provides an additional evaluation prompt that can further enhance the commit message provided by LLMs. APCE also provides researchers with a straightforward mechanism for automated and human evaluation of LLM-generated messages. Demo link this https URL
zh
[AI-64] Beyond Rate Coding: Surrogate Gradients Enable Spike Timing Learning in Spiking Neural Networks
【速读】:该论文旨在解决Spiking Neural Networks (SNNs)是否能够从精确的尖峰时序信息中学习,而不仅仅是依赖于发放率的问题。其核心挑战在于区分时序编码(temporal coding)与率编码(rate coding)在神经网络中的贡献,并验证SNNs在缺乏尖峰计数信息的情况下仍能有效执行任务的能力。解决方案的关键在于:首先设计合成任务以分离单神经元内尖峰间隔和跨神经元同步性;其次构建去除尖峰计数信息、仅保留时序信息的语音识别数据集(SHD和SSC变体),并采用替代梯度下降(Surrogate Gradient Descent)训练SNNs,结果表明这类模型在时序信息主导的任务上显著优于纯率编码模型;此外,通过引入生物启发扰动(如高斯抖动、尖峰删除等)评估鲁棒性,发现SNNs对时间反转具有敏感性,且带延迟学习的网络表现更接近人类行为,进一步证明其对时序信息的依赖性。
链接: https://arxiv.org/abs/2507.16043
作者: Ziqiao Yu,Pengfei Sun,Dan F. M. Goodman
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate the extent to which Spiking Neural Networks (SNNs) trained with Surrogate Gradient Descent (Surrogate GD), with and without delay learning, can learn from precise spike timing beyond firing rates. We first design synthetic tasks isolating intra-neuron inter-spike intervals and cross-neuron synchrony under matched spike counts. On more complex spike-based speech recognition datasets (Spiking Heidelberg Digits (SHD) and Spiking Speech Commands (SSC), we construct variants where spike count information is eliminated and only timing information remains, and show that Surrogate GD-trained SNNs are able to perform significantly above chance whereas purely rate-based models perform at chance level. We further evaluate robustness under biologically inspired perturbations – including Gaussian jitter per spike or per-neuron, and spike deletion – revealing consistent but perturbation-specific degradation. Networks show a sharp performance drop when spike sequences are reversed in time, with a larger drop in performance from SNNs trained with delays, indicating that these networks are more human-like in terms of behaviour. To facilitate further studies of temporal coding, we have released our modified SHD and SSC datasets.
zh
[AI-65] Reactivation: Empirical NTK Dynamics Under Task Shifts
【速读】:该论文试图解决的问题是:当前对神经正切核(Neural Tangent Kernel, NTK)的研究主要局限于单任务学习场景,即数据分布假设在训练过程中保持不变,而现实中持续学习(continual learning)中数据分布会随时间动态变化,这种动态性如何影响NTK的演化及其对模型性能的影响尚不明确。解决方案的关键在于通过系统性的实证分析,揭示NTK在持续学习场景下的动态行为特征,从而表明NTK并非始终静态,其演化对于特征学习至关重要,并挑战了理论研究中广泛采用的静态核近似方法在大规模持续学习中的适用性。
链接: https://arxiv.org/abs/2507.16039
作者: Yuzhi Liu,Zixuan Chen,Zirui Zhang,Yufei Liu,Giulia Lanzillotta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Neural Tangent Kernel (NTK) offers a powerful tool to study the functional dynamics of neural networks. In the so-called lazy, or kernel regime, the NTK remains static during training and the network function is linear in the static neural tangents feature space. The evolution of the NTK during training is necessary for feature learning, a key driver of deep learning success. The study of the NTK dynamics has led to several critical discoveries in recent years, in generalization and scaling behaviours. However, this body of work has been limited to the single task setting, where the data distribution is assumed constant over time. In this work, we present a comprehensive empirical analysis of NTK dynamics in continual learning, where the data distribution shifts over time. Our findings highlight continual learning as a rich and underutilized testbed for probing the dynamics of neural training. At the same time, they challenge the validity of static-kernel approximations in theoretical treatments of continual learning, even at large scale.
zh
[AI-66] “Just a strange pic”: Evaluating safety in GenAI Image safety annotation tasks from diverse annotators perspectives AAAI
【速读】:该论文试图解决的问题是:当前AI生成内容(AIGC)安全评估中,现有安全管道往往依赖预定义分类体系,忽视了标注者在判断过程中所体现的道德、情感与情境性推理,导致对“安全”这一复杂概念的理解片面化。解决方案的关键在于:通过分析5,372条开放性标注评论,揭示标注者不仅基于个人经验和社会文化认知进行多维评判,还受任务结构和指南影响,从而提出应重构评估设计,使其能够引导道德反思、区分不同类型伤害,并容纳主观且情境敏感的解释空间,以更全面地捕捉真实世界中的安全感知。
链接: https://arxiv.org/abs/2507.16033
作者: Ding Wang,Mark Díaz,Charvi Rastogi,Aida Davani,Vinodkumar Prabhakaran,Pushkar Mishra,Roma Patel,Alicia Parrish,Zoe Ashwood,Michela Paganini,Tian Huey Teh,Verena Rieser,Lora Aroyo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society 2025 (AIES 2025)
Abstract:Understanding what constitutes safety in AI-generated content is complex. While developers often rely on predefined taxonomies, real-world safety judgments also involve personal, social, and cultural perceptions of harm. This paper examines how annotators evaluate the safety of AI-generated images, focusing on the qualitative reasoning behind their judgments. Analyzing 5,372 open-ended comments, we find that annotators consistently invoke moral, emotional, and contextual reasoning that extends beyond structured safety categories. Many reflect on potential harm to others more than to themselves, grounding their judgments in lived experience, collective risk, and sociocultural awareness. Beyond individual perceptions, we also find that the structure of the task itself – including annotation guidelines – shapes how annotators interpret and express harm. Guidelines influence not only which images are flagged, but also the moral judgment behind the justifications. Annotators frequently cite factors such as image quality, visual distortion, and mismatches between prompt and output as contributing to perceived harm dimensions, which are often overlooked in standard evaluation frameworks. Our findings reveal that existing safety pipelines miss critical forms of reasoning that annotators bring to the task. We argue for evaluation designs that scaffold moral reflection, differentiate types of harm, and make space for subjective, context-sensitive interpretations of AI-generated content.
zh
[AI-67] From Logic to Language: A Trust Index for Problem Solving with LLM s
【速读】:该论文旨在解决传统形式化计算范式在处理模糊性、动态环境和主观语境等人类问题时的局限性,这些问题无法通过明确规则描述。其核心贡献在于提出一个统一框架,用于区分形式语言(formal languages)与自然语言(natural language)所对应的问题空间,并引入向量化的信任指数 Q(vector-valued trust index Q),以量化自然语言解决方案的质量——这不同于形式语言中二值正确性(binary correctness)的评价方式,而是基于连续的适切性谱(adequacy spectrum)。解决方案的关键在于定义两个统计质量维度:归一化双语义熵(Normalized bi-semantic entropy)用于衡量答案在语义变化下的鲁棒性和概念多样性,以及情感效价(emotional valence)作为主观价值的可量化指标,从而为生成式 AI(Generative AI)时代的问题求解提供更严谨的评估体系。
链接: https://arxiv.org/abs/2507.16028
作者: Tehseen Rug,Felix Böhmer,Tessa Pfattheicher
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 2 figures
Abstract:Classical computation, grounded in formal, logical systems, has been the engine of technological progress for decades, excelling at problems that can be described with unambiguous rules. This paradigm, however, leaves a vast ocean of human problems – those characterized by ambiguity, dynamic environments, and subjective context – largely untouched. The advent of Large Language Models (LLMs) represents a fundamental shift, enabling computational systems to engage with this previously inaccessible domain using natural language. This paper introduces a unified framework to understand and contrast these problem-solving paradigms. We define and delineate the problem spaces addressable by formal languages versus natural language. While solutions to the former problem class can be evaluated using binary quality measures, the latter requires a much more nuanced definition of approximate solution space taking into account the vagueness, subjectivity and ambiguity inherent to natural language. We therefore introduce a vector-valued trust index Q, which reflects solution quality and distinguishes the binary correctness of formal solutions from the continuous adequacy spectrum characteristic of natural language solutions. Within this framework, we propose two statistical quality dimensions. Normalized bi-semantic entropy measures robustness and conceptual diversity of LLM answers given semantic variation in problem formulations. Emotional valence maps subjective valuation of a solution to a quantifiable metric that can be maximized by invoking statistical measures. The concepts introduced in this work will provide a more rigorous understanding of the capabilities, limitations, and inherent nature of problem-solving in the age of LLMs.
zh
[AI-68] Micromobility Flow Prediction: A Bike Sharing Station-level Study via Multi-level Spatial-Temporal Attention Neural Network
【速读】:该论文旨在解决城市微出行资源(如共享单车)在站点层面供需失衡导致的系统运维困难问题,核心挑战在于共享单车系统的时空复杂性以及大规模站点数量带来的预测难度。解决方案的关键是提出了一种多层级时空注意力神经网络BikeMAN,其通过编码器-解码器结构结合两个注意力机制:一个用于捕捉站点间空间相关性特征,另一个用于建模站点交通的时间动态特性,从而实现对整个城市范围内所有站点的高精度自行车流量预测。
链接: https://arxiv.org/abs/2507.16020
作者: Xi Yang,Jiachen Wang,Song Han,Suining He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, UrbComp 2024
Abstract:Efficient use of urban micromobility resources such as bike sharing is challenging due to the unbalanced station-level demand and supply, which causes the maintenance of the bike sharing systems painstaking. Prior efforts have been made on accurate prediction of bike traffics, i.e., demand/pick-up and return/drop-off, to achieve system efficiency. However, bike station-level traffic prediction is difficult because of the spatial-temporal complexity of bike sharing systems. Moreover, such level of prediction over entire bike sharing systems is also challenging due to the large number of bike stations. To fill this gap, we propose BikeMAN, a multi-level spatio-temporal attention neural network to predict station-level bike traffic for entire bike sharing systems. The proposed network consists of an encoder and a decoder with an attention mechanism representing the spatial correlation between features of bike stations in the system and another attention mechanism describing the temporal characteristic of bike station traffic. Through experimental study on over 10 millions trips of bike sharing systems ( 700 stations) of New York City, our network showed high accuracy in predicting the bike station traffic of all stations in the city.
zh
[AI-69] Dream Lift Animate: From Single Images to Animatable Gaussian Avatars
【速读】:该论文旨在解决从单张图像重建可动画化3D人体虚拟形象(animatable 3D human avatar)的问题,尤其关注如何在不依赖多视角图像或复杂后处理的情况下实现高保真度、结构一致性且支持姿态驱动变形的3D模型。其解决方案的关键在于提出了一种名为Dream, Lift, Animate (DLA) 的新框架:首先利用视频扩散模型(video diffusion model)生成合理的多视角图像以捕获几何与外观细节;随后通过3D Gaussian lifting将这些视图转化为无结构的3D高斯表示;最后引入基于Transformer的编码器,建模全局空间关系并将高斯投影到参数化人体模型的UV空间中,形成结构化的潜在表示,从而实现基于姿态驱动的变形与渲染。该方法通过将高斯锚定在UV流形上,在动画过程中保持视觉一致性并保留精细细节,显著优于现有先进方法在感知质量和光度准确性上的表现。
链接: https://arxiv.org/abs/2507.15979
作者: Marcel C. Bühler,Ye Yuan,Xueting Li,Yangyi Huang,Koki Nagano,Umar Iqbal
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.
zh
[AI-70] On the transferability of Sparse Autoencoders for interpreting compressed models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在压缩后对其可解释性影响不明确的问题。当前主流的压缩方法如剪枝和量化虽能提升推理效率,但其对模型内部机制的理解能力——尤其是基于稀疏自动编码器(Sparse Autoencoders, SAEs)的特征分解能力——尚未被充分探索。论文的关键发现是:在原始模型上训练的SAE仍能有效解释压缩后的模型,尽管性能略有下降;更重要的是,直接对原始SAE进行剪枝即可获得与在压缩模型上重新训练SAE相当的性能,从而显著降低SAE的训练成本。这一发现为高效、低成本的模型可解释性分析提供了可行路径。
链接: https://arxiv.org/abs/2507.15977
作者: Suchit Gupte,Vishnu Kabir Chhabra,Mohammad Mahdi Khalili
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model’s interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model’s activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive training costs of SAEs.
zh
[AI-71] Does More Inference-Time Compute Really Help Robustness?
【速读】:该论文旨在解决推理时计算扩展(inference-time scaling)在大语言模型(Large Language Models, LLMs)中提升鲁棒性的前提假设是否普遍成立的问题,特别是针对开源小规模模型及不同对抗场景下的有效性与安全性。其关键解决方案在于提出一种简单的预算强制策略(budget forcing strategy),证明小型开放模型也能从推理时扩展中获益;更重要的是,论文揭示并批判性检验了先前研究隐含的假设——即中间推理步骤对攻击者是隐藏的,通过放松这一假设发现:当推理链显式暴露时,增加推理时间反而导致模型鲁棒性下降,形成一种反向缩放律(inverse scaling law)。这表明推理时扩展的鲁棒性收益高度依赖于对抗环境和部署上下文,强调在安全敏感场景中需审慎权衡此类技术的应用风险。
链接: https://arxiv.org/abs/2507.15974
作者: Tong Wu,Chong Xiang,Jiachen T. Wang,Weichen Yu,Chawin Sitawarin,Vikash Sehwag,Prateek Mittal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Recently, Zaremba et al. demonstrated that increasing inference-time computation improves robustness in large proprietary reasoning LLMs. In this paper, we first show that smaller-scale, open-source models (e.g., DeepSeek R1, Qwen3, Phi-reasoning) can also benefit from inference-time scaling using a simple budget forcing strategy. More importantly, we reveal and critically examine an implicit assumption in prior work: intermediate reasoning steps are hidden from adversaries. By relaxing this assumption, we identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law: if intermediate reasoning steps become explicitly accessible, increased inference-time computation consistently reduces model robustness. Finally, we discuss practical scenarios where models with hidden reasoning chains are still vulnerable to attacks, such as models with tool-integrated reasoning and advanced reasoning extraction attacks. Our findings collectively demonstrate that the robustness benefits of inference-time scaling depend heavily on the adversarial setting and deployment context. We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.
zh
[AI-72] Nonlinear Framework for Speech Bandwidth Extension
【速读】:该论文旨在解决带宽受限条件下高频信息丢失的问题,这一问题在电信通信和资源受限环境下的高保真音频重建中尤为关键。其解决方案的核心是提出一种基于对抗学习的带宽扩展(Band Width Extension, BWE)框架NDSI-BWE,该框架引入了七种受非线性动力系统启发的新判别器,分别捕捉不同时间行为特征:包括多分辨率李雅普诺夫判别器(MRLD)、多尺度递归判别器(MS-RD)、多尺度去趋势分形分析判别器(MSDFA)、多分辨率庞加莱图判别器(MR-PPD)、多周期判别器(MPD)、多分辨率幅值判别器(MRAD)与多分辨率相位判别器(MRPD),以全面建模信号的混沌敏感性、自相似性、长程相关性、隐空间关系、周期模式及幅相转换统计特性。生成器采用复数域ConformerNeXt结构并结合双流Lattice-Net架构,利用Transformer的全局依赖建模能力和ConvNeXt块的局部时序建模能力,同时优化幅度和相位;通过深度可分离卷积设计使参数量减少八倍,最终在六项客观指标和五名评估者组成的主观评测中达到当前最优性能(SoTA)。
链接: https://arxiv.org/abs/2507.15970
作者: Tarikul Islam Tamiti,Nursad Mamun,Anomadarshi Barua
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Recovering high-frequency components lost to bandwidth constraints is crucial for applications ranging from telecommunications to high-fidelity audio on limited resources. We introduce NDSI-BWE, a new adversarial Band Width Extension (BWE) framework that leverage four new discriminators inspired by nonlinear dynamical system to capture diverse temporal behaviors: a Multi-Resolution Lyapunov Discriminator (MRLD) for determining sensitivity to initial conditions by capturing deterministic chaos, a Multi-Scale Recurrence Discriminator (MS-RD) for self-similar recurrence dynamics, a Multi-Scale Detrended Fractal Analysis Discriminator (MSDFA) for long range slow variant scale invariant relationship, a Multi-Resolution Poincaré Plot Discriminator (MR-PPD) for capturing hidden latent space relationship, a Multi-Period Discriminator (MPD) for cyclical patterns, a Multi-Resolution Amplitude Discriminator (MRAD) and Multi-Resolution Phase Discriminator (MRPD) for capturing intricate amplitude-phase transition statistics. By using depth-wise convolution at the core of the convolutional block with in each discriminators, NDSI-BWE attains an eight-times parameter reduction. These seven discriminators guide a complex-valued ConformerNeXt based genetor with a dual stream Lattice-Net based architecture for simultaneous refinement of magnitude and phase. The genertor leverage the transformer based conformer’s global dependency modeling and ConvNeXt block’s local temporal modeling capability. Across six objective evaluation metrics and subjective based texts comprises of five human judges, NDSI-BWE establishes a new SoTA in BWE.
zh
[AI-73] Dual Turing Test: A Framework for Detecting and Mitigating Undetectable AI
【速读】:该论文旨在解决生成式 AI (Generative AI) 模型在对齐(alignment)过程中如何同时保证输出质量与不可检测性(undetectability)的问题,即在确保模型输出自然流畅的同时,避免被人类判别器识别为人工智能生成内容。其解决方案的关键在于提出一个统一框架,将“双 Turing 测试”(dual Turing test)、带质量约束的对抗分类博弈以及基于强化学习(Reinforcement Learning, RL)的对齐流水线相结合:通过定义一个两玩家零和博弈结构,在 N 轮独立测试中引入质量函数 $ Q $ 和阈值参数 $ \tau 、 \delta $,并利用最小最大(minimax)边界保障鲁棒性;在此基础上构建 RL-HF 式对齐循环,其中 undetectability detector $ D $ 提供负奖励以抑制可探测行为,而质量代理(quality proxy)则维持输出的语义连贯性和流畅性,从而实现高质量且隐蔽的模型输出。
链接: https://arxiv.org/abs/2507.15907
作者: Alberto Messina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In this short note, we propose a unified framework that bridges three areas: (1) a flipped perspective on the Turing Test, the “dual Turing test”, in which a human judge’s goal is to identify an AI rather than reward a machine for deception; (2) a formal adversarial classification game with explicit quality constraints and worst-case guarantees; and (3) a reinforcement learning (RL) alignment pipeline that uses an undetectability detector and a set of quality related components in its reward model. We review historical precedents, from inverted and meta-Turing variants to modern supervised reverse-Turing classifiers, and highlight the novelty of combining quality thresholds, phased difficulty levels, and minimax bounds. We then formalize the dual test: define the judge’s task over N independent rounds with fresh prompts drawn from a prompt space Q, introduce a quality function Q and parameters tau and delta, and cast the interaction as a two-player zero-sum game over the adversary’s feasible strategy set M. Next, we map this minimax game onto an RL-HF style alignment loop, in which an undetectability detector D provides negative reward for stealthy outputs, balanced by a quality proxy that preserves fluency. Throughout, we include detailed explanations of each component notation, the meaning of inner minimization over sequences, phased tests, and iterative adversarial training and conclude with a suggestion for a couple of immediate actions.
zh
[AI-74] owards Reliable Uncertainty-Aware Alignment
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)对齐过程中因奖励模型(Reward Model)估计不稳定而导致的策略优化风险问题。现有方法通常依赖单一奖励模型进行策略优化,但实验表明,即使使用相同偏好数据集独立训练的奖励模型也存在显著分歧,这种变异性会引发过拟合,进而导致性能下降。解决方案的关键在于提出一种方差感知的策略优化框架(variance-aware policy optimization framework),其核心创新是一个引入奖励模型方差估计的策略正则化项,从而有效降低输出劣于默认策略的风险。实验证明,该方法在多种LLM和奖励模型配置下均能实现更稳定、鲁棒的对齐效果。
链接: https://arxiv.org/abs/2507.15906
作者: Debangshu Banerjee,Kintan Saha,Aditya Gopalan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model estimate can render it vulnerable to inaccuracies in the reward model. We empirically study the variability of reward model training on open-source benchmarks. We observe that independently trained reward models on the same preference dataset can exhibit substantial disagreement, highlighting the instability of current alignment strategies. Employing a theoretical model, we demonstrate that variability in reward model estimation can cause overfitting, leading to the risk of performance degradation. To mitigate this risk, we propose a variance-aware policy optimization framework for preference-based alignment. The key ingredient of the framework is a new policy regularizer that incorporates reward model variance estimates. We show that variance-aware policy optimization provably reduces the risk of outputting a worse policy than the default. Experiments across diverse LLM and reward model configurations confirm that our approach yields more stable and robust alignment than the standard (variance-unaware) pipeline.
zh
[AI-75] Foundation Models and Transformers for Anomaly Detection: A Survey
【速读】:该论文旨在解决视觉异常检测(Visual Anomaly Detection, VAD)中长期依赖建模、上下文理解以及数据稀缺等核心挑战。其解决方案的关键在于利用Transformer架构和基础模型(Foundation Models)的全局感受野与可迁移性,通过引入注意力机制并结合大规模预训练策略,显著提升了异常检测的鲁棒性、可解释性和扩展性。
链接: https://arxiv.org/abs/2507.15905
作者: Mouïn Ben Ammar,Arturo Mendoza,Nacim Belkhir,Antoine Manzanera,Gianni Franchi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In line with the development of deep learning, this survey examines the transformative role of Transformers and foundation models in advancing visual anomaly detection (VAD). We explore how these architectures, with their global receptive fields and adaptability, address challenges such as long-range dependency modeling, contextual modeling and data scarcity. The survey categorizes VAD methods into reconstruction-based, feature-based and zero/few-shot approaches, highlighting the paradigm shift brought about by foundation models. By integrating attention mechanisms and leveraging large-scale pre-training, Transformers and foundation models enable more robust, interpretable, and scalable anomaly detection solutions. This work provides a comprehensive review of state-of-the-art techniques, their strengths, limitations, and emerging trends in leveraging these architectures for VAD.
zh
[AI-76] owards Mitigation of Hallucination for LLM -empowered Agents : Progressive Generalization Bound Exploration and Watchdog Monitor
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)驱动的智能代理在开放环境中因幻觉(hallucination)导致输出与事实不一致的问题,从而影响其可信度和实际部署安全性。解决方案的关键在于提出一种名为HalMit的黑盒监控框架,该框架通过建模LLM赋能代理的泛化边界来检测幻觉,无需访问模型内部结构;其核心创新是引入概率分形采样技术,在并行条件下生成足够多的查询以触发异常响应,从而高效识别代理的泛化边界,实现对幻觉的精准检测。
链接: https://arxiv.org/abs/2507.15903
作者: Siyuan Liu,Wenjing Liu,Zhiwei Xu,Xin Wang,Bo Chen,Tao Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Empowered by large language models (LLMs), intelligent agents have become a popular paradigm for interacting with open environments to facilitate AI deployment. However, hallucinations generated by LLMs-where outputs are inconsistent with facts-pose a significant challenge, undermining the credibility of intelligent agents. Only if hallucinations can be mitigated, the intelligent agents can be used in real-world without any catastrophic risk. Therefore, effective detection and mitigation of hallucinations are crucial to ensure the dependability of agents. Unfortunately, the related approaches either depend on white-box access to LLMs or fail to accurately identify hallucinations. To address the challenge posed by hallucinations of intelligent agents, we present HalMit, a novel black-box watchdog framework that models the generalization bound of LLM-empowered agents and thus detect hallucinations without requiring internal knowledge of the LLM’s architecture. Specifically, a probabilistic fractal sampling technique is proposed to generate a sufficient number of queries to trigger the incredible responses in parallel, efficiently identifying the generalization bound of the target agent. Experimental evaluations demonstrate that HalMit significantly outperforms existing approaches in hallucination monitoring. Its black-box nature and superior performance make HalMit a promising solution for enhancing the dependability of LLM-powered systems.
zh
[AI-77] Advancing Responsible Innovation in Agent ic AI: A study of Ethical Frameworks for Household Automation
【速读】:该论文旨在解决家庭环境中代理型人工智能(Agentic AI)在从被动响应向主动自治演进过程中所引发的伦理挑战,包括隐私侵犯、偏见歧视及用户控制权缺失等问题,尤其关注老年人、儿童和神经多样性群体等弱势用户群体面临的风险。其解决方案的关键在于构建负责任的创新框架与以人为本的设计原则,强调通过定制化的可解释性机制、细粒度的同意控制策略以及强有力的用户干预手段,并结合参与式和包容性方法,确保智能家庭系统具备透明性、公平性和可信度。
链接: https://arxiv.org/abs/2507.15901
作者: Joydeep Chandra,Satyam Kumar Navneet
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
Abstract:The implementation of Artificial Intelligence (AI) in household environments, especially in the form of proactive autonomous agents, brings about possibilities of comfort and attention as well as it comes with intra or extramural ethical challenges. This article analyzes agentic AI and its applications, focusing on its move from reactive to proactive autonomy, privacy, fairness and user control. We review responsible innovation frameworks, human-centered design principles, and governance practices to distill practical guidance for ethical smart home systems. Vulnerable user groups such as elderly individuals, children, and neurodivergent who face higher risks of surveillance, bias, and privacy risks were studied in detail in context of Agentic AI. Design imperatives are highlighted such as tailored explainability, granular consent mechanisms, and robust override controls, supported by participatory and inclusive methodologies. It was also explored how data-driven insights, including social media analysis via Natural Language Processing(NLP), can inform specific user needs and ethical concerns. This survey aims to provide both a conceptual foundation and suggestions for developing transparent, inclusive, and trustworthy agentic AI in household automation.
zh
[AI-78] ReDi: Rectified Discrete Flow
【速读】:该论文旨在解决离散流模型(Discrete Flow-based Models, DFMs)在生成高质量离散数据时采样速度慢的问题,其根源在于DFMs依赖迭代解码过程以处理高维数据,这种过程源于对联合分布的因子分解近似。解决方案的关键是提出一种名为“校正离散流”(Rectified Discrete Flow, ReDi)的新方法,通过校正源分布与目标分布之间的耦合关系来降低条件总相关性(Conditional Total Correlation, TC),从而减少因子分解误差。ReDi在理论上保证每一步迭代都能单调递减TC并收敛,实验证明其显著降低TC并支持少步生成;此外,校正后的耦合结构也适用于训练高效的单步图像生成模型,为高效离散数据合成提供了新的理论基础和实践路径。
链接: https://arxiv.org/abs/2507.15897
作者: Jaehoon Yoo,Wonjung Kim,Seunghoon Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Discrete Flow-based Models (DFMs) are powerful generative models for high-quality discrete data but typically suffer from slow sampling speeds due to their reliance on iterative decoding processes. This reliance on a multi-step process originates from the factorization approximation of DFMs, which is necessary for handling high-dimensional data. In this paper, we rigorously characterize the approximation error from factorization using Conditional Total Correlation (TC), which depends on the coupling. To reduce the Conditional TC and enable efficient few-step generation, we propose Rectified Discrete Flow (ReDi), a novel iterative method that reduces factorization error by rectifying the coupling between source and target distributions. We theoretically prove that each ReDi step guarantees a monotonic decreasing Conditional TC, ensuring its convergence. Empirically, ReDi significantly reduces Conditional TC and enables few-step generation. Moreover, we demonstrate that the rectified couplings are well-suited for training efficient one-step models on image generation. ReDi offers a simple and theoretically grounded approach for tackling the few-step challenge, providing a new perspective on efficient discrete data synthesis. Code is available at this https URL
zh
[AI-79] Integrating Reason -Based Moral Decision-Making in the Reinforcement Learning Architecture
【速读】:该论文旨在解决如何构建具备伦理行为能力的人工道德代理(Artificial Moral Agents, AMAs)的问题,尤其是在强化学习(Reinforcement Learning)框架下实现可解释、可适应且符合规范伦理要求的道德决策机制。解决方案的关键在于提出一种基于推理的人工道德代理(Reason-Based Artificial Moral Agents, RBAMAs),其核心创新是扩展传统强化学习架构,使代理能够通过案例反馈学习“理由理论”(reason-theory)——即一种用于处理道德相关命题并推导道德义务的理论体系;在此基础上,RBAMAs能够在执行任务的同时动态调整行为以遵守道德义务,从而提升行动的道德正当性(moral justifiability)、道德鲁棒性(moral robustness)和道德可信度(moral trustworthiness)。
链接: https://arxiv.org/abs/2507.15895
作者: Lisa Dargasz
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Master’s thesis, April 2025, 122 pages
Abstract:Reinforcement Learning is a machine learning methodology that has demonstrated strong performance across a variety of tasks. In particular, it plays a central role in the development of artificial autonomous agents. As these agents become increasingly capable, market readiness is rapidly approaching, which means those agents, for example taking the form of humanoid robots or autonomous cars, are poised to transition from laboratory prototypes to autonomous operation in real-world environments. This transition raises concerns leading to specific requirements for these systems - among them, the requirement that they are designed to behave ethically. Crucially, research directed toward building agents that fulfill the requirement to behave ethically - referred to as artificial moral agents(AMAs) - has to address a range of challenges at the intersection of computer science and philosophy. This study explores the development of reason-based artificial moral agents (RBAMAs). RBAMAs are build on an extension of the reinforcement learning architecture to enable moral decision-making based on sound normative reasoning, which is achieved by equipping the agent with the capacity to learn a reason-theory - a theory which enables it to process morally relevant propositions to derive moral obligations - through case-based feedback. They are designed such that they adapt their behavior to ensure conformance to these obligations while they pursue their designated tasks. These features contribute to the moral justifiability of the their actions, their moral robustness, and their moral trustworthiness, which proposes the extended architecture as a concrete and deployable framework for the development of AMAs that fulfills key ethical desiderata. This study presents a first implementation of an RBAMA and demonstrates the potential of RBAMAs in initial experiments.
zh
[AI-80] Dr. Boot: Bootstrapping Program Synthesis Language Models to Perform Repairing
【速读】:该论文旨在解决当前用于程序合成的生成式 AI (Generative AI) 模型在训练数据稀缺、质量有限以及合成过程与人类编程习惯不一致的问题。具体而言,现有模型通常一次性生成代码,而人类程序员则通过编译器反馈迭代修复代码,这种差异导致模型性能受限。解决方案的关键在于提出一种自举(bootstrapping)算法,该算法支持模型学习如何修复错误代码,从而模拟人类的迭代开发流程。实验表明,该方法在性能上优于常规微调,并且在相同规模下表现接近于更大模型,同时发现训练集中APPs数据集的测试用例存在潜在问题,可能影响修复和强化学习类方法的效果。
链接: https://arxiv.org/abs/2507.15889
作者: Noah van der Vleuten
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Master’s thesis, University of Amsterdam, 2023 ( this https URL ). Code and experiments available at: this https URL
Abstract:Language models for program synthesis are usually trained and evaluated on programming competition datasets (MBPP, APPS). However, these datasets are limited in size and quality, while these language models are extremely data hungry. Additionally, the language models have a misaligned program synthesis process compared to humans. While humans iteratively develop code with the help of a compiler, most program synthesis models currently produce code in one go. To solve these issues, we introduce a bootstrapping algorithm for program synthesis, that supports teaching models how to repair. We show that bootstrapping consistently outperforms regular fine-tuning. Compared to other work, our bootstrapped model performs on par with fine-tuned models that are 68% larger. Notably, bootstrapping with repairing also improves non-repairing performance compared to regular bootstrapping during inference. However, on our models, repairing during inference is likely inferior to simply sampling the same number of solutions. Furthermore, we find that there are issues with the example test cases in the training portion of the APPS dataset that are valuable to the community, as many repairing and reinforcement learning methods rely on them.
zh
[AI-81] Combining Cost-Constrained Runtime Monitors for AI Safety
【速读】:该论文旨在解决如何在运行时(runtime)高效组合多个监控器(monitor)以最大化对不当输出(misaligned outputs)的安全干预召回率(recall),同时满足平均成本预算约束的问题。其核心挑战在于平衡监控开销与安全干预效果之间的权衡,尤其是在资源受限的场景下。解决方案的关键在于提出一种基于似然比(likelihood ratio)的协议设计算法:该算法通过穷举搜索决定何时调用哪些监控器,并依据奈曼-皮尔逊引理(Neyman-Pearson lemma)分配安全干预策略,从而实现对监测成本与干预有效性之间的战略性权衡。实验表明,该方法在代码审查场景中将召回率提升超过两倍于基线方案,并且两个监控器联合使用可帕累托优于单独使用任一监控器。
链接: https://arxiv.org/abs/2507.15886
作者: Tim Tian Hua,James Baskerville,Henri Lemoine,Mia Hopman,Aryan Bhatt,Tyler Tracy
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Monitoring AIs at runtime can help us detect and stop harmful actions. In this paper, we study how to combine multiple runtime monitors into a single monitoring protocol. The protocol’s objective is to maximize the probability of applying a safety intervention on misaligned outputs (i.e., maximize recall). Since running monitors and applying safety interventions are costly, the protocol also needs to adhere to an average-case budget constraint. Taking the monitors’ performance and cost as given, we develop an algorithm to find the most efficient protocol. The algorithm exhaustively searches over when and which monitors to call, and allocates safety interventions based on the Neyman-Pearson lemma. By focusing on likelihood ratios and strategically trading off spending on monitors against spending on interventions, we more than double our recall rate compared to a naive baseline in a code review setting. We also show that combining two monitors can Pareto dominate using either monitor alone. Our framework provides a principled methodology for combining existing monitors to detect undesirable behavior in cost-sensitive settings.
zh
[AI-82] ADEPTS: A Capability Framework for Human-Centered Agent Design
【速读】:该论文旨在解决当前AI代理(AI agent)开发中缺乏统一、用户导向的能力框架问题,即现有指导分散在用户体验(UX)启发式规则、工程分类体系和伦理检查清单中,无法为团队提供清晰、可操作的核心能力定义。解决方案的关键在于提出ADEPTS框架,该框架基于六个以用户为中心的设计原则,明确界定AI代理在日常使用中应具备的最小化、面向用户的六大核心能力(可理解性、可控性、可信性等),从而在技术实现与用户体验之间建立桥梁,为研究人员、设计师、工程师及政策制定者提供一致且可执行的指导,推动AI代理能力的系统性提升与跨领域协作。
链接: https://arxiv.org/abs/2507.15885
作者: Pierluca D’Oro,Caley Drooff,Joy Chen,Joseph Tighe
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Large language models have paved the way to powerful and flexible AI agents, assisting humans by increasingly integrating into their daily life. This flexibility, potential, and growing adoption demands a holistic and cross-disciplinary approach to developing, monitoring and discussing the capabilities required for agent-driven user experiences. However, current guidance on human-centered AI agent development is scattered: UX heuristics focus on interface behaviors, engineering taxonomies describe internal pipelines, and ethics checklists address high-level governance. There is no concise, user-facing vocabulary that tells teams what an agent should fundamentally be able to do. We introduce ADEPTS, a capability framework defining a set of core user-facing capabilities to provide unified guidance around the development of AI agents. ADEPTS is based on six principles for human-centered agent design, that express the minimal, user-facing capabilities an AI agent should demonstrate to be understandable, controllable and trustworthy in everyday use. ADEPTS complements existing frameworks and taxonomies; differently from them, it sits at the interface between technical and experience development. By presenting ADEPTS, we aim to condense complex AI-UX requirements into a compact framework that is actionable guidance for AI researchers, designers, engineers, and policy reviewers alike. We believe ADEPTS has the potential of accelerating the improvement of user-relevant agent capabilities, of easing the design of experiences that take advantage of those capabilities, and of providing a shared language to track and discuss progress around the development of AI agents.
zh
[AI-83] he Recursive Coherence Principle: A Formal Constraint on Scalable Intelligence Alignment and Reasoning Architecture
【速读】:该论文旨在解决复杂智能系统(包括生物智能、人工智能和集体智能)在递归推理过程中因结构一致性丧失而导致的可扩展性问题,即如何在系统规模扩大时维持语义一致性与推理稳定性。其核心解决方案是提出递归一致性原则(Recursive Coherence Principle, RCP),并定义功能性智能模型(Functional Model of Intelligence, FMI)作为唯一已知能满足RCP的架构。FMI的关键在于其具备内嵌的递归评估、建模、适应、稳定、分解与桥接功能,以及外部存储、回忆、系统1与系统2推理等能力,从而确保跨推理层与协调层的语义结构得以保持一致。论文指出,当前AI中的对齐失败、幻觉和不稳定性等问题本质上是缺乏递归一致性导致的结构崩溃表现,因此主张从行为约束转向结构性一致性设计,为实现可扩展、鲁棒且一致的智能系统提供了理论基础与技术路径。
链接: https://arxiv.org/abs/2507.15880
作者: Andy E. Williams
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Intelligence-biological, artificial, or collective-requires structural coherence across recursive reasoning processes to scale effectively. As complex systems grow, coherence becomes fragile unless a higher-order structure ensures semantic consistency. This paper introduces the Recursive Coherence Principle (RCP): a foundational constraint stating that for any reasoning system of order N, composed of systems operating over conceptual spaces of order N-1, semantic coherence is preserved only by a recursively evaluable generalization operator that spans and aligns those lower-order conceptual spaces. Crucially, this coherence enables structural alignment. Without recursive coherence, no system can reliably preserve goals, meanings, or reasoning consistency at scale. We formally define the Functional Model of Intelligence (FMI) as the only known operator capable of satisfying the RCP at any scale. The FMI is a minimal, composable architecture with internal functions (evaluation, modeling, adaptation, stability, decomposition, bridging) and external functions (storage, recall, System 1 and System 2 reasoning) vital for preserving semantic structure across inference and coordination layers. We prove that any system lacking the FMI will experience recursive coherence breakdown as it scales, arguing that common AI issues like misalignment, hallucination, and instability are symptoms of this structural coherence loss. Unlike other foundational principles, RCP uniquely captures the internal, recursive dynamics needed for coherent, alignable intelligence, modeling semantic coherence under recursion. This work significantly impacts AI alignment, advocating a shift from behavioral constraints to structural coherence, and offers a pathway for safely generalizable, robustly coherent AI at scale.
zh
[AI-84] Out-of-Distribution Generalization in the ARC-AGI Domain: Comparing Execution-Guided Neural Program Synthesis and Test-Time Fine-Tuning
【速读】:该论文旨在解决生成式 AI (Generative AI) 在开放世界问题域中缺乏组合泛化能力的问题,特别是在 ARC-AGI 这一设计上要求模型具备分布外泛化能力的领域。其核心解决方案是采用执行引导的神经程序合成(execution-guided neural program synthesis)方法,该方法通过在推理阶段动态执行候选程序来指导搜索过程,从而显著提升模型构建新颖组合解的能力;实验表明,相比测试时微调(test-time fine-tuning, TTFT),该方法在组合泛化性能上更优,且 TTFT 的成功主要源于其能激发大语言模型(LLM)原本难以直接利用的分布内知识。
链接: https://arxiv.org/abs/2507.15877
作者: Simon Ouellette
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We run a controlled compositional generalization experiment in the ARC-AGI domain: an open-world problem domain in which the ability to generalize out-of-distribution is, by design, an essential characteristic for success. We compare neural program synthesis and test-time fine-tuning approaches on this experiment. We find that execution-guided neural program synthesis outperforms all reference algorithms in its ability to compose novel solutions. Our empirical findings also suggest that the success of TTFT on ARC-AGI lies mainly in eliciting in-distribution knowledge that the LLM otherwise fails to rely on directly.
zh
[AI-85] Re-evaluating Short- and Long-Term Trend Factors in CTA Replication: A Bayesian Graphical Approach
【速读】:该论文旨在解决商品交易顾问(Commodity Trading Advisors, CTAs)在趋势跟随策略中,短期与长期趋势系统之间的相对优势及其协同效应不明确的问题。其解决方案的关键在于利用贝叶斯图模型(Bayesian graphical model)对CTA收益进行动态分解,将其分离为短期趋势、长期趋势和市场贝塔(market beta)因子,并进一步揭示不同时间维度趋势组合如何影响策略的风险调整后绩效。
链接: https://arxiv.org/abs/2507.15876
作者: Eric Benhamou,Jean-Jacques Ohana,Alban Etienne,Béatrice Guez,Ethan Setrouk,Thomas Jacquot
机构: 未知
类目: Artificial Intelligence (cs.AI); Pricing of Securities (q-fin.PR); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR)
备注: 13 pages
Abstract:Commodity Trading Advisors (CTAs) have historically relied on trend-following rules that operate on vastly different horizons from long-term breakouts that capture major directional moves to short-term momentum signals that thrive in fast-moving markets. Despite a large body of work on trend following, the relative merits and interactions of short-versus long-term trend systems remain controversial. This paper adds to the debate by (i) dynamically decomposing CTA returns into short-term trend, long-term trend and market beta factors using a Bayesian graphical model, and (ii) showing how the blend of horizons shapes the strategy’s risk-adjusted performance.
zh
[AI-86] Differential Multimodal Transformers
【速读】:该论文旨在解决多模态模型(如PaliGemma)在处理文本与视觉信息时,因上下文窗口有限而导致的噪声干扰问题,以及由此引发的幻觉(hallucination)现象。其核心挑战在于Transformer注意力机制容易过度关注无关内容,从而降低问答准确性。解决方案的关键在于将原本为纯文本模型设计的**差分注意力机制(Differential Attention)**扩展至文本-视觉融合模型中,并通过LoRA(Low-Rank Adaptation)方法对PaliGemma 3B模型进行微调,以增强模型对相关上下文的聚焦能力,从而有效抑制噪声信息并提升任务表现。
链接: https://arxiv.org/abs/2507.15875
作者: Jerry Li,Timothy Oh,Joseph Hoang,Vardhit Veeramachaneni
机构: 未知
类目: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:
Abstract:Small language models have gained significant popularity due to their efficiency and growing capabilities. However, incorporating additional modalities, such as vision, can exacerbate the challenge of limited context windows by introducing noise. Recent studies have highlighted that Transformer attention mechanisms often disproportionately focus on irrelevant contexts. In this work, we extend the Differential Attention mechanism, originally designed for text-only models, to the text-vision model PaliGemma. Our aim is to evaluate its ability to mitigate noisy information retrieval and reduce hallucinations. To this end, we fine-tuned the PaliGemma 3B model using LoRA, incorporating Differential Attention, and experimented with various parameter settings and configurations. We demonstrate that Differential Attention can be adapted and integrated into the fine-tuning of existing models to enhance noisy information retrieval and question-answering capabilities.
zh
[AI-87] Purchase and Production Optimization in a Meat Processing Plant
【速读】:该论文针对肉类加工企业在原材料采购与后续加工环节中存在的优化问题展开研究,旨在实现投入材料的高效利用以提升企业利润。区别于多数聚焦供应链管理的研究,本文专注于生产阶段,考虑了替代加工路径、不同保质期库存以及在现有文献中常被忽略的最小订购量(minimum order quantity)和最小替代比例(minimum percentage in alternatives)等约束条件。研究表明,这两个约束均使问题变为NP-hard,为此作者设计了一种基于整数线性规划(Integer Linear Programming, ILP)的简单迭代求解方法,能够在开源求解器上快速获得最优解(通常在几秒内),并有效缓解因数据范围跨度大导致的数值不稳定问题,从而在实际应用中展现出良好的效率与鲁棒性。
链接: https://arxiv.org/abs/2507.15866
作者: Marek Vlk,Premysl Sucha,Jaroslaw Rudy,Radoslaw Idzikowski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures
Abstract:The food production industry, especially the meat production sector, faces many challenges that have even escalated due to the recent outbreak of the energy crisis in the European Union. Therefore, efficient use of input materials is an essential aspect affecting the profit of such companies. This paper addresses an optimization problem concerning the purchase and subsequent material processing we solved for a meat processing company. Unlike the majority of existing papers, we do not concentrate on how this problem concerns supply chain management, but we focus purely on the production stage. The problem involves the concept of alternative ways of material processing, stock of material with different expiration dates, and extra constraints widely neglected in the current literature, namely, the minimum order quantity and the minimum percentage in alternatives. We prove that each of these two constraints makes the problem \mbox \mathcalNP -hard, and hence we design a simple iterative approach based on integer linear programming that allows us to solve real-life instances even using an open-source integer linear programming solver. Another advantage of this approach is that it mitigates numerical issues, caused by the extensive range of data values, we experienced with a commercial solver. The results obtained using real data from the meat processing company showed that our algorithm can find the optimum solution in a few seconds for all considered use cases.
zh
[AI-88] From Reasoning to Super-Intelligence: A Search-Theoretic Perspective
【速读】:该论文旨在解决当前链式思维(Chain-of-Thought, CoT)学习在复杂推理任务中表现不佳的问题,其核心障碍包括分布漂移(distribution drift)、缺乏嵌入式搜索机制以及指数级的推理成本。解决方案的关键在于提出一种新的学习范式——勤勉学习者(Diligent Learner),该范式将推理建模为由验证器(validator)引导的深度优先搜索,并支持失败时的回溯机制,在两个温和且现实的假设下证明其能高效从CoT数据中学习,而现有方法(如监督微调、强化学习、思维树等)则无法实现这一目标。此框架为构建基于自然生成、不完整数据训练的大规模推理模型(Large Reasoning Models, LRMs)提供了理论保障和实践路径。
链接: https://arxiv.org/abs/2507.15865
作者: Shai Shalev-Shwartz,Amnon Shashua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Chain-of-Thought (CoT) reasoning has emerged as a powerful tool for enhancing the problem-solving capabilities of large language models (LLMs). However, the theoretical foundations of learning from CoT data remain underdeveloped, and existing approaches – such as Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), Tree-of-Thoughts (ToT), and Monte Carlo Tree Search (MCTS) – often fail on complex reasoning tasks. In this work, we identify core obstacles that hinder effective CoT learning, including distribution drift, lack of embedded search, and exponential inference costs. We introduce the Diligent Learner, a new learning paradigm that explicitly models reasoning as a depth-first search guided by a validator and supports backtracking upon failure. Under two mild and realistic assumptions, we prove that the Diligent Learner can efficiently learn from CoT data while existing methods fail to do so. This framework offers a path toward building scalable and reliable reasoning systems trained on naturally occurring, incomplete data – paving the way for the development of Large Reasoning Models (LRMs) with robust, interpretable problem-solving abilities.
zh
[AI-89] Decentralized AI-driven IoT Architecture for Privacy-Preserving and Latency-Optimized Healthcare in Pandemic and Critical Care Scenarios
【速读】:该论文旨在解决传统集中式医疗架构在数据隐私、延迟和安全性方面存在的问题,特别是在疫情和重症监护场景下对实时患者监测的挑战。其解决方案的关键在于提出一种基于人工智能(AI)的去中心化物联网(IoT)架构,融合联邦学习(federated learning)、区块链(blockchain)和边缘计算(edge computing)技术,从而在保障数据隐私的同时显著降低延迟并提升系统整体性能,实验结果表明该方案在交易延迟、能耗和数据吞吐量等指标上均优于主流云解决方案。
链接: https://arxiv.org/abs/2507.15859
作者: Harsha Sammangi(Dakota State University),Aditya Jagatha(College of Business and Information Systems, Dakota State University),Giridhar Reddy Bojja(College of Business, Michigan Technological University),Jun Liu(College of Business and I.S, Dakota State University)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 Pages
Abstract:AI Innovations in the IoT for Real-Time Patient Monitoring On one hand, the current traditional centralized healthcare architecture poses numerous issues, including data privacy, delay, and security. Here, we present an AI-enabled decentralized IoT architecture that can address such challenges during a pandemic and critical care settings. This work presents our architecture to enhance the effectiveness of the current available federated learning, blockchain, and edge computing approach, maximizing data privacy, minimizing latency, and improving other general system metrics. Experimental results demonstrate transaction latency, energy consumption, and data throughput orders of magnitude lower than competitive cloud solutions.
zh
[AI-90] Decoding Translation-Related Functional Sequences in 5UTRs Using Interpretable Deep Learning Models
【速读】:该论文旨在解决5’非翻译区(5’UTR)在mRNA翻译调控中的建模难题,特别是现有深度学习模型因固定输入长度限制和可解释性不足而难以准确预测翻译效率的问题。解决方案的关键在于提出一种基于Transformer架构的新型模型UTR-STCNet,其核心创新包括:1)引入Saliency-Aware Token Clustering(SATC)模块,通过显著性得分迭代聚合核苷酸token为多尺度语义单元,实现对变长5’UTR的灵活建模;2)设计Saliency-Guided Transformer(SGT)块,利用轻量级注意力机制捕捉局部与远端调控依赖关系,在不增加计算成本的前提下提升模型性能与生物学可解释性。该方法在多个基准数据集上均优于当前最优基线模型,并能识别已知功能元件如上游AUG和Kozak序列,从而为翻译调控机制提供深入洞察。
链接: https://arxiv.org/abs/2507.16801
作者: Yuxi Lin,Yaxue Fang,Zehong Zhang,Zhouwu Liu,Siyun Zhong,Fulong Yu
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding how 5’ untranslated regions (5’UTRs) regulate mRNA translation is critical for controlling protein expression and designing effective therapeutic mRNAs. While recent deep learning models have shown promise in predicting translational efficiency from 5’UTR sequences, most are constrained by fixed input lengths and limited interpretability. We introduce UTR-STCNet, a Transformer-based architecture for flexible and biologically grounded modeling of variable-length 5’UTRs. UTR-STCNet integrates a Saliency-Aware Token Clustering (SATC) module that iteratively aggregates nucleotide tokens into multi-scale, semantically meaningful units based on saliency scores. A Saliency-Guided Transformer (SGT) block then captures both local and distal regulatory dependencies using a lightweight attention mechanism. This combined architecture achieves efficient and interpretable modeling without input truncation or increased computational cost. Evaluated across three benchmark datasets, UTR-STCNet consistently outperforms state-of-the-art baselines in predicting mean ribosome load (MRL), a key proxy for translational efficiency. Moreover, the model recovers known functional elements such as upstream AUGs and Kozak motifs, highlighting its potential for mechanistic insight into translation regulation.
zh
[AI-91] Estimating Treatment Effects with Independent Component Analysis
【速读】:该论文旨在解决在部分线性回归(Partially Linear Regression, PLR)模型中如何更准确、高效地估计处理效应的问题,尤其是在存在干扰变量(nuisance)的情况下。传统因果推断方法往往依赖于强假设或样本效率较低,而本文通过引入独立成分分析(Independent Component Analysis, ICA)这一来自可识别性理论的方法,提出了一种新颖的解决方案:利用非高斯性(non-Gaussianity)作为关键前提,将ICA应用于PLR模型中的因果效应估计。其核心创新在于首次理论和实证揭示了ICA与因果效应估计之间的联系,并发现线性ICA即使在存在高斯混杂因子或非线性干扰项时,仍能准确估计多个处理效应,从而显著提升了估计的一致性和鲁棒性。
链接: https://arxiv.org/abs/2507.16467
作者: Patrik Reizinger,Lester Mackey,Wieland Brendel,Rahul Krishnan
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The field of causal inference has developed a variety of methods to accurately estimate treatment effects in the presence of nuisance. Meanwhile, the field of identifiability theory has developed methods like Independent Component Analysis (ICA) to identify latent sources and mixing weights from data. While these two research communities have developed largely independently, they aim to achieve similar goals: the accurate and sample-efficient estimation of model parameters. In the partially linear regression (PLR) setting, Mackey et al. (2018) recently found that estimation consistency can be improved with non-Gaussian treatment noise. Non-Gaussianity is also a crucial assumption for identifying latent factors in ICA. We provide the first theoretical and empirical insights into this connection, showing that ICA can be used for causal effect estimation in the PLR model. Surprisingly, we find that linear ICA can accurately estimate multiple treatment effects even in the presence of Gaussian confounders or nonlinear nuisance.
zh
[AI-92] Predictive Hydrodynamic Simulations for Laser Direct-drive Implosion Experiments via Artificial Intelligence
【速读】:该论文旨在解决激光驱动内爆实验中复杂物理过程的预测难题,尤其是针对双锥点火(Double-Cone Ignition, DCI)方案的implosion动力学特性难以准确模拟的问题。其解决方案的关键在于构建了一个基于Transformer架构的深度学习模型MULTI-Net,用于从激光波形和靶半径等输入参数中预测内爆特征,并引入物理信息引导的解码器(Physics-Informed Decoder, PID)进行高维采样,显著降低了预测误差,相较于拉丁超立方采样(Latin Hypercube Sampling)具有更高的精度。该框架实现了对SG-II Upgrade装置上DCI实验中x射线条纹相机测量的内爆动力学的有效预测,验证了数据驱动AI方法在提升激光聚变仿真预测能力方面的有效性。
链接: https://arxiv.org/abs/2507.16227
作者: Zixu Wang,Yuhan Wang,Junfei Ma,Fuyuan Wu,Junchi Yan,Xiaohui Yuan,Zhe Zhang,Jie Zhang
机构: 未知
类目: Plasma Physics (physics.plasm-ph); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures
Abstract:This work presents predictive hydrodynamic simulations empowered by artificial intelligence (AI) for laser driven implosion experiments, taking the double-cone ignition (DCI) scheme as an example. A Transformer-based deep learning model MULTI-Net is established to predict implosion features according to laser waveforms and target radius. A Physics-Informed Decoder (PID) is proposed for high-dimensional sampling, significantly reducing the prediction errors compared to Latin hypercube sampling. Applied to DCI experiments conducted on the SG-II Upgrade facility, the MULTI-Net model is able to predict the implosion dynamics measured by the x-ray streak camera. It is found that an effective laser absorption factor about 65% is suitable for the one-dimensional simulations of the DCI-R10 experiments. For shot 33, the mean implosion velocity and collided plasma density reached 195 km/s and 117 g/cc, respectively. This study demonstrates a data-driven AI framework that enhances the prediction ability of simulations for complicated laser fusion experiments.
zh
[AI-93] Bayesian Deep Learning for Convective Initiation Nowcasting Uncertainty Estimation
【速读】:该论文旨在解决对流初生(Convective Initiation, CI)短临预报中不确定性量化不足的问题,尤其是如何提升基于深度学习模型的预测概率性和校准性。其解决方案的关键在于引入贝叶斯深度学习方法,通过在残差神经网络(ResNet)基础上构建概率性预测框架,实现对预测不确定性的有效表征与分离。其中,初始权重集成 + 随机丢弃(initial-weights ensemble + Monte Carlo dropout)方法表现最优,因其能通过多组不同初始权重的确定性ResNet模型结合推理阶段的随机丢弃机制,更充分地采样假设空间(hypothesis space),从而生成更具技能且校准良好的概率预测;同时,为克服参数优化困难导致的性能下降问题,进一步采用Bayesian-MOPED方法约束假设搜索范围,使模型在保持高预测精度的同时增强泛化能力。
链接: https://arxiv.org/abs/2507.16219
作者: Da Fan,David John Gagne II,Steven J. Greybush,Eugene E. Clothiaux,John S. Schreck,Chaopeng Shen
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:This study evaluated the probability and uncertainty forecasts of five recently proposed Bayesian deep learning methods relative to a deterministic residual neural network (ResNet) baseline for 0-1 h convective initiation (CI) nowcasting using GOES-16 satellite infrared observations. Uncertainty was assessed by how well probabilistic forecasts were calibrated and how well uncertainty separated forecasts with large and small errors. Most of the Bayesian deep learning methods produced probabilistic forecasts that outperformed the deterministic ResNet, with one, the initial-weights ensemble + Monte Carlo (MC) dropout, an ensemble of deterministic ResNets with different initial weights to start training and dropout activated during inference, producing the most skillful and well-calibrated forecasts. The initial-weights ensemble + MC dropout benefited from generating multiple solutions that more thoroughly sampled the hypothesis space. The Bayesian ResNet ensemble was the only one that performed worse than the deterministic ResNet at longer lead times, likely due to the challenge of optimizing a larger number of parameters. To address this issue, the Bayesian-MOPED (MOdel Priors with Empirical Bayes using Deep neural network) ResNet ensemble was adopted, and it enhanced forecast skill by constraining the hypothesis search near the deterministic ResNet hypothesis. All Bayesian methods demonstrated well-calibrated uncertainty and effectively separated cases with large and small errors. In case studies, the initial-weights ensemble + MC dropout demonstrated better forecast skill than the Bayesian-MOPED ensemble and the deterministic ResNet on selected CI events in clear-sky regions. However, the initial-weights ensemble + MC dropout exhibited poorer generalization in clear-sky and anvil cloud regions without CI occurrence compared to the deterministic ResNet and Bayesian-MOPED ensemble.
zh
[AI-94] AutoMAT: A Hierarchical Framework for Autonomous Alloy Discovery
【速读】:该论文旨在解决合金设计中面临的两大核心挑战:一是成分设计空间庞大导致的探索困难,二是实验验证成本高昂限制了高效迭代。解决方案的关键在于提出了一种分层自主框架AutoMAT,其整合了大语言模型(Large Language Models, LLMs)、基于CALPHAD(CALculation of PHAse Diagrams)的自动化模拟和AI驱动的搜索策略,实现了从概念生成到实验验证的全流程自动化与智能化。该框架无需依赖人工标注的大数据集即可实现高效率、高精度和可解释性的合金设计,在两个案例中分别实现了轻质高强钛合金和高强高熵合金的突破性进展,将发现周期从数年缩短至数周,展现出作为下一代合金设计平台的巨大潜力。
链接: https://arxiv.org/abs/2507.16005
作者: Penghui Yang,Chendong Zhao,Bijun Tang,Zhonghan Zhang,Xinrun Wang,Yanchen Deng,Yuhao Lu,Cuntai Guan,Zheng Liu,Bo An
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Alloy discovery is central to advancing modern industry but remains hindered by the vastness of compositional design space and the costly validation. Here, we present AutoMAT, a hierarchical and autonomous framework grounded in and validated by experiments, which integrates large language models, automated CALPHAD-based simulations, and AI-driven search to accelerate alloy design. Spanning the entire pipeline from ideation to validation, AutoMAT achieves high efficiency, accuracy, and interpretability without the need for manually curated large datasets. In a case study targeting a lightweight, high-strength alloy, AutoMAT identifies a titanium alloy with 8.1% lower density and comparable yield strength relative to the state-of-the-art reference, achieving the highest specific strength among all comparisons. In a second case targeting high-yield-strength high-entropy alloys, AutoMAT achieves a 28.2% improvement in yield strength over the base alloy. In both cases, AutoMAT reduces the discovery timeline from years to weeks, illustrating its potential as a scalable and versatile platform for next-generation alloy design.
zh
[AI-95] A Generative Model for Disentangling Galaxy Photometric Parameters
【速读】:该论文旨在解决大规模光度巡天中海量星系图像的形态参数提取难题,传统基于参数化光度轮廓拟合的方法在面对数十亿源时计算效率低下。其解决方案的关键在于提出一种条件自编码器(Conditional AutoEncoder, CAE)框架,通过在GalSim生成的多样化真实模拟星系图像上训练,将每个星系图像编码为低维潜在表示,并以关键形态参数(如通量、半光半径、Sersic指数和椭率)作为条件约束,实现对这些结构特征的解耦恢复与图像重建,从而在保证精度的同时显著提升计算效率。
链接: https://arxiv.org/abs/2507.15898
作者: Keen Leung,Colen Yan,Jun Yin
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures
Abstract:Ongoing and future photometric surveys will produce unprecedented volumes of galaxy images, necessitating robust, efficient methods for deriving galaxy morphological parameters at scale. Traditional approaches, such as parametric light-profile fitting, offer valuable insights but become computationally prohibitive when applied to billions of sources. In this work, we propose a Conditional AutoEncoder (CAE) framework to simultaneously model and characterize galaxy morphology. Our CAE is trained on a suite of realistic mock galaxy images generated via GalSim, encompassing a broad range of galaxy types, photometric parameters (e.g., flux, half-light radius, Sersic index, ellipticity), and observational conditions. By encoding each galaxy image into a low-dimensional latent representation conditioned on key parameters, our model effectively recovers these morphological features in a disentangled manner, while also reconstructing the original image. The results demonstrate that the CAE approach can accurately and efficiently infer complex structural properties, offering a powerful alternative to existing methods.
zh
机器学习
[LG-0] A Partitioned Sparse Variational Gaussian Process for Fast Distributed Spatial Modeling
链接: https://arxiv.org/abs/2507.16771
作者: Michael Grosskopf,Kellin Rumsey,Ayan Biswas,Earl Lawrence
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:The next generation of Department of Energy supercomputers will be capable of exascale computation. For these machines, far more computation will be possible than that which can be saved to disk. As a result, users will be unable to rely on post-hoc access to data for uncertainty quantification and other statistical analyses and there will be an urgent need for sophisticated machine learning algorithms which can be trained in situ. Algorithms deployed in this setting must be highly scalable, memory efficient and capable of handling data which is distributed across nodes as spatially contiguous partitions. One suitable approach involves fitting a sparse variational Gaussian process (SVGP) model independently and in parallel to each spatial partition. The resulting model is scalable, efficient and generally accurate, but produces the undesirable effect of constructing discontinuous response surfaces due to the disagreement between neighboring models at their shared boundary. In this paper, we extend this idea by allowing for a small amount of communication between neighboring spatial partitions which encourages better alignment of the local models, leading to smoother spatial predictions and a better fit in general. Due to our decentralized communication scheme, the proposed extension remains highly scalable and adds very little overhead in terms of computation (and none, in terms of memory). We demonstrate this Partitioned SVGP (PSVGP) approach for the Energy Exascale Earth System Model (E3SM) and compare the results to the independent SVGP case.
[LG-1] Improving Model Classification by Optimizing the Training Dataset
链接: https://arxiv.org/abs/2507.16729
作者: Morad Tukan,Loay Mualem,Eitan Netzer,Liran Sigalat
类目: Machine Learning (cs.LG)
*备注:
Abstract:In the era of data-centric AI, the ability to curate high-quality training data is as crucial as model design. Coresets offer a principled approach to data reduction, enabling efficient learning on large datasets through importance sampling. However, conventional sensitivity-based coreset construction often falls short in optimizing for classification performance metrics, e.g., F1 score, focusing instead on loss approximation. In this work, we present a systematic framework for tuning the coreset generation process to enhance downstream classification quality. Our method introduces new tunable parameters–including deterministic sampling, class-wise allocation, and refinement via active sampling, beyond traditional sensitivity scores. Through extensive experiments on diverse datasets and classifiers, we demonstrate that tuned coresets can significantly outperform both vanilla coresets and full dataset training on key classification metrics, offering an effective path towards better and more efficient model training.
[LG-2] Multi-objective Portfolio Optimization Via Gradient Descent
链接: https://arxiv.org/abs/2507.16717
作者: Christian Oliva,Pedro R. Ventura,Luis F. Lago-Fernández
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Traditional approaches to portfolio optimization, often rooted in Modern Portfolio Theory and solved via quadratic programming or evolutionary algorithms, struggle with scalability or flexibility, especially in scenarios involving complex constraints, large datasets and/or multiple conflicting objectives. To address these challenges, we introduce a benchmark framework for multi-objective portfolio optimization (MPO) using gradient descent with automatic differentiation. Our method supports any optimization objective, such as minimizing risk measures (e.g., CVaR) or maximizing Sharpe ratio, along with realistic constraints, such as tracking error limits, UCITS regulations, or asset group restrictions. We have evaluated our framework across six experimental scenarios, from single-objective setups to complex multi-objective cases, and have compared its performance against standard solvers like CVXPY and SKFOLIO. Our results show that our method achieves competitive performance while offering enhanced flexibility for modeling multiple objectives and constraints. We aim to provide a practical and extensible tool for researchers and practitioners exploring advanced portfolio optimization problems in real-world conditions.
[LG-3] Latent Space Alignment for AI-Native MIMO Semantic Communications IJCNN2025
链接: https://arxiv.org/abs/2507.16680
作者: Mario Edoardo Pandolfo,Simone Fiorellino,Emilio Calvanese Strinati,Paolo Di Lorenzo
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
*备注: Proc. of IEEE IJCNN 2025
Abstract:Semantic communications focus on prioritizing the understanding of the meaning behind transmitted data and ensuring the successful completion of tasks that motivate the exchange of information. However, when devices rely on different languages, logic, or internal representations, semantic mismatches may occur, potentially hindering mutual understanding. This paper introduces a novel approach to addressing latent space misalignment in semantic communications, exploiting multiple-input multiple-output (MIMO) communications. Specifically, our method learns a MIMO precoder/decoder pair that jointly performs latent space compression and semantic channel equalization, mitigating both semantic mismatches and physical channel impairments. We explore two solutions: (i) a linear model, optimized by solving a biconvex optimization problem via the alternating direction method of multipliers (ADMM); (ii) a neural network-based model, which learns semantic MIMO precoder/decoder under transmission power budget and complexity constraints. Numerical results demonstrate the effectiveness of the proposed approach in a goal-oriented semantic communication scenario, illustrating the main trade-offs between accuracy, communication burden, and complexity of the solutions.
[LG-4] Deep Unfolding Network for Nonlinear Multi-Frequency Electrical Impedance Tomography
链接: https://arxiv.org/abs/2507.16678
作者: Giovanni S. Alberti,Damiana Lazzaro,Serena Morigi,Luca Ratti,Matteo Santacesaria
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Multi-frequency Electrical Impedance Tomography (mfEIT) represents a promising biomedical imaging modality that enables the estimation of tissue conductivities across a range of frequencies. Addressing this challenge, we present a novel variational network, a model-based learning paradigm that strategically merges the advantages and interpretability of classical iterative reconstruction with the power of deep learning. This approach integrates graph neural networks (GNNs) within the iterative Proximal Regularized Gauss Newton (PRGN) framework. By unrolling the PRGN algorithm, where each iteration corresponds to a network layer, we leverage the physical insights of nonlinear model fitting alongside the GNN’s capacity to capture inter-frequency correlations. Notably, the GNN architecture preserves the irregular triangular mesh structure used in the solution of the nonlinear forward model, enabling accurate reconstruction of overlapping tissue fraction concentrations.
[LG-5] Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers SOCC2025
链接: https://arxiv.org/abs/2507.16676
作者: Vasileios Titopoulos,Kosmas Alexandridis,Giorgos Dimitrakopoulos
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: IEEE International System-on-Chip Conference (IEEE SOCC 2025)
Abstract:Transformers and large language models (LLMs), powered by the attention mechanism, have transformed numerous AI applications, driving the need for specialized hardware accelerators. A major challenge in these accelerators is efficiently detecting errors caused by random hardware faults. Traditional algorithm-based fault tolerance (ABFT) techniques verify individual matrix multiplications but fall short in handling the full attention mechanism, particularly due to intermediate softmax normalization. This work proposes Flash-ABFT, a novel method that computes an online checksum across the entire three-matrix product of query, key and value matrices, of an attention layer, including the softmax operation, with a single check. This approach significantly reduces overhead by eliminating redundant checks while maintaining high fault-detection accuracy. Experimental results demonstrate that Flash-ABFT incurs only 5.3% hardware area overhead and less than 1.9% energy overhead, making it a cost-effective and robust solution for error detection in attention accelerators.
[LG-6] GASPnet: Global Agreement to Synchronize Phases
链接: https://arxiv.org/abs/2507.16674
作者: Andrea Alamiaa,Sabine Muzellec,Thomas Serre,Rufin VanRullen
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:In recent years, Transformer architectures have revolutionized most fields of artificial intelligence, relying on an attentional mechanism based on the agreement between keys and queries to select and route information in the network. In previous work, we introduced a novel, brain-inspired architecture that leverages a similar implementation to achieve a global ‘routing by agreement’ mechanism. Such a system modulates the network’s activity by matching each neuron’s key with a single global query, pooled across the entire network. Acting as a global attentional system, this mechanism improves noise robustness over baseline levels but is insufficient for multi-classification tasks. Here, we improve on this work by proposing a novel mechanism that combines aspects of the Transformer attentional operations with a compelling neuroscience theory, namely, binding by synchrony. This theory proposes that the brain binds together features by synchronizing the temporal activity of neurons encoding those features. This allows the binding of features from the same object while efficiently disentangling those from distinct objects. We drew inspiration from this theory and incorporated angular phases into all layers of a convolutional network. After achieving phase alignment via Kuramoto dynamics, we use this approach to enhance operations between neurons with similar phases and suppresses those with opposite phases. We test the benefits of this mechanism on two datasets: one composed of pairs of digits and one composed of a combination of an MNIST item superimposed on a CIFAR-10 image. Our results reveal better accuracy than CNN networks, proving more robust to noise and with better generalization abilities. Overall, we propose a novel mechanism that addresses the visual binding problem in neural networks by leveraging the synergy between neuroscience and machine learning.
[LG-7] Families of Optimal Transport Kernels for Cell Complexes
链接: https://arxiv.org/abs/2507.16569
作者: Rahul Khorana
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recent advances have discussed cell complexes as ideal learning representations. However, there is a lack of available machine learning methods suitable for learning on CW complexes. In this paper, we derive an explicit expression for the Wasserstein distance between cell complex signal distributions in terms of a Hodge-Laplacian matrix. This leads to a structurally meaningful measure to compare CW complexes and define the optimal transportation map. In order to simultaneously include both feature and structure information, we extend the Fused Gromov-Wasserstein distance to CW complexes. Finally, we introduce novel kernels over the space of probability measures on CW complexes based on the dual formulation of optimal transport.
[LG-8] Canonical Correlation Patterns for Validating Clustering of Multivariate Time Series
链接: https://arxiv.org/abs/2507.16497
作者: Isabella Degen,Zahraa S Abdallah,Kate Robson Brown,Henry W J Reeve
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 45 pages, 8 figures. Introduces canonical correlation patterns as discrete validation targets for correlation-based clustering, systematically evaluates distance functions and validity indices, and provides practical implementation guidelines through controlled experiments with synthetic ground truth data
Abstract:Clustering of multivariate time series using correlation-based methods reveals regime changes in relationships between variables across health, finance, and industrial applications. However, validating whether discovered clusters represent distinct relationships rather than arbitrary groupings remains a fundamental challenge. Existing clustering validity indices were developed for Euclidean data, and their effectiveness for correlation patterns has not been systematically evaluated. Unlike Euclidean clustering, where geometric shapes provide discrete reference targets, correlations exist in continuous space without equivalent reference patterns. We address this validation gap by introducing canonical correlation patterns as mathematically defined validation targets that discretise the infinite correlation space into finite, interpretable reference patterns. Using synthetic datasets with perfect ground truth across controlled conditions, we demonstrate that canonical patterns provide reliable validation targets, with L1 norm for mapping and L5 norm for silhouette width criterion and Davies-Bouldin index showing superior performance. These methods are robust to distribution shifts and appropriately detect correlation structure degradation, enabling practical implementation guidelines. This work establishes a methodological foundation for rigorous correlation-based clustering validation in high-stakes domains.
[LG-9] RIS-aided Latent Space Alignment for Semantic Channel Equalization
链接: https://arxiv.org/abs/2507.16450
作者: Tomás Hüttebräucker,Mario Edoardo Pandolfo,Simone Fiorellino,Emilio Calvanese Strinati,Paolo Di Lorenzo
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Semantic communication systems introduce a new paradigm in wireless communications, focusing on transmitting the intended meaning rather than ensuring strict bit-level accuracy. These systems often rely on Deep Neural Networks (DNNs) to learn and encode meaning directly from data, enabling more efficient communication. However, in multi-user settings where interacting agents are trained independently-without shared context or joint optimization-divergent latent representations across AI-native devices can lead to semantic mismatches, impeding mutual understanding even in the absence of traditional transmission errors. In this work, we address semantic mismatch in Multiple-Input Multiple-Output (MIMO) channels by proposing a joint physical and semantic channel equalization framework that leverages the presence of Reconfigurable Intelligent Surfaces (RIS). The semantic equalization is implemented as a sequence of transformations: (i) a pre-equalization stage at the transmitter; (ii) propagation through the RIS-aided channel; and (iii) a post-equalization stage at the receiver. We formulate the problem as a constrained Minimum Mean Squared Error (MMSE) optimization and propose two solutions: (i) a linear semantic equalization chain, and (ii) a non-linear DNN-based semantic equalizer. Both methods are designed to operate under semantic compression in the latent space and adhere to transmit power constraints. Through extensive evaluations, we show that the proposed joint equalization strategies consistently outperform conventional, disjoint approaches to physical and semantic channel equalization across a broad range of scenarios and wireless channel conditions.
[LG-10] he Sweet Danger of Sugar: Debunking Representation Learning for Encrypted Traffic Classification
链接: https://arxiv.org/abs/2507.16438
作者: Yuqi Zhao,Giovanni Dettori,Matteo Boffa,Luca Vassio,Marco Mellia
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: This paper has been accepted at ACM SIGCOMM 2025. It will appear in the proceedings with DOI https://doi.org/10.1145/3718958.3750498
Abstract:Recently we have witnessed the explosion of proposals that, inspired by Language Models like BERT, exploit Representation Learning models to create traffic representations. All of them promise astonishing performance in encrypted traffic classification (up to 98% accuracy). In this paper, with a networking expert mindset, we critically reassess their performance. Through extensive analysis, we demonstrate that the reported successes are heavily influenced by data preparation problems, which allow these models to find easy shortcuts - spurious correlation between features and labels - during fine-tuning that unrealistically boost their performance. When such shortcuts are not present - as in real scenarios - these models perform poorly. We also introduce Pcap-Encoder, an LM-based representation learning model that we specifically design to extract features from protocol headers. Pcap-Encoder appears to be the only model that provides an instrumental representation for traffic classification. Yet, its complexity questions its applicability in practical settings. Our findings reveal flaws in dataset preparation and model training, calling for a better and more conscious test design. We propose a correct evaluation methodology and stress the need for rigorous benchmarking.
[LG-11] Improving Predictions on Highly Unbalanced Data Using Open Source Synthetic Data Upsampling
链接: https://arxiv.org/abs/2507.16419
作者: Ivona Krchova,Michael Platzer,Paul Tiwald
类目: Machine Learning (cs.LG)
*备注:
Abstract:Unbalanced tabular data sets present significant challenges for predictive modeling and data analysis across a wide range of applications. In many real-world scenarios, such as fraud detection, medical diagnosis, and rare event prediction, minority classes are vastly underrepresented, making it difficult for traditional machine learning algorithms to achieve high accuracy. These algorithms tend to favor the majority class, leading to biased models that struggle to accurately represent minority classes. Synthetic data holds promise for addressing the under-representation of minority classes by providing new, diverse, and highly realistic samples. This paper presents a benchmark study on the use of AI-generated synthetic data for upsampling highly unbalanced tabular data sets. We evaluate the effectiveness of an open-source solution, the Synthetic Data SDK by MOSTLY AI, which provides a flexible and user-friendly approach to synthetic upsampling for mixed-type data. We compare predictive models trained on data sets upsampled with synthetic records to those using standard methods, such as naive oversampling and SMOTE-NC. Our results demonstrate that synthetic data can improve predictive accuracy for minority groups by generating diverse data points that fill gaps in sparse regions of the feature space. We show that upsampled synthetic training data consistently results in top-performing predictive models, particularly for mixed-type data sets containing very few minority samples. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.16419 [cs.LG] (or arXiv:2507.16419v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.16419 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-12] Optimization and generalization analysis for two-layer physics-informed neural networks without over-parametrization
链接: https://arxiv.org/abs/2507.16380
作者: Zhihan Zeng,Yiqi Gu
类目: Machine Learning (cs.LG)
*备注:
Abstract:This work focuses on the behavior of stochastic gradient descent (SGD) in solving least-squares regression with physics-informed neural networks (PINNs). Past work on this topic has been based on the over-parameterization regime, whose convergence may require the network width to increase vastly with the number of training samples. So, the theory derived from over-parameterization may incur prohibitive computational costs and is far from practical experiments. We perform new optimization and generalization analysis for SGD in training two-layer PINNs, making certain assumptions about the target function to avoid over-parameterization. Given \epsilon0 , we show that if the network width exceeds a threshold that depends only on \epsilon and the problem, then the training loss and expected loss will decrease below O(\epsilon) .
[LG-13] Bipartite Patient-Modality Graph Learning with Event-Conditional Modelling of Censoring for Cancer Survival Prediction
链接: https://arxiv.org/abs/2507.16363
作者: Hailin Yue,Hulin Kuang,Jin Liu,Junjian Li,Lanlan Wang,Mengshen He,Jianxin Wang
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:
Abstract:Accurately predicting the survival of cancer patients is crucial for personalized treatment. However, existing studies focus solely on the relationships between samples with known survival risks, without fully leveraging the value of censored samples. Furthermore, these studies may suffer performance degradation in modality-missing scenarios and even struggle during the inference process. In this study, we propose a bipartite patient-modality graph learning with event-conditional modelling of censoring for cancer survival prediction (CenSurv). Specifically, we first use graph structure to model multimodal data and obtain representation. Then, to alleviate performance degradation in modality-missing scenarios, we design a bipartite graph to simulate the patient-modality relationship in various modality-missing scenarios and leverage a complete-incomplete alignment strategy to explore modality-agnostic features. Finally, we design a plug-and-play event-conditional modeling of censoring (ECMC) that selects reliable censored data using dynamic momentum accumulation confidences, assigns more accurate survival times to these censored data, and incorporates them as uncensored data into training. Comprehensive evaluations on 5 publicly cancer datasets showcase the superiority of CenSurv over the best state-of-the-art by 3.1% in terms of the mean C-index, while also exhibiting excellent robustness under various modality-missing scenarios. In addition, using the plug-and-play ECMC module, the mean C-index of 8 baselines increased by 1.3% across 5 datasets. Code of CenSurv is available at this https URL.
[LG-14] he Cost of Compression: Tight Quadratic Black-Box Attacks on Sketches for ell_2 Norm Estimation
链接: https://arxiv.org/abs/2507.16345
作者: Sara Ahmadian,Edith Cohen,Uri Stemmer
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:Dimensionality reduction via linear sketching is a powerful and widely used technique, but it is known to be vulnerable to adversarial inputs. We study the black-box adversarial setting, where a fixed, hidden sketching matrix A in R^k X n maps high-dimensional vectors v \in R^n to lower-dimensional sketches A v in R^k , and an adversary can query the system to obtain approximate ell2-norm estimates that are computed from the sketch. We present a universal, nonadaptive attack that, using tilde(O)( k^2 ) queries, either causes a failure in norm estimation or constructs an adversarial input on which the optimal estimator for the query distribution (used by the attack) fails. The attack is completely agnostic to the sketching matrix and to the estimator: It applies to any linear sketch and any query responder, including those that are randomized, adaptive, or tailored to the query distribution. Our lower bound construction tightly matches the known upper bounds of tilde(Omega)( k^2 ), achieved by specialized estimators for Johnson Lindenstrauss transforms and AMS sketches. Beyond sketching, our results uncover structural parallels to adversarial attacks in image classification, highlighting fundamental vulnerabilities of compressed representations. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2507.16345 [cs.LG] (or arXiv:2507.16345v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.16345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-15] me to Split: Exploring Data Splitting Strategies for Offline Evaluation of Sequential Recommenders RECSYS2025
链接: https://arxiv.org/abs/2507.16289
作者: Danil Gusak,Anna Volodkevich,Anton Klenitskiy,Alexey Vasilev,Evgeny Frolov
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted for ACM RecSys 2025. Author’s version. The final published version will be available at the ACM Digital Library
Abstract:Modern sequential recommender systems, ranging from lightweight transformer-based variants to large language models, have become increasingly prominent in academia and industry due to their strong performance in the next-item prediction task. Yet common evaluation protocols for sequential recommendations remain insufficiently developed: they often fail to reflect the corresponding recommendation task accurately, or are not aligned with real-world scenarios. Although the widely used leave-one-out split matches next-item prediction, it permits the overlap between training and test periods, which leads to temporal leakage and unrealistically long test horizon, limiting real-world relevance. Global temporal splitting addresses these issues by evaluating on distinct future periods. However, its applications to sequential recommendations remain loosely defined, particularly in terms of selecting target interactions and constructing a validation subset that provides necessary consistency between validation and test metrics. In this paper, we demonstrate that evaluation outcomes can vary significantly across splitting strategies, influencing model rankings and practical deployment decisions. To improve reproducibility in both academic and industrial settings, we systematically compare different splitting strategies for sequential recommendations across multiple datasets and established baselines. Our findings show that prevalent splits, such as leave-one-out, may be insufficiently aligned with more realistic evaluation strategies. Code: this https URL Comments: Accepted for ACM RecSys 2025. Author’s version. The final published version will be available at the ACM Digital Library Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2507.16289 [cs.IR] (or arXiv:2507.16289v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.16289 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3705328.3748164 Focus to learn more DOI(s) linking to related resources
[LG-16] Multi-Agent Reinforcement Learning for Sample-Efficient Deep Neural Network Mapping
链接: https://arxiv.org/abs/2507.16249
作者: Srivatsan Krishnan,Jason Jabbour,Dan Zhang,Natasha Jaques,Aleksandra Faust,Shayegan Omidshafiei,Vijay Janapa Reddi
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:Mapping deep neural networks (DNNs) to hardware is critical for optimizing latency, energy consumption, and resource utilization, making it a cornerstone of high-performance accelerator design. Due to the vast and complex mapping space, reinforcement learning (RL) has emerged as a promising approach-but its effectiveness is often limited by sample inefficiency. We present a decentralized multi-agent reinforcement learning (MARL) framework designed to overcome this challenge. By distributing the search across multiple agents, our framework accelerates exploration. To avoid inefficiencies from training multiple agents in parallel, we introduce an agent clustering algorithm that assigns similar mapping parameters to the same agents based on correlation analysis. This enables a decentralized, parallelized learning process that significantly improves sample efficiency. Experimental results show our MARL approach improves sample efficiency by 30-300x over standard single-agent RL, achieving up to 32.61x latency reduction and 16.45x energy-delay product (EDP) reduction under iso-sample conditions.
[LG-17] oward a Lightweight and Robust Design for Caching with Predictions
链接: https://arxiv.org/abs/2507.16242
作者: Peng Chen,Hailiang Zhao,Jiaji Zhang,Xueyan Tang,Yixuan Wang,Shuiguang Deng
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: preprint
Abstract:The online caching problem aims to minimize cache misses when serving a sequence of requests under a limited cache size. While naive learning-augmented caching algorithms achieve ideal 1 -consistency, they lack robustness guarantees. Existing robustification methods either sacrifice 1 -consistency or introduce significant computational overhead. In this paper, we introduce \textscGuard, a lightweight robustification framework that enhances the robustness of a broad class of learning-augmented caching algorithms to 2H_k + 2 , while preserving their 1 -consistency. \textscGuard achieves the current best-known trade-off between consistency and robustness, with only \mathcalO(1) additional per-request overhead, thereby maintaining the original time complexity of the base algorithm. Extensive experiments across multiple real-world datasets and prediction models validate the effectiveness of \textscGuard in practice.
[LG-18] LLM -Enhanced Reranking for Complementary Product Recommendation
链接: https://arxiv.org/abs/2507.16237
作者: Zekun Xu,Yudi Zhang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Complementary product recommendation, which aims to suggest items that are used together to enhance customer value, is a crucial yet challenging task in e-commerce. While existing graph neural network (GNN) approaches have made significant progress in capturing complex product relationships, they often struggle with the accuracy-diversity tradeoff, particularly for long-tail items. This paper introduces a model-agnostic approach that leverages Large Language Models (LLMs) to enhance the reranking of complementary product recommendations. Unlike previous works that use LLMs primarily for data preprocessing and graph augmentation, our method applies LLM-based prompting strategies directly to rerank candidate items retrieved from existing recommendation models, eliminating the need for model retraining. Through extensive experiments on public datasets, we demonstrate that our approach effectively balances accuracy and diversity in complementary product recommendations, with at least 50% lift in accuracy metrics and 2% lift in diversity metrics on average for the top recommended items across datasets.
[LG-19] Aligned Manifold Property and Topology Point Clouds for Learning Molecular Properties
链接: https://arxiv.org/abs/2507.16223
作者: Alexander Mihalcea
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 13 pages, 6 figures
Abstract:Machine learning models for molecular property prediction generally rely on representations – such as SMILES strings and molecular graphs – that overlook the surface-local phenomena driving intermolecular behavior. 3D-based approaches often reduce surface detail or require computationally expensive SE(3)-equivariant architectures to manage spatial variance. To overcome these limitations, this work introduces AMPTCR (Aligned Manifold Property and Topology Cloud Representation), a molecular surface representation that combines local quantum-derived scalar fields and custom topological descriptors within an aligned point cloud format. Each surface point includes a chemically meaningful scalar, geodesically derived topology vectors, and coordinates transformed into a canonical reference frame, enabling efficient learning with conventional SE(3)-sensitive architectures. AMPTCR is evaluated using a DGCNN framework on two tasks: molecular weight and bacterial growth inhibition. For molecular weight, results confirm that AMPTCR encodes physically meaningful data, with a validation R^2 of 0.87. In the bacterial inhibition task, AMPTCR enables both classification and direct regression of E. coli inhibition values using Dual Fukui functions as the electronic descriptor and Morgan Fingerprints as auxiliary data, achieving an ROC AUC of 0.912 on the classification task, and an R^2 of 0.54 on the regression task. These results help demonstrate that AMPTCR offers a compact, expressive, and architecture-agnostic representation for modeling surface-mediated molecular properties.
[LG-20] RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs
链接: https://arxiv.org/abs/2507.16200
作者: Pengwei Jin,Di Huang,Chongxiao Li,Shuyao Cheng,Yang Zhao,Xinyao Zheng,Jiaguo Zhu,Shuyi Xing,Bohan Dou,Rui Zhang,Zidong Du,Qi Guo,Xing Hu
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: The benchmark is open-sourced at this https URL
Abstract:The automatic generation of Verilog code using Large Language Models (LLMs) has garnered significant interest in hardware design automation. However, existing benchmarks for evaluating LLMs in Verilog generation fall short in replicating real-world design workflows due to their designs’ simplicity, inadequate design specifications, and less rigorous verification environments. To address these limitations, we present RealBench, the first benchmark aiming at real-world IP-level Verilog generation tasks. RealBench features complex, structured, real-world open-source IP designs, multi-modal and formatted design specifications, and rigorous verification environments, including 100% line coverage testbenches and a formal checker. It supports both module-level and system-level tasks, enabling comprehensive assessments of LLM capabilities. Evaluations on various LLMs and agents reveal that even one of the best-performing LLMs, o1-preview, achieves only a 13.3% pass@1 on module-level tasks and 0% on system-level tasks, highlighting the need for stronger Verilog generation models in the future. The benchmark is open-sourced at this https URL.
[LG-21] EBaReT: Expert-guided Bag Reward Transformer for Auto Bidding
链接: https://arxiv.org/abs/2507.16186
作者: Kaiyuan Li,Pengyu Wang,Yunshan Peng,Pengjia Yuan,Yanxiang Zeng,Rui Xiang,Yanhua Cheng,Xialong Liu,Peng Jiang
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:
Abstract:Reinforcement learning has been widely applied in automated bidding. Traditional approaches model bidding as a Markov Decision Process (MDP). Recently, some studies have explored using generative reinforcement learning methods to address long-term dependency issues in bidding environments. Although effective, these methods typically rely on supervised learning approaches, which are vulnerable to low data quality due to the amount of sub-optimal bids and low probability rewards resulting from the low click and conversion rates. Unfortunately, few studies have addressed these challenges. In this paper, we formalize the automated bidding as a sequence decision-making problem and propose a novel Expert-guided Bag Reward Transformer (EBaReT) to address concerns related to data quality and uncertainty rewards. Specifically, to tackle data quality issues, we generate a set of expert trajectories to serve as supplementary data in the training process and employ a Positive-Unlabeled (PU) learning-based discriminator to identify expert transitions. To ensure the decision also meets the expert level, we further design a novel expert-guided inference strategy. Moreover, to mitigate the uncertainty of rewards, we consider the transitions within a certain period as a “bag” and carefully design a reward function that leads to a smoother acquisition of rewards. Extensive experiments demonstrate that our model achieves superior performance compared to state-of-the-art bidding methods. Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR) Cite as: arXiv:2507.16186 [cs.LG] (or arXiv:2507.16186v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.16186 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-22] he Impact of Pseudo-Science in Financial Loans Risk Prediction
链接: https://arxiv.org/abs/2507.16182
作者: Bruno Scarone,Ricardo Baeza-Yates
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:We study the societal impact of pseudo-scientific assumptions for predicting the behavior of people in a straightforward application of machine learning to risk prediction in financial lending. This use case also exemplifies the impact of survival bias in loan return prediction. We analyze the models in terms of their accuracy and social cost, showing that the socially optimal model may not imply a significant accuracy loss for this downstream task. Our results are verified for commonly used learning methods and datasets. Our findings also show that there is a natural dynamic when training models that suffer survival bias where accuracy slightly deteriorates, and whose recall and precision improves with time. These results act as an illusion, leading the observer to believe that the system is getting better, when in fact the model is suffering from increasingly more unfairness and survival bias.
[LG-23] Learning Patient-Specific Spatial Biomarker Dynamics via Operator Learning for Alzheimers Disease Progression
链接: https://arxiv.org/abs/2507.16148
作者: Jindong Wang,Yutong Mao,Xiao Liu,Wenrui Hao
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Alzheimer’s disease (AD) is a complex, multifactorial neurodegenerative disorder with substantial heterogeneity in progression and treatment response. Despite recent therapeutic advances, predictive models capable of accurately forecasting individualized disease trajectories remain limited. Here, we present a machine learning-based operator learning framework for personalized modeling of AD progression, integrating longitudinal multimodal imaging, biomarker, and clinical data. Unlike conventional models with prespecified dynamics, our approach directly learns patient-specific disease operators governing the spatiotemporal evolution of amyloid, tau, and neurodegeneration biomarkers. Using Laplacian eigenfunction bases, we construct geometry-aware neural operators capable of capturing complex brain dynamics. Embedded within a digital twin paradigm, the framework enables individualized predictions, simulation of therapeutic interventions, and in silico clinical trials. Applied to AD clinical data, our method achieves high prediction accuracy exceeding 90% across multiple biomarkers, substantially outperforming existing approaches. This work offers a scalable, interpretable platform for precision modeling and personalized therapeutic optimization in neurodegenerative diseases.
[LG-24] Equivariant Goal Conditioned Contrastive Reinforcement Learning
链接: https://arxiv.org/abs/2507.16139
作者: Arsh Tangri,Nichols Crawford Taylor,Haojie Huang,Robert Platt
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Contrastive Reinforcement Learning (CRL) provides a promising framework for extracting useful structured representations from unlabeled interactions. By pulling together state-action pairs and their corresponding future states, while pushing apart negative pairs, CRL enables learning nontrivial policies without manually designed rewards. In this work, we propose Equivariant CRL (ECRL), which further structures the latent space using equivariant constraints. By leveraging inherent symmetries in goal-conditioned manipulation tasks, our method improves both sample efficiency and spatial generalization. Specifically, we formally define Goal-Conditioned Group-Invariant MDPs to characterize rotation-symmetric robotic manipulation tasks, and build on this by introducing a novel rotation-invariant critic representation paired with a rotation-equivariant actor for Contrastive RL. Our approach consistently outperforms strong baselines across a range of simulated tasks in both state-based and image-based settings. Finally, we extend our method to the offline RL setting, demonstrating its effectiveness across multiple tasks.
[LG-25] orchAO: PyTorch-Native Training-to-Serving Model Optimization ICML25
链接: https://arxiv.org/abs/2507.16099
作者: Andrew Or,Apurva Jain,Daniel Vega-Myhre,Jesse Cai,Charles David Hernandez,Zhenrui Zheng,Driss Guessous,Vasiliy Kuznetsov,Christian Puhrsch,Mark Saroufim,Supriya Rao,Thien Tran,Aleksandar Samardžić
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, published in CODEML@ICML25
Abstract:We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL.
[LG-26] Interpreting CFD Surrogates through Sparse Autoencoders IJCAI2025
链接: https://arxiv.org/abs/2507.16069
作者: Yeping Hu,Shusen Liu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: Accepted by IJCAI 2025 Workshop on Explainable Artificial Intelligence (XAI)
Abstract:Learning-based surrogate models have become a practical alternative to high-fidelity CFD solvers, but their latent representations remain opaque and hinder adoption in safety-critical or regulation-bound settings. This work introduces a posthoc interpretability framework for graph-based surrogate models used in computational fluid dynamics (CFD) by leveraging sparse autoencoders (SAEs). By obtaining an overcomplete basis in the node embedding space of a pretrained surrogate, the method extracts a dictionary of interpretable latent features. The approach enables the identification of monosemantic concepts aligned with physical phenomena such as vorticity or flow structures, offering a model-agnostic pathway to enhance explainability and trustworthiness in CFD applications.
[LG-27] Neural Probabilistic Shaping: Joint Distribution Learning for Optical Fiber Communications
链接: https://arxiv.org/abs/2507.16012
作者: Mohammad Taha Askari,Lutz Lampe,Amirhossein Ghazisaeidi
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Signal Processing (eess.SP)
*备注: 4 pages, 3 figures, Submitted to the 51st European Conference on Optical Communications
Abstract:We present an autoregressive end-to-end learning approach for probabilistic shaping on nonlinear fiber channels. Our proposed scheme learns the joint symbol distribution and provides a 0.3-bits/2D achievable information rate gain over an optimized marginal distribution for dual-polarized 64-QAM transmission over a single-span 205 km link.
[LG-28] Enhancing Stability of Physics-Informed Neural Network Training Through Saddle-Point Reformulation
链接: https://arxiv.org/abs/2507.16008
作者: Dmitry Bylinkin,Mikhail Aleksandrov,Savelii Chezhegov,Aleksandr Beznosikov
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 34 pages, 4 tables, 3 figures, 4 theorems; code available at this https URL
Abstract:Physics-informed neural networks (PINNs) have gained prominence in recent years and are now effectively used in a number of applications. However, their performance remains unstable due to the complex landscape of the loss function. To address this issue, we reformulate PINN training as a nonconvex-strongly concave saddle-point problem. After establishing the theoretical foundation for this approach, we conduct an extensive experimental study, evaluating its effectiveness across various tasks and architectures. Our results demonstrate that the proposed method outperforms the current state-of-the-art techniques.
[LG-29] HyDRA: A Hybrid-Driven Reasoning Architecture for Verifiable Knowledge Graphs
链接: https://arxiv.org/abs/2507.15917
作者: Adrian Kaiser,Claudiu Leoveanu-Condrei,Ryan Gold,Marius-Constantin Dinu,Markus Hofmarcher
类目: Machine Learning (cs.LG)
*备注: 8 pages, 4 figures
Abstract:The synergy between symbolic knowledge, often represented by Knowledge Graphs (KGs), and the generative capabilities of neural networks is central to advancing neurosymbolic AI. A primary bottleneck in realizing this potential is the difficulty of automating KG construction, which faces challenges related to output reliability, consistency, and verifiability. These issues can manifest as structural inconsistencies within the generated graphs, such as the formation of disconnected \textitisolated islands of data or the inaccurate conflation of abstract classes with specific instances. To address these challenges, we propose HyDRA, a \textbfHy brid- \textbfD riven \textbfR easoning \textbfA rchitecture designed for verifiable KG automation. Given a domain or an initial set of documents, HyDRA first constructs an ontology via a panel of collaborative neurosymbolic agents. These agents collaboratively agree on a set of competency questions (CQs) that define the scope and requirements the ontology must be able to answer. Given these CQs, we build an ontology graph that subsequently guides the automated extraction of triplets for KG generation from arbitrary documents. Inspired by design-by-contracts (DbC) principles, our method leverages verifiable contracts as the primary control mechanism to steer the generative process of Large Language Models (LLMs). To verify the output of our approach, we extend beyond standard benchmarks and propose an evaluation framework that assesses the functional correctness of the resulting KG by leveraging symbolic verifications as described by the neurosymbolic AI framework, \textitSymbolicAI . This work contributes a hybrid-driven architecture for improving the reliability of automated KG construction and the exploration of evaluation methods for measuring the functional integrity of its output. The code is publicly available.
[LG-30] Fast-VAT: Accelerating Cluster Tendency Visualization using Cython and Numba
链接: https://arxiv.org/abs/2507.15904
作者: MSR Avinash(Presidency University, Bangalore),Ismael Lachheb(EPITA School of Engineering and Computer Science, Paris, France)
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 10 pages, 3 figures, 3 tables. Code available at this https URL
Abstract:Visual Assessment of Cluster Tendency (VAT) is a widely used unsupervised technique to assess the presence of cluster structure in unlabeled datasets. However, its standard implementation suffers from significant performance limitations due to its O(n^2) time complexity and inefficient memory usage. In this work, we present Fast-VAT, a high-performance reimplementation of the VAT algorithm in Python, augmented with Numba’s Just-In-Time (JIT) compilation and Cython’s static typing and low-level memory optimizations. Our approach achieves up to 50x speedup over the baseline implementation, while preserving the output fidelity of the original method. We validate Fast-VAT on a suite of real and synthetic datasets – including Iris, Mall Customers, and Spotify subsets – and verify cluster tendency using Hopkins statistics, PCA, and t-SNE. Additionally, we compare VAT’s structural insights with clustering results from DBSCAN and K-Means to confirm its reliability.
[LG-31] Improving the Generation of VAEs with High Dimensional Latent Spaces by the use of Hyperspherical Coordinates IJCNN25
链接: https://arxiv.org/abs/2507.15900
作者: Alejandro Ascarate,Leo Lebrat,Rodrigo Santa Cruz,Clinton Fookes,Olivier Salvado
类目: Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, published in IJCNN25 (in press)
Abstract:Variational autoencoders (VAE) encode data into lower-dimensional latent vectors before decoding those vectors back to data. Once trained, decoding a random latent vector from the prior usually does not produce meaningful data, at least when the latent space has more than a dozen dimensions. In this paper, we investigate this issue by drawing insight from high dimensional statistics: in these regimes, the latent vectors of a standard VAE are by construction distributed uniformly on a hypersphere. We propose to formulate the latent variables of a VAE using hyperspherical coordinates, which allows compressing the latent vectors towards an island on the hypersphere, thereby reducing the latent sparsity and we show that this improves the generation ability of the VAE. We propose a new parameterization of the latent space with limited computational overhead.
[LG-32] Prompt Smart Pay Less: Cost-Aware APO for Real-World Applications
链接: https://arxiv.org/abs/2507.15884
作者: Jayesh Choudhari,Piyush Kumar Singh,Douglas McIlwraith,Snehal Nair
类目: Machine Learning (cs.LG)
*备注:
Abstract:Prompt design is a critical factor in the effectiveness of Large Language Models (LLMs), yet remains largely heuristic, manual, and difficult to scale. This paper presents the first comprehensive evaluation of Automatic Prompt Optimization (APO) methods for real-world, high-stakes multiclass classification in a commercial setting, addressing a critical gap in the existing literature where most of the APO frameworks have been validated only on benchmark classification tasks of limited complexity. We introduce APE-OPRO, a novel hybrid framework that combines the complementary strengths of APE and OPRO, achieving notably better cost-efficiency, around 18% improvement over OPRO, without sacrificing performance. We benchmark APE-OPRO alongside both gradient-free (APE, OPRO) and gradient-based (ProTeGi) methods on a dataset of ~2,500 labeled products. Our results highlight key trade-offs: ProTeGi offers the strongest absolute performance at lower API cost but higher computational time as noted in~\citeprotegi, while APE-OPRO strikes a compelling balance between performance, API efficiency, and scalability. We further conduct ablation studies on depth and breadth hyperparameters, and reveal notable sensitivity to label formatting, indicating implicit sensitivity in LLM behavior. These findings provide actionable insights for implementing APO in commercial applications and establish a foundation for future research in multi-label, vision, and multimodal prompt optimization scenarios. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.15884 [cs.LG] (or arXiv:2507.15884v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.15884 Focus to learn more arXiv-issued DOI via DataCite
[LG-33] An open dataset of neural networks for hypernetwork research
链接: https://arxiv.org/abs/2507.15869
作者: David Kurtenbach,Lior Shamir
类目: Machine Learning (cs.LG)
*备注: Electronics, published
Abstract:Despite the transformative potential of AI, the concept of neural networks that can produce other neural networks by generating model weights (hypernetworks) has been largely understudied. One of the possible reasons is the lack of available research resources that can be used for the purpose of hypernetwork research. Here we describe a dataset of neural networks, designed for the purpose of hypernetworks research. The dataset includes 10^4 LeNet-5 neural networks trained for binary image classification separated into 10 classes, such that each class contains 1,000 different neural networks that can identify a certain ImageNette V2 class from all other classes. A computing cluster of over 10^4 cores was used to generate the dataset. Basic classification results show that the neural networks can be classified with accuracy of 72.0%, indicating that the differences between the neural networks can be identified by supervised machine learning algorithms. The ultimate purpose of the dataset is to enable hypernetworks research. The dataset and the code that generates it are open and accessible to the public.
[LG-34] Quantifying Holistic Review: A Multi-Modal Approach to College Admissions Prediction
链接: https://arxiv.org/abs/2507.15862
作者: Jun-Wei Zeng,Jerry Shen
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:This paper introduces the Comprehensive Applicant Profile Score (CAPS), a novel multi-modal framework designed to quantitatively model and interpret holistic college admissions evaluations. CAPS decomposes applicant profiles into three interpretable components: academic performance (Standardized Academic Score, SAS), essay quality (Essay Quality Index, EQI), and extracurricular engagement (Extracurricular Impact Score, EIS). Leveraging transformer-based semantic embeddings, LLM scoring, and XGBoost regression, CAPS provides transparent and explainable evaluations aligned with human judgment. Experiments on a synthetic but realistic dataset demonstrate strong performance, achieving an EQI prediction R^2 of 0.80, classification accuracy over 75%, a macro F1 score of 0.69, and a weighted F1 score of 0.74. CAPS addresses key limitations in traditional holistic review – particularly the opacity, inconsistency, and anxiety faced by applicants – thus paving the way for more equitable and data-informed admissions practices.
[LG-35] Pixel-Resolved Long-Context Learning for Turbulence at Exascale: Resolving Small-scale Eddies Toward the Viscous Limit
链接: https://arxiv.org/abs/2507.16697
作者: Junqi Yin,Mijanur Palash,M. Paul Laiu,Muralikrishnan Gopalakrishnan Meena,John Gounley,Stephen M. de Bruyn Kops,Feiyi Wang,Ramanan Sankaran,Pei Zhang
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注:
Abstract:Turbulence plays a crucial role in multiphysics applications, including aerodynamics, fusion, and combustion. Accurately capturing turbulence’s multiscale characteristics is essential for reliable predictions of multiphysics interactions, but remains a grand challenge even for exascale supercomputers and advanced deep learning models. The extreme-resolution data required to represent turbulence, ranging from billions to trillions of grid points, pose prohibitive computational costs for models based on architectures like vision transformers. To address this challenge, we introduce a multiscale hierarchical Turbulence Transformer that reduces sequence length from billions to a few millions and a novel RingX sequence parallelism approach that enables scalable long-context learning. We perform scaling and science runs on the Frontier supercomputer. Our approach demonstrates excellent performance up to 1.1 EFLOPS on 32,768 AMD GPUs, with a scaling efficiency of 94%. To our knowledge, this is the first AI model for turbulence that can capture small-scale eddies down to the dissipative range.
[LG-36] Structural Effect and Spectral Enhancement of High-Dimensional Regularized Linear Discriminant Analysis
链接: https://arxiv.org/abs/2507.16682
作者: Yonghan Zhang,Zhangni Pu,Lu Yan,Jiang Hu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:Regularized linear discriminant analysis (RLDA) is a widely used tool for classification and dimensionality reduction, but its performance in high-dimensional scenarios is inconsistent. Existing theoretical analyses of RLDA often lack clear insight into how data structure affects classification performance. To address this issue, we derive a non-asymptotic approximation of the misclassification rate and thus analyze the structural effect and structural adjustment strategies of RLDA. Based on this, we propose the Spectral Enhanced Discriminant Analysis (SEDA) algorithm, which optimizes the data structure by adjusting the spiked eigenvalues of the population covariance matrix. By developing a new theoretical result on eigenvectors in random matrix theory, we derive an asymptotic approximation on the misclassification rate of SEDA. The bias correction algorithm and parameter selection strategy are then obtained. Experiments on synthetic and real datasets show that SEDA achieves higher classification accuracy and dimensionality reduction compared to existing LDA methods.
[LG-37] Alternative Loss Function in Evaluation of Transformer Models
链接: https://arxiv.org/abs/2507.16548
作者: Jakub Michańków,Paweł Sakowski,Robert Ślepaczuk
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注: 12 pages
Abstract:The proper design and architecture of testing of machine learning models, especially in their application to quantitative finance problems, is crucial. The most important in this process is selecting an adequate loss function used for training, validation, estimation purposes, and tuning of hyperparameters. Therefore, in this research, through empirical experiments on equity and cryptocurrency assets, we introduce the Mean Absolute Directional Loss (MADL) function which is more adequate for optimizing forecast-generating models used in algorithmic investment strategies. The MADL function results are compared for Transformer and LSTM models and we show that almost in every case Transformer results are significantly better than those obtained with LSTM.
[LG-38] Adaptive Bayesian Single-Shot Quantum Sensing
链接: https://arxiv.org/abs/2507.16477
作者: Ivana Nikoloska,Ruud Van Sloun,Osvaldo Simeone
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: submitted for publication
Abstract:Quantum sensing harnesses the unique properties of quantum systems to enable precision measurements of physical quantities such as time, magnetic and electric fields, acceleration, and gravitational gradients well beyond the limits of classical sensors. However, identifying suitable sensing probes and measurement schemes can be a classically intractable task, as it requires optimizing over Hilbert spaces of high dimension. In variational quantum sensing, a probe quantum system is generated via a parameterized quantum circuit (PQC), exposed to an unknown physical parameter through a quantum channel, and measured to collect classical data. PQCs and measurements are typically optimized using offline strategies based on frequentist learning criteria. This paper introduces an adaptive protocol that uses Bayesian inference to optimize the sensing policy via the maximization of the active information gain. The proposed variational methodology is tailored for non-asymptotic regimes where a single probe can be deployed in each time step, and is extended to support the fusion of estimates from multiple quantum sensing agents.
[LG-39] Adaptive Multi-task Learning for Multi-sector Portfolio Optimization
链接: https://arxiv.org/abs/2507.16433
作者: Qingliang Fan,Ruike Wu,Yanrong Yang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:Accurate transfer of information across multiple sectors to enhance model estimation is both significant and challenging in multi-sector portfolio optimization involving a large number of assets in different classes. Within the framework of factor modeling, we propose a novel data-adaptive multi-task learning methodology that quantifies and learns the relatedness among the principal temporal subspaces (spanned by factors) across multiple sectors under study. This approach not only improves the simultaneous estimation of multiple factor models but also enhances multi-sector portfolio optimization, which heavily depends on the accurate recovery of these factor models. Additionally, a novel and easy-to-implement algorithm, termed projection-penalized principal component analysis, is developed to accomplish the multi-task learning procedure. Diverse simulation designs and practical application on daily return data from Russell 3000 index demonstrate the advantages of multi-task learning methodology.
[LG-40] An effective physics-informed neural operator framework for predicting wavefields
链接: https://arxiv.org/abs/2507.16431
作者: Xiao Ma,Tariq Alkhalifah
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:
Abstract:Solving the wave equation is fundamental for geophysical applications. However, numerical solutions of the Helmholtz equation face significant computational and memory challenges. Therefore, we introduce a physics-informed convolutional neural operator (PICNO) to solve the Helmholtz equation efficiently. The PICNO takes both the background wavefield corresponding to a homogeneous medium and the velocity model as input function space, generating the scattered wavefield as the output function space. Our workflow integrates PDE constraints directly into the training process, enabling the neural operator to not only fit the available data but also capture the underlying physics governing wave phenomena. PICNO allows for high-resolution reasonably accurate predictions even with limited training samples, and it demonstrates significant improvements over a purely data-driven convolutional neural operator (CNO), particularly in predicting high-frequency wavefields. These features and improvements are important for waveform inversion down the road.
[LG-41] Meta-learning of Gibbs states for many-body Hamiltonians with applications to Quantum Boltzmann Machines
链接: https://arxiv.org/abs/2507.16373
作者: Ruchira V Bhat,Rahul Bhowmick,Avinash Singh,Krishna Kumar Sabapathy
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 14 figures, 3 tables, 3 algorithms
Abstract:The preparation of quantum Gibbs states is a fundamental challenge in quantum computing, essential for applications ranging from modeling open quantum systems to quantum machine learning. Building on the Meta-Variational Quantum Eigensolver framework proposed by Cervera-Lierta et al.(2021) and a problem driven ansatz design, we introduce two meta-learning algorithms: Meta-Variational Quantum Thermalizer (Meta-VQT) and Neural Network Meta-VQT (NN-Meta VQT) for efficient thermal state preparation of parametrized Hamiltonians on Noisy Intermediate-Scale Quantum (NISQ) devices. Meta-VQT utilizes a fully quantum ansatz, while NN Meta-VQT integrates a quantum classical hybrid architecture. Both leverage collective optimization over training sets to generalize Gibbs state preparation to unseen parameters. We validate our methods on upto 8-qubit Transverse Field Ising Model and the 2-qubit Heisenberg model with all field terms, demonstrating efficient thermal state generation beyond training data. For larger systems, we show that our meta-learned parameters when combined with appropriately designed ansatz serve as warm start initializations, significantly outperforming random initializations in the optimization tasks. Furthermore, a 3- qubit Kitaev ring example showcases our algorithm’s effectiveness across finite-temperature crossover regimes. Finally, we apply our algorithms to train a Quantum Boltzmann Machine (QBM) on a 2-qubit Heisenberg model with all field terms, achieving enhanced training efficiency, improved Gibbs state accuracy, and a 30-fold runtime speedup over existing techniques such as variational quantum imaginary time (VarQITE)-based QBM highlighting the scalability and practicality of meta-algorithm-based QBMs.
[LG-42] Constructing material network representations for intelligent amorphous alloys design
链接: https://arxiv.org/abs/2507.16336
作者: S.-Y. Zhang,J. Tian,S.-L. Liu,H.-M. Zhang,H.-Y. Bai,Y.-C. Hu,W.-H. Wang
类目: Materials Science (cond-mat.mtrl-sci); Disordered Systems and Neural Networks (cond-mat.dis-nn); Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 5 figures
Abstract:Designing high-performance amorphous alloys is demanding for various applications. But this process intensively relies on empirical laws and unlimited attempts. The high-cost and low-efficiency nature of the traditional strategies prevents effective sampling in the enormous material space. Here, we propose material networks to accelerate the discovery of binary and ternary amorphous alloys. The network topologies reveal hidden material candidates that were obscured by traditional tabular data representations. By scrutinizing the amorphous alloys synthesized in different years, we construct dynamical material networks to track the history of the alloy discovery. We find that some innovative materials designed in the past were encoded in the networks, demonstrating their predictive power in guiding new alloy design. These material networks show physical similarities with several real-world networks in our daily lives. Our findings pave a new way for intelligent materials design, especially for complex alloys.
[LG-43] Physics-Driven Neural Network for Solving Electromagnetic Inverse Scattering Problems
链接: https://arxiv.org/abs/2507.16321
作者: Yutong Du,Zicheng Liu,Bazargul Matkerim,Changyou Li,Yali Zong,Bo Qi,Jingwei Kou
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:In recent years, deep learning-based methods have been proposed for solving inverse scattering problems (ISPs), but most of them heavily rely on data and suffer from limited generalization capabilities. In this paper, a new solving scheme is proposed where the solution is iteratively updated following the updating of the physics-driven neural network (PDNN), the hyperparameters of which are optimized by minimizing the loss function which incorporates the constraints from the collected scattered fields and the prior information about scatterers. Unlike data-driven neural network solvers, PDNN is trained only requiring the input of collected scattered fields and the computation of scattered fields corresponding to predicted solutions, thus avoids the generalization problem. Moreover, to accelerate the imaging efficiency, the subregion enclosing the scatterers is identified. Numerical and experimental results demonstrate that the proposed scheme has high reconstruction accuracy and strong stability, even when dealing with composite lossy scatterers.
[LG-44] PAC Off-Policy Prediction of Contextual Bandits
链接: https://arxiv.org/abs/2507.16236
作者: Yilong Wan,Yuqiang Li,Xianyi Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper investigates off-policy evaluation in contextual bandits, aiming to quantify the performance of a target policy using data collected under a different and potentially unknown behavior policy. Recently, methods based on conformal prediction have been developed to construct reliable prediction intervals that guarantee marginal coverage in finite samples, making them particularly suited for safety-critical applications. To further achieve coverage conditional on a given offline data set, we propose a novel algorithm that constructs probably approximately correct prediction intervals. Our method builds upon a PAC-valid conformal prediction framework, and we strengthen its theoretical guarantees by establishing PAC-type bounds on coverage. We analyze both finite-sample and asymptotic properties of the proposed method, and compare its empirical performance with existing methods in simulations.
[LG-45] oward Routine CSP of Pharmaceuticals: A Fully Automated Protocol Using Neural Network Potentials
链接: https://arxiv.org/abs/2507.16218
作者: Zachary L. Glick,Derek P. Metcalf,Scott F. Swarthout
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:Crystal structure prediction (CSP) is a useful tool in pharmaceutical development for identifying and assessing risks associated with polymorphism, yet widespread adoption has been hindered by high computational costs and the need for both manual specification and expert knowledge to achieve useful results. Here, we introduce a fully automated, high-throughput CSP protocol designed to overcome these barriers. The protocol’s efficiency is driven by Lavo-NN, a novel neural network potential (NNP) architected and trained specifically for pharmaceutical crystal structure generation and ranking. This NNP-driven crystal generation phase is integrated into a scalable cloud-based workflow. We validate this CSP protocol on an extensive retrospective benchmark of 49 unique molecules, almost all of which are drug-like, successfully generating structures that match all 110 Z’ = 1 experimental polymorphs. The average CSP in this benchmark is performed with approximately 8.4k CPU hours, which is a significant reduction compared to other protocols. The practical utility of the protocol is further demonstrated through case studies that resolve ambiguities in experimental data and a semi-blinded challenge that successfully identifies and ranks polymorphs of three modern drugs from powder X-ray diffraction patterns alone. By significantly reducing the required time and cost, the protocol enables CSP to be routinely deployed earlier in the drug discovery pipeline, such as during lead optimization. Rapid turnaround times and high throughput also enable CSP that can be run in parallel with experimental screening, providing chemists with real-time insights to guide their work in the lab.
[LG-46] Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support
链接: https://arxiv.org/abs/2507.16107
作者: Trung Phung,Kyle Reese,Ilya Shpitser,Rohit Bhattacharya
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 45 pages
Abstract:A common approach for handling missing values in data analysis pipelines is multiple imputation via software packages such as MICE (Van Buuren and Groothuis-Oudshoorn, 2011) and Amelia (Honaker et al., 2011). These packages typically assume the data are missing at random (MAR), and impose parametric or smoothing assumptions upon the imputing distributions in a way that allows imputation to proceed even if not all missingness patterns have support in the data. Such assumptions are unrealistic in practice, and induce model misspecification bias on any analysis performed after such imputation. In this paper, we provide a principled alternative. Specifically, we develop a new characterization for the full data law in graphical models of missing data. This characterization is constructive, is easily adapted for the calculation of imputation distributions for both MAR and MNAR (missing not at random) mechanisms, and is able to handle lack of support for certain patterns of missingness. We use this characterization to develop a new imputation algorithm – Multivariate Imputation via Supported Pattern Recursion (MISPR) – which uses Gibbs sampling, by analogy with the Multivariate Imputation with Chained Equations (MICE) algorithm, but which is consistent under both MAR and MNAR settings, and is able to handle missing data patterns with no support without imposing additional assumptions beyond those already imposed by the missing data model itself. In simulations, we show MISPR obtains comparable results to MICE when data are MAR, and superior, less biased results when data are MNAR. Our characterization and imputation algorithm based on it are a step towards making principled missing data methods more practical in applied settings, where the data are likely both MNAR and sufficiently high dimensional to yield missing data patterns with no support at available sample sizes. Comments: 45 pages Subjects: Methodology (stat.ME); Machine Learning (cs.LG) Cite as: arXiv:2507.16107 [stat.ME] (or arXiv:2507.16107v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2507.16107 Focus to learn more arXiv-issued DOI via DataCite
[LG-47] Is memory all you need? Data-driven Mori-Zwanzig modeling of Lagrangian particle dynamics in turbulent flows
链接: https://arxiv.org/abs/2507.16058
作者: Xander de Wit,Alessandro Gabbana,Michael Woodward,Yen Ting Lin,Federico Toschi,Daniel Livescu
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:
Abstract:The dynamics of Lagrangian particles in turbulence play a crucial role in mixing, transport, and dispersion processes in complex flows. Their trajectories exhibit highly non-trivial statistical behavior, motivating the development of surrogate models that can reproduce these trajectories without incurring the high computational cost of direct numerical simulations of the full Eulerian field. This task is particularly challenging because reduced-order models typically lack access to the full set of interactions with the underlying turbulent field. Novel data-driven machine learning techniques can be very powerful in capturing and reproducing complex statistics of the reduced-order/surrogate dynamics. In this work, we show how one can learn a surrogate dynamical system that is able to evolve a turbulent Lagrangian trajectory in a way that is point-wise accurate for short-time predictions (with respect to Kolmogorov time) and stable and statistically accurate at long times. This approach is based on the Mori–Zwanzig formalism, which prescribes a mathematical decomposition of the full dynamical system into resolved dynamics that depend on the current state and the past history of a reduced set of observables and the unresolved orthogonal dynamics due to unresolved degrees of freedom of the initial state. We show how by training this reduced order model on a point-wise error metric on short time-prediction, we are able to correctly learn the dynamics of the Lagrangian turbulence, such that also the long-time statistical behavior is stably recovered at test time. This opens up a range of new applications, for example, for the control of active Lagrangian agents in turbulence.
[LG-48] Radiological and Biological Dictionary of Radiomics Features: Addressing Understandable AI Issues in Personalized Breast Cancer; Dictionary Version BM1.0
链接: https://arxiv.org/abs/2507.16041
作者: Arman Gorji,Nima Sanati,Amir Hossein Pouria,Somayeh Sadat Mehrnia,Ilker Hacihaliloglu,Arman Rahmim,Mohammad R. Salmanpour
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:
Abstract:Radiomics-based AI models show promise for breast cancer diagnosis but often lack interpretability, limiting clinical adoption. This study addresses the gap between radiomic features (RF) and the standardized BI-RADS lexicon by proposing a dual-dictionary framework. First, a Clinically-Informed Feature Interpretation Dictionary (CIFID) was created by mapping 56 RFs to BI-RADS descriptors (shape, margin, internal enhancement) through literature and expert review. The framework was applied to classify triple-negative breast cancer (TNBC) versus non-TNBC using dynamic contrast-enhanced MRI from a multi-institutional cohort of 1,549 patients. We trained 27 machine learning classifiers with 27 feature selection methods. SHapley Additive exPlanations (SHAP) were used to interpret predictions and generate a complementary Data-Driven Feature Interpretation Dictionary (DDFID) for 52 additional RFs. The best model, combining Variance Inflation Factor (VIF) selection with Extra Trees Classifier, achieved an average cross-validation accuracy of 0.83. Key predictive RFs aligned with clinical knowledge: higher Sphericity (round/oval shape) and lower Busyness (more homogeneous enhancement) were associated with TNBC. The framework confirmed known imaging biomarkers and uncovered novel, interpretable associations. This dual-dictionary approach (BM1.0) enhances AI model transparency and supports the integration of RFs into routine breast cancer diagnosis and personalized care.
[LG-49] Minor Embedding for Quantum Annealing with Reinforcement Learning
链接: https://arxiv.org/abs/2507.16004
作者: Riccardo Nembrini,Maurizio Ferrari Dacrema,Paolo Cremonesi
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum Annealing (QA) is a quantum computing paradigm for solving combinatorial optimization problems formulated as Quadratic Unconstrained Binary Optimization (QUBO) problems. An essential step in QA is minor embedding, which maps the problem graph onto the sparse topology of the quantum processor. This process is computationally expensive and scales poorly with increasing problem size and hardware complexity. Existing heuristics are often developed for specific problem graphs or hardware topologies and are difficult to generalize. Reinforcement Learning (RL) offers a promising alternative by treating minor embedding as a sequential decision-making problem, where an agent learns to construct minor embeddings by iteratively mapping the problem variables to the hardware qubits. We propose a RL-based approach to minor embedding using a Proximal Policy Optimization agent, testing its ability to embed both fully connected and randomly generated problem graphs on two hardware topologies, Chimera and Zephyr. The results show that our agent consistently produces valid minor embeddings, with reasonably efficient number of qubits, in particular on the more modern Zephyr topology. Our proposed approach is also able to scale to moderate problem sizes and adapts well to different graph structures, highlighting RL’s potential as a flexible and general-purpose framework for minor embedding in QA.
[LG-50] Automated Design of Structured Variational Quantum Circuits with Reinforcement Learning
链接: https://arxiv.org/abs/2507.16001
作者: Gloria Turati,Simone Foderà,Riccardo Nembrini,Maurizio Ferrari Dacrema,Paolo Cremonesi
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Variational Quantum Algorithms (VQAs) are among the most promising approaches for leveraging near-term quantum hardware, yet their effectiveness strongly depends on the design of the underlying circuit ansatz, which is typically constructed with heuristic methods. In this work, we represent the synthesis of variational quantum circuits as a sequential decision-making problem, where gates are added iteratively in order to optimize an objective function, and we introduce two reinforcement learning-based methods, RLVQC Global and RLVQC Block, tailored to combinatorial optimization problems. RLVQC Block creates ansatzes that generalize the Quantum Approximate Optimization Algorithm (QAOA), by discovering a two-qubits block that is applied to all the interacting qubit pairs. While RLVQC Global further generalizes the ansatz and adds gates unconstrained by the structure of the interacting qubits. Both methods adopt the Proximal Policy Optimization (PPO) algorithm and use empirical measurement outcomes as state observations to guide the agent. We evaluate the proposed methods on a broad set of QUBO instances derived from classical graph-based optimization problems. Our results show that both RLVQC methods exhibit strong results with RLVQC Block consistently outperforming QAOA and generally surpassing RLVQC Global. While RLVQC Block produces circuits with depth comparable to QAOA, the Global variant is instead able to find significantly shorter ones. These findings suggest that reinforcement learning methods can be an effective tool to discover new ansatz structures tailored for specific problems and that the most effective circuit design strategy lies between rigid predefined architectures and completely unconstrained ones, offering a favourable trade-off between structure and adaptability.
[LG-51] Generative AI Models for Learning Flow Maps of Stochastic Dynamical Systems in Bounded Domains
链接: https://arxiv.org/abs/2507.15990
作者: Minglei Yang,Yanfang Liu,Diego del-Castillo-Negrete,Yanzhao Cao,Guannan Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Simulating stochastic differential equations (SDEs) in bounded domains, presents significant computational challenges due to particle exit phenomena, which requires accurate modeling of interior stochastic dynamics and boundary interactions. Despite the success of machine learning-based methods in learning SDEs, existing learning methods are not applicable to SDEs in bounded domains because they cannot accurately capture the particle exit dynamics. We present a unified hybrid data-driven approach that combines a conditional diffusion model with an exit prediction neural network to capture both interior stochastic dynamics and boundary exit phenomena. Our ML model consists of two major components: a neural network that learns exit probabilities using binary cross-entropy loss with rigorous convergence guarantees, and a training-free diffusion model that generates state transitions for non-exiting particles using closed-form score functions. The two components are integrated through a probabilistic sampling algorithm that determines particle exit at each time step and generates appropriate state transitions. The performance of the proposed approach is demonstrated via three test cases: a one-dimensional simplified problem for theoretical verification, a two-dimensional advection-diffusion problem in a bounded domain, and a three-dimensional problem of interest to magnetically confined fusion plasmas.
[LG-52] Efficient dataset construction using active learning and uncertainty-aware neural networks for plasma turbulent transport surrogate models
链接: https://arxiv.org/abs/2507.15976
作者: Aaron Ho(1),Lorenzo Zanisi(2),Bram de Leeuw(3),Vincent Galvan(1),Pablo Rodriguez-Fernandez(1),Nathaniel T. Howard(1) ((1) MIT Plasma Science and Fusion Center, Cambridge, USA, (2) UKAEA Culham Centre for Fusion Energy, Abingdon, UK, (3) Radboud University, Nijmegen, Netherlands)
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG)
*备注:
Abstract:This work demonstrates a proof-of-principle for using uncertainty-aware architectures, in combination with active learning techniques and an in-the-loop physics simulation code as a data labeller, to construct efficient datasets for data-driven surrogate model generation. Building off of a previous proof-of-principle successfully demonstrating training set reduction on static pre-labelled datasets, using the ADEPT framework, this strategy was applied again to the plasma turbulent transport problem within tokamak fusion plasmas, specifically the QuaLiKiz quasilinear electrostatic gyrokinetic turbulent transport code. While QuaLiKiz provides relatively fast evaluations, this study specifically targeted small datasets to serve as a proxy for more expensive codes, such as CGYRO or GENE. The newly implemented algorithm uses the SNGP architecture for the classification component of the problem and the BNN-NCP architecture for the regression component, training models for all turbulent modes (ITG, TEM, ETG) and all transport fluxes ( Q_e , Q_i , \Gamma_e , \Gamma_i , and \Pi_i ) described by the general QuaLiKiz output. With 45 active learning iterations, moving from a small initial training set of 10^2 to a final set of 10^4 , the resulting models reached a F_1 classification performance of ~0.8 and a R^2 regression performance of ~0.75 on an independent test set across all outputs. This extrapolates to reaching the same performance and efficiency as the previous ADEPT pipeline, although on a problem with 1 extra input dimension. While the improvement rate achieved in this implementation diminishes faster than expected, the overall technique is formulated with components that can be upgraded and generalized to many surrogate modeling applications beyond plasma turbulent transport predictions.
[LG-53] MSGM: A Multi-Scale Spatiotemporal Graph Mamba for EEG Emotion Recognition
链接: https://arxiv.org/abs/2507.15914
作者: Hanwen Liu,Yifeng Gong,Zuwei Yan,Zeheng Zhuang,Jiaxuan Lu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:EEG-based emotion recognition struggles with capturing multi-scale spatiotemporal dynamics and ensuring computational efficiency for real-time applications. Existing methods often oversimplify temporal granularity and spatial hierarchies, limiting accuracy. To overcome these challenges, we propose the Multi-Scale Spatiotemporal Graph Mamba (MSGM), a novel framework integrating multi-window temporal segmentation, bimodal spatial graph modeling, and efficient fusion via the Mamba architecture. By segmenting EEG signals across diverse temporal scales and constructing global-local graphs with neuroanatomical priors, MSGM effectively captures fine-grained emotional fluctuations and hierarchical brain connectivity. A multi-depth Graph Convolutional Network (GCN) and token embedding fusion module, paired with Mamba’s state-space modeling, enable dynamic spatiotemporal interaction at linear complexity. Notably, with just one MSST-Mamba layer, MSGM surpasses leading methods in the field on the SEED, THU-EP, and FACED datasets, outperforming baselines in subject-independent emotion classification while achieving robust accuracy and millisecond-level inference on the NVIDIA Jetson Xavier NX.
[LG-54] Structural DID with ML: Theory Simulation and a Roadmap for Applied Research
链接: https://arxiv.org/abs/2507.15899
作者: Yile Yu,Anzhi Xu,Yi Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 45 pages, 29 figures
Abstract:Causal inference in observational panel data has become a central concern in economics,policy analysis,and the broader social this http URL address the core contradiction where traditional difference-in-differences (DID) struggles with high-dimensional confounding variables in observational panel data,while machine learning (ML) lacks causal structure interpretability,this paper proposes an innovative framework called S-DIDML that integrates structural identification with high-dimensional this http URL upon the structure of traditional DID methods,S-DIDML employs structured residual orthogonalization techniques (Neyman orthogonality+cross-fitting) to retain the group-time treatment effect (ATT) identification structure while resolving high-dimensional covariate interference this http URL designs a dynamic heterogeneity estimation module combining causal forests and semi-parametric models to capture spatiotemporal heterogeneity this http URL framework establishes a complete modular application process with standardized Stata implementation this http URL introduction of S-DIDML enriches methodological research on DID and DDML innovations, shifting causal inference from method stacking to architecture this http URL advancement enables social sciences to precisely identify policy-sensitive groups and optimize resource this http URL framework provides replicable evaluation tools, decision optimization references,and methodological paradigms for complex intervention scenarios such as digital transformation policies and environmental regulations.
信息检索
[IR-0] Biases in LLM -Generated Musical Taste Profiles for Recommendation
链接: https://arxiv.org/abs/2507.16708
作者: Bruno Sguerra,Elena V. Epure,Harin Lee,Manuel Moussallam
类目: Information Retrieval (cs.IR)
*备注:
Abstract:One particularly promising use case of Large Language Models (LLMs) for recommendation is the automatic generation of Natural Language (NL) user taste profiles from consumption data. These profiles offer interpretable and editable alternatives to opaque collaborative filtering representations, enabling greater transparency and user control. However, it remains unclear whether users consider these profiles to be an accurate representation of their taste, which is crucial for trust and usability. Moreover, because LLMs inherit societal and data-driven biases, profile quality may systematically vary across user and item characteristics. In this paper, we study this issue in the context of music streaming, where personalization is challenged by a large and culturally diverse catalog. We conduct a user study in which participants rate NL profiles generated from their own listening histories. We analyze whether identification with the profiles is biased by user attributes (e.g., mainstreamness, taste diversity) and item features (e.g., genre, country of origin). We also compare these patterns to those observed when using the profiles in a downstream recommendation task. Our findings highlight both the potential and limitations of scrutable, LLM-based profiling in personalized systems.
[IR-1] Generating Search Explanations using Large Language Models SIGIR2025
链接: https://arxiv.org/abs/2507.16692
作者: Arif Laksito,Mark Stevenson
类目: Information Retrieval (cs.IR)
*备注: Extended Abstract - Workshop on Explainability in Information Retrieval (WExIR), SIGIR 2025
Abstract:Aspect-oriented explanations in search results are typically concise text snippets placed alongside retrieved documents to serve as explanations that assist users in efficiently locating relevant information. While Large Language Models (LLMs) have demonstrated exceptional performance for a range of problems, their potential to generate explanations for search results has not been explored. This study addresses that gap by leveraging both encoder-decoder and decoder-only LLMs to generate explanations for search results. The explanations generated are consistently more accurate and plausible explanations than those produced by a range of baseline models.
[IR-2] Enhancing patent retrieval using automated patent summarization SIGIR2025
链接: https://arxiv.org/abs/2507.16371
作者: Eleni Kamateri,Renukswamy Chikkamath,Michail Salampasis,Linda Andersson,Markus Endres
类目: Information Retrieval (cs.IR)
*备注: This version was submitted and accepted for publication at the 6th Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech 2025), held in conjunction with SIGIR 2025. A revised and polished version, incorporating reviewers’ feedback, will follow
Abstract:Effective query formulation is a key challenge in long-document Information Retrieval (IR). This challenge is particularly acute in domain-specific contexts like patent retrieval, where documents are lengthy, linguistically complex, and encompass multiple interrelated technical topics. In this work, we present the application of recent extractive and abstractive summarization methods for generating concise, purpose-specific summaries of patent documents. We further assess the utility of these automatically generated summaries as surrogate queries across three benchmark patent datasets and compare their retrieval performance against conventional approaches that use entire patent sections. Experimental results show that summarization-based queries significantly improve prior-art retrieval effectiveness, highlighting their potential as an efficient alternative to traditional query formulation techniques.
[IR-3] Reinforce Lifelong Interaction Value of User-Author Pairs for Large-Scale Recommendation Systems
链接: https://arxiv.org/abs/2507.16253
作者: Yisha Li,Lexi Gao,Jingxin Liu,Xiang Gao,Xin Li,Haiyang Lu,Liyin Hong
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recommendation systems (RS) help users find interested content and connect authors with their target audience. Most research in RS tends to focus either on predicting users’ immediate feedback (like click-through rate) accurately or improving users’ long-term engagement. However, they ignore the influence for authors and the lifelong interaction value (LIV) of user-author pairs, which is particularly crucial for improving the prosperity of social community in short-video platforms. Currently, reinforcement learning (RL) can optimize long-term benefits and has been widely applied in RS. In this paper, we introduce RL to Reinforce Lifelong Interaction Value of User-Author pairs (RLIV-UA) based on each interaction of UA pairs. To address the long intervals between UA interactions and the large scale of the UA space, we propose a novel Sparse Cross-Request Interaction Markov Decision Process (SCRI-MDP) and introduce an Adjacent State Approximation (ASA) method to construct RL training samples. Additionally, we introduce Multi-Task Critic Learning (MTCL) to capture the progressive nature of UA interactions (click - follow - gift), where denser interaction signals are leveraged to compensate for the learning of sparse labels. Finally, an auxiliary supervised learning task is designed to enhance the convergence of the RLIV-UA model. In offline experiments and online A/B tests, the RLIV-UA model achieves both higher user satisfaction and higher platform profits than compared methods.
[IR-4] Scaling Recommender Transformers to One Billion Parameters
链接: https://arxiv.org/abs/2507.15994
作者: Kirill Khrylchenko,Artem Matveev,Sergei Makeev,Vladimir Baikalov
类目: Information Retrieval (cs.IR)
*备注: To be submitted
Abstract:While large transformer models have been successfully used in many real-world applications such as natural language processing, computer vision, and speech processing, scaling transformers for recommender systems remains a challenging problem. Recently, Generative Recommenders framework was proposed to scale beyond typical Deep Learning Recommendation Models (DLRMs). Reformulation of recommendation as sequential transduction task led to improvement of scaling properties in terms of compute. Nevertheless, the largest encoder configuration reported by the HSTU authors amounts only to ~176 million parameters, which is considerably smaller than the hundreds of billions or even trillions of parameters common in modern language models. In this work, we present a recipe for training large transformer recommenders with up to a billion parameters. We show that autoregressive learning on user histories naturally decomposes into two subtasks, feedback prediction and next-item prediction, and demonstrate that such a decomposition scales effectively across a wide range of transformer sizes. Furthermore, we report a successful deployment of our proposed architecture on a large-scale music platform serving millions of users. According to our online A/B tests, this new model increases total listening time by +2.26% and raises the likelihood of user likes by +6.37%, constituting (to our knowledge) the largest improvement in recommendation quality reported for any deep learning-based system in the platform’s history. Comments: To be submitted Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2507.15994 [cs.IR] (or arXiv:2507.15994v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.15994 Focus to learn more arXiv-issued DOI via DataCite