本篇博文主要内容为 2025-07-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-07-24)
今日共更新474篇论文,其中:
- 自然语言处理共47篇(Computation and Language (cs.CL))
- 人工智能共140篇(Artificial Intelligence (cs.AI))
- 计算机视觉共118篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共131篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Pretraining on the Test Set Is No Longer All You Need: A Debate-Driven Approach to QA Benchmarks
【速读】: 该论文旨在解决当前前沿语言模型在标准问答(QA)基准测试中面临的三大问题:数据污染(data contamination)、记忆过拟合(memorization)以及数据集构建成本持续上升。为应对这些挑战,作者提出了一种基于辩论驱动的评估范式(debate-driven evaluation paradigm),其核心在于将任意现有QA数据集转化为结构化的对抗性辩论任务——其中一方模型需基于官方答案进行辩护,另一方则构造并捍卫一个替代答案,由对正确答案不知情的裁判模型进行评判。该方法通过多轮论证显著提升难度、抑制浅层记忆行为,同时复用已有QA样本以降低标注开销。关键创新点包括:(1)一套系统性的QA到辩论评估的转换流水线;(2)一个公开基准(MMLU-Pro子集),包含标准化协议与参考模型。实验证明该方法能有效抵御数据污染,且即使使用较弱的裁判模型也能可靠区分更强的辩论者,表明其具备可扩展性与低成本优势,从而为衡量先进语言模型的真实推理能力提供可持续路径。
链接: https://arxiv.org/abs/2507.17747
作者: Linbo Cao,Jinman Zhao
机构: University of Waterloo (滑铁卢大学); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 7 figures. Accepted to COLM 2025. Code available at: this http URL
Abstract:As frontier language models increasingly saturate standard QA benchmarks, concerns about data contamination, memorization, and escalating dataset creation costs persist. We propose a debate-driven evaluation paradigm that transforms any existing QA dataset into structured adversarial debates–where one model is given the official answer to defend, and another constructs and defends an alternative answer–adjudicated by a judge model blind to the correct solution. By forcing multi-round argumentation, this approach substantially increases difficulty while penalizing shallow memorization, yet reuses QA items to reduce curation overhead. We make two main contributions: (1) an evaluation pipeline to systematically convert QA tasks into debate-based assessments, and (2) a public benchmark that demonstrates our paradigm’s effectiveness on a subset of MMLU-Pro questions, complete with standardized protocols and reference models. Empirical results validate the robustness of the method and its effectiveness against data contamination–a Llama 3.1 model fine-tuned on test questions showed dramatic accuracy improvements (50% - 82%) but performed worse in debates. Results also show that even weaker judges can reliably differentiate stronger debaters, highlighting how debate-based evaluation can scale to future, more capable systems while maintaining a fraction of the cost of creating new benchmarks. Overall, our framework underscores that “pretraining on the test set is no longer all you need,” offering a sustainable path for measuring the genuine reasoning ability of advanced language models.
zh
[NLP-1] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
【速读】: 该论文旨在解决在现实世界任务中,由于缺乏明确的客观标准而导致强化学习(Reinforcement Learning, RL)难以定义可靠奖励信号的问题,尤其是在后训练语言模型时,传统基于偏好的方法依赖于难以解释且易产生虚假相关性的奖励函数。其解决方案的关键在于提出一种名为“Rubrics as Rewards”(RaR)的框架,通过结构化的检查清单式评分标准(rubrics)作为可解释的奖励信号,结合GRPO(Generalized Reward Policy Optimization)进行在线策略训练,从而显著提升模型对人类偏好的对齐能力,并在不同模型规模下保持鲁棒性能。
链接: https://arxiv.org/abs/2507.17746
作者: Anisha Gunjal,Anthony Wang,Elaine Lau,Vaskar Nath,Bing Liu,Sean Hendryx
机构: Scale AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Extending Reinforcement Learning with Verifiable Rewards (RLVR) to real-world tasks often requires balancing objective and subjective evaluation criteria. However, many such tasks lack a single, unambiguous ground truth-making it difficult to define reliable reward signals for post-training language models. While traditional preference-based methods offer a workaround, they rely on opaque reward functions that are difficult to interpret and prone to spurious correlations. We introduce \textbfRubrics as Rewards (RaR), a framework that uses structured, checklist-style rubrics as interpretable reward signals for on-policy training with GRPO. Our best RaR method yields up to a 28% relative improvement on HealthBench-1k compared to simple Likert-based approaches, while matching or surpassing the performance of reward signals derived from expert-written references. By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences and sustain robust performance across model scales.
zh
[NLP-2] Megrez2 Technical Report
【速读】: 该论文旨在解决大语言模型在设备端部署时面临的参数量庞大、计算资源消耗高与推理效率低的问题,尤其针对资源受限场景下的模型可部署性挑战。其解决方案的关键在于提出Megrez2架构,通过引入跨层专家共享机制(cross-layer expert sharing),在相邻Transformer层间复用专家模块以显著减少总参数量,同时保持模型容量;并结合预门控路由(pre-gated routing)实现内存高效的专家加载和更快的推理速度,从而在保证性能的前提下大幅提升模型的轻量化与部署友好性。
链接: https://arxiv.org/abs/2507.17728
作者: Boxun Li,Yadong Li,Zhiyuan Li,Congyi Liu,Weilin Liu,Guowei Niu,Zheyue Tan,Haiyang Xu,Zhuyu Yao,Tao Yuan,Dong Zhou,Yueqing Zhuang,Bo Zhao,Guohao Dai,Yu Wang
机构: Infinigence-AI; Aalto University (阿尔托大学); Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present Megrez2, a novel lightweight and high-performance language model architecture optimized for device native deployment. Megrez2 introduces a novel cross-layer expert sharing mechanism, which significantly reduces total parameter count by reusing expert modules across adjacent transformer layers while maintaining most of the model’s capacity. It also incorporates pre-gated routing, enabling memory-efficient expert loading and faster inference. As the first instantiation of the Megrez2 architecture, we introduce the Megrez2-Preview model, which is pre-trained on a 5-trillion-token corpus and further enhanced through supervised fine-tuning and reinforcement learning with verifiable rewards. With only 3B activated and 7.5B stored parameters, Megrez2-Preview demonstrates competitive or superior performance compared to larger models on a wide range of tasks, including language understanding, instruction following, mathematical reasoning, and code generation. These results highlight the effectiveness of the Megrez2 architecture to achieve a balance between accuracy, efficiency, and deployability, making it a strong candidate for real-world, resource-constrained applications.
zh
[NLP-3] AI Telephone Surveying: Automating Quantitative Data Collection with an AI Interviewer
【速读】: 该论文旨在解决传统定量调查方法在可扩展性与交互自然性之间难以平衡的问题,尤其是在利用人工智能(AI)技术进行电话问卷调查时,如何实现既保持研究方法学严谨性又提升受访者体验。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLM)、自动语音识别(Automatic Speech Recognition, ASR)和语音合成技术的AI电话调查系统,该系统严格遵循定量研究的最佳实践(如题项顺序随机化、答案选项随机化及措辞一致性),并通过两个试点调查验证其有效性,结果表明更短的问卷和更具响应性的AI访谈者有助于提升完成率、降低中断率并提高受访者满意度。
链接: https://arxiv.org/abs/2507.17718
作者: Danny D. Leybzon,Shreyas Tirumala,Nishant Jain,Summer Gillen,Michael Jackson,Cameron McPhee,Jennifer Schmidt
机构: VKL Research, Inc.(VKL 研究公司); SSRS(高级研究系统)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:With the rise of voice-enabled artificial intelligence (AI) systems, quantitative survey researchers have access to a new data-collection mode: AI telephone surveying. By using AI to conduct phone interviews, researchers can scale quantitative studies while balancing the dual goals of human-like interactivity and methodological rigor. Unlike earlier efforts that used interactive voice response (IVR) technology to automate these surveys, voice AI enables a more natural and adaptive respondent experience as it is more robust to interruptions, corrections, and other idiosyncrasies of human speech. We built and tested an AI system to conduct quantitative surveys based on large language models (LLM), automatic speech recognition (ASR), and speech synthesis technologies. The system was specifically designed for quantitative research, and strictly adhered to research best practices like question order randomization, answer order randomization, and exact wording. To validate the system’s effectiveness, we deployed it to conduct two pilot surveys with the SSRS Opinion Panel and followed-up with a separate human-administered survey to assess respondent experiences. We measured three key metrics: the survey completion rates, break-off rates, and respondent satisfaction scores. Our results suggest that shorter instruments and more responsive AI interviewers may contribute to improvements across all three metrics studied. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2507.17718 [cs.CL] (or arXiv:2507.17718v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.17718 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-4] From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
【速读】: 该论文旨在解决生成式 AI (Generative AI) 临床笔记在医疗场景中广泛应用时,其质量评估缺乏客观、可扩展且与真实医生偏好一致的自动化方法的问题。现有自动评估指标往往无法准确反映临床医生的实际评价标准,导致评估结果与实际应用脱节。解决方案的关键在于构建一个系统化的反馈蒸馏管道(pipeline),将真实用户反馈转化为结构化检查清单(checklist),该清单具有可解释性、基于人类反馈且可由大语言模型(LLM)执行。通过分析超过21,000次去标识化临床会话数据,实验证明该检查清单在覆盖率、多样性及对人类评分的预测能力上优于基线方法,并展现出对质量下降扰动的鲁棒性与与临床偏好高度一致的特性,从而为AI临床笔记的质量评估提供了一种实用且可靠的方法论。
链接: https://arxiv.org/abs/2507.17717
作者: Karen Zhou,John Giorgi,Pranav Mani,Peng Xu,Davis Liang,Chenhao Tan
机构: University of Chicago (芝加哥大学); Abridge
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters, prepared in accordance with the HIPAA safe harbor standard, from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms baseline approaches in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist’s robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, the checklist can help identify notes likely to fall below our chosen quality thresholds.
zh
[NLP-5] yDi QA-WANA: A Benchmark for Information-Seeking Question Answering in Languages of West Asia and North Africa
【速读】: 该论文旨在解决多语言问答任务中因文化相关性缺失和语言多样性不足而导致的模型泛化能力受限问题。其解决方案的关键在于构建一个高质量、跨语言的问答数据集——TyDi QA-WANA,该数据集包含28K个样本,覆盖西亚与北非地区的10种语言变体(language varieties),所有数据均直接在目标语言环境中采集,避免了翻译带来的文化偏差,同时每道题配以较长的文章(large text contexts),从而更真实地评估模型在复杂语境下理解与推理的能力。此设计有助于推动生成式 AI(Generative AI)在低资源语言场景中的发展与应用。
链接: https://arxiv.org/abs/2507.17709
作者: Parker Riley,Siamak Shakeri,Waleed Ammar,Jonathan H. Clark
机构: Google(谷歌); Holistic Intelligence for Global Good
类目: Computation and Language (cs.CL)
备注:
Abstract:We present TyDi QA-WANA, a question-answering dataset consisting of 28K examples divided among 10 language varieties of western Asia and northern Africa. The data collection process was designed to elicit information-seeking questions, where the asker is genuinely curious to know the answer. Each question in paired with an entire article that may or may not contain the answer; the relatively large size of the articles results in a task suitable for evaluating models’ abilities to utilize large text contexts in answering questions. Furthermore, the data was collected directly in each language variety, without the use of translation, in order to avoid issues of cultural relevance. We present performance of two baseline models, and release our code and data to facilitate further improvement by the research community.
zh
[NLP-6] owards Greater Leverag e: Scaling Laws for Efficient Mixture-of-Experts Language Models
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)架构在扩展大型语言模型(Large Language Models, LLMs)时面临的模型容量预测难题,即如何准确评估特定MoE配置(如专家激活比例和粒度)所带来的计算效率优势。其解决方案的关键在于提出“效率杠杆”(Efficiency Leverage, EL)这一量化指标,并通过大规模实证研究(训练超过300个参数规模达28B的模型)揭示EL主要受专家激活比例和总计算预算驱动,且二者遵循可预测的幂律关系;同时发现专家粒度作为非线性调节因子,在一个明确的最优范围内对EL产生显著影响。基于此,作者构建了一个统一的缩放定律,能够精准预测MoE架构的EL值,从而为高效MoE模型的设计与扩展提供理论依据与实践指导。
链接: https://arxiv.org/abs/2507.17702
作者: Changxin Tian,Kunlong Chen,Jia Liu,Ziqi Liu,Zhiqiang Zhang,Jun Zhou
机构: Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Mixture-of-Experts (MoE) has become a dominant architecture for scaling Large Language Models (LLMs) efficiently by decoupling total parameters from computational cost. However, this decoupling creates a critical challenge: predicting the model capacity of a given MoE configurations (e.g., expert activation ratio and granularity) remains an unresolved problem. To address this gap, we introduce Efficiency Leverage (EL), a metric quantifying the computational advantage of an MoE model over a dense equivalent. We conduct a large-scale empirical study, training over 300 models up to 28B parameters, to systematically investigate the relationship between MoE architectural configurations and EL. Our findings reveal that EL is primarily driven by the expert activation ratio and the total compute budget, both following predictable power laws, while expert granularity acts as a non-linear modulator with a clear optimal range. We integrate these discoveries into a unified scaling law that accurately predicts the EL of an MoE architecture based on its configuration. To validate our derived scaling laws, we designed and trained Ling-mini-beta, a pilot model for Ling-2.0 series with only 0.85B active parameters, alongside a 6.1B dense model for comparison. When trained on an identical 1T high-quality token dataset, Ling-mini-beta matched the performance of the 6.1B dense model while consuming over 7x fewer computational resources, thereby confirming the accuracy of our scaling laws. This work provides a principled and empirically-grounded foundation for the scaling of efficient MoE models.
zh
[NLP-7] Who Attacks and Why? Using LLM s to Identify Negative Campaigning in 18M Tweets across 19 Countries
【速读】: 该论文旨在解决负向竞选(negative campaigning)研究中因传统分类方法成本高、可扩展性差而导致的实证研究受限问题。其解决方案的关键在于引入零样本大语言模型(zero-shot Large Language Models, LLMs),用于跨语言的负向竞选内容自动分类;该方法在十种语言的基准数据集上表现媲美母语人工编码者,并显著优于传统监督式机器学习方法,从而实现了对19个欧洲国家议会成员1800万条推文的大规模、跨文化、可重复的政治传播分析。
链接: https://arxiv.org/abs/2507.17636
作者: Victor Hartman,Petter Törnberg
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Negative campaigning is a central feature of political competition, yet empirical research has been limited by the high cost and limited scalability of existing classification methods. This study makes two key contributions. First, it introduces zero-shot Large Language Models (LLMs) as a novel approach for cross-lingual classification of negative campaigning. Using benchmark datasets in ten languages, we demonstrate that LLMs achieve performance on par with native-speaking human coders and outperform conventional supervised machine learning approaches. Second, we leverage this novel method to conduct the largest cross-national study of negative campaigning to date, analyzing 18 million tweets posted by parliamentarians in 19 European countries between 2017 and 2022. The results reveal consistent cross-national patterns: governing parties are less likely to use negative messaging, while ideologically extreme and populist parties – particularly those on the radical right – engage in significantly higher levels of negativity. These findings advance our understanding of how party-level characteristics shape strategic communication in multiparty systems. More broadly, the study demonstrates the potential of LLMs to enable scalable, transparent, and replicable research in political communication across linguistic and cultural contexts.
zh
[NLP-8] WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training
【速读】: 该论文旨在解决传统学习率(Learning Rate, LR)调度中依赖衰减阶段(decay phase)所带来的性能瓶颈问题,提出了一种无需学习率衰减的替代方案。其核心解决方案是Warmup-Stable and Merge (WSM) 框架,关键在于通过模型合并(model merging)技术建立与传统衰减策略(如余弦衰减、线性衰减和反平方根衰减)之间的形式化联系,并将其转化为一种有理论依据的模型平均(principled model averaging)方法。实验表明,合并持续时间(merge duration)是影响模型性能最关键的因素,显著优于检查点间隔和合并数量等参数,且WSM在多个基准测试中均超越广泛采用的Warmup-Stable-Decay(WSD)方法,展现出更强的泛化能力和长期微调潜力。
链接: https://arxiv.org/abs/2507.17634
作者: Changxin Tian,Jiapeng Wang,Qian Zhao,Kunlong Chen,Jia Liu,Ziqi Liu,Jiaxin Mao,Wayne Xin Zhao,Zhiqiang Zhang,Jun Zhou
机构: Ant Group (蚂蚁集团); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in learning rate (LR) scheduling have demonstrated the effectiveness of decay-free approaches that eliminate the traditional decay phase while maintaining competitive performance. Model merging techniques have emerged as particularly promising solutions in this domain. We present Warmup-Stable and Merge (WSM), a general framework that establishes a formal connection between learning rate decay and model merging. WSM provides a unified theoretical foundation for emulating various decay strategies-including cosine decay, linear decay and inverse square root decay-as principled model averaging schemes, while remaining fully compatible with diverse optimization methods. Through extensive experiments, we identify merge duration-the training window for checkpoint aggregation-as the most critical factor influencing model performance, surpassing the importance of both checkpoint interval and merge quantity. Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks, achieving significant improvements of +3.5% on MATH, +2.9% on HumanEval, and +5.5% on MMLU-Pro. The performance advantages extend to supervised fine-tuning scenarios, highlighting WSM’s potential for long-term model refinement.
zh
[NLP-9] A Hybrid Early-Exit Algorithm for Large Language Models Based on Space Alignment Decoding (SPADE)
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段计算成本高昂的问题,特别是现有早期退出(early-exit)算法因中间层与输出层表示不一致而导致解码准确性下降的局限性。其解决方案的关键在于提出一种名为SPADE(SPace Alignment DEcoding)的新颖解码方法,通过仅保留起始标记和答案标记的最小化序列来对齐中间层表示与输出层表示,从而实现高质量输出生成;同时引入基于熵的置信度度量机制,以线性近似方式优化早期退出决策策略,最终构建出一种混合型早期退出算法,在保证准确率的前提下显著降低推理开销。
链接: https://arxiv.org/abs/2507.17618
作者: Bowen Zheng,Ming Ma,Zhongqiao Lin,Tianming Yang
机构: Institute of Neuroscience, Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology, Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Performance (cs.PF)
备注:
Abstract:Large language models are computationally expensive due to their deep structures. Prior research has shown that intermediate layers contain sufficient information to generate accurate answers, leading to the development of early-exit algorithms that reduce inference costs by terminating computation at earlier layers. However, these methods often suffer from poor performance due to misalignment between intermediate and output layer representations that lead to decoding inaccuracy. To address these challenges, we propose SPADE (SPace Alignment DEcoding), a novel decoding method that aligns intermediate layer representations with the output layer by propagating a minimally reduced sequence consisting of only the start token and the answer token. We further optimize the early-exit decision-making process by training a linear approximation of SPADE that computes entropy-based confidence metrics. Putting them together, we create a hybrid early-exit algorithm that monitors confidence levels and stops inference at intermediate layers while using SPADE to generate high-quality outputs. This approach significantly reduces inference costs without compromising accuracy, offering a scalable and efficient solution for deploying large language models in real-world applications.
zh
[NLP-10] Dual-branch Prompting for Multimodal Machine Translation
【速读】: 该论文旨在解决多模态机器翻译(Multimodal Machine Translation, MMT)中现有方法对配对图像-文本输入的依赖性以及在推理阶段对无关视觉噪声敏感的问题,从而提升模型的鲁棒性和实用性。其解决方案的关键在于提出一种基于扩散模型的双分支提示框架(Diffusion-based Dual-branch Prompting framework, D2P-MMT),该框架仅需源文本和由预训练扩散模型生成的重构图像即可完成翻译任务,天然过滤掉干扰性视觉细节并保留语义线索;同时通过双分支提示策略联合学习真实图像与重构图像,并引入分布对齐损失(distributional alignment loss)以缩小训练与推理阶段的模态差异,增强跨模态交互的一致性。
链接: https://arxiv.org/abs/2507.17588
作者: Jie Wang,Zhendong Yang,Liansong Zong,Xiaobo Zhang,Dexian Wang,Ji Zhang
机构: Southwest Jiaotong University (西南交通大学); Xihua University (西华大学); Chengdu University of Traditional Chinese Medicine (成都中医药大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches.
zh
[NLP-11] Synthetic Voice Data for Automatic Speech Recognition in African Languages
【速读】: 该论文旨在解决非洲地区超过2300种语言中多数缺乏可用语音技术资源的问题,特别是针对低资源语言的自动语音识别(ASR)模型训练数据匮乏难题。其解决方案的关键在于构建大规模合成语音语料库,通过三个步骤实现:利用大语言模型(LLM)生成文本、使用文本到语音(TTS)技术合成语音,并基于合成数据对ASR模型进行微调。实验表明,该方法在八种语言中达到较高文本可读性评分,且在Hausa、Dholuo和Chichewa三种语言上显著提升了ASR性能,仅需不到1%的真实数据成本即可获得接近或超越纯真实数据训练的效果,验证了合成数据在低资源语言ASR中的有效性与经济性。
链接: https://arxiv.org/abs/2507.17578
作者: Brian DeRenzi,Anna Dixon,Mohamed Aymane Farhi,Christian Resch
机构: Dimagi(迪马吉); CLEAR Global(清晰全球)
类目: Computation and Language (cs.CL)
备注: 29 pages incl. appendix, 8 tables, 5 figures. Authors are listed in alphabetical order
Abstract:Speech technology remains out of reach for most of the over 2300 languages in Africa. We present the first systematic assessment of large-scale synthetic voice corpora for African ASR. We apply a three-step process: LLM-driven text creation, TTS voice synthesis, and ASR fine-tuning. Eight out of ten languages for which we create synthetic text achieved readability scores above 5 out of 7. We evaluated ASR improvement for three (Hausa, Dholuo, Chichewa) and created more than 2,500 hours of synthetic voice data at below 1% of the cost of real data. Fine-tuned Wav2Vec-BERT-2.0 models trained on 250h real and 250h synthetic Hausa matched a 500h real-data-only baseline, while 579h real and 450h to 993h synthetic data created the best performance. We also present gender-disaggregated ASR performance evaluation. For very low-resource languages, gains varied: Chichewa WER improved about 6.5% relative with a 1:2 real-to-synthetic ratio; a 1:1 ratio for Dholuo showed similar improvements on some evaluation data, but not on others. Investigating intercoder reliability, ASR errors and evaluation datasets revealed the need for more robust reviewer protocols and more accurate evaluation data. All data and models are publicly released to invite further work to improve synthetic data for African languages.
zh
[NLP-12] BoSS: Beyond-Semantic Speech
【速读】: 该论文旨在解决当前语音技术(如自动语音识别 ASR 和文本转语音 TTS)在处理语音通信中超越显式语义(explicit semantics)维度时的不足问题,即未能有效捕捉情感、语境动态和隐含语义等非显性信息。其解决方案的关键在于提出“超越语义语音”(Beyond-Semantic Speech, BoSS)这一概念框架,明确界定并形式化描述了语音中除语义外的多维信息,包括情感线索(affective cues)、语境动态和隐含语义,并借助认知关联理论与机器学习模型分析语音的时间和语境动态特性,从而推动语音智能从基础命令识别向类人社会交互演进。
链接: https://arxiv.org/abs/2507.17563
作者: Qing Wang,Zehan Li,Hang Lv,Hongjie Chen,Yaodong Song,Jian Kang,Jie Lian,Jie Li,Yongxiang Li,Zhongjiang He,Xuelong Li
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Human communication involves more than explicit semantics, with implicit signals and contextual cues playing a critical role in shaping meaning. However, modern speech technologies, such as Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) often fail to capture these beyond-semantic dimensions. To better characterize and benchmark the progression of speech intelligence, we introduce Spoken Interaction System Capability Levels (L1-L5), a hierarchical framework illustrated the evolution of spoken dialogue systems from basic command recognition to human-like social interaction. To support these advanced capabilities, we propose Beyond-Semantic Speech (BoSS), which refers to the set of information in speech communication that encompasses but transcends explicit semantics. It conveys emotions, contexts, and modifies or extends meanings through multidimensional features such as affective cues, contextual dynamics, and implicit semantics, thereby enhancing the understanding of communicative intentions and scenarios. We present a formalized framework for BoSS, leveraging cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics. We evaluate BoSS-related attributes across five different dimensions, reveals that current spoken language models (SLMs) are hard to fully interpret beyond-semantic signals. These findings highlight the need for advancing BoSS research to enable richer, more context-aware human-machine communication.
zh
[NLP-13] Seed LiveInterpret 2.0: End-to-end Simultaneous Speech-to-speech Translation with Your Voice
【速读】: 该论文旨在解决同步口译(Simultaneous Interpretation, SI)在产品级自动系统中长期存在的核心难题,包括转写与翻译质量低、实时语音生成能力不足、多说话人混淆以及长篇语篇中的译文膨胀等问题。其解决方案的关键在于提出了一种端到端的SI模型Seed-LiveInterpret 2.0,该模型基于创新的双工语音到语音理解-生成框架(duplex speech-to-speech understanding-generating framework),通过大规模预训练和强化学习,在翻译准确率与延迟之间实现了显著优化,同时具备语音克隆能力,大幅降低平均延迟(从近10秒降至约3秒),从而显著提升实际应用可用性。
链接: https://arxiv.org/abs/2507.17527
作者: Shanbo Cheng,Yu Bao,Zhichao Huang,Yu Lu,Ningxin Peng,Lu Xu,Runsheng Yu,Rong Cao,Ting Han,Zeyang Li,Sitong Liu,Shengtao Ma,Shiguang Pan,Jiongchen Xiao,Nuo Xu,Meng Yang,Rong Ye,Yiming Yu,Ruofei Zhang,Wanyi Zhang,Wenhao Zhu,Liehao Zou,Lu Lu,Yuxuan Wang,Yonghui Wu
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Seed-LiveInterpret 2.0 Technical Report
Abstract:Simultaneous Interpretation (SI) represents one of the most daunting frontiers in the translation industry, with product-level automatic systems long plagued by intractable challenges: subpar transcription and translation quality, lack of real-time speech generation, multi-speaker confusion, and translated speech inflation, especially in long-form discourses. In this study, we introduce Seed-LiveInterpret 2.0, an end-to-end SI model that delivers high-fidelity, ultra-low-latency speech-to-speech generation with voice cloning capabilities. As a fully operational product-level solution, Seed-LiveInterpret 2.0 tackles these challenges head-on through our novel duplex speech-to-speech understanding-generating framework. Experimental results demonstrate that through large-scale pretraining and reinforcement learning, the model achieves a significantly better balance between translation accuracy and latency, validated by human interpreters to exceed 70% correctness in complex scenarios. Notably, Seed-LiveInterpret 2.0 outperforms commercial SI solutions by significant margins in translation quality, while slashing the average latency of cloned speech from nearly 10 seconds to a near-real-time 3 seconds, which is around a near 70% reduction that drastically enhances practical usability.
zh
[NLP-14] URPO: A Unified Reward Policy Optimization Framework for Large Language Models
【速读】: 该论文旨在解决大规模对齐流水线中因分离训练策略模型(policy model)与奖励模型(reward model)所导致的资源密集、效率低下及性能受限问题。传统方法中,奖励模型参数在强化学习(Reinforcement Learning, RL)过程中保持冻结,使得奖励信号静态且难以适应复杂任务需求。其解决方案的关键在于提出统一奖励与策略优化(Unified Reward Policy Optimization, URPO)框架,将指令遵循(instruction-following)与奖励建模(reward modeling)整合进单一模型和单阶段训练流程中,并通过Group-Relative Policy Optimization(GRPO)循环统一处理偏好对、可验证推理和开放式指令数据。此机制使模型既能从真实偏好和逻辑中学习,又能自主生成奖励用于开放任务,从而实现生成与评估能力的协同进化,显著提升对齐效果与内部评估器质量。
链接: https://arxiv.org/abs/2507.17515
作者: Songshuo Lu,Hua Wang,Zhi Chen,Yaohua Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward Policy Optimization (URPO), that unifies instruction-following (“player”) and reward modeling (“referee”) within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO’s superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and the composite reasoning average from 32.66 to 35.66. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.
zh
[NLP-15] DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD
【速读】: 该论文旨在解决Transformer模型在训练过程中对高级优化器(如AdamW)的依赖问题,其根本原因在于梯度分布呈现重尾特性,导致标准动量SGDW(mSGDW)难以有效收敛。解决方案的关键在于提出一种深度归一化Transformer(Deeply Normalized Transformer, DNT),通过在Transformer架构中合理嵌入归一化技术,精确调控每一层的雅可比矩阵(Jacobian matrices),平衡权重、激活值及其相互作用的影响,从而使得梯度分布更加集中,使模型能够使用简单的vanilla mSGDW进行训练,同时保持与AdamW训练相当的性能表现。
链接: https://arxiv.org/abs/2507.17501
作者: Xianbiao Qi,Marco Chen,Wenjie Xiao,Jiaquan Ye,Yelin He,Chun-Guang Li,Zhouchen Lin
机构: Intellifusion Inc.(Intellifusion公司); Tsinghua University (清华大学); Johns Hopkins University (约翰霍普金斯大学); Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: We have introduced a novel architecture, Deeply Normalized Transformer (DNT), which enables efficient training with vanilla momentum SGDW (mSGDW), achieving performance on par with AdamW-optimized Transformers
Abstract:Transformers have become the de facto backbone of modern deep learning, yet their training typically demands an advanced optimizer with adaptive learning rate like AdamW, rather than a momentum SGDW (mSGDW). Previous works show that it is mainly due to a heavy-tailed distribution of the gradients. In this paper, we introduce a Deeply Normalized Transformer (DNT), which is meticulously engineered to overcome this limitation enabling seamless training with vanilla mSGDW while yielding comparable performance to the Transformers trained via AdamW. To be specific, in DNT, we strategically integrate normalization techniques at proper positions in the Transformers to effectively modulate the Jacobian matrices of each layer, balance the influence of weights, activations, and their interactions, and thus enable the distributions of gradients concentrated. We provide both theoretical justifications of the normalization technique used in our DNT and extensive empirical evaluation on two popular Transformer architectures to validate that: a) DNT outperforms its counterparts (\ie, ViT and GPT), and b) DNT can be effectively trained with vanilla mSGDW.
zh
[NLP-16] MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLM s
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在多语言推理能力评估中的局限性问题,尤其是现有基准测试因依赖英文基准的翻译而偏向英语语境下的推理任务,难以真实反映模型在不同语言和文化背景下的原生推理能力。解决方案的关键在于构建Multilingual Native Reasoning Challenge (MultiNRC),这是一个包含超过1000个由母语者原创、具有语言和文化特性的推理题目的多语言基准,涵盖语言特定推理、文字游戏谜题、文化传统推理及具文化关联的数学推理四类核心类别,并提供人工翻译的英文对应版本以实现跨语言直接对比。通过系统评估14种主流LLM在MultiNRC及其英文等价集上的表现,揭示了当前模型在原生多语言推理上的普遍不足以及在不同推理类型中表现出的差异化能力,尤其凸显了文化相关知识对模型推理性能的显著影响。
链接: https://arxiv.org/abs/2507.17476
作者: Alexander R. Fabbri,Diego Mares,Jorge Flores,Meher Mankikar,Ernesto Hernandez,Dean Lee,Bing Liu,Chen Xing
机构: Scale AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although recent Large Language Models (LLMs) have shown rapid improvement on reasoning benchmarks in English, the evaluation of such LLMs’ multilingual reasoning capability across diverse languages and cultural contexts remains limited. Existing multilingual reasoning benchmarks are typically constructed by translating existing English reasoning benchmarks, biasing these benchmarks towards reasoning problems with context in English language/cultures. In this work, we introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark designed to assess LLMs on more than 1,000 native, linguistic and culturally grounded reasoning questions written by native speakers in French, Spanish, and Chinese. MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay riddles, cultural/tradition reasoning, and math reasoning with cultural relevance. For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English. This set of English equivalents can provide a direct comparison of LLM reasoning capacity in other languages vs. English on the same reasoning questions. We systematically evaluate current 14 leading LLMs covering most LLM families on MultiNRC and its English equivalent set. The results show that (1) current LLMs are still not good at native multilingual reasoning, with none scoring above 50% on MultiNRC; (2) LLMs exhibit distinct strengths and weaknesses in handling linguistic, cultural, and logical reasoning tasks; (3) Most models perform substantially better in math reasoning in English compared to in original languages (+10%), indicating persistent challenges with culturally grounded knowledge.
zh
[NLP-17] Each to Their Own: Exploring the Optimal Embedding in RAG
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在实际应用中因不同嵌入模型(embedding models)的异质性训练数据和架构差异,导致相似度计算结果不稳定、进而影响大语言模型(Large Language Models, LLMs)响应质量的问题。解决方案的关键在于提出两种融合多嵌入模型优势的方法:其中,Mixture-Embedding RAG 通过标准化相似度排序整合多个嵌入模型的检索结果,但效果未优于基线;而 Confident RAG 则通过多次调用不同嵌入模型生成响应,并选择置信度最高的输出,实现了相对于原始 LLM 和 RAG 分别约 10% 和 5% 的平均性能提升,展现出良好的跨模型与跨领域泛化能力,是一种高效且可即插即用的改进方案。
链接: https://arxiv.org/abs/2507.17442
作者: Shiting Chen,Zijian Zhao,Jinsong Chen
机构: University of Hong Kong (香港大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, as Large Language Models (LLMs) have fundamentally impacted various fields, the methods for incorporating up-to-date information into LLMs or adding external knowledge to construct domain-specific models have garnered wide attention. Retrieval-Augmented Generation (RAG), serving as an inference-time scaling method, is notable for its low cost and minimal effort for parameter tuning. However, due to heterogeneous training data and model architecture, the variant embedding models used in RAG exhibit different benefits across various areas, often leading to different similarity calculation results and, consequently, varying response quality from LLMs. To address this problem, we propose and examine two approaches to enhance RAG by combining the benefits of multiple embedding models, named Mixture-Embedding RAG and Confident RAG. Mixture-Embedding RAG simply sorts and selects retrievals from multiple embedding models based on standardized similarity; however, it does not outperform vanilla RAG. In contrast, Confident RAG generates responses multiple times using different embedding models and then selects the responses with the highest confidence level, demonstrating average improvements of approximately 10% and 5% over vanilla LLMs and RAG, respectively. The consistent results across different LLMs and embedding models indicate that Confident RAG is an efficient plug-and-play approach for various domains. We will release our code upon publication.
zh
[NLP-18] Investigating Subjective Factors of Argument Strength: Storytelling Emotions and Hedging
【速读】: 该论文旨在解决当前关于主观特征(如情绪、叙事和缓和表达)如何影响论证质量的实证研究不足的问题,尤其是缺乏大规模分析这些因素与客观论证质量和主观说服力之间关系的研究。其解决方案的关键在于:首先,构建并比较了针对三种主观特征(情绪、叙事、缓和表达)的自动化标注方法,填补了现有数据集中无统一标注资源的空白;其次,通过回归分析量化这些主观特征对两个标准数据集(分别标注了客观论证质量与主观说服力)的影响,揭示出叙事和缓和表达在两类论证质量上呈现相反效应,而情绪的影响则取决于修辞使用方式而非领域特性。
链接: https://arxiv.org/abs/2507.17409
作者: Carlotta Quensel,Neele Falk,Gabriella Lapesa
机构: Leibniz University Hannover (汉诺威莱布尼茨大学); University of Stuttgart (斯图加特大学); Leibniz Institute for the Social Sciences - GESIS & Heinrich Heine University Düsseldorf (莱布尼茨社会科学研究机构 - GESIS 与海因里希海涅杜塞尔多夫大学)
类目: Computation and Language (cs.CL)
备注: Accepted to the 12th Workshop on Argument Mining (ArgMining) 2025
Abstract:In assessing argument strength, the notions of what makes a good argument are manifold. With the broader trend towards treating subjectivity as an asset and not a problem in NLP, new dimensions of argument quality are studied. Although studies on individual subjective features like personal stories exist, there is a lack of large-scale analyses of the relation between these features and argument strength. To address this gap, we conduct regression analysis to quantify the impact of subjective factors - emotions, storytelling, and hedging - on two standard datasets annotated for objective argument quality and subjective persuasion. As such, our contribution is twofold: at the level of contributed resources, as there are no datasets annotated with all studied dimensions, this work compares and evaluates automated annotation methods for each subjective feature. At the level of novel insights, our regression analysis uncovers different patterns of impact of subjective features on the two facets of argument strength encoded in the datasets. Our results show that storytelling and hedging have contrasting effects on objective and subjective argument quality, while the influence of emotions depends on their rhetoric utilization rather than the domain.
zh
[NLP-19] Millions of textGeAR-s: Extending GraphRAG to Millions of Documents SIGIR2025
【速读】: 该论文旨在解决图结构增强的检索增强生成(Retrieval-Augmented Generation, RAG)方法在跨任务和跨数据集场景下的通用性不足问题。当前基于图的方法通常针对特定任务(如多跳问答或多文档摘要)设计,缺乏在更广泛数据集上的适用性证据。论文的关键解决方案是将最先进的图结构RAG方法GeAR(Graph-based Entity-Aware Retrieval)适配至SIGIR 2025 LiveRAG Challenge,并通过实证分析其性能边界与局限性,从而评估其在多样化任务中的泛化能力。
链接: https://arxiv.org/abs/2507.17399
作者: Zhili Shen,Chenxin Diao,Pascual Merita,Pavlos Vougiouklis,Jeff Z. Pan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2025 LiveRAG Challenge Program
Abstract:Recent studies have explored graph-based approaches to retrieval-augmented generation, leveraging structured or semi-structured information – such as entities and their relations extracted from documents – to enhance retrieval. However, these methods are typically designed to address specific tasks, such as multi-hop question answering and query-focused summarisation, and therefore, there is limited evidence of their general applicability across broader datasets. In this paper, we aim to adapt a state-of-the-art graph-based RAG solution: \textGeAR and explore its performance and limitations on the SIGIR 2025 LiveRAG Challenge.
zh
[NLP-20] ransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition
【速读】: 该论文旨在解决开放环境下车牌识别中因车牌类型多样性和成像条件复杂而导致的识别准确率下降问题,尤其针对传统卷积神经网络(CNN)和连接时序分类(CRNN)方法在单双线中国车牌识别中的局限性。其解决方案的关键在于提出一个统一框架,融合轻量级视觉编码器与文本解码器,并基于预训练策略优化模型性能;同时构建了合成与真实图像混合的单/双线车牌数据集以缓解双线车牌样本稀缺问题,并引入透视校正网络(Perspective Transformation Network, PTN),通过车牌角点坐标回归作为隐变量,利用视图分类信息进行监督,从而提升系统鲁棒性、可解释性及标注效率,最终在多个测试场景下实现了超过99%的识别准确率和高达167帧/秒的处理速度。
链接: https://arxiv.org/abs/2507.17335
作者: Guangzhu Xu,Zhi Ke,Pengcheng Zuo,Bangjun Lei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:License plate recognition in open environments is widely applicable across various domains; however, the diversity of license plate types and imaging conditions presents significant challenges. To address the limitations encountered by CNN and CRNN-based approaches in license plate recognition, this paper proposes a unified solution that integrates a lightweight visual encoder with a text decoder, within a pre-training framework tailored for single and double-line Chinese license plates. To mitigate the scarcity of double-line license plate datasets, we constructed a single/double-line license plate dataset by synthesizing images, applying texture mapping onto real scenes, and blending them with authentic license plate images. Furthermore, to enhance the system’s recognition accuracy, we introduce a perspective correction network (PTN) that employs license plate corner coordinate regression as an implicit variable, supervised by license plate view classification information. This network offers improved stability, interpretability, and low annotation costs. The proposed algorithm achieves an average recognition accuracy of 99.34% on the corrected CCPD test set under coarse localization disturbance. When evaluated under fine localization disturbance, the accuracy further improves to 99.58%. On the double-line license plate test set, it achieves an average recognition accuracy of 98.70%, with processing speeds reaching up to 167 frames per second, indicating strong practical applicability.
zh
[NLP-21] riple X: A LLM -Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge
【速读】: 该论文旨在解决多语言对话场景下语音识别准确率不足的问题,特别是在跨语言、非结构化口语交互中的性能瓶颈。解决方案的关键在于提出了一种创新的编码器-适配器-大语言模型(Encoder-Adapter-Large Language Model, EALM)架构,该架构通过融合文本型大语言模型(Large Language Model, LLM)的强大推理能力与领域特定的适配模块,在保持多语言泛化性的同时显著提升识别精度。此外,研究还设计了基于大规模多语言音频数据集的多阶段训练策略,进一步优化了模型在不同语言和语境下的表现,最终在多语言对话语音语言建模挑战赛(MLC-SLM Challenge)中取得了优异的词错误率(Word Error Rate, WER)成绩。
链接: https://arxiv.org/abs/2507.17288
作者: Miaomiao Gao,Xiaoxiao Xiang,Yiwen Guo
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:This paper describes our Triple X speech recognition system submitted to Task 1 of the Multi-Lingual Conversational Speech Language Modeling (MLC-SLM) Challenge. Our work focuses on optimizing speech recognition accuracy in multilingual conversational scenarios through an innovative encoder-adapter-LLM architecture. This framework harnesses the powerful reasoning capabilities of text-based large language models while incorporating domain-specific adaptations. To further enhance multilingual recognition performance, we adopted a meticulously designed multi-stage training strategy leveraging extensive multilingual audio datasets. Experimental results demonstrate that our approach achieves competitive Word Error Rate (WER) performance on both dev and test sets, obtaining second place in the challenge ranking.
zh
[NLP-22] ab-MIA: A Benchmark Dataset for Membership Inference Attacks on Tabular Data in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中对表格数据(tabular data)中个人身份信息(PII)的潜在隐私泄露问题,尤其是针对结构化数据特有的记忆行为和成员推断攻击(Membership Inference Attacks, MIAs)风险。解决方案的关键在于构建了一个名为Tab-MIA的基准数据集,涵盖五类数据集合及六种编码格式,首次系统性地评估了现有MIA方法在LLMs微调后的表格数据上的有效性。研究发现,LLMs在不同编码格式下对结构化数据的记忆模式存在差异,即便仅微调3个epoch,模型仍表现出高敏感性(AUROC接近90%),表明其易受MIAs攻击。Tab-MIA为量化隐私风险和开发面向表格数据的隐私保护机制提供了可复现的实验框架与基础。
链接: https://arxiv.org/abs/2507.17259
作者: Eyal German,Sagiv Antebi,Daniel Samira,Asaf Shabtai,Yuval Elovici
机构: Ben-Gurion University of the Negev (本-古里安大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly trained on tabular data, which, unlike unstructured text, often contains personally identifiable information (PII) in a highly structured and explicit format. As a result, privacy risks arise, since sensitive records can be inadvertently retained by the model and exposed through data extraction or membership inference attacks (MIAs). While existing MIA methods primarily target textual content, their efficacy and threat implications may differ when applied to structured data, due to its limited content, diverse data types, unique value distributions, and column-level semantics. In this paper, we present Tab-MIA, a benchmark dataset for evaluating MIAs on tabular data in LLMs and demonstrate how it can be used. Tab-MIA comprises five data collections, each represented in six different encoding formats. Using our Tab-MIA benchmark, we conduct the first evaluation of state-of-the-art MIA methods on LLMs finetuned with tabular data across multiple encoding formats. In the evaluation, we analyze the memorization behavior of pretrained LLMs on structured data derived from Wikipedia tables. Our findings show that LLMs memorize tabular data in ways that vary across encoding formats, making them susceptible to extraction via MIAs. Even when fine-tuned for as few as three epochs, models exhibit high vulnerability, with AUROC scores approaching 90% in most cases. Tab-MIA enables systematic evaluation of these risks and provides a foundation for developing privacy-preserving methods for tabular data in LLMs.
zh
[NLP-23] CLARIFID: Improving Radiology Report Generation by Reinforcing Clinically Accurate Impressions and Enforcing Detailed Findings
【速读】: 该论文旨在解决当前放射学报告自动生成方法在临床可靠性上的不足,特别是报告内容的真实性与诊断全面性问题。现有方法多聚焦于生成流畅文本,却缺乏对事实正确性的保障,且通常仅依赖单视角胸片图像,限制了诊断的完整性。其解决方案的关键在于提出CLARIFID框架,该框架通过四个核心机制实现:(1)基于章节感知预训练学习从“发现(Findings)”到“印象(Impression)”的逻辑推理流程;(2)采用近端策略优化(Proximal Policy Optimization)进行微调,并以CheXbert F1分数作为奖励信号提升印象部分的准确性;(3)引入推理感知解码机制,强制模型先完成“发现”再生成“印象”,确保临床推理连贯性;(4)利用基于视觉Transformer的多视角编码器融合多张胸片图像,增强诊断信息的全面性。最终,在MIMIC-CXR数据集上的实验表明,该方法在标准自然语言生成指标和临床相关评分上均优于现有基线,显著提升了报告的临床有效性。
链接: https://arxiv.org/abs/2507.17234
作者: Kyeongkyu Lee,Seonghwan Yoon,Hongki Lim
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Automatic generation of radiology reports has the potential to alleviate radiologists’ significant workload, yet current methods struggle to deliver clinically reliable conclusions. In particular, most prior approaches focus on producing fluent text without effectively ensuring the factual correctness of the reports and often rely on single-view images, limiting diagnostic comprehensiveness. We propose CLARIFID, a novel framework that directly optimizes diagnostic correctness by mirroring the two-step workflow of experts. Specifically, CLARIFID (1) learns the logical flow from Findings to Impression through section-aware pretraining, (2) is fine-tuned with Proximal Policy Optimization in which the CheXbert F1 score of the Impression section serves as the reward, (3) enforces reasoning-aware decoding that completes “Findings” before synthesizing the “Impression”, and (4) fuses multiple chest X-ray views via a vision-transformer-based multi-view encoder. During inference, we apply a reasoning-aware next-token forcing strategy followed by report-level re-ranking, ensuring that the model first produces a comprehensive Findings section before synthesizing the Impression and thereby preserving coherent clinical reasoning. Experimental results on the MIMIC-CXR dataset demonstrate that our method achieves superior clinical efficacy and outperforms existing baselines on both standard NLG metrics and clinically aware scores.
zh
[NLP-24] A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在理解烹饪过程中对食材状态变化缺乏显式认知的问题,即模型虽基于大量食谱文本训练,却无法准确追踪食材在烹饪各阶段的中间状态(intermediate ingredient states),从而影响其对食谱逻辑的理解与执行能力。解决方案的关键在于引入状态探测(state probing)方法,构建一个标注清晰、结构规范的日语食谱数据集,并设计三项新任务以评估LLMs识别食材状态转换及中间步骤存在性的能力;实验表明,通过学习食材状态知识可显著提升模型对烹饪流程的理解性能,使其达到与商用大模型相当的水平。
链接: https://arxiv.org/abs/2507.17232
作者: Mashiro Toyooka,Kiyoharu Aizawa,Yoko Yamakata
机构: The University of Tokyo (东京大学)
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to ACM Multimedia 2025
Abstract:Large Language Models (LLMs) are trained on a vast amount of procedural texts, but they do not directly observe real-world phenomena. In the context of cooking recipes, this poses a challenge, as intermediate states of ingredients are often omitted, making it difficult for models to track ingredient states and understand recipes accurately. In this paper, we apply state probing, a method for evaluating a language model’s understanding of the world, to the domain of cooking. We propose a new task and dataset for evaluating how well LLMs can recognize intermediate ingredient states during cooking procedures. We first construct a new Japanese recipe dataset with clear and accurate annotations of ingredient state changes, collected from well-structured and controlled recipe texts. Using this dataset, we design three novel tasks to evaluate whether LLMs can track ingredient state transitions and identify ingredients present at intermediate steps. Our experiments with widely used LLMs, such as Llama3.1-70B and Qwen2.5-72B, show that learning ingredient state knowledge improves their understanding of cooking processes, achieving performance comparable to commercial LLMs.
zh
[NLP-25] he Pluralistic Moral Gap: Understanding Judgment and Value Differences between Humans and Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在提供道德建议时与人类道德判断之间存在的“多元道德差距”问题,即模型输出在分布一致性与道德价值多样性上均难以充分匹配人类判断。其核心发现是:LLMs仅在人类共识较高的情境下能较好复现人类道德判断,而当人类意见分歧增大时,模型对道德判断的对齐能力显著下降;同时,模型所依赖的道德价值范围明显窄于人类。为缩小这一差距,作者提出动态道德画像(Dynamic Moral Profiling, DMP),这是一种基于Dirichlet分布的采样方法,通过引入从人类理由中提取的道德价值分布作为条件约束,使模型输出更贴近人类的多元道德观。DMP在提升对齐度(改善64.3%)和增强道德价值多样性方面表现优异,为实现更具人类一致性和多元包容性的道德生成提供了有效路径。
链接: https://arxiv.org/abs/2507.17216
作者: Giuseppe Russo,Debora Nozza,Paul Röttger,Dirk Hovy
机构: EPFL (瑞士联邦理工学院); Bocconi University (博科尼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures
Abstract:People increasingly rely on Large Language Models (LLMs) for moral advice, which may influence humans’ decisions. Yet, little is known about how closely LLMs align with human moral judgments. To address this, we introduce the Moral Dilemma Dataset, a benchmark of 1,618 real-world moral dilemmas paired with a distribution of human moral judgments consisting of a binary evaluation and a free-text rationale. We treat this problem as a pluralistic distributional alignment task, comparing the distributions of LLM and human judgments across dilemmas. We find that models reproduce human judgments only under high consensus; alignment deteriorates sharply when human disagreement increases. In parallel, using a 60-value taxonomy built from 3,783 value expressions extracted from rationales, we show that LLMs rely on a narrower set of moral values than humans. These findings reveal a pluralistic moral gap: a mismatch in both the distribution and diversity of values expressed. To close this gap, we introduce Dynamic Moral Profiling (DMP), a Dirichlet-based sampling method that conditions model outputs on human-derived value profiles. DMP improves alignment by 64.3% and enhances value diversity, offering a step toward more pluralistic and human-aligned moral guidance from LLMs.
zh
[NLP-26] FinGAIA: An End-to-End Benchmark for Evaluating AI Agents in Finance
【速读】: 该论文旨在解决当前AI代理(AI agents)在金融领域中多步骤、多工具协作能力评估缺失的问题,尤其是其在实际金融任务中的表现尚不清晰。解决方案的关键在于构建了FinGAIA——一个端到端的基准测试平台,包含407个精心设计的任务,覆盖证券、基金、银行、保险、期货、信托和资产管理等七大金融子领域,并按场景深度分为基础业务分析、资产决策支持与战略风险管理三个层级。通过在零样本设置下对10个主流AI代理进行评估,发现最优模型ChatGPT的准确率为48.9%,显著低于金融专家水平,同时识别出五类典型失败模式,如跨模态对齐不足(Cross-modal Alignment Deficiency)和金融术语偏差(Financial Terminological Bias),为未来研究指明方向。
链接: https://arxiv.org/abs/2507.17186
作者: Lingfeng Zeng,Fangqi Lou,Zixuan Wang,Jiajie Xu,Jinyi Niu,Mengping Li,Yifan Dong,Qi Qi,Wei Zhang,Ziwei Yang,Jun Han,Ruilun Feng,Ruiqi Hu,Lejie Zhang,Zhengbo Feng,Yicheng Ren,Xin Guo,Zhaowei Liu,Dongpo Cheng,Weige Cai,Liwen Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The booming development of AI agents presents unprecedented opportunities for automating complex tasks across various domains. However, their multi-step, multi-tool collaboration capabilities in the financial sector remain underexplored. This paper introduces FinGAIA, an end-to-end benchmark designed to evaluate the practical abilities of AI agents in the financial domain. FinGAIA comprises 407 meticulously crafted tasks, spanning seven major financial sub-domains: securities, funds, banking, insurance, futures, trusts, and asset management. These tasks are organized into three hierarchical levels of scenario depth: basic business analysis, asset decision support, and strategic risk management. We evaluated 10 mainstream AI agents in a zero-shot setting. The best-performing agent, ChatGPT, achieved an overall accuracy of 48.9%, which, while superior to non-professionals, still lags financial experts by over 35 percentage points. Error analysis has revealed five recurring failure patterns: Cross-modal Alignment Deficiency, Financial Terminological Bias, Operational Process Awareness Barrier, among others. These patterns point to crucial directions for future research. Our work provides the first agent benchmark closely related to the financial domain, aiming to objectively assess and promote the development of agents in this crucial field. Partial data is available at this https URL.
zh
[NLP-27] SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLM s
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在结构化知识(Structured Knowledge, SK)理解能力评估中存在的两大问题:一是现有评测方法缺乏对具体能力的细致划分,二是评估仅聚焦于单一类型的结构化知识(如知识图谱或表格),难以全面反映模型的真实水平。为应对这一挑战,作者提出了SKA-Bench——一个结构化知识增强型问答基准,涵盖知识图谱(Knowledge Graph, KG)、表格(Table)、KG+文本和表格+文本四种典型结构化知识形式。其解决方案的关键在于构建一个三阶段数据构造流程,生成包含问题、答案、正向知识单元与噪声知识单元的实例,并进一步将这些实例细分为四个基础能力测试床:噪声鲁棒性(Noise Robustness)、顺序无关性(Order Insensitivity)、信息整合能力(Information Integration)以及负例拒绝能力(Negative Rejection),从而实现对LLMs结构化知识理解能力的精细化诊断。
链接: https://arxiv.org/abs/2507.17178
作者: Zhiqiang Liu,Enpei Niu,Yin Hua,Mengshu Sun,Lei Liang,Huajun Chen,Wen Zhang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at this https URL.
zh
[NLP-28] CogDual: Enhancing Dual Cognition of LLM s via Reinforcement Learning with Implicit Rule-Based Rewards
【速读】: 该论文旨在解决当前角色扮演语言代理(Role-Playing Language Agents, RPLAs)在行为一致性与情境适配性方面的不足,这些问题通常源于现有方法(如提示工程或监督微调)对驱动角色行为的深层认知机制的忽视。解决方案的关键在于提出一种基于认知心理学启发的新型RPLA——CogDual,其采用“识别-响应”(cognize-then-respond)推理范式,通过联合建模外部情境感知(external situational awareness)与内部自我意识(internal self-awareness),从而提升角色行为的一致性和上下文相关性;此外,论文还引入两种通用奖励机制用于强化学习优化,进一步增强了模型在开放域文本生成任务中的泛化能力。
链接: https://arxiv.org/abs/2507.17147
作者: Cheng Liu,Yifei Lu,Fanghua Ye,Jian Li,Xingyu Chen,Feiliang Ren,Zhaopeng Tu,Xiaolong Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying \emphcognitive mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce \textbfCogDual, a novel RPLA adopting a \textitcognize-then-respond reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks.
zh
[NLP-29] Evolutionary Feature-wise Thresholding for Binary Representation of NLP Embeddings
【速读】: 该论文旨在解决大规模自然语言处理(Natural Language Processing, NLP)应用中文本嵌入(text embedding)的存储与计算效率问题。传统方法常采用固定阈值对连续实数值嵌入进行二值化(binarization),导致信息损失和性能下降。其解决方案的关键在于提出一种基于坐标搜索(Coordinate Search)的优化框架,通过为每个特征独立确定最优阈值,实现更精准的二值编码。该方法显著提升了二进制表示的准确性与效率,在多种NLP任务和数据集上均优于传统固定阈值法,且具有良好的通用性,可推广至其他机器学习场景中的任意特征二值化需求。
链接: https://arxiv.org/abs/2507.17025
作者: Soumen Sinha,Shahryar Rahnamayan,Azam Asilian Bidgoli
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Efficient text embedding is crucial for large-scale natural language processing (NLP) applications, where storage and computational efficiency are key concerns. In this paper, we explore how using binary representations (barcodes) instead of real-valued features can be used for NLP embeddings derived from machine learning models such as BERT. Thresholding is a common method for converting continuous embeddings into binary representations, often using a fixed threshold across all features. We propose a Coordinate Search-based optimization framework that instead identifies the optimal threshold for each feature, demonstrating that feature-specific thresholds lead to improved performance in binary encoding. This ensures that the binary representations are both accurate and efficient, enhancing performance across various features. Our optimal barcode representations have shown promising results in various NLP applications, demonstrating their potential to transform text representation. We conducted extensive experiments and statistical tests on different NLP tasks and datasets to evaluate our approach and compare it to other thresholding methods. Binary embeddings generated using using optimal thresholds found by our method outperform traditional binarization methods in accuracy. This technique for generating binary representations is versatile and can be applied to any features, not just limited to NLP embeddings, making it useful for a wide range of domains in machine learning applications.
zh
[NLP-30] Can External Validation Tools Improve Annotation Quality for LLM -as-a-Judge? ACL2025
【速读】: 该论文旨在解决在评估大语言模型(Large Language Models, LLMs)时,针对长文本事实性内容、数学和代码等复杂任务领域中,传统基于人工或AI的成对偏好标注(pairwise preference annotation)质量难以保障的问题。其关键解决方案是提出一种使用外部工具的智能体系统(tool-using agentic system),通过引入网络搜索(web-search)和代码执行(code execution)能力,使标注过程能够基于外部验证独立于LLM内部知识与偏见,从而提升反馈的质量与可靠性。实验表明,该方法在多个挑战性任务域中有效改善了标注性能,但效果受提示词设计等参数影响显著,凸显了对非饱和标注基准的迫切需求。
链接: https://arxiv.org/abs/2507.17015
作者: Arduin Findeis,Floris Weers,Guoli Yin,Ke Ye,Ruoming Pang,Tom Gunter
机构: University of Cambridge (剑桥大学); Apple (苹果公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2025
Abstract:Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two alternative model responses to the same input, a human or AI annotator selects the “better” response. This approach can provide feedback for domains where other hard-coded metrics are difficult to obtain (e.g., chat response quality), thereby helping model evaluation or training. However, for some domains high-quality pairwise comparisons can be tricky to obtain - from AI and humans. For example, for responses with many factual statements, annotators may disproportionately weigh writing quality rather than underlying facts. In this work, we explore augmenting standard AI annotator systems with additional tools to improve performance on three challenging response domains: long-form factual, math and code tasks. We propose a tool-using agentic system to provide higher quality feedback on these domains. Our system uses web-search and code execution to ground itself based on external validation, independent of the LLM’s internal knowledge and biases. We provide extensive experimental results evaluating our method across the three targeted response domains as well as general annotation tasks, using RewardBench (incl. AlpacaEval and LLMBar), RewardMath, as well as three new datasets for domains with saturated pre-existing datasets. Our results indicate that external tools can indeed improve performance in many, but not all, cases. More generally, our experiments highlight the sensitivity of performance to simple parameters (e.g., prompt) and the need for improved (non-saturated) annotator benchmarks. We share our code at this https URL.
zh
[NLP-31] Multi-Label Classification with Generative AI Models in Healthcare: A Case Study of Suicidality and Risk Factors
【速读】: 该论文旨在解决精神科电子健康记录(EHR)中 suicidality-related factors (SrFs),包括自杀意念(SI)、自杀未遂(SA)、接触自杀(ES)及非自杀性自伤(NSSI)等多标签分类问题,传统方法常将其简化为二分类任务,忽视了风险因素的共现复杂性。解决方案的关键在于提出一种端到端的生成式大语言模型(LLMs)多标签分类(MLC)流程,采用微调后的GPT-3.5和引导提示(guided prompting)的GPT-4.5进行建模,并引入标签集层面评估指标与多标签混淆矩阵以实现更精细的误差分析。实验表明,微调后的GPT-3.5在部分匹配准确率(0.94)和F1分数(0.91)上表现最优,而GPT-4.5在稀有或少数标签集上展现出更强的鲁棒性和平衡性,揭示出系统性错误模式(如SI与SA混淆)及模型倾向于保守过度标注的现象,从而为结构化非结构化EHR数据、支持大规模临床研究与循证医学提供了可行范式。
链接: https://arxiv.org/abs/2507.17009
作者: Ming Huang,Zehan Li,Yan Hu,Wanjing Wang,Andrew Wen,Scott Lane,Salih Selek,Lokesh Shahani,Rodrigo Machado-Vieira,Jair Soares,Hua Xu,Hongfang Liu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Quantitative Methods (q-bio.QM)
备注:
Abstract:Suicide remains a pressing global health crisis, with over 720,000 deaths annually and millions more affected by suicide ideation (SI) and suicide attempts (SA). Early identification of suicidality-related factors (SrFs), including SI, SA, exposure to suicide (ES), and non-suicidal self-injury (NSSI), is critical for timely intervention. While prior studies have applied AI to detect SrFs in clinical notes, most treat suicidality as a binary classification task, overlooking the complexity of cooccurring risk factors. This study explores the use of generative large language models (LLMs), specifically GPT-3.5 and GPT-4.5, for multi-label classification (MLC) of SrFs from psychiatric electronic health records (EHRs). We present a novel end to end generative MLC pipeline and introduce advanced evaluation methods, including label set level metrics and a multilabel confusion matrix for error analysis. Finetuned GPT-3.5 achieved top performance with 0.94 partial match accuracy and 0.91 F1 score, while GPT-4.5 with guided prompting showed superior performance across label sets, including rare or minority label sets, indicating a more balanced and robust performance. Our findings reveal systematic error patterns, such as the conflation of SI and SA, and highlight the models tendency toward cautious over labeling. This work not only demonstrates the feasibility of using generative AI for complex clinical classification tasks but also provides a blueprint for structuring unstructured EHR data to support large scale clinical research and evidence based medicine.
zh
[NLP-32] Obscured but Not Erased: Evaluating Nationality Bias in LLM s via Name-Based Bias Benchmarks
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在缺乏显式国籍标签时仍可能表现出隐性偏见的问题,尤其是当使用文化特征明显的姓名替代明确国籍信息时,这种偏见如何影响模型的准确性与公平性。其解决方案的关键在于提出一种基于姓名的基准测试方法,源自Bias Benchmark for QA (BBQ) 数据集,通过将显式国籍标签替换为具有文化指向性的姓名,模拟更贴近真实应用场景的测试环境,从而系统评估不同规模LLMs在模糊情境下的偏见强度与准确率表现。实验表明,小模型不仅准确性更低,且对原有偏见更具顽固性,验证了偏见在模型中的深层残留特性。
链接: https://arxiv.org/abs/2507.16989
作者: Giulio Pelosio,Devesh Batra,Noémie Bovey,Robert Hankache,Cristovao Iglesias,Greig Cowan,Raad Khraishi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) can exhibit latent biases towards specific nationalities even when explicit demographic markers are not present. In this work, we introduce a novel name-based benchmarking approach derived from the Bias Benchmark for QA (BBQ) dataset to investigate the impact of substituting explicit nationality labels with culturally indicative names, a scenario more reflective of real-world LLM applications. Our novel approach examines how this substitution affects both bias magnitude and accuracy across a spectrum of LLMs from industry leaders such as OpenAI, Google, and Anthropic. Our experiments show that small models are less accurate and exhibit more bias compared to their larger counterparts. For instance, on our name-based dataset and in the ambiguous context (where the correct choice is not revealed), Claude Haiku exhibited the worst stereotypical bias scores of 9%, compared to only 3.5% for its larger counterpart, Claude Sonnet, where the latter also outperformed it by 117.7% in accuracy. Additionally, we find that small models retain a larger portion of existing errors in these ambiguous contexts. For example, after substituting names for explicit nationality references, GPT-4o retains 68% of the error rate versus 76% for GPT-4o-mini, with similar findings for other model providers, in the ambiguous context. Our research highlights the stubborn resilience of biases in LLMs, underscoring their profound implications for the development and deployment of AI systems in diverse, global contexts.
zh
[NLP-33] Leverag ing Synthetic Data for Question Answering with Multilingual LLM s in the Agricultural Domain
【速读】: 该论文旨在解决农业领域中农民难以及时获取本地化、多语言且准确的农业信息的问题。现有通用大语言模型(Large Language Models, LLMs)在农业场景下往往提供泛化建议,缺乏针对特定区域和语言环境的精准性,主要受限于领域知识不足及高质量、区域性数据稀缺。其解决方案的关键在于:首先,从农业专业文档中生成多语言合成数据集(涵盖英语、印地语和旁遮普语),随后对不同语言的LLM进行针对性微调(fine-tuning)。实验证明,该策略显著提升了模型在事实准确性、相关性和农业共识方面的性能,表明基于合成数据的语言特异性微调是提升农业场景下LLM表现的有效途径,尤其适用于多语言和低资源环境。
链接: https://arxiv.org/abs/2507.16974
作者: Rishemjit Kaur,Arshdeep Singh Bhankhar,Surangika Ranathunga,Jashanpreet Singh Salh,Sudhir Rajput,Vidhi,Kashish Mahendra,Bhavika Berwal,Ritesh Kumar
机构: CSIR-Central Scientific Instruments Organisation, India (印度中央科学仪器组织); Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, U.P., India (印度科学与创新研究院); School of Mathematical and Computational Sciences, Massey University, New Zealand (梅西大学数学与计算科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 tables, Appendix A-K
Abstract:Enabling farmers to access accurate agriculture-related information in their native languages in a timely manner is crucial for the success of the agriculture field. Although large language models (LLMs) can be used to implement Question Answering (QA) systems, simply using publicly available general-purpose LLMs in agriculture typically offer generic advisories, lacking precision in local and multilingual contexts due to insufficient domain-specific training and scarcity of high-quality, region-specific datasets. Our study addresses these limitations by generating multilingual synthetic agricultural datasets (English, Hindi, Punjabi) from agriculture-specific documents and fine-tuning language-specific LLMs. Our evaluation on curated multilingual datasets demonstrates significant improvements in factual accuracy, relevance, and agricultural consensus for the fine-tuned models compared to their baseline counterparts. These results highlight the efficacy of synthetic data-driven, language-specific fine-tuning as an effective strategy to improve the performance of LLMs in agriculture, especially in multilingual and low-resource settings. By enabling more accurate and localized agricultural advisory services, this study provides a meaningful step toward bridging the knowledge gap in AI-driven agricultural solutions for diverse linguistic communities.
zh
[NLP-34] xt-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning ICIP
【速读】: 该论文旨在解决通过多语言自然语言接口访问知识库(Knowledge Graph, KG)中的结构化知识这一挑战,核心问题在于如何将自然语言问题准确转换为可执行的SPARQL查询。解决方案的关键在于提出mKGQAgent框架,该框架受人类推理启发,将复杂的自然语言到SPARQL的转换任务分解为模块化、可解释的子任务,包括规划、实体链接和查询优化,并通过一个由经验池支持的LLM代理协同工作流程实现,在上下文学习(in-context learning)指导下提升多语言知识图谱问答(Multilingual KGQA)的效率与准确性。
链接: https://arxiv.org/abs/2507.16971
作者: Aleksandr Perevalov,Andreas Both
机构: WSE Research Group, Leipzig University of Applied Sciences (莱比锡应用科学大学); DICE Research Group, University of Paderborn (帕德博恩大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: During the final evaluation on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants
Abstract:Accessing knowledge via multilingual natural-language interfaces is one of the emerging challenges in the field of information retrieval and related ones. Structured knowledge stored in knowledge graphs can be queried via a specific query language (e.g., SPARQL). Therefore, one needs to transform natural-language input into a query to fulfill an information need. Prior approaches mostly focused on combining components (e.g., rule-based or neural-based) that solve downstream tasks and come up with an answer at the end. We introduce mKGQAgent, a human-inspired framework that breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks. By leveraging a coordinated LLM agent workflow for planning, entity linking, and query refinement - guided by an experience pool for in-context learning - mKGQAgent efficiently handles multilingual KGQA. Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants. This work opens new avenues for developing human-like reasoning systems in multilingual semantic parsing.
zh
[NLP-35] Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLM s
【速读】: 该论文旨在解决对话式信息检索(Conversational Information Retrieval, CIR)系统在处理无法回答的问题时容易产生误导性或幻觉内容的问题。传统方法依赖外部分类器判断问题是否可答,但这种分离式设计常与生成式大语言模型(Large Language Models, LLMs)的内部逻辑不一致,导致可靠性不足。解决方案的关键在于提出Self-Aware LLM for Unanswerability (SALU),其核心创新是将不可回答性检测深度集成到LLM的生成过程中,通过多任务学习框架同时训练标准问答和显式拒答能力,并引入基于置信度引导的人类反馈强化学习(Reinforcement Learning with Human Feedback, RLHF)阶段,明确惩罚幻觉响应、奖励合理拒答,从而促使模型建立对知识边界的内在自知能力。实验表明,SALU在准确识别可答/不可答问题上显著优于主流基线,且人类评估证实其在事实准确性、拒答合理性及幻觉抑制方面表现卓越。
链接: https://arxiv.org/abs/2507.16951
作者: Shuyuan Lin,Lei Duan,Philip Hughes,Yuxuan Sheng
机构: Sichuan University of Science and Engineering (四川理工学院); Zagazig University (扎加齐格大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Conversational Information Retrieval (CIR) systems, while offering intuitive access to information, face a significant challenge: reliably handling unanswerable questions to prevent the generation of misleading or hallucinated content. Traditional approaches often rely on external classifiers, which can introduce inconsistencies with the core generative Large Language Models (LLMs). This paper introduces Self-Aware LLM for Unanswerability (SALU), a novel approach that deeply integrates unanswerability detection directly within the LLM’s generative process. SALU is trained using a multi-task learning framework for both standard Question Answering (QA) and explicit abstention generation for unanswerable queries. Crucially, it incorporates a confidence-score-guided reinforcement learning with human feedback (RLHF) phase, which explicitly penalizes hallucinated responses and rewards appropriate abstentions, fostering intrinsic self-awareness of knowledge boundaries. Through extensive experiments on our custom-built C-IR_Answerability dataset, SALU consistently outperforms strong baselines, including hybrid LLM-classifier systems, in overall accuracy for correctly answering or abstaining from questions. Human evaluation further confirms SALU’s superior reliability, achieving high scores in factuality, appropriate abstention, and, most importantly, a dramatic reduction in hallucination, demonstrating its ability to robustly “know when to say ‘I don’t know’.”
zh
[NLP-36] AI-based Clinical Decision Support for Primary Care: A Real-World Study ALT
【速读】: 该论文旨在解决临床决策支持系统在真实医疗环境中因医生使用率低、与工作流程脱节而导致的效能不足问题,从而减少诊断和治疗错误。其解决方案的关键在于开发并部署一个与临床工作流程高度对齐的大语言模型(Large Language Model, LLM)驱动的决策支持工具——AI Consult,该工具仅在需要时激活,不干扰医生自主权,并通过主动推广策略提升医生采纳率,最终显著降低临床错误发生率,验证了LLM-based临床决策支持在现实场景中的可行性和有效性。
链接: https://arxiv.org/abs/2507.16947
作者: Robert Korom,Sarah Kiptinness,Najib Adan,Kassim Said,Catherine Ithuli,Oliver Rotich,Boniface Kimani,Irene King’ori,Stellah Kamau,Elizabeth Atemba,Muna Aden,Preston Bowman,Michael Sharman,Rebecca Soskin Hicks,Rebecca Distler,Johannes Heidecke,Rahul K. Arora,Karan Singhal
机构: Penda Health; Nairobi County; OpenAI
类目: Computation and Language (cs.CL)
备注: Blog: this https URL
Abstract:We evaluate the impact of large language model-based clinical decision support in live care. In partnership with Penda Health, a network of primary care clinics in Nairobi, Kenya, we studied AI Consult, a tool that serves as a safety net for clinicians by identifying potential documentation and clinical decision-making errors. AI Consult integrates into clinician workflows, activating only when needed and preserving clinician autonomy. We conducted a quality improvement study, comparing outcomes for 39,849 patient visits performed by clinicians with or without access to AI Consult across 15 clinics. Visits were rated by independent physicians to identify clinical errors. Clinicians with access to AI Consult made relatively fewer errors: 16% fewer diagnostic errors and 13% fewer treatment errors. In absolute terms, the introduction of AI Consult would avert diagnostic errors in 22,000 visits and treatment errors in 29,000 visits annually at Penda alone. In a survey of clinicians with AI Consult, all clinicians said that AI Consult improved the quality of care they delivered, with 75% saying the effect was “substantial”. These results required a clinical workflow-aligned AI Consult implementation and active deployment to encourage clinician uptake. We hope this study demonstrates the potential for LLM-based clinical decision support tools to reduce errors in real-world settings and provides a practical framework for advancing responsible adoption.
zh
[NLP-37] SiLQ: Simple Large Language Model Quantization-Aware Training
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在量化(Quantization)过程中如何在不显著损失精度的前提下,实现推理延迟降低、模型体积减小和能耗减少的问题,同时确保所提方法与专用推理加速器兼容。其解决方案的关键在于提出一种简单且端到端的量化感知训练(Quantization-Aware Training, QAT)方法:该方法仅需增加少于0.1%的总训练预算,即可在多个现代基准上显著优于现有主流量化方法,且适用于不同模型架构、权重(weights)、激活值(activations)及缓存(cache)的量化,无需引入额外计算操作,仅需在训练中嵌入量化过程本身。
链接: https://arxiv.org/abs/2507.16933
作者: Steven K. Esser,Jeffrey L. McKinstry,Deepika Bablani,Rathinakumar Appuswamy,Dharmendra S. Modha
机构: IBM Research (IBM 研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 3 figures
Abstract:Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of accuracy in reasonable time, and in particular to do so without requiring mechanisms incompatible with specialized inference accelerators. Here, we demonstrate a simple, end-to-end quantization-aware training approach that, with an increase in total model training budget of less than 0.1%, outperforms the leading published quantization methods by large margins on several modern benchmarks, with both base and instruct model variants. The approach easily generalizes across different model architectures, can be applied to activations, cache, and weights, and requires the introduction of no additional operations to the model other than the quantization itself.
zh
[NLP-38] A Unifying Scheme for Extractive Content Selection Tasks
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中各类文本选择任务(content selection tasks)长期分散研究、缺乏统一建模框架的问题。这些任务虽然目标一致——从源文本中选取相关片段——但各自使用独立的模型、数据集和评估指标,导致资源重复与迁移困难。解决方案的关键在于提出指令引导的文本选择(Instruction-guided Content Selection, IGCS)统一框架,将任务定义和实例级需求编码为语言模型的指令输入;同时构建首个覆盖多样内容选择任务的基准测试集(IGCSbench),并开发一个通用合成数据集用于迁移学习,实验证明其在有无目标任务专用训练数据时均能提升性能。此外,论文还系统分析了基于大语言模型(Large Language Models, LLMs)进行推理时的通用性问题,并提出了一种通用评估指标,从而为未来内容选择模型的研究提供可复用的资源与方法论支持。
链接: https://arxiv.org/abs/2507.16922
作者: Shmuel Amar,Ori Shapira,Aviv Slobodkin,Ido Dagan
机构: Bar-Ilan University (巴伊兰大学); Google Research (谷歌研究); OriginAI
类目: Computation and Language (cs.CL)
备注:
Abstract:A broad range of NLP tasks involve selecting relevant text spans from given source texts. Despite this shared objective, such \textitcontent selection tasks have traditionally been studied in isolation, each with its own modeling approaches, datasets, and evaluation metrics. In this work, we propose \textitinstruction-guided content selection (IGCS) as a beneficial unified framework for such settings, where the task definition and any instance-specific request are encapsulated as instructions to a language model. To promote this framework, we introduce \igcsbench, the first unified benchmark covering diverse content selection tasks. Further, we create a large generic synthetic dataset that can be leveraged for diverse content selection tasks, and show that transfer learning with these datasets often boosts performance, whether dedicated training for the targeted task is available or not. Finally, we address generic inference time issues that arise in LLM-based modeling of content selection, assess a generic evaluation metric, and overall propose the utility of our resources and methods for future content selection models. Models and datasets available at this https URL.
zh
[NLP-39] ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension
【速读】: 该论文旨在解决多实体场景下指代表达理解(Referring Expression Comprehension, REC)中因忽略实体间复杂关系而导致的定位精度与可靠性不足的问题,同时应对高质量细粒度图像-文本-关系标注数据集匮乏的瓶颈。其解决方案的关键在于提出一个名为ReMeREC的新框架,该框架通过两个核心组件协同工作:一是Text-adaptive Multi-entity Perceptron (TMP),能够基于细粒度文本线索动态推断实体数量和范围,生成区分性强的表示以缓解语言中隐式边界带来的语义模糊;二是Entity Inter-relationship Reasoner (EIR),用于增强实体间的关联推理能力并提升全局场景理解。此外,作者构建了首个面向关系感知的多实体REC数据集ReMeX及辅助数据集EntityText,从而推动模型在复杂语境下的多实体精确定位与关系预测性能达到当前最优水平。
链接: https://arxiv.org/abs/2507.16877
作者: Yizhi Hu,Zezhao Tian,Xingqun Qi,Chen Su,Bingkun Yang,Junhui Yin,Muyi Sun,Man Zhang,Zhenan Sun
机构: Beijing University of Posts and Telecommunications(北京邮电大学); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15 pages, 7 figures
Abstract:Referring Expression Comprehension (REC) aims to localize specified entities or regions in an image based on natural language descriptions. While existing methods handle single-entity localization, they often ignore complex inter-entity relationships in multi-entity scenes, limiting their accuracy and reliability. Additionally, the lack of high-quality datasets with fine-grained, paired image-text-relation annotations hinders further progress. To address this challenge, we first construct a relation-aware, multi-entity REC dataset called ReMeX, which includes detailed relationship and textual annotations. We then propose ReMeREC, a novel framework that jointly leverages visual and textual cues to localize multiple entities while modeling their inter-relations. To address the semantic ambiguity caused by implicit entity boundaries in language, we introduce the Text-adaptive Multi-entity Perceptron (TMP), which dynamically infers both the quantity and span of entities from fine-grained textual cues, producing distinctive representations. Additionally, our Entity Inter-relationship Reasoner (EIR) enhances relational reasoning and global scene understanding. To further improve language comprehension for fine-grained prompts, we also construct a small-scale auxiliary dataset, EntityText, generated using large language models. Experiments on four benchmark datasets show that ReMeREC achieves state-of-the-art performance in multi-entity grounding and relation prediction, outperforming existing approaches by a large margin.
zh
[NLP-40] Pixels Patterns but No Poetry: To See The World like Humans
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在感知能力上与人类存在显著差距的问题,即MLLMs是否能像人类一样直观地理解合成图像内容。传统研究主要聚焦于提升模型的推理能力,而本文提出从感知角度重新审视这一问题,设计了“图灵视觉测试”(Turing Eye Test, TET),这是一个以感知为导向的基准测试,包含四项诊断性任务,评估模型对人类可直觉处理的合成图像的理解能力。实验发现,当前最先进的MLLMs在这些感知任务中表现灾难性失败,而仅依赖上下文学习或基于语言主干的训练无法改善性能,唯有微调视觉编码器(vision tower)能快速提升表现,表明当前MLLMs的核心瓶颈在于视觉表征的泛化能力,而非语言推理或知识存储能力——这揭示了现有模型与人类感知之间存在的关键差距。
链接: https://arxiv.org/abs/2507.16863
作者: Hongcheng Gao,Zihao Huang,Lin Xu,Jingyi Tang,Xinhao Li,Yue Liu,Haoyang Li,Taihang Hu,Minhua Lin,Xinlong Yang,Ge Wu,Balong Bi,Hongyu Chen,Wentao Zhang
机构: University of Chinese Academy of Sciences (中国科学院大学); Nanjing University (南京大学); National University of Singapore (新加坡国立大学); BUPT (北京邮电大学); Nankai University (南开大学); The Pennsylvania State University (宾夕法尼亚州立大学); Peking University (北京大学); BJTU (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhancing reasoning capabilities in MLLMs, a fundamental question persists: Can Multimodal Large Language Models truly perceive the world as humans do? This paper shifts focus from reasoning to perception. Rather than constructing benchmarks specifically for reasoning, we introduce the Turing Eye Test (TET), a challenging perception-oriented benchmark comprising four diagnostic tasks that evaluate MLLMs’ performance on synthetic images that humans process intuitively. Our findings reveal that state-of-the-art MLLMs exhibit catastrophic failures on our perceptual tasks trivial for humans. Both in-context learning and training on language backbone-effective for previous benchmarks-fail to improve performance on our tasks, while fine-tuning the vision tower enables rapid adaptation, suggesting that our benchmark poses challenges for vision tower generalization rather than for the knowledge and reasoning capabilities of the language backbone-a key gap between current MLLMs and human perception. We release a representative subset of TET tasks in this version, and will introduce more diverse tasks and methods to enhance visual generalization in future work.
zh
[NLP-41] A Query-Aware Multi-Path Knowledge Graph Fusion Approach for Enhancing Retrieval-Augmented Generation in Large Language Models
【速读】: 该论文旨在解决当前检索增强生成(Retrieval Augmented Generation, RAG)方法中因仅依赖相似度匹配检索孤立文本片段而导致的语义关联不足问题,从而影响生成内容的准确性与一致性。其解决方案的关键在于提出一种查询感知的多路径知识图谱融合方法(Query-Aware Multi-Path Knowledge Graph Fusion, QMKGF):首先利用大语言模型(LLM)高效构建知识图谱(Knowledge Graph, KG),随后设计包含一阶、多跳及重要性关系的多路径子图构造策略以提升检索文档与用户查询之间的语义相关性;进一步引入查询感知注意力奖励模型对子图三元组进行评分,并选取最优子图进行扩展,通过融合其他高语义相关子图中的三元组来增强原始查询的语义表示,最终提升大语言模型的生成质量。
链接: https://arxiv.org/abs/2507.16826
作者: Qikai Wei,Huansheng Ning,Chunlong Han,Jianguo Ding
机构: University of Science and Technology Beijing (北京科技大学); Blekinge Institute of Technology (布莱金厄理工大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Retrieval Augmented Generation (RAG) has gradually emerged as a promising paradigm for enhancing the accuracy and factual consistency of content generated by large language models (LLMs). However, existing RAG studies primarily focus on retrieving isolated segments using similarity-based matching methods, while overlooking the intrinsic connections between them. This limitation hampers performance in RAG tasks. To address this, we propose QMKGF, a Query-Aware Multi-Path Knowledge Graph Fusion Approach for Enhancing Retrieval Augmented Generation. First, we design prompt templates and employ general-purpose LLMs to extract entities and relations, thereby generating a knowledge graph (KG) efficiently. Based on the constructed KG, we introduce a multi-path subgraph construction strategy that incorporates one-hop relations, multi-hop relations, and importance-based relations, aiming to improve the semantic relevance between the retrieved documents and the user query. Subsequently, we designed a query-aware attention reward model that scores subgraph triples based on their semantic relevance to the query. Then, we select the highest score subgraph and enrich subgraph with additional triples from other subgraphs that are highly semantically relevant to the query. Finally, the entities, relations, and triples within the updated subgraph are utilised to expand the original query, thereby enhancing its semantic representation and improving the quality of LLMs’ generation. We evaluate QMKGF on the SQuAD, IIRC, Culture, HotpotQA, and MuSiQue datasets. On the HotpotQA dataset, our method achieves a ROUGE-1 score of 64.98%, surpassing the BGE-Rerank approach by 9.72 percentage points (from 55.26% to 64.98%). Experimental results demonstrate the effectiveness and superiority of the QMKGF approach.
zh
[NLP-42] Disaster Informatics after the COVID-19 Pandemic: Bibliometric and Topic Analysis based on Large-scale Academic Literature
【速读】: 该论文旨在解决灾害信息学(disaster informatics)领域在2020年1月至2022年9月期间的研究趋势、合作网络、主题演化及其受新冠疫情(COVID-19)影响的系统性认知不足问题。其解决方案的关键在于构建一个大规模文献语料库,并结合预训练语言模型(pre-trained language models)与生成式AI(generative AI)等先进技术,实现对国家、机构、作者层级的合作模式、研究主题聚类、优先级变迁及跨领域协同趋势的定量分析与可视化呈现。结果表明,疫情显著推动了公共卫生相关研究的优先级提升,并促使该领域向多维韧性策略和跨部门数据共享方向演进,为政策制定者、实践者和学者提供了增强灾害应对能力的战略依据。
链接: https://arxiv.org/abs/2507.16820
作者: Ngan Tran,Haihua Chen,Ana Cleveland,Yuhan Zhou
机构: University of North Texas (北德克萨斯大学)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: 36 pages, 14 figures, 5 tables
Abstract:This study presents a comprehensive bibliometric and topic analysis of the disaster informatics literature published between January 2020 to September 2022. Leveraging a large-scale corpus and advanced techniques such as pre-trained language models and generative AI, we identify the most active countries, institutions, authors, collaboration networks, emergent topics, patterns among the most significant topics, and shifts in research priorities spurred by the COVID-19 pandemic. Our findings highlight (1) countries that were most impacted by the COVID-19 pandemic were also among the most active, with each country having specific research interests, (2) countries and institutions within the same region or share a common language tend to collaborate, (3) top active authors tend to form close partnerships with one or two key partners, (4) authors typically specialized in one or two specific topics, while institutions had more diverse interests across several topics, and (5) the COVID-19 pandemic has influenced research priorities in disaster informatics, placing greater emphasis on public health. We further demonstrate that the field is converging on multidimensional resilience strategies and cross-sectoral data-sharing collaborations or projects, reflecting a heightened awareness of global vulnerability and interdependency. Collecting and quality assurance strategies, data analytic practices, LLM-based topic extraction and summarization approaches, and result visualization tools can be applied to comparable datasets or solve similar analytic problems. By mapping out the trends in disaster informatics, our analysis offers strategic insights for policymakers, practitioners, and scholars aiming to enhance disaster informatics capacities in an increasingly uncertain and complex risk landscape.
zh
[NLP-43] Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach ACL
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)模型在面对对抗攻击(如同义词替换)时的脆弱性问题,特别是针对循环神经网络(Recurrent Neural Networks, RNNs)和现代状态空间模型(State Space Models, SSMs),例如S4,其鲁棒性研究尚不充分。解决方案的关键在于提出一种基于增长边界矩阵(Growth Bound Matrices, GBM)的新正则化技术,通过量化输入扰动对模型输出的影响并加以约束,从而提升模型对词替换攻击的抗干扰能力,同时增强在干净文本上的泛化性能。该方法首次系统分析了S4模型的鲁棒性,并在多个架构和基准数据集上验证了其有效性,相比现有基线模型, adversarial robustness 最多提升8.8%。
链接: https://arxiv.org/abs/2507.10330
作者: Mohammed Bouri,Adnane Saoud
机构: Mohammed VI Polytechnic University (穆罕默德六世 polytechnic 大学); CID Development (CID 开发)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ACL Findings 2025
Abstract:Despite advancements in Natural Language Processing (NLP), models remain vulnerable to adversarial attacks, such as synonym substitutions. While prior work has focused on improving robustness for feed-forward and convolutional architectures, the robustness of recurrent networks and modern state space models (SSMs), such as S4, remains understudied. These architectures pose unique challenges due to their sequential processing and complex parameter dynamics. In this paper, we introduce a novel regularization technique based on Growth Bound Matrices (GBM) to improve NLP model robustness by reducing the impact of input perturbations on model outputs. We focus on computing the GBM for three architectures: Long Short-Term Memory (LSTM), State Space models (S4), and Convolutional Neural Networks (CNN). Our method aims to (1) enhance resilience against word substitution attacks, (2) improve generalization on clean text, and (3) providing the first systematic analysis of SSM (S4) robustness. Extensive experiments across multiple architectures and benchmark datasets demonstrate that our method improves adversarial robustness by up to 8.8% over existing baselines. These results highlight the effectiveness of our approach, outperforming several state-of-the-art methods in adversarial defense. Codes are available at this https URL
zh
[NLP-44] Segmentation-free Goodness of Pronunciation
【速读】: 该论文旨在解决现代计算机辅助语言学习(CALL)系统中音素级发音评估的准确性问题,特别是传统基于发音优劣度(Goodness of Pronunciation, GOP)的方法依赖于预先语音分段(pre-segmentation),限制了其与基于连接时序分类(CTC)训练的声学模型结合的可能性。解决方案的关键在于提出两种新方法:一是自对齐GOP(GOP-SA),使CTC模型可用于发音检测与诊断;二是无对齐GOP(GOP-AF),通过考虑目标音素的所有可能对齐方式,实现更鲁棒的评估。GOP-AF进一步引入理论分析、数值稳定性处理及归一化机制,使其适用于不同峰值特性(peakiness)的声学模型,并在CMU Kids和Speechocean762数据集上验证了其优越性能,显著提升了音素级发音评估的效果。
链接: https://arxiv.org/abs/2507.16838
作者: Xinwei Cao,Zijian Fan,Torbjørn Svendsen,Giampiero Salvi
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer aided language learning (CALL) systems. Within MDD, phoneme-level pronunciation assessment is key to helping L2 learners improve their pronunciation. However, most systems are based on a form of goodness of pronunciation (GOP) which requires pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general alignment-free method that takes all possible alignments of the target phoneme into account (GOP-AF). We give a theoretical account of our definition of GOP-AF, an implementation that solves potential numerical issues as well as a proper normalization which makes the method applicable with acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and Speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-AF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the Speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.
zh
[NLP-45] Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems
【速读】: 该论文旨在解决语音驱动的对话式人工智能(Voice-based Conversational AI)系统在实际部署中,不同组件(如语音转文字 STT、大语言模型 LLM 和文本转语音 TTS)组合配置对整体性能影响缺乏系统评估的问题。其解决方案的关键在于构建了一个基于大语言模型作为评判者(LLM-as-a-Judge)的自动化评估框架,并通过超过30万次由AI主持的求职面试数据,对四种生产级STT×LLM×TTS组合进行了大规模实证比较,从而识别出最优组件搭配(Google STT + GPT-4.1),并揭示了客观技术指标与用户满意度之间弱相关性的关键发现,为多模态对话系统的组件选型提供了可验证的方法论支持。
链接: https://arxiv.org/abs/2507.16835
作者: Nima Yazdani,Ali Ansari,Aruj Mahajan,Amirhossein Afsharrad,Seyed Shahabeddin Mousavi
机构: Stanford University (斯坦福大学); University of Southern California (南加州大学); micro1
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
备注:
Abstract:Voice-based conversational AI systems increasingly rely on cascaded architectures combining speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. However, systematic evaluation of different component combinations in production settings remains understudied. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data from over 300,000 AI-conducted job interviews. We develop an automated evaluation framework using LLM-as-a-Judge to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of four production configurations reveals that Google STT paired with GPT-4.1 significantly outperforms alternatives in both conversational and technical quality metrics. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversational AI systems and contribute a validated evaluation methodology for voice-based interactions.
zh
[NLP-46] owards Robust Speech Recognition for Jamaican Patois Music Transcription
【速读】: 该论文旨在解决当前自动语音识别(Automatic Speech Recognition, ASR)系统在处理牙买加帕托伊斯语(Jamaican Patois)音乐时性能不佳的问题,导致生成的字幕不准确,限制了内容可及性并阻碍下游应用的发展。解决方案的关键在于采用数据驱动的方法,通过收集超过40小时的手动标注帕托伊斯语音乐数据集,对先进的ASR模型进行微调,并基于此结果推导出Whisper模型在帕托伊斯语音频上的性能扩展规律(scaling laws),从而提升识别准确性并推动该语言的语音建模发展。
链接: https://arxiv.org/abs/2507.16834
作者: Jordan Madden,Matthew Stone,Dimitri Johnson,Daniel Geddez
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Although Jamaican Patois is a widely spoken language, current speech recognition systems perform poorly on Patois music, producing inaccurate captions that limit accessibility and hinder downstream applications. In this work, we take a data-centric approach to this problem by curating more than 40 hours of manually transcribed Patois music. We use this dataset to fine-tune state-of-the-art automatic speech recognition (ASR) models, and use the results to develop scaling laws for the performance of Whisper models on Jamaican Patois audio. We hope that this work will have a positive impact on the accessibility of Jamaican Patois music and the future of Jamaican Patois language modeling.
zh
计算机视觉
[CV-0] Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility ICCV2025
【速读】:该论文旨在解决现代机器学习模型在鲁棒性(robustness)与资源效率(resource-efficiency)之间难以协同优化的问题。其核心解决方案在于:采用高学习率(high learning rates)作为关键策略,能够同时提升模型对虚假相关性(spurious correlations)的鲁棒性以及网络压缩性(network compressibility)。研究表明,高学习率不仅促进不变特征利用(invariant feature utilization)、类别分离(class separation)和激活稀疏性(activation sparsity)等有利表征特性,还在多种数据集、模型架构和优化器设置下展现出优于其他超参数与正则化方法的一致性能优势。此外,作者指出高学习率在标准分类任务中的成功可能源于其对训练集中隐藏或罕见虚假相关性的有效缓解。
链接: https://arxiv.org/abs/2507.17748
作者: Melih Barsbey,Lucas Prieto,Stefanos Zafeiriou,Tolga Birdal
机构: Imperial College London (帝国理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Accepted at ICCV 2025, 23 pages
Abstract:Robustness and resource-efficiency are two highly desirable properties for modern machine learning models. However, achieving them jointly remains a challenge. In this paper, we position high learning rates as a facilitator for simultaneously achieving robustness to spurious correlations and network compressibility. We demonstrate that large learning rates also produce desirable representation properties such as invariant feature utilization, class separation, and activation sparsity. Importantly, our findings indicate that large learning rates compare favorably to other hyperparameters and regularization methods, in consistently satisfying these properties in tandem. In addition to demonstrating the positive effect of large learning rates across diverse spurious correlation datasets, models, and optimizers, we also present strong evidence that the previously documented success of large learning rates in standard classification tasks is likely due to its effect on addressing hidden/rare spurious correlations in the training dataset.
zh
[CV-1] Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention
【速读】:该论文旨在解决当前基于扩散模型的3D生成框架在两阶段扩散流水线中因注意力机制具有二次复杂度而导致的严重计算效率低下问题,尤其是在稀疏体素(sparse voxel)表示下进行高分辨率建模时。其解决方案的关键在于提出Ultra3D框架:首先利用紧凑的VecSet表示在第一阶段高效生成粗略物体布局,显著减少token数量并加速体素坐标预测;其次在第二阶段引入Part Attention机制——一种几何感知的局部注意力机制,将注意力计算限制在语义一致的部件区域内,从而在保持结构连续性的同时避免冗余全局注意力计算,实现高达6.7倍的潜在特征生成速度提升。
链接: https://arxiv.org/abs/2507.17745
作者: Yiwen Chen,Zhihao Li,Yikai Wang,Hu Zhang,Qin Li,Chi Zhang,Guosheng Lin
机构: Nanyang Technological University (南洋理工大学); Math Magic; Tsinghua University (清华大学); School of Artificial Intelligence, Beijing Normal University (北京师范大学人工智能学院); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL
Abstract:Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.
zh
[CV-2] Yume: An Interactive World Generation Model
【速读】:该论文旨在解决如何从单一输入图像生成高保真、可交互的动态视频世界,并支持通过键盘或神经信号等外设进行探索与控制的问题。其核心解决方案在于提出一个包含四个关键组件的框架:首先对相机运动进行量化以实现稳定训练和友好交互;其次引入带记忆模块的掩码视频扩散Transformer(Masked Video Diffusion Transformer, MVDT)以实现自回归方式下的无限视频生成;再者,采用无需训练的抗伪影机制(Anti-Artifact Mechanism, AAM)和基于随机微分方程的时间旅行采样(Time Travel Sampling based on Stochastic Differential Equations, TTS-SDE),提升视觉质量和精确控制能力;最后通过对抗蒸馏与缓存机制的协同优化实现模型加速。该方法在高质量世界探索数据集 Sekai 上训练后,在多样化场景中表现出显著效果。
链接: https://arxiv.org/abs/2507.17744
作者: Xiaofeng Mao,Shaoheng Lin,Zhen Li,Chuanhao Li,Wenshuo Peng,Tong He,Jiangmiao Pang,Mingmin Chi,Yu Qiao,Kaipeng Zhang
机构: Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on this https URL. Yume will update monthly to achieve its original goal. Project page: this https URL.
zh
[CV-3] A Comprehensive Evaluation Framework for the Study of the Effects of Facial Filters on Face Recognition Accuracy
【速读】:该论文旨在解决面部滤镜(Facial Filters)对自动化人脸识别性能的负面影响问题,尤其关注现有研究多局限于少量特定风格滤镜、难以反映社交应用中广泛存在的多样化滤镜现象。其解决方案的关键在于提出一个系统性框架,包含受控的数据集、基于代表性原则的滤镜选择流程以及用于评估滤镜影响的实验设计,从而实现对跨文化场景下(如Instagram、Snapchat与Meitu、Pitu)滤镜效应的大规模量化分析,并进一步通过在人脸嵌入空间(face embedding space)中检测和恢复滤镜干扰来提升识别准确率。
链接: https://arxiv.org/abs/2507.17729
作者: Kagan Ozturk,Louisa Conwill,Jacob Gutierrez,Kevin Bowyer,Walter J. Scheirer
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial filters are now commonplace for social media users around the world. Previous work has demonstrated that facial filters can negatively impact automated face recognition performance. However, these studies focus on small numbers of hand-picked filters in particular styles. In order to more effectively incorporate the wide ranges of filters present on various social media applications, we introduce a framework that allows for larger-scale study of the impact of facial filters on automated recognition. This framework includes a controlled dataset of face images, a principled filter selection process that selects a representative range of filters for experimentation, and a set of experiments to evaluate the filters’ impact on recognition. We demonstrate our framework with a case study of filters from the American applications Instagram and Snapchat and the Chinese applications Meitu and Pitu to uncover cross-cultural differences. Finally, we show how the filtering effect in a face embedding space can easily be detected and restored to improve face recognition performance.
zh
[CV-4] CA-Cut: Crop-Aligned Cutout for Data Augmentation to Learn More Robust Under-Canopy Navigation
【速读】:该论文旨在解决视觉下冠层导航中因训练数据稀缺导致模型泛化能力不足的问题,尤其是在作物行间距不均、遮挡频繁及杂物干扰等复杂环境下,传统基于深度学习的感知模型难以可靠区分可通行区域。解决方案的关键在于提出一种新颖的数据增强方法——作物对齐掩蔽(Crop-Aligned Cutout, CA-Cut),该方法通过在输入图像中随机掩蔽位于作物行附近的区域,引导模型学习高阶上下文特征,从而提升在细粒度信息缺失时的语义关键点预测鲁棒性。实验表明,将掩蔽分布偏向作物行位置可显著降低预测误差(最高达36.9%),并增强跨环境的泛化性能。
链接: https://arxiv.org/abs/2507.17727
作者: Robel Mamo,Taeyeong Choi
机构: Kennesaw State University (肯尼索州立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the 12th European Conference on Mobile Robots (ECMR 2025)
Abstract:State-of-the-art visual under-canopy navigation methods are designed with deep learning-based perception models to distinguish traversable space from crop rows. While these models have demonstrated successful performance, they require large amounts of training data to ensure reliability in real-world field deployment. However, data collection is costly, demanding significant human resources for in-field sampling and annotation. To address this challenge, various data augmentation techniques are commonly employed during model training, such as color jittering, Gaussian blur, and horizontal flip, to diversify training data and enhance model robustness. In this paper, we hypothesize that utilizing only these augmentation techniques may lead to suboptimal performance, particularly in complex under-canopy environments with frequent occlusions, debris, and non-uniform spacing of crops. Instead, we propose a novel augmentation method, so-called Crop-Aligned Cutout (CA-Cut) which masks random regions out in input images that are spatially distributed around crop rows on the sides to encourage trained models to capture high-level contextual features even when fine-grained information is obstructed. Our extensive experiments with a public cornfield dataset demonstrate that masking-based augmentations are effective for simulating occlusions and significantly improving robustness in semantic keypoint predictions for visual navigation. In particular, we show that biasing the mask distribution toward crop rows in CA-Cut is critical for enhancing both prediction accuracy and generalizability across diverse environments achieving up to a 36.9% reduction in prediction error. In addition, we conduct ablation studies to determine the number of masks, the size of each mask, and the spatial distribution of masks to maximize overall performance.
zh
[CV-5] On the Interaction of Compressibility and Adversarial Robustness
【速读】:该论文旨在解决神经网络中压缩性(compressibility)与对抗鲁棒性(adversarial robustness)之间的相互作用机制不明确的问题。当前研究虽分别深入探讨了压缩性和鲁棒性,但缺乏统一框架来理解二者如何协同或冲突影响模型性能。解决方案的关键在于构建一个理论框架,系统分析不同形式的压缩性(如神经元级稀疏性和谱压缩性)如何在表示空间中诱导出少数高度敏感的方向,从而被攻击者利用生成有效的对抗扰动。该框架进一步推导出简洁而具指导意义的鲁棒性上界,揭示压缩性通过改变学习到的表示结构,对 L∞ 和 L2 鲁棒性的影响路径。重要的是,这些脆弱性不依赖于压缩的具体实现方式(如正则化、架构偏置或隐式学习动力学),并通过合成和真实任务的实证验证了理论预测,表明此类脆弱性在对抗训练和迁移学习下依然存在,并可能促进通用对抗扰动(universal adversarial perturbations)的出现,最终指出结构化压缩与鲁棒性之间存在根本性权衡。
链接: https://arxiv.org/abs/2507.17725
作者: Melih Barsbey,Antônio H. Ribeiro,Umut Şimşekli,Tolga Birdal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:Modern neural networks are expected to simultaneously satisfy a host of desirable properties: accurate fitting to training data, generalization to unseen inputs, parameter and computational efficiency, and robustness to adversarial perturbations. While compressibility and robustness have each been studied extensively, a unified understanding of their interaction still remains elusive. In this work, we develop a principled framework to analyze how different forms of compressibility - such as neuron-level sparsity and spectral compressibility - affect adversarial robustness. We show that these forms of compression can induce a small number of highly sensitive directions in the representation space, which adversaries can exploit to construct effective perturbations. Our analysis yields a simple yet instructive robustness bound, revealing how neuron and spectral compressibility impact L_\infty and L_2 robustness via their effects on the learned representations. Crucially, the vulnerabilities we identify arise irrespective of how compression is achieved - whether via regularization, architectural bias, or implicit learning dynamics. Through empirical evaluations across synthetic and realistic tasks, we confirm our theoretical predictions, and further demonstrate that these vulnerabilities persist under adversarial training and transfer learning, and contribute to the emergence of universal adversarial perturbations. Our findings show a fundamental tension between structured compressibility and robustness, and suggest new pathways for designing models that are both efficient and secure.
zh
[CV-6] BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems ITSC
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在自动驾驶感知系统中应用时存在的幻觉问题,即模型可能遗漏真实存在的交通参与者(如弱势道路使用者),或错误地“生成”不存在的元素,这会严重影响高级驾驶辅助系统(ADAS)或自动驾驶系统(ADS)的安全性。解决方案的关键在于系统性评估三种前沿VLMs在Waymo Open Dataset子集上的表现,并提出一种名为BetterCheck的幻觉检测策略,以识别和缓解VLM输出中的不实描述,从而为基于VLM的感知系统提供安全护栏。
链接: https://arxiv.org/abs/2507.17722
作者: Malsha Ashani Mahawatta Dona,Beatriz Cabrero-Daniel,Yinan Yu,Christian Berger
机构: University of Gothenburg (哥德堡大学); Chalmers University of Technology (查尔姆斯理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in The IEEE International Conference on Intelligent Transportation Systems (ITSC)2025
Abstract:Large language models (LLMs) are growingly extended to process multimodal data such as text and video simultaneously. Their remarkable performance in understanding what is shown in images is surpassing specialized neural networks (NNs) such as Yolo that is supporting only a well-formed but very limited vocabulary, ie., objects that they are able to detect. When being non-restricted, LLMs and in particular state-of-the-art vision language models (VLMs) show impressive performance to describe even complex traffic situations. This is making them potentially suitable components for automotive perception systems to support the understanding of complex traffic situations or edge case situation. However, LLMs and VLMs are prone to hallucination, which mean to either potentially not seeing traffic agents such as vulnerable road users who are present in a situation, or to seeing traffic agents who are not there in reality. While the latter is unwanted making an ADAS or autonomous driving systems (ADS) to unnecessarily slow down, the former could lead to disastrous decisions from an ADS. In our work, we are systematically assessing the performance of 3 state-of-the-art VLMs on a diverse subset of traffic situations sampled from the Waymo Open Dataset to support safety guardrails for capturing such hallucinations in VLM-supported perception systems. We observe that both, proprietary and open VLMs exhibit remarkable image understanding capabilities even paying thorough attention to fine details sometimes difficult to spot for us humans. However, they are also still prone to making up elements in their descriptions to date requiring hallucination detection strategies such as BetterCheck that we propose in our work.
zh
[CV-7] Joint Asymmetric Loss for Learning with Noisy Labels ICCV2025
【速读】:该论文旨在解决标签噪声下深度神经网络训练中的准确率下降问题,特别是现有对称损失函数因约束过严导致的欠拟合问题。其解决方案的关键在于提出一种新的非对称均方误差(Asymmetric Mean Square Error, AMSE)损失函数,并将其引入到主动-被动损失(Active Passive Loss, APL)框架中,构建出联合非对称损失(Joint Asymmetric Loss, JAL)框架。AMSE在理论上满足非对称条件的充要条件,且与APL兼容,从而有效提升模型在标签噪声环境下的鲁棒性和拟合能力。
链接: https://arxiv.org/abs/2507.17692
作者: Jialiang Wang,Xianming Liu,Xiong Zhou,Gangfeng Hu,Deming Zhai,Junjun Jiang,Xiangyang Ji
机构: Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Learning with noisy labels is a crucial task for training accurate deep neural networks. To mitigate label noise, prior studies have proposed various robust loss functions, particularly symmetric losses. Nevertheless, symmetric losses usually suffer from the underfitting issue due to the overly strict constraint. To address this problem, the Active Passive Loss (APL) jointly optimizes an active and a passive loss to mutually enhance the overall fitting ability. Within APL, symmetric losses have been successfully extended, yielding advanced robust loss functions. Despite these advancements, emerging theoretical analyses indicate that asymmetric losses, a new class of robust loss functions, possess superior properties compared to symmetric losses. However, existing asymmetric losses are not compatible with advanced optimization frameworks such as APL, limiting their potential and applicability. Motivated by this theoretical gap and the prospect of asymmetric losses, we extend the asymmetric loss to the more complex passive loss scenario and propose the Asymetric Mean Square Error (AMSE), a novel asymmetric loss. We rigorously establish the necessary and sufficient condition under which AMSE satisfies the asymmetric condition. By substituting the traditional symmetric passive loss in APL with our proposed AMSE, we introduce a novel robust loss framework termed Joint Asymmetric Loss (JAL). Extensive experiments demonstrate the effectiveness of our method in mitigating label noise. Code available at: this https URL
zh
[CV-8] Audio-Vision Contrastive Learning for Phonological Class Recognition
【速读】:该论文旨在解决语音生产中发音特征(articulatory-phonological features)准确分类的问题,特别是在临床场景下,精准的音位分析对疾病诊断和个性化康复治疗具有重要意义。其核心挑战在于如何融合多模态信息以提升分类性能,尤其是从发音动作和语音信号中提取互补特征。解决方案的关键在于提出一种基于对比学习(contrastive learning)的多模态深度学习框架,将实时磁共振成像(rtMRI)与语音信号进行深度融合,通过对比表示学习增强跨模态一致性,从而显著提升对三个关键发音维度(发音方式、发音部位、清浊音)的分类准确率。实验表明,该方法在USC-TIMIT数据集上达到平均F1分数0.81,较单模态基线提升0.23,验证了对比学习在多模态发音分析中的有效性。
链接: https://arxiv.org/abs/2507.17682
作者: Daiqi Liu,Tomás Arias-Vergara,Jana Hutter,Andreas Maier,Paula Andrea Pérez-Toro
机构: Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany (模式识别实验室,埃尔朗根-纽伦堡弗里德里希-亚历山大大学,德国); Smart Imaging Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany (智能成像实验室,埃尔朗根-纽伦堡弗里德里希-亚历山大大学,德国); GITA Lab, Universidad de Antioquia UdeA, Medellín, Colombia (GITA 实验室,安蒂奥基亚大学,麦德林,哥伦比亚)
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: conference to TSD 2025
Abstract:Accurate classification of articulatory-phonological features plays a vital role in understanding human speech production and developing robust speech technologies, particularly in clinical contexts where targeted phonemic analysis and therapy can improve disease diagnosis accuracy and personalized rehabilitation. In this work, we propose a multimodal deep learning framework that combines real-time magnetic resonance imaging (rtMRI) and speech signals to classify three key articulatory dimensions: manner of articulation, place of articulation, and voicing. We perform classification on 15 phonological classes derived from the aforementioned articulatory dimensions and evaluate the system with four audio/vision configurations: unimodal rtMRI, unimodal audio signals, multimodal middle fusion, and contrastive learning-based audio-vision fusion. Experimental results on the USC-TIMIT dataset show that our contrastive learning-based approach achieves state-of-the-art performance, with an average F1-score of 0.81, representing an absolute increase of 0.23 over the unimodal baseline. The results confirm the effectiveness of contrastive representation learning for multimodal articulatory analysis. Our code and processed dataset will be made publicly available at this https URL to support future research.
zh
[CV-9] Perspective-Invariant 3D Object Detection ICCV2025
【速读】:该论文旨在解决当前LiDAR-based 3D物体检测研究中对非车辆平台(如四足机器人和无人机)数据与方法覆盖不足的问题,从而推动跨平台(cross-platform)3D感知系统的通用性发展。其解决方案的关键在于提出首个多平台基准数据集Pi3DET,并设计一种新颖的跨平台自适应框架,通过几何级和特征级的鲁棒对齐实现视角不变(perspective-invariant)的3D检测,有效将车辆平台的知识迁移至其他平台,显著提升了复杂场景下跨平台3D检测的性能与鲁棒性。
链接: https://arxiv.org/abs/2507.17665
作者: Ao Liang,Lingdong Kong,Dongyue Lu,Youquan Liu,Jian Fang,Huaici Zhao,Wei Tsang Ooi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICCV 2025; 46 pages, 18 figures, 22 tables; Project Page at this https URL
Abstract:With the rise of robotics, LiDAR-based 3D object detection has garnered significant attention in both academia and industry. However, existing datasets and methods predominantly focus on vehicle-mounted platforms, leaving other autonomous platforms underexplored. To bridge this gap, we introduce Pi3DET, the first benchmark featuring LiDAR data and 3D bounding box annotations collected from multiple platforms: vehicle, quadruped, and drone, thereby facilitating research in 3D object detection for non-vehicle platforms as well as cross-platform 3D detection. Based on Pi3DET, we propose a novel cross-platform adaptation framework that transfers knowledge from the well-studied vehicle platform to other platforms. This framework achieves perspective-invariant 3D detection through robust alignment at both geometric and feature levels. Additionally, we establish a benchmark to evaluate the resilience and robustness of current 3D detectors in cross-platform scenarios, providing valuable insights for developing adaptive 3D perception systems. Extensive experiments validate the effectiveness of our approach on challenging cross-platform tasks, demonstrating substantial gains over existing adaptation methods. We hope this work paves the way for generalizable and unified 3D perception systems across diverse and complex environments. Our Pi3DET dataset, cross-platform benchmark suite, and annotation toolkit have been made publicly available.
zh
[CV-10] alk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras
【速读】:该论文旨在解决事件相机(event camera)的异步数据流与人类语言之间难以对齐的问题,即如何实现基于语言指令的物体定位(language-driven object grounding)在事件感知场景中的有效应用。其核心挑战在于利用事件数据中高时间分辨率和抗运动模糊的优势,同时融合空间、时间和关系等多维语义信息以支持精准的语义理解。解决方案的关键在于提出首个大规模基准数据集 Talk2Event 和对应的 EventRefer 框架:前者提供了超过 3 万条经验证的指代表达式,并标注了外观、状态、视角关系及对象间关系四类属性;后者通过 Mixture of Event-Attribute Experts (MoEE) 动态融合多属性表示,实现对不同模态和场景动态变化的自适应建模,在纯事件、纯帧以及事件-帧融合等多种设置下均显著优于当前最优基线方法。
链接: https://arxiv.org/abs/2507.17664
作者: Lingdong Kong,Dongyue Lu,Ao Liang,Rong Li,Yuhao Dong,Tianshuai Hu,Lai Xing Ng,Wei Tsang Ooi,Benoit R. Cottereau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint; 42 pages, 17 figures, 16 tables; Project Page at this https URL
Abstract:Event cameras offer microsecond-level latency and robustness to motion blur, making them ideal for understanding dynamic environments. Yet, connecting these asynchronous streams to human language remains an open challenge. We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding in event-based perception. Built from real-world driving data, we provide over 30,000 validated referring expressions, each enriched with four grounding attributes – appearance, status, relation to viewer, and relation to other objects – bridging spatial, temporal, and relational reasoning. To fully exploit these cues, we propose EventRefer, an attribute-aware grounding framework that dynamically fuses multi-attribute representations through a Mixture of Event-Attribute Experts (MoEE). Our method adapts to different modalities and scene dynamics, achieving consistent gains over state-of-the-art baselines in event-only, frame-only, and event-frame fusion settings. We hope our dataset and approach will establish a foundation for advancing multimodal, temporally-aware, and language-driven perception in real-world robotics and autonomy.
zh
[CV-11] Monocular Semantic Scene Completion via Masked Recurrent Networks ICCV2025
【速读】:该论文旨在解决单目语义场景补全(Monocular Semantic Scene Completion, MSSC)中因深度估计不准确导致的性能瓶颈问题,尤其在复杂场景下现有单阶段方法难以同时实现可见区域分割与遮挡区域生成。其解决方案的关键在于提出一种两阶段统一框架MonoMRN:第一阶段进行粗粒度MSSC预测,第二阶段引入掩码递归网络(Masked Recurrent Network),核心创新包括基于掩码更新机制的掩码稀疏门控循环单元(Masked Sparse Gated Recurrent Unit, MS-GRU),该设计聚焦于已知占位区域并降低计算开销;以及距离注意力投影(distance attention projection),通过根据距观测表面的距离分配不同注意力权重来减少投影误差。实验证明该框架在NYUv2和SemanticKITTI数据集上均达到最优性能,并展现出对多种扰动的鲁棒性。
链接: https://arxiv.org/abs/2507.17661
作者: Xuzhi Wang,Xinran Wu,Song Wang,Lingdong Kong,Ziping Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: ICCV 2025; 15 pages, 10 figures, 6 tables; Code at this https URL
Abstract:Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model’s resilience to such challenges. The source code is publicly available.
zh
[CV-12] See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在知识驱动型视觉问答(Knowledge-Based Visual Question Answering, KBVQA)任务中因依赖单一维度证据而导致的推理局限性问题,即“只见树木,不见森林”的认知盲区,从而难以实现多角度、鲁棒且全面的视觉理解。解决方案的关键在于提出Synergos-VQA框架,该框架在推理阶段并行生成并融合三种互补的证据流:(1)整体性证据(Holistic Evidence)用于感知场景全局信息(“森林”),(2)结构化证据(Structural Evidence)通过原型驱动模块识别关键对象(“树木”),以及(3)因果证据(Causal Evidence)利用反事实探针确保推理的稳健性与可解释性。这种多维证据的协同融合机制显著提升了模型的推理深度与可靠性,实验证明其在OK-VQA和A-OKVQA等挑战性基准上达到新的最先进性能,并具备良好的模块化扩展能力。
链接: https://arxiv.org/abs/2507.17659
作者: Junjie Wang,Yunhan Tang,Yijie Wang,Zhihao Yuan,Huan Wang,Yangfan He,Bin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have pushed the frontiers of Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is fundamentally bottlenecked by a reliance on uni-dimensional evidence. This “seeing only the trees, but not the forest” approach prevents robust, multi-faceted understanding. Inspired by the principle of seeing both the forest and trees, we propose Synergos-VQA, a novel synergistic reasoning framework. At its core, Synergos-VQA concurrently generates and fuses three complementary evidence streams at inference time: (1) Holistic Evidence to perceive the entire scene (the “forest”), (2) Structural Evidence from a prototype-driven module to identify key objects (the “trees”), and (3) Causal Evidence from a counterfactual probe to ensure the reasoning is robustly grounded. By synergistically fusing this multi-faceted evidence, our framework achieves a more comprehensive and reliable reasoning process. Extensive experiments show that Synergos-VQA decisively establishes a new state-of-the-art on three challenging benchmarks, including OK-VQA and A-OKVQA. Furthermore, our approach demonstrates strong plug-and-play capabilities, significantly boosting various open-source MLLMs and proving that superior methodological design can outperform sheer model scale.
zh
[CV-13] Attention (as Discrete-Time Markov) Chains
【速读】:该论文试图解决视觉Transformer中注意力机制的解释性不足与局部注意力局限的问题,即如何更深入理解token之间的注意力关系,并利用这种理解提升模型在零样本分割和图像生成等任务中的性能。其解决方案的关键在于将注意力矩阵重新诠释为离散时间马尔可夫链(discrete-time Markov chain),从而在统一框架下解析选择、求和与平均等常见操作,并引入间接注意力(indirect attention)的概念,通过马尔可夫链传播建模长期依赖关系;同时识别出语义相似区域对应的“亚稳态”(metastable states),并利用矩阵乘法与特征值分析高效计算这些状态及其分布,进而提出TokenRank——即马尔可夫链的稳态向量,用于衡量全局token重要性,在无需标注数据的情况下显著提升零样本分割性能,并改进无条件图像生成质量。
链接: https://arxiv.org/abs/2507.17657
作者: Yotam Erel,Olaf Dünkel,Rishabh Dabral,Vladislav Golyanik,Christian Theobalt,Amit H. Bermano
机构: Tel Aviv University (特拉维夫大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We introduce a new interpretation of the attention matrix as a discrete-time Markov chain. Our interpretation sheds light on common operations involving attention scores such as selection, summation, and averaging in a unified framework. It further extends them by considering indirect attention, propagated through the Markov chain, as opposed to previous studies that only model immediate effects. Our main observation is that tokens corresponding to semantically similar regions form a set of metastable states, where the attention clusters, while noisy attention scores tend to disperse. Metastable states and their prevalence can be easily computed through simple matrix multiplication and eigenanalysis, respectively. Using these lightweight tools, we demonstrate state-of-the-art zero-shot segmentation. Lastly, we define TokenRank – the steady state vector of the Markov chain, which measures global token importance. We demonstrate that using it brings improvements in unconditional image generation. We believe our framework offers a fresh view of how tokens are being attended in modern visual transformers.
zh
[CV-14] CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts ICCV2025
【速读】:该论文旨在解决图像分类模型在现实世界中面对连续且真实的分布外(Out-of-Distribution, OOD)扰动时,其鲁棒性难以被准确评估的问题。现有方法多依赖于简单的合成噪声或仅能模拟二元扰动的扩散模型,无法刻画真实场景中复杂、渐进式的干扰变化。解决方案的关键在于提出CNS-Bench——一个连续扰动基准测试平台,通过在扩散模型中引入LoRA适配器(LoRA adapters)实现对多种扰动强度的连续控制,从而生成具有现实感的、可调强度的扰动图像;同时设计了一种过滤机制以提升生成样本质量并增强基准测试的可靠性,使得模型鲁棒性评估能够覆盖更广泛的扰动尺度,并识别出模型性能下降的关键临界点,从而提供比传统二元扰动更细致和实用的评估视角。
链接: https://arxiv.org/abs/2507.17651
作者: Olaf Dünkel,Artur Jesslen,Jiahao Xie,Christian Theobalt,Christian Rupprecht,Adam Kortylewski
机构: Max Planck Institute for Informatics (马普所信息学研究所); University of Freiburg (弗莱堡大学); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL
Abstract:An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they often fail to capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To address failure cases, we propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness. Project page including code and data: this https URL.
zh
[CV-15] he Early Bird Identifies the Worm: You Cant Beat a Head Start in Long-Term Body Re-ID (ECHO-BID)
【速读】:该论文旨在解决在非受限观测环境下人体再识别(Person Re-Identification, Re-ID)的难题,尤其针对因距离、视角、成像条件及衣物变化等因素导致的识别性能下降问题。其解决方案的核心是提出ECHO-BID模型,该模型基于预训练于大规模对象分类任务的EVA-02 Large骨干网络,并通过在最具挑战性的衣物更换数据上进行迁移学习,显著提升了长期人体再识别性能。关键创新在于:一方面利用更大模型规模与掩码图像建模(Masked Image Modeling)预训练策略增强特征表达能力;另一方面发现更具挑战性的较小数据集在跨数据集泛化上优于更大但较易的数据集,而进一步微调则能在最困难场景中实现最优效果。这一结果表明,正确选择预训练骨干架构和迁移学习协议对提升长期Re-ID性能具有决定性作用。
链接: https://arxiv.org/abs/2507.17640
作者: Thomas M. Metz,Matthew Q. Hill,Alice J. O’Toole
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Person identification in unconstrained viewing environments presents significant challenges due to variations in distance, viewpoint, imaging conditions, and clothing. We introduce \textbfE va \textbfC lothes-Change from \textbfH idden \textbfO bjects - \textbfB ody \textbfID entification (ECHO-BID), a class of long-term re-id models built on object-pretrained EVA-02 Large backbones. We compare ECHO-BID to 9 other models that vary systematically in backbone architecture, model size, scale of object classification pretraining, and transfer learning protocol. Models were evaluated on benchmark datasets across constrained, unconstrained, and occluded settings. ECHO-BID, with transfer learning on the most challenging clothes-change data, achieved state-of-the-art results on long-term re-id – substantially outperforming other methods. ECHO-BID also surpassed other methods by a wide margin in occluded viewing scenarios. A combination of increased model size and Masked Image Modeling during pretraining underlie ECHO-BID’s strong performance on long-term re-id. Notably, a smaller, but more challenging transfer learning dataset, generalized better across datasets than a larger, less challenging one. However, the larger dataset with an additional fine-tuning step proved best on the most difficult data. Selecting the correct pretrained backbone architecture and transfer learning protocols can drive substantial gains in long-term re-id performance.
zh
[CV-16] Reusing Attention for One-stage Lane Topology Understanding IROS2025
【速读】:该论文旨在解决自动驾驶中车道拓扑关系理解不准确的问题,现有两阶段方法因误差传播和计算开销大而导致效率低下。其解决方案的关键在于提出一种单阶段架构,同时预测交通元素、车道中心线及拓扑关系,并通过复用不同Transformer解码器中的中间注意力资源,有效利用元素检测模块内的固有关系知识来建模交通元素与车道之间的拓扑关系,无需额外的计算密集型图网络;此外,首次证明了可从使用标准定义(Standard Definition, SD)地图的模型中蒸馏知识至无SD地图模型,从而在缺乏SD地图的情况下仍能实现优异性能。
链接: https://arxiv.org/abs/2507.17617
作者: Yang Li,Zongzheng Zhang,Xuchong Qiu,Xinrun Li,Ziming Liu,Leichen Wang,Ruikai Li,Zhenxin Zhu,Huan-ang Gao,Xiaojian Lin,Zhiyong Cui,Hang Zhao,Hao Zhao
机构: Institute for AI Industry Research, Tsinghua University (清华大学人工智能产业研究院); Bosch Corporate Research, China (博世中国研究中心); Institute for Interdisciplinary Information Sciences, Tsinghua University (清华大学交叉信息研究院); Department of Computer Science, ETH Zürich (苏黎世联邦理工学院计算机科学系); State Key Lab of Intelligent Transportation System, Beihang University (北京航空航天大学智能交通系统国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IROS 2025, Project Page: this https URL
Abstract:Understanding lane toplogy relationships accurately is critical for safe autonomous driving. However, existing two-stage methods suffer from inefficiencies due to error propagations and increased computational overheads. To address these challenges, we propose a one-stage architecture that simultaneously predicts traffic elements, lane centerlines and topology relationship, improving both the accuracy and inference speed of lane topology understanding for autonomous driving. Our key innovation lies in reusing intermediate attention resources within distinct transformer decoders. This approach effectively leverages the inherent relational knowledge within the element detection module to enable the modeling of topology relationships among traffic elements and lanes without requiring additional computationally expensive graph networks. Furthermore, we are the first to demonstrate that knowledge can be distilled from models that utilize standard definition (SD) maps to those operates without using SD maps, enabling superior performance even in the absence of SD maps. Extensive experiments on the OpenLane-V2 dataset show that our approach outperforms baseline methods in both accuracy and efficiency, achieving superior results in lane detection, traffic element identification, and topology reasoning. Our code is available at this https URL.
zh
[CV-17] Vision Transformer attention alignment with human visual perception in aesthetic object evaluation
【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 模型的注意力机制与人类视觉注意模式在审美评价场景下的对齐问题,尤其是针对手工制品(如编织包和姜瓶)的视觉感知差异。其解决方案的关键在于通过眼动实验获取30名参与者的人类视觉注意热图,并结合预训练的ViT模型(基于DINO自蒸馏方法)提取12个注意力头的注意力图,利用Kullback-Leibler散度量化两者分布差异,在不同高斯核参数(sigma=0.1–3.0)下进行比较分析。结果表明,特定注意力头(如第12头)与人类注意模式具有显著相关性(最优sigma=2.4±0.03),且某些头部(如第7和第9头)明显偏离人类注意模式(p<0.05,Tukey HSD检验),揭示了ViT虽具全局注意力特性,但部分注意力头可有效模拟人类聚焦行为,尤其在识别特定物体特征(如编织品上的扣件)时表现突出。
链接: https://arxiv.org/abs/2507.17616
作者: Miguel Carrasco,César González-Martín,José Aranda,Luis Oliveros
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 15 figures
Abstract:Visual attention mechanisms play a crucial role in human perception and aesthetic evaluation. Recent advances in Vision Transformers (ViTs) have demonstrated remarkable capabilities in computer vision tasks, yet their alignment with human visual attention patterns remains underexplored, particularly in aesthetic contexts. This study investigates the correlation between human visual attention and ViT attention mechanisms when evaluating handcrafted objects. We conducted an eye-tracking experiment with 30 participants (9 female, 21 male, mean age 24.6 years) who viewed 20 artisanal objects comprising basketry bags and ginger jars. Using a Pupil Labs eye-tracker, we recorded gaze patterns and generated heat maps representing human visual attention. Simultaneously, we analyzed the same objects using a pre-trained ViT model with DINO (Self-DIstillation with NO Labels), extracting attention maps from each of the 12 attention heads. We compared human and ViT attention distributions using Kullback-Leibler divergence across varying Gaussian parameters (sigma=0.1 to 3.0). Statistical analysis revealed optimal correlation at sigma=2.4 ±0.03, with attention head #12 showing the strongest alignment with human visual patterns. Significant differences were found between attention heads, with heads #7 and #9 demonstrating the greatest divergence from human attention (p 0.05, Tukey HSD test). Results indicate that while ViTs exhibit more global attention patterns compared to human focal attention, certain attention heads can approximate human visual behavior, particularly for specific object features like buckles in basketry items. These findings suggest potential applications of ViT attention mechanisms in product design and aesthetic evaluation, while highlighting fundamental differences in attention strategies between human perception and current AI models.
zh
[CV-18] InvRGBL: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling ICCV2025
【速读】:该论文旨在解决传统基于RGB的逆渲染(inverse rendering)方法在复杂光照条件下难以准确估计材质参数的问题,尤其是在使用LiDAR数据时仅将其作为几何信息来源、忽略其强度值所蕴含的材质线索的情况下。解决方案的关键在于提出一种名为InvRGB+L的新颖逆渲染模型,其核心创新包括:(1) 一个基于物理的LiDAR辐照度建模方法,利用LiDAR主动照明在不同光谱范围内的强度信息提供与可见光互补的材质感知信号;(2) 引入RGB-LiDAR材质一致性损失函数,确保跨模态材质估计的一致性与鲁棒性。该方法显著提升了城市和室内场景的可重光照(relighting)、夜间模拟及动态物体插入等任务的表现,优于当前最先进的场景级逆渲染与LiDAR仿真技术。
链接: https://arxiv.org/abs/2507.17613
作者: Xiaoxue Chen,Bhargav Chandaka,Chih-Hao Lin,Ya-Qin Zhang,David Forsyth,Hao Zhao,Shenlong Wang
机构: Tsinghua University (清华大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); BAAI (北京智源研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:We present InvRGB+L, a novel inverse rendering model that reconstructs large, relightable, and dynamic scenes from a single RGB+LiDAR sequence. Conventional inverse graphics methods rely primarily on RGB observations and use LiDAR mainly for geometric information, often resulting in suboptimal material estimates due to visible light interference. We find that LiDAR’s intensity values-captured with active illumination in a different spectral range-offer complementary cues for robust material estimation under variable lighting. Inspired by this, InvRGB+L leverages LiDAR intensity cues to overcome challenges inherent in RGB-centric inverse graphics through two key innovations: (1) a novel physics-based LiDAR shading model and (2) RGB-LiDAR material consistency losses. The model produces novel-view RGB and LiDAR renderings of urban and indoor scenes and supports relighting, night simulations, and dynamic object insertions, achieving results that surpass current state-of-the-art methods in both scene-level urban inverse rendering and LiDAR simulation.
zh
[CV-19] Explainable AI for Collaborative Assessment of 2D/3D Registration Quality
【速读】:该论文旨在解决2D/3D注册(2D/3D registration)质量验证难题,即在图像引导手术中,尽管注册算法已高度成熟,但其输出仍可能出现微小但致命的误差,而传统基于可视化的方法难以可靠识别这些错误,从而威胁患者安全。解决方案的关键在于提出首个专门针对2D/3D注册质量验证的人工智能(AI)框架,并集成可解释人工智能(XAI)模块,通过提供决策依据来增强人机协同判断能力。该框架在算法层面提升检测准确性,在交互层面支持人类操作者理解AI推理逻辑,从而形成更可靠的术中质量保障机制。
链接: https://arxiv.org/abs/2507.17597
作者: Sue Min Cho,Alexander Do,Russell H. Taylor,Mathias Unberath
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As surgery embraces digital transformation–integrating sophisticated imaging, advanced algorithms, and robotics to support and automate complex sub-tasks–human judgment of system correctness remains a vital safeguard for patient safety. This shift introduces new “operator-type” roles tasked with verifying complex algorithmic outputs, particularly at critical junctures of the procedure, such as the intermediary check before drilling or implant placement. A prime example is 2D/3D registration, a key enabler of image-based surgical navigation that aligns intraoperative 2D images with preoperative 3D data. Although registration algorithms have advanced significantly, they occasionally yield inaccurate results. Because even small misalignments can lead to revision surgery or irreversible surgical errors, there is a critical need for robust quality assurance. Current visualization-based strategies alone have been found insufficient to enable humans to reliably detect 2D/3D registration misalignments. In response, we propose the first artificial intelligence (AI) framework trained specifically for 2D/3D registration quality verification, augmented by explainability features that clarify the model’s decision-making. Our explainable AI (XAI) approach aims to enhance informed decision-making for human operators by providing a second opinion together with a rationale behind it. Through algorithm-centric and human-centered evaluations, we systematically compare four conditions: AI-only, human-only, human-AI, and human-XAI. Our findings reveal that while explainability features modestly improve user trust and willingness to override AI errors, they do not exceed the standalone AI in aggregate performance. Nevertheless, future work extending both the algorithmic design and the human-XAI collaboration elements holds promise for more robust quality assurance of 2D/3D registration.
zh
[CV-20] PRIX: Learning to Plan from Raw Pixels for End-to-End Autonomous Driving
【速读】:该论文旨在解决端到端自动驾驶模型在实际部署中面临的三大挑战:模型规模庞大、对昂贵激光雷达(LiDAR)传感器的依赖以及计算密集的鸟瞰图(BEV)特征表示,这些因素严重限制了其在仅配备摄像头的量产车型中的可扩展性。解决方案的关键在于提出一种名为PRIX(Plan from Raw Pixels)的新颖高效架构,其核心创新是引入Context-aware Recalibration Transformer(CaRT)模块,通过增强多层级视觉特征来提升规划鲁棒性,并完全基于原始像素输入直接生成安全轨迹,无需显式BEV表示或LiDAR数据,从而在保持先进性能的同时显著降低推理延迟和模型复杂度。
链接: https://arxiv.org/abs/2507.17596
作者: Maciej K. Wozniak,Lianhang Liu,Yixi Cai,Patric Jensfelt
机构: KTH Royal Institute of Technology (瑞典皇家理工学院); Scania CV AB
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: under review
Abstract:While end-to-end autonomous driving models show promising results, their practical deployment is often hindered by large model sizes, a reliance on expensive LiDAR sensors and computationally intensive BEV feature representations. This limits their scalability, especially for mass-market vehicles equipped only with cameras. To address these challenges, we propose PRIX (Plan from Raw Pixels). Our novel and efficient end-to-end driving architecture operates using only camera data, without explicit BEV representation and forgoing the need for LiDAR. PRIX leverages a visual feature extractor coupled with a generative planning head to predict safe trajectories from raw pixel inputs directly. A core component of our architecture is the Context-aware Recalibration Transformer (CaRT), a novel module designed to effectively enhance multi-level visual features for more robust planning. We demonstrate through comprehensive experiments that PRIX achieves state-of-the-art performance on the NavSim and nuScenes benchmarks, matching the capabilities of larger, multimodal diffusion planners while being significantly more efficient in terms of inference speed and model size, making it a practical solution for real-world deployment. Our work is open-source and the code will be at this https URL.
zh
[CV-21] RemixFusion: Residual-based Mixed Representation for Large-scale Online RGB-D Reconstruction
【速读】:该论文旨在解决基于神经隐式表示(neural implicit representation)的在线稠密重建方法在大规模场景中面临的两大问题:一是重建细节不足,导致结果过于平滑;二是神经表示的学习过程耗时较长,难以满足实时性要求。其解决方案的关键在于提出了一种基于残差的混合表示方法(residual-based mixed representation),即通过一个显式的粗粒度TSDF(Truncated Signed Distance Function)网格与一个隐式的神经模块相结合,后者用于生成需添加到粗网格中的细粒度残差信息。这种设计在保证时间与内存预算受限的前提下实现了高保真重建,并显著提升了相机位姿估计的精度和鲁棒性。此外,作者进一步引入了基于姿态变化优化的多帧联合位姿优化策略(结合束调整BA)以及自适应梯度放大技术,从而提升优化收敛性和全局最优性,同时采用局部移动体积机制实现高效在线学习。
链接: https://arxiv.org/abs/2507.17594
作者: Yuqing Lan,Chenyang Zhu,Shuaifeng Zhi,Jiazhao Zhang,Zhoufeng Wang,Renjiao Yi,Yijie Wang,Kai Xu
机构: National University of Defense Technology (国防科技大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The introduction of the neural implicit representation has notably propelled the advancement of online dense reconstruction techniques. Compared to traditional explicit representations, such as TSDF, it improves the mapping completeness and memory efficiency. However, the lack of reconstruction details and the time-consuming learning of neural representations hinder the widespread application of neural-based methods to large-scale online reconstruction. We introduce RemixFusion, a novel residual-based mixed representation for scene reconstruction and camera pose estimation dedicated to high-quality and large-scale online RGB-D reconstruction. In particular, we propose a residual-based map representation comprised of an explicit coarse TSDF grid and an implicit neural module that produces residuals representing fine-grained details to be added to the coarse grid. Such mixed representation allows for detail-rich reconstruction with bounded time and memory budget, contrasting with the overly-smoothed results by the purely implicit representations, thus paving the way for high-quality camera tracking. Furthermore, we extend the residual-based representation to handle multi-frame joint pose optimization via bundle adjustment (BA). In contrast to the existing methods, which optimize poses directly, we opt to optimize pose changes. Combined with a novel technique for adaptive gradient amplification, our method attains better optimization convergence and global optimality. Furthermore, we adopt a local moving volume to factorize the mixed scene representation with a divide-and-conquer design to facilitate efficient online learning in our residual-based framework. Extensive experiments demonstrate that our method surpasses all state-of-the-art ones, including those based either on explicit or implicit representations, in terms of the accuracy of both mapping and tracking on large-scale scenes.
zh
[CV-22] From Scan to Action: Leverag ing Realistic Scans for Embodied Scene Understanding CVPR2025 CVPR
【速读】:该论文旨在解决真实世界三维场景扫描数据在下游应用中使用受限的问题,主要挑战包括数据体量庞大、标注格式多样以及工具兼容性差。其解决方案的关键在于提出一种基于通用场景描述(Universal Scene Description, USD)的统一标注集成方法,并针对不同应用场景设计特定的USD变体(USD flavors),从而有效整合和利用这些复杂的数据集。通过该方法,论文在大语言模型(LLM)驱动的场景编辑任务中实现了80%的成功率,在机器人仿真中的策略学习也达到了87%的成功率,验证了该方案的有效性和泛化能力。
链接: https://arxiv.org/abs/2507.17585
作者: Anna-Maria Halacheva,Jan-Nico Zaech,Sombit Dey,Luc Van Gool,Danda Pani Paudel
机构: INSAIT; Sofia University “St. Kliment Ohridski” (索非亚大学“圣克莱门特·奥霍里斯基”)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at the OpenSUN3D Workshop, CVPR 2025. This workshop paper is not included in the official CVPR proceedings
Abstract:Real-world 3D scene-level scans offer realism and can enable better real-world generalizability for downstream applications. However, challenges such as data volume, diverse annotation formats, and tool compatibility limit their use. This paper demonstrates a methodology to effectively leverage these scans and their annotations. We propose a unified annotation integration using USD, with application-specific USD flavors. We identify challenges in utilizing holistic real-world scan datasets and present mitigation strategies. The efficacy of our approach is demonstrated through two downstream applications: LLM-based scene editing, enabling effective LLM understanding and adaptation of the data (80% success), and robotic simulation, achieving an 87% success rate in policy learning.
zh
[CV-23] Boosting Ray Search Procedure of Hard-label Attacks with Transfer-based Priors ICLR2025
【速读】:该论文旨在解决硬标签(hard-label)对抗攻击中的查询效率问题,即在仅能获取模型预测最高概率类别标签的情况下,如何高效地搜索到使模型误分类的扰动。其核心挑战在于将离散的硬标签反馈转化为可优化的连续问题,同时最小化查询次数。解决方案的关键在于提出一种基于先验引导(prior-guided)的射线搜索方法:利用替代模型(surrogate model)提供的迁移性先验信息,通过在先验方向与随机方向张成的子空间中近似真实梯度的投影,实现更高效的梯度估计。该方法理论上提升了梯度估计与真实梯度之间的期望余弦相似度,并在ImageNet和CIFAR-10数据集上显著优于11种先进方法,在保持高攻击成功率的同时大幅降低查询成本。
链接: https://arxiv.org/abs/2507.17577
作者: Chen Ma,Xinjie Xu,Shuyu Cheng,Qi Xuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: Published at ICLR 2025 (Spotlight paper)
Abstract:One of the most practical and challenging types of black-box adversarial attacks is the hard-label attack, where only the top-1 predicted label is available. One effective approach is to search for the optimal ray direction from the benign image that minimizes the \ell_p -norm distance to the adversarial region. The unique advantage of this approach is that it transforms the hard-label attack into a continuous optimization problem. The objective function value is the ray’s radius, which can be obtained via binary search at a high query cost. Existing methods use a “sign trick” in gradient estimation to reduce the number of queries. In this paper, we theoretically analyze the quality of this gradient estimation and propose a novel prior-guided approach to improve ray search efficiency both theoretically and empirically. Specifically, we utilize the transfer-based priors from surrogate models, and our gradient estimators appropriately integrate them by approximating the projection of the true gradient onto the subspace spanned by these priors and random directions, in a query-efficient manner. We theoretically derive the expected cosine similarities between the obtained gradient estimators and the true gradient, and demonstrate the improvement achieved by incorporating priors. Extensive experiments on the ImageNet and CIFAR-10 datasets show that our approach significantly outperforms 11 state-of-the-art methods in terms of query efficiency.
zh
[CV-24] An h-space Based Adversarial Attack for Protection Against Few-shot Personalization
【速读】:该论文旨在解决扩散模型(diffusion models)在生成定制化图像时引发的隐私泄露问题,特别是未经授权对私有内容进行修改的风险。其解决方案的关键在于利用模型语义潜在空间(h-space)中的高抽象特性,设计一种基于对抗攻击的新型反定制方法HAAD(h-space based Adversarial Attack for Diffusion models),通过在h-space中构造扰动来有效破坏图像生成过程;进一步提出的HAAD-KV变体则仅基于h-space的KV参数构建扰动,在保持更强防护效果的同时显著降低计算开销,且性能优于现有最先进方法。
链接: https://arxiv.org/abs/2507.17554
作者: Xide Xu,Sandesh Kamath,Muhammad Atif Butt,Bogdan Raducanu
机构: Computer Vision Center (计算机视觉中心); Universitat Autònoma de Barcelona (巴塞罗那自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 15 figures. Accepted by ACM Multimedia 2025
Abstract:The versatility of diffusion models in generating customized images from few samples raises significant privacy concerns, particularly regarding unauthorized modifications of private content. This concerning issue has renewed the efforts in developing protection mechanisms based on adversarial attacks, which generate effective perturbations to poison diffusion models. Our work is motivated by the observation that these models exhibit a high degree of abstraction within their semantic latent space (`h-space’), which encodes critical high-level features for generating coherent and meaningful content. In this paper, we propose a novel anti-customization approach, called HAAD (h-space based Adversarial Attack for Diffusion models), that leverages adversarial attacks to craft perturbations based on the h-space that can efficiently degrade the image generation process. Building upon HAAD, we further introduce a more efficient variant, HAAD-KV, that constructs perturbations solely based on the KV parameters of the h-space. This strategy offers a stronger protection, that is computationally less expensive. Despite their simplicity, our methods outperform state-of-the-art adversarial attacks, highlighting their effectiveness.
zh
[CV-25] Constructing Ophthalmic MLLM for Positioning-diagnosis Collaboration Through Clinical Cognitive Chain Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在眼科专业领域应用中面临的两大核心问题:一是标注粒度碎片化导致的跨模态理解不精确,二是临床推理逻辑不一致影响诊断准确性。其解决方案的关键在于构建了一个面向眼科的专用MLLM——FundusExpert,结合由智能Fundus-Engine系统生成的高质量数据集FundusGen,该系统通过自动化定位与基于MLLM的语义扩展,实现单张眼底图像中全局疾病分类、局部目标检测与细粒度特征分析的一体化处理;同时,通过构建符合临床认知链的标注体系,引导模型生成可解释的推理路径,从而显著提升模型在眼科问答和零样本报告生成任务中的性能表现。
链接: https://arxiv.org/abs/2507.17539
作者: Xinyao Liu,Diping Song
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of Science and Technology of China (中国科学技术大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces FundusExpert, an ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with FundusGen, a dataset constructed through the intelligent Fundus-Engine system. Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable reasoning paths. FundusExpert, fine-tuned with instruction data from FundusGen, achieves the best performance in ophthalmic question-answering tasks, surpassing the average accuracy of the 40B MedRegA by 26.6%. It also excels in zero-shot report generation tasks, achieving a clinical consistency of 77.0%, significantly outperforming GPT-4o’s 47.6%. Furthermore, we reveal a scaling law between data quality and model capability ( L \propto N^0.068 ), demonstrating that the cognitive alignment annotations in FundusGen enhance data utilization efficiency. By integrating region-level localization with diagnostic reasoning chains, our work develops a scalable, clinically-aligned MLLM and explores a pathway toward bridging the visual-language gap in specific MLLMs. Our project can be found at this https URL.
zh
[CV-26] Multi-modal Multi-task Pre-training for Improved Point Cloud Understanding
【速读】:该论文旨在解决当前多模态预训练框架在3D点云理解任务中因仅依赖单一预训练任务而导致信息利用不充分的问题,从而限制了模型在复杂下游任务中的性能表现。其解决方案的关键在于提出一种多模态多任务预训练框架(Multi-modal Multi-task Pre-training, MMPT),通过设计三个互补的预训练任务实现更全面的特征学习:(i) Token-level reconstruction (TLR) 用于恢复被掩码的点token,增强模型的表征能力;(ii) Point-level reconstruction (PLR) 直接预测被掩码点的位置,生成可用于后续任务的变换后点云;(iii) Multi-modal contrastive learning (MCL) 在模态内与模态间建立特征对应关系,自监督地聚合来自3D点云和2D图像的丰富信号。该框架无需任何3D标注,具备良好的可扩展性,并能有效迁移到多种判别与生成类下游任务中。
链接: https://arxiv.org/abs/2507.17533
作者: Liwen Liu,Weidong Yang,Lipeng Ma,Ben Fei
机构: Fudan University (复旦大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in multi-modal pre-training methods have shown promising effectiveness in learning 3D representations by aligning multi-modal features between 3D shapes and their corresponding 2D counterparts. However, existing multi-modal pre-training frameworks primarily rely on a single pre-training task to gather multi-modal data in 3D applications. This limitation prevents the models from obtaining the abundant information provided by other relevant tasks, which can hinder their performance in downstream tasks, particularly in complex and diverse domains. In order to tackle this issue, we propose MMPT, a Multi-modal Multi-task Pre-training framework designed to enhance point cloud understanding. Specifically, three pre-training tasks are devised: (i) Token-level reconstruction (TLR) aims to recover masked point tokens, endowing the model with representative learning abilities. (ii) Point-level reconstruction (PLR) is integrated to predict the masked point positions directly, and the reconstructed point cloud can be considered as a transformed point cloud used in the subsequent task. (iii) Multi-modal contrastive learning (MCL) combines feature correspondences within and across modalities, thus assembling a rich learning signal from both 3D point cloud and 2D image modalities in a self-supervised manner. Moreover, this framework operates without requiring any 3D annotations, making it scalable for use with large datasets. The trained encoder can be effectively transferred to various downstream tasks. To demonstrate its effectiveness, we evaluated its performance compared to state-of-the-art methods in various discriminant and generative applications under widely-used benchmarks.
zh
[CV-27] STQE: Spatial-Temporal Quality Enhancement for G-PCC Compressed Dynamic Point Clouds
【速读】:该论文旨在解决压缩动态点云(compressed dynamic point clouds)中视觉质量提升的问题,特别是如何有效利用帧间空间-时间相关性来改善基于通用点云编码(G-PCC)标准压缩后的点云质量。其解决方案的关键在于提出了一种空间-时间属性质量增强网络(STQE),该网络包含四个核心模块:基于重着色的运动补偿模块,用于实现参考帧属性信息到当前帧几何结构的精确对齐;通道感知的时间注意力模块,动态突出双向参考帧中的相关区域;高斯引导的邻域特征聚合模块,高效捕获几何与颜色属性间的空间依赖关系;以及基于皮尔逊相关系数的联合损失函数,以缓解传统逐点均方误差优化导致的过度平滑问题。通过上述设计,STQE在G-PCC测试模型上实现了显著的峰值信噪比(PSNR)提升和比特率下降(BD-rate reductions)。
链接: https://arxiv.org/abs/2507.17522
作者: Tian Guo,Hui Yuan,Xiaolong Mao,Shiqi Jiang,Raouf Hamzaoui,Sam Kwong
机构: Shandong University (山东大学); De Montfort University (德蒙福特大学); Lingnan University (岭南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Very few studies have addressed quality enhancement for compressed dynamic point clouds. In particular, the effective exploitation of spatial-temporal correlations between point cloud frames remains largely unexplored. Addressing this gap, we propose a spatial-temporal attribute quality enhancement (STQE) network that exploits both spatial and temporal correlations to improve the visual quality of G-PCC compressed dynamic point clouds. Our contributions include a recoloring-based motion compensation module that remaps reference attribute information to the current frame geometry to achieve precise inter-frame geometric alignment, a channel-aware temporal attention module that dynamically highlights relevant regions across bidirectional reference frames, a Gaussian-guided neighborhood feature aggregation module that efficiently captures spatial dependencies between geometry and color attributes, and a joint loss function based on the Pearson correlation coefficient, designed to alleviate over-smoothing effects typical of point-wise mean squared error optimization. When applied to the latest G-PCC test model, STQE achieved improvements of 0.855 dB, 0.682 dB, and 0.828 dB in delta PSNR, with Bjøntegaard Delta rate (BD-rate) reductions of -25.2%, -31.6%, and -32.5% for the Luma, Cb, and Cr components, respectively.
zh
[CV-28] InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation
【速读】:该论文旨在解决现有视觉-语言-动作(Vision-Language-Action, VLA)模型在实际应用中普遍存在的两个核心问题:一是难以同时实现灵活的多模态推理与精确的动作生成,二是因任务特定数据训练导致能力局限及预训练视觉-语言能力的灾难性遗忘。其解决方案的关键在于提出一种端到端的VLA模型InstructVLA,结合创新的训练范式——视觉-语言-动作指令微调(Vision-Language-Action Instruction Tuning, VLA-IT),通过混合专家(mixture-of-experts)机制,在标准视觉-语言模型语料和自建的650K样本VLA-IT数据集上联合优化文本推理与动作生成能力,从而在保持大视觉-语言模型(VLM)灵活性的同时显著提升操作性能,并在模拟与真实场景中展现出推理增强的动作执行能力。
链接: https://arxiv.org/abs/2507.17520
作者: Shuai Yang,Hao Li,Yilun Chen,Bin Wang,Yang Tian,Tai Wang,Hanqing Wang,Feng Zhao,Yiyi Liao,Jiangmiao Pang
机构: University of Science and Technology of China (中国科学技术大学); Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages
Abstract:To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA’s potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.
zh
[CV-29] Accelerating Parallel Diffusion Model Serving with Residual Compression
【速读】:该论文旨在解决扩散模型(Diffusion Models)在多加速器并行推理过程中因交换大量激活值而导致的显著通信开销问题,从而限制了效率与可扩展性。解决方案的关键在于提出一种名为CompactFusion的压缩框架,其核心思想是利用扩散激活值中存在的强时间冗余特性——相邻推理步骤产生的激活值高度相似,携带的信息增量极小。为此,CompactFusion采用残差压缩(Residual Compression)策略,仅传输逐步激活差异(即压缩后的残差),从而大幅减少通信数据量;同时引入轻量级误差反馈机制以防止误差累积,确保生成质量不受影响。该方法实现了更低延迟和更高生成质量,尤其适用于慢速网络下的序列并行等高通信负载场景。
链接: https://arxiv.org/abs/2507.17511
作者: Jiajun Luo,Yicheng Xiao,Jianru Xu,Yangxiu You,Rongwei Lu,Chen Tang,Jingyan Jiang,Zhi Wang
机构: Tsinghua University (清华大学); Southern University of Science and Technology (南方科技大学); Jiangnan University (江南大学); The Chinese University of Hong Kong (香港中文大学); Shenzhen Technology University (深圳技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models produce realistic images and videos but require substantial computational resources, necessitating multi-accelerator parallelism for real-time deployment. However, parallel inference introduces significant communication overhead from exchanging large activations between devices, limiting efficiency and scalability. We present CompactFusion, a compression framework that significantly reduces communication while preserving generation quality. Our key observation is that diffusion activations exhibit strong temporal redundancy-adjacent steps produce highly similar activations, saturating bandwidth with near-duplicate data carrying little new information. To address this inefficiency, we seek a more compact representation that encodes only the essential information. CompactFusion achieves this via Residual Compression that transmits only compressed residuals (step-wise activation differences). Based on empirical analysis and theoretical justification, we show that it effectively removes redundant data, enabling substantial data reduction while maintaining high fidelity. We also integrate lightweight error feedback to prevent error accumulation. CompactFusion establishes a new paradigm for parallel diffusion inference, delivering lower latency and significantly higher generation quality than prior methods. On 4xL20, it achieves 3.0x speedup while greatly improving fidelity. It also uniquely supports communication-heavy strategies like sequence parallelism on slow networks, achieving 6.7x speedup over prior overlap-based method. CompactFusion applies broadly across diffusion models and parallel settings, and integrates easily without requiring pipeline rework. Portable implementation demonstrated on xDiT is publicly available at this https URL
zh
[CV-30] Illicit object detection in X-ray imaging using deep learning techniques: A comparative evaluation
【速读】:该论文旨在解决X射线安检中非法物品检测的准确性与可靠性问题,其核心挑战包括物体遮挡、物品物理属性差异、X射线扫描设备多样性以及训练数据有限等。为系统性评估当前深度学习(Deep Learning, DL)方法在该任务中的表现,研究提出了一套详尽的对比评价框架,关键在于:1)整合六大数据集(OPIXray、CLCXray、SIXray、EDS、HiXray和PIDray),覆盖多种场景与设备;2)涵盖十种主流目标检测模型,包括通用卷积神经网络(CNN)、定制CNN、通用Transformer及混合CNN-Transformer架构;3)采用多维指标(如mAP50/mAP50:95检测精度、推理时间ms、参数量M、计算量GFLOPS)进行综合评估。通过此框架,研究揭示了不同方法在对象级性能、数据集特异性及效率方面的关键差异,为后续研究提供了可复现的基准和清晰的方向指引。
链接: https://arxiv.org/abs/2507.17508
作者: Jorgen Cani,Christos Diou,Spyridon Evangelatos,Vasileios Argyriou,Panagiotis Radoglou-Grammatikis,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos
机构: Hellenic American University (希腊美国大学); Netcompany (尼特公司); Kingston University (金斯顿大学); K3Y (K3Y); University of Western Macedonia (西部马其顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated X-ray inspection is crucial for efficient and unobtrusive security screening in various public settings. However, challenges such as object occlusion, variations in the physical properties of items, diversity in X-ray scanning devices, and limited training data hinder accurate and reliable detection of illicit items. Despite the large body of research in the field, reported experimental evaluations are often incomplete, with frequently conflicting outcomes. To shed light on the research landscape and facilitate further research, a systematic, detailed, and thorough comparative evaluation of recent Deep Learning (DL)-based methods for X-ray object detection is conducted. For this, a comprehensive evaluation framework is developed, composed of: a) Six recent, large-scale, and widely used public datasets for X-ray illicit item detection (OPIXray, CLCXray, SIXray, EDS, HiXray, and PIDray), b) Ten different state-of-the-art object detection schemes covering all main categories in the literature, including generic Convolutional Neural Network (CNN), custom CNN, generic transformer, and hybrid CNN-transformer architectures, and c) Various detection (mAP50 and mAP50:95) and time/computational-complexity (inference time (ms), parameter size (M), and computational load (GFLOPS)) metrics. A thorough analysis of the results leads to critical observations and insights, emphasizing key aspects such as: a) Overall behavior of the object detection schemes, b) Object-level detection performance, c) Dataset-specific observations, and d) Time efficiency and computational complexity analysis. To support reproducibility of the reported experimental results, the evaluation code and model weights are made publicly available at this https URL.
zh
[CV-31] DFDNet: Dynamic Frequency-Guided De-Flare Network
【速读】:该论文旨在解决夜间摄影中强光源引起的光晕(flare)伪影问题,这类伪影不仅显著降低图像视觉质量,还会影响下游任务的性能。现有方法在去除大尺度光晕和修复光源附近区域结构损伤方面仍存在局限。解决方案的关键在于提出一种动态频率引导去光晕网络(DFDNet),其核心创新是将内容信息与光晕伪影在频域中解耦:通过全局动态频域引导模块(GDFG)动态优化全局频域特征以分离光晕信息,同时设计局部细节引导模块(LDGM)利用对比学习策略对齐光源区域的局部特征,从而减少光晕去除过程中的局部细节损失,提升精细图像恢复效果。
链接: https://arxiv.org/abs/2507.17489
作者: Minglong Xue,Aoxiang Ning,Shivakumara Palaiahnakote,Mingliang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Strong light sources in nighttime photography frequently produce flares in images, significantly degrading visual quality and impacting the performance of downstream tasks. While some progress has been made, existing methods continue to struggle with removing large-scale flare artifacts and repairing structural damage in regions near the light source. We observe that these challenging flare artifacts exhibit more significant discrepancies from the reference images in the frequency domain compared to the spatial domain. Therefore, this paper presents a novel dynamic frequency-guided deflare network (DFDNet) that decouples content information from flare artifacts in the frequency domain, effectively removing large-scale flare artifacts. Specifically, DFDNet consists mainly of a global dynamic frequency-domain guidance (GDFG) module and a local detail guidance module (LDGM). The GDFG module guides the network to perceive the frequency characteristics of flare artifacts by dynamically optimizing global frequency domain features, effectively separating flare information from content information. Additionally, we design an LDGM via a contrastive learning strategy that aligns the local features of the light source with the reference image, reduces local detail damage from flare removal, and improves fine-grained image restoration. The experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods in terms of performance. The code is available at \hrefthis https URLthis https URL.
zh
[CV-32] Unsupervised anomaly detection using Bayesian flow networks: application to brain FDG PET in the context of Alzheimers disease
【速读】:该论文旨在解决无监督异常检测(Unsupervised Anomaly Detection, UAD)在神经影像学中的应用问题,特别是针对阿尔茨海默病(Alzheimer’s disease)相关异常的精准识别。现有方法如变分自编码器(VAE)、生成对抗网络(GAN)和扩散模型(Diffusion Models)在处理高空间相关噪声下的图像异常检测时存在性能瓶颈,且难以保留个体特异性特征。解决方案的关键在于提出AnoBFN,一种基于贝叶斯流网络(Bayesian Flow Networks, BFNs)的扩展模型,其核心创新包括:i)在高空间相关噪声条件下实现条件图像生成能力,ii)通过输入图像的递归反馈机制在生成过程中保持受试者特异性,从而提升异常检测的准确性并降低假阳性率。
链接: https://arxiv.org/abs/2507.17486
作者: Hugues Roy,Reuben Dorent,Ninon Burgos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unsupervised anomaly detection (UAD) plays a crucial role in neuroimaging for identifying deviations from healthy subject data and thus facilitating the diagnosis of neurological disorders. In this work, we focus on Bayesian flow networks (BFNs), a novel class of generative models, which have not yet been applied to medical imaging or anomaly detection. BFNs combine the strength of diffusion frameworks and Bayesian inference. We introduce AnoBFN, an extension of BFNs for UAD, designed to: i) perform conditional image generation under high levels of spatially correlated noise, and ii) preserve subject specificity by incorporating a recursive feedback from the input image throughout the generative process. We evaluate AnoBFN on the challenging task of Alzheimer’s disease-related anomaly detection in FDG PET images. Our approach outperforms other state-of-the-art methods based on VAEs (beta-VAE), GANs (f-AnoGAN), and diffusion models (AnoDDPM), demonstrating its effectiveness at detecting anomalies while reducing false positive rates.
zh
[CV-33] SRMambaV2: Biomimetic Attention for Sparse Point Cloud Upsampling in Autonomous Driving
【速读】:该论文旨在解决自动驾驶场景中稀疏激光雷达点云(LiDAR point clouds)上采样问题,其核心挑战在于点云数据的固有稀疏性与复杂三维结构导致的细节重建困难。解决方案的关键在于提出一种名为SRMambaV2的新方法:首先设计了受人类驾驶员视觉感知启发的生物仿生二维选择性扫描自注意力机制(2DSSA),以建模远距离稀疏区域的特征分布;其次采用双分支网络架构增强稀疏特征的表达能力;最后引入渐进自适应损失函数(PAL)以精细化重构上采样过程中的细粒度几何结构。该方案在长距离稀疏区域的重建精度和整体几何保真度方面均取得显著提升。
链接: https://arxiv.org/abs/2507.17479
作者: Chuang Chen,Xiaolin Qin,Jing Hu,Wenyi Ge
机构: Chengdu University of Information Technology (成都信息工程大学); Rice University (莱斯大学); Colorado State University (科罗拉多州立大学); University of Colorado (科罗拉多大学); National Research Institute for Metals (日本金属材料研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Upsampling LiDAR point clouds in autonomous driving scenarios remains a significant challenge due to the inherent sparsity and complex 3D structures of the data. Recent studies have attempted to address this problem by converting the complex 3D spatial scenes into 2D image super-resolution tasks. However, due to the sparse and blurry feature representation of range images, accurately reconstructing detailed and complex spatial topologies remains a major difficulty. To tackle this, we propose a novel sparse point cloud upsampling method named SRMambaV2, which enhances the upsampling accuracy in long-range sparse regions while preserving the overall geometric reconstruction quality. Specifically, inspired by human driver visual perception, we design a biomimetic 2D selective scanning self-attention (2DSSA) mechanism to model the feature distribution in distant sparse areas. Meanwhile, we introduce a dual-branch network architecture to enhance the representation of sparse features. In addition, we introduce a progressive adaptive loss (PAL) function to further refine the reconstruction of fine-grained details during the upsampling process. Experimental results demonstrate that SRMambaV2 achieves superior performance in both qualitative and quantitative evaluations, highlighting its effectiveness and practical value in automotive sparse point cloud upsampling tasks.
zh
[CV-34] Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls
【速读】:该论文旨在解决视觉蕴含(Visual Entailment, VE)任务是否能作为可靠诊断工具以评估多模态语言模型在视觉-语言理解方面的能力这一问题。其关键解决方案在于通过系统性实验,包括零样本、少样本和微调三种设置,结合提示设计、上下文示例数量与顺序以及视觉信息可用性等变量,深入分析VE任务的表现机制;同时引入基于解释的评估方法,验证模型推理过程的语义合理性。结果表明,微调可显著提升VE性能(e-SNLI-VE数据集上准确率达83.3%),且生成的解释具有人类相似性(BERTScore F1=89.2%),但同时也揭示了模型在缺乏视觉输入时易产生幻觉、过度依赖语言先验的问题,暗示VE任务的视觉 grounding 可能存在局限,从而为改进多模态评估方法提供了方向。
链接: https://arxiv.org/abs/2507.17467
作者: Elena Pitta,Tom Kouwenhoven,Tessa Verhoef
机构: Leiden Institute of Advanced Computer Science (莱顿高级计算机科学研究所); Leiden University (莱顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: LUHME: 2nd Workshop on Language Understanding in the Human-Machine Era
Abstract:This study investigates the extent to which the Visual Entailment (VE) task serves as a reliable probe of vision-language understanding in multimodal language models, using the LLaMA 3.2 11B Vision model as a test case. Beyond reporting performance metrics, we aim to interpret what these results reveal about the underlying possibilities and limitations of the VE task. We conduct a series of experiments across zero-shot, few-shot, and fine-tuning settings, exploring how factors such as prompt design, the number and order of in-context examples and access to visual information might affect VE performance. To further probe the reasoning processes of the model, we used explanation-based evaluations. Results indicate that three-shot inference outperforms the zero-shot baselines. However, additional examples introduce more noise than they provide benefits. Additionally, the order of the labels in the prompt is a critical factor that influences the predictions. In the absence of visual information, the model has a strong tendency to hallucinate and imagine content, raising questions about the model’s over-reliance on linguistic priors. Fine-tuning yields strong results, achieving an accuracy of 83.3% on the e-SNLI-VE dataset and outperforming the state-of-the-art OFA-X model. Additionally, the explanation evaluation demonstrates that the fine-tuned model provides semantically meaningful explanations similar to those of humans, with a BERTScore F1-score of 89.2%. We do, however, find comparable BERTScore results in experiments with limited vision, questioning the visual grounding of this task. Overall, our results highlight both the utility and limitations of VE as a diagnostic task for vision-language understanding and point to directions for refining multimodal evaluation methods.
zh
[CV-35] ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents
【速读】:该论文旨在解决机器人模仿学习中4D多视角序列数据采集成本高、高质量数据稀缺的问题,从而限制了具身智能策略(如视觉-语言-动作模型,VLA)的泛化能力与实际应用。其核心解决方案是提出一种名为ERMV(Editing Robotic Multi-View 4D data)的数据增强框架,关键创新在于:(1) 引入基于极线运动感知的注意力机制(Epipolar Motion-Aware Attention, EMA-Attn),以在动态视图和长时间跨度下保持几何与外观一致性;(2) 设计稀疏时空模块(Sparse Spatio-Temporal, STT),通过解耦时空维度并采用稀疏采样显著降低计算开销,扩展编辑窗口;(3) 构建反馈干预机制,利用多模态大语言模型(Multimodal Large Language Model, MLLM)检测编辑不一致并仅在必要时请求专家指导,有效缓解误差累积问题。
链接: https://arxiv.org/abs/2507.17462
作者: Chang Nie,Guangming Wang,Zhe Lie,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Key Laboratory of System Control and Information Processing, Ministry of Education of China (教育部系统控制与信息处理重点实验室); Cambridge University (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Robot imitation learning relies on 4D multi-view sequential images. However, the high cost of data collection and the scarcity of high-quality data severely constrain the generalization and application of embodied intelligence policies like Vision-Language-Action (VLA) models. Data augmentation is a powerful strategy to overcome data scarcity, but methods for editing 4D multi-view sequential images for manipulation tasks are currently lacking. Thus, we propose ERMV (Editing Robotic Multi-View 4D data), a novel data augmentation framework that efficiently edits an entire multi-view sequence based on single-frame editing and robot state conditions. This task presents three core challenges: (1) maintaining geometric and appearance consistency across dynamic views and long time horizons; (2) expanding the working window with low computational costs; and (3) ensuring the semantic integrity of critical objects like the robot arm. ERMV addresses these challenges through a series of innovations. First, to ensure spatio-temporal consistency in motion blur, we introduce a novel Epipolar Motion-Aware Attention (EMA-Attn) mechanism that learns pixel shift caused by movement before applying geometric constraints. Second, to maximize the editing working window, ERMV pioneers a Sparse Spatio-Temporal (STT) module, which decouples the temporal and spatial views and remodels a single-frame multi-view problem through sparse sampling of the views to reduce computational demands. Third, to alleviate error accumulation, we incorporate a feedback intervention Mechanism, which uses a Multimodal Large Language Model (MLLM) to check editing inconsistencies and request targeted expert guidance only when necessary. Extensive experiments demonstrate that ERMV-augmented data significantly boosts the robustness and generalization of VLA models in both simulated and real-world environments.
zh
[CV-36] Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection
【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)检测中依赖大规模人工标注数据所带来的劳动密集、不一致性和泛化能力差的问题,尤其是对稀有交互场景的适应性不足。其解决方案的关键在于提出一种无需训练的HOI检测框架DYSCO(Dynamic Scoring with enhanced semantics),该框架通过构建一个融合视觉与文本交互表示的多模态注册表(multimodal registry),利用少量视觉提示和创新的交互签名(interaction signatures)增强动词语义对齐,并引入一种独特的多头注意力机制以自适应地加权视觉与文本特征的贡献,从而实现对交互关系的鲁棒且细粒度的理解,尤其在稀有交互类别上表现出显著优势。
链接: https://arxiv.org/abs/2507.17456
作者: Francesco Tonini,Lorenzo Vaquero,Alessandro Conti,Cigdem Beyan,Elisa Ricci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM Multimedia 2025
Abstract:Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions. Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues. These annotations are labor-intensive to create, prone to inconsistency, and limit scalability to new domains and rare interactions. We argue that recent advances in Vision-Language Models (VLMs) offer untapped potential, particularly in enhancing interaction representation. While prior work has injected such potential and even proposed training-free methods, there remain key gaps. Consequently, we propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics (DYSCO) that effectively utilizes textual and visual interaction representations within a multimodal registry, enabling robust and nuanced interaction understanding. This registry incorporates a small set of visual cues and uses innovative interaction signatures to improve the semantic alignment of verbs, facilitating effective generalization to rare interactions. Additionally, we propose a unique multi-head attention mechanism that adaptively weights the contributions of the visual and textual features. Experimental results demonstrate that our DYSCO surpasses training-free state-of-the-art models and is competitive with training-based approaches, particularly excelling in rare interactions. Code is available at this https URL.
zh
[CV-37] VLM-Guided Visual Place Recognition for Planet-Scale Geo-Localization
【速读】:该论文旨在解决全球尺度下的单图像地理定位(geo-localization)问题,即从一张图像中准确识别其拍摄位置,这是导航、自动驾驶和灾害响应等应用中的关键挑战。传统基于检索的方法存在可扩展性差与感知歧义(perceptual aliasing)的问题,而基于分类的方法则泛化能力弱且依赖大量标注数据。尽管视觉语言模型(VLMs)在准确性上表现优异,但其易产生幻觉且缺乏可解释性,难以作为独立解决方案。论文提出了一种融合VLM与视觉位置识别(VPR)的混合框架:首先利用VLM生成地理先验(geographic prior),有效约束检索搜索空间;随后通过检索与重排序机制,结合特征相似性和初始坐标邻近度筛选最合理的匹配结果。该方案的关键在于将VLM的语义理解能力与VPR的几何一致性优势相结合,从而实现更高效、鲁棒且精确的地理定位系统。
链接: https://arxiv.org/abs/2507.17455
作者: Sania Waheed,Na Min An,Michael Milford,Sarvapali D. Ramchurn,Shoaib Ehsan
机构: University of Southampton (南安普顿大学); KAIST (韩国科学技术院); Queensland University of Technology (昆士兰科技大学); University of Essex (埃塞克斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Geo-localization from a single image at planet scale (essentially an advanced or extreme version of the kidnapped robot problem) is a fundamental and challenging task in applications such as navigation, autonomous driving and disaster response due to the vast diversity of locations, environmental conditions, and scene variations. Traditional retrieval-based methods for geo-localization struggle with scalability and perceptual aliasing, while classification-based approaches lack generalization and require extensive training data. Recent advances in vision-language models (VLMs) offer a promising alternative by leveraging contextual understanding and reasoning. However, while VLMs achieve high accuracy, they are often prone to hallucinations and lack interpretability, making them unreliable as standalone solutions. In this work, we propose a novel hybrid geo-localization framework that combines the strengths of VLMs with retrieval-based visual place recognition (VPR) methods. Our approach first leverages a VLM to generate a prior, effectively guiding and constraining the retrieval search space. We then employ a retrieval step, followed by a re-ranking mechanism that selects the most geographically plausible matches based on feature similarity and proximity to the initially estimated coordinates. We evaluate our approach on multiple geo-localization benchmarks and show that it consistently outperforms prior state-of-the-art methods, particularly at street (up to 4.51%) and city level (up to 13.52%). Our results demonstrate that VLM-generated geographic priors in combination with VPR lead to scalable, robust, and accurate geo-localization systems.
zh
[CV-38] Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection ICCV2025
【速读】:该论文旨在解决现有开放词汇目标检测模型在小规模模型下难以充分利用大规模视觉-语言数据的问题,尤其是在实时场景中如何提升检测性能与效率。其核心解决方案是提出Dynamic-DINO框架,通过引入高效MoE-Tuning策略将密集模型(dense model)转化为动态推理架构,并设计粒度分解机制将基础模型的前馈网络(Feed-Forward Network, FFN)拆分为多个小型专家网络以扩展子网搜索空间;同时采用预训练权重分配策略和特定路由器初始化方法防止微调初期性能下降,在推理阶段仅激活与输入相关的专家形成紧凑子网,从而实现高精度与低计算开销的平衡。
链接: https://arxiv.org/abs/2507.17436
作者: Yehao Lu,Minghe Weng,Zekang Xiao,Rui Jiang,Wei Su,Guangcong Zheng,Ping Lu,Xi Li
机构: College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); Polytechnic Institute, Zhejiang University (浙江大学理工学院); ZTE (中兴通讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:The Mixture of Experts (MoE) architecture has excelled in Large Vision-Language Models (LVLMs), yet its potential in real-time open-vocabulary object detectors, which also leverage large-scale vision-language datasets but smaller models, remains unexplored. This work investigates this domain, revealing intriguing insights. In the shallow layers, experts tend to cooperate with diverse peers to expand the search space. While in the deeper layers, fixed collaborative structures emerge, where each expert maintains 2-3 fixed partners and distinct expert combinations are specialized in processing specific patterns. Concretely, we propose Dynamic-DINO, which extends Grounding DINO 1.5 Edge from a dense model to a dynamic inference framework via an efficient MoE-Tuning strategy. Additionally, we design a granularity decomposition mechanism to decompose the Feed-Forward Network (FFN) of base model into multiple smaller expert networks, expanding the subnet search space. To prevent performance degradation at the start of fine-tuning, we further propose a pre-trained weight allocation strategy for the experts, coupled with a specific router initialization. During inference, only the input-relevant experts are activated to form a compact subnet. Experiments show that, pretrained with merely 1.56M open-source data, Dynamic-DINO outperforms Grounding DINO 1.5 Edge, pretrained on the private Grounding20M dataset.
zh
[CV-39] CAPRI-CT: Causal Analysis and Predictive Reasoning for Image Quality Optimization in Computed Tomography
【速读】:该论文旨在解决在计算机断层扫描(Computed Tomography, CT)中如何在保证图像质量的同时最大限度降低辐射暴露这一关键临床挑战。其解决方案的核心在于提出了一种因果感知的深度学习框架CAPRI-CT,该框架通过融合CT图像数据与采集元数据(如管电压、管电流及对比剂类型等),建模影响图像质量的潜在因果关系;利用变分自编码器(Variational Autoencoders, VAEs)的集成方法提取有意义的特征并生成因果表示,进而实现对信噪比(Signal-to-Noise Ratio, SNR)的预测和反事实推理,支持“假设情景”模拟(如改变对比剂种类或浓度、调整扫描参数等),从而为放射科医生和技术人员提供可操作的优化建议,无需重复物理扫描即可设计更高效的CT成像协议。
链接: https://arxiv.org/abs/2507.17420
作者: Sneha George Gnanakalavathy,Hairil Abdul Razak,Robert Meertens,Jonathan E. Fieldsend,Xujiong Ye,Mohammed M. Abdelsamea
机构: University of Exeter (埃克塞特大学); University of Nottingham Malaysia (诺丁汉大学马来西亚分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In computed tomography (CT), achieving high image quality while minimizing radiation exposure remains a key clinical challenge. This paper presents CAPRI-CT, a novel causal-aware deep learning framework for Causal Analysis and Predictive Reasoning for Image Quality Optimization in CT imaging. CAPRI-CT integrates image data with acquisition metadata (such as tube voltage, tube current, and contrast agent types) to model the underlying causal relationships that influence image quality. An ensemble of Variational Autoencoders (VAEs) is employed to extract meaningful features and generate causal representations from observational data, including CT images and associated imaging parameters. These input features are fused to predict the Signal-to-Noise Ratio (SNR) and support counterfactual inference, enabling what-if simulations, such as changes in contrast agents (types and concentrations) or scan parameters. CAPRI-CT is trained and validated using an ensemble learning approach, achieving strong predictive performance. By facilitating both prediction and interpretability, CAPRI-CT provides actionable insights that could help radiologists and technicians design more efficient CT protocols without repeated physical scans. The source code and dataset are publicly available at this https URL.
zh
[CV-40] Content-based 3D Image Retrieval and a ColBERT-inspired Re-ranking for Tumor Flagging and Staging
【速读】:该论文旨在解决医学影像检索中因图像数据量激增而导致放射科医生难以高效获取相关病例的问题,尤其针对现有基于内容的图像检索(Content-based Image Retrieval, CBIR)系统缺乏标准化评估和全面研究的局限性。其解决方案的关键在于提出一种名为C-MIR的新方法,该方法基于ColBERT的上下文感知晚期交互机制(contextualized late interaction mechanism),专为三维医学图像设计,无需依赖预分割数据或器官特异性数据集,从而适配临床实践中广泛存在的大型非结构化影像归档系统(如PACS)。C-MIR通过有效定位感兴趣区域,实现了无需昂贵数据增强步骤的计算高效重排序,显著提升了肿瘤识别与定位性能,特别是在结肠和肺部肿瘤的标记任务中表现突出(p<0.05),并展现出改善肿瘤分期的潜力。
链接: https://arxiv.org/abs/2507.17412
作者: Farnaz Khun Jush,Steffen Vogler,Matthias Lenga
机构: Bayer(拜耳)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:The increasing volume of medical images poses challenges for radiologists in retrieving relevant cases. Content-based image retrieval (CBIR) systems offer potential for efficient access to similar cases, yet lack standardized evaluation and comprehensive studies. Building on prior studies for tumor characterization via CBIR, this study advances CBIR research for volumetric medical images through three key contributions: (1) a framework eliminating reliance on pre-segmented data and organ-specific datasets, aligning with large and unstructured image archiving systems, i.e. PACS in clinical practice; (2) introduction of C-MIR, a novel volumetric re-ranking method adapting ColBERT’s contextualized late interaction mechanism for 3D medical imaging; (3) comprehensive evaluation across four tumor sites using three feature extractors and three database configurations. Our evaluations highlight the significant advantages of C-MIR. We demonstrate the successful adaptation of the late interaction principle to volumetric medical images, enabling effective context-aware re-ranking. A key finding is C-MIR’s ability to effectively localize the region of interest, eliminating the need for pre-segmentation of datasets and offering a computationally efficient alternative to systems relying on expensive data enrichment steps. C-MIR demonstrates promising improvements in tumor flagging, achieving improved performance, particularly for colon and lung tumors (p0.05). C-MIR also shows potential for improving tumor staging, warranting further exploration of its capabilities. Ultimately, our work seeks to bridge the gap between advanced retrieval techniques and their practical applications in healthcare, paving the way for improved diagnostic processes.
zh
[CV-41] Physics-based Human Pose Estimation from a Single Moving RGB Camera
【速读】:该论文旨在解决单目(monocular)和基于物理的人体姿态追踪方法在非平坦地面或相机动态运动场景下易产生伪影(artifact)的问题,以及现有方法常依赖合成数据或缺乏真实世界光照传输、相机运动与姿态诱导外观及几何变化建模的局限性。其解决方案的关键在于构建首个非合成的真实世界数据集 MoviCam,该数据集包含动态移动的单目RGB相机的真值轨迹、场景几何结构及人体三维运动,并标注了人体与场景接触信息;同时提出 PhysDynPose 方法,通过融合场景几何与物理约束,在相机运动和非平面环境中实现更准确的人体姿态跟踪:具体而言,利用先进的运动学估计算法获取人体姿态,并结合鲁棒的SLAM(Simultaneous Localization and Mapping)方法捕捉动态相机轨迹,从而恢复世界坐标系中的人体姿态,再通过场景感知的物理优化器对初始估计进行精修。
链接: https://arxiv.org/abs/2507.17406
作者: Ayce Idil Aytekin,Chuqiao Li,Diogo Luvizon,Rishabh Dabral,Martin Oswald,Marc Habermann,Christian Theobalt
机构: Max Planck Institute for Informatics (马克斯·普朗克信息研究所); University of Tübingen (图宾根大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most monocular and physics-based human pose tracking methods, while achieving state-of-the-art results, suffer from artifacts when the scene does not have a strictly flat ground plane or when the camera is moving. Moreover, these methods are often evaluated on in-the-wild real world videos without ground-truth data or on synthetic datasets, which fail to model the real world light transport, camera motion, and pose-induced appearance and geometry changes. To tackle these two problems, we introduce MoviCam, the first non-synthetic dataset containing ground-truth camera trajectories of a dynamically moving monocular RGB camera, scene geometry, and 3D human motion with human-scene contact labels. Additionally, we propose PhysDynPose, a physics-based method that incorporates scene geometry and physical constraints for more accurate human motion tracking in case of camera motion and non-flat scenes. More precisely, we use a state-of-the-art kinematics estimator to obtain the human pose and a robust SLAM method to capture the dynamic camera trajectory, enabling the recovery of the human pose in the world frame. We then refine the kinematic pose estimate using our scene-aware physics optimizer. From our new benchmark, we found that even state-of-the-art methods struggle with this inherently challenging setting, i.e. a moving camera and non-planar environments, while our method robustly estimates both human and camera poses in world coordinates.
zh
[CV-42] HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning ICCV’25
【速读】:该论文旨在解决部分相关视频检索(Partially Relevant Video Retrieval, PRVR)中因欧几里得空间(Euclidean space)几何失真导致的视频层次结构建模不佳问题,从而影响跨模态匹配精度。其解决方案的关键在于提出首个基于双曲空间(hyperbolic space)建模的框架HLFormer,通过在混合空间中融合洛伦兹注意力模块(Lorentz Attention Block)与欧几里得注意力模块(Euclidean Attention Block),并引入均值引导自适应交互模块(Mean-Guided Adaptive Interaction Module)动态融合特征;同时设计偏序保持损失(Partial Order Preservation Loss)以利用洛伦兹锥约束强化“文本-视频”间的层次关系,有效提升部分相关性建模能力。
链接: https://arxiv.org/abs/2507.17402
作者: Li Jun,Wang Jinpeng,Tan Chaolei,Lian Niu,Chen Long,Zhang Min,Wang Yaowei,Xia Shu-Tao,Chen Bin
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Research Center of Artificial Intelligence, Peng Cheng Laboratory (鹏城实验室人工智能研究中心); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注: Accepted by ICCV’25. 13 pages, 6 figures, 4 tables
Abstract:Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce “text video” hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods. Code is released at this https URL.
zh
[CV-43] HiProbe-VAD: Video Anomaly Detection via Hidden States Probing in Tuning-Free Multimodal LLM s ACM-MM2025
【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)中传统方法对大量标注数据依赖性强、计算开销大以及实际应用受限的问题。其解决方案的关键在于利用预训练多模态大语言模型(Multimodal Large Language Models, MLLs)的中间隐藏状态(hidden states),发现这些中间层表示相较于输出层具有更高的异常敏感性和线性可分性;进而提出动态层显著性探测(Dynamic Layer Saliency Probing, DLSP)机制,智能识别并提取最优中间层中的信息丰富特征,结合轻量级异常评分与时间定位模块实现高效异常检测与解释生成,且无需任何微调即可在不同MLLMs间实现良好泛化能力。
链接: https://arxiv.org/abs/2507.17394
作者: Zhaolin Cai,Fan Li,Ziwei Zheng,Yanjun Qin
机构: Xinjiang University (新疆大学); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ACM MM 2025
Abstract:Video Anomaly Detection (VAD) aims to identify and locate deviations from normal patterns in video sequences. Traditional methods often struggle with substantial computational demands and a reliance on extensive labeled datasets, thereby restricting their practical applicability. To address these constraints, we propose HiProbe-VAD, a novel framework that leverages pre-trained Multimodal Large Language Models (MLLMs) for VAD without requiring fine-tuning. In this paper, we discover that the intermediate hidden states of MLLMs contain information-rich representations, exhibiting higher sensitivity and linear separability for anomalies compared to the output layer. To capitalize on this, we propose a Dynamic Layer Saliency Probing (DLSP) mechanism that intelligently identifies and extracts the most informative hidden states from the optimal intermediate layer during the MLLMs reasoning. Then a lightweight anomaly scorer and temporal localization module efficiently detects anomalies using these extracted hidden states and finally generate explanations. Experiments on the UCF-Crime and XD-Violence datasets demonstrate that HiProbe-VAD outperforms existing training-free and most traditional approaches. Furthermore, our framework exhibits remarkable cross-model generalization capabilities in different MLLMs without any tuning, unlocking the potential of pre-trained MLLMs for video anomaly detection and paving the way for more practical and scalable solutions.
zh
[CV-44] EndoGen: Conditional Autoregressive Endoscopic Video Generation MICCAI2025
【速读】:该论文旨在解决内窥镜视频生成中长期存在的两大问题:一是现有方法多局限于静态图像生成,无法提供临床所需的动态上下文信息;二是多数采用无条件生成策略,缺乏对医生诊断具有参考意义的可控性。其解决方案的关键在于提出首个条件驱动的内窥镜视频生成框架EndoGen,核心创新包括两个方面:首先,设计了面向时空结构的Grid-Frame Pattern(SGP)策略,将多帧生成任务重构为基于网格的图像生成模式,从而有效利用自回归架构在全局依赖建模上的优势;其次,引入语义感知的Token掩码机制(SAT),通过在生成过程中聚焦于语义显著区域,增强内容的丰富性和多样性。实验表明,该框架可生成高质量、条件引导的内窥镜视频,并提升下游任务如息肉分割的性能。
链接: https://arxiv.org/abs/2507.17388
作者: Xinyu Liu,Hengyu Liu,Cheng Wang,Tianming Liu,Yixuan Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: MICCAI 2025
Abstract:Endoscopic video generation is crucial for advancing medical imaging and enhancing diagnostic capabilities. However, prior efforts in this field have either focused on static images, lacking the dynamic context required for practical applications, or have relied on unconditional generation that fails to provide meaningful references for clinicians. Therefore, in this paper, we propose the first conditional endoscopic video generation framework, namely EndoGen. Specifically, we build an autoregressive model with a tailored Spatiotemporal Grid-Frame Patterning (SGP) strategy. It reformulates the learning of generating multiple frames as a grid-based image generation pattern, which effectively capitalizes the inherent global dependency modeling capabilities of autoregressive architectures. Furthermore, we propose a Semantic-Aware Token Masking (SAT) mechanism, which enhances the model’s ability to produce rich and diverse content by selectively focusing on semantically meaningful regions during the generation process. Through extensive experiments, we demonstrate the effectiveness of our framework in generating high-quality, conditionally guided endoscopic content, and improves the performance of downstream task of polyp segmentation. Code released at this https URL.
zh
[CV-45] A Conditional Probability Framework for Compositional Zero-shot Learning
【速读】:该论文旨在解决组合零样本学习(Compositional Zero-Shot Learning, CZSL)中对未见过的对象与属性组合识别能力不足的问题,其核心挑战在于传统方法将对象和属性视为独立实体进行解耦学习,忽略了二者之间的语义约束和上下文依赖关系。解决方案的关键在于提出一种条件概率框架(Conditional Probability Framework, CPF),显式建模属性与对象间的相互依赖性:将组合概率分解为对象先验概率与属性在给定对象下的条件概率两部分,并通过文本描述增强对象特征以突出语义相关区域,再利用交叉注意力机制引导属性学习,实现更精准的上下文对齐。该方法通过联合优化对象概率与条件属性概率,有效捕捉组合依赖关系,在多个CZSL基准测试中展现出优越性能。
链接: https://arxiv.org/abs/2507.17377
作者: Peng Wu,Qiuxia Lai,Hao Fang,Guo-Sen Xie,Yilong Yin,Xiankai Lu,Wenguan Wang
机构: Shandong University (山东大学); Communication University of China (中国传媒大学); Nanjing University of Science and Technology (南京理工大学); Zhejiang University (浙江大学); National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University (西安交通大学人机混合增强智能国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of known objects and attributes by leveraging knowledge from previously seen compositions. Traditional approaches primarily focus on disentangling attributes and objects, treating them as independent entities during learning. However, this assumption overlooks the semantic constraints and contextual dependencies inside a composition. For example, certain attributes naturally pair with specific objects (e.g., “striped” applies to “zebra” or “shirts” but not “sky” or “water”), while the same attribute can manifest differently depending on context (e.g., “young” in “young tree” vs. “young dog”). Thus, capturing attribute-object interdependence remains a fundamental yet long-ignored challenge in CZSL. In this paper, we adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies. We decompose the probability of a composition into two components: the likelihood of an object and the conditional likelihood of its attribute. To enhance object feature learning, we incorporate textual descriptors to highlight semantically relevant image regions. These enhanced object features then guide attribute learning through a cross-attention mechanism, ensuring better contextual alignment. By jointly optimizing object likelihood and conditional attribute likelihood, our method effectively captures compositional dependencies and generalizes well to unseen compositions. Extensive experiments on multiple CZSL benchmarks demonstrate the superiority of our approach. Code is available at here.
zh
[CV-46] SFUOD: Source-Free Unknown Object Detection ICCV2025
【速读】:该论文旨在解决源域无监督目标域适应(Source-free Domain Adaptation, SFDA)中“封闭集”假设的局限性问题,即传统方法仅能识别源域预定义类别,无法检测目标域中出现的未知对象。为突破这一限制,作者提出源域无未知对象检测(Source-Free Unknown Object Detection, SFUOD)新范式,使检测器既能识别已知对象,又能将未知对象标记为未知类。其核心解决方案是CollaPAUL框架:通过协同调优(Collaborative Tuning)机制,利用跨域注意力融合预训练检测器的源域知识与辅助编码器的目标域知识,实现更有效的知识迁移;同时采用基于主轴的未知标签分配(Principal Axes-based Unknown Labeling),通过主轴投影估计目标置信度和对象性得分,自动为未知对象生成伪标签。该方法在多个SFUOD基准上达到最优性能,验证了其有效性。
链接: https://arxiv.org/abs/2507.17373
作者: Keon-Hee Park,Seun-An Choe,Gyeong-Moon Park
机构: Kyung Hee University (高丽大学); Korea University (韩国国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ICCV 2025
Abstract:Source-free object detection adapts a detector pre-trained on a source domain to an unlabeled target domain without requiring access to labeled source data. While this setting is practical as it eliminates the need for the source dataset during domain adaptation, it operates under the restrictive assumption that only pre-defined objects from the source domain exist in the target domain. This closed-set setting prevents the detector from detecting undefined objects. To ease this assumption, we propose Source-Free Unknown Object Detection (SFUOD), a novel scenario which enables the detector to not only recognize known objects but also detect undefined objects as unknown objects. To this end, we propose CollaPAUL (Collaborative tuning and Principal Axis-based Unknown Labeling), a novel framework for SFUOD. Collaborative tuning enhances knowledge adaptation by integrating target-dependent knowledge from the auxiliary encoder with source-dependent knowledge from the pre-trained detector through a cross-domain attention mechanism. Additionally, principal axes-based unknown labeling assigns pseudo-labels to unknown objects by estimating objectness via principal axes projection and confidence scores from model predictions. The proposed CollaPAUL achieves state-of-the-art performances on SFUOD benchmarks, and extensive experiments validate its effectiveness.
zh
[CV-47] Exploring Spatial Diversity for Region-based Active Learning
【速读】:该论文旨在解决语义分割任务中因依赖大规模标注数据而导致的高标注成本问题。现有方法通常基于深度神经网络在全监督模式下训练,但像素级密集预测任务的标注代价高昂。为此,论文提出一种基于区域的主动学习(region-based active learning)策略,通过选择最具信息量的图像区域进行标注来降低标注成本。其解决方案的关键在于构建一个统一的优化框架,将传统的样本不确定性(如模型置信度)与局部空间多样性(local spatial diversity)相结合,从而在减少标注量的同时提升模型性能。实验表明,该框架仅需5–9%的标注像素即可达到全监督方法95%的性能,显著优于当前最先进的区域级主动学习方法。
链接: https://arxiv.org/abs/2507.17367
作者: Lile Cai,Xun Xu,Lining Zhang,Chuan-Sheng Foo
机构: Institute for Infocomm Research (I2R), A-STAR (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: published in IEEE Transactions on Image Processing, 2021
Abstract:State-of-the-art methods for semantic segmentation are based on deep neural networks trained on large-scale labeled datasets. Acquiring such datasets would incur large annotation costs, especially for dense pixel-level prediction tasks like semantic segmentation. We consider region-based active learning as a strategy to reduce annotation costs while maintaining high performance. In this setting, batches of informative image regions instead of entire images are selected for labeling. Importantly, we propose that enforcing local spatial diversity is beneficial for active learning in this case, and to incorporate spatial diversity along with the traditional active selection criterion, e.g., data sample uncertainty, in a unified optimization framework for region-based active learning. We apply this framework to the Cityscapes and PASCAL VOC datasets and demonstrate that the inclusion of spatial diversity effectively improves the performance of uncertainty-based and feature diversity-based active learning methods. Our framework achieves 95% performance of fully supervised methods with only 5-9% of the labeled pixels, outperforming all state-of-the-art region-based active learning methods for semantic segmentation.
zh
[CV-48] Exploring Active Learning for Semiconductor Defect Segmentation ICIP2022
【速读】:该论文旨在解决半导体X射线显微成像(X-Ray Microscopy, XRM)中深度学习模型训练所需的大量标注数据问题,尤其是在语义分割等密集预测任务中,标注成本高且耗时。针对这一问题,论文提出了一种结合主动学习(Active Learning, AL)的解决方案,其关键在于:首先在未标注数据上进行对比预训练(contrastive pretraining),以获得每个AL循环的初始化权重,从而缓解因领域偏移(domain shift)带来的性能下降;其次设计了一种基于稀有性感知(rareness-aware)的采样选择函数,优先选取包含罕见类别的样本,有效应对类别严重不平衡的问题。实验表明,该方法在高带宽存储器结构的XRM数据集上达到了当前最优性能。
链接: https://arxiv.org/abs/2507.17359
作者: Lile Cai,Ramanpreet Singh Pahwa,Xun Xu,Jie Wang,Richard Chang,Lining Zhang,Chuan-Sheng Foo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to ICIP 2022
Abstract:The development of X-Ray microscopy (XRM) technology has enabled non-destructive inspection of semiconductor structures for defect identification. Deep learning is widely used as the state-of-the-art approach to perform visual analysis tasks. However, deep learning based models require large amount of annotated data to train. This can be time-consuming and expensive to obtain especially for dense prediction tasks like semantic segmentation. In this work, we explore active learning (AL) as a potential solution to alleviate the annotation burden. We identify two unique challenges when applying AL on semiconductor XRM scans: large domain shift and severe class-imbalance. To address these challenges, we propose to perform contrastive pretraining on the unlabelled data to obtain the initialization weights for each AL cycle, and a rareness-aware acquisition function that favors the selection of samples containing rare classes. We evaluate our method on a semiconductor dataset that is compiled from XRM scans of high bandwidth memory structures composed of logic and memory dies, and demonstrate that our method achieves state-of-the-art performance.
zh
[CV-49] Exploring Active Learning for Label-Efficient Training of Semantic Neural Radiance Field ICME2025
【速读】:该论文旨在解决语义感知神经辐射场(Semantically-aware Neural Radiance Field, Semantically-aware NeRF)训练过程中对像素级类别标签的高标注成本问题。其核心解决方案是引入主动学习(Active Learning)策略,通过智能选择最具信息量的样本进行标注,从而显著降低人工标注需求。关键创新在于提出了一种融合三维几何约束的新型主动学习策略,在样本选择时考虑了场景的3D结构信息,有效提升了标注效率,实验表明相比随机采样可实现超过2倍的标注成本削减。
链接: https://arxiv.org/abs/2507.17351
作者: Yuzhe Zhu,Lile Cai,Kangkang Lu,Fayao Liu,Xulei Yang
机构: Institute for Infocomm Research (I2R), A*STAR, Singapore; Nanyang Technological University, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICME 2025
Abstract:Neural Radiance Field (NeRF) models are implicit neural scene representation methods that offer unprecedented capabilities in novel view synthesis. Semantically-aware NeRFs not only capture the shape and radiance of a scene, but also encode semantic information of the scene. The training of semantically-aware NeRFs typically requires pixel-level class labels, which can be prohibitively expensive to collect. In this work, we explore active learning as a potential solution to alleviate the annotation burden. We investigate various design choices for active learning of semantically-aware NeRF, including selection granularity and selection strategies. We further propose a novel active learning strategy that takes into account 3D geometric constraints in sample selection. Our experiments demonstrate that active learning can effectively reduce the annotation cost of training semantically-aware NeRF, achieving more than 2X reduction in annotation cost compared to random sampling.
zh
[CV-50] Swin-TUNA : A Novel PEFT Approach for Accurate Food Image Segmentation
【速读】:该论文旨在解决现有基于Transformer的大规模模型(如FoodSAM)在食品图像分割任务中因参数量庞大和计算资源消耗高而难以满足工业部署需求的问题。其解决方案的关键在于提出一种参数高效微调(Parameter Efficient Fine-Tuning, PEFT)方法——Swin-TUNA,该方法通过在Swin Transformer架构中集成多尺度可训练适配器模块(TUNable Adapter module),仅更新4%的参数即可实现高性能分割。核心创新在于层次化特征适配机制:设计深度与维度映射上不同尺度的分离卷积,以缓解浅层与深层网络特征差异,并引入任务无关与任务特定特征的动态平衡策略,从而在保持高精度的同时显著降低模型复杂度(参数减少98.7%,降至8.13M),并在小样本场景下表现出更快收敛速度和更强泛化能力。
链接: https://arxiv.org/abs/2507.17347
作者: Haotian Chen,Zhiyong Xiao
机构: Jiangnan University (江南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In the field of food image processing, efficient semantic segmentation techniques are crucial for industrial applications. However, existing large-scale Transformer-based models (such as FoodSAM) face challenges in meeting practical deploymentrequirements due to their massive parameter counts and high computational resource demands. This paper introduces TUNable Adapter module (Swin-TUNA), a Parameter Efficient Fine-Tuning (PEFT) method that integrates multiscale trainable adapters into the Swin Transformer architecture, achieving high-performance food image segmentation by updating only 4% of the parameters. The core innovation of Swin-TUNA lies in its hierarchical feature adaptation mechanism: it designs separable convolutions in depth and dimensional mappings of varying scales to address the differences in features between shallow and deep networks, combined with a dynamic balancing strategy for tasks-agnostic and task-specific features. Experiments demonstrate that this method achieves mIoU of 50.56% and 74.94% on the FoodSeg103 and UECFoodPix Complete datasets, respectively, surpassing the fully parameterized FoodSAM model while reducing the parameter count by 98.7% (to only 8.13M). Furthermore, Swin-TUNA exhibits faster convergence and stronger generalization capabilities in low-data scenarios, providing an efficient solution for assembling lightweight food image.
zh
[CV-51] Principled Multimodal Representation Learning
【速读】:该论文旨在解决多模态表示学习中传统方法依赖预定义锚点模态(anchor modality)导致的跨模态对齐受限问题,以及现有多模态同步对齐方法中存在的优化不稳定性和固定锚点限制。其解决方案的关键在于提出了一种原理性框架——Principled Multimodal Representation Learning (PMRL),该框架基于理论洞察:完全对齐对应于Gram矩阵秩为1的特性,通过优化表示矩阵的最大奇异值来实现所有模态沿共享主方向的一致对齐;同时设计了一种基于softmax的损失函数,将奇异值视为logits以优先最大化最大奇异值,并引入实例级对比正则化以保持主特征向量间的区分度,从而避免表示坍缩,提升对齐稳定性与性能。
链接: https://arxiv.org/abs/2507.17343
作者: Xiaohao Liu,Xiaobo Xia,See-Kiong Ng,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 32 pages, 9 figures, 10 tables
Abstract:Multimodal representation learning seeks to create a unified representation space by integrating diverse data modalities to improve multimodal understanding. Traditional methods often depend on pairwise contrastive learning, which relies on a predefined anchor modality, restricting alignment across all modalities. Recent advances have investigated the simultaneous alignment of multiple modalities, yet several challenges remain, such as limitations imposed by fixed anchor points and instability arising from optimizing the product of singular values. To address the challenges, in this paper, we propose Principled Multimodal Representation Learning (PMRL), a novel framework that achieves simultaneous alignment of multiple modalities without anchor dependency in a more stable manner. Specifically, grounded in the theoretical insight that full alignment corresponds to a rank-1 Gram matrix, PMRL optimizes the dominant singular value of the representation matrix to align modalities along a shared leading direction. We propose a softmax-based loss function that treats singular values as logits to prioritize the largest singular value. Besides, instance-wise contrastive regularization on the leading eigenvectors maintains inter-instance separability and prevents representation collapse. Extensive experiments across diverse tasks demonstrate PMRL’s superiority compared to baseline methods. The source code will be publicly available.
zh
[CV-52] DeMo: Motion Decoupling for Autonomous Driving NEURIPS2024
【速读】:该论文旨在解决现有自动驾驶中运动预测与规划方法在建模轨迹复杂时空演化方面的不足,尤其是基于“一查询一轨迹”范式的局限性——此类方法虽能生成多模态运动意图,但难以准确捕捉交通参与者轨迹的动态演变过程,易导致碰撞或次优决策。其解决方案的关键在于提出DeMo++框架,通过将运动估计解耦为两个独立模块:一是整体运动意图(holistic motion intentions)以捕获多样化的潜在运动方向,二是精细时空状态(fine spatiotemporal states)用于追踪个体在场景中的动态进展并实现自优化能力;同时引入跨场景轨迹交互机制,挖掘相邻场景间运动关系,从而全面建模运动意图多样性与每条轨迹的时空演化特性。为高效实现该框架,还设计了融合注意力机制与Mamba结构的混合模型,兼顾场景信息聚合效率与轨迹状态序列建模精度。
链接: https://arxiv.org/abs/2507.17342
作者: Bozhou Zhang,Nan Song,Xiatian Zhu,Li Zhang
机构: Fudan University (复旦大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Journal extension of NeurIPS 2024. arXiv admin note: substantial text overlap with arXiv:2410.05982
Abstract:Motion forecasting and planning are tasked with estimating the trajectories of traffic agents and the ego vehicle, respectively, to ensure the safety and efficiency of autonomous driving systems in dynamically changing environments. State-of-the-art methods typically adopt a one-query-one-trajectory paradigm, where each query corresponds to a unique trajectory for predicting multi-mode trajectories. While this paradigm can produce diverse motion intentions, it often falls short in modeling the intricate spatiotemporal evolution of trajectories, which can lead to collisions or suboptimal outcomes. To overcome this limitation, we propose DeMo++, a framework that decouples motion estimation into two distinct components: holistic motion intentions to capture the diverse potential directions of movement, and fine spatiotemporal states to track the agent’s dynamic progress within the scene and enable a self-refinement capability. Further, we introduce a cross-scene trajectory interaction mechanism to explore the relationships between motions in adjacent scenes. This allows DeMo++ to comprehensively model both the diversity of motion intentions and the spatiotemporal evolution of each trajectory. To effectively implement this framework, we developed a hybrid model combining Attention and Mamba. This architecture leverages the strengths of both mechanisms for efficient scene information aggregation and precise trajectory state sequence modeling. Extensive experiments demonstrate that DeMo++ achieves state-of-the-art performance across various benchmarks, including motion forecasting (Argoverse 2 and nuScenes), motion planning (nuPlan), and end-to-end planning (NAVSIM).
zh
[CV-53] mporal Point-Supervised Signal Reconstruction: A Human-Annotation-Free Framework for Weak Moving Target Detection
【速读】:该论文旨在解决低空监视与预警系统中弱运动目标检测难题,主要挑战包括信号能量低、目标空间范围小以及背景杂波复杂等。传统方法在特征提取鲁棒性和依赖人工标注方面存在局限。其解决方案的关键在于提出一种全新的Temporal Point-Supervised(TPS)框架,将检测任务重构为像素级的时间信号建模问题,使弱目标表现为短时脉冲响应;同时设计了Temporal Signal Reconstruction Network(TSRNet),采用编码器-解码器结构并引入Dynamic Multi-Scale Attention(DMSAttention)模块以增强对多样化时间模式的敏感性,并结合基于图的轨迹挖掘策略抑制虚警,从而实现无需人工标注的高精度实时检测,实验表明该方法在自建低信噪比(low-SNR)数据集上优于现有最优方法,且推理速度超过1000 FPS。
链接: https://arxiv.org/abs/2507.17334
作者: Weihua Gao,Chunxu Ren,Wenlong Niu,Xiaodong Peng
机构: University of Chinese Academy of Sciences (中国科学院大学); National Space Science Center (国家空间科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:In low-altitude surveillance and early warning systems, detecting weak moving targets remains a significant challenge due to low signal energy, small spatial extent, and complex background clutter. Existing methods struggle with extracting robust features and suffer from the lack of reliable annotations. To address these limitations, we propose a novel Temporal Point-Supervised (TPS) framework that enables high-performance detection of weak targets without any manual this http URL of conventional frame-based detection, our framework reformulates the task as a pixel-wise temporal signal modeling problem, where weak targets manifest as short-duration pulse-like responses. A Temporal Signal Reconstruction Network (TSRNet) is developed under the TPS paradigm to reconstruct these transient this http URL adopts an encoder-decoder architecture and integrates a Dynamic Multi-Scale Attention (DMSAttention) module to enhance its sensitivity to diverse temporal patterns. Additionally, a graph-based trajectory mining strategy is employed to suppress false alarms and ensure temporal this http URL experiments on a purpose-built low-SNR dataset demonstrate that our framework outperforms state-of-the-art methods while requiring no human annotations. It achieves strong detection performance and operates at over 1000 FPS, underscoring its potential for real-time deployment in practical scenarios.
zh
[CV-54] PARTE: Part-Guided Texturing for 3D Human Reconstruction from a Single Image ICCV2025
【速读】:该论文旨在解决现有3D人体重建方法中因人体各部位纹理未对齐而导致的纹理混淆问题(texture misalignment),即不同人体部件(如上衣、裤子)在重建过程中容易相互融合,缺乏清晰边界。其核心解决方案是引入基于3D人体部件信息的引导机制,提出PARTE框架,关键在于两个模块:一是通过纹理无关的人体表面重建与部件分割模块(PartSegmenter)从单张图像中推断出3D人体部件标签;二是设计了一个部件引导的纹理生成模块(PartTexturer),利用预训练图像生成模型中的部件纹理对齐先验知识,显式地指导纹理重建过程,从而实现各部件间纹理的结构一致性与视觉分离性。
链接: https://arxiv.org/abs/2507.17332
作者: Hyeongjin Nam,Donghwan Kim,Gyeongsik Moon,Kyoung Mu Lee
机构: Seoul National University (首尔国立大学); Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICCV 2025, 22 pages including the supplementary material
Abstract:The misaligned human texture across different human parts is one of the main limitations of existing 3D human reconstruction methods. Each human part, such as a jacket or pants, should maintain a distinct texture without blending into others. The structural coherence of human parts serves as a crucial cue to infer human textures in the invisible regions of a single image. However, most existing 3D human reconstruction methods do not explicitly exploit such part segmentation priors, leading to misaligned textures in their reconstructions. In this regard, we present PARTE, which utilizes 3D human part information as a key guide to reconstruct 3D human textures. Our framework comprises two core components. First, to infer 3D human part information from a single image, we propose a 3D part segmentation module (PartSegmenter) that initially reconstructs a textureless human surface and predicts human part labels based on the textureless surface. Second, to incorporate part information into texture reconstruction, we introduce a part-guided texturing module (PartTexturer), which acquires prior knowledge from a pre-trained image generation network on texture alignment of human parts. Extensive experiments demonstrate that our framework achieves state-of-the-art quality in 3D human reconstruction. The project page is available at this https URL.
zh
[CV-55] CartoonAlive: Towards Expressive Live2D Modeling from Single Portraits
【速读】:该论文旨在解决当前数字人技术中交互式二维卡通风格数字人生成效率低、灵活性差的问题。现有主流方法多依赖于高成本的3D建模或缺乏实时交互能力的2D视频表示,难以满足快速生成个性化、高表达力的二维卡通角色的需求。其解决方案的关键在于提出一种名为CartoonAlive的新方法,通过引入3D人脸建模中常用的形状基(shape basis)概念构建适用于Live2D的面部混合形状(blendshapes),并基于输入图像中检测到的面部关键点推断对应的混合形状权重,从而在不到半分钟内实现从单张肖像图到高质量、高保真且可实时驱动的Live2D数字人的高效生成。
链接: https://arxiv.org/abs/2507.17327
作者: Chao He,Jianqiang Ren,Jianjing Xiang,Xiejie Shen
机构: Tongyi Lab(通义实验室); Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement of large foundation models, AIGC, cloud rendering, and real-time motion capture technologies, digital humans are now capable of achieving synchronized facial expressions and body movements, engaging in intelligent dialogues driven by natural language, and enabling the fast creation of personalized avatars. While current mainstream approaches to digital humans primarily focus on 3D models and 2D video-based representations, interactive 2D cartoon-style digital humans have received relatively less attention. Compared to 3D digital humans that require complex modeling and high rendering costs, and 2D video-based solutions that lack flexibility and real-time interactivity, 2D cartoon-style Live2D models offer a more efficient and expressive alternative. By simulating 3D-like motion through layered segmentation without the need for traditional 3D modeling, Live2D enables dynamic and real-time manipulation. In this technical report, we present CartoonAlive, an innovative method for generating high-quality Live2D digital humans from a single input portrait image. CartoonAlive leverages the shape basis concept commonly used in 3D face modeling to construct facial blendshapes suitable for Live2D. It then infers the corresponding blendshape weights based on facial keypoints detected from the input image. This approach allows for the rapid generation of a highly expressive and visually accurate Live2D model that closely resembles the input portrait, within less than half a minute. Our work provides a practical and scalable solution for creating interactive 2D cartoon characters, opening new possibilities in digital content creation and virtual character animation. The project homepage is this https URL.
zh
[CV-56] CasP: Improving Semi-Dense Feature Matching Pipeline Leverag ing Cascaded Correspondence Priors for Guidance ICCV2025
【速读】:该论文旨在解决现有半稠密特征匹配方法在复杂场景下因全局搜索整个特征图而导致的精度和效率受限问题。其解决方案的关键在于提出了一种新颖的级联匹配流水线CasP,通过引入级联对应先验(cascaded correspondence priors)实现分阶段匹配:第一阶段利用基于区域的选择性交叉注意力机制增强特征判别力,并生成一对多的粗匹配区域;第二阶段则在这些先验区域内限制搜索范围,从而确定一对一的精匹配。该设计不仅提升了匹配精度与跨域泛化能力,还通过融合高层特征显著降低了低层特征提取的计算成本,尤其在高分辨率下表现出更强的加速优势(如1152分辨率下较最高效方法ELoFTR提速约2.2倍)。
链接: https://arxiv.org/abs/2507.17312
作者: Peiqi Chen,Lei Yu,Yi Wan,Yingying Pei,Xinyi Liu,Yongxiang Yao,Yingying Zhang,Lixiang Ru,Liheng Zhong,Jingdong Chen,Ming Yang,Yongjun Zhang
机构: Wuhan University (武汉大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of \sim2.2\times at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems. Code is available at this https URL.
zh
[CV-57] Learning-based Stage Verification System in Manual Assembly Scenarios
【速读】:该论文旨在解决工业4.0背景下,仅使用视觉传感器实现多目标与多状态装配过程高效监测的问题。传统方法依赖多种传感器或复杂硬件配置以保证高精度,但成本高昂且难以在动态工业环境中部署。解决方案的关键在于利用多个机器学习模型,通过融合相同时间戳下的状态信息,实现装配阶段的精准识别,平均准确率超过92%;同时提升错误检测与可视化能力,为操作员提供实时、可操作的指导,从而降低对昂贵硬件的依赖,增强方案在现代工业场景中的实用性。
链接: https://arxiv.org/abs/2507.17304
作者: Xingjian Zhang,Yutong Duan,Zaishu Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In the context of Industry 4.0, effective monitoring of multiple targets and states during assembly processes is crucial, particularly when constrained to using only visual sensors. Traditional methods often rely on either multiple sensor types or complex hardware setups to achieve high accuracy in monitoring, which can be cost-prohibitive and difficult to implement in dynamic industrial environments. This study presents a novel approach that leverages multiple machine learning models to achieve precise monitoring under the limitation of using a minimal number of visual sensors. By integrating state information from identical timestamps, our method detects and confirms the current stage of the assembly process with an average accuracy exceeding 92%. Furthermore, our approach surpasses conventional methods by offering enhanced error detection and visuali-zation capabilities, providing real-time, actionable guidance to operators. This not only improves the accuracy and efficiency of assembly monitoring but also re-duces dependency on expensive hardware solutions, making it a more practical choice for modern industrial applications.
zh
[CV-58] PointLAMA: Latent Attention meets Mamba for Efficient Point Cloud Pretraining
【速读】:该论文旨在解决Mamba模型在点云建模中因缺乏局部归纳偏置(local inductive bias)而导致的细粒度几何结构捕捉能力不足的问题。其解决方案的关键在于提出PointLAMA框架,该框架通过三个核心组件实现:(1) 任务感知的点云序列化方法,利用Hilbert/Trans-Hilbert空间填充曲线和轴向排序对点token进行结构对齐,以适配分类与分割任务;(2) 融合潜空间注意力(Latent Attention)与Mamba模块的混合编码器,其中轻量级Point-wise Multi-head Latent Attention (PMLA) 模块设计用于与Mamba共享潜空间特性,从而增强局部上下文建模且不牺牲整体效率;(3) 基于Mamba主干的条件扩散机制,在预训练阶段通过去噪扰动特征序列实现表示学习,无需显式点级重建。这一系列设计显著提升了点云预训练的效率与性能。
链接: https://arxiv.org/abs/2507.17296
作者: Xuanyu Lin,Xiaona Zeng,Xianwei Zheng,Xutao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mamba has recently gained widespread attention as a backbone model for point cloud modeling, leveraging a state-space architecture that enables efficient global sequence modeling with linear complexity. However, its lack of local inductive bias limits its capacity to capture fine-grained geometric structures in 3D data. To address this limitation, we propose \textbfPointLAMA, a point cloud pretraining framework that combines task-aware point cloud serialization, a hybrid encoder with integrated Latent Attention and Mamba blocks, and a conditional diffusion mechanism built upon the Mamba backbone. Specifically, the task-aware point cloud serialization employs Hilbert/Trans-Hilbert space-filling curves and axis-wise sorting to structurally align point tokens for classification and segmentation tasks, respectively. Our lightweight Latent Attention block features a Point-wise Multi-head Latent Attention (PMLA) module, which is specifically designed to align with the Mamba architecture by leveraging the shared latent space characteristics of PMLA and Mamba. This enables enhanced local context modeling while preserving overall efficiency. To further enhance representation learning, we incorporate a conditional diffusion mechanism during pretraining, which denoises perturbed feature sequences without relying on explicit point-wise reconstruction. Experimental results demonstrate that PointLAMA achieves competitive performance on multiple benchmark datasets with minimal parameter count and FLOPs, validating its effectiveness for efficient point cloud pretraining.
zh
[CV-59] Fully Automated SAM for Single-source Domain Generalization in Medical Image Segmentation
【速读】:该论文旨在解决基于SAM(Segment Anything Model)的单源域泛化医学图像分割模型在临床应用中面临的两大挑战:一是SAM对领域特定专家标注提示(prompt)的高度依赖性,限制了其自动化能力;二是不良提示(如尺寸不当的边界框)会导致SAM生成错误掩码结果。解决方案的关键在于提出FA-SAM框架,其核心创新包括两个模块:1)自动生成提示的AGM分支结合浅层特征不确定性建模(SUFM)模块,用于为目标域生成高质量边界框提示,实现完全自动化分割;2)嵌入到SAM掩码解码器中的图像-提示嵌入融合(IPEF)模块,通过融合多尺度图像嵌入与提示嵌入信息,增强对目标对象全局与局部细节的捕捉能力,从而缓解不良提示带来的负面影响。
链接: https://arxiv.org/abs/2507.17281
作者: Huanli Zhuo,Leilei Ma,Haifeng Zhao,Shiwei Zhou,Dengdi Sun,Yanping Fu
机构: Anhui University (安徽大学); Anhui Provincial Key Laboratory of Multimodal Cognitive Computation (安徽省多模态认知计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript has been accepted for presentation at the IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC 2025) and is copyrighted by IEEE
Abstract:Although SAM-based single-source domain generalization models for medical image segmentation can mitigate the impact of domain shift on the model in cross-domain scenarios, these models still face two major challenges. First, the segmentation of SAM is highly dependent on domain-specific expert-annotated prompts, which prevents SAM from achieving fully automated medical image segmentation and therefore limits its application in clinical settings. Second, providing poor prompts (such as bounding boxes that are too small or too large) to the SAM prompt encoder can mislead SAM into generating incorrect mask results. Therefore, we propose the FA-SAM, a single-source domain generalization framework for medical image segmentation that achieves fully automated SAM. FA-SAM introduces two key innovations: an Auto-prompted Generation Model (AGM) branch equipped with a Shallow Feature Uncertainty Modeling (SUFM) module, and an Image-Prompt Embedding Fusion (IPEF) module integrated into the SAM mask decoder. Specifically, AGM models the uncertainty distribution of shallow features through the SUFM module to generate bounding box prompts for the target domain, enabling fully automated segmentation with SAM. The IPEF module integrates multiscale information from SAM image embeddings and prompt embeddings to capture global and local details of the target object, enabling SAM to mitigate the impact of poor prompts. Extensive experiments on publicly available prostate and fundus vessel datasets validate the effectiveness of FA-SAM and highlight its potential to address the above challenges.
zh
[CV-60] PolarAnything: Diffusion-based Polarimetric Image Synthesis
【速读】:该论文旨在解决现有 polarization 图像合成方法依赖大量3D资产(如形状和物理基础渲染,Physically Based Rendering, PBR)材料的问题,这些限制使得生成大规模、逼真的偏振图像变得困难。其关键解决方案是提出 PolarAnything,一种基于扩散模型(diffusion model)的生成框架,通过有效的表征策略在仅输入单张RGB图像的情况下即可生成具有物理准确性和视觉真实感的偏振图像,从而摆脱对复杂3D资产集合的依赖。
链接: https://arxiv.org/abs/2507.17268
作者: Kailong Zhang,Youwei Lyu,Heng Guo,Si Li,Zhanyu Ma,Boxin Shi
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Xiong’an Aerospace Information Research Institute (雄安航天信息研究院); State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University (北京大学计算机学院多媒体信息处理国家重点实验室); National Engineering Research Center of Visual Technology, School of Computer Science, Peking University (北京大学计算机学院视觉技术国家工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Polarization images facilitate image enhancement and 3D reconstruction tasks, but the limited accessibility of polarization cameras hinders their broader application. This gap drives the need for synthesizing photorealistic polarization this http URL existing polarization simulator Mitsuba relies on a parametric polarization image formation model and requires extensive 3D assets covering shape and PBR materials, preventing it from generating large-scale photorealistic images. To address this problem, we propose PolarAnything, capable of synthesizing polarization images from a single RGB input with both photorealism and physical accuracy, eliminating the dependency on 3D asset collections. Drawing inspiration from the zero-shot performance of pretrained diffusion models, we introduce a diffusion-based generative framework with an effective representation strategy that preserves the fidelity of polarization properties. Experiments show that our model generates high-quality polarization images and supports downstream tasks like shape from polarization.
zh
[CV-61] VisionTrap: Unanswerable Questions On Visual Data
【速读】:该论文试图解决当前视觉问答(Visual Question Answering, VQA)模型在面对无法回答的问题时,缺乏识别自身知识边界能力的问题,即模型是否会错误地生成答案而非选择不回答。解决方案的关键在于构建一个名为VisionTrap的数据集,该数据集包含三类逻辑上合理但本质上不可回答的问题:(1)融合物体与动物的混合实体、(2)处于非常规或不可能场景中的对象、(3)虚构或不存在的人物。通过这一设计,研究能够系统评估VQA模型是否具备识别“不应作答”的能力,从而推动模型从盲目回答向理性拒答演进。
链接: https://arxiv.org/abs/2507.17262
作者: Asir Saadat,Syem Aziz,Shahriar Mahmud,Abdullah Ibne Masud Mahi,Sabbir Ahmed
机构: Rochester Institute of Technology (罗切斯特理工学院); Islamic University of Technology (伊斯兰科技大学); United International University (联合国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Question Answering (VQA) has been a widely studied topic, with extensive research focusing on how VLMs respond to answerable questions based on real-world images. However, there has been limited exploration of how these models handle unanswerable questions, particularly in cases where they should abstain from providing a response. This research investigates VQA performance on unrealistically generated images or asking unanswerable questions, assessing whether models recognize the limitations of their knowledge or attempt to generate incorrect answers. We introduced a dataset, VisionTrap, comprising three categories of unanswerable questions across diverse image types: (1) hybrid entities that fuse objects and animals, (2) objects depicted in unconventional or impossible scenarios, and (3) fictional or non-existent figures. The questions posed are logically structured yet inherently unanswerable, testing whether models can correctly recognize their limitations. Our findings highlight the importance of incorporating such questions into VQA benchmarks to evaluate whether models tend to answer, even when they should abstain.
zh
[CV-62] Unsupervised Exposure Correction
【速读】:该论文旨在解决当前曝光校正方法中存在的三大挑战:人工标注成对数据的劳动密集性、泛化能力有限,以及在低层计算机视觉任务中性能下降的问题。其解决方案的关键在于提出一种无监督曝光校正(Unsupervised Exposure Correction, UEC)方法,该方法无需人工标注,利用模拟图像信号处理(Image Signal Processing, ISP)流水线生成的自由配对数据进行训练,从而避免了个体风格偏差并提升了模型泛化能力;同时,研究构建了一个大规模辐射校正数据集(Radiometry Correction Dataset),专门用于强化曝光差异的学习,并设计了一种保留图像细节的变换函数,在仅使用现有最优监督方法0.01%参数量的情况下实现了更优性能,且验证了曝光校正对边缘检测等低层任务的有效性。
链接: https://arxiv.org/abs/2507.17252
作者: Ruodai Cui,Li Niu,Guosheng Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current exposure correction methods have three challenges, labor-intensive paired data annotation, limited generalizability, and performance degradation in low-level computer vision tasks. In this work, we introduce an innovative Unsupervised Exposure Correction (UEC) method that eliminates the need for manual annotations, offers improved generalizability, and enhances performance in low-level downstream tasks. Our model is trained using freely available paired data from an emulated Image Signal Processing (ISP) pipeline. This approach does not need expensive manual annotations, thereby minimizing individual style biases from the annotation and consequently improving its generalizability. Furthermore, we present a large-scale Radiometry Correction Dataset, specifically designed to emphasize exposure variations, to facilitate unsupervised learning. In addition, we develop a transformation function that preserves image details and outperforms state-of-the-art supervised methods [12], while utilizing only 0.01% of their parameters. Our work further investigates the broader impact of exposure correction on downstream tasks, including edge detection, demonstrating its effectiveness in mitigating the adverse effects of poor exposure on low-level features. The source code and dataset are publicly available at this https URL.
zh
[CV-63] Perceptual Classifiers: Detecting Generative Images using Perceptual Features ICCV
【速读】:该论文旨在解决AI生成图像(GenAI)的检测问题,即如何有效区分真实图像与由生成式模型(如扩散模型或GANs)合成的虚假图像。其解决方案的关键在于利用现有的图像质量评估(Image Quality Assessment, IQA)模型所具备的感知能力——这些模型能捕捉真实图像在带通统计空间中的流形结构,并将其特征空间作为判别基础。研究发现,通过在一个两层神经网络中训练IQA模型提取的特征空间,可实现跨生成模型的泛化检测性能,在保持对图像退化(如压缩、噪声等)鲁棒性的同时,达到当前最优的伪造图像识别效果。
链接: https://arxiv.org/abs/2507.17240
作者: Krishna Srikar Durbha,Asvin Kumar Venkataramanan,Rajesh Sureddi,Alan C. Bovik
机构: The University of Texas at Austin(德克萨斯大学奥斯汀分校); University of Colorado Boulder(科罗拉多大学博尔德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures, 3 tables, ICCV VQualA Workshop 2025
Abstract:Image Quality Assessment (IQA) models are employed in many practical image and video processing pipelines to reduce storage, minimize transmission costs, and improve the Quality of Experience (QoE) of millions of viewers. These models are sensitive to a diverse range of image distortions and can accurately predict image quality as judged by human viewers. Recent advancements in generative models have resulted in a significant influx of “GenAI” content on the internet. Existing methods for detecting GenAI content have progressed significantly with improved generalization performance on images from unseen generative models. Here, we leverage the capabilities of existing IQA models, which effectively capture the manifold of real images within a bandpass statistical space, to distinguish between real and AI-generated images. We investigate the generalization ability of these perceptual classifiers to the task of GenAI image detection and evaluate their robustness against various image degradations. Our results show that a two-layer network trained on the feature space of IQA models demonstrates state-of-the-art performance in detecting fake images across generative models, while maintaining significant robustness against image degradations.
zh
[CV-64] MaskedCLIP: Bridging the Masked and CLIP Space for Semi-Supervised Medical Vision-Language Pre-training
【速读】:该论文旨在解决当前基础模型(foundation model)在医学图像分析中仅依赖配对图像-文本数据或未配对图像数据进行预训练时,难以学习到更丰富、更全面图像特征的问题。其解决方案的关键在于提出了一种半监督视觉语言预训练框架MaskedCLIP,通过引入桥接Transformer将掩码图像建模(masked image modeling)与对比语言-图像预训练(contrastive language-image pre-training, CLIP)的特征空间对齐,并设计掩码知识蒸馏损失,实现两个特征空间之间的语义知识交互与互补,从而有效融合配对与未配对图像数据,提升基础模型的泛化能力与下游任务性能。
链接: https://arxiv.org/abs/2507.17239
作者: Lei Zhu,Jun Zhou,Rick Siow Mong Goh,Yong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MedAGI 2025 (Oral)
Abstract:Foundation models have recently gained tremendous popularity in medical image analysis. State-of-the-art methods leverage either paired image-text data via vision-language pre-training or unpaired image data via self-supervised pre-training to learn foundation models with generalizable image features to boost downstream task performance. However, learning foundation models exclusively on either paired or unpaired image data limits their ability to learn richer and more comprehensive image features. In this paper, we investigate a novel task termed semi-supervised vision-language pre-training, aiming to fully harness the potential of both paired and unpaired image data for foundation model learning. To this end, we propose MaskedCLIP, a synergistic masked image modeling and contrastive language-image pre-training framework for semi-supervised vision-language pre-training. The key challenge in combining paired and unpaired image data for learning a foundation model lies in the incompatible feature spaces derived from these two types of data. To address this issue, we propose to connect the masked feature space with the CLIP feature space with a bridge transformer. In this way, the more semantic specific CLIP features can benefit from the more general masked features for semantic feature extraction. We further propose a masked knowledge distillation loss to distill semantic knowledge of original image features in CLIP feature space back to the predicted masked image features in masked feature space. With this mutually interactive design, our framework effectively leverages both paired and unpaired image data to learn more generalizable image features for downstream tasks. Extensive experiments on retinal image analysis demonstrate the effectiveness and data efficiency of our method.
zh
[CV-65] Dataset Distillation as Data Compression: A Rate-Utility Perspective ICCV2025
【速读】:该论文旨在解决大规模机器学习中数据集与模型规模不断增长所带来的计算和存储开销问题,其核心挑战在于如何在有限存储预算下同时优化合成样本的压缩率(rate)与任务性能(utility)。解决方案的关键在于提出一种联合率-效用优化方法:将合成样本参数化为由极轻量级网络解码的可优化潜在编码(latent codes),以量化潜变量的香农熵作为压缩率度量,并引入任意现有蒸馏损失函数作为效用指标,通过拉格朗日乘子平衡二者关系;同时设计比特/类(bits per class, bpc)这一精确存储度量标准,统一考虑样本、标签及解码器参数的成本,从而实现跨方法公平比较。实验表明,该方法在CIFAR-10、CIFAR-100和ImageNet-128上相比标准蒸馏方法实现最高达170倍的压缩比且保持相当精度。
链接: https://arxiv.org/abs/2507.17221
作者: Youneng Bao,Yiping Liu,Zhuo Chen,Yongsheng Liang,Mu Li,Kede Ma
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Peng Cheng Laboratory (鹏城实验室); City University of Hong Kong (香港城市大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025
Abstract:Driven by the ``scale-is-everything’’ paradigm, modern machine learning increasingly demands ever-larger datasets and models, yielding prohibitive computational and storage requirements. Dataset distillation mitigates this by compressing an original dataset into a small set of synthetic samples, while preserving its full utility. Yet, existing methods either maximize performance under fixed storage budgets or pursue suitable synthetic data representations for redundancy removal, without jointly optimizing both objectives. In this work, we propose a joint rate-utility optimization method for dataset distillation. We parameterize synthetic samples as optimizable latent codes decoded by extremely lightweight networks. We estimate the Shannon entropy of quantized latents as the rate measure and plug any existing distillation loss as the utility measure, trading them off via a Lagrange multiplier. To enable fair, cross-method comparisons, we introduce bits per class (bpc), a precise storage metric that accounts for sample, label, and decoder parameter costs. On CIFAR-10, CIFAR-100, and ImageNet-128, our method achieves up to 170\times greater compression than standard distillation at comparable accuracy. Across diverse bpc budgets, distillation losses, and backbone architectures, our approach consistently establishes better rate-utility trade-offs.
zh
[CV-66] PIG-Nav: Key Insights for Pretrained Image Goal Navigation Models
【速读】:该论文旨在解决视觉导航模型在复杂环境中的泛化能力不足与零样本迁移性能有限的问题,特别是在多样场景下实现高效、鲁棒的图像目标导航(Image-Goal Navigation)。其解决方案的关键在于两个方面:一是提出采用早期融合(early-fusion)网络结构,结合预训练Vision Transformer(ViT)编码器对视觉观测与目标图像进行特征级融合,从而提升模型对多模态信息的理解能力;二是引入合适的辅助任务以增强全局导航表征学习,显著改善模型在未见环境中的导航表现。此外,论文还设计了一种高效的数据预处理流水线用于大规模游戏视频数据标注,通过扩充开放数据集提升模型性能,最终在多个仿真和真实环境中实现了显著优于现有视觉导航基础模型的零样本与微调性能。
链接: https://arxiv.org/abs/2507.17220
作者: Jiansong Wan,Chengming Zhou,Jinkua Liu,Xiangge Huang,Xiaoyu Chen,Xiaohan Yi,Qisen Yang,Baiting Zhu,Xin-Qiang Cai,Lixing Liu,Rushuai Yang,Chuheng Zhang,Sherif Abdelfattah,Hayong Shin,Pushi Zhang,Li Zhao,Jiang Bian
机构: Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Recent studies have explored pretrained (foundation) models for vision-based robotic navigation, aiming to achieve generalizable navigation and positive transfer across diverse environments while enhancing zero-shot performance in unseen settings. In this work, we introduce PIG-Nav (Pretrained Image-Goal Navigation), a new approach that further investigates pretraining strategies for vision-based navigation models and contributes in two key areas. Model-wise, we identify two critical design choices that consistently improve the performance of pretrained navigation models: (1) integrating an early-fusion network structure to combine visual observations and goal images via appropriately pretrained Vision Transformer (ViT) image encoder, and (2) introducing suitable auxiliary tasks to enhance global navigation representation learning, thus further improving navigation performance. Dataset-wise, we propose a novel data preprocessing pipeline for efficiently labeling large-scale game video datasets for navigation model training. We demonstrate that augmenting existing open navigation datasets with diverse gameplay videos improves model performance. Our model achieves an average improvement of 22.6% in zero-shot settings and a 37.5% improvement in fine-tuning settings over existing visual navigation foundation models in two complex simulated environments and one real-world environment. These results advance the state-of-the-art in pretrained image-goal navigation models. Notably, our model maintains competitive performance while requiring significantly less fine-tuning data, highlighting its potential for real-world deployment with minimal labeled supervision.
zh
[CV-67] A Low-Cost Machine Learning Approach for Timber Diameter Estimation
【速读】:该论文旨在解决木材加工行业中对木材种类和厚度识别效率低、准确性差的问题,传统依赖人工的方法存在速度慢、一致性差及易出错等缺陷,尤其在处理大批量木材时尤为明显。解决方案的关键在于构建一个轻量级、低成本的机器学习框架,利用标准RGB图像在真实工业环境(如木材交付时的厂房内)中进行木材圆木直径估计;其核心创新是采用YOLOv5目标检测算法,在公开数据集TimberSeg 1.0上微调训练,通过边界框尺寸估算木材厚度,无需昂贵传感器或受控环境,且在有限计算资源下实现了mAP@0.5达0.64的可靠检测性能,具备良好的可扩展性和现场部署潜力。
链接: https://arxiv.org/abs/2507.17219
作者: Fatemeh Hasanzadeh Fard,Sanaz Hasanzadeh Fard,Mehdi Jonoobi
机构: University of Tehran (德黑兰大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The wood processing industry, particularly in facilities such as sawmills and MDF production lines, requires accurate and efficient identification of species and thickness of the wood. Although traditional methods rely heavily on expert human labor, they are slow, inconsistent, and prone to error, especially when processing large volumes. This study focuses on practical and cost-effective machine learning frameworks that automate the estimation of timber log diameter using standard RGB images captured under real-world working conditions. We employ the YOLOv5 object detection algorithm, fine-tuned on a public dataset (TimberSeg 1.0), to detect individual timber logs and estimate thickness through bounding-box dimensions. Unlike previous methods that require expensive sensors or controlled environments, this model is trained on images taken in typical industrial sheds during timber delivery. Experimental results show that the model achieves a mean Average Precision (mAP@0.5) of 0.64, demonstrating reliable log detection even with modest computing resources. This lightweight, scalable solution holds promise for practical integration into existing workflows, including on-site inventory management and preliminary sorting, particularly in small and medium-sized operations.
zh
[CV-68] VBCD: A Voxel-Based Framework for Personalized Dental Crown Design
【速读】:该论文旨在解决基于口内扫描数据设计修复性牙冠(restorative dental crowns)过程中,牙科技师工作量大、效率低的问题。其解决方案的关键在于提出了一种基于体素(voxel)的自动化牙冠设计框架(VBCD),该框架首先从体素化的口内扫描数据中生成初始粗略牙冠,再通过引入距离感知监督机制的精细化模块提升精度与质量;同时,在训练阶段采用曲率与边缘线惩罚损失函数(Curvature and Margin line Penalty Loss, CMPL)以增强生成牙冠与牙龈边缘线(margin line)的对齐度,并结合FDI牙齿编号系统的空间位置提示(positional prompt)进一步提高生成牙冠的准确性。
链接: https://arxiv.org/abs/2507.17205
作者: Linda Wei,Chang Liu,Wenran Zhang,Zengji Zhang,Shaoting Zhang,Hongsheng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The design of restorative dental crowns from intraoral scans is labor-intensive for dental technicians. To address this challenge, we propose a novel voxel-based framework for automated dental crown design (VBCD). The VBCD framework generates an initial coarse dental crown from voxelized intraoral scans, followed by a fine-grained refiner incorporating distance-aware supervision to improve accuracy and quality. During the training stage, we employ the Curvature and Margin line Penalty Loss (CMPL) to enhance the alignment of the generated crown with the margin line. Additionally, a positional prompt based on the FDI tooth numbering system is introduced to further improve the accuracy of the generated dental crowns. Evaluation on a large-scale dataset of intraoral scans demonstrated that our approach outperforms existing methods, providing a robust solution for personalized dental crown design.
zh
[CV-69] DesignLab: Designing Slides Through Iterative Detection and Correction
【速读】:该论文旨在解决非专家用户在设计高质量演示文稿时面临的挑战,即现有自动化工具虽能提供布局和配色方案建议,但缺乏自我优化能力,难以满足实际工作流程中对精细调整的需求。解决方案的关键在于提出DesignLab框架,通过将设计过程分解为两个角色——设计审查者(design reviewer)负责识别设计问题,设计贡献者(design contributor)负责修正这些问题——形成一个迭代循环机制。该机制使草稿在每轮迭代中逐步优化,从而实现单次生成方法无法达到的高品质输出。研究进一步利用大规模语言模型微调这两个角色,并通过引入受控扰动模拟中间版本,使审查者学会识别设计错误,贡献者学会修复策略,最终显著优于现有设计生成方法,包括商业工具。
链接: https://arxiv.org/abs/2507.17202
作者: Jooyeol Yun,Heng Wang,Yotaro Shimose,Jaegul Choo,Shingo Takamatsu
机构: Sony Group Corporation (索尼集团); Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: this https URL
Abstract:Designing high-quality presentation slides can be challenging for non-experts due to the complexity involved in navigating various design choices. Numerous automated tools can suggest layouts and color schemes, yet often lack the ability to refine their own output, which is a key aspect in real-world workflows. We propose DesignLab, which separates the design process into two roles, the design reviewer, who identifies design-related issues, and the design contributor who corrects them. This decomposition enables an iterative loop where the reviewer continuously detects issues and the contributor corrects them, allowing a draft to be further polished with each iteration, reaching qualities that were unattainable. We fine-tune large language models for these roles and simulate intermediate drafts by introducing controlled perturbations, enabling the design reviewer learn design errors and the contributor learn how to fix them. Our experiments show that DesignLab outperforms existing design-generation methods, including a commercial tool, by embracing the iterative nature of designing which can result in polished, professional slides.
zh
[CV-70] Vec2Face for Face Dataset Generation
【速读】:该论文旨在解决合成人脸训练数据中身份一致性(identity consistency)被忽视的问题,尤其是在增加类内属性变异(intra-class attribute variation)时,如何保持类内身份的一致性以提升人脸识别(face recognition, FR)模型的性能。现有方法往往过度关注类间可分性(inter-class separability)和类内多样性,却忽略了生成图像在身份层面的稳定性。其解决方案的关键在于提出Vec2Face+,一种直接从图像特征生成人脸图像的生成模型,并通过三种策略实现高质量合成:1)采样差异显著的向量以确保类间良好分离;2)设计AttrOP算法增强通用属性变化;3)基于LoRA的姿势控制实现高效且身份保留的侧脸图像生成。该方法最终构建了VFace系列合成数据集,在多个真实测试集上超越了传统真实数据集CASIA-WebFace的性能表现,首次实现了合成数据在平均准确率上优于真实数据。
链接: https://arxiv.org/abs/2507.17192
作者: Haiyu Wu,Jaskirat Singh,Sicong Tian,Liang Zheng,Kevin W. Bowyer
机构: University of Notre Dame (圣母大学); Australian National University (澳大利亚国立大学); Indiana University South Bend (南本德印第安纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:When synthesizing identities as face recognition training data, it is generally believed that large inter-class separability and intra-class attribute variation are essential for synthesizing a quality dataset. % This belief is generally correct, and this is what we aim for. However, when increasing intra-class variation, existing methods overlook the necessity of maintaining intra-class identity consistency. % To address this and generate high-quality face training data, we propose Vec2Face+, a generative model that creates images directly from image features and allows for continuous and easy control of face identities and attributes. Using Vec2Face+, we obtain datasets with proper inter-class separability and intra-class variation and identity consistency using three strategies: 1) we sample vectors sufficiently different from others to generate well-separated identities; 2) we propose an AttrOP algorithm for increasing general attribute variations; 3) we propose LoRA-based pose control for generating images with profile head poses, which is more efficient and identity-preserving than AttrOP. % Our system generates VFace10K, a synthetic face dataset with 10K identities, which allows an FR model to achieve state-of-the-art accuracy on seven real-world test sets. Scaling the size to 4M and 12M images, the corresponding VFace100K and VFace300K datasets yield higher accuracy than the real-world training dataset, CASIA-WebFace, on five real-world test sets. This is the first time a synthetic dataset beats the CASIA-WebFace in average accuracy. In addition, we find that only 1 out of 11 synthetic datasets outperforms random guessing (\emphi.e., 50%) in twin verification and that models trained with synthetic identities are more biased than those trained with real identities. Both are important aspects for future investigation.
zh
[CV-71] Asymmetric Lesion Detection with Geometric Patterns and CNN-SVM Classification
【速读】:该论文旨在解决皮肤病变形状不对称性识别难题,尤其针对临床实践中难以被非专家准确判断的黑色素瘤诊断标准中的不对称性特征。其解决方案的关键在于提出一种基于监督学习的图像处理算法,用于分析病变几何模式,并结合预训练卷积神经网络(CNN)提取形态、颜色与纹理特征,进而训练多类支持向量机(SVM)分类器进行病变形状分类(不对称、半对称和对称)。该方法在几何特征检测中达到99.00%的检测率,在CNN特征基础上实现94% Kappa评分、95%宏F1分数和97%加权F1分数,显著优于现有文献中的最优方法。
链接: https://arxiv.org/abs/2507.17185
作者: M. A. Rasel,Sameem Abdul Kareem,Zhenli Kwan,Nik Aimee Azizah Faheem,Winn Hui Han,Rebecca Kai Jan Choong,Shin Shen Yong,Unaizah Obaidellah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted version. Published in Computers in Biology and Medicine, Volume 179, 2024. DOI: https://doi.org/10.1016/j.compbiomed.2024.108851
Abstract:In dermoscopic images, which allow visualization of surface skin structures not visible to the naked eye, lesion shape offers vital insights into skin diseases. In clinically practiced methods, asymmetric lesion shape is one of the criteria for diagnosing melanoma. Initially, we labeled data for a non-annotated dataset with symmetrical information based on clinical assessments. Subsequently, we propose a supporting technique, a supervised learning image processing algorithm, to analyze the geometrical pattern of lesion shape, aiding non-experts in understanding the criteria of an asymmetric lesion. We then utilize a pre-trained convolutional neural network (CNN) to extract shape, color, and texture features from dermoscopic images for training a multiclass support vector machine (SVM) classifier, outperforming state-of-the-art methods from the literature. In the geometry-based experiment, we achieved a 99.00% detection rate for dermatological asymmetric lesions. In the CNN-based experiment, the best performance is found with 94% Kappa Score, 95% Macro F1-score, and 97% Weighted F1-score for classifying lesion shapes (Asymmetric, Half-Symmetric, and Symmetric).
zh
[CV-72] Hierarchical Fusion and Joint Aggregation: A Multi-Level Feature Representation Method for AIGC Image Quality Assessment
【速读】:该论文旨在解决AI生成内容(AIGC)质量评估中因多维度挑战导致的难题,尤其是现有方法仅依赖单一层次视觉特征,难以捕捉AIGC图像中的复杂失真问题。其解决方案的关键在于提出一种多层级视觉表征范式,包含三个阶段:多层级特征提取、分层融合与联合聚合;在此基础上设计了两种网络:MGLF-Net通过双CNN与Transformer骨干网络提取互补的局部和全局特征,用于感知质量评估;MPEF-Net则在每个特征层级嵌入文本提示语义信息,增强图文对应关系建模,最终实现更精准的AIGC质量评价。
链接: https://arxiv.org/abs/2507.17182
作者: Linghe Meng,Jiarun Song
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The quality assessment of AI-generated content (AIGC) faces multi-dimensional challenges, that span from low-level visual perception to high-level semantic understanding. Existing methods generally rely on single-level visual features, limiting their ability to capture complex distortions in AIGC images. To address this limitation, a multi-level visual representation paradigm is proposed with three stages, namely multi-level feature extraction, hierarchical fusion, and joint aggregation. Based on this paradigm, two networks are developed. Specifically, the Multi-Level Global-Local Fusion Network (MGLF-Net) is designed for the perceptual quality assessment, extracting complementary local and global features via dual CNN and Transformer visual backbones. The Multi-Level Prompt-Embedded Fusion Network (MPEF-Net) targets Text-to-Image correspondence by embedding prompt semantics into the visual feature fusion process at each feature level. The fused multi-level features are then aggregated for final evaluation. Experiments on benchmarks demonstrate outstanding performance on both tasks, validating the effectiveness of the proposed multi-level visual assessment paradigm.
zh
[CV-73] Multi-Scale PCB Defect Detection with YOLOv8 Network Improved via Pruning and Lightweight Network
【速读】:该论文旨在解决高密度印刷电路板(PCB)缺陷检测中传统模型难以兼顾检测精度与计算成本的问题,尤其针对微小缺陷的高精度实时检测需求。解决方案的关键在于提出一种基于YOLOv8的多尺度改进方法,通过“小目标敏感策略、网络轻量化和自适应剪枝”的综合策略实现性能优化:首先在骨干网络中引入参数更少的Ghost-HGNetv2结构以提取多层次语义特征;其次在颈部网络集成轻量级C2f-Faster模块增强多尺度特征融合能力;再次设计新型GCDetect检测头,使边界框预测与类别分类共享GroupConv权重,显著减少参数量的同时保持检测精度;此外,采用改进的Inner-MPDIoU边界损失函数提升微小目标的定位准确性;最后通过优化的自适应剪枝率进一步降低模型复杂度。实验表明,该方法在公开PCB缺陷数据集上mAP₀.₅达到99.32%,较YOLOv8n提升10.13%。
链接: https://arxiv.org/abs/2507.17176
作者: Li Pingzhen,Xu Sheng,Chen Jing,Su Chengyue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the high density of printed circuit board (PCB) design and the high speed of production, the traditional PCB defect detection model is difficult to take into account the accuracy and computational cost, and cannot meet the requirements of high accuracy and real-time detection of tiny defects. Therefore, in this paper, a multi-scale PCB defect detection method is improved with YOLOv8 using a comprehensive strategy of tiny target sensitivity strategy, network lightweighting and adaptive pruning, which is able to improve the detection speed and accuracy by optimizing the backbone network, the neck network and the detection head, the loss function and the adaptive pruning rate. Firstly, a Ghost-HGNetv2 structure with fewer parameters is used in the backbone network, and multilevel features are used to extract image semantic features to discover accurate defects. Secondly, we integrate C2f-Faster with small number of parameters in the neck section to enhance the ability of multi-level feature fusion. Next, in the Head part, we design a new GCDetect detection head, which allows the prediction of bounding boxes and categories to share the weights of GroupConv, and uses a small number of grouping convolutions to accomplish the regression and classification tasks, which significantly reduces the number of parameters while maintaining the accuracy of detection. We also design the Inner-MPDIoU boundary loss function to improve the detection and localization of tiny targets. Finally, the model was pruned by an optimized adaptive pruning rate to further reduce the complexity of the model. Experimental results show that the model exhibits advantages in terms of accuracy and speed. On the publicly available PCB defect dataset, mAP0.5 reaches 99.32% and mAP0.5:0.9 reaches 75.18%, which is 10.13% higher compared to YOLOv8n.
zh
[CV-74] DOOMGAN:High-Fidelity Dynamic Identity Obfuscation Ocular Generative Morphing
【速读】:该论文旨在解决可见光谱眼部生物特征(ocular biometrics)在面临形态攻击(morphing attacks)时的安全性问题,这类攻击通过融合多个个体的特征生成合成生物特征,严重威胁生物识别系统的完整性。当前对近红外虹膜和人脸的形态攻击研究较为充分,但可见光眼部数据的形态攻击仍缺乏系统探索。解决方案的关键在于提出DOOMGAN模型,其核心创新包括:基于关键点驱动的可见眼部解剖结构编码、注意力机制引导的逼真形态合成,以及多维度损失函数的动态加权策略,从而在复杂非受控条件下保持虹膜边界和周边纹理等细节特征,显著提升攻击成功率与合成质量。
链接: https://arxiv.org/abs/2507.17158
作者: Bharath Krishnamurthy,Ajita Rattani
机构: University of North Texas (北德克萨斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCB 2025 (IEEE/IAPR International Joint Conference on Biometrics). 11 pages with references, 8-page main paper with 4 figures and 4 tables. Includes 6 pages of supplementary material with 3 additional figures and 3 tables. Code is available at the official lab repository: this https URL and the author’s repository: this https URL
Abstract:Ocular biometrics in the visible spectrum have emerged as a prominent modality due to their high accuracy, resistance to spoofing, and non-invasive nature. However, morphing attacks, synthetic biometric traits created by blending features from multiple individuals, threaten biometric system integrity. While extensively studied for near-infrared iris and face biometrics, morphing in visible-spectrum ocular data remains underexplored. Simulating such attacks demands advanced generation models that handle uncontrolled conditions while preserving detailed ocular features like iris boundaries and periocular textures. To address this gap, we introduce DOOMGAN, that encompasses landmark-driven encoding of visible ocular anatomy, attention-guided generation for realistic morph synthesis, and dynamic weighting of multi-faceted losses for optimized convergence. DOOMGAN achieves over 20% higher attack success rates than baseline methods under stringent thresholds, along with 20% better elliptical iris structure generation and 30% improved gaze consistency. We also release the first comprehensive ocular morphing dataset to support further research in this domain.
zh
[CV-75] UNICE: Training A Universal Image Contrast Enhancer
【速读】:该论文旨在解决现有图像对比度增强方法在不同任务间泛化能力差的问题,尤其是当模型从一个特定任务(如低光或背光增强)迁移到另一个任务时性能显著下降的现象。其核心挑战在于如何构建一个通用且具备强泛化能力的对比度增强模型,而无需为每个任务单独标注大量数据。解决方案的关键在于:首先观察到多种对比度增强任务的本质共性均涉及曝光与对比度调整,并由此提出利用高动态范围(HDR)输入作为统一处理基础;随后通过收集46,928张HDR原始图像并合成多曝光序列(MES)及伪sRGB真值图像,训练两个网络——第一个网络从单张sRGB图像生成MES,第二个网络将MES融合为增强图像,从而实现无需人工标注的端到端训练。该方法被称为UNiversal Image Contrast Enhancer (UNICE),在多个无参考图像质量指标上优于现有方法甚至人工标注真值,验证了其强大的跨任务和跨数据集泛化能力。
链接: https://arxiv.org/abs/2507.17157
作者: Ruodai Cui,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing image contrast enhancement methods are typically designed for specific tasks such as under-/over-exposure correction, low-light and backlit image enhancement, etc. The learned models, however, exhibit poor generalization performance across different tasks, even across different datasets of a specific task. It is important to explore whether we can learn a universal and generalized model for various contrast enhancement tasks. In this work, we observe that the common key factor of these tasks lies in the need of exposure and contrast adjustment, which can be well-addressed if high-dynamic range (HDR) inputs are available. We hence collect 46,928 HDR raw images from public sources, and render 328,496 sRGB images to build multi-exposure sequences (MES) and the corresponding pseudo sRGB ground-truths via multi-exposure fusion. Consequently, we train a network to generate an MES from a single sRGB image, followed by training another network to fuse the generated MES into an enhanced image. Our proposed method, namely UNiversal Image Contrast Enhancer (UNICE), is free of costly human labeling. However, it demonstrates significantly stronger generalization performance than existing image contrast enhancement methods across and within different tasks, even outperforming manually created ground-truths in multiple no-reference image quality metrics. The dataset, code and model are available at this https URL.
zh
[CV-76] ScSAM: Debiasing Morphology and Distributional Variability in Subcellular Semantic Segmentation ECAI
【速读】:该论文旨在解决基于学习的细胞器分割模型在面对亚细胞组分形态和分布高度变异时,易产生特征学习偏倚的问题。现有方法通常依赖单一映射关系,忽视特征多样性,导致训练偏差;同时,尽管Segment Anything Model (SAM) 提供了丰富的特征表示,其在亚细胞场景中的应用受限于标签空间不一致引发的伪特征学习以及对细粒度空间细节的忽略,难以捕捉细微结构变化并应对数据分布 skewed 的情况。解决方案的关键在于提出 ScSAM 方法,通过融合预训练 SAM 与掩码自编码器(Masked Autoencoder, MAE)引导的细胞先验知识,缓解数据不平衡带来的训练偏倚;具体包括设计特征对齐与融合模块以统一不同嵌入表示至同一特征空间,并引入基于余弦相似度矩阵的类别提示编码器,激活特定类别的特征以识别亚细胞类别。
链接: https://arxiv.org/abs/2507.17149
作者: Bo Fang,Jianan Fan,Dongnan Liu,Hang Chang,Gerald J.Shami,Filip Braet,Weidong Cai
机构: University of Sydney (悉尼大学); Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by 28th European Conference on Artificial Intelligence (ECAI)
Abstract:The significant morphological and distributional variability among subcellular components poses a long-standing challenge for learning-based organelle segmentation models, significantly increasing the risk of biased feature learning. Existing methods often rely on single mapping relationships, overlooking feature diversity and thereby inducing biased training. Although the Segment Anything Model (SAM) provides rich feature representations, its application to subcellular scenarios is hindered by two key challenges: (1) The variability in subcellular morphology and distribution creates gaps in the label space, leading the model to learn spurious or biased features. (2) SAM focuses on global contextual understanding and often ignores fine-grained spatial details, making it challenging to capture subtle structural alterations and cope with skewed data distributions. To address these challenges, we introduce ScSAM, a method that enhances feature robustness by fusing pre-trained SAM with Masked Autoencoder (MAE)-guided cellular prior knowledge to alleviate training bias from data imbalance. Specifically, we design a feature alignment and fusion module to align pre-trained embeddings to the same feature space and efficiently combine different representations. Moreover, we present a cosine similarity matrix-based class prompt encoder to activate class-specific features to recognize subcellular categories. Extensive experiments on diverse subcellular image datasets demonstrate that ScSAM outperforms state-of-the-art methods.
zh
[CV-77] SADA: Stability-guided Adaptive Diffusion Acceleration ICML2025 ICML
【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成任务中因迭代采样过程和二次注意力计算成本导致的高计算开销问题,同时现有无需训练的加速策略虽能降低每步计算成本,但存在与原始基线相比 fidelity(保真度)显著下降的问题。其关键解决方案是提出 Stability-guided Adaptive Diffusion Acceleration (SADA),通过统一步骤级(step-wise)和标记级(token-wise)稀疏性决策,基于单一稳定性准则实现对基于常微分方程(ODE)的生成模型(如扩散模型和流匹配模型)的高效加速;SADA 一方面根据采样轨迹自适应分配稀疏性以应对不同提示词引发的去噪路径差异,另一方面引入基于数值 ODE 求解器梯度信息的合理近似方案,从而在保持高保真度的前提下实现 ≥1.8× 的加速效果(LPIPS ≤ 0.10,FID ≤ 4.5),显著优于已有方法。
链接: https://arxiv.org/abs/2507.17135
作者: Ting Jiang,Yixiao Wang,Hancheng Ye,Zishan Shao,Jingwei Sun,Jingyang Zhang,Zekai Chen,Jianyi Zhang,Yiran Chen,Hai Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and published by ICML 2025. Code is available at: this https URL
Abstract:Diffusion models have achieved remarkable success in generative tasks but suffer from high computational costs due to their iterative sampling process and quadratic attention costs. Existing training-free acceleration strategies that reduce per-step computation cost, while effectively reducing sampling time, demonstrate low faithfulness compared to the original baseline. We hypothesize that this fidelity gap arises because (a) different prompts correspond to varying denoising trajectory, and (b) such methods do not consider the underlying ODE formulation and its numerical solution. In this paper, we propose Stability-guided Adaptive Diffusion Acceleration (SADA), a novel paradigm that unifies step-wise and token-wise sparsity decisions via a single stability criterion to accelerate sampling of ODE-based generative models (Diffusion and Flow-matching). For (a), SADA adaptively allocates sparsity based on the sampling trajectory. For (b), SADA introduces principled approximation schemes that leverage the precise gradient information from the numerical ODE solver. Comprehensive evaluations on SD-2, SDXL, and Flux using both EDM and DPM++ solvers reveal consistent \ge 1.8\times speedups with minimal fidelity degradation (LPIPS \leq 0.10 and FID \leq 4.5 ) compared to unmodified baselines, significantly outperforming prior methods. Moreover, SADA adapts seamlessly to other pipelines and modalities: It accelerates ControlNet without any modifications and speeds up MusicLDM by 1.8\times with \sim 0.01 spectrogram LPIPS.
zh
[CV-78] Robust Five-Class and binary Diabetic Retinopathy Classification Using Transfer Learning and Data Augmentation
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)早期诊断中因训练数据有限和类别不平衡导致的模型性能瓶颈问题。其核心解决方案是结合迁移学习(Transfer Learning)与类平衡的数据增强技术,构建一个鲁棒的深度学习框架用于DR的二分类和五级严重程度分类任务。实验表明,采用EfficientNet-B0和ResNet34等预训练卷积神经网络架构,在APTOS 2019数据集上实现了高准确率(二分类达98.9%)和优异的AUC指标(二分类为99.4%,五分类为94.1%),验证了该策略在兼顾精度与计算效率方面的有效性,为临床场景下的DR自动筛查提供了可扩展的解决方案。
链接: https://arxiv.org/abs/2507.17121
作者: Faisal Ahmed,Mohammad Alfrad Nobel Bhuiyan
机构: Embry-Riddle Aeronautical University (Embry-Riddle 航空大学); Louisiana State University Health Sciences Center (路易斯安那州立大学健康科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 9 pages, 1 Figure
Abstract:Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, and early diagnosis through automated retinal image analysis can significantly reduce the risk of blindness. This paper presents a robust deep learning framework for both binary and five-class DR classification, leveraging transfer learning and extensive data augmentation to address the challenges of class imbalance and limited training data. We evaluate a range of pretrained convolutional neural network architectures, including variants of ResNet and EfficientNet, on the APTOS 2019 dataset. For binary classification, our proposed model achieves a state-of-the-art accuracy of 98.9%, with a precision of 98.6%, recall of 99.3%, F1-score of 98.9%, and an AUC of 99.4%. In the more challenging five-class severity classification task, our model obtains a competitive accuracy of 84.6% and an AUC of 94.1%, outperforming several existing approaches. Our findings also demonstrate that EfficientNet-B0 and ResNet34 offer optimal trade-offs between accuracy and computational efficiency across both tasks. These results underscore the effectiveness of combining class-balanced augmentation with transfer learning for high-performance DR diagnosis. The proposed framework provides a scalable and accurate solution for DR screening, with potential for deployment in real-world clinical environments. Comments: 9 pages, 1 Figure Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) ACMclasses: F.2.2; I.2.7 Cite as: arXiv:2507.17121 [cs.CV] (or arXiv:2507.17121v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.17121 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-79] IONext: Unlocking the Next Era of Inertial Odometry
【速读】:该论文旨在解决基于卷积神经网络(CNN)的惯性里程计(inertial odometry)在建模局部细微运动变化和长期依赖关系时的局限性,以及现有方法在时间建模上的不足,从而提升定位精度与泛化能力。其解决方案的关键在于提出两个核心模块:一是双翼自适应动态混合器(Dual-wing Adaptive Dynamic Mixer, DADM),通过动态生成选择性权重实现对全局运动模式与局部精细特征的高效多尺度融合;二是时空门控单元(Spatio-Temporal Gating Unit, STGU),用于在时间维度上筛选出具有代表性和任务相关性的运动特征,增强时序建模能力。基于这两个模块构建的新骨干网络IONext,在六个公开数据集上显著优于当前最先进的Transformer和CNN方法。
链接: https://arxiv.org/abs/2507.17089
作者: Shanshan Zhang,Siyue Wang,Tianshui Wen,Qi Zhang,Ziheng Zhou,Lingxiang Zheng,Yu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Researchers have increasingly adopted Transformer-based models for inertial odometry. While Transformers excel at modeling long-range dependencies, their limited sensitivity to local, fine-grained motion variations and lack of inherent inductive biases often hinder localization accuracy and generalization. Recent studies have shown that incorporating large-kernel convolutions and Transformer-inspired architectural designs into CNN can effectively expand the receptive field, thereby improving global motion perception. Motivated by these insights, we propose a novel CNN-based module called the Dual-wing Adaptive Dynamic Mixer (DADM), which adaptively captures both global motion patterns and local, fine-grained motion features from dynamic inputs. This module dynamically generates selective weights based on the input, enabling efficient multi-scale feature aggregation. To further improve temporal modeling, we introduce the Spatio-Temporal Gating Unit (STGU), which selectively extracts representative and task-relevant motion features in the temporal domain. This unit addresses the limitations of temporal modeling observed in existing CNN approaches. Built upon DADM and STGU, we present a new CNN-based inertial odometry backbone, named Next Era of Inertial Odometry (IONext). Extensive experiments on six public datasets demonstrate that IONext consistently outperforms state-of-the-art (SOTA) Transformer- and CNN-based methods. For instance, on the RNIN dataset, IONext reduces the average ATE by 10% and the average RTE by 12% compared to the representative model iMOT.
zh
[CV-80] FedVLM: Scalable Personalized Vision-Language Models through Federated Learning
【速读】:该论文旨在解决在联邦学习(Federated Learning)环境下,视觉语言模型(Vision-Language Models, VLMs)因客户端数据分布非独立同分布(non-iid)而导致的微调困难与泛化性能下降问题。现有参数高效微调方法如LoRA(Low-Rank Adaptation)虽能降低计算开销,但在异构数据场景下表现不佳。其关键解决方案是提出FedVLM框架,并引入个性化LoRA(pLoRA),通过动态调整每个客户端的LoRA参数以适配本地数据分布,在保障模型隐私和去中心化训练的同时,显著提升本地适应能力;实验表明,pLoRA相较标准LoRA在RLAIF-V数据集上使客户端特定性能提升24.5%,验证了其在非iid环境下的优越性。
链接: https://arxiv.org/abs/2507.17088
作者: Arkajyoti Mitra(1),Afia Anjum(1),Paul Agbaje(1),Mert Pesé(2),Habeeb Olufowobi(1) ((1) University of Texas at Arlington, (2) Clemson University)
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校); Clemson University (克莱姆森大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) demonstrate impressive zero-shot and few-shot learning capabilities, making them essential for several downstream tasks. However, fine-tuning these models at scale remains challenging, particularly in federated environments where data is decentralized and non-iid across clients. Existing parameter-efficient tuning methods like LoRA (Low-Rank Adaptation) reduce computational overhead but struggle with heterogeneous client data, leading to suboptimal generalization. To address these challenges, we propose FedVLM, a federated LoRA fine-tuning framework that enables decentralized adaptation of VLMs while preserving model privacy and reducing reliance on centralized training. To further tackle data heterogeneity, we introduce personalized LoRA (pLoRA), which dynamically adapts LoRA parameters to each client’s unique data distribution, significantly improving local adaptation while maintaining global model aggregation. Experiments on the RLAIF-V dataset show that pLoRA improves client-specific performance by 24.5% over standard LoRA, demonstrating superior adaptation in non-iid settings. FedVLM provides a scalable and efficient solution for fine-tuning VLMs in federated settings, advancing personalized adaptation in distributed learning scenarios.
zh
[CV-81] SDGOCC: Semantic and Depth-Guided Birds-Eye View Transformation for 3D Multimodal Occupancy Prediction CVPR2025
【速读】:该论文旨在解决多模态3D占用预测(multimodal 3D occupancy prediction)中因单模态感知局限性导致的精度不足问题:相机方法缺乏深度信息,而LiDAR方法在遮挡场景下表现受限;同时现有轻量级方法主要依赖Lift-Splat-Shoot(LSS)流程,存在深度估计不准且未能充分挖掘LiDAR点云的几何与语义信息的问题。其解决方案的关键在于提出一种新型网络SDG-OCC,包含两个核心机制:一是联合语义与深度引导的视图变换(joint semantic and depth-guided view transformation),通过扩散与双线性离散化融合像素语义和共点深度信息,构建更精确的深度分布;二是基于融合到占用驱动的主动蒸馏(fusion-to-occupancy-driven active distillation),从多模态数据中提取丰富语义,并依据LiDAR识别区域选择性地将知识迁移至图像特征,从而提升模型对关键区域的感知能力。该设计显著提升了占用预测精度与鲁棒性,同时支持实时推理。
链接: https://arxiv.org/abs/2507.17083
作者: Zaipeng Duan,Chenxu Dang,Xuzhong Hu,Pei An,Junfeng Ding,Jie Zhan,Yunbiao Xu,Jie Ma
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by CVPR2025
Abstract:Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at this https URL.
zh
[CV-82] VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM -Augmented CLIP Embeddings RECSYS2025
【速读】:该论文旨在解决现有视觉-语言模型(如CLIP)在电商推荐系统中面临的三大核心问题:1)对象级对齐能力弱,即全局图像嵌入难以捕捉商品的细粒度属性,导致检索性能不佳;2)文本表征模糊,商品描述常缺乏上下文清晰度,影响跨模态匹配效果;3)领域适配性差,通用视觉-语言模型难以有效泛化至电商特定数据。解决方案的关键在于提出VL-CLIP框架,通过引入视觉定位(Visual Grounding)以实现细粒度视觉理解,增强图像嵌入的局部语义精度;同时利用大语言模型(LLM)代理生成更丰富的文本嵌入,提升描述的语义明确性和上下文一致性。该方法显著提升了跨模态检索准确率和推荐质量,在美国最大电商平台之一上实现了点击率(CTR)提升18.6%、平均交易成本(ATC)降低15.5%、总商品价值(GMV)增长4.0%,验证了结合对象感知视觉定位与LLM增强文本表示的有效性。
链接: https://arxiv.org/abs/2507.17080
作者: Ramin Giahi,Kehui Yao,Sriram Kollipara,Kai Zhao,Vahid Mirjalili,Jianpeng Xu,Topojoy Biswas,Evren Korpeoglu,Kannan Achan
机构: Walmart Global Tech(沃尔玛全球科技)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at RecSys 2025; DOI: this https URL
Abstract:Multimodal learning plays a critical role in e-commerce recommendation platforms today, enabling accurate recommendations and product understanding. However, existing vision-language models, such as CLIP, face key challenges in e-commerce recommendation systems: 1) Weak object-level alignment, where global image embeddings fail to capture fine-grained product attributes, leading to suboptimal retrieval performance; 2) Ambiguous textual representations, where product descriptions often lack contextual clarity, affecting cross-modal matching; and 3) Domain mismatch, as generic vision-language models may not generalize well to e-commerce-specific data. To address these limitations, we propose a framework, VL-CLIP, that enhances CLIP embeddings by integrating Visual Grounding for fine-grained visual understanding and an LLM-based agent for generating enriched text embeddings. Visual Grounding refines image representations by localizing key products, while the LLM agent enhances textual features by disambiguating product descriptions. Our approach significantly improves retrieval accuracy, multimodal retrieval effectiveness, and recommendation quality across tens of millions of items on one of the largest e-commerce platforms in the U.S., increasing CTR by 18.6%, ATC by 15.5%, and GMV by 4.0%. Additional experimental results show that our framework outperforms vision-language models, including CLIP, FashionCLIP, and GCL, in both precision and semantic alignment, demonstrating the potential of combining object-aware visual grounding and LLM-enhanced text representation for robust multimodal recommendations.
zh
[CV-83] Few-Shot Learning in Video and 3D Object Detection: A Survey
【速读】:该论文旨在解决少样本学习(Few-shot learning, FSL)在视频和三维(3D)目标检测中的应用问题,核心挑战在于如何利用极少量标注样本实现对新类别的有效识别,从而显著降低昂贵的手动标注成本。解决方案的关键在于:对于视频目标检测,通过帧间信息传播机制(如tube proposals和时间匹配网络)高效利用时空结构;对于3D目标检测,结合专为点云设计的网络架构与针对类别不平衡优化的损失函数,以应对LiDAR或深度数据的稀疏性和纹理缺失问题。两者均强调原型匹配(prototype matching)机制的集成以及在特征、时序和数据模态层面的信息高效利用,从而在泛化能力和过拟合之间取得平衡。
链接: https://arxiv.org/abs/2507.17079
作者: Md Meftahul Ferdaus,Kendall N. Niles,Joe Tom,Mahdi Abdelguerfi,Elias Ioup
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review in ACM Computing Surveys
Abstract:Few-shot learning (FSL) enables object detection models to recognize novel classes given only a few annotated examples, thereby reducing expensive manual data labeling. This survey examines recent FSL advances for video and 3D object detection. For video, FSL is especially valuable since annotating objects across frames is more laborious than for static images. By propagating information across frames, techniques like tube proposals and temporal matching networks can detect new classes from a couple examples, efficiently leveraging spatiotemporal structure. FSL for 3D detection from LiDAR or depth data faces challenges like sparsity and lack of texture. Solutions integrate FSL with specialized point cloud networks and losses tailored for class imbalance. Few-shot 3D detection enables practical autonomous driving deployment by minimizing costly 3D annotation needs. Core issues in both domains include balancing generalization and overfitting, integrating prototype matching, and handling data modality properties. In summary, FSL shows promise for reducing annotation requirements and enabling real-world video, 3D, and other applications by efficiently leveraging information across feature, temporal, and data modalities. By comprehensively surveying recent advancements, this paper illuminates FSL’s potential to minimize supervision needs and enable deployment across video, 3D, and other real-world applications.
zh
[CV-84] oward Scalable Video Narration: A Training-free Approach Using Multimodal Large Language Models ICCV2025
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解任务中普遍存在的时间对齐困难与幻觉问题,尤其是在不熟悉场景下生成的视频描述缺乏准确性与时间一致性。解决方案的关键在于提出一种无需训练的流水线方法 VideoNarrator,其核心创新在于通过可灵活组合的组件架构,使现成的 MLLMs 和视觉-语言模型(Visual-Language Models, VLMs)能够分别承担生成、提供上下文或验证视频字幕的角色,从而实现各模块间的协同作用,显著提升视频字幕的时序精度和内容真实性,有效缓解幻觉现象并增强下游任务如视频摘要和问答的能力。
链接: https://arxiv.org/abs/2507.17050
作者: Tz-Ying Wu,Tahani Trigui,Sharath Nittur Sridhar,Anand Bodas,Subarna Tripathi
机构: Intel(英特尔)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVAM Workshop at ICCV 2025
Abstract:In this paper, we introduce VideoNarrator, a novel training-free pipeline designed to generate dense video captions that offer a structured snapshot of video content. These captions offer detailed narrations with precise timestamps, capturing the nuances present in each segment of the video. Despite advancements in multimodal large language models (MLLMs) for video comprehension, these models often struggle with temporally aligned narrations and tend to hallucinate, particularly in unfamiliar scenarios. VideoNarrator addresses these challenges by leveraging a flexible pipeline where off-the-shelf MLLMs and visual-language models (VLMs) can function as caption generators, context providers, or caption verifiers. Our experimental results demonstrate that the synergistic interaction of these components significantly enhances the quality and accuracy of video narrations, effectively reducing hallucinations and improving temporal alignment. This structured approach not only enhances video understanding but also facilitates downstream tasks such as video summarization and video question answering, and can be potentially extended for advertising and marketing applications.
zh
[CV-85] Controllable Hybrid Captioner for Improved Long-form Video Understanding
【速读】:该论文旨在解决长视频内容高维密集特性带来的理解困难问题,即如何通过文本表示有效压缩视频信息并支持复杂自然语言查询的推理。其关键解决方案在于构建一种基于文本的记忆系统:利用LaViLa视频字幕模型对视频短片段进行逐步字幕生成,结合视觉语言模型(VLM)补充静态场景描述,从而丰富字幕日志的内容维度;进一步提出可控混合字幕生成器(controllable hybrid captioner),通过特殊输入标记识别场景变化,交替生成动作与场景字幕,显著提升字幕生成效率和问答能力。
链接: https://arxiv.org/abs/2507.17047
作者: Kuleen Sasse,Efsun Sarioglu Kayi,Arun Reddy
机构: Johns Hopkins University Applied Physics Laboratory (约翰霍普金斯大学应用物理实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.
zh
[CV-86] ransformer Based Building Boundary Reconstruction using Attraction Field Maps
【速读】:该论文旨在解决从单张卫星图像中自动提取建筑轮廓(building footprint)的难题,这一任务在城市规划、灾害管理和大尺度空间分析等领域至关重要,但传统方法依赖人工标注,效率低且难以规模化。其核心挑战在于如何准确建模建筑边界几何结构并实现高精度的物体表示。解决方案的关键在于提出一种基于图卷积网络(Graph Convolutional Networks, GCNs)的新方法——Decoupled-PolyGCN,通过引入几何规则性约束优化建筑边界、融合多尺度与多分辨率特征,并嵌入吸引力场图(Attraction Field Maps)以增强空间结构感知能力,从而显著提升建筑轮廓重建的准确性与规整性,在平均精度(AP)和召回率(AR)上分别优于现有方法6%和10%。
链接: https://arxiv.org/abs/2507.17038
作者: Muhammad Kamran,Mohammad Moein Sheikholeslami,Andreas Wichmann,Gunho Sohn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, the number of remote satellites orbiting the Earth has grown significantly, streaming vast amounts of high-resolution visual data to support diverse applications across civil, public, and military domains. Among these applications, the generation and updating of spatial maps of the built environment have become critical due to the extensive coverage and detailed imagery provided by satellites. However, reconstructing spatial maps from satellite imagery is a complex computer vision task, requiring the creation of high-level object representations, such as primitives, to accurately capture the built environment. While the past decade has witnessed remarkable advancements in object detection and representation using visual data, primitives-based object representation remains a persistent challenge in computer vision. Consequently, high-quality spatial maps often rely on labor-intensive and manual processes. This paper introduces a novel deep learning methodology leveraging Graph Convolutional Networks (GCNs) to address these challenges in building footprint reconstruction. The proposed approach enhances performance by incorporating geometric regularity into building boundaries, integrating multi-scale and multi-resolution features, and embedding Attraction Field Maps into the network. These innovations provide a scalable and precise solution for automated building footprint extraction from a single satellite image, paving the way for impactful applications in urban planning, disaster management, and large-scale spatial analysis. Our model, Decoupled-PolyGCN, outperforms existing methods by 6% in AP and 10% in AR, demonstrating its ability to deliver accurate and regularized building footprints across diverse and challenging scenarios.
zh
[CV-87] StreamME: Simplify 3D Gaussian Avatar within Live Stream
【速读】:该论文旨在解决3D虚拟形象(avatar)重建过程中训练速度慢、依赖预缓存数据以及难以实时集成到下游应用的问题。其核心挑战在于实现无需预处理的快速、连续重建,同时保持高质量渲染并兼顾隐私保护与通信效率。解决方案的关键在于提出一种名为StreamME的方法,采用“即刻训练”(on-the-fly training)策略,基于3D高斯溅射(3D Gaussian Splatting, 3DGS)构建无需多层感知机(MLP)的几何驱动模型,从而显著提升对表情变化的适应速度;并通过主点简化策略稀疏分布点云,在减少点数的同时维持渲染质量,最终实现实时视频流中头像的同步记录与重建,适用于VR、在线会议及动画生成等场景。
链接: https://arxiv.org/abs/2507.17029
作者: Luchuan Song,Yang Zhou,Zhan Xu,Yi Zhou,Deepali Aneja,Chenliang Xu
机构: University of Rochester (罗切斯特大学); Adobe Research (Adobe 研究院)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 15 Figures
Abstract:We propose StreamME, a method focuses on fast 3D avatar reconstruction. The StreamME synchronously records and reconstructs a head avatar from live video streams without any pre-cached data, enabling seamless integration of the reconstructed appearance into downstream applications. This exceptionally fast training strategy, which we refer to as on-the-fly training, is central to our approach. Our method is built upon 3D Gaussian Splatting (3DGS), eliminating the reliance on MLPs in deformable 3DGS and relying solely on geometry, which significantly improves the adaptation speed to facial expression. To further ensure high efficiency in on-the-fly training, we introduced a simplification strategy based on primary points, which distributes the point clouds more sparsely across the facial surface, optimizing points number while maintaining rendering quality. Leveraging the on-the-fly training capabilities, our method protects the facial privacy and reduces communication bandwidth in VR system or online conference. Additionally, it can be directly applied to downstream application such as animation, toonify, and relighting. Please refer to our project page for more details: this https URL.
zh
[CV-88] Bringing Balance to Hand Shape Classification: Mitigating Data Imbalance Through Generative Models
【速读】:该论文旨在解决手语手势形状(handshape)识别数据集普遍存在的样本数量少且类别分布严重不均衡的问题,这限制了分类模型的有效训练与性能提升。解决方案的关键在于通过生成对抗网络(Generative Adversarial Networks, GAN)合成高质量的训练数据以增强原始小规模、不平衡数据集,并结合两种不同条件机制的GAN架构进行对比:ReACGAN利用辅助分类器提供标签信息以生成对应类别的真实图像,而SPADE则借助空间自适应归一化(spatially-adaptive normalization)将姿态信息作为条件引导生成具有准确空间配置的手势图像。实验表明,该方法在RWTH德语手语手形数据集上将当前最优准确率提升了5%,并展现出跨数据集的泛化能力,无需重新训练生成器即可实现与单一来源训练模型相当的性能。
链接: https://arxiv.org/abs/2507.17008
作者: Gaston Gustavo Rios,Pedro Dal Bianco,Franco Ronchetti,Facundo Quiroga,Oscar Stanchi,Santiago Ponte Ahón,Waldo Hasperué
机构: Universidad Nacional del Sur (国立南部大学); Universidad Nacional de La Plata (国立拉普拉塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures, to be published in Applied Soft Computing
Abstract:Most sign language handshape datasets are severely limited and unbalanced, posing significant challenges to effective model training. In this paper, we explore the effectiveness of augmenting the training data of a handshape classifier by generating synthetic data. We use an EfficientNet classifier trained on the RWTH German sign language handshape dataset, which is small and heavily unbalanced, applying different strategies to combine generated and real images. We compare two Generative Adversarial Networks (GAN) architectures for data generation: ReACGAN, which uses label information to condition the data generation process through an auxiliary classifier, and SPADE, which utilizes spatially-adaptive normalization to condition the generation on pose information. ReACGAN allows for the generation of realistic images that align with specific handshape labels, while SPADE focuses on generating images with accurate spatial handshape configurations. Our proposed techniques improve the current state-of-the-art accuracy on the RWTH dataset by 5%, addressing the limitations of small and unbalanced datasets. Additionally, our method demonstrates the capability to generalize across different sign language datasets by leveraging pose-based generation trained on the extensive HaGRID dataset. We achieve comparable performance to single-source trained classifiers without the need for retraining the generator.
zh
[CV-89] Divisive Decisions: Improving Salience-Based Training for Generalization in Binary Classification Tasks
【速读】:该论文旨在解决现有基于显著性引导的训练方法在提升深度学习模型泛化能力时存在的局限性问题,即仅利用真实类别(true-class)的类激活图(Class Activation Map, CAM)与人工参考显著性图之间的差异进行监督,而忽略了错误类别(false-class)的CAM信息。其解决方案的关键在于提出三种新方法,将真实类别和错误类别的CAM同时纳入训练策略,并基于“在二分类任务中,真实与错误类别的CAM应在人类识别的重要特征上呈现差异”的假设,设计了双CAM对比机制以及一种新的后验工具用于识别关键判别特征。实验表明,该方法在多种闭集和开集二分类任务(如人脸检测、生物特征活体攻击检测及胸部X光图像异常分类)中均优于传统仅使用真类CAM的方法,显著提升了模型的泛化性能。
链接: https://arxiv.org/abs/2507.17000
作者: Jacob Piland,Chris Sweet,Adam Czajka
机构: University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Existing saliency-guided training approaches improve model generalization by incorporating a loss term that compares the model’s class activation map (CAM) for a sample’s true-class (\it i.e., correct-label class) against a human reference saliency map. However, prior work has ignored the false-class CAM(s), that is the model’s saliency obtained for incorrect-label class. We hypothesize that in binary tasks the true and false CAMs should diverge on the important classification features identified by humans (and reflected in human saliency maps). We use this hypothesis to motivate three new saliency-guided training methods incorporating both true- and false-class model’s CAM into the training strategy and a novel post-hoc tool for identifying important features. We evaluate all introduced methods on several diverse binary close-set and open-set classification tasks, including synthetic face detection, biometric presentation attack detection, and classification of anomalies in chest X-ray scans, and find that the proposed methods improve generalization capabilities of deep learning models over traditional (true-class CAM only) saliency-guided training approaches. We offer source codes and model weights\footnoteGitHub repository link removed to preserve anonymity to support reproducible research.
zh
[CV-90] oward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts ICCV2025
【速读】:该论文旨在解决长尾分布下的在线异常检测(Long-tailed Online Anomaly Detection, LTOAD)问题,即在训练数据呈现长尾分布且无法获取类别标签的在线学习场景中,实现高效、准确的异常区域定位。传统长尾异常检测(LTAD)方法依赖类别标签进行监督学习,但在在线设置中这些标签不可用,导致现有方法难以直接应用。论文的关键解决方案是提出一种类无关(class-agnostic)的框架,首先在离线长尾异常检测场景中取得优于当前最优基线的结果(如MVTec数据集上图像AUROC提升4.63%),并进一步将其适配至在线学习设置,从而在最具挑战性的长尾在线场景下实现性能提升(图像AUROC提升0.53%)。这一方案突破了类别标签依赖的限制,为实际工业和医疗场景中的动态异常检测提供了可行路径。
链接: https://arxiv.org/abs/2507.16946
作者: Chiao-An Yang,Kuan-Chuan Peng,Raymond A. Yeh
机构: Purdue University (普渡大学); Mitsubishi Electric Research Laboratories (三菱电机研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted to ICCV 2025. The supplementary material is included. The long-tailed online anomaly detection dataset is available at this https URL
Abstract:Anomaly detection (AD) identifies the defect regions of a given image. Recent works have studied AD, focusing on learning AD without abnormal images, with long-tailed distributed training data, and using a unified model for all classes. In addition, online AD learning has also been explored. In this work, we expand in both directions to a realistic setting by considering the novel task of long-tailed online AD (LTOAD). We first identified that the offline state-of-the-art LTAD methods cannot be directly applied to the online setting. Specifically, LTAD is class-aware, requiring class labels that are not available in the online setting. To address this challenge, we propose a class-agnostic framework for LTAD and then adapt it to our online learning setting. Our method outperforms the SOTA baselines in most offline LTAD settings, including both the industrial manufacturing and the medical domain. In particular, we observe +4.63% image-AUROC on MVTec even compared to methods that have access to class labels and the number of classes. In the most challenging long-tailed online setting, we achieve +0.53% image-AUROC compared to baselines. Our LTOAD benchmark is released here: this https URL .
zh
[CV-91] AURA: A Multi-Modal Medical Agent for Understanding Reasoning Annotation
【速读】:该论文旨在解决医学影像分析中静态预测系统缺乏交互性、可解释性不足以及临床适应性弱的问题。当前基于大语言模型(Large Language Models, LLMs)的智能体(agentic AI)已在多个领域展现出推理与工具调用能力,但在医学影像领域的应用仍处于起步阶段。解决方案的关键在于提出AURA——首个专为医学影像设计的视觉语言可解释性智能体,其核心创新包括:(i) 一个包含相位定位、病理分割和解剖结构分割的模块化分割套件,用于精确定位临床相关区域;(ii) 一个反事实图像生成模块,支持通过图像级解释进行因果推理;(iii) 一套评估工具,涵盖像素级差异图分析、分类及前沿诊断相关性评估组件,从而实现对医学影像的综合分析、动态交互、情境化解释与可量化评价,推动医学影像AI从静态输出向可交互、可解释、临床可信赖的方向演进。
链接: https://arxiv.org/abs/2507.16940
作者: Nima Fathi,Amar Kumar,Tal Arbel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 9 pages, 3 figures, International Conference on Medical Image Computing and Computer-Assisted Intervention
Abstract:Recent advancements in Large Language Models (LLMs) have catalyzed a paradigm shift from static prediction systems to agentic AI agents capable of reasoning, interacting with tools, and adapting to complex tasks. While LLM-based agentic systems have shown promise across many domains, their application to medical imaging remains in its infancy. In this work, we introduce AURA, the first visual linguistic explainability agent designed specifically for comprehensive analysis, explanation, and evaluation of medical images. By enabling dynamic interactions, contextual explanations, and hypothesis testing, AURA represents a significant advancement toward more transparent, adaptable, and clinically aligned AI systems. We highlight the promise of agentic AI in transforming medical image analysis from static predictions to interactive decision support. Leveraging Qwen-32B, an LLM-based architecture, AURA integrates a modular toolbox comprising: (i) a segmentation suite with phase grounding, pathology segmentation, and anatomy segmentation to localize clinically meaningful regions; (ii) a counterfactual image-generation module that supports reasoning through image-level explanations; and (iii) a set of evaluation tools including pixel-wise difference-map analysis, classification, and advanced state-of-the-art components to assess diagnostic relevance and visual interpretability.
zh
[CV-92] Sparser2Sparse: Single-shot Sparser-to-Sparse Learning for Spatial Transcriptomics Imputation with Natural Image Co-learning
【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)中高分辨率数据获取成本高、稀缺性限制其广泛应用的问题。针对此挑战,作者提出Single-shot Sparser-to-Sparse (S2S-ST) 框架,其核心创新在于:(1) 利用ST数据内在的空间模式设计稀疏到稀疏的自监督学习策略;(2) 引入自然图像进行跨域协同学习以增强特征表示能力;(3) 构建级联数据一致性重构网络(Cascaded Data Consistent Imputation Network, CDCIN),通过迭代优化在保持已采样基因数据保真度的同时提升预测精度。该方案仅需一个低成本的稀疏ST数据集即可实现高精度重构,显著降低对昂贵高分辨率数据的依赖。
链接: https://arxiv.org/abs/2507.16886
作者: Yaoyu Fang,Jiahe Qian,Xinkun Wang,Lee A. Cooper,Bo Zhou
机构: Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figure, under review
Abstract:Spatial transcriptomics (ST) has revolutionized biomedical research by enabling high resolution gene expression profiling within tissues. However, the high cost and scarcity of high resolution ST data remain significant challenges. We present Single-shot Sparser-to-Sparse (S2S-ST), a novel framework for accurate ST imputation that requires only a single and low-cost sparsely sampled ST dataset alongside widely available natural images for co-training. Our approach integrates three key innovations: (1) a sparser-to-sparse self-supervised learning strategy that leverages intrinsic spatial patterns in ST data, (2) cross-domain co-learning with natural images to enhance feature representation, and (3) a Cascaded Data Consistent Imputation Network (CDCIN) that iteratively refines predictions while preserving sampled gene data fidelity. Extensive experiments on diverse tissue types, including breast cancer, liver, and lymphoid tissue, demonstrate that our method outperforms state-of-the-art approaches in imputation accuracy. By enabling robust ST reconstruction from sparse inputs, our framework significantly reduces reliance on costly high resolution data, facilitating potential broader adoption in biomedical research and clinical applications.
zh
[CV-93] Finding Dori: Memorization in Text-to-Image Diffusion Models Is Less Local Than Assumed
【速读】:该论文旨在解决文本到图像扩散模型(text-to-image diffusion models, DMs)中存在的数据隐私与知识产权风险问题,即模型可能无意中记忆并复现训练数据。现有缓解策略主要依赖于识别并剪枝导致复制的权重,其前提是记忆内容可被局部化。然而,论文指出这些方法存在根本性缺陷:即使经过剪枝,仅需微调输入提示的文本嵌入(text embeddings),即可重新触发数据复制,表明当前防御机制极为脆弱;同时,研究挑战了“记忆局部性”的假设,发现复制可从文本嵌入空间的不同位置触发,并通过模型内部不同路径实现。解决方案的关键在于提出一种新型对抗性微调方法(adversarial fine-tuning),该方法迭代搜索复制触发点并更新模型以增强鲁棒性,从而为真正移除记忆内容而非仅仅抑制其检索提供了新思路。
链接: https://arxiv.org/abs/2507.16880
作者: Antoni Kowalczuk,Dominik Hintersdorf,Lukas Struppek,Kristian Kersting,Adam Dziedzic,Franziska Boenisch
机构: CISPA Helmholtz Center for Information Security; German Research Center for Artificial Intelligence (DFKI); Computer Science Department, Technical University of Darmstadt; Hessian Center for AI (Hessian.AI); Centre for Cognitive Science, Technical University of Darmstadt
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering replication, based on the assumption that memorization can be localized. Our research assesses the robustness of these pruning-based approaches. We demonstrate that even after pruning, minor adjustments to text embeddings of input prompts are sufficient to re-trigger data replication, highlighting the fragility of these defenses. Furthermore, we challenge the fundamental assumption of memorization locality, by showing that replication can be triggered from diverse locations within the text embedding space, and follows different paths in the model. Our findings indicate that existing mitigation strategies are insufficient and underscore the need for methods that truly remove memorized content, rather than attempting to suppress its retrieval. As a first step in this direction, we introduce a novel adversarial fine-tuning method that iteratively searches for replication triggers and updates the model to increase robustness. Through our research, we provide fresh insights into the nature of memorization in text-to-image DMs and a foundation for building more trustworthy and compliant generative AI.
zh
[CV-94] CausalStep: A Benchmark for Explicit Stepwise Causal Reasoning in Videos
【速读】:该论文旨在解决当前视频理解与推理任务中缺乏对因果性和步骤式推理能力的严格评估问题。现有视频基准测试主要衡量浅层理解,且允许模型利用全局上下文信息,从而无法有效检验模型是否具备真正的因果推断和逐步推理能力。解决方案的关键在于提出CausalStep基准,其通过将视频分割为因果关联的单元,并强制执行严格的逐步问答(QA)协议,确保答案顺序性并阻止捷径解法;同时引入基于错误类型分类的精心设计干扰项以提升诊断价值,最终实现对模型因果推理能力的精准量化评估。
链接: https://arxiv.org/abs/2507.16878
作者: Xuchen Li,Xuzhao Li,Shiyu Hu,Kaiqi Huang,Wentao Zhang
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Beijing Key Laboratory of Intelligent Perception and Cognitive Computing (北京市智能感知与认知计算重点实验室); 4. University of California, Berkeley (加州大学伯克利分校); 5. National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint, Under review
Abstract:Recent advances in large language models (LLMs) have improved reasoning in text and image domains, yet achieving robust video reasoning remains a significant challenge. Existing video benchmarks mainly assess shallow understanding and reasoning and allow models to exploit global context, failing to rigorously evaluate true causal and stepwise reasoning. We present CausalStep, a benchmark designed for explicit stepwise causal reasoning in videos. CausalStep segments videos into causally linked units and enforces a strict stepwise question-answer (QA) protocol, requiring sequential answers and preventing shortcut solutions. Each question includes carefully constructed distractors based on error type taxonomy to ensure diagnostic value. The benchmark features 100 videos across six categories and 1,852 multiple-choice QA pairs. We introduce seven diagnostic metrics for comprehensive evaluation, enabling precise diagnosis of causal reasoning capabilities. Experiments with leading proprietary and open-source models, as well as human baselines, reveal a significant gap between current models and human-level stepwise reasoning. CausalStep provides a rigorous benchmark to drive progress in robust and interpretable video reasoning.
zh
[CV-95] HIPPO-Video: Simulating Watch Histories with Large Language Models for Personalized Video Highlighting
【速读】:该论文旨在解决个性化视频摘要(personalized video highlighting)中因用户偏好高度多样且复杂而导致的现有数据集缺乏针对性的问题。传统视频数据集通常依赖孤立视频或简单文本查询,无法准确捕捉用户行为的细微差异。其解决方案的关键在于构建HIPPO-Video数据集,该数据集通过基于大语言模型(LLM)的用户模拟器生成真实感强的观看历史(watch history),从而反映多样化用户偏好;并提出HiPHer方法,利用这些个性化观看历史预测条件化的片段级显著性分数(segment-wise saliency scores),实验证明该方法在性能上优于通用和基于查询的方法,展现出在实际场景中实现高度用户中心化视频摘要的潜力。
链接: https://arxiv.org/abs/2507.16873
作者: Jeongeun Lee,Youngjae Yu,Dongha Lee
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to COLM2025
Abstract:The exponential growth of video content has made personalized video highlighting an essential task, as user preferences are highly variable and complex. Existing video datasets, however, often lack personalization, relying on isolated videos or simple text queries that fail to capture the intricacies of user behavior. In this work, we introduce HIPPO-Video, a novel dataset for personalized video highlighting, created using an LLM-based user simulator to generate realistic watch histories reflecting diverse user preferences. The dataset includes 2,040 (watch history, saliency score) pairs, covering 20,400 videos across 170 semantic categories. To validate our dataset, we propose HiPHer, a method that leverages these personalized watch histories to predict preference-conditioned segment-wise saliency scores. Through extensive experiments, we demonstrate that our method outperforms existing generic and query-based approaches, showcasing its potential for highly user-centric video highlighting in real-world scenarios.
zh
[CV-96] Controllable Video Generation: A Survey
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 视频生成模型中控制能力不足的问题,即现有文本到视频(text-to-video)基础模型仅依赖文本提示难以准确表达复杂、多模态及细粒度的用户意图,导致生成视频与用户需求之间存在偏差。解决方案的关键在于引入多种非文本条件信号(如相机运动、深度图、人体姿态等)作为控制机制,并将其集成到视频扩散模型的去噪过程中,从而实现对视频生成过程的精细化引导。该方法显著提升了生成视频的可控性与实用性,推动了可控视频生成技术的发展。
链接: https://arxiv.org/abs/2507.16869
作者: Yue Ma,Kunyu Feng,Zhongyuan Hu,Xinyu Wang,Yucheng Wang,Mingzhe Zheng,Xuanhua He,Chenyang Zhu,Hongyu Liu,Yingqing He,Zeyu Wang,Zhifeng Li,Xiu Li,Wei Liu,Dan Xu,Linfeng Zhang,Qifeng Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学广州分校); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Tencent (腾讯)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, where text prompts alone are often insufficient to express complex, multi-modal, and fine-grained user requirements. This limitation makes it challenging for users to generate videos with precise control using current models. To address this issue, recent research has explored the integration of additional non-textual conditions, such as camera motion, depth maps, and human pose, to extend pretrained video generation models and enable more controllable video synthesis. These approaches aim to enhance the flexibility and practical applicability of AIGC-driven video generation systems. In this survey, we provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field. We begin by introducing the key concepts and commonly used open-source video generation models. We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation. Finally, we categorize existing methods based on the types of control signals they leverage, including single-condition generation, multi-condition generation, and universal controllable generation. For a complete list of the literature on controllable video generation reviewed, please visit our curated repository at this https URL.
zh
[CV-97] Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection
【速读】:该论文旨在解决多模态感知中相机与激光雷达(LiDAR)特征在鸟瞰图(BEV)表示下的错位问题,该问题源于外参标定误差和激光雷达滚动快门效应导致的投影偏差,进而引发相机分支深度监督不准确及跨模态特征融合错误。解决方案的关键在于利用2D目标检测器识别出的对象-背景边界作为先验信息,首先通过Prior Guided Depth Calibration (PGDC) 模块校正局部错位,保留正确的跨模态特征对;随后引入Discontinuity Aware Geometric Fusion (DAGF) 模块处理全局错位,抑制噪声并增强边界处的锐利过渡;最终结合Structural Guidance Depth Modulator (SGDM) 利用门控注意力机制高效融合对齐后的深度与图像特征,从而实现更精确的BEV感知。
链接: https://arxiv.org/abs/2507.16861
作者: Xiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Integrating LiDAR and camera inputs into a unified Bird’s-Eye-View (BEV) representation is crucial for enhancing 3D perception capabilities of autonomous vehicles. However, current methods are often affected by misalignment between camera and LiDAR features. This misalignment leads to inaccurate depth supervision in camera branch and erroneous fusion during cross-modal feature aggregation. The root cause of this misalignment lies in projection errors, stemming from minor extrinsic calibration inaccuracies and rolling shutter effect of LiDAR during vehicle motion. In this work, our key insight is that these projection errors are predominantly concentrated at object-background boundaries, which are readily identified by 2D detectors. Based on this, our main motivation is to utilize 2D object priors to pre-align cross-modal features before fusion. To address local misalignment, we propose Prior Guided Depth Calibration (PGDC), which leverages 2D priors to correct local misalignment and preserve correct cross-modal feature pairs. To resolve global misalignment, we introduce Discontinuity Aware Geometric Fusion (DAGF) to process calibrated results from PGDC, suppressing noise and explicitly enhancing sharp transitions at object-background boundaries. To effectively utilize these transition-aware depth representations, we incorporate Structural Guidance Depth Modulator (SGDM), using a gated attention mechanism to efficiently fuse aligned depth and image features. Our proposed method achieves state-of-the-art performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively.
zh
[CV-98] Weak Links in LinkedIn: Enhancing Fake Profile Detection in the Age of LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)生成的虚假职业资料对现有文本型虚假账号检测器造成的鲁棒性下降问题。随着LLM(如GPT)能够生成高度逼真的虚假资料,传统检测方法在面对此类生成内容时误接受率(False Accept Rate)显著上升(从6–7%飙升至42–52%),暴露出检测系统在对抗生成式AI(Generative AI)攻击下的脆弱性。解决方案的关键在于提出一种基于GPT辅助的对抗训练策略(GPT-assisted adversarial training),通过引入LLM生成的对抗样本进行模型训练,使检测器在保持低误拒率(False Reject Rate: 0.5–2%)的同时,将误接受率恢复至1–7%,从而显著提升其对生成式虚假资料的识别能力。
链接: https://arxiv.org/abs/2507.16860
作者: Apoorva Gulati,Rajesh Kumar,Vinti Agarwal,Aditya Sharma
机构: 未知
类目: ocial and Information Networks (cs.SI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 10 pages, 3 figures, 1 table, accepted for publication at ASONAM 2025. this https URL
Abstract:Large Language Models (LLMs) have made it easier to create realistic fake profiles on platforms like LinkedIn. This poses a significant risk for text-based fake profile detectors. In this study, we evaluate the robustness of existing detectors against LLM-generated profiles. While highly effective in detecting manually created fake profiles (False Accept Rate: 6-7%), the existing detectors fail to identify GPT-generated profiles (False Accept Rate: 42-52%). We propose GPT-assisted adversarial training as a countermeasure, restoring the False Accept Rate to between 1-7% without impacting the False Reject Rates (0.5-2%). Ablation studies revealed that detectors trained on combined numerical and textual embeddings exhibit the highest robustness, followed by those using numerical-only embeddings, and lastly those using textual-only embeddings. Complementary analysis on the ability of prompt-based GPT-4Turbo and human evaluators affirms the need for robust automated detectors such as the one proposed in this study.
zh
[CV-99] SIA: Enhancing Safety via Intent Awareness for Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在实际应用中因图像与文本输入之间微妙交互而产生的隐性安全风险问题,即看似无害的输入组合可能共同揭示有害意图,导致模型输出不安全内容。现有基于事后过滤或静态拒绝提示的方法难以识别此类由多模态输入协同引发的潜在危害。解决方案的关键在于提出一种无需训练的提示工程框架SIA(Safety via Intent Awareness),其核心机制为三阶段推理流程:首先通过图像描述(captioning)进行视觉抽象,继而利用少量示例链式思维(few-shot chain-of-thought prompting)推断用户意图,最后基于该意图对响应进行条件化优化。该方法不依赖预定义规则或分类器,而是动态适应从图像-文本对中推断出的隐含意图,从而实现对多模态有害意图的主动检测与缓解。
链接: https://arxiv.org/abs/2507.16856
作者: Youngjin Na,Sangheon Jeong,Youngwan Lee
机构: Modulabs; ETRI; KAIST
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 6 figures
Abstract:As vision-language models (VLMs) are increasingly deployed in real-world applications, new safety risks arise from the subtle interplay between images and text. In particular, seemingly innocuous inputs can combine to reveal harmful intent, leading to unsafe model responses. Despite increasing attention to multimodal safety, previous approaches based on post hoc filtering or static refusal prompts struggle to detect such latent risks, especially when harmfulness emerges only from the combination of inputs. We propose SIA (Safety via Intent Awareness), a training-free prompt engineering framework that proactively detects and mitigates harmful intent in multimodal inputs. SIA employs a three-stage reasoning process: (1) visual abstraction via captioning, (2) intent inference through few-shot chain-of-thought prompting, and (3) intent-conditioned response refinement. Rather than relying on predefined rules or classifiers, SIA dynamically adapts to the implicit intent inferred from the image-text pair. Through extensive experiments on safety-critical benchmarks including SIUO, MM-SafetyBench, and HoliSafe, we demonstrate that SIA achieves substantial safety improvements, outperforming prior methods. Although SIA shows a minor reduction in general reasoning accuracy on MMStar, the corresponding safety gains highlight the value of intent-aware reasoning in aligning VLMs with human-centric values.
zh
[CV-100] CLAMP: Contrastive Learning with Adaptive Multi-loss and Progressive Fusion for Multimodal Aspect-Based Sentiment Analysis
【速读】:该论文旨在解决多模态方面情感分析(Multimodal Aspect-Based Sentiment Analysis, MABSA)中面临的跨模态对齐噪声和细粒度表示不一致问题,特别是现有方法在全局模态对齐时忽视了方面词与其对应局部视觉区域之间的关联,导致文本与图像间表征差距难以弥合。解决方案的关键在于提出一个端到端的对比学习框架CLAMP(Contrastive Learning framework with Adaptive Multi-loss and Progressive Attention Fusion),其核心创新包括:1)渐进式注意力融合网络(Progressive Attention Fusion network),通过分阶段的多层次跨模态交互增强文本特征与图像区域间的细粒度对齐,有效抑制无关视觉噪声;2)多任务对比学习机制,结合全局模态对比与局部粒度对齐,提升跨模态表征一致性;3)自适应多损失聚合模块,基于动态不确定性加权机制调节各任务损失贡献,缓解梯度干扰。
链接: https://arxiv.org/abs/2507.16854
作者: Xiaoqiang He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal aspect-based sentiment analysis(MABSA) seeks to identify aspect terms within paired image-text data and determine their fine grained sentiment polarities, representing a fundamental task for improving the effectiveness of applications such as product review systems and public opinion monitoring. Existing methods face challenges such as cross modal alignment noise and insufficient consistency in fine-grained representations. While global modality alignment methods often overlook the connection between aspect terms and their corresponding local visual regions, bridging the representation gap between text and images remains a challenge. To address these limitations, this paper introduces an end to end Contrastive Learning framework with Adaptive Multi-loss and Progressive Attention Fusion(CLAMP). The framework is composed of three novel modules: Progressive Attention Fusion network, Multi-task Contrastive Learning, and Adaptive Multi-loss Aggregation. The Progressive Attention Fusion network enhances fine-grained alignment between textual features and image regions via hierarchical, multi-stage cross modal interactions, effectively suppressing irrelevant visual noise. Secondly, multi-task contrastive learning combines global modal contrast and local granularity alignment to enhance cross modal representation consistency. Adaptive Multi-loss Aggregation employs a dynamic uncertainty based weighting mechanism to calibrate loss contributions according to each task’s uncertainty, thereby mitigating gradient interference. Evaluation on standard public benchmarks demonstrates that CLAMP consistently outperforms the vast majority of existing state of the art methods.
zh
[CV-101] Coarse-to-fine crack cue for robust crack detection
【速读】:该论文旨在解决深度学习方法在裂缝检测任务中泛化能力不足的问题,尤其是现有方法常忽视裂缝的细长结构特性(thin structure property),导致在复杂背景、阴影和光照变化等未见域场景下性能下降。其解决方案的关键在于提出一种基于粗到精裂缝线索生成(coarse-to-fine crack cue generation)的新方法——CrackCue:首先通过简单的最大池化与上采样操作获得粗粒度无裂缝背景,再利用重建网络生成精细无裂缝背景,最终通过原始图像与该背景的差异提取出鲁棒的细粒度裂缝线索(fine crack cue)。该线索嵌入了不受复杂背景干扰的裂缝先验信息,可作为插件模块集成至多种先进裂缝检测网络中,显著提升模型的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2507.16851
作者: Zelong Liu,Yuliang Gu,Zhichao Sun,Huachao Zhu,Xin Xiao,Bo Du,Laurent Najman(LIGM),Yongchao Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE); Image and Video Processing (eess.IV)
备注:
Abstract:Crack detection is an important task in computer vision. Despite impressive in-dataset performance, deep learning-based methods still struggle in generalizing to unseen domains. The thin structure property of cracks is usually overlooked by previous methods. In this work, we introduce CrackCue, a novel method for robust crack detection based on coarse-to-fine crack cue generation. The core concept lies on leveraging the thin structure property to generate a robust crack cue, guiding the crack detection. Specifically, we first employ a simple max-pooling and upsampling operation on the crack image. This results in a coarse crack-free background, based on which a fine crack-free background can be obtained via a reconstruction network. The difference between the original image and fine crack-free background provides a fine crack cue. This fine cue embeds robust crack prior information which is unaffected by complex backgrounds, shadow, and varied lighting. As a plug-and-play method, we incorporate the proposed CrackCue into three advanced crack detection networks. Extensive experimental results demonstrate that the proposed CrackCue significantly improves the generalization ability and robustness of the baseline methods. The source code will be publicly available.
zh
[CV-102] oward a Real-Time Framework for Accurate Monocular 3D Human Pose Estimation with Geometric Priors ICRA2025
【速读】:该论文旨在解决单目图像下人体三维姿态估计(Monocular 3D Human Pose Estimation)在实时性和约束环境下的挑战,尤其是如何在不依赖专用硬件和大规模标注数据的前提下实现高精度、可部署的3D姿态估计。其解决方案的关键在于将实时2D关键点检测与几何感知的2D到3D提升(2D-to-3D Lifting)相结合,显式利用已知相机内参(camera intrinsics)和个体特定的解剖学先验(anatomical priors),并通过自校准(self-calibration)和生物力学约束逆运动学(biomechanically-constrained inverse kinematics)从动作捕捉(MoCap)和合成数据集中生成大规模、合理的2D-3D训练样本,从而在边缘设备上实现快速、个性化且准确的3D人体姿态估计。
链接: https://arxiv.org/abs/2507.16850
作者: Mohamed Adjel(LAAS)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: IEEE ICRA 2025 (workshop: Enhancing Human Mobility: From Computer Vision-Based Motion Tracking to Wearable Assistive Robot Control), May 2025, Atlanta (Georgia), United States
Abstract:Monocular 3D human pose estimation remains a challenging and ill-posed problem, particularly in real-time settings and unconstrained environments. While direct imageto-3D approaches require large annotated datasets and heavy models, 2D-to-3D lifting offers a more lightweight and flexible alternative-especially when enhanced with prior knowledge. In this work, we propose a framework that combines real-time 2D keypoint detection with geometry-aware 2D-to-3D lifting, explicitly leveraging known camera intrinsics and subject-specific anatomical priors. Our approach builds on recent advances in self-calibration and biomechanically-constrained inverse kinematics to generate large-scale, plausible 2D-3D training pairs from MoCap and synthetic datasets. We discuss how these ingredients can enable fast, personalized, and accurate 3D pose estimation from monocular images without requiring specialized hardware. This proposal aims to foster discussion on bridging data-driven learning and model-based priors to improve accuracy, interpretability, and deployability of 3D human motion capture on edge devices in the wild.
zh
[CV-103] Post-Disaster Affected Area Segmentation with a Vision Transformer (ViT)-based EVAP Model using Sentinel-2 and Formosat-5 Imagery
【速读】:该论文旨在解决灾害影响区域分割在缺乏精确地面真值标注情况下的准确性与可靠性问题,尤其针对遥感影像中灾害范围识别的挑战。其关键解决方案在于提出一种基于视觉Transformer(Vision Transformer, ViT)的深度学习框架,结合主成分分析(Principal Component Analysis, PCA)特征空间分析与置信度指数(Confidence Index, CI),从少量人工标注区域出发,自动扩展生成弱监督训练集,并利用Sentinel-2和Formosat-5多光谱数据训练ViT编码器-解码器模型,同时引入多阶段损失策略和多种解码器结构以提升有限监督条件下的分割性能。该方法显著改善了分割结果的平滑性和空间一致性,为灾害制图提供了一种可扩展的解决方案。
链接: https://arxiv.org/abs/2507.16849
作者: Yi-Shan Chu,Hsuan-Cheng Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a vision transformer (ViT)-based deep learning framework to refine disaster-affected area segmentation from remote sensing imagery, aiming to support and enhance the Emergent Value Added Product (EVAP) developed by the Taiwan Space Agency (TASA). The process starts with a small set of manually annotated regions. We then apply principal component analysis (PCA)-based feature space analysis and construct a confidence index (CI) to expand these labels, producing a weakly supervised training set. These expanded labels are then used to train ViT-based encoder-decoder models with multi-band inputs from Sentinel-2 and Formosat-5 imagery. Our architecture supports multiple decoder variants and multi-stage loss strategies to improve performance under limited supervision. During the evaluation, model predictions are compared with higher-resolution EVAP output to assess spatial coherence and segmentation consistency. Case studies on the 2022 Poyang Lake drought and the 2023 Rhodes wildfire demonstrate that our framework improves the smoothness and reliability of segmentation results, offering a scalable approach for disaster mapping when accurate ground truth is unavailable.
zh
[CV-104] Assessing Medical Training Skills via Eye and Head Movements
【速读】:该论文旨在解决临床技能评估中缺乏客观、量化指标的问题,尤其在产科模拟训练场景下,如何通过非侵入式生理与行为信号实现对从业者熟练程度的自动区分。其解决方案的关键在于利用可穿戴眼动追踪眼镜采集眼动和头部运动数据(如瞳孔反应率、注视持续时间及角速度),并基于这些多模态特征构建计算模型,从而有效区分受训与未受训 practitioner,其中头部相关特征表现最优(F1=0.85,AUC=0.86),为未来基于生成式 AI (Generative AI) 的隐性技能评估提供了可行的技术路径。
链接: https://arxiv.org/abs/2507.16819
作者: Kayhan Latifzadeh,Luis A. Leiva,Klen Čopič Pucihar,Matjaž Kljun,Iztok Devetak,Lili Steblovnik
机构: University of Luxembourg(卢森堡大学); University of Primorska(普里莫尔斯卡大学); Stellenbosch University(斯泰伦博斯大学); University of Ljubljana(卢布尔雅那大学); University Medical Centre Ljubljana(卢布尔雅那医科大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We examined eye and head movements to gain insights into skill development in clinical settings. A total of 24 practitioners participated in simulated baby delivery training sessions. We calculated key metrics, including pupillary response rate, fixation duration, or angular velocity. Our findings indicate that eye and head tracking can effectively differentiate between trained and untrained practitioners, particularly during labor tasks. For example, head-related features achieved an F1 score of 0.85 and AUC of 0.86, whereas pupil-related features achieved F1 score of 0.77 and AUC of 0.85. The results lay the groundwork for computational models that support implicit skill assessment and training in clinical settings by using commodity eye-tracking glasses as a complementary device to more traditional evaluation methods such as subjective scores.
zh
[CV-105] MCM: Mamba-based Cardiac Motion Tracking using Sequential Images in MRI MICCAI
【速读】:该论文旨在解决现有心脏运动追踪方法在处理心肌运动时忽视其连续性而导致运动估计不一致、不平滑的问题。传统方法通常基于单帧图像对(参考帧与随机选取的目标帧)进行学习,未能充分利用心脏周期中帧间的时序关联。解决方案的关键在于提出一种基于Mamba架构的心脏运动追踪网络(MCM),其核心创新包括:1)设计双向Mamba模块,通过双向扫描机制有效建模心脏运动的时序依赖关系;2)引入融合邻近帧运动信息的运动解码器,增强时间一致性;3)利用Mamba的结构化状态空间形式,在不增加计算复杂度的前提下学习心肌的连续动力学特性。该方法显著提升了运动场的平滑性和时序一致性,实验表明其在定量和定性指标上均优于传统及当前最优的基于学习的方法。
链接: https://arxiv.org/abs/2507.17678
作者: Jiahui Yin,Xinxing Cheng,Jinming Duan,Yan Pang,Declan O’Regan,Hadrien Reynaud,Qingjie Meng
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Medical Image Computing and Computer-Assisted Intervention (MICCAI), Reconstruction and Imaging Motion Estimation Workshop (RIME), 2025
Abstract:Myocardial motion tracking is important for assessing cardiac function and diagnosing cardiovascular diseases, for which cine cardiac magnetic resonance (CMR) has been established as the gold standard imaging modality. Many existing methods learn motion from single image pairs consisting of a reference frame and a randomly selected target frame from the cardiac cycle. However, these methods overlook the continuous nature of cardiac motion and often yield inconsistent and non-smooth motion estimations. In this work, we propose a novel Mamba-based cardiac motion tracking network (MCM) that explicitly incorporates target image sequence from the cardiac cycle to achieve smooth and temporally consistent motion tracking. By developing a bi-directional Mamba block equipped with a bi-directional scanning mechanism, our method facilitates the estimation of plausible deformation fields. With our proposed motion decoder that integrates motion information from frames adjacent to the target frame, our method further enhances temporal coherence. Moreover, by taking advantage of Mamba’s structured state-space formulation, the proposed method learns the continuous dynamics of the myocardium from sequential images without increasing computational complexity. We evaluate the proposed method on two public datasets. The experimental results demonstrate that the proposed method quantitatively and qualitatively outperforms both conventional and state-of-the-art learning-based cardiac motion tracking methods. The code is available at this https URL.
zh
[CV-106] Mammo-Mamba: A Hybrid State-Space and Transformer Architecture with Sequential Mixture of Experts for Multi-View Mammography
【速读】:该论文旨在解决多视角乳腺X线摄影(mammogram)图像分类中深度Transformer模型计算复杂度高、效率低的问题,尤其是在处理高分辨率医学影像时,传统Transformer的二次方复杂度限制了其在临床实践中的应用。解决方案的关键在于提出Mammo-Mamba框架,其核心创新是引入序列专家混合机制(Sequential Mixture of Experts, SeqMoE),通过定制化的SecMamba模块实现内容自适应的特征精炼,从而在保持高分类性能的同时显著提升计算效率。该方法将Selective State-Space Models (SSMs)、Transformer注意力机制与专家驱动的特征优化相结合,使模型能够在深层网络中动态调整特征权重,有效克服了传统Transformer在乳腺癌早期诊断中对资源消耗大的局限性。
链接: https://arxiv.org/abs/2507.17662
作者: Farnoush Bayatmakou,Reza Taleei,Nicole Simone,Arash Mohammadi
机构: Concordia Institute for Information Systems Engineering (CIISE), Concordia University (康考迪亚大学); Thomas Jefferson University Hospital (托马斯·杰斐逊大学医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Breast cancer (BC) remains one of the leading causes of cancer-related mortality among women, despite recent advances in Computer-Aided Diagnosis (CAD) systems. Accurate and efficient interpretation of multi-view mammograms is essential for early detection, driving a surge of interest in Artificial Intelligence (AI)-powered CAD models. While state-of-the-art multi-view mammogram classification models are largely based on Transformer architectures, their computational complexity scales quadratically with the number of image patches, highlighting the need for more efficient alternatives. To address this challenge, we propose Mammo-Mamba, a novel framework that integrates Selective State-Space Models (SSMs), transformer-based attention, and expert-driven feature refinement into a unified architecture. Mammo-Mamba extends the MambaVision backbone by introducing the Sequential Mixture of Experts (SeqMoE) mechanism through its customized SecMamba block. The SecMamba is a modified MambaVision block that enhances representation learning in high-resolution mammographic images by enabling content-adaptive feature refinement. These blocks are integrated into the deeper stages of MambaVision, allowing the model to progressively adjust feature emphasis through dynamic expert gating, effectively mitigating the limitations of traditional Transformer models. Evaluated on the CBIS-DDSM benchmark dataset, Mammo-Mamba achieves superior classification performance across all key metrics while maintaining computational efficiency.
zh
[CV-107] A Versatile Pathology Co-pilot via Reasoning Enhanced Multimodal Large Language Model
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在病理学应用中推理能力受限、任务范式单一的问题。具体而言,现有方法依赖昂贵的思维链(chain-of-thought)标注,且仅能处理局部感兴趣区域(region-of-interest, ROI)层面的视觉问答(VQA)任务,难以满足临床实践中包括ROI分类、检测、分割以及全切片图像(whole-slide image, WSI)分类等多样化诊断需求。解决方案的关键在于提出SmartPath-R1框架,通过尺度感知的监督微调与任务导向的强化微调相结合,无需链式思维标注即可激发模型内在知识以增强病理推理能力;同时引入基于专家混合(mixture-of-experts)机制实现多尺度、多任务协同分析,从而支持ROI级与WSI级任务的统一处理,显著提升模型的泛化性与实用性。
链接: https://arxiv.org/abs/2507.17303
作者: Zhe Xu,Ziyi Liu,Junlin Hou,Jiabo Ma,Cheng Jin,Yihui Wang,Zhixuan Chen,Zhengyu Zhang,Zhengrui Guo,Fengtao Zhou,Yingxue Xu,Xi Wang,Ronald Cheong Kin Chan,Li Liang,Hao Chen
机构: Hong Kong University of Science and Technology (香港科技大学); Southern Medical University (南方医科大学); Chinese University of Hong Kong (香港中文大学); Jinfeng Laboratory (金凤实验室)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) have emerged as powerful tools for computational pathology, offering unprecedented opportunities to integrate pathological images with language context for comprehensive diagnostic analysis. These models hold particular promise for automating complex tasks that traditionally require expert interpretation of pathologists. However, current MLLM approaches in pathology demonstrate significantly constrained reasoning capabilities, primarily due to their reliance on expensive chain-of-thought annotations. Additionally, existing methods remain limited to simplex application of visual question answering (VQA) at region-of-interest (ROI) level, failing to address the full spectrum of diagnostic needs such as ROI classification, detection, segmentation, whole-slide-image (WSI) classification and VQA in clinical practice. In this study, we present SmartPath-R1, a versatile MLLM capable of simultaneously addressing both ROI-level and WSI-level tasks while demonstrating robust pathological reasoning capability. Our framework combines scale-dependent supervised fine-tuning and task-aware reinforcement fine-tuning, which circumvents the requirement for chain-of-thought supervision by leveraging the intrinsic knowledge within MLLM. Furthermore, SmartPath-R1 integrates multiscale and multitask analysis through a mixture-of-experts mechanism, enabling dynamic processing for diverse tasks. We curate a large-scale dataset comprising 2.3M ROI samples and 188K WSI samples for training and evaluation. Extensive experiments across 72 tasks validate the effectiveness and superiority of the proposed approach. This work represents a significant step toward developing versatile, reasoning-enhanced AI systems for precision pathology.
zh
[CV-108] MyGO: Make your Goals Obvious Avoiding Semantic Confusion in Prostate Cancer Lesion Region Segmentation
【速读】:该论文旨在解决前列腺癌(Prostate Cancer, PCa)医学图像分割中因病灶区域与非病灶区域语义相似度高而导致的语义混淆问题,从而提升病灶定位与进展识别的准确性。其解决方案的关键在于提出一种新颖的Pixel Anchor Module,该模块通过引导模型发现稀疏的特征锚点(feature anchors),以捕捉和解析全局上下文信息,增强模型的非线性表达能力,进而提高病灶区域的分割精度;同时结合基于自注意力机制的Top_k选择策略优化锚点识别,并引入焦点损失(focal loss)缓解类别不平衡问题,显著提升了复杂场景下的语义判别能力。
链接: https://arxiv.org/abs/2507.17269
作者: Zhengcheng Lin(1),Zuobin Ying(2),Zhenyu Li(3),Zhenyu Liu(4),Jian Lu(5),Weiping Ding(6) ((1), (2) City University of Macau, (3) Shandong University, (4) Chinese Academy of Sciences, (5) Peking University, (6) Nantong University)
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Early diagnosis and accurate identification of lesion location and progression in prostate cancer (PCa) are critical for assisting clinicians in formulating effective treatment strategies. However, due to the high semantic homogeneity between lesion and non-lesion areas, existing medical image segmentation methods often struggle to accurately comprehend lesion semantics, resulting in the problem of semantic confusion. To address this challenge, we propose a novel Pixel Anchor Module, which guides the model to discover a sparse set of feature anchors that serve to capture and interpret global contextual information. This mechanism enhances the model’s nonlinear representation capacity and improves segmentation accuracy within lesion regions. Moreover, we design a self-attention-based Top_k selection strategy to further refine the identification of these feature anchors, and incorporate a focal loss function to mitigate class imbalance, thereby facilitating more precise semantic interpretation across diverse regions. Our method achieves state-of-the-art performance on the PI-CAI dataset, demonstrating 69.73% IoU and 74.32% Dice scores, and significantly improving prostate cancer lesion detection.
zh
[CV-109] Harmonization in Magnetic Resonance Imaging: A Survey of Acquisition Image-level and Feature-level Methods
【速读】:该论文旨在解决医学影像数据在不同扫描仪、采集协议或成像站点间存在的显著异质性问题,即“批次效应”(batch effects)或“站点效应”(site effects),此类非生物因素导致的变异会掩盖真实的生物学信号,降低模型的可重复性和统计功效,并严重损害基于学习的模型在跨数据集上的泛化能力。其解决方案的关键在于图像调和(image harmonization),通过消除或减轻与站点相关的偏差,同时保留有意义的生物学信息,从而提升数据的可比性和一致性。论文系统梳理了从前瞻性采集与重建策略到回顾性图像级和特征级方法,以及基于旅行受试者的技术路径,并重点聚焦于深度学习驱动的调和方法,以实现更高效、鲁棒的多中心医学影像标准化处理。
链接: https://arxiv.org/abs/2507.16962
作者: Qinqin Yang,Firoozeh Shomal-Zadeh,Ali Gholipour
机构: University of California Irvine (加州大学欧文分校); Case Western Reserve University (凯斯西储大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 20 pages, 6 figures, 2 tables
Abstract:Modern medical imaging technologies have greatly advanced neuroscience research and clinical diagnostics. However, imaging data collected across different scanners, acquisition protocols, or imaging sites often exhibit substantial heterogeneity, known as “batch effects” or “site effects”. These non-biological sources of variability can obscure true biological signals, reduce reproducibility and statistical power, and severely impair the generalizability of learning-based models across datasets. Image harmonization aims to eliminate or mitigate such site-related biases while preserving meaningful biological information, thereby improving data comparability and consistency. This review provides a comprehensive overview of key concepts, methodological advances, publicly available datasets, current challenges, and future directions in the field of medical image harmonization, with a focus on magnetic resonance imaging (MRI). We systematically cover the full imaging pipeline, and categorize harmonization approaches into prospective acquisition and reconstruction strategies, retrospective image-level and feature-level methods, and traveling-subject-based techniques. Rather than providing an exhaustive survey, we focus on representative methods, with particular emphasis on deep learning-based approaches. Finally, we summarize the major challenges that remain and outline promising avenues for future research.
zh
[CV-110] A Hybrid CNN-VSSM model for Multi-View Multi-Task Mammography Analysis: Robust Diagnosis with Attention-Based Fusion
【速读】:该论文旨在解决乳腺癌筛查中早期准确解读乳腺X线摄影(mammography)图像的难题,尤其针对现有人工智能(AI)方法因仅依赖单视图输入或单一任务输出而导致的临床实用性不足问题。其解决方案的关键在于提出了一种多视图、多任务的混合深度学习框架,该框架能够同时处理标准四视图乳腺影像,并联合预测每个乳房的诊断标签与BI-RADS评分。该架构融合了卷积神经网络(CNN)与视觉状态空间模型(Visual State Space Models, VSSMs),以提取局部特征并捕捉全局上下文依赖关系;并通过门控注意力融合模块动态加权不同视图的信息,增强模型对缺失数据的鲁棒性和可解释性。实验表明,该混合模型在多种分类任务中均显著优于基线CNN和VSSM模型,尤其在二分类(BI-RADS 1 vs. 5)任务中达到AUC 0.9967和F1 0.9830,验证了其在提升诊断性能方面的有效性。
链接: https://arxiv.org/abs/2507.16955
作者: Yalda Zafari,Roaa Elalfy,Mohamed Mabrok,Somaya Al-Maadeed,Tamer Khattab,Essam A. Rashed
机构: Qatar University (卡塔尔大学); University of Hyogo (兵库县立大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Early and accurate interpretation of screening mammograms is essential for effective breast cancer detection, yet it remains a complex challenge due to subtle imaging findings and diagnostic ambiguity. Many existing AI approaches fall short by focusing on single view inputs or single-task outputs, limiting their clinical utility. To address these limitations, we propose a novel multi-view, multitask hybrid deep learning framework that processes all four standard mammography views and jointly predicts diagnostic labels and BI-RADS scores for each breast. Our architecture integrates a hybrid CNN VSSM backbone, combining convolutional encoders for rich local feature extraction with Visual State Space Models (VSSMs) to capture global contextual dependencies. To improve robustness and interpretability, we incorporate a gated attention-based fusion module that dynamically weights information across views, effectively handling cases with missing data. We conduct extensive experiments across diagnostic tasks of varying complexity, benchmarking our proposed hybrid models against baseline CNN architectures and VSSM models in both single task and multi task learning settings. Across all tasks, the hybrid models consistently outperform the baselines. In the binary BI-RADS 1 vs. 5 classification task, the shared hybrid model achieves an AUC of 0.9967 and an F1 score of 0.9830. For the more challenging ternary classification, it attains an F1 score of 0.7790, while in the five-class BI-RADS task, the best F1 score reaches 0.4904. These results highlight the effectiveness of the proposed hybrid framework and underscore both the potential and limitations of multitask learning for improving diagnostic performance and enabling clinically meaningful mammography analysis.
zh
[CV-111] A tissue and cell-level annotated HE and PD-L1 histopathology image dataset in non-small cell lung cancer
【速读】:该论文旨在解决当前非小细胞肺癌(NSCLC)数字病理学数据集在支持免疫治疗生物标志物开发方面的局限性问题,具体表现为数据集覆盖范围有限、缺乏临床常见的转移部位标注以及缺失如PD-L1免疫组化(IHC)等分子信息。解决方案的关键在于构建并公开发布IGNITE数据工具包,这是一个多染色、多中心、多扫描仪的标注NSCLC全切片图像数据集,包含887个完全标注的兴趣区域,涵盖三个互补任务:(i) HE染色切片中组织区室的多类语义分割(16类,包括原发性和转移性NSCLC),(ii) 细胞核检测,(iii) PD-L1 IHC切片中PD-L1阳性肿瘤细胞检测。该数据集是首个公开提供HE染色转移部位手动标注和PD-L1 IHC信息的NSCLC数据资源,为计算量化肿瘤免疫微环境(TIME)特征提供了重要基础。
链接: https://arxiv.org/abs/2507.16855
作者: Joey Spronck,Leander van Eekelen,Dominique van Midden,Joep Bogaerts,Leslie Tessier,Valerie Dechering,Muradije Demirel-Andishmand,Gabriel Silva de Souza,Roland Nemeth,Enrico Munari,Giuseppe Bogina,Ilaria Girolami,Albino Eccher,Balazs Acs,Ceren Boyaci,Natalie Klubickova,Monika Looijen-Salamon,Shoko Vos,Francesco Ciompi
机构: Radboud University Medical Center (奈梅亨大学医学中心); University of Brescia (布雷西亚大学); Ospedale Sacro Cuore (神圣心脏医院); Provincial Hospital of Bolzano (博尔扎诺省立医院); University and Hospital Trust of Verona (维罗纳大学与医院信托); Karolinska University Hospital (卡罗林斯卡大学医院); Biopticka Laboratory, Ltd (生物切片实验室有限公司)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Our dataset is available at ’ this https URL and our code is available at ’ this https URL
Abstract:The tumor immune microenvironment (TIME) in non-small cell lung cancer (NSCLC) histopathology contains morphological and molecular characteristics predictive of immunotherapy response. Computational quantification of TIME characteristics, such as cell detection and tissue segmentation, can support biomarker development. However, currently available digital pathology datasets of NSCLC for the development of cell detection or tissue segmentation algorithms are limited in scope, lack annotations of clinically prevalent metastatic sites, and forgo molecular information such as PD-L1 immunohistochemistry (IHC). To fill this gap, we introduce the IGNITE data toolkit, a multi-stain, multi-centric, and multi-scanner dataset of annotated NSCLC whole-slide images. We publicly release 887 fully annotated regions of interest from 155 unique patients across three complementary tasks: (i) multi-class semantic segmentation of tissue compartments in HE-stained slides, with 16 classes spanning primary and metastatic NSCLC, (ii) nuclei detection, and (iii) PD-L1 positive tumor cell detection in PD-L1 IHC slides. To the best of our knowledge, this is the first public NSCLC dataset with manual annotations of HE in metastatic sites and PD-L1 IHC.
zh
人工智能
[AI-0] Flow Matching Meets Biology and Life Science: A Survey
【速读】:该论文旨在系统梳理和总结流匹配(flow matching)这一新兴生成建模方法在生物领域的最新进展,解决当前缺乏对流匹配技术及其生物学应用全面综述的问题。其解决方案的关键在于:首先从理论基础与变体角度对流匹配进行系统性回顾,继而将其在生物领域的应用划分为三大核心方向——生物序列建模、分子生成与设计、以及肽和蛋白质生成,并针对每个方向深入分析近期研究进展;同时汇总常用数据集与开源工具,为后续研究提供资源支持与未来发展方向的洞察。
链接: https://arxiv.org/abs/2507.17731
作者: Zihao Li,Zhichen Zeng,Xiao Lin,Feihao Fang,Yanru Qu,Zhe Xu,Zhining Liu,Xuying Ning,Tianxin Wei,Ge Liu,Hanghang Tong,Jingrui He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint, 27 pages
Abstract:Over the past decade, advances in generative modeling, such as generative adversarial networks, masked autoencoders, and diffusion models, have significantly transformed biological research and discovery, enabling breakthroughs in molecule design, protein generation, drug discovery, and beyond. At the same time, biological applications have served as valuable testbeds for evaluating the capabilities of generative models. Recently, flow matching has emerged as a powerful and efficient alternative to diffusion-based generative modeling, with growing interest in its application to problems in biology and life sciences. This paper presents the first comprehensive survey of recent developments in flow matching and its applications in biological domains. We begin by systematically reviewing the foundations and variants of flow matching, and then categorize its applications into three major areas: biological sequence modeling, molecule generation and design, and peptide and protein generation. For each, we provide an in-depth review of recent progress. We also summarize commonly used datasets and software tools, and conclude with a discussion of potential future directions. The corresponding curated resources are available at this https URL.
zh
[AI-1] Online Submission and Evaluation System Design for Competition Operations
【速读】:该论文旨在解决科研竞赛中因参赛算法提交与评估流程繁琐、环境兼容性差以及组织者管理负担重等问题。其核心解决方案是构建一个在线竞赛系统,通过隔离环境自动执行提交与评估流程,从而实现对大量参赛作品的高效、标准化处理,已在网格路径规划竞赛和机器人跑者联赛等实际场景中成功应用。
链接: https://arxiv.org/abs/2507.17730
作者: Zhe Chen,Daniel Harabor,Ryan Hechnenberger,Nathan R. Sturtevant
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This work was presented at the Workshop on the International Planning Competition (WIPC 2024)
Abstract:Research communities have developed benchmark datasets across domains to compare the performance of algorithms and techniques However, tracking the progress in these research areas is not easy, as publications appear in different venues at the same time, and many of them claim to represent the state-of-the-art. To address this, research communities often organise periodic competitions to evaluate the performance of various algorithms and techniques, thereby tracking advancements in the field. However, these competitions pose a significant operational burden. The organisers must manage and evaluate a large volume of submissions. Furthermore, participants typically develop their solutions in diverse environments, leading to compatibility issues during the evaluation of their submissions. This paper presents an online competition system that automates the submission and evaluation process for a competition. The competition system allows organisers to manage large numbers of submissions efficiently, utilising isolated environments to evaluate submissions. This system has already been used successfully for several competitions, including the Grid-Based Pathfinding Competition and the League of Robot Runners competition.
zh
[AI-2] hinking Isnt an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations
【速读】:该论文旨在解决当前关于生成式 AI(Generative AI)中“推理模型”(Large Reasoning Models, LRMs)有效性存疑的问题,即近期研究发现,带有显式思维链(step-by-step thinking process)的LRM在复杂推理任务中并不一定优于不具显式推理机制的模型。为验证这一现象是否在引入工具增强后依然存在,作者通过引入两类工具——Python解释器和草稿板(scratchpads),对三组代表性大语言模型(LLMs)及其对应的LRM版本进行评估。关键解决方案在于:在合理使用外部工具的前提下,LRMs在所有任务复杂度水平上均显著优于其非推理版本,从而表明工具增强可有效释放LRM的推理潜力,挑战了“推理是幻觉”的观点,并揭示了工具增强型LRM在处理复杂问题上的实际优势。
链接: https://arxiv.org/abs/2507.17699
作者: Zhao Song,Song Yue,Jiahao Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) have become a central focus in today’s large language model (LLM) research, where models are designed to output a step-by-step thinking process before arriving at a final answer to handle complex reasoning tasks. Despite their promise, recent empirical studies (e.g., [Shojaee et al., 2025] from Apple) suggest that this thinking process may not actually enhance reasoning ability, where LLMs without explicit reasoning actually outperform LRMs on tasks with low or high complexity. In this work, we revisit these findings and investigate whether the limitations of LRMs persist when tool augmentations are introduced. We incorporate two types of tools, Python interpreters and scratchpads, and evaluate three representative LLMs and their LRM counterparts on Apple’s benchmark reasoning puzzles. Our results show that, with proper tool use, LRMs consistently outperform their non-reasoning counterparts across all levels of task complexity. These findings challenge the recent narrative that reasoning is an illusion and highlight the potential of tool-augmented LRMs for solving complex problems.
zh
[AI-3] Symbiotic Agents : A Novel Paradigm for Trustworthy AGI-driven Networks
【速读】:该论文旨在解决当前6G网络中因传统专用智能方法(specialized intelligence approach)导致的决策效率低、适应性差的问题,即如何实现具备广泛推理能力的通用人工智能(AGI)驱动的网络管理与服务提供。其核心挑战在于提升大语言模型(LLM)在实时场景下的准确性、可控性和资源效率,同时确保可信AI(Trustworthy AI)特性。解决方案的关键在于提出一种“共生代理”(symbiotic agents)新范式,通过将LLM与实时优化算法深度融合:输入层优化器提供数值任务的不确定性边界控制,输出层优化器由LLM监督以实现自适应实时调控;该架构支持两类新型代理——无线接入网(RAN)优化器和SLA多代理协商器,并在5G测试床中验证了其有效性,结果显示决策错误降低5倍,且使用小型语言模型(SLM)可实现近实时响应(82ms),GPU资源消耗减少99.9%,同时显著提升SLA灵活性和RAN资源利用率。
链接: https://arxiv.org/abs/2507.17695
作者: Ilias Chatzistefanidis,Navid Nikaein
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Submitted to Computer Networks AI for 6G
Abstract:Large Language Model (LLM)-based autonomous agents are expected to play a vital role in the evolution of 6G networks, by empowering real-time decision-making related to management and service provisioning to end-users. This shift facilitates the transition from a specialized intelligence approach, where artificial intelligence (AI) algorithms handle isolated tasks, to artificial general intelligence (AGI)-driven networks, where agents possess broader reasoning capabilities and can manage diverse network functions. In this paper, we introduce a novel agentic paradigm that combines LLMs with real-time optimization algorithms towards Trustworthy AI, defined as symbiotic agents. Optimizers at the LLM’s input-level provide bounded uncertainty steering for numerically precise tasks, whereas output-level optimizers supervised by the LLM enable adaptive real-time control. We design and implement two novel agent types including: (i) Radio Access Network optimizers, and (ii) multi-agent negotiators for Service-Level Agreements (SLAs). We further propose an end-to-end architecture for AGI networks and evaluate it on a 5G testbed capturing channel fluctuations from moving vehicles. Results show that symbiotic agents reduce decision errors fivefold compared to standalone LLM-based agents, while smaller language models (SLM) achieve similar accuracy with a 99.9% reduction in GPU resource overhead and in near-real-time loops of 82 ms. A multi-agent demonstration for collaborative RAN on the real-world testbed highlights significant flexibility in service-level agreement and resource allocation, reducing RAN over-utilization by approximately 44%. Drawing on our findings and open-source implementations, we introduce the symbiotic paradigm as the foundation for next-generation, AGI-driven networks-systems designed to remain adaptable, efficient, and trustworthy even as LLMs advance.
zh
[AI-4] CASCADE: LLM -Powered JavaScript Deobfuscator at Google
【速读】:该论文旨在解决JavaScript代码中广泛存在的混淆(obfuscation)问题,该问题严重阻碍了软件测试、静态分析和恶意代码检测等任务的进行。现有静态或动态去混淆技术通常依赖大量硬编码规则,难以应对复杂多变的混淆手法且缺乏灵活性。解决方案的关键在于提出一种名为CASCADE的混合方法:利用Gemini的先进编码能力识别关键前置函数(prelude functions),这些函数是当前最常见混淆技术的基础组件;随后通过JavaScript中间表示(JSIR)实现确定性的代码变换,从而恢复原始字符串、API名称等语义信息,并揭示程序的真实行为。此方法显著减少了对人工规则的依赖,同时提升了去混淆的可靠性与适应性,已在Google生产环境中部署并取得显著成效。
链接: https://arxiv.org/abs/2507.17691
作者: Shan Jiang,Pranoy Kovuri,David Tao,Zhixun Tan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:
Abstract:Software obfuscation, particularly prevalent in JavaScript, hinders code comprehension and analysis, posing significant challenges to software testing, static analysis, and malware detection. This paper introduces CASCADE, a novel hybrid approach that integrates the advanced coding capabilities of Gemini with the deterministic transformation capabilities of a compiler Intermediate Representation (IR), specifically JavaScript IR (JSIR). By employing Gemini to identify critical prelude functions, the foundational components underlying the most prevalent obfuscation techniques, and leveraging JSIR for subsequent code transformations, CASCADE effectively recovers semantic elements like original strings and API names, and reveals original program behaviors. This method overcomes limitations of existing static and dynamic deobfuscation techniques, eliminating hundreds to thousands of hardcoded rules while achieving reliability and flexibility. CASCADE is already deployed in Google’s production environment, demonstrating substantial improvements in JavaScript deobfuscation efficiency and reducing reverse engineering efforts.
zh
[AI-5] Simulating multiple human perspectives in socio-ecological systems using large language models
【速读】:该论文旨在解决 socio-ecological systems(社会-生态系统)研究中因利益相关者(stakeholder)视角多样且难以获取而导致的跨视角理解困难问题。其解决方案的关键在于提出并实现 HoPeS(Human-Oriented Perspective Shifting)建模框架,该框架利用大语言模型(LLM)驱动的代理(agent)模拟不同利益相关者角色,通过结构化的仿真协议(simulation protocol)作为“脚手架”,支持用户在多个视角间切换、反思与整合,从而实现对复杂系统中认知差异的沉浸式探索。
链接: https://arxiv.org/abs/2507.17680
作者: Yongchao Zeng,Calum Brown,Ioannis Kyriakou,Ronja Hotz,Mark Rounsevell
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Understanding socio-ecological systems requires insights from diverse stakeholder perspectives, which are often hard to access. To enable alternative, simulation-based exploration of different stakeholder perspectives, we develop the HoPeS (Human-Oriented Perspective Shifting) modelling framework. HoPeS employs agents powered by large language models (LLMs) to represent various stakeholders; users can step into the agent roles to experience perspectival differences. A simulation protocol serves as a “scaffold” to streamline multiple perspective-taking simulations, supporting users in reflecting on, transitioning between, and integrating across perspectives. A prototype system is developed to demonstrate HoPeS in the context of institutional dynamics and land use change, enabling both narrative-driven and numerical experiments. In an illustrative experiment, a user successively adopts the perspectives of a system observer and a researcher - a role that analyses data from the embedded land use model to inform evidence-based decision-making for other LLM agents representing various institutions. Despite the user’s effort to recommend technically sound policies, discrepancies persist between the policy recommendation and implementation due to stakeholders’ competing advocacies, mirroring real-world misalignment between researcher and policymaker perspectives. The user’s reflection highlights the subjective feelings of frustration and disappointment as a researcher, especially due to the challenge of maintaining political neutrality while attempting to gain political influence. Despite this, the user exhibits high motivation to experiment with alternative narrative framing strategies, suggesting the system’s potential in exploring different perspectives. Further system and protocol refinement are likely to enable new forms of interdisciplinary collaboration in socio-ecological simulations.
zh
[AI-6] How Should We Meta-Learn Reinforcement Learning Algorithms?
【速读】:该论文试图解决的问题是:当前元学习(meta-learning)算法在强化学习(Reinforcement Learning, RL)中的应用缺乏系统性比较,尤其是不同元学习方法(如基于进化算法优化黑盒函数或利用大语言模型(Large Language Model, LLM)生成代码)在提升RL算法性能方面的有效性、效率与可解释性差异不明确。解决方案的关键在于:通过实证比较多种元学习算法在RL流水线不同模块上的表现,综合评估其元训练和元测试性能、可解释性、样本成本及训练时间等指标,从而提出一套指导原则,用于设计更高效、高性能的元学习RL算法。
链接: https://arxiv.org/abs/2507.17668
作者: Alexander David Goldie,Zilin Wang,Jakob Nicolaus Foerster,Shimon Whiteson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted paper at Reinforcement Learning Conference (RLC) 2025
Abstract:The process of meta-learning algorithms from data, instead of relying on manual design, is growing in popularity as a paradigm for improving the performance of machine learning systems. Meta-learning shows particular promise for reinforcement learning (RL), where algorithms are often adapted from supervised or unsupervised learning despite their suboptimality for RL. However, until now there has been a severe lack of comparison between different meta-learning algorithms, such as using evolution to optimise over black-box functions or LLMs to propose code. In this paper, we carry out this empirical comparison of the different approaches when applied to a range of meta-learned algorithms which target different parts of the RL pipeline. In addition to meta-train and meta-test performance, we also investigate factors including the interpretability, sample cost and train time for each meta-learning algorithm. Based on these findings, we propose several guidelines for meta-learning new RL algorithms which will help ensure that future learned algorithms are as performant as possible.
zh
[AI-7] Enhancing Quantum Federated Learning with Fisher Information-Based Optimization
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在实际应用中面临的高通信成本、客户端数据异构性、训练时间长以及隐私泄露风险等问题。其核心挑战在于如何在不共享原始数据的前提下,高效且稳健地聚合多个客户端的本地模型参数以提升全局模型性能。解决方案的关键在于提出一种量子联邦学习(Quantum Federated Learning, QFL)算法,利用局部客户端模型上的费舍尔信息(Fisher Information)来识别对量子模型性能影响显著的关键参数,并在聚合过程中优先保留这些参数,从而增强模型的鲁棒性和收敛效率。实验结果表明,该方法在ADNI和MNIST数据集上优于传统的量子联邦平均方法,验证了其有效性和可行性。
链接: https://arxiv.org/abs/2507.17580
作者: Amandeep Singh Bhatia,Sabre Kais
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
备注:
Abstract:Federated Learning (FL) has become increasingly popular across different sectors, offering a way for clients to work together to train a global model without sharing sensitive data. It involves multiple rounds of communication between the global model and participating clients, which introduces several challenges like high communication costs, heterogeneous client data, prolonged processing times, and increased vulnerability to privacy threats. In recent years, the convergence of federated learning and parameterized quantum circuits has sparked significant research interest, with promising implications for fields such as healthcare and finance. By enabling decentralized training of quantum models, it allows clients or institutions to collaboratively enhance model performance and outcomes while preserving data privacy. Recognizing that Fisher information can quantify the amount of information that a quantum state carries under parameter changes, thereby providing insight into its geometric and statistical properties. We intend to leverage this property to address the aforementioned challenges. In this work, we propose a Quantum Federated Learning (QFL) algorithm that makes use of the Fisher information computed on local client models, with data distributed across heterogeneous partitions. This approach identifies the critical parameters that significantly influence the quantum model’s performance, ensuring they are preserved during the aggregation process. Our research assessed the effectiveness and feasibility of QFL by comparing its performance against other variants, and exploring the benefits of incorporating Fisher information in QFL settings. Experimental results on ADNI and MNIST datasets demonstrate the effectiveness of our approach in achieving better performance and robustness against the quantum federated averaging method.
zh
[AI-8] Federated Majorize-Minimization: Beyond Parameter Aggregation
【速读】:该论文旨在解决联邦学习(Federated Learning)场景下随机优化算法的鲁棒扩展问题,特别是如何在数据异质性、部分参与和通信约束等实际挑战中设计高效且统一的优化框架。其解决方案的关键在于提出一种基于Majorize-Minimization(MM)理论的统一框架,并由此导出Stochastic Approximation Stochastic Surrogate MM(\SSMM)算法,该算法通过学习局部代理函数(surrogate majorizing function)而非原始参数进行聚合,从而实现对多种经典优化方法(如梯度法、EM算法等)的统一建模与扩展;进一步地,将此机制应用于联邦设置,提出了QSMM算法,显著提升了算法在复杂分布式环境下的适应性和灵活性。
链接: https://arxiv.org/abs/2507.17534
作者: Aymeric Dieuleveut,Gersende Fort,Mahmoud Hegazy,Hoi-To Wai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:This paper proposes a unified approach for designing stochastic optimization algorithms that robustly scale to the federated learning setting. Our work studies a class of Majorize-Minimization (MM) problems, which possesses a linearly parameterized family of majorizing surrogate functions. This framework encompasses (proximal) gradient-based algorithms for (regularized) smooth objectives, the Expectation Maximization algorithm, and many problems seen as variational surrogate MM. We show that our framework motivates a unifying algorithm called Stochastic Approximation Stochastic Surrogate MM (\SSMM), which includes previous stochastic MM procedures as special instances. We then extend \SSMM\ to the federated setting, while taking into consideration common bottlenecks such as data heterogeneity, partial participation, and communication constraints; this yields \QSMM. The originality of \QSMM\ is to learn locally and then aggregate information characterizing the \textitsurrogate majorizing function, contrary to classical algorithms which learn and aggregate the \textitoriginal parameter. Finally, to showcase the flexibility of this methodology beyond our theoretical setting, we use it to design an algorithm for computing optimal transport maps in the federated setting.
zh
[AI-9] Integrating Physics-Based and Data-Driven Approaches for Probabilistic Building Energy Modeling
【速读】:该论文旨在解决当前建筑能耗建模中两个关键研究空白:一是现有混合方法多聚焦于确定性建模,忽视了天气波动和人员行为等带来的固有不确定性;二是缺乏在概率建模框架下的系统性比较。其解决方案的关键在于引入五种代表性混合方法进行概率化评估,重点基于真实案例中的建筑热力学量(如室内温度)的分位数预测性能进行分析,并采用分位数校准预测(Quantile Conformal Prediction)对预测结果进行校准,从而提升模型的可靠性与物理合理性。实验表明,残差学习(residual learning)结合前馈神经网络(Feedforward Neural Network)在多数场景下表现最优,且在分布外测试数据上仍能生成物理直观的预测结果。
链接: https://arxiv.org/abs/2507.17526
作者: Leandro Von Krannichfeldt,Kristina Orehounig,Olga Fink
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Building energy modeling is a key tool for optimizing the performance of building energy systems. Historically, a wide spectrum of methods has been explored – ranging from conventional physics-based models to purely data-driven techniques. Recently, hybrid approaches that combine the strengths of both paradigms have gained attention. These include strategies such as learning surrogates for physics-based models, modeling residuals between simulated and observed data, fine-tuning surrogates with real-world measurements, using physics-based outputs as additional inputs for data-driven models, and integrating the physics-based output into the loss function the data-driven model. Despite this progress, two significant research gaps remain. First, most hybrid methods focus on deterministic modeling, often neglecting the inherent uncertainties caused by factors like weather fluctuations and occupant behavior. Second, there has been little systematic comparison within a probabilistic modeling framework. This study addresses these gaps by evaluating five representative hybrid approaches for probabilistic building energy modeling, focusing on quantile predictions of building thermodynamics in a real-world case study. Our results highlight two main findings. First, the performance of hybrid approaches varies across different building room types, but residual learning with a Feedforward Neural Network performs best on average. Notably, the residual approach is the only model that produces physically intuitive predictions when applied to out-of-distribution test data. Second, Quantile Conformal Prediction is an effective procedure for calibrating quantile predictions in case of indoor temperature modeling.
zh
[AI-10] Enabling Cyber Security Education through Digital Twins and Generative AI
【速读】:该论文旨在解决当前网络安全教育中理论与实践脱节的问题,即学员难以将课堂所学知识有效应用于真实复杂的网络攻防场景。解决方案的关键在于构建一个融合数字孪生(Digital Twin, DT)与大型语言模型(Large Language Model, LLM)的协同框架:其中,DT用于高保真还原IT、OT和IoT等多域基础设施,实现动态监控与攻击模拟;而LLM则提供自然语言交互、实时反馈及自适应学习支持,增强教学情境的真实性与智能化水平。该框架以定制化的红队刀具(Red Team Knife, RTK)为核心工具,基于攻击链(Cyber Kill Chain)模型引导学习者完成从侦察到响应的全流程演练,从而显著提升漏洞评估、威胁检测与安全运营等实战能力。
链接: https://arxiv.org/abs/2507.17518
作者: Vita Santa Barletta,Vito Bavaro,Miriana Calvano,Antonio Curci,Antonio Piccinno,Davide Pio Posa
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:
Abstract:Digital Twins (DTs) are gaining prominence in cybersecurity for their ability to replicate complex IT (Information Technology), OT (Operational Technology), and IoT (Internet of Things) infrastructures, allowing for real time monitoring, threat analysis, and system simulation. This study investigates how integrating DTs with penetration testing tools and Large Language Models (LLMs) can enhance cybersecurity education and operational readiness. By simulating realistic cyber environments, this approach offers a practical, interactive framework for exploring vulnerabilities and defensive strategies. At the core of this research is the Red Team Knife (RTK), a custom penetration testing toolkit aligned with the Cyber Kill Chain model. RTK is designed to guide learners through key phases of cyberattacks, including reconnaissance, exploitation, and response within a DT powered ecosystem. The incorporation of Large Language Models (LLMs) further enriches the experience by providing intelligent, real-time feedback, natural language threat explanations, and adaptive learning support during training exercises. This combined DT LLM framework is currently being piloted in academic settings to develop hands on skills in vulnerability assessment, threat detection, and security operations. Initial findings suggest that the integration significantly improves the effectiveness and relevance of cybersecurity training, bridging the gap between theoretical knowledge and real-world application. Ultimately, the research demonstrates how DTs and LLMs together can transform cybersecurity education to meet evolving industry demands.
zh
[AI-11] AI Scan Tool: A RAG -Based Tool With Minimalistic Input for Trustworthy AI Self-Assessment
【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)系统在欧盟《人工智能法案》(AI Act)框架下合规评估的复杂性问题,尤其针对法律层面的可信人工智能(Trustworthy AI, TAI)自评估需求。其解决方案的关键在于提出一种基于检索增强生成(Retrieval-Augmented Generation, RAG)技术的最小化输入工具——TAI Scan Tool,采用两阶段流程(预筛选与评估阶段),通过对比高风险AI系统的设定来判断目标AI系统的风险等级,并自动检索相关条款以辅助合规决策,从而提升评估效率与准确性。
链接: https://arxiv.org/abs/2507.17514
作者: Athanasios Davvetas,Xenia Ziouvelou,Ypatia Dami,Alexis Kaponis,Konstantina Giouvanopoulou,Michael Papademas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, 4 tables
Abstract:This paper introduces the TAI Scan Tool, a RAG-based TAI self-assessment tool with minimalistic input. The current version of the tool supports the legal TAI assessment, with a particular emphasis on facilitating compliance with the AI Act. It involves a two-step approach with a pre-screening and an assessment phase. The assessment output of the system includes insight regarding the risk-level of the AI system according to the AI Act, while at the same time retrieving relevant articles to aid with compliance and notify on their obligations. Our qualitative evaluation using use-case scenarios yields promising results, correctly predicting risk levels while retrieving relevant articles across three distinct semantic groups. Furthermore, interpretation of results shows that the tool’s reasoning relies on comparison with the setting of high-risk systems, a behaviour attributed to their deployment requiring careful consideration, and therefore frequently presented within the AI Act.
zh
[AI-12] HOTA: Hamiltonian framework for Optimal Transport Advection
【速读】:该论文旨在解决当前生成模型在概率流优化中忽视底层流形几何结构的问题,即大多数现有方法假设欧几里得(Euclidean)空间且依赖强密度估计假设,导致生成轨迹无法遵循真实最优性原理。其解决方案的关键在于提出Hamiltonian Optimal Transport Advection (HOTA),一种基于Hamilton-Jacobi-Bellman方程的方法,通过Kantorovich势函数显式求解对偶动力学最优传输问题,从而实现高效且可扩展的轨迹优化;该方法无需显式建模密度分布,在代价函数非光滑的情况下仍能保持性能,显著提升了生成轨迹的可行性和最优性。
链接: https://arxiv.org/abs/2507.17513
作者: Nazar Buzun,Daniil Shlenskii,Maxim Bobrin,Dmitry V. Dylov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimal transport (OT) has become a natural framework for guiding the probability flows. Yet, the majority of recent generative models assume trivial geometry (e.g., Euclidean) and rely on strong density-estimation assumptions, yielding trajectories that do not respect the true principles of optimality in the underlying manifold. We present Hamiltonian Optimal Transport Advection (HOTA), a Hamilton-Jacobi-Bellman based method that tackles the dual dynamical OT problem explicitly through Kantorovich potentials, enabling efficient and scalable trajectory optimization. Our approach effectively evades the need for explicit density modeling, performing even when the cost functionals are non-smooth. Empirically, HOTA outperforms all baselines in standard benchmarks, as well as in custom datasets with non-differentiable costs, both in terms of feasibility and optimality.
zh
[AI-13] Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
【速读】:该论文旨在解决当前基于强化学习的可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法在多领域推理能力培养上的局限性问题,即现有研究多集中于单一推理领域(如数学求解、代码生成或逻辑谜题),而现实世界中的复杂推理任务通常需要整合多种认知技能。解决方案的关键在于系统性地探究RLVR框架下跨多个推理领域(数学推理、代码生成与逻辑谜题求解)的交互机制,通过四个核心实验模块:(1)基于GRPO算法和Qwen-2.5-7B模型族评估单域训练下的域内提升与跨域泛化能力;(2)分析联合多域训练中出现的相互增强与冲突关系;(3)比较基础模型与指令微调(SFT)模型在相同强化学习配置下的性能差异;(4)深入探索课程学习策略、奖励设计变化及语言特性对训练效果的影响。研究表明,不同推理域之间存在复杂的动态交互,明确关键影响因素有助于优化强化学习方法以促进大语言模型(LLMs)的综合性多域推理能力发展。
链接: https://arxiv.org/abs/2507.17512
作者: Yu Li,Zhuoshi Pan,Honglin Lin,Mengyuan Sun,Conghui He,Lijun Wu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages, 24 figures
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Existing research has predominantly concentrated on isolated reasoning domains such as mathematical problem-solving, coding tasks, or logical reasoning. However, real world reasoning scenarios inherently demand an integrated application of multiple cognitive skills. Despite this, the interplay among these reasoning skills under reinforcement learning remains poorly understood. To bridge this gap, we present a systematic investigation of multi-domain reasoning within the RLVR framework, explicitly focusing on three primary domains: mathematical reasoning, code generation, and logical puzzle solving. We conduct a comprehensive study comprising four key components: (1) Leveraging the GRPO algorithm and the Qwen-2.5-7B model family, our study thoroughly evaluates the models’ in-domain improvements and cross-domain generalization capabilities when trained on single-domain datasets. (2) Additionally, we examine the intricate interactions including mutual enhancements and conflicts that emerge during combined cross-domain training. (3) To further understand the influence of SFT on RL, we also analyze and compare performance differences between base and instruct models under identical RL configurations. (4) Furthermore, we delve into critical RL training details, systematically exploring the impacts of curriculum learning strategies, variations in reward design, and language-specific factors. Through extensive experiments, our results offer significant insights into the dynamics governing domain interactions, revealing key factors influencing both specialized and generalizable reasoning performance. These findings provide valuable guidance for optimizing RL methodologies to foster comprehensive, multi-domain reasoning capabilities in LLMs.
zh
[AI-14] Automated Hybrid Grounding Using Structural and Data-Driven Heuristics
【速读】:该论文旨在解决答案集编程(Answer Set Programming, ASP)在工业应用中面临的“接地瓶颈”(grounding bottleneck)问题,即在大规模实例下传统底向上接地(bottom-up grounding)方法效率低下甚至不可行的问题。解决方案的关键在于提出一种自动化混合接地(automated hybrid grounding)方法:通过基于数据结构启发式(data-structural heuristics)的分割算法,智能判断在何种情况下应使用体解耦接地(body-decoupled grounding),何种情况下仍应采用标准底向上接地。该启发式策略结合了规则结构特征与实例数据的估计过程,实验证明其在难以接地的场景中显著提升性能,同时在难以求解的实例上达到接近当前最优水平。
链接: https://arxiv.org/abs/2507.17493
作者: Alexander Beiser,Markus Hecher,Stefan Woltran
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:
Abstract:The grounding bottleneck poses one of the key challenges that hinders the widespread adoption of Answer Set Programming in industry. Hybrid Grounding is a step in alleviating the bottleneck by combining the strength of standard bottom-up grounding with recently proposed techniques where rule bodies are decoupled during grounding. However, it has remained unclear when hybrid grounding shall use body-decoupled grounding and when to use standard bottom-up grounding. In this paper, we address this issue by developing automated hybrid grounding: we introduce a splitting algorithm based on data-structural heuristics that detects when to use body-decoupled grounding and when standard grounding is beneficial. We base our heuristics on the structure of rules and an estimation procedure that incorporates the data of the instance. The experiments conducted on our prototypical implementation demonstrate promising results, which show an improvement on hard-to-ground scenarios, whereas on hard-to-solve instances we approach state-of-the-art performance.
zh
[AI-15] CQE under Epistemic Dependencies: Algorithms and Experiments (extended version) ISWC2025
【速读】:该论文旨在解决在基于本体(ontology)的受控查询评估(Controlled Query Evaluation, CQE)中,如何通过语义依赖规则(epistemic dependencies, EDs)实现安全的信息披露问题。其核心挑战在于,在保障数据隐私的前提下,高效且准确地回答布尔型联合析取查询(Boolean unions of conjunctive queries, BUCQs)。解决方案的关键在于将EDs与最优GA屏蔽器(optimal GA censors)相结合,即利用所有最优GA屏蔽器的交集作为安全披露边界,从而在保证强安全性的同时具备良好的计算复杂度特性。研究进一步证明,在全ED类(full EDs)和DL-Lite_R本体下,BUCQs可在数据复杂度为AC⁰的范围内被有效求解,并通过一个详细的、基于一阶逻辑的重写算法实现,实验验证了该方法的实用性。
链接: https://arxiv.org/abs/2507.17487
作者: Lorenzo Marconi,Flavia Ricci,Riccardo Rosati
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Extended version of paper accepted at the 24th International Semantic Web Conference (ISWC 2025)
Abstract:We investigate Controlled Query Evaluation (CQE) over ontologies, where information disclosure is regulated by epistemic dependencies (EDs), a family of logical rules recently proposed for the CQE framework. In particular, we combine EDs with the notion of optimal GA censors, i.e. maximal sets of ground atoms that are entailed by the ontology and can be safely revealed. We focus on answering Boolean unions of conjunctive queries (BUCQs) with respect to the intersection of all optimal GA censors - an approach that has been shown in other contexts to ensure strong security guarantees with favorable computational behavior. First, we characterize the security of this intersection-based approach and identify a class of EDs (namely, full EDs) for which it remains safe. Then, for a subclass of EDs and for DL-Lite_R ontologies, we show that answering BUCQs in the above CQE semantics is in AC^0 in data complexity by presenting a suitable, detailed first-order rewriting algorithm. Finally, we report on experiments conducted in two different evaluation scenarios, showing the practical feasibility of our rewriting function.
zh
[AI-16] LTLZinc: a Benchmarking Framework for Continual Learning and Neuro-Symbolic Temporal Reasoning
【速读】:该论文旨在解决神经符号人工智能(Neuro-symbolic Artificial Intelligence)在持续学习(Continual Learning)场景下,特别是在需要时间维度推理的动态环境中应用不足的问题。现有方法大多局限于静态任务,难以处理随时间演进的知识更新与因果逻辑推理。其解决方案的关键在于提出LTLZinc框架,该框架通过将线性时序逻辑(Linear Temporal Logic, LTL)规范与MiniZinc约束相结合,自动生成涵盖多种复杂时序推理和持续学习任务的数据集,并支持细粒度标注以适配不同的神经与神经符号训练范式。实验表明,LTLZinc生成的任务能有效揭示当前最先进方法在时序建模和知识保留方面的局限性,从而推动面向统一时序学习与推理框架的研究发展。
链接: https://arxiv.org/abs/2507.17482
作者: Luca Salvatore Lorello,Nikolaos Manginas,Marco Lippi,Stefano Melacci
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Neuro-symbolic artificial intelligence aims to combine neural architectures with symbolic approaches that can represent knowledge in a human-interpretable formalism. Continual learning concerns with agents that expand their knowledge over time, improving their skills while avoiding to forget previously learned concepts. Most of the existing approaches for neuro-symbolic artificial intelligence are applied to static scenarios only, and the challenging setting where reasoning along the temporal dimension is necessary has been seldom explored. In this work we introduce LTLZinc, a benchmarking framework that can be used to generate datasets covering a variety of different problems, against which neuro-symbolic and continual learning methods can be evaluated along the temporal and constraint-driven dimensions. Our framework generates expressive temporal reasoning and continual learning tasks from a linear temporal logic specification over MiniZinc constraints, and arbitrary image classification datasets. Fine-grained annotations allow multiple neural and neuro-symbolic training settings on the same generated datasets. Experiments on six neuro-symbolic sequence classification and four class-continual learning tasks generated by LTLZinc, demonstrate the challenging nature of temporal learning and reasoning, and highlight limitations of current state-of-the-art methods. We release the LTLZinc generator and ten ready-to-use tasks to the neuro-symbolic and continual learning communities, in the hope of fostering research towards unified temporal learning and reasoning frameworks.
zh
[AI-17] An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在缺乏人工标注的情况下实现高质量对齐人类意图与安全规范的根本性挑战。其解决方案的核心在于提出一种不确定性驱动的自适应自我对齐框架(Uncertainty-Driven Adaptive Self-Alignment, UDASA),通过量化输出在语义、事实性和价值一致性三个维度上的不确定性,构建偏好对并根据不确定性差异将训练样本分为保守、中等和探索三个阶段,进而分阶段对模型进行渐进式优化,从而在无需人工标注的前提下显著提升模型在无害性、有用性、真实性及受控情感生成等多个任务上的表现。
链接: https://arxiv.org/abs/2507.17477
作者: Haoran Sun,Zekun Zhang,Shaoning Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable progress in instruction following and general-purpose reasoning. However, achieving high-quality alignment with human intent and safety norms without human annotations remains a fundamental challenge. In this work, we propose an Uncertainty-Driven Adaptive Self-Alignment (UDASA) framework designed to improve LLM alignment in a fully automated manner. UDASA first generates multiple responses for each input and quantifies output uncertainty across three dimensions: semantics, factuality, and value alignment. Based on these uncertainty scores, the framework constructs preference pairs and categorizes training samples into three stages, conservative, moderate, and exploratory, according to their uncertainty difference. The model is then optimized progressively across these stages. In addition, we conduct a series of preliminary studies to validate the core design assumptions and provide strong empirical motivation for the proposed framework. Experimental results show that UDASA outperforms existing alignment methods across multiple tasks, including harmlessness, helpfulness, truthfulness, and controlled sentiment generation, significantly improving model performance.
zh
[AI-18] BGM-HAN: A Hierarchical Attention Network for Accurate and Fair Decision Assessment on Semi-Structured Profiles
【速读】:该论文旨在解决高风险决策领域中人类决策因难以察觉的认知偏差而影响公平性和长期效果的问题。其解决方案的关键在于提出BGM-HAN(增强型字节对编码门控多头分层注意力网络),通过引入分层学习机制来有效建模半结构化申请者数据,捕捉多层次表征以实现更细致的评估,从而在提升预测性能的同时增强模型的可解释性,为重视结构、上下文和公平性的决策场景提供了一种有效的增强框架。
链接: https://arxiv.org/abs/2507.17472
作者: Junhua Liu,Roy Ka-Wei Lee,Kwan Hui Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at ASONAM 2025
Abstract:Human decision-making in high-stakes domains often relies on expertise and heuristics, but is vulnerable to hard-to-detect cognitive biases that threaten fairness and long-term outcomes. This work presents a novel approach to enhancing complex decision-making workflows through the integration of hierarchical learning alongside various enhancements. Focusing on university admissions as a representative high-stakes domain, we propose BGM-HAN, an enhanced Byte-Pair Encoded, Gated Multi-head Hierarchical Attention Network, designed to effectively model semi-structured applicant data. BGM-HAN captures multi-level representations that are crucial for nuanced assessment, improving both interpretability and predictive performance. Experimental results on real admissions data demonstrate that our proposed model significantly outperforms both state-of-the-art baselines from traditional machine learning to large language models, offering a promising framework for augmenting decision-making in domains where structure, context, and fairness matter. Source code is available at: this https URL.
zh
[AI-19] Reasoning -Driven Retrosynthesis Prediction with Large Language Models via Reinforcement Learning
【速读】:该论文旨在解决现有生成式AI(Generative AI)在有机合成逆合成规划中面临的适用性不足与可解释性差的问题。传统基于图结构或序列到序列的模型往往缺乏通用的化学知识,导致预测结果准确性不稳定且难以解释。解决方案的关键在于提出一种基于推理的大语言模型(Reasoning-based Large Language Model, RetroDFM-R),通过大规模强化学习框架并结合化学可验证奖励机制进行训练,从而显著提升预测准确性和可解释性。该方法不仅在USPTO-50K基准上达到65.0%的top-1准确率,还通过双盲人类评估验证了其预测结果的化学合理性与实际应用价值,同时能够准确推导出文献报道的真实药物分子和钙钛矿材料的多步逆合成路径,并提供人类可理解的推理过程,增强了模型在真实场景中的可信度与实用性。
链接: https://arxiv.org/abs/2507.17448
作者: Situo Zhang,Hanqi Li,Lu Chen,Zihan Zhao,Xuanze Lin,Zichen Zhu,Bo Chen,Xin Chen,Kai Yu
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: Preprint
Abstract:Retrosynthesis planning, essential in organic synthesis and drug discovery, has greatly benefited from recent AI-driven advancements. Nevertheless, existing methods frequently face limitations in both applicability and explainability. Traditional graph-based and sequence-to-sequence models often lack generalized chemical knowledge, leading to predictions that are neither consistently accurate nor easily explainable. To address these challenges, we introduce RetroDFM-R, a reasoning-based large language model (LLM) designed specifically for chemical retrosynthesis. Leveraging large-scale reinforcement learning guided by chemically verifiable rewards, RetroDFM-R significantly enhances prediction accuracy and explainability. Comprehensive evaluations demonstrate that RetroDFM-R significantly outperforms state-of-the-art methods, achieving a top-1 accuracy of 65.0% on the USPTO-50K benchmark. Double-blind human assessments further validate the chemical plausibility and practical utility of RetroDFM-R’s predictions. RetroDFM-R also accurately predicts multistep retrosynthetic routes reported in the literature for both real-world drug molecules and perovskite materials. Crucially, the model’s explicit reasoning process provides human-interpretable insights, thereby enhancing trust and practical value in real-world retrosynthesis applications.
zh
[AI-20] IndoorBEV: Joint Detection and Footprint Completion of Objects via Mask-based Prediction in Indoor Scenarios for Birds-Eye View Perception
【速读】:该论文旨在解决复杂室内三维点云中多样化物体检测的挑战,尤其是面对物体形状多样、场景杂乱以及静态与动态元素共存时,传统基于边界框(bounding box)的方法性能受限的问题。其核心解决方案是提出一种基于掩码(mask-based)的鸟瞰图(Bird’s-Eye View, BEV)方法——IndoorBEV,通过将3D场景投影至2D BEV网格来自然处理遮挡并提供一致的俯视视角,从而有效区分静态障碍物与动态目标。该方法的关键在于采用轴向紧凑编码器(axis compact encoder)和基于窗口的骨干网络提取丰富的空间特征,并结合查询驱动的解码头,利用学习到的对象查询在BEV空间中同时预测物体类别与实例掩码,实现对静态与动态物体轮廓的鲁棒建模,显著优于依赖边界框回归的传统方法。
链接: https://arxiv.org/abs/2507.17445
作者: Haichuan Li,Changda Tian,Panos Trahanias,Tomi Westerlund
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Detecting diverse objects within complex indoor 3D point clouds presents significant challenges for robotic perception, particularly with varied object shapes, clutter, and the co-existence of static and dynamic elements where traditional bounding box methods falter. To address these limitations, we propose IndoorBEV, a novel mask-based Bird’s-Eye View (BEV) method for indoor mobile robots. In a BEV method, a 3D scene is projected into a 2D BEV grid which handles naturally occlusions and provides a consistent top-down view aiding to distinguish static obstacles from dynamic agents. The obtained 2D BEV results is directly usable to downstream robotic tasks like navigation, motion prediction, and planning. Our architecture utilizes an axis compact encoder and a window-based backbone to extract rich spatial features from this BEV map. A query-based decoder head then employs learned object queries to concurrently predict object classes and instance masks in the BEV space. This mask-centric formulation effectively captures the footprint of both static and dynamic objects regardless of their shape, offering a robust alternative to bounding box regression. We demonstrate the effectiveness of IndoorBEV on a custom indoor dataset featuring diverse object classes including static objects and dynamic elements like robots and miscellaneous items, showcasing its potential for robust indoor scene understanding. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.17445 [cs.RO] (or arXiv:2507.17445v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2507.17445 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-21] Fair Compromises in Participatory Budgeting: a Multi-Agent Deep Reinforcement Learning Approach
【速读】:该论文旨在解决参与式预算(Participatory Budgeting)中因选民需对多个项目进行决策而产生的“选择过载”问题,以及如何通过优化投票策略提升选民偏好在最终获胜项目集中的代表性,从而实现更公平的公共资金分配。解决方案的关键在于引入一种基于多智能体深度强化学习(Multi-Agent Deep Reinforcement Learning)的决策支持方法,并采用分支神经网络(branching neural network)架构以在去中心化方式下克服多智能体强化学习的可扩展性挑战,从而识别出能提高选票胜率的投票策略,并发现公平妥协可通过低成本项目实现的规律。
链接: https://arxiv.org/abs/2507.17433
作者: Hugh Adams,Srijoni Majumdar,Evangelos Pournaras
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Participatory budgeting is a method of collectively understanding and addressing spending priorities where citizens vote on how a budget is spent, it is regularly run to improve the fairness of the distribution of public funds. Participatory budgeting requires voters to make decisions on projects which can lead to ``choice overload". A multi-agent reinforcement learning approach to decision support can make decision making easier for voters by identifying voting strategies that increase the winning proportion of their vote. This novel approach can also support policymakers by highlighting aspects of election design that enable fair compromise on projects. This paper presents a novel, ethically aligned approach to decision support using multi-agent deep reinforcement learning modelling. This paper introduces a novel use of a branching neural network architecture to overcome scalability challenges of multi-agent reinforcement learning in a decentralized way. Fair compromises are found through optimising voter actions towards greater representation of voter preferences in the winning set. Experimental evaluation with real-world participatory budgeting data reveals a pattern in fair compromise: that it is achievable through projects with smaller cost.
zh
[AI-22] Ctx2TrajGen: Traffic Context-Aware Microscale Vehicle Trajectories using Generative Adversarial Imitation Learning
【速读】:该论文旨在解决微观车辆轨迹建模中的关键挑战,即如何在复杂城市驾驶环境中生成真实、多样化且与上下文一致的车辆行为轨迹,以支持交通行为分析和自动驾驶系统开发。其解决方案的关键在于提出Ctx2TrajGen框架,该框架基于生成对抗强化学习(GAIL),结合近端策略优化(PPO)与WGAN-GP技术,显式地将周围车辆状态和道路几何信息作为条件输入,从而有效捕捉非线性交互关系并缓解训练不稳定性,最终实现高保真度的交互感知轨迹生成。
链接: https://arxiv.org/abs/2507.17418
作者: Joobin Jin,Seokjun Hong,Gyeongseon Baek,Yeeun Kim,Byeongjoon Noh
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Precise modeling of microscopic vehicle trajectories is critical for traffic behavior analysis and autonomous driving systems. We propose Ctx2TrajGen, a context-aware trajectory generation framework that synthesizes realistic urban driving behaviors using GAIL. Leveraging PPO and WGAN-GP, our model addresses nonlinear interdependencies and training instability inherent in microscopic settings. By explicitly conditioning on surrounding vehicles and road geometry, Ctx2TrajGen generates interaction-aware trajectories aligned with real-world context. Experiments on the drone-captured DRIFT dataset demonstrate superior performance over existing methods in terms of realism, behavioral diversity, and contextual fidelity, offering a robust solution to data scarcity and domain shift without simulation.
zh
[AI-23] Investigating Training Data Detection in AI Coders
【速读】:该论文旨在解决生成式 AI(Generative AI)在代码领域应用中因训练数据泄露风险而引发的合规性与隐私保护问题,即如何有效检测代码大语言模型(CodeLLMs)输出中是否包含源自训练数据的敏感或专有代码片段。其解决方案的关键在于构建了一个名为 CodeSnitch 的函数级基准数据集,包含 9,000 个跨三种编程语言的代码样本,并明确标注每个样本是否被纳入 CodeLLM 训练;同时设计基于 Type-1 至 Type-4 代码克隆检测分类法的靶向变异策略,系统评估七种前沿训练数据检测(TDD)方法在代码场景下的性能与鲁棒性,从而为未来开发更高效、可靠的代码级 TDD 方法提供实证依据与方向指引。
链接: https://arxiv.org/abs/2507.17389
作者: Tianlin Li,Yunxiang Wei,Zhiming Li,Aishan Liu,Qing Guo,Xianglong Liu,Dongning Sun,Yang Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in code large language models (CodeLLMs) have made them indispensable tools in modern software engineering. However, these models occasionally produce outputs that contain proprietary or sensitive code snippets, raising concerns about potential non-compliant use of training data, and posing risks to privacy and intellectual property. To ensure responsible and compliant deployment of CodeLLMs, training data detection (TDD) has become a critical task. While recent TDD methods have shown promise in natural language settings, their effectiveness on code data remains largely underexplored. This gap is particularly important given code’s structured syntax and distinct similarity criteria compared to natural language. To address this, we conduct a comprehensive empirical study of seven state-of-the-art TDD methods on source code data, evaluating their performance across eight CodeLLMs. To support this evaluation, we introduce CodeSnitch, a function-level benchmark dataset comprising 9,000 code samples in three programming languages, each explicitly labeled as either included or excluded from CodeLLM training. Beyond evaluation on the original CodeSnitch, we design targeted mutation strategies to test the robustness of TDD methods under three distinct settings. These mutation strategies are grounded in the well-established Type-1 to Type-4 code clone detection taxonomy. Our study provides a systematic assessment of current TDD techniques for code and offers insights to guide the development of more effective and robust detection methods in the future.
zh
[AI-24] DynaSearcher: Dynamic Knowledge Graph Augmented Search Agent via Multi-Reward Reinforcement Learning
【速读】:该论文旨在解决多步代理检索系统(multi-step agentic retrieval systems)在实际应用中面临的两个核心问题:一是生成的事实不一致的中间查询(intermediate queries),二是低效的搜索轨迹(search trajectories),这些问题会导致推理偏差或冗余计算。解决方案的关键在于提出DynaSearcher,其创新性地结合了动态知识图谱(dynamic knowledge graphs)与多奖励强化学习(multi-reward reinforcement learning, RL)。具体而言,知识图谱作为外部结构化知识,显式建模实体关系以确保中间查询的事实一致性并减少无关信息干扰;同时,多奖励RL框架通过精细化控制检索准确率、效率和响应质量等目标,引导生成高质量中间查询和完整最终答案,抑制无效探索并降低信息遗漏或冗余。
链接: https://arxiv.org/abs/2507.17365
作者: Chuzhan Hao,Wenfeng Feng,Yuewei Zhang,Hao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures
Abstract:Multi-step agentic retrieval systems based on large language models (LLMs) have demonstrated remarkable performance in complex information search tasks. However, these systems still face significant challenges in practical applications, particularly in generating factually inconsistent intermediate queries and inefficient search trajectories, which can lead to reasoning deviations or redundant computations. To address these issues, we propose DynaSearcher, an innovative search agent enhanced by dynamic knowledge graphs and multi-reward reinforcement learning (RL). Specifically, our system leverages knowledge graphs as external structured knowledge to guide the search process by explicitly modeling entity relationships, thereby ensuring factual consistency in intermediate queries and mitigating biases from irrelevant information. Furthermore, we employ a multi-reward RL framework for fine-grained control over training objectives such as retrieval accuracy, efficiency, and response quality. This framework promotes the generation of high-quality intermediate queries and comprehensive final answers, while discouraging unnecessary exploration and minimizing information omissions or redundancy. Experimental results demonstrate that our approach achieves state-of-the-art answer accuracy on six multi-hop question answering datasets, matching frontier LLMs while using only small-scale models and limited computational resources. Furthermore, our approach demonstrates strong generalization and robustness across diverse retrieval environments and larger-scale models, highlighting its broad applicability.
zh
[AI-25] EarthLink: Interpreting Climate Signals with Self-Evolving AI Agents
【速读】:该论文旨在解决地球系统科学中因数据海量、分散且复杂,以及分析需求日益精密所带来的科研效率瓶颈问题。其解决方案的关键在于提出EarthLink——首个专为地球科学家设计的交互式AI代理(AI agent),它能够自动化从研究规划、代码生成到多场景分析的完整科研流程;并通过用户交互实现动态学习与能力迭代,形成闭环反馈机制,从而显著提升科学研究的效率与可解释性。
链接: https://arxiv.org/abs/2507.17311
作者: Zijie Guo,Jiong Wang,Xiaoyu Yue,Wangxu Wei,Zhe Jiang,Wanghan Xu,Ben Fei,Wenlong Zhang,Xinyu Gu,Lijing Cheng,Jing-Jia Luo,Chao Li,Yaqiang Wang,Tao Chen,Wanli Ouyang,Fenghua Ling,Lei Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:
Abstract:Modern Earth science is at an inflection point. The vast, fragmented, and complex nature of Earth system data, coupled with increasingly sophisticated analytical demands, creates a significant bottleneck for rapid scientific discovery. Here we introduce EarthLink, the first AI agent designed as an interactive copilot for Earth scientists. It automates the end-to-end research workflow, from planning and code generation to multi-scenario analysis. Unlike static diagnostic tools, EarthLink can learn from user interaction, continuously refining its capabilities through a dynamic feedback loop. We validated its performance on a number of core scientific tasks of climate change, ranging from model-observation comparisons to the diagnosis of complex phenomena. In a multi-expert evaluation, EarthLink produced scientifically sound analyses and demonstrated an analytical competency that was rated as comparable to specific aspects of a human junior researcher’s workflow. Additionally, its transparent, auditable workflows and natural language interface empower scientists to shift from laborious manual execution to strategic oversight and hypothesis generation. EarthLink marks a pivotal step towards an efficient, trustworthy, and collaborative paradigm for Earth system research in an era of accelerating global change.
zh
[AI-26] Confounded Causal Imitation Learning with Instrumental Variables
【速读】:该论文旨在解决模仿学习(Imitation Learning)中因未测量混杂变量(Unmeasured Confounders)导致的策略估计偏差问题。这些混杂变量会同时影响状态和动作,从而破坏观测数据与目标策略之间的因果关系。解决方案的关键在于引入工具变量(Instrumental Variables, IV)并提出一种名为“混淆因果模仿学习”(Confounded Causal Imitation Learning, C2L)的模型框架,该框架通过两阶段方法实现:第一阶段基于伪变量定义检验准则,严格识别出满足充分必要条件的有效IV;第二阶段利用所识别的IV,分别设计基于模拟器和离线数据的两种策略优化方法,从而有效缓解长期时序依赖下的混杂偏倚,提升策略学习的因果有效性。
链接: https://arxiv.org/abs/2507.17309
作者: Yan Zeng,Shenglan Nie,Feng Xie,Libo Huang,Peng Wu,Zhi Geng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures
Abstract:Imitation learning from demonstrations usually suffers from the confounding effects of unmeasured variables (i.e., unmeasured confounders) on the states and actions. If ignoring them, a biased estimation of the policy would be entailed. To break up this confounding gap, in this paper, we take the best of the strong power of instrumental variables (IV) and propose a Confounded Causal Imitation Learning (C2L) model. This model accommodates confounders that influence actions across multiple timesteps, rather than being restricted to immediate temporal dependencies. We develop a two-stage imitation learning framework for valid IV identification and policy optimization. In particular, in the first stage, we construct a testing criterion based on the defined pseudo-variable, with which we achieve identifying a valid IV for the C2L models. Such a criterion entails the sufficient and necessary identifiability conditions for IV validity. In the second stage, with the identified IV, we propose two candidate policy learning approaches: one is based on a simulator, while the other is offline. Extensive experiments verified the effectiveness of identifying the valid IV as well as learning the policy.
zh
[AI-27] On Temporal Guidance and Iterative Refinement in Audio Source Separation
【速读】:该论文旨在解决声景空间语义分割(Spatial Semantic Segmentation of Sound Scenes, S5)中因传统两阶段流水线(音频标签识别后接标签条件源分离)缺乏细粒度时间信息而导致的源分离效果受限的问题。其解决方案的关键在于增强事件检测与源分离阶段之间的协同机制:首先微调预训练Transformer以检测活跃声音类别;其次利用另一个微调后的相同Transformer实例执行声事件检测(Sound Event Detection, SED),为分离模块提供时变指导;最后引入迭代精炼机制,通过递归复用前一轮分离结果持续提升分离质量。这一方法显著提升了音频标签和源分离性能,在DCASE Challenge 2025 Task 4中取得第二名。
链接: https://arxiv.org/abs/2507.17297
作者: Tobias Morocutti,Jonathan Greif,Paul Primus,Florian Schmid,Gerhard Widmer
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Spatial semantic segmentation of sound scenes (S5) involves the accurate identification of active sound classes and the precise separation of their sources from complex acoustic mixtures. Conventional systems rely on a two-stage pipeline - audio tagging followed by label-conditioned source separation - but are often constrained by the absence of fine-grained temporal information critical for effective separation. In this work, we address this limitation by introducing a novel approach for S5 that enhances the synergy between the event detection and source separation stages. Our key contributions are threefold. First, we fine-tune a pre-trained Transformer to detect active sound classes. Second, we utilize a separate instance of this fine-tuned Transformer to perform sound event detection (SED), providing the separation module with detailed, time-varying guidance. Third, we implement an iterative refinement mechanism that progressively enhances separation quality by recursively reusing the separator’s output from previous iterations. These advancements lead to significant improvements in both audio tagging and source separation performance, as demonstrated by our system’s second-place finish in Task 4 of the DCASE Challenge 2025. Our implementation and model checkpoints are available in our GitHub repository: this https URL .
zh
[AI-28] Integrating Belief Domains into Probabilistic Logic Programs
【速读】:该论文旨在解决传统概率逻辑编程(Probabilistic Logic Programming, PLP)在表达认知不确定性(epistemic uncertainty)方面的局限性,尤其是当不确定性来源于如计算机视觉模型的层次分类等场景时,现有基于点概率(point-probabilities)的分布语义(Distribution Semantics)难以有效建模。解决方案的关键在于引入基于区间概率(interval probabilities)的容量逻辑程序(Capacity Logic Programs),通过将分布语义扩展至包含信念函数(belief functions)——一种非加性容量(non-additive capacities)——来支持对认知不确定性的形式化表达与推理,从而提升框架在实际应用中的适应性和表达能力。
链接: https://arxiv.org/abs/2507.17291
作者: Damiano Azzolini,Fabrizio Riguzzi,Theresa Swift
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Under consideration in Theory and Practice of Logic Programming (TPLP)
Abstract:Probabilistic Logic Programming (PLP) under the Distribution Semantics is a leading approach to practical reasoning under uncertainty. An advantage of the Distribution Semantics is its suitability for implementation as a Prolog or Python library, available through two well-maintained implementations, namely ProbLog and cplint/PITA. However, current formulations of the Distribution Semantics use point-probabilities, making it difficult to express epistemic uncertainty, such as arises from, for example, hierarchical classifications from computer vision models. Belief functions generalize probability measures as non-additive capacities, and address epistemic uncertainty via interval probabilities. This paper introduces interval-based Capacity Logic Programs based on an extension of the Distribution Semantics to include belief functions, and describes properties of the new framework that make it amenable to practical applications.
zh
[AI-29] Compliance Brain Assistant: Conversational Agent ic AI for Assisting Compliance Tasks in Enterprise Environments
【速读】:该论文旨在解决企业环境中合规人员在执行日常合规任务时效率低下的问题,尤其针对大语言模型(Large Language Model, LLM)在处理复杂合规查询时响应质量不足与延迟较高的挑战。解决方案的关键在于提出一种名为Compliance Brain Assistant (CBA)的对话式、代理型AI助手,其核心创新是设计了一个用户查询路由器(user query router),能够智能区分请求类型并动态选择两种运行模式:FastTrack模式用于处理仅需从知识库中检索上下文的简单请求,以保证低延迟;FullAgentic模式则适用于需要跨文档、跨API调用和多步骤推理的复杂请求,以提升响应准确性。实验表明,该路由机制在保持运行时间相近的前提下,显著优于单一模式方案,在关键词匹配率(83.7% vs. 41.7%)和LLM评分通过率(82.0% vs. 20.0%)等指标上实现大幅提升,验证了其在响应质量与延迟之间取得良好平衡的有效性。
链接: https://arxiv.org/abs/2507.17289
作者: Shitong Zhu,Chenhao Fang,Derek Larson,Neel Reddy Pochareddy,Rajeev Rao,Sophie Zeng,Yanqing Peng,Wendy Summer,Alex Goncalves,Arya Pudota,Herve Robert
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents Compliance Brain Assistant (CBA), a conversational, agentic AI assistant designed to boost the efficiency of daily compliance tasks for personnel in enterprise environments. To strike a good balance between response quality and latency, we design a user query router that can intelligently choose between (i) FastTrack mode: to handle simple requests that only need additional relevant context retrieved from knowledge corpora; and (ii) FullAgentic mode: to handle complicated requests that need composite actions and tool invocations to proactively discover context across various compliance artifacts, and/or involving other APIs/models for accommodating requests. A typical example would be to start with a user query, use its description to find a specific entity and then use the entity’s information to query other APIs for curating and enriching the final AI response. Our experimental evaluations compared CBA against an out-of-the-box LLM on various real-world privacy/compliance-related queries targeting various personas. We found that CBA substantially improved upon the vanilla LLM’s performance on metrics such as average keyword match rate (83.7% vs. 41.7%) and LLM-judge pass rate (82.0% vs. 20.0%). We also compared metrics for the full routing-based design against the fast-track only
and full-agentic
modes and found that it had a better average match-rate and pass-rate while keeping the run-time approximately the same. This finding validated our hypothesis that the routing mechanism leads to a good trade-off between the two worlds. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2507.17289 [cs.AI] (or arXiv:2507.17289v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.17289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-30] Leverag ing Knowledge Graphs and LLM Reasoning to Identify Operational Bottlenecks for Warehouse Planning Assistance
【速读】:该论文旨在解决从仓库运营的离散事件仿真(Discrete Event Simulation, DES)输出数据中识别瓶颈和低效环节的问题,传统方法通常依赖大量人工分析或专用工具,效率低下且难以处理复杂关系。解决方案的关键在于构建一个融合知识图谱(Knowledge Graph, KG)与大语言模型(Large Language Model, LLM)代理的框架:首先将原始DES数据转化为语义丰富的KG,以结构化方式捕获事件与实体间的关联;随后由LLM代理通过迭代推理生成相互依赖的子问题,针对每个子问题生成Cypher查询访问KG、提取信息并进行自我反思纠错,从而实现类人化的自适应诊断过程。此方法显著提升了对操作性问题的定位准确率,并在复杂诊断任务中展现出优于基线方法的发现能力,实现了从仿真数据到可行动洞察的高效转化。
链接: https://arxiv.org/abs/2507.17273
作者: Rishi Parekh,Saisubramaniam Gopalakrishnan,Zishan Ahmad,Anirudh Deodhar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures
Abstract:Analyzing large, complex output datasets from Discrete Event Simulations (DES) of warehouse operations to identify bottlenecks and inefficiencies is a critical yet challenging task, often demanding significant manual effort or specialized analytical tools. Our framework integrates Knowledge Graphs (KGs) and Large Language Model (LLM)-based agents to analyze complex Discrete Event Simulation (DES) output data from warehouse operations. It transforms raw DES data into a semantically rich KG, capturing relationships between simulation events and entities. An LLM-based agent uses iterative reasoning, generating interdependent sub-questions. For each sub-question, it creates Cypher queries for KG interaction, extracts information, and self-reflects to correct errors. This adaptive, iterative, and self-correcting process identifies operational issues mimicking human analysis. Our DES approach for warehouse bottleneck identification, tested with equipment breakdowns and process irregularities, outperforms baseline methods. For operational questions, it achieves near-perfect pass rates in pinpointing inefficiencies. For complex investigative questions, we demonstrate its superior diagnostic ability to uncover subtle, interconnected issues. This work bridges simulation modeling and AI (KG+LLM), offering a more intuitive method for actionable insights, reducing time-to-insight, and enabling automated warehouse inefficiency evaluation and diagnosis.
zh
[AI-31] Understanding Prompt Programming Tasks and Questions
【速读】:该论文旨在解决当前生成式 AI(Generative AI)领域中提示编程(prompt programming)支持不足的问题,特别是开发者在构建和优化提示时面临任务繁杂、工具缺失导致的效率低下与认知负担。其关键解决方案是通过系统性实证研究,构建了一个包含25项提示编程任务和51个开发者提问的分类体系,并量化各任务与问题的重要性;进一步对比分析48个研究与商业工具后发现,所有任务均需手动完成,且超过三分之一的重要问题仍未被有效解答,从而明确了未来提示编程工具设计的关键改进方向。
链接: https://arxiv.org/abs/2507.17264
作者: Jenny T. Liang,Chenyang Yang,Agnia Sergeyuk,Travis D. Breaux,Brad A. Myers
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Prompting foundation models (FMs) like large language models (LLMs) have enabled new AI-powered software features (e.g., text summarization) that previously were only possible by fine-tuning FMs. Now, developers are embedding prompts in software, known as prompt programs. The process of prompt programming requires the developer to make many changes to their prompt. Yet, the questions developers ask to update their prompt is unknown, despite the answers to these questions affecting how developers plan their changes. With the growing number of research and commercial prompt programming tools, it is unclear whether prompt programmers’ needs are being adequately addressed. We address these challenges by developing a taxonomy of 25 tasks prompt programmers do and 51 questions they ask, measuring the importance of each task and question. We interview 16 prompt programmers, observe 8 developers make prompt changes, and survey 50 developers. We then compare the taxonomy with 48 research and commercial tools. We find that prompt programming is not well-supported: all tasks are done manually, and 16 of the 51 questions – including a majority of the most important ones – remain unanswered. Based on this, we outline important opportunities for prompt programming tools.
zh
[AI-32] Students Feedback Requests and Interactions with the SCRIPT Chatbot: Do They Get What They Ask For?
【速读】:该论文旨在解决初学者在编程学习过程中缺乏个性化反馈和支持的问题,尤其是在生成式 AI(Generative AI)辅助教学场景下如何有效提供结构化与灵活性兼具的指导。其解决方案的关键在于设计并实现了一个名为 SCRIPT 的基于 ChatGPT-4o-mini 的对话机器人,该系统通过预设提示词(predefined prompts)提供结构化引导,同时支持开放式的交互方式,从而在满足学生反馈请求类型偏好(如语法纠错、逻辑解释等)的同时保持对系统提示约束的遵守,实验结果显示其响应与学生所需反馈类型高度匹配(75%),验证了该设计在平衡引导性与灵活性方面的有效性。
链接: https://arxiv.org/abs/2507.17258
作者: Andreas Scholl,Natalie Kiesler
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at PPIG 2025
Abstract:Building on prior research on Generative AI (GenAI) and related tools for programming education, we developed SCRIPT, a chatbot based on ChatGPT-4o-mini, to support novice learners. SCRIPT allows for open-ended interactions and structured guidance through predefined prompts. We evaluated the tool via an experiment with 136 students from an introductory programming course at a large German university and analyzed how students interacted with SCRIPT while solving programming tasks with a focus on their feedback preferences. The results reveal that students’ feedback requests seem to follow a specific sequence. Moreover, the chatbot responses aligned well with students’ requested feedback types (in 75%), and it adhered to the system prompt constraints. These insights inform the design of GenAI-based learning support systems and highlight challenges in balancing guidance and flexibility in AI-assisted tools.
zh
[AI-33] Agent Identity Evals: Measuring Agent ic Identity
【速读】:该论文旨在解决语言模型代理(Language Model Agents, LMAs)在长时间运行中因继承大语言模型(Large Language Models, LLMs)的固有缺陷(如状态无关性、随机性、对提示敏感及语言中介依赖)而导致的代理身份不稳定问题,这些问题会削弱其可识别性、连续性、持久性和一致性,进而影响其推理、规划与执行等核心代理能力。解决方案的关键在于提出了一套名为“代理身份评估”(Agent Identity Evaluations, AIE)的严谨、统计驱动且基于实证的框架,用于量化衡量LMA系统随时间保持其代理身份的程度,包括能力、属性以及从状态扰动中恢复的能力;AIE包含一组新设计的指标,可与性能、能力及代理鲁棒性测量方法集成,从而辅助优化LMA基础设施(如记忆模块和工具调用机制)的设计与部署。
链接: https://arxiv.org/abs/2507.17257
作者: Elija Perrier,Michael Timothy Bennett
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Central to agentic capability and trustworthiness of language model agents (LMAs) is the extent they maintain stable, reliable, identity over time. However, LMAs inherit pathologies from large language models (LLMs) (statelessness, stochasticity, sensitivity to prompts and linguistically-intermediation) which can undermine their identifiability, continuity, persistence and consistency. This attrition of identity can erode their reliability, trustworthiness and utility by interfering with their agentic capabilities such as reasoning, planning and action. To address these challenges, we introduce \textitagent identity evals (AIE), a rigorous, statistically-driven, empirical framework for measuring the degree to which an LMA system exhibit and maintain their agentic identity over time, including their capabilities, properties and ability to recover from state perturbations. AIE comprises a set of novel metrics which can integrate with other measures of performance, capability and agentic robustness to assist in the design of optimal LMA infrastructure and scaffolding such as memory and tools. We set out formal definitions and methods that can be applied at each stage of the LMA life-cycle, and worked examples of how to apply them.
zh
[AI-34] Reality Proxy: Fluid Interactions with Real-World Objects in MR via Abstract Representations
【速读】:该论文旨在解决混合现实(Mixed Reality, MR)中因物体拥挤、距离较远或部分遮挡而导致的交互困难问题,这些问题源于直接在物理对象上进行交互时输入与物理约束紧密耦合的局限性。解决方案的关键在于引入“代理(proxy)”——即对真实世界物体的抽象表示,从而将交互目标从物理对象解耦至其代理。通过Reality Proxy系统,用户可在选择过程中无缝切换交互目标,并借助生成式AI(Generative AI)为代理赋予语义属性和层次化空间关系,实现诸如浏览、基于属性筛选、嵌套分组导航及复杂多对象选择等新型交互操作,且无需新增手势或菜单系统,显著提升了MR场景下的交互效率与灵活性。
链接: https://arxiv.org/abs/2507.17248
作者: Xiaoan Liu,Difan Jia,Xianhao Carton Liu,Mar Gonzalez-Franco,Chen Zhu-Tian
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 16 pages, 9 figures. Accepted for publication in UIST’25 (The 38th Annual ACM Symposium on User Interface Software and Technology), Busan, Republic of Korea, 28 Sep - 1 Oct 2025
Abstract:Interacting with real-world objects in Mixed Reality (MR) often proves difficult when they are crowded, distant, or partially occluded, hindering straightforward selection and manipulation. We observe that these difficulties stem from performing interaction directly on physical objects, where input is tightly coupled to their physical constraints. Our key insight is to decouple interaction from these constraints by introducing proxies-abstract representations of real-world objects. We embody this concept in Reality Proxy, a system that seamlessly shifts interaction targets from physical objects to their proxies during selection. Beyond facilitating basic selection, Reality Proxy uses AI to enrich proxies with semantic attributes and hierarchical spatial relationships of their corresponding physical objects, enabling novel and previously cumbersome interactions in MR - such as skimming, attribute-based filtering, navigating nested groups, and complex multi object selections - all without requiring new gestures or menu systems. We demonstrate Reality Proxy’s versatility across diverse scenarios, including office information retrieval, large-scale spatial navigation, and multi-drone control. An expert evaluation suggests the system’s utility and usability, suggesting that proxy-based abstractions offer a powerful and generalizable interaction paradigm for future MR systems.
zh
[AI-35] DistrAttention: An Efficient and Flexible Self-Attention Mechanism on Modern GPUs
【速读】:该论文旨在解决Transformer架构中自注意力(self-attention)机制因计算复杂度与输入序列长度呈平方关系而导致的可扩展性瓶颈问题。现有优化方法通常要么丢失完整上下文信息,要么灵活性不足。其解决方案的关键在于提出DistrAttention机制,通过在嵌入维度(embedding dimensionality, d)上对数据进行分组,并结合局部敏感哈希(locality-sensitive hashing, LSH)实现轻量级采样与融合,从而保留全上下文信息的同时提升效率;进一步设计块级分组框架以控制LSH引入的误差,并通过优化块大小实现与FlashAttention-2的无缝集成,在现代GPU上获得高性能。实验表明,DistrAttention在ViT推理中兼具速度与精度优势,在Llama3-1B模型中仅损失1%准确率即可实现最低推理延迟。
链接: https://arxiv.org/abs/2507.17245
作者: Haolin Jin,Mengbai Xiao,Yuan Yuan,Xiao Zhang,Dongxiao Yu,Guanghui Zhang,Haoliang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The Transformer architecture has revolutionized deep learning, delivering the state-of-the-art performance in areas such as natural language processing, computer vision, and time series prediction. However, its core component, self-attention, has the quadratic time complexity relative to input sequence length, which hinders the scalability of Transformers. The exsiting approaches on optimizing self-attention either discard full-contextual information or lack of flexibility. In this work, we design DistrAttention, an effcient and flexible self-attention mechanism with the full context. DistrAttention achieves this by grouping data on the embedding dimensionality, usually referred to as d . We realize DistrAttention with a lightweight sampling and fusion method that exploits locality-sensitive hashing to group similar data. A block-wise grouping framework is further designed to limit the errors introduced by locality sensitive hashing. By optimizing the selection of block sizes, DistrAttention could be easily integrated with FlashAttention-2, gaining high-performance on modern GPUs. We evaluate DistrAttention with extensive experiments. The results show that our method is 37% faster than FlashAttention-2 on calculating self-attention. In ViT inference, DistrAttention is the fastest and the most accurate among approximate self-attention mechanisms. In Llama3-1B, DistrAttention still achieves the lowest inference time with only 1% accuray loss.
zh
[AI-36] Eco-Friendly AI: Unleashing Data Power for Green Federated Learning
【速读】:该论文旨在解决人工智能(AI)与机器学习(ML)模型训练过程中带来的显著环境影响问题,尤其是能源消耗和碳排放。其核心挑战在于:大规模数据训练导致高能耗,而传统的集中式训练方式又面临数据传输成本高和隐私保护难的问题。为应对这一问题,论文提出了一种以数据为中心的绿色联邦学习(Green Federated Learning, Green FL)方法,其关键解决方案是通过优化数据选择与节点配置来最小化训练过程中的环境影响——具体包括对联邦数据集特征的分析、基于质量指标选取最优数据子集,以及优先选择环境影响最低的计算节点。该方法在时间序列分类任务中验证了有效性,能显著降低联邦学习的碳足迹。
链接: https://arxiv.org/abs/2507.17241
作者: Mattia Sabella,Monica Vitali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:The widespread adoption of Artificial Intelligence (AI) and Machine Learning (ML) comes with a significant environmental impact, particularly in terms of energy consumption and carbon emissions. This pressing issue highlights the need for innovative solutions to mitigate AI’s ecological footprint. One of the key factors influencing the energy consumption of ML model training is the size of the training dataset. ML models are often trained on vast amounts of data continuously generated by sensors and devices distributed across multiple locations. To reduce data transmission costs and enhance privacy, Federated Learning (FL) enables model training without the need to move or share raw data. While FL offers these advantages, it also introduces challenges due to the heterogeneity of data sources (related to volume and quality), computational node capabilities, and environmental impact. This paper contributes to the advancement of Green AI by proposing a data-centric approach to Green Federated Learning. Specifically, we focus on reducing FL’s environmental impact by minimizing the volume of training data. Our methodology involves the analysis of the characteristics of federated datasets, the selecting of an optimal subset of data based on quality metrics, and the choice of the federated nodes with the lowest environmental impact. We develop a comprehensive methodology that examines the influence of data-centric factors, such as data quality and volume, on FL training performance and carbon emissions. Building on these insights, we introduce an interactive recommendation system that optimizes FL configurations through data reduction, minimizing environmental impact during training. Applying this methodology to time series classification has demonstrated promising results in reducing the environmental impact of FL tasks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2507.17241 [cs.LG] (or arXiv:2507.17241v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.17241 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-37] P3SL: Personalized Privacy-Preserving Split Learning on Heterogeneous Edge Devices
【速读】:该论文旨在解决异构边缘设备环境下Split Learning(SL)面临的个性化隐私保护与本地模型定制难题,尤其是在设备计算资源、通信能力、环境条件及隐私需求差异显著时,现有方法往往忽略个体差异导致隐私泄露风险增加或模型性能下降。其解决方案的关键在于提出P3SL(Personalized Privacy-Preserving Split Learning)框架:一方面设计了一种个性化的顺序式分层学习流水线,使每个客户端可根据自身资源、环境和隐私偏好调整本地模型与分割点;另一方面采用双层优化机制,使客户端能够在不向服务器暴露敏感信息(如计算能力、环境状态、隐私要求)的前提下自主确定最优个性化分割点,从而在保障模型精度的同时平衡能耗与隐私泄露风险。
链接: https://arxiv.org/abs/2507.17228
作者: Wei Fan,JinYi Yoon,Xiaochang Li,Huajie Shao,Bo Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted as invited paper in The 34th International Conference on Computer Communications and Networks (ICCCN 2025)
Abstract:Split Learning (SL) is an emerging privacy-preserving machine learning technique that enables resource constrained edge devices to participate in model training by partitioning a model into client-side and server-side sub-models. While SL reduces computational overhead on edge devices, it encounters significant challenges in heterogeneous environments where devices vary in computing resources, communication capabilities, environmental conditions, and privacy requirements. Although recent studies have explored heterogeneous SL frameworks that optimize split points for devices with varying resource constraints, they often neglect personalized privacy requirements and local model customization under varying environmental conditions. To address these limitations, we propose P3SL, a Personalized Privacy-Preserving Split Learning framework designed for heterogeneous, resource-constrained edge device systems. The key contributions of this work are twofold. First, we design a personalized sequential split learning pipeline that allows each client to achieve customized privacy protection and maintain personalized local models tailored to their computational resources, environmental conditions, and privacy needs. Second, we adopt a bi-level optimization technique that empowers clients to determine their own optimal personalized split points without sharing private sensitive information (i.e., computational resources, environmental conditions, privacy requirements) with the server. This approach balances energy consumption and privacy leakage risks while maintaining high model accuracy. We implement and evaluate P3SL on a testbed consisting of 7 devices including 4 Jetson Nano P3450 devices, 2 Raspberry Pis, and 1 laptop, using diverse model architectures and datasets under varying environmental conditions.
zh
[AI-38] Our Cars Can Talk: How IoT Brings AI to Vehicles
【速读】:该论文旨在解决传统车辆维护模式中反应式维修导致的效率低下与成本高昂问题,通过将人工智能(Artificial Intelligence, AI)集成到车辆中,使其成为具备感知能力的智能平台,从而实现从被动响应向主动预测的转变。其解决方案的关键在于引入AI协作者(AI copilot),该协作者能够同时理解机器数据和驾驶员意图,促进车辆系统与用户之间的高效交互,并推动智能车辆系统、预测性维护及AI驱动的人机协同技术的融合发展。
链接: https://arxiv.org/abs/2507.17214
作者: Amod Kant Agrawal
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
备注: 3 pages, 1 figure; To appear in IEEE Computer (Nov 2025)
Abstract:Bringing AI to vehicles and enabling them as sensing platforms is key to transforming maintenance from reactive to proactive. Now is the time to integrate AI copilots that speak both languages: machine and driver. This article offers a conceptual and technical perspective intended to spark interdisciplinary dialogue and guide future research and development in intelligent vehicle systems, predictive maintenance, and AI-powered user interaction.
zh
[AI-39] Dispatch-Aware Deep Neural Network for Optimal Transmission Switching: Toward Real-Time and Feasibility Guaranteed Operation
【速读】:该论文旨在解决最优潮流(Optimal Power Flow, OPF)中因引入最优输电线路切换(Optimal Transmission Switching, OTS)所导致的混合整数规划问题计算复杂度高、难以在大规模电网中应用的问题。其核心解决方案是提出一种调度感知深度神经网络(Dispatch-Aware Deep Neural Network, DA-DNN),该模型通过将预测的线路状态输入可微分的直流最优潮流(DC-OPF)层,并以生成成本作为损失函数,使训练和推理过程中始终满足物理网络约束。关键创新在于采用定制化的权重-偏置初始化策略,确保从第一轮前向传播起即保持可行解,从而实现稳定学习并显著提升大规模电网场景下的可扩展性。训练完成后,DA-DNN可在与求解DC-OPF相当的时间内输出一个理论上可行的拓扑与调度组合,同时有效捕捉OTS带来的经济优势。
链接: https://arxiv.org/abs/2507.17194
作者: Minsoo Kim,Jip Kim
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures
Abstract:Optimal transmission switching (OTS) improves optimal power flow (OPF) by selectively opening transmission lines, but its mixed-integer formulation increases computational complexity, especially on large grids. To deal with this, we propose a dispatch-aware deep neural network (DA-DNN) that accelerates DC-OTS without relying on pre-solved labels. DA-DNN predicts line states and passes them through a differentiable DC-OPF layer, using the resulting generation cost as the loss function so that all physical network constraints are enforced throughout training and inference. In addition, we adopt a customized weight-bias initialization that keeps every forward pass feasible from the first iteration, which allows stable learning on large grids. Once trained, the proposed DA-DNN produces a provably feasible topology and dispatch pair in the same time as solving the DCOPF, whereas conventional mixed-integer solvers become intractable. As a result, the proposed method successfully captures the economic advantages of OTS while maintaining scalability.
zh
[AI-40] LLM Meets the Sky: Heuristic Multi-Agent Reinforcement Learning for Secure Heterogeneous UAV Networks
【速读】:该论文旨在解决异构无人机网络(HetUAVNs)中物理层安全(PLS)问题,目标是在推进能量约束下最大化保密速率。与以往假设无人机能力一致或忽略能效-安全权衡的研究不同,本文考虑了具有不同载荷和计算资源的无人机协同服务地面终端并抵御窃听者的现实场景。解决方案的关键在于提出了一种分层优化框架:内层采用基于半定松弛(SDR)的S2DC算法,结合罚函数与凸差(d.c.)规划求解固定无人机位置下的保密预编码问题;外层引入大语言模型(LLM)引导的多智能体强化学习方法(LLM-HeMARL),通过LLM生成的专家启发式策略指导无人机学习节能且安全驱动的轨迹,避免实时调用LLM带来的推理开销,从而实现高效、鲁棒的联合运动与通信优化。
链接: https://arxiv.org/abs/2507.17188
作者: Lijie Zheng,Ji He,Shih Yu Chang,Yulong Shen,Dusit Niyato
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Submitted to IEEE Transactions on Mobile Computing
Abstract:This work tackles the physical layer security (PLS) problem of maximizing the secrecy rate in heterogeneous UAV networks (HetUAVNs) under propulsion energy constraints. Unlike prior studies that assume uniform UAV capabilities or overlook energy-security trade-offs, we consider a realistic scenario where UAVs with diverse payloads and computation resources collaborate to serve ground terminals in the presence of eavesdroppers. To manage the complex coupling between UAV motion and communication, we propose a hierarchical optimization framework. The inner layer uses a semidefinite relaxation (SDR)-based S2DC algorithm combining penalty functions and difference-of-convex (d.c.) programming to solve the secrecy precoding problem with fixed UAV positions. The outer layer introduces a Large Language Model (LLM)-guided heuristic multi-agent reinforcement learning approach (LLM-HeMARL) for trajectory optimization. LLM-HeMARL efficiently incorporates expert heuristics policy generated by the LLM, enabling UAVs to learn energy-aware, security-driven trajectories without the inference overhead of real-time LLM calls. The simulation results show that our method outperforms existing baselines in secrecy rate and energy efficiency, with consistent robustness across varying UAV swarm sizes and random seeds.
zh
[AI-41] Regret Minimization in Population Network Games: Vanishing Heterogeneity and Convergence to Equilibria
【速读】:该论文旨在解决大规模多智能体系统中异质性对均衡形成影响的理论难题,特别是如何在多样化初始策略下实现行为一致性与均衡收敛。其解决方案的关键在于通过平滑后悔匹配(smooth regret-matching)机制,将大量具有不同初始策略的智能体引导至统一的行为模式;研究进一步将系统状态建模为后悔分布的概率密度,并利用连续性方程分析其演化过程,发现后悔分布的方差随时间衰减,从而导致异质性消失并促使智能体达成共识,最终证明在竞争与合作场景下均可收敛至量化响应均衡(quantal response equilibrium)。
链接: https://arxiv.org/abs/2507.17183
作者: Die Hu,Shuyue Hu,Chunjiang Mu,Shiqi Fan,Chen Chu,Jinzhuo Liu,Zhen Wang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Understanding and predicting the behavior of large-scale multi-agents in games remains a fundamental challenge in multi-agent systems. This paper examines the role of heterogeneity in equilibrium formation by analyzing how smooth regret-matching drives a large number of heterogeneous agents with diverse initial policies toward unified behavior. By modeling the system state as a probability distribution of regrets and analyzing its evolution through the continuity equation, we uncover a key phenomenon in diverse multi-agent settings: the variance of the regret distribution diminishes over time, leading to the disappearance of heterogeneity and the emergence of consensus among agents. This universal result enables us to prove convergence to quantal response equilibria in both competitive and cooperative multi-agent settings. Our work advances the theoretical understanding of multi-agent learning and offers a novel perspective on equilibrium selection in diverse game-theoretic scenarios.
zh
[AI-42] Improving LLM s Generalized Reasoning Abilities by Graph Problems
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对新颖且复杂的推理任务时性能下降的问题,尤其是现有领域特定持续预训练(Domain-specific Continued Pretraining, CPT)方法缺乏向更广泛推理任务迁移能力的局限性。其解决方案的关键在于引入图问题推理(Graph Problem Reasoning, GPR),通过设计首个大规模GPR语料库GraphPile(包含109亿token和23种图任务),涵盖路径查找、网络分析、数值计算与拓扑推理等多样化逻辑关系任务,并结合链式思维(chain-of-thought)、程序式思维(program-of-thought)及执行轨迹数据进行训练。实验表明,基于GraphPile微调后的模型(如GraphMind)在数学推理上准确率提升达4.9%,非数学推理任务(如逻辑与常识推理)提升高达21.2%,从而显著增强了LLMs的通用推理能力与跨任务适应性。
链接: https://arxiv.org/abs/2507.17168
作者: Qifan Zhang,Nuo Chen,Zehua Li,Miao Peng,Jing Tang,Jia Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: COLM2025
Abstract:Large Language Models (LLMs) have made remarkable strides in reasoning tasks, yet their performance often falters on novel and complex problems. Domain-specific continued pretraining (CPT) methods, such as those tailored for mathematical reasoning, have shown promise but lack transferability to broader reasoning tasks. In this work, we pioneer the use of Graph Problem Reasoning (GPR) to enhance the general reasoning capabilities of LLMs. GPR tasks, spanning pathfinding, network analysis, numerical computation, and topological reasoning, require sophisticated logical and relational reasoning, making them ideal for teaching diverse reasoning patterns. To achieve this, we introduce GraphPile, the first large-scale corpus specifically designed for CPT using GPR data. Spanning 10.9 billion tokens across 23 graph tasks, the dataset includes chain-of-thought, program-of-thought, trace of execution, and real-world graph data. Using GraphPile, we train GraphMind on popular base models Llama 3 and 3.1, as well as Gemma 2, achieving up to 4.9 percent higher accuracy in mathematical reasoning and up to 21.2 percent improvement in non-mathematical reasoning tasks such as logical and commonsense reasoning. By being the first to harness GPR for enhancing reasoning patterns and introducing the first dataset of its kind, our work bridges the gap between domain-specific pretraining and universal reasoning capabilities, advancing the adaptability and robustness of LLMs.
zh
[AI-43] abular Diffusion based Actionable Counterfactual Explanations for Network Intrusion Detection
【速读】:该论文旨在解决现代网络入侵检测系统(Network Intrusion Detection Systems, NIDS)中深度学习模型“黑箱”特性所带来的可解释性问题,即缺乏对检测决策成因的清晰理解,从而影响用户信任及快速响应攻击的能力。解决方案的关键在于提出一种基于扩散机制(diffusion-based)的反事实解释(counterfactual explanation)框架,该框架能够生成最小且多样化的反事实样本,并以更高效的方式降低解释生成时间;同时,通过将局部反事实解释归纳为全局规则,实现从实例级到系统级的可操作防御策略,显著提升入侵检测与防御机制的有效性和实用性。
链接: https://arxiv.org/abs/2507.17161
作者: Vinura Galwaduge,Jagath Samarabandu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern network intrusion detection systems (NIDS) frequently utilize the predictive power of complex deep learning models. However, the “black-box” nature of such deep learning methods adds a layer of opaqueness that hinders the proper understanding of detection decisions, trust in the decisions and prevent timely countermeasures against such attacks. Explainable AI (XAI) methods provide a solution to this problem by providing insights into the causes of the predictions. The majority of the existing XAI methods provide explanations which are not convenient to convert into actionable countermeasures. In this work, we propose a novel diffusion-based counterfactual explanation framework that can provide actionable explanations for network intrusion attacks. We evaluated our proposed algorithm against several other publicly available counterfactual explanation algorithms on 3 modern network intrusion datasets. To the best of our knowledge, this work also presents the first comparative analysis of existing counterfactual explanation algorithms within the context of network intrusion detection systems. Our proposed method provide minimal, diverse counterfactual explanations out of the tested counterfactual explanation algorithms in a more efficient manner by reducing the time to generate explanations. We also demonstrate how counterfactual explanations can provide actionable explanations by summarizing them to create a set of global rules. These rules are actionable not only at instance level but also at the global level for intrusion attacks. These global counterfactual rules show the ability to effectively filter out incoming attack queries which is crucial for efficient intrusion detection and defense mechanisms.
zh
[AI-44] JAM: Keypoint-Guided Joint Prediction after Classification-Aware Marginal Proposal for Multi-Agent Interaction IROS2025
【速读】:该论文旨在解决多智能体联合预测中低概率模式生成质量差的问题,特别是在复杂交互场景下,现有方法难以有效捕捉和生成多样化的轨迹分布。其解决方案的关键在于提出一种两阶段的多智能体交互预测框架——JAM(Keypoint-guided Joint Prediction after Classification-aware Marginal Proposal),第一阶段通过分类感知的边际提议(classification-aware marginal proposal)对查询进行轨迹类型分类,引导模型学习各类轨迹模式并提供全面的初始分布信息;第二阶段则基于场景上下文与第一阶段的边际提议,引入关键点(keypoint)作为指导信号,显式增强联合预测模块对初始轨迹关键信息的捕捉与利用能力,从而提升高精度、高多样性的联合轨迹预测性能。
链接: https://arxiv.org/abs/2507.17152
作者: Fangze Lin,Ying He,Fei Yu,Hong Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: IROS 2025 Accepted
Abstract:Predicting the future motion of road participants is a critical task in autonomous driving. In this work, we address the challenge of low-quality generation of low-probability modes in multi-agent joint prediction. To tackle this issue, we propose a two-stage multi-agent interactive prediction framework named \textitkeypoint-guided joint prediction after classification-aware marginal proposal (JAM). The first stage is modeled as a marginal prediction process, which classifies queries by trajectory type to encourage the model to learn all categories of trajectories, providing comprehensive mode information for the joint prediction module. The second stage is modeled as a joint prediction process, which takes the scene context and the marginal proposals from the first stage as inputs to learn the final joint distribution. We explicitly introduce key waypoints to guide the joint prediction module in better capturing and leveraging the critical information from the initial predicted trajectories. We conduct extensive experiments on the real-world Waymo Open Motion Dataset interactive prediction benchmark. The results show that our approach achieves competitive performance. In particular, in the framework comparison experiments, the proposed JAM outperforms other prediction frameworks and achieves state-of-the-art performance in interactive trajectory prediction. The code is available at this https URL to facilitate future research.
zh
[AI-45] owards Human-level Intelligence via Human-like Whole-Body Manipulation
【速读】:该论文旨在解决通用智能机器人(general-purpose intelligent robots)在现实世界中实现全身体协调操作(whole-body manipulation)的核心挑战,具体包括:(1)设计具备人类水平物理能力的安全机器人硬件;(2)开发直观且可扩展的全身遥操作接口以收集高质量数据;(3)构建能够从人类示范中学习全身视觉运动策略(visuomotor policies)的算法。解决方案的关键在于提出Astribot Suite——一个统一框架,通过整合具身(embodiment)、遥操作界面与学习流水线,实现了对多样化环境和日常任务中全身体协调能力的有效训练与部署,标志着向真实场景下通用型机器人迈进的重要一步。
链接: https://arxiv.org/abs/2507.17141
作者: Guang Gao,Jianan Wang,Jinbo Zuo,Junnan Jiang,Jingfan Zhang,Xianwen Zeng,Yuejiang Zhu,Lianyang Ma,Ke Chen,Minhua Sheng,Ruirui Zhang,Zhaohui An
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Building general-purpose intelligent robots has long been a fundamental goal of robotics. A promising approach is to mirror the evolutionary trajectory of humans: learning through continuous interaction with the environment, with early progress driven by the imitation of human behaviors. Achieving this goal presents three core challenges: (1) designing safe robotic hardware with human-level physical capabilities; (2) developing an intuitive and scalable whole-body teleoperation interface for data collection; and (3) creating algorithms capable of learning whole-body visuomotor policies from human demonstrations. To address these challenges in a unified framework, we propose Astribot Suite, a robot learning suite for whole-body manipulation aimed at general daily tasks across diverse environments. We demonstrate the effectiveness of our system on a wide range of activities that require whole-body coordination, extensive reachability, human-level dexterity, and agility. Our results show that Astribot’s cohesive integration of embodiment, teleoperation interface, and learning pipeline marks a significant step towards real-world, general-purpose whole-body robotic manipulation, laying the groundwork for the next generation of intelligent robots.
zh
[AI-46] Resilient Multi-Agent Negotiation for Medical Supply Chains:Integrating LLM s and Blockchain for Transparent Coordination
【速读】:该论文旨在解决全球健康紧急事件(如新冠疫情)暴露出的传统医疗供应链在资源分配效率低、透明度不足及应对动态中断能力差等方面的突出问题。其解决方案的关键在于提出了一种融合区块链技术与去中心化大语言模型(Large Language Model, LLM)驱动的多智能体协商系统的混合框架:通过LLM赋能的自主代理(代表制造商、分销商和医疗机构)实现情境感知的结构化协商与决策,提升稀缺医疗资源的快速且伦理化的分配能力;同时,链上区块链层利用智能合约确保决策的不可篡改性、透明性和可审计性,链下代理层支持本地自适应推理与决策,二者通过正式的跨层通信协议协同工作,从而显著增强医疗供应链在危机中的韧性与问责机制。
链接: https://arxiv.org/abs/2507.17134
作者: Mariam ALMutairi,Hyungmin Kim
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figure
Abstract:Global health emergencies, such as the COVID-19 pandemic, have exposed critical weaknesses in traditional medical supply chains, including inefficiencies in resource allocation, lack of transparency, and poor adaptability to dynamic disruptions. This paper presents a novel hybrid framework that integrates blockchain technology with a decentralized, large language model (LLM) powered multi-agent negotiation system to enhance the resilience and accountability of medical supply chains during crises. In this system, autonomous agents-representing manufacturers, distributors, and healthcare institutions-engage in structured, context-aware negotiation and decision-making processes facilitated by LLMs, enabling rapid and ethical allocation of scarce medical resources. The off-chain agent layer supports adaptive reasoning and local decision-making, while the on-chain blockchain layer ensures immutable, transparent, and auditable enforcement of decisions via smart contracts. The framework also incorporates a formal cross-layer communication protocol to bridge decentralized negotiation with institutional enforcement. A simulation environment emulating pandemic scenarios evaluates the system’s performance, demonstrating improvements in negotiation efficiency, fairness of allocation, supply chain responsiveness, and auditability. This research contributes an innovative approach that synergizes blockchain trust guarantees with the adaptive intelligence of LLM-driven agents, providing a robust and scalable solution for critical supply chain coordination under uncertainty.
zh
[AI-47] Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在规则和领域知识频繁变化的环境中适应能力不足的问题,例如监管合规和用户风险筛查等场景。现有方法如离线微调和标准提示(standard prompting)无法在实际运行中有效适应新知识,导致性能下降。解决方案的关键在于提出自适应反思交互代理(Adaptive Reflective Interactive Agent, ARIA),其核心机制是通过结构化的自我对话评估自身不确定性,主动识别知识缺口并请求人类专家提供针对性解释或修正;随后系统性地将人类反馈更新至带时间戳的内部知识库,并通过比对与澄清查询检测和解决冲突或过时知识,从而实现测试时持续学习(test-time continual learning)。
链接: https://arxiv.org/abs/2507.17131
作者: Yufei He,Ruoyu Li,Alex Chen,Yue Liu,Yulin Chen,Yuan Sui,Cheng Chen,Yi Zhu,Luca Luo,Frank Yang,Bryan Hooi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) agents often struggle in environments where rules and required domain knowledge frequently change, such as regulatory compliance and user risk screening. Current approaches, like offline fine-tuning and standard prompting, are insufficient because they cannot effectively adapt to new knowledge during actual operation. To address this limitation, we propose the Adaptive Reflective Interactive Agent (ARIA), an LLM agent framework designed specifically to continuously learn updated domain knowledge at test time. ARIA assesses its own uncertainty through structured self-dialogue, proactively identifying knowledge gaps and requesting targeted explanations or corrections from human experts. It then systematically updates an internal, timestamped knowledge repository with provided human guidance, detecting and resolving conflicting or outdated knowledge through comparisons and clarification queries. We evaluate ARIA on the realistic customer due diligence name screening task on TikTok Pay, alongside publicly available dynamic knowledge tasks. Results demonstrate significant improvements in adaptability and accuracy compared to baselines using standard offline fine-tuning and existing self-improving agents. ARIA is deployed within TikTok Pay serving over 150 million monthly active users, confirming its practicality and effectiveness for operational use in rapidly evolving environments.
zh
[AI-48] BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在服务系统中因推理资源密集和延迟敏感而导致的GPU内存利用率低、延迟高以及难以适应异构负载动态变化的问题。现有静态或连续批处理策略常导致内存碎片化、OOM错误频发,且无法有效保障服务等级目标(Service Level Objective, SLO)。其解决方案的关键在于提出BucketServe——一个基于桶的动态批处理框架,通过按序列长度将请求分组至大小同质的桶(bucket),显著减少填充开销并优化GPU内存使用;同时引入自适应桶分裂/合并机制与优先级感知调度策略,以缓解资源碎片并确保SLO合规,从而实现更高的吞吐量和系统负载容量。
链接: https://arxiv.org/abs/2507.17120
作者: Wanyi Zheng,Minxian Xu,Shengye Song,Kejiang Ye
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 9 pages
Abstract:Large language models (LLMs) have become increasingly popular in various areas, traditional business gradually shifting from rule-based systems to LLM-based solutions. However, the inference of LLMs is resource-intensive or latency-sensitive, posing significant challenges for serving systems. Existing LLM serving systems often use static or continuous batching strategies, which can lead to inefficient GPU memory utilization and increased latency, especially under heterogeneous workloads. These methods may also struggle to adapt to dynamic workload fluctuations, resulting in suboptimal throughput and potential service level objective (SLO) violations. In this paper, we introduce BucketServe, a bucket-based dynamic batching framework designed to optimize LLM inference performance. By grouping requests into size-homogeneous buckets based on sequence length, BucketServe minimizes padding overhead and optimizes GPU memory usage through real-time batch size adjustments preventing out-of-memory (OOM) errors. It introduces adaptive bucket splitting/merging and priority-aware scheduling to mitigate resource fragmentation and ensure SLO compliance. Experiment shows that BucketServe significantly outperforms UELLM in throughput, achieving up to 3.58x improvement. It can also handle 1.93x more request load under the SLO attainment of 80% compared with DistServe and demonstrates 1.975x higher system load capacity compared to the UELLM.
zh
[AI-49] HySafe-AI: Hybrid Safety Architectural Analysis Framework for AI Systems: A Case Study
【速读】:该论文旨在解决当前安全关键领域(如自动驾驶系统和机器人)中,基于生成式AI(Generative AI)的端到端(End-to-End, E2E)架构(如大语言模型LLMs和视觉语言模型VLMs)所带来的安全性评估难题。传统安全分析方法(如故障模式与影响分析FMEA和故障树分析FTA)在面对基础模型复杂的潜在表示(latent representations)形成机制时,其适用性受限。解决方案的关键在于提出HySAFE-AI——一种面向AI系统的混合安全架构分析框架,通过适配并改进传统方法,以更有效地评估AI系统在复杂架构下的安全性,从而为未来AI安全标准的发展提供指导。
链接: https://arxiv.org/abs/2507.17118
作者: Mandar Pitale,Jelena Frtunikj,Abhinaw Priyadershi,Vasu Singh,Maria Spence
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:AI has become integral to safety-critical areas like autonomous driving systems (ADS) and robotics. The architecture of recent autonomous systems are trending toward end-to-end (E2E) monolithic architectures such as large language models (LLMs) and vision language models (VLMs). In this paper, we review different architectural solutions and then evaluate the efficacy of common safety analyses such as failure modes and effect analysis (FMEA) and fault tree analysis (FTA). We show how these techniques can be improved for the intricate nature of the foundational models, particularly in how they form and utilize latent representations. We introduce HySAFE-AI, Hybrid Safety Architectural Analysis Framework for AI Systems, a hybrid framework that adapts traditional methods to evaluate the safety of AI systems. Lastly, we offer hints of future work and suggestions to guide the evolution of future AI safety standards.
zh
[AI-50] Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)微调大语言模型(Large Language Models, LLMs)时普遍认为需更新大部分参数的假设问题。研究发现,RL微调实际上仅激活并修改模型中一小部分子网络(通常为5%-30%的权重),这种现象称为“RL诱导的参数更新稀疏性”(RL-induced parameter update sparsity)。其关键解决方案在于:通过仅对这一稀疏但稳定的子网络进行微调,即可恢复全模型性能,并获得与完整微调几乎一致的参数结果,从而揭示了RL适应模型的本质并非全局调整,而是聚焦于局部、可复现的子结构——这为高效RL方法设计提供了新思路,并从彩票券假说(lottery ticket hypothesis)视角重新诠释了模型微调中的稀疏性机制。
链接: https://arxiv.org/abs/2507.17107
作者: Andrii Balashov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 6 figures
Abstract:Reinforcement learning (RL) is a key post-pretraining step for aligning large language models (LLMs) with complex tasks and human preferences. While it is often assumed that RL fine-tuning requires updating most of a model’s parameters, we challenge this assumption with a surprising finding: RL fine-tuning consistently modifies only a small subnetwork (typically 5-30% of weights), leaving most parameters unchanged. We call this phenomenon RL-induced parameter update sparsity. It arises naturally, without any sparsity constraints or parameter-efficient tuning, and appears across multiple RL algorithms (e.g., PPO, DPO, SimPO, PRIME) and model families (e.g., OpenAI, Meta, and open-source LLMs). Moreover, the subnetworks updated by RL show substantial overlap across different seeds, datasets, and algorithms-far exceeding chance-suggesting a partially transferable structure in the pretrained model. We show that fine-tuning only this sparse subnetwork recovers full model performance and yields parameters nearly identical to the fully fine-tuned model. Our analysis suggests this sparsity emerges because RL operates near the model’s original distribution, requiring only targeted changes. KL penalties, gradient clipping, and on-policy dynamics have limited effect on the sparsity pattern. These findings shed new light on how RL adapts models: not by shifting all weights, but by focusing training on a small, consistently updated subnetwork. This insight enables more efficient RL methods and reframes sparsity through the lens of the lottery ticket hypothesis.
zh
[AI-51] LoRA is All You Need for Safety Alignment of Reasoning LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全对齐微调(Safety Alignment Fine-Tuning, SFT)过程中出现的“安全税”(Safety Tax)问题,即安全对齐会显著损害模型的推理能力。解决方案的关键在于采用低秩适应(Low-Rank Adaptation, LoRA)技术对拒绝数据集(refusal datasets)进行微调,通过将安全相关的权重更新限制在低秩子空间内,最小化对推理相关权重的干扰,从而在保持模型强大推理能力的同时实现高水平的安全性。实验表明,该方法在数学、科学和编程四个基准测试中均实现了与全模型微调相当的安全水平,且不牺牲推理性能。
链接: https://arxiv.org/abs/2507.17075
作者: Yihao Xue,Baharan Mirzasoleiman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reasoning LLMs have demonstrated remarkable breakthroughs in solving complex problems that were previously out of reach. To ensure LLMs do not assist with harmful requests, safety alignment fine-tuning is necessary in the post-training phase. However, safety alignment fine-tuning has recently been shown to significantly degrade reasoning abilities, a phenomenon known as the “Safety Tax”. In this work, we show that using LoRA for SFT on refusal datasets effectively aligns the model for safety without harming its reasoning capabilities. This is because restricting the safety weight updates to a low-rank space minimizes the interference with the reasoning weights. Our extensive experiments across four benchmarks covering math, science, and coding show that this approach produces highly safe LLMs – with safety levels comparable to full-model fine-tuning – without compromising their reasoning abilities. Additionally, we observe that LoRA induces weight updates with smaller overlap with the initial weights compared to full-model fine-tuning. We also explore methods that further reduce such overlap – via regularization or during weight merging – and observe some improvement on certain tasks. We hope this result motivates designing approaches that yield more consistent improvements in the reasoning-safety trade-off.
zh
[AI-52] Advancing Robustness in Deep Reinforcement Learning with an Ensemble Defense Approach
【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)模型在自动驾驶场景中面对对抗攻击时鲁棒性不足的问题。现有防御机制如对抗训练和知识蒸馏虽能提升模型韧性,但在自动驾驶这一复杂动态环境中,单一防御策略仍存在局限。论文提出了一种基于集成(ensemble-based)的防御架构,其关键在于通过融合多种防御机制形成协同增强效应,从而显著提升DRL模型在高速公路和汇入场景下的抗干扰能力——实验表明,该方法在FGSM攻击下使平均奖励提升超过213%,碰撞率降低82%,优于所有独立防御策略。
链接: https://arxiv.org/abs/2507.17070
作者: Adithya Mohan,Dominik Rößle,Daniel Cremers,Torsten Schön
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, 2 tables
Abstract:Recent advancements in Deep Reinforcement Learning (DRL) have demonstrated its applicability across various domains, including robotics, healthcare, energy optimization, and autonomous driving. However, a critical question remains: How robust are DRL models when exposed to adversarial attacks? While existing defense mechanisms such as adversarial training and distillation enhance the resilience of DRL models, there remains a significant research gap regarding the integration of multiple defenses in autonomous driving scenarios specifically. This paper addresses this gap by proposing a novel ensemble-based defense architecture to mitigate adversarial attacks in autonomous driving. Our evaluation demonstrates that the proposed architecture significantly enhances the robustness of DRL models. Compared to the baseline under FGSM attacks, our ensemble method improves the mean reward from 5.87 to 18.38 (over 213% increase) and reduces the mean collision rate from 0.50 to 0.09 (an 82% decrease) in the highway scenario and merge scenario, outperforming all standalone defense strategies.
zh
[AI-53] Compatibility of Max and Sum Objectives for Committee Selection and k-Facility Location
【速读】:该论文致力于解决度量空间下的设施选址问题(metric facility location problem),即在任意度量空间中选择 k 个设施来服务一组客户 C,同时考虑四种不同的优化目标:每个客户试图最小化其到所选设施的距离之和或最大值,而整体目标则相应地取所有客户成本的总和或最大值。研究的核心在于分析这些不同目标之间的兼容性,而非孤立优化单一目标。解决方案的关键在于证明存在一种可行解,能够同时接近任意两种目标的最优值,从而表明在选择设施或代表委员会时,可以实现多个目标的协同优化,避免因追求某一目标而牺牲其他目标。
链接: https://arxiv.org/abs/2507.17063
作者: Yue Han,Elliot Anshelevich
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注:
Abstract:We study a version of the metric facility location problem (or, equivalently, variants of the committee selection problem) in which we must choose k facilities in an arbitrary metric space to serve some set of clients C . We consider four different objectives, where each client i\in C attempts to minimize either the sum or the maximum of its distance to the chosen facilities, and where the overall objective either considers the sum or the maximum of the individual client costs. Rather than optimizing a single objective at a time, we study how compatible these objectives are with each other, and show the existence of solutions which are simultaneously close-to-optimum for any pair of the above objectives. Our results show that when choosing a set of facilities or a representative committee, it is often possible to form a solution which is good for several objectives at the same time, instead of sacrificing one desideratum to achieve another.
zh
[AI-54] Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems
【速读】:该论文旨在解决当前多智能体大语言模型(Large Language Model, LLM)系统在开放复杂场景中因静态工作流、固定角色分配及有限的智能体间通信而导致协作效率低下的问题。其解决方案的关键在于构建一个具备自适应协调能力的框架,包含三个核心机制:动态任务路由(根据置信度和负载重新分配任务)、双向反馈机制(通过结构化批评迭代优化输出),以及并行智能体评估与竞争机制(在高模糊性子任务上由评估器驱动选择最优结果)。这一设计显著提升了系统的事实覆盖率、连贯性和执行效率。
链接: https://arxiv.org/abs/2507.17061
作者: Chengxuan Xia,Qianye Wu,Sixuan Tian,Yilun Hao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 8 pages, 2 figures
Abstract:Large language model (LLM) agents have shown increasing promise for collaborative task completion. However, existing multi-agent frameworks often rely on static workflows, fixed roles, and limited inter-agent communication, reducing their effectiveness in open-ended, high-complexity domains. This paper proposes a coordination framework that enables adaptiveness through three core mechanisms: dynamic task routing, bidirectional feedback, and parallel agent evaluation. The framework allows agents to reallocate tasks based on confidence and workload, exchange structured critiques to iteratively improve outputs, and crucially compete on high-ambiguity subtasks with evaluator-driven selection of the most suitable result. We instantiate these principles in a modular architecture and demonstrate substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines. Our findings highlight the benefits of incorporating both adaptiveness and structured competition in multi-agent LLM systems.
zh
[AI-55] Prag matic Policy Development via Interpretable Behavior Cloning
【速读】:该论文旨在解决离线强化学习(Offline Reinforcement Learning, Offline RL)在安全关键领域应用中面临的可解释性差与评估不可靠两大挑战。传统离线RL策略因黑箱特性难以解释,且基于重要性采样的离线评估方法对行为策略偏差敏感,导致实际部署风险高。其解决方案的关键在于:利用树状结构模型(tree-based model)对行为策略进行可解释建模,通过提取每个状态中最常选择的治疗动作来生成策略;该方法不仅天然具备可解释性,还能通过调节考虑的动作数量控制与行为策略的重叠程度,从而实现更可靠的离线评估。此框架将临床实践中隐含的高频治疗模式标准化,实证表明其在类风湿关节炎和脓毒症护理场景中优于当前临床实践。
链接: https://arxiv.org/abs/2507.17056
作者: Anton Matsson,Yaochen Rao,Heather J. Litman,Fredrik D. Johansson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline reinforcement learning (RL) holds great promise for deriving optimal policies from observational data, but challenges related to interpretability and evaluation limit its practical use in safety-critical domains. Interpretability is hindered by the black-box nature of unconstrained RL policies, while evaluation – typically performed off-policy – is sensitive to large deviations from the data-collecting behavior policy, especially when using methods based on importance sampling. To address these challenges, we propose a simple yet practical alternative: deriving treatment policies from the most frequently chosen actions in each patient state, as estimated by an interpretable model of the behavior policy. By using a tree-based model, which is specifically designed to exploit patterns in the data, we obtain a natural grouping of states with respect to treatment. The tree structure ensures interpretability by design, while varying the number of actions considered controls the degree of overlap with the behavior policy, enabling reliable off-policy evaluation. This pragmatic approach to policy development standardizes frequent treatment patterns, capturing the collective clinical judgment embedded in the data. Using real-world examples in rheumatoid arthritis and sepsis care, we demonstrate that policies derived under this framework can outperform current practice, offering interpretable alternatives to those obtained via offline RL.
zh
[AI-56] New Mechanisms in Flex Distribution for Bounded Suboptimal Multi-Agent Path Finding
【速读】:该论文旨在解决多智能体路径规划(Multi-Agent Path Finding, MAPF)中基于冲突的搜索算法(Conflict-Based Search, CBS)在求解过程中因灵活分配(flex distribution)策略不当导致效率下降的问题。具体而言,现有方法通过增加阈值来加速求解,但可能使总路径成本(Sum of Path Costs, SOC)超出允许的偏差范围 $ w \cdot LB $,从而频繁切换路径集合而非集中解决冲突,降低了收敛效率。解决方案的关键在于提出三种新型灵活分配机制:冲突比例分配(Conflict-Based Flex Distribution)、基于延迟估计的分配(Delay-Based Flex Distribution)以及分层混合策略(Mixed-Strategy Flex Distribution),这些机制能够在保证解的有界次优性前提下显著提升算法收敛速度和稳定性。
链接: https://arxiv.org/abs/2507.17054
作者: Shao-Hung Chan,Thomy Phan,Jiaoyang Li,Sven Koenig
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures, International Symposium on Combinatorial Search, 2025
Abstract:Multi-Agent Path Finding (MAPF) is the problem of finding a set of collision-free paths, one for each agent in a shared environment. Its objective is to minimize the sum of path costs (SOC), where the path cost of each agent is defined as the travel time from its start location to its target location. Explicit Estimation Conflict-Based Search (EECBS) is the leading algorithm for bounded-suboptimal MAPF, with the SOC of the solution being at most a user-specified factor w away from optimal. EECBS maintains sets of paths and a lower bound LB on the optimal SOC. Then, it iteratively selects a set of paths whose SOC is at most w \cdot LB and introduces constraints to resolve collisions. For each path in a set, EECBS maintains a lower bound on its optimal path that satisfies constraints. By finding an individually bounded-suboptimal path with cost at most a threshold of w times its lower bound, EECBS guarantees to find a bounded-suboptimal solution. To speed up EECBS, previous work uses flex distribution to increase the threshold. Though EECBS with flex distribution guarantees to find a bounded-suboptimal solution, increasing the thresholds may push the SOC beyond w \cdot LB , forcing EECBS to switch among different sets of paths instead of resolving collisions on a particular set of paths, and thus reducing efficiency. To address this issue, we propose Conflict-Based Flex Distribution that distributes flex in proportion to the number of collisions. We also estimate the delays needed to satisfy constraints and propose Delay-Based Flex Distribution. On top of that, we propose Mixed-Strategy Flex Distribution, combining both in a hierarchical framework. We prove that EECBS with our new flex distribution mechanisms is complete and bounded-suboptimal. Our experiments show that our approaches outperform the original (greedy) flex distribution.
zh
[AI-57] Causal Graph Fuzzy LLM s: A First Introduction and Applications in Time Series Forecasting
【速读】:该论文旨在解决多变量时间序列预测(Multivariate Time Series Forecasting, MTSF)中模型可解释性不足与复杂动态关系建模困难的问题。其解决方案的关键在于提出一种新型大语言模型架构CGF-LLM,通过并行应用模糊时间序列(Fuzzy Time Series, FTS)和因果图(Causal Graph)将原始数值时间序列转化为具有语义意义的文本表示,从而为预训练的GPT-2模型提供兼具结构洞察与语义理解能力的输入,显著提升了模型对时间序列内在动态机制的可解释性和预测性能。
链接: https://arxiv.org/abs/2507.17016
作者: Omid Orang,Patricia O. Lucas,Gabriel I. F. Paiva,Petronio C. L. Silva,Felipe Augusto Rocha da Silva,Adriano Alonso Veloso,Frederico Gadelha Guimaraes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the Brazilian Congress of Artificial Intelligence (CBIC)
Abstract:In recent years, the application of Large Language Models (LLMs) to time series forecasting (TSF) has garnered significant attention among researchers. This study presents a new frame of LLMs named CGF-LLM using GPT-2 combined with fuzzy time series (FTS) and causal graph to predict multivariate time series, marking the first such architecture in the literature. The key objective is to convert numerical time series into interpretable forms through the parallel application of fuzzification and causal analysis, enabling both semantic understanding and structural insight as input for the pretrained GPT-2 model. The resulting textual representation offers a more interpretable view of the complex dynamics underlying the original time series. The reported results confirm the effectiveness of our proposed LLM-based time series forecasting model, as demonstrated across four different multivariate time series datasets. This initiative paves promising future directions in the domain of TSF using LLMs based on FTS.
zh
[AI-58] aplax – Laplace Approximations with JAX ICML2025
【速读】:该论文旨在解决深度神经网络中权重空间不确定性量化的问题,从而支持贝叶斯工具(如预测不确定性估计和基于奥卡姆剃刀原理的模型选择)在实际应用中的落地。其解决方案的关键在于提出并实现了一个名为laplax的新开源Python库,该库基于jax构建,采用模块化且纯函数式的架构设计,具备最小外部依赖,能够高效地执行拉普拉斯近似(Laplace approximation),为贝叶斯神经网络、深度学习不确定性量化以及改进拉普拉斯近似技术的研究提供灵活且易用的实验平台。
链接: https://arxiv.org/abs/2507.17013
作者: Tobias Weber,Bálint Mucsányi,Lenard Rommel,Thomas Christie,Lars Kasüschke,Marvin Pförtner,Philipp Hennig
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submission to the ICML 2025 Workshop on Championing Open-source Development in Machine Learning (CODEML '25)
Abstract:The Laplace approximation provides a scalable and efficient means of quantifying weight-space uncertainty in deep neural networks, enabling the application of Bayesian tools such as predictive uncertainty and model selection via Occam’s razor. In this work, we introduce laplax, a new open-source Python package for performing Laplace approximations with jax. Designed with a modular and purely functional architecture and minimal external dependencies, laplax offers a flexible and researcher-friendly framework for rapid prototyping and experimentation. Its goal is to facilitate research on Bayesian neural networks, uncertainty quantification for deep learning, and the development of improved Laplace approximation techniques.
zh
[AI-59] owards Autonomous Sustainability Assessment via Multimodal AI Agents
【速读】:该论文旨在解决传统生命周期评估(Life Cycle Assessment, LCA)在电子设备碳足迹计算中因数据缺失而导致的效率低、成本高和可扩展性差的问题。其核心挑战在于获取从原材料提取到生产阶段(cradle-to-gate)所需的详细物料与工艺数据往往不可得,且依赖专家人工处理耗时数周甚至数月。解决方案的关键在于引入多模态AI代理(multimodal AI agents),模拟LCA专家与产品管理人员、工程师等利益相关者之间的交互过程,通过自定义数据抽象和软件工具从维修社区及政府认证的在线文本与图像中自动提取信息,构建详细的生命周期清单(Life-Cycle Inventory, LCI)。该方法将计算时间缩短至一分钟以内,且无需专有数据即可获得误差在19%以内的碳排放估算;同时,进一步提出基于相似产品聚类的直接环境影响(Environmental Impact, EI)估计方法和基于材料属性加权合成的排放因子生成机制,显著提升了预测精度(MAPE降低120.26%),为未来自动化、高精度LCA工作流提供了可扩展的数据驱动范式。
链接: https://arxiv.org/abs/2507.17012
作者: Zhihan Zhang,Alexander Metzger,Yuxuan Mei,Felix Hähnlein,Zachary Englhardt,Tingyu Cheng,Gregory D. Abowd,Shwetak Patel,Adriana Schulz,Vikram Iyer
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Interest in sustainability information has surged in recent years. However, the data required for a life cycle assessment (LCA) that maps the materials and processes from product manufacturing to disposal into environmental impacts (EI) are often unavailable. Here we reimagine conventional LCA by introducing multimodal AI agents that emulate interactions between LCA experts and stakeholders like product managers and engineers to calculate the cradle-to-gate (production) carbon emissions of electronic devices. The AI agents iteratively generate a detailed life-cycle inventory leveraging a custom data abstraction and software tools that extract information from online text and images from repair communities and government certifications. This approach reduces weeks or months of expert time to under one minute and closes data availability gaps while yielding carbon footprint estimates within 19% of expert LCAs with zero proprietary data. Additionally, we develop a method to directly estimate EI by comparing an input to a cluster of products with similar descriptions and known carbon footprints. This runs in 3 ms on a laptop with a MAPE of 12.28% on electronic products. Further, we develop a data-driven method to generate emission factors. We use the properties of an unknown material to represent it as a weighted sum of emission factors for similar materials. Compared to human experts picking the closest LCA database entry, this improves MAPE by 120.26%. We analyze the data and compute scaling of this approach and discuss its implications for future LCA workflows.
zh
[AI-60] owards Trustworthy AI: Secure Deepfake Detection using CNNs and Zero-Knowledge Proofs
【速读】:该论文旨在解决合成媒体时代下深度伪造(deepfake)图像对信息完整性的威胁问题,特别是在扩展现实(XR)场景中实时检测与隐私保护之间的矛盾。其解决方案的关键在于提出一个两阶段框架TrustDefender:第一阶段采用轻量级卷积神经网络(CNN)实现实时检测深度伪造图像;第二阶段集成简洁的零知识证明(ZKP)协议,在不泄露原始用户数据的前提下验证检测结果,从而在满足XR平台计算资源限制的同时,保障敏感环境下的隐私安全。该方案融合了先进的计算机视觉模型与可证明的安全机制,为沉浸式且隐私敏感的应用提供了可靠的人工智能基础。
链接: https://arxiv.org/abs/2507.17010
作者: H M Mohaimanul Islam,Huynh Q. N. Vo,Aditya Rane
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted for peer-review in TrustXR - 2025
Abstract:In the era of synthetic media, deepfake manipulations pose a significant threat to information integrity. To address this challenge, we propose TrustDefender, a two-stage framework comprising (i) a lightweight convolutional neural network (CNN) that detects deepfake imagery in real-time extended reality (XR) streams, and (ii) an integrated succinct zero-knowledge proof (ZKP) protocol that validates detection results without disclosing raw user data. Our design addresses both the computational constraints of XR platforms while adhering to the stringent privacy requirements in sensitive settings. Experimental evaluations on multiple benchmark deepfake datasets demonstrate that TrustDefender achieves 95.3% detection accuracy, coupled with efficient proof generation underpinned by rigorous cryptography, ensuring seamless integration with high-performance artificial intelligence (AI) systems. By fusing advanced computer vision models with provable security mechanisms, our work establishes a foundation for reliable AI in immersive and privacy-sensitive applications.
zh
[AI-61] PyG 2.0: Scalable Learning on Real World Graphs
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在大规模实际应用中面临的可扩展性与功能局限性问题。其核心解决方案在于提出 PyG 2.0 及后续版本,通过重构框架架构,引入对异构图(heterogeneous graphs)和时序图(temporal graphs)的支持、可扩展的特征存储与图存储机制,以及多项性能优化策略,显著提升了模型在真实场景下处理大规模图数据的能力,从而推动了关系深度学习与大语言建模等关键领域的研究进展。
链接: https://arxiv.org/abs/2507.16991
作者: Matthias Fey,Jinu Sunil,Akihiro Nitta,Rishi Puri,Manan Shah,Blaž Stojanovič,Ramona Bendias,Alexandria Barghi,Vid Kocijan,Zecheng Zhang,Xinwei He,Jan Eric Lenssen,Jure Leskovec
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:PyG (PyTorch Geometric) has evolved significantly since its initial release, establishing itself as a leading framework for Graph Neural Networks. In this paper, we present Pyg 2.0 (and its subsequent minor versions), a comprehensive update that introduces substantial improvements in scalability and real-world application capabilities. We detail the framework’s enhanced architecture, including support for heterogeneous and temporal graphs, scalable feature/graph stores, and various optimizations, enabling researchers and practitioners to tackle large-scale graph learning problems efficiently. Over the recent years, PyG has been supporting graph learning in a large variety of application areas, which we will summarize, while providing a deep dive into the important areas of relational deep learning and large language modeling.
zh
[AI-62] Evaluating Ensemble and Deep Learning Models for Static Malware Detection with Dimensionality Reduction Using the EMBER Dataset
【速读】:该论文旨在解决静态恶意软件检测中机器学习模型性能差异及其对特征预处理敏感性的问题,核心目标是评估不同分类算法在EMBER数据集上的有效性,并明确预处理策略(如PCA和LDA)对模型表现的影响。解决方案的关键在于系统性地比较八种主流分类模型(包括集成学习方法如LightGBM、XGBoost及TabNet等),在原始特征空间与两种降维方法下进行多指标评估(准确率、精确率、召回率、F1分数和AUC),并通过探索性数据分析(EDA)揭示关键特征的判别能力。研究发现,提升类模型(如LightGBM和XGBoost)在各类配置下均表现出最优且稳定的性能,而降维技术的效果高度依赖于模型类型——LDA虽可增强KNN效果,但显著损害了梯度提升模型的表现,表明特征工程需根据模型架构选择性应用。
链接: https://arxiv.org/abs/2507.16952
作者: Md Min-Ha-Zul Abedin,Tazqia Mehrub
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:This study investigates the effectiveness of several machine learning algorithms for static malware detection using the EMBER dataset, which contains feature representations of Portable Executable (PE) files. We evaluate eight classification models: LightGBM, XGBoost, CatBoost, Random Forest, Extra Trees, HistGradientBoosting, k-Nearest Neighbors (KNN), and TabNet, under three preprocessing settings: original feature space, Principal Component Analysis (PCA), and Linear Discriminant Analysis (LDA). The models are assessed on accuracy, precision, recall, F1 score, and AUC to examine both predictive performance and robustness. Ensemble methods, especially LightGBM and XGBoost, show the best overall performance across all configurations, with minimal sensitivity to PCA and consistent generalization. LDA improves KNN performance but significantly reduces accuracy for boosting models. TabNet, while promising in theory, underperformed under feature reduction, likely due to architectural sensitivity to input structure. The analysis is supported by detailed exploratory data analysis (EDA), including mutual information ranking, PCA or t-SNE visualizations, and outlier detection using Isolation Forest and Local Outlier Factor (LOF), which confirm the discriminatory capacity of key features in the EMBER dataset. The results suggest that boosting models remain the most reliable choice for high-dimensional static malware detection, and that dimensionality reduction should be applied selectively based on model type. This work provides a benchmark for comparing classification models and preprocessing strategies in malware detection tasks and contributes insights that can guide future system development and real-world deployment.
zh
[AI-63] Revisiting Pre-trained Language Models for Vulnerability Detection
【速读】:该论文旨在解决预训练语言模型(Pre-trained Language Models, PLMs)在真实场景下漏洞检测(Vulnerability Detection, VD)任务中有效性不足的问题。现有研究因数据准备、评估设置和实验设计的不充分,导致对PLMs性能的评估存在偏差与局限性。其解决方案的关键在于构建了新的基准数据集,并系统性地评估了17种PLMs(涵盖小型专用模型与大规模模型),通过对比微调与提示工程两种策略,分析模型在不同训练/测试设置下的效果与泛化能力,同时考察其对代码归一化、抽象及语义保持变换的鲁棒性。研究发现,专为捕捉代码语法与语义模式设计的预训练任务显著提升模型性能,但其在复杂依赖关系识别、扰动处理及语义不变漏洞检测方面仍面临挑战,且受限于上下文窗口长度的截断问题可能导致标注错误。该工作强调了面向实际应用场景的全面评估的重要性,并指出了未来改进方向。
链接: https://arxiv.org/abs/2507.16887
作者: Youpeng Li,Weiliang Qi,Xuyu Wang,Fuxun Yu,Xinda Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:
Abstract:The rapid advancement of pre-trained language models (PLMs) has demonstrated promising results for various code-related tasks. However, their effectiveness in detecting real-world vulnerabilities remains a critical challenge. % for the security community. While existing empirical studies evaluate PLMs for vulnerability detection (VD), their inadequate consideration in data preparation, evaluation setups, and experimental settings undermines the accuracy and comprehensiveness of evaluations. This paper introduces RevisitVD, an extensive evaluation of 17 PLMs spanning smaller code-specific PLMs and large-scale PLMs using newly constructed datasets. Specifically, we compare the performance of PLMs under both fine-tuning and prompt engineering, assess their effectiveness and generalizability across various training and testing settings, and analyze their robustness against code normalization, abstraction, and semantic-preserving transformations. Our findings reveal that, for VD tasks, PLMs incorporating pre-training tasks designed to capture the syntactic and semantic patterns of code outperform both general-purpose PLMs and those solely pre-trained or fine-tuned on large code corpora. However, these models face notable challenges in real-world scenarios, such as difficulties in detecting vulnerabilities with complex dependencies, handling perturbations introduced by code normalization and abstraction, and identifying semantic-preserving vulnerable code transformations. Also, the truncation caused by the limited context windows of PLMs can lead to a non-negligible amount of labeling errors. This study underscores the importance of thorough evaluations of model performance in practical scenarios and outlines future directions to help enhance the effectiveness of PLMs for realistic VD applications. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2507.16887 [cs.CR] (or arXiv:2507.16887v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.16887 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-64] SplitMeanFlow: Interval Splitting Consistency in Few-Step Generative Modeling
【速读】:该论文旨在解决生成式模型(如Flow Matching)在采样过程中计算成本高昂的迭代问题,尤其是现有方法(如MeanFlow)通过学习平均速度场实现少步或一步生成时存在的局限性。其解决方案的关键在于提出一种全新的代数一致性原则——区间分割一致性(Interval Splitting Consistency),该原则基于定积分的可加性性质,建立了一个不依赖微分算子的自参考关系,用于约束平均速度场在不同时间区间上的行为。基于此,作者设计了SplitMeanFlow训练框架,直接以该代数一致性作为学习目标,从而在理论上更一般化地覆盖MeanFlow的微分形式(当区间无限细分时即恢复原微分恒等式),并在实践中显著提升效率:无需JVP(Jacobi-Vector Product)计算,简化实现、增强训练稳定性并兼容更广硬件平台,已在大规模语音合成系统(如Doubao)中部署,实现20倍加速。
链接: https://arxiv.org/abs/2507.16884
作者: Yi Guo,Wei Wang,Zhihang Yuan,Rong Cao,Kuan Chen,Zhengyang Chen,Yuanyuan Huo,Yang Zhang,Yuping Wang,Shouda Liu,Yuxuan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Tech Report
Abstract:Generative models like Flow Matching have achieved state-of-the-art performance but are often hindered by a computationally expensive iterative sampling process. To address this, recent work has focused on few-step or one-step generation by learning the average velocity field, which directly maps noise to data. MeanFlow, a leading method in this area, learns this field by enforcing a differential identity that connects the average and instantaneous velocities. In this work, we argue that this differential formulation is a limiting special case of a more fundamental principle. We return to the first principles of average velocity and leverage the additivity property of definite integrals. This leads us to derive a novel, purely algebraic identity we term Interval Splitting Consistency. This identity establishes a self-referential relationship for the average velocity field across different time intervals without resorting to any differential operators. Based on this principle, we introduce SplitMeanFlow, a new training framework that enforces this algebraic consistency directly as a learning objective. We formally prove that the differential identity at the core of MeanFlow is recovered by taking the limit of our algebraic consistency as the interval split becomes infinitesimal. This establishes SplitMeanFlow as a direct and more general foundation for learning average velocity fields. From a practical standpoint, our algebraic approach is significantly more efficient, as it eliminates the need for JVP computations, resulting in simpler implementation, more stable training, and broader hardware compatibility. One-step and two-step SplitMeanFlow models have been successfully deployed in large-scale speech synthesis products (such as Doubao), achieving speedups of 20x.
zh
[AI-65] Confidence Optimization for Probabilistic Encoding
【速读】:该论文旨在解决概率编码(probabilistic encoding)在分类任务中因高斯噪声引入的随机性导致点间距离度量失真问题,从而影响模型分类性能与表征学习效果。其解决方案的关键在于提出一种置信度优化的概率编码(confidence optimization probabilistic encoding, CPE)方法:首先设计置信度感知机制以调整距离计算,提升分类任务中概率编码的一致性与可靠性;其次用更稳定的L2正则项替代传统依赖不可靠先验假设的KL散度正则项来直接约束方差,从而增强模型泛化能力。该方法具有模型无关性,在BERT和RoBERTa等预训练语言模型上均显著提升了自然语言分类任务的性能。
链接: https://arxiv.org/abs/2507.16881
作者: Pengjiu Xia,Yidian Huang,Wenchao Wei,Yuwen Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Probabilistic encoding introduces Gaussian noise into neural networks, enabling a smooth transition from deterministic to uncertain states and enhancing generalization ability. However, the randomness of Gaussian noise distorts point-based distance measurements in classification tasks. To mitigate this issue, we propose a confidence optimization probabilistic encoding (CPE) method that improves distance reliability and enhances representation learning. Specifically, we refine probabilistic encoding with two key strategies: First, we introduce a confidence-aware mechanism to adjust distance calculations, ensuring consistency and reliability in probabilistic encoding classification tasks. Second, we replace the conventional KL divergence-based variance regularization, which relies on unreliable prior assumptions, with a simpler L2 regularization term to directly constrain variance. The method we proposed is model-agnostic, and extensive experiments on natural language classification tasks demonstrate that our method significantly improves performance and generalization on both the BERT and the RoBERTa model.
zh
[AI-66] Budget Allocation Policies for Real-Time Multi-Agent Path Finding
【速读】:该论文旨在解决实时多智能体路径规划(Real-Time Multi-Agent Path Finding, RT-MAPF)中因规划预算(planning budget)分配不合理而导致的求解效率低下问题。现有方法在每个规划周期内反复调用窗口化的多智能体路径规划(MAPF)算法,但未显式考虑规划预算的分配策略,导致在高约束场景下性能受限。解决方案的关键在于提出不同的预算分配策略,特别是将规划预算按代理(agent)进行分配,而非采用所有代理共享一个预算池的基线方法;实验表明,这种基于代理的预算分配策略能显著提升问题求解成功率,并降低完成时间(makespan)。
链接: https://arxiv.org/abs/2507.16874
作者: Raz Beck,Roni Stern
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 2 figures, 3 tables
Abstract:Multi-Agent Pathfinding (MAPF) is the problem of finding paths for a set of agents such that each agent reaches its desired destination while avoiding collisions with the other agents. Many MAPF solvers are designed to run offline, that is, first generate paths for all agents and then execute them. Real-Time MAPF (RT-MAPF) embodies a realistic MAPF setup in which one cannot wait until a complete path for each agent has been found before they start to move. Instead, planning and execution are interleaved, where the agents must commit to a fixed number of steps in a constant amount of computation time, referred to as the planning budget. Existing solutions to RT-MAPF iteratively call windowed versions of MAPF algorithms in every planning period, without explicitly considering the size of the planning budget. We address this gap and explore different policies for allocating the planning budget in windowed versions of standard MAPF algorithms, namely Prioritized Planning (PrP) and MAPF-LNS2. Our exploration shows that the baseline approach in which all agents draw from a shared planning budget pool is ineffective in over-constrained situations. Instead, policies that distribute the planning budget over the agents are able to solve more problems with a smaller makespan.
zh
[AI-67] CompLeak: Deep Learning Model Compression Exacerbates Privacy Leakage
【速读】:该论文旨在解决模型压缩(Model Compression)在提升资源效率的同时可能引入的隐私泄露风险问题,尤其是针对生成式 AI(Generative AI)和大语言模型(LLMs)等基础模型的压缩操作。现有研究多关注压缩带来的性能与资源消耗之间的权衡,却忽视了压缩过程对成员推理攻击(Membership Inference Attack, MIA)敏感性的潜在影响。其解决方案的关键在于提出 CompLeak——首个系统性评估压缩配置下隐私风险的框架,涵盖剪枝(Pruning)、量化(Quantization)和权重聚类(Weight Clustering)三种主流压缩方式,并设计三种变体:CompLeakNR(单模型攻击)、CompLeakSR(利用原模型与压缩模型间的元信息增强攻击)以及 CompLeakMR(通过多个压缩版本挖掘联合隐私泄露),从而全面揭示压缩操作对隐私保护的负面影响。
链接: https://arxiv.org/abs/2507.16872
作者: Na Li,Yansong Gao,Hongsheng Hu,Boyu Kuang,Anmin Fu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Model compression is crucial for minimizing memory storage and accelerating inference in deep learning (DL) models, including recent foundation models like large language models (LLMs). Users can access different compressed model versions according to their resources and budget. However, while existing compression operations primarily focus on optimizing the trade-off between resource efficiency and model performance, the privacy risks introduced by compression remain overlooked and insufficiently understood. In this work, through the lens of membership inference attack (MIA), we propose CompLeak, the first privacy risk evaluation framework examining three widely used compression configurations that are pruning, quantization, and weight clustering supported by the commercial model compression framework of Google’s TensorFlow-Lite (TF-Lite) and Facebook’s PyTorch Mobile. CompLeak has three variants, given available access to the number of compressed models and original model. CompLeakNR starts by adopting existing MIA methods to attack a single compressed model, and identifies that different compressed models influence members and non-members differently. When the original model and one compressed model are available, CompLeakSR leverages the compressed model as a reference to the original model and uncovers more privacy by combining meta information (e.g., confidence vector) from both models. When multiple compressed models are available with/without accessing the original model, CompLeakMR innovatively exploits privacy leakage info from multiple compressed versions to substantially signify the overall privacy leakage. We conduct extensive experiments on seven diverse model architectures (from ResNet to foundation models of BERT and GPT-2), and six image and textual benchmark datasets. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.16872 [cs.CR] (or arXiv:2507.16872v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2507.16872 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-68] Diffusion-Modeled Reinforcement Learning for Carbon and Risk-Aware Microgrid Optimization
【速读】:该论文旨在解决多微电网系统在可再生能源高比例接入和系统复杂性增加背景下,面临的实时能量调度与优化问题,尤其关注不确定性环境下的碳排放约束与运行风险控制。解决方案的关键在于提出DiffCarl算法——一种基于扩散模型(diffusion model)的碳排放与风险感知强化学习方法,通过去噪生成过程学习动作分布,显著提升了深度强化学习(DRL)策略的表达能力,从而实现动态不确定环境中碳足迹敏感且鲁棒的能量调度。
链接: https://arxiv.org/abs/2507.16867
作者: Yunyi Zhao,Wei Zhang,Cheng Xiang,Hongyang Du,Dusit Niyato,Shuhua Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:This paper introduces DiffCarl, a diffusion-modeled carbon- and risk-aware reinforcement learning algorithm for intelligent operation of multi-microgrid systems. With the growing integration of renewables and increasing system complexity, microgrid communities face significant challenges in real-time energy scheduling and optimization under uncertainty. DiffCarl integrates a diffusion model into a deep reinforcement learning (DRL) framework to enable adaptive energy scheduling under uncertainty and explicitly account for carbon emissions and operational risk. By learning action distributions through a denoising generation process, DiffCarl enhances DRL policy expressiveness and enables carbon- and risk-aware scheduling in dynamic and uncertain microgrid environments. Extensive experimental studies demonstrate that it outperforms classic algorithms and state-of-the-art DRL solutions, with 2.3-30.1% lower operational cost. It also achieves 28.7% lower carbon emissions than those of its carbon-unaware variant and reduces performance variability. These results highlight DiffCarl as a practical and forward-looking solution. Its flexible design allows efficient adaptation to different system configurations and objectives to support real-world deployment in evolving energy systems.
zh
[AI-69] Reinforcement Learning in hyperbolic space for multi-step reasoning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在多步推理(multi-step reasoning)任务中面临的挑战,包括信用分配难题、高维状态表示以及训练稳定性问题。其解决方案的关键在于引入超曲面(hyperbolic)Transformer架构,利用双曲空间中的嵌入(hyperbolic embeddings)有效建模具有层次结构的推理过程,从而提升RL代理在复杂任务中的推理能力与效率。实验表明,该方法在Frontier Math和非线性最优控制基准测试中显著优于传统Transformer-based RL,准确率提升达32%~45%,同时计算时间减少16%~32%。
链接: https://arxiv.org/abs/2507.16864
作者: Tao Xu,Dung-Yang Lee,Momiao Xiong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 53 pages, 5 figures
Abstract:Multi-step reasoning is a fundamental challenge in artificial intelligence, with applications ranging from mathematical problem-solving to decision-making in dynamic environments. Reinforcement Learning (RL) has shown promise in enabling agents to perform multi-step reasoning by optimizing long-term rewards. However, conventional RL methods struggle with complex reasoning tasks due to issues such as credit assignment, high-dimensional state representations, and stability concerns. Recent advancements in Transformer architectures and hyperbolic geometry have provided novel solutions to these challenges. This paper introduces a new framework that integrates hyperbolic Transformers into RL for multi-step reasoning. The proposed approach leverages hyperbolic embeddings to model hierarchical structures effectively. We present theoretical insights, algorithmic details, and experimental results that include Frontier Math and nonlinear optimal control problems. Compared to RL with vanilla transformer, the hyperbolic RL largely improves accuracy by (32%~44%) on FrontierMath benchmark, (43%~45%) on nonlinear optimal control benchmark, while achieving impressive reduction in computational time by (16%~32%) on FrontierMath benchmark, (16%~17%) on nonlinear optimal control benchmark. Our work demonstrates the potential of hyperbolic Transformers in reinforcement learning, particularly for multi-step reasoning tasks that involve hierarchical structures.
zh
[AI-70] Leverag ing multi-source and heterogeneous signals for fatigue detection
【速读】:该论文旨在解决现实场景中疲劳检测(fatigue detection)的适用性问题,即现有方法多依赖高端传感器和受控环境,难以在真实世界部署。其核心挑战在于如何在目标域使用适配上下文的传感器时,有效利用来自不同配置源域(包括使用不切实际传感器的受控环境)的知识。解决方案的关键在于提出一种异构多源疲劳检测框架,该框架能自适应地利用目标域的可用模态,并从源域的多样化配置中提取有益信息,从而提升模型在传感器受限场景下的实用性、鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2507.16859
作者: Luobin Cui,Yanlai Wu,Tang Ying,Weikai Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 1figures,32pages
Abstract:Fatigue detection plays a critical role in safety-critical applications such as aviation, mining, and long-haul transport. However, most existing methods rely on high-end sensors and controlled environments, limiting their applicability in real world settings. This paper formally defines a practical yet underexplored problem setting for real world fatigue detection, where systems operating with context-appropriate sensors aim to leverage knowledge from differently instrumented sources including those using impractical sensors deployed in controlled environments. To tackle this challenge, we propose a heterogeneous and multi-source fatigue detection framework that adaptively utilizes the available modalities in the target domain while benefiting from the diverse configurations present in source domains. Our experiments, conducted using a realistic field-deployed sensor setup and two publicly available datasets, demonstrate the practicality, robustness, and improved generalization of our approach, paving the practical way for effective fatigue monitoring in sensor-constrained scenarios.
zh
[AI-71] SynthCTI: LLM -Driven Synthetic CTI Generation to enhance MITRE Technique Mapping
【速读】:该论文旨在解决网络安全威胁情报(Cyber Threat Intelligence, CTI)挖掘中,将威胁描述映射到MITRE ATT&CK技术时面临的两大挑战:高质量标注数据稀缺和类别不平衡问题。现有自动化方法因训练数据不足且部分ATT&CK技术样本极少而性能受限。解决方案的关键在于提出SynthCTI——一种基于聚类的合成数据增强框架,通过提取训练数据中的语义上下文并引导大语言模型(Large Language Model, LLM)生成语义忠实、词汇多样化的合成CTI句子,从而有效扩充低频类别的训练样本。实验表明,该方法在两个公开CTI数据集上显著提升分类模型的宏观F1分数,甚至使小型模型超越未增强的大模型,凸显了高质量数据生成对构建高效CTI分类系统的核心价值。
链接: https://arxiv.org/abs/2507.16852
作者: Álvaro Ruiz-Ródenas,Jaime Pujante Sáez,Daniel García-Algora,Mario Rodríguez Béjar,Jorge Blasco,José Luis Hernández-Ramos
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 13 figures
Abstract:Cyber Threat Intelligence (CTI) mining involves extracting structured insights from unstructured threat data, enabling organizations to understand and respond to evolving adversarial behavior. A key task in CTI mining is mapping threat descriptions to MITRE ATT\CK techniques. However, this process is often performed manually, requiring expert knowledge and substantial effort. Automated approaches face two major challenges: the scarcity of high-quality labeled CTI data and class imbalance, where many techniques have very few examples. While domain-specific Large Language Models (LLMs) such as SecureBERT have shown improved performance, most recent work focuses on model architecture rather than addressing the data limitations. In this work, we present SynthCTI, a data augmentation framework designed to generate high-quality synthetic CTI sentences for underrepresented MITRE ATT\CK techniques. Our method uses a clustering-based strategy to extract semantic context from training data and guide an LLM in producing synthetic CTI sentences that are lexically diverse and semantically faithful. We evaluate SynthCTI on two publicly available CTI datasets, CTI-to-MITRE and TRAM, using LLMs with different capacity. Incorporating synthetic data leads to consistent macro-F1 improvements: for example, ALBERT improves from 0.35 to 0.52 (a relative gain of 48.6%), and SecureBERT reaches 0.6558 (up from 0.4412). Notably, smaller models augmented with SynthCTI outperform larger models trained without augmentation, demonstrating the value of data generation methods for building efficient and effective CTI classification systems.
zh
[AI-72] Dynamic Simulation Framework for Disinformation Dissemination and Correction With Social Bots
【速读】:该论文旨在解决当前关于社交机器人(social bots)在信息生态系统中传播和纠正虚假信息(disinformation)的研究中存在的三大问题:一是用户与网络建模过于简化,二是忽视了机器人动态行为的复杂性,三是缺乏对纠错策略的定量评估。其解决方案的关键在于提出MADD框架——一种基于多智能体(Multi Agent)的虚假信息传播模拟框架,通过融合Barabasi-Albert模型(用于生成无标度拓扑结构)与随机块模型(Stochastic Block Model,用于刻画社区结构),构建更贴近现实的传播网络;同时引入恶意与合法机器人,并设计受控的动态参与机制,从而实现对事实型与叙事型纠错策略的量化分析与实验验证。
链接: https://arxiv.org/abs/2507.16848
作者: Boyu Qiao,Kun Li,Wei Zhou,Songlin Hu
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:
Abstract:In the human-bot symbiotic information ecosystem, social bots play key roles in spreading and correcting disinformation. Understanding their influence is essential for risk control and better governance. However, current studies often rely on simplistic user and network modeling, overlook the dynamic behavior of bots, and lack quantitative evaluation of correction strategies. To fill these gaps, we propose MADD, a Multi Agent based framework for Disinformation Dissemination. MADD constructs a more realistic propagation network by integrating the Barabasi Albert Model for scale free topology and the Stochastic Block Model for community structures, while designing node attributes based on real world user data. Furthermore, MADD incorporates both malicious and legitimate bots, with their controlled dynamic participation allows for quantitative analysis of correction strategies. We evaluate MADD using individual and group level metrics. We experimentally verify the real world consistency of MADD user attributes and network structure, and we simulate the dissemination of six disinformation topics, demonstrating the differential effects of fact based and narrative based correction strategies.
zh
[AI-73] Weak Supervision Techniques towards Enhanced ASR Models in Industry-level CRM Systems ICONIP2024
【速读】:该论文旨在解决客户关系管理(CRM)系统中因通用预训练自动语音识别(ASR)模型难以有效处理行业特定语音任务,而导致客户意图识别不准确、个性化服务难以实现的问题。其解决方案的关键在于提出了一种针对行业场景的ASR模型微调方法,通过领域适配优化模型性能,显著提升了ASR在CRM系统中的辅助作用,从而增强客户满意度与忠诚度。
链接: https://arxiv.org/abs/2507.16843
作者: Zhongsheng Wang,Sijie Wang,Jia Wang,Yung-I Liang,Yuxi Zhang,Jiamou Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by ICONIP 2024
Abstract:In the design of customer relationship management (CRM) systems, accurately identifying customer types and offering personalized services are key to enhancing customer satisfaction and loyalty. However, this process faces the challenge of discerning customer voices and intentions, and general pre-trained automatic speech recognition (ASR) models make it difficult to effectively address industry-specific speech recognition tasks. To address this issue, we innovatively proposed a solution for fine-tuning industry-specific ASR models, which significantly improved the performance of the fine-tuned ASR models in industry applications. Experimental results show that our method substantially improves the crucial auxiliary role of the ASR model in industry CRM systems, and this approach has also been adopted in actual industrial applications.
zh
[AI-74] CASPER: Contrastive Approach for Smart Ponzi Scheme Detecter with More Negative Samples
【速读】:该论文旨在解决智能合约中庞氏骗局(Smart Ponzi Scheme)检测难题,尤其针对传统基于深度学习的监督模型因标注数据稀缺而导致训练困难的问题。其核心解决方案是提出一种新颖的对比学习框架CASPER(Contrastive Approach for Smart Ponzi detectER with more negative samples),通过引入更多负样本增强对比学习能力,使模型能够在仅使用少量标注数据的情况下仍能有效提取智能合约源代码的特征表示,从而显著提升检测性能并降低系统复杂度与运营成本。
链接: https://arxiv.org/abs/2507.16840
作者: Weijia Yang,Tian Lan,Leyuan Liu,Wei Chen,Tianqing Zhu,Sheng Wen,Xiaosong Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid evolution of digital currency trading, fueled by the integration of blockchain technology, has led to both innovation and the emergence of smart Ponzi schemes. A smart Ponzi scheme is a fraudulent investment operation in smart contract that uses funds from new investors to pay returns to earlier investors. Traditional Ponzi scheme detection methods based on deep learning typically rely on fully supervised models, which require large amounts of labeled data. However, such data is often scarce, hindering effective model training. To address this challenge, we propose a novel contrastive learning framework, CASPER (Contrastive Approach for Smart Ponzi detectER with more negative samples), designed to enhance smart Ponzi scheme detection in blockchain transactions. By leveraging contrastive learning techniques, CASPER can learn more effective representations of smart contract source code using unlabeled datasets, significantly reducing both operational costs and system complexity. We evaluate CASPER on the XBlock dataset, where it outperforms the baseline by 2.3% in F1 score when trained with 100% labeled data. More impressively, with only 25% labeled data, CASPER achieves an F1 score nearly 20% higher than the baseline under identical experimental conditions. These results highlight CASPER’s potential for effective and cost-efficient detection of smart Ponzi schemes, paving the way for scalable fraud detection solutions in the future.
zh
[AI-75] You Dont Bring Me Flowers: Mitigating Unwanted Recommendations Through Conformal Risk Control RECSYS2025
【速读】:该论文旨在解决推荐系统在个性化内容推送过程中传播无关、 unwanted(不希望看到的)甚至有害内容的问题,这些问题不仅降低用户满意度,还可能引发虚假信息扩散、极端化倾向及用户信任流失等社会性风险。其解决方案的关键在于提出一种直观、模型无关且无需假设数据分布的方法,利用简单的二元反馈(binary feedback)对推荐结果进行置信度校准,通过**同构风险控制(conformal risk control)**严格限定不良内容的比例;同时创新性地引入对已消费项的隐式反馈(implicit feedback),在保持推荐集规模可控的前提下扩展推荐集合,从而实现更灵活且鲁棒的风险控制。实验表明,该方法能在最小干预下显著减少 unwanted 内容暴露,且具备良好的实用性与可部署性。
链接: https://arxiv.org/abs/2507.16829
作者: Giovanni De Toni,Erasmo Purificato,Emilia Gómez,Bruno Lepri,Andrea Passerini,Cristian Consonni
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted at the 19th ACM Conference on Recommender Systems (RecSys 2025)
Abstract:Recommenders are significantly shaping online information consumption. While effective at personalizing content, these systems increasingly face criticism for propagating irrelevant, unwanted, and even harmful recommendations. Such content degrades user satisfaction and contributes to significant societal issues, including misinformation, radicalization, and erosion of user trust. Although platforms offer mechanisms to mitigate exposure to undesired content, these mechanisms are often insufficiently effective and slow to adapt to users’ feedback. This paper introduces an intuitive, model-agnostic, and distribution-free method that uses conformal risk control to provably bound unwanted content in personalized recommendations by leveraging simple binary feedback on items. We also address a limitation of traditional conformal risk control approaches, i.e., the fact that the recommender can provide a smaller set of recommended items, by leveraging implicit feedback on consumed items to expand the recommendation set while ensuring robust risk mitigation. Our experimental evaluation on data coming from a popular online video-sharing platform demonstrates that our approach ensures an effective and controllable reduction of unwanted recommendations with minimal effort. The source code is available here: this https URL.
zh
[AI-76] Explainable Vulnerability Detection in C/C Using Edge-Aware Graph Attention Networks
【速读】:该论文旨在解决现实世界中源代码安全漏洞检测面临的两大挑战:一是数据集存在类别不平衡问题(即漏洞函数样本稀少),导致现有基于学习的方法虽优化召回率但误报率高,难以融入开发流程;二是多数方法缺乏可解释性,限制了其在安全工作流中的应用。解决方案的关键在于提出了一种基于图结构的框架 ExplainVulD,通过构建代码属性图(Code Property Graph)并采用双通道嵌入表示节点,融合语义与结构信息;引入边感知注意力机制,结合边类型嵌入区分程序关系;同时使用类别加权交叉熵损失缓解类别不平衡问题;最终在 ReVeal 数据集上实现了更高的准确率和 F1 分数,并提供可解释输出以识别关键代码区域,增强安全分析的透明度与可信度。
链接: https://arxiv.org/abs/2507.16540
作者: Radowanul Haque,Aftab Ali,Sally McClean,Naveed Khan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Detecting security vulnerabilities in source code remains challenging, particularly due to class imbalance in real-world datasets where vulnerable functions are under-represented. Existing learning-based methods often optimise for recall, leading to high false positive rates and reduced usability in development workflows. Furthermore, many approaches lack explainability, limiting their integration into security workflows. This paper presents ExplainVulD, a graph-based framework for vulnerability detection in C/C++ code. The method constructs Code Property Graphs and represents nodes using dual-channel embeddings that capture both semantic and structural information. These are processed by an edge-aware attention mechanism that incorporates edge-type embeddings to distinguish among program relations. To address class imbalance, the model is trained using class-weighted cross-entropy loss. ExplainVulD achieves a mean accuracy of 88.25 percent and an F1 score of 48.23 percent across 30 independent runs on the ReVeal dataset. These results represent relative improvements of 4.6 percent in accuracy and 16.9 percent in F1 score compared to the ReVeal model, a prior learning-based method. The framework also outperforms static analysis tools, with relative gains of 14.0 to 14.1 percent in accuracy and 132.2 to 201.2 percent in F1 score. Beyond improved detection performance, ExplainVulD produces explainable outputs by identifying the most influential code regions within each function, supporting transparency and trust in security triage.
zh
[AI-77] o Trust or Not to Trust: On Calibration in ML-based Resource Allocation for Wireless Networks
【速读】:该论文旨在解决下一代通信系统中机器学习(ML)模型在资源分配场景下预测准确性与置信度校准之间的关系问题,特别是如何通过校准提升系统级中断概率(Outage Probability, OP)的可靠性。其核心挑战在于:现有模型虽能提供预测结果,但缺乏对预测置信度的准确刻画,导致实际系统性能难以满足特定可靠性要求。解决方案的关键在于建立理论框架,揭示完美校准模型下OP随资源数量变化的渐近特性——即当资源数增加时,OP趋近于条件期望输出;同时证明了后处理校准(如Platt缩放和等距回归)无法降低系统最小可实现OP,因为其不引入关于未来信道状态的新信息。此外,论文提出一个单调性条件,明确具有改进OP能力的预测器需满足的精度-置信度函数性质,从而为设计满足特定OP目标的分类阈值提供理论依据。
链接: https://arxiv.org/abs/2507.17494
作者: Rashika Raina,Nidhi Simmons,David E. Simmons,Michel Daoud Yacoub,Trung Q. Duong
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:In next-generation communications and networks, machine learning (ML) models are expected to deliver not only accurate predictions but also well-calibrated confidence scores that reflect the true likelihood of correct decisions. This paper studies the calibration performance of an ML-based outage predictor within a single-user, multi-resource allocation framework. We first establish key theoretical properties of this system’s outage probability (OP) under perfect calibration. Importantly, we show that as the number of resources grows, the OP of a perfectly calibrated predictor approaches the expected output conditioned on it being below the classification threshold. In contrast, when only one resource is available, the system’s OP equals the model’s overall expected output. We then derive the OP conditions for a perfectly calibrated predictor. These findings guide the choice of the classification threshold to achieve a desired OP, helping system designers meet specific reliability requirements. We also demonstrate that post-processing calibration cannot improve the system’s minimum achievable OP, as it does not introduce new information about future channel states. Additionally, we show that well-calibrated models are part of a broader class of predictors that necessarily improve OP. In particular, we establish a monotonicity condition that the accuracy-confidence function must satisfy for such improvement to occur. To demonstrate these theoretical properties, we conduct a rigorous simulation-based analysis using post-processing calibration techniques: Platt scaling and isotonic regression. As part of this framework, the predictor is trained using an outage loss function specifically designed for this system. Furthermore, this analysis is performed on Rayleigh fading channels with temporal correlation captured by Clarke’s 2D model, which accounts for receiver mobility.
zh
[AI-78] Demonstration of Efficient Predictive Surrogates for Large-scale Quantum Processors
【速读】:该论文旨在解决大规模量子处理器因制造成本高昂而难以普及的问题,从而限制了其在实际场景中的广泛应用。解决方案的关键在于提出“预测代理模型”(predictive surrogates),即一类经典学习模型,能够以可证明的计算效率模拟给定量子处理器的平均值行为。该方法显著减少了对量子处理器访问的需求,在数字量子模拟等任务中实现高效预训练和相变识别,实验表明其不仅能将测量开销降低数个数量级,还能超越传统依赖大量量子资源的方法,为拓展先进量子处理器的应用范围提供了切实可行的路径。
链接: https://arxiv.org/abs/2507.17470
作者: Wei-You Liao,Yuxuan Du,Xinbiao Wang,Tian-Ci Tian,Yong Luo,Bo Du,Dacheng Tao,He-Liang Huang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 53 pages, 15 figures, comments are welcome
Abstract:The ongoing development of quantum processors is driving breakthroughs in scientific discovery. Despite this progress, the formidable cost of fabricating large-scale quantum processors means they will remain rare for the foreseeable future, limiting their widespread application. To address this bottleneck, we introduce the concept of predictive surrogates, which are classical learning models designed to emulate the mean-value behavior of a given quantum processor with provably computational efficiency. In particular, we propose two predictive surrogates that can substantially reduce the need for quantum processor access in diverse practical scenarios. To demonstrate their potential in advancing digital quantum simulation, we use these surrogates to emulate a quantum processor with up to 20 programmable superconducting qubits, enabling efficient pre-training of variational quantum eigensolvers for families of transverse-field Ising models and identification of non-equilibrium Floquet symmetry-protected topological phases. Experimental results reveal that the predictive surrogates not only reduce measurement overhead by orders of magnitude, but can also surpass the performance of conventional, quantum-resource-intensive approaches. Collectively, these findings establish predictive surrogates as a practical pathway to broadening the impact of advanced quantum processors.
zh
[AI-79] HuiduRep: A Robust Self-Supervised Framework for Learning Neural Representations from Extracellular Spikes
【速读】:该论文旨在解决神经科学中基于细胞外记录(extracellular recordings)的尖峰排序(spike sorting)在低信噪比(SNR)、电极漂移及跨会话变异条件下性能下降的问题。其解决方案的关键在于提出了一种名为HuiduRep的鲁棒自监督表示学习框架,通过结合对比学习(contrastive learning)与去噪自动编码器(denoising autoencoder),从尖峰波形中提取具有判别性和泛化能力的潜在特征表示,从而实现无需监督的聚类式尖峰排序,显著提升了处理复杂真实数据场景下的稳定性与准确性。
链接: https://arxiv.org/abs/2507.17224
作者: Feng Cao,Zishuo Feng
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 9 pages, 3 figures, 6 tables
Abstract:Extracellular recordings are brief voltage fluctuations recorded near neurons, widely used in neuroscience as the basis for decoding brain activity at single-neuron resolution. Spike sorting, which assigns each spike to its source neuron, is a critical step in brain sensing pipelines. However, it remains challenging under low signal-to-noise ratio (SNR), electrode drift, and cross-session variability. In this paper, we propose HuiduRep, a robust self-supervised representation learning framework that extracts discriminative and generalizable features from extracellular spike waveforms. By combining contrastive learning with a denoising autoencoder, HuiduRep learns latent representations that are robust to noise and drift. Built on HuiduRep, we develop a spike sorting pipeline that clusters spike representations without supervision. Experiments on hybrid and real-world datasets demonstrate that HuiduRep achieves strong robustness and the pipeline matches or outperforms state-of-the-art tools such as KiloSort4 and MountainSort5. These findings demonstrate the potential of self-supervised spike representation learning as a foundational tool for robust and generalizable processing of extracellular recordings.
zh
[AI-80] Weather-Aware AI Systems versus Route-Optimization AI: A Comprehensive Analysis of AI Applications in Transportation Productivity
【速读】:该论文旨在解决当前人工智能(AI)在交通领域应用中被严重低估的问题,即现有研究主要聚焦于路径优化算法,而忽视了天气等环境因素对网约车司机生产力的显著影响。其解决方案的关键在于提出并验证了一种综合性的“天气感知型AI系统”,该系统融合了深度学习气象预测与机器学习定位优化技术,能够同时应对多维度运营挑战,从而实现比仅依赖路径优化的AI系统更高的收益提升(107.3% vs. 14%),并揭示出天气情报作为核心变量所带来的巨大经济价值和市场潜力。
链接: https://arxiv.org/abs/2507.17099
作者: Tatsuru Kikuchi
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注: 41 pages, 5 figures
Abstract:While recent research demonstrates that AI route-optimization systems improve taxi driver productivity by 14%, this study reveals that such findings capture only a fraction of AI’s potential in transportation. We examine comprehensive weather-aware AI systems that integrate deep learning meteorological prediction with machine learning positioning optimization, comparing their performance against traditional operations and route-only AI approaches. Using simulation data from 10,000 taxi operations across varied weather conditions, we find that weather-aware AI systems increase driver revenue by 107.3%, compared to 14% improvements from route-optimization alone. Weather prediction contributes the largest individual productivity gain, with strong correlations between meteorological conditions and demand ( r=0.575 ). Economic analysis reveals annual earnings increases of 13.8 million yen per driver, with rapid payback periods and superior return on investment. These findings suggest that current AI literature significantly underestimates AI’s transformative potential by focusing narrowly on routing algorithms, while weather intelligence represents an untapped \ 8.9 billion market opportunity. Our results indicate that future AI implementations should adopt comprehensive approaches that address multiple operational challenges simultaneously rather than optimizing isolated functions.
zh
[AI-81] Computational Performance Bounds Prediction in Quantum Computing with Unstable Noise
【速读】:该论文旨在解决量子计算设备中噪声波动对计算性能预测的挑战,特别是如何高效、准确地预测在动态噪声环境下量子任务的实际执行性能边界(即性能上限与下限),以支持下一代量子中心型超级计算机中的系统管理(如作业调度)和功能正确性保障。其解决方案的关键在于提出了一种数据驱动的工作流 QuBound,该方法通过分解历史性能轨迹来分离噪声源,并设计了一种新颖的编码器将电路特征与噪声信息嵌入到长短期记忆(Long Short-Term Memory, LSTM)网络中,从而实现对不同量子电路在实时噪声条件下的性能边界进行高精度预测。实验表明,相比现有仅输出单一性能值的预测方法,QuBound 能更可靠地覆盖实际性能范围,且在速度上比传统量子模拟快超过 10⁶ 倍,预测区间宽度也比最优分析方法窄 10 倍以上。
链接: https://arxiv.org/abs/2507.17043
作者: Jinyang Li,Samudra Dasgupta,Yuhong Song,Lei Yang,Travis Humble,Weiwen Jiang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantum computing has significantly advanced in recent years, boasting devices with hundreds of quantum bits (qubits), hinting at its potential quantum advantage over classical computing. Yet, noise in quantum devices poses significant barriers to realizing this supremacy. Understanding noise’s impact is crucial for reproducibility and application reuse; moreover, the next-generation quantum-centric supercomputing essentially requires efficient and accurate noise characterization to support system management (e.g., job scheduling), where ensuring correct functional performance (i.e., fidelity) of jobs on available quantum devices can even be higher-priority than traditional objectives. However, noise fluctuates over time, even on the same quantum device, which makes predicting the computational bounds for on-the-fly noise is vital. Noisy quantum simulation can offer insights but faces efficiency and scalability issues. In this work, we propose a data-driven workflow, namely QuBound, to predict computational performance bounds. It decomposes historical performance traces to isolate noise sources and devises a novel encoder to embed circuit and noise information processed by a Long Short-Term Memory (LSTM) network. For evaluation, we compare QuBound with a state-of-the-art learning-based predictor, which only generates a single performance value instead of a bound. Experimental results show that the result of the existing approach falls outside of performance bounds, while all predictions from our QuBound with the assistance of performance decomposition better fit the bounds. Moreover, QuBound can efficiently produce practical bounds for various circuits with over 106 speedup over simulation; in addition, the range from QuBound is over 10x narrower than the state-of-the-art analytical approach.
zh
[AI-82] Bayesian preference elicitation for decision support in multiobjective optimization
【速读】:该论文旨在解决多目标优化问题中决策者从帕累托前沿(Pareto front)中高效识别偏好解的难题。其解决方案的关键在于引入贝叶斯模型来估计决策者的效用函数,该模型基于成对比较(pairwise comparisons)数据进行学习,并结合一种兼顾探索与利用的主动查询策略,以最小的交互次数引导发现高效用解。此外,该方法在 elicitation 阶段结束后还能生成一个精简的高质量解集,显著简化最终决策流程。
链接: https://arxiv.org/abs/2507.16999
作者: Felix Huber,Sebastian Rojas Gonzalez,Raul Astudillo
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 5 figures
Abstract:We present a novel approach to help decision-makers efficiently identify preferred solutions from the Pareto set of a multi-objective optimization problem. Our method uses a Bayesian model to estimate the decision-maker’s utility function based on pairwise comparisons. Aided by this model, a principled elicitation strategy selects queries interactively to balance exploration and exploitation, guiding the discovery of high-utility solutions. The approach is flexible: it can be used interactively or a posteriori after estimating the Pareto front through standard multi-objective optimization techniques. Additionally, at the end of the elicitation phase, it generates a reduced menu of high-quality solutions, simplifying the decision-making process. Through experiments on test problems with up to nine objectives, our method demonstrates superior performance in finding high-utility solutions with a small number of queries. We also provide an open-source implementation of our method to support its adoption by the broader community.
zh
[AI-83] Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN
【速读】:该论文旨在解决DNA测序数据快速增长背景下,传统基于启发式的方法(如BLAST)在大规模相似性搜索任务中面临的计算效率低、对远缘序列检测效果差的问题。其核心解决方案是采用嵌入式(embedding-based)相似性搜索方法,利用深度学习模型生成捕捉基因序列深层结构与功能模式的向量表示,并系统评估FAISS和ScaNN两个先进的向量检索库在生物信息学特定基因嵌入上的性能表现。结果表明,该方案在内存占用和运行时间上具有显著优势,同时提升了新序列(如未分类物种或无已知同源物的基因)的检索准确性,为替代依赖序列比对的传统工具提供了高效且精准的新路径。
链接: https://arxiv.org/abs/2507.16978
作者: Mohammad Saleh Refahi,Gavin Hearne,Harrison Muller,Kieran Lynch,Bahrad A. Sokhansanj,James R. Brown,Gail Rosen
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The exponential growth of DNA sequencing data has outpaced traditional heuristic-based methods, which struggle to scale effectively. Efficient computational approaches are urgently needed to support large-scale similarity search, a foundational task in bioinformatics for detecting homology, functional similarity, and novelty among genomic and proteomic sequences. Although tools like BLAST have been widely used and remain effective in many scenarios, they suffer from limitations such as high computational cost and poor performance on divergent sequences. In this work, we explore embedding-based similarity search methods that learn latent representations capturing deeper structural and functional patterns beyond raw sequence alignment. We systematically evaluate two state-of-the-art vector search libraries, FAISS and ScaNN, on biologically meaningful gene embeddings. Unlike prior studies, our analysis focuses on bioinformatics-specific embeddings and benchmarks their utility for detecting novel sequences, including those from uncharacterized taxa or genes lacking known homologs. Our results highlight both computational advantages (in memory and runtime efficiency) and improved retrieval quality, offering a promising alternative to traditional alignment-heavy tools. Subjects: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.16978 [q-bio.GN] (or arXiv:2507.16978v1 [q-bio.GN] for this version) https://doi.org/10.48550/arXiv.2507.16978 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: GenBio @ ICML 2025 Workshop, OpenReview, June 2025 GenBio @ ICML 2025 Workshop, OpenReview, June 2025
zh
[AI-84] Machine learning-based multimodal prognostic models integrating pathology images and high-throughput omic data for overall survival prediction in cancer: a systematic review
【速读】:该论文旨在解决癌症预后预测中如何有效整合组织病理图像(whole slide images, WSI)与高通量组学数据以提升生存预测准确性的问题。其解决方案的关键在于系统性地综述和评估多模态机器学习方法在该领域的应用,发现深度学习模型在整合WSI与组学数据时通常优于单一模态模型,但当前研究普遍存在偏倚风险高、外部验证不足及临床实用性未充分探讨等方法学缺陷,因此提出需加强研究的严谨性、扩大数据集多样性并开展临床转化评估。
链接: https://arxiv.org/abs/2507.16876
作者: Charlotte Jennings(1, 2),Andrew Broad(1),Lucy Godson(1),Emily Clarke(1, 2),David Westhead(2),Darren Treanor(1, 2, 3) ((1) National Pathology Imaging Cooperative, Leeds Teaching Hospitals NHS Trust, Leeds, UK (2) University of Leeds, Leeds, UK (3) Linköping University, Linköping, Sweden)
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Main article (50 pages, inc 3 tables, 4 figures). Supplementary material included with additional methodological information and data
Abstract:Multimodal machine learning integrating histopathology and molecular data shows promise for cancer prognostication. We systematically reviewed studies combining whole slide images (WSIs) and high-throughput omics to predict overall survival. Searches of EMBASE, PubMed, and Cochrane CENTRAL (12/08/2024), plus citation screening, identified eligible studies. Data extraction used CHARMS; bias was assessed with PROBAST+AI; synthesis followed SWiM and PRISMA 2020. Protocol: PROSPERO (CRD42024594745). Forty-eight studies (all since 2017) across 19 cancer types met criteria; all used The Cancer Genome Atlas. Approaches included regularised Cox regression (n=4), classical ML (n=13), and deep learning (n=31). Reported c-indices ranged 0.550-0.857; multimodal models typically outperformed unimodal ones. However, all studies showed unclear/high bias, limited external validation, and little focus on clinical utility. Multimodal WSI-omics survival prediction is a fast-growing field with promising results but needs improved methodological rigor, broader datasets, and clinical evaluation. Funded by NPIC, Leeds Teaching Hospitals NHS Trust, UK (Project 104687), supported by UKRI Industrial Strategy Challenge Fund. Comments: Main article (50 pages, inc 3 tables, 4 figures). Supplementary material included with additional methodological information and data Subjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2507.16876 [q-bio.QM] (or arXiv:2507.16876v1 [q-bio.QM] for this version) https://doi.org/10.48550/arXiv.2507.16876 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Charlotte Jennings Dr [view email] [v1] Tue, 22 Jul 2025 11:02:51 UTC (2,992 KB)
zh
机器学习
[LG-0] HydraOpt: Navigating the Efficiency-Performance Trade-off of Adapter Merging
链接: https://arxiv.org/abs/2507.17706
作者: Taha Ceritli,Ondrej Bohdal,Mete Ozay,Jijoong Moon,Kyeng-Hun Lee,Hyeonmok Ko,Umberto Michieli
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models (LLMs) often leverage adapters, such as low-rank-based adapters, to achieve strong performance on downstream tasks. However, storing a separate adapter for each task significantly increases memory requirements, posing a challenge for resource-constrained environments such as mobile devices. Although model merging techniques can reduce storage costs, they typically result in substantial performance degradation. In this work, we introduce HydraOpt, a new model merging technique that capitalizes on the inherent similarities between the matrices of low-rank adapters. Unlike existing methods that produce a fixed trade-off between storage size and performance, HydraOpt allows us to navigate this spectrum of efficiency and performance. Our experiments show that HydraOpt significantly reduces storage size (48% reduction) compared to storing all adapters, while achieving competitive performance (0.2-1.8% drop). Furthermore, it outperforms existing merging techniques in terms of performance at the same or slightly worse storage efficiency.
[LG-1] Mindfulness Meditation and Respiration: Accelerometer-Based Respiration Rate and Mindfulness Progress Estimation to Enhance App Engagement and Mindfulness Skills
链接: https://arxiv.org/abs/2507.17688
作者: Mohammad Nur Hossain Khan,David creswell,Jordan Albert,Patrick O’Connell,Shawn Fallon,Mathew Polowitz,Xuhai “orson” Xu,Bashima islam
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Accepted in Proc. ACM Interact. Mob. Wearable Ubiquitous Technology (IMWUT)
Abstract:Mindfulness training is widely recognized for its benefits in reducing depression, anxiety, and loneliness. With the rise of smartphone-based mindfulness apps, digital meditation has become more accessible, but sustaining long-term user engagement remains a challenge. This paper explores whether respiration biosignal feedback and mindfulness skill estimation enhance system usability and skill development. We develop a smartphone’s accelerometer-based respiration tracking algorithm, eliminating the need for additional wearables. Unlike existing methods, our approach accurately captures slow breathing patterns typical of mindfulness meditation. Additionally, we introduce the first quantitative framework to estimate mindfulness skills-concentration, sensory clarity, and equanimity-based on accelerometer-derived respiration data. We develop and test our algorithms on 261 mindfulness sessions in both controlled and real-world settings. A user study comparing an experimental group receiving biosignal feedback with a control group using a standard app shows that respiration feedback enhances system usability. Our respiration tracking model achieves a mean absolute error (MAE) of 1.6 breaths per minute, closely aligning with ground truth data, while our mindfulness skill estimation attains F1 scores of 80-84% in tracking skill progression. By integrating respiration tracking and mindfulness estimation into a commercial app, we demonstrate the potential of smartphone sensors to enhance digital mindfulness training.
[LG-2] owards Effective Open-set Graph Class-incremental Learning
链接: https://arxiv.org/abs/2507.17687
作者: Jiazhen Chen,Zheng Ma,Sichao Fu,Mingbin Feng,Tony S. Wirjanto,Weihua Ou
类目: Machine Learning (cs.LG)
*备注: Accepted by 33rd ACM International Conference on Multimedia (MM 2025)
Abstract:Graph class-incremental learning (GCIL) allows graph neural networks (GNNs) to adapt to evolving graph analytical tasks by incrementally learning new class knowledge while retaining knowledge of old classes. Existing GCIL methods primarily focus on a closed-set assumption, where all test samples are presumed to belong to previously known classes. Such an assumption restricts their applicability in real-world scenarios, where unknown classes naturally emerge during inference, and are absent during training. In this paper, we explore a more challenging open-set graph class-incremental learning scenario with two intertwined challenges: catastrophic forgetting of old classes, which impairs the detection of unknown classes, and inadequate open-set recognition, which destabilizes the retention of learned knowledge. To address the above problems, a novel OGCIL framework is proposed, which utilizes pseudo-sample embedding generation to effectively mitigate catastrophic forgetting and enable robust detection of unknown classes. To be specific, a prototypical conditional variational autoencoder is designed to synthesize node embeddings for old classes, enabling knowledge replay without storing raw graph data. To handle unknown classes, we employ a mixing-based strategy to generate out-of-distribution (OOD) samples from pseudo in-distribution and current node embeddings. A novel prototypical hypersphere classification loss is further proposed, which anchors in-distribution embeddings to their respective class prototypes, while repelling OOD embeddings away. Instead of assigning all unknown samples into one cluster, our proposed objective function explicitly models them as outliers through prototype-aware rejection regions, ensuring a robust open-set recognition. Extensive experiments on five benchmarks demonstrate the effectiveness of OGCIL over existing GCIL and open-set GNN methods.
[LG-3] Generalized Dual Discriminator GANs
链接: https://arxiv.org/abs/2507.17684
作者: Penukonda Naga Chandana,Tejas Srivastava,Gowtham R. Kurri,V. Lalitha
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注: 8 pages, 2 figures, extended version of a paper accepted for presentation at ITW 2025
Abstract:Dual discriminator generative adversarial networks (D2 GANs) were introduced to mitigate the problem of mode collapse in generative adversarial networks. In D2 GANs, two discriminators are employed alongside a generator: one discriminator rewards high scores for samples from the true data distribution, while the other favors samples from the generator. In this work, we first introduce dual discriminator \alpha -GANs (D2 \alpha -GANs), which combines the strengths of dual discriminators with the flexibility of a tunable loss function, \alpha -loss. We further generalize this approach to arbitrary functions defined on positive reals, leading to a broader class of models we refer to as generalized dual discriminator generative adversarial networks. For each of these proposed models, we provide theoretical analysis and show that the associated min-max optimization reduces to the minimization of a linear combination of an f -divergence and a reverse f -divergence. This generalizes the known simplification for D2-GANs, where the objective reduces to a linear combination of the KL-divergence and the reverse KL-divergence. Finally, we perform experiments on 2D synthetic data and use multiple performance metrics to capture various advantages of our GANs.
[LG-4] XStacking: Explanation-Guided Stacked Ensemble Learning
链接: https://arxiv.org/abs/2507.17650
作者: Moncef Garouani,Ayah Barhrhouj,Olivier Teste
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ensemble Machine Learning (EML) techniques, especially stacking, have been shown to improve predictive performance by combining multiple base models. However, they are often criticized for their lack of interpretability. In this paper, we introduce XStacking, an effective and inherently explainable framework that addresses this limitation by integrating dynamic feature transformation with model-agnostic Shapley additive explanations. This enables stacked models to retain their predictive accuracy while becoming inherently explainable. We demonstrate the effectiveness of the framework on 29 datasets, achieving improvements in both the predictive effectiveness of the learning space and the interpretability of the resulting models. XStacking offers a practical and scalable solution for responsible ML.
[LG-5] Citation Recommendation using Deep Canonical Correlation Analysis
链接: https://arxiv.org/abs/2507.17603
作者: Conor McNamara,Effirul Ramlan
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 21 pages, 6 figures, 7 tables
Abstract:Recent advances in citation recommendation have improved accuracy by leveraging multi-view representation learning to integrate the various modalities present in scholarly documents. However, effectively combining multiple data views requires fusion techniques that can capture complementary information while preserving the unique characteristics of each modality. We propose a novel citation recommendation algorithm that improves upon linear Canonical Correlation Analysis (CCA) methods by applying Deep CCA (DCCA), a neural network extension capable of capturing complex, non-linear relationships between distributed textual and graph-based representations of scientific articles. Experiments on the large-scale DBLP (Digital Bibliography Library Project) citation network dataset demonstrate that our approach outperforms state-of-the-art CCA-based methods, achieving relative improvements of over 11% in Mean Average Precision@10, 5% in Precision@10, and 7% in Recall@10. These gains reflect more relevant citation recommendations and enhanced ranking quality, suggesting that DCCA’s non-linear transformations yield more expressive latent representations than CCA’s linear projections.
[LG-6] Generalized Advantage Estimation for Distributional Policy Gradients
链接: https://arxiv.org/abs/2507.17530
作者: Shahil Shaik,Jonathon M. Smereka,Yue Wang
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 6 pages, 3 figures, published at ACC 2025 Conference
Abstract:Generalized Advantage Estimation (GAE) has been used to mitigate the computational complexity of reinforcement learning (RL) by employing an exponentially weighted estimation of the advantage function to reduce the variance in policy gradient estimates. Despite its effectiveness, GAE is not designed to handle value distributions integral to distributional RL, which can capture the inherent stochasticity in systems and is hence more robust to system noises. To address this gap, we propose a novel approach that utilizes the optimal transport theory to introduce a Wasserstein-like directional metric, which measures both the distance and the directional discrepancies between probability distributions. Using the exponentially weighted estimation, we leverage this Wasserstein-like directional metric to derive distributional GAE (DGAE). Similar to traditional GAE, our proposed DGAE provides a low-variance advantage estimate with controlled bias, making it well-suited for policy gradient algorithms that rely on advantage estimation for policy updates. We integrated DGAE into three different policy gradient methods. Algorithms were evaluated across various OpenAI Gym environments and compared with the baselines with traditional GAE to assess the performance.
[LG-7] Generalized Low-Rank Matrix Contextual Bandits with Graph Information
链接: https://arxiv.org/abs/2507.17528
作者: Yao Wang,Jiannan Li,Yue Kang,Shanxing Gao,Zhenxin Xiao
类目: Machine Learning (cs.LG)
*备注:
Abstract:The matrix contextual bandit (CB), as an extension of the well-known multi-armed bandit, is a powerful framework that has been widely applied in sequential decision-making scenarios involving low-rank structure. In many real-world scenarios, such as online advertising and recommender systems, additional graph information often exists beyond the low-rank structure, that is, the similar relationships among users/items can be naturally captured through the connectivity among nodes in the corresponding graphs. However, existing matrix CB methods fail to explore such graph information, and thereby making them difficult to generate effective decision-making policies. To fill in this void, we propose in this paper a novel matrix CB algorithmic framework that builds upon the classical upper confidence bound (UCB) framework. This new framework can effectively integrate both the low-rank structure and graph information in a unified manner. Specifically, it involves first solving a joint nuclear norm and matrix Laplacian regularization problem, followed by the implementation of a graph-based generalized linear version of the UCB algorithm. Rigorous theoretical analysis demonstrates that our procedure outperforms several popular alternatives in terms of cumulative regret bound, owing to the effective utilization of graph information. A series of synthetic and real-world data experiments are conducted to further illustrate the merits of our procedure.
[LG-8] C3RL: Rethinking the Combination of Channel-independence and Channel-mixing from Representation Learning
链接: https://arxiv.org/abs/2507.17454
作者: Shusen Ma,Yun-Bo Zhao,Yu Kang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multivariate time series forecasting has drawn increasing attention due to its practical importance. Existing approaches typically adopt either channel-mixing (CM) or channel-independence (CI) strategies. CM strategy can capture inter-variable dependencies but fails to discern variable-specific temporal patterns. CI strategy improves this aspect but fails to fully exploit cross-variable dependencies like CM. Hybrid strategies based on feature fusion offer limited generalization and interpretability. To address these issues, we propose C3RL, a novel representation learning framework that jointly models both CM and CI strategies. Motivated by contrastive learning in computer vision, C3RL treats the inputs of the two strategies as transposed views and builds a siamese network architecture: one strategy serves as the backbone, while the other complements it. By jointly optimizing contrastive and prediction losses with adaptive weighting, C3RL balances representation and forecasting performance. Extensive experiments on seven models show that C3RL boosts the best-case performance rate to 81.4% for models based on CI strategy and to 76.3% for models based on CM strategy, demonstrating strong generalization and effectiveness. The code will be available once the paper is accepted.
[LG-9] Efficient Neural Network Verification via Order Leading Exploration of Branch-and-Bound Trees DATE2025
链接: https://arxiv.org/abs/2507.17453
作者: Guanqin Zhang,Kota Fukuda,Zhenya Zhang,H.M.N. Dilum Bandara,Shiping Chen,Jianjun Zhao,Yulei Sui
类目: Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
*备注: This is an extended version of the ECOOP 2025 paper, with a comparison with DATE 2025 (Figure 7 of RQ1 in Section 5.2), as well as an in-depth discussion of OOPSLA 2025 in the related work (Section 6)
Abstract:The vulnerability of neural networks to adversarial perturbations has necessitated formal verification techniques that can rigorously certify the quality of neural networks. As the state-of-the-art, branch and bound (BaB) is a “divide-and-conquer” strategy that applies off-the-shelf verifiers to sub-problems for which they perform better. While BaB can identify the sub-problems that are necessary to be split, it explores the space of these sub-problems in a naive “first-come-first-serve” manner, thereby suffering from an issue of inefficiency to reach a verification conclusion. To bridge this gap, we introduce an order over different sub-problems produced by BaB, concerning with their different likelihoods of containing counterexamples. Based on this order, we propose a novel verification framework Oliva that explores the sub-problem space by prioritizing those sub-problems that are more likely to find counterexamples, in order to efficiently reach the conclusion of the verification. Even if no counterexample can be found in any sub-problem, it only changes the order of visiting different sub-problem and so will not lead to a performance degradation. Specifically, Oliva has two variants, including Oliva^GR , a greedy strategy that always prioritizes the sub-problems that are more likely to find counterexamples, and Oliva^SA , a balanced strategy inspired by simulated annealing that gradually shifts from exploration to exploitation to locate the globally optimal sub-problems. We experimentally evaluate the performance of Oliva on 690 verification problems spanning over 5 models with datasets MNIST and CIFAR10. Compared to the state-of-the-art approaches, we demonstrate the speedup of Oliva for up to 25X in MNIST, and up to 80X in CIFAR10.
[LG-10] Persistent Patterns in Eye Movements: A Topological Approach to Emotion Recognition
链接: https://arxiv.org/abs/2507.17450
作者: Arsha Niksa,Hooman Zare,Ali Shahrabi,Hanieh Hatami,Mohammadreza Razvan
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a topological pipeline for automated multiclass emotion recognition from eye-tracking data. Delay embeddings of gaze trajectories are analyzed using persistent homology. From the resulting persistence diagrams, we extract shape-based features such as mean persistence, maximum persistence, and entropy. A random forest classifier trained on these features achieves up to 75.6% accuracy on four emotion classes, which are the quadrants the Circumplex Model of Affect. The results demonstrate that persistence diagram geometry effectively encodes discriminative gaze dynamics, suggesting a promising topological approach for affective computing and human behavior analysis.
[LG-11] A Comprehensive Evaluation on Quantization Techniques for Large Language Models
链接: https://arxiv.org/abs/2507.17417
作者: Yutong Liu,Cairong Zhao,Guosheng Hu
类目: Machine Learning (cs.LG)
*备注:
Abstract:For large language models (LLMs), post-training quantization (PTQ) can significantly reduce memory footprint and computational overhead. Model quantization is a rapidly evolving research field. Though many papers have reported breakthrough performance, they may not conduct experiments on the same ground since one quantization method usually contains multiple components. In addition, analyzing the theoretical connections among existing methods is crucial for in-depth understanding. To bridge these gaps, we conduct an extensive review of state-of-the-art methods and perform comprehensive evaluations on the same ground to ensure fair comparisons. To our knowledge, this fair and extensive investigation remains critically important yet underexplored. To better understand the theoretical connections, we decouple the published quantization methods into two steps: pre-quantization transformation and quantization error mitigation. We define the former as a preprocessing step applied before quantization to reduce the impact of outliers, making the data distribution flatter and more suitable for quantization. Quantization error mitigation involves techniques that offset the errors introduced during quantization, thereby enhancing model performance. We evaluate and analyze the impact of different components of quantization methods. Additionally, we analyze and evaluate the latest MXFP4 data format and its performance. Our experimental results demonstrate that optimized rotation and scaling yield the best performance for pre-quantization transformation, and combining low-rank compensation with GPTQ occasionally outperforms using GPTQ alone for quantization error mitigation. Furthermore, we explore the potential of the latest MXFP4 quantization and reveal that the optimal pre-quantization transformation strategy for INT4 does not generalize well to MXFP4, inspiring further investigation.
[LG-12] Confidence Calibration in Vision-Language-Action Models
链接: https://arxiv.org/abs/2507.17383
作者: Thomas P Zollo,Richard Zemel
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 34 pages, 19 figures
Abstract:Trustworthy robot behavior requires not only high levels of task success but also that the robot can reliably quantify how likely it is to succeed. To this end, we present the first systematic study of confidence calibration in vision-language-action (VLA) foundation models, which map visual observations and natural-language instructions to low-level robot motor commands. We begin with extensive benchmarking to understand the critical relationship between task success and calibration error across multiple datasets and VLA variants, finding that task performance and calibration are not in tension. Next, we introduce prompt ensembles for VLAs, a lightweight, Bayesian-inspired algorithm that averages confidence across paraphrased instructions and consistently improves calibration. We further analyze calibration over the task time horizon, showing that confidence is often most reliable after making some progress, suggesting natural points for risk-aware intervention. Finally, we reveal differential miscalibration across action dimensions and propose action-wise Platt scaling, a method to recalibrate each action dimension independently to produce better confidence estimates. Our aim in this study is to begin to develop the tools and conceptual understanding necessary to render VLAs both highly performant and highly trustworthy via reliable uncertainty quantification.
[LG-13] Continual Generalized Category Discovery: Learning and Forgetting from a Bayesian Perspective
链接: https://arxiv.org/abs/2507.17382
作者: Hao Dai,Jagmohan Chauhan
类目: Machine Learning (cs.LG)
*备注: 20 pages, 6 figures. Forty-second International Conference on Machine Learning. 2025
Abstract:Continual Generalized Category Discovery (C-GCD) faces a critical challenge: incrementally learning new classes from unlabeled data streams while preserving knowledge of old classes. Existing methods struggle with catastrophic forgetting, especially when unlabeled data mixes known and novel categories. We address this by analyzing C-GCD’s forgetting dynamics through a Bayesian lens, revealing that covariance misalignment between old and new classes drives performance degradation. Building on this insight, we propose Variational Bayes C-GCD (VB-CGCD), a novel framework that integrates variational inference with covariance-aware nearest-class-mean classification. VB-CGCD adaptively aligns class distributions while suppressing pseudo-label noise via stochastic variational updates. Experiments show VB-CGCD surpasses prior art by +15.21% with the overall accuracy in the final session on standard benchmarks. We also introduce a new challenging benchmark with only 10% labeled data and extended online phases, VB-CGCD achieves a 67.86% final accuracy, significantly higher than state-of-the-art (38.55%), demonstrating its robust applicability across diverse scenarios. Code is available at: this https URL
[LG-14] ViRN: Variational Inference and Distribution Trilateration for Long-Tailed Continual Representation Learning
链接: https://arxiv.org/abs/2507.17368
作者: Hao Dai,Chong Tang,Jagmohan Chauhan
类目: Machine Learning (cs.LG)
*备注: 6 pages, 2 figures
Abstract:Continual learning (CL) with long-tailed data distributions remains a critical challenge for real-world AI systems, where models must sequentially adapt to new classes while retaining knowledge of old ones, despite severe class imbalance. Existing methods struggle to balance stability and plasticity, often collapsing under extreme sample scarcity. To address this, we propose ViRN, a novel CL framework that integrates variational inference (VI) with distributional trilateration for robust long-tailed learning. First, we model class-conditional distributions via a Variational Autoencoder to mitigate bias toward head classes. Second, we reconstruct tail-class distributions via Wasserstein distance-based neighborhood retrieval and geometric fusion, enabling sample-efficient alignment of tail-class representations. Evaluated on six long-tailed classification benchmarks, including speech (e.g., rare acoustic events, accents) and image tasks, ViRN achieves a 10.24% average accuracy gain over state-of-the-art methods.
[LG-15] OC-UCO: a comprehensive repository of tabular ordinal classification datasets
链接: https://arxiv.org/abs/2507.17348
作者: Rafael Ayllón-Gavilán,David Guijo-Rubio,Antonio Manuel Gómez-Orellana,David Guijo-Rubio,Francisco Bérchez-Moreno,Víctor Manuel Vargas-Yun,Pedro A. Gutiérrez
类目: Machine Learning (cs.LG)
*备注: 25 single column pages, 5 figures, 7 tables
Abstract:An ordinal classification (OC) problem corresponds to a special type of classification characterised by the presence of a natural order relationship among the classes. This type of problem can be found in a number of real-world applications, motivating the design and development of many ordinal methodologies over the last years. However, it is important to highlight that the development of the OC field suffers from one main disadvantage: the lack of a comprehensive set of datasets on which novel approaches to the literature have to be benchmarked. In order to approach this objective, this manuscript from the University of Córdoba (UCO), which have previous experience on the OC field, provides the literature with a publicly available repository of tabular data for a robust validation of novel OC approaches, namely TOC-UCO (Tabular Ordinal Classification repository of the UCO). Specifically, this repository includes a set of 46 tabular ordinal datasets, preprocessed under a common framework and ensured to have a reasonable number of patterns and an appropriate class distribution. We also provide the sources and preprocessing steps of each dataset, along with details on how to benchmark a novel approach using the TOC-UCO repository. For this, indices for 30 different randomised train-test partitions are provided to facilitate the reproducibility of the experiments.
[LG-16] DeCo-SGD: Joint Optimization of Delay Staleness and Gradient Compression Ratio for Distributed SGD
链接: https://arxiv.org/abs/2507.17346
作者: Rongwei Lu,Jingyan Jiang,Chunyang Li,Haotian Dong,Xingguang Wei,Delin Cai,Zhi Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Distributed machine learning in high end-to-end latency and low, varying bandwidth network environments undergoes severe throughput degradation. Due to its low communication requirements, distributed SGD (D-SGD) remains the mainstream optimizer in such challenging networks, but it still suffers from significant throughput reduction. To mitigate these limitations, existing approaches typically employ gradient compression and delayed aggregation to alleviate low bandwidth and high latency, respectively. To address both challenges simultaneously, these strategies are often combined, introducing a complex three-way trade-off among compression ratio, staleness (delayed synchronization steps), and model convergence rate. To achieve the balance under varying bandwidth conditions, an adaptive policy is required to dynamically adjust these parameters. Unfortunately, existing works rely on static heuristic strategies due to the lack of theoretical guidance, which prevents them from achieving this goal. This study fills in this theoretical gap by introducing a new theoretical tool, decomposing the joint optimization problem into a traditional convergence rate analysis with multiple analyzable noise terms. We are the first to reveal that staleness exponentially amplifies the negative impact of gradient compression on training performance, filling a critical gap in understanding how compressed and delayed gradients affect training. Furthermore, by integrating the convergence rate with a network-aware time minimization condition, we propose DeCo-SGD, which dynamically adjusts the compression ratio and staleness based on the real-time network condition and training task. DeCo-SGD achieves up to 5.07 and 1.37 speed-ups over D-SGD and static strategy in high-latency and low, varying bandwidth networks, respectively.
[LG-17] A Learning-based Domain Decomposition Method
链接: https://arxiv.org/abs/2507.17328
作者: Rui Wu,Nikola Kovachki,Burigede Liu
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:
Abstract:Recent developments in mechanical, aerospace, and structural engineering have driven a growing need for efficient ways to model and analyse structures at much larger and more complex scales than before. While established numerical methods like the Finite Element Method remain reliable, they often struggle with computational cost and scalability when dealing with large and geometrically intricate problems. In recent years, neural network-based methods have shown promise because of their ability to efficiently approximate nonlinear mappings. However, most existing neural approaches are still largely limited to simple domains, which makes it difficult to apply to real-world PDEs involving complex geometries. In this paper, we propose a learning-based domain decomposition method (L-DDM) that addresses this gap. Our approach uses a single, pre-trained neural operator-originally trained on simple domains-as a surrogate model within a domain decomposition scheme, allowing us to tackle large and complicated domains efficiently. We provide a general theoretical result on the existence of neural operator approximations in the context of domain decomposition solution of abstract PDEs. We then demonstrate our method by accurately approximating solutions to elliptic PDEs with discontinuous microstructures in complex geometries, using a physics-pretrained neural operator (PPNO). Our results show that this approach not only outperforms current state-of-the-art methods on these challenging problems, but also offers resolution-invariance and strong generalization to microstructural patterns unseen during training.
[LG-18] R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning
链接: https://arxiv.org/abs/2507.17307
作者: Zhuokun Chen,Zeren Chen,Jiahao He,Mingkui Tan,Jianfei Cai,Bohan Zhuang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Chain-of-thought (CoT) reasoning enhances the problem-solving capabilities of large language models by encouraging step-by-step intermediate reasoning during inference. While effective, CoT introduces substantial computational overhead due to its reliance on autoregressive decoding over long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compressive reward designs, or improve decoding speed via speculative decoding with smaller models. However, speculative decoding suffers from limited speedup when the agreement between small and large models is low, and fails to exploit the potential advantages of small models in producing concise intermediate reasoning. In this paper, we present R-Stitch, a token-level, confidence-based hybrid decoding framework that accelerates CoT inference by switching between a small language model (SLM) and a large language model (LLM) along the reasoning trajectory. R-Stitch uses the SLM to generate tokens by default and delegates to the LLM only when the SLM’s confidence falls below a threshold. This design avoids full-sequence rollback and selectively invokes the LLM on uncertain steps, preserving both efficiency and answer quality. R-Stitch is model-agnostic, training-free, and compatible with standard decoding pipelines. Experiments on math reasoning benchmarks demonstrate that R-Stitch achieves up to 85% reduction in inference latency with negligible accuracy drop, highlighting its practical effectiveness in accelerating CoT reasoning.
[LG-19] VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback
链接: https://arxiv.org/abs/2507.17294
作者: Jianxin Bi,Kevin Yuchen Ma,Ce Hao,Mike Zheng Shou,Harold Soh
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 19 pages, 5 figures
Abstract:Tactile feedback is generally recognized to be crucial for effective interaction with the physical world. However, state-of-the-art Vision-Language-Action (VLA) models lack the ability to interpret and use tactile signals, limiting their effectiveness in contact-rich tasks. Incorporating tactile feedback into these systems is challenging due to the absence of large multi-modal datasets. We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing \emphwithout fine-tuning the base VLA. Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation. Through real-world experiments, we demonstrate that our dual-level integration of tactile feedback improves task planning efficiency while enhancing execution precision. Code is open-sourced at \hrefthis https URLthis URL.
[LG-20] Data Virtualization for Machine Learning
链接: https://arxiv.org/abs/2507.17293
作者: Saiful Khan,Joyraj Chakraborty,Philip Beaucamp,Niraj Bhujel,Min Chen
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Nowadays, machine learning (ML) teams have multiple concurrent ML workflows for different applications. Each workflow typically involves many experiments, iterations, and collaborative activities and commonly takes months and sometimes years from initial data wrangling to model deployment. Organizationally, there is a large amount of intermediate data to be stored, processed, and maintained. \emphData virtualization becomes a critical technology in an infrastructure to serve ML workflows. In this paper, we present the design and implementation of a data virtualization service, focusing on its service architecture and service operations. The infrastructure currently supports six ML applications, each with more than one ML workflow. The data virtualization service allows the number of applications and workflows to grow in the coming years.
[LG-21] Decentralized Federated Learning of Probabilistic Generative Classifiers
链接: https://arxiv.org/abs/2507.17285
作者: Aritz Pérez,Carlos Echegoyen,Guzmán Santafé
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning is a paradigm of increasing relevance in real world applications, aimed at building a global model across a network of heterogeneous users without requiring the sharing of private data. We focus on model learning over decentralized architectures, where users collaborate directly to update the global model without relying on a central server. In this context, the current paper proposes a novel approach to collaboratively learn probabilistic generative classifiers with a parametric form. The framework is composed by a communication network over a set of local nodes, each of one having its own local data, and a local updating rule. The proposal involves sharing local statistics with neighboring nodes, where each node aggregates the neighbors’ information and iteratively learns its own local classifier, which progressively converges to a global model. Extensive experiments demonstrate that the algorithm consistently converges to a globally competitive model across a wide range of network topologies, network sizes, local dataset sizes, and extreme non-i.i.d. data distributions.
[LG-22] Prolonging Tool Life: Learning Skillful Use of General-purpose Tools through Lifespan-guided Reinforcement Learning
链接: https://arxiv.org/abs/2507.17275
作者: Po-Yen Wu,Cheng-Yu Kuo,Yuki Kadokawa,Takamitsu Matsubara
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Under review
Abstract:In inaccessible environments with uncertain task demands, robots often rely on general-purpose tools that lack predefined usage strategies. These tools are not tailored for particular operations, making their longevity highly sensitive to how they are used. This creates a fundamental challenge: how can a robot learn a tool-use policy that both completes the task and prolongs the tool’s lifespan? In this work, we address this challenge by introducing a reinforcement learning (RL) framework that incorporates tool lifespan as a factor during policy optimization. Our framework leverages Finite Element Analysis (FEA) and Miner’s Rule to estimate Remaining Useful Life (RUL) based on accumulated stress, and integrates the RUL into the RL reward to guide policy learning toward lifespan-guided behavior. To handle the fact that RUL can only be estimated after task execution, we introduce an Adaptive Reward Normalization (ARN) mechanism that dynamically adjusts reward scaling based on estimated RULs, ensuring stable learning signals. We validate our method across simulated and real-world tool use tasks, including Object-Moving and Door-Opening with multiple general-purpose tools. The learned policies consistently prolong tool lifespan (up to 8.01x in simulation) and transfer effectively to real-world settings, demonstrating the practical value of learning lifespan-guided tool use strategies.
[LG-23] Rethinking VAE: From Continuous to Discrete Representations Without Probabilistic Assumptions
链接: https://arxiv.org/abs/2507.17255
作者: Songxuan Shi
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper explores the generative capabilities of Autoencoders (AEs) and establishes connections between Variational Autoencoders (VAEs) and Vector Quantized-Variational Autoencoders (VQ-VAEs) through a reformulated training framework. We demonstrate that AEs exhibit generative potential via latent space interpolation and perturbation, albeit limited by undefined regions in the encoding space. To address this, we propose a new VAE-like training method that introduces clustering centers to enhance data compactness and ensure well-defined latent spaces without relying on traditional KL divergence or reparameterization techniques. Experimental results on MNIST, CelebA, and FashionMNIST datasets show smooth interpolative transitions, though blurriness persists. Extending this approach to multiple learnable vectors, we observe a natural progression toward a VQ-VAE-like model in continuous space. However, when the encoder outputs multiple vectors, the model degenerates into a discrete Autoencoder (VQ-AE), which combines image fragments without learning semantic representations. Our findings highlight the critical role of encoding space compactness and dispersion in generative modeling and provide insights into the intrinsic connections between VAEs and VQ-VAEs, offering a new perspective on their design and limitations.
[LG-24] HypoChainer: A Collaborative System Combining LLM s and Knowledge Graphs for Hypothesis-Driven Scientific Discovery
链接: https://arxiv.org/abs/2507.17209
作者: Haoran Jiang,Shaohan Shi,Yunjie Yao,Chang Jiang,Quan Li
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Modern scientific discovery faces growing challenges in integrating vast and heterogeneous knowledge critical to breakthroughs in biomedicine and drug development. Traditional hypothesis-driven research, though effective, is constrained by human cognitive limits, the complexity of biological systems, and the high cost of trial-and-error experimentation. Deep learning models, especially graph neural networks (GNNs), have accelerated prediction generation, but the sheer volume of outputs makes manual selection for validation unscalable. Large language models (LLMs) offer promise in filtering and hypothesis generation, yet suffer from hallucinations and lack grounding in structured knowledge, limiting their reliability. To address these issues, we propose HypoChainer, a collaborative visualization framework that integrates human expertise, LLM-driven reasoning, and knowledge graphs (KGs) to enhance hypothesis generation and validation. HypoChainer operates in three stages: First, exploration and contextualization – experts use retrieval-augmented LLMs (RAGs) and dimensionality reduction to navigate large-scale GNN predictions, assisted by interactive explanations. Second, hypothesis chain formation – experts iteratively examine KG relationships around predictions and semantically linked entities, refining hypotheses with LLM and KG suggestions. Third, validation prioritization – refined hypotheses are filtered based on KG-supported evidence to identify high-priority candidates for experimentation, with visual analytics further strengthening weak links in reasoning. We demonstrate HypoChainer’s effectiveness through case studies in two domains and expert interviews, highlighting its potential to support interpretable, scalable, and knowledge-grounded scientific discovery.
[LG-25] Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content Moderation ACL2025
链接: https://arxiv.org/abs/2507.17204
作者: Zixuan Wang,Jinghao Shi,Hanzhong Liang,Xiang Shen,Vera Wen,Zhiqian Chen,Yifan Wu,Zhixin Zhang,Hongyu Xiong
类目: Machine Learning (cs.LG)
*备注: Camera Ready for ACL 2025
Abstract:Effective content moderation is essential for video platforms to safeguard user experience and uphold community standards. While traditional video classification models effectively handle well-defined moderation tasks, they struggle with complicated scenarios such as implicit harmful content and contextual ambiguity. Multimodal large language models (MLLMs) offer a promising solution to these limitations with their superior cross-modal reasoning and contextual understanding. However, two key challenges hinder their industrial adoption. First, the high computational cost of MLLMs makes full-scale deployment impractical. Second, adapting generative models for discriminative classification remains an open research problem. In this paper, we first introduce an efficient method to transform a generative MLLM into a multimodal classifier using minimal discriminative training data. To enable industry-scale deployment, we then propose a router-ranking cascade system that integrates MLLMs with a lightweight router model. Offline experiments demonstrate that our MLLM-based approach improves F1 score by 66.50% over traditional classifiers while requiring only 2% of the fine-tuning data. Online evaluations show that our system increases automatic content moderation volume by 41%, while the cascading deployment reduces computational cost to only 1.5% of direct full-scale deployment.
[LG-26] Met2Net: A Decoupled Two-Stage Spatio-Temporal Forecasting Model for Complex Meteorological Systems
链接: https://arxiv.org/abs/2507.17189
作者: Shaohan Li,Hao Yang,Min Chen,Xiaolin Qin
类目: Machine Learning (cs.LG)
*备注:
Abstract:The increasing frequency of extreme weather events due to global climate change urges accurate weather prediction. Recently, great advances have been made by the \textbfend-to-end methods, thanks to deep learning techniques, but they face limitations of \textitrepresentation inconsistency in multivariable integration and struggle to effectively capture the dependency between variables, which is required in complex weather systems. Treating different variables as distinct modalities and applying a \textbftwo-stage training approach from multimodal models can partially alleviate this issue, but due to the inconformity in training tasks between the two stages, the results are often suboptimal. To address these challenges, we propose an implicit two-stage training method, configuring separate encoders and decoders for each variable. In detailed, in the first stage, the Translator is frozen while the Encoders and Decoders learn a shared latent space, in the second stage, the Encoders and Decoders are frozen, and the Translator captures inter-variable interactions for prediction. Besides, by introducing a self-attention mechanism for multivariable fusion in the latent space, the performance achieves further improvements. Empirically, extensive experiments show the state-of-the-art performance of our method. Specifically, it reduces the MSE for near-surface air temperature and relative humidity predictions by 28.82% and 23.39%, respectively. The source code is available at this https URL.
[LG-27] GhostUMAP2: Measuring and Analyzing (rd)-Stability of UMAP
链接: https://arxiv.org/abs/2507.17174
作者: Myeongwon Jung,Takanori Fujiwara,Jaemin Jo
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Despite the widespread use of Uniform Manifold Approximation and Projection (UMAP), the impact of its stochastic optimization process on the results remains underexplored. We observed that it often produces unstable results where the projections of data points are determined mostly by chance rather than reflecting neighboring structures. To address this limitation, we introduce (r,d)-stability to UMAP: a framework that analyzes the stochastic positioning of data points in the projection space. To assess how stochastic elements, specifically initial projection positions and negative sampling, impact UMAP results, we introduce “ghosts”, or duplicates of data points representing potential positional variations due to stochasticity. We define a data point’s projection as (r,d)-stable if its ghosts perturbed within a circle of radius r in the initial projection remain confined within a circle of radius d for their final positions. To efficiently compute the ghost projections, we develop an adaptive dropping scheme that reduces a runtime up to 60% compared to an unoptimized baseline while maintaining approximately 90% of unstable points. We also present a visualization tool that supports the interactive exploration of the (r,d)-stability of data points. Finally, we demonstrate the effectiveness of our framework by examining the stability of projections of real-world datasets and present usage guidelines for the effective use of our framework.
[LG-28] PICore: Physics-Informed Unsupervised Coreset Selection for Data Efficient Neural Operator Training
链接: https://arxiv.org/abs/2507.17151
作者: Anirudh Satheesh,Anant Khandelwal,Mucong Ding,Radu Balan
类目: Machine Learning (cs.LG)
*备注: Submitted to TMLR 2025
Abstract:Neural operators offer a powerful paradigm for solving partial differential equations (PDEs) that cannot be solved analytically by learning mappings between function spaces. However, there are two main bottlenecks in training neural operators: they require a significant amount of training data to learn these mappings, and this data needs to be labeled, which can only be accessed via expensive simulations with numerical solvers. To alleviate both of these issues simultaneously, we propose PICore, an unsupervised coreset selection framework that identifies the most informative training samples without requiring access to ground-truth PDE solutions. PICore leverages a physics-informed loss to select unlabeled inputs by their potential contribution to operator learning. After selecting a compact subset of inputs, only those samples are simulated using numerical solvers to generate labels, reducing annotation costs. We then train the neural operator on the reduced labeled dataset, significantly decreasing training time as well. Across four diverse PDE benchmarks and multiple coreset selection strategies, PICore achieves up to 78% average increase in training efficiency relative to supervised coreset selection methods with minimal changes in accuracy. We provide code at this https URL.
[LG-29] Model Compression Engine for Wearable Devices Skin Cancer Diagnosis
链接: https://arxiv.org/abs/2507.17125
作者: Jacob M. Delgado-López,Andrea P. Seda-Hernandez,Juan D. Guadalupe-Rosado,Luis E. Fernandez Ramirez,Miguel Giboyeaux-Camilo,Wilfredo E. Lugo-Beauchamp
类目: Machine Learning (cs.LG)
*备注:
Abstract:Skin cancer is one of the most prevalent and preventable types of cancer, yet its early detection remains a challenge, particularly in resource-limited settings where access to specialized healthcare is scarce. This study proposes an AI-driven diagnostic tool optimized for embedded systems to address this gap. Using transfer learning with the MobileNetV2 architecture, the model was adapted for binary classification of skin lesions into “Skin Cancer” and “Other.” The TensorRT framework was employed to compress and optimize the model for deployment on the NVIDIA Jetson Orin Nano, balancing performance with energy efficiency. Comprehensive evaluations were conducted across multiple benchmarks, including model size, inference speed, throughput, and power consumption. The optimized models maintained their performance, achieving an F1-Score of 87.18% with a precision of 93.18% and recall of 81.91%. Post-compression results showed reductions in model size of up to 0.41, along with improvements in inference speed and throughput, and a decrease in energy consumption of up to 0.93 in INT8 precision. These findings validate the feasibility of deploying high-performing, energy-efficient diagnostic tools on resource-constrained edge devices. Beyond skin cancer detection, the methodologies applied in this research have broader applications in other medical diagnostics and domains requiring accessible, efficient AI solutions. This study underscores the potential of optimized AI systems to revolutionize healthcare diagnostics, thereby bridging the divide between advanced technology and underserved regions.
[LG-30] Computer Vision for Real-Time Monkeypox Diagnosis on Embedded Systems
链接: https://arxiv.org/abs/2507.17123
作者: Jacob M. Delgado-López,Ricardo A. Morell-Rodriguez,Sebastián O. Espinosa-Del Rosario,Wilfredo E. Lugo-Beauchamp
类目: Machine Learning (cs.LG)
*备注:
Abstract:The rapid diagnosis of infectious diseases, such as monkeypox, is crucial for effective containment and treatment, particularly in resource-constrained environments. This study presents an AI-driven diagnostic tool developed for deployment on the NVIDIA Jetson Orin Nano, leveraging the pre-trained MobileNetV2 architecture for binary classification. The model was trained on the open-source Monkeypox Skin Lesion Dataset, achieving a 93.07% F1-Score, which reflects a well-balanced performance in precision and recall. To optimize the model, the TensorRT framework was used to accelerate inference for FP32 and to perform post-training quantization for FP16 and INT8 formats. TensorRT’s mixed-precision capabilities enabled these optimizations, which reduced the model size, increased inference speed, and lowered power consumption by approximately a factor of two, all while maintaining the original accuracy. Power consumption analysis confirmed that the optimized models used significantly less energy during inference, reinforcing their suitability for deployment in resource-constrained environments. The system was deployed with a Wi-Fi Access Point (AP) hotspot and a web-based interface, enabling users to upload and analyze images directly through connected devices such as mobile phones. This setup ensures simple access and seamless connectivity, making the tool practical for real-world applications. These advancements position the diagnostic tool as an efficient, scalable, and energy-conscious solution to address diagnosis challenges in underserved regions, paving the way for broader adoption in low-resource healthcare settings.
[LG-31] Probabilistic Graphical Models: A Concise Tutorial
链接: https://arxiv.org/abs/2507.17116
作者: Jacqueline Maasch,Willie Neiswanger,Stefano Ermon,Volodymyr Kuleshov
类目: Machine Learning (cs.LG)
*备注: Under review
Abstract:Probabilistic graphical modeling is a branch of machine learning that uses probability distributions to describe the world, make predictions, and support decision-making under uncertainty. Underlying this modeling framework is an elegant body of theory that bridges two mathematical traditions: probability and graph theory. This framework provides compact yet expressive representations of joint probability distributions, yielding powerful generative models for probabilistic reasoning. This tutorial provides a concise introduction to the formalisms, methods, and applications of this modeling framework. After a review of basic probability and graph theory, we explore three dominant themes: (1) the representation of multivariate distributions in the intuitive visual language of graphs, (2) algorithms for learning model parameters and graphical structures from data, and (3) algorithms for inference, both exact and approximate. Comments: Under review Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.17116 [cs.LG] (or arXiv:2507.17116v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.17116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-32] ZORMS-LfD: Learning from Demonstrations with Zeroth-Order Random Matrix Search
链接: https://arxiv.org/abs/2507.17096
作者: Olivia Dry,Timothy L. Molloy,Wanxin Jin,Iman Shames
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:
Abstract:We propose Zeroth-Order Random Matrix Search for Learning from Demonstrations (ZORMS-LfD). ZORMS-LfD enables the costs, constraints, and dynamics of constrained optimal control problems, in both continuous and discrete time, to be learned from expert demonstrations without requiring smoothness of the learning-loss landscape. In contrast, existing state-of-the-art first-order methods require the existence and computation of gradients of the costs, constraints, dynamics, and learning loss with respect to states, controls and/or parameters. Most existing methods are also tailored to discrete time, with constrained problems in continuous time receiving only cursory attention. We demonstrate that ZORMS-LfD matches or surpasses the performance of state-of-the-art methods in terms of both learning loss and compute time across a variety of benchmark problems. On unconstrained continuous-time benchmark problems, ZORMS-LfD achieves similar loss performance to state-of-the-art first-order methods with an over 80 % reduction in compute time. On constrained continuous-time benchmark problems where there is no specialized state-of-the-art method, ZORMS-LfD is shown to outperform the commonly used gradient-free Nelder-Mead optimization method.
[LG-33] Deformable Cluster Manipulation via Whole-Arm Policy Learning
链接: https://arxiv.org/abs/2507.17085
作者: Jayadeep Jacob,Wenzheng Zhang,Houston Warren,Paulo Borges,Tirthankar Bandyopadhyay,Fabio Ramos
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Manipulating clusters of deformable objects presents a substantial challenge with widespread applicability, but requires contact-rich whole-arm interactions. A potential solution must address the limited capacity for realistic model synthesis, high uncertainty in perception, and the lack of efficient spatial abstractions, among others. We propose a novel framework for learning model-free policies integrating two modalities: 3D point clouds and proprioceptive touch indicators, emphasising manipulation with full body contact awareness, going beyond traditional end-effector modes. Our reinforcement learning framework leverages a distributional state representation, aided by kernel mean embeddings, to achieve improved training efficiency and real-time inference. Furthermore, we propose a novel context-agnostic occlusion heuristic to clear deformables from a target region for exposure tasks. We deploy the framework in a power line clearance scenario and observe that the agent generates creative strategies leveraging multiple arm links for de-occlusion. Finally, we perform zero-shot sim-to-real policy transfer, allowing the arm to clear real branches with unknown occlusion patterns, unseen topology, and uncertain dynamics.
[LG-34] Sensor Drift Compensation in Electronic-Nose-Based Gas Recognition Using Knowledge Distillation
链接: https://arxiv.org/abs/2507.17071
作者: Juntao Lin,Xianghao Zhan
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY); Instrumentation and Detectors (physics.ins-det)
*备注: 9 pages
Abstract:Due to environmental changes and sensor aging, sensor drift challenges the performance of electronic nose systems in gas classification during real-world deployment. Previous studies using the UCI Gas Sensor Array Drift Dataset reported promising drift compensation results but lacked robust statistical experimental validation and may overcompensate for sensor drift, losing class-related this http URL address these limitations and improve sensor drift compensation with statistical rigor, we first designed two domain adaptation tasks based on the same electronic nose dataset: using the first batch to predict the remaining batches, simulating a controlled laboratory setting; and predicting the next batch using all prior batches, simulating continuous training data updates for online training. We then systematically tested three methods: our proposed novel Knowledge Distillation (KD) method, the benchmark method Domain Regularized Component Analysis (DRCA), and a hybrid method KD-DRCA, across 30 random test set partitions on the UCI dataset. We showed that KD consistently outperformed both DRCA and KD-DRCA, achieving up to an 18% improvement in accuracy and 15% in F1-score, demonstrating KD’s superior effectiveness in drift compensation. This is the first application of KD for electronic nose drift mitigation, significantly outperforming the previous state-of-the-art DRCA method and enhancing the reliability of sensor drift compensation in real-world environments.
[LG-35] Risk In Context: Benchmarking Privacy Leakage of Foundation Models in Synthetic Tabular Data Generation KDD2025
链接: https://arxiv.org/abs/2507.17066
作者: Jessup Byun,Xiaofeng Lin,Joshua Ward,Guang Cheng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by Agentic GenAI Evaluation KDD2025, poster presentation
Abstract:Synthetic tabular data is essential for machine learning workflows, especially for expanding small or imbalanced datasets and enabling privacy-preserving data sharing. However, state-of-the-art generative models (GANs, VAEs, diffusion models) rely on large datasets with thousands of examples. In low-data settings, often the primary motivation for synthetic data, these models can overfit, leak sensitive records, and require frequent retraining. Recent work uses large pre-trained transformers to generate rows via in-context learning (ICL), which needs only a few seed examples and no parameter updates, avoiding retraining. But ICL repeats seed rows verbatim, introducing a new privacy risk that has only been studied in text. The severity of this risk in tabular synthesis-where a single row may identify a person-remains unclear. We address this gap with the first benchmark of three foundation models (GPT-4o-mini, LLaMA 3.3 70B, TabPFN v2) against four baselines on 35 real-world tables from health, finance, and policy. We evaluate statistical fidelity, downstream utility, and membership inference leakage. Results show foundation models consistently have the highest privacy risk. LLaMA 3.3 70B reaches up to 54 percentage points higher true-positive rate at 1% FPR than the safest baseline. GPT-4o-mini and TabPFN are also highly vulnerable. We plot the privacy-utility frontier and show that CTGAN and GPT-4o-mini offer better tradeoffs. A factorial study finds that three zero-cost prompt tweaks-small batch size, low temperature, and using summary statistics-can reduce worst-case AUC by 14 points and rare-class leakage by up to 39 points while maintaining over 90% fidelity. Our benchmark offers a practical guide for safer low-data synthesis with foundation models.
[LG-36] Shared Control of Holonomic Wheelchairs through Reinforcement Learning
链接: https://arxiv.org/abs/2507.17055
作者: Jannis Bähler,Diego Paez-Granados,Jorge Peña-Queralta
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Smart electric wheelchairs can improve user experience by supporting the driver with shared control. State-of-the-art work showed the potential of shared control in improving safety in navigation for non-holonomic robots. However, for holonomic systems, current approaches often lead to unintuitive behavior for the user and fail to utilize the full potential of omnidirectional driving. Therefore, we propose a reinforcement learning-based method, which takes a 2D user input and outputs a 3D motion while ensuring user comfort and reducing cognitive load on the driver. Our approach is trained in Isaac Gym and tested in simulation in Gazebo. We compare different RL agent architectures and reward functions based on metrics considering cognitive load and user comfort. We show that our method ensures collision-free navigation while smartly orienting the wheelchair and showing better or competitive smoothness compared to a previous non-learning-based method. We further perform a sim-to-real transfer and demonstrate, to the best of our knowledge, the first real-world implementation of RL-based shared control for an omnidirectional mobility platform.
[LG-37] BiLO: Bilevel Local Operator Learning for PDE Inverse Problems. Part II: Efficient Uncertainty Quantification with Low-Rank Adaptation
链接: https://arxiv.org/abs/2507.17019
作者: Ray Zirui Zhang,Christopher E. Miles,Xiaohui Xie,John S. Lowengrub
类目: Machine Learning (cs.LG)
*备注:
Abstract:Uncertainty quantification and inverse problems governed by partial differential equations (PDEs) are central to a wide range of scientific and engineering applications. In this second part of a two part series, we extend Bilevel Local Operator Learning (BiLO) for PDE-constrained optimization problems developed in Part 1 to the Bayesian inference framework. At the lower level, we train a network to approximate the local solution operator by minimizing the local operator loss with respect to the weights of the neural network. At the upper level, we sample the PDE parameters from the posterior distribution. We achieve efficient sampling through gradient-based Markov Chain Monte Carlo (MCMC) methods and low-rank adaptation (LoRA). Compared with existing methods based on Bayesian neural networks, our approach bypasses the challenge of sampling in the high-dimensional space of neural network weights and does not require specifying a prior distribution on the neural network solution. Instead, uncertainty propagates naturally from the data through the PDE constraints. By enforcing strong PDE constraints, the proposed method improves the accuracy of both parameter inference and uncertainty quantification. We analyze the dynamic error of the gradient in the MCMC sampler and the static error in the posterior distribution due to inexact minimization of the lower level problem and demonstrate a direct link between the tolerance for solving the lower level problem and the accuracy of the resulting uncertainty quantification. Through numerical experiments across a variety of PDE models, we demonstrate that our method delivers accurate inference and quantification of uncertainties while maintaining high computational efficiency.
[LG-38] Should Bias Always be Eliminated? A Principled Framework to Use Data Bias for OOD Generation
链接: https://arxiv.org/abs/2507.17001
作者: Yan Li,Guangyi Chen,Yunlong Deng,Zijian Li,Zeyu Tang,Anpeng Wu,Kun Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Most existing methods for adapting models to out-of-distribution (OOD) domains rely on invariant representation learning to eliminate the influence of biased features. However, should bias always be eliminated – and if not, when should it be retained, and how can it be leveraged? To address these questions, we first present a theoretical analysis that explores the conditions under which biased features can be identified and effectively utilized. Building on this theoretical foundation, we introduce a novel framework that strategically leverages bias to complement invariant representations during inference. The framework comprises two key components that leverage bias in both direct and indirect ways: (1) using invariance as guidance to extract predictive ingredients from bias, and (2) exploiting identified bias to estimate the environmental condition and then use it to explore appropriate bias-aware predictors to alleviate environment gaps. We validate our approach through experiments on both synthetic datasets and standard domain generalization benchmarks. Results consistently demonstrate that our method outperforms existing approaches, underscoring its robustness and adaptability.
[LG-39] Hierarchical Reinforcement Learning Framework for Adaptive Walking Control Using General Value Functions of Lower-Limb Sensor Signals
链接: https://arxiv.org/abs/2507.16983
作者: Sonny T. Jones,Grange M. Simpson,Patrick M. Pilarski,Ashley N. Dalrymple
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 5 pages, 3 figures, accepted at the 6th Multi-disciplinary Conference on Reinforcement Learning and Decision Making (RLDM2025), June 11-14, 2025
Abstract:Rehabilitation technology is a natural setting to study the shared learning and decision-making of human and machine agents. In this work, we explore the use of Hierarchical Reinforcement Learning (HRL) to develop adaptive control strategies for lower-limb exoskeletons, aiming to enhance mobility and autonomy for individuals with motor impairments. Inspired by prominent models of biological sensorimotor processing, our investigated HRL approach breaks down the complex task of exoskeleton control adaptation into a higher-level framework for terrain strategy adaptation and a lower-level framework for providing predictive information; this latter element is implemented via the continual learning of general value functions (GVFs). GVFs generated temporal abstractions of future signal values from multiple wearable lower-limb sensors, including electromyography, pressure insoles, and goniometers. We investigated two methods for incorporating actual and predicted sensor signals into a policy network with the intent to improve the decision-making capacity of the control system of a lower-limb exoskeleton during ambulation across varied terrains. As a key result, we found that the addition of predictions made from GVFs increased overall network accuracy. Terrain-specific performance increases were seen while walking on even ground, uneven ground, up and down ramps, and turns, terrains that are often misclassified without predictive information. This suggests that predictive information can aid decision-making during uncertainty, e.g., on terrains that have a high chance of being misclassified. This work, therefore, contributes new insights into the nuances of HRL and the future development of exoskeletons to facilitate safe transitioning and traversing across different walking environments.
[LG-40] Navigation through Non-Compact Symmetric Spaces: a mathematical perspective on Cartan Neural Networks
链接: https://arxiv.org/abs/2507.16871
作者: Pietro Giuseppe Fré,Federico Milanesio,Guido Sanguinetti,Matteo Santoro
类目: Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 59 pages, 2 figures
Abstract:Recent work has identified non-compact symmetric spaces U/H as a promising class of homogeneous manifolds to develop a geometrically consistent theory of neural networks. An initial implementation of these concepts has been presented in a twin paper under the moniker of Cartan Neural Networks, showing both the feasibility and the performance of these geometric concepts in a machine learning context. The current paper expands on the mathematical structures underpinning Cartan Neural Networks, detailing the geometric properties of the layers and how the maps between layers interact with such structures to make Cartan Neural Networks covariant and geometrically interpretable. Together, these twin papers constitute a first step towards a fully geometrically interpretable theory of neural networks exploiting group-theoretic structures
[LG-41] EVOLVE-X: Embedding Fusion and Language Prompting for User Evolution Forecasting on Social Media
链接: https://arxiv.org/abs/2507.16847
作者: Ismail Hossain,Sai Puppala,Md Jahangir Alam,Sajedul Talukder
类目: ocial and Information Networks (cs.SI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: We are submitting this paper to ICWSM 2026 conference on September 15th, 2025
Abstract:Social media platforms serve as a significant medium for sharing personal emotions, daily activities, and various life events, ensuring individuals stay informed about the latest developments. From the initiation of an account, users progressively expand their circle of friends or followers, engaging actively by posting, commenting, and sharing content. Over time, user behavior on these platforms evolves, influenced by demographic attributes and the networks they form. In this study, we present a novel approach that leverages open-source models Llama-3-Instruct, Mistral-7B-Instruct, Gemma-7B-IT through prompt engineering, combined with GPT-2, BERT, and RoBERTa using a joint embedding technique, to analyze and predict the evolution of user behavior on social media over their lifetime. Our experiments demonstrate the potential of these models to forecast future stages of a user’s social evolution, including network changes, future connections, and shifts in user activities. Experimental results highlight the effectiveness of our approach, with GPT-2 achieving the lowest perplexity (8.21) in a Cross-modal configuration, outperforming RoBERTa (9.11) and BERT, and underscoring the importance of leveraging Cross-modal configurations for superior performance. This approach addresses critical challenges in social media, such as friend recommendations and activity predictions, offering insights into the trajectory of user behavior. By anticipating future interactions and activities, this research aims to provide early warnings about potential negative outcomes, enabling users to make informed decisions and mitigate risks in the long term.
[LG-42] D-Interpreter: Enhancing the Understanding of Timing Diagrams with Visual-Language Learning
链接: https://arxiv.org/abs/2507.16844
作者: Jie He,Vincent Theo Willem Kenbeek,Zhantao Yang,Meixun Qu,Ezio Bartocci,Dejan Ničković,Radu Grosu
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce TD-Interpreter, a specialized ML tool that assists engineers in understanding complex timing diagrams (TDs), originating from a third party, during their design and verification process. TD-Interpreter is a visual question-answer environment which allows engineers to input a set of TDs and ask design and verification queries regarding these TDs. We implemented TD-Interpreter with multimodal learning by fine-tuning LLaVA, a lightweight 7B Multimodal Large Language Model (MLLM). To address limited training data availability, we developed a synthetic data generation workflow that aligns visual information with its textual interpretation. Our experimental evaluation demonstrates the usefulness of TD-Interpreter which outperformed untuned GPT-4o by a large margin on the evaluated benchmarks.
[LG-43] Exploring the Frontiers of kNN Noisy Feature Detection and Recovery for Self-Driving Labs
链接: https://arxiv.org/abs/2507.16833
作者: Qiuyu Shi,Kangming Li,Yao Fehlis,Daniel Persaud,Robert Black,Jason Hattrick-Simpers
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 15 pages, 6 figures
Abstract:Self-driving laboratories (SDLs) have shown promise to accelerate materials discovery by integrating machine learning with automated experimental platforms. However, errors in the capture of input parameters may corrupt the features used to model system performance, compromising current and future campaigns. This study develops an automated workflow to systematically detect noisy features, determine sample-feature pairings that can be corrected, and finally recover the correct feature values. A systematic study is then performed to examine how dataset size, noise intensity, and feature value distribution affect both the detectability and recoverability of noisy features. In general, high-intensity noise and large training datasets are conducive to the detection and correction of noisy features. Low-intensity noise reduces detection and recovery but can be compensated for by larger clean training data sets. Detection and correction results vary between features with continuous and dispersed feature distributions showing greater recoverability compared to features with discrete or narrow distributions. This systematic study not only demonstrates a model agnostic framework for rational data recovery in the presence of noise, limited data, and differing feature distributions but also provides a tangible benchmark of kNN imputation in materials data sets. Ultimately, it aims to enhance data quality and experimental precision in automated materials discovery.
[LG-44] Evaluating Artificial Intelligence Algorithms for the Standardization of Transtibial Prosthetic Socket Shape Design
链接: https://arxiv.org/abs/2507.16818
作者: C.H.E. Jordaan,M. van der Stelt,T.J.J. Maal,V.M.A. Stirler,R. Leijendekkers,T. Kachman,G.A. de Jong
类目: Machine Learning (cs.LG)
*备注:
Abstract:The quality of a transtibial prosthetic socket depends on the prosthetist’s skills and expertise, as the fitting is performed manually. This study investigates multiple artificial intelligence (AI) approaches to help standardize transtibial prosthetic socket design. Data from 118 patients were collected by prosthetists working in the Dutch healthcare system. This data consists of a three-dimensional (3D) scan of the residual limb and a corresponding 3D model of the prosthetist-designed socket. Multiple data pre-processing steps are performed for alignment, standardization and optionally compression using Morphable Models and Principal Component Analysis. Afterward, three different algorithms - a 3D neural network, Feedforward neural network, and random forest - are developed to either predict 1) the final socket shape or 2) the adaptations performed by a prosthetist to predict the socket shape based on the 3D scan of the residual limb. Each algorithm’s performance was evaluated by comparing the prosthetist-designed socket with the AI-generated socket, using two metrics in combination with the error location. First, we measure the surface-to-surface distance to assess the overall surface error between the AI-generated socket and the prosthetist-designed socket. Second, distance maps between the AI-generated and prosthetist sockets are utilized to analyze the error’s location. For all algorithms, estimating the required adaptations outperformed direct prediction of the final socket shape. The random forest model applied to adaptation prediction yields the lowest error with a median surface-to-surface distance of 1.24 millimeters, a first quartile of 1.03 millimeters, and a third quartile of 1.54 millimeters.
[LG-45] Neural networks for bifurcation and linear stability analysis of steady states in partial differential equations
链接: https://arxiv.org/abs/2407.19707
作者: Muhammad Luthfi Shahab,Hadi Susanto
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC); Pattern Formation and Solitons (nlin.PS)
*备注: Accepted for publication in Applied Mathematics and Computation
Abstract:This research introduces an extended application of neural networks for solving nonlinear partial differential equations (PDEs). A neural network, combined with a pseudo-arclength continuation, is proposed to construct bifurcation diagrams from parameterized nonlinear PDEs. Additionally, a neural network approach is also presented for solving eigenvalue problems to analyze solution linear stability, focusing on identifying the largest eigenvalue. The effectiveness of the proposed neural network is examined through experiments on the Bratu equation and the Burgers equation. Results from a finite difference method are also presented as comparison. Varying numbers of grid points are employed in each case to assess the behavior and accuracy of both the neural network and the finite difference method. The experimental results demonstrate that the proposed neural network produces better solutions, generates more accurate bifurcation diagrams, has reasonable computational times, and proves effective for linear stability analysis.
[LG-46] Deep Generative Learning of Magnetic Frustration in Artificial Spin Ice from Magnetic Force Microscopy Images
链接: https://arxiv.org/abs/2507.17726
作者: Arnab Neogi,Suryakant Mishra,Prasad P Iyer,Tzu-Ming Lu,Ezra Bussmann,Sergei Tretiak,Andrew Crandall Jones,Jian-Xin Zhu
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Increasingly large datasets of microscopic images with atomic resolution facilitate the development of machine learning methods to identify and analyze subtle physical phenomena embedded within the images. In this work, microscopic images of honeycomb lattice spin-ice samples serve as datasets from which we automate the calculation of net magnetic moments and directional orientations of spin-ice configurations. In the first stage of our workflow, machine learning models are trained to accurately predict magnetic moments and directions within spin-ice structures. Variational Autoencoders (VAEs), an emergent unsupervised deep learning technique, are employed to generate high-quality synthetic magnetic force microscopy (MFM) images and extract latent feature representations, thereby reducing experimental and segmentation errors. The second stage of proposed methodology enables precise identification and prediction of frustrated vertices and nanomagnetic segments, effectively correlating structural and functional aspects of microscopic images. This facilitates the design of optimized spin-ice configurations with controlled frustration patterns, enabling potential on-demand synthesis.
[LG-47] Sequential Bayesian Design for Efficient Surrogate Construction in the Inversion of Darcy Flows
链接: https://arxiv.org/abs/2507.17713
作者: Hongji Wang,Hongqiao Wang,Jinyong Ying,Qingping Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 21 pages, 15 figures
Abstract:Inverse problems governed by partial differential equations (PDEs) play a crucial role in various fields, including computational science, image processing, and engineering. Particularly, Darcy flow equation is a fundamental equation in fluid mechanics, which plays a crucial role in understanding fluid flow through porous media. Bayesian methods provide an effective approach for solving PDEs inverse problems, while their numerical implementation requires numerous evaluations of computationally expensive forward solvers. Therefore, the adoption of surrogate models with lower computational costs is essential. However, constructing a globally accurate surrogate model for high-dimensional complex problems demands high model capacity and large amounts of data. To address this challenge, this study proposes an efficient locally accurate surrogate that focuses on the high-probability regions of the true likelihood in inverse problems, with relatively low model complexity and few training data requirements. Additionally, we introduce a sequential Bayesian design strategy to acquire the proposed surrogate since the high-probability region of the likelihood is unknown. The strategy treats the posterior evolution process of sequential Bayesian design as a Gaussian process, enabling algorithmic acceleration through one-step ahead prior. The complete algorithmic framework is referred to as Sequential Bayesian design for locally accurate surrogate (SBD-LAS). Finally, three experiments based the Darcy flow equation demonstrate the advantages of the proposed method in terms of both inversion accuracy and computational speed.
[LG-48] Debiased maximum-likelihood estimators for hazard ratios under machine-learning adjustment
链接: https://arxiv.org/abs/2507.17686
作者: Takashi Hayakawa,Satoshi Asai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Previous studies have shown that hazard ratios between treatment groups estimated with the Cox model are uninterpretable because the indefinite baseline hazard of the model fails to identify temporal change in the risk set composition due to treatment assignment and unobserved factors among multiple, contradictory scenarios. To alleviate this problem, especially in studies based on observational data with uncontrolled dynamic treatment and real-time measurement of many covariates, we propose abandoning the baseline hazard and using machine learning to explicitly model the change in the risk set with or without latent variables. For this framework, we clarify the context in which hazard ratios can be causally interpreted, and then develop a method based on Neyman orthogonality to compute debiased maximum-likelihood estimators of hazard ratios. Computing the constructed estimators is more efficient than computing those based on weighted regression with marginal structural Cox models. Numerical simulations confirm that the proposed method identifies the ground truth with minimal bias. These results lay the foundation for developing a useful, alternative method for causal inference with uncontrolled, observational data in modern epidemiology.
[LG-49] me Deep Gradient Flow Method for pricing American options
链接: https://arxiv.org/abs/2507.17606
作者: Jasper Rou
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Probability (math.PR); Mathematical Finance (q-fin.MF)
*备注: 13 pages, 6 figures
Abstract:In this research, we explore neural network-based methods for pricing multidimensional American put options under the BlackScholes and Heston model, extending up to five dimensions. We focus on two approaches: the Time Deep Gradient Flow (TDGF) method and the Deep Galerkin Method (DGM). We extend the TDGF method to handle the free-boundary partial differential equation inherent in American options. We carefully design the sampling strategy during training to enhance performance. Both TDGF and DGM achieve high accuracy while outperforming conventional Monte Carlo methods in terms of computational speed. In particular, TDGF tends to be faster during training than DGM.
[LG-50] Scalable DC Optimization via Adaptive Frank-Wolfe Algorithms
链接: https://arxiv.org/abs/2507.17545
作者: Sebastian Pokutta
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of minimizing a difference of (smooth) convex functions over a compact convex feasible region P , i.e., \min_x \in P f(x) - g(x) , with smooth f and Lipschitz continuous g . This computational study builds upon and complements the framework of Maskan et al. [2025] by integrating advanced Frank-Wolfe variants to reduce computational overhead. We empirically show that constrained DC problems can be efficiently solved using a combination of the Blended Pairwise Conditional Gradients (BPCG) algorithm [Tsuji et al., 2022] with warm-starting and the adaptive error bound from Maskan et al. [2025]. The result is a highly efficient and scalable projection-free algorithm for constrained DC optimization.
[LG-51] Optimal differentially private kernel learning with random projection
链接: https://arxiv.org/abs/2507.17544
作者: Bonwoo Lee,Cheolwoo Park,Jeongyoun Ahn
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 110 page, 12 figures
Abstract:Differential privacy has become a cornerstone in the development of privacy-preserving learning algorithms. This work addresses optimizing differentially private kernel learning within the empirical risk minimization (ERM) framework. We propose a novel differentially private kernel ERM algorithm based on random projection in the reproducing kernel Hilbert space using Gaussian processes. Our method achieves minimax-optimal excess risk for both the squared loss and Lipschitz-smooth convex loss functions under a local strong convexity condition. We further show that existing approaches based on alternative dimension reduction techniques, such as random Fourier feature mappings or \ell_2 regularization, yield suboptimal generalization performance. Our key theoretical contribution also includes the derivation of dimension-free generalization bounds for objective perturbation-based private linear ERM – marking the first such result that does not rely on noisy gradient-based mechanisms. Additionally, we obtain sharper generalization bounds for existing differentially private kernel ERM algorithms. Empirical evaluations support our theoretical claims, demonstrating that random projection enables statistically efficient and optimally private kernel learning. These findings provide new insights into the design of differentially private algorithms and highlight the central role of dimension reduction in balancing privacy and utility.
[LG-52] Clustering-based hard negative sampling for supervised contrastive speaker verification INTERSPEECH2025
链接: https://arxiv.org/abs/2507.17540
作者: Piotr Masztalski,Michał Romaniuk,Jakub Żak,Mateusz Matuszewski,Konrad Kowalczyk
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted to INTERSPEECH 2025
Abstract:In speaker verification, contrastive learning is gaining popularity as an alternative to the traditionally used classification-based approaches. Contrastive methods can benefit from an effective use of hard negative pairs, which are different-class samples particularly challenging for a verification model due to their similarity. In this paper, we propose CHNS - a clustering-based hard negative sampling method, dedicated for supervised contrastive speaker representation learning. Our approach clusters embeddings of similar speakers, and adjusts batch composition to obtain an optimal ratio of hard and easy negatives during contrastive loss calculation. Experimental evaluation shows that CHNS outperforms a baseline supervised contrastive approach with and without loss-based hard negative sampling, as well as a state-of-the-art classification-based approach to speaker verification by as much as 18 % relative EER and minDCF on the VoxCeleb dataset using two lightweight model architectures.
[LG-53] Graph Neural Network Approach to Predicting Magnetization in Quasi-One-Dimensional Ising Systems
链接: https://arxiv.org/abs/2507.17509
作者: V. Slavin,O. Kryvchikov,D. Laptev
类目: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 18 pages, 4 figures
Abstract:We present a graph-based deep learning framework for predicting the magnetic properties of quasi-one-dimensional Ising spin systems. The lattice geometry is encoded as a graph and processed by a graph neural network (GNN) followed by fully connected layers. The model is trained on Monte Carlo simulation data and accurately reproduces key features of the magnetization curve, including plateaus, critical transition points, and the effects of geometric frustration. It captures both local motifs and global symmetries, demonstrating that GNNs can infer magnetic behavior directly from structural connectivity. The proposed approach enables efficient prediction of magnetization without the need for additional Monte Carlo simulations.
[LG-54] Joint Multi-Target Detection-Tracking in Cognitive Massive MIMO Radar via POMCP
链接: https://arxiv.org/abs/2507.17506
作者: Imad Bouhou,Stefano Fortunati,Leila Gharsalli,Alexandre Renaux
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:This correspondence presents a power-aware cognitive radar framework for joint detection and tracking of multiple targets in a massive multiple-input multiple-output (MIMO) radar environment. Building on a previous single-target algorithm based on Partially Observable Monte Carlo Planning (POMCP), we extend it to the multi-target case by assigning each target an independent POMCP tree, enabling scalable and efficient planning. Departing from uniform power allocation-which is often suboptimal with varying signal-to-noise ratios (SNRs)-our approach predicts each target’s future angular position and expected received power, based on its estimated range and radar cross-section (RCS). These predictions guide adaptive waveform design via a constrained optimization problem that allocates transmit energy to enhance the detectability of weaker or distant targets, while ensuring sufficient power for high-SNR targets. The reward function in the underlying partially observable Markov decision process (POMDP) is also modified to prioritize accurate spatial and power estimation. Simulations involving multiple targets with different SNRs confirm the effectiveness of our method. The proposed framework for the cognitive radar improves detection probability for low-SNR targets and achieves more accurate tracking compared to approaches using uniform or orthogonal waveforms. These results demonstrate the potential of the POMCP-based framework for adaptive, efficient multi-target radar systems. Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2507.17506 [eess.SP] (or arXiv:2507.17506v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2507.17506 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-55] Doubly robust outlier resistant inference on causal treatment effect
链接: https://arxiv.org/abs/2507.17439
作者: Joonsung Kang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:
Abstract:Outliers can severely distort causal effect estimation in observational studies, yet this issue has received limited attention in the literature. Their influence is especially pronounced in small sample sizes, where detecting and removing outliers becomes increasingly difficult. Therefore, it is essential to estimate treatment effects robustly without excluding these influential data points. To address this, we propose a doubly robust point estimator for the average treatment effect under a contaminated model that includes outliers. Robustness in outcome regression is achieved through a robust estimating equation, while covariate balancing propensity scores (CBPS) ensure resilience in propensity score modeling. To prevent model overfitting due to the inclusion of numerous parameters, we incorporate variable selection. All these components are unified under a penalized empirical likelihood framework. For confidence interval estimation, most existing approaches rely on asymptotic properties, which may be unreliable in finite samples. We derive an optimal finite-sample confidence interval for the average treatment effect using our proposed estimating equation, ensuring that the interval bounds remain unaffected by outliers. Through simulations and a real-world application involving hypertension data with outliers, we demonstrate that our method consistently outperforms existing approaches in both accuracy and robustness. Subjects: Methodology (stat.ME); Machine Learning (cs.LG) Cite as: arXiv:2507.17439 [stat.ME] (or arXiv:2507.17439v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2507.17439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-56] Learning from Scratch: Structurally-masked Transformer for Next Generation Lib-free Simulation
链接: https://arxiv.org/abs/2507.17396
作者: Junlang Huang,Hao Chen,Zhong Guan
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes a neural framework for power and timing prediction of multi-stage data path, distinguishing itself from traditional lib-based analytical methods dependent on driver characterization and load simplifications. To the best of our knowledge, this is the first language-based, netlist-aware neural network designed explicitly for standard cells. Our approach employs two pre-trained neural models of waveform prediction and delay estimation that directly infer transient waveforms and propagation delays from SPICE netlists, conditioned on critical physical parameters such as load capacitance, input slew, and gate size. This method accurately captures both intrinsic and coupling-induced delay effects without requiring simplification or interpolation. For multi-stage timing prediction, we implement a recursive propagation strategy where predicted waveforms from each stage feed into subsequent stages, cumulatively capturing delays across the logic chain. This approach ensures precise timing alignment and complete waveform visibility throughout complex signal pathways. The waveform prediction utilizes a hybrid CNN-Transformer architecture with netlist-aware node-level encoding, addressing traditional Transformers’ fixed input dimensionality constraints. Additionally, specialized subnetworks separately handle primary delay estimation and crosstalk correction. Experimental results demonstrate SPICE-level accuracy, consistently achieving RMSE below 0.0098 across diverse industrial circuits. The proposed framework provides a scalable, structurally adaptable neural alternative to conventional power and timing engines, demonstrating high fidelity to physical circuit behaviors.
[LG-57] Nearly Minimax Discrete Distribution Estimation in Kullback-Leibler Divergence with High Probability
链接: https://arxiv.org/abs/2507.17316
作者: Dirk van der Hoeven,Julia Olkhovskaia,Tim van Erven
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of estimating a discrete distribution p with support of size K and provide both upper and lower bounds with high probability in KL divergence. We prove that in the worst case, for any estimator \widehatp , with probability at least \delta , \textKL(p | \widehatp) \geq C\max\K,\ln(K)\ln(1/\delta) /n , where n is the sample size and C 0 is a constant. We introduce a computationally efficient estimator p^\textOTB , based on Online to Batch conversion and suffix averaging, and show that with probability at least 1 - \delta \textKL(p | \widehatp) \leq C(K\log(\log(K)) + \ln(K)\ln(1/\delta)) /n . Furthermore, we also show that with sufficiently many observations relative to \log(1/\delta) , the maximum likelihood estimator \barp guarantees that with probability at least 1-\delta 1/6 \chi^2(\barp|p) \leq 1/4 \chi^2(p|\barp) \leq \textKL(p|\barp) \leq C(K + \log(1/\delta))/n, where \chi^2 denotes the \chi^2 -divergence. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2507.17316 [stat.ML] (or arXiv:2507.17316v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.17316 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-58] Spintronic Bayesian Hardware Driven by Stochastic Magnetic Domain Wall Dynamics
链接: https://arxiv.org/abs/2507.17193
作者: Tianyi Wang,Bingqian Dai,Kin Wong,Yaochen Li,Yang Cheng,Qingyuan Shu,Haoran He,Puyang Huang,Hanshen Huang,Kang L. Wang
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG)
*备注:
Abstract:As artificial intelligence (AI) advances into diverse applications, ensuring reliability of AI models is increasingly critical. Conventional neural networks offer strong predictive capabilities but produce deterministic outputs without inherent uncertainty estimation, limiting their reliability in safety-critical domains. Probabilistic neural networks (PNNs), which introduce randomness, have emerged as a powerful approach for enabling intrinsic uncertainty quantification. However, traditional CMOS architectures are inherently designed for deterministic operation and actively suppress intrinsic randomness. This poses a fundamental challenge for implementing PNNs, as probabilistic processing introduces significant computational overhead. To address this challenge, we introduce a Magnetic Probabilistic Computing (MPC) platform-an energy-efficient, scalable hardware accelerator that leverages intrinsic magnetic stochasticity for uncertainty-aware computing. This physics-driven strategy utilizes spintronic systems based on magnetic domain walls (DWs) and their dynamics to establish a new paradigm of physical probabilistic computing for AI. The MPC platform integrates three key mechanisms: thermally induced DW stochasticity, voltage controlled magnetic anisotropy (VCMA), and tunneling magnetoresistance (TMR), enabling fully electrical and tunable probabilistic functionality at the device level. As a representative demonstration, we implement a Bayesian Neural Network (BNN) inference structure and validate its functionality on CIFAR-10 classification tasks. Compared to standard 28nm CMOS implementations, our approach achieves a seven orders of magnitude improvement in the overall figure of merit, with substantial gains in area efficiency, energy consumption, and speed. These results underscore the MPC platform’s potential to enable reliable and trustworthy physical AI systems.
[LG-59] OkadaTorch: A Differentiable Programming of Okada Model to Calculate Displacements and Strains from Fault Parameters
链接: https://arxiv.org/abs/2507.17126
作者: Masayoshi Someya,Taisuke Yamada,Tomohisa Okazaki
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures
Abstract:The Okada model is a widely used analytical solution for displacements and strains caused by a point or rectangular dislocation source in a 3D elastic half-space. We present OkadaTorch, a PyTorch implementation of the Okada model, where the entire code is differentiable; gradients with respect to input can be easily computed using automatic differentiation (AD). Our work consists of two components: a direct translation of the original Okada model into PyTorch, and a convenient wrapper interface for efficiently computing gradients and Hessians with respect to either observation station coordinates or fault parameters. This differentiable framework is well suited for fault parameter inversion, including gradient-based optimization, Bayesian inference, and integration with scientific machine learning (SciML) models. Our code is available here: this https URL
[LG-60] CoLT: The conditional localization test for assessing the accuracy of neural posterior estimates
链接: https://arxiv.org/abs/2507.17030
作者: Tianyu Chen,Vansh Bansal,James G. Scott
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of validating whether a neural posterior estimate ( q(\theta \mid x) ) is an accurate approximation to the true, unknown true posterior ( p(\theta \mid x) ). Existing methods for evaluating the quality of an NPE estimate are largely derived from classifier-based tests or divergence measures, but these suffer from several practical drawbacks. As an alternative, we introduce the \emphConditional Localization Test (CoLT), a principled method designed to detect discrepancies between ( p(\theta \mid x) ) and ( q(\theta \mid x) ) across the full range of conditioning inputs. Rather than relying on exhaustive comparisons or density estimation at every ( x ), CoLT learns a localization function that adaptively selects points \theta_l(x) where the neural posterior q deviates most strongly from the true posterior p for that x . This approach is particularly advantageous in typical simulation-based inference settings, where only a single draw ( \theta \sim p(\theta \mid x) ) from the true posterior is observed for each conditioning input, but where the neural posterior ( q(\theta \mid x) ) can be sampled an arbitrary number of times. Our theoretical results establish necessary and sufficient conditions for assessing distributional equality across all ( x ), offering both rigorous guarantees and practical scalability. Empirically, we demonstrate that CoLT not only performs better than existing methods at comparing p and q , but also pinpoints regions of significant divergence, providing actionable insights for model refinement. These properties position CoLT as a state-of-the-art solution for validating neural posterior estimates.
[LG-61] he surprising strength of weak classifiers for validating neural posterior estimates
链接: https://arxiv.org/abs/2507.17026
作者: Vansh Bansal,Tianyu Chen,James G. Scott
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Neural Posterior Estimation (NPE) has emerged as a powerful approach for amortized Bayesian inference when the true posterior p(\theta \mid y) is intractable or difficult to sample. But evaluating the accuracy of neural posterior estimates remains challenging, with existing methods suffering from major limitations. One appealing and widely used method is the classifier two-sample test (C2ST), where a classifier is trained to distinguish samples from the true posterior p(\theta \mid y) versus the learned NPE approximation q(\theta \mid y) . Yet despite the appealing simplicity of the C2ST, its theoretical and practical reliability depend upon having access to a near-Bayes-optimal classifier – a requirement that is rarely met and, at best, difficult to verify. Thus a major open question is: can a weak classifier still be useful for neural posterior validation? We show that the answer is yes. Building on the work of Hu and Lei, we present several key results for a conformal variant of the C2ST, which converts any trained classifier’s scores – even those of weak or over-fitted models – into exact finite-sample p-values. We establish two key theoretical properties of the conformal C2ST: (i) finite-sample Type-I error control, and (ii) non-trivial power that degrades gently in tandem with the error of the trained classifier. The upshot is that even weak, biased, or overfit classifiers can still yield powerful and reliable tests. Empirically, the Conformal C2ST outperforms classical discriminative tests across a wide range of benchmarks. These results reveal the under appreciated strength of weak classifiers for validating neural posterior estimates, establishing the conformal C2ST as a practical, theoretically grounded diagnostic for modern simulation-based inference.
[LG-62] Fundamental limits of distributed covariance matrix estimation via a conditional strong data processing inequality
链接: https://arxiv.org/abs/2507.16953
作者: Mohammad Reza Rahmani,Mohammad Hossein Yassaee,Mohammad Reza Aref
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Estimating high-dimensional covariance matrices is a key task across many fields. This paper explores the theoretical limits of distributed covariance estimation in a feature-split setting, where communication between agents is constrained. Specifically, we study a scenario in which multiple agents each observe different components of i.i.d. samples drawn from a sub-Gaussian random vector. A central server seeks to estimate the complete covariance matrix using a limited number of bits communicated by each agent. We obtain a nearly tight minimax lower bound for covariance matrix estimation under operator norm and Frobenius norm. Our main technical tool is a novel generalization of the strong data processing inequality (SDPI), termed the Conditional Strong Data Processing Inequality (C-SDPI) coefficient, introduced in this work. The C-SDPI coefficient shares key properties such as tensorization with the conventional SDPI. Crucially, it quantifies the average contraction in a state-dependent channel and can be significantly lower than the worst-case SDPI coefficient over the state input. Utilizing the doubling trick of Geng-Nair and an operator Jensen inequality, we compute this coefficient for Gaussian mixture channels. We then employ it to establish minimax lower bounds on estimation error, capturing the trade-offs among sample size, communication cost, and data dimensionality. Building on this, we present a nearly optimal estimation protocol whose sample and communication requirements match the lower bounds up to logarithmic factors. Unlike much of the existing literature, our framework does not assume infinite samples or Gaussian distributions, making it broadly applicable. Finally, we extend our analysis to interactive protocols, showing interaction can significantly reduce communication requirements compared to non-interactive schemes. Subjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2507.16953 [stat.ML] (or arXiv:2507.16953v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2507.16953 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-63] Avoiding spectral pollution for transfer operators using residuals
链接: https://arxiv.org/abs/2507.16915
作者: April Herwig,Matthew J. Colbrook,Oliver Junge,Péter Koltai,Julia Slipantschuk
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Numerical Analysis (math.NA); Spectral Theory (math.SP); Machine Learning (stat.ML)
*备注:
Abstract:Koopman operator theory enables linear analysis of nonlinear dynamical systems by lifting their evolution to infinite-dimensional function spaces. However, finite-dimensional approximations of Koopman and transfer (Frobenius–Perron) operators are prone to spectral pollution, introducing spurious eigenvalues that can compromise spectral computations. While recent advances have yielded provably convergent methods for Koopman operators, analogous tools for general transfer operators remain limited. In this paper, we present algorithms for computing spectral properties of transfer operators without spectral pollution, including extensions to the Hardy-Hilbert space. Case studies–ranging from families of Blaschke maps with known spectrum to a molecular dynamics model of protein folding–demonstrate the accuracy and flexibility of our approach. Notably, we demonstrate that spectral features can arise even when the corresponding eigenfunctions lie outside the chosen space, highlighting the functional-analytic subtleties in defining the “true” Koopman spectrum. Our methods offer robust tools for spectral estimation across a broad range of applications.
[LG-64] chnical report: Impact of Duration Prediction on Speaker-specific TTS for Indian Languages
链接: https://arxiv.org/abs/2507.16875
作者: Isha Pandey,Pranav Gaikwad,Amruta Parulekar,Ganesh Ramakrishnan
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:
Abstract:High-quality speech generation for low-resource languages, such as many Indian languages, remains a significant challenge due to limited data and diverse linguistic structures. Duration prediction is a critical component in many speech generation pipelines, playing a key role in modeling prosody and speech rhythm. While some recent generative approaches choose to omit explicit duration modeling, often at the cost of longer training times. We retain and explore this module to better understand its impact in the linguistically rich and data-scarce landscape of India. We train a non-autoregressive Continuous Normalizing Flow (CNF) based speech model using publicly available Indian language data and evaluate multiple duration prediction strategies for zero-shot, speaker-specific generation. Our comparative analysis on speech-infilling tasks reveals nuanced trade-offs: infilling based predictors improve intelligibility in some languages, while speaker-prompted predictors better preserve speaker characteristics in others. These findings inform the design and selection of duration strategies tailored to specific languages and tasks, underscoring the continued value of interpretable components like duration prediction in adapting advanced generative architectures to low-resource, multilingual settings.
[LG-65] Enhancing Lung Disease Diagnosis via Semi-Supervised Machine Learning
链接: https://arxiv.org/abs/2507.16845
作者: Xiaoran Xua,In-Ho Rab,Ravi Sankarc
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注:
Abstract:Lung diseases, including lung cancer and COPD, are significant health concerns globally. Traditional diagnostic methods can be costly, time-consuming, and invasive. This study investigates the use of semi supervised learning methods for lung sound signal detection using a model combination of MFCC+CNN. By introducing semi supervised learning modules such as Mix Match, Co-Refinement, and Co Refurbishing, we aim to enhance the detection performance while reducing dependence on manual annotations. With the add-on semi-supervised modules, the accuracy rate of the MFCC+CNN model is 92.9%, an increase of 3.8% to the baseline model. The research contributes to the field of lung disease sound detection by addressing challenges such as individual differences, feature insufficient labeled data.
[LG-66] From Black Box to Biomarker: Sparse Autoencoders for Interpreting Speech Models of Parkinsons Disease NEURIPS2025
链接: https://arxiv.org/abs/2507.16836
作者: Peter Plantinga,Jen-Kai Chen,Roozbeh Sattari,Mirco Ravanelli,Denise Klein
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, submitted to NeurIPS 2025
Abstract:Speech holds promise as a cost-effective and non-invasive biomarker for neurological conditions such as Parkinson’s disease (PD). While deep learning systems trained on raw audio can find subtle signals not available from hand-crafted features, their black-box nature hinders clinical adoption. To address this, we apply sparse autoencoders (SAEs) to uncover interpretable internal representations from a speech-based PD detection system. We introduce a novel mask-based activation for adapting SAEs to small biomedical datasets, creating sparse disentangled dictionary representations. These dictionary entries are found to have strong associations with characteristic articulatory deficits in PD speech, such as reduced spectral flux and increased spectral flatness in the low-energy regions highlighted by the model attention. We further show that the spectral flux is related to volumetric measurements of the putamen from MRI scans, demonstrating the potential of SAEs to reveal clinically relevant biomarkers for disease monitoring and diagnosis.
[LG-67] Does Language Matter for Early Detection of Parkinsons Disease from Speech?
链接: https://arxiv.org/abs/2507.16832
作者: Peter Plantinga,Briac Cordelle,Dominique Louër,Mirco Ravanelli,Denise Klein
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted to IEEE Workshop on Machine Learning for Signal Processing (MLSP) 2025
Abstract:Using speech samples as a biomarker is a promising avenue for detecting and monitoring the progression of Parkinson’s disease (PD), but there is considerable disagreement in the literature about how best to collect and analyze such data. Early research in detecting PD from speech used a sustained vowel phonation (SVP) task, while some recent research has explored recordings of more cognitively demanding tasks. To assess the role of language in PD detection, we tested pretrained models with varying data types and pretraining objectives and found that (1) text-only models match the performance of vocal-feature models, (2) multilingual Whisper outperforms self-supervised models whereas monolingual Whisper does worse, and (3) AudioSet pretraining improves performance on SVP but not spontaneous speech. These findings together highlight the critical role of language for the early detection of Parkinson’s disease.
[LG-68] High-dimensional multidisciplinary design optimization for aircraft eco-design / Optimisation multi-disciplinaire en grande dimension pour léco-conception avion en avant-projet
链接: https://arxiv.org/abs/2402.04711
作者: Paul Saves
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Mathematical Software (cs.MS); Machine Learning (stat.ML)
*备注: PhD Thesis, Université de Toulouse, Toulouse, 2024 on Gaussian Process kernels for Bayesian optimization in high dimension with mixed and hierarchical variables at ISAE-SUPAERO. Keywords: Gaussian process, Black-box optimization, Bayesian inference, Multidisciplinary design optimization, Mixed hierarchical and categorical inputs, Eco-friendly aircraft design
Abstract:The objective of this Philosophiae Doctor (Ph.D) thesis is to propose an efficient approach for optimizing a multidisciplinary black-box model when the optimization problem is constrained and involves a large number of mixed integer design variables (typically 100 variables). The targeted optimization approach, called EGO, is based on a sequential enrichment of an adaptive surrogate model and, in this context, GP surrogate models are one of the most widely used in engineering problems to approximate time-consuming high fidelity models. EGO is a heuristic BO method that performs well in terms of solution quality. However, like any other global optimization method, EGO suffers from the curse of dimensionality, meaning that its performance is satisfactory on lower dimensional problems, but deteriorates as the dimensionality of the optimization search space increases. For realistic aircraft design problems, the typical size of the design variables can even exceed 100 and, thus, trying to solve directly the problems using EGO is ruled out. The latter is especially true when the problems involve both continuous and categorical variables increasing even more the size of the search space. In this Ph.D thesis, effective parameterization tools are investigated, including techniques like partial least squares regression, to significantly reduce the number of design variables. Additionally, Bayesian optimization is adapted to handle discrete variables and high-dimensional spaces in order to reduce the number of evaluations when optimizing innovative aircraft concepts such as the “DRAGON” hybrid airplane to reduce their climate impact.
信息检索
[IR-0] Leave No One Behind: Fairness-Aware Cross-Domain Recommender Systems for Non-Overlapping Users RECSYS2025
链接: https://arxiv.org/abs/2507.17749
作者: Weixin Chen,Yuhan Zhao,Li Chen,Weike Pan
类目: Information Retrieval (cs.IR)
*备注: Accepted by RecSys 2025
Abstract:Cross-domain recommendation (CDR) methods predominantly leverage overlapping users to transfer knowledge from a source domain to a target domain. However, through empirical studies, we uncover a critical bias inherent in these approaches: while overlapping users experience significant enhancements in recommendation quality, non-overlapping users benefit minimally and even face performance degradation. This unfairness may erode user trust, and, consequently, negatively impact business engagement and revenue. To address this issue, we propose a novel solution that generates virtual source-domain users for non-overlapping target-domain users. Our method utilizes a dual attention mechanism to discern similarities between overlapping and non-overlapping users, thereby synthesizing realistic virtual user embeddings. We further introduce a limiter component that ensures the generated virtual users align with real-data distributions while preserving each user’s unique characteristics. Notably, our method is model-agnostic and can be seamlessly integrated into any CDR model. Comprehensive experiments conducted on three public datasets with five CDR baselines demonstrate that our method effectively mitigates the CDR non-overlapping user bias, without loss of overall accuracy. Our code is publicly available at this https URL.
[IR-1] On Function-Correcting Codes in the Lee Metric
链接: https://arxiv.org/abs/2507.17654
作者: Gyanendra K. Verma,Abhay Kumar Singh
类目: Information Theory (cs.IT); Discrete Mathematics (cs.DM); Information Retrieval (cs.IR)
*备注:
Abstract:Function-correcting codes are a coding framework designed to minimize redundancy while ensuring that specific functions or computations of encoded data can be reliably recovered, even in the presence of errors. The choice of metric is crucial in designing such codes, as it determines which computations must be protected and how errors are measured and corrected. Previous work by Liu and Liu [6] studied function-correcting codes over \mathbbZ_2^l,\ l\geq 2 using the homogeneous metric, which coincides with the Lee metric over \mathbbZ_4 . In this paper, we extend the study to codes over \mathbbZ_m, for any positive integer m\geq 2 under the Lee metric and aim to determine their optimal redundancy. To achieve this, we introduce irregular Lee distance codes and derive upper and lower bounds on the optimal redundancy by characterizing the shortest possible length of such codes. These general bounds are then simplified and applied to specific classes of functions, including Lee-local functions, Lee weight functions, and Lee weight distribution functions, leading to improved some bounds compared to those of Liu and Liu [6] over \mathbbZ_4 and generalize the other bounds over \mathbbZ_m in the Lee metric.
[IR-2] “Beyond the past”: Leverag ing Audio and Human Memory for Sequential Music Recommendation
链接: https://arxiv.org/abs/2507.17356
作者: Viet-Tran Anh,Bruno Sguerra,Gabriel Meseguer-Brocal,Lea Briand,Manuel Moussallam
类目: Information Retrieval (cs.IR)
*备注:
Abstract:On music streaming services, listening sessions are often composed of a balance of familiar and new tracks. Recently, sequential recommender systems have adopted cognitive-informed approaches, such as Adaptive Control of Thought-Rational (ACT-R), to successfully improve the prediction of the most relevant tracks for the next user session. However, one limitation of using a model inspired by human memory (or the past), is that it struggles to recommend new tracks that users have not previously listened to. To bridge this gap, here we propose a model that leverages audio information to predict in advance the ACT-R-like activation of new tracks and incorporates them into the recommendation scoring process. We demonstrate the empirical effectiveness of the proposed model using proprietary data, which we publicly release along with the model’s source code to foster future research in this field.
[IR-3] EndoFinder: Online Lesion Retrieval for Explainable Colorectal Polyp Diagnosis Leverag ing Latent Scene Representations
链接: https://arxiv.org/abs/2507.17323
作者: Ruijie Yang,Yan Zhu,Peiyao Fu,Yizhe Zhang,Zhihua Wang,Quanlin Li,Pinghong Zhou,Xian Yang,Shuo Wang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Colorectal cancer (CRC) remains a leading cause of cancer-related mortality, underscoring the importance of timely polyp detection and diagnosis. While deep learning models have improved optical-assisted diagnostics, they often demand extensive labeled datasets and yield “black-box” outputs with limited interpretability. In this paper, we propose EndoFinder, an online polyp retrieval framework that leverages multi-view scene representations for explainable and scalable CRC diagnosis. First, we develop a Polyp-aware Image Encoder by combining contrastive learning and a reconstruction task, guided by polyp segmentation masks. This self-supervised approach captures robust features without relying on large-scale annotated data. Next, we treat each polyp as a three-dimensional “scene” and introduce a Scene Representation Transformer, which fuses multiple views of the polyp into a single latent representation. By discretizing this representation through a hashing layer, EndoFinder enables real-time retrieval from a compiled database of historical polyp cases, where diagnostic information serves as interpretable references for new queries. We evaluate EndoFinder on both public and newly collected polyp datasets for re-identification and pathology classification. Results show that EndoFinder outperforms existing methods in accuracy while providing transparent, retrieval-based insights for clinical decision-making. By contributing a novel dataset and a scalable, explainable framework, our work addresses key challenges in polyp diagnosis and offers a promising direction for more efficient AI-driven colonoscopy workflows. The source code is available at this https URL.
[IR-4] Exploring the Potential of LLM s for Serendipity Evaluation in Recommender Systems RECSYS2025
链接: https://arxiv.org/abs/2507.17290
作者: Li Kang,Yuhan Zhao,Li Chen
类目: Information Retrieval (cs.IR)
*备注: RecSys2025
Abstract:Serendipity plays a pivotal role in enhancing user satisfaction within recommender systems, yet its evaluation poses significant challenges due to its inherently subjective nature and conceptual ambiguity. Current algorithmic approaches predominantly rely on proxy metrics for indirect assessment, often failing to align with real user perceptions, thus creating a gap. With large language models (LLMs) increasingly revolutionizing evaluation methodologies across various human annotation tasks, we are inspired to explore a core research proposition: Can LLMs effectively simulate human users for serendipity evaluation? To address this question, we conduct a meta-evaluation on two datasets derived from real user studies in the e-commerce and movie domains, focusing on three key aspects: the accuracy of LLMs compared to conventional proxy metrics, the influence of auxiliary data on LLM comprehension, and the efficacy of recently popular multi-LLM techniques. Our findings indicate that even the simplest zero-shot LLMs achieve parity with, or surpass, the performance of conventional metrics. Furthermore, multi-LLM techniques and the incorporation of auxiliary data further enhance alignment with human perspectives. Based on our findings, the optimal evaluation by LLMs yields a Pearson correlation coefficient of 21.5% when compared to the results of the user study. This research implies that LLMs may serve as potentially accurate and cost-effective evaluators, introducing a new paradigm for serendipity evaluation in recommender systems.
[IR-5] R4ec: A Reasoning Reflection and Refinement Framework for Recommendation Systems RECSYS25
链接: https://arxiv.org/abs/2507.17249
作者: Hao Gu,Rui Zhong,Yu Xia,Wei Yang,Chi Lu,Peng Jiang,Kun Gai
类目: Information Retrieval (cs.IR)
*备注: Accepted by Recsys25
Abstract:Harnessing Large Language Models (LLMs) for recommendation systems has emerged as a prominent avenue, drawing substantial research interest. However, existing approaches primarily involve basic prompt techniques for knowledge acquisition, which resemble System-1 thinking. This makes these methods highly sensitive to errors in the reasoning path, where even a small mistake can lead to an incorrect inference. To this end, in this paper, we propose R^4 ec, a reasoning, reflection and refinement framework that evolves the recommendation system into a weak System-2 model. Specifically, we introduce two models: an actor model that engages in reasoning, and a reflection model that judges these responses and provides valuable feedback. Then the actor model will refine its response based on the feedback, ultimately leading to improved responses. We employ an iterative reflection and refinement process, enabling LLMs to facilitate slow and deliberate System-2-like thinking. Ultimately, the final refined knowledge will be incorporated into a recommendation backbone for prediction. We conduct extensive experiments on Amazon-Book and MovieLens-1M datasets to demonstrate the superiority of R^4 ec. We also deploy R^4 ec on a large scale online advertising platform, showing 2.2% increase of revenue. Furthermore, we investigate the scaling properties of the actor model and reflection model.
[IR-6] riadic First-Order Logic Queries in Temporal Networks
链接: https://arxiv.org/abs/2507.17215
作者: Omkar Bhalerao,Yunjie Pan,C. Seshadhri,Nishil Talati
类目: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
*备注:
Abstract:Motif counting is a fundamental problem in network analysis, and there is a rich literature of theoretical and applied algorithms for this problem. Given a large input network G , a motif H is a small “pattern” graph indicative of special local structure. Motif/pattern mining involves finding all matches of this pattern in the input G . The simplest, yet challenging, case of motif counting is when H has three vertices, often called a “triadic” query. Recent work has focused on “temporal graph mining”, where the network G has edges with timestamps (and directions) and H has time constraints. Inspired by concepts in logic and database theory, we introduce the study of “thresholded First Order Logic (FOL) Motif Analysis” for massive temporal networks. A typical triadic motif query asks for the existence of three vertices that form a desired temporal pattern. An “FOL” motif query is obtained by having both existential and thresholded universal quantifiers. This allows for query semantics that can mine richer information from networks. A typical triadic query would be “find all triples of vertices u,v,w such that they form a triangle within one hour”. A thresholded FOL query can express “find all pairs u,v such that for half of w where (u,w) formed an edge, (v,w) also formed an edge within an hour”. We design the first algorithm, FOLTY, for mining thresholded FOL triadic queries. The theoretical running time of FOLTY matches the best known running time for temporal triangle counting in sparse graphs. We give an efficient implementation of FOLTY using specialized temporal data structures. FOLTY has excellent empirical behavior, and can answer triadic FOL queries on graphs with nearly 70M edges is less than hour on commodity hardware. Our work has the potential to start a new research direction in the classic well-studied problem of motif analysis. Subjects: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR); Social and Information Networks (cs.SI) Cite as: arXiv:2507.17215 [cs.DB] (or arXiv:2507.17215v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2507.17215 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-7] Enhancing Transferability and Consistency in Cross-Domain Recommendations via Supervised Disentanglement
链接: https://arxiv.org/abs/2507.17112
作者: Yuhan Wang,Qing Xie,Zhifeng Bao,Mengzi Tang,Lin Li,Yongjian Liu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Cross-domain recommendation (CDR) aims to alleviate the data sparsity by transferring knowledge across domains. Disentangled representation learning provides an effective solution to model complex user preferences by separating intra-domain features (domain-shared and domain-specific features), thereby enhancing robustness and interpretability. However, disentanglement-based CDR methods employing generative modeling or GNNs with contrastive objectives face two key challenges: (i) pre-separation strategies decouple features before extracting collaborative signals, disrupting intra-domain interactions and introducing noise; (ii) unsupervised disentanglement objectives lack explicit task-specific guidance, resulting in limited consistency and suboptimal alignment. To address these challenges, we propose DGCDR, a GNN-enhanced encoder-decoder framework. To handle challenge (i), DGCDR first applies GNN to extract high-order collaborative signals, providing enriched representations as a robust foundation for disentanglement. The encoder then dynamically disentangles features into domain-shared and -specific spaces, preserving collaborative information during the separation process. To handle challenge (ii), the decoder introduces an anchor-based supervision that leverages hierarchical feature relationships to enhance intra-domain consistency and cross-domain alignment. Extensive experiments on real-world datasets demonstrate that DGCDR achieves state-of-the-art performance, with improvements of up to 11.59% across key metrics. Qualitative analyses further validate its superior disentanglement quality and transferability. Our source code and datasets are available on GitHub for further comparison.
[IR-8] LLM 4MEA: Data-free Model Extraction Attacks on Sequential Recommenders via Large Language Models
链接: https://arxiv.org/abs/2507.16969
作者: Shilong Zhao,Fei Sun,Kaike Zhang,Shaoling Jing,Du Su,Zhichao Shi,Zhiyi Yin,Huawei Shen,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recent studies have demonstrated the vulnerability of sequential recommender systems to Model Extraction Attacks (MEAs). MEAs collect responses from recommender systems to replicate their functionality, enabling unauthorized deployments and posing critical privacy and security risks. Black-box attacks in prior MEAs are ineffective at exposing recommender system vulnerabilities due to random sampling in data selection, which leads to misaligned synthetic and real-world distributions. To overcome this limitation, we propose LLM4MEA, a novel model extraction method that leverages Large Language Models (LLMs) as human-like rankers to generate data. It generates data through interactions between the LLM ranker and target recommender system. In each interaction, the LLM ranker analyzes historical interactions to understand user behavior, and selects items from recommendations with consistent preferences to extend the interaction history, which serves as training data for MEA. Extensive experiments demonstrate that LLM4MEA significantly outperforms existing approaches in data quality and attack performance, reducing the divergence between synthetic and real-world data by up to 64.98% and improving MEA performance by 44.82% on average. From a defensive perspective, we propose a simple yet effective defense strategy and identify key hyperparameters of recommender systems that can mitigate the risk of MEAs.