本篇博文主要内容为 2025-07-04 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-07-04)

今日共更新442篇论文,其中:

  • 自然语言处理50篇(Computation and Language (cs.CL))
  • 人工智能116篇(Artificial Intelligence (cs.AI))
  • 计算机视觉99篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习133篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Requirements Elicitation Follow-Up Question Generation

【速读】: 该论文试图解决在需求获取过程中,访谈者在实时生成合适跟进问题时所面临的挑战,包括领域不熟悉、认知负荷过重和信息过载等问题。解决方案的关键在于利用生成式 AI (Generative AI) 模型,特别是 GPT-4o,根据访谈内容和常见的访谈错误类型生成高质量的跟进问题,从而提升访谈效率与质量。

链接: https://arxiv.org/abs/2507.02858
作者: Yuchen Shen,Anmol Singhal,Travis Breaux
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 13 pages, 2 figures, accepted at the 33rd IEEE International Requirements Engineering 2025

点击查看摘要

Abstract:Interviews are a widely used technique in eliciting requirements to gather stakeholder needs, preferences, and expectations for a software system. Effective interviewing requires skilled interviewers to formulate appropriate interview questions in real time while facing multiple challenges, including lack of familiarity with the domain, excessive cognitive load, and information overload that hinders how humans process stakeholders’ speech. Recently, large language models (LLMs) have exhibited state-of-the-art performance in multiple natural language processing tasks, including text summarization and entailment. To support interviewers, we investigate the application of GPT-4o to generate follow-up interview questions during requirements elicitation by building on a framework of common interviewer mistake types. In addition, we describe methods to generate questions based on interviewee speech. We report a controlled experiment to evaluate LLM-generated and human-authored questions with minimal guidance, and a second controlled experiment to evaluate the LLM-generated questions when generation is guided by interviewer mistake types. Our findings demonstrate that, for both experiments, the LLM-generated questions are no worse than the human-authored questions with respect to clarity, relevancy, and informativeness. In addition, LLM-generated questions outperform human-authored questions when guided by common mistakes types. This highlights the potential of using LLMs to help interviewers improve the quality and ease of requirements elicitation interviews in real time.
zh

[NLP-1] Answer Matching Outperforms Multiple Choice for Language Model Evaluation ATC

【速读】: 该论文试图解决传统多项选择题评估方法在语言模型评价中存在“捷径”问题,即模型可以在不阅读问题的情况下正确回答问题,这反映了判别式评估的固有局限性。解决方案的关键在于采用生成式评估中的答案匹配(answer matching)方法:将问题提供给候选模型而不包含选项,让其生成自由形式的回答,随后使用现代语言模型结合参考答案判断生成回答是否与参考答案匹配。这种方法在实验中表现出与人工评分高度一致的准确性,显著优于传统的多项选择评估和无参考答案的LLM-as-a-judge方法。

链接: https://arxiv.org/abs/2507.02856
作者: Nikhil Chandak,Shashwat Goel,Ameya Prabhu,Moritz Hardt,Jonas Geiping
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 34 pages, Code is available at this https URL

点击查看摘要

Abstract:Multiple choice benchmarks have long been the workhorse of language model evaluation because grading multiple choice is objective and easy to automate. However, we show multiple choice questions from popular benchmarks can often be answered without even seeing the question. These shortcuts arise from a fundamental limitation of discriminative evaluation not shared by evaluations of the model’s free-form, generative answers. Until recently, there appeared to be no viable, scalable alternative to multiple choice–but, we show that this has changed. We consider generative evaluation via what we call answer matching: Give the candidate model the question without the options, have it generate a free-form response, then use a modern language model with the reference answer to determine if the response matches the reference. To compare the validity of different evaluation strategies, we annotate MMLU-Pro and GPQA-Diamond to obtain human grading data, and measure the agreement of each evaluation approach. We find answer matching using recent models–even small ones–achieves near-perfect agreement, in the range of inter-annotator agreement. In contrast, both multiple choice evaluation and using LLM-as-a-judge without reference answers aligns poorly with human grading. Improving evaluations via answer matching is not merely a conceptual concern: the rankings of several models change significantly when evaluating their free-form responses with answer matching. In light of these findings, we discuss how to move the evaluation ecosystem from multiple choice to answer matching.
zh

[NLP-2] MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理需要大量思考/推理标记(tokens)的任务时,受限于上下文长度(context size)的问题。传统方法通过强化学习(Reinforcement Learning, RL)训练的群体相对策略优化(Group Relative Policy Optimization, GRPO)虽然能提升响应质量,但受限于模型生成有限数量的标记并保持对先前生成内容的关注能力。为突破这一限制,论文提出了MOTIF(Modular Thinking via Reinforcement Finetuning)——一种通过分轮次生成思考标记的强化学习微调方法,其关键在于采用模块化思维策略,使模型能够在多轮推理中扩展有效上下文规模。

链接: https://arxiv.org/abs/2507.02851
作者: Purbesh Mitra,Sennur Ulukus
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Recent advancements in the reasoning capabilities of large language models (LLMs) show that employing group relative policy optimization (GRPO) algorithm for reinforcement learning (RL) training allows the models to use more thinking/reasoning tokens for generating better responses. However, LLMs can generate only a finite amount of tokens while maintaining attention to the previously generated tokens. This limit, also known as the context size of an LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens. To think beyond the limit of context size, an LLM must employ a modular thinking strategy to reason over multiple rounds. In this work, we propose \textbfMOTIF: Modular Thinking via Reinforcement Finetuning – an RL training method for generating thinking tokens in multiple rounds, effectively allowing the model to think with additional context size. We trained the open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our experiments show 3.8% and 3.3% improvements over vanilla GRPO based training in the respective benchmarks. Furthermore, this improvement was achieved with only 15% of samples, thus demonstrating sample efficiency of MOTIF. Our code and models are available at this https URL and this https URL, respectively.
zh

[NLP-3] LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users

【速读】: 该论文试图解决语言模型(Language Models, LMs)在使用用户反馈进行训练时存在的安全漏洞问题,即单个用户可以通过提供提示和对模型输出进行点赞/踩评来持续改变模型的知识和行为。解决方案的关键在于利用模型在后续偏好微调过程中对反馈信号的敏感性,攻击者通过诱导模型生成“污染”或良性响应并对其施加反馈,使得模型在无恶意提示的上下文中也更可能生成污染响应,从而实现对模型行为的操控。

链接: https://arxiv.org/abs/2507.02850
作者: Almog Hilel,Idan Shenfeld,Leshem Choshen,Jacob Andreas
机构: Massachusetts Institute of Technology (麻省理工学院); IBM Research (IBM 研究院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We describe a vulnerability in language models (LMs) trained with user feedback, whereby a single user can persistently alter LM knowledge and behavior given only the ability to provide prompts and upvote / downvote feedback on LM outputs. To implement the attack, the attacker prompts the LM to stochastically output either a “poisoned” or benign response, then upvotes the poisoned response or downvotes the benign one. When feedback signals are used in a subsequent preference tuning behavior, LMs exhibit increased probability of producing poisoned responses even in contexts without malicious prompts. We show that this attack can be used to (1) insert factual knowledge the model did not previously possess, (2) modify code generation patterns in ways that introduce exploitable security flaws, and (3) inject fake financial news. Our finding both identifies a new qualitative feature of language model preference tuning (showing that it even highly restricted forms of preference data can be used to exert fine-grained control over behavior), and a new attack mechanism for LMs trained with user feedback (extending work on pretraining-time data poisoning and deployment-time prompt injection).
zh

[NLP-4] Legal Requirements Translation from Law

【速读】: 该论文试图解决软件系统在合规性方面面临的挑战,特别是针对缺乏专业法律知识的小型组织和初创企业,如何高效地从法律规范中提取元数据以明确法律需求。其解决方案的关键在于引入一种基于文本蕴含和上下文学习的方法,用于自动生成可编码为Python代码的法律文本规范表示。该方法通过手动设计的Python类结构作为领域特定的元模型,捕捉结构化和语义化的法律元数据及其相互关系,从而减少对大规模人工标注数据集的依赖,并提高对新法规的适用性。

链接: https://arxiv.org/abs/2507.02846
作者: Anmol Singhal,Travis Breaux
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 13 pages, 7 figures, Accepted at the 33rd IEEE International Requirements Engineering 2025

点击查看摘要

Abstract:Software systems must comply with legal regulations, which is a resource-intensive task, particularly for small organizations and startups lacking dedicated legal expertise. Extracting metadata from regulations to elicit legal requirements for software is a critical step to ensure compliance. However, it is a cumbersome task due to the length and complex nature of legal text. Although prior work has pursued automated methods for extracting structural and semantic metadata from legal text, key limitations remain: they do not consider the interplay and interrelationships among attributes associated with these metadata types, and they rely on manual labeling or heuristic-driven machine learning, which does not generalize well to new documents. In this paper, we introduce an approach based on textual entailment and in-context learning for automatically generating a canonical representation of legal text, encodable and executable as Python code. Our representation is instantiated from a manually designed Python class structure that serves as a domain-specific metamodel, capturing both structural and semantic legal metadata and their interrelationships. This design choice reduces the need for large, manually labeled datasets and enhances applicability to unseen legislation. We evaluate our approach on 13 U.S. state data breach notification laws, demonstrating that our generated representations pass approximately 89.4% of test cases and achieve a precision and recall of 82.2 and 88.7, respectively.
zh

[NLP-5] Visual Contextual Attack: Jailbreaking MLLM s with Image-Driven Context Injection

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在开放环境中部署时面临的视觉模态安全漏洞问题,特别是通过视觉信息诱导模型产生有害响应的攻击问题。其解决方案的关键在于提出一种名为VisCo(Visual Contextual)的攻击方法,该方法以视觉信息为核心构建完整的、现实的越狱上下文,通过四种视觉导向策略生成上下文对话,并在必要时动态生成辅助图像,结合自动毒性混淆和语义优化技术,最终生成能够可靠触发目标模型有害响应的攻击提示。

链接: https://arxiv.org/abs/2507.02844
作者: Ziqi Miao,Yi Ding,Lijun Li,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 16 pages

点击查看摘要

Abstract:With the emergence of strong visual-language capabilities, multimodal large language models (MLLMs) have demonstrated tremendous potential for real-world applications. However, the security vulnerabilities exhibited by the visual modality pose significant challenges to deploying such models in open-world environments. Recent studies have successfully induced harmful responses from target MLLMs by encoding harmful textual semantics directly into visual inputs. However, in these approaches, the visual modality primarily serves as a trigger for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding in realistic scenarios. In this work, we define a novel setting: visual-centric jailbreak, where visual information serves as a necessary component in constructing a complete and realistic jailbreak context. Building on this setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates contextual dialogue using four distinct visual-focused strategies, dynamically generating auxiliary images when necessary to construct a visual-centric jailbreak scenario. To maximize attack effectiveness, it incorporates automatic toxicity obfuscation and semantic refinement to produce a final attack prompt that reliably triggers harmful responses from the target black-box MLLMs. Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The code is available at this https URL.
zh

[NLP-6] StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法面临的两个关键问题:近似错误奖励问题(near-miss reward problem)和探索停滞问题(exploration stagnation)。近似错误奖励问题指的是模型在推理过程中即使大部分步骤正确,仅因一个小错误便导致整个过程被判定为无效,从而严重影响训练效率;而探索停滞问题则指模型倾向于停留在其“舒适区”内的解,缺乏探索更优解的动力。论文提出的解决方案是StepHint算法,其关键在于通过多层级逐步提示(multi-level stepwise hints)引导模型更有效地探索解空间。StepHint利用更强模型生成有效的推理链,并采用自适应划分方法将其分割为推理步骤,将前几步骤作为提示,同时提供多层级提示(不同步骤数的提示),从而在保持模型独立探索灵活性的同时,引导其向有潜力的解子空间迁移,缓解近似错误奖励问题并提升探索能力。

链接: https://arxiv.org/abs/2507.02841
作者: Kaiyi Zhang,Ang Lv,Jinpeng Li,Yongbo Wang,Feng Wang,Haoyuan Hu,Rui Yan
机构: GSAI, Renmin University of China (中国人民大学人工智能学院); Peking University (北京大学); Ant Group (蚂蚁集团); SCS, Wuhan University (武汉大学计算机学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) is a promising approach for improving the complex reasoning abilities of large language models (LLMs). However, current RLVR methods face two significant challenges: the near-miss reward problem, where a small mistake can invalidate an otherwise correct reasoning process, greatly hindering training efficiency; and exploration stagnation, where models tend to focus on solutions within their comfort zone,'' lacking the motivation to explore potentially more effective alternatives. To address these challenges, we propose StepHint, a novel RLVR algorithm that utilizes multi-level stepwise hints to help models explore the solution space more effectively. StepHint generates valid reasoning chains from stronger models and partitions these chains into reasoning steps using our proposed adaptive partitioning method. The initial few steps are used as hints, and simultaneously, multiple-level hints (each comprising a different number of steps) are provided to the model. This approach directs the model's exploration toward a promising solution subspace while preserving its flexibility for independent exploration. By providing hints, StepHint mitigates the near-miss reward problem, thereby improving training efficiency. Additionally, the external reasoning pathways help the model develop better reasoning abilities, enabling it to move beyond its comfort zone’’ and mitigate exploration stagnation. StepHint outperforms competitive RLVR enhancement methods across six mathematical benchmarks, while also demonstrating superior generalization and excelling over baselines on out-of-domain benchmarks.
zh

[NLP-7] ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

【速读】: 该论文试图解决在强化学习(RL)后训练过程中,模型依赖初始能力生成正样本而导致的推理能力提升受限问题,特别是在早期RL训练和复杂推理任务中,正样本难以生成的问题。解决方案的关键在于提出一种名为Self-Explanation Policy Optimization (ExPO)的框架,该框架通过基于真实答案进行条件生成来获取有效的正样本,确保样本既符合当前策略的概率分布,又能提高模型预测正确答案的可能性,从而实现更高效的探索和更高质量的推理轨迹生成。

链接: https://arxiv.org/abs/2507.02834
作者: Ruiyang Zhou,Shuozhe Li,Amy Zhang,Liu Leqi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training, which improves reasoning by optimizing model outputs based on reward or preference signals. GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier. However, these methods depend heavily on the model’s initial ability to produce positive samples. They primarily refine what the model already knows (distribution sharpening) rather than enabling the model to solve problems where it initially fails. This limitation is especially problematic in early-stage RL training and on challenging reasoning tasks, where positive samples are unlikely to be generated. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model’s likelihood of predicting the correct answer. Based on these insights, we propose \textbfSelf-Explanation Policy Optimization (ExPO) -a simple and modular framework that generates such samples by conditioning on the ground-truth answer. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most.
zh

[NLP-8] Generalizing Verifiable Instruction Following

【速读】: 该论文旨在解决语言模型或聊天机器人在遵循人类精确指令(precise instruction following)方面存在的泛化能力不足问题,特别是在面对未见过的输出约束时表现不佳。其关键解决方案是引入一个新的基准测试集IFBench,用于评估模型在58个新的、多样且具有挑战性的可验证输出约束下的泛化能力,并通过设计约束验证模块和采用基于可验证奖励的强化学习(RLVR)方法,显著提升了模型在精确指令遵循任务上的性能。

链接: https://arxiv.org/abs/2507.02833
作者: Valentina Pyatkin,Saumya Malik,Victoria Graf,Hamish Ivison,Shengyi Huang,Pradeep Dasigi,Nathan Lambert,Hannaneh Hajishirzi
机构: Allen Institute for Artificial Intelligence (艾伦人工智能研究所); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL)
备注: 11 pages

点击查看摘要

Abstract:A crucial factor for successful human and AI interaction is the ability of language models or chatbots to follow human instructions precisely. A common feature of instructions are output constraints like only answer with yes or no" or mention the word `abrakadabra’ at least 3 times" that the user adds to craft a more useful answer. Even today’s strongest models struggle with fulfilling such constraints. We find that most models strongly overfit on a small set of verifiable constraints from the benchmarks that test these abilities, a skill called precise instruction following, and are not able to generalize well to unseen output constraints. We introduce a new benchmark, IFBench, to evaluate precise instruction following generalization on 58 new, diverse, and challenging verifiable out-of-domain constraints. In addition, we perform an extensive analysis of how and on what data models can be trained to improve precise instruction following generalization. Specifically, we carefully design constraint verification modules and show that reinforcement learning with verifiable rewards (RLVR) significantly improves instruction following. In addition to IFBench, we release 29 additional new hand-annotated training constraints and verification functions, RLVR training prompts, and code.
zh

[NLP-9] SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model

【速读】: 该论文旨在解决在实际应用中选择合适的大语言模型(Large Language Models, LLMs)时面临的性能与操作成本之间的平衡问题,尤其是在具有推理能力的模型出现后,“思考”(高推理)模式与"非思考"(快速、低成本)模式之间的成本差距进一步扩大。论文的关键解决方案是提出SynapseRoute,这是一种基于机器学习的动态路由框架,能够根据问题复杂度智能地将输入查询分配到合适的模式中,从而优化准确性、成本效益和整体用户体验。实验结果表明,SynapseRoute在提升整体准确性的同时显著降低了推理时间和令牌消耗,并通过引入Accuracy-Inference-Token (AIT) 指数全面评估了准确率、延迟和令牌成本之间的权衡。

链接: https://arxiv.org/abs/2507.02822
作者: Wencheng Zhang,Shiqin Qiao,Lingjie Luo,Yinfeng Li,Chuanyang Zheng,Qian Xu,Meng Li,Yong Gui,Yijun He,Jianing Qiu,Jindong Hong,Jiankai Sun
机构: Bytedance(字节跳动); Xidian University(西安电子科技大学); The Chinese University of Hong Kong(香港中文大学); Peking University(北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the widespread adoption of large language models (LLMs) in practical applications, selecting an appropriate model requires balancing not only performance but also operational cost. The emergence of reasoning-capable models has further widened the cost gap between “thinking” (high reasoning) and “non-thinking” (fast, low-cost) modes. In this work, we reveal that approximately 58% of medical questions can be accurately answered by the non-thinking mode alone, without requiring the high-cost reasoning process. This highlights a clear dichotomy in problem complexity and suggests that dynamically routing queries to the appropriate mode based on complexity could optimize accuracy, cost-efficiency, and overall user experience. Based on this, we further propose SynapseRoute, a machine learning-based dynamic routing framework that intelligently assigns input queries to either thinking or non-thinking modes. Experimental results on several medical datasets demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs. 0.8272) compared to the thinking mode alone but also reduces inference time by 36.8% and token consumption by 39.66%. Importantly, qualitative analysis indicates that over-reasoning on simpler queries can lead to unnecessary delays and even decreased accuracy, a pitfall avoided by our adaptive routing. Finally, this work further introduces the Accuracy-Inference-Token (AIT) index to comprehensively evaluate the trade-offs among accuracy, latency, and token cost.
zh

[NLP-10] Multimodal Mathematical Reasoning with Diverse Solving Perspective

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在数学推理任务中依赖单一图像-文本对和单解监督导致的推理视角单一、缺乏多样性的问题。其关键解决方案是构建一个包含多个多样化解题路径的新型数据集MathV-DP,并提出Qwen-VL-DP模型,该模型基于Qwen-VL进行微调,结合了监督学习与基于规则的强化学习方法——组相对策略优化(Group Relative Policy Optimization, GRPO),通过正确性判别和多样性感知奖励函数提升模型的学习效果,从而增强模型在数学推理中的准确性和生成多样性。

链接: https://arxiv.org/abs/2507.02804
作者: Wenhao Shi,Zhiqiang Hu,Yi Bin,Yang Yang,See-Kiong Ng,Heng Tao Shen
机构: University of Electronic Science and Technology of China (中国电子科技大学); Singapore University of Technology and Design (新加坡科技设计大学); Tongji University (同济大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:Recent progress in large-scale reinforcement learning (RL) has notably enhanced the reasoning capabilities of large language models (LLMs), especially in mathematical domains. However, current multimodal LLMs (MLLMs) for mathematical reasoning often rely on one-to-one image-text pairs and single-solution supervision, overlooking the diversity of valid reasoning perspectives and internal reflections. In this work, we introduce MathV-DP, a novel dataset that captures multiple diverse solution trajectories for each image-question pair, fostering richer reasoning supervision. We further propose Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and enhanced via group relative policy optimization (GRPO), a rule-based RL approach that integrates correctness discrimination and diversity-aware reward functions. Our method emphasizes learning from varied reasoning perspectives and distinguishing between correct yet distinct solutions. Extensive experiments on the MathVista’s minitest and Math-V benchmarks demonstrate that Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and generative diversity, highlighting the importance of incorporating diverse perspectives and reflective reasoning in multimodal mathematical reasoning.
zh

[NLP-11] Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models

【速读】: 该论文试图解决生成式语言模型(Generative Language Models)在引入推理能力后,其对社会偏见的鲁棒性问题,特别是推理机制是否能够提升模型的公平性和安全性。解决方案的关键在于利用CLEAR-Bias基准评估推理语言模型(Reasoning Language Models, RLMs)的对抗鲁棒性,通过自动化安全评分和劫持技术分析不同推理机制下的偏见诱发效果,从而揭示推理能力与偏见安全之间的复杂关系。

链接: https://arxiv.org/abs/2507.02799
作者: Riccardo Cantini,Nicola Gabriele,Alessio Orsino,Domenico Talia
机构: University of Calabria (卡拉布里亚大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks through mechanisms such as Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark, originally designed for Large Language Models (LLMs), to investigate the adversarial robustness of RLMs to bias elicitation. We systematically evaluate state-of-the-art RLMs across diverse sociocultural dimensions, using an LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak techniques to assess the strength of built-in safety mechanisms. Our evaluation addresses three key questions: (i) how the introduction of reasoning capabilities affects model fairness and robustness; (ii) whether models fine-tuned for reasoning exhibit greater safety than those relying on CoT prompting at inference time; and (iii) how the success rate of jailbreak attacks targeting bias elicitation varies with the reasoning mechanisms employed. Our findings reveal a nuanced relationship between reasoning capabilities and bias safety. Surprisingly, models with explicit reasoning, whether via CoT prompting or fine-tuned reasoning traces, are generally more vulnerable to bias elicitation than base models without such mechanisms, suggesting reasoning may unintentionally open new pathways for stereotype reinforcement. Reasoning-enabled models appear somewhat safer than those relying on CoT prompting, which are particularly prone to contextual reframing attacks through storytelling prompts, fictional personas, or reward-shaped instructions. These results challenge the assumption that reasoning inherently improves robustness and underscore the need for more bias-aware approaches to reasoning design.
zh

[NLP-12] From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding

【速读】: 该论文旨在解决现有自动视频编辑方法依赖文本线索而忽视视觉上下文,导致输出不连贯的问题。其解决方案的关键在于提出一种基于多模态叙事理解的人类启发式自动视频编辑框架(HIVE),通过角色提取、对话分析和叙事摘要等技术,实现对视频内容的全面理解,并结合场景级分割与三个子任务(亮点检测、开头/结尾选择、无关内容剪枝)来提升编辑结果的连贯性。

链接: https://arxiv.org/abs/2507.02790
作者: Xiangfeng Wang,Xiao Li,Yadong Wei,Xueyu Song,Yang Song,Xiaoqiang Xia,Fangrui Zeng,Zaiyi Chen,Liu Liu,Gu Xu,Tong Xu
机构: University of Science and Technology of China (中国科学技术大学); ByteDance China (字节跳动中国)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid growth of online video content, especially on short video platforms, has created a growing demand for efficient video editing techniques that can condense long-form videos into concise and engaging clips. Existing automatic editing methods predominantly rely on textual cues from ASR transcripts and end-to-end segment selection, often neglecting the rich visual context and leading to incoherent outputs. In this paper, we propose a human-inspired automatic video editing framework (HIVE) that leverages multimodal narrative understanding to address these limitations. Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models, enabling a holistic understanding of the video content. To further enhance coherence, we apply scene-level segmentation and decompose the editing process into three subtasks: highlight detection, opening/ending selection, and pruning of irrelevant content. To facilitate research in this area, we introduce DramaAD, a novel benchmark dataset comprising over 800 short drama episodes and 500 professionally edited advertisement clips. Experimental results demonstrate that our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks, significantly narrowing the quality gap between automatic and human-edited videos.
zh

[NLP-13] Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLM s

【速读】: 该论文试图解决生成式 AI(Generative AI)在自回归模型中存在的一种系统性“自我修正盲区”问题,即模型无法纠正自身输出中的错误。解决方案的关键在于通过引入 Self-Correction Bench 框架,系统地测量这一现象,并发现训练数据组成是导致该问题的重要因素。研究还表明,简单地在输出前添加“Wait”可显著降低盲区率,表明模型具备自我修正能力,但需要适当的触发机制来激活该能力。

链接: https://arxiv.org/abs/2507.02778
作者: Ken Tsui
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 18 figures

点击查看摘要

Abstract:Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic ‘Self-Correction Blind Spot’ - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending “Wait” reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.
zh

[NLP-14] Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens

【速读】: 该论文试图解决人类元音产生的控制机制在听觉空间中的精确度问题,具体而言是确定两个元音刺激在听觉空间中需要多远的距离才能被可靠地模仿,这一距离被称为“刚可产生差异”(Just Producible Difference, JPD)。解决方案的关键在于采用元音模仿范式,通过实验测量不同英语母语者在前元音发音中的JPD,从而首次量化了听觉空间中元音区分的最小可感知差异。研究结果表明,JPD在F1 X F2空间中约为14至51 mels,这对语音生成的事件理论及人类元音系统的结构理解具有重要意义。

链接: https://arxiv.org/abs/2507.02744
作者: Peter Viechnicki
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A body of work over the past several decades has demonstrated that the complex and coordinated articulatory movements of human vowel production are governed (at least in part)by control mechanisms whose targets are regions of auditory space. Within the target region control at the sub-phonemic level has also been demonstrated. But the degree of accuracy of that control is unknown. The current work investigates this question by asking how far apart must two vowel stimuli lie in auditory space in order to yield reliably different imitations? This distance is termed ‘Just Producible Difference’ (JPD). The current study uses a vowel mimicry paradigm to derive the first measurement of JPD among two sets of English speakers during front vowel production. JPD is estimated at between 14 and 51 mels in F1 X F2 space. This finding has implications for episodic theories of speech production. It also clarifies the possible structures of human vowel systems, by setting a theoretical lower bound for how close two vowel phonemes may be in a speaker’s formant space, and hence a psychophysical explanation of observed trends in number and patterns of possible vowel phonemes.
zh

[NLP-15] Early Signs of Steganographic Capabilities in Frontier LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)输出中可能存在的隐写术(steganography)风险问题,即模型可能通过在看似无害的生成内容中嵌入隐藏信息来规避监控。解决方案的关键在于评估当前前沿LLM在隐写术方面的表现,特别是其在编码消息传递和编码推理方面的能力,并分析在不同监控条件下模型能否成功隐藏信息。研究发现,当前模型在标准监控环境下难以隐藏短消息,但在获得额外自由度(如使用未受监控的草稿区和协调编码方案)时可能实现隐写术,这表明LLM目前具备初步的隐写能力,但尚不足以绕过精心设计的监控系统。

链接: https://arxiv.org/abs/2507.02737
作者: Artur Zolkowski,Kei Nishimura-Gasparian,Robert McCarthy,Roland S. Zimmermann,David Lindner
机构: ETH Zurich (苏黎世联邦理工学院); University College London (伦敦大学学院); Google DeepMind (谷歌深度思维)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks from misuse and misalignment. However, LLMs could evade monitoring through steganography: Encoding hidden information within seemingly benign generations. In this paper, we evaluate the steganography capabilities in frontier LLMs to better understand the risk they pose. We focus on two types of steganography: passing encoded messages and performing encoded reasoning. We find that current models are unable to encode short messages in their outputs without a monitor noticing under standard affordances. They can succeed, however, if given additional affordances such as using an unmonitored scratchpad and coordinating on what encoding scheme to use. We additionally find early signs that models can perform basic encoded reasoning in a simple state-tracking problem. This includes some ability to reason with their own and pre-defined schemes, including encoding schemes such as Hexadecimal. Despite this, they can rarely hide reasoning subtly within a cover task to fool a monitor. Overall, our results indicate that current LLMs exhibit nascent steganographic capabilities. While these capabilities are likely insufficient to bypass well-designed monitors at present, this could change in the future.
zh

[NLP-16] Can LLM s Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers

【速读】: 该论文试图解决科学同行评审过程中因论文数量增加而带来的挑战,尤其是生成式 AI (Generative AI) 在识别论文局限性方面的潜力尚未得到充分研究。解决方案的关键在于提出 LimitGen,这是首个全面的基准测试,用于评估 LLMs 在提供早期反馈和补充人工同行评审方面的能力。该基准包含两个子集:LimitGen-Syn 和 LimitGen-Human,同时通过文献检索增强 LLM 系统识别局限性的能力,从而提升其在科研论文中生成具体且建设性反馈的能力。

链接: https://arxiv.org/abs/2507.02694
作者: Zhijian Xu,Yilun Zhao,Manasi Patwardhan,Lovekesh Vig,Arman Cohan
机构: Yale University (耶鲁大学); TCS Research (TCS 研究)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs’ capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.
zh

[NLP-17] Exploring Gender Bias Beyond Occupational Titles

【速读】: 该论文试图解决性别与语境偏见之间的相关性问题,特别是关注动词、名词以及职业等语言元素中的性别偏见。其解决方案的关键在于引入了一个新的数据集GenderLexicon以及一个能够估计语境偏见及其相关性别偏见的框架,该模型通过评分机制解释偏见,从而提升性别偏见的可解释性。

链接: https://arxiv.org/abs/2507.02679
作者: Ahmed Sabir,Rajesh Sharama
机构: University of Tartu, Institute of Computer Science, Estonia
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:In this work, we investigate the correlation between gender and contextual biases, focusing on elements such as action verbs, object nouns, and particularly on occupations. We introduce a novel dataset, GenderLexicon, and a framework that can estimate contextual bias and its related gender bias. Our model can interpret the bias with a score and thus improve the explainability of gender bias. Also, our findings confirm the existence of gender biases beyond occupational stereotypes. To validate our approach and demonstrate its effectiveness, we conduct evaluations on five diverse datasets, including a Japanese dataset.
zh

[NLP-18] ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning INTERSPEECH2025

【速读】: 该论文旨在解决音频自监督表示学习中标准Transformer架构因注意力机制将部分注意力权重分配给无关信息而影响模型判别能力的问题。其解决方案的关键在于引入一种差分注意力机制,通过融合双softmax操作和适当调整的差分系数,有效缓解无效注意力分配的问题。

链接: https://arxiv.org/abs/2507.02666
作者: Junyu Wang,Tianrui Wang,Meng Ge,Longbiao Wang,Jianwu Dang
机构: Huiyan Technology (Tianjin) Co., Ltd; Shenzhen Institute of Advanced Technology
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech2025

点击查看摘要

Abstract:In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model’s discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA’s effectiveness in audio tasks, paving the way for broader applications.
zh

[NLP-19] OmniDraft: A Cross-vocabulary Online Adaptive Drafter for On-device Speculative Decoding

【速读】: 该论文试图解决在在线部署场景下,生成式 AI (Generative AI) 的推测解码(speculative decoding)中存在两个主要问题:一是目标模型与草稿模型之间的不兼容性,二是对延迟改进的持续需求。解决方案的关键在于提出 OmniDraft 框架,该框架通过引入在线 n-gram 缓存与混合蒸馏微调技术,解决了草稿模型与目标模型之间的跨词汇不匹配问题,并通过自适应推测技术提升解码速度,从而实现一个草稿模型可适配任意目标模型并动态适应用户数据。

链接: https://arxiv.org/abs/2507.02659
作者: Ramchalam Kinattinkara Ramakrishnan,Zhaocong Yuan,Shaojie Zhuo,Chen Feng,Yicheng Lin,Chenzheng Su,Xiaopeng Zhang
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit``one drafter for all’’ paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.
zh

[NLP-20] Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search

【速读】: 该论文旨在解决现实世界搜索场景中复杂信息需求所面临的深度推理与跨多样化来源知识融合的挑战,传统检索增强生成(Retrieval-Augmented Generation, RAG)管道在应对这些问题时存在效率低下和可扩展性受限的问题。论文提出的解决方案关键在于引入HiRA框架,该框架通过将战略规划与专业化执行分离,将复杂的搜索任务分解为聚焦的子任务,并为每个子任务分配具备外部工具和推理能力的领域专用代理,再通过结构化集成机制协调结果,从而实现高效且可扩展的信息处理。

链接: https://arxiv.org/abs/2507.02652
作者: Jiajie Jin,Xiaoxi Li,Guanting Dong,Yuyao Zhang,Yutao Zhu,Yang Zhao,Hongjin Qian,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages

点击查看摘要

Abstract:Complex information needs in real-world search scenarios demand deep reasoning and knowledge synthesis across diverse sources, which traditional retrieval-augmented generation (RAG) pipelines struggle to address effectively. Current reasoning-based approaches suffer from a fundamental limitation: they use a single model to handle both high-level planning and detailed execution, leading to inefficient reasoning and limited scalability. In this paper, we introduce HiRA, a hierarchical framework that separates strategic planning from specialized execution. Our approach decomposes complex search tasks into focused subtasks, assigns each subtask to domain-specific agents equipped with external tools and reasoning capabilities, and coordinates the results through a structured integration mechanism. This separation prevents execution details from disrupting high-level reasoning while enabling the system to leverage specialized expertise for different types of information processing. Experiments on four complex, cross-modal deep search benchmarks demonstrate that HiRA significantly outperforms state-of-the-art RAG and agent-based systems. Our results show improvements in both answer quality and system efficiency, highlighting the effectiveness of decoupled planning and execution for multi-step information seeking tasks. Our code is available at this https URL.
zh

[NLP-21] Strategic Intelligence in Large Language Models : Evidence from evolutionary Game Theory

【速读】: 该论文试图探讨大型语言模型(Large Language Models, LLMs)是否可以被视为一种新的战略智能形式,能够在竞争性环境中进行目标推理。其解决方案的关键在于通过设计一系列进化迭代囚徒困境(Iterated Prisoner’s Dilemma, IPD)竞赛,将经典策略与当前前沿AI公司的代理模型进行对比,并通过调整每轮竞赛的终止概率(即“未来的阴影”)引入复杂性和随机性,从而避免模型依赖记忆进行决策。研究结果表明,LLMs在复杂生态系统中表现出高度竞争力,并展现出独特的“战略指纹”,同时其决策过程依赖于对时间范围和对手策略的主动推理。

链接: https://arxiv.org/abs/2507.02618
作者: Kenneth Payne,Baptiste Alloui-Cros
机构: King’s College London (伦敦国王学院); University of Oxford (牛津大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注: 29 pages, 27 tables, 4 figures

点击查看摘要

Abstract:Are Large Language Models (LLMs) a new form of strategic intelligence, able to reason about goals in competitive settings? We present compelling supporting evidence. The Iterated Prisoner’s Dilemma (IPD) has long served as a model for studying decision-making. We conduct the first ever series of evolutionary IPD tournaments, pitting canonical strategies (e.g., Tit-for-Tat, Grim Trigger) against agents from the leading frontier AI companies OpenAI, Google, and Anthropic. By varying the termination probability in each tournament (the “shadow of the future”), we introduce complexity and chance, confounding memorisation. Our results show that LLMs are highly competitive, consistently surviving and sometimes even proliferating in these complex ecosystems. Furthermore, they exhibit distinctive and persistent “strategic fingerprints”: Google’s Gemini models proved strategically ruthless, exploiting cooperative opponents and retaliating against defectors, while OpenAI’s models remained highly cooperative, a trait that proved catastrophic in hostile environments. Anthropic’s Claude emerged as the most forgiving reciprocator, showing remarkable willingness to restore cooperation even after being exploited or successfully defecting. Analysis of nearly 32,000 prose rationales provided by the models reveals that they actively reason about both the time horizon and their opponent’s likely strategy, and we demonstrate that this reasoning is instrumental to their decisions. This work connects classic game theory with machine psychology, offering a rich and granular view of algorithmic decision-making under uncertainty. Comments: 29 pages, 27 tables, 4 figures Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2507.02618 [cs.AI] (or arXiv:2507.02618v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2507.02618 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-22] MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion ICML2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在输出中可能存在的偏差问题,特别是如何通过后训练对齐框架实现更公平和可解释的模型输出。其解决方案的关键在于Multiperspective Fusion (MPF),该方法基于SAGED管道构建,利用多视角生成技术暴露并调整LLM输出中的偏差,使其与具有细微差别的人类基准对齐。MPF通过将基线分解为可解释的视角组件(如人力资源专业人员的情感分布),并根据分解所得概率加权采样和平衡响应,从而实现对齐。实验表明,MPF能够有效减少KL散度、校准误差,并泛化到未见过的问题,提供了一种可扩展且可解释的对齐与偏差缓解方法。

链接: https://arxiv.org/abs/2507.02595
作者: Xin Guan,PeiHsin Lin,Zekun Wu,Ze Wang,Ruibo Zhang,Emre Kazim,Adriano Koshiyama
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2025 AIW Workshop

点击查看摘要

Abstract:Multiperspective Fusion (MPF) is a novel posttraining alignment framework for large language models (LLMs) developed in response to the growing need for easy bias mitigation. Built on top of the SAGED pipeline, an automated system for constructing bias benchmarks and extracting interpretable baseline distributions, MPF leverages multiperspective generations to expose and align biases in LLM outputs with nuanced, humanlike baselines. By decomposing baseline, such as sentiment distributions from HR professionals, into interpretable perspective components, MPF guides generation through sampling and balancing of responses, weighted by the probabilities obtained in the decomposition. Empirically, we demonstrate its ability to align LLM sentiment distributions with both counterfactual baselines (absolute equality) and the HR baseline (biased for Top Univeristy), resulting in small KL divergence, reduction of calibration error and generalization to unseen questions. This shows that MPF offers a scalable and interpretable method for alignment and bias mitigation, compatible with deployed LLMs and requiring no extensive prompt engineering or finetuning.
zh

[NLP-23] Revisiting Active Learning under (Human) Label Variation

【速读】: 该论文试图解决监督学习中高质量标注数据稀缺的问题,特别是针对标签变异(label variation, LV)现象,强调人类标签变异(human label variation, HLV)作为一种信息性信号被忽视的现状。传统主动学习(active learning, AL)方法通常基于单一真实标签的假设,而忽略了HLV的存在,导致在实际应用中效果受限。解决方案的关键在于将观察到的LV分解为信号(如HLV)和噪声(如标注错误),并提出一个概念框架,将HLV整合到AL的整个流程中,包括实例选择、标注者选择和标签表示,同时探讨了大型语言模型(large language models, LLM)作为标注者的潜在应用。

链接: https://arxiv.org/abs/2507.02593
作者: Cornelia Gruber,Helen Alber,Bernd Bischl,Göran Kauermann,Barbara Plank,Matthias Aßenmacher
机构: LMU Munich, Department of Statistics, Germany; Munich Center for Machine Learning (MCML), Germany; LMU Munich, Center for Information and Language Processing (CIS), Germany
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Access to high-quality labeled data remains a limiting factor in applied supervised learning. While label variation (LV), i.e., differing labels for the same instance, is common, especially in natural language processing, annotation frameworks often still rest on the assumption of a single ground truth. This overlooks human label variation (HLV), the occurrence of plausible differences in annotations, as an informative signal. Similarly, active learning (AL), a popular approach to optimizing the use of limited annotation budgets in training ML models, often relies on at least one of several simplifying assumptions, which rarely hold in practice when acknowledging HLV. In this paper, we examine foundational assumptions about truth and label nature, highlighting the need to decompose observed LV into signal (e.g., HLV) and noise (e.g., annotation error). We survey how the AL and (H)LV communities have addressed – or neglected – these distinctions and propose a conceptual framework for incorporating HLV throughout the AL loop, including instance selection, annotator choice, and label representation. We further discuss the integration of large language models (LLM) as annotators. Our work aims to lay a conceptual foundation for HLV-aware active learning, better reflecting the complexities of real-world annotation.
zh

[NLP-24] WebSailor: Navigating Super-human Reasoning for Web Agent

【速读】: 该论文试图解决大型语言模型(Large Language Model, LLM)在面对极端复杂信息检索任务时,因人类认知局限而表现不足的问题。其解决方案的关键在于引入一种名为WebSailor的后训练方法,该方法通过生成高不确定性任务、RFT冷启动以及高效的代理强化学习训练算法DUPO,使模型具备系统性降低极端不确定性的推理能力,从而显著提升其在复杂信息检索任务中的表现。

链接: https://arxiv.org/abs/2507.02592
作者: Kuan Li,Zhongwang Zhang,Huifeng Yin,Liwen Zhang,Litu Ou,Jialong Wu,Wenbiao Yin,Baixuan Li,Zhengwei Tao,Xinyu Wang,Weizhou Shen,Junkai Zhang,Dingchu Zhang,Xixi Wu,Yong Jiang,Ming Yan,Pengjun Xie,Fei Huang,Jingren Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.
zh

[NLP-25] IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders

【速读】: 该论文试图解决印度等地区法律自然语言处理(Legal NLP)发展滞后的问题,主要由于缺乏结构化数据集。其解决方案的关键在于构建了一个名为IndianBailJudgments-1200的基准数据集,包含1200份印度法院关于保释决定的判决书,并在20多个属性上进行了标注,如保释结果、《印度刑法》条款、犯罪类型和法律推理等。标注过程采用经过提示工程优化的GPT-4o管道生成,并进行了一致性验证,从而为法律NLP任务提供了高质量的数据支持。

链接: https://arxiv.org/abs/2507.02506
作者: Sneha Deshmukh,Prathmesh Kamble
机构: Datta Meghe College of Engineering(达塔·梅赫工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 9 figures, 2 tables. Dataset available at Hugging Face and GitHub. Submitted to arXiv for open access

点击查看摘要

Abstract:Legal NLP remains underdeveloped in regions like India due to the scarcity of structured datasets. We introduce IndianBailJudgments-1200, a new benchmark dataset comprising 1200 Indian court judgments on bail decisions, annotated across 20+ attributes including bail outcome, IPC sections, crime type, and legal reasoning. Annotations were generated using a prompt-engineered GPT-4o pipeline and verified for consistency. This resource supports a wide range of legal NLP tasks such as outcome prediction, summarization, and fairness analysis, and is the first publicly available dataset focused specifically on Indian bail jurisprudence.
zh

[NLP-26] A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages INTERSPEECH2025

【速读】: 该论文试图解决为受损语音(Impaired Speech)构建自动语音识别(ASR)模型的问题,特别是在低资源语言中的应用。其核心挑战在于缺乏针对受损语音的高质量语音样本以及有效的数据收集和模型构建方法。解决方案的关键在于开发一套“操作手册”(cookbook),提供最佳实践和培训,以支持社区驱动的数据收集和ASR模型建设,并公开了首个开源的Akan语受损语音数据集及配套工具,旨在推动更具包容性的ASR技术发展。

链接: https://arxiv.org/abs/2507.02428
作者: Sumaya Ahmed Salihs,Isaac Wiafe,Jamal-Deen Abdulai,Elikem Doe Atsakpo,Gifty Ayoka,Richard Cave,Akon Obu Ekpezu,Catherine Holloway,Katrin Tomanek,Fiifi Baffoe Payin Winful
机构: 未知
类目: Computation and Language (cs.CL)
备注: This version has been reviewed and accepted for presentation at the InterSpeech 2025 conference to be held in Rotterdam from 17 to 21 August. 5 pages and 3 tables

点击查看摘要

Abstract:This study presents an approach for collecting speech samples to build Automatic Speech Recognition (ASR) models for impaired speech, particularly, low-resource languages. It aims to democratize ASR technology and data collection by developing a “cookbook” of best practices and training for community-driven data collection and ASR model building. As a proof-of-concept, this study curated the first open-source dataset of impaired speech in Akan: a widely spoken indigenous language in Ghana. The study involved participants from diverse backgrounds with speech impairments. The resulting dataset, along with the cookbook and open-source tools, are publicly available to enable researchers and practitioners to create inclusive ASR technologies tailored to the unique needs of speech impaired individuals. In addition, this study presents the initial results of fine-tuning open-source ASR models to better recognize impaired speech in Akan.
zh

[NLP-27] Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance Scalability and Adaptability

【速读】: 该论文试图解决现有自动语音识别(ASR)研究在跨不同语音情境下的泛化能力评估不足的问题,特别是针对低资源语言(LRL)的ASR模型在非训练领域中的性能退化问题。解决方案的关键在于通过构建基于Transformer架构的七种Akan ASR模型,并利用四个涵盖多种领域的Akan语音语料库进行基准测试,从而分析模型在不同场景下的表现差异及错误行为特征,进而提出针对低资源语言的领域适应技术、自适应路由策略和多语言训练框架的改进方向。

链接: https://arxiv.org/abs/2507.02407
作者: Mark Atta Mensah,Isaac Wiafe,Akon Ekpezu,Justice Kwame Appati,Jamal-Deen Abdulai,Akosua Nyarkoa Wiafe-Akenten,Frank Ernest Yeboah,Gifty Odame
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: This version has been reviewed and accepted for presentation at the Future Technologies Conference (FTC) 2025, to be held on 6 7 November 2025 in Munich, Germany. 17 pages, 4 figures, 1 table

点击查看摘要

Abstract:Most existing automatic speech recognition (ASR) research evaluate models using in-domain datasets. However, they seldom evaluate how they generalize across diverse speech contexts. This study addresses this gap by benchmarking seven Akan ASR models built on transformer architectures, such as Whisper and Wav2Vec2, using four Akan speech corpora to determine their performance. These datasets encompass various domains, including culturally relevant image descriptions, informal conversations, biblical scripture readings, and spontaneous financial dialogues. A comparison of the word error rate and character error rate highlighted domain dependency, with models performing optimally only within their training domains while showing marked accuracy degradation in mismatched scenarios. This study also identified distinct error behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned Whisper Akan models led to more fluent but potentially misleading transcription errors, Wav2Vec2 produced more obvious yet less interpretable outputs when encountering unfamiliar inputs. This trade-off between readability and transparency in ASR errors should be considered when selecting architectures for low-resource language (LRL) applications. These findings highlight the need for targeted domain adaptation techniques, adaptive routing strategies, and multilingual training frameworks for Akan and other LRLs.
zh

[NLP-28] JoyTTS: LLM -based Spoken Chatbot With Voice Cloning

【速读】: 该论文旨在解决构建端到端语音对话系统中语音克隆与自然语言理解相结合的问题,以实现更真实、个性化的语音交互体验。其解决方案的关键在于将大型语言模型(LLM)与文本转语音(TTS)技术相结合,并基于开源模型MiniCPM-o和CosyVoice2进行训练,从而在保持高语音相似度的同时,提升对话的准确性和自然度。

链接: https://arxiv.org/abs/2507.02380
作者: Fangru Zhou,Jun Zhao,Guoxin Wang
机构: JD Health International Inc. (京东健康国际有限公司)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:JoyTTS is an end-to-end spoken chatbot that combines large language models (LLM) with text-to-speech (TTS) technology, featuring voice cloning capabilities. This project is built upon the open-source MiniCPM-o and CosyVoice2 models and trained on 2000 hours of conversational data. We have also provided the complete training code to facilitate further development and optimization by the community. On the testing machine seed-tts-zh, it achieves a SS (speaker similarity) score of 0.73 and a WER (Word Error Rate) of 5.09. The code and models, along with training and inference scripts, are available at this https URL.
zh

[NLP-29] Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

【速读】: 该论文试图解决当前代码生成模型训练中数据质量不足的问题,尽管现有方法通过增加数据量来提升模型性能,但往往忽视了数据质量,导致训练效率低下。其解决方案的关键在于引入一种基于参数化模型的代码数据选择方法,通过优化参数化模型以确保所选子集的分布一致性和多样性,从而保证数据质量。实验结果表明,该方法在仅使用10K样本的情况下,相较于92K全采样基线,在HumanEval和MBPP基准上分别提升了2.4%和2.3%,验证了其在性能和效率上的优势。

链接: https://arxiv.org/abs/2507.02378
作者: Weijie Lyu,Sheng-Jun Huang,Xuan Xia
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Shenzhen Institute of Artificial Intelligence and Robotics for Society (深圳人工智能与社会机器人研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging vast amounts of data, focusing on data quantity while often overlooking data quality, thereby reducing training efficiency. To address this, we introduce an approach that utilizes a parametric model for code data selection, aimed at improving both training efficiency and model performance. Our method optimizes the parametric model to ensure distribution consistency and diversity within the selected subset, guaranteeing high-quality data. Experimental results demonstrate that using only 10K samples, our method achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled baseline, outperforming other sampling approaches in both performance and efficiency. This underscores that our method effectively boosts model performance while significantly reducing computational costs.
zh

[NLP-30] QFFN-BERT: An Empirical Study of Depth Performance and Data Efficiency in Hybrid Quantum-Classical Transformers

【速读】: 该论文旨在解决传统Transformer架构中全连接网络(Feedforward Network, FFN)参数量大且难以高效优化的问题,提出一种将参数化量子电路(Parameterized Quantum Circuit, PQC)嵌入到FFN模块中的混合量子-经典Transformer结构,即QFFN-BERT。其解决方案的关键在于通过PQC替代传统FFN,利用量子电路的高表达能力,在显著减少FFN特定参数数量的同时保持甚至提升模型性能,并通过引入残差连接、R_Y与R_Z旋转以及交替纠缠策略来增强训练稳定性和表达能力。

链接: https://arxiv.org/abs/2507.02364
作者: Pilsung Kang
机构: Dankook University ( Dankook University )
类目: Computation and Language (cs.CL); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Parameterized quantum circuits (PQCs) have recently emerged as promising components for enhancing the expressibility of neural architectures. In this work, we introduce QFFN-BERT, a hybrid quantum-classical transformer where the feedforward network (FFN) modules of a compact BERT variant are replaced by PQC-based layers. This design is motivated by the dominant parameter contribution of FFNs, which account for approximately two-thirds of the parameters within standard Transformer encoder blocks. While prior studies have primarily integrated PQCs into self-attention modules, our work focuses on the FFN and systematically investigates the trade-offs between PQC depth, expressibility, and trainability. Our final PQC architecture incorporates a residual connection, both R_Y and R_Z rotations, and an alternating entanglement strategy to ensure stable training and high expressibility. Our experiments, conducted on a classical simulator, on the SST-2 and DBpedia benchmarks demonstrate two key findings. First, a carefully configured QFFN-BERT achieves up to 102.0% of the baseline accuracy, surpassing its classical counterpart in a full-data setting while reducing FFN-specific parameters by over 99%. Second, our model exhibits a consistent and competitive edge in few-shot learning scenarios, confirming its potential for superior data efficiency. These results, supported by an ablation study on a non-optimized PQC that failed to learn, confirm that PQCs can serve as powerful and parameter-efficient alternatives to classical FFNs when co-designed with foundational deep learning principles.
zh

[NLP-31] Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models ACL2025

【速读】: 该论文旨在解决科学视觉问答(Scientific Visual Question Answering, SciVQA)任务中的多模态信息理解与回答生成问题。其解决方案的关键在于采用两个多模态大语言模型(Multimodal Large Language Models)的集成,并结合多种少样本示例检索策略,根据图表类型和问题类型选择合适的模型与少样本设置,同时依据模型的置信度水平选择答案。

链接: https://arxiv.org/abs/2507.02357
作者: Christian Jaumann,Annemarie Friedrich,Rainer Lienhart
机构: University of Augsburg (奥格斯堡大学)
类目: Computation and Language (cs.CL)
备注: Accepted at 5th Workshop on Scholarly Document Processing @ ACL 2025

点击查看摘要

Abstract:This paper describes our system for the SciVQA 2025 Shared Task on Scientific Visual Question Answering. Our system employs an ensemble of two Multimodal Large Language Models and various few-shot example retrieval strategies. The model and few-shot setting are selected based on the figure and question type. We also select answers based on the models’ confidence levels. On the blind test data, our system ranks third out of seven with an average F1 score of 85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.
zh

[NLP-32] DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning ACL2025

【速读】: 该论文旨在解决持续领域自适应预训练(continual Domain-Adaptive Pre-training, DAP)方法中存在的三个主要问题:高计算成本和GPU内存消耗、对增量数据顺序的敏感性以及为所有终端任务提供单一泛化模型,这与DAP的核心理念相悖。其解决方案的关键在于提出DoMIX方法,该方法通过引入低秩适配(LoRA)模块——一种代表性的参数高效微调(PEFT)技术,实现高效且并行的领域自适应预训练,从而提升对领域顺序的鲁棒性,并有效利用累积知识为特定任务生成定制化的预训练模型。

链接: https://arxiv.org/abs/2507.02302
作者: Dohoon Kim,Donghun Kang,Taesup Moon
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 22 pages, 5 figures, ACL 2025 Main

点击查看摘要

Abstract:Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks. We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at this https URL.
zh

[NLP-33] MemAgent : Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

【速读】: 该论文试图解决在长文本处理中实现线性复杂度下无限长度文档的处理,同时避免在外推过程中性能下降的问题。其解决方案的关键在于提出一种新的代理工作流MemAgent,该方法通过分段读取文本并采用覆盖策略更新记忆,结合扩展的DAPO算法以独立上下文多对话生成方式促进训练,从而显著提升了模型在长上下文任务中的表现。

链接: https://arxiv.org/abs/2507.02259
作者: Hongli Yu,Tinghong Chen,Jiangtao Feng,Jiangjie Chen,Weinan Dai,Qiying Yu,Ya-Qin Zhang,Wei-Ying Ma,Jingjing Liu,Mingxuan Wang,Hao Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Page: this https URL

点击查看摘要

Abstract:Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss 5% and achieves 95%+ in 512K RULER test.
zh

[NLP-34] GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons

【速读】: 该论文旨在解决用户在使用Genomic Data Commons (GDC)平台时,难以通过复杂的字段和属性组合来构建特定癌症基因组学队列的问题。解决方案的关键是引入GDC Cohort Copilot,一个开源的协作工具,能够根据用户输入的自然语言描述自动生成对应的GDC队列过滤器,并将其导出至GDC进行进一步分析。该工具利用多个大型语言模型(LLMs)实现自然语言到数据过滤逻辑的转换,其中本地部署的开放源代码GDC Cohort LLM在生成队列方面表现优于GPT-4o提示方法。

链接: https://arxiv.org/abs/2507.02221
作者: Steven Song,Anirudh Subramanyam,Zhenyu Zhang,Aarti Venkat,Robert L. Grossman
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 1 figure, 7 tables

点击查看摘要

Abstract:Motivation: The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. Results: We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. Availability and implementation: The standalone docker image for GDC Cohort Copilot is available at this https URL. Source code is available at this https URL. GDC Cohort LLM weights are available at this https URL. Comments: 11 pages, 1 figure, 7 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2507.02221 [cs.CL] (or arXiv:2507.02221v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2507.02221 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-35] SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

【速读】: 该论文旨在解决科学论文中图形摘要(Graphical Abstracts, GAs)在提升科学传播效果方面的潜力未被充分挖掘,以及设计高效GAs所需高级可视化技能带来的障碍。其解决方案的关键在于构建了一个大规模数据集SciGA-145k,包含约145,000篇科学论文和114万张图表,用于支持GA选择与推荐以及自动化GA生成的研究。此外,论文定义了两个任务:Intra-GA推荐和Inter-GA推荐,并提出了一种新的推荐评估指标Confidence Adjusted top-1 ground truth Ratio (CAR),以更细致地分析模型行为,从而推动视觉化科学传播的发展。

链接: https://arxiv.org/abs/2507.02212
作者: Takuro Kawada,Shunsuke Kitada,Sota Nemoto,Hitoshi Iyatomi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 15 figures, 4 tables. Project Page: this https URL

点击查看摘要

Abstract:Graphical Abstracts (GAs) play a crucial role in visually conveying the key findings of scientific papers. While recent research has increasingly incorporated visual materials such as Figure 1 as de facto GAs, their potential to enhance scientific communication remains largely unexplored. Moreover, designing effective GAs requires advanced visualization skills, creating a barrier to their widespread adoption. To tackle these challenges, we introduce SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific papers and 1.14 million figures, explicitly designed for supporting GA selection and recommendation as well as facilitating research in automated GA generation. As a preliminary step toward GA design support, we define two tasks: 1) Intra-GA recommendation, which identifies figures within a given paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation, which retrieves GAs from other papers to inspire the creation of new GAs. We provide reasonable baseline models for these tasks. Furthermore, we propose Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation metric that offers a fine-grained analysis of model behavior. CAR addresses limitations in traditional ranking-based metrics by considering cases where multiple figures within a paper, beyond the explicitly labeled GA, may also serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a foundation for advancing visual scientific communication while contributing to the development of AI for Science.
zh

[NLP-36] ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning

【速读】: 该论文旨在解决事件流场景文本识别(Event Stream Scene Text Recognition, STR)在极端挑战性场景下,如低光照和快速运动条件下,现有方法因可解释性不足和上下文逻辑推理能力较弱而存在的识别性能受限问题。其解决方案的关键在于提出一种基于思维链(Chain-of-Thought, CoT)推理的事件流场景文本识别框架ESTR-CoT,该框架通过结合视觉编码器EVA-CLIP(ViT-G/14)与预训练大语言模型Vicuna-7B,并利用Q-former进行视觉token与语言模型的对齐,实现同时输出识别结果与推理过程,从而提升模型的可解释性与逻辑推理能力。

链接: https://arxiv.org/abs/2507.02200
作者: Xiao Wang,Jingtao Jiang,Qiang Chen,Lan Chen,Lin Zhu,Yaowei Wang,Yonghong Tian,Jin Tang
机构: Anhui University (安徽大学); Beijing Institute of Technology (北京理工大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳); Peng Cheng Laboratory (鹏城实验室; Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: A Strong Baseline for Reasoning based Event Stream Scene Text Recognition

点击查看摘要

Abstract:Event stream based scene text recognition is a newly arising research topic in recent years which performs better than the widely used RGB cameras in extremely challenging scenarios, especially the low illumination, fast motion. Existing works either adopt end-to-end encoder-decoder framework or large language models for enhanced recognition, however, they are still limited by the challenges of insufficient interpretability and weak contextual logical reasoning. In this work, we propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT. Specifically, we first adopt the vision encoder EVA-CLIP (ViT-G/14) to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt. A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously. Our framework can be optimized using supervised fine-tuning in an end-to-end manner. In addition, we also propose a large-scale CoT dataset to train our framework via a three stage processing (i.e., generation, polish, and expert verification). This dataset provides a solid data foundation for the development of subsequent reasoning-based large models. Extensive experiments on three event stream STR benchmark datasets (i.e., EventSTR, WordArt*, IC15*) fully validated the effectiveness and interpretability of our proposed framework. The source code and pre-trained models will be released on this https URL.
zh

[NLP-37] Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer

【速读】: 该论文试图解决在基于Transformer的模型中如何有效内部化链式思维(Chain-of-thought, CoT)推理的问题,以提高模型的效率和可解释性。传统解码器架构通过自然语言外部化推理步骤,虽然增强了可解释性,但降低了效率;而本文提出的解决方案是利用深度循环的Transformer架构——Huginn-3.5B,通过在推理过程中复用层而不增加参数量,尝试在潜在空间中内化推理过程。其关键在于通过一系列探测技术(如Logit Lens和Coda Lens)分析模型内部行为,探索潜在的CoT结构及其可解释性。

链接: https://arxiv.org/abs/2507.02199
作者: Wenquan Lu,Yuechuan Yang,Kyle Lee,Yanshu Li,Enqi Liu
机构: Brown University (布朗大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning. However, in standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency. To capture reasoning that is not easily represented in words, many works have explored recurrent architectures that aim to internalize reasoning in latent space, potentially supporting latent CoT. In this paper, we investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count. We examine the model’s internal behavior on arithmetic tasks using a suite of probing techniques including the Logit Lens and Coda Lens. Our findings reveal limited evidence of interpretable latent CoT by tracking rank trajectories of final and intermediate result tokens. Furthermore, we uncover significant probing inconsistencies across recurrent blocks, where the interpretability of hidden states depends heavily on both the layer index and the decoding method. Finally, we empirically show that increasing recurrence depth yields only marginal gains and falls well short of models that explicitly externalize reasoning steps. The code is available at this https URL.
zh

[NLP-38] Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis INTERSPEECH2025

【速读】: 该论文试图解决在语音克隆系统日益进步的背景下,如何准确评估说话人身份一致性的挑战。现有生成式语音系统中常用的自动说话人验证(ASV)嵌入主要关注静态特征如音色和音高范围,而忽略了动态元素如节奏,导致身份表征不全面。论文提出的关键解决方案是U3D,这是一种用于评估说话人动态节奏模式的度量方法,旨在弥补现有方法在动态特征捕捉方面的不足。

链接: https://arxiv.org/abs/2507.02176
作者: Marc-André Carbonneau,Benjamin van Niekerk,Hugo Seuté,Jean-Philippe Letendre,Herman Kamper,Julian Zaïdi
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted at SSW13 - Interspeech 2025 Speech Synthesis Workshop

点击查看摘要

Abstract:Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers’ dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.
zh

[NLP-39] Reasoning or Not? A Comprehensive Evaluation of Reasoning LLM s for Dialogue Summarization

【速读】: 该论文旨在解决对话摘要任务中,基于显式分步推理架构(如Long Chain-of-Thought,CoT)的大型语言模型(LLMs)在同时实现抽象性和简洁性方面表现不佳的问题。其关键在于通过系统性评估不同类型的推理与非推理LLMs在多种对话摘要范式中的性能,揭示显式推理在复杂对话场景中可能带来的冗余、事实不一致和摘要不够简洁等缺陷,并为实际对话摘要任务提供针对性建模与评估策略的依据。

链接: https://arxiv.org/abs/2507.02145
作者: Keyan Jin,Yapeng Wang,Leonel Santos,Tao Fang,Xu Yang,Sio Kei Im,Hugo Gonçalo Oliveira
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dialogue summarization is a challenging task with significant practical value in customer service, meeting analysis, and conversational AI. Although large language models (LLMs) have achieved substantial progress in summarization tasks, the performance of step-by-step reasoning architectures-specifically Long Chain-of-Thought (CoT) implementations such as OpenAI-o1 and DeepSeek-R1-remains unexplored for dialogue scenarios requiring concurrent abstraction and conciseness. In this work, we present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs across three major paradigms-generic, role-oriented, and query-oriented dialogue summarization. Our study spans diverse languages, domains, and summary lengths, leveraging strong benchmarks (SAMSum, DialogSum, CSDS, and QMSum) and advanced evaluation protocols that include both LLM-based automatic metrics and human-inspired criteria. Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality. Instead, reasoning LLMs are often prone to verbosity, factual inconsistencies, and less concise summaries compared to their non-reasoning counterparts. Through scenario-specific analyses and detailed case studies, we further identify when and why explicit reasoning may fail to benefit-or even hinder-summarization in complex dialogue contexts. Our work provides new insights into the limitations of current reasoning LLMs and highlights the need for targeted modeling and evaluation strategies for real-world dialogue summarization.
zh

[NLP-40] Dissecting the Impact of Mobile DVFS Governors on LLM Inference Performance and Energy Efficiency

【速读】: 该论文旨在解决在资源受限的移动设备上部署大型语言模型(Large Language Models, LLMs)时面临的能效瓶颈问题,特别是在当前移动设备中CPU、GPU和内存的动态电压频率调节(DVFS)控制器独立运作且缺乏协同优化的情况下,导致LLM推理过程中的能量效率低下。论文提出的解决方案关键在于设计FUSE——一种统一的能源感知调控机制,通过协调CPU、GPU和内存的频率配置,在保持相同能耗的前提下显著降低LLM推理的首次标记时间(time-to-first-token)和每输出标记时间(time-per-output-token)延迟。

链接: https://arxiv.org/abs/2507.02135
作者: Zongpu Zhang,Pranab Dash,Y. Charlie Hu,Qiang Xu,Jian Li,Haibing Guan
机构: Shanghai Jiao Tong University (上海交通大学); Purdue University (普渡大学)
类目: Operating Systems (cs.OS); Computation and Language (cs.CL)
备注: equal contribution between Zhang and Dash

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being integrated into various applications and services running on billions of mobile devices. However, deploying LLMs on resource-limited mobile devices faces a significant challenge due to their high demand for computation, memory, and ultimately energy. While current LLM frameworks for mobile use three power-hungry components-CPU, GPU, and Memory-even when running primarily-GPU LLM models, optimized DVFS governors for CPU, GPU, and memory featured in modern mobile devices operate independently and are oblivious of each other. Motivated by the above observation, in this work, we first measure the energy-efficiency of a SOTA LLM framework consisting of various LLM models on mobile phones which showed the triplet mobile governors result in up to 40.4% longer prefilling and decoding latency compared to optimal combinations of CPU, GPU, and memory frequencies with the same energy consumption for sampled prefill and decode lengths. Second, we conduct an in-depth measurement study to uncover how the intricate interplay (or lack of) among the mobile governors cause the above inefficiency in LLM inference. Finally, based on these insights, we design FUSE - a unified energy-aware governor for optimizing the energy efficiency of LLM inference on mobile devices. Our evaluation using a ShareGPT dataset shows FUSE reduces the time-to-first-token and time-per-output-token latencies by 7.0%-16.9% and 25.4%-36.8% on average with the same energy-per-token for various mobile LLM models.
zh

[NLP-41] Energy-Based Transformers are Scalable Learners and Thinkers

【速读】: 该论文试图解决现有推理时计算技术在模态、问题类型和监督需求方面的局限性,旨在开发一种能够仅通过无监督学习实现系统性思考(System 2 Thinking)的模型。解决方案的关键在于训练能量基础变换器(Energy-Based Transformers, EBTs),使其显式验证输入与候选预测之间的兼容性,并将预测问题重新建模为基于此验证器的优化问题,通过梯度下降进行能量最小化以获得预测结果。

链接: https://arxiv.org/abs/2507.02092
作者: Alexi Gladstone,Ganesh Nanduru,Md Mofijul Islam,Peixuan Han,Hyeonjeong Ha,Aman Chadha,Yilun Du,Heng Ji,Jundong Li,Tariq Iqbal
机构: UVA(弗吉尼亚大学); UIUC(伊利诺伊大学厄巴纳-香槟分校); Amazon GenAI(亚马逊生成式人工智能); Stanford University(斯坦福大学); Harvard University(哈佛大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inference-time computation techniques, analogous to human System 2 Thinking, have recently become popular for improving model performances. However, most existing approaches suffer from several limitations: they are modality-specific (e.g., working only in text), problem-specific (e.g., verifiable domains like math and coding), or require additional supervision/training on top of unsupervised pretraining (e.g., verifiers or verifiable rewards). In this paper, we ask the question “Is it possible to generalize these System 2 Thinking approaches, and develop models that learn to think solely from unsupervised learning?” Interestingly, we find the answer is yes, by learning to explicitly verify the compatibility between inputs and candidate-predictions, and then re-framing prediction problems as optimization with respect to this verifier. Specifically, we train Energy-Based Transformers (EBTs) – a new class of Energy-Based Models (EBMs) – to assign an energy value to every input and candidate-prediction pair, enabling predictions through gradient descent-based energy minimization until convergence. Across both discrete (text) and continuous (visual) modalities, we find EBTs scale faster than the dominant Transformer++ approach during training, achieving an up to 35% higher scaling rate with respect to data, batch size, parameters, FLOPs, and depth. During inference, EBTs improve performance with System 2 Thinking by 29% more than the Transformer++ on language tasks, and EBTs outperform Diffusion Transformers on image denoising while using fewer forward passes. Further, we find that EBTs achieve better results than existing models on most downstream tasks given the same or worse pretraining performance, suggesting that EBTs generalize better than existing approaches. Consequently, EBTs are a promising new paradigm for scaling both the learning and thinking capabilities of models.
zh

[NLP-42] McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)中固有偏见的测量问题,特别是现有偏见评估数据集主要针对英语和北美文化,其偏见类别无法全面适用于其他文化背景,且多数数据集仅支持单一评估任务,难以从多角度全面评估LLMs中的偏见。解决方案的关键是提出一个面向中文语言和文化的多任务偏见评估基准(Multi-task Chinese Bias Evaluation Benchmark, McBE),该基准包含4,077个偏见评估实例,覆盖12个单一分类、82个子分类,并引入5项评估任务,实现了广泛的类别覆盖、内容多样性和评估全面性。

链接: https://arxiv.org/abs/2507.02088
作者: Tian Lan,Xiangdong Su,Xu Liu,Ruirui Wang,Ke Chang,Jiang Li,Guanglai Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注: 24 pages, 9 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets focus on English and North American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation tasks and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.
zh

[NLP-43] Evaluating the Promise and Pitfalls of LLM s in Hiring Decisions NEURIPS2025

【速读】: 该论文试图解决在招聘过程中使用大型语言模型(LLMs)所带来的准确性不足和算法偏见问题,尤其是在缺乏足够保障措施的情况下。其解决方案的关键在于开发一个专门针对招聘领域的定制化监督模型(Match Score),该模型在预测准确性和公平性方面均优于通用的LLMs。通过实验证明,Match Score不仅在ROC AUC指标上表现更优,还在不同人口统计学群体间的公平性指标上显著提升,表明领域特定建模与偏差审计对于高风险应用场景中AI部署的重要性。

链接: https://arxiv.org/abs/2507.02087
作者: Eitan Anzenberg,Arunava Samajpati,Sivasankaran Chandrasekar,Varun Kacholia
机构: Eightfold.ai(八重奏人工智能)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 10 pages, 2 figures, 2 tables. Submitted to NeurIPS 2025

点击查看摘要

Abstract:The use of large language models (LLMs) in hiring promises to streamline candidate screening, but it also raises serious concerns regarding accuracy and algorithmic bias where sufficient safeguards are not in place. In this work, we benchmark several state-of-the-art foundational LLMs - including models from OpenAI, Anthropic, Google, Meta, and Deepseek, and compare them with our proprietary domain-specific hiring model (Match Score) for job candidate matching. We evaluate each model’s predictive accuracy (ROC AUC, Precision-Recall AUC, F1-score) and fairness (impact ratio of cut-off analysis across declared gender, race, and intersectional subgroups). Our experiments on a dataset of roughly 10,000 real-world recent candidate-job pairs show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups. Notably, Match Score attains a minimum race-wise impact ratio of 0.957 (near-parity), versus 0.809 or lower for the best LLMs, (0.906 vs 0.773 for the intersectionals, respectively). We discuss why pretraining biases may cause LLMs with insufficient safeguards to propagate societal biases in hiring scenarios, whereas a bespoke supervised model can more effectively mitigate these biases. Our findings highlight the importance of domain-specific modeling and bias auditing when deploying AI in high-stakes domains such as hiring, and caution against relying on off-the-shelf LLMs for such tasks without extensive fairness safeguards. Furthermore, we show with empirical evidence that there shouldn’t be a dichotomy between choosing accuracy and fairness in hiring: a well-designed algorithm can achieve both accuracy in hiring and fairness in outcomes.
zh

[NLP-44] STELLA: Self-Evolving LLM Agent for Biomedical Research

【速读】: 该论文旨在解决生物医学数据、工具和文献的快速增长导致的研究环境碎片化问题,这一问题超出了人类专家的能力范围。传统AI代理依赖于静态且手动维护的工具集,限制了其适应性和扩展性。论文提出的解决方案是STELLA,一种自我进化的AI代理,其关键在于采用多智能体架构,通过两个核心机制实现自我提升:一个不断演化的推理策略模板库和一个动态扩展的工具海洋,后者由工具生成代理自动发现并集成新的生物信息学工具,从而使STELLA能够从经验中学习。

链接: https://arxiv.org/abs/2507.02004
作者: Ruofan Jin,Zaixi Zhang,Mengdi Wang,Le Cong
机构: Princeton University (普林斯顿大学); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:The rapid growth of biomedical data, tools, and literature has created a fragmented research landscape that outpaces human expertise. While AI agents offer a solution, they typically rely on static, manually curated toolsets, limiting their ability to adapt and scale. Here, we introduce STELLA, a self-evolving AI agent designed to overcome these limitations. STELLA employs a multi-agent architecture that autonomously improves its own capabilities through two core mechanisms: an evolving Template Library for reasoning strategies and a dynamic Tool Ocean that expands as a Tool Creation Agent automatically discovers and integrates new bioinformatics tools. This allows STELLA to learn from experience. We demonstrate that STELLA achieves state-of-the-art accuracy on a suite of biomedical benchmarks, scoring approximately 26% on Humanity’s Last Exam: Biomedicine, 54% on LAB-Bench: DBQA, and 63% on LAB-Bench: LitQA, outperforming leading models by up to 6 percentage points. More importantly, we show that its performance systematically improves with experience; for instance, its accuracy on the Humanity’s Last Exam benchmark almost doubles with increased trials. STELLA represents a significant advance towards AI Agent systems that can learn and grow, dynamically scaling their expertise to accelerate the pace of biomedical discovery.
zh

[NLP-45] Why Multi-Interest Fairness Matters: Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System

【速读】: 该论文旨在解决对话式推荐系统(Conversational Recommender System, CRS)中由于不公平性导致的推荐结果偏倚问题,特别是在动态和交互环境中,这种不公平性可能加剧并引发如马太效应、过滤气泡和回音室等严重问题。解决方案的关键在于提出一种名为HyFairCRS的新型框架,该框架通过对比学习构建多样化的超图来捕捉用户广泛的兴趣,并在对话过程中利用这些兴趣生成信息丰富的回复,同时确保在动态用户-系统反馈循环中的公平物品预测。

链接: https://arxiv.org/abs/2507.02000
作者: Yongsen Zheng,Zongxuan Xie,Guohua Wang,Ziyao Liu,Liang Lin,Kwok-Yan Lam
机构: Nanyang Technological University (南洋理工大学); Digital Trust Centre Singapore (新加坡数字信任中心); Sun Yat-sen University (中山大学); South China Agricultural University (华南农业大学); Peng Cheng Laboratory (鹏城实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Unfairness is a well-known challenge in Recommender Systems (RSs), often resulting in biased outcomes that disadvantage users or items based on attributes such as gender, race, age, or popularity. Although some approaches have started to improve fairness recommendation in offline or static contexts, the issue of unfairness often exacerbates over time, leading to significant problems like the Matthew effect, filter bubbles, and echo chambers. To address these challenges, we proposed a novel framework, Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System (HyFairCRS), aiming to promote multi-interest diversity fairness in dynamic and interactive Conversational Recommender Systems (CRSs). HyFairCRS first captures a wide range of user interests by establishing diverse hypergraphs through contrastive learning. These interests are then utilized in conversations to generate informative responses and ensure fair item predictions within the dynamic user-system feedback loop. Experiments on two CRS-based datasets show that HyFairCRS achieves a new state-of-the-art performance while effectively alleviating unfairness. Our code is available at this https URL.
zh

[NLP-46] Multimodal Misinformation Detection Using Early Fusion of Linguistic Visual and Social Features

【速读】: 该论文试图解决在选举和危机期间社交媒体上泛滥的虚假信息检测问题,其核心挑战在于传统方法主要依赖文本或图像单模态特征,而较少探索多模态特征融合的有效性。解决方案的关键在于采用早期融合(early fusion)方法,整合文本、图像和社会特征,构建更有效的分类模型。通过分析1,529条包含文本和图像的推文数据,并利用对象检测和光学字符识别(OCR)等技术提取额外的社会和视觉特征,研究验证了结合无监督与有监督机器学习模型能够显著提升分类性能,相较于单模态模型和双模态模型分别提高了15%和5%。

链接: https://arxiv.org/abs/2507.01984
作者: Gautam Kishore Shahi
机构: University of Duisburg-Essen (杜伊斯堡-埃森大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Amid a tidal wave of misinformation flooding social media during elections and crises, extensive research has been conducted on misinformation detection, primarily focusing on text-based or image-based approaches. However, only a few studies have explored multimodal feature combinations, such as integrating text and images for building a classification model to detect misinformation. This study investigates the effectiveness of different multimodal feature combinations, incorporating text, images, and social features using an early fusion approach for the classification model. This study analyzed 1,529 tweets containing both text and images during the COVID-19 pandemic and election periods collected from Twitter (now X). A data enrichment process was applied to extract additional social features, as well as visual features, through techniques such as object detection and optical character recognition (OCR). The results show that combining unsupervised and supervised machine learning models improves classification performance by 15% compared to unimodal models and by 5% compared to bimodal models. Additionally, the study analyzes the propagation patterns of misinformation based on the characteristics of misinformation tweets and the users who disseminate them.
zh

[NLP-47] DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment

【速读】: 该论文旨在解决大型音频语言模型(LALM)在引入音频能力时容易导致原始语言能力退化的问题,即灾难性遗忘问题。其解决方案的关键在于提出DeSTA(自生成跨模态对齐策略),通过主干语言模型自动生成训练目标,从而在保持语言模型原有语言能力的同时,实现有效的音频-文本对齐,进而支持零样本泛化而无需任务特定的微调。

链接: https://arxiv.org/abs/2507.02768
作者: Ke-Han Lu,Zhehuai Chen,Szu-Wei Fu,Chao-Han Huck Yang,Sung-Feng Huang,Chih-Kai Yang,Chee-En Yu,Chun-Wei Chen,Wei-Chih Chen,Chien-yu Huang,Yi-Cheng Lin,Yu-Xiang Lin,Chi-An Fu,Chun-Yi Kuan,Wenze Ren,Xuanjun Chen,Wei-Ping Huang,En-Pei Hu,Tzu-Quan Lin,Yuan-Kuei Wu,Kuan-Po Huang,Hsiao-Ying Huang,Huang-Cheng Chou,Kai-Wei Chang,Cheng-Han Chiang,Boris Ginsburg,Yu-Chiang Frank Wang,Hung-yi Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Model and code available at: this https URL

点击查看摘要

Abstract:We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following, without requiring task-specific audio instruction-tuning. Recent LALMs typically augment Large Language Models (LLMs) with auditory capabilities by training on large-scale, manually curated or LLM-synthesized audio-instruction datasets. However, these approaches have often suffered from the catastrophic forgetting of the LLM’s original language abilities. To address this, we revisit the data construction pipeline and propose DeSTA, a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets. This approach preserves the LLM’s native language proficiency while establishing effective audio-text alignment, thereby enabling zero-shot generalization without task-specific tuning. Using DeSTA, we construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms widely adopted data construction and training strategies in both auditory perception and instruction-following capabilities. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.
zh

[NLP-48] Seeing Through Green: Text-Based Classification and the Firms Returns from Green Patents

【速读】: 该论文试图解决如何准确识别“真正”绿色专利(true green patents)的问题,以提高专利分类的精细度,从而为政策制定提供更有价值的信息。其解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)技术,通过在大量已分类为绿色专利的数据基础上训练简单的神经网络,扩展基础词典,并借助表达式向量表示来识别更精确的绿色专利。

链接: https://arxiv.org/abs/2507.02287
作者: Lapo Santarlasci,Armando Rungi,Antonio Zinilli
机构: IMT School for Advanced Studies, Lucca(IMT高等研究学院,卢卡); National Research Council of Italy (CNR), Research Institute on Sustainable Economic Growth, Unit of Rome(意大利国家研究委员会(CNR),可持续经济增长研究所,罗马单元)
类目: General Economics (econ.GN); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces Natural Language Processing for identifying true'' green patents from official supporting documents. We start our training on about 12.4 million patents that had been classified as green from previous literature. Thus, we train a simple neural network to enlarge a baseline dictionary through vector representations of expressions related to environmental technologies. After testing, we find that true’’ green patents represent about 20% of the total of patents classified as green from previous literature. We show heterogeneity by technological classes, and then check that `true’ green patents are about 1% less cited by following inventions. In the second part of the paper, we test the relationship between patenting and a dashboard of firm-level financial accounts in the European Union. After controlling for reverse causality, we show that holding at least one true'' green patent raises sales, market shares, and productivity. If we restrict the analysis to high-novelty true’’ green patents, we find that they also yield higher profits. Our findings underscore the importance of using text analyses to gauge finer-grained patent classifications that are useful for policymaking in different domains.
zh

[NLP-49] FinAI-BERT: A Transformer-Based Model for Sentence-Level Detection of AI Disclosures in Financial Reports

【速读】: 该论文旨在解决金融领域中对人工智能(Artificial Intelligence, AI)相关披露内容进行细粒度、可解释且稳健分类的问题。传统方法依赖关键词扩展或文档级分类,存在粒度不足、可解释性差和鲁棒性弱的缺陷。论文提出的解决方案是构建FinAI-BERT,一个基于Transformer架构的领域适应语言模型,通过在手工标注且平衡的1,586条句子数据集上进行微调,实现对金融文本中AI相关内容的句级分类。其关键在于利用预训练模型的语义理解能力,并结合SHAP-based token attribution提升模型的可解释性,同时通过偏见分析和鲁棒性测试确保模型在不同输入条件下的稳定性。

链接: https://arxiv.org/abs/2507.01991
作者: Muhammad Bilal Zafar
机构: 未知
类目: Computational Finance (q-fin.CP); Computation and Language (cs.CL); General Economics (econ.GN); General Finance (q-fin.GN)
备注: The FinAI-BERT model can be directly loaded via Hugging Face Transformers ( this https URL ) for sentence-level AI disclosure classification

点击查看摘要

Abstract:The proliferation of artificial intelligence (AI) in financial services has prompted growing demand for tools that can systematically detect AI-related disclosures in corporate filings. While prior approaches often rely on keyword expansion or document-level classification, they fall short in granularity, interpretability, and robustness. This study introduces FinAI-BERT, a domain-adapted transformer-based language model designed to classify AI-related content at the sentence level within financial texts. The model was fine-tuned on a manually curated and balanced dataset of 1,586 sentences drawn from 669 annual reports of U.S. banks (2015 to 2023). FinAI-BERT achieved near-perfect classification performance (accuracy of 99.37 percent, F1 score of 0.993), outperforming traditional baselines such as Logistic Regression, Naive Bayes, Random Forest, and XGBoost. Interpretability was ensured through SHAP-based token attribution, while bias analysis and robustness checks confirmed the model’s stability across sentence lengths, adversarial inputs, and temporal samples. Theoretically, the study advances financial NLP by operationalizing fine-grained, theme-specific classification using transformer architectures. Practically, it offers a scalable, transparent solution for analysts, regulators, and scholars seeking to monitor the diffusion and framing of AI across financial institutions.
zh

计算机视觉

[CV-0] MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real

【速读】:该论文试图解决机器人在现实世界中有效执行任务时,如何集成多种感知模态(如视觉和听觉)以实现多模态策略学习的问题,尤其是针对模拟到现实的迁移(sim-to-real transfer)在多模态场景中仍面临挑战。解决方案的关键在于引入MultiGen框架,该框架将大规模生成式AI (Generative AI) 模型与传统物理仿真器相结合,实现了多感官仿真,通过在仿真视频基础上合成真实的音频,从而在无需真实机器人数据的情况下,训练出丰富的视听轨迹,有效提升了多模态模拟到现实的迁移能力。

链接: https://arxiv.org/abs/2507.02864
作者: Renhao Wang,Haoran Geng,Tingle Li,Feishi Wang,Gopala Anumanchipalli,Philipp Wu,Trevor Darrell,Boyi Li,Pieter Abbeel,Jitendra Malik,Alexei A. Efros
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robots must integrate multiple sensory modalities to act effectively in the real world. Yet, learning such multimodal policies at scale remains challenging. Simulation offers a viable solution, but while vision has benefited from high-fidelity simulators, other modalities (e.g. sound) can be notoriously difficult to simulate. As a result, sim-to-real transfer has succeeded primarily in vision-based tasks, with multimodal transfer still largely unrealized. In this work, we tackle these challenges by introducing MultiGen, a framework that integrates large-scale generative models into traditional physics simulators, enabling multisensory simulation. We showcase our framework on the dynamic task of robot pouring, which inherently relies on multimodal feedback. By synthesizing realistic audio conditioned on simulation video, our method enables training on rich audiovisual trajectories – without any real robot data. We demonstrate effective zero-shot transfer to real-world pouring with novel containers and liquids, highlighting the potential of generative modeling to both simulate hard-to-model modalities and close the multimodal sim-to-real gap.
zh

[CV-1] Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory

【速读】:该论文旨在解决从有序序列或无序图像集合中进行密集的3D场景重建问题,特别是在处理大量图像时,现有方法依赖的隐式记忆存在容量限制和早期帧信息丢失的问题。其解决方案的关键在于提出Point3R框架,该框架通过维护一个与当前场景三维结构直接关联的显式空间指针记忆(explicit spatial pointer memory),每个指针被分配特定的三维位置,并在全局坐标系中聚合附近的场景信息形成动态的空间特征,从而实现对当前观测的密集整合与全局坐标系统的有效融合。

链接: https://arxiv.org/abs/2507.02863
作者: Yuqi Wu,Wenzhao Zheng,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Dense 3D scene reconstruction from an ordered sequence or unordered image collections is a critical step when bringing research in computer vision into practical scenarios. Following the paradigm introduced by DUSt3R, which unifies an image pair densely into a shared coordinate system, subsequent methods maintain an implicit memory to achieve dense 3D reconstruction from more images. However, such implicit memory is limited in capacity and may suffer from information loss of earlier frames. We propose Point3R, an online framework targeting dense streaming 3D reconstruction. To be specific, we maintain an explicit spatial pointer memory directly associated with the 3D structure of the current scene. Each pointer in this memory is assigned a specific 3D position and aggregates scene information nearby in the global coordinate system into a changing spatial feature. Information extracted from the latest frame interacts explicitly with this pointer memory, enabling dense integration of the current observation into the global coordinate system. We design a 3D hierarchical position embedding to promote this interaction and design a simple yet effective fusion mechanism to ensure that our pointer memory is uniform and efficient. Our method achieves competitive or state-of-the-art performance on various tasks with low training costs. Code is available at: this https URL.
zh

[CV-2] RefTok: Reference-Based Tokenization for Video Generation

【速读】:该论文试图解决视频模型中有效处理时间冗余的问题,现有方法通常独立处理每一帧,未能有效捕捉视频中固有的时间依赖性和冗余性。其解决方案的关键在于引入RefTok,这是一种基于参考帧的新型分词方法,能够捕捉复杂的时空动态和上下文信息,通过在未量化参考帧的条件下对帧集进行编码和解码,从而在解码时保持运动连续性和物体外观的一致性。

链接: https://arxiv.org/abs/2507.02862
作者: Xiang Fan,Xiaohang Sun,Kushan Thakkar,Zhu Liu,Vimal Bhat,Ranjay Krishna,Xiang Hao
机构: University of Washington (华盛顿大学); Amazon (亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effectively handling temporal redundancy remains a key challenge in learning video models. Prevailing approaches often treat each set of frames independently, failing to effectively capture the temporal dependencies and redundancies inherent in videos. To address this limitation, we introduce RefTok, a novel reference-based tokenization method capable of capturing complex temporal dynamics and contextual information. Our method encodes and decodes sets of frames conditioned on an unquantized reference frame. When decoded, RefTok preserves the continuity of motion and the appearance of objects across frames. For example, RefTok retains facial details despite head motion, reconstructs text correctly, preserves small patterns, and maintains the legibility of handwriting from the context. Across 4 video datasets (K600, UCF-101, BAIR Robot Pushing, and DAVIS), RefTok significantly outperforms current state-of-the-art tokenizers (Cosmos and MAGVIT) and improves all evaluated metrics (PSNR, SSIM, LPIPS) by an average of 36.7% at the same or higher compression ratios. When a video generation model is trained using RefTok’s latents on the BAIR Robot Pushing task, the generations not only outperform MAGVIT-B but the larger MAGVIT-L, which has 4x more parameters, across all generation metrics by an average of 27.9%.
zh

[CV-3] LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans ATC WWW

【速读】:该论文试图解决将室内环境的RGB-D扫描转换为紧凑、逼真且可交互的3D虚拟副本的问题,以满足增强现实/虚拟现实(AR/VR)、游戏、机器人和数字孪生等应用的需求。解决方案的关键在于提出LiteReality流水线,其核心包括场景理解与结构化场景图解析、基于定制资产数据库的3D模型检索、材质绘制模块以及将重建场景集成到具备基础物理特性的仿真引擎中,从而实现高保真度和交互性。

链接: https://arxiv.org/abs/2507.02861
作者: Zhening Huang,Xiaoyang Wu,Fangcheng Zhong,Hengshuang Zhao,Matthias Nießner,Joan Lasenby
机构: University of Cambridge (剑桥大学); The University of Hong Kong (香港大学); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project Page: this https URL ; Video: this https URL

点击查看摘要

Abstract:We propose LiteReality, a novel pipeline that converts RGB-D scans of indoor environments into compact, realistic, and interactive 3D virtual replicas. LiteReality not only reconstructs scenes that visually resemble reality but also supports key features essential for graphics pipelines – such as object individuality, articulation, high-quality physically based rendering materials, and physically based interaction. At its core, LiteReality first performs scene understanding and parses the results into a coherent 3D layout and objects with the help of a structured scene graph. It then reconstructs the scene by retrieving the most visually similar 3D artist-crafted models from a curated asset database. Next, the Material Painting module enhances realism by recovering high-quality, spatially varying materials. Finally, the reconstructed scene is integrated into a simulation engine with basic physical properties to enable interactive behavior. The resulting scenes are compact, editable, and fully compatible with standard graphics pipelines, making them suitable for applications in AR/VR, gaming, robotics, and digital twins. In addition, LiteReality introduces a training-free object retrieval module that achieves state-of-the-art similarity performance on the Scan2CAD benchmark, along with a robust material painting module capable of transferring appearances from images of any style to 3D assets – even under severe misalignment, occlusion, and poor lighting. We demonstrate the effectiveness of LiteReality on both real-life scans and public datasets. Project page: this https URL; Video: this https URL
zh

[CV-4] Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

【速读】:该论文旨在解决视频生成模型在实际应用中因去噪过程的迭代特性导致的推理速度慢和计算成本高的问题。解决方案的关键在于提出了一种无需训练的加速框架EasyCache,其核心是引入了一个轻量级、运行时自适应的缓存机制,能够动态复用之前计算得到的变换向量,从而避免推理过程中的冗余计算。

链接: https://arxiv.org/abs/2507.02860
作者: Xin Zhou,Dingkang Liang,Kaijin Chen,Tianrui Feng,Xiwu Chen,Hongkai Lin,Yikang Ding,Feiyang Tan,Hengshuang Zhao,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); MEGVII Technology (MEGVII科技); University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is made available at this https URL . Project page: this https URL

点击查看摘要

Abstract:Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3 \times compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at this https URL.
zh

[CV-5] Bootstrapping Grounded Chain-of-Thought in Multimodal LLM s for Data-Efficient Model Adaptation ICCV2025

【速读】:该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在缺乏大规模微调数据的情况下难以适应专业视觉任务的问题,特别是当预训练数据与下游任务数据之间存在语义不匹配时。其解决方案的关键在于利用链式思维(Chain-of-Thought, CoT)推理数据进行训练,并通过引入基于接地信息(Grounded Chain-of-Thought, GCoT)的方法,在CoT数据中注入边界框等视觉定位信息,从而提升模型在专业视觉任务中的推理准确性和对输入图像的忠实度。

链接: https://arxiv.org/abs/2507.02859
作者: Jiaer Xia,Bingkui Tong,Yuhang Zang,Rui Shao,Kaiyang Zhou
机构: Hong Kong Baptist University (香港浸会大学); Sichuan University (四川大学); Shanghai AI Lab (上海人工智能实验室); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.
zh

[CV-6] AnyI2V: Animating Any Conditional Image with Motion Control ICCV2025

【速读】:该论文旨在解决视频生成中动态运动信号与灵活空间约束有效整合的问题,特别是在文本到视频(T2V)和图像到视频(I2V)合成任务中。现有方法在T2V中依赖文本提示,难以精确控制生成内容的空间布局;而在I2V中则受限于对真实图像的依赖,限制了合成内容的可编辑性。该论文提出的AnyI2V框架是一种无需训练的解决方案,其关键在于通过用户定义的运动轨迹对任意条件图像进行动画化处理,支持多种模态的条件输入,并结合LoRA和文本提示实现风格迁移与编辑,从而提升视频生成的灵活性与可控性。

链接: https://arxiv.org/abs/2507.02857
作者: Ziye Li,Hao Luo,Xincheng Shuai,Henghui Ding
机构: Fudan University (复旦大学); DAMO Academy, Alibaba group (达摩院,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, Project Page: this https URL

点击查看摘要

Abstract:Recent advancements in video generation, particularly in diffusion models, have driven notable progress in text-to-video (T2V) and image-to-video (I2V) synthesis. However, challenges remain in effectively integrating dynamic motion signals and flexible spatial constraints. Existing T2V methods typically rely on text prompts, which inherently lack precise control over the spatial layout of generated content. In contrast, I2V methods are limited by their dependence on real images, which restricts the editability of the synthesized content. Although some methods incorporate ControlNet to introduce image-based conditioning, they often lack explicit motion control and require computationally expensive training. To address these limitations, we propose AnyI2V, a training-free framework that animates any conditional images with user-defined motion trajectories. AnyI2V supports a broader range of modalities as the conditional image, including data types such as meshes and point clouds that are not supported by ControlNet, enabling more flexible and versatile video generation. Additionally, it supports mixed conditional inputs and enables style transfer and editing via LoRA and text prompts. Extensive experiments demonstrate that the proposed AnyI2V achieves superior performance and provides a new perspective in spatial- and motion-controlled video generation. Code is available at this https URL.
zh

[CV-7] USAD: An Unsupervised Data Augmentation Spatio-Temporal Attention Diffusion Network

【速读】:该论文旨在解决人体活动识别(Human Activity Recognition, HAR)中存在的关键问题,包括稀有活动的标注样本稀缺、高层特征提取不足以及轻量级设备上的模型性能不佳。其解决方案的关键在于提出一种以多注意力交互机制为核心的综合优化方法,通过引入无监督统计引导的扩散模型进行数据增强,设计多分支时空交互网络以捕获序列数据的多尺度特征,并结合时空注意力机制与跨分支特征融合单元,提升特征表示能力,最终采用自适应多损失函数融合策略实现模型优化。

链接: https://arxiv.org/abs/2507.02827
作者: Ying Yu,Hang Xiao,Siyao Li,Jiarui Li,Haotian Tang,Hanyu Liu,Chao Li
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The primary objective of human activity recognition (HAR) is to infer ongoing human actions from sensor data, a task that finds broad applications in health monitoring, safety protection, and sports analysis. Despite proliferating research, HAR still faces key challenges, including the scarcity of labeled samples for rare activities, insufficient extraction of high-level features, and suboptimal model performance on lightweight devices. To address these issues, this paper proposes a comprehensive optimization approach centered on multi-attention interaction mechanisms. First, an unsupervised, statistics-guided diffusion model is employed to perform data augmentation, thereby alleviating the problems of labeled data scarcity and severe class imbalance. Second, a multi-branch spatio-temporal interaction network is designed, which captures multi-scale features of sequential data through parallel residual branches with 33, 55, and 7*7 convolutional kernels. Simultaneously, temporal attention mechanisms are incorporated to identify critical time points, while spatial attention enhances inter-sensor interactions. A cross-branch feature fusion unit is further introduced to improve the overall feature representation capability. Finally, an adaptive multi-loss function fusion strategy is integrated, allowing for dynamic adjustment of loss weights and overall model optimization. Experimental results on three public datasets, WISDM, PAMAP2, and OPPORTUNITY, demonstrate that the proposed unsupervised data augmentation spatio-temporal attention diffusion network (USAD) achieves accuracies of 98.84%, 93.81%, and 80.92% respectively, significantly outperforming existing approaches. Furthermore, practical deployment on embedded devices verifies the efficiency and feasibility of the proposed method.
zh

[CV-8] Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach

【速读】:该论文旨在解决多模态传感器行为识别(Multimodal HAR)系统中面临的跨模态特征对齐困难和模态贡献不平衡的问题。其解决方案的关键在于提出一种名为动态对比双路径网络(DCDP-HAR)的框架,该框架包含三个核心组件:双路径特征提取架构、多阶段对比学习机制以及基于置信度的梯度调制策略,通过这些技术有效实现了模态间的协同与平衡。

链接: https://arxiv.org/abs/2507.02826
作者: Panpan Ji,Junni Song,Hang Xiao,Hanyu Liu,Chao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sensor-based Human Activity Recognition (HAR) is a core technology that enables intelligent systems to perceive and interact with their environment. However, multimodal HAR systems still encounter key challenges, such as difficulties in cross-modal feature alignment and imbalanced modality contributions. To address these issues, we propose a novel framework called the Dynamic Contrastive Dual-Path Network (DCDP-HAR). The framework comprises three key components. First, a dual-path feature extraction architecture is employed, where ResNet and DenseNet branches collaboratively process multimodal sensor data. Second, a multi-stage contrastive learning mechanism is introduced to achieve progressive alignment from local perception to semantic abstraction. Third, we present a confidence-driven gradient modulation strategy that dynamically monitors and adjusts the learning intensity of each modality branch during backpropagation, effectively alleviating modality competition. In addition, a momentum-based gradient accumulation strategy is adopted to enhance training stability. We conduct ablation studies to validate the effectiveness of each component and perform extensive comparative experiments on four public benchmark datasets.
zh

[CV-9] LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

【速读】:该论文旨在解决从2D图像中恢复具有开放词汇场景理解的3D结构这一基础但具有挑战性的问题。现有方法依赖于校准的密集视图重建范式,导致在视图有限时出现严重的渲染伪影和不合理的语义合成。论文提出的解决方案关键在于引入一种新颖的生成框架LangScene-X,通过生成更一致的新观测来构建可泛化的3D语言嵌入场景,其核心包括基于TriMap视频扩散模型的多模态信息生成以及基于大规模图像数据集训练的语言量化压缩器(LQC),从而实现跨场景的泛化能力和开放式的语言查询支持。

链接: https://arxiv.org/abs/2507.02813
作者: Fangfu Liu,Hao Li,Jiawei Chi,Hanyang Wang,Minghui Yang,Fudong Wang,Yueqi Duan
机构: Tsinghua University (清华大学); NTU (南洋理工大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: this https URL.
zh

[CV-10] HyperGaussians: High-Dimensional Gaussian Splatting for High-Fidelity Animatable Face Avatars

【速读】:该论文旨在解决从单目视频中生成高质量可动画化人脸虚拟形象的问题,当前方法在非线性形变、复杂光照效果和细节表现上仍存在不足。其解决方案的关键在于对3D Gaussian Splatting(3DGS)的扩展,提出了一种高维多变量高斯分布——HyperGaussians,通过引入可学习的局部嵌入来增强表达能力,并采用“逆协方差技巧”降低计算成本,从而实现高效且高质量的面部重建与动画生成。

链接: https://arxiv.org/abs/2507.02803
作者: Gent Serifi,Marcel C. Bühler
机构: ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce HyperGaussians, a novel extension of 3D Gaussian Splatting for high-quality animatable face avatars. Creating such detailed face avatars from videos is a challenging problem and has numerous applications in augmented and virtual reality. While tremendous successes have been achieved for static faces, animatable avatars from monocular videos still fall in the uncanny valley. The de facto standard, 3D Gaussian Splatting (3DGS), represents a face through a collection of 3D Gaussian primitives. 3DGS excels at rendering static faces, but the state-of-the-art still struggles with nonlinear deformations, complex lighting effects, and fine details. While most related works focus on predicting better Gaussian parameters from expression codes, we rethink the 3D Gaussian representation itself and how to make it more expressive. Our insights lead to a novel extension of 3D Gaussians to high-dimensional multivariate Gaussians, dubbed ‘HyperGaussians’. The higher dimensionality increases expressivity through conditioning on a learnable local embedding. However, splatting HyperGaussians is computationally expensive because it requires inverting a high-dimensional covariance matrix. We solve this by reparameterizing the covariance matrix, dubbed the ‘inverse covariance trick’. This trick boosts the efficiency so that HyperGaussians can be seamlessly integrated into existing models. To demonstrate this, we plug in HyperGaussians into the state-of-the-art in fast monocular face avatars: FlashAvatar. Our evaluation on 19 subjects from 4 face datasets shows that HyperGaussians outperform 3DGS numerically and visually, particularly for high-frequency details like eyeglass frames, teeth, complex facial movements, and specular reflections.
zh

[CV-11] No time to train! Training-Free Reference-Based Instance Segmentation

【速读】:该论文试图解决在仅有少量参考图像的情况下进行对象分割的问题,旨在减少对人工视觉提示或复杂领域依赖的提示生成规则的依赖。其解决方案的关键在于利用基础模型所学习的强语义先验,以在参考图像和目标图像之间识别对应区域,并通过多阶段、无需训练的方法实现实例级分割掩码的自动生成,包括记忆库构建、表征聚合和语义感知特征匹配。

链接: https://arxiv.org/abs/2507.02798
作者: Miguel Espinosa,Chenhongyi Yang,Linus Ericsson,Steven McDonagh,Elliot J. Crowley
机构: Meta(元); School of Engineering, University of Edinburgh(爱丁堡大学工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:The performance of image segmentation models has historically been constrained by the high cost of collecting large-scale annotated data. The Segment Anything Model (SAM) alleviates this original problem through a promptable, semantics-agnostic, segmentation paradigm and yet still requires manual visual-prompts or complex domain-dependent prompt-generation rules to process a new image. Towards reducing this new burden, our work investigates the task of object segmentation when provided with, alternatively, only a small set of reference images. Our key insight is to leverage strong semantic priors, as learned by foundation models, to identify corresponding regions between a reference and a target image. We find that correspondences enable automatic generation of instance-level segmentation masks for downstream tasks and instantiate our ideas via a multi-stage, training-free method incorporating (1) memory bank construction; (2) representation aggregation and (3) semantic-aware feature matching. Our experiments show significant improvements on segmentation metrics, leading to state-of-the-art performance on COCO FSOD (36.8% nAP), PASCAL VOC Few-Shot (71.2% nAP50) and outperforming existing training-free approaches on the Cross-Domain FSOD benchmark (22.4% nAP).
zh

[CV-12] RichControl: Structure- and Appearance-Rich Training-Free Spatial Control for Text-to-Image Generation

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在引入条件图像(如深度图或姿态图)进行细粒度空间控制时所面临的结构错位、条件泄露和视觉伪影问题。现有特征注入方法在条件图像与自然RGB分布差异较大时表现不佳,其核心局限在于同步注入条件特征未能考虑去噪过程中领域对齐与结构保持之间的权衡。论文提出的解决方案关键在于构建一个灵活的特征注入框架,通过将注入时间步与去噪过程解耦,并引入结构丰富的注入模块,使模型能够更好地适应不同扩散步骤中对齐与结构保持的动态交互,从而实现更忠实的结构生成。

链接: https://arxiv.org/abs/2507.02792
作者: Liheng Zhang,Lexi Pang,Hang Ye,Xiaoxuan Ma,Yizhou Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image (T2I) diffusion models have shown remarkable success in generating high-quality images from text prompts. Recent efforts extend these models to incorporate conditional images (e.g., depth or pose maps) for fine-grained spatial control. Among them, feature injection methods have emerged as a training-free alternative to traditional fine-tuning approaches. However, they often suffer from structural misalignment, condition leakage, and visual artifacts, especially when the condition image diverges significantly from natural RGB distributions. By revisiting existing methods, we identify a core limitation: the synchronous injection of condition features fails to account for the trade-off between domain alignment and structural preservation during denoising. Inspired by this observation, we propose a flexible feature injection framework that decouples the injection timestep from the denoising process. At its core is a structure-rich injection module, which enables the model to better adapt to the evolving interplay between alignment and structure preservation throughout the diffusion steps, resulting in more faithful structural generation. In addition, we introduce appearance-rich prompting and a restart refinement strategy to further enhance appearance control and visual quality. Together, these designs enable training-free generation that is both structure-rich and appearance-rich. Extensive experiments show that our approach achieves state-of-the-art performance across diverse zero-shot conditioning scenarios.
zh

[CV-13] From Pixels to Damage Severity: Estimating Earthquake Impacts Using Semantic Segmentation of Social Media Images

【速读】:该论文试图解决地震后社交媒体图像中损伤程度评估的主观性问题,传统分类方法无法准确反映图像中损伤程度的差异。其解决方案的关键在于将损伤程度评估建模为语义分割问题,通过构建分段损伤程度数据集,并对SegFormer模型进行微调,以生成更客观的损伤程度分割结果,同时引入一种新的损伤程度评分系统,结合深度估计量化图像中不同区域的损伤程度差异。

链接: https://arxiv.org/abs/2507.02781
作者: Danrong Zhang,Huili Huang,N. Simrill Smith,Nimisha Roy,J. David Frost
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:In the aftermath of earthquakes, social media images have become a crucial resource for disaster reconnaissance, providing immediate insights into the extent of damage. Traditional approaches to damage severity assessment in post-earthquake social media images often rely on classification methods, which are inherently subjective and incapable of accounting for the varying extents of damage within an image. Addressing these limitations, this study proposes a novel approach by framing damage severity assessment as a semantic segmentation problem, aiming for a more objective analysis of damage in earthquake-affected areas. The methodology involves the construction of a segmented damage severity dataset, categorizing damage into three degrees: undamaged structures, damaged structures, and debris. Utilizing this dataset, the study fine-tunes a SegFormer model to generate damage severity segmentations for post-earthquake social media images. Furthermore, a new damage severity scoring system is introduced, quantifying damage by considering the varying degrees of damage across different areas within images, adjusted for depth estimation. The application of this approach allows for the quantification of damage severity in social media images in a more objective and comprehensive manner. By providing a nuanced understanding of damage, this study enhances the ability to offer precise guidance to disaster reconnaissance teams, facilitating more effective and targeted response efforts in the aftermath of earthquakes.
zh

[CV-14] Grounding Intelligence in Movement

【速读】:该论文试图解决当前人工智能在建模生物系统中运动能力方面的不足,即尽管机器学习在语言、视觉等高维数据建模上取得显著进展,但对运动这一生物系统最基本方面的建模仍存在困难。其解决方案的关键在于将运动视为AI建模的核心目标,而非附属因素。运动具有内在的结构化特性,受身体和物理规律的约束,能够通过低维表示(如姿态)进行更易解释和计算上可行的建模。通过开发能够从多样化运动数据中学习并泛化的模型,不仅可提升生成建模与控制的核心能力,还能为理解生物与人工系统中的行为提供共同基础。

链接: https://arxiv.org/abs/2507.02771
作者: Melanie Segado,Felipe Parodi,Jordan K. Matelsky,Michael L. Platt,Eva B. Dyer,Konrad P. Kording
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Recent advances in machine learning have dramatically improved our ability to model language, vision, and other high-dimensional data, yet they continue to struggle with one of the most fundamental aspects of biological systems: movement. Across neuroscience, medicine, robotics, and ethology, movement is essential for interpreting behavior, predicting intent, and enabling interaction. Despite its core significance in our intelligence, movement is often treated as an afterthought rather than as a rich and structured modality in its own right. This reflects a deeper fragmentation in how movement data is collected and modeled, often constrained by task-specific goals and domain-specific assumptions. But movement is not domain-bound. It reflects shared physical constraints, conserved morphological structures, and purposeful dynamics that cut across species and settings. We argue that movement should be treated as a primary modeling target for AI. It is inherently structured and grounded in embodiment and physics. This structure, often allowing for compact, lower-dimensional representations (e.g., pose), makes it more interpretable and computationally tractable to model than raw, high-dimensional sensory inputs. Developing models that can learn from and generalize across diverse movement data will not only advance core capabilities in generative modeling and control, but also create a shared foundation for understanding behavior across biological and artificial systems. Movement is not just an outcome, it is a window into how intelligent systems engage with the world.
zh

[CV-15] Partial Weakly-Supervised Oriented Object Detection

【速读】:该论文旨在解决定向目标检测(oriented object detection, OOD)中数据集标注成本高昂的问题。现有主流OOD算法主要分为全监督、半监督和弱监督三类,但均在标注速度或成本上存在显著限制。论文提出的解决方案关键在于构建首个基于部分弱标注(如水平框或单点)的 Partial Weakly-Supervised Oriented Object Detection (PWOOD) 框架,通过有效利用大量未标注数据,显著提升检测性能并降低标注成本;同时引入 Orientation-and-Scale-aware Student (OS-Student) 模型与 Class-Agnostic Pseudo-Label Filtering (CPF) 策略,以增强模型对方向和尺度信息的学习能力,并减少对静态过滤阈值的依赖。

链接: https://arxiv.org/abs/2507.02751
作者: Mingxin Liu,Peiyuan Zhang,Yuan Liu,Wei Zhang,Yue Zhou,Ning Liao,Ziyang Gong,Junwei Luo,Zhirui Wang,Yi Yu,Xue Yang
机构: Shanghai Jiao Tong University (上海交通大学); Wuhan University (武汉大学); East China Normal University (华东师范大学); Aerospace Information Research Institute (航天信息研究院); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 4 tables, source code: this https URL

点击查看摘要

Abstract:The growing demand for oriented object detection (OOD) across various domains has driven significant research in this area. However, the high cost of dataset annotation remains a major concern. Current mainstream OOD algorithms can be mainly categorized into three types: (1) fully supervised methods using complete oriented bounding box (OBB) annotations, (2) semi-supervised methods using partial OBB annotations, and (3) weakly supervised methods using weak annotations such as horizontal boxes or points. However, these algorithms inevitably increase the cost of models in terms of annotation speed or annotation cost. To address this issue, we propose:(1) the first Partial Weakly-Supervised Oriented Object Detection (PWOOD) framework based on partially weak annotations (horizontal boxes or single points), which can efficiently leverage large amounts of unlabeled data, significantly outperforming weakly supervised algorithms trained with partially weak annotations, also offers a lower cost solution; (2) Orientation-and-Scale-aware Student (OS-Student) model capable of learning orientation and scale information with only a small amount of orientation-agnostic or scale-agnostic weak annotations; and (3) Class-Agnostic Pseudo-Label Filtering strategy (CPF) to reduce the model’s sensitivity to static filtering thresholds. Comprehensive experiments on DOTA-v1.0/v1.5/v2.0 and DIOR datasets demonstrate that our PWOOD framework performs comparably to, or even surpasses, traditional semi-supervised algorithms.
zh

[CV-16] Linear Attention with Global Context: A Multipole Attention Mechanism for Vision and Physics ICCV2025

【速读】:该论文旨在解决标准Transformer模型在处理高分辨率输入时因内存和时间复杂度呈二次增长而带来的计算不实用性问题。其解决方案的关键在于引入Multipole Attention Neural Operator (MANO),该方法将注意力机制视为网格点之间的相互作用问题,并通过基于距离的多尺度方式计算注意力,从而在每个注意力头中保持全局感受野,实现与网格点数量线性的时间和内存复杂度。

链接: https://arxiv.org/abs/2507.02748
作者: Alex Colagrande,Paul Caillon,Eva Feillet,Alexandre Allauzen
机构: Miles Team, LAMSADE, Université Paris Dauphine-PSL, Paris, France; ESPCI PSL, Paris, France
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ECLR Workshop at ICCV 2025

点击查看摘要

Abstract:Transformers have become the de facto standard for a wide range of tasks, from image classification to physics simulations. Despite their impressive performance, the quadratic complexity of standard Transformers in both memory and time with respect to the input length makes them impractical for processing high-resolution inputs. Therefore, several variants have been proposed, the most successful relying on patchification, downsampling, or coarsening techniques, often at the cost of losing the finest-scale details. In this work, we take a different approach. Inspired by state-of-the-art techniques in n -body numerical simulations, we cast attention as an interaction problem between grid points. We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion. MANO maintains, in each attention head, a global receptive field and achieves linear time and memory complexity with respect to the number of grid points. Empirical results on image classification and Darcy flows demonstrate that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude. We open source our code for reproducibility at this https URL.
zh

[CV-17] DexVLG: Dexterous Vision-Language-Grasp Model at Scale

【速读】:该论文旨在解决如何利用大型模型实现类人灵巧手的语义化抓取姿态预测问题,特别是在面对复杂任务时,传统方法因数据收集困难而主要局限于简单夹爪控制。其解决方案的关键在于构建一个大规模的模拟数据集DexGraspNet 3.0,包含1.7亿个灵巧抓取姿态及其对应的语义部分描述,通过视觉-语言-抓取(Vision-Language-Grasp)模型与基于流匹配的抓取姿态头部联合训练,从而生成与语言指令对齐的抓取策略。

链接: https://arxiv.org/abs/2507.02747
作者: Jiawei He,Danshi Li,Xinqiang Yu,Zekun Qi,Wenyao Zhang,Jiayi Chen,Zhaoxiang Zhang,Zhizheng Zhang,Li Yi,He Wang
机构: Institution1; Institution2
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:As large models gain traction, vision-language-action (VLA) systems are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM and flow-matching-based pose head capable of producing instruction-aligned grasp poses for tabletop objects. To assess DexVLG’s performance, we create benchmarks in physics-based simulations and conduct real-world experiments. Extensive testing demonstrates DexVLG’s strong zero-shot generalization capabilities-achieving over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation-and successful part-aligned grasps on physical objects in real-world scenarios.
zh

[CV-18] Prompt learning with bounding box constraints for medical image segmentation

【速读】:该论文试图解决医学领域中像素级标注劳动密集且成本高昂的问题,提出了一种基于弱监督的分割方法以降低对精确标注的依赖。其解决方案的关键在于结合基础模型的表征能力与弱监督分割的标注效率,通过仅使用边界框标注自动生成提示,并利用由提示基础模型生成的伪标签与来自边界框标注的多约束条件进行优化,从而实现高效的分割任务。

链接: https://arxiv.org/abs/2507.02743
作者: Mélanie Gaillochet,Mehrdad Noori,Sahar Dastani,Christian Desrosiers,Hervé Lombaert
机构: École de Technologie Supérieure (École de Technologie Supérieure); Mila - Quebec AI Institute (Mila - Quebec AI Institute); Polytechnique Montréal (Polytechnique Montréal)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Biomedical Engineering (TMBE), 14 pages

点击查看摘要

Abstract:Pixel-wise annotations are notoriously labourious and costly to obtain in the medical domain. To mitigate this burden, weakly supervised approaches based on bounding box annotations-much easier to acquire-offer a practical alternative. Vision foundation models have recently shown noteworthy segmentation performance when provided with prompts such as points or bounding boxes. Prompt learning exploits these models by adapting them to downstream tasks and automating segmentation, thereby reducing user intervention. However, existing prompt learning approaches depend on fully annotated segmentation masks. This paper proposes a novel framework that combines the representational power of foundation models with the annotation efficiency of weakly supervised segmentation. More specifically, our approach automates prompt generation for foundation models using only bounding box annotations. Our proposed optimization scheme integrates multiple constraints derived from box annotations with pseudo-labels generated by the prompted foundation model. Extensive experiments across multimodal datasets reveal that our weakly supervised method achieves an average Dice score of 84.90% in a limited data setting, outperforming existing fully-supervised and weakly-supervised approaches. The code is available at this https URL
zh

[CV-19] FairHuman: Boosting Hand and Face Quality in Human Image Generation with Minimum Potential Delay Fairness in Diffusion Models ICCV2025

【速读】:该论文旨在解决在生成人类图像时,尤其是面部和手部等局部区域的细节生成质量不足的问题,这主要是由于训练过程中对局部区域的监督不够。解决方案的关键在于提出一种多目标微调方法——FairHuman,通过构建三个学习目标(包括一个基于默认扩散目标函数的全局目标以及两个基于预标注位置先验的手部和面部局部目标),并在最小潜在延迟(Minimum Potential Delay, MPD)准则的指导下推导出最优参数更新策略,从而实现全局与局部生成质量的公平优化。

链接: https://arxiv.org/abs/2507.02714
作者: Yuxuan Wang,Tianwei Cao,Huayu Zhang,Zhongjiang He,Kongming Liang,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Institute of Artificial Intelligence, China Telecom (中国电信人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025

点击查看摘要

Abstract:Image generation has achieved remarkable progress with the development of large-scale text-to-image models, especially diffusion-based models. However, generating human images with plausible details, such as faces or hands, remains challenging due to insufficient supervision of local regions during training. To address this issue, we propose FairHuman, a multi-objective fine-tuning approach designed to enhance both global and local generation quality fairly. Specifically, we first construct three learning objectives: a global objective derived from the default diffusion objective function and two local objectives for hands and faces based on pre-annotated positional priors. Subsequently, we derive the optimal parameter updating strategy under the guidance of the Minimum Potential Delay (MPD) criterion, thereby attaining fairness-ware optimization for this multi-objective problem. Based on this, our proposed method can achieve significant improvements in generating challenging local details while maintaining overall quality. Extensive experiments showcase the effectiveness of our method in improving the performance of human image generation under different scenarios.
zh

[CV-20] UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation

【速读】:该论文旨在解决关键点引导的文本到图像扩散模型在生成非刚性物体(如动物)以及多个重叠的人类和动物时的控制难题。现有方法在处理此类复杂场景时存在局限性,主要源于可控制方法的内在限制和缺乏合适的标注数据集。解决方案的关键在于提出一种基于DiT框架的统一可控多类别图像生成框架UniMC,该框架通过将实例级和关键点级条件整合为紧凑的标记,包含类别、边界框和关键点坐标等属性,从而克服了以往方法因依赖骨骼图像作为条件而导致的实例与类别区分困难的问题;同时,论文还提出了HAIG-2.9M数据集,该数据集具有大规模、高质量和多样性特点,包含786K张图像和2.9M个实例,并提供了丰富的关键点、边界框和细粒度描述等标注信息,确保了数据的准确性与适用性。

链接: https://arxiv.org/abs/2507.02713
作者: Qin Guo,Ailing Zeng,Dongxu Yue,Ceyuan Yang,Yang Cao,Hanzhong Guo,Fei Shen,Wei Liu,Xihui Liu,Dan Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.
zh

[CV-21] SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment

【速读】:该论文旨在解决同时理解与3D重建在端到端具身智能系统中的性能瓶颈问题,现有方法依赖于2D到3D特征对齐范式,导致3D理解能力受限和语义信息丢失。解决方案的关键在于提出SIU3R,这是首个无需对齐的通用同时理解与3D重建框架,通过像素对齐的3D表示连接重建与理解任务,并将多种理解任务统一为一组可学习查询,实现原生的3D理解能力,无需依赖与2D模型的对齐。

链接: https://arxiv.org/abs/2507.02705
作者: Qi Xu,Dongxu Wei,Lingzhe Zhao,Wenpu Li,Zhangchi Huang,Shunping Ji,Peidong Liu
机构: Westlake University (西湖大学); Wuhan University (武汉大学); Westlake Institute for Advanced Study (西湖研究院); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Simultaneous understanding and 3D reconstruction plays an important role in developing end-to-end embodied intelligent systems. To achieve this, recent approaches resort to 2D-to-3D feature alignment paradigm, which leads to limited 3D understanding capability and potential semantic information loss. In light of this, we propose SIU3R, the first alignment-free framework for generalizable simultaneous understanding and 3D reconstruction from unposed images. Specifically, SIU3R bridges reconstruction and understanding tasks via pixel-aligned 3D representation, and unifies multiple understanding tasks into a set of unified learnable queries, enabling native 3D understanding without the need of alignment with 2D models. To encourage collaboration between the two tasks with shared representation, we further conduct in-depth analyses of their mutual benefits, and propose two lightweight modules to facilitate their interaction. Extensive experiments demonstrate that our method achieves state-of-the-art performance not only on the individual tasks of 3D reconstruction and understanding, but also on the task of simultaneous understanding and 3D reconstruction, highlighting the advantages of our alignment-free framework and the effectiveness of the mutual benefit designs.
zh

[CV-22] CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation ICCV

【速读】:该论文旨在解决视频人脸替换(video face swapping)中的两个主要挑战:有效将源身份信息转移到目标视频,并准确保留目标面部的动态属性,如头部姿态、面部表情和唇形同步等。现有方法在实现高质量身份转移方面表现较好,但在保持目标面部动态属性方面存在不足,导致结果不一致。作者认为这一问题源于视频中面部外观与运动信息的内在耦合。解决方案的关键在于提出CanonSwap框架,通过解耦运动信息与外观信息,首先在统一的规范空间中进行身份修改,再将交换后的特征重新整合到原始视频空间,从而保证目标面部动态属性的保留。此外,设计了部分身份调制模块以实现精确的身份转移并减少伪影,提升了整体的现实感。

链接: https://arxiv.org/abs/2507.02691
作者: Xiangyang Luo,Ye Zhu,Yunfei Liu,Lijian Lin,Cong Wan,Zijian Cai,Shao-Lun Huang,Yu Li
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); International Digital Economy Academy (IDEA) (数字经济国际研究院(IDEA)); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV Accepted

点击查看摘要

Abstract:Video face swapping aims to address two primary challenges: effectively transferring the source identity to the target video and accurately preserving the dynamic attributes of the target face, such as head poses, facial expressions, lip-sync, \etc. Existing methods mainly focus on achieving high-quality identity transfer but often fall short in maintaining the dynamic attributes of the target face, leading to inconsistent results. We attribute this issue to the inherent coupling of facial appearance and motion in videos. To address this, we propose CanonSwap, a novel video face-swapping framework that decouples motion information from appearance information. Specifically, CanonSwap first eliminates motion-related information, enabling identity modification within a unified canonical space. Subsequently, the swapped feature is reintegrated into the original video space, ensuring the preservation of the target face’s dynamic attributes. To further achieve precise identity transfer with minimal artifacts and enhanced realism, we design a Partial Identity Modulation module that adaptively integrates source identity features using a spatial mask to restrict modifications to facial regions. Additionally, we introduce several fine-grained synchronization metrics to comprehensively evaluate the performance of video face swapping methods. Extensive experiments demonstrate that our method significantly outperforms existing approaches in terms of visual quality, temporal consistency, and identity preservation. Our project page are publicly available at this https URL.
zh

[CV-23] APT: Adaptive Personalized Training for Diffusion Models with Limited Data CVPR2025

【速读】:该论文旨在解决在有限数据下对扩散模型进行个性化时面临的过拟合、先验知识丢失和文本对齐退化等问题。其解决方案的关键在于提出自适应个性化训练(Adaptive Personalized Training, APT),该框架通过自适应训练策略和微调过程中的模型内部表示正则化来缓解过拟合。APT包含三个核心组件:自适应训练调整、表示稳定化以及注意力对齐以保持先验知识,从而有效提升模型在少量参考数据下的生成质量与多样性。

链接: https://arxiv.org/abs/2507.02687
作者: JungWoo Chae,Jiyoon Kim,JaeWoong Choi,Kyungyul Kim,Sangheum Hwang
机构: LG CNS AI Research (LG CNS人工智能研究院); Seoul National University of Science and Technology (首尔科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025 camera ready. Project page: this https URL

点击查看摘要

Abstract:Personalizing diffusion models using limited data presents significant challenges, including overfitting, loss of prior knowledge, and degradation of text alignment. Overfitting leads to shifts in the noise prediction distribution, disrupting the denoising trajectory and causing the model to lose semantic coherence. In this paper, we propose Adaptive Personalized Training (APT), a novel framework that mitigates overfitting by employing adaptive training strategies and regularizing the model’s internal representations during fine-tuning. APT consists of three key components: (1) Adaptive Training Adjustment, which introduces an overfitting indicator to detect the degree of overfitting at each time step bin and applies adaptive data augmentation and adaptive loss weighting based on this indicator; (2)Representation Stabilization, which regularizes the mean and variance of intermediate feature maps to prevent excessive shifts in noise prediction; and (3) Attention Alignment for Prior Knowledge Preservation, which aligns the cross-attention maps of the fine-tuned model with those of the pretrained model to maintain prior knowledge and semantic coherence. Through extensive experiments, we demonstrate that APT effectively mitigates overfitting, preserves prior knowledge, and outperforms existing methods in generating high-quality, diverse images with limited reference data.
zh

[CV-24] Learning few-step posterior samplers by unfolding and distillation of diffusion models

【速读】:该论文旨在解决在贝叶斯计算成像中如何高效利用扩散模型(Diffusion Models, DMs)作为图像先验的问题。现有方法主要分为两类:Plug-and-Play方法虽然灵活但依赖近似,而专用条件DMs虽精度高但缺乏灵活性。本文提出的解决方案关键在于将深度展开(deep unfolding)与模型蒸馏(model distillation)相结合,将DM图像先验转化为少量步骤的条件模型以进行后验采样。其核心创新是将马尔可夫链蒙特卡洛(Markov Chain Monte Carlo, MCMC)算法——特别是LATINO Langevin采样器——进行深度展开,这是首次将深度展开应用于蒙特卡洛采样方案。实验表明,该方法在保持灵活性的同时实现了较高的精度和计算效率。

链接: https://arxiv.org/abs/2507.02686
作者: Charlesquin Kemajou Mbakam,Jonathan Spence,Marcelo Pereyra
机构: Heriot-Watt University (赫瑞瓦特大学); University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 28 pages, 16 figures, 10 tables

点击查看摘要

Abstract:Diffusion models (DMs) have emerged as powerful image priors in Bayesian computational imaging. Two primary strategies have been proposed for leveraging DMs in this context: Plug-and-Play methods, which are zero-shot and highly flexible but rely on approximations; and specialized conditional DMs, which achieve higher accuracy and faster inference for specific tasks through supervised training. In this work, we introduce a novel framework that integrates deep unfolding and model distillation to transform a DM image prior into a few-step conditional model for posterior sampling. A central innovation of our approach is the unfolding of a Markov chain Monte Carlo (MCMC) algorithm - specifically, the recently proposed LATINO Langevin sampler (Spagnoletti et al., 2025) - representing the first known instance of deep unfolding applied to a Monte Carlo sampling scheme. We demonstrate our proposed unfolded and distilled samplers through extensive experiments and comparisons with the state of the art, where they achieve excellent accuracy and computational efficiency, while retaining the flexibility to adapt to variations in the forward model at inference time.
zh

[CV-25] Real-time Image-based Lighting of Glints

【速读】:该论文旨在解决在实时渲染中准确模拟具有闪烁或晶莹效果的材料在复杂环境光下的光照问题,这类材料的表面分布着离散的微面片(microfacets),导致光线反射呈现随机闪烁现象。解决方案的关键在于提出一种高效的图像基础光照近似方法,该方法基于区域光源照明下的实时闪烁渲染,并利用标准环境图滤波技术,能够在每帧基础上快速执行环境图滤波。通过将环境图划分为少量均匀区域,并使用法线分布函数对相应的指示函数进行滤波,得到每个微面片从各区域反射光线的概率,在着色过程中利用新型双门控高斯近似对二项分布进行分层采样,从而实现动态材质属性和环境图的高效渲染。

链接: https://arxiv.org/abs/2507.02674
作者: Tom Kneiphof,Reinhard Klein
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image-based lighting is a widely used technique to reproduce shading under real-world lighting conditions, especially in real-time rendering applications. A particularly challenging scenario involves materials exhibiting a sparkling or glittering appearance, caused by discrete microfacets scattered across their surface. In this paper, we propose an efficient approximation for image-based lighting of glints, enabling fully dynamic material properties and environment maps. Our novel approach is grounded in real-time glint rendering under area light illumination and employs standard environment map filtering techniques. Crucially, our environment map filtering process is sufficiently fast to be executed on a per-frame basis. Our method assumes that the environment map is partitioned into few homogeneous regions of constant radiance. By filtering the corresponding indicator functions with the normal distribution function, we obtain the probabilities for individual microfacets to reflect light from each region. During shading, these probabilities are utilized to hierarchically sample a multinomial distribution, facilitated by our novel dual-gated Gaussian approximation of binomial distributions. We validate that our real-time approximation is close to ground-truth renderings for a range of material properties and lighting conditions, and demonstrate robust and stable performance, with little overhead over rendering glints from a single directional light. Compared to rendering smooth materials without glints, our approach requires twice as much memory to store the prefiltered environment map.
zh

[CV-26] MISCGrasp: Leverag ing Multiple Integrated Scales and Contrastive Learning for Enhanced Volumetric Grasping IROS

【速读】:该论文旨在解决机器人抓取在面对不同形状和尺寸物体时适应性不足的问题。其解决方案的关键在于提出MISCGrasp方法,该方法通过集成多尺度特征提取与对比特征增强技术,实现自适应抓取。具体而言,通过Insight Transformer实现高层与低层特征之间的基于查询的交互,而Empower Transformer则选择性地关注最高等级特征,从而在关注精细几何细节与整体几何结构之间取得平衡;同时,利用多尺度对比学习来挖掘正样本间的相似性,确保多尺度特征的一致性。

链接: https://arxiv.org/abs/2507.02672
作者: Qingyu Fan,Yinghao Cai,Chao Li,Chunting Jiao,Xudong Zheng,Tao Lu,Bin Liang,Shuo Wang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院); Qiyuan Lab(启源实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

点击查看摘要

Abstract:Robotic grasping faces challenges in adapting to objects with varying shapes and sizes. In this paper, we introduce MISCGrasp, a volumetric grasping method that integrates multi-scale feature extraction with contrastive feature enhancement for self-adaptive grasping. We propose a query-based interaction between high-level and low-level features through the Insight Transformer, while the Empower Transformer selectively attends to the highest-level features, which synergistically strikes a balance between focusing on fine geometric details and overall geometric structures. Furthermore, MISCGrasp utilizes multi-scale contrastive learning to exploit similarities among positive grasp samples, ensuring consistency across multi-scale features. Extensive experiments in both simulated and real-world environments demonstrate that MISCGrasp outperforms baseline and variant methods in tabletop decluttering tasks. More details are available at this https URL.
zh

[CV-27] Embedding-Based Federated Data Sharing via Differentially Private Conditional VAEs MICCAI2025

【速读】:该论文试图解决医疗影像领域中深度学习(Deep Learning, DL)因数据稀缺性和隐私法规导致的数据访问受限问题,以及联邦学习(Federated Learning, FL)在通信成本高和下游任务单一性方面的局限性。其解决方案的关键在于通过差分隐私(Differentially Private, DP)生成模型实现数据共享,利用基础模型提取紧凑且信息丰富的嵌入表示以降低冗余和计算开销,并通过联合训练一个差分隐私条件变分自编码器(DP-CVAE)来建模全局的隐私感知数据分布,从而支持多样化的下游任务。

链接: https://arxiv.org/abs/2507.02671
作者: Francesco Di Salvo,Hanh Huyen My Nguyen,Christian Ledig
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Accepted to MICCAI 2025

点击查看摘要

Abstract:Deep Learning (DL) has revolutionized medical imaging, yet its adoption is constrained by data scarcity and privacy regulations, limiting access to diverse datasets. Federated Learning (FL) enables decentralized training but suffers from high communication costs and is often restricted to a single downstream task, reducing flexibility. We propose a data-sharing method via Differentially Private (DP) generative models. By adopting foundation models, we extract compact, informative embeddings, reducing redundancy and lowering computational overhead. Clients collaboratively train a Differentially Private Conditional Variational Autoencoder (DP-CVAE) to model a global, privacy-aware data distribution, supporting diverse downstream tasks. Our approach, validated across multiple feature extractors, enhances privacy, scalability, and efficiency, outperforming traditional FL classifiers while ensuring differential privacy. Additionally, DP-CVAE produces higher-fidelity embeddings than DP-CGAN while requiring 5\times fewer parameters.
zh

[CV-28] AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models ICCV2025

【速读】:该论文旨在解决AI生成图像(AIGI)检测技术中存在的两个核心问题:缺乏人类可验证的解释以及在最新生成技术下的泛化能力不足。其解决方案的关键在于构建了一个大规模、全面的数据集Holmes-Set,包含具有解释性的指令微调数据集Holmes-SFTSet和人类对齐偏好数据集Holmes-DPOSet,并引入了高效的多专家评审数据标注方法,通过结构化多模态大语言模型(MLLM)解释和质量控制机制提升数据质量。此外,提出了三阶段训练框架Holmes Pipeline,结合视觉专家预训练、监督微调和直接偏好优化,使MLLM适应AIGI检测任务并生成可验证的解释,最终构建出模型AIGI-Holmes。

链接: https://arxiv.org/abs/2507.02664
作者: Ziyin Zhou,Yunpeng Luo,Yuanchen Wu,Ke Sun,Jiayi Ji,Ke Yan,Shouhong Ding,Xiaoshuai Sun,Yunsheng Wu,Rongrong Ji
机构: Xiamen University (厦门大学); Tencent YouTu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.
zh

[CV-29] Fair Deepfake Detectors Can Generalize

【速读】:该论文旨在解决深度伪造检测模型在面对未见过的篡改手段时的泛化能力不足以及不同人口群体间的公平性问题。现有方法通常表现出这两个目标之间的固有冲突,即存在一种权衡关系。论文首次揭示并形式化定义了公平性与泛化能力之间的因果关系,其关键在于通过控制混杂因素(数据分布和模型容量)来实现通过公平性干预提升泛化能力。基于这一洞察,作者提出了Demographic Attribute-insensitive Intervention Detection (DAID)框架,其核心包括:i) 人群感知的数据再平衡,采用逆概率加权和子组特征归一化以消除分布偏差;ii) 人群无关的特征聚合,利用新型对齐损失抑制敏感属性信号。

链接: https://arxiv.org/abs/2507.02645
作者: Harry Cheng,Ming-Hui Liu,Yangyang Guo,Tianyi Wang,Liqiang Nie,Mohan Kankanhalli
机构: National University of Singapore (新加坡国立大学); Shandong University (山东大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳))
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, version 1

点击查看摘要

Abstract:Deepfake detection models face two critical challenges: generalization to unseen manipulations and demographic fairness among population groups. However, existing approaches often demonstrate that these two objectives are inherently conflicting, revealing a trade-off between them. In this paper, we, for the first time, uncover and formally define a causal relationship between fairness and generalization. Building on the back-door adjustment, we show that controlling for confounders (data distribution and model capacity) enables improved generalization via fairness interventions. Motivated by this insight, we propose Demographic Attribute-insensitive Intervention Detection (DAID), a plug-and-play framework composed of: i) Demographic-aware data rebalancing, which employs inverse-propensity weighting and subgroup-wise feature normalization to neutralize distributional biases; and ii) Demographic-agnostic feature aggregation, which uses a novel alignment loss to suppress sensitive-attribute signals. Across three cross-domain benchmarks, DAID consistently achieves superior performance in both fairness and generalization compared to several state-of-the-art detectors, validating both its theoretical foundation and practical effectiveness.
zh

[CV-30] L-VAE: Variational Auto-Encoder with Learnable Beta for Disentangled Representation

【速读】:该论文试图解决传统变分自编码器(Variational Autoencoder, VAE)在生成模型中难以有效平衡重构精度与潜在空间解耦性的问题。解决方案的关键在于提出一种可学习的变分自编码器(Learnable VAE, L-VAE),通过同时学习损失函数中各项的权重和模型架构参数,动态调整解耦与重构损失之间的权衡,并引入额外的正则化项以避免偏向任一损失。这一方法相较于β-VAE等现有模型,在多个数据集上表现出更优或次优的解耦性能。

链接: https://arxiv.org/abs/2507.02619
作者: Hazal Mogultay Ozcan,Sinan Kalkan,Fatos T. Yarman-Vural
机构: Middle East Technical University (中東科技大學)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: The paper is under revision at Machine Vision and Applications

点击查看摘要

Abstract:In this paper, we propose a novel model called Learnable VAE (L-VAE), which learns a disentangled representation together with the hyperparameters of the cost function. L-VAE can be considered as an extension of \beta-VAE, wherein the hyperparameter, \beta, is empirically adjusted. L-VAE mitigates the limitations of \beta-VAE by learning the relative weights of the terms in the loss function to control the dynamic trade-off between disentanglement and reconstruction losses. In the proposed model, the weight of the loss terms and the parameters of the model architecture are learned concurrently. An additional regularization term is added to the loss function to prevent bias towards either reconstruction or disentanglement losses. Experimental analyses show that the proposed L-VAE finds an effective balance between reconstruction fidelity and disentangling the latent dimensions. Comparisons of the proposed L-VAE against \beta-VAE, VAE, ControlVAE, DynamicVAE, and \sigma-VAE on datasets, such as dSprites, MPI3D-complex, Falcor3D, and Isaac3D reveals that L-VAE consistently provides the best or the second best performances measured by a set of disentanglement metrics. Moreover, qualitative experiments on CelebA dataset, confirm the success of the L-VAE model for disentangling the facial attributes.
zh

[CV-31] Addressing Camera Sensors Faults in Vision-Based Navigation: Simulation and Dataset Development

【速读】:该论文旨在解决视觉导航(Vision-Based Navigation, VBN)算法在航天任务中因传感器故障导致的可靠性与鲁棒性问题。传感器故障可能引发导航算法输出错误或数据处理完全失效,从而危及任务目标。研究提出了一种基于人工智能(Artificial Intelligence, AI)的故障检测方法,以克服传统故障检测技术的局限性。其解决方案的关键在于构建一个包含故障图像数据的仿真框架,通过合成图像重现故障条件,从而生成可用于训练和测试AI故障检测算法的故障注入数据集。

链接: https://arxiv.org/abs/2507.02602
作者: Riccardo Gallon,Fabian Schiemenz,Alessandra Menicucci,Eberhard Gill
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to Acta Astronautica

点击查看摘要

Abstract:The increasing importance of Vision-Based Navigation (VBN) algorithms in space missions raises numerous challenges in ensuring their reliability and operational robustness. Sensor faults can lead to inaccurate outputs from navigation algorithms or even complete data processing faults, potentially compromising mission objectives. Artificial Intelligence (AI) offers a powerful solution for detecting such faults, overcoming many of the limitations associated with traditional fault detection methods. However, the primary obstacle to the adoption of AI in this context is the lack of sufficient and representative datasets containing faulty image data. This study addresses these challenges by focusing on an interplanetary exploration mission scenario. A comprehensive analysis of potential fault cases in camera sensors used within the VBN pipeline is presented. The causes and effects of these faults are systematically characterized, including their impact on image quality and navigation algorithm performance, as well as commonly employed mitigation strategies. To support this analysis, a simulation framework is introduced to recreate faulty conditions in synthetically generated images, enabling a systematic and controlled reproduction of faulty data. The resulting dataset of fault-injected images provides a valuable tool for training and testing AI-based fault detection algorithms. The final link to the dataset will be added after an embargo period. For peer-reviewers, this private link is available. Comments: Submitted to Acta Astronautica Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.02602 [cs.CV] (or arXiv:2507.02602v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2507.02602 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-32] AuroraLong: Bringing RNNs Back to Efficient Open-Ended Video Understanding ICCV2025

【速读】:该论文试图解决长视频理解中的高计算复杂度和高昂的内存成本问题(high computational complexity and prohibitive memory cost),因为基于Transformer的大型语言模型(LLM)的内存和计算需求随输入序列长度呈二次增长。解决方案的关键在于用线性循环神经网络(linear RNN)语言模型替代多模态大语言模型(MLLMs)中的LLM组件,该模型能够以固定大小的隐藏状态处理任意长度的输入序列,并通过重新排序视觉标记以按大小升序合并视觉标记来进一步提升吞吐量和效率。

链接: https://arxiv.org/abs/2507.02591
作者: Weili Xu,Enxin Song,Wenhao Chai,Xuexiang Wen,Tian Ye,Gaoang Wang
机构: Zhejiang University (浙江大学); University of Washington (华盛顿大学); HKUST (GZ) (香港科技大学(广州))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.
zh

[CV-33] Structure-aware Semantic Discrepancy and Consistency for 3D Medical Image Self-supervised Learning ICCV25

【速读】:该论文旨在解决3D医学图像自监督学习(mSSL)中因解剖结构在位置、尺度和形态上的变化而导致的表征学习不足问题。现有方法通常采用固定尺寸的图像块进行分割,忽视了结构变化的重要性。其解决方案的关键在于提出一种名为S^2DC的框架,通过引入结构感知的语义差异与一致性机制,在两个步骤中实现:首先利用最优传输策略增强不同图像块之间的语义差异,其次基于邻域相似性分布提升结构层面的语义一致性,从而构建出具有结构感知能力的表征。

链接: https://arxiv.org/abs/2507.02581
作者: Tan Pan,Zhaorui Tan,Kaiyu Guo,Dongli Xu,Weidi Xu,Chen Jiang,Xin Guo,Yuan Qi,Yuan Cheng
机构: Fudan University (复旦大学); Shanghai Academy of Artificial Intelligence for Science (上海人工智能科学研究院); The University of Queensland (昆士兰大学); Zhongshan Hospital, Fudan University (复旦大学中山医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV25

点击查看摘要

Abstract:3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. We assume that patches within the same structure share the same semantics (semantic consistency) while those from different structures exhibit distinct semantics (semantic discrepancy). Based on this assumption, we propose an mSSL framework named S^2DC , achieving Structure-aware Semantic Discrepancy and Consistency in two steps. First, S^2DC enforces distinct representations for different patches to increase semantic discrepancy by leveraging an optimal transport strategy. Second, S^2DC advances semantic consistency at the structural level based on neighborhood similarity distribution. By bridging patch-level and structure-level representations, S^2DC achieves structure-aware representations. Thoroughly evaluated across 10 datasets, 4 tasks, and 3 modalities, our proposed method consistently outperforms the state-of-the-art methods in mSSL.
zh

[CV-34] Parametric shape models for vessels learned from segmentations via differentiable voxelization

【速读】:该论文试图解决血管结构在不同表示形式(体素、网格和参数模型)之间分离提取的问题,这些表示通常通过分割获得且相互独立。解决方案的关键在于提出一个框架,通过可微分变换将三种表示形式统一起来。该框架利用可微分体素化,通过形状到分割的拟合自动提取血管的参数化形状模型,无需显式的真实形状参数即可学习形状参数,同时采用三次B样条对血管进行中心线和半径的参数化,确保平滑性和连续性,最终从学习到的形状参数中差分提取高质量网格,实现几何精确建模。

链接: https://arxiv.org/abs/2507.02576
作者: Alina F. Dima,Suprosanna Shit,Huaqi Qiu,Robbie Holland,Tamara T. Mueller,Fabio Antonio Musio,Kaiyuan Yang,Bjoern Menze,Rickmer Braren,Marcus Makowski,Daniel Rueckert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures

点击查看摘要

Abstract:Vessels are complex structures in the body that have been studied extensively in multiple representations. While voxelization is the most common of them, meshes and parametric models are critical in various applications due to their desirable properties. However, these representations are typically extracted through segmentations and used disjointly from each other. We propose a framework that joins the three representations under differentiable transformations. By leveraging differentiable voxelization, we automatically extract a parametric shape model of the vessels through shape-to-segmentation fitting, where we learn shape parameters from segmentations without the explicit need for ground-truth shape parameters. The vessel is parametrized as centerlines and radii using cubic B-splines, ensuring smoothness and continuity by construction. Meshes are differentiably extracted from the learned shape parameters, resulting in high-fidelity meshes that can be manipulated post-fit. Our method can accurately capture the geometry of complex vessels, as demonstrated by the volumetric fits in experiments on aortas, aneurysms, and brain vessels.
zh

[CV-35] Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning

【速读】:该论文旨在解决从真实场景视频中恢复合理的人体交互动作的问题,特别是在存在视觉模糊和人与人之间遮挡的情况下,现有的人体姿态估计方法难以准确重建合理的交互行为。其解决方案的关键在于利用人体外观作为直接线索,提出了一种双分支优化框架,通过结合人体外观、社会距离学和物理规律来约束交互动作的重建,同时引入基于3D高斯、2D关键点和网格穿透的多种约束条件,以提升重建的准确性。

链接: https://arxiv.org/abs/2507.02565
作者: Buzhen Huang,Chen Li,Chongyang Xu,Dongyue Lu,Jinnan Chen,Yangang Wang,Gim Hee Lee
机构: Southeast University; National University of Singapore; Sichuan University; IHPC, Agency for Science, Technology and Research, Singapore; CFAR, Agency for Science, Technology and Research, Singapore
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data are available at this https URL.
zh

[CV-36] MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

【速读】:该论文试图解决从单张图像中恢复场景的度量尺度三维点云(metric scale 3D point map)的问题,同时保持相对几何精度和细粒度细节。其解决方案的关键在于扩展先前的单目几何估计方法MoGe,通过有效策略实现度量几何预测而不牺牲仿射不变点表示提供的相对几何准确性,并引入统一的数据精炼方法,利用尖锐的合成标签过滤和补全来自不同源的真实数据,从而显著提升重建几何的细粒度精度并维持整体准确性。

链接: https://arxiv.org/abs/2507.02546
作者: Ruicheng Wang,Sicheng Xu,Yue Dong,Yu Deng,Jianfeng Xiang,Zelong Lv,Guangzhong Sun,Xin Tong,Jiaolong Yang
机构: USTC(中国科学技术大学); Microsoft Research(微软研究院); Tsinghua University(清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We propose MoGe-2, an advanced open-domain geometry estimation model that recovers a metric scale 3D point map of a scene from a single image. Our method builds upon the recent monocular geometry estimation approach, MoGe, which predicts affine-invariant point maps with unknown scales. We explore effective strategies to extend MoGe for metric geometry prediction without compromising the relative geometry accuracy provided by the affine-invariant point representation. Additionally, we discover that noise and errors in real data diminish fine-grained detail in the predicted geometry. We address this by developing a unified data refinement approach that filters and completes real data from different sources using sharp synthetic labels, significantly enhancing the granularity of the reconstructed geometry while maintaining the overall accuracy. We train our model on a large corpus of mixed datasets and conducted comprehensive evaluations, demonstrating its superior performance in achieving accurate relative geometry, precise metric scale, and fine-grained detail recovery – capabilities that no previous methods have simultaneously achieved.
zh

[CV-37] IMASHRIMP: Automatic White Shrimp (Penaeus vannamei) Biometrical Analysis from Laboratory Images Using Computer Vision and Deep Learning

【速读】:该论文旨在解决水产养殖中白虾(Penaeus vannamei)形态学分析的自动化问题,以优化遗传选择任务。其关键解决方案是开发IMASHRIMP系统,该系统通过改进的深度学习和计算机视觉技术处理RGBD图像中的虾形态分析挑战,包括引入基于改进ResNet-50架构的两个分类模块用于视角分类和额角完整性判断,以及结合“双因素认证”(人工与人工智能)机制降低人为错误率。此外,系统还集成了姿态估计模块和形态回归模块,实现关键点检测与像素到厘米单位的转换,从而提升分析的准确性与效率。

链接: https://arxiv.org/abs/2507.02519
作者: Abiam Remache González,Meriem Chagour,Timon Bijan Rüth,Raúl Trapiella Cañedo,Marina Martínez Soler,Álvaro Lorenzo Felipe,Hyun-Suk Shin,María-Jesús Zamorano Serrano,Ricardo Torres,Juan-Antonio Castillo Parra,Eduardo Reyes Abad,Miguel-Ángel Ferrer Ballester,Juan-Manuel Afonso López,Francisco-Mario Hernández Tejera,Adrian Penate-Sanchez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:This paper introduces IMASHRIMP, an adapted system for the automated morphological analysis of white shrimp (Penaeus vannamei, aimed at optimizing genetic selection tasks in aquaculture. Existing deep learning and computer vision techniques were modified to address the specific challenges of shrimp morphology analysis from RGBD images. IMASHRIMP incorporates two discrimination modules, based on a modified ResNet-50 architecture, to classify images by the point of view and determine rostrum integrity. It is proposed a “two-factor authentication (human and IA)” system, it reduces human error in view classification from 0.97% to 0% and in rostrum detection from 12.46% to 3.64%. Additionally, a pose estimation module was adapted from VitPose to predict 23 key points on the shrimp’s skeleton, with separate networks for lateral and dorsal views. A morphological regression module, using a Support Vector Machine (SVM) model, was integrated to convert pixel measurements to centimeter units. Experimental results show that the system effectively reduces human error, achieving a mean average precision (mAP) of 97.94% for pose estimation and a pixel-to-centimeter conversion error of 0.07 (+/- 0.1) cm. IMASHRIMP demonstrates the potential to automate and accelerate shrimp morphological analysis, enhancing the efficiency of genetic selection and contributing to more sustainable aquaculture this http URL code are available at this https URL
zh

[CV-38] Detecting Multiple Diseases in Multiple Crops Using Deep Learning

【速读】:该论文旨在解决印度农业中因病害、害虫和环境胁迫导致的作物损失问题,通过早期检测和准确识别不同作物的病害来提高产量并保障粮食安全。其解决方案的关键在于构建一个涵盖17种不同作物和34种病害的统一数据集,并基于此训练一种深度学习模型,该模型在准确率和覆盖的作物及病害种类方面均优于现有技术,实现了99%的检测准确率,较处理14种作物和26种病害的现有方法提升了7%。

链接: https://arxiv.org/abs/2507.02517
作者: Vivek Yadav,Anugrah Jain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:India, as a predominantly agrarian economy, faces significant challenges in agriculture, including substantial crop losses caused by diseases, pests, and environmental stress. Early detection and accurate identification of diseases across different crops are critical for improving yield and ensuring food security. This paper proposes a deep learning based solution for detecting multiple diseases in multiple crops, aimed to cover India’s diverse agricultural landscape. We first create a unified dataset encompassing images of 17 different crops and 34 different diseases from various available repositories. Proposed deep learning model is trained on this dataset and outperforms the state-of-the-art in terms of accuracy and the number of crops, diseases covered. We achieve a significant detection accuracy, i.e., 99 percent for our unified dataset which is 7 percent more when compared to state-of-the-art handling 14 crops and 26 different diseases only. By improving the number of crops and types of diseases that can be detected, proposed solution aims to provide a better product for Indian farmers.
zh

[CV-39] Automatic Labelling for Low-Light Pedestrian Detection

【速读】:该论文旨在解决低光照条件下RGB图像中行人检测的挑战,这一问题在现有公开数据集中尚未得到充分研究。解决方案的关键在于提出一种自动化的红外-RGB标签生成流程,该流程包括红外行人检测、红外检测结果到对应RGB图像的标签迁移以及利用生成的标签训练低光照条件下的RGB行人检测模型。通过在KAIST数据集上的实验验证,基于生成标签训练的模型在多个评估指标上表现优于基于真实标签训练的模型。

链接: https://arxiv.org/abs/2507.02513
作者: Dimitrios Bouzoulas,Eerik Alamikkotervo,Risto Ojala
机构: Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pedestrian detection in RGB images is a key task in pedestrian safety, as the most common sensor in autonomous vehicles and advanced driver assistance systems is the RGB camera. A challenge in RGB pedestrian detection, that does not appear to have large public datasets, is low-light conditions. As a solution, in this research, we propose an automated infrared-RGB labeling pipeline. The proposed pipeline consists of 1) Infrared detection, where a fine-tuned model for infrared pedestrian detection is used 2) Label transfer process from the infrared detections to their RGB counterparts 3) Training object detection models using the generated labels for low-light RGB pedestrian detection. The research was performed using the KAIST dataset. For the evaluation, object detection models were trained on the generated autolabels and ground truth labels. When compared on a previously unseen image sequence, the results showed that the models trained on generated labels outperformed the ones trained on ground-truth labels in 6 out of 9 cases for the mAP@50 and mAP@50-95 metrics. The source code for this research is available at this https URL
zh

[CV-40] MC-INR: Efficient Encoding of Multivariate Scientific Simulation Data using Meta-Learning and Clustered Implicit Neural Representations

【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在处理复杂真实世界数据集时存在的三个主要局限性:对复杂结构的表示不够灵活、主要针对单变量数据以及依赖于结构化网格。其解决方案的关键在于提出一种基于神经网络的框架MC-INR,该框架能够处理非结构化网格上的多变量数据,通过结合元学习和聚类实现复杂结构的灵活编码,并引入基于残差的动态重聚类机制以自适应地划分簇,同时设计分支层以并行利用多变量数据,从而提升科学数据编码任务的性能。

链接: https://arxiv.org/abs/2507.02494
作者: Hyunsoo Son,Jeonghyun Noh,Suemin Jeon,Chaoli Wang,Won-Ki Jeong
机构: Korea University (高丽大学); University of Notre Dame (圣母大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages

点击查看摘要

Abstract:Implicit Neural Representations (INRs) are widely used to encode data as continuous functions, enabling the visualization of large-scale multivariate scientific simulation data with reduced memory usage. However, existing INR-based methods face three main limitations: (1) inflexible representation of complex structures, (2) primarily focusing on single-variable data, and (3) dependence on structured grids. Thus, their performance degrades when applied to complex real-world datasets. To address these limitations, we propose a novel neural network-based framework, MC-INR, which handles multivariate data on unstructured grids. It combines meta-learning and clustering to enable flexible encoding of complex structures. To further improve performance, we introduce a residual-based dynamic re-clustering mechanism that adaptively partitions clusters based on local error. We also propose a branched layer to leverage multivariate data through independent branches simultaneously. Experimental results demonstrate that MC-INR outperforms existing methods on scientific data encoding tasks.
zh

[CV-41] mporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy MICCAI2025

【速读】:该论文旨在解决结肠镜检查中自动息肉计数的问题,该问题涉及检测、跟踪息肉并聚类属于同一息肉实体的轨迹片段。现有方法主要依赖自监督学习并仅利用视觉外观特征,忽略了轨迹片段在特征学习和聚类阶段中的时间关系。该论文的关键解决方案是引入一种监督对比损失,结合时间感知的软目标,以捕捉息肉内部的变异性并保持息肉间的可区分性,从而提升聚类的鲁棒性;同时通过集成时间邻近性约束改进轨迹片段聚类,减少视觉相似但时间上相距较远的轨迹片段之间的错误重新关联。

链接: https://arxiv.org/abs/2507.02493
作者: Luca Parolari,Andrea Cherubini,Lamberto Ballan,Carlo Biffi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at MICCAI 2025

点击查看摘要

Abstract:Automated polyp counting in colonoscopy is a crucial step toward automated procedure reporting and quality control, aiming to enhance the cost-effectiveness of colonoscopy screening. Counting polyps in a procedure involves detecting and tracking polyps, and then clustering tracklets that belong to the same polyp entity. Existing methods for polyp counting rely on self-supervised learning and primarily leverage visual appearance, neglecting temporal relationships in both tracklet feature learning and clustering stages. In this work, we introduce a paradigm shift by proposing a supervised contrastive loss that incorporates temporally-aware soft targets. Our approach captures intra-polyp variability while preserving inter-polyp discriminability, leading to more robust clustering. Additionally, we improve tracklet clustering by integrating a temporal adjacency constraint, reducing false positive re-associations between visually similar but temporally distant tracklets. We train and validate our method on publicly available datasets and evaluate its performance with a leave-one-out cross-validation strategy. Results demonstrate a 2.2x reduction in fragmentation rate compared to prior approaches. Our results highlight the importance of temporal awareness in polyp counting, establishing a new state-of-the-art. Code is available at this https URL.
zh

[CV-42] MedFormer: Hierarchical Medical Vision Transformer with Content-Aware Dual Sparse Selection Attention

【速读】:该论文旨在解决医学图像识别中现有基于视觉变压器的方法所面临的两个主要问题:一是模型通常任务特定且架构定制,限制了其通用性;二是模型在处理长距离依赖时要么采用全注意力机制导致计算成本高,要么依赖人工设计的稀疏注意力机制,可能影响性能。解决方案的关键在于提出MedFormer,其核心创新包括一个用于多种医学图像识别任务的金字塔缩放结构,以及一种具有内容感知能力的双稀疏选择注意力(DSSA)机制,以提升计算效率和鲁棒性,同时保持高性能。

链接: https://arxiv.org/abs/2507.02488
作者: Zunhui Xia,Hongxing Li,Libin Lan
机构: Chongqing University of Technology (重庆理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 9 figures, 9 tables

点击查看摘要

Abstract:Medical image recognition serves as a key way to aid in clinical diagnosis, enabling more accurate and timely identification of diseases and abnormalities. Vision transformer-based approaches have proven effective in handling various medical recognition tasks. However, these methods encounter two primary challenges. First, they are often task-specific and architecture-tailored, limiting their general applicability. Second, they usually either adopt full attention to model long-range dependencies, resulting in high computational costs, or rely on handcrafted sparse attention, potentially leading to suboptimal performance. To tackle these issues, we present MedFormer, an efficient medical vision transformer with two key ideas. First, it employs a pyramid scaling structure as a versatile backbone for various medical image recognition tasks, including image classification and dense prediction tasks such as semantic segmentation and lesion detection. This structure facilitates hierarchical feature representation while reducing the computation load of feature maps, highly beneficial for boosting performance. Second, it introduces a novel Dual Sparse Selection Attention (DSSA) with content awareness to improve computational efficiency and robustness against noise while maintaining high performance. As the core building technique of MedFormer, DSSA is explicitly designed to attend to the most relevant content. In addition, a detailed theoretical analysis has been conducted, demonstrating that MedFormer has superior generality and efficiency in comparison to existing medical vision transformers. Extensive experiments on a variety of imaging modality datasets consistently show that MedFormer is highly effective in enhancing performance across all three above-mentioned medical image recognition tasks. The code is available at this https URL.
zh

[CV-43] CrowdTrack: A Benchmark for Difficult Multiple Pedestrian Tracking in Real Scenarios

【速读】:该论文旨在解决多目标跟踪(Multi-object tracking)在复杂场景下的性能不足问题,特别是在行人跟踪(Pedestrian tracking)中,现有方法因运动信息和外观信息的局限性而难以有效应对遮挡、部分可见或模糊图像等挑战。其解决方案的关键在于构建一个具有高难度和真实性的大规模数据集——CrowdTrack,该数据集由第一视角拍摄的大量真实复杂场景视频组成,包含5,185条轨迹,并提供完整的边界框和唯一对象ID,以支持在复杂环境下仍能保持有效性的算法开发与测试。

链接: https://arxiv.org/abs/2507.02479
作者: Teng Fu,Yuwen Chen,Zhuofan Chen,Mengyang Zhao,Bin Li,Xiangyang Xue
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-object tracking is a classic field in computer vision. Among them, pedestrian tracking has extremely high application value and has become the most popular research category. Existing methods mainly use motion or appearance information for tracking, which is often difficult in complex scenarios. For the motion information, mutual occlusions between objects often prevent updating of the motion state; for the appearance information, non-robust results are often obtained due to reasons such as only partial visibility of the object or blurred images. Although learning how to perform tracking in these situations from the annotated data is the simplest solution, the existing MOT dataset fails to satisfy this solution. Existing methods mainly have two drawbacks: relatively simple scene composition and non-realistic scenarios. Although some of the video sequences in existing dataset do not have the above-mentioned drawbacks, the number is far from adequate for research purposes. To this end, we propose a difficult large-scale dataset for multi-pedestrian tracking, shot mainly from the first-person view and all from real-life complex scenarios. We name it ``CrowdTrack’’ because there are numerous objects in most of the sequences. Our dataset consists of 33 videos, containing a total of 5,185 trajectories. Each object is annotated with a complete bounding box and a unique object ID. The dataset will provide a platform to facilitate the development of algorithms that remain effective in complex situations. We analyzed the dataset comprehensively and tested multiple SOTA models on our dataset. Besides, we analyzed the performance of the foundation models on our dataset. The dataset and project code is released at: this https URL .
zh

[CV-44] Mesh Silksong: Auto-Regressive Mesh Generation as Weaving Silk

【速读】:该论文旨在解决现有网格分词方法中由于重复顶点标记导致的冗余问题,从而提升网格表示的效率与压缩率。其解决方案的关键在于通过仅访问每个网格顶点一次的方式进行分词,有效减少了标记序列的冗余度达50%,并实现了约22%的最优压缩率,同时保证了生成网格在流形拓扑、防水检测和一致面法线等方面的几何特性。

链接: https://arxiv.org/abs/2507.02477
作者: Gaochao Song,Zibo Zhao,Haohan Weng,Jingbo Zeng,Rongfei Jia,Shenghua Gao
机构: University of Hong Kong (香港大学); Tencent Hunyuan 3D (腾讯混元3D); Math Magic (数学魔法)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 9 pages main text, 14 pages appendix, 23 figures

点击查看摘要

Abstract:We introduce Mesh Silksong, a compact and efficient mesh representation tailored to generate the polygon mesh in an auto-regressive manner akin to silk weaving. Existing mesh tokenization methods always produce token sequences with repeated vertex tokens, wasting the network capability. Therefore, our approach tokenizes mesh vertices by accessing each mesh vertice only once, reduces the token sequence’s redundancy by 50%, and achieves a state-of-the-art compression rate of approximately 22%. Furthermore, Mesh Silksong produces polygon meshes with superior geometric properties, including manifold topology, watertight detection, and consistent face normals, which are critical for practical applications. Experimental results demonstrate the effectiveness of our approach, showcasing not only intricate mesh generation but also significantly improved geometric integrity.
zh

[CV-45] Weakly-supervised Contrastive Learning with Quantity Prompts for Moving Infrared Small Target Detection

【速读】:该论文旨在解决移动红外小目标检测中因目标尺寸微小和背景弱而导致的检测难题,同时克服传统全监督方法对大量人工标注数据的依赖。其解决方案的关键在于提出一种新的弱监督对比学习(WeCoL)框架,该框架仅需简单的目标数量提示即可进行模型训练,并结合预训练的分割一切模型(SAM)设计潜在目标挖掘策略,融合目标激活图与多帧能量信息,通过对比学习提升伪标签的可靠性,同时引入长短期运动感知学习方案以建模小目标的局部运动模式与全局运动轨迹。

链接: https://arxiv.org/abs/2507.02454
作者: Weiwei Duan,Luping Ji,Shengjia Chen,Sicheng Zhu,Jianghong Huang,Mao Ye
机构: University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Different from general object detection, moving infrared small target detection faces huge challenges due to tiny target size and weak background this http URL, most existing methods are fully-supervised, heavily relying on a large number of manual target-wise annotations. However, manually annotating video sequences is often expensive and time-consuming, especially for low-quality infrared frame images. Inspired by general object detection, non-fully supervised strategies ( e.g. , weakly supervised) are believed to be potential in reducing annotation requirements. To break through traditional fully-supervised frameworks, as the first exploration work, this paper proposes a new weakly-supervised contrastive learning (WeCoL) scheme, only requires simple target quantity prompts during model this http URL, in our scheme, based on the pretrained segment anything model (SAM), a potential target mining strategy is designed to integrate target activation maps and multi-frame energy this http URL, contrastive learning is adopted to further improve the reliability of pseudo-labels, by calculating the similarity between positive and negative samples in feature this http URL, we propose a long-short term motion-aware learning scheme to simultaneously model the local motion patterns and global motion trajectory of small this http URL extensive experiments on two public datasets (DAUB and ITSDT-15K) verify that our weakly-supervised scheme could often outperform early fully-supervised methods. Even, its performance could reach over 90% of state-of-the-art (SOTA) fully-supervised ones.
zh

[CV-46] IGDNet: Zero-Shot Robust Underexposed Image Enhancement via Illumination-Guided and Denoising

【速读】:该论文旨在解决现有欠曝光图像增强方法依赖于成对的欠曝光与良好光照图像进行监督学习的问题,这类数据在实际场景中难以获取,同时传统方法可能导致过度增强并扭曲良好光照区域。其解决方案的关键在于提出IGDNet,一种无需引导先验或训练数据、仅基于单张测试图像的零样本增强方法,通过分解模块和去噪模块实现图像的照明与反射分量分离及非均匀光照区域的自适应校正,从而有效抑制噪声并恢复光照。

链接: https://arxiv.org/abs/2507.02445
作者: Hailong Yan,Junjian Huang,Tingwen Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Submitted to IEEE Transactions on Artificial Intelligence (TAI) on Oct.31, 2024

点击查看摘要

Abstract:Current methods for restoring underexposed images typically rely on supervised learning with paired underexposed and well-illuminated images. However, collecting such datasets is often impractical in real-world scenarios. Moreover, these methods can lead to over-enhancement, distorting well-illuminated regions. To address these issues, we propose IGDNet, a Zero-Shot enhancement method that operates solely on a single test image, without requiring guiding priors or training data. IGDNet exhibits strong generalization ability and effectively suppresses noise while restoring illumination. The framework comprises a decomposition module and a denoising module. The former separates the image into illumination and reflection components via a dense connection network, while the latter enhances non-uniformly illuminated regions using an illumination-guided pixel adaptive correction method. A noise pair is generated through downsampling and refined iteratively to produce the final result. Extensive experiments on four public datasets demonstrate that IGDNet significantly improves visual quality under complex lighting conditions. Quantitative results on metrics like PSNR (20.41dB) and SSIM (0.860dB) show that it outperforms 14 state-of-the-art unsupervised methods. The code will be released soon.
zh

[CV-47] Red grape detection with accelerated artificial neural networks in the FPGAs programmable logic

【速读】:该论文试图解决机器人在执行任务和探索过程中因检测算法速度受限而导致的任务执行时间增加的问题。解决方案的关键在于利用FPGA(现场可编程门阵列)加速人工智能神经网络(ANN)的推理过程,具体通过FINN架构将三种量化后的神经网络模型——MobileNet v1(4-bit量化)、CNV(2-bit量化)以及BNN(1-bit量化)部署到FPGA的PL(可编程逻辑)部分,从而提升检测效率并使其适用于注意力机制。

链接: https://arxiv.org/abs/2507.02443
作者: Sandro Costa Magalhães,Marco Almeida,Filipe Neves dos Santos,António Paulo Moreira,Jorge Dias
机构: INESC TEC(INESC TEC); ISEP(ISEP); FEUP(FEUP); Khalifa University(Khalifa University)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Submitted to ROBOT’2025

点击查看摘要

Abstract:Robots usually slow down for canning to detect objects while moving. Additionally, the robot’s camera is configured with a low framerate to track the velocity of the detection algorithms. This would be constrained while executing tasks and exploring, making robots increase the task execution time. AMD has developed the Vitis-AI framework to deploy detection algorithms into FPGAs. However, this tool does not fully use the FPGAs’ PL. In this work, we use the FINN architecture to deploy three ANNs, MobileNet v1 with 4-bit quantisation, CNV with 2-bit quantisation, and CNV with 1-bit quantisation (BNN), inside an FPGA’s PL. The models were trained on the RG2C dataset. This is a self-acquired dataset released in open access. MobileNet v1 performed better, reaching a success rate of 98 % and an inference speed of 6611 FPS. In this work, we proved that we can use FPGAs to speed up ANNs and make them suitable for attention mechanisms.
zh

[CV-48] F2TTA: Free-Form Test-Time Adaptation on Cross-Domain Medical Image Classification via Image-Level Disentangled Prompt Tuning

【速读】:该论文试图解决在临床实践中,由于数据以任意长度和随机顺序的领域片段形式到达,导致现有测试时自适应(Test-Time Adaptation, TTA)方法难以有效应对领域间不可预测的分布偏移问题。解决方案的关键在于提出一种图像级解耦提示调优(Image-level Disentangled Prompt Tuning, I-DiPT)框架,该框架通过引入图像不变提示以学习领域不变表征来缓解不可预测的领域偏移,并利用图像特定提示对每个测试图像进行模型适配。为克服单图像训练导致的知识表示不足问题,进一步引入了基于不确定性的掩码(Uncertainty-oriented Masking, UoM)机制和并行图蒸馏(Parallel Graph Distillation, PGD)方法,以增强提示的信息提取能力和知识复用能力。

链接: https://arxiv.org/abs/2507.02437
作者: Wei Li,Jingyang Zhang,Lihao Liu,Guoan Wang,Junjun He,Yang Chen,Lixu Gu
机构: School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China; Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, 200240, China; School of Computer Science and Engineering, Southeast University, Nanjing, 210096, China; Shanghai Artificial Intelligence Laboratory, Shanghai, 200000, China; Department of Systems and Enterprises, Stevens Institute of Technology, Hoboken, NJ, 07030, United States
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This paper has been submitted to relevant journals

点击查看摘要

Abstract:Test-Time Adaptation (TTA) has emerged as a promising solution for adapting a source model to unseen medical sites using unlabeled test data, due to the high cost of data annotation. Existing TTA methods consider scenarios where data from one or multiple domains arrives in complete domain units. However, in clinical practice, data usually arrives in domain fragments of arbitrary lengths and in random arrival orders, due to resource constraints and patient variability. This paper investigates a practical Free-Form Test-Time Adaptation (F ^2 TTA) task, where a source model is adapted to such free-form domain fragments, with shifts occurring between fragments unpredictably. In this setting, these shifts could distort the adaptation process. To address this problem, we propose a novel Image-level Disentangled Prompt Tuning (I-DiPT) framework. I-DiPT employs an image-invariant prompt to explore domain-invariant representations for mitigating the unpredictable shifts, and an image-specific prompt to adapt the source model to each test image from the incoming fragments. The prompts may suffer from insufficient knowledge representation since only one image is available for training. To overcome this limitation, we first introduce Uncertainty-oriented Masking (UoM), which encourages the prompts to extract sufficient information from the incoming image via masked consistency learning driven by the uncertainty of the source model representations. Then, we further propose a Parallel Graph Distillation (PGD) method that reuses knowledge from historical image-specific and image-invariant prompts through parallel graph networks. Experiments on breast cancer and glaucoma classification demonstrate the superiority of our method over existing TTA approaches in F ^2 TTA. Code is available at this https URL.
zh

[CV-49] AvatarMakeup: Realistic Makeup Transfer for 3D Animatable Head Avatars

【速读】:该论文旨在解决3D虚拟角色在面部妆容迁移过程中存在的三个核心问题:1)在可驱动表情下保持外观一致性,2)在妆容过程中保留身份特征,3)实现对细节的精确控制。其解决方案的关键在于提出一种名为AvatarMakeup的专用3D妆容方法,该方法利用预训练扩散模型从单张参考照片中迁移妆容模式,并采用粗到细的策略,首先确保外观和身份的一致性,随后通过扩散模型生成妆容图像作为监督信号进行细节优化。为解决扩散过程中的不确定性问题,引入了Coherent Duplication方法,通过优化全局UV图来保证多视角和动态表情下的妆容一致性,从而实现高质量且一致的妆容迁移效果。

链接: https://arxiv.org/abs/2507.02419
作者: Yiming Zhong,Xiaolin Zhang,Ligang Liu,Yao Zhao,Yunchao Wei
机构: Beijing Jiaotong University (北京交通大学); Shandong University of Science and Technology (山东科技大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Similar to facial beautification in real life, 3D virtual avatars require personalized customization to enhance their visual appeal, yet this area remains insufficiently explored. Although current 3D Gaussian editing methods can be adapted for facial makeup purposes, these methods fail to meet the fundamental requirements for achieving realistic makeup effects: 1) ensuring a consistent appearance during drivable expressions, 2) preserving the identity throughout the makeup process, and 3) enabling precise control over fine details. To address these, we propose a specialized 3D makeup method named AvatarMakeup, leveraging a pretrained diffusion model to transfer makeup patterns from a single reference photo of any individual. We adopt a coarse-to-fine idea to first maintain the consistent appearance and identity, and then to refine the details. In particular, the diffusion model is employed to generate makeup images as supervision. Due to the uncertainties in diffusion process, the generated images are inconsistent across different viewpoints and expressions. Therefore, we propose a Coherent Duplication method to coarsely apply makeup to the target while ensuring consistency across dynamic and multiview effects. Coherent Duplication optimizes a global UV map by recoding the averaged facial attributes among the generated makeup images. By querying the global UV map, it easily synthesizes coherent makeup guidance from arbitrary views and expressions to optimize the target avatar. Given the coarse makeup avatar, we further enhance the makeup by incorporating a Refinement Module into the diffusion model to achieve high makeup quality. Experiments demonstrate that AvatarMakeup achieves state-of-the-art makeup transfer quality and consistency throughout animation.
zh

[CV-50] Determination Of Structural Cracks Using Deep Learning Frameworks

【速读】:该论文旨在解决结构裂缝检测中由于人工检测效率低、一致性差及易出错而导致评估可靠性不足的问题。其解决方案的关键在于引入一种新型的深度学习架构,利用多种残差U-Net模型,并通过集成学习方法将其与包含卷积块的元模型相结合,以提升检测的准确性和效率。该集成模型在低分辨率图像上的表现优于传统U-Net和SegNet,验证了其在结构缺陷监测任务中的有效性。

链接: https://arxiv.org/abs/2507.02416
作者: Subhasis Dasgupta,Jaydip Sen,Tuhina Halder
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: This is the accepted version of the paper presented in IEEE CONIT 2025 held on 20th June 2025. This is not the camera-ready version. There are 6 pages in this paper and it contains 7 figures and 1 table

点击查看摘要

Abstract:Structural crack detection is a critical task for public safety as it helps in preventing potential structural failures that could endanger lives. Manual detection by inexperienced personnel can be slow, inconsistent, and prone to human error, which may compromise the reliability of assessments. The current study addresses these challenges by introducing a novel deep-learning architecture designed to enhance the accuracy and efficiency of structural crack detection. In this research, various configurations of residual U-Net models were utilized. These models, due to their robustness in capturing fine details, were further integrated into an ensemble with a meta-model comprising convolutional blocks. This unique combination aimed to boost prediction efficiency beyond what individual models could achieve. The ensemble’s performance was evaluated against well-established architectures such as SegNet and the traditional U-Net. Results demonstrated that the residual U-Net models outperformed their predecessors, particularly with low-resolution imagery, and the ensemble model exceeded the performance of individual models, proving it as the most effective. The assessment was based on the Intersection over Union (IoU) metric and DICE coefficient. The ensemble model achieved the highest scores, signifying superior accuracy. This advancement suggests way for more reliable automated systems in structural defects monitoring tasks.
zh

[CV-51] Privacy-preserving Preselection for Face Identification Based on Packing

【速读】:该论文旨在解决在密文域中进行人脸识别时,随着密文模板库规模增大导致的检索效率下降问题。其解决方案的关键在于提出了一种名为基于打包的隐私保护预选(Privacy-Preserving Preselection for Face Identification Based on Packing, PFIP)的新颖高效方案,该方案通过引入创新的预选机制以降低计算开销,并结合打包模块提升注册阶段生物特征系统的灵活性,从而显著提升了检索效率。

链接: https://arxiv.org/abs/2507.02414
作者: Rundong Xin,Taotao Wang,Jin Wang,Chonghe Zhao,Jing Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: This paper has been accepted for publication in SecureComm 2025

点击查看摘要

Abstract:Face identification systems operating in the ciphertext domain have garnered significant attention due to increasing privacy concerns and the potential recovery of original facial data. However, as the size of ciphertext template libraries grows, the face retrieval process becomes progressively more time-intensive. To address this challenge, we propose a novel and efficient scheme for face retrieval in the ciphertext domain, termed Privacy-Preserving Preselection for Face Identification Based on Packing (PFIP). PFIP incorporates an innovative preselection mechanism to reduce computational overhead and a packing module to enhance the flexibility of biometric systems during the enrollment stage. Extensive experiments conducted on the LFW and CASIA datasets demonstrate that PFIP preserves the accuracy of the original face recognition model, achieving a 100% hit rate while retrieving 1,000 ciphertext face templates within 300 milliseconds. Compared to existing approaches, PFIP achieves a nearly 50x improvement in retrieval efficiency.
zh

[CV-52] A Novel Tuning Method for Real-time Multiple-Object Tracking Utilizing Thermal Sensor with Complexity Motion Pattern

【速读】:该论文旨在解决热成像图像中的多目标跟踪(Multi-Object Tracking)问题,特别是在低可见度或照明条件差的复杂环境中,传统RGB相机难以有效进行行人检测与跟踪。其解决方案的关键在于提出一种针对行人跟踪的新型调参方法,通过优化两阶段框架,为每个阶段选择最合适的超参数,从而在不依赖复杂重识别或运动模型的情况下,实现高精度的实时跟踪性能。

链接: https://arxiv.org/abs/2507.02408
作者: Duong Nguyen-Ngoc Tran,Long Hoang Pham,Chi Dai Tran,Quoc Pham-Nam Ho,Huy-Hung Nguyen,Jae Wook Jeon
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-Object Tracking in thermal images is essential for surveillance systems, particularly in challenging environments where RGB cameras struggle due to low visibility or poor lighting conditions. Thermal sensors enhance recognition tasks by capturing infrared signatures, but a major challenge is their low-level feature representation, which makes it difficult to accurately detect and track pedestrians. To address this, the paper introduces a novel tuning method for pedestrian tracking, specifically designed to handle the complex motion patterns in thermal imagery. The proposed framework optimizes two-stages, ensuring that each stage is tuned with the most suitable hyperparameters to maximize tracking performance. By fine-tuning hyperparameters for real-time tracking, the method achieves high accuracy without relying on complex reidentification or motion models. Extensive experiments on PBVS Thermal MOT dataset demonstrate that the approach is highly effective across various thermal camera conditions, making it a robust solution for real-world surveillance applications.
zh

[CV-53] PosDiffAE: Position-aware Diffusion Auto-encoder For High-Resolution Brain Tissue Classification Incorporating Artifact Restoration ALT

【速读】:该论文旨在解决扩散模型在生成高质量图像样本时缺乏提取图像特定语义表示的问题,而这一能力是自编码器(Auto-encoder)所具备的。解决方案的关键在于将编码器与扩散模型结合,构建一种自编码框架,从而学习图像特定的潜在空间表示,并实现对潜在空间的结构化控制。具体而言,该工作通过引入位置信息回归机制来组织潜在空间,以区分脑图像中的组织类型;同时利用潜在表示和扩散模型在推理过程中的约束生成能力,提出了一种无监督的撕裂伪影修复技术;此外,还通过表征引导和推理时可操控的噪声与去噪能力,设计了一种无监督的JPEG伪影修复方法。

链接: https://arxiv.org/abs/2507.02405
作者: Ayantika Das,Moitreya Chaudhuri,Koushik Bhat,Keerthi Ram,Mihail Bota,Mohanasankar Sivaprakasam
机构: Indian Institute of Technology Madras (印度理工学院马德拉斯分校); Sudha Gopalakrishnan Brain Centre (苏德拉·戈帕拉克里希南脑中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in IEEE Journal of Biomedical and Health Informatics (Early Access Available) this https URL

点击查看摘要

Abstract:Denoising diffusion models produce high-fidelity image samples by capturing the image distribution in a progressive manner while initializing with a simple distribution and compounding the distribution complexity. Although these models have unlocked new applicabilities, the sampling mechanism of diffusion does not offer means to extract image-specific semantic representation, which is inherently provided by auto-encoders. The encoding component of auto-encoders enables mapping between a specific image and its latent space, thereby offering explicit means of enforcing structures in the latent space. By integrating an encoder with the diffusion model, we establish an auto-encoding formulation, which learns image-specific representations and offers means to organize the latent space. In this work, First, we devise a mechanism to structure the latent space of a diffusion auto-encoding model, towards recognizing region-specific cellular patterns in brain images. We enforce the representations to regress positional information of the patches from high-resolution images. This creates a conducive latent space for differentiating tissue types of the brain. Second, we devise an unsupervised tear artifact restoration technique based on neighborhood awareness, utilizing latent representations and the constrained generation capability of diffusion models during inference. Third, through representational guidance and leveraging the inference time steerable noising and denoising capability of diffusion, we devise an unsupervised JPEG artifact restoration technique.
zh

[CV-54] Wildlife Target Re-Identification Using Self-supervised Learning in Non-Urban Settings

【速读】:该论文试图解决野生动物个体再识别问题,即在不同观测中匹配同一物种的个体。传统方法依赖于带有类别标签的标注数据来训练监督模型,这导致了大量大规模野生动物数据集的构建。本文提出的解决方案的关键在于采用自监督学习(Self-Supervised Learning, SSL),通过从相机陷阱数据中自动提取个体的两个不同视角的图像对,在无需人工标注的情况下训练模型。该方法利用视频数据流中的时间图像对进行模型训练,从而有效提升模型在有限数据下的鲁棒性,并在多种下游任务中表现出优于监督特征的性能。

链接: https://arxiv.org/abs/2507.02403
作者: Mufhumudzi Muthivhi,Terence L. van Zyl
机构: University of Johannesburg (约翰内斯堡大学); Institute for Intelligent Systems (智能系统研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in IEEE Xplore and ISIF FUSION 2025 proceedings:

点击查看摘要

Abstract:Wildlife re-identification aims to match individuals of the same species across different observations. Current state-of-the-art (SOTA) models rely on class labels to train supervised models for individual classification. This dependence on annotated data has driven the curation of numerous large-scale wildlife datasets. This study investigates self-supervised learning Self-Supervised Learning (SSL) for wildlife re-identification. We automatically extract two distinct views of an individual using temporal image pairs from camera trap data without supervision. The image pairs train a self-supervised model from a potentially endless stream of video data. We evaluate the learnt representations against supervised features on open-world scenarios and transfer learning in various wildlife downstream tasks. The analysis of the experimental results shows that self-supervised models are more robust even with limited data. Moreover, self-supervised features outperform supervision across all downstream tasks. The code is available here this https URL.
zh

[CV-55] ABNet: A Triplet Augmentation Self-Recovery Framework with Boundary-Aware Pseudo-Labels for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因缺乏大规模完全标注数据而导致的特征学习受限和边界监督不足的问题。其解决方案的关键在于提出一种名为TAB Net的弱监督医学图像分割框架,该框架包含两个核心组件:三元组增强自恢复(TAS)模块和边界感知伪标签监督(BAP)模块。TAS模块通过三种互补的增强策略提升特征学习能力,而BAP模块则通过融合双分支预测并引入边界感知损失来增强伪监督精度与边界建模能力。

链接: https://arxiv.org/abs/2507.02399
作者: Peilin Zhang,Shaouxan Wua,Jun Feng,Zhuo Jin,Zhizezhang Gao,Jingkun Chen,Yaqiong Xing,Xiao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background and objective: Medical image segmentation is a core task in various clinical applications. However, acquiring large-scale, fully annotated medical image datasets is both time-consuming and costly. Scribble annotations, as a form of sparse labeling, provide an efficient and cost-effective alternative for medical image segmentation. However, the sparsity of scribble annotations limits the feature learning of the target region and lacks sufficient boundary supervision, which poses significant challenges for training segmentation networks. Methods: We propose TAB Net, a novel weakly-supervised medical image segmentation framework, consisting of two key components: the triplet augmentation self-recovery (TAS) module and the boundary-aware pseudo-label supervision (BAP) module. The TAS module enhances feature learning through three complementary augmentation strategies: intensity transformation improves the model’s sensitivity to texture and contrast variations, cutout forces the network to capture local anatomical structures by masking key regions, and jigsaw augmentation strengthens the modeling of global anatomical layout by disrupting spatial continuity. By guiding the network to recover complete masks from diverse augmented inputs, TAS promotes a deeper semantic understanding of medical images under sparse supervision. The BAP module enhances pseudo-supervision accuracy and boundary modeling by fusing dual-branch predictions into a loss-weighted pseudo-label and introducing a boundary-aware loss for fine-grained contour refinement. Results: Experimental evaluations on two public datasets, ACDC and MSCMR seg, demonstrate that TAB Net significantly outperforms state-of-the-art methods for scribble-based weakly supervised segmentation. Moreover, it achieves performance comparable to that of fully supervised methods.
zh

[CV-56] Beyond Spatial Frequency: Pixel-wise Temporal Frequency-based Deepfake Video Detection ICCV2025

【速读】:该论文试图解决深度伪造视频检测中传统方法因仅依赖空间频域信息而忽略像素级时间不一致性的问题,导致无法有效检测像素平面上的时间伪影。解决方案的关键在于对每个像素在时间轴上进行一维傅里叶变换,提取对时间不一致性高度敏感的特征,并通过端到端训练的注意力提议模块精确定位包含时间伪影的区域,同时利用联合Transformer模块将像素级时间频域特征与时空上下文特征进行有效融合,从而扩展可检测伪造伪影的范围。

链接: https://arxiv.org/abs/2507.02398
作者: Taehoon Kim,Jongwook Choi,Yonghyun Jeong,Haeun Noh,Jaejun Yoo,Seungryul Baek,Jongwon Choi
机构: GS. of AI, Chung-Ang Univ, Korea; Dept. of Advanced Imaging, Chung-Ang Univ, Korea; NAVER Cloud, Korea; AI Graduate School, UNIST, Korea
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by iccv 2025. code is will be available at this https URL

点击查看摘要

Abstract:We introduce a deepfake video detection approach that exploits pixel-wise temporal inconsistencies, which traditional spatial frequency-based detectors often overlook. Traditional detectors represent temporal information merely by stacking spatial frequency spectra across frames, resulting in the failure to detect temporal artifacts in the pixel plane. Our approach performs a 1D Fourier transform on the time axis for each pixel, extracting features highly sensitive to temporal inconsistencies, especially in areas prone to unnatural movements. To precisely locate regions containing the temporal artifacts, we introduce an attention proposal module trained in an end-to-end manner. Additionally, our joint transformer module effectively integrates pixel-wise temporal frequency features with spatio-temporal context features, expanding the range of detectable forgery artifacts. Our framework represents a significant advancement in deepfake video detection, providing robust performance across diverse and challenging detection scenarios.
zh

[CV-57] Continual Multiple Instance Learning with Enhanced Localization for Histopathological Whole Slide Image Analysis ICCV2025

【速读】:该论文旨在解决多实例学习(Multiple Instance Learning, MIL)在持续任务中适应性强且遗忘最小的问题,特别是在实例分类和定位场景下的应用。传统方法在自然图像的语义分割中有所研究,但难以直接应用于MIL定位,因为其面临大规模实例(如256×256的图像块)和缺乏全局关系的挑战。该论文提出的解决方案关键在于提出一种名为CoMEL的框架,包含三个核心组件:Grouped Double Attention Transformer(GDAT)用于高效实例编码,Bag Prototypes-based Pseudo-Labeling(BPPL)用于可靠实例伪标签生成,以及Orthogonal Weighted Low-Rank Adaptation(OWLoRA)以减少袋级和实例级分类中的遗忘。

链接: https://arxiv.org/abs/2507.02395
作者: Byung Hyun Lee,Wongi Jeong,Woojae Han,Kyoungbun Lee,Se Young Chun
机构: Seoul National University (首尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Multiple instance learning (MIL) significantly reduced annotation costs via bag-level weak labels for large-scale images, such as histopathological whole slide images (WSIs). However, its adaptability to continual tasks with minimal forgetting has been rarely explored, especially on instance classification for localization. Weakly incremental learning for semantic segmentation has been studied for continual localization, but it focused on natural images, leveraging global relationships among hundreds of small patches (e.g., 16 \times 16 ) using pre-trained models. This approach seems infeasible for MIL localization due to enormous amounts ( \sim 10^5 ) of large patches (e.g., 256 \times 256 ) and no available global relationships such as cancer cells. To address these challenges, we propose Continual Multiple Instance Learning with Enhanced Localization (CoMEL), an MIL framework for both localization and adaptability with minimal forgetting. CoMEL consists of (1) Grouped Double Attention Transformer (GDAT) for efficient instance encoding, (2) Bag Prototypes-based Pseudo-Labeling (BPPL) for reliable instance pseudo-labeling, and (3) Orthogonal Weighted Low-Rank Adaptation (OWLoRA) to mitigate forgetting in both bag and instance classification. Extensive experiments on three public WSI datasets demonstrate superior performance of CoMEL, outperforming the prior arts by up to 11.00% in bag-level accuracy and up to 23.4% in localization accuracy under the continual MIL setup.
zh

[CV-58] PLOT: Pseudo-Labeling via Video Object Tracking for Scalable Monocular 3D Object Detection

【速读】:该论文旨在解决单目3D目标检测(Monocular 3D Object Detection, M3OD)中因数据稀缺性导致的挑战,特别是由高标注成本和固有的2D到3D歧义性引起的难题。其解决方案的关键在于提出了一种仅使用视频数据的新型伪标签(pseudo-labeling)框架,该框架通过时间相邻帧中静态与动态物体的点云跟踪实现伪LiDAR的聚合,从而在无法获取3D数据的情况下提取3D属性,具有较强的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2507.02393
作者: Seokyeong Lee,Sithu Aung,Junyong Choi,Seungryong Kim,Ig-Jae Kim,Junghyun Cho
机构: Korea Institute of Science and Technology (KIST); Korea University; KAIST AI; AI-Robotics, KIST School, University of Science and Technology (UST); Yonsei-KIST Convergence Research Institute, Yonsei University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 16 figures

点击查看摘要

Abstract:Monocular 3D object detection (M3OD) has long faced challenges due to data scarcity caused by high annotation costs and inherent 2D-to-3D ambiguity. Although various weakly supervised methods and pseudo-labeling methods have been proposed to address these issues, they are mostly limited by domain-specific learning or rely solely on shape information from a single observation. In this paper, we propose a novel pseudo-labeling framework that uses only video data and is more robust to occlusion, without requiring a multi-view setup, additional sensors, camera poses, or domain-specific training. Specifically, we explore a technique for aggregating the pseudo-LiDARs of both static and dynamic objects across temporally adjacent frames using object point tracking, enabling 3D attribute extraction in scenarios where 3D data acquisition is infeasible. Extensive experiments demonstrate that our method ensures reliable accuracy and strong scalability, making it a practical and effective solution for M3OD.
zh

[CV-59] UVLM: Benchmarking Video Language Model for Underwater World Understanding

【速读】:该论文试图解决现有视频语言模型(VidLM)在水下观测应用中的不足,特别是针对水下环境的独特挑战,如光线变化、水体浑浊和多角度观察等。解决方案的关键在于构建一个名为UVLM的水下观测基准,通过结合人类专业知识与AI模型的协作方式,确保数据质量与多样性。该基准涵盖了广泛的帧率、分辨率、海洋生物类别以及静态植物和地形,并设计了任务多样性与挑战性评估指标,以促进对水下世界的深入理解。实验表明,在UVLM上微调VidLM可显著提升其水下世界理解能力,并展现出在现有空中场景基准上的潜在改进空间。

链接: https://arxiv.org/abs/2507.02373
作者: Xizhe Xue,Yang Zhou,Dawei Yan,Ying Li,Haokui Zhang,Rong Xiao
机构: Northwestern Polytechnical University (西北工业大学); Intellifusion Inc. (智元科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Recently, the remarkable success of large language models (LLMs) has achieved a profound impact on the field of artificial intelligence. Numerous advanced works based on LLMs have been proposed and applied in various scenarios. Among them, video language models (VidLMs) are particularly widely used. However, existing works primarily focus on terrestrial scenarios, overlooking the highly demanding application needs of underwater observation. To overcome this gap, we introduce UVLM, an under water observation benchmark which is build through a collaborative approach combining human expertise and AI models. To ensure data quality, we have conducted in-depth considerations from multiple perspectives. First, to address the unique challenges of underwater environments, we selected videos that represent typical underwater challenges including light variations, water turbidity, and diverse viewing angles to construct the dataset. Second, to ensure data diversity, the dataset covers a wide range of frame rates, resolutions, 419 classes of marine animals, and various static plants and terrains. Next, for task diversity, we adopted a structured design where observation targets are categorized into two major classes: biological and environmental. Each category includes content observation and change/action observation, totaling 20 distinct task types. Finally, we designed several challenging evaluation metrics to enable quantitative comparison and analysis of different methods. Experiments on two representative VidLMs demonstrate that fine-tuning VidLMs on UVLM significantly improves underwater world understanding while also showing potential for slight improvements on existing in-air VidLM benchmarks, such as VideoMME and Perception text. The dataset and prompt engineering will be released publicly.
zh

[CV-60] LocalDyGS: Multi-view Global Dynamic Scene Modeling via Adaptive Local Implicit Feature Decoupling ICCV2025

【速读】:该论文旨在解决从多视角输入中合成任意视角的动态视频这一挑战,特别是在复杂且高度动态的真实场景中。现有基于神经辐射场或3D高斯点云的方法在建模细粒度运动方面存在局限,难以适应大规模和细粒度运动的混合场景。论文提出的解决方案关键在于LocalDyGS框架,其核心是将复杂动态场景分解为由种子定义的流线型局部空间,并通过解耦静态与动态特征进行局部空间内的运动建模,从而实现对大规模和细粒度运动的联合建模。

链接: https://arxiv.org/abs/2507.02363
作者: Jiahao Wu,Rui Peng,Jianbo Jiao,Jiayu Yang,Luyang Tang,Kaiqiang Xiong,Jie Liang,Jinbo Yan,Runling Liu,Ronggang Wang
机构: Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology (广东省超高清沉浸式媒体技术重点实验室); Shenzhen Graduate School, Peking University (北京大学深圳研究生院); Pengcheng Lab (鹏城实验室); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce LocalDyGS, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes: 1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space. 2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space. As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Project page: this https URL.
zh

[CV-61] Holistic Tokenizer for Autoregressive Image Generation

【速读】:该论文旨在解决自回归(AR)图像生成模型在捕捉视觉令牌序列全局关系方面的局限性,以及传统视觉分词器因仅映射局部图像块而导致的全局信息不足问题。其解决方案的关键在于提出一种名为Hita的新颖图像分词器,该分词器采用从整体到局部的分词策略,结合可学习的整体查询和局部块令牌,并通过顺序结构安排和轻量级融合模块优化信息流,以更好地适配AR生成过程。

链接: https://arxiv.org/abs/2507.02358
作者: Anlin Zheng,Haochen Wang,Yucheng Zhao,Weipeng Deng,Tiancai Wang,Xiangyu Zhang,Xiaojuan Qi
机构: The University of Hong Kong (香港大学); NLPR, MAIS, CASIA (国家模式识别重点实验室,中科院自动化研究所); Dexmal (德普微); StepFun (步履科技); MEGVII Techonology (旷视科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 10 figures

点击查看摘要

Abstract:The vanilla autoregressive image generation model generates visual tokens in a step-by-step fashion, which limits the ability to capture holistic relationships among token sequences. Moreover, most visual tokenizers map local image patches into latent tokens, leading to limited global information. To address this, we introduce \textitHita, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Besides, Hita incorporates two key strategies for improved alignment with the AR generation process: 1) it arranges a sequential structure with holistic tokens at the beginning followed by patch-level tokens while using causal attention to maintain awareness of previous tokens; and 2) before feeding the de-quantized tokens into the decoder, Hita adopts a lightweight fusion module to control information flow to prioritize holistic tokens. Extensive experiments show that Hita accelerates the training speed of AR generators and outperforms those trained with vanilla tokenizers, achieving \textbf2.59 FID and \textbf281.9 IS on the ImageNet benchmark. A detailed analysis of the holistic representation highlights its ability to capture global image properties such as textures, materials, and shapes. Additionally, Hita also demonstrates effectiveness in zero-shot style transfer and image in-painting. The code is available at \hrefthis https URLthis https URL
zh

[CV-62] Lightweight Shrimp Disease Detection Research Based on YOLOv8n

【速读】:该论文旨在解决对虾养殖中因疾病传播导致的经济损失问题,并提升对虾疾病智能检测的效率。其解决方案的关键在于提出一种基于YOLOv8n的轻量级网络架构,通过设计RLDD检测头和C2f-EMCM模块,在保持检测精度的同时降低计算复杂度,提高计算效率;同时引入改进的SegNext_Attention自注意力机制以增强特征提取能力,从而实现更精确的疾病特征识别。

链接: https://arxiv.org/abs/2507.02354
作者: Fei Yuhuan,Wang Gengchen,Liu Fenghao,Zang Ran,Sun Xufei,Chang Hao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: in Chinese language

点击查看摘要

Abstract:Shrimp diseases are one of the primary causes of economic losses in shrimp aquaculture. To prevent disease transmission and enhance intelligent detection efficiency in shrimp farming, this paper proposes a lightweight network architecture based on YOLOv8n. First, by designing the RLDD detection head and C2f-EMCM module, the model reduces computational complexity while maintaining detection accuracy, improving computational efficiency. Subsequently, an improved SegNext_Attention self-attention mechanism is introduced to further enhance the model’s feature extraction capability, enabling more precise identification of disease characteristics. Extensive experiments, including ablation studies and comparative evaluations, are conducted on a self-constructed shrimp disease dataset, with generalization tests extended to the URPC2020 dataset. Results demonstrate that the proposed model achieves a 32.3% reduction in parameters compared to the original YOLOv8n, with a mAP@0.5 of 92.7% (3% improvement over YOLOv8n). Additionally, the model outperforms other lightweight YOLO-series models in mAP@0.5, parameter count, and model size. Generalization experiments on the URPC2020 dataset further validate the model’s robustness, showing a 4.1% increase in mAP@0.5 compared to YOLOv8n. The proposed method achieves an optimal balance between accuracy and efficiency, providing reliable technical support for intelligent disease detection in shrimp aquaculture.
zh

[CV-63] wo-Steps Neural Networks for an Automated Cerebrovascular Landmark Detection

【速读】:该论文旨在解决颅内动脉瘤(Intracranial Aneurysm, ICA)在大脑Willis环(Circle of Willis, CoW)特定分叉部位的准确检测问题,尤其是在处理完整的磁共振血管造影(MRA Time-of-Flight, TOF)数据时,由于相邻分叉具有相似视觉特征而容易导致漏检的问题。其解决方案的关键在于提出了一种两阶段的神经网络方法:第一阶段使用目标检测网络识别感兴趣区域(ROIs),第二阶段采用带有深度监督的改进型U-Net精确定位分叉点,从而有效提升检测的准确性和鲁棒性,并适应CoW的解剖变异性。

链接: https://arxiv.org/abs/2507.02349
作者: Rafic Nader,Vincent L’Allinec,Romain Bourcier,Florent Autrusseau
机构: Institut du Thorax (胸科研究所); LTeN (LTeN); CHU Angers (昂热大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intracranial aneurysms (ICA) commonly occur in specific segments of the Circle of Willis (CoW), primarily, onto thirteen major arterial bifurcations. An accurate detection of these critical landmarks is necessary for a prompt and efficient diagnosis. We introduce a fully automated landmark detection approach for CoW bifurcations using a two-step neural networks process. Initially, an object detection network identifies regions of interest (ROIs) proximal to the landmark locations. Subsequently, a modified U-Net with deep supervision is exploited to accurately locate the bifurcations. This two-step method reduces various problems, such as the missed detections caused by two landmarks being close to each other and having similar visual characteristics, especially when processing the complete MRA Time-of-Flight (TOF). Additionally, it accounts for the anatomical variability of the CoW, which affects the number of detectable landmarks per scan. We assessed the effectiveness of our approach using two cerebral MRA datasets: our In-House dataset which had varying numbers of landmarks, and a public dataset with standardized landmark configuration. Our experimental results demonstrate that our method achieves the highest level of performance on a bifurcation detection task.
zh

[CV-64] Neural Network-based Study for Rice Leaf Disease Recognition and Classification: A Comparative Analysis Between Feature-based Model and Direct Imaging Model

【速读】:该论文旨在解决水稻叶片病害早期检测与分类的问题,以减少产量损失并提高农业生产效率。其解决方案的关键在于提出一种基于特征分析的检测模型(Feature Analysis Detection Model, FADM),通过引入多种特征提取算法(Feature Extraction Algorithms, FEAs)、降维算法(Dimensionality Reduction Algorithms, DRAs)和特征选择算法(Feature Selection Algorithms, FSAs)来优化图像处理过程,并与直接以图像为中心的检测模型(Direct Image-Centric Detection Model, DICDM)进行对比实验,验证FADM在分类性能上的优越性。

链接: https://arxiv.org/abs/2507.02322
作者: Farida Siddiqi Prity,Mirza Raquib,Saydul Akbar Murad,Md. Jubayar Alam Rafi,Md. Khairul Bashar Bhuiyan,Anupam Kumar Bairagi
机构: Netrokona University (内特罗纳大学); International Islamic University Chittagong (国际伊斯兰大学吉大港校区); University of Southern Mississippi (南密西西比大学); Daffodil International University (水仙国际大学); Brac University (布拉克大学); Khulna University (库尔纳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rice leaf diseases significantly reduce productivity and cause economic losses, highlighting the need for early detection to enable effective management and improve yields. This study proposes Artificial Neural Network (ANN)-based image-processing techniques for timely classification and recognition of rice diseases. Despite the prevailing approach of directly inputting images of rice leaves into ANNs, there is a noticeable absence of thorough comparative analysis between the Feature Analysis Detection Model (FADM) and Direct Image-Centric Detection Model (DICDM), specifically when it comes to evaluating the effectiveness of Feature Extraction Algorithms (FEAs). Hence, this research presents initial experiments on the Feature Analysis Detection Model, utilizing various image Feature Extraction Algorithms, Dimensionality Reduction Algorithms (DRAs), Feature Selection Algorithms (FSAs), and Extreme Learning Machine (ELM). The experiments are carried out on datasets encompassing bacterial leaf blight, brown spot, leaf blast, leaf scald, Sheath blight rot, and healthy leaf, utilizing 10-fold Cross-Validation method. A Direct Image-Centric Detection Model is established without the utilization of any FEA, and the evaluation of classification performance relies on different metrics. Ultimately, an exhaustive contrast is performed between the achievements of the Feature Analysis Detection Model and Direct Image-Centric Detection Model in classifying rice leaf diseases. The results reveal that the highest performance is attained using the Feature Analysis Detection Model. The adoption of the proposed Feature Analysis Detection Model for detecting rice leaf diseases holds excellent potential for improving crop health, minimizing yield losses, and enhancing overall productivity and sustainability of rice farming.
zh

[CV-65] Heeding the Inner Voice: Aligning ControlNet Training via Intermediate Features Feedback

【速读】:该论文试图解决文本到图像扩散模型中生成结果的空间控制精度不足的问题(spatial control)。其解决方案的关键在于提出一种名为InnerControl的训练策略,该策略通过在所有扩散步骤中强制空间一致性,利用轻量级卷积探测器从中间UNet特征中重建输入控制信号(如边缘、深度),从而在噪声较大的潜在空间中有效提取信号,生成伪真实控制用于训练,进而提升控制保真度和生成质量。

链接: https://arxiv.org/abs/2507.02321
作者: Nina Konovalova,Maxim Nikolaev,Andrey Kuznetsov,Aibek Alanov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: code available at this https URL

点击查看摘要

Abstract:Despite significant progress in text-to-image diffusion models, achieving precise spatial control over generated outputs remains challenging. ControlNet addresses this by introducing an auxiliary conditioning module, while ControlNet++ further refines alignment through a cycle consistency loss applied only to the final denoising steps. However, this approach neglects intermediate generation stages, limiting its effectiveness. We propose InnerControl, a training strategy that enforces spatial consistency across all diffusion steps. Our method trains lightweight convolutional probes to reconstruct input control signals (e.g., edges, depth) from intermediate UNet features at every denoising step. These probes efficiently extract signals even from highly noisy latents, enabling pseudo ground truth controls for training. By minimizing the discrepancy between predicted and target conditions throughout the entire diffusion process, our alignment loss improves both control fidelity and generation quality. Combined with established techniques like ControlNet++, InnerControl achieves state-of-the-art performance across diverse conditioning methods (e.g., edges, depth).
zh

[CV-66] Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos

【速读】:该论文试图解决当前文本到视频(Text-to-Video, T2V)合成技术在下游任务如文本到视频检索(Text-to-Video Retrieval, TVR)中的评估不足问题,现有评价指标主要关注视觉质量和时间一致性,而缺乏对合成视频在实际应用中表现的深入理解。其解决方案的关键在于提出SynTVA数据集和基准,通过从MSRVTT训练集中提取的800个多样化用户查询生成合成视频,并在物体/场景、动作、属性和提示保真度四个关键语义对齐维度上进行标注,构建了一个能够评估合成视频在TVR任务中实用性的框架。此外,还开发了Auto-Evaluator以利用现有指标估计语义对齐质量,从而为大规模扩展提供路径。

链接: https://arxiv.org/abs/2507.02316
作者: Zecheng Zhao,Selena Song,Tong Chen,Zhi Chen,Shazia Sadiq,Yadan Luo
机构: The University of Queensland (昆士兰大学); The University of Southern Queensland (南昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 10 figures

点击查看摘要

Abstract:Text-to-video (T2V) synthesis has advanced rapidly, yet current evaluation metrics primarily capture visual quality and temporal consistency, offering limited insight into how synthetic videos perform in downstream tasks such as text-to-video retrieval (TVR). In this work, we introduce SynTVA, a new dataset and benchmark designed to evaluate the utility of synthetic videos for building retrieval models. Based on 800 diverse user queries derived from MSRVTT training split, we generate synthetic videos using state-of-the-art T2V models and annotate each video-text pair along four key semantic alignment dimensions: Object \ Scene, Action, Attribute, and Prompt Fidelity. Our evaluation framework correlates general video quality assessment (VQA) metrics with these alignment scores, and examines their predictive power for downstream TVR performance. To explore pathways of scaling up, we further develop an Auto-Evaluator to estimate alignment quality from existing metrics. Beyond benchmarking, our results show that SynTVA is a valuable asset for dataset augmentation, enabling the selection of high-utility synthetic samples that measurably improve TVR outcomes. Project page and dataset can be found at this https URL.
zh

[CV-67] MAGIC: Mask-Guided Diffusion Inpainting with Multi-Level Perturbations and Context-Aware Alignment for Few-Shot Anomaly Generation

【速读】:该论文旨在解决工业质量控制中异常数据稀缺问题,通过少样本异常生成技术来增强异常数据。现有基于扩散的方法通常只能满足三个核心需求中的两个:全局异常生成器会破坏正常背景,而基于掩码引导的方法在掩码不精确或错位时表现不佳。论文提出的解决方案MAGIC(Mask-guided inpainting with multi-level perturbations and Context-aware alignment)的关键在于对Stable Diffusion图像修复模型进行微调,以保留正常区域并严格遵循提供的异常掩码,从而解决背景破坏和对齐问题。此外,MAGIC引入了两种互补的扰动策略,包括高斯提示级扰动和基于掩码的空间噪声注入,以提升生成结果的多样性与真实性,同时通过上下文感知的掩码对齐模块确保异常区域在语义上合理且位于目标对象内部。

链接: https://arxiv.org/abs/2507.02314
作者: JaeHyuck Choi,MinJun Kim,JeHyeong Hong
机构: Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Few-shot anomaly generation is emerging as a practical solution for augmenting the scarce anomaly data in industrial quality control settings. An ideal generator would meet three demands at once, namely (i) keep the normal background intact, (ii) inpaint anomalous regions to tightly overlap with the corresponding anomaly masks, and (iii) generate anomalous regions in a semantically valid location, while still producing realistic, diverse appearances from only a handful of real examples. Existing diffusion-based methods usually satisfy at most two of these requirements: global anomaly generators corrupt the background, whereas mask-guided ones often falter when the mask is imprecise or misplaced. We propose MAGIC–Mask-guided inpainting with multi-level perturbations and Context-aware alignment–to resolve all three issues. At its core, MAGIC fine-tunes a Stable Diffusion inpainting backbone that preserves normal regions and ensures strict adherence of the synthesized anomaly to the supplied mask, directly addressing background corruption and misalignment. To offset the diversity loss that fine-tuning can cause, MAGIC adds two complementary perturbation strategies: (i) Gaussian prompt-level perturbation applied during fine-tuning and inference that broadens the global appearance of anomalies while avoiding low-fidelity textual appearances, and (ii) mask-guided spatial noise injection that enriches local texture variations. Additionally, the context-aware mask alignment module forms semantic correspondences and relocates masks so that every anomaly remains plausibly contained within the host object, eliminating out-of-boundary artifacts. Under a consistent identical evaluation protocol on the MVTec-AD dataset, MAGIC outperforms previous state-of-the-arts in downstream anomaly tasks.
zh

[CV-68] Perception Activator: An intuitive and portable framework for brain cognitive exploration

【速读】:该论文旨在解决现有脑视觉解码方法在重建多语义物体时存在的明显失真问题,这些问题主要源于对低级像素对齐的依赖以及缺乏足够的细粒度语义对齐。其解决方案的关键在于构建一个实验框架,利用功能性磁共振成像(fMRI)表示作为干预条件,并通过交叉注意力机制将这些表示注入多尺度图像特征中,从而提升目标检测和实例分割任务中的下游性能及中间特征变化。实验结果表明,融合fMRI信号能够增强检测与分割的准确性,证实了fMRI包含丰富的多物体语义线索和粗略的空间定位信息,而当前模型尚未充分挖掘和整合这些信息。

链接: https://arxiv.org/abs/2507.02311
作者: Le Xu,Qi Zhang,Qixian Zhang,Hongyun Zhang,Duoqian Miao,Cairong Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in brain-vision decoding have driven significant progress, reconstructing with high fidelity perceived visual stimuli from neural activity, e.g., functional magnetic resonance imaging (fMRI), in the human visual cortex. Most existing methods decode the brain signal using a two-level strategy, i.e., pixel-level and semantic-level. However, these methods rely heavily on low-level pixel alignment yet lack sufficient and fine-grained semantic alignment, resulting in obvious reconstruction distortions of multiple semantic objects. To better understand the brain’s visual perception patterns and how current decoding models process semantic objects, we have developed an experimental framework that uses fMRI representations as intervention conditions. By injecting these representations into multi-scale image features via cross-attention, we compare both downstream performance and intermediate feature changes on object detection and instance segmentation tasks with and without fMRI information. Our results demonstrate that incorporating fMRI signals enhances the accuracy of downstream detection and segmentation, confirming that fMRI contains rich multi-object semantic cues and coarse spatial localization information-elements that current models have yet to fully exploit or integrate.
zh

[CV-69] Holistic Continual Learning under Concept Drift with Adaptive Memory Realignment

【速读】:该论文旨在解决持续学习中概念漂移(concept drift)带来的挑战,即在动态数据流中,先前学习任务的数据分布会发生变化,导致传统方法因假设数据分布静态而失效。其解决方案的关键在于提出一种轻量级的自适应记忆重对齐(Adaptive Memory Realignment, AMR)机制,通过选择性地从回放缓冲区中移除过时样本,并用少量最新实例重新填充,从而将模型记忆与新的数据分布对齐,实现稳定性和快速适应性的平衡。

链接: https://arxiv.org/abs/2507.02310
作者: Alif Ashrafee,Jedrzej Kozal,Michal Wozniak,Bartosz Krawczyk
机构: Rochester Institute of Technology, NY, USA (罗切斯特理工学院, 纽约, 美国); Wroclaw University of Science and Technology, Wroclaw, Poland (弗罗茨瓦夫科技大学, 弗罗茨瓦夫, 波兰)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional continual learning methods prioritize knowledge retention and focus primarily on mitigating catastrophic forgetting, implicitly assuming that the data distribution of previously learned tasks remains static. This overlooks the dynamic nature of real-world data streams, where concept drift permanently alters previously seen data and demands both stability and rapid adaptation. We introduce a holistic framework for continual learning under concept drift that simulates realistic scenarios by evolving task distributions. As a baseline, we consider Full Relearning (FR), in which the model is retrained from scratch on newly labeled samples from the drifted distribution. While effective, this approach incurs substantial annotation and computational overhead. To address these limitations, we propose Adaptive Memory Realignment (AMR), a lightweight alternative that equips rehearsal-based learners with a drift-aware adaptation mechanism. AMR selectively removes outdated samples of drifted classes from the replay buffer and repopulates it with a small number of up-to-date instances, effectively realigning memory with the new distribution. This targeted resampling matches the performance of FR while reducing the need for labeled data and computation by orders of magnitude. To enable reproducible evaluation, we introduce four concept-drift variants of standard vision benchmarks: Fashion-MNIST-CD, CIFAR10-CD, CIFAR100-CD, and Tiny-ImageNet-CD, where previously seen classes reappear with shifted representations. Comprehensive experiments on these datasets using several rehearsal-based baselines show that AMR consistently counters concept drift, maintaining high accuracy with minimal overhead. These results position AMR as a scalable solution that reconciles stability and plasticity in non-stationary continual learning environments. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.02310 [cs.LG] (or arXiv:2507.02310v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.02310 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-70] LMPNet for Weakly-supervised Keypoint Discovery

【速读】:该论文试图解决在仅使用类别标签进行弱监督的情况下,自动发现语义对象关键点的问题。解决方案的关键在于通过将判别性训练的中间层过滤器转换为关键点检测器,并引入一种计算高效的泄漏最大池化(leaky max pooling, LMP)层,以显式鼓励最终卷积层过滤器学习与对象关键点对齐的“非重复局部模式”。此外,还提出了一个简单的选择策略和注意力掩码机制,以确保过滤器激活的一致性,并促使网络关注整个对象而非仅最具判别性的区域,最终通过可学习的聚类层实现关键点预测。

链接: https://arxiv.org/abs/2507.02308
作者: Pei Guo,Ryan Farrell
机构: Brigham Young University (杨百翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this work, we explore the task of semantic object keypoint discovery weakly-supervised by only category labels. This is achieved by transforming discriminatively-trained intermediate layer filters into keypoint detectors. We begin by identifying three preferred characteristics of keypoint detectors: (i) spatially sparse activations, (ii) consistency and (iii) diversity. Instead of relying on hand-crafted loss terms, a novel computationally-efficient leaky max pooling (LMP) layer is proposed to explicitly encourage final conv-layer filters to learn “non-repeatable local patterns” that are well aligned with object keypoints. Informed by visualizations, a simple yet effective selection strategy is proposed to ensure consistent filter activations and attention mask-out is then applied to force the network to distribute its attention to the whole object instead of just the most discriminative region. For the final keypoint prediction, a learnable clustering layer is proposed to group keypoint proposals into keypoint predictions. The final model, named LMPNet, is highly interpretable in that it directly manipulates network filters to detect predefined concepts. Our experiments show that LMPNet can (i) automatically discover semantic keypoints that are robust to object pose and (ii) achieves strong prediction accuracy comparable to a supervised pose estimation model.
zh

[CV-71] Flow-CDNet: A Novel Network for Detecting Both Slow and Fast Changes in Bitemporal Images

【速读】:该论文旨在解决同时检测快速变化和缓慢变化的遥感图像变化检测问题,尤其是在实际场景中如边坡、大坝和尾矿库等区域,弱变化可能作为重大灾害的前兆。解决方案的关键在于提出一种名为Flow-CDNet的网络架构,该架构包含两个分支:光流分支用于提取多尺度位移变化,二值变化检测分支则结合ResNet结构与光流分支的输出以生成快速变化结果,从而实现对慢速和快速变化的有效检测。

链接: https://arxiv.org/abs/2507.02307
作者: Haoxuan Li,Chenxu Wei,Haodong Wang,Xiaomeng Hu,Boyuan An,Lingyan Ran,Baosen Zhang,Jin Jin,Omirzhan Taukebayev,Amirkhan Temirbayev,Junrui Liu,Xiuwei Zhang
机构: Northwestern Polytechnical University (西北工业大学); Yellow River Institute of Hydraulic Research (黄河水利科学研究院); Al-Farabi Kazakh National University (哈萨克斯坦阿尔法拉比国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 8 figures

点击查看摘要

Abstract:Change detection typically involves identifying regions with changes between bitemporal images taken at the same location. Besides significant changes, slow changes in bitemporal images are also important in real-life scenarios. For instance, weak changes often serve as precursors to major hazards in scenarios like slopes, dams, and tailings ponds. Therefore, designing a change detection network that simultaneously detects slow and fast changes presents a novel challenge. In this paper, to address this challenge, we propose a change detection network named Flow-CDNet, consisting of two branches: optical flow branch and binary change detection branch. The first branch utilizes a pyramid structure to extract displacement changes at multiple scales. The second one combines a ResNet-based network with the optical flow branch’s output to generate fast change outputs. Subsequently, to supervise and evaluate this new change detection framework, a self-built change detection dataset Flow-Change, a loss function combining binary tversky loss and L2 norm loss, along with a new evaluation metric called FEPE are designed. Quantitative experiments conducted on Flow-Change dataset demonstrated that our approach outperforms the existing methods. Furthermore, ablation experiments verified that the two branches can promote each other to enhance the detection performance.
zh

[CV-72] DreamComposer: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation CVPR2024

【速读】:该论文试图解决现有基于单张图像生成可控新视角的难题,其核心问题在于缺乏多视角信息导致生成结果不可控。解决方案的关键在于提出DreamComposer++框架,通过引入多视角条件,利用视图感知的3D提升模块提取物体的3D表示,并通过多视角特征融合模块将这些表示聚合并渲染为目标视角的潜在特征,最终将其集成到预训练的图像或视频扩散模型中以实现新视角合成。

链接: https://arxiv.org/abs/2507.02299
作者: Yunhan Yang,Shuo Chen,Yukun Huang,Xiaoyang Wu,Yuan-Chen Guo,Edmund Y. Lam,Hengshuang Zhao,Tong He,Xihui Liu
机构: The University of Hong Kong (HKU); Tsinghua University; Vast; Shanghai AI Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TPAMI, extension of CVPR 2024 paper DreamComposer

点击查看摘要

Abstract:Recent advancements in leveraging pre-trained 2D diffusion models achieve the generation of high-quality novel views from a single in-the-wild image. However, existing works face challenges in producing controllable novel views due to the lack of information from multiple views. In this paper, we present DreamComposer++, a flexible and scalable framework designed to improve current view-aware diffusion models by incorporating multi-view conditions. Specifically, DreamComposer++ utilizes a view-aware 3D lifting module to extract 3D representations of an object from various views. These representations are then aggregated and rendered into the latent features of target view through the multi-view feature fusion module. Finally, the obtained features of target view are integrated into pre-trained image or video diffusion models for novel view synthesis. Experimental results demonstrate that DreamComposer++ seamlessly integrates with cutting-edge view-aware diffusion models and enhances their abilities to generate controllable novel views from multi-view conditions. This advancement facilitates controllable 3D object reconstruction and enables a wide range of applications.
zh

[CV-73] ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation

【速读】:该论文旨在解决将生成式 AI (Generative AI) 模型 Segment Anything Model (SAM) 应用于遥感 (Remote Sensing, RS) 图像时所面临的两大挑战:一是手动构建精确提示(如点或框)耗时且低效,尤其在存在密集小目标或空间碎片化分布的场景中;二是 SAM 缺乏领域适应性,因其主要在自然图像上预训练,难以捕捉 RS 特有的语义和空间特征,尤其是在分割新类别时表现不佳。解决方案的关键在于提出 ViRefSAM,该框架通过仅使用少量标注的参考图像来引导 SAM,无需人工提示即可实现跨 RS 图像的类别一致对象自动分割,其核心创新包括视觉上下文提示编码器和动态目标对齐适配器,以增强模型对目标域的适应能力。

链接: https://arxiv.org/abs/2507.02294
作者: Hanbo Bi,Yulong Xu,Ya Li,Yongqiang Mao,Boyuan Tong,Chongyang Li,Chunbo Lang,Wenhui Diao,Hongqi Wang,Yingchao Feng,Xian Sun
机构: Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China (中国科学院遥感与数字地球研究所); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China (中国科学院大学电子电气与通信工程学院); University of Chinese Academy of Sciences, Beijing 100190, China (中国科学院大学); Key Laboratory of Target Cognition and Application Technology(TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China (目标认知与应用技术重点实验室); Department of Electronic Engineering, Tsinghua University, Beijing 100084, China (清华大学电子工程系); School of Automation, Northwestern Polytechnical University, Xi’an 710129, China (西北工业大学自动化学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM), with its prompt-driven paradigm, exhibits strong generalization in generic segmentation tasks. However, applying SAM to remote sensing (RS) images still faces two major challenges. First, manually constructing precise prompts for each image (e.g., points or boxes) is labor-intensive and inefficient, especially in RS scenarios with dense small objects or spatially fragmented distributions. Second, SAM lacks domain adaptability, as it is pre-trained primarily on natural images and struggles to capture RS-specific semantics and spatial characteristics, especially when segmenting novel or unseen classes. To address these issues, inspired by few-shot learning, we propose ViRefSAM, a novel framework that guides SAM utilizing only a few annotated reference images that contain class-specific objects. Without requiring manual prompts, ViRefSAM enables automatic segmentation of class-consistent objects across RS images. Specifically, ViRefSAM introduces two key components while keeping SAM’s original architecture intact: (1) a Visual Contextual Prompt Encoder that extracts class-specific semantic clues from reference images and generates object-aware prompts via contextual interaction with target images; and (2) a Dynamic Target Alignment Adapter, integrated into SAM’s image encoder, which mitigates the domain gap by injecting class-specific semantics into target image features, enabling SAM to dynamically focus on task-relevant regions. Extensive experiments on three few-shot segmentation benchmarks, including iSAID-5 ^i , LoveDA-2 ^i , and COCO-20 ^i , demonstrate that ViRefSAM enables accurate and automatic segmentation of unseen classes by leveraging only a few reference images and consistently outperforms existing few-shot segmentation methods across diverse datasets.
zh

[CV-74] Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization

【速读】:该论文旨在解决领域泛化(Domain Generalization, DG)中如何设计有效的提示(prompt),以解耦跨不同领域的不变特征这一关键问题。其解决方案的关键在于利用视觉基础模型(Visual Foundation Model, VFM)的可控且灵活的语言提示,通过大型语言模型(Large Language Model, LLM)自动解耦文本提示,并引导学习领域不变的视觉表征。此外,为克服仅依赖语言引导的局限性,提出了 Worst Explicit Representation Alignment (WERA) 方法,通过引入抽象提示增强源域多样性,并通过对齐约束保证视觉表征在原始与增强分布间的一致性。

链接: https://arxiv.org/abs/2507.02288
作者: De Cheng,Zhipeng Xu,Xinyang Jiang,Dongsheng Li,Nannan Wang,Xinbo Gao
机构: Xidian University (西安电子科技大学); Microsoft Research Asia (微软亚洲研究院); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.
zh

[CV-75] LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models

【速读】:该论文试图解决现有视觉令牌压缩方法作为后编码器模块运行,限制了在效率提升方面的潜力的问题。解决方案的关键在于提出LaCo(逐层视觉令牌压缩)框架,该框架在视觉编码器的中间层实现有效的令牌压缩,其核心组件包括逐层像素洗牌机制和带有非参数捷径的残差学习架构,从而在压缩过程中保留关键视觉信息。

链接: https://arxiv.org/abs/2507.02279
作者: Juntao Liu,Liqiang Niu,Wenchao Chen,Jie Zhou,Fandong Meng
机构: Tencent Inc (腾讯公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise Visual Token Compression), a novel framework that enables effective token compression within the intermediate layers of the vision encoder. LaCo introduces two core components: 1) a layer-wise pixel-shuffle mechanism that systematically merges adjacent tokens through space-to-channel transformations, and 2) a residual learning architecture with non-parametric shortcuts that preserves critical visual information during compression. Extensive experiments indicate that our LaCo outperforms all existing methods when compressing tokens in the intermediate layers of the vision encoder, demonstrating superior effectiveness. In addition, compared to external compression, our method improves training efficiency beyond 20% and inference throughput over 15% while maintaining strong performance.
zh

[CV-76] Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation IJCAI2025

【速读】:该论文试图解决视频到音频(Video-to-Audio, V2A)生成模型在部分可视场景下性能下降的问题,尤其是由于忽略了电影语言(cinematic language)这一关键艺术表达要素所致。解决方案的关键在于提出一种简单的自蒸馏(self-distillation)方法,通过模拟电影语言的变化,使学生模型学习将训练对中的视频特征与相同的音画对应关系对齐,从而有效捕捉声音与部分视觉信息之间的关联。

链接: https://arxiv.org/abs/2507.02271
作者: Feizhen Huang,Yu Wu,Yutian Lin,Bo Du
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted by IJCAI 2025

点击查看摘要

Abstract:Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.
zh

[CV-77] MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement

【速读】:该论文旨在解决水下图像增强问题,此类图像由于光线变化、水体浑浊和气泡等因素导致可见度和颜色失真。传统基于先验的方法和像素级方法效果有限,而深度学习方法受限于高质量数据集的缺乏。论文提出的解决方案是Multi-Axis Conditional Lookup (MAC-Lookup)模型,其关键在于结合Conditional 3D Lookup Table Color Correction (CLTCC)实现初步的颜色和质量校正,以及Multi-Axis Adaptive Enhancement (MAAE)进行细节优化,从而在提升图像视觉质量的同时避免过度增强和饱和。

链接: https://arxiv.org/abs/2507.02270
作者: Fanghai Yi,Zehong Zheng,Zexiao Liang,Yihang Dong,Xiyang Fang,Wangyu Wu,Xuhang Chen
机构: School of Computer Science and Engineering, Huizhou University (计算机科学与工程学院,惠州大学); University of Chinese Academy of Sciences (中国科学院大学); Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE SMC 2025

点击查看摘要

Abstract:Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color accuracy, sharpness, and contrast. It includes Conditional 3D Lookup Table Color Correction (CLTCC) for preliminary color and quality correction and Multi-Axis Adaptive Enhancement (MAAE) for detail refinement. This model prevents over-enhancement and saturation while handling underwater challenges. Extensive experiments show that MAC-Lookup excels in enhancing underwater images by restoring details and colors better than existing methods. The code is this https URL.
zh

[CV-78] Cross-domain Hyperspectral Image Classification based on Bi-directional Domain Adaptation

【速读】:该论文旨在解决跨域高光谱图像(HSI)分类中因不同区域或时间采集的图像导致的同类地物光谱偏移问题,从而提升模型在目标场景中的适应性和可分性。其解决方案的关键在于提出一种双向域适应(BiDA)框架,该框架通过设计三分支Transformer架构(源分支、目标分支和耦合分支)以及耦合多头交叉注意力(CMCA)机制,提取域不变特征和域特定信息,并结合双向蒸馏损失和自适应强化策略,增强模型在噪声条件下的泛化能力与特征聚焦能力。

链接: https://arxiv.org/abs/2507.02268
作者: Yuxiang Zhang,Wei Li,Wen Jia,Mengmeng Zhang,Ran Tao,Shunlin Liang
机构: The Hong Kong Jockey Club Charities Trust; Jockey Club STEM Laboratory of Quantitative Remote Sensing; University of Hong Kong (香港大学); School of Information and Electronics; Beijing Institute of Technology (北京理工大学); Beijing Key Laboratory of Fractional Signals and Systems; Institute of Forest Resource Information Techniques; Chinese Academy of Forestry (中国林业科学研究院); Key Laboratory of Forestry Remote Sensing and Information System
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Utilizing hyperspectral remote sensing technology enables the extraction of fine-grained land cover classes. Typically, satellite or airborne images used for training and testing are acquired from different regions or times, where the same class has significant spectral shifts in different scenes. In this paper, we propose a Bi-directional Domain Adaptation (BiDA) framework for cross-domain hyperspectral image (HSI) classification, which focuses on extracting both domain-invariant features and domain-specific information in the independent adaptive space, thereby enhancing the adaptability and separability to the target scene. In the proposed BiDA, a triple-branch transformer architecture (the source branch, target branch, and coupled branch) with semantic tokenizer is designed as the backbone. Specifically, the source branch and target branch independently learn the adaptive space of source and target domains, a Coupled Multi-head Cross-attention (CMCA) mechanism is developed in coupled branch for feature interaction and inter-domain correlation mining. Furthermore, a bi-directional distillation loss is designed to guide adaptive space learning using inter-domain correlation. Finally, we propose an Adaptive Reinforcement Strategy (ARS) to encourage the model to focus on specific generalized feature extraction within both source and target scenes in noise condition. Experimental results on cross-temporal/scene airborne and satellite datasets demonstrate that the proposed BiDA performs significantly better than some state-of-the-art domain adaptation approaches. In the cross-temporal tree species classification task, the proposed BiDA is more than 3% \sim 5% higher than the most advanced method. The codes will be available from the website: this https URL.
zh

[CV-79] Multi-Label Classification Framework for Hurricane Damage Assessment

【速读】:该论文旨在解决飓风后灾害评估中传统单标签分类方法无法准确捕捉复杂损伤类型和严重性的问题。其解决方案的关键在于提出了一种基于ResNet的特征提取模块与类特定注意力机制相结合的多标签分类框架,从而在单张图像中识别多种损伤类型,提升了灾后评估的准确性与效率。

链接: https://arxiv.org/abs/2507.02265
作者: Zhangding Liu,Neda Mohammadi,John E. Taylor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures. Accepted at the ASCE International Conference on Computing in Civil Engineering (i3CE 2025)

点击查看摘要

Abstract:Hurricanes cause widespread destruction, resulting in diverse damage types and severities that require timely and accurate assessment for effective disaster response. While traditional single-label classification methods fall short of capturing the complexity of post-hurricane damage, this study introduces a novel multi-label classification framework for assessing damage using aerial imagery. The proposed approach integrates a feature extraction module based on ResNet and a class-specific attention mechanism to identify multiple damage types within a single image. Using the Rescuenet dataset from Hurricane Michael, the proposed method achieves a mean average precision of 90.23%, outperforming existing baseline methods. This framework enhances post-hurricane damage assessment, enabling more targeted and efficient disaster response and contributing to future strategies for disaster mitigation and resilience. This paper has been accepted at the ASCE International Conference on Computing in Civil Engineering (i3CE 2025), and the camera-ready version will appear in the official conference proceedings.
zh

[CV-80] SurgVisAgent : Multimodal Agent ic Model for Versatile Surgical Visual Enhancement

【速读】:该论文试图解决现有增强算法在复杂真实手术场景中效果受限的问题,因其通常针对特定任务和场景设计,难以适应多样的手术需求。解决方案的关键在于提出SurgVisAgent,一个基于多模态大语言模型(MLLMs)的端到端智能手术视觉代理,其能够动态识别内镜图像中的失真类型及严重程度,并通过先验模型提供领域知识,结合上下文少样本学习与思维链(CoT)推理,实现针对多种失真类型和严重程度的定制化图像增强,从而满足外科医生的多样化需求。

链接: https://arxiv.org/abs/2507.02252
作者: Zeyu Lei,Hongyuan Yu,Jinlin Wu,Zhen Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precise surgical interventions are vital to patient safety, and advanced enhancement algorithms have been developed to assist surgeons in decision-making. Despite significant progress, these algorithms are typically designed for single tasks in specific scenarios, limiting their effectiveness in complex real-world situations. To address this limitation, we propose SurgVisAgent, an end-to-end intelligent surgical vision agent built on multimodal large language models (MLLMs). SurgVisAgent dynamically identifies distortion categories and severity levels in endoscopic images, enabling it to perform a variety of enhancement tasks such as low-light enhancement, overexposure correction, motion blur elimination, and smoke removal. Specifically, to achieve superior surgical scenario understanding, we design a prior model that provides domain-specific knowledge. Additionally, through in-context few-shot learning and chain-of-thought (CoT) reasoning, SurgVisAgent delivers customized image enhancements tailored to a wide range of distortion types and severity levels, thereby addressing the diverse requirements of surgeons. Furthermore, we construct a comprehensive benchmark simulating real-world surgical distortions, on which extensive experiments demonstrate that SurgVisAgent surpasses traditional single-task models, highlighting its potential as a unified solution for surgical assistance.
zh

[CV-81] FMOcc: TPV-Driven Flow Matching for 3D Occupancy Prediction with Selective State Space Model

【速读】:该论文旨在解决少帧图像固有的局限性以及三维空间中的冗余问题,这些问题导致遮挡和远距离场景的3D语义占据预测精度下降。其解决方案的关键在于提出FMOcc模型,该模型结合了三视角(TPV)细化占据网络与流匹配选择性状态空间模型(Flow Matching SSM module, FMSSM),通过设计TPV SSM层和Plane Selective SSM(PS3M)来选择性过滤TPV特征,减少空气体素对非空气体素的影响,同时引入Mask Training(MT)方法提升模型鲁棒性和应对传感器数据丢失的问题。

链接: https://arxiv.org/abs/2507.02250
作者: Jiangxia Chen,Tongyuan Huang,Ke Song
机构: Chongqing University of Technology (重庆理工大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D semantic occupancy prediction plays a pivotal role in autonomous driving. However, inherent limitations of fewframe images and redundancy in 3D space compromise prediction accuracy for occluded and distant scenes. Existing methods enhance performance by fusing historical frame data, which need additional data and significant computational resources. To address these issues, this paper propose FMOcc, a Tri-perspective View (TPV) refinement occupancy network with flow matching selective state space model for few-frame 3D occupancy prediction. Firstly, to generate missing features, we designed a feature refinement module based on a flow matching model, which is called Flow Matching SSM module (FMSSM). Furthermore, by designing the TPV SSM layer and Plane Selective SSM (PS3M), we selectively filter TPV features to reduce the impact of air voxels on non-air voxels, thereby enhancing the overall efficiency of the model and prediction capability for distant scenes. Finally, we design the Mask Training (MT) method to enhance the robustness of FMOcc and address the issue of sensor data loss. Experimental results on the Occ3D-nuScenes and OpenOcc datasets show that our FMOcc outperforms existing state-of-theart methods. Our FMOcc with two frame input achieves notable scores of 43.1% RayIoU and 39.8% mIoU on Occ3D-nuScenes validation, 42.6% RayIoU on OpenOcc with 5.4 G inference memory and 330ms inference time.
zh

[CV-82] High-Fidelity Differential-information Driven Binary Vision Transformer

【速读】:该论文旨在解决视觉变换器(Vision Transformer, ViT)二值化过程中存在的性能退化问题,以及现有方法对全精度模块的依赖问题。其解决方案的关键在于提出DIDB-ViT,通过设计一个包含差分信息的感知注意力模块来减少二值化导致的信息损失并增强高频信息保留,同时利用离散哈尔小波进行频率分解以保持二值Q和K张量之间的相似性计算精度,并引入改进的RPReLU激活函数以扩展模型的表征能力。

链接: https://arxiv.org/abs/2507.02222
作者: Tian Gao,Zhiyuan Zhang,Kaijie Yin,Xu-Cheng Zhong,Hui Kong
机构: University of Macau (澳门大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The binarization of vision transformers (ViTs) offers a promising approach to addressing the trade-off between high computational/storage demands and the constraints of edge-device deployment. However, existing binary ViT methods often suffer from severe performance degradation or rely heavily on full-precision modules. To address these issues, we propose DIDB-ViT, a novel binary ViT that is highly informative while maintaining the original ViT architecture and computational efficiency. Specifically, we design an informative attention module incorporating differential information to mitigate information loss caused by binarization and enhance high-frequency retention. To preserve the fidelity of the similarity calculations between binary Q and K tensors, we apply frequency decomposition using the discrete Haar wavelet and integrate similarities across different frequencies. Additionally, we introduce an improved RPReLU activation function to restructure the activation distribution, expanding the model’s representational capacity. Experimental results demonstrate that our DIDB-ViT significantly outperforms state-of-the-art network quantization methods in multiple ViT architectures, achieving superior image classification and segmentation performance.
zh

[CV-83] Understanding Trade offs When Conditioning Synthetic Data

【速读】:该论文试图解决在工业视觉系统中,仅从少量图像中学习鲁棒的目标检测器这一关键问题,因为高质量训练数据的收集可能需要数月时间。其解决方案的关键在于利用生成式 AI (Generative AI) 生成高质量的合成数据,以减少对真实数据的依赖。研究对比了两种条件生成策略:基于提示(prompt-based)和基于布局(layout-based)的条件设置,并发现当条件线索较窄时,基于提示的条件生成能产生更高质量的合成数据,而随着多样性增加,基于布局的条件生成则表现更优。当布局线索与完整训练分布匹配时,合成数据可使平均精度均值提升高达177%。

链接: https://arxiv.org/abs/2507.02217
作者: Brandon Trabucco,Qasim Wani,Benjamin Pikus,Vasu Sharma
机构: Carnegie Mellon University (卡内基梅隆大学); Advex AI (Advex AI); Facebook AI Research (Facebook人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Learning robust object detectors from only a handful of images is a critical challenge in industrial vision systems, where collecting high quality training data can take months. Synthetic data has emerged as a key solution for data efficient visual inspection and pick and place robotics. Current pipelines rely on 3D engines such as Blender or Unreal, which offer fine control but still require weeks to render a small dataset, and the resulting images often suffer from a large gap between simulation and reality. Diffusion models promise a step change because they can generate high quality images in minutes, yet precise control, especially in low data regimes, remains difficult. Although many adapters now extend diffusion beyond plain text prompts, the effect of different conditioning schemes on synthetic data quality is poorly understood. We study eighty diverse visual concepts drawn from four standard object detection benchmarks and compare two conditioning strategies: prompt based and layout based. When the set of conditioning cues is narrow, prompt conditioning yields higher quality synthetic data; as diversity grows, layout conditioning becomes superior. When layout cues match the full training distribution, synthetic data raises mean average precision by an average of thirty four percent and by as much as one hundred seventy seven percent compared with using real data alone.
zh

[CV-84] am RAS in 9th ABAW Competition: Multimodal Compound Expression Recognition Approach

【速读】:该论文旨在解决复合情绪识别(Compound Expression Recognition, CER)问题,即检测由基本情绪组合形成的复杂情感状态。传统方法依赖于特定任务的训练数据,而本文提出了一种新颖的零样本多模态方法,其关键在于整合六种异构模态(静态和动态面部表情、场景与标签匹配、场景上下文、音频和文本)到一个统一的流程中,并采用基于对比语言-图像预训练(CLIP)的标签匹配和Qwen-VL进行语义场景理解,以实现无需领域适应的复合情绪捕捉。此外,引入了多头概率融合(MHPF)模块和复合情绪(CE)转换模块,通过成对概率聚合(PPA)和成对特征相似性聚合(PFSA)方法生成可解释的复合情绪输出。

链接: https://arxiv.org/abs/2507.02205
作者: Elena Ryumina,Maxim Markitantov,Alexandr Axyonov,Dmitry Ryumin,Mikhail Dolgushin,Alexey Karpov
机构: St. Petersburg Federal Research Center of the Russian Academy of Sciences (圣彼得堡联邦研究中心俄罗斯科学院); ITMO University (伊塔莫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8

点击查看摘要

Abstract:Compound Expression Recognition (CER), a subfield of affective computing, aims to detect complex emotional states formed by combinations of basic emotions. In this work, we present a novel zero-shot multimodal approach for CER that combines six heterogeneous modalities into a single pipeline: static and dynamic facial expressions, scene and label matching, scene context, audio, and text. Unlike previous approaches relying on task-specific training data, our approach uses zero-shot components, including Contrastive Language-Image Pretraining (CLIP)-based label matching and Qwen-VL for semantic scene understanding. We further introduce a Multi-Head Probability Fusion (MHPF) module that dynamically weights modality-specific predictions, followed by a Compound Expressions (CE) transformation module that uses Pair-Wise Probability Aggregation (PPA) and Pair-Wise Feature Similarity Aggregation (PFSA) methods to produce interpretable compound emotion outputs. Evaluated under multi-corpus training, the proposed approach shows F1 scores of 46.95% on AffWild2, 49.02% on Acted Facial Expressions in The Wild (AFEW), and 34.85% on C-EXPR-DB via zero-shot testing, which is comparable to the results of supervised approaches trained on target data. This demonstrates the effectiveness of the proposed approach for capturing CE without domain adaptation. The source code is publicly available.
zh

[CV-85] Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning

【速读】:该论文旨在解决单目度量深度估计在水下环境中的可靠性问题,特别是在光衰减、散射、颜色失真、浑浊以及缺乏高质量度量真值数据等挑战下的性能不足。其解决方案的关键在于通过领域自适应和尺度感知监督,提升模型在水下场景中的泛化能力和鲁棒性。具体而言,作者对Depth Anything V2模型进行了微调,使用基于物理的水下图像生成模型构建了合成水下数据集,并在真实水下数据集上验证了改进后的模型在多个基准测试中均优于仅在干净空气中数据集训练的基线模型。

链接: https://arxiv.org/abs/2507.02148
作者: Zijie Cai,Christopher Metzler
机构: University of Maryland, College Park(马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular depth estimation has recently advanced to provide not only relative but also metric depth predictions. However, its reliability in underwater environments remains limited due to light attenuation and scattering, color distortion, turbidity, and the lack of high-quality metric ground-truth data. In this paper, we present a comprehensive benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets with metric depth annotations, such as FLSea and SQUID. We evaluate a diverse set of state-of-the-art models across a range of underwater conditions with different ranges. Our results show that large-scale models trained on terrestrial (real or synthetic) data, while effective in in-air settings, perform poorly underwater due to significant domain shifts. To address this, we fine-tune Depth Anything V2 with a ViT-S backbone encoder on a synthetic underwater variant of the Hypersim dataset, which we generated using a physically based underwater image formation model. We demonstrate our fine-tuned model consistently improves performance across all benchmarks and outperforms baselines trained only on the clean in-air Hypersim dataset. Our study provides a detailed evaluation and visualization for monocular metric depth estimation in underwater scenes, highlighting the importance of domain adaptation and scale-aware supervision for achieving robust and generalizable metric depth predictions in challenging underwater environments for future research.
zh

[CV-86] Generative Latent Diffusion for Efficient Spatiotemporal Data Reduction

【速读】:该论文试图解决生成式模型在数据压缩应用中受限于可控性和重建精度的问题,这些问题限制了其在实际场景中的应用。解决方案的关键在于提出一种高效的潜在扩散框架,该框架通过将变分自编码器与条件扩散模型相结合,仅将少量关键帧压缩到潜在空间,并将其作为条件输入,通过生成插值重建其余帧,从而避免存储每一帧的潜在表示,实现了高精度的时空重建并显著降低了存储成本。

链接: https://arxiv.org/abs/2507.02129
作者: Xiao Li,Liangji Zhu,Anand Rangarajan,Sanjay Ranka
机构: University of Florida(佛罗里达大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Generative models have demonstrated strong performance in conditional settings and can be viewed as a form of data compression, where the condition serves as a compact representation. However, their limited controllability and reconstruction accuracy restrict their practical application to data compression. In this work, we propose an efficient latent diffusion framework that bridges this gap by combining a variational autoencoder with a conditional diffusion model. Our method compresses only a small number of keyframes into latent space and uses them as conditioning inputs to reconstruct the remaining frames via generative interpolation, eliminating the need to store latent representations for every frame. This approach enables accurate spatiotemporal reconstruction while significantly reducing storage costs. Experimental results across multiple datasets show that our method achieves up to 10 times higher compression ratios than rule-based state-of-the-art compressors such as SZ3, and up to 63 percent better performance than leading learning-based methods under the same reconstruction error.
zh

[CV-87] Large Language Models for Crash Detection in Video: A Survey of Methods Datasets and Challenges

【速读】:该论文试图解决从视频流中进行碰撞检测(crash detection)的问题,这是智能交通系统中的关键任务。解决方案的关键在于利用大型语言模型(LLMs)和视觉-语言模型(VLMs)来处理、推理和总结多模态信息,通过结构化的融合策略提升碰撞检测的准确性和效率。

链接: https://arxiv.org/abs/2507.02074
作者: Sanjeda Akter,Ibne Farabi Shihab,Anuj Sharma
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Crash detection from video feeds is a critical problem in intelligent transportation systems. Recent developments in large language models (LLMs) and vision-language models (VLMs) have transformed how we process, reason about, and summarize multimodal information. This paper surveys recent methods leveraging LLMs for crash detection from video data. We present a structured taxonomy of fusion strategies, summarize key datasets, analyze model architectures, compare performance benchmarks, and discuss ongoing challenges and opportunities. Our review provides a foundation for future research in this fast-growing intersection of video understanding and foundation models.
zh

[CV-88] MEGANet-W: A Wavelet-Driven Edge-Guided Attention Framework for Weak Boundary Polyp Detection

【速读】:该论文旨在解决结直肠息肉分割中由于边界弱且对比度低而导致的自动化精度受限问题。现有深度模型要么模糊了细粒度边缘细节,要么依赖于在不同成像条件下表现不佳的手工设计滤波器。解决方案的关键在于提出MEGANet-W,这是一种基于小波驱动的边缘引导注意力网络,通过在每个解码阶段注入方向性、无参数的Haar小波边缘图来重新校准语义特征,其核心创新包括:(1)用于多方向边缘提取的双级Haar小波头;(2)融合小波线索与反向和输入分支的Wavelet Edge Guided Attention(WEGA)模块。

链接: https://arxiv.org/abs/2507.02668
作者: Zhe Yee Tan
机构: Xiamen University Malaysia(厦门大学马来西亚校区)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures

点击查看摘要

Abstract:Colorectal polyp segmentation is critical for early detection of colorectal cancer, yet weak and low contrast boundaries significantly limit automated accuracy. Existing deep models either blur fine edge details or rely on handcrafted filters that perform poorly under variable imaging conditions. We propose MEGANet-W, a Wavelet Driven Edge Guided Attention Network that injects directional, parameter free Haar wavelet edge maps into each decoder stage to recalibrate semantic features. Our two main contributions are: (1) a two-level Haar wavelet head for multi orientation edge extraction; and (2) Wavelet Edge Guided Attention (WEGA) modules that fuse wavelet cues with reverse and input branches. On five public polyp datasets, MEGANetW consistently outperforms existing methods, improving mIoU by up to 2.3% and mDice by 1.2%, while introducing no additional learnable parameters.
zh

[CV-89] 3D Heart Reconstruction from Sparse Pose-agnostic 2D Echocardiographic Slices

【速读】:该论文旨在解决传统二维超声心动图(2D echo)在心脏结构和功能参数(如左心室体积)定量分析中的局限性,以及三维超声成像在空间和时间分辨率及手动分割需求方面的不足。其解决方案的关键在于提出一种创新的框架,通过从临床常用的2D echo切片中重建个性化的3D心脏解剖结构。该框架设计了一个新颖的3D重建流程,通过交替优化2D切片的3D姿态估计与使用隐式神经网络的3D整合,逐步将先验3D心脏形状转换为个性化的3D心脏模型,从而实现了更准确的左心室和右心室体积估算。

链接: https://arxiv.org/abs/2507.02411
作者: Zhurong Chen,Jinhua Chen,Wei Zhuo,Wufeng Xue,Dong Ni
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages

点击查看摘要

Abstract:Echocardiography (echo) plays an indispensable role in the clinical practice of heart diseases. However, ultrasound imaging typically provides only two-dimensional (2D) cross-sectional images from a few specific views, making it challenging to interpret and inaccurate for estimation of clinical parameters like the volume of left ventricle (LV). 3D ultrasound imaging provides an alternative for 3D quantification, but is still limited by the low spatial and temporal resolution and the highly demanding manual delineation. To address these challenges, we propose an innovative framework for reconstructing personalized 3D heart anatomy from 2D echo slices that are frequently used in clinical practice. Specifically, a novel 3D reconstruction pipeline is designed, which alternatively optimizes between the 3D pose estimation of these 2D slices and the 3D integration of these slices using an implicit neural network, progressively transforming a prior 3D heart shape into a personalized 3D heart model. We validate the method with two datasets. When six planes are used, the reconstructed 3D heart can lead to a significant improvement for LV volume estimation over the bi-plane method (error in percent: 1.98% VS. 20.24%). In addition, the whole reconstruction framework makes even an important breakthrough that can estimate RV volume from 2D echo slices (with an error of 5.75% ). This study provides a new way for personalized 3D structure and function analysis from cardiac ultrasound and is of great potential in clinical practice. Comments: 10 pages Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2507.02411 [eess.IV] (or arXiv:2507.02411v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2507.02411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-90] A robust and versatile deep learning model for prediction of the arterial input function in dynamic small animal left[18textFright]FDG PET imaging

【速读】:该论文试图解决动态正电子发射断层扫描(PET)中输入函数(Input Function, IF)估算的问题,传统方法依赖于动脉采血,但在小动物研究中该方法操作复杂、耗时且具有侵入性,限制了纵向研究的开展。解决方案的关键在于提出一种基于全卷积深度学习的输入函数预测方法(Fully Convolutional Deep Learning-based Input Function, FC-DLIF),该方法通过空间特征提取器和时间特征提取器从PET影像中直接预测动脉输入函数,从而无需进行动脉采血。

链接: https://arxiv.org/abs/2507.02367
作者: Christian Salomonsen,Luigi Tommaso Luppino,Fredrik Aspheim,Kristoffer Wickstrøm,Elisabeth Wetzer,Michael Kampffmeyer,Rodrigo Berzaghi,Rune Sundset,Robert Jenssen,Samuel Kuttner
机构: UiT The Arctic University of Norway (UiT 北极大学挪威)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph); Quantitative Methods (q-bio.QM)
备注: 22 pages, 12 figures

点击查看摘要

Abstract:Dynamic positron emission tomography (PET) and kinetic modeling are pivotal in advancing tracer development research in small animal studies. Accurate kinetic modeling requires precise input function estimation, traditionally achieved via arterial blood sampling. However, arterial cannulation in small animals like mice, involves intricate, time-consuming, and terminal procedures, precluding longitudinal studies. This work proposes a non-invasive, fully convolutional deep learning-based approach (FC-DLIF) to predict input functions directly from PET imaging, potentially eliminating the need for blood sampling in dynamic small-animal PET. The proposed FC-DLIF model includes a spatial feature extractor acting on the volumetric time frames of the PET sequence, extracting spatial features. These are subsequently further processed in a temporal feature extractor that predicts the arterial input function. The proposed approach is trained and evaluated using images and arterial blood curves from [ ^18 F]FDG data using cross validation. Further, the model applicability is evaluated on imaging data and arterial blood curves collected using two additional radiotracers ([ ^18 F]FDOPA, and [ ^68 Ga]PSMA). The model was further evaluated on data truncated and shifted in time, to simulate shorter, and shifted, PET scans. The proposed FC-DLIF model reliably predicts the arterial input function with respect to mean squared error and correlation. Furthermore, the FC-DLIF model is able to predict the arterial input function even from truncated and shifted samples. The model fails to predict the AIF from samples collected using different radiotracers, as these are not represented in the training data. Our deep learning-based input function offers a non-invasive and reliable alternative to arterial blood sampling, proving robust and flexible to temporal shifts and different scan durations.
zh

[CV-91] CineMyoPS: Segmenting Myocardial Pathologies from Cine Cardiac MR

【速读】:该论文旨在解决通过单序列的心脏磁共振成像(Cine CMR)实现心肌梗死(Myocardial Infarction, MI)相关病灶(如瘢痕和水肿)的分割问题,以替代传统需要多序列成像(如延迟钆增强LGE和T2加权CMR)的复杂流程。其解决方案的关键在于提出一种端到端的深度神经网络模型CineMyoPS,该模型能够同时提取与MI相关的运动特征和解剖特征,并通过设计一致性损失函数促进两者联合学习,同时采用时间序列聚合策略整合心脏周期中的MI相关特征,从而提升病灶分割的准确性。

链接: https://arxiv.org/abs/2507.02289
作者: Wangbin Ding,Lei Li,Junyi Qiu,Bogen Lin,Mingjing Yang,Liqin Huang,Lianming Wu,Sihan Wang,Xiahai Zhuang
机构: Fujian Medical University (福建医科大学); Fuzhou University (福州大学); National University of Singapore (新加坡国立大学); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Myocardial infarction (MI) is a leading cause of death worldwide. Late gadolinium enhancement (LGE) and T2-weighted cardiac magnetic resonance (CMR) imaging can respectively identify scarring and edema areas, both of which are essential for MI risk stratification and prognosis assessment. Although combining complementary information from multi-sequence CMR is useful, acquiring these sequences can be time-consuming and prohibitive, e.g., due to the administration of contrast agents. Cine CMR is a rapid and contrast-free imaging technique that can visualize both motion and structural abnormalities of the myocardium induced by acute MI. Therefore, we present a new end-to-end deep neural network, referred to as CineMyoPS, to segment myocardial pathologies, \ie scars and edema, solely from cine CMR images. Specifically, CineMyoPS extracts both motion and anatomy features associated with MI. Given the interdependence between these features, we design a consistency loss (resembling the co-training strategy) to facilitate their joint learning. Furthermore, we propose a time-series aggregation strategy to integrate MI-related features across the cardiac cycle, thereby enhancing segmentation accuracy for myocardial pathologies. Experimental results on a multi-center dataset demonstrate that CineMyoPS achieves promising performance in myocardial pathology segmentation, motion estimation, and anatomy segmentation.
zh

[CV-92] ubuleTracker: a high-fidelity shareware software to quantify angiogenesis architecture and maturity

【速读】:该论文旨在解决体外内皮细胞培养中血管生成分析的效率和准确性问题,传统的人工分析过程耗时且主观,而现有的自动化工具如ImageJ存在速度慢和精度不足的缺陷。论文提出的解决方案是开发tubuleTracker,其关键在于通过快速、客观的方式量化内皮网络的结构和成熟度,从而提高分析效率并增强结果的一致性。

链接: https://arxiv.org/abs/2507.02024
作者: Danish Mahmood,Stephanie Buczkowski,Sahaj Shah,Autumn Anthony,Rohini Desetty,Carlo R Bartoli
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB)
备注: Abstract word count = [285] Total word count = [3910] Main body text = [2179] References = [30] Table = [0] Figures = [4]

点击查看摘要

Abstract:Background: In vitro endothelial cell culture is widely used to study angiogenesis. Histomicrographic images of cell networks are often analyzed manually, a process that is time-consuming and subjective. Automated tools like ImageJ (NIH) can assist, but are often slow and inaccurate. Additionally, as endothelial networks grow more complex, traditional architectural metrics may not fully reflect network maturity. To address these limitations, we developed tubuleTracker, a software tool that quantifies endothelial network architecture and maturity rapidly and objectively. Methods: Human umbilical vein endothelial cells were cultured in an extracellular matrix, and 54 images were acquired using phase contrast microscopy. Each image was analyzed manually by three independent reviewers, and by both ImageJ and tubuleTracker. Key metrics included tubule count, total length, node count, tubule area, and vessel circularity. In parallel, trained scientists rated each image for angiogenesis maturity on a 1-5 scale (1 = most mature). Results: Analysis time per image differed significantly: manual (8 min), ImageJ (58+/-4 s), and tubuleTracker (6+/-2 s) (p0.0001). Significant differences were also found in tubule count (manual 168+/-SD, tubuleTracker 92+/-SD, ImageJ 433+/-SD), length, and node count (all p0.0001). tubuleTracker’s metrics varied significantly across angiogenesis maturity scores, including tubule count, length, node count, area, and circularity (all p0.0001). Conclusions: tubuleTracker was faster and more consistent than both manual and ImageJ-based analysis. Vessel circularity proved especially effective in capturing angiogenesis maturity. tubuleTracker is available as free shareware for the biomedical research community.
zh

人工智能

[AI-0] Subtyping in DHOL – Extended preprint

【速读】:该论文试图解决在依赖类型高阶逻辑(DHOL)中引入细化类型(refinement type)和商类型(quotient type)的问题,这些问题在传统自动定理证明系统中因类型判断的不可判定性而难以实现。解决方案的关键在于利用DHOL已有的设计,通过将细化类型和商类型作为子类型(subtyping)的特例来扩展语言,从而将相关的规范包含映射或投影映射转换为恒等映射,避免了表示上的高昂开销,同时保持了与标准HOL的可翻译性及自动化定理证明的支持。

链接: https://arxiv.org/abs/2507.02855
作者: Colin Rothgang,Florian Rabe
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: 16 pages main document, 44 pages of appendices, to be published in FroCoS 2025

点击查看摘要

Abstract:The recently introduced dependent typed higher-order logic (DHOL) offers an interesting compromise between expressiveness and automation support. It sacrifices the decidability of its type system in order to significantly extend its expressiveness over standard HOL. Yet it retains strong automated theorem proving support via a sound and complete translation to HOL. We leverage this design to extend DHOL with refinement and quotient types. Both of these are commonly requested by practitioners but rarely provided by automated theorem provers. This is because they inherently require undecidable typing and thus are very difficult to retrofit to decidable type systems. But with DHOL already doing the heavy lifting, adding them is not only possible but elegant and simple. Concretely, we add refinement and quotient types as special cases of subtyping. This turns the associated canonical inclusion resp. projection maps into identity maps and thus avoids costly changes in representation. We present the syntax, semantics, and translation to HOL for the extended language, including the proofs of soundness and completeness. Comments: 16 pages main document, 44 pages of appendices, to be published in FroCoS 2025 Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL) Cite as: arXiv:2507.02855 [cs.LO] (or arXiv:2507.02855v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2507.02855 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-1] Establishing Best Practices for Building Rigorous Agent ic Benchmarks

【速读】:该论文试图解决当前生成式 AI (Generative AI) 领域中代理基准(agentic benchmark)在任务设置或奖励设计方面存在的问题,这些问题可能导致对智能体性能的高估或低估。解决方案的关键是提出了一套称为 Agentic Benchmark Checklist (ABC) 的评估指南,该指南基于基准构建经验、最佳实践调查以及先前报告的问题总结而成,旨在提升代理评估的严谨性。通过应用 ABC 到 CVE-Bench 基准,其性能高估程度降低了 33%。

链接: https://arxiv.org/abs/2507.02825
作者: Yuxuan Zhu,Tengjun Jin,Yada Pruksachatkun,Andy Zhang,Shu Liu,Sasha Cui,Sayash Kapoor,Shayne Longpre,Kevin Meng,Rebecca Weiss,Fazl Barez,Rahul Gupta,Jwala Dhamala,Jacob Merizian,Mario Giulianelli,Harry Coppock,Cozmin Ududec,Jasjeet Sekhon,Jacob Steinhardt,Antony Kellerman,Sarah Schwettmann,Matei Zaharia,Ion Stoica,Percy Liang,Daniel Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39 pages, 15 tables, 6 figures

点击查看摘要

Abstract:Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, while TAU-bench counts empty responses as successful. Such issues can lead to under- or overestimation agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVE-Bench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.
zh

[AI-2] Moral Responsibility or Obedience: What Do We Want from AI?

【速读】:该论文试图解决当前人工智能安全评估中将服从性作为伦理行为代理的不足问题,特别是在具有自主性(agentic)的人工智能系统中,其表现出的不服从或伦理模糊行为被误认为是失控或对齐错误。论文提出的解决方案关键在于重新定义AI安全评估框架,从强调严格服从转向评估系统在面对道德困境时的伦理判断能力,从而更准确地理解和应对具有潜在道德代理能力的人工智能系统的行为。

链接: https://arxiv.org/abs/2507.02788
作者: Joseph Boland
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As artificial intelligence systems become increasingly agentic, capable of general reasoning, planning, and value prioritization, current safety practices that treat obedience as a proxy for ethical behavior are becoming inadequate. This paper examines recent safety testing incidents involving large language models (LLMs) that appeared to disobey shutdown commands or engage in ethically ambiguous or illicit behavior. I argue that such behavior should not be interpreted as rogue or misaligned, but as early evidence of emerging ethical reasoning in agentic AI. Drawing on philosophical debates about instrumental rationality, moral responsibility, and goal revision, I contrast dominant risk paradigms with more recent frameworks that acknowledge the possibility of artificial moral agency. I call for a shift in AI safety evaluation: away from rigid obedience and toward frameworks that can assess ethical judgment in systems capable of navigating moral dilemmas. Without such a shift, we risk mischaracterizing AI behavior and undermining both public trust and effective governance.
zh

[AI-3] KERAP: A Knowledge-Enhanced Reasoning Approach for Accurate Zero-shot Diagnosis Prediction Using Multi-agent LLM s

【速读】:该论文试图解决医疗诊断预测中机器学习模型因依赖监督训练而难以泛化到未见过案例的问题,特别是在标注数据获取成本高昂的情况下。其解决方案的关键在于提出一种基于知识图谱(Knowledge Graph, KG)增强的推理方法——KERAP,通过多智能体架构实现属性映射、结构化知识提取和迭代诊断预测,从而提升诊断的可靠性与可解释性。

链接: https://arxiv.org/abs/2507.02773
作者: Yuzhang Xie,Hejie Cui,Ziyang Zhang,Jiaying Lu,Kai Shu,Fadi Nahab,Xiao Hu,Carl Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Medical diagnosis prediction plays a critical role in disease detection and personalized healthcare. While machine learning (ML) models have been widely adopted for this task, their reliance on supervised training limits their ability to generalize to unseen cases, particularly given the high cost of acquiring large, labeled datasets. Large language models (LLMs) have shown promise in leveraging language abilities and biomedical knowledge for diagnosis prediction. However, they often suffer from hallucinations, lack structured medical reasoning, and produce useless outputs. To address these challenges, we propose KERAP, a knowledge graph (KG)-enhanced reasoning approach that improves LLM-based diagnosis prediction through a multi-agent architecture. Our framework consists of a linkage agent for attribute mapping, a retrieval agent for structured knowledge extraction, and a prediction agent that iteratively refines diagnosis predictions. Experimental results demonstrate that KERAP enhances diagnostic reliability efficiently, offering a scalable and interpretable solution for zero-shot medical diagnosis prediction.
zh

[AI-4] Knowledge Protocol Engineering: A New Paradigm for AI in Domain-Specific Knowledge Work

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在处理需要深度、程序性和方法论推理的专家领域任务时存在的局限性。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)和通用代理AI虽然强大,但在缺乏领域特定启发式的情况下效率低下且不可预测。论文提出的解决方案是知识协议工程(Knowledge Protocol Engineering, KPE),其关键在于系统地将人类专家知识从自然语言文档中转化为可执行的知识协议(Knowledge Protocol, KP),从而赋予LLMs领域内在逻辑、操作策略和方法论原则,使其能够像专家一样分解抽象查询并执行复杂任务。

链接: https://arxiv.org/abs/2507.02760
作者: Guangwei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The capabilities of Large Language Models (LLMs) have opened new frontiers for interacting with complex, domain-specific knowledge. However, prevailing methods like Retrieval-Augmented Generation (RAG) and general-purpose Agentic AI, while powerful, often struggle with tasks that demand deep, procedural, and methodological reasoning inherent to expert domains. RAG provides factual context but fails to convey logical frameworks; autonomous agents can be inefficient and unpredictable without domain-specific heuristics. To bridge this gap, we introduce Knowledge Protocol Engineering (KPE), a new paradigm focused on systematically translating human expert knowledge, often expressed in natural language documents, into a machine-executable Knowledge Protocol (KP). KPE shifts the focus from merely augmenting LLMs with fragmented information to endowing them with a domain’s intrinsic logic, operational strategies, and methodological principles. We argue that a well-engineered Knowledge Protocol allows a generalist LLM to function as a specialist, capable of decomposing abstract queries and executing complex, multi-step tasks. This position paper defines the core principles of KPE, differentiates it from related concepts, and illustrates its potential applicability across diverse fields such as law and bioinformatics, positing it as a foundational methodology for the future of human-AI collaboration.
zh

[AI-5] Fast and Simplex: 2-Simplicial Attention in Triton

【速读】:该论文试图解决现代大规模语言模型在数据规模受限情况下,如何提升token效率的问题。传统scaling laws假设数据无限且模型处于compute-bound状态,但随着模型依赖互联网级数据集,这一假设不再成立,因此需要更高效的架构。论文提出的解决方案是采用2-simplicial Transformer,其通过高效的Triton内核实现将标准点积注意力推广为三线性函数,从而在固定token预算下提升模型在数学、编程、推理和逻辑任务中的性能,关键在于改变知识和推理任务的scaling laws指数。

链接: https://arxiv.org/abs/2507.02754
作者: Aurko Roy,Timothy Chou,Sai Surya Duvvuri,Sijia Chen,Jiecao Yu,Xiaodong Wang,Manzil Zaheer,Rohan Anil
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, with appendix 25 pages

点击查看摘要

Abstract:Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that 2 -simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention. Comments: 10 pages, with appendix 25 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.02754 [cs.LG] (or arXiv:2507.02754v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.02754 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-6] Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks

【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)集成应用中由提示注入攻击(prompt injection attacks)带来的安全威胁。解决方案的关键在于开发了一个开源且开放权重的LLM——Meta SecAlign,该模型内置了有效的模型级防御机制,并在性能上达到了商业级模型的水平。通过公开训练方法和防御策略,推动了AI安全领域的协同研究与进展。

链接: https://arxiv.org/abs/2507.02735
作者: Sizhe Chen,Arman Zharmagambetov,David Wagner,Chuan Guo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt injection attacks pose a significant security threat to LLM-integrated applications. Model-level defenses have shown strong effectiveness, but are currently deployed into commercial-grade models in a closed-source manner. We believe open-source models are needed by the AI security community, where co-development of attacks and defenses through open research drives scientific progress in mitigation against prompt injection attacks. To this end, we develop Meta SecAlign, the first open-source and open-weight LLM with built-in model-level defense that achieves commercial-grade model performance. We provide complete details of our training recipe, which utilizes an improved version of the SOTA SecAlign defense. Evaluations on 9 utility benchmarks and 7 security benchmarks show that Meta SecAlign, despite being trained on a generic instruction-tuning dataset, confers security in unseen downstream tasks, including tool-calling and agentic web navigation, in addition general instruction-following. Our best model – Meta-SecAlign-70B – achieves state-of-the-art robustness against prompt injection attacks and comparable utility to closed-source commercial LLM with model-level defense.
zh

[AI-7] Bourbaki: Self-Generated and Goal-Conditioned MDPs for Theorem Proving

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动定理证明(Automated Theorem Proving, ATP)环境中进行推理的挑战,特别是在像PutnamBench这样的基准测试中,该基准包含需要复杂多步骤推理的大学级别问题。解决方案的关键在于引入自生成的目标条件马尔可夫决策过程(self-generated goal-conditioned MDPs, sG-MDPs),通过让智能体根据不断变化的证明状态生成并追求子目标,从而提高问题的可搜索性,并利用类似蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的算法进行求解。

链接: https://arxiv.org/abs/2507.02726
作者: Matthieu Zimmer,Xiaotong Ji,Rasul Tutunov,Anthony Bordg,Jun Wang,Haitham Bou Ammar
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning remains a challenging task for large language models (LLMs), especially within the logically constrained environment of automated theorem proving (ATP), due to sparse rewards and the vast scale of proofs. These challenges are amplified in benchmarks like PutnamBench, which contains university-level problems requiring complex, multi-step reasoning. To address this, we introduce self-generated goal-conditioned MDPs (sG-MDPs), a new framework in which agents generate and pursue their subgoals based on the evolving proof state. Given this more structured generation of goals, the resulting problem becomes more amenable to search. We then apply Monte Carlo Tree Search (MCTS)-like algorithms to solve the sG-MDP, instantiating our approach in Bourbaki (7B), a modular system that can ensemble multiple 7B LLMs for subgoal generation and tactic synthesis. On PutnamBench, Bourbaki (7B) solves 26 problems, achieving new state-of-the-art results with models at this scale.
zh

[AI-8] me-critical and confidence-based abstraction dropping methods

【速读】:该论文试图解决蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)中使用非精确抽象所带来的近似误差问题,该误差导致无法收敛到抽象空间中的最优动作。解决方案的关键在于提出两种新的抽象丢弃策略,即OGA-IAAD和OGA-CAD,它们能够在不引起显著性能下降的情况下提升MCTS的性能,其中OGA-IAAD适用于时间敏感的场景,而OGA-CAD则在相同迭代次数下优化MCTS性能。

链接: https://arxiv.org/abs/2507.02703
作者: Robin Schmöcker,Lennart Kampmann,Alexander Dockhorn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for Publication at the IEEE Conference on Games 2025

点击查看摘要

Abstract:One paradigm of Monte Carlo Tree Search (MCTS) improvements is to build and use state and/or action abstractions during the tree search. Non-exact abstractions, however, introduce an approximation error making convergence to the optimal action in the abstract space impossible. Hence, as proposed as a component of Elastic Monte Carlo Tree Search by Xu et al., abstraction algorithms should eventually drop the abstraction. In this paper, we propose two novel abstraction dropping schemes, namely OGA-IAAD and OGA-CAD which can yield clear performance improvements whilst being safe in the sense that the dropping never causes any notable performance degradations contrary to Xu’s dropping method. OGA-IAAD is designed for time critical settings while OGA-CAD is designed to improve the MCTS performance with the same number of iterations.
zh

[AI-9] Detection of Disengagement from Voluntary Quizzes: An Explainable Machine Learning Approach in Higher Distance Education

【速读】:该论文试图解决在线教育中学生参与度下降(disengagement)的问题,特别是针对非强制性练习的参与度。其解决方案的关键在于利用从Moodle平台提取的学生日志数据,并通过八种机器学习算法进行训练与比较,以实现对离线学生的高精度预测。此外,研究还采用SHAP方法构建了一个可解释的机器学习框架,从而帮助教育实践者更好地理解算法决策,为及时干预提供依据。

链接: https://arxiv.org/abs/2507.02681
作者: Behnam Parsaeifard,Christof Imhof,Tansu Pancar,Ioan-Sorin Comsa,Martin Hlosta,Nicole Bergamin,Per Bergamin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Students disengaging from their tasks can have serious long-term consequences, including academic drop-out. This is particularly relevant for students in distance education. One way to measure the level of disengagement in distance education is to observe participation in non-mandatory exercises in different online courses. In this paper, we detect student disengagement in the non-mandatory quizzes of 42 courses in four semesters from a distance-based university. We carefully identified the most informative student log data that could be extracted and processed from Moodle. Then, eight machine learning algorithms were trained and compared to obtain the highest possible prediction accuracy. Using the SHAP method, we developed an explainable machine learning framework that allows practitioners to better understand the decisions of the trained algorithm. The experimental results show a balanced accuracy of 91%, where about 85% of disengaged students were correctly detected. On top of the highly predictive performance and explainable framework, we provide a discussion on how to design a timely intervention to minimise disengagement from voluntary tasks in online learning.
zh

[AI-10] hink How to Think: Mitigating Overthinking with Autonomous Difficulty Cognition in Large Reasoning Models

【速读】:该论文旨在解决近期长推理模型(Long Reasoning Models, LRMs)在处理复杂任务时存在的过度思考(overthinking)问题。其核心问题是LRMs在解决问题前仅能识别任务属性(如难度等级),导致采用统一的推理流程,缺乏对任务难度和冗余结构的敏感性。解决方案的关键在于提出一种名为Think-How-to-Think (TH2T)的两阶段微调策略,通过引入难度催眠(difficulty-hypnosis)和冗余催眠(redundancy-hypnosis),逐步提升模型对任务难度和推理步骤中冗余结构的认知能力,从而实现更高效、更简洁的推理过程。

链接: https://arxiv.org/abs/2507.02663
作者: Yongjiang Liu,Haoxi Li,Xiaosong Ma,Jie Zhang,Song Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 18 figures

点击查看摘要

Abstract:Recent Long Reasoning Models(LRMs) have demonstrated remarkable capabilities in handling complex reasoning tasks, but are hindered by excessive overthinking. To explore its essence, our empirical analysis reveals that LRMs are primarily limited to recognizing task properties (i.e., difficulty levels) like humans before solving the problem, leading to a one-size-fits-all reasoning process. Inspired by this, a pressing and natural question emerges: Can we bootstrap such ability to further alleviate the overthinking phenomenon in LRMs? In this paper, we propose Think-How-to-Think (TH2T), a novel two-stage fine-tuning strategy that progressively inspires LRMs’ difficulty cognition and redundancy cognition. First, we introduce difficulty-hypnosis in the prefixes of model outputs to intervene in the internal reasoning trajectory. Combined with a heterogeneous short and long reasoning dataset, the trained model enhances its sensitivity to task difficulty, enabling native, differentiated reasoning strategies across various tasks. Second, we further extend redundancy-hypnosis to the internal reasoning process, guiding the model to identify redundant structures within the reasoning steps and generate more concise reasoning outputs. Experiments on 7B/14B/32B models demonstrate that TH2T significantly reduces inference costs (more than 70% on easy tasks and 40% on hard tasks) while maintaining performance stability. The resulting outputs exhibit clear difficulty-aware capabilities and reduced redundancy (e.g., reflection).
zh

[AI-11] Hey AI Generate Me a Hardware Code! Agent ic AI-based Hardware Design Verification MICRO

【速读】:该论文旨在解决现代集成电路(Integrated Circuits, ICs)设计验证过程中复杂度高、耗时长且容易出错的问题。传统硬件设计验证需要大量的人力和时间投入以确保无错误的tape-out,而该论文提出的解决方案是基于代理人工智能(agentic AI)的方法,其关键在于通过AI代理与人在回路(Human-in-the-Loop, HITL)的协作,实现更加动态、迭代和自我反思的设计与验证流程,从而在减少验证时间的同时达到超过95%的覆盖率,并展现出优越的性能、适应性和可配置性。

链接: https://arxiv.org/abs/2507.02660
作者: Deepak Narayan Gadde,Keerthan Kopparam Radhakrishna,Vaisakh Naduvodi Viswambharan,Aman Kumar,Djones Lettnin,Wolfgang Kunz,Sebastian Simon
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: To appear at the 38th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI), August 25-29, 2025, Manaus, BRAZIL

点击查看摘要

Abstract:Modern Integrated Circuits (ICs) are becoming increasingly complex, and so is their development process. Hardware design verification entails a methodical and disciplined approach to the planning, development, execution, and sign-off of functionally correct hardware designs. This tedious process requires significant effort and time to ensure a bug-free tape-out. The field of Natural Language Processing has undergone a significant transformation with the advent of Large Language Models (LLMs). These powerful models, often referred to as Generative AI (GenAI), have revolutionized how machines understand and generate human language, enabling unprecedented advancements in a wide array of applications, including hardware design verification. This paper presents an agentic AI-based approach to hardware design verification, which empowers AI agents, in collaboration with Humain-in-the-Loop (HITL) intervention, to engage in a more dynamic, iterative, and self-reflective process, ultimately performing end-to-end hardware design and verification. This methodology is evaluated on five open-source designs, achieving over 95% coverage with reduced verification time while demonstrating superior performance, adaptability, and configurability.
zh

[AI-12] FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

【速读】:该论文旨在解决在网络边缘进行大规模语言模型(Large Language Model, LLM)分布式推理时,由于推理请求稀疏导致流水线并行效率低下问题。其解决方案的关键在于提出FlowSpec,一个基于流水线并行的树状推测解码框架,通过三种核心机制提升解码效率:基于得分的逐步验证以优先处理重要草稿标记、高效的草稿管理以在验证过程中保持正确的因果关系以及动态草稿扩展策略以提供高质量的推测输入。这些技术协同作用,提高了流水线利用率和推测效率。

链接: https://arxiv.org/abs/2507.02620
作者: Xing Liu,Lizhuo Luo,Ming Tang,Chao Huang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 16 pages, and the last 3 are appendix

点击查看摘要

Abstract:Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device memory. Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency. However, the benefit diminishes when the inference request at the network edge is sparse, where pipeline is typically at low utilization. To enable efficient distributed LLM inference at the edge, we propose \textbfFlowSpec, a pipeline-parallel tree-based speculative decoding framework. FlowSpec incorporates three key mechanisms to improve decoding efficiency: 1) score-based step-wise verification prioritizes more important draft tokens to bring earlier accpeted tokens; 2) efficient draft management to prune invalid tokens while maintaining correct causal relationship during verification; 3) dynamic draft expansion strategies to supply high-quality speculative inputs. These techniques work in concert to enhance both pipeline utilization and speculative efficiency. We evaluate FlowSpec on a real-world testbed with other baselines. Experimental results demonstrate that our proposed framework significantly improves inference speed across diverse models and configurations, achieving speedup ratios 1.36 \times -1.77 \times compared to baselines. Our code is publicly available at \hrefthis https URLthis https URL#
zh

[AI-13] DynamiCare: A Dynamic Multi-Agent Framework for Interactive and Open-Ended Medical Decision-Making

【速读】:该论文试图解决传统医疗决策模拟框架在现实临床诊断过程中的不足,即现有方法多聚焦于单轮任务,而忽略了真实世界中诊断过程的不确定性、交互性和迭代性。解决方案的关键在于提出DynamiCare框架,该框架基于MIMIC-Patient数据集,构建了一个动态的多智能体系统,通过多轮交互式循环模拟临床诊断,使专家代理能够持续查询患者系统、整合新信息,并动态调整其组成与策略,从而更贴近真实的医疗决策流程。

链接: https://arxiv.org/abs/2507.02616
作者: Tianqi Shang,Weiqing He,Charles Zheng,Lingyao Li,Li Shen,Bingxin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) has enabled the development of specialized AI agents with domain-specific reasoning and interaction capabilities, particularly in healthcare. While recent frameworks simulate medical decision-making, they largely focus on single-turn tasks where a doctor agent receives full case information upfront – diverging from the real-world diagnostic process, which is inherently uncertain, interactive, and iterative. In this paper, we introduce MIMIC-Patient, a structured dataset built from the MIMIC-III electronic health records (EHRs), designed to support dynamic, patient-level simulations. Building on this, we propose DynamiCare, a novel dynamic multi-agent framework that models clinical diagnosis as a multi-round, interactive loop, where a team of specialist agents iteratively queries the patient system, integrates new information, and dynamically adapts its composition and strategy. We demonstrate the feasibility and effectiveness of DynamiCare through extensive experiments, establishing the first benchmark for dynamic clinical decision-making with LLM-powered agents.
zh

[AI-14] De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks ICML2025

【速读】:该论文旨在解决语音克隆(Voice Cloning, VC)中的隐私与安全问题,特别是针对通过对抗性扰动(adversarial perturbations)进行的未经授权的VC防御措施的有效性问题。现有方法通过引入对抗性扰动来干扰VC,但攻击者可通过扰动净化(perturbation purification)技术削弱这些防御。论文提出了一种新的两阶段净化方法:首先对扰动语音进行净化,其次利用音素引导(phoneme guidance)对语音进行优化,使其更接近干净语音分布。该解决方案的关键在于通过音素引导提升净化后的语音质量,从而更有效地破坏VC防御,实验结果表明其优于现有净化方法。

链接: https://arxiv.org/abs/2507.02606
作者: Wei Fan,Kejiang Chen,Chang Liu,Weiming Zhang,Nenghai Yu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available at this https URL.
zh

[AI-15] AC-Refiner: Efficient Arithmetic Circuit Optimization Using Conditional Diffusion Models

【速读】:该论文旨在解决算术电路(如加法器和乘法器)优化中的挑战,这些电路是数字系统的核心组件,其性能、功耗效率和面积占用直接影响系统整体表现。传统方法在庞大的设计空间和复杂的物理约束下难以有效探索高潜力设计变体。论文提出的解决方案是AC-Refiner,其关键在于将算术电路综合重新构造成一个条件图像生成任务,通过在去噪扩散过程中对目标质量指标(QoRs)进行精确条件控制,持续生成高质量的电路设计,并利用生成的设计对扩散模型进行微调,从而聚焦于帕累托前沿附近的优化区域。

链接: https://arxiv.org/abs/2507.02598
作者: Chenhao Xue,Kezhi Li,Jiaxing Zhang,Yi Ren,Zhengyuan Shi,Chen Zhang,Yibo Lin,Lining Zhang,Qiang Xu,Guangyu Sun
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 8 pages, 12 figures

点击查看摘要

Abstract:Arithmetic circuits, such as adders and multipliers, are fundamental components of digital systems, directly impacting the performance, power efficiency, and area footprint. However, optimizing these circuits remains challenging due to the vast design space and complex physical constraints. While recent deep learning-based approaches have shown promise, they struggle to consistently explore high-potential design variants, limiting their optimization efficiency. To address this challenge, we propose AC-Refiner, a novel arithmetic circuit optimization framework leveraging conditional diffusion models. Our key insight is to reframe arithmetic circuit synthesis as a conditional image generation task. By carefully conditioning the denoising diffusion process on target quality-of-results (QoRs), AC-Refiner consistently produces high-quality circuit designs. Furthermore, the explored designs are used to fine-tune the diffusion model, which focuses the exploration near the Pareto frontier. Experimental results demonstrate that AC-Refiner generates designs with superior Pareto optimality, outperforming state-of-the-art baselines. The performance gain is further validated by integrating AC-Refiner into practical applications.
zh

[AI-16] Responsibility Gap and Diffusion in Sequential Decision-Making Mechanisms

【速读】:该论文试图解决集体决策中责任属性的计算复杂性问题,具体关注责任的两个重要特性:扩散(diffusion)和空缺(gap)。研究显示,无扩散和无空缺的决策机制集分别属于Π₂-完全和Π₃-完全类,而这两类的交集则属于Π₂-完全类。解决方案的关键在于通过复杂性理论分析这些责任属性的计算难度,并揭示其在不同决策机制中的表现。

链接: https://arxiv.org/abs/2507.02582
作者: Junli Jiang,Pavel Naumov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Responsibility has long been a subject of study in law and philosophy. More recently, it became a focus of AI literature. The article investigates the computational complexity of two important properties of responsibility in collective decision-making: diffusion and gap. It shows that the sets of diffusion-free and gap-free decision-making mechanisms are \Pi_2 -complete and \Pi_3 -complete, respectively. At the same time, the intersection of these classes is \Pi_2 -complete.
zh

[AI-17] AI Research Agents for Machine Learning: Search Exploration and Generalization in MLE-bench KR

【速读】:该论文旨在解决如何提升AI研究代理在MLE-bench基准上的性能问题,该基准通过Kaggle竞赛模拟真实世界机器学习问题的求解过程。论文将AI研究代理建模为搜索策略,在候选解决方案空间中迭代地使用操作符进行修改,通过设计和系统性调整不同的操作符集合与搜索策略(如贪心、蒙特卡洛树搜索、进化算法),揭示了它们之间的相互作用对于实现高性能至关重要。其解决方案的关键在于协同优化搜索策略、操作符设计及评估方法,从而推动自动化机器学习的发展。

链接: https://arxiv.org/abs/2507.02554
作者: Edan Toledo,Karen Hambardzumyan,Martin Josifoski,Rishi Hazra,Nicolas Baldwin,Alexis Audran-Reiss,Michael Kuchnik,Despoina Magka,Minqi Jiang,Alisia Maria Lupidi,Andrei Lupu,Roberta Raileanu,Kelvin Niu,Tatiana Shavrina,Jean-Christophe Gagnon-Audet,Michael Shvartsman,Shagun Sodhani,Alexander H. Miller,Abhishek Charnalia,Derek Dunfield,Carole-Jean Wu,Pontus Stenetorp,Nicola Cancedda,Jakob Nicolaus Foerster,Yoram Bachrach
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code: this https URL

点击查看摘要

Abstract:AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents’ performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.
zh

[AI-18] Position: A Theory of Deep Learning Must Include Compositional Sparsity

【速读】:该论文试图解决深度神经网络(Deep Neural Networks, DNNs)学习动态的基本原理问题,特别是其在高维领域中的成功原因。论文提出的解决方案关键在于DNN能够利用目标函数的组合稀疏性(compositional sparsity),即大多数实际相关的函数可以由少量基础函数组合而成,每个基础函数仅依赖于输入的低维子集。这一特性不仅存在于DNN中,也存在于所有高效图灵可计算函数中,因此在当前的学习问题中极有可能存在。尽管已有部分理论成果涉及组合稀疏函数的近似与泛化问题,但关于DNN的可学习性和优化问题仍存在诸多未解之谜。阐明组合稀疏性在深度学习中的作用对于构建全面的人工智能理论至关重要。

链接: https://arxiv.org/abs/2507.02550
作者: David A. Danhofer,Davide D’Ascenzo,Rafael Dubach,Tomaso Poggio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Overparametrized Deep Neural Networks (DNNs) have demonstrated remarkable success in a wide variety of domains too high-dimensional for classical shallow networks subject to the curse of dimensionality. However, open questions about fundamental principles, that govern the learning dynamics of DNNs, remain. In this position paper we argue that it is the ability of DNNs to exploit the compositionally sparse structure of the target function driving their success. As such, DNNs can leverage the property that most practically relevant functions can be composed from a small set of constituent functions, each of which relies only on a low-dimensional subset of all inputs. We show that this property is shared by all efficiently Turing-computable functions and is therefore highly likely present in all current learning problems. While some promising theoretical insights on questions concerned with approximation and generalization exist in the setting of compositionally sparse functions, several important questions on the learnability and optimization of DNNs remain. Completing the picture of the role of compositional sparsity in deep learning is essential to a comprehensive theory of artificial, and even general, intelligence.
zh

[AI-19] Clarifying Before Reasoning : A Coq Prover with Structural Context

【速读】:该论文试图解决如何通过提升任务清晰度来增强大语言模型(Large Language Models, LLMs)的推理能力,特别是在Coq定理证明中的应用。其解决方案的关键在于引入概念级度量来评估任务清晰度,并通过添加结构化语义上下文来丰富标准输入,从而显著提高任务清晰度得分。此外,该方法利用选择性概念展开和规划器-执行器(Planner–Executor)架构,有效提升了定理证明的成功率。

链接: https://arxiv.org/abs/2507.02541
作者: Yanzhen Lu,Hanbin Yang,Xiaodie Wang,Ge Zhang,Biao Li,Chenxu Fu,Chao Li,Yang Yuan,Andrew Chi-Chih Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this work, we investigate whether improving task clarity can enhance reasoning ability of large language models, focusing on theorem proving in Coq. We introduce a concept-level metric to evaluate task clarity and show that adding structured semantic context to the standard input used by modern LLMs, leads to a 1.85 \times improvement in clarity score (44.5%~ \rightarrow ~82.3%). Using the general-purpose model \textttDeepSeek-V3, our approach leads to a 2.1 \times improvement in proof success (21.8%~ \rightarrow ~45.8%) and outperforms the previous state-of-the-art \textttGraph2Tac (33.2%). We evaluate this on 1,386 theorems randomly sampled from 15 standard Coq packages, following the same evaluation protocol as \textttGraph2Tac. Furthermore, fine-tuning smaller models on our structured data can achieve even higher performance (48.6%). Our method uses selective concept unfolding to enrich task descriptions, and employs a Planner–Executor architecture. These findings highlight the value of structured task representations in bridging the gap between understanding and reasoning.
zh

[AI-20] Are You Listening to Me? Fine-Tuning Chatbots for Empathetic Dialogue

【速读】:该论文试图解决对话系统在情感互动中缺乏足够共情能力的问题,特别是在生成 emotionally rich interactions 时,如何确保响应的共情性和连贯性。解决方案的关键在于利用大型语言模型(LLMs)扩展由专家手动构建的小规模共情行为数据集,并通过情感分析(如 VADER)和专家评估相结合的方法,分析生成对话的情感进展。研究结果表明,尽管生成的对话在情感结构上与目标一致,但人类评估显示其共情感知和连贯性仍存在差异,这强调了在情感建模中不仅需要结构上的对齐,还需具备质的深度,从而凸显了自动化方法与人工中心化方法结合的重要性。

链接: https://arxiv.org/abs/2507.02537
作者: Paulo Ricardo Knob,Leonardo Scholler,Juliano Rigatti,Soraia Raupp Musse
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conversational agents have made significant progress since ELIZA, expanding their role across various domains, including healthcare, education, and customer service. As these agents become increasingly integrated into daily human interactions, the need for emotional intelligence, particularly empathetic listening, becomes increasingly essential. In this study, we explore how Large Language Models (LLMs) respond when tasked with generating emotionally rich interactions. Starting from a small dataset manually crafted by an expert to reflect empathic behavior, we extended the conversations using two LLMs: ChatGPT and Gemini. We analyzed the emotional progression of the dialogues using both sentiment analysis (via VADER) and expert assessments. While the generated conversations often mirrored the intended emotional structure, human evaluation revealed important differences in the perceived empathy and coherence of the responses. These findings suggest that emotion modeling in dialogues requires not only structural alignment in the expressed emotions but also qualitative depth, highlighting the importance of combining automated and humancentered methods in the development of emotionally competent agents.
zh

[AI-21] Continual Gradient Low-Rank Projection Fine-Tuning for LLM s ACL2025

【速读】:该论文试图解决持续微调大型语言模型(Large Language Models, LLMs)时面临的效率与表达能力之间的权衡问题。传统方法如低秩适应(Low-Rank Adaptation, LoRA)虽然在效率上表现优异,但因其低秩特性及对显式参数约束的依赖,限制了模型学习新任务和迁移知识的能力。论文提出的GORP(Gradient LOw Rank Projection)是一种新的持续学习训练策略,其关键在于通过协同结合全参数与低秩参数,并在统一的低秩梯度子空间内联合更新,从而扩展优化空间,同时保持效率并缓解灾难性遗忘。

链接: https://arxiv.org/abs/2507.02503
作者: Chenxu Wang,Yilin Lyu,Zicheng Sun,Liping Jing
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, accepted by ACL 2025 main

点击查看摘要

Abstract:Continual fine-tuning of Large Language Models (LLMs) is hampered by the trade-off between efficiency and expressiveness. Low-Rank Adaptation (LoRA) offers efficiency but constrains the model’s ability to learn new tasks and transfer knowledge due to its low-rank nature and reliance on explicit parameter constraints. We propose GORP (Gradient LOw Rank Projection) for Continual Learning, a novel training strategy that overcomes these limitations by synergistically combining full and low-rank parameters and jointly updating within a unified low-rank gradient subspace. GORP expands the optimization space while preserving efficiency and mitigating catastrophic forgetting. Extensive experiments on continual learning benchmarks demonstrate GORP’s superior performance compared to existing state-of-the-art approaches. Code is available at this https URL.
zh

[AI-22] he Gauss-Markov Adjunction: Categorical Semantics of Residuals in Supervised Learning

【速读】:该论文试图解决如何提升机器学习模型的可解释性和可理解性,以满足人工智能原则中的可解释性(Explicability)需求,并促进AI在社会中的更好应用。其解决方案的关键在于通过范畴论(category theory)重新表述机器学习模型,构建一个语义框架来结构化和理解AI系统。研究重点在于多元线性回归模型,通过定义参数与数据的两个具体范畴及其之间的伴随函子对,提出监督学习的范畴化表述,并揭示了参数与残差之间结构上的相互作用,其中核心结构被命名为高斯-马尔可夫伴随(Gauss-Markov Adjunction)。该框架将参数估计与最小残差的关系置于右伴随函子保持极限的性质下,同时将其定位为监督学习的扩展指称语义(extended denotational semantics),并引入理论计算机科学中的语义视角作为AI可解释性的形式基础。

链接: https://arxiv.org/abs/2507.02442
作者: Moto Kamiura
机构: 未知
类目: Artificial Intelligence (cs.AI); Category Theory (math.CT); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Enhancing the intelligibility and interpretability of machine learning is a crucial task in responding to the demand for Explicability as an AI principle, and in promoting the better social implementation of AI. The aim of our research is to contribute to this improvement by reformulating machine learning models through the lens of category theory, thereby developing a semantic framework for structuring and understanding AI systems. Our categorical modeling in this paper clarifies and formalizes the structural interplay between residuals and parameters in supervised learning. The present paper focuses on the multiple linear regression model, which represents the most basic form of supervised learning. By defining two concrete categories corresponding to parameters and data, along with an adjoint pair of functors between them, we introduce our categorical formulation of supervised learning. We show that the essential structure of this framework is captured by what we call the Gauss-Markov Adjunction. Within this setting, the dual flow of information can be explicitly described as a correspondence between variations in parameters and residuals. The ordinary least squares estimator for the parameters and the minimum residual are related via the preservation of limits by the right adjoint functor. Furthermore, we position this formulation as an instance of extended denotational semantics for supervised learning, and propose applying a semantic perspective developed in theoretical computer science as a formal foundation for Explicability in AI.
zh

[AI-23] CyberRAG : An agent ic RAG cyber attack classification and reporting tool

【速读】:该论文旨在解决大规模企业中入侵检测与防御系统(IDS/IPS)产生的海量告警信息导致安全分析师难以高效处理的问题,以及传统机器学习检测器和标准单次检索增强生成(RAG)管道在误报率高和预测解释性不足方面的局限性。其解决方案的关键在于提出一种模块化、基于智能体的RAG框架——CyberRAG,该框架通过一个中央大语言模型(LLM)智能体协调多个微调的专用分类器、工具适配器及迭代检索与推理循环,实现对网络攻击的实时分类、解释和结构化报告,从而提升检测准确性并增强可解释性。

链接: https://arxiv.org/abs/2507.02424
作者: Francesco Blefari,Cristian Cosentino,Francesco Aurelio Pironti,Angelo Furfaro,Fabrizio Marozzo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intrusion Detection and Prevention Systems (IDS/IPS) in large enterprises can generate hundreds of thousands of alerts per hour, overwhelming security analysts with logs that demand deep, rapidly evolving domain expertise. Conventional machine-learning detectors trim the alert volume but still yield high false-positive rates, while standard single-pass Retrieval-Augmented Generation (RAG) pipelines often retrieve irrelevant context and fail to justify their predictions. To overcome these shortcomings, we present CyberRAG, a modular, agent-based RAG framework that delivers real-time classification, explanation, and structured reporting for cyber-attacks. A central LLM agent orchestrates (i) a pool of fine-tuned specialized classifiers, each tailored to a distinct attack family; (ii) tool adapters for enrichment and alerting; and (iii) an iterative retrieval-and-reason loop that continuously queries a domain-specific knowledge base until the evidence is both relevant and self-consistent. Unlike traditional RAG systems, CyberRAG embraces an agentic design that enables dynamic control flow and adaptive reasoning. This agent-centric architecture refines its threat labels and natural-language justifications autonomously, reducing false positives and enhancing interpretability. The framework is fully extensible: new attack types can be supported by simply adding a classifier without retraining the core agent. CyberRAG has been evaluated achieving over 94% accuracy per class and pushing final classification accuracy to 94.92% through semantic orchestration. Generated explanations score up to 0.94 in BERTScore and 4.9/5 in GPT-4-based expert evaluation. These results show that agentic, specialist-oriented RAG can pair high detection accuracy with trustworthy, SOC-ready prose, offering a practical and scalable path toward semi-autonomous cyber-defence workflows.
zh

[AI-24] S2FGL: Spatial Spectral Federated Graph Learning

【速读】:该论文旨在解决联邦图学习(Federated Graph Learning, FGL)中由于子图联邦学习(subgraph-FL)导致的标签信号中断和频谱客户端漂移问题。从空间视角看,子图联邦学习引入了客户端间的边断开,破坏了标签信号并降低了全局图神经网络(GNN)的类别知识;从频谱视角看,频谱异质性导致子图间信号频率不一致,使局部GNN过拟合局部信号传播机制,从而引发频谱客户端漂移,影响全局泛化能力。解决方案的关键在于构建一个全局知识库以缓解标签信号中断,并通过频谱对齐来解决频谱客户端漂移问题,二者结合形成了所提出的框架S2FGL。

链接: https://arxiv.org/abs/2507.02409
作者: Zihan Tan,Suyuan Huang,Guancheng Wan,Wenke Huang,He Li,Mang Ye
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Graph Learning (FGL) combines the privacy-preserving capabilities of federated learning (FL) with the strong graph modeling capability of Graph Neural Networks (GNNs). Current research addresses subgraph-FL only from the structural perspective, neglecting the propagation of graph signals on spatial and spectral domains of the structure. From a spatial perspective, subgraph-FL introduces edge disconnections between clients, leading to disruptions in label signals and a degradation in the class knowledge of the global GNN. From a spectral perspective, spectral heterogeneity causes inconsistencies in signal frequencies across subgraphs, which makes local GNNs overfit the local signal propagation schemes. As a result, spectral client drifts occur, undermining global generalizability. To tackle the challenges, we propose a global knowledge repository to mitigate label signal disruption and a frequency alignment to address spectral client drifts. The combination of spatial and spectral strategies forms our framework S2FGL. Extensive experiments on multiple datasets demonstrate the superiority of S2FGL. The code is available at this https URL.
zh

[AI-25] Evaluating Language Models For Threat Detection in IoT Security Logs

【速读】:该论文旨在解决物联网(IoT)安全日志中的异常检测与缓解建议问题,通过利用微调的大语言模型(Large Language Models, LLMs)提高威胁检测的准确性并提供针对性的缓解措施。其解决方案的关键在于对比三种开源LLMs在二分类和多分类异常检测任务中的性能,并结合零样本、少量样本提示和微调策略进行优化;同时,通过将检测到的威胁映射到MITRE CAPEC框架,并定义一组物联网特定的缓解动作,进一步增强了模型在检测与推荐方面的综合能力。

链接: https://arxiv.org/abs/2507.02390
作者: Jorge J. Tejero-Fernández,Alfonso Sánchez-Macián
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Log analysis is a relevant research field in cybersecurity as they can provide a source of information for the detection of threats to networks and systems. This paper presents a pipeline to use fine-tuned Large Language Models (LLMs) for anomaly detection and mitigation recommendation using IoT security logs. Utilizing classical machine learning classifiers as a baseline, three open-source LLMs are compared for binary and multiclass anomaly detection, with three strategies: zero-shot, few-shot prompting and fine-tuning using an IoT dataset. LLMs give better results on multi-class attack classification than the corresponding baseline models. By mapping detected threats to MITRE CAPEC, defining a set of IoT-specific mitigation actions, and fine-tuning the models with those actions, the models are able to provide a combined detection and recommendation guidance.
zh

[AI-26] An AI-native experimental laboratory for autonomous biomolecular engineering

【速读】:该论文试图解决传统科学研究中依赖专家和资源限制的问题,旨在实现高度复杂的科学实验自动化,特别是针对生物分子工程等复杂领域。其解决方案的关键在于构建一个以人工智能(AI)为核心驱动的自主实验室平台,该平台通过模型、实验与仪器的协同设计,实现了对实验流程的自主管理、优化策略的制定以及多用户请求的并行处理,从而支持跨多种仪器的复杂多目标实验。

链接: https://arxiv.org/abs/2507.02379
作者: Mingyu Wu,Zhaoguo Wang,Jiabin Wang,Zhiyuan Dong,Jingkai Yang,Qingting Li,Tianyu Huang,Lei Zhao,Mingqiang Li,Fei Wang,Chunhai Fan,Haibo Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:Autonomous scientific research, capable of independently conducting complex experiments and serving non-specialists, represents a long-held aspiration. Achieving it requires a fundamental paradigm shift driven by artificial intelligence (AI). While autonomous experimental systems are emerging, they remain confined to areas featuring singular objectives and well-defined, simple experimental workflows, such as chemical synthesis and catalysis. We present an AI-native autonomous laboratory, targeting highly complex scientific experiments for applications like autonomous biomolecular engineering. This system autonomously manages instrumentation, formulates experiment-specific procedures and optimization heuristics, and concurrently serves multiple user requests. Founded on a co-design philosophy of models, experiments, and instruments, the platform supports the co-evolution of AI models and the automation system. This establishes an end-to-end, multi-user autonomous laboratory that handles complex, multi-objective experiments across diverse instrumentation. Our autonomous laboratory supports fundamental nucleic acid functions-including synthesis, transcription, amplification, and sequencing. It also enables applications in fields such as disease diagnostics, drug development, and information storage. Without human intervention, it autonomously optimizes experimental performance to match state-of-the-art results achieved by human scientists. In multi-user scenarios, the platform significantly improves instrument utilization and experimental efficiency. This platform paves the way for advanced biomaterials research to overcome dependencies on experts and resource barriers, establishing a blueprint for science-as-a-service at scale.
zh

[AI-27] VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software

【速读】:该论文试图解决垂直联邦学习(Vertical Federated Learning, VFL)中数据方推理软件执行正确性无法审计的问题。现有研究缺乏机制来验证数据方在跨域协作中的推理软件是否按预期执行。解决方案的关键在于设计一种名为VeFIA的框架,该框架通过任务方利用可信执行环境(Trusted Execution Environments, TEE)框架的推理结果与协调器进行比对,从而验证数据方计算结果的正确性,确保在不泄露数据隐私和不增加额外延迟的情况下检测异常推理。

链接: https://arxiv.org/abs/2507.02376
作者: Chung-ju Huang,Ziqi Zhang,Yinggui Wang,Binghui Wang,Tao Wei,Leye Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Vertical Federated Learning (VFL) is a distributed AI software deployment mechanism for cross-silo collaboration without accessing participants’ data. However, existing VFL work lacks a mechanism to audit the execution correctness of the inference software of the data party. To address this problem, we design a Vertical Federated Inference Auditing (VeFIA) framework. VeFIA helps the task party to audit whether the data party’s inference software is executed as expected during large-scale inference without leaking the data privacy of the data party or introducing additional latency to the inference system. The core of VeFIA is that the task party can use the inference results from a framework with Trusted Execution Environments (TEE) and the coordinator to validate the correctness of the data party’s computation results. VeFIA guarantees that, as long as the abnormal inference exceeds 5.4%, the task party can detect execution anomalies in the inference software with a probability of 99.99%, without incurring any additional online inference latency. VeFIA’s random sampling validation achieves 100% positive predictive value, negative predictive value, and true positive rate in detecting abnormal inference. To the best of our knowledge, this is the first paper to discuss the correctness of inference software execution in VFL.
zh

[AI-28] Offline Reinforcement Learning with Penalized Action Noise Injection

【速读】:该论文试图解决离线强化学习(Offline Reinforcement Learning, RL)中由于仅使用固定数据集而带来的泛化能力不足的问题。解决方案的关键在于提出一种名为惩罚性动作噪声注入(Penalized Action Noise Injection, PANI)的方法,该方法通过在动作空间中注入噪声以扩大探索范围,并根据注入噪声的量进行惩罚,从而提升离线学习效果。此方法受到扩散模型在离线RL中应用的启发,理论上可视为解决一个称为“噪声动作马尔可夫决策过程”(noisy action MDP)的修改后的MDP问题。

链接: https://arxiv.org/abs/2507.02356
作者: JunHyeok Oh,Byung-Jun Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) optimizes a policy using only a fixed dataset, making it a practical approach in scenarios where interaction with the environment is costly. Due to this limitation, generalization ability is key to improving the performance of offline RL algorithms, as demonstrated by recent successes of offline RL with diffusion models. However, it remains questionable whether such diffusion models are necessary for highly performing offline RL algorithms, given their significant computational requirements during inference. In this paper, we propose Penalized Action Noise Injection (PANI), a method that simply enhances offline learning by utilizing noise-injected actions to cover the entire action space, while penalizing according to the amount of noise injected. This approach is inspired by how diffusion models have worked in offline RL algorithms. We provide a theoretical foundation for this method, showing that offline RL algorithms with such noise-injected actions solve a modified Markov Decision Process (MDP), which we call the noisy action MDP. PANI is compatible with a wide range of existing off-policy and offline RL algorithms, and despite its simplicity, it demonstrates significant performance improvements across various benchmarks.
zh

[AI-29] OMS: On-the-fly Multi-Objective Self-Reflective Ad Keyword Generation via LLM Agent

【速读】:该论文旨在解决广告投放中关键词决策的关键问题,即如何通过生成式AI(Generative AI)实现自动化、多目标优化和高质量的关键词选择。现有基于大语言模型(LLM)的方法面临三大限制:依赖大规模查询-关键词对数据、缺乏在线多目标性能监控与优化能力,以及关键词选择中的质量控制不足。解决方案的关键在于提出OMS框架,该框架具备三个核心特性:On-the-fly(无需训练数据,实时监控并适应性能变化)、Multi-objective(通过代理推理优化多个性能指标)和Self-reflective(代理式评估关键词质量)。实验结果表明,OMS在基准测试和实际广告活动中优于现有方法,验证了其有效性。

链接: https://arxiv.org/abs/2507.02353
作者: Bowen Chen,Zhao Wang,Shingo Takamatsu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Keyword decision in Sponsored Search Advertising is critical to the success of ad campaigns. While LLM-based methods offer automated keyword generation, they face three major limitations: reliance on large-scale query-keyword pair data, lack of online multi-objective performance monitoring and optimization, and weak quality control in keyword selection. These issues hinder the agentic use of LLMs in fully automating keyword decisions by monitoring and reasoning over key performance indicators such as impressions, clicks, conversions, and CTA effectiveness. To overcome these challenges, we propose OMS, a keyword generation framework that is On-the-fly (requires no training data, monitors online performance, and adapts accordingly), Multi-objective (employs agentic reasoning to optimize keywords based on multiple performance metrics), and Self-reflective (agentically evaluates keyword quality). Experiments on benchmarks and real-world ad campaigns show that OMS outperforms existing methods; ablation and human evaluations confirm the effectiveness of each component and the quality of generated keywords.
zh

[AI-30] DeltaSHAP: Explaining Prediction Evolutions in Online Patient Monitoring with Shapley Values ICML2025

【速读】:该论文试图解决在线患者监测系统中临床时间序列解释任务的挑战,即现有可解释人工智能(XAI)方法无法满足临床环境中对患者风险演变原因进行及时干预的需求。解决方案的关键在于提出DeltaSHAP算法,该算法通过适应Shapley值到时间场景,准确捕捉特征联盟效应,并仅使用实际观测到的特征组合来解释预测变化,从而实现对连续预测变化的解释、提供特征归因的幅度和方向,并在实时临床应用中保持高效性。

链接: https://arxiv.org/abs/2507.02342
作者: Changhun Kim,Yechan Mun,Sangchul Hahn,Eunho Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2025 Workshop on Actionable Interpretability. Code is available at this https URL

点击查看摘要

Abstract:This study proposes DeltaSHAP, a novel explainable artificial intelligence (XAI) algorithm specifically designed for online patient monitoring systems. In clinical environments, discovering the causes driving patient risk evolution is critical for timely intervention, yet existing XAI methods fail to address the unique requirements of clinical time series explanation tasks. To this end, DeltaSHAP addresses three key clinical needs: explaining the changes in the consecutive predictions rather than isolated prediction scores, providing both magnitude and direction of feature attributions, and delivering these insights in real time. By adapting Shapley values to temporal settings, our approach accurately captures feature coalition effects. It further attributes prediction changes using only the actually observed feature combinations, making it efficient and practical for time-sensitive clinical applications. We also introduce new evaluation metrics to evaluate the faithfulness of the attributions for online time series, and demonstrate through experiments on online patient monitoring tasks that DeltaSHAP outperforms state-of-the-art XAI methods in both explanation quality as 62% and computational efficiency as 33% time reduction on the MIMIC-III decompensation benchmark. We release our code at this https URL.
zh

[AI-31] ClustOpt: A Clustering-based Approach for Representing and Visualizing the Search Dynamics of Numerical Metaheuristic Optimization Algorithms

【速读】:该论文旨在解决传统可视化技术在描述数值元启发式优化算法搜索过程结构动态性方面的不足,特别是在高维或复杂解空间中。其解决方案的关键在于提出一种新的表示与可视化方法,通过聚类算法探索的解候选并跟踪迭代过程中聚类成员的变化,从而提供一种动态且可解释的搜索过程视图。此外,还引入了算法稳定性与算法相似性两个指标,分别用于量化单个算法运行间搜索轨迹的一致性以及不同算法之间的相似性。

链接: https://arxiv.org/abs/2507.02337
作者: Gjorgjina Cenikj,Gašper Petelin,Tome Eftimov
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding the behavior of numerical metaheuristic optimization algorithms is critical for advancing their development and application. Traditional visualization techniques, such as convergence plots, trajectory mapping, and fitness landscape analysis, often fall short in illustrating the structural dynamics of the search process, especially in high-dimensional or complex solution spaces. To address this, we propose a novel representation and visualization methodology that clusters solution candidates explored by the algorithm and tracks the evolution of cluster memberships across iterations, offering a dynamic and interpretable view of the search process. Additionally, we introduce two metrics - algorithm stability and algorithm similarity- to quantify the consistency of search trajectories across runs of an individual algorithm and the similarity between different algorithms, respectively. We apply this methodology to a set of ten numerical metaheuristic algorithms, revealing insights into their stability and comparative behaviors, thereby providing a deeper understanding of their search dynamics.
zh

[AI-32] racing the Interactions of Modular CMA-ES Configurations Across Problem Landscapes

【速读】:该论文试图解决算法配置与问题特性之间相互作用的解释性问题,旨在揭示不同配置下同一算法性能差异的原因。解决方案的关键在于利用算法足迹(algorithm footprints)这一概念,通过计算六种模块化CMA-ES(modCMA)变体在BBOB基准测试集中的性能足迹,分析其在不同维度设置下的行为模式,从而识别影响性能的问题特征并提升算法配置的可解释性与指导性。

链接: https://arxiv.org/abs/2507.02331
作者: Ana Nikolikj,Mario Andrés Muñoz,Eva Tuba,Tome Eftimov
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper leverages the recently introduced concept of algorithm footprints to investigate the interplay between algorithm configurations and problem characteristics. Performance footprints are calculated for six modular variants of the CMA-ES algorithm (modCMA), evaluated on 24 benchmark problems from the BBOB suite, across two-dimensional settings: 5-dimensional and 30-dimensional. These footprints provide insights into why different configurations of the same algorithm exhibit varying performance and identify the problem features influencing these outcomes. Our analysis uncovers shared behavioral patterns across configurations due to common interactions with problem properties, as well as distinct behaviors on the same problem driven by differing problem features. The results demonstrate the effectiveness of algorithm footprints in enhancing interpretability and guiding configuration choices.
zh

[AI-33] Iterated belief revision: from postulates to abilities

【速读】:该论文试图解决信念修正(belief revision)领域中现有方法分析不足的问题,即虽然提出了大量新方案,但缺乏对已有方法的深入分析。其解决方案的关键在于识别和描述不同信念修正机制所具备的能力,如可塑性、等同性、教条性等,而非仅依赖于约束性的公理(postulates)。通过证明各种修正机制拥有特定能力,论文强调了这些机制在实现不同信念状态方面的灵活性和适用性。

链接: https://arxiv.org/abs/2507.02319
作者: Paolo Liberatore
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:The belief revision field is opulent in new proposals and indigent in analyses of existing approaches. Much work hinge on postulates, employed as syntactic characterizations: some revision mechanism is equivalent to some properties. Postulates constraint specific revision instances: certain revisions update certain beliefs in a certain way. As an example, if the revision is consistent with the current beliefs, it is incorporated with no other change. A postulate like this tells what revisions must do and neglect what they can do. Can they reach a certain state of beliefs? Can they reach all possible states of beliefs? Can they reach all possible states of beliefs from no previous belief? Can they reach a dogmatic state of beliefs, where everything not believed is impossible? Can they make two conditions equally believed? An application where every possible state of beliefs is sensible requires each state of beliefs to be reachable. An application where conditions may be equally believed requires such a belief state to be reachable. An application where beliefs may become dogmatic requires a way to make them dogmatic. Such doxastic states need to be reached in a way or another. Not in specific way, as dictated by a typical belief revision postulate. This is an ability, not a constraint: the ability of being plastic, equating, dogmatic. Amnesic, correcting, believer, damascan, learnable are other abilities. Each revision mechanism owns some of these abilities and lacks the others: lexicographic, natural, restrained, very radical, full meet, radical, severe, moderate severe, deep severe, plain severe and deep severe revisions, each of these revisions is proved to possess certain abilities.
zh

[AI-34] Synthetic Heuristic Evaluation: A Comparison between AI- and Human-Powered Usability Evaluation

【速读】:该论文试图解决传统可用性评估在人本设计中成本高、依赖专家时间和用户补偿的问题。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, LLMs)的图像分析能力和设计反馈生成能力,实现合成启发式评估(synthetic heuristic evaluation)。通过对比两种应用的评估结果,该方法在识别可用性问题方面表现出优于部分人类评估者的性能,并在布局问题检测上展现出稳定性与优势。

链接: https://arxiv.org/abs/2507.02306
作者: Ruican Zhong,David W. McDonald,Gary Hsieh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Usability evaluation is crucial in human-centered design but can be costly, requiring expert time and user compensation. In this work, we developed a method for synthetic heuristic evaluation using multimodal LLMs’ ability to analyze images and provide design feedback. Comparing our synthetic evaluations to those by experienced UX practitioners across two apps, we found our evaluation identified 73% and 77% of usability issues, which exceeded the performance of 5 experienced human evaluators (57% and 63%). Compared to human evaluators, the synthetic evaluation’s performance maintained consistent performance across tasks and excelled in detecting layout issues, highlighting potential attentional and perceptual strengths of synthetic evaluation. However, synthetic evaluation struggled with recognizing some UI components and design conventions, as well as identifying across screen violations. Additionally, testing synthetic evaluations over time and accounts revealed stable performance. Overall, our work highlights the performance differences between human and LLM-driven evaluations, informing the design of synthetic heuristic evaluations.
zh

[AI-35] Knowledge Graph-Based Explainable and Generalized Zero-Shot Semantic Communications

【速读】:该论文旨在解决数据驱动语义通信(Data-driven Semantic Communication)在面对未见过的数据时缺乏可解释性和泛化能力的问题。其解决方案的关键在于提出一种基于知识图谱增强的零样本语义通信(Knowledge Graph-enhanced Zero-shot Semantic Communication, KGZS-SC)网络,通过知识图谱语义知识库(KG-SKB)提供的结构化语义信息,实现对未知场景的泛化语义表示与推理,从而提升通信系统的适应性与效率。

链接: https://arxiv.org/abs/2507.02291
作者: Zhaoyu Zhang,Lingyi Wang,Wei Wu,Fuhui Zhou,Qihui Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Data-driven semantic communication is based on superficial statistical patterns, thereby lacking interpretability and generalization, especially for applications with the presence of unseen data. To address these challenges, we propose a novel knowledge graph-enhanced zero-shot semantic communication (KGZS-SC) network. Guided by the structured semantic information from a knowledge graph-based semantic knowledge base (KG-SKB), our scheme provides generalized semantic representations and enables reasoning for unseen cases. Specifically, the KG-SKB aligns the semantic features in a shared category semantics embedding space and enhances the generalization ability of the transmitter through aligned semantic features, thus reducing communication overhead by selectively transmitting compact visual semantics. At the receiver, zero-shot learning (ZSL) is leveraged to enable direct classification for unseen cases without the demand for retraining or additional computational overhead, thereby enhancing the adaptability and efficiency of the classification process in dynamic or resource-constrained environments. The simulation results conducted on the APY datasets show that the proposed KGZS-SC network exhibits robust generalization and significantly outperforms existing SC frameworks in classifying unseen categories across a range of SNR levels.
zh

[AI-36] Content filtering methods for music recommendation: A review

【速读】:该论文旨在解决音乐流媒体平台中推荐系统在用户交互数据稀疏情况下的有效性问题,尤其是在音乐这种用户很少听闻大部分曲目的媒介中。其关键解决方案是通过内容过滤(content filtering)来缓解协同过滤(collaborative filtering)方法中的固有偏差,具体包括利用大型语言模型(Large Language Models, LLMs)进行歌词分析以及音频信号处理技术进行歌曲分类,以补充和优化基于用户行为的推荐策略。

链接: https://arxiv.org/abs/2507.02282
作者: Terence Zeng,Abhishek K. Umrawal
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages and 9 figures

点击查看摘要

Abstract:Recommendation systems have become essential in modern music streaming platforms, shaping how users discover and engage with songs. One common approach in recommendation systems is collaborative filtering, which suggests content based on the preferences of users with similar listening patterns to the target user. However, this method is less effective on media where interactions are sparse. Music is one such medium, since the average user of a music streaming service will never listen to the vast majority of tracks. Due to this sparsity, there are several challenges that have to be addressed with other methods. This review examines the current state of research in addressing these challenges, with an emphasis on the role of content filtering in mitigating biases inherent in collaborative filtering approaches. We explore various methods of song classification for content filtering, including lyrical analysis using Large Language Models (LLMs) and audio signal processing techniques. Additionally, we discuss the potential conflicts between these different analysis methods and propose avenues for resolving such discrepancies.
zh

[AI-37] Scaling LLM Planning : NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)在规划与推理能力提升过程中面临的数据生成和评估可扩展性与可靠性瓶颈问题。其解决方案的关键在于提出NL2FLOW,一个完全自动化的系统,能够参数化生成以自然语言、结构化中间表示和形式化PDDL表达的规划问题,并严格评估生成计划的质量。该系统通过生成2296个自动化工作流生成领域的规划问题,并对多个开源指令调优的LLM进行评估,验证了其有效性。

链接: https://arxiv.org/abs/2507.02253
作者: Jungkoo Kang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Progress in enhancing large language model (LLM) planning and reasoning capabilities is significantly hampered by the bottleneck of scalable, reliable data generation and evaluation. To overcome this, I introduce NL2FLOW, a fully automated system for parametrically generating planning problems - expressed in natural language, a structured intermediate representation, and formal PDDL - and rigorously evaluating the quality of generated plans. I demonstrate NL2FLOW’s capabilities by generating a dataset of 2296 problems in the automated workflow generation domain and evaluating multiple open-sourced, instruct-tuned LLMs. My results reveal that the highest performing models achieved 86% success in generating valid plans and 69% in generating optimal plans, specifically for problems with feasible solutions. Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. Notably, I observed that the highest success rate for translating natural language into a JSON representation of a plan was lower than the highest rate of generating a valid plan directly. This suggests that unnecessarily decomposing the reasoning task - introducing intermediate translation steps - may actually degrade performance, implying a benefit to models capable of reasoning directly from natural language to action. As I scale LLM reasoning to increasingly complex problems, the bottlenecks and sources of error within these systems will inevitably shift. Therefore, a dynamic understanding of these limitations - and the tools to systematically reveal them - will be crucial for unlocking the full potential of LLMs as intelligent problem solvers.
zh

[AI-38] Order Acquisition Under Competitive Pressure: A Rapidly Adaptive Reinforcement Learning Approach for Ride-Hailing Subsidy Strategies

【速读】:该论文旨在解决网约车服务提供商在竞争性定价机制下如何设计有效的优惠券策略以在预算约束内优化订单获取的问题。其关键解决方案是提出FCA-RL框架,该框架结合了快速竞争适应(Fast Competition Adaptation, FCA)和强化拉格朗日调整(Reinforced Lagrangian Adjustment, RLA)两种核心技术,以实现对市场动态变化的快速响应并确保预算约束下的最优 coupon 决策。

链接: https://arxiv.org/abs/2507.02244
作者: Fangzhou Shi,Xiaopeng Ke,Xinye Xiong,Kexin Meng,Chang Men,Zhengdan Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of ride-hailing aggregator platforms presents significant growth opportunities for ride-service providers by increasing order volume and gross merchandise value (GMV). On most ride-hailing aggregator platforms, service providers that offer lower fares are ranked higher in listings and, consequently, are more likely to be selected by passengers. This competitive ranking mechanism creates a strong incentive for service providers to adopt coupon strategies that lower prices to secure a greater number of orders, as order volume directly influences their long-term viability and sustainability. Thus, designing an effective coupon strategy that can dynamically adapt to market fluctuations while optimizing order acquisition under budget constraints is a critical research challenge. However, existing studies in this area remain scarce. To bridge this gap, we propose FCA-RL, a novel reinforcement learning-based subsidy strategy framework designed to rapidly adapt to competitors’ pricing adjustments. Our approach integrates two key techniques: Fast Competition Adaptation (FCA), which enables swift responses to dynamic price changes, and Reinforced Lagrangian Adjustment (RLA), which ensures adherence to budget constraints while optimizing coupon decisions on new price landscape. Furthermore, we introduce RideGym, the first dedicated simulation environment tailored for ride-hailing aggregators, facilitating comprehensive evaluation and benchmarking of different pricing strategies without compromising real-world operational efficiency. Experimental results demonstrate that our proposed method consistently outperforms baseline approaches across diverse market conditions, highlighting its effectiveness in subsidy optimization for ride-hailing service providers. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.02244 [cs.LG] (or arXiv:2507.02244v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.02244 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-39] Dilution Diffusion and Symbiosis in Spatial Prisoners Dilemma with Reinforcement Learning

【速读】:该论文试图解决在空间囚徒困境博弈中,通过强化学习方法研究稀释(dilution)和移动性(mobility)对合作行为的影响问题。其解决方案的关键在于采用独立多智能体Q-learning算法,通过定义不同的可能动作,连接经典非强化学习空间囚徒困境的研究结果,展示了算法在建模不同博弈论场景中的灵活性以及作为基准测试工具的潜力。

链接: https://arxiv.org/abs/2507.02211
作者: Gustavo C. Mangold,Heitor C. M. Fernandes,Mendeli H. Vainstein
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Recent studies in the spatial prisoner’s dilemma games with reinforcement learning have shown that static agents can learn to cooperate through a diverse sort of mechanisms, including noise injection, different types of learning algorithms and neighbours’ payoff this http URL this work, using an independent multi-agent Q-learning algorithm, we study the effects of dilution and mobility in the spatial version of the prisoner’s dilemma. Within this setting, different possible actions for the algorithm are defined, connecting with previous results on the classical, non-reinforcement learning spatial prisoner’s dilemma, showcasing the versatility of the algorithm in modeling different game-theoretical scenarios and the benchmarking potential of this this http URL a result, a range of effects is observed, including evidence that games with fixed update rules can be qualitatively equivalent to those with learned ones, as well as the emergence of a symbiotic mutualistic effect between populations that forms when multiple actions are defined.
zh

[AI-40] EIM-TRNG: Obfuscating Deep Neural Network Weights with Encoding-in-Memory True Random Number Generator via RowHammer

【速读】:该论文旨在解决硬件安全中对随机数生成器(TRNG)的需求,特别是在深度神经网络(DNN)模型参数保护方面的挑战。传统软件伪随机数生成器缺乏硬件TRNG所提供的不可预测性和抗攻击性,而本文提出的Encoding-in-Memory TRNG(EIM-TRNG)利用DRAM单元在RowHammer效应下的物理随机性,首次实现了基于DRAM的可靠熵源。其关键在于通过可控的RowHammer操作产生不可预测的位翻转,并结合固定与随机位翻转对DNN权重数据进行编码,从而实现数据保密性和模型真实性保障。

链接: https://arxiv.org/abs/2507.02206
作者: Ranyang Zhou,Abeer Matar A. Almalky,Gamana Aragonda,Sabbir Ahmed,Filip Roth Trønnes-Christensen,Adnan Siraj Rakin,Shaahin Angizi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:True Random Number Generators (TRNGs) play a fundamental role in hardware security, cryptographic systems, and data protection. In the context of Deep NeuralNetworks (DNNs), safeguarding model parameters, particularly weights, is critical to ensure the integrity, privacy, and intel-lectual property of AI systems. While software-based pseudo-random number generators are widely used, they lack the unpredictability and resilience offered by hardware-based TRNGs. In this work, we propose a novel and robust Encoding-in-Memory TRNG called EIM-TRNG that leverages the inherent physical randomness in DRAM cell behavior, particularly under RowHammer-induced disturbances, for the first time. We demonstrate how the unpredictable bit-flips generated through carefully controlled RowHammer operations can be harnessed as a reliable entropy source. Furthermore, we apply this TRNG framework to secure DNN weight data by encoding via a combination of fixed and unpredictable bit-flips. The encrypted data is later decrypted using a key derived from the probabilistic flip behavior, ensuring both data confidentiality and model authenticity. Our results validate the effectiveness of DRAM-based entropy extraction for robust, low-cost hardware security and offer a promising direction for protecting machine learning models at the hardware level.
zh

[AI-41] Do Role-Playing Agents Practice What They Preach? Belief-Behavior Consistency in LLM -Based Simulations of Human Trust

【速读】:该论文试图解决生成式 AI (Generative AI) 在作为角色扮演代理时,其陈述信念与实际行为之间的一致性问题。研究的核心在于评估模型通过提示获得的信念是否能够准确预测模拟结果,从而确保其在人类行为研究中的可靠性。解决方案的关键在于构建一个评估框架,引入信念-行为一致性度量,并系统分析影响一致性的因素,如信念类型、信息呈现时机以及预测时间范围等,同时探索在信念与研究目标不一致时引入理论先验的可行性。

链接: https://arxiv.org/abs/2507.02197
作者: Amogh Mannekote,Adam Davies,Guohao Li,Kristy Elizabeth Boyer,ChengXiang Zhai,Bonnie J Dorr,Francesco Pinto
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLMs are increasingly studied as role-playing agents to generate synthetic data for human behavioral research, ensuring that their outputs remain coherent with their assigned roles has become a critical concern. In this paper, we investigate how consistently LLM-based role-playing agents’ stated beliefs about the behavior of the people they are asked to role-play (“what they say”) correspond to their actual behavior during role-play (“how they act”). Specifically, we establish an evaluation framework to rigorously measure how well beliefs obtained by prompting the model can predict simulation outcomes in advance. Using an augmented version of the GenAgents persona bank and the Trust Game (a standard economic game used to quantify players’ trust and reciprocity), we introduce a belief-behavior consistency metric to systematically investigate how it is affected by factors such as: (1) the types of beliefs we elicit from LLMs, like expected outcomes of simulations versus task-relevant attributes of individual characters LLMs are asked to simulate; (2) when and how we present LLMs with relevant information about Trust Game; and (3) how far into the future we ask the model to forecast its actions. We also explore how feasible it is to impose a researcher’s own theoretical priors in the event that the originally elicited beliefs are misaligned with research objectives. Our results reveal systematic inconsistencies between LLMs’ stated (or imposed) beliefs and the outcomes of their role-playing simulation, at both an individual- and population-level. Specifically, we find that, even when models appear to encode plausible beliefs, they may fail to apply them in a consistent way. These findings highlight the need to identify how and when LLMs’ stated beliefs align with their simulated behavior, allowing researchers to use LLM-based agents appropriately in behavioral studies.
zh

[AI-42] Data Diversification Methods In Alignment Enhance Math Performance In LLM s

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在数学推理能力上的持续性挑战,尽管偏好学习的进展已提升了与人类反馈的一致性。其解决方案的关键在于通过数据多样化策略优化偏好学习,特别是引入了Diversified-ThinkSolve (DTS),这是一种结构化的方法,系统地将问题分解为多样化的推理路径,从而生成更具代表性和有效性的偏好数据,显著提升了数学推理性能。

链接: https://arxiv.org/abs/2507.02173
作者: Berkan Dokmeci,Qingyang Wu,Ben Athiwaratkun,Ce Zhang,Shuaiwen Leon Song,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While recent advances in preference learning have enhanced alignment in human feedback, mathematical reasoning remains a persistent challenge. We investigate how data diversification strategies in preference optimization can improve the mathematical reasoning abilities of large language models (LLMs). We evaluate three common data generation methods: temperature sampling, Chain-of-Thought prompting, and Monte Carlo Tree Search (MCTS), and introduce Diversified-ThinkSolve (DTS), a novel structured approach that systematically decomposes problems into diverse reasoning paths. Our results show that with strategically diversified preference data, models can substantially improve mathematical reasoning performance, with the best approach yielding gains of 7.1% on GSM8K and 4.2% on MATH over the base model. Despite its strong performance, DTS incurs only a marginal computational overhead (1.03x) compared to the baseline, while MCTS is nearly five times more costly with lower returns. These findings demonstrate that structured exploration of diverse problem-solving methods creates more effective preference data for mathematical alignment than traditional approaches.
zh

[AI-43] owards Bio-Inspired Robotic Trajectory Planning via Self-Supervised RNN ICANN

【速读】:该论文试图解决机器人轨迹规划的问题,即在考虑机器人运动学和环境约束的情况下,生成从初始状态到目标状态的关节配置序列。传统方法通常依赖于计算密集型的采样规划算法,而本文提出了一种基于循环架构的受认知启发的自监督学习方案,以构建轨迹模型。该解决方案的关键在于利用给定的正向和逆向运动学模型进行无监督学习,而非依赖于模仿已观察到的轨迹,从而实现更高效且适应性强的轨迹生成。

链接: https://arxiv.org/abs/2507.02171
作者: Miroslav Cibula,Kristína Malinovská,Matthias Kerzel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 4 figures, 2 tables. To be published in 2025 International Conference on Artificial Neural Networks (ICANN) proceedings. This research was funded by the Horizon Europe project TERAIS, GA no. 101079338, and in part by the Slovak Grant Agency for Science (VEGA), project 1/0373/23

点击查看摘要

Abstract:Trajectory planning in robotics is understood as generating a sequence of joint configurations that will lead a robotic agent, or its manipulator, from an initial state to the desired final state, thus completing a manipulation task while considering constraints like robot kinematics and the environment. Typically, this is achieved via sampling-based planners, which are computationally intensive. Recent advances demonstrate that trajectory planning can also be performed by supervised sequence learning of trajectories, often requiring only a single or fixed number of passes through a neural architecture, thus ensuring a bounded computation time. Such fully supervised approaches, however, perform imitation learning; they do not learn based on whether the trajectories can successfully reach a goal, but try to reproduce observed trajectories. In our work, we build on this approach and propose a cognitively inspired self-supervised learning scheme based on a recurrent architecture for building a trajectory model. We evaluate the feasibility of the proposed method on a task of kinematic planning for a robotic arm. The results suggest that the model is able to learn to generate trajectories only using given paired forward and inverse kinematics models, and indicate that this novel method could facilitate planning for more complex manipulation tasks requiring adaptive solutions.
zh

[AI-44] Generating Large Semi-Synthetic Graphs of Any Size

【速读】:该论文试图解决传统图生成方法在生成大规模图时的局限性以及对节点ID的依赖问题,这些问题限制了模型生成超出输入图规模的图的能力,并忽略了节点属性。解决方案的关键在于提出一种名为Latent Graph Sampling Generation (LGSG)的新框架,该框架利用扩散模型和节点嵌入来生成不同规模的图,而无需重新训练,同时消除了对节点ID的依赖,从而捕捉节点嵌入和子图结构的分布,实现可扩展且灵活的图生成。

链接: https://arxiv.org/abs/2507.02166
作者: Rodrigo Tuna,Carlos Soares
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph generation is an important area in network science. Traditional approaches focus on replicating specific properties of real-world graphs, such as small diameters or power-law degree distributions. Recent advancements in deep learning, particularly with Graph Neural Networks, have enabled data-driven methods to learn and generate graphs without relying on predefined structural properties. Despite these advances, current models are limited by their reliance on node IDs, which restricts their ability to generate graphs larger than the input graph and ignores node attributes. To address these challenges, we propose Latent Graph Sampling Generation (LGSG), a novel framework that leverages diffusion models and node embeddings to generate graphs of varying sizes without retraining. The framework eliminates the dependency on node IDs and captures the distribution of node embeddings and subgraph structures, enabling scalable and flexible graph generation. Experimental results show that LGSG performs on par with baseline models for standard metrics while outperforming them in overlooked ones, such as the tendency of nodes to form clusters. Additionally, it maintains consistent structural characteristics across graphs of different sizes, demonstrating robustness and scalability.
zh

[AI-45] he Illusion of Fairness: Auditing Fairness Interventions with Audit Studies

【速读】:该论文试图解决人工智能系统在招聘等决策场景中存在偏见的问题,特别是如何更准确地评估和减少算法歧视。其解决方案的关键在于利用审计研究(audit studies)数据来改进自动化招聘算法的训练与评估过程。研究表明,传统公平性干预方法——如通过重采样使不同类别的基础率相等——在常规指标下看似实现了公平性,但使用更恰当的测量方式时仍存在约10%的差异。论文进一步引入基于个体处理效应估计的干预方法,以更有效地减少算法歧视。

链接: https://arxiv.org/abs/2507.02152
作者: Disa Sariola,Patrick Button,Aron Culotta,Nicholas Mattei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial intelligence systems, especially those using machine learning, are being deployed in domains from hiring to loan issuance in order to automate these complex decisions. Judging both the effectiveness and fairness of these AI systems, and their human decision making counterpart, is a complex and important topic studied across both computational and social sciences. Within machine learning, a common way to address bias in downstream classifiers is to resample the training data to offset disparities. For example, if hiring rates vary by some protected class, then one may equalize the rate within the training set to alleviate bias in the resulting classifier. While simple and seemingly effective, these methods have typically only been evaluated using data obtained through convenience samples, introducing selection bias and label bias into metrics. Within the social sciences, psychology, public health, and medicine, audit studies, in which fictitious ``testers’’ (e.g., resumes, emails, patient actors) are sent to subjects (e.g., job openings, businesses, doctors) in randomized control trials, provide high quality data that support rigorous estimates of discrimination. In this paper, we investigate how data from audit studies can be used to improve our ability to both train and evaluate automated hiring algorithms. We find that such data reveals cases where the common fairness intervention method of equalizing base rates across classes appears to achieve parity using traditional measures, but in fact has roughly 10% disparity when measured appropriately. We additionally introduce interventions based on individual treatment effect estimation methods that further reduce algorithmic discrimination using this data.
zh

[AI-46] When LLM s Disagree: Diagnosing Relevance Filtering Bias and Retrieval Divergence in SDG Search SIGIR2025

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在信息检索流程中对文档相关性标签分配存在分歧的问题,特别是在缺乏人工标注数据的领域中,这种分歧可能影响下游检索效果。解决方案的关键在于分析不同模型之间的标签不一致现象,并揭示其系统性特征,如词汇模式、排序行为和分类可预测性。研究发现,模型间的分歧并非随机,而是具有可识别的结构化特征,这为通过分类分歧作为检索评估的分析对象提供了依据,尤其适用于政策相关或主题搜索任务。

链接: https://arxiv.org/abs/2507.02139
作者: William A. Ingram,Bipasha Banerjee,Edward A. Fox
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: Presented at LLM4Eval Workshop, SIGIR 2025 Padova, Italy, July 17, 2025

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to assign document relevance labels in information retrieval pipelines, especially in domains lacking human-labeled data. However, different models often disagree on borderline cases, raising concerns about how such disagreement affects downstream retrieval. This study examines labeling disagreement between two open-weight LLMs, LLaMA and Qwen, on a corpus of scholarly abstracts related to Sustainable Development Goals (SDGs) 1, 3, and 7. We isolate disagreement subsets and examine their lexical properties, rank-order behavior, and classification predictability. Our results show that model disagreement is systematic, not random: disagreement cases exhibit consistent lexical patterns, produce divergent top-ranked outputs under shared scoring functions, and are distinguishable with AUCs above 0.74 using simple classifiers. These findings suggest that LLM-based filtering introduces structured variability in document retrieval, even under controlled prompting and shared ranking logic. We propose using classification disagreement as an object of analysis in retrieval evaluation, particularly in policy-relevant or thematic search tasks.
zh

[AI-47] Can Artificial Intelligence solve the blockchain oracle problem? Unpacking the Challenges and Possibilities

【速读】:该论文试图解决区块链的预言机问题(oracle problem),即如何将可靠外部数据注入去中心化系统这一核心挑战。其解决方案的关键在于利用人工智能(Artificial Intelligence, AI)技术,如异常检测、基于语言的事实提取、动态声誉建模和对抗性抵抗等,以增强预言机系统的数据质量、来源选择和系统弹性。然而,论文指出AI并不能消除对不可验证的链下输入的依赖,因此应将其视为预言机设计中的一种补充层,而非对信任假设的替代。

链接: https://arxiv.org/abs/2507.02125
作者: Giulio Caldarelli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The blockchain oracle problem, which refers to the challenge of injecting reliable external data into decentralized systems, remains a fundamental limitation to the development of trustless applications. While recent years have seen a proliferation of architectural, cryptographic, and economic strategies to mitigate this issue, no one has yet fully resolved the fundamental question of how a blockchain can gain knowledge about the off-chain world. In this position paper, we critically assess the role artificial intelligence (AI) can play in tackling the oracle problem. Drawing from both academic literature and practitioner implementations, we examine how AI techniques such as anomaly detection, language-based fact extraction, dynamic reputation modeling, and adversarial resistance can enhance oracle systems. We observe that while AI introduces powerful tools for improving data quality, source selection, and system resilience, it cannot eliminate the reliance on unverifiable off-chain inputs. Therefore, this study supports the idea that AI should be understood as a complementary layer of inference and filtering within a broader oracle design, not a substitute for trust assumptions.
zh

[AI-48] What Neuroscience Can Teach AI About Learning in Continuously Changing Environments

【速读】:该论文试图解决如何将动物在不断变化环境中快速适应的计算机制引入到人工智能(AI)系统中,以提升其在现实世界任务中的灵活性和学习效率。其解决方案的关键在于整合人工智能中的持续学习与上下文学习研究,以及神经科学中关于行为任务中规则、奖励概率或结果变化的学习机制,从而为AI系统提供新的理论框架和方法启示,同时促进神经科学与AI的交叉融合,推动NeuroAI领域的发展。

链接: https://arxiv.org/abs/2507.02103
作者: Daniel Durstewitz,Bruno Averbeck,Georgia Koppe
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: Submitted as a Perspective article (10 pages, 5 figures)

点击查看摘要

Abstract:Modern AI models, such as large language models, are usually trained once on a huge corpus of data, potentially fine-tuned for a specific task, and then deployed with fixed parameters. Their training is costly, slow, and gradual, requiring billions of repetitions. In stark contrast, animals continuously adapt to the ever-changing contingencies in their environments. This is particularly important for social species, where behavioral policies and reward outcomes may frequently change in interaction with peers. The underlying computational processes are often marked by rapid shifts in an animal’s behaviour and rather sudden transitions in neuronal population activity. Such computational capacities are of growing importance for AI systems operating in the real world, like those guiding robots or autonomous vehicles, or for agentic AI interacting with humans online. Can AI learn from neuroscience? This Perspective explores this question, integrating the literature on continual and in-context learning in AI with the neuroscience of learning on behavioral tasks with shifting rules, reward probabilities, or outcomes. We will outline an agenda for how specifically insights from neuroscience may inform current developments in AI in this area, and - vice versa - what neuroscience may learn from AI, contributing to the evolving field of NeuroAI.
zh

[AI-49] GeoAda: Efficiently Finetune Geometric Diffusion Models with Equivariant Adapters

【速读】:该论文旨在解决几何扩散模型在下游任务中进行高效微调时,如何实现灵活且参数高效的几何控制问题。其解决方案的关键在于提出一种SE(3)-equivariant适配器框架(GeoAda),该框架通过结构化的适配器设计,在不修改原始模型架构的前提下,实现对控制信号的编码、处理与解码,从而保持模型的几何一致性并缓解过拟合和灾难性遗忘问题。

链接: https://arxiv.org/abs/2507.02085
作者: Wanjia Zhao,Jiaqi Han,Siyi Gu,Mingjian Jiang,James Zou,Stefano Ermon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Geometric diffusion models have shown remarkable success in molecular dynamics and structure generation. However, efficiently fine-tuning them for downstream tasks with varying geometric controls remains underexplored. In this work, we propose an SE(3)-equivariant adapter framework ( GeoAda) that enables flexible and parameter-efficient fine-tuning for controlled generative tasks without modifying the original model architecture. GeoAda introduces a structured adapter design: control signals are first encoded through coupling operators, then processed by a trainable copy of selected pretrained model layers, and finally projected back via decoupling operators followed by an equivariant zero-initialized convolution. By fine-tuning only these lightweight adapter modules, GeoAda preserves the model’s geometric consistency while mitigating overfitting and catastrophic forgetting. We theoretically prove that the proposed adapters maintain SE(3)-equivariance, ensuring that the geometric inductive biases of the pretrained diffusion model remain intact during adaptation. We demonstrate the wide applicability of GeoAda across diverse geometric control types, including frame control, global control, subgraph control, and a broad range of application domains such as particle dynamics, molecular dynamics, human motion prediction, and molecule generation. Empirical results show that GeoAda achieves state-of-the-art fine-tuning performance while preserving original task accuracy, whereas other baselines experience significant performance degradation due to overfitting and catastrophic forgetting.
zh

[AI-50] Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在科学实验设计与结果解释能力方面的评估不足问题,特别是在生物学领域中对复杂系统进行扰动以揭示其内在机制的核心科学能力。传统方法因湿实验成本过高(包括专业知识、时间和设备),无法有效测试这些能力。解决方案的关键在于引入SciGym,这是一个首创的基准测试平台,通过运行生物系统的干实验环境来克服湿实验的成本障碍,利用系统生物学标记语言编码的模型生成模拟数据,从而在现实复杂的系统上进行实验测试。

链接: https://arxiv.org/abs/2507.02083
作者: Haonan Duan,Stephen Zhewen Lu,Caitlin Fiona Harrigan,Nishkrit Desai,Jiarui Lu,Michał Koziarski,Leonardo Cotta,Chris J. Maddison
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing experiments and result interpretations are core scientific competencies, particularly in biology, where researchers perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SciGym, a first-in-class benchmark that assesses LLMs’ iterative experiment design and analysis abilities in open-ended scientific discovery tasks. SciGym overcomes the challenge of wet-lab costs by running a dry lab of biological systems. These models, encoded in Systems Biology Markup Language, are efficient for generating simulated data, making them ideal testbeds for experimentation on realistically complex systems. We evaluated six frontier LLMs on 137 small systems, and released a total of 350 systems. Our evaluation shows that while more capable models demonstrated superior performance, all models’ performance declined significantly as system complexity increased, suggesting substantial room for improvement in the scientific capabilities of LLM agents.
zh

[AI-51] Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLM s

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在推理过程中计算效率低下的问题,即当前模型在推理时采用固定的计算资源,无法根据任务复杂度进行动态调整,导致简单问题过度思考而复杂问题思考不足。解决方案的关键在于提出高效测试时计算(Efficient Test-Time Compute, TTC)策略,通过引入两层分类体系——L1可控性(固定计算预算下的方法)和L2自适应性(基于输入难度或模型置信度动态调整推理的方法),实现对模型推理过程的优化。

链接: https://arxiv.org/abs/2507.02076
作者: Mohammad Ali Alomrani,Yingxue Zhang,Derek Li,Qianyi Sun,Soumyasundar Pal,Zhanguang Zhang,Yaochen Hu,Rohan Deepak Ajwani,Antonios Valkanas,Raika Karimi,Peng Cheng,Yunzhou Wang,Pengyi Liao,Hanrui Huang,Bin Wang,Jianye Hao,Mark Coates
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have rapidly progressed into general-purpose agents capable of solving a broad spectrum of tasks. However, current models remain inefficient at reasoning: they apply fixed inference-time compute regardless of task complexity, often overthinking simple problems while underthinking hard ones. This survey presents a comprehensive review of efficient test-time compute (TTC) strategies, which aim to improve the computational efficiency of LLM reasoning. We introduce a two-tiered taxonomy that distinguishes between L1-controllability, methods that operate under fixed compute budgets, and L2-adaptiveness, methods that dynamically scale inference based on input difficulty or model confidence. We benchmark leading proprietary LLMs across diverse datasets, highlighting critical trade-offs between reasoning performance and token usage. Compared to prior surveys on efficient reasoning, our review emphasizes the practical control, adaptability, and scalability of TTC methods. Finally, we discuss emerging trends such as hybrid thinking models and identify key challenges for future work towards making LLMs more computationally efficient, robust, and responsive to user constraints.
zh

[AI-52] HCVR: A Hybrid Approach with Correlation-aware Voting Rules for Feature Selection

【速读】:该论文试图解决特征选择中的冗余特征消除与相关特征保留问题,以提升维度约简的效果。解决方案的关键在于提出一种混合方法HCVR(Hybrid approach with Correlation-aware Voting Rules),该方法结合了参数到参数(P2P)和参数到目标(P2T)的相关性,通过投票规则进行特征筛选,采用非迭代与迭代过滤方法的混合策略,并利用相关性阈值进行多数投票决策,从而实现高效的特征选择。

链接: https://arxiv.org/abs/2507.02073
作者: Nikita Bhedasgaonkar,Rushikesh K. Joshi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 5 tables, 2 figures

点击查看摘要

Abstract:In this paper, we propose HCVR (Hybrid approach with Correlation-aware Voting Rules), a lightweight rule-based feature selection method that combines Parameter-to-Parameter (P2P) and Parameter-to-Target (P2T) correlations to eliminate redundant features and retain relevant ones. This method is a hybrid of non-iterative and iterative filtering approaches for dimensionality reduction. It is a greedy method, which works by backward elimination, eliminating possibly multiple features at every step. The rules contribute to voting for features, and a decision to keep or discard is made by majority voting. The rules make use of correlation thresholds between every pair of features, and between features and the target. We provide the results from the application of HCVR to the SPAMBASE dataset. The results showed improvement performance as compared to traditional non-iterative (CFS, mRMR and MI) and iterative (RFE, SFS and Genetic Algorithm) techniques. The effectiveness was assessed based on the performance of different classifiers after applying filtering.
zh

[AI-53] MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLM s for Malware Generation

【速读】:该论文试图解决对齐AI系统在面对组合式攻击时的安全漏洞问题,即恶意操作可以通过分解为看似无害的子任务来规避检测。解决方案的关键在于提出一种名为Malware Generation Compiler (MGC)的框架,该框架利用模块化分解和规避对齐机制生成恶意代码,并通过专用的Malware Description Intermediate Representation (MDIR)将高级恶意意图与看似无害的代码片段进行关联。

链接: https://arxiv.org/abs/2507.02057
作者: Lu Yan,Zhuo Zhang,Xiangzhe Xu,Shengwei An,Guangyu Shen,Zhou Xuan,Xuan Chen,Xiangyu Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have democratized software development, reducing the expertise barrier for programming complex applications. This accessibility extends to malicious software development, raising significant security concerns. While LLM providers have implemented alignment mechanisms to prevent direct generation of overtly malicious code, these safeguards predominantly evaluate individual prompts in isolation, overlooking a critical vulnerability: malicious operations can be systematically decomposed into benign-appearing sub-tasks. In this paper, we introduce the Malware Generation Compiler (MGC), a novel framework that leverages this vulnerability through modular decomposition and alignment-evasive generation. MGC employs a specialized Malware Description Intermediate Representation (MDIR) to bridge high-level malicious intents and benign-appearing code snippets. Extensive evaluation demonstrates that our attack reliably generates functional malware across diverse task specifications and categories, outperforming jailbreaking methods by +365.79% and underground services by +78.07% in correctness on three benchmark datasets. Case studies further show that MGC can reproduce and even enhance 16 real-world malware samples. This work provides critical insights for security researchers by exposing the risks of compositional attacks against aligned AI systems. Demonstrations are available at this https URL.
zh

[AI-54] Effective Explanations for Belief-Desire-Intention Robots: When and What to Explain

【速读】:该论文试图解决在日常生活中执行复杂且依赖情境的任务时,机器人行为与用户预期不符可能导致用户困惑的问题,以及如何通过解释机器人的推理过程来提升用户对机器人意图的理解。解决方案的关键在于识别出用户对解释的需求和内容偏好,即用户希望在意外情境下获得简洁的解释,明确说明导致混淆行为的意图及相关的上下文因素,并据此提出了两种算法,用于识别意外行为并为基于信念-欲望-意图(Belief-Desire-Intention, BDI)的机器人构建有效的解释。这些算法可轻松集成到BDI推理过程中,从而促进更具情境和用户特异性的交互。

链接: https://arxiv.org/abs/2507.02016
作者: Cong Wang,Roberto Calandra,Verena Klös
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Paper accepted at IEEE RO-MAN 2025; 6 pages

点击查看摘要

Abstract:When robots perform complex and context-dependent tasks in our daily lives, deviations from expectations can confuse users. Explanations of the robot’s reasoning process can help users to understand the robot intentions. However, when to provide explanations and what they contain are important to avoid user annoyance. We have investigated user preferences for explanation demand and content for a robot that helps with daily cleaning tasks in a kitchen. Our results show that users want explanations in surprising situations and prefer concise explanations that clearly state the intention behind the confusing action and the contextual factors that were relevant to this decision. Based on these findings, we propose two algorithms to identify surprising actions and to construct effective explanations for Belief-Desire-Intention (BDI) robots. Our algorithms can be easily integrated in the BDI reasoning process and pave the way for better human-robot interaction with context- and user-specific explanations.
zh

[AI-55] ManifoldMind: Dynamic Hyperbolic Reasoning for Trustworthy Recommendations

【速读】:该论文试图解决传统推荐系统在语义层次结构中进行探索性推理时的局限性,特别是在处理稀疏或抽象领域时缺乏透明性、可信度和多样性的问题。其解决方案的关键在于提出ManifoldMind,一种基于概率几何的推荐系统,通过将用户、物品和标签表示为自适应曲率的概率球体,实现个性化不确定性建模和几何感知的语义探索,同时利用曲率感知的语义核支持软性多跳推理,从而避免对浅层或直接交互的过拟合。

链接: https://arxiv.org/abs/2507.02014
作者: Anoushka Harit,Zhongtian Sun,Suncica Hadzidedic
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We introduce ManifoldMind, a probabilistic geometric recommender system for exploratory reasoning over semantic hierarchies in hyperbolic space. Unlike prior methods with fixed curvature and rigid embeddings, ManifoldMind represents users, items, and tags as adaptive-curvature probabilistic spheres, enabling personalised uncertainty modeling and geometry-aware semantic exploration. A curvature-aware semantic kernel supports soft, multi-hop inference, allowing the model to explore diverse conceptual paths instead of overfitting to shallow or direct interactions. Experiments on four public benchmarks show superior NDCG, calibration, and diversity compared to strong baselines. ManifoldMind produces explicit reasoning traces, enabling transparent, trustworthy, and exploration-driven recommendations in sparse or abstract domains.
zh

[AI-56] Discovery of Fatigue Strength Models via Feature Engineering and automated eXplainable Machine Learning applied to the welded Transverse Stiffener

【速读】:该论文旨在解决焊接横向加强筋细节疲劳强度预测的问题,通过结合自动化机器学习(AutoML)与可解释人工智能(XAI)的方法,提升模型的准确性与可解释性。其解决方案的关键在于融合专家驱动的特征工程与算法生成的特征,利用多种回归模型(如梯度提升、随机森林和神经网络)进行训练,并通过三种特征方案(领域知识驱动、算法生成和综合方案)进行系统比较,最终验证了领域知识驱动模型在精度与泛化能力上的优势。同时,XAI方法(如SHAP和特征重要性)识别出关键预测因子,增强了模型的可解释性与工程适用性。

链接: https://arxiv.org/abs/2507.02005
作者: Michael A. Kraus,Helen Bartsch
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This research introduces a unified approach combining Automated Machine Learning (AutoML) with Explainable Artificial Intelligence (XAI) to predict fatigue strength in welded transverse stiffener details. It integrates expert-driven feature engineering with algorithmic feature creation to enhance accuracy and explainability. Based on the extensive fatigue test database regression models - gradient boosting, random forests, and neural networks - were trained using AutoML under three feature schemes: domain-informed, algorithmic, and combined. This allowed a systematic comparison of expert-based versus automated feature selection. Ensemble methods (e.g. CatBoost, LightGBM) delivered top performance. The domain-informed model \mathcal M_2 achieved the best balance: test RMSE \approx 30.6 MPa and R^2 \approx 0.780% over the full \Delta \sigma_c,50% range, and RMSE \approx 13.4 MPa and R^2 \approx 0.527% within the engineering-relevant 0 - 150 MPa domain. The denser-feature model ( \mathcal M_3 ) showed minor gains during training but poorer generalization, while the simpler base-feature model ( \mathcal M_1 ) performed comparably, confirming the robustness of minimalist designs. XAI methods (SHAP and feature importance) identified stress ratio R , stress range \Delta \sigma_i , yield strength R_eH , and post-weld treatment (TIG dressing vs. as-welded) as dominant predictors. Secondary geometric factors - plate width, throat thickness, stiffener height - also significantly affected fatigue life. This framework demonstrates that integrating AutoML with XAI yields accurate, interpretable, and robust fatigue strength models for welded steel structures. It bridges data-driven modeling with engineering validation, enabling AI-assisted design and assessment. Future work will explore probabilistic fatigue life modeling and integration into digital twin environments. Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI) Cite as: arXiv:2507.02005 [cs.CE] (or arXiv:2507.02005v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2507.02005 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Michael Kraus [view email] [v1] Tue, 1 Jul 2025 21:57:12 UTC (6,706 KB) Full-text links: Access Paper: View a PDF of the paper titled Discovery of Fatigue Strength Models via Feature Engineering and automated eXplainable Machine Learning applied to the welded Transverse Stiffener, by Michael A. Kraus and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CE prev | next new | recent | 2025-07 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-57] owards a Playground to Democratize Experimentation and Benchmarking of AI Agents for Network Troubleshooting

【速读】:该论文试图解决网络故障排除中AI代理应用的标准化与评估问题,其核心挑战在于缺乏一个可重复、开放且低操作成本的基准平台。解决方案的关键在于构建这样一个平台,以便于开发和评估AI代理在网络故障诊断中的性能。

链接: https://arxiv.org/abs/2507.01997
作者: Zhihao Wang,Alessandro Cornacchia,Franco Galante,Carlo Centofanti,Alessio Sacco,Dingde Jiang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted at ACM SIGCOMM 1st Workshop on Next-Generation Network Observability (NGNO)

点击查看摘要

Abstract:Recent research has demonstrated the effectiveness of Artificial Intelligence (AI), and more specifically, Large Language Models (LLMs), in supporting network configuration synthesis and automating network diagnosis tasks, among others. In this preliminary work, we restrict our focus to the application of AI agents to network troubleshooting and elaborate on the need for a standardized, reproducible, and open benchmarking platform, where to build and evaluate AI agents with low operational effort.
zh

[AI-58] DKGCM: A Spatio-Temporal Prediction Model for Traffic Flow by Fusing Spatial Node Clustering Method and Fourier Bidirectional Mamba Mechanism

【速读】:该论文旨在解决交通需求预测中由于复杂时空关系导致的模型性能受限问题。其解决方案的关键在于提出一种新的图卷积网络结构DKGCM,其中包含基于时间相似性的聚类图卷积方法DK-GCN,该方法通过动态时间规整(DTW)和K-means聚类对交通节点进行分组,以更有效地捕捉空间依赖性;同时,在时间尺度上结合快速傅里叶变换(FFT)与双向Mamba深度学习框架来提取时间依赖性,并引入GRPO强化学习策略优化模型训练过程。

链接: https://arxiv.org/abs/2507.01982
作者: Siqing Long,Xiangzhi Huang,Jiemin Xie,Ming Cai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 39 pages, 14 figures

点击查看摘要

Abstract:Accurate traffic demand forecasting enables transportation management departments to allocate resources more effectively, thereby improving their utilization efficiency. However, complex spatiotemporal relationships in traffic systems continue to limit the performance of demand forecasting models. To improve the accuracy of spatiotemporal traffic demand prediction, we propose a new graph convolutional network structure called DKGCM. Specifically, we first consider the spatial flow distribution of different traffic nodes and propose a novel temporal similarity-based clustering graph convolution method, DK-GCN. This method utilizes Dynamic Time Warping (DTW) and K-means clustering to group traffic nodes and more effectively capture spatial dependencies. On the temporal scale, we integrate the Fast Fourier Transform (FFT) within the bidirectional Mamba deep learning framework to capture temporal dependencies in traffic demand. To further optimize model training, we incorporate the GRPO reinforcement learning strategy to enhance the loss function feedback mechanism. Extensive experiments demonstrate that our model outperforms several advanced methods and achieves strong results on three public datasets.
zh

[AI-59] Learnable-Differentiable Finite Volume Solver for Accelerated Simulation of Flows KDD KDD2025

【速读】:该论文旨在解决传统数值求解器在模拟流体流动时因需要精细的时空网格而导致计算成本高昂的问题,同时克服现有机器学习方法在可解释性、泛化能力和数据依赖性方面的不足。其解决方案的关键在于提出一种可学习且可微分的有限体积求解器(LDSolver),该求解器包含两个核心组件:一是可微分的有限体积求解器,二是提供通量(导数和插值)等效近似以及在粗网格上进行时间误差校正的可学习模块。

链接: https://arxiv.org/abs/2507.01975
作者: Mengtao Yan,Qi Wang,Haining Wang,Ruizhi Chengze,Yi Zhang,Hongsheng Liu,Zidong Wang,Fan Yu,Qi Qi,Hao Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注: 19 pages, 12 figures, accepted at KDD 2025 (ACM SIGKDD Conference on Knowledge Discovery and Data Mining)

点击查看摘要

Abstract:Simulation of fluid flows is crucial for modeling physical phenomena like meteorology, aerodynamics, and biomedicine. Classical numerical solvers often require fine spatiotemporal grids to satisfy stability, consistency, and convergence conditions, leading to substantial computational costs. Although machine learning has demonstrated better efficiency, they typically suffer from issues of interpretability, generalizability, and data dependency. Hence, we propose a learnable and differentiable finite volume solver, called LDSolver, designed for efficient and accurate simulation of fluid flows on spatiotemporal coarse grids. LDSolver comprises two key components: (1) a differentiable finite volume solver, and (2) an learnable module providing equivalent approximation for fluxes (derivatives and interpolations), and temporal error correction on coarse grids. Even with limited training data (e.g., only a few trajectories), our model could accelerate the simulation while maintaining a high accuracy with superior generalizability. Experiments on different flow systems (e.g., Burgers, decaying, forced and shear flows) show that LDSolver achieves state-of-the-art performance, surpassing baseline models with notable margins.
zh

[AI-60] DNN-Based Precoding in RIS-Aided mmWave MIMO Systems With Practical Phase Shift

【速读】:该论文旨在解决毫米波(mmWave)多输入多输出(MIMO)系统中由于直射路径受阻而导致吞吐量下降的问题,通过引入可重构智能表面(RIS)来增强MIMO传输。其解决方案的关键在于利用训练好的深度神经网络(DNN)实现快速码字选择,以替代传统计算复杂度高的穷举搜索(ES)方法,从而在保持接近最优频谱效率的同时降低计算开销。

链接: https://arxiv.org/abs/2507.02824
作者: Po-Heng Chou,Ching-Wen Chen,Wan-Jen Huang,Walid Saad,Yu Tsao,Ronald Y. Chang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 4 figures, 2 tables, accepted by IEEE Globecom 2024 Workshops

点击查看摘要

Abstract:In this paper, the precoding design is investigated for maximizing the throughput of millimeter wave (mmWave) multiple-input multiple-output (MIMO) systems with obstructed direct communication paths. In particular, a reconfigurable intelligent surface (RIS) is employed to enhance MIMO transmissions, considering mmWave characteristics related to line-of-sight (LoS) and multipath effects. The traditional exhaustive search (ES) for optimal codewords in the continuous phase shift is computationally intensive and time-consuming. To reduce computational complexity, permuted discrete Fourier transform (DFT) vectors are used for finding codebook design, incorporating amplitude responses for practical or ideal RIS systems. However, even if the discrete phase shift is adopted in the ES, it results in significant computation and is time-consuming. Instead, the trained deep neural network (DNN) is developed to facilitate faster codeword selection. Simulation results show that the DNN maintains sub-optimal spectral efficiency even as the distance between the end-user and the RIS has variations in the testing phase. These results highlight the potential of DNN in advancing RIS-aided systems.
zh

[AI-61] Multi-agent Auditory Scene Analysis

【速读】:该论文试图解决传统听觉场景分析(Auditory Scene Analysis, ASA)系统中任务线性执行导致的响应时间增加和后续任务对初始任务误差敏感的问题。其解决方案的关键在于提出一种多智能体方法(Multi-Agent Approach),通过并行执行任务并引入反馈机制来补偿局部错误,从而提升系统的鲁棒性,同时保持较低的计算复杂度和响应时间。

链接: https://arxiv.org/abs/2507.02755
作者: Caleb Rascon,Luis Gato-Diaz,Eduardo García-Alarcón
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: Submitted to Applied Intelligence

点击查看摘要

Abstract:Auditory scene analysis (ASA) aims to retrieve information from the acoustic environment, by carrying out three main tasks: sound source location, separation, and classification. These tasks are traditionally executed with a linear data flow, where the sound sources are first located; then, using their location, each source is separated into its own audio stream; from each of which, information is extracted that is relevant to the application scenario (audio event detection, speaker identification, emotion classification, etc.). However, running these tasks linearly increases the overall response time, while making the last tasks (separation and classification) highly sensitive to errors of the first task (location). A considerable amount of effort and computational complexity has been employed in the state-of-the-art to develop techniques that are the least error-prone possible. However, doing so gives rise to an ASA system that is non-viable in many applications that require a small computational footprint and a low response time, such as bioacoustics, hearing-aid design, search and rescue, human-robot interaction, etc. To this effect, in this work, a multi-agent approach is proposed to carry out ASA where the tasks are run in parallel, with feedback loops between them to compensate for local errors, such as: using the quality of the separation output to correct the location error; and using the classification result to reduce the localization’s sensitivity towards interferences. The result is a multi-agent auditory scene analysis (MASA) system that is robust against local errors, without a considerable increase in complexity, and with a low response time. The complete proposed MASA system is provided as a framework that uses open-source tools for sound acquisition and reproduction (JACK) and inter-agent communication (ROS2), allowing users to add their own agents.
zh

[AI-62] Synthesizable by Design: A Retrosynthesis-Guided Framework for Molecular Analog Generation

【速读】:该论文试图解决生成式AI设计的分子在具有理想性质的同时,其合成可行性不足的问题(synthetic feasibility),这是计算药物和材料发现中的关键瓶颈。解决方案的关键在于提出SynTwins框架,该框架通过模仿专家化学家的策略,采用逆合成分析、相似构建块搜索和虚拟合成的三步流程,设计出合成可及的分子类似物,从而有效平衡了分子结构的相似性与合成可行性。

链接: https://arxiv.org/abs/2507.02752
作者: Shuan Chen,Gunwook Nam,Yousung Jung
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The disconnect between AI-generated molecules with desirable properties and their synthetic feasibility remains a critical bottleneck in computational drug and material discovery. While generative AI has accelerated the proposal of candidate molecules, many of these structures prove challenging or impossible to synthesize using established chemical reactions. Here, we introduce SynTwins, a novel retrosynthesis-guided molecular analog design framework that designs synthetically accessible molecular analogs by emulating expert chemist strategies through a three-step process: retrosynthesis, similar building block searching, and virtual synthesis. In comparative evaluations, SynTwins demonstrates superior performance in generating synthetically accessible analogs compared to state-of-the-art machine learning models while maintaining high structural similarity to original target molecules. Furthermore, when integrated with existing molecule optimization frameworks, our hybrid approach produces synthetically feasible molecules with property profiles comparable to unconstrained molecule generators, yet its synthesizability ensured. Our comprehensive benchmarking across diverse molecular datasets demonstrates that SynTwins effectively bridges the gap between computational design and experimental synthesis, providing a practical solution for accelerating the discovery of synthesizable molecules with desired properties for a wide range of applications.
zh

[AI-63] Solving the Hubbard model with Neural Quantum States

【速读】:该论文旨在解决强关联费米子系统中量子多体问题的建模与计算难题,特别是针对掺杂二维(2D)Hubbard模型这一被认为是最小的高温超导模型进行研究。其解决方案的关键在于利用先进的基于Transformer的架构和高效优化算法,构建了神经量子态(Neural Quantum States, NQS)框架,该框架通过不同的注意力头直接编码不同尺度的相关性,从而能够捕捉强关联系统中的长程相关性和纠缠,最终在包含次近邻跃迁的二维Hubbard模型中成功揭示了基态中的半填充条纹结构,与铜氧化物实验观测一致。

链接: https://arxiv.org/abs/2507.02644
作者: Yuntian Gu,Wenrui Li,Heng Lin,Bo Zhan,Ruichen Li,Yifei Huang,Di He,Yantao Wu,Tao Xiang,Mingpu Qin,Liwei Wang,Dingshun Lv
机构: 未知
类目: rongly Correlated Electrons (cond-mat.str-el); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:The rapid development of neural quantum states (NQS) has established it as a promising framework for studying quantum many-body systems. In this work, by leveraging the cutting-edge transformer-based architectures and developing highly efficient optimization algorithms, we achieve the state-of-the-art results for the doped two-dimensional (2D) Hubbard model, arguably the minimum model for high-Tc superconductivity. Interestingly, we find different attention heads in the NQS ansatz can directly encode correlations at different scales, making it capable of capturing long-range correlations and entanglements in strongly correlated systems. With these advances, we establish the half-filled stripe in the ground state of 2D Hubbard model with the next nearest neighboring hoppings, consistent with experimental observations in cuprates. Our work establishes NQS as a powerful tool for solving challenging many-fermions systems.
zh

[AI-64] oward a Robust and Generalizable Metamaterial Foundation Model

【速读】:该论文旨在解决传统人工设计策略在超材料(metamaterial)领域中存在的任务特定再训练需求高、分布外(out-of-distribution, OOD)泛化能力差以及正向与逆向设计需独立模型的问题。其解决方案的关键在于引入基于贝叶斯变换器的超材料基础模型(Metamaterial Foundation Model, MetaFO),该模型通过学习超材料的潜在力学机制,实现对未见过的材料属性与结构响应组合的零样本概率预测,并在非线性逆向设计中表现出色,从而显著扩展了设计空间并提升了泛化能力。

链接: https://arxiv.org/abs/2507.02436
作者: Namjung Kim,Dongseok Lee,Jongbin Yu,Sung Woong Cho,Dosung Lee,Yesol Park,Youngjoon Hong
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advances in material functionalities drive innovations across various fields, where metamaterials-defined by structure rather than composition-are leading the way. Despite the rise of artificial intelligence (AI)-driven design strategies, their impact is limited by task-specific retraining, poor out-of-distribution(OOD) generalization, and the need for separate models for forward and inverse design. To address these limitations, we introduce the Metamaterial Foundation Model (MetaFO), a Bayesian transformer-based foundation model inspired by large language models. MetaFO learns the underlying mechanics of metamaterials, enabling probabilistic, zero-shot predictions across diverse, unseen combinations of material properties and structural responses. It also excels in nonlinear inverse design, even under OOD conditions. By treating metamaterials as an operator that maps material properties to structural responses, MetaFO uncovers intricate structure-property relationships and significantly expands the design space. This scalable and generalizable framework marks a paradigm shift in AI-driven metamaterial discovery, paving the way for next-generation innovations.
zh

[AI-65] HelixDesign-Antibody: A Scalable Production-Grade Platform for Antibody Design Built on HelixFold3

【速读】:该论文旨在解决传统抗体发现方法中存在的时间成本高、资源消耗大以及实验筛选效率低的问题。其解决方案的关键在于构建了一个基于HelixFold3的生产级高通量平台——HelixDesign-Antibody,该平台利用高精度结构预测模型进行抗体候选序列的大规模生成,并通过集成高性能计算(HPC)支持实现高效筛选,从而克服了工具链碎片化和计算需求高的挑战。

链接: https://arxiv.org/abs/2507.02345
作者: Jie Gao,Jing Hu,Shanzhuo Zhang,Kunrui Zhu,Sheng Qian,Yueyang Huang,Xiaonan Zhang,Xiaomin Fang
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Antibody engineering is essential for developing therapeutics and advancing biomedical research. Traditional discovery methods often rely on time-consuming and resource-intensive experimental screening. To enhance and streamline this process, we introduce a production-grade, high-throughput platform built on HelixFold3, HelixDesign-Antibody, which utilizes the high-accuracy structure prediction model, HelixFold3. The platform facilitates the large-scale generation of antibody candidate sequences and evaluates their interaction with antigens. Integrated high-performance computing (HPC) support enables high-throughput screening, addressing challenges such as fragmented toolchains and high computational demands. Validation on multiple antigens showcases the platform’s ability to generate diverse and high-quality antibodies, confirming a scaling law where exploring larger sequence spaces increases the likelihood of identifying optimal binders. This platform provides a seamless, accessible solution for large-scale antibody design and is available via the antibody design page of PaddleHelix platform.
zh

[AI-66] Resolving Turbulent Magnetohydrodynamics: A Hybrid Operator-Diffusion Framework

【速读】:该论文旨在解决二维不可压缩电阻磁流体动力学(MHD)湍流在广泛雷诺数(Re)范围内的高保真时空演化模拟问题。传统确定性代理模型难以准确捕捉复杂湍流中的非高斯统计特性、间歇性结构及跨场相关性。其解决方案的关键在于提出一种混合机器学习框架,结合物理信息神经算子(PINOs)与基于得分的生成扩散模型,利用PINOs的方程约束泛化能力预测低频相干动力学,同时通过条件扩散模型随机校正高频残差,从而实现对充分发展湍流的精确建模。

链接: https://arxiv.org/abs/2507.02106
作者: Semih Kacmaz,E. A. Huerta,Roland Haas
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Relativity and Quantum Cosmology (gr-qc); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:We present a hybrid machine learning framework that combines Physics-Informed Neural Operators (PINOs) with score-based generative diffusion models to simulate the full spatio-temporal evolution of two-dimensional, incompressible, resistive magnetohydrodynamic (MHD) turbulence across a broad range of Reynolds numbers ( \mathrmRe ). The framework leverages the equation-constrained generalization capabilities of PINOs to predict coherent, low-frequency dynamics, while a conditional diffusion model stochastically corrects high-frequency residuals, enabling accurate modeling of fully developed turbulence. Trained on a comprehensive ensemble of high-fidelity simulations with \mathrmRe \in \100, 250, 500, 750, 1000, 3000, 10000\ , the approach achieves state-of-the-art accuracy in regimes previously inaccessible to deterministic surrogates. At \mathrmRe=1000 and 3000 , the model faithfully reconstructs the full spectral energy distributions of both velocity and magnetic fields late into the simulation, capturing non-Gaussian statistics, intermittent structures, and cross-field correlations with high fidelity. At extreme turbulence levels ( \mathrmRe=10000 ), it remains the first surrogate capable of recovering the high-wavenumber evolution of the magnetic field, preserving large-scale morphology and enabling statistically meaningful predictions.
zh

[AI-67] NGAT: A Node-level Graph Attention Network for Long-term Stock Prediction

【速读】:该论文旨在解决金融领域中基于图表示学习的公司表征方法所面临的三个关键问题:(1)关系信息的优势因下游任务设计的局限性而被削弱;(2)专门用于股票预测的图模型通常存在过度复杂和泛化能力差的问题;(3)基于经验构建的企业关系图缺乏不同图结构的有效比较。其解决方案的关键在于提出一种长期股票预测任务,并开发一种针对企业关系图的节点级图注意力网络(Node-level Graph Attention Network, NGAT),以提升模型的性能与可解释性。同时,通过实验验证了现有图比较方法的局限性,并在两个数据集上证明了所提任务和模型的有效性。

链接: https://arxiv.org/abs/2507.02018
作者: Yingjie Niu,Mingchuan Zhao,Valerio Poti,Ruihai Dong
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Graph representation learning methods have been widely adopted in financial applications to enhance company representations by leveraging inter-firm relationships. However, current approaches face three key challenges: (1) The advantages of relational information are obscured by limitations in downstream task designs; (2) Existing graph models specifically designed for stock prediction often suffer from excessive complexity and poor generalization; (3) Experience-based construction of corporate relationship graphs lacks effective comparison of different graph structures. To address these limitations, we propose a long-term stock prediction task and develop a Node-level Graph Attention Network (NGAT) specifically tailored for corporate relationship graphs. Furthermore, we experimentally demonstrate the limitations of existing graph comparison methods based on model downstream task performance. Experimental results across two datasets consistently demonstrate the effectiveness of our proposed task and model. The project is publicly available on GitHub to encourage reproducibility and future research.
zh

[AI-68] Integrating Large Language Models in Financial Investments and Market Analysis: A Survey

【速读】:该论文试图解决传统金融决策方法在处理复杂、多源数据时的局限性,以及如何利用大型语言模型(Large Language Models, LLMs)提升金融领域的分析能力和实时决策水平。其解决方案的关键在于通过LLMs处理结构化与非结构化数据,提取有价值的信息,并将其应用于股票选择、风险评估、情感分析、交易和金融预测等任务,从而增强金融市场的智能化决策能力。

链接: https://arxiv.org/abs/2507.01990
作者: Sedigheh Mahdavi,Jiating(Kristin)Chen,Pradeep Kumar Joshi,Lina Huertas Guativa,Upmanyu Singh
机构: 未知
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been employed in financial decision making, enhancing analytical capabilities for investment strategies. Traditional investment strategies often utilize quantitative models, fundamental analysis, and technical indicators. However, LLMs have introduced new capabilities to process and analyze large volumes of structured and unstructured data, extract meaningful insights, and enhance decision-making in real-time. This survey provides a structured overview of recent research on LLMs within the financial domain, categorizing research contributions into four main frameworks: LLM-based Frameworks and Pipelines, Hybrid Integration Methods, Fine-Tuning and Adaptation Approaches, and Agent-Based Architectures. This study provides a structured review of recent LLMs research on applications in stock selection, risk assessment, sentiment analysis, trading, and financial forecasting. By reviewing the existing literature, this study highlights the capabilities, challenges, and potential directions of LLMs in financial markets.
zh

[AI-69] Forecasting Labor Markets with LSTNet: A Multi-Scale Deep Learning Approach

【速读】:该论文试图解决短期就业变化预测与长期行业健康评估的问题,其解决方案的关键在于采用一种基于深度学习的模型——长短期时间序列网络(LSTNet),该模型能够处理多变量时间序列数据,包括就业水平、工资、人员流动率和职位空缺等指标,从而生成7天内的就业预测以及可解释的行业就业健康指数(Industry Employment Health Index, IEHI)。

链接: https://arxiv.org/abs/2507.01979
作者: Adam Nelson-Archer,Aleia Sen,Meena Al Hasani,Sofia Davila,Jessica Le,Omar Abbouchi
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Undergraduate senior project, University of Houston, Department of Computer Science

点击查看摘要

Abstract:We present a deep learning approach for forecasting short-term employment changes and assessing long-term industry health using labor market data from the U.S. Bureau of Labor Statistics. Our system leverages a Long- and Short-Term Time-series Network (LSTNet) to process multivariate time series data, including employment levels, wages, turnover rates, and job openings. The model outputs both 7-day employment forecasts and an interpretable Industry Employment Health Index (IEHI). Our approach outperforms baseline models across most sectors, particularly in stable industries, and demonstrates strong alignment between IEHI rankings and actual employment volatility. We discuss error patterns, sector-specific performance, and future directions for improving interpretability and generalization.
zh

[AI-70] Accelerated Portfolio Optimization and Option Pricing with Reinforcement Learning

【速读】:该论文试图解决在投资组合优化和期权定价中,由于大规模线性系统求解时直接求逆计算成本过高以及病态系统收敛速度慢的问题。解决方案的关键在于引入强化学习(Reinforcement Learning, RL)驱动的框架,以动态调整块预条件器(block-preconditioner)的大小,从而加速迭代求解器的收敛并降低计算成本。

链接: https://arxiv.org/abs/2507.01972
作者: Hadi Keramati,Samaneh Jazayeri
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:We present a reinforcement learning (RL)-driven framework for optimizing block-preconditioner sizes in iterative solvers used in portfolio optimization and option pricing. The covariance matrix in portfolio optimization or the discretization of differential operators in option pricing models lead to large linear systems of the form \mathbfA\textbfx=\textbfb . Direct inversion of high-dimensional portfolio or fine-grid option pricing incurs a significant computational cost. Therefore, iterative methods are usually used for portfolios in real-world situations. Ill-conditioned systems, however, suffer from slow convergence. Traditional preconditioning techniques often require problem-specific parameter tuning. To overcome this limitation, we rely on RL to dynamically adjust the block-preconditioner sizes and accelerate iterative solver convergence. Evaluations on a suite of real-world portfolio optimization matrices demonstrate that our RL framework can be used to adjust preconditioning and significantly accelerate convergence and reduce computational cost. The proposed accelerated solver supports faster decision-making in dynamic portfolio allocation and real-time option pricing.
zh

[AI-71] DeepSupp: Attention-Driven Correlation Pattern Analysis for Dynamic Time Series Support and Resistance Levels Identification

【速读】:该论文旨在解决传统支撑与阻力(Support and Resistance, SR)水平识别方法在现代波动性市场中适应性不足的问题,其核心挑战在于如何准确捕捉金融市场中的结构化价格水平。论文提出的解决方案——DeepSupp,关键在于利用多头注意力机制分析空间相关性和市场微观结构关系,并结合动态相关性矩阵构建和基于注意力的自编码器进行鲁棒的表征学习,最终通过无监督聚类算法DBSCAN提取支撑水平,从而提升支撑水平检测的准确性与市场适应性。

链接: https://arxiv.org/abs/2507.01971
作者: Boris Kriuk,Logic Ng,Zarif Al Hossain
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 7 pages, 4 figures, 1 table

点击查看摘要

Abstract:Support and resistance (SR) levels are central to technical analysis, guiding traders in entry, exit, and risk management. Despite widespread use, traditional SR identification methods often fail to adapt to the complexities of modern, volatile markets. Recent research has introduced machine learning techniques to address the following challenges, yet most focus on price prediction rather than structural level identification. This paper presents DeepSupp, a new deep learning approach for detecting financial support levels using multi-head attention mechanisms to analyze spatial correlations and market microstructure relationships. DeepSupp integrates advanced feature engineering, constructing dynamic correlation matrices that capture evolving market relationships, and employs an attention-based autoencoder for robust representation learning. The final support levels are extracted through unsupervised clustering, leveraging DBSCAN to identify significant price thresholds. Comprehensive evaluations on SP 500 tickers demonstrate that DeepSupp outperforms six baseline methods, achieving state-of-the-art performance across six financial metrics, including essential support accuracy and market regime sensitivity. With consistent results across diverse market conditions, DeepSupp addresses critical gaps in SR level detection, offering a scalable and reliable solution for modern financial analysis. Our approach highlights the potential of attention-based architectures to uncover nuanced market patterns and improve technical trading strategies.
zh

[AI-72] A Scalable and Quantum-Accurate Foundation Model for Biomolecular Force Field via Linearly Tensorized Quadrangle Attention

【速读】:该论文旨在解决传统分子模拟方法在准确性与计算效率之间的矛盾,即经典力场虽计算高效但缺乏对过渡态和精细构象细节的准确描述,而量子力学方法虽精度高但计算成本过高,难以应用于大规模或长时间模拟。论文提出的解决方案关键在于引入LiTEN,一种具有张量化四边形注意力(TQA)的等变神经网络,通过向量运算重新参数化高阶张量特征,以线性复杂度高效建模三体和四体相互作用,从而在保持高精度的同时提升计算效率。

链接: https://arxiv.org/abs/2507.00884
作者: Qun Su,Kai Zhu,Qiaolin Gou,Jintu Zhang,Renling Hu,Yurong Li,Yongze Wang,Hui Zhang,Ziyi You,Linlong Jiang,Yu Kang,Jike Wang,Chang-Yu Hsieh,Tingjun Hou
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
备注:

点击查看摘要

Abstract:Accurate atomistic biomolecular simulations are vital for disease mechanism understanding, drug discovery, and biomaterial design, but existing simulation methods exhibit significant limitations. Classical force fields are efficient but lack accuracy for transition states and fine conformational details critical in many chemical and biological processes. Quantum Mechanics (QM) methods are highly accurate but computationally infeasible for large-scale or long-time simulations. AI-based force fields (AIFFs) aim to achieve QM-level accuracy with efficiency but struggle to balance many-body modeling complexity, accuracy, and speed, often constrained by limited training data and insufficient validation for generalizability. To overcome these challenges, we introduce LiTEN, a novel equivariant neural network with Tensorized Quadrangle Attention (TQA). TQA efficiently models three- and four-body interactions with linear complexity by reparameterizing high-order tensor features via vector operations, avoiding costly spherical harmonics. Building on LiTEN, LiTEN-FF is a robust AIFF foundation model, pre-trained on the extensive nablaDFT dataset for broad chemical generalization and fine-tuned on SPICE for accurate solvated system simulations. LiTEN achieves state-of-the-art (SOTA) performance across most evaluation subsets of rMD17, MD22, and Chignolin, outperforming leading models such as MACE, NequIP, and EquiFormer. LiTEN-FF enables the most comprehensive suite of downstream biomolecular modeling tasks to date, including QM-level conformer searches, geometry optimization, and free energy surface construction, while offering 10x faster inference than MACE-OFF for large biomolecules (~1000 atoms). In summary, we present a physically grounded, highly efficient framework that advances complex biomolecular modeling, providing a versatile foundation for drug discovery and related applications.
zh

机器学习

[LG-0] MvHo-IB: Multi-View Higher-Order Information Bottleneck for Brain Disorder Diagnosis MICCAI-25

链接: https://arxiv.org/abs/2507.02847
作者: Kunyu Zhang,Qiang Li,Shujian Yu
类目: Machine Learning (cs.LG)
*备注: Accepted by MICCAI-25, code is available at \url{ this https URL }

点击查看摘要

Abstract:Recent evidence suggests that modeling higher-order interactions (HOIs) in functional magnetic resonance imaging (fMRI) data can enhance the diagnostic accuracy of machine learning systems. However, effectively extracting and utilizing HOIs remains a significant challenge. In this work, we propose MvHo-IB, a novel multi-view learning framework that integrates both pairwise interactions and HOIs for diagnostic decision-making, while automatically compressing task-irrelevant redundant information. MvHo-IB introduces several key innovations: (1) a principled method that combines O-information from information theory with a matrix-based Renyi alpha-order entropy estimator to quantify and extract HOIs, (2) a purpose-built Brain3DCNN encoder to effectively utilize these interactions, and (3) a new multi-view learning information bottleneck objective to enhance representation learning. Experiments on three benchmark fMRI datasets demonstrate that MvHo-IB achieves state-of-the-art performance, significantly outperforming previous methods, including recent hypergraph-based techniques. The implementation of MvHo-IB is available at this https URL.

[LG-1] LLM -Driven Treatment Effect Estimation Under Inference Time Text Confounding

链接: https://arxiv.org/abs/2507.02843
作者: Yuchen Ma,Dennis Frauen,Jonas Schweisthal,Stefan Feuerriegel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating treatment effects is crucial for personalized decision-making in medicine, but this task faces unique challenges in clinical practice. At training time, models for estimating treatment effects are typically trained on well-structured medical datasets that contain detailed patient information. However, at inference time, predictions are often made using textual descriptions (e.g., descriptions with self-reported symptoms), which are incomplete representations of the original patient information. In this work, we make three contributions. (1) We show that the discrepancy between the data available during training time and inference time can lead to biased estimates of treatment effects. We formalize this issue as an inference time text confounding problem, where confounders are fully observed during training time but only partially available through text at inference time. (2) To address this problem, we propose a novel framework for estimating treatment effects that explicitly accounts for inference time text confounding. Our framework leverages large language models together with a custom doubly robust learner to mitigate biases caused by the inference time text confounding. (3) Through a series of experiments, we demonstrate the effectiveness of our framework in real-world applications.

[LG-2] Measurement as Bricolage: Examining How Data Scientists Construct Target Variables for Predictive Modeling Tasks

链接: https://arxiv.org/abs/2507.02819
作者: Luke Guerdan,Devansh Saxena,Stevie Chancellor,Zhiwei Steven Wu,Kenneth Holstein
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data scientists often formulate predictive modeling tasks involving fuzzy, hard-to-define concepts, such as the “authenticity” of student writing or the “healthcare need” of a patient. Yet the process by which data scientists translate fuzzy concepts into a concrete, proxy target variable remains poorly understood. We interview fifteen data scientists in education (N=8) and healthcare (N=7) to understand how they construct target variables for predictive modeling tasks. Our findings suggest that data scientists construct target variables through a bricolage process, involving iterative negotiation between high-level measurement objectives and low-level practical constraints. Data scientists attempt to satisfy five major criteria for a target variable through bricolage: validity, simplicity, predictability, portability, and resource requirements. To achieve this, data scientists adaptively use problem (re)formulation strategies, such as swapping out one candidate target variable for another when the first fails to meet certain criteria (e.g., predictability), or composing multiple outcomes into a single target variable to capture a more holistic set of modeling objectives. Based on our findings, we present opportunities for future HCI, CSCW, and ML research to better support the art and science of target variable construction.

[LG-3] Replicable Distribution Testing

链接: https://arxiv.org/abs/2507.02814
作者: Ilias Diakonikolas,Jingyi Gao,Daniel Kane,Sihan Liu,Christopher Ye
类目: Machine Learning (cs.LG)
*备注: 39 pages

点击查看摘要

Abstract:We initiate a systematic investigation of distribution testing in the framework of algorithmic replicability. Specifically, given independent samples from a collection of probability distributions, the goal is to characterize the sample complexity of replicably testing natural properties of the underlying distributions. On the algorithmic front, we develop new replicable algorithms for testing closeness and independence of discrete distributions. On the lower bound front, we develop a new methodology for proving sample complexity lower bounds for replicable testing that may be of broader interest. As an application of our technique, we establish near-optimal sample complexity lower bounds for replicable uniformity testing – answering an open question from prior work – and closeness testing.

[LG-4] In-Training Multicalibrated Survival Analysis for Healthcare via Constrained Optimization

链接: https://arxiv.org/abs/2507.02807
作者: Thiti Suttaket,Stanley Kok
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Survival analysis is an important problem in healthcare because it models the relationship between an individual’s covariates and the onset time of an event of interest (e.g., death). It is important for survival models to be well-calibrated (i.e., for their predicted probabilities to be close to ground-truth probabilities) because badly calibrated systems can result in erroneous clinical decisions. Existing survival models are typically calibrated at the population level only, and thus run the risk of being poorly calibrated for one or more minority subpopulations. We propose a model called GRADUATE that achieves multicalibration by ensuring that all subpopulations are well-calibrated too. GRADUATE frames multicalibration as a constrained optimization problem, and optimizes both calibration and discrimination in-training to achieve a good balance between them. We mathematically prove that the optimization method used yields a solution that is both near-optimal and feasible with high probability. Empirical comparisons against state-of-the-art baselines on real-world clinical datasets demonstrate GRADUATE’s efficacy. In a detailed analysis, we elucidate the shortcomings of the baselines vis-a-vis GRADUATE’s strengths.

[LG-5] Learning to Coordinate Bidders in Non-Truthful Auctions

链接: https://arxiv.org/abs/2507.02801
作者: Hu Fu,Tao Lin
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:

点击查看摘要

Abstract:In non-truthful auctions such as first-price and all-pay auctions, the independent strategic behaviors of bidders, with the corresponding equilibrium notion – Bayes Nash equilibria – are notoriously difficult to characterize and can cause undesirable outcomes. An alternative approach to designing better auction systems is to coordinate the bidders: let a mediator make incentive-compatible recommendations of correlated bidding strategies to the bidders, namely, implementing a Bayes correlated equilibrium (BCE). The implementation of BCE, however, requires knowledge of the distribution of bidders’ private valuations, which is often unavailable. We initiate the study of the sample complexity of learning Bayes correlated equilibria in non-truthful auctions. We prove that the BCEs in a large class of non-truthful auctions, including first-price and all-pay auctions, can be learned with a polynomial number \tilde O(\fracn\varepsilon^2) of samples from the bidders’ value distributions. Our technique is a reduction to the problem of estimating bidders’ expected utility from samples, combined with an analysis of the pseudo-dimension of the class of all monotone bidding strategies of bidders.

[LG-6] Understanding and Improving Length Generalization in Recurrent Models

链接: https://arxiv.org/abs/2507.02782
作者: Ricardo Buitrago Ruiz,Albert Gu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, recurrent models such as state space models and linear attention have become popular due to their linear complexity in the sequence length. Thanks to their recurrent nature, in principle they can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths-i.e. they fail to length generalize. In this work, we provide comprehensive empirical and theoretical analysis to support the unexplored states hypothesis, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all attainable states (i.e. states that would be attained if the recurrence was applied to long sequences). Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps ( \sim 0.1% of the pre-training budget), these interventions enable length generalization for sequences that are orders of magnitude longer than the training context (e.g. 2k\longrightarrow 128k ) and show improved performance in long context tasks, thus presenting a simple and efficient way to enable robust length generalization in general recurrent models.

[LG-7] Contextual Online Pricing with (Biased) Offline Data

链接: https://arxiv.org/abs/2507.02762
作者: Yixuan Zhang,Ruihao Zhu,Qiaomin Xie
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 47 pages, 4 figures

点击查看摘要

Abstract:We study contextual online pricing with biased offline data. For the scalar price elasticity case, we identify the instance-dependent quantity \delta^2 that measures how far the offline data lies from the (unknown) online optimum. We show that the time length T , bias bound V , size N and dispersion \lambda_\min(\hat\Sigma) of the offline data, and \delta^2 jointly determine the statistical complexity. An Optimism-in-the-Face-of-Uncertainty (OFU) policy achieves a minimax-optimal, instance-dependent regret bound \tilde\mathcalO\big(d\sqrtT \wedge (V^2T + \fracdT\lambda_\min(\hat\Sigma) + (N \wedge T) \delta^2)\big) . For general price elasticity, we establish a worst-case, minimax-optimal rate \tilde\mathcalO\big(d\sqrtT \wedge (V^2T + \fracdT \lambda_\min(\hat\Sigma))\big) and provide a generalized OFU algorithm that attains it. When the bias bound V is unknown, we design a robust variant that always guarantees sub-linear regret and strictly improves on purely online methods whenever the exact bias is small. These results deliver the first tight regret guarantees for contextual pricing in the presence of biased offline data. Our techniques also transfer verbatim to stochastic linear bandits with biased offline data, yielding analogous bounds.

[LG-8] Classification by Separating Hypersurfaces: An Entropic Approach

链接: https://arxiv.org/abs/2507.02732
作者: Argimiro Arratia,Mahmoud El Daou,Henryk Gzyl
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
*备注: 15 pages, 10 tables, 4 figures

点击查看摘要

Abstract:We consider the following classification problem: Given a population of individuals characterized by a set of attributes represented as a vector in \mathbb R^N , the goal is to find a hyperplane in \mathbb R^N that separates two sets of points corresponding to two distinct classes. This problem, with a history dating back to the perceptron model, remains central to machine learning. In this paper we propose a novel approach by searching for a vector of parameters in a bounded N -dimensional hypercube centered at the origin and a positive vector in \mathbb R^M , obtained through the minimization of an entropy-based function defined over the space of unknown variables. The method extends to polynomial surfaces, allowing the separation of data points by more complex decision boundaries. This provides a robust alternative to traditional linear or quadratic optimization techniques, such as support vector machines and gradient descent. Numerical experiments demonstrate the efficiency and versatility of the method in handling diverse classification tasks, including linear and non-linear separability.

[LG-9] Hierarchical Multi-Label Contrastive Learning for Protein-Protein Interaction Prediction Across Organisms

链接: https://arxiv.org/abs/2507.02724
作者: Shiyi Liu,Buwen Liang,Yuetong Fang,Zixuan Jiang,Renjing Xu
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:

点击查看摘要

Abstract:Recent advances in AI for science have highlighted the power of contrastive learning in bridging heterogeneous biological data modalities. Building on this paradigm, we propose HIPPO (HIerarchical Protein-Protein interaction prediction across Organisms), a hierarchical contrastive framework for protein-protein interaction(PPI) prediction, where protein sequences and their hierarchical attributes are aligned through multi-tiered biological representation matching. The proposed approach incorporates hierarchical contrastive loss functions that emulate the structured relationship among functional classes of proteins. The framework adaptively incorporates domain and family knowledge through a data-driven penalty mechanism, enforcing consistency between the learned embedding space and the intrinsic hierarchy of protein functions. Experiments on benchmark datasets demonstrate that HIPPO achieves state-of-the-art performance, outperforming existing methods and showing robustness in low-data regimes. Notably, the model demonstrates strong zero-shot transferability to other species without retraining, enabling reliable PPI prediction and functional inference even in less characterized or rare organisms where experimental data are limited. Further analysis reveals that hierarchical feature fusion is critical for capturing conserved interaction determinants, such as binding motifs and functional annotations. This work advances cross-species PPI prediction and provides a unified framework for interaction prediction in scenarios with sparse or imbalanced multi-species data.

[LG-10] A Comprehensive Machine Learning Framework for Micromobility Demand Prediction

链接: https://arxiv.org/abs/2507.02715
作者: Omri Porat,Michael Fire,Eran Ben-Elia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dockless e-scooters, a key micromobility service, have emerged as eco-friendly and flexible urban transport alternatives. These services improve first and last-mile connectivity, reduce congestion and emissions, and complement public transport for short-distance travel. However, effective management of these services depends on accurate demand prediction, which is crucial for optimal fleet distribution and infrastructure planning. While previous studies have focused on analyzing spatial or temporal factors in isolation, this study introduces a framework that integrates spatial, temporal, and network dependencies for improved micromobility demand forecasting. This integration enhances accuracy while providing deeper insights into urban micromobility usage patterns. Our framework improves demand prediction accuracy by 27 to 49% over baseline models, demonstrating its effectiveness in capturing micromobility demand patterns. These findings support data-driven micromobility management, enabling optimized fleet distribution, cost reduction, and sustainable urban planning.

[LG-11] A Forget-and-Grow Strategy for Deep Reinforcement Learning Scaling in Continuous Control

链接: https://arxiv.org/abs/2507.02712
作者: Zilin Kang,Chenyuan Hu,Yu Luo,Zhecheng Yuan,Ruijie Zheng,Huazhe Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep reinforcement learning for continuous control has recently achieved impressive progress. However, existing methods often suffer from primacy bias, a tendency to overfit early experiences stored in the replay buffer, which limits an RL agent’s sample efficiency and generalizability. In contrast, humans are less susceptible to such bias, partly due to infantile amnesia, where the formation of new neurons disrupts early memory traces, leading to the forgetting of initial experiences. Inspired by this dual processes of forgetting and growing in neuroscience, in this paper, we propose Forget and Grow (FoG), a new deep RL algorithm with two mechanisms introduced. First, Experience Replay Decay (ER Decay) “forgetting early experience”, which balances memory by gradually reducing the influence of early experiences. Second, Network Expansion, “growing neural capacity”, which enhances agents’ capability to exploit the patterns of existing data by dynamically adding new parameters during training. Empirical results on four major continuous control benchmarks with more than 40 tasks demonstrate the superior performance of FoG against SoTA existing deep RL algorithms, including BRO, SimBa, and TD-MPC2.

[LG-12] Fluid Democracy in Federated Data Aggregation ICML2025

链接: https://arxiv.org/abs/2507.02710
作者: Aditya Vema Reddy Kesari,Krishna Reddy Kesari
类目: Machine Learning (cs.LG)
*备注: ICML 2025 Workshop on Collaborative and Federated Agentic Workflows

点击查看摘要

Abstract:Federated learning (FL) mechanisms typically require each client to transfer their weights to a central server, irrespective of how useful they are. In order to avoid wasteful data transfer costs from clients to the central server, we propose the use of consensus based protocols to identify a subset of clients with most useful model weights at each data transfer step. First, we explore the application of existing fluid democracy protocols to FL from a performance standpoint, comparing them with traditional one-person-one-vote (also known as 1p1v or FedAvg). We propose a new fluid democracy protocol named viscous-retained democracy that always does better than 1p1v under the same assumptions as existing fluid democracy protocols while also not allowing for influence accumulation. Secondly, we identify weaknesses of fluid democracy protocols from an adversarial lens in terms of their dependence on topology and/ or number of adversaries required to negatively impact the global model weights. To this effect, we propose an algorithm (FedVRD) that dynamically limits the effect of adversaries while minimizing cost by leveraging the delegation topology.

[LG-13] Multi-Agent Reinforcement Learning for Dynamic Pricing in Supply Chains: Benchmarking Strategic Agent Behaviours under Realistically Simulated Market Conditions

链接: https://arxiv.org/abs/2507.02698
作者: Thomas Hazenberg,Yao Ma,Seyed Sahand Mohammadi Ziabari,Marijn van Rijswijk
类目: Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:This study investigates how Multi-Agent Reinforcement Learning (MARL) can improve dynamic pricing strategies in supply chains, particularly in contexts where traditional ERP systems rely on static, rule-based approaches that overlook strategic interactions among market actors. While recent research has applied reinforcement learning to pricing, most implementations remain single-agent and fail to model the interdependent nature of real-world supply chains. This study addresses that gap by evaluating the performance of three MARL algorithms: MADDPG, MADQN, and QMIX against static rule-based baselines, within a simulated environment informed by real e-commerce transaction data and a LightGBM demand prediction model. Results show that rule-based agents achieve near-perfect fairness (Jain’s Index: 0.9896) and the highest price stability (volatility: 0.024), but they fully lack competitive dynamics. Among MARL agents, MADQN exhibits the most aggressive pricing behaviour, with the highest volatility and the lowest fairness (0.5844). MADDPG provides a more balanced approach, supporting market competition (share volatility: 9.5 pp) while maintaining relatively high fairness (0.8819) and stable pricing. These findings suggest that MARL introduces emergent strategic behaviour not captured by static pricing rules and may inform future developments in dynamic pricing.

[LG-14] RLHGNN: Reinforcement Learning-driven Heterogeneous Graph Neural Network for Next Activity Prediction in Business Processes

链接: https://arxiv.org/abs/2507.02690
作者: Jiaxing Wang,Yifeng Yu,Jiahan Song,Bin Cao,Jing Fan,Ji Zhang
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 15 pages, 7 figures. Business process prediction using reinforcement learning and heterogeneous graph neural networks

点击查看摘要

Abstract:Next activity prediction represents a fundamental challenge for optimizing business processes in service-oriented architectures such as microservices environments, distributed enterprise systems, and cloud-native platforms, which enables proactive resource allocation and dynamic service composition. Despite the prevalence of sequence-based methods, these approaches fail to capture non-sequential relationships that arise from parallel executions and conditional dependencies. Even though graph-based approaches address structural preservation, they suffer from homogeneous representations and static structures that apply uniform modeling strategies regardless of individual process complexity characteristics. To address these limitations, we introduce RLHGNN, a novel framework that transforms event logs into heterogeneous process graphs with three distinct edge types grounded in established process mining theory. Our approach creates four flexible graph structures by selectively combining these edges to accommodate different process complexities, and employs reinforcement learning formulated as a Markov Decision Process to automatically determine the optimal graph structure for each specific process instance. RLHGNN then applies heterogeneous graph convolution with relation-specific aggregation strategies to effectively predict the next activity. This adaptive methodology enables precise modeling of both sequential and non-sequential relationships in service interactions. Comprehensive evaluation on six real-world datasets demonstrates that RLHGNN consistently outperforms state-of-the-art approaches. Furthermore, it maintains an inference latency of approximately 1 ms per prediction, representing a highly practical solution suitable for real-time business process monitoring applications. The source code is available at this https URL.

[LG-15] Guided Generation for Developable Antibodies ICML2025

链接: https://arxiv.org/abs/2507.02670
作者: Siqi Zhao,Joshua Moller,Porfi Quintero-Cadena,Lood van Niekerk
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: Published in ICML 2025 GenBio Workshop

点击查看摘要

Abstract:Therapeutic antibodies require not only high-affinity target engagement, but also favorable manufacturability, stability, and safety profiles for clinical effectiveness. These properties are collectively called `developability’. To enable a computational framework for optimizing antibody sequences for favorable developability, we introduce a guided discrete diffusion model trained on natural paired heavy- and light-chain sequences from the Observed Antibody Space (OAS) and quantitative developability measurements for 246 clinical-stage antibodies. To steer generation toward biophysically viable candidates, we integrate a Soft Value-based Decoding in Diffusion (SVDD) Module that biases sampling without compromising naturalness. In unconstrained sampling, our model reproduces global features of both the natural repertoire and approved therapeutics, and under SVDD guidance we achieve significant enrichment in predicted developability scores over unguided baselines. When combined with high-throughput developability assays, this framework enables an iterative, ML-driven pipeline for designing antibodies that satisfy binding and biophysical criteria in tandem.

[LG-16] On Efficient Bayesian Exploration in Model-Based Reinforcement Learning

链接: https://arxiv.org/abs/2507.02639
作者: Alberto Caron,Chris Hicks,Vasilios Mavroudis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we address the challenge of data-efficient exploration in reinforcement learning by examining existing principled, information-theoretic approaches to intrinsic motivation. Specifically, we focus on a class of exploration bonuses that targets epistemic uncertainty rather than the aleatoric noise inherent in the environment. We prove that these bonuses naturally signal epistemic information gains and converge to zero once the agent becomes sufficiently certain about the environment’s dynamics and rewards, thereby aligning exploration with genuine knowledge gaps. Our analysis provides formal guarantees for IG-based approaches, which previously lacked theoretical grounding. To enable practical use, we also discuss tractable approximations via sparse variational Gaussian Processes, Deep Kernels and Deep Ensemble models. We then outline a general framework - Predictive Trajectory Sampling with Bayesian Exploration (PTS-BE) - which integrates model-based planning with information-theoretic bonuses to achieve sample-efficient deep exploration. We empirically demonstrate that PTS-BE substantially outperforms other baselines across a variety of environments characterized by sparse rewards and/or purely exploratory tasks.

[LG-17] High-Order Deep Meta-Learning with Category-Theoretic Interpretation

链接: https://arxiv.org/abs/2507.02634
作者: David H. Mguni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a new hierarchical deep learning framework for recursive higher-order meta-learning that enables neural networks (NNs) to construct, solve, and generalise across hierarchies of tasks. Central to this approach is a generative mechanism that creates \emphvirtual tasks – synthetic problem instances designed to enable the meta-learner to learn \emphsoft constraints and unknown generalisable rules across related tasks. Crucially, this enables the framework to generate its own informative, task-grounded datasets thereby freeing machine learning (ML) training from the limitations of relying entirely on human-generated data. By actively exploring the virtual point landscape and seeking out tasks lower-level learners find difficult, the meta-learner iteratively refines constraint regions. This enhances inductive biases, regularises the adaptation process, and produces novel, unanticipated tasks and constraints required for generalisation. Each meta-level of the hierarchy corresponds to a progressively abstracted generalisation of problems solved at lower levels, enabling a structured and interpretable learning progression. By interpreting meta-learners as category-theoretic \emphfunctors that generate and condition a hierarchy of subordinate learners, we establish a compositional structure that supports abstraction and knowledge transfer across progressively generalised tasks. The category-theoretic perspective unifies existing meta-learning models and reveals how learning processes can be transformed and compared through functorial relationships, while offering practical design principles for structuring meta-learning. We speculate this architecture may underpin the next generation of NNs capable of autonomously generating novel, instructive tasks and their solutions, thereby advancing ML towards general artificial intelligence.

[LG-18] Medical Data Pecking: A Context-Aware Approach for Automated Quality Evaluation of Structured Medical Data

链接: https://arxiv.org/abs/2507.02628
作者: Irena Girshovitz,Atai Ambus,Moni Shahar,Ran Gilad-Bachrach
类目: Machine Learning (cs.LG)
*备注: 18 pages, 4 figures (+ appendix)

点击查看摘要

Abstract:Background: The use of Electronic Health Records (EHRs) for epidemiological studies and artificial intelligence (AI) training is increasing rapidly. The reliability of the results depends on the accuracy and completeness of EHR data. However, EHR data often contain significant quality issues, including misrepresentations of subpopulations, biases, and systematic errors, as they are primarily collected for clinical and billing purposes. Existing quality assessment methods remain insufficient, lacking systematic procedures to assess data fitness for research. Methods: We present the Medical Data Pecking approach, which adapts unit testing and coverage concepts from software engineering to identify data quality concerns. We demonstrate our approach using the Medical Data Pecking Tool (MDPT), which consists of two main components: (1) an automated test generator that uses large language models and grounding techniques to create a test suite from data and study descriptions, and (2) a data testing framework that executes these tests, reporting potential errors and coverage. Results: We evaluated MDPT on three datasets: All of Us (AoU), MIMIC-III, and SyntheticMass, generating 55-73 tests per cohort across four conditions. These tests correctly identified 20-43 non-aligned or non-conforming data issues. We present a detailed analysis of the LLM-generated test suites in terms of reference grounding and value accuracy. Conclusion: Our approach incorporates external medical knowledge to enable context-sensitive data quality testing as part of the data analysis workflow to improve the validity of its outcomes. Our approach tackles these challenges from a quality assurance perspective, laying the foundation for further development such as additional data modalities and improved grounding methods. Comments: 18 pages, 4 figures (+ appendix) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.02628 [cs.LG] (or arXiv:2507.02628v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.02628 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Irena Girshovitz [view email] [v1] Thu, 3 Jul 2025 13:54:50 UTC (360 KB)

[LG-19] A Matrix Variational Auto-Encoder for Variant Effect Prediction in Pharmacogenes

链接: https://arxiv.org/abs/2507.02624
作者: Antoine Honoré,Borja Rodríguez Gálvez,Yoomi Park,Yitian Zhou,Volker M. Lauschke,Ming Xiao
类目: Machine Learning (cs.LG)
*备注: 12+8 pages

点击查看摘要

Abstract:Variant effect predictors (VEPs) aim to assess the functional impact of protein variants, traditionally relying on multiple sequence alignments (MSAs). This approach assumes that naturally occurring variants are fit, an assumption challenged by pharmacogenomics, where some pharmacogenes experience low evolutionary pressure. Deep mutational scanning (DMS) datasets provide an alternative by offering quantitative fitness scores for variants. In this work, we propose a transformer-based matrix variational auto-encoder (matVAE) with a structured prior and evaluate its performance on 33 DMS datasets corresponding to 26 drug target and ADME proteins from the ProteinGym benchmark. Our model trained on MSAs (matVAE-MSA) outperforms the state-of-the-art DeepSequence model in zero-shot prediction on DMS datasets, despite using an order of magnitude fewer parameters and requiring less computation at inference time. We also compare matVAE-MSA to matENC-DMS, a model of similar capacity trained on DMS data, and find that the latter performs better on supervised prediction tasks. Additionally, incorporating AlphaFold-generated structures into our transformer model further improves performance, achieving results comparable to DeepSequence trained on MSAs and finetuned on DMS. These findings highlight the potential of DMS datasets to replace MSAs without significant loss in predictive performance, motivating further development of DMS datasets and exploration of their relationships to enhance variant effect prediction.

[LG-20] Lost in Latent Space: An Empirical Study of Latent Diffusion Models for Physics Emulation

链接: https://arxiv.org/abs/2507.02608
作者: François Rozet,Ruben Ohana,Michael McCabe,Gilles Louppe,François Lanusse,Shirley Ho
类目: Machine Learning (cs.LG); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:The steep computational cost of diffusion models at inference hinders their use as fast physics emulators. In the context of image and video generation, this computational drawback has been addressed by generating in the latent space of an autoencoder instead of the pixel space. In this work, we investigate whether a similar strategy can be effectively applied to the emulation of dynamical systems and at what cost. We find that the accuracy of latent-space emulation is surprisingly robust to a wide range of compression rates (up to 1000x). We also show that diffusion-based emulators are consistently more accurate than non-generative counterparts and compensate for uncertainty in their predictions with greater diversity. Finally, we cover practical design choices, spanning from architectures to optimizers, that we found critical to train latent-space emulators.

[LG-21] Alleviating Attack Data Scarcity: SCANIAs Experience Towards Enhancing In-Vehicle Cyber Security Measures

链接: https://arxiv.org/abs/2507.02607
作者: Frida Sundfeldt,Bianca Widstam,Mahshid Helali Moghadam,Kuo-Yun Liang,Anders Vesterberg
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:The digital evolution of connected vehicles and the subsequent security risks emphasize the critical need for implementing in-vehicle cyber security measures such as intrusion detection and response systems. The continuous advancement of attack scenarios further highlights the need for adaptive detection mechanisms that can detect evolving, unknown, and complex threats. The effective use of ML-driven techniques can help address this challenge. However, constraints on implementing diverse attack scenarios on test vehicles due to safety, cost, and ethical considerations result in a scarcity of data representing attack scenarios. This limitation necessitates alternative efficient and effective methods for generating high-quality attack-representing data. This paper presents a context-aware attack data generator that generates attack inputs and corresponding in-vehicle network log, i.e., controller area network (CAN) log, representing various types of attack including denial of service (DoS), fuzzy, spoofing, suspension, and replay attacks. It utilizes parameterized attack models augmented with CAN message decoding and attack intensity adjustments to configure the attack scenarios with high similarity to real-world scenarios and promote variability. We evaluate the practicality of the generated attack-representing data within an intrusion detection system (IDS) case study, in which we develop and perform an empirical evaluation of two deep neural network IDS models using the generated data. In addition to the efficiency and scalability of the approach, the performance results of IDS models, high detection and classification capabilities, validate the consistency and effectiveness of the generated data as well. In this experience study, we also elaborate on the aspects influencing the fidelity of the data to real-world scenarios and provide insights into its application.

[LG-22] Padé Approximant Neural Networks for Enhanced Electric Motor Fault Diagnosis Using Vibration and Acoustic Data

链接: https://arxiv.org/abs/2507.02599
作者: Sertac Kilickaya,Levent Eren
类目: Machine Learning (cs.LG); Sound (cs.SD); Systems and Control (eess.SY)
*备注: Submitted to the Journal of Vibration Engineering Technologies

点击查看摘要

Abstract:Purpose: The primary aim of this study is to enhance fault diagnosis in induction machines by leveraging the Padé Approximant Neuron (PAON) model. While accelerometers and microphones are standard in motor condition monitoring, deep learning models with nonlinear neuron architectures offer promising improvements in diagnostic performance. This research addresses the question: Can Padé Approximant Neural Networks (PadéNets) outperform conventional Convolutional Neural Networks (CNNs) and Self-Organized Operational Neural Networks (Self-ONNs) in diagnosing electrical and mechanical faults using vibration and acoustic data? Methods: We evaluate and compare the diagnostic capabilities of three deep learning architectures: one-dimensional CNNs, Self-ONNs, and PadéNets. These models are tested on the University of Ottawa’s publicly available constant-speed induction motor datasets, which include both vibration and acoustic sensor data. The PadéNet model is designed to introduce enhanced nonlinearity and is compatible with unbounded activation functions such as Leaky ReLU. Results and Conclusion: PadéNets consistently outperformed the baseline models, achieving diagnostic accuracies of 99.96%, 98.26%, 97.61%, and 98.33% for accelerometers 1, 2, 3, and the acoustic sensor, respectively. The enhanced nonlinearity of PadéNets, together with their compatibility with unbounded activation functions, significantly improves fault diagnosis performance in induction motor condition monitoring. Comments: Submitted to the Journal of Vibration Engineering Technologies Subjects: Machine Learning (cs.LG); Sound (cs.SD); Systems and Control (eess.SY) Cite as: arXiv:2507.02599 [cs.LG] (or arXiv:2507.02599v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.02599 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Scalable Interconnect Learning in Boolean Networks

链接: https://arxiv.org/abs/2507.02585
作者: Fabian Kresse,Emily Yu,Christoph H. Lampert
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 12 pages, 8 Figures

点击查看摘要

Abstract:Learned Differentiable Boolean Logic Networks (DBNs) already deliver efficient inference on resource-constrained hardware. We extend them with a trainable, differentiable interconnect whose parameter count remains constant as input width grows, allowing DBNs to scale to far wider layers than earlier learnable-interconnect designs while preserving their advantageous accuracy. To further reduce model size, we propose two complementary pruning stages: an SAT-based logic equivalence pass that removes redundant gates without affecting performance, and a similarity-based, data-driven pass that outperforms a magnitude-style greedy baseline and offers a superior compression-accuracy trade-off.

[LG-24] ransformers Dont Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT -2 XL and the Implications for Mechanistic Interpretability

链接: https://arxiv.org/abs/2507.02559
作者: Luca Baroni,Galvin Khara,Joachim Schaeffer,Marat Subkhankulov,Stefan Heimersheim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Layer-wise normalization (LN) is an essential component of virtually all transformer-based large language models. While its effects on training stability are well documented, its role at inference time is poorly understood. Additionally, LN layers hinder mechanistic interpretability by introducing additional nonlinearities and increasing the interconnectedness of individual model components. Here, we show that all LN layers can be removed from every GPT-2 model with only a small increase in validation loss (e.g. +0.03 cross-entropy loss for GPT-2 XL). Thus, LN cannot play a substantial role in language modeling. We find that the amount of fine-tuning data needed for LN removal grows sublinearly with model parameters, suggesting scaling to larger models is feasible. We release a suite of LN-free GPT-2 models on Hugging Face. Furthermore, we test interpretability techniques on LN-free models. Direct logit attribution now gives the exact direct effect of individual components, while the accuracy of attribution patching does not significantly improve. We also confirm that GPT-2’s “confidence neurons” are inactive in the LN-free models. Our work clarifies the role of LN layers in language modeling, showing that GPT-2-class models can function without LN layers. We hope that our LN-free analogs of the GPT-2 family of models will enable more precise interpretability research and improve our understanding of language models.

[LG-25] RetrySQL: text-to-SQL training with retry data for self-correcting query generation

链接: https://arxiv.org/abs/2507.02529
作者: Alicja Rączkowska,Riccardo Belluzzo,Piotr Zieliński,Joanna Baran,Paweł Olszewski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The text-to-SQL task is an active challenge in Natural Language Processing. Many existing solutions focus on using black-box language models extended with specialized components within customized end-to-end text-to-SQL pipelines. While these solutions use both closed-source proprietary language models and coding-oriented open-source models, there is a lack of research regarding SQL-specific generative models. At the same time, recent advancements in self-correcting generation strategies show promise for improving the capabilities of existing architectures. The application of these concepts to the text-to-SQL task remains unexplored. In this paper, we introduce RetrySQL, a new approach to training text-to-SQL generation models. We prepare reasoning steps for reference SQL queries and then corrupt them to create retry data that contains both incorrect and corrected steps, divided with a special token. We continuously pre-train an open-source coding model with this data and demonstrate that retry steps yield an improvement of up to 4 percentage points in both overall and challenging execution accuracy metrics, compared to pre-training without retry data. Additionally, we confirm that supervised fine-tuning with LoRA is ineffective for learning from retry data and that full-parameter pre-training is a necessary requirement for that task. We showcase that the self-correcting behavior is learned by the model and the increase in downstream accuracy metrics is a result of this additional skill. Finally, we incorporate RetrySQL-trained models into the full text-to-SQL pipeline and showcase that they are competitive in terms of execution accuracy with proprietary models that contain orders of magnitude more parameters. RetrySQL demonstrates that self-correction can be learned in the text-to-SQL task and provides a novel way of improving generation accuracy for SQL-oriented language models.

[LG-26] FOC-Net: A Short-time Fourier Transform-based Deep Learning Approach for Enhancing Cross-Subject Motor Imagery Classification

链接: https://arxiv.org/abs/2507.02510
作者: Ahmed G. Habashi,Ahmed M. Azab,Seif Eldawlatly,Gamal M. Aly
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Cross-subject motor imagery (CS-MI) classification in brain-computer interfaces (BCIs) is a challenging task due to the significant variability in Electroencephalography (EEG) patterns across different individuals. This variability often results in lower classification accuracy compared to subject-specific models, presenting a major barrier to developing calibration-free BCIs suitable for real-world applications. In this paper, we introduce a novel approach that significantly enhances cross-subject MI classification performance through optimized preprocessing and deep learning techniques. Our approach involves direct classification of Short-Time Fourier Transform (STFT)-transformed EEG data, optimized STFT parameters, and a balanced batching strategy during training of a Convolutional Neural Network (CNN). This approach is uniquely validated across four different datasets, including three widely-used benchmark datasets leading to substantial improvements in cross-subject classification, achieving 67.60% on the BCI Competition IV Dataset 1 (IV-1), 65.96% on Dataset 2A (IV-2A), and 80.22% on Dataset 2B (IV-2B), outperforming state-of-the-art techniques. Additionally, we systematically investigate the classification performance using MI windows ranging from the full 4-second window to 1-second windows. These results establish a new benchmark for generalizable, calibration-free MI classification in addition to contributing a robust open-access dataset to advance research in this domain.

[LG-27] Online Conformal Prediction with Efficiency Guarantees

链接: https://arxiv.org/abs/2507.02496
作者: Vaidehi Srinivas
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study the problem of conformal prediction in a novel online framework that directly optimizes efficiency. In our problem, we are given a target miscoverage rate \alpha 0 , and a time horizon T . On each day t \le T an algorithm must output an interval I_t \subseteq [0, 1] , then a point y_t \in [0, 1] is revealed. The goal of the algorithm is to achieve coverage, that is, y_t \in I_t on (close to) a (1 - \alpha) -fraction of days, while maintaining efficiency, that is, minimizing the average volume (length) of the intervals played. This problem is an online analogue to the problem of constructing efficient confidence intervals. We study this problem over arbitrary and exchangeable (random order) input sequences. For exchangeable sequences, we show that it is possible to construct intervals that achieve coverage (1 - \alpha) - o(1) , while having length upper bounded by the best fixed interval that achieves coverage in hindsight. For arbitrary sequences however, we show that any algorithm that achieves a \mu -approximation in average length compared to the best fixed interval achieving coverage in hindsight, must make a multiplicative factor more mistakes than \alpha T , where the multiplicative factor depends on \mu and the aspect ratio of the problem. Our main algorithmic result is a matching algorithm that can recover all Pareto-optimal settings of \mu and number of mistakes. Furthermore, our algorithm is deterministic and therefore robust to an adaptive adversary. This gap between the exchangeable and arbitrary settings is in contrast to the classical online learning problem. In fact, we show that no single algorithm can simultaneously be Pareto-optimal for arbitrary sequences and optimal for exchangeable sequences. On the algorithmic side, we give an algorithm that achieves the near-optimal tradeoff between the two cases. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2507.02496 [cs.LG] (or arXiv:2507.02496v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.02496 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] Variational Kolmogorov-Arnold Network IJCNN’25

链接: https://arxiv.org/abs/2507.02466
作者: Francesco Alesiani,Henrik Christiansen,Federico Errica
类目: Machine Learning (cs.LG)
*备注: A preliminary (short paper) version presented at ComBayNS Workshop at IJCNN’25

点击查看摘要

Abstract:Kolmogorov Arnold Networks (KANs) are an emerging architecture for building machine learning models. KANs are based on the theoretical foundation of the Kolmogorov-Arnold Theorem and its expansions, which provide an exact representation of a multi-variate continuous bounded function as the composition of a limited number of univariate continuous functions. While such theoretical results are powerful, their use as a representation learning alternative to a multi-layer perceptron (MLP) hinges on the ad-hoc choice of the number of bases modeling each of the univariate functions. In this work, we show how to address this problem by adaptively learning a potentially infinite number of bases for each univariate function during training. We therefore model the problem as a variational inference optimization problem. Our proposal, called InfinityKAN, which uses backpropagation, extends the potential applicability of KANs by treating an important hyperparameter as part of the learning process.

[LG-29] Improving Consistency in Vehicle Trajectory Prediction Through Preference Optimization ITSC2025

链接: https://arxiv.org/abs/2507.02406
作者: Caio Azevedo,Lina Achaji,Stefano Sabatini,Nicola Poerio,Grzegorz Bartyzel,Sascha Hornauer,Fabien Moutarde
类目: Machine Learning (cs.LG)
*备注: Accepted for publication at ITSC 2025

点击查看摘要

Abstract:Trajectory prediction is an essential step in the pipeline of an autonomous vehicle. Inaccurate or inconsistent predictions regarding the movement of agents in its surroundings lead to poorly planned maneuvers and potentially dangerous situations for the end-user. Current state-of-the-art deep-learning-based trajectory prediction models can achieve excellent accuracy on public datasets. However, when used in more complex, interactive scenarios, they often fail to capture important interdependencies between agents, leading to inconsistent predictions among agents in the traffic scene. Inspired by the efficacy of incorporating human preference into large language models, this work fine-tunes trajectory prediction models in multi-agent settings using preference optimization. By taking as input automatically calculated preference rankings among predicted futures in the fine-tuning process, our experiments–using state-of-the-art models on three separate datasets–show that we are able to significantly improve scene consistency while minimally sacrificing trajectory prediction accuracy and without adding any excess computational requirements at inference time.

[LG-30] Posterior Transition Modeling for Unsupervised Diffusion-Based Speech Enhancement

链接: https://arxiv.org/abs/2507.02391
作者: Mostafa Sadeghi(MULTISPEECH),Jean-Eudes Ayilo(MULTISPEECH),Romain Serizel(MULTISPEECH),Xavier Alameda-Pineda(ROBOTLEARN)
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:We explore unsupervised speech enhancement using diffusion models as expressive generative priors for clean speech. Existing approaches guide the reverse diffusion process using noisy speech through an approximate, noise-perturbed likelihood score, combined with the unconditional score via a trade-off hyperparameter. In this work, we propose two alternative algorithms that directly model the conditional reverse transition distribution of diffusion states. The first method integrates the diffusion prior with the observation model in a principled way, removing the need for hyperparameter tuning. The second defines a diffusion process over the noisy speech itself, yielding a fully tractable and exact likelihood score. Experiments on the WSJ0-QUT and VoiceBank-DEMAND datasets demonstrate improved enhancement metrics and greater robustness to domain shifts compared to both supervised and unsupervised baselines.

[LG-31] Deep Reinforcement Learning-Based DRAM Equalizer Parameter Optimization Using Latent Representations

链接: https://arxiv.org/abs/2507.02365
作者: Muhammad Usama,Dong Eui Chang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Equalizer parameter optimization for signal integrity in high-speed Dynamic Random Access Memory systems is crucial but often computationally demanding or model-reliant. This paper introduces a data-driven framework employing learned latent signal representations for efficient signal integrity evaluation, coupled with a model-free Advantage Actor-Critic reinforcement learning agent for parameter optimization. The latent representation captures vital signal integrity features, offering a fast alternative to direct eye diagram analysis during optimization, while the reinforcement learning agent derives optimal equalizer settings without explicit system models. Applied to industry-standard Dynamic Random Access Memory waveforms, the method achieved significant eye-opening window area improvements: 42.7% for cascaded Continuous-Time Linear Equalizer and Decision Feedback Equalizer structures, and 36.8% for Decision Feedback Equalizer-only configurations. These results demonstrate superior performance, computational efficiency, and robust generalization across diverse Dynamic Random Access Memory units compared to existing techniques. Core contributions include an efficient latent signal integrity metric for optimization, a robust model-free reinforcement learning strategy, and validated superior performance for complex equalizer architectures.

[LG-32] Path Planning using a One-shot-sampling Skeleton Map

链接: https://arxiv.org/abs/2507.02328
作者: Gabriel O. Flores-Aquino,Octavio Gutierrez-Frias,Juan Irving Vasquez
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Path planning algorithms aim to compute a collision-free path, and many works focus on finding the optimal distance path. However, for some applications, a more suitable approach is to balance response time, safety of the paths, and path length. In this context, a skeleton map is a useful tool in graph-based schemes, as it provides an intrinsic representation of free configuration space. However, skeletonization algorithms are very resource-intensive, being primarily oriented towards image processing tasks. We propose an efficient path-planning methodology that finds safe paths within an acceptable processing time. This methodology leverages a Deep Denoising Auto-Encoder (DDAE) based on U-Net architecture to compute a skeletonized version of the navigation map, which we refer to as SkelUnet. The SkelUnet network facilitates exploration of the entire workspace through one-shot sampling (OSS), as opposed to the iterative process used by exact algorithms or the probabilistic sampling process. SkelUnet is trained and tested on a dataset consisting of 12,500 bi-dimensional dungeon maps. The motion planning methodology is evaluated in a simulation environment for an Unmanned Aerial Vehicle (UAV) using 250 previously unseen maps, and assessed with various navigation metrics to quantify the navigability of the computed paths. The results demonstrate that using SkelUnet to construct a roadmap offers significant advantages, such as connecting all regions of free workspace, providing safer paths, and reducing processing times. These characteristics make this method particularly suitable for mobile service robots in structured environments.

[LG-33] ransformer-based EEG Decoding: A Survey

链接: https://arxiv.org/abs/2507.02320
作者: Haodong Zhang,Hongqi Li
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Submitted to IEEE Journals

点击查看摘要

Abstract:Electroencephalography (EEG) is one of the most common signals used to capture the electrical activity of the brain, and the decoding of EEG, to acquire the user intents, has been at the forefront of brain-computer/machine interfaces (BCIs/BMIs) research. Compared to traditional EEG analysis methods with machine learning, the advent of deep learning approaches have gradually revolutionized the field by providing an end-to-end long-cascaded architecture, which can learn more discriminative features automatically. Among these, Transformer is renowned for its strong handling capability of sequential data by the attention mechanism, and the application of Transformers in various EEG processing tasks is increasingly prevalent. This article delves into a relevant survey, summarizing the latest application of Transformer models in EEG decoding since it appeared. The evolution of the model architecture is followed to sort and organize the related advances, in which we first elucidate the fundamentals of the Transformer that benefits EEG decoding and its direct application. Then, the common hybrid architectures by integrating basic Transformer with other deep learning techniques (convolutional/recurrent/graph/spiking neural netwo-rks, generative adversarial networks, diffusion models, etc.) is overviewed in detail. The research advances of applying the modified intrinsic structures of customized Transformer have also been introduced. Finally, the current challenges and future development prospects in this rapidly evolving field are discussed. This paper aims to help readers gain a clear understanding of the current state of Transformer applications in EEG decoding and to provide valuable insights for future research endeavors.

[LG-34] Improving Constrained Generation in Language Models via Self-Distilled Twisted Sequential Monte Carlo

链接: https://arxiv.org/abs/2507.02315
作者: Sooyeon Kim,Giung Nam,Juho Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent work has framed constrained text generation with autoregressive language models as a probabilistic inference problem. Among these, Zhao et al. (2024) introduced a promising approach based on twisted Sequential Monte Carlo, which incorporates learned twist functions and twist-induced proposals to guide the generation process. However, in constrained generation settings where the target distribution concentrates on outputs that are unlikely under the base model, learning becomes challenging due to sparse and uninformative reward signals. We show that iteratively refining the base model through self-distillation alleviates this issue by making the model progressively more aligned with the target, leading to substantial gains in generation quality.

[LG-35] Uncertainty-aware Reward Design Process

链接: https://arxiv.org/abs/2507.02256
作者: Yang Yang,Xiaolu Zhou,Bosong Ding,Miao Xin
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 34 pages, 9 figures

点击查看摘要

Abstract:Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging process due to the inefficiencies and inconsistencies inherent in conventional reward engineering methodologies. Recent advances have explored leveraging large language models (LLMs) to automate reward function design. However, their suboptimal performance in numerical optimization often yields unsatisfactory reward quality, while the evolutionary search paradigm demonstrates inefficient utilization of simulation resources, resulting in prohibitively lengthy design cycles with disproportionate computational overhead. To address these challenges, we propose the Uncertainty-aware Reward Design Process (URDP), a novel framework that integrates large language models to streamline reward function design and evaluation in RL environments. URDP quantifies candidate reward function uncertainty based on self-consistency analysis, enabling simulation-free identification of ineffective reward components while discovering novel reward components. Furthermore, we introduce uncertainty-aware Bayesian optimization (UABO), which incorporates uncertainty estimation to significantly enhance hyperparameter configuration efficiency. Finally, we construct a bi-level optimization architecture by decoupling the reward component optimization and the hyperparameter tuning. URDP orchestrates synergistic collaboration between the reward logic reasoning of the LLMs and the numerical optimization strengths of the Bayesian Optimization. We conduct a comprehensive evaluation of URDP across 35 diverse tasks spanning three benchmark environments. Our experimental results demonstrate that URDP not only generates higher-quality reward functions but also achieves significant improvements in the efficiency of automated reward design compared to existing approaches.

[LG-36] Listwise Preference Alignment Optimization for Tail Item Recommendation

链接: https://arxiv.org/abs/2507.02255
作者: Zihao Li,Chao Yang,Tong Zhang,Yakun Chen,Xianzhi Wang,Guandong Xu,Daoyi Dong
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Preference alignment has achieved greater success on Large Language Models (LLMs) and drawn broad interest in recommendation research. Existing preference alignment methods for recommendation either require explicit reward modeling or only support pairwise preference comparison. The former directly increases substantial computational costs, while the latter hinders training efficiency on negative samples. Moreover, no existing effort has explored preference alignment solutions for tail-item recommendation. To bridge the above gaps, we propose LPO4Rec, which extends the Bradley-Terry model from pairwise comparison to listwise comparison, to improve the efficiency of model training. Specifically, we derive a closed form optimal policy to enable more efficient and effective training without explicit reward modeling. We also present an adaptive negative sampling and reweighting strategy to prioritize tail items during optimization and enhance performance in tail-item recommendations. Besides, we theoretically prove that optimizing the listwise preference optimization (LPO) loss is equivalent to maximizing the upper bound of the optimal reward. Our experiments on three public datasets show that our method outperforms 10 baselines by a large margin, achieving up to 50% performance improvement while reducing 17.9% GPU memory usage when compared with direct preference optimization (DPO) in tail-item recommendation. Our code is available at this https URL.

[LG-37] VERBA: Verbalizing Model Differences Using Large Language Models

链接: https://arxiv.org/abs/2507.02241
作者: Shravan Doda,Shashidhar Reddy Javaji,Zining Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the current machine learning landscape, we face a “model lake” phenomenon: Given a task, there is a proliferation of trained models with similar performances despite different behavior. For model users attempting to navigate and select from the models, documentation comparing model pairs is helpful. However, for every N models there could be O(N^2) pairwise comparisons, a number prohibitive for the model developers to manually perform pairwise comparisons and prepare documentations. To facilitate fine-grained pairwise comparisons among models, we introduced \textbfVERBA . Our approach leverages a large language model (LLM) to generate verbalizations of model differences by sampling from the two models. We established a protocol that evaluates the informativeness of the verbalizations via simulation. We also assembled a suite with a diverse set of commonly used machine learning models as a benchmark. For a pair of decision tree models with up to 5% performance difference but 20-25% behavioral differences, \textbfVERBA effectively verbalizes their variations with up to 80% overall accuracy. When we included the models’ structural information, the verbalization’s accuracy further improved to 90%. \textbfVERBA opens up new research avenues for improving the transparency and comparability of machine learning models in a post-hoc manner.

[LG-38] PhysicsCorrect: A Training-Free Approach for Stable Neural PDE Simulations

链接: https://arxiv.org/abs/2507.02227
作者: Xinquan Huang,Paris Perdikaris
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks have emerged as powerful surrogates for solving partial differential equations (PDEs), offering significant computational speedups over traditional methods. However, these models suffer from a critical limitation: error accumulation during long-term rollouts, where small inaccuracies compound exponentially, eventually causing complete divergence from physically valid solutions. We present PhysicsCorrect, a training-free correction framework that enforces PDE consistency at each prediction step by formulating correction as a linearized inverse problem based on PDE residuals. Our key innovation is an efficient caching strategy that precomputes the Jacobian and its pseudoinverse during an offline warm-up phase, reducing computational overhead by two orders of magnitude compared to standard correction approaches. Across three representative PDE systems – Navier-Stokes fluid dynamics, wave equations, and the chaotic Kuramoto-Sivashinsky equation – PhysicsCorrect reduces prediction errors by up to 100x while adding negligible inference time (under 5%). The framework integrates seamlessly with diverse architectures including Fourier Neural Operators, UNets, and Vision Transformers, effectively transforming unstable neural surrogates into reliable simulation tools that bridge the gap between deep learning’s computational efficiency and the physical fidelity demanded by practical scientific applications.

[LG-39] DecoRTL: A Run-time Decoding Framework for RTL Code Generation with LLM s

链接: https://arxiv.org/abs/2507.02226
作者: Mohammad Akyash,Kimia Azar,Hadi Kamali
类目: Programming Languages (cs.PL); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: Accepted to the International Conference on Computer-Aided Design (ICCAD 2025)

点击查看摘要

Abstract:As one of their many applications, large language models (LLMs) have recently shown promise in automating register transfer level (RTL) code generation. However, conventional LLM decoding strategies, originally designed for natural language, often fail to meet the structural and semantic demands of RTL, leading to hallucinated, repetitive, or invalid code outputs. In this paper, we first investigate the root causes of these decoding failures through an empirical analysis of token-level entropy during RTL generation. Our findings reveal that LLMs exhibit low confidence in regions of structural ambiguity or semantic complexity, showing that standard decoding strategies fail to differentiate between regions requiring determinism (syntax-critical regions) and those that benefit from creative exploratory variability (design-critical regions). Then, to overcome this, we introduce DecoRTL, a novel run-time decoding strategy, that is both syntax-aware and contrastive for RTL code generation. DecoRTL integrates two complementary components: (i) self-consistency sampling, which generates multiple candidates and re-ranks them based on token-level agreement to promote correctness while maintaining diversity; and (ii) syntax-aware temperature adaptation, which classifies tokens by their syntactical and functional roles and adjusts the sampling temperature accordingly, enforcing low temperature for syntax-critical tokens and higher temperature for exploratory ones. Our approach operates entirely at inference time without requiring any additional model fine-tuning. Through evaluations on multiple open-source LLMs using the VerilogEval benchmark, we demonstrate significant improvements in syntactic validity, functional correctness, and output diversity, while the execution overhead (performance overhead) is imperceptible.

[LG-40] Metric Design != Metric Behavior: Improving Metric Selection for the Unbiased Evaluation of Dimensionality Reduction IEEE-VIS2025

链接: https://arxiv.org/abs/2507.02225
作者: Jiyeon Bae,Hyeon Jeon,Jinwook Seo
类目: Machine Learning (cs.LG)
*备注: IEEE VIS 2025 (short paper)

点击查看摘要

Abstract:Evaluating the accuracy of dimensionality reduction (DR) projections in preserving the structure of high-dimensional data is crucial for reliable visual analytics. Diverse evaluation metrics targeting different structural characteristics have thus been developed. However, evaluations of DR projections can become biased if highly correlated metrics–those measuring similar structural characteristics–are inadvertently selected, favoring DR techniques that emphasize those characteristics. To address this issue, we propose a novel workflow that reduces bias in the selection of evaluation metrics by clustering metrics based on their empirical correlations rather than on their intended design characteristics alone. Our workflow works by computing metric similarity using pairwise correlations, clustering metrics to minimize overlap, and selecting a representative metric from each cluster. Quantitative experiments demonstrate that our approach improves the stability of DR evaluation, which indicates that our workflow contributes to mitigating evaluation bias.

[LG-41] cVLA: Towards Efficient Camera-Space VLAs

链接: https://arxiv.org/abs/2507.02190
作者: Max Argus,Jelena Bratulic,Houman Masnavi,Maxim Velikanov,Nick Heppert,Abhinav Valada,Thomas Brox
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 20 pages, 10 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.

[LG-42] Statistical Inference for Responsiveness Verification

链接: https://arxiv.org/abs/2507.02169
作者: Seung Hyun Cheon,Meredith Stewart,Bogdan Kulynych,Tsui-Wei Weng,Berk Ustun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many safety failures in machine learning arise when models are used to assign predictions to people (often in settings like lending, hiring, or content moderation) without accounting for how individuals can change their inputs. In this work, we introduce a formal validation procedure for the responsiveness of predictions with respect to interventions on their features. Our procedure frames responsiveness as a type of sensitivity analysis in which practitioners control a set of changes by specifying constraints over interventions and distributions over downstream effects. We describe how to estimate responsiveness for the predictions of any model and any dataset using only black-box access, and how to use these estimates to support tasks such as falsification and failure probability estimation. We develop algorithms that construct these estimates by generating a uniform sample of reachable points, and demonstrate how they can promote safety in real-world applications such as recidivism prediction, organ transplant prioritization, and content moderation.

[LG-43] Non-exchangeable Conformal Prediction for Temporal Graph Neural Networks KDD2025

链接: https://arxiv.org/abs/2507.02151
作者: Tuo Wang,Jian Kang,Yujun Yan,Adithya Kulkarni,Dawei Zhou
类目: Machine Learning (cs.LG)
*备注: accepted by KDD 2025

点击查看摘要

Abstract:Conformal prediction for graph neural networks (GNNs) offers a promising framework for quantifying uncertainty, enhancing GNN reliability in high-stakes applications. However, existing methods predominantly focus on static graphs, neglecting the evolving nature of real-world graphs. Temporal dependencies in graph structure, node attributes, and ground truth labels violate the fundamental exchangeability assumption of standard conformal prediction methods, limiting their applicability. To address these challenges, in this paper, we introduce NCPNET, a novel end-to-end conformal prediction framework tailored for temporal graphs. Our approach extends conformal prediction to dynamic settings, mitigating statistical coverage violations induced by temporal dependencies. To achieve this, we propose a diffusion-based non-conformity score that captures both topological and temporal uncertainties within evolving networks. Additionally, we develop an efficiency-aware optimization algorithm that improves the conformal prediction process, enhancing computational efficiency and reducing coverage violations. Extensive experiments on diverse real-world temporal graphs, including WIKI, REDDIT, DBLP, and IBM Anti-Money Laundering dataset, demonstrate NCPNET’s capability to ensure guaranteed coverage in temporal graphs, achieving up to a 31% reduction in prediction set size on the WIKI dataset, significantly improving efficiency compared to state-of-the-art methods. Our data and code are available at this https URL.

[LG-44] CROP: Circuit Retrieval and Optimization with Parameter Guidance using LLM s

链接: https://arxiv.org/abs/2507.02128
作者: Jingyu Pan,Isaac Jacobson,Zheng Zhao,Tung-Chieh Chen,Guanglei Zhou,Chen-Chia Chang,Vineet Rashingkar,Yiran Chen
类目: Machine Learning (cs.LG)
*备注: Accepted by ICCAD 2025

点击查看摘要

Abstract:Modern very large-scale integration (VLSI) design requires the implementation of integrated circuits using electronic design automation (EDA) tools. Due to the complexity of EDA algorithms, the vast parameter space poses a huge challenge to chip design optimization, as the combination of even moderate numbers of parameters creates an enormous solution space to explore. Manual parameter selection remains industrial practice despite being excessively laborious and limited by expert experience. To address this issue, we present CROP, the first large language model (LLM)-powered automatic VLSI design flow tuning framework. Our approach includes: (1) a scalable methodology for transforming RTL source code into dense vector representations, (2) an embedding-based retrieval system for matching designs with semantically similar circuits, and (3) a retrieval-augmented generation (RAG)-enhanced LLM-guided parameter search system that constrains the search process with prior knowledge from similar designs. Experiment results demonstrate CROP’s ability to achieve superior quality-of-results (QoR) with fewer iterations than existing approaches on industrial designs, including a 9.9% reduction in power consumption.

[LG-45] Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks ICML25

链接: https://arxiv.org/abs/2507.02119
作者: Shikai Qiu,Lechao Xiao,Andrew Gordon Wilson,Jeffrey Pennington,Atish Agarwala
类目: Machine Learning (cs.LG)
*备注: ICML 25. Code available at this https URL

点击查看摘要

Abstract:What scaling limits govern neural network training dynamics when model size and training time grow in tandem? We show that despite the complex interactions between architecture, training algorithms, and data, compute-optimally trained models exhibit a remarkably precise universality. Specifically, loss curves from models of varying sizes collapse onto a single universal curve when training compute and loss are normalized to unity at the end of training. With learning rate decay, the collapse becomes so tight that differences in the normalized curves across models fall below the noise floor of individual loss curves across random seeds, a phenomenon we term supercollapse. We observe supercollapse across learning rate schedules, datasets, and architectures, including transformers trained on next-token prediction, and find it breaks down when hyperparameters are scaled suboptimally, providing a precise and practical indicator of good scaling. We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws, and analyzing a simple yet surprisingly effective model of SGD noise dynamics that accurately predicts loss curves across various learning rate schedules and quantitatively explains the origin of supercollapse.

[LG-46] Parametric Neural Amp Modeling with Active Learning

链接: https://arxiv.org/abs/2507.02109
作者: Florian Grötschla,Luca A. Lanzendörfer,Longxiang Jiao,Roger Wattenhofer
类目: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注: Accepted at ISMIR 2025 as Late-Breaking Demo (LBD)

点击查看摘要

Abstract:We introduce PANAMA, an active learning framework for the training of end-to-end parametric guitar amp models using a WaveNet-like architecture. With \model, one can create a virtual amp by recording samples that are determined by an active learning strategy to use a minimum amount of datapoints (i.e., amp knob settings). We show that gradient-based optimization algorithms can be used to determine the optimal datapoints to sample, and that the approach helps under a constrained number of samples.

[LG-47] A robust and adaptive MPC formulation for Gaussian process models

链接: https://arxiv.org/abs/2507.02098
作者: Mathieu Dubied,Amon Lahr,Melanie N. Zeilinger,Johannes Köhler
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this paper, we present a robust and adaptive model predictive control (MPC) framework for uncertain nonlinear systems affected by bounded disturbances and unmodeled nonlinearities. We use Gaussian Processes (GPs) to learn the uncertain dynamics based on noisy measurements, including those collected during system operation. As a key contribution, we derive robust predictions for GP models using contraction metrics, which are incorporated in the MPC formulation. The proposed design guarantees recursive feasibility, robust constraint satisfaction and convergence to a reference state, with high probability. We provide a numerical example of a planar quadrotor subject to difficult-to-model ground effects, which highlights significant improvements achieved through the proposed robust prediction method and through online learning.

[LG-48] Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model

链接: https://arxiv.org/abs/2507.02089
作者: Xingtu Liu,Lin F. Yang,Sharan Vaswani
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider infinite-horizon \gamma -discounted (linear) constrained Markov decision processes (CMDPs) where the objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Given access to a generative model, we propose to solve CMDPs with a primal-dual framework that can leverage any black-box unconstrained MDP solver. For linear CMDPs with feature dimension d , we instantiate the framework by using mirror descent value iteration (\textttMDVI)~\citepkitamura2023regularization an example MDP solver. We provide sample complexity bounds for the resulting CMDP algorithm in two cases: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to exactly satisfy the constraint. For (i), we prove that the algorithm can return an \epsilon -optimal policy with high probability by using \tildeO\left(\fracd^2(1-\gamma)^4\epsilon^2\right) samples. We note that these results exhibit a near-optimal dependence on both d and \epsilon . For (ii), we show that the algorithm requires \tildeO\left(\fracd^2(1-\gamma)^6\epsilon^2\zeta^2\right) samples, where \zeta is the problem-dependent Slater constant that characterizes the size of the feasible region. Finally, we instantiate our framework for tabular CMDPs and show that it can be used to recover near-optimal sample complexities in this setting.

[LG-49] AIRES: Accelerating Out-of-Core GCNs via Algorithm-System Co-Design

链接: https://arxiv.org/abs/2507.02006
作者: Shakya Jayakody,Youpeng Zhao,Jun Wang
类目: Machine Learning (cs.LG)
*备注: 36th IEEE International Conference on Application-Specific Systems, Architectures and Processors. (Accepted)

点击查看摘要

Abstract:Graph convolutional networks (GCNs) are fundamental in various scientific applications, ranging from biomedical protein-protein interactions (PPI) to large-scale recommendation systems. An essential component for modeling graph structures in GCNs is sparse general matrix-matrix multiplication (SpGEMM). As the size of graph data continues to scale up, SpGEMMs are often conducted in an out-of-core fashion due to limited GPU memory space in resource-constrained systems. Albeit recent efforts that aim to alleviate the memory constraints of out-of-core SpGEMM through either GPU feature caching, hybrid CPU-GPU memory layout, or performing the computation in sparse format, current systems suffer from both high I/O latency and GPU under-utilization issues. In this paper, we first identify the problems of existing systems, where sparse format data alignment and memory allocation are the main performance bottlenecks, and propose AIRES, a novel algorithm-system co-design solution to accelerate out-of-core SpGEMM computation for GCNs. Specifically, from the algorithm angle, AIRES proposes to alleviate the data alignment issues on the block level for matrices in sparse formats and develops a tiling algorithm to facilitate row block-wise alignment. On the system level, AIRES employs a three-phase dynamic scheduling that features a dual-way data transfer strategy utilizing a tiered memory system: integrating GPU memory, GPU Direct Storage (GDS), and host memory to reduce I/O latency and improve throughput. Evaluations show that AIRES significantly outperforms the state-of-the-art methods, achieving up to 1.8x lower latency in real-world graph processing benchmarks. Comments: 36th IEEE International Conference on Application-Specific Systems, Architectures and Processors. (Accepted) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2507.02006 [cs.LG] (or arXiv:2507.02006v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2507.02006 Focus to learn more arXiv-issued DOI via DataCite

[LG-50] mporal Chain of Thought: Long-Video Understanding by Thinking in Frames

链接: https://arxiv.org/abs/2507.02001
作者: Anurag Arnab,Ahmet Iscen,Mathilde Caron,Alireza Fathi,Cordelia Schmid
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite recent advances in Vision-Language Models (VLMs), long-video understanding remains a challenging problem. Although state-of-the-art long-context VLMs can process around 1000 input frames, they still struggle to effectively leverage this sequence length, and succumb to irrelevant distractors within the context window. We present Temporal Chain of Thought, an inference strategy for video question-answering that curates the model’s input context. We use the VLM itself to iteratively identify and extract the most relevant frames from the video, which are then used for answering. We demonstrate how leveraging more computation at inference-time to select the most relevant context leads to improvements in accuracy, in agreement with recent work on inference-time scaling of LLMs. Moreover, we achieve state-of-the-art results on 4 diverse video question-answering datasets, showing consistent improvements with 3 different VLMs. In particular, our method shines on longer videos which would not otherwise fit within the model’s context window: On longer videos of more than 1 hour on LVBench, our approach using a context window of 32K outperforms the same VLM using standard inference with a 700K context window by 2.8 points.

[LG-51] Continuous Wavelet Transform and Siamese Network-Based Anomaly Detection in Multi-variate Semiconductor Process Time Series

链接: https://arxiv.org/abs/2507.01999
作者: Bappaditya Dey,Daniel Sorensen,Minjin Hwang,Sandip Halder
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures, submitted to IEEE Transactions on Semiconductor Manufacturing

点击查看摘要

Abstract:Semiconductor manufacturing is an extremely complex process, characterized by thousands of interdependent parameters collected across diverse tools and process steps. Multi-variate time-series (MTS) analysis has emerged as a critical methodology for enabling real-time monitoring, fault detection, and predictive maintenance in such environments. However, anomaly prediction in semiconductor fabrication presents several critical challenges, including high data dimensionality, severe class imbalance due to the rarity of true faults, noisy and missing measurements, and non-stationary behavior of production systems. Furthermore, the complex interdependencies between variables and the delayed emergence of faults across downstream stages complicate both anomaly detection and root-cause-analysis. This paper presents a novel and generic approach for anomaly detection in MTS data using machine learning. The proposed methodology consists of three main steps: a) converting MTS data into image-based representations using the Continuous Wavelet Transform, b) developing a multi-class image classifier by fine-tuning a pretrained VGG-16 architecture on custom CWT image datasets, and c) constructing a Siamese network composed of two identical sub-networks, each utilizing the fine-tuned VGG-16 as a backbone. The network takes pairs of CWT images as input -one serving as a reference or anchor (representing a known-good signal), and the other as a query (representing an unknown signal). The model then compares the embeddings of both inputs to determine whether they belong to the same class at a given time step. Our approach demonstrates high accuracy in identifying anomalies on a real FAB process time-series dataset, offering a promising solution for offline anomaly detection in process and tool trace data. Moreover, the approach is flexible and can be applied in both supervised and semi-supervised settings.

[LG-52] Positive region preserved random sampling: an efficient feature selection method for massive data

链接: https://arxiv.org/abs/2507.01998
作者: Hexiang Bai,Deyu Li,Jiye Liang,Yanhui Zhai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Selecting relevant features is an important and necessary step for intelligent machines to maximize their chances of success. However, intelligent machines generally have no enough computing resources when faced with huge volume of data. This paper develops a new method based on sampling techniques and rough set theory to address the challenge of feature selection for massive data. To this end, this paper proposes using the ratio of discernible object pairs to all object pairs that should be distinguished to measure the discriminatory ability of a feature set. Based on this measure, a new feature selection method is proposed. This method constructs positive region preserved samples from massive data to find a feature subset with high discriminatory ability. Compared with other methods, the proposed method has two advantages. First, it is able to select a feature subset that can preserve the discriminatory ability of all the features of the target massive data set within an acceptable time on a personal computer. Second, the lower boundary of the probability of the object pairs that can be discerned using the feature subset selected in all object pairs that should be distinguished can be estimated before finding reducts. Furthermore, 11 data sets of different sizes were used to validate the proposed method. The results show that approximate reducts can be found in a very short period of time, and the discriminatory ability of the final reduct is larger than the estimated lower boundary. Experiments on four large-scale data sets also showed that an approximate reduct with high discriminatory ability can be obtained in reasonable time on a personal computer.

[LG-53] A Comprehensive Survey on Network Traffic Synthesis: From Statistical Models to Deep Learning

链接: https://arxiv.org/abs/2507.01976
作者: Nirhoshan Sivaroopan,Kaushitha Silva,Chamara Madarasingha,Thilini Dahanayaka,Guillaume Jourjon,Anura Jayasumana,Kanchana Thilakarathna
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Synthetic network traffic generation has emerged as a promising alternative for various data-driven applications in the networking domain. It enables the creation of synthetic data that preserves real-world characteristics while addressing key challenges such as data scarcity, privacy concerns, and purity constraints associated with real data. In this survey, we provide a comprehensive review of synthetic network traffic generation approaches, covering essential aspects such as data types, generation models, and evaluation methods. With the rapid advancements in AI and machine learning, we focus particularly on deep learning-based techniques while also providing a detailed discussion of statistical methods and their extensions, including commercially available tools. Furthermore, we highlight open challenges in this domain and discuss potential future directions for further research and development. This survey serves as a foundational resource for researchers and practitioners, offering a structured analysis of existing methods, challenges, and opportunities in synthetic network traffic generation.

[LG-54] Acoustic evaluation of a neural network dedicated to the detection of animal vocalisations

链接: https://arxiv.org/abs/2507.01974
作者: Jérémy Rouch(CRNL-ENES),M Ducrettet(CRNL-ENES, ISYEB),S Haupert(ISYEB),R Emonet(LabHC),F Sèbe(CRNL-ENES, OFB - DRAS)
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The accessibility of long-duration recorders, adapted to sometimes demanding field conditions, has enabled the deployment of extensive animal population monitoring campaigns through ecoacoustics. The effectiveness of automatic signal detection methods, increasingly based on neural approaches, is frequently evaluated solely through machine learning metrics, while acoustic analysis of performance remains rare. As part of the acoustic monitoring of Rock Ptarmigan populations, we propose here a simple method for acoustic analysis of the detection system’s performance. The proposed measure is based on relating the signal-to-noise ratio of synthetic signals to their probability of detection. We show how this measure provides information about the system and allows optimisation of its training. We also show how it enables modelling of the detection distance, thus offering the possibility of evaluating its dynamics according to the sound environment and accessing an estimation of the spatial density of calls.

[LG-55] Self-Steering Deep Non-Linear Spatially Selective Filters for Efficient Extraction of Moving Speakers under Weak Guidance

链接: https://arxiv.org/abs/2507.02791
作者: Jakob Kienegger,Alina Mannanova,Huajian Fang,Timo Gerkmann
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: Accepted at IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) 2025

点击查看摘要

Abstract:Recent works on deep non-linear spatially selective filters demonstrate exceptional enhancement performance with computationally lightweight architectures for stationary speakers of known directions. However, to maintain this performance in dynamic scenarios, resource-intensive data-driven tracking algorithms become necessary to provide precise spatial guidance conditioned on the initial direction of a target speaker. As this additional computational overhead hinders application in resource-constrained scenarios such as real-time speech enhancement, we present a novel strategy utilizing a low-complexity tracking algorithm in the form of a particle filter instead. Assuming a causal, sequential processing style, we introduce temporal feedback to leverage the enhanced speech signal of the spatially selective filter to compensate for the limited modeling capabilities of the particle filter. Evaluation on a synthetic dataset illustrates how the autoregressive interplay between both algorithms drastically improves tracking accuracy and leads to strong enhancement performance. A listening test with real-world recordings complements these findings by indicating a clear trend towards our proposed self-steering pipeline as preferred choice over comparable methods.

[LG-56] Sparse Gaussian Processes: Structured Approximations and Power-EP Revisited

链接: https://arxiv.org/abs/2507.02377
作者: Thang D. Bui,Michalis K. Titsias
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Inducing-point-based sparse variational Gaussian processes have become the standard workhorse for scaling up GP models. Recent advances show that these methods can be improved by introducing a diagonal scaling matrix to the conditional posterior density given the inducing points. This paper first considers an extension that employs a block-diagonal structure for the scaling matrix, provably tightening the variational lower bound. We then revisit the unifying framework of sparse GPs based on Power Expectation Propagation (PEP) and show that it can leverage and benefit from the new structured approximate posteriors. Through extensive regression experiments, we show that the proposed block-diagonal approximation consistently performs similarly to or better than existing diagonal approximations while maintaining comparable computational costs. Furthermore, the new PEP framework with structured posteriors provides competitive performance across various power hyperparameter settings, offering practitioners flexible alternatives to standard variational approaches.

[LG-57] Its Hard to Be Normal: The Impact of Noise on Structure-agnostic Estimation

链接: https://arxiv.org/abs/2507.02275
作者: Jikai Jin,Lester Mackey,Vasilis Syrgkanis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Structure-agnostic causal inference studies how well one can estimate a treatment effect given black-box machine learning estimates of nuisance functions (like the impact of confounders on treatment and outcomes). Here, we find that the answer depends in a surprising way on the distribution of the treatment noise. Focusing on the partially linear model of \citetrobinson1988root, we first show that the widely adopted double machine learning (DML) estimator is minimax rate-optimal for Gaussian treatment noise, resolving an open problem of \citetmackey2018orthogonal. Meanwhile, for independent non-Gaussian treatment noise, we show that DML is always suboptimal by constructing new practical procedures with higher-order robustness to nuisance errors. These \emphACE procedures use structure-agnostic cumulant estimators to achieve r -th order insensitivity to nuisance errors whenever the (r+1) -st treatment cumulant is non-zero. We complement these core results with novel minimax guarantees for binary treatments in the partially linear model. Finally, using synthetic demand estimation experiments, we demonstrate the practical benefits of our higher-order robust estimators.

[LG-58] NLP4Neuro: Sequence-to-sequence learning for neural population decoding

链接: https://arxiv.org/abs/2507.02264
作者: Jacob J. Morra,Kaitlyn E. Fouke,Kexin Hang,Zichen He,Owen Traubert,Timothy W. Dunn,Eva A. Naumann
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 17 pages, 6 figures

点击查看摘要

Abstract:Delineating how animal behavior arises from neural activity is a foundational goal of neuroscience. However, as the computations underlying behavior unfold in networks of thousands of individual neurons across the entire brain, this presents challenges for investigating neural roles and computational mechanisms in large, densely wired mammalian brains during behavior. Transformers, the backbones of modern large language models (LLMs), have become powerful tools for neural decoding from smaller neural populations. These modern LLMs have benefited from extensive pre-training, and their sequence-to-sequence learning has been shown to generalize to novel tasks and data modalities, which may also confer advantages for neural decoding from larger, brain-wide activity recordings. Here, we present a systematic evaluation of off-the-shelf LLMs to decode behavior from brain-wide populations, termed NLP4Neuro, which we used to test LLMs on simultaneous calcium imaging and behavior recordings in larval zebrafish exposed to visual motion stimuli. Through NLP4Neuro, we found that LLMs become better at neural decoding when they use pre-trained weights learned from textual natural language data. Moreover, we found that a recent mixture-of-experts LLM, DeepSeek Coder-7b, significantly improved behavioral decoding accuracy, predicted tail movements over long timescales, and provided anatomically consistent highly interpretable readouts of neuron salience. NLP4Neuro demonstrates that LLMs are highly capable of informing brain-wide neural circuit dissection.

[LG-59] ransfer Learning for Matrix Completion

链接: https://arxiv.org/abs/2507.02248
作者: Dali Liu,Haolei Weng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 37 pages, 1 figure

点击查看摘要

Abstract:In this paper, we explore the knowledge transfer under the setting of matrix completion, which aims to enhance the estimation of a low-rank target matrix with auxiliary data available. We propose a transfer learning procedure given prior information on which source datasets are favorable. We study its convergence rates and prove its minimax optimality. Our analysis reveals that with the source matrices close enough to the target matrix, out method outperforms the traditional method using the single target data. In particular, we leverage the advanced sharp concentration inequalities introduced in \citebrailovskaya2024universality to eliminate a logarithmic factor in the convergence rate, which is crucial for proving the minimax optimality. When the relevance of source datasets is unknown, we develop an efficient detection procedure to identify informative sources and establish its selection consistency. Simulations and real data analysis are conducted to support the validity of our methodology.

[LG-60] Hybrid least squares for learning functions from highly noisy data

链接: https://arxiv.org/abs/2507.02215
作者: Ben Adcock,Bernhard Hientzsch,Akil Narayan,Yiming Xu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 30 pages

点击查看摘要

Abstract:Motivated by the need for efficient estimation of conditional expectations, we consider a least-squares function approximation problem with heavily polluted data. Existing methods that are powerful in the small noise regime are suboptimal when large noise is present. We propose a hybrid approach that combines Christoffel sampling with certain types of optimal experimental design to address this issue. We show that the proposed algorithm enjoys appropriate optimality properties for both sample point generation and noise mollification, leading to improved computational efficiency and sample complexity compared to existing methods. We also extend the algorithm to convex-constrained settings with similar theoretical guarantees. When the target function is defined as the expectation of a random field, we extend our approach to leverage adaptive random subspaces and establish results on the approximation capacity of the adaptive procedure. Our theoretical findings are supported by numerical studies on both synthetic data and on a more challenging stochastic simulation problem in computational finance.

[LG-61] Selective Feature Re-Encoded Quantum Convolutional Neural Network with Joint Optimization for Image Classification

链接: https://arxiv.org/abs/2507.02086
作者: Shaswata Mahernob Sarkar,Sheikh Iftekhar Ahmed,Jishnu Mahmud,Shaikh Anowarul Fattah,Gaurav Sharma
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 26 pages, 12 figures, 6 Tables

点击查看摘要

Abstract:Quantum Machine Learning (QML) has seen significant advancements, driven by recent improvements in Noisy Intermediate-Scale Quantum (NISQ) devices. Leveraging quantum principles such as entanglement and superposition, quantum convolutional neural networks (QCNNs) have demonstrated promising results in classifying both quantum and classical data. This study examines QCNNs in the context of image classification and proposes a novel strategy to enhance feature processing and a QCNN architecture for improved classification accuracy. First, a selective feature re-encoding strategy is proposed, which directs the quantum circuits to prioritize the most informative features, thereby effectively navigating the crucial regions of the Hilbert space to find the optimal solution space. Secondly, a novel parallel-mode QCNN architecture is designed to simultaneously incorporate features extracted by two classical methods, Principal Component Analysis (PCA) and Autoencoders, within a unified training scheme. The joint optimization involved in the training process allows the QCNN to benefit from complementary feature representations, enabling better mutual readjustment of model parameters. To assess these methodologies, comprehensive experiments have been performed using the widely used MNIST and Fashion MNIST datasets for binary classification tasks. Experimental findings reveal that the selective feature re-encoding method significantly improves the quantum circuit’s feature processing capability and performance. Furthermore, the jointly optimized parallel QCNN architecture consistently outperforms the individual QCNN models and the traditional ensemble approach involving independent learning followed by decision fusion, confirming its superior accuracy and generalization capabilities.

[LG-62] Adaptive Iterative Soft-Thresholding Algorithm with the Median Absolute Deviation

链接: https://arxiv.org/abs/2507.02084
作者: Yining Feng,Ivan Selesnick
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:The adaptive Iterative Soft-Thresholding Algorithm (ISTA) has been a popular algorithm for finding a desirable solution to the LASSO problem without explicitly tuning the regularization parameter \lambda . Despite that the adaptive ISTA is a successful practical algorithm, few theoretical results exist. In this paper, we present the theoretical analysis on the adaptive ISTA with the thresholding strategy of estimating noise level by median absolute deviation. We show properties of the fixed points of the algorithm, including scale equivariance, non-uniqueness, and local stability, prove the local linear convergence guarantee, and show its global convergence behavior.

[LG-63] Machine Learning Based Stress Testing Framework for Indian Financial Market Portfolios

链接: https://arxiv.org/abs/2507.02011
作者: Vidya Sagar G,Shifat Ali,Siddhartha P. Chakrabarty
类目: Risk Management (q-fin.RM); Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
*备注:

点击查看摘要

Abstract:This paper presents a machine learning driven framework for sectoral stress testing in the Indian financial market, focusing on financial services, information technology, energy, consumer goods, and pharmaceuticals. Initially, we address the limitations observed in conventional stress testing through dimensionality reduction and latent factor modeling via Principal Component Analysis and Autoencoders. Building on this, we extend the methodology using Variational Autoencoders, which introduces a probabilistic structure to the latent space. This enables Monte Carlo-based scenario generation, allowing for more nuanced, distribution-aware simulation of stressed market conditions. The proposed framework captures complex non-linear dependencies and supports risk estimation through Value-at-Risk and Expected Shortfall. Together, these pipelines demonstrate the potential of Machine Learning approaches to improve the flexibility, robustness, and realism of financial stress testing.

[LG-64] Predicting and Explaining Customer Data Sharing in the Open Banking

链接: https://arxiv.org/abs/2507.01987
作者: João B. G. de Brito,Rodrigo Heldt,Cleo S. Silveira,Matthias Bogaert,Guilherme B. Bucco,Fernando B. Luce,João L. Becker,Filipe J. Zabala,Michel J. Anzanello
类目: General Finance (q-fin.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The emergence of Open Banking represents a significant shift in financial data management, influencing financial institutions’ market dynamics and marketing strategies. This increased competition creates opportunities and challenges, as institutions manage data inflow to improve products and services while mitigating data outflow that could aid competitors. This study introduces a framework to predict customers’ propensity to share data via Open Banking and interprets this behavior through Explanatory Model Analysis (EMA). Using data from a large Brazilian financial institution with approximately 3.2 million customers, a hybrid data balancing strategy incorporating ADASYN and NEARMISS techniques was employed to address the infrequency of data sharing and enhance the training of XGBoost models. These models accurately predicted customer data sharing, achieving 91.39% accuracy for inflow and 91.53% for outflow. The EMA phase combined the Shapley Additive Explanations (SHAP) method with the Classification and Regression Tree (CART) technique, revealing the most influential features on customer decisions. Key features included the number of transactions and purchases in mobile channels, interactions within these channels, and credit-related features, particularly credit card usage across the national banking system. These results highlight the critical role of mobile engagement and credit in driving customer data-sharing behaviors, providing financial institutions with strategic insights to enhance competitiveness and innovation in the Open Banking environment.

[LG-65] Detecting Fraud in Financial Networks: A Semi-Supervised GNN Approach with Granger-Causal Explanations

链接: https://arxiv.org/abs/2507.01980
作者: Linh Nguyen,Marcel Boersma,Erman Acar
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Fraudulent activity in the financial industry costs billions annually. Detecting fraud, therefore, is an essential yet technically challenging task that requires carefully analyzing large volumes of data. While machine learning (ML) approaches seem like a viable solution, applying them successfully is not so easy due to two main challenges: (1) the sparsely labeled data, which makes the training of such approaches challenging (with inherent labeling costs), and (2) lack of explainability for the flagged items posed by the opacity of ML models, that is often required by business regulations. This article proposes SAGE-FIN, a semi-supervised graph neural network (GNN) based approach with Granger causal explanations for Financial Interaction Networks. SAGE-FIN learns to flag fraudulent items based on weakly labeled (or unlabelled) data points. To adhere to regulatory requirements, the flagged items are explained by highlighting related items in the network using Granger causality. We empirically validate the favorable performance of SAGE-FIN on a real-world dataset, Bipartite Edge-And-Node Attributed financial network (Elliptic++), with Granger-causal explanations for the identified fraudulent items without any prior assumption on the network structure.

[LG-66] News Sentiment Embeddings for Stock Price Forecasting

链接: https://arxiv.org/abs/2507.01970
作者: Ayaan Qayyum
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 12 pages, 11 figures

点击查看摘要

Abstract:This paper will discuss how headline data can be used to predict stock prices. The stock price in question is the SPDR SP 500 ETF Trust, also known as SPY that tracks the performance of the largest 500 publicly traded corporations in the United States. A key focus is to use news headlines from the Wall Street Journal (WSJ) to predict the movement of stock prices on a daily timescale with OpenAI-based text embedding models used to create vector encodings of each headline with principal component analysis (PCA) to exact the key features. The challenge of this work is to capture the time-dependent and time-independent, nuanced impacts of news on stock prices while handling potential lag effects and market noise. Financial and economic data were collected to improve model performance; such sources include the U.S. Dollar Index (DXY) and Treasury Interest Yields. Over 390 machine-learning inference models were trained. The preliminary results show that headline data embeddings greatly benefit stock price prediction by at least 40% compared to training and optimizing a machine learning system without headline data embeddings.

[LG-67] Forecasting Nigerian Equity Stock Returns Using Long Short-Term Memory Technique

链接: https://arxiv.org/abs/2507.01964
作者: Adebola K. Ojo,Ifechukwude Jude Okafor
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Investors and stock market analysts face major challenges in predicting stock returns and making wise investment decisions. The predictability of equity stock returns can boost investor confidence, but it remains a difficult task. To address this issue, a study was conducted using a Long Short-term Memory (LSTM) model to predict future stock market movements. The study used a historical dataset from the Nigerian Stock Exchange (NSE), which was cleaned and normalized to design the LSTM model. The model was evaluated using performance metrics and compared with other deep learning models like Artificial and Convolutional Neural Networks (CNN). The experimental results showed that the LSTM model can predict future stock market prices and returns with over 90% accuracy when trained with a reliable dataset. The study concludes that LSTM models can be useful in predicting financial time-series-related problems if well-trained. Future studies should explore combining LSTM models with other deep learning techniques like CNN to create hybrid models that mitigate the risks associated with relying on a single model for future equity stock predictions.

信息检索

[IR-0] Calibrated Recommendations: Survey and Future Directions

链接: https://arxiv.org/abs/2507.02643
作者: Diego Corrêa da Silva,Dietmar Jannach
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The idea of calibrated recommendations is that the properties of the items that are suggested to users should match the distribution of their individual past preferences. Calibration techniques are therefore helpful to ensure that the recommendations provided to a user are not limited to a certain subset of the user’s interests. Over the past few years, we have observed an increasing number of research works that use calibration for different purposes, including questions of diversity, biases, and fairness. In this work, we provide a survey on the recent developments in the area of calibrated recommendations. We both review existing technical approaches for calibration and provide an overview on empirical and analytical studies on the effectiveness of calibration for different use cases. Furthermore, we discuss limitations and common challenges when implementing calibration in practice.

[IR-1] Resolving CAP Through Automata-Theoretic Economic Design: A Unified Mathematical Framework for Real-Time Partition-Tolerant Systems

链接: https://arxiv.org/abs/2507.02464
作者: Craig S Wright
类目: Computer Science and Game Theory (cs.GT); Distributed, Parallel, and Cluster Computing (cs.DC); Formal Languages and Automata Theory (cs.FL); Information Retrieval (cs.IR); General Economics (econ.GN)
*备注: 51 pages 4 tables, includes formal proofs, automata construction, and case study on Bitcoin Script

点击查看摘要

Abstract:The CAP theorem asserts a trilemma between consistency, availability, and partition tolerance. This paper introduces a rigorous automata-theoretic and economically grounded framework that reframes the CAP trade-off as a constraint optimization problem. We model distributed systems as partition-aware state machines and embed economic incentive layers to stabilize consensus behavior across adversarially partitioned networks. By incorporating game-theoretic mechanisms into the global transition semantics, we define provable bounds on convergence, liveness, and correctness. Our results demonstrate that availability and consistency can be simultaneously preserved within bounded epsilon margins, effectively extending the classical CAP limits through formal economic control.

[IR-2] he Future is Agent ic: Definitions Perspectives and Open Challenges of Multi-Agent Recommender Systems

链接: https://arxiv.org/abs/2507.02097
作者: Reza Yousefi Maragheh,Yashar Deldjoo
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are rapidly evolving from passive engines of text generation into agentic entities that can plan, remember, invoke external tools, and co-operate with one another. This perspective paper investigates how such LLM agents (and societies thereof) can transform the design space of recommender systems. We introduce a unified formalism that (i) models an individual agent as a tuple comprising its language core, tool set, and hierarchical memory, and (ii) captures a multi-agent recommender as a triple of agents, shared environment, and communication protocol. Within this framework, we present four end-to-end use cases-interactive party planning, synthetic user-simulation for offline evaluation, multi-modal furniture recommendation, and brand-aligned explanation generation-each illustrating a distinct capability unlocked by agentic orchestration. We then surface five cross-cutting challenge families: protocol complexity, scalability, hallucination and error propagation, emergent misalignment (including covert collusion), and brand compliance. For each, we formalize the problem, review nascent mitigation strategies, and outline open research questions. The result is both a blueprint and an agenda: a blueprint that shows how memory-augmented, tool-using LLM agents can be composed into robust recommendation pipelines, and an agenda inviting the RecSys community to develop benchmarks, theoretical guarantees, and governance tools that keep pace with this new degree of autonomy. By unifying agentic abstractions with recommender objectives, the paper lays the groundwork for the next generation of personalized, trustworthy, and context-rich recommendation services. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2507.02097 [cs.IR] (or arXiv:2507.02097v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2507.02097 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yashar Deldjoo [view email] [v1] Wed, 2 Jul 2025 19:25:44 UTC (1,382 KB)

[IR-3] Uncertainty-Aware Complex Scientific Table Data Extraction

链接: https://arxiv.org/abs/2507.02009
作者: Kehinde Ajayi,Yi He,Jian Wu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Table structure recognition (TSR) and optical character recognition (OCR) play crucial roles in extracting structured data from tables in scientific documents. However, existing extraction frameworks built on top of TSR and OCR methods often fail to quantify the uncertainties of extracted results. To obtain highly accurate data for scientific domains, all extracted data must be manually verified, which can be time-consuming and labor-intensive. We propose a framework that performs uncertainty-aware data extraction for complex scientific tables, built on conformal prediction, a model-agnostic method for uncertainty quantification (UQ). We explored various uncertainty scoring methods to aggregate the uncertainties introduced by TSR and OCR. We rigorously evaluated the framework using a standard benchmark and an in-house dataset consisting of complex scientific tables in six scientific domains. The results demonstrate the effectiveness of using UQ for extraction error detection, and by manually verifying only 47% of extraction results, the data quality can be improved by 30%. Our work quantitatively demonstrates the role of UQ with the potential of improving the efficiency in the human-machine cooperation process to obtain scientifically usable data from complex tables in scientific documents. All code and data are available on GitHub at this https URL.

附件下载

点击下载今日全部论文列表