本篇博文主要内容为 2025-08-13 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-13)

今日共更新466篇论文,其中:

  • 自然语言处理79篇(Computation and Language (cs.CL))
  • 人工智能163篇(Artificial Intelligence (cs.AI))
  • 计算机视觉93篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习119篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] me Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

【速读】: 该论文旨在解决扩散大语言模型(Diffusion Large Language Models, dLLMs)在文本生成过程中因当前解码策略忽视中间预测信息而导致的性能瓶颈问题。具体而言,研究发现了一个关键现象——时间振荡(temporal oscillation),即正确答案常出现在去噪过程的中间步骤,但随后被后续步骤覆盖。解决方案的核心在于利用时间一致性(temporal consistency):一是提出无需训练的测试时解码策略“时间自一致性投票”(Temporal Self-Consistency Voting),通过聚合各去噪步骤的预测结果以选择最一致的输出;二是引入一种后训练方法“时间一致性强化”(Temporal Consistency Reinforcement),使用时间语义熵(Temporal Semantic Entropy, TSE)作为奖励信号,引导模型生成更稳定的中间预测。实验证明,该方法在多个基准测试中显著提升了dLLMs的性能,验证了时间动态信息的潜在价值。

链接: https://arxiv.org/abs/2508.09138
作者: Wen Wang,Bozhen Fang,Chenchen Jing,Yongliang Shen,Yangyi Shen,Qiuyu Wang,Hao Ouyang,Hao Chen,Chunhua Shen
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Zhejiang University of Technology (浙江工业大学); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.
zh

[NLP-1] Complex Logical Instruction Generation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在处理逻辑密集型自然语言指令时能力不足的问题,尤其是在条件判断、嵌套结构、递归和函数调用等复杂逻辑表达上的指令遵循能力尚未被充分探索。其解决方案的关键在于提出两个核心工具:LogicIFGen 和 LogicIFEval——前者是一个可扩展的自动化框架,能够从代码函数中生成可验证的逻辑丰富指令;后者则是一个包含426条此类指令的基准测试集,用于系统评估LLMs在复杂逻辑指令下的表现。实验表明,现有最先进LLMs在该基准上的正确遵循率普遍低于60%,揭示了其在逻辑理解与执行方面的显著缺陷。

链接: https://arxiv.org/abs/2508.09125
作者: Mian Zhang,Shujian Liu,Sixun Dong,Ming Yin,Yebowen Hu,Xun Wang,Steven Ma,Song Wang,Sathish Reddy Indurthi,Haoyun Deng,Zhiyu Zoey Chen,Kaiqiang Song
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: this https URL
zh

[NLP-2] OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)代理在真实世界复杂、长周期工作流中评估不足的问题。现有基准测试主要聚焦于独立且自包含的原子任务,无法捕捉现实场景中所需的长期上下文依赖性和多应用交互协调能力。解决方案的关键在于提出 OdysseyBench,一个涵盖 Word、Excel、PDF、Email 和 Calendar 等办公场景的综合性基准,包含两个互补子集:OdysseyBench+(300 个来自真实用例的任务)和 OdysseyBench-Neo(302 个新合成的复杂任务),每个任务均要求代理从长周期交互历史中提取关键信息并跨应用进行多步推理。此外,为实现可扩展的基准构建,作者还提出了 HomerAgents 多代理框架,通过系统化的环境探索、任务生成与对话合成自动化地创建长周期工作流基准。实证表明,OdysseyBench 能更准确地评估 LLM 代理在真实生产力场景中的能力,显著优于传统原子任务基准。

链接: https://arxiv.org/abs/2508.09124
作者: Weixuan Wang,Dongge Han,Daniel Madrigal Diaz,Jin Xu,Victor Rühle,Saravan Rajmohan
机构: School of Informatics, University of Edinburgh (爱丁堡大学信息学院); Microsoft (微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.
zh

[NLP-3] SinLlama - A Large Language Model for Sinhala

【速读】: 该论文旨在解决低资源语言(如僧伽罗语)在开源大语言模型(LLM)中支持不足的问题。解决方案的关键在于:首先,通过扩展现有多语言LLM(Llama-3-8B)的分词器(tokenizer),加入专为僧伽罗语设计的词汇表;其次,在清洗后的1000万条僧伽罗语文本语料上进行持续预训练,从而构建出首个显式支持僧伽罗语的解码器架构开源大语言模型——SinLlama。实验表明,该模型在三个文本分类任务上的指令微调性能显著优于原始的Llama-3-8B基线及指令优化版本。

链接: https://arxiv.org/abs/2508.09115
作者: H.W.K.Aravinda,Rashad Sirajudeen,Samith Karunathilake,Nisansa de Silva,Surangika Ranathunga,Rishemjit Kaur
机构: University of Moratuwa (莫鲁塔瓦大学); Massey University (马西大学); Academy of Scientific and Innovative Research (科学与创新研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.
zh

[NLP-4] AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

【速读】: 该论文旨在解决现有代码生成评估基准在多语言覆盖、难度分布和数据规模方面的局限性,尤其是依赖人工标注导致的扩展困难以及对Python语言的过度集中。其解决方案的关键在于提出AutoCodeGen——一种无需人工标注即可自动生成高质量多语言代码生成数据集的方法。该方法通过大语言模型(LLM)生成测试输入,并借助多语言沙箱环境获取正确输出,结合逆序问题生成策略与多重过滤机制保障测试用例的正确性和完整性,从而构建出涵盖20种编程语言、共3920个问题的AutoCodeBench基准,显著提升了评估任务的复杂度、多样性和实用性。

链接: https://arxiv.org/abs/2508.09101
作者: Jason Chou,Ao Liu,Yuchi Deng,Zhiying Zeng,Tao Zhang,Haotian Zhu,Jianwei Cai,Yue Mao,Chenchen Zhang,Lingyun Tan,Ziyan Xu,Bohui Zhai,Hengyi Liu,Speed Zhu,Wiggin Zhou,Fengzong Lian
机构: Hunyuan Team, Tencent(腾讯)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Homepage: this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities, these benchmarks face several critical limitations. First, they often rely on manual annotations, which are time-consuming and difficult to scale across different programming languages and problem complexities. Second, most existing benchmarks focus primarily on Python, while the few multilingual benchmarks suffer from limited difficulty and uneven language distribution. To address these challenges, we propose AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations. AutoCodeGen ensures the correctness and completeness of test cases by generating test inputs with LLMs and obtaining test outputs through a multilingual sandbox, while achieving high data quality through reverse-order problem generation and multiple filtering steps. Using this novel method, we introduce AutoCodeBench, a large-scale code generation benchmark comprising 3,920 problems evenly distributed across 20 programming languages. It is specifically designed to evaluate LLMs on challenging, diverse, and practical multilingual tasks. We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite. The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks. Besides, we introduce AutoCodeBench-Complete, specifically designed for base models to assess their few-shot code generation capabilities. We hope the AutoCodeBench series will serve as a valuable resource and inspire the community to focus on more challenging and practical multilingual code generation scenarios.
zh

[NLP-5] Link Prediction for Event Logs in the Process Industry

【速读】: 该论文旨在解决流程工业中操作日志(shift books)因记录碎片化而导致的事件关联困难问题,即相关设备或工艺问题及其解决方案的条目常分散在不同记录中,难以有效推荐历史解决方案给用户。其解决方案的关键在于将记录链接(record linking, RL)建模为跨文档共指消解(cross-document coreference resolution, CDCR)任务,并引入自然语言推理(natural language inference, NLI)和语义文本相似度(semantic text similarity, STS)增强机制,同时通过因果推断(causal inference, CI)框架对模型进行优化,从而在保持原始文本结构特性(包括非结构化文本与结构化属性)的前提下,显著提升记录间关联的准确性。实验表明,该方法相较最优NLI与STS基线分别提升了28%和27%。

链接: https://arxiv.org/abs/2508.09096
作者: Anastasia Zhukova,Thomas Walton,Christian E. Matt,Bela Gipp
机构: University of Göttingen (哥廷根大学); eschbach GmbH (埃施巴赫有限公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Knowledge management (KM) is vital in the process industry for optimizing operations, ensuring safety, and enabling continuous improvement through effective use of operational data and past insights. A key challenge in this domain is the fragmented nature of event logs in shift books, where related records, e.g., entries documenting issues related to equipment or processes and the corresponding solutions, may remain disconnected. This fragmentation hinders the recommendation of previous solutions to the users. To address this problem, we investigate record linking (RL) as link prediction, commonly studied in graph-based machine learning, by framing it as a cross-document coreference resolution (CDCR) task enhanced with natural language inference (NLI) and semantic text similarity (STS) by shifting it into the causal inference (CI). We adapt CDCR, traditionally applied in the news domain, into an RL model to operate at the passage level, similar to NLI and STS, while accommodating the process industry’s specific text formats, which contain unstructured text and structured record attributes. Our RL model outperformed the best versions of NLI- and STS-driven baselines by 28% (11.43 points) and 27% (11.21 points), respectively. Our work demonstrates how domain adaptation of the state-of-the-art CDCR models, enhanced with reasoning capabilities, can be effectively tailored to the process industry, improving data quality and connectivity in shift logs.
zh

[NLP-6] Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言(Low-Resource Languages, LRLs)上性能显著下降的问题,其根源在于LLMs主要基于英语数据训练,缺乏对其他语言的有效表征能力。解决方案的关键在于提出一种新型架构,通过融合多语言编码器(如mT5)的所有中间层表示,而非仅使用最终层,从而增强传递给LLM的语义信息。该方法包含两个核心策略:一是全局Softmax加权机制以确定各层整体重要性,二是基于Transformer的Softmax模型学习每个token在不同层上的特定权重;融合后的表示被映射到LLM的嵌入空间中,使其能够处理多语言输入。整个模型仅用英语数据训练,无需平行语料或双语数据,却在多个LRL任务(如XNLI、IndicXNLI、Sinhala新闻分类和Amazon评论)上实现显著性能提升,尤其在斯里兰卡僧伽罗语分类准确率从71.66%提升至75.86%,并推动整体XNLI平均准确率由70.36%提升至71.50%。

链接: https://arxiv.org/abs/2508.09091
作者: Imalsha Puranegedara,Themira Chathumina,Nisal Ranathunga,Nisansa de Silva,Surangika Ranathunga,Mokanarangan Thayaparan
机构: University of Moratuwa (莫鲁塔瓦大学); Massey University (梅西大学); The Open University (开放大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM’s embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.
zh

[NLP-7] CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization

【速读】: 该论文旨在解决强化学习微调(Reinforcement Learning Fine-Tuning, RLFT)在开放性主观任务(如角色扮演对话)中因奖励信号模糊和评估不稳定而导致性能受限的问题。传统基于独立样本评分的奖励建模方法难以准确捕捉人类评价中隐含的比较判断,从而引发主观标准不一致与奖励噪声。解决方案的关键在于提出对比策略优化(Comparative Policy Optimization, CPO),其核心思想是将奖励评估范式从单样本评分转变为群体轨迹级的对比评估,通过引入CharacterArena评估框架——包含情境化多轮角色扮演模拟与轨迹级对比评价两阶段——实现对主观评分的客观化建模,显著降低上下文偏差并提升对话质量。

链接: https://arxiv.org/abs/2508.09074
作者: Xinge Ye,Rui Wang,Yuchuan Wu,Victor Ma,Feiteng Fang,Fei Huang,Yongbin Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers (e.g., code generation, mathematical reasoning), yet struggles with open-ended subjective tasks like role-playing dialogue. Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward this http URL by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization (CPO). CPO redefines the reward evaluation paradigm by shifting from sample-wise scoring to comparative group-wise this http URL on the same principle, we introduce the CharacterArena evaluation framework, which comprises two stages:(1) Contextualized Multi-turn Role-playing Simulation, and (2) Trajectory-level Comparative Evaluation. By operationalizing subjective scoring via objective trajectory comparisons, CharacterArena minimizes contextual bias and enables more robust and fair performance evaluation. Empirical results on CharacterEval, CharacterBench, and CharacterArena confirm that CPO effectively mitigates reward ambiguity and leads to substantial improvements in dialogue quality.
zh

[NLP-8] READER: Retrieval-Assisted Drafter for Efficient LLM Inference

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因自回归生成机制导致的效率瓶颈问题,尤其是在工业场景中对大规模批量处理(large batch sizes)的需求尚未被充分探索的情况下。解决方案的关键在于提出一种无损推测解码方法 READER(Retrieval-Assisted Drafter for Efficient LLM Inference),其核心创新是利用文本中的自重复特性,通过统计搜索获取候选token来扩展推测解码树,并结合KV缓存优化策略以提升大规模批处理下的性能。该方法无需额外训练,可复用预训练的推测模型,显著提升了推理速度,在检索增强生成等任务上实现了超过10倍的加速效果。

链接: https://arxiv.org/abs/2508.09072
作者: Maxim Divilkovskiy,Vitaly Malygin,Sergey Zlobin,Sultan Isali,Vasily Kalugin,Stanislav Ilyushin,Nuriza Aitassova,Yi Fei,Zeng Weidi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) generate tokens autoregressively, with each token depending on the preceding context. This sequential nature makes the inference process inherently difficult to accelerate, posing a significant challenge for efficient deployment. In recent years, various methods have been proposed to address this issue, with the most effective approaches often involving the training of additional draft models. In this paper, we introduce READER (Retrieval-Assisted Drafter for Efficient LLM Inference), a novel lossless speculative decoding method that enhances model-based approaches by leveraging self-repetitions in the text. Our algorithm expands the speculative decoding tree using tokens obtained through statistical search. This work focuses on large batch sizes (= 8), an underexplored yet important area for industrial applications. We also analyze the key-value (KV) cache size during speculative decoding and propose an optimization to improve performance for large batches. As a result, READER outperforms existing speculative decoding methods. Notably, READER requires no additional training and can reuse pre-trained speculator models, increasing the speedup by over 40%. Our method demonstrates particularly strong performance on search-based tasks, such as retrieval-augmented generation, where we achieve more than 10x speedup.
zh

[NLP-9] MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App Vague Interactive Single-App and Unethical Instructions ACM-MM2025

【速读】: 该论文旨在解决当前移动代理(mobile agents)在真实用户场景中表现不足的问题,即现有评估基准与现实世界脱节,无法充分反映用户多样化和复杂的需求。针对这一问题,作者提出MVISU-Bench,一个包含404项任务、覆盖137个移动应用的双语基准,涵盖多应用(Multi-App)、模糊指令(Vague)、交互式(Interactive)、单应用(Single-App)及不道德指令(Unethical Instructions)五类典型场景。解决方案的关键在于引入Aider模块——一种可插拔的动态提示生成器(dynamic prompt prompter),用于识别并澄清用户意图、降低风险,从而提升移动代理在复杂指令下的鲁棒性和成功率。实验表明,Aider在MVISU-Bench上使整体成功率提升19.55%,尤其在不道德指令和交互式指令场景下分别提升53.52%和29.41%,显著缩小了现有移动代理与真实用户期望之间的差距。

链接: https://arxiv.org/abs/2508.09057
作者: Zeyu Huang,Juyuan Wang,Longfeng Chen,Boyi Xiao,Leng Cai,Yawen Zeng,Jin Xu
机构: South China University of Technology(华南理工大学); Pazhou Lab(琶洲实验室)
类目: Computation and Language (cs.CL)
备注: ACM MM 2025

点击查看摘要

Abstract:Given the significant advances in Large Vision Language Models (LVLMs) in reasoning and visual understanding, mobile agents are rapidly emerging to meet users’ automation needs. However, existing evaluation benchmarks are disconnected from the real world and fail to adequately address the diverse and complex requirements of users. From our extensive collection of user questionnaire, we identified five tasks: Multi-App, Vague, Interactive, Single-App, and Unethical Instructions. Around these tasks, we present \textbfMVISU-Bench, a bilingual benchmark that includes 404 tasks across 137 mobile applications. Furthermore, we propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents. Our Aider is easy to integrate into several frameworks and has successfully improved overall success rates by 19.55% compared to the current state-of-the-art (SOTA) on MVISU-Bench. Specifically, it achieves success rate improvements of 53.52% and 29.41% for unethical and interactive instructions, respectively. Through extensive experiments and analysis, we highlight the gap between existing mobile agents and real-world user expectations.
zh

[NLP-10] LLM -as-a-Supervisor: Mistaken Therapeutic Behaviors Trigger Targeted Supervisory Feedback

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在心理治疗场景中直接面向患者应用时面临的伦理与安全问题,同时应对临床训练数据隐私保护与治疗行为标准模糊之间的根本矛盾。其核心解决方案在于构建一种以LLM为监督者的新型治疗师培训范式:首先定义常见治疗错误及其针对性纠正策略作为可操作的标准;其次通过人机协同的对话反馈数据集(MATE),让一个易犯错的代理在自然访谈中故意制造标准错误,由监督代理识别并提供精准反馈;最终基于该数据集微调后的监督模型可有效支持真实治疗师的专业训练,实验证明其反馈质量符合临床指南要求,具备显著的应用潜力。

链接: https://arxiv.org/abs/2508.09042
作者: Chen Xu,Zhenyu Lv,Tian Lan,Xianyang Wang,Luyao Ji,Leyang Cui,Minqiang Yang,Jian Shen,Qunxi Dong,Xiuling Liu,Juan Wang,Bin Hu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Although large language models (LLMs) hold significant promise in psychotherapy, their direct application in patient-facing scenarios raises ethical and safety concerns. Therefore, this work shifts towards developing an LLM as a supervisor to train real therapists. In addition to the privacy of clinical therapist training data, a fundamental contradiction complicates the training of therapeutic behaviors: clear feedback standards are necessary to ensure a controlled training system, yet there is no absolute “gold standard” for appropriate therapeutic behaviors in practice. In contrast, many common therapeutic mistakes are universal and identifiable, making them effective triggers for targeted feedback that can serve as clearer evidence. Motivated by this, we create a novel therapist-training paradigm: (1) guidelines for mistaken behaviors and targeted correction strategies are first established as standards; (2) a human-in-the-loop dialogue-feedback dataset is then constructed, where a mistake-prone agent intentionally makes standard mistakes during interviews naturally, and a supervisor agent locates and identifies mistakes and provides targeted feedback; (3) after fine-tuning on this dataset, the final supervisor model is provided for real therapist training. The detailed experimental results of automated, human and downstream assessments demonstrate that models fine-tuned on our dataset MATE, can provide high-quality feedback according to the clinical guideline, showing significant potential for the therapist training scenario.
zh

[NLP-11] P/D-Device: Disaggregated Large Language Model between Cloud and Devices

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在云边协同部署中面临的资源瓶颈问题:一方面,云端解码阶段生成大量token导致资源占用时间过长,限制了云服务的吞吐量;另一方面,设备端因计算资源受限,随着提示词(prompt)长度增长,首次输出延迟(Time to First Token, TTFT)显著增加。解决方案的关键在于将LLM推理过程在云与设备之间进行分离式调度——具体而言,在预填充(prefill)阶段由云端提供部分中间结果,使设备在接收到首个token后即可立即响应用户,从而大幅降低TTFT;随后通过速率控制器(speed controller)平滑控制后续token的输出速率,实现更稳定的每输出token时间(Time Per Output Token, TPOT),同时利用云端预填充过程中生成的中间数据对提示词进行优化,进一步加速设备端推理。该方案有效实现了云资源利用率提升与设备端低延迟响应的兼顾。

链接: https://arxiv.org/abs/2508.09035
作者: Yibo Jin,Yixu Xu,Yue Chen,Chengbin Wang,Tao Wang,Jiaqi Huang,Rongfei Zhang,Yiming Dong,Yuting Yan,Ke Cheng,Yingjie Zhu,Shulan Wang,Qianqian Tang,Shuaishuai Meng,Guanxin Cheng,Ze Wang,Shuyan Miao,Ketao Wang,Wen Liu,Yifan Yang,Tong Zhang,Anran Wang,Chengzhou Lu,Tiantian Dong,Yongsheng Zhang,Zhe Wang,Hefei Guo,Hongjie Liu,Wei Lu,Zhengyong Zhang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Serving disaggregated large language models has been widely adopted in industrial practice for enhanced performance. However, too many tokens generated in decoding phase, i.e., occupying the resources for a long time, essentially hamper the cloud from achieving a higher throughput. Meanwhile, due to limited on-device resources, the time to first token (TTFT), i.e., the latency of prefill phase, increases dramatically with the growth on prompt length. In order to concur with such a bottleneck on resources, i.e., long occupation in cloud and limited on-device computing capacity, we propose to separate large language model between cloud and devices. That is, the cloud helps a portion of the content for each device, only in its prefill phase. Specifically, after receiving the first token from the cloud, decoupling with its own prefill, the device responds to the user immediately for a lower TTFT. Then, the following tokens from cloud are presented via a speed controller for smoothed TPOT (the time per output token), until the device catches up with the progress. On-device prefill is then amortized using received tokens while the resource usage in cloud is controlled. Moreover, during cloud prefill, the prompt can be refined, using those intermediate data already generated, to further speed up on-device inference. We implement such a scheme P/D-Device, and confirm its superiority over other alternatives. We further propose an algorithm to decide the best settings. Real-trace experiments show that TTFT decreases at least 60%, maximum TPOT is about tens of milliseconds, and cloud throughput increases by up to 15x.
zh

[NLP-12] E3-Rewrite: Learning to Rewrite SQL for Executability Equivalenceand Efficiency

【速读】: 该论文旨在解决传统SQL查询重写(SQL query rewriting)方法依赖预定义规则所导致的局限性问题,即规则集固定难以泛化到新型查询模式、处理复杂查询能力不足,且无法充分捕捉如执行顺序重排和CTE重构等高效重写策略。其解决方案的关键在于提出E3-Rewrite框架,该框架基于大语言模型(LLM)实现可执行、等价且高效的SQL重写:首先通过上下文构建模块利用执行计划和检索到的示例生成瓶颈感知提示(bottleneck-aware prompts),增强推理阶段的语义引导;其次设计多目标奖励函数,综合语法正确性、语义等价性和执行成本进行强化学习优化;最后采用分阶段课程学习策略稳定多目标训练过程,优先保障可执行性和等价性,再逐步引入效率优化。实验表明,E3-Rewrite在多个基准测试中相较现有最优方法平均减少25.6%查询执行时间,并提升24.4%的成功率,显著扩展了对复杂查询的覆盖能力。

链接: https://arxiv.org/abs/2508.09023
作者: Dongjie Xu,Yue Cui,Weijie Shi,Qingzhi Ma,Hanghui Guo,Jiaming Li,Yao Zhao,Ruiyuan Zhang,Shimin Di,Jia Zhu,Kai Zheng,Jiajie Xu
机构: Soochow University (苏州大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学); Zhejiang University (浙江大学); Peking University (北京大学); Nanjing University (南京大学); Fudan University (复旦大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:SQL query rewriting aims to reformulate a query into a more efficient form while preserving equivalence. Most existing methods rely on predefined rewrite rules. However, such rule-based approaches face fundamental limitations: (1) fixed rule sets generalize poorly to novel query patterns and struggle with complex queries; (2) a wide range of effective rewriting strategies cannot be fully captured by declarative rules. To overcome these issues, we propose using large language models (LLMs) to generate rewrites. LLMs can capture complex strategies, such as evaluation reordering and CTE rewriting. Despite this potential, directly applying LLMs often results in suboptimal or non-equivalent rewrites due to a lack of execution awareness and semantic grounding. To address these challenges, We present E3-Rewrite, an LLM-based SQL rewriting framework that produces executable, equivalent, and efficient queries. It integrates two core components: a context construction module and a reinforcement learning framework. First, the context module leverages execution plans and retrieved demonstrations to build bottleneck-aware prompts that guide inference-time rewriting. Second, we design a reward function targeting executability, equivalence, and efficiency, evaluated via syntax checks, equivalence verification, and cost estimation. Third, to ensure stable multi-objective learning, we adopt a staged curriculum that first emphasizes executability and equivalence, then gradually incorporates efficiency. Extensive experiments show that E3-Rewrite achieves up to a 25.6% reduction in query execution time compared to state-of-the-art methods across multiple SQL benchmarks. Moreover, it delivers up to 24.4% more successful rewrites, expanding coverage to complex queries that previous systems failed to handle.
zh

[NLP-13] A Survey on Training-free Alignment of Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)对齐(alignment)过程中因依赖资源密集型微调(fine-tuning, FT)而导致的知识退化问题,以及在模型访问受限或计算资源不足场景下的适配难题。其解决方案的关键在于系统性地梳理和分类无需训练(training-free, TF)的对齐方法,这些方法通过预解码阶段、解码中阶段和解码后阶段的策略——如上下文学习、解码时调整和生成后修正——实现对LLMs及多模态大语言模型(multimodal LLMs, MLLMs)的有效对齐,从而在不重新训练模型的前提下提升其安全性与可靠性。

链接: https://arxiv.org/abs/2508.09016
作者: Birong Pan,Yongqi Li,Weiyu Zhang,Wenpeng Lu,Mayi Xu,Shen Zhou,Yuanyuan Zhu,Ming Zhong,Tieyun Qian
机构: Wuhan University (武汉大学); Qilu University of Technology (齐鲁工业大学); Zhongguancun Academy (中关村学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The alignment of large language models (LLMs) aims to ensure their outputs adhere to human values, ethical standards, and legal norms. Traditional alignment methods often rely on resource-intensive fine-tuning (FT), which may suffer from knowledge degradation and face challenges in scenarios where the model accessibility or computational resources are constrained. In contrast, training-free (TF) alignment techniques–leveraging in-context learning, decoding-time adjustments, and post-generation corrections–offer a promising alternative by enabling alignment without heavily retraining LLMs, making them adaptable to both open-source and closed-source environments. This paper presents the first systematic review of TF alignment methods, categorizing them by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we provide a detailed examination from the viewpoint of LLMs and multimodal LLMs (MLLMs), highlighting their mechanisms and limitations. Furthermore, we identify key challenges and future directions, paving the way for more inclusive and effective TF alignment techniques. By synthesizing and organizing the rapidly growing body of research, this survey offers a guidance for practitioners and advances the development of safer and more reliable LLMs.
zh

[NLP-14] LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA SEMEVAL2025

【速读】: 该论文旨在解决表格问答(Tabular Question Answering, TQA)任务中缺乏任务特定微调时的性能瓶颈问题。其解决方案的关键在于构建一个零样本(zero-shot)代码生成流水线,利用大语言模型(Large Language Model, LLM)直接生成可执行代码以从表格数据中提取答案;该流水线包含多个模块化组件,包括列选择与数据类型分析模块,用于提升代码生成的准确性,并引入迭代式错误反馈机制,在生成代码失败时将错误信息融入新提示中进行重构,从而增强系统的鲁棒性。

链接: https://arxiv.org/abs/2508.09012
作者: Adrián Gude,Roi Santos-Ríos,Francisco Prado-Valiño,Ana Ezquerro,Jesús Vilares
机构: Universidade da Coruña, CITIC (拉科鲁尼亚大学,信息与计算机科学研究中心); Departamento de Ciencias de la Computación y Tecnologías de la Información (计算机科学与信息技术系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to SemEval 2025. Camera-ready version

点击查看摘要

Abstract:This paper describes our participation in SemEval 2025 Task 8, focused on Tabular Question Answering. We developed a zero-shot pipeline that leverages an Large Language Model to generate functional code capable of extracting the relevant information from tabular data based on an input question. Our approach consists of a modular pipeline where the main code generator module is supported by additional components that identify the most relevant columns and analyze their data types to improve extraction accuracy. In the event that the generated code fails, an iterative refinement process is triggered, incorporating the error feedback into a new generation prompt to enhance robustness. Our results show that zero-shot code generation is a valid approach for Tabular QA, achieving rank 33 of 53 in the test phase despite the lack of task-specific fine-tuning.
zh

[NLP-15] Retrospective Sparse Attention for Efficient Long-Context Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长上下文任务中因Key-Value (KV) 缓存内存占用随序列长度线性增长而导致的推理延迟瓶颈问题,特别是现有KV缓存压缩方法仅关注输入上下文而忽视解码过程中累积的注意力误差。其解决方案的关键在于提出RetroAttention机制,通过维护一个轻量级输出缓存,在后续解码步骤中利用新生成的KV条目回溯修正先前注意力输出,从而打破固定注意力输出的范式,实现对前期近似结果的持续修正,显著提升有效KV暴露度和生成准确性。

链接: https://arxiv.org/abs/2508.09001
作者: Seonghwan Choi,Beomseok Kang,Dongwon Jo,Jae-Joon Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to efficiently access more relevant context, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6 \times and accuracy by up to 21.9%.
zh

[NLP-16] Revealing the Role of Audio Channels in ASR Performance Degradation

【速读】: 该论文旨在解决预训练自动语音识别(ASR)模型在不同录音通道(recording channel)输入音频时性能显著下降的问题。以往研究常将此现象归因于训练与测试语料库之间的不匹配,但本文指出,录音通道引起的语音特征变化本身即可从根本上损害ASR性能。解决方案的关键在于提出一种归一化技术,通过将ASR模型内部特征表示对齐到干净参考通道的特征表示,从而有效缓解通道差异带来的影响,显著提升模型在未见通道和语言上的泛化能力。

链接: https://arxiv.org/abs/2508.08967
作者: Kuan-Tang Huang,Li-Wei Chen,Hung-Shin Lee,Berlin Chen,Hsin-Min Wang
机构: National Taiwan Normal University (国立台湾师范大学); National Tsing Hua University (国立清华大学); Academia Sinica (中央研究院); United-Link Co. Ltd. (联合资讯股份有限公司)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to IEEE ASRU 2025

点击查看摘要

Abstract:Pre-trained automatic speech recognition (ASR) models have demonstrated strong performance on a variety of tasks. However, their performance can degrade substantially when the input audio comes from different recording channels. While previous studies have demonstrated this phenomenon, it is often attributed to the mismatch between training and testing corpora. This study argues that variations in speech characteristics caused by different recording channels can fundamentally harm ASR performance. To address this limitation, we propose a normalization technique designed to mitigate the impact of channel variation by aligning internal feature representations in the ASR model with those derived from a clean reference channel. This approach significantly improves ASR performance on previously unseen channels and languages, highlighting its ability to generalize across channel and language differences.
zh

[NLP-17] Jointly Generating and Attributing Answers using Logits of Document-Identifier Tokens

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成回答时存在的幻觉(hallucination)问题,尤其是提升生成内容的可追溯性与可信度,即确保生成答案与其来源文档之间的忠实对应关系(faithfulness)。现有方法多依赖内部模型信号来反映决策过程,但存在引入额外延迟以及难以直接对齐 token 生成与 attribution 生成的问题。论文提出 LoDIT 方法,其核心创新在于:在检索增强生成(Retrieval-Augmented Generation, RAG)框架下,通过标记文档的特定 token 并利用这些 token 的 logits 来实时估算每个文档对答案生成的贡献,并将这些贡献聚合为最终的文档归属 attribution。该方法实现了生成与 attributions 的联合优化,在保证效率的同时显著提升了答案的可信度和一致性。

链接: https://arxiv.org/abs/2508.08942
作者: Lucas Albarede,Jose Moreno,Lynda Tamine,Luce Lefeuvre
机构: Université de Toulouse, IRIT (IRIT); SNCF (SNCF)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Despite their impressive performances, Large Language Models (LLMs) remain prone to hallucination, which critically undermines their trustworthiness. While most of the previous work focused on tackling answer and attribution correctness, a recent line of work investigated faithfulness, with a focus on leveraging internal model signals to reflect a model’s actual decision-making process while generating the answer. Nevertheless, these methods induce additional latency and have shown limitations in directly aligning token generation with attribution generation. In this paper, we introduce LoDIT, a method that jointly generates and faithfully attributes answers in RAG by leveraging specific token logits during generation. It consists of two steps: (1) marking the documents with specific token identifiers and then leveraging the logits of these tokens to estimate the contribution of each document to the answer during generation, and (2) aggregating these contributions into document attributions. Experiments on a trustworthiness-focused attributed text-generation benchmark, Trust-Align, show that LoDIT significantly outperforms state-of-the-art models on several metrics. Finally, an in-depth analysis of LoDIT shows both its efficiency in terms of latency and its robustness in different settings.
zh

[NLP-18] rain Long Think Short: Curriculum Learning for Efficient Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因固定长度预算导致的效率与准确性难以平衡的问题。现有方法采用固定的token预算进行训练,无法利用学习过程中从探索到压缩的自然演化规律,从而限制了模型生成高效且准确推理链的能力。其解决方案的关键在于提出一种基于组相对策略优化(Group Relative Policy Optimization, GRPO)的课程学习(Curriculum Learning)策略,通过动态调整token预算——初始阶段给予较大预算以促进策略发现,随后逐步收紧预算以引导模型提炼出更紧凑的推理路径——并设计一个融合任务正确性(由验证器反馈提供)、长度效率和格式规范性(通过结构化标签实现)的多信号奖励函数,有效提升了模型在相同最终预算下的推理准确率和token利用率。

链接: https://arxiv.org/abs/2508.08940
作者: Hasan Abed Al Kader Hammoud,Kumail Alhamoud,Abed Hammoud,Elie Bou-Zeid,Marzyeh Ghassemi,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST); Massachusetts Institute of Technology (MIT); Princeton University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Review

点击查看摘要

Abstract:Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. We further ablate the impact of reward weighting and decay schedule design, showing that progressive constraint serves as a powerful inductive bias for training efficient reasoning models. Our code and checkpoints are released at: this https URL.
zh

[NLP-19] Reveal-Bangla: A Dataset for Cross-Lingual Multi-Step Reasoning Evaluation AACL2025

【速读】: 该论文旨在解决当前语言模型在多步推理任务中的评估主要局限于高资源语言(如英语)的问题,特别是缺乏对低资源语言(如孟加拉语)的系统性评测。其解决方案的关键在于构建一个手动翻译的孟加拉语多步推理数据集,源自英文Reveal数据集,并包含二元和非二元问题类型;通过控制实验对比以英语为中心和以孟加拉语为中心的多语言小型语言模型在原始英文数据集与翻译后的孟加拉语数据集上的表现,从而分析模型在不同语言中利用相关推理步骤的能力差异。

链接: https://arxiv.org/abs/2508.08933
作者: Khondoker Ittehadul Islam,Gabriele Sarti
机构: Center for Language and Cognition (CLCG), University of Groningen (格罗宁根大学语言与认知中心)
类目: Computation and Language (cs.CL)
备注: Submitted to IJCNLP-AACL 2025

点击查看摘要

Abstract:Language models have demonstrated remarkable performance on complex multi-step reasoning tasks. However, their evaluation has been predominantly confined to high-resource languages such as English. In this paper, we introduce a manually translated Bangla multi-step reasoning dataset derived from the English Reveal dataset, featuring both binary and non-binary question types. We conduct a controlled evaluation of English-centric and Bangla-centric multilingual small language models on the original dataset and our translated version to compare their ability to exploit relevant reasoning steps to produce correct answers. Our results show that, in comparable settings, reasoning context is beneficial for more challenging non-binary questions, but models struggle to employ relevant Bangla reasoning steps effectively. We conclude by exploring how reasoning steps contribute to models’ predictions, highlighting different trends across models and languages.
zh

[NLP-20] Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning

【速读】: 该论文旨在解决低资源语言(如阿拉伯语)在自动语音识别(ASR)系统开发中面临的挑战,尤其是因标注数据稀缺和方言多样性带来的性能瓶颈。其解决方案的关键在于构建一个可扩展的训练流程,该流程结合弱监督学习与监督微调:首先利用15,000小时弱标签语音(涵盖现代标准阿拉伯语MSA和多种方言DA)进行预训练,随后通过持续监督微调策略,融合过滤后的弱标签数据与少量高质量标注数据,从而显著提升模型鲁棒性和识别准确率。这一方法在多方言阿拉伯语ASR挑战赛中取得第一名,验证了弱监督与微调协同在缓解数据稀缺问题上的有效性。

链接: https://arxiv.org/abs/2508.08912
作者: Mahmoud Salhab,Shameed Sait,Mohammad Abusheikh,Hasan Abusheikh
机构: CNTXT AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic speech recognition (ASR) plays a vital role in enabling natural human-machine interaction across applications such as virtual assistants, industrial automation, customer support, and real-time transcription. However, developing accurate ASR systems for low-resource languages like Arabic remains a significant challenge due to limited labeled data and the linguistic complexity introduced by diverse dialects. In this work, we present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model. In the first stage, we pretrain the model on 15,000 hours of weakly labeled speech covering both Modern Standard Arabic (MSA) and various Dialectal Arabic (DA) variants. In the subsequent stage, we perform continual supervised fine-tuning using a mixture of filtered weakly labeled data and a small, high-quality annotated dataset. Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge. These findings highlight the effectiveness of weak supervision paired with fine-tuning in overcoming data scarcity and delivering high-quality ASR for low-resource, dialect-rich languages.
zh

[NLP-21] ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因自回归解码(autoregressive decoding)的串行特性导致的高延迟问题。其核心创新在于提出一种自适应串行-并行解码(Adaptive Serial-Parallel Decoding, ASPD)框架,关键在于两个方面:一是通过非侵入式流水线自动提取并验证自回归模型输出中的内在并行结构(intrinsic parallelism),实现并行可解码片段的自动化构建;二是设计混合解码引擎(Hybrid Decoding Engine),支持串行与并行解码模式间的无缝切换,并复用KV缓存(KV cache),从而在保证生成质量的前提下显著提升推理效率。实验表明,ASPD在通用任务、检索增强生成和数学推理等场景中均实现了显著加速(如Vicuna Bench上平均1.85倍加速,峰值达3.19倍),且生成质量与传统自回归方法相差不超过1%。

链接: https://arxiv.org/abs/2508.08895
作者: Keyu Chen,Zhifeng Shen,Daohai Yu,Haoqian Wu,Wei Wen,Jianfeng He,Ruizhi Qiao,Xing Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel decoding modes while maintaining a reusable KV cache, maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant acceleration without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.
zh

[NLP-22] Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化场景中因对欠资源文化(low-resource cultures)过度泛化而导致的文化理解偏差问题,特别是揭示LLMs内部表征机制如何导致文化偏见(如西方主导偏见和文化扁平化)的形成。其解决方案的关键在于提出CultureScope——首个基于机制可解释性的方法,通过补丁(patching)技术提取LLMs内部的文化知识空间,并引入“文化扁平化得分”作为衡量内在文化偏见的指标,从而实现对文化偏见来源的可追溯分析。

链接: https://arxiv.org/abs/2508.08879
作者: Haeun Yu,Seogyeong Jeong,Siddhesh Pawar,Jisu Shin,Jiho Jin,Junho Myung,Alice Oh,Isabelle Augenstein
机构: University of Copenhagen (哥本哈根大学); KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures

点击查看摘要

Abstract:The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs’ representations impacts their cultural understanding. Prior work only performs extrinsic evaluation of LLMs’ cultural competence, without accounting for how LLMs’ internal mechanisms lead to cultural (mis)representation. To bridge this gap, we propose Culturescope, the first mechanistic interpretability-based method that probes the internal representations of LLMs to elicit the underlying cultural knowledge space. CultureScope utilizes a patching method to extract the cultural knowledge. We introduce a cultural flattening score as a measure of the intrinsic cultural biases. Additionally, we study how LLMs internalize Western-dominance bias and cultural flattening, which allows us to trace how cultural biases emerge within LLMs. Our experimental results reveal that LLMs encode Western-dominance bias and cultural flattening in their cultural knowledge space. We find that low-resource cultures are less susceptible to cultural biases, likely due to their limited training resources. Our work provides a foundation for future research on mitigating cultural biases and enhancing LLMs’ cultural understanding. Our codes and data used for experiments are publicly available.
zh

[NLP-23] Weakly Supervised Fine-grained Span-Level Framework for Chinese Radiology Report Quality Assurance CIKM2025

【速读】: 该论文旨在解决放射科报告质量保证(Quality Assurance, QA)过程中对资深医生依赖性强、人工成本高且评分可能存在偏差的问题。现有方法通常基于文档级别的语义对比,难以精准捕捉报告中的细微差异。其解决方案的关键在于提出一种细粒度的Span-level Quality Assurance EvaluaTOR(Sqator),通过分析初稿与修订稿之间被修改文本片段(revised spans)的重要性来量化QA分数,并将所有片段得分融合以输出最终评分,从而实现更准确、自动化的QA评估。

链接: https://arxiv.org/abs/2508.08876
作者: Kaiyu Wang,Lin Mu,Zhiyao Yang,Ximing Li,Xiaotang Zhou Wanfu Gao,Huimao Zhang
机构: Jilin University (吉林大学); Changchun University of Technology (长春工业大学)
类目: Computation and Language (cs.CL)
备注: Accepted by CIKM 2025. 11 pages, 7 figures

点击查看摘要

Abstract:Quality Assurance (QA) for radiology reports refers to judging whether the junior reports (written by junior doctors) are qualified. The QA scores of one junior report are given by the senior doctor(s) after reviewing the image and junior report. This process requires intensive labor costs for senior doctors. Additionally, the QA scores may be inaccurate for reasons like diagnosis bias, the ability of senior doctors, and so on. To address this issue, we propose a Span-level Quality Assurance EvaluaTOR (Sqator) to mark QA scores automatically. Unlike the common document-level semantic comparison method, we try to analyze the semantic difference by exploring more fine-grained text spans. Unlike the common document-level semantic comparison method, we try to analyze the semantic difference by exploring more fine-grained text spans. Specifically, Sqator measures QA scores by measuring the importance of revised spans between junior and senior reports, and outputs the final QA scores by merging all revised span scores. We evaluate Sqator using a collection of 12,013 radiology reports. Experimental results show that Sqator can achieve competitive QA scores. Moreover, the importance scores of revised spans can be also consistent with the judgments of senior doctors.
zh

[NLP-24] BiasGym: Fantastic Biases and How to Find (and Remove) Them

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中编码的偏见与刻板印象难以系统识别和缓解的问题。由于偏见行为往往隐蔽且非显性,传统方法在诱发和分析此类偏差时面临挑战。解决方案的关键在于提出BiasGym框架,其核心由两个组件构成:BiasInject通过基于token的微调向冻结模型注入特定概念关联,实现可控的偏见引入;BiasScope则利用这些注入信号定位并调控导致偏见行为的模型组件,从而支持机制层面的分析与靶向去偏。该方法能够在不损害下游任务性能的前提下有效降低现实中的刻板印象(如“某国人群是鲁莽司机”),同时适用于训练阶段未见过的虚构关联(如“某国人群有蓝皮肤”),具备良好的泛化能力与应用价值。

链接: https://arxiv.org/abs/2508.08855
作者: Sekh Mainul Islam,Nadav Borenstein,Siddhesh Milind Pawar,Haeun Yu,Arnav Arora,Isabelle Augenstein
机构: University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Understanding biases and stereotypes encoded in the weights of Large Language Models (LLMs) is crucial for developing effective mitigation strategies. Biased behaviour is often subtle and non-trivial to isolate, even when deliberately elicited, making systematic analysis and debiasing particularly challenging. To address this, we introduce BiasGym, a simple, cost-effective, and generalizable framework for reliably injecting, analyzing, and mitigating conceptual associations within LLMs. BiasGym consists of two components: BiasInject, which injects specific biases into the model via token-based fine-tuning while keeping the model frozen, and BiasScope, which leverages these injected signals to identify and steer the components responsible for biased behavior. Our method enables consistent bias elicitation for mechanistic analysis, supports targeted debiasing without degrading performance on downstream tasks, and generalizes to biases unseen during training. We demonstrate the effectiveness of BiasGym in reducing real-world stereotypes (e.g., people from a country being reckless drivers') and in probing fictional associations (e.g., people from a country having blue skin’), showing its utility for both safety interventions and interpretability research.
zh

[NLP-25] Steering Towards Fairness: Mitigating Political Bias in LLM s

【速读】: 该论文旨在解决生成式 AI(Generative AI)中Decoder-based大语言模型(LLM)存在的政治与经济维度意识形态偏见问题,这类偏见可能通过内部表征被编码并影响输出内容。其解决方案的关键在于提出一种基于政治光谱测试(Political Compass Test, PCT)的探查与缓解框架,利用对比样本提取和比较模型各隐藏层激活值,构建分层激活分析管道以识别多轴意识形态差异,并进一步利用这些差异生成定向向量(steering vectors)实现对偏见的有效干预,从而在模型内部表征层面实现系统性去偏,而非仅依赖表面输出修正。

链接: https://arxiv.org/abs/2508.08846
作者: Afrozah Nadeem,Mark Dras,Usman Naseem
机构: Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have enabled their widespread use across diverse real-world applications. However, concerns remain about their tendency to encode and reproduce ideological biases, particularly along political and economic dimensions. In this paper, we propose a framework for probing and mitigating such biases in decoder-based LLMs through analysis of internal model representations. Grounded in the Political Compass Test (PCT), our method uses contrastive pairs to extract and compare hidden layer activations from models like Mistral and DeepSeek. We introduce a comprehensive activation extraction pipeline capable of layer-wise analysis across multiple ideological axes, revealing meaningful disparities linked to political framing. Our results show that decoder LLMs systematically encode representational bias across layers, which can be leveraged for effective steering vector-based mitigation. This work provides new insights into how political bias is encoded in LLMs and offers a principled approach to debiasing beyond surface-level output interventions.
zh

[NLP-26] An Investigation of Robustness of LLM s in Mathematical Reasoning : Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在数学推理能力评估中缺乏对鲁棒性(robustness)考量的问题,即现有方法难以衡量模型在面对非数学因素扰动时的稳定性。其解决方案的关键在于提出一种系统性的评估框架,通过构造数学等价但语言和参数形式不同的变体问题来“压力测试”LLMs,从而量化模型对非数学扰动的敏感度。该方法显著提升了对模型真实数学推理能力的洞察力,并基于此构建了PutnamGAP基准数据集,用于实证验证不同模型的鲁棒性差异,揭示出主流模型如OpenAI的O3在表面变体和核心步骤变体上均出现明显性能下降,凸显了当前LLMs在数学推理中的脆弱性。

链接: https://arxiv.org/abs/2508.08833
作者: Yuren Hao,Xiang Wan,Chengxiang Zhai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:In this paper, we introduce a systematic framework beyond conventional method to assess LLMs’ mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI’s flagship reasoning model, O3, scores 49 % on the originals but drops by 4 percentage points on surface variants, and by 10.5 percentage points on core-step-based variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.
zh

[NLP-27] MoE: Time-Aware Mixture of Language Experts

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因使用固定时间截面的网络数据进行训练而导致的知识过时与时间泄露(temporal leakage)问题,即模型可能依赖于查询时间点之后的信息进行预测,从而产生时序上的逻辑错误。其解决方案的关键在于采用模块化的时间分段预训练策略:将2013–2024年的语料库划分为不重叠的两年时间段,分别训练多个GPT风格的专家模型(GPT-style experts),并在推理阶段通过TiMoE(Time-aware Mixture of Language Experts)机制实现因果路由——即根据查询时间戳屏蔽训练窗口晚于该时间的专家,并在共享空间中合并剩余专家的对数概率,确保严格的时间因果性,同时保留多时期知识的广度。实验表明,该方法在标准NLP任务和TSQA基准上均优于单一时期专家模型,并显著减少未来知识错误达15%。

链接: https://arxiv.org/abs/2508.08827
作者: Robin Faro,Dongyang Fan,Tamar Alphaidze,Martin Jaggi
机构: EPFL(瑞士联邦理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are typically trained on fixed snapshots of the web, which means that their knowledge becomes stale and their predictions risk temporal leakage: relying on information that lies in the future relative to a query. We tackle this problem by pre-training from scratch a set of GPT-style experts on disjoint two-year slices of a 2013-2024 corpus and combining them through TiMoE, a Time-aware Mixture of Language Experts. At inference time, TiMoE masks all experts whose training window ends after the query timestamp and merges the remaining log-probabilities in a shared space, guaranteeing strict causal validity while retaining the breadth of multi-period knowledge. We also release TSQA, a 10k-question benchmark whose alternatives are explicitly labelled as past, future or irrelevant, allowing fine-grained measurement of temporal hallucinations. Experiments on eight standard NLP tasks plus TSQA show that a co-adapted TiMoE variant matches or exceeds the best single-period expert and cuts future-knowledge errors by up to 15%. Our results demonstrate that modular, time-segmented pre-training paired with causal routing is a simple yet effective path toward LLMs that stay chronologically grounded without sacrificing general performance much. We open source our code at TiMoE (Github): this https URL
zh

[NLP-28] A Dual-Axis Taxonomy of Knowledge Editing for LLM s: From Mechanisms to Functions

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中知识更新效率低的问题,即模型虽从海量文本中学习到丰富知识,但这些知识可能过时或不准确,而完全重新训练成本过高。为此,论文提出一种基于功能的新型分类法(function-based taxonomy),将知识编辑方法按其作用的知识类型(如事实性、时间性、概念性、常识性和社会性知识)进行划分,从而更系统地理解不同编辑机制在各类知识上的有效性差异。其解决方案的关键在于:通过结合机制维度(如参数修改 vs. 外部记忆)与知识功能维度的双轴分析框架,揭示编辑效果与目标知识性质之间的内在关联,进而为优化知识编辑策略提供理论依据和实践指导。

链接: https://arxiv.org/abs/2508.08795
作者: Amir Mohammad Salehoof,Ali Ramezani,Yadollah Yaghoobzadeh,Majid Nili Ahmadabadi
机构: University of Tehran (德黑兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:Large language models (LLMs) acquire vast knowledge from large text corpora, but this information can become outdated or inaccurate. Since retraining is computationally expensive, knowledge editing offers an efficient alternative – modifying internal knowledge without full retraining. These methods aim to update facts precisely while preserving the model’s overall capabilities. While existing surveys focus on the mechanism of editing (e.g., parameter changes vs. external memory), they often overlook the function of the knowledge being edited. This survey introduces a novel, complementary function-based taxonomy to provide a more holistic view. We examine how different mechanisms apply to various knowledge types – factual, temporal, conceptual, commonsense, and social – highlighting how editing effectiveness depends on the nature of the target knowledge. By organizing our review along these two axes, we map the current landscape, outline the strengths and limitations of existing methods, define the problem formally, survey evaluation tasks and datasets, and conclude with open challenges and future directions.
zh

[NLP-29] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在工具使用(tool use)方面进展受限的问题,其核心挑战在于缺乏专为工具使用设计的高效强化学习(Reinforcement Learning, RL)框架,具体表现为训练环境构建不稳定以及奖励机制难以验证。解决方案的关键在于提出一个自动化环境构建流程,涵盖场景分解、文档生成、函数集成、复杂度缩放和本地化部署,从而创建高质量、可测量反馈的训练环境;同时引入一种可验证的奖励机制,能够同时评估工具使用的精确性和任务执行的完整性,并与轨迹数据结合后无缝适配标准RL算法,实现基于反馈的模型训练优化。实验表明,该方法显著提升了LLMs的工具使用能力,且不损害其通用性能,其提升源于低层MLP参数更新所驱动的上下文理解与推理能力增强。

链接: https://arxiv.org/abs/2508.08791
作者: Junjie Ye,Changhao Jiang,Zhengyin Du,Yufei Xu,Xuesong Yao,Zhiheng Xi,Xiaoran Fan,Qi Zhang,Xuanjing Huang,Jiecao Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective tool use is essential for large language models (LLMs) to interact meaningfully with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models’ tool-use performance without degrading their general capabilities, regardless of inference modes or training algorithms. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models.
zh

[NLP-30] Privacy-protected Retrieval-Augmented Generation for Knowledge Graph Question Answering

【速读】: 该论文旨在解决在检索增强生成(Retrieval-Augmented Generation, RAG)系统中使用私有知识图谱(Knowledge Graph, KG)时面临的隐私泄露问题,即当实体对大语言模型(Large Language Models, LLMs)匿名化后,传统RAG方法因无法匹配语义而失效。解决方案的关键在于提出一种名为ARoG的新框架,其核心包括两个策略:一是基于关系中心的抽象策略(relation-centric abstraction),通过动态捕获邻接关系语义将匿名实体转化为高阶概念以补充可检索语义;二是基于结构导向的抽象策略(structure-oriented abstraction),将自然语言问题转化为结构化的抽象概念路径,从而与KG中的抽象概念高效对齐,实现隐私保护前提下的有效知识检索。这两个策略在保障隐私不暴露给LLMs的同时,显著提升了检索性能。

链接: https://arxiv.org/abs/2508.08785
作者: Yunfeng Ning,Mayi Xu,Jintao Wen,Qiankun Pi,Yuanyuan Zhu,Ming Zhong,Jiawei Jiang,Tieyun Qian
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs often suffer from hallucinations and outdated or incomplete knowledge. RAG is proposed to address these issues by integrating external knowledge like that in KGs into LLMs. However, leveraging private KGs in RAG systems poses significant privacy risks due to the black-box nature of LLMs and potential insecure data transmission, especially when using third-party LLM APIs lacking transparency and control. In this paper, we investigate the privacy-protected RAG scenario for the first time, where entities in KGs are anonymous for LLMs, thus preventing them from accessing entity semantics. Due to the loss of semantics of entities, previous RAG systems cannot retrieve question-relevant knowledge from KGs by matching questions with the meaningless identifiers of anonymous entities. To realize an effective RAG system in this scenario, two key challenges must be addressed: (1) How can anonymous entities be converted into retrievable information. (2) How to retrieve question-relevant anonymous entities. Hence, we propose a novel ARoG framework including relation-centric abstraction and structure-oriented abstraction strategies. For challenge (1), the first strategy abstracts entities into high-level concepts by dynamically capturing the semantics of their adjacent relations. It supplements meaningful semantics which can further support the retrieval process. For challenge (2), the second strategy transforms unstructured natural language questions into structured abstract concept paths. These paths can be more effectively aligned with the abstracted concepts in KGs, thereby improving retrieval performance. To guide LLMs to effectively retrieve knowledge from KGs, the two strategies strictly protect privacy from being exposed to LLMs. Experiments on three datasets demonstrate that ARoG achieves strong performance and privacy-robustness.
zh

[NLP-31] Designing Memory-Augmented AR Agents for Spatiotemporal Reasoning in Personalized Task Assistance

【速读】: 该论文旨在解决当前增强现实(Augmented Reality, AR)系统中智能代理在处理复杂多步骤任务时的局限性,即难以利用用户的长期交互经验与偏好来提供个性化支持。其核心问题在于现有AR代理无法有效捕捉、存储和推理时空上下文中的历史用户交互信息。解决方案的关键在于提出一个记忆增强型AR代理的概念框架,该框架包含四个相互关联的模块:感知模块用于多模态传感器数据处理,记忆模块实现持久化的时空经验存储,时空推理模块用于融合过去与当前情境进行决策,以及执行模块用于高效的AR交互输出。通过这一结构化设计,系统能够基于用户特定的历史体验持续学习并动态适应,从而实现更智能、个性化的任务辅助能力。

链接: https://arxiv.org/abs/2508.08774
作者: Dongwook Choi,Taeyoon Kwon,Dongil Yang,Hyojun Kim,Jinyoung Yeo
机构: Yonsei University (延世大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Augmented Reality (AR) systems are increasingly integrating foundation models, such as Multimodal Large Language Models (MLLMs), to provide more context-aware and adaptive user experiences. This integration has led to the development of AR agents to support intelligent, goal-directed interactions in real-world environments. While current AR agents effectively support immediate tasks, they struggle with complex multi-step scenarios that require understanding and leveraging user’s long-term experiences and preferences. This limitation stems from their inability to capture, retain, and reason over historical user interactions in spatiotemporal contexts. To address these challenges, we propose a conceptual framework for memory-augmented AR agents that can provide personalized task assistance by learning from and adapting to user-specific experiences over time. Our framework consists of four interconnected modules: (1) Perception Module for multimodal sensor processing, (2) Memory Module for persistent spatiotemporal experience storage, (3) Spatiotemporal Reasoning Module for synthesizing past and present contexts, and (4) Actuator Module for effective AR communication. We further present an implementation roadmap, a future evaluation strategy, a potential target application and use cases to demonstrate the practical applicability of our framework across diverse domains. We aim for this work to motivate future research toward developing more intelligent AR systems that can effectively bridge user’s interaction history with adaptive, context-aware task assistance.
zh

[NLP-32] DevNous: An LLM -Based Multi-Agent System for Grounding IT Project Management in Unstructured Conversation

【速读】: 该论文旨在解决信息系统管理中团队非结构化对话向IT项目治理所需结构化文档自动转换的瓶颈问题。其解决方案的关键在于提出并实现了一个基于大语言模型(Large Language Model, LLM)的多智能体专家系统DevNous,该系统可无缝集成至团队聊天环境中,从非正式对话中识别可操作意图,并管理状态感知的多轮工作流,以完成如任务形式化和进度摘要合成等核心行政任务。通过引入一个包含160个真实交互回合的新基准数据集并进行定量评估,DevNous在精确匹配准确率(81.3%)和多集合F1分数(0.845)上表现优异,验证了其可行性与有效性。

链接: https://arxiv.org/abs/2508.08761
作者: Stavros Doropoulos(1),Stavros Vologiannidis(1),Ioannis Magnisalis(2) ((1) Department of Computer, Informatics and Telecommunications Engineering, International Hellenic University, (2) DG Informatics, European Commission, Brussels, Belgium)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The manual translation of unstructured team dialogue into the structured artifacts required for Information Technology (IT) project governance is a critical bottleneck in modern information systems management. We introduce DevNous, a Large Language Model-based (LLM) multi-agent expert system, to automate this unstructured-to-structured translation process. DevNous integrates directly into team chat environments, identifying actionable intents from informal dialogue and managing stateful, multi-turn workflows for core administrative tasks like automated task formalization and progress summary synthesis. To quantitatively evaluate the system, we introduce a new benchmark of 160 realistic, interactive conversational turns. The dataset was manually annotated with a multi-label ground truth and is publicly available. On this benchmark, DevNous achieves an exact match turn accuracy of 81.3% and a multiset F1-Score of 0.845, providing strong evidence for its viability. The primary contributions of this work are twofold: (1) a validated architectural pattern for developing ambient administrative agents, and (2) the introduction of the first robust empirical baseline and public benchmark dataset for this challenging problem domain.
zh

[NLP-33] SciRerankBench: Benchmarking Rerankers Towards Scientific Retrieval-Augmented Generated LLM s

【速读】: 该论文旨在解决科学文献问答任务中,现有检索增强生成大语言模型(RAG-LLMs)中重排序器(reranker)性能评估缺乏系统性基准的问题。由于科学领域对术语细微差异敏感,重排序阶段在保证答案事实准确性与知识相关性方面至关重要,但其潜力和局限尚未被充分探索。解决方案的关键在于构建首个专门用于评估RAG-LLMs中重排序器的基准——SciRerankBench,涵盖五个科学学科,并设计三类具有挑战性的问答-上下文-答案(Q-C-A)对:含噪声上下文(Noisy Contexts, NC)、语义相似但逻辑无关上下文(Semantically Similar but Logically Irrelevant Contexts, SSLI)以及反事实上下文(Counterfactual Contexts, CC),从而从抗噪能力、相关性歧义区分和事实一致性三个维度进行严谨评估。通过在五类主流大语言模型上系统测试13种常用重排序器,该工作揭示了各类方法的优势与不足,为未来RAG-LLMs中重排序模块的发展提供了重要依据。

链接: https://arxiv.org/abs/2508.08742
作者: Haotian Chen,Qingqing Long,Meng Xiao,Xiao Luo,Wei Ju,Chengrui Wang,Xuezhi Wang,Yuanchun Zhou,Hengshu Zhu
机构: Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心); Peking University(北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific literature question answering is a pivotal step towards new scientific discoveries. Recently, \textittwo-stage retrieval-augmented generated large language models (RAG-LLMs) have shown impressive advancements in this domain. Such a two-stage framework, especially the second stage (reranker), is particularly essential in the scientific domain, where subtle differences in terminology may have a greatly negative impact on the final factual-oriented or knowledge-intensive answers. Despite this significant progress, the potential and limitations of these works remain unexplored. In this work, we present a Scientific Rerank-oriented RAG Benchmark (SciRerankBench), for evaluating rerankers within RAG-LLMs systems, spanning five scientific subjects. To rigorously assess the reranker performance in terms of noise resilience, relevance disambiguation, and factual consistency, we develop three types of question-context-answer (Q-C-A) pairs, i.e., Noisy Contexts (NC), Semantically Similar but Logically Irrelevant Contexts (SSLI), and Counterfactual Contexts (CC). Through systematic evaluation of 13 widely used rerankers on five families of LLMs, we provide detailed insights into their relative strengths and limitations. To the best of our knowledge, SciRerankBench is the first benchmark specifically developed to evaluate rerankers within RAG-LLMs, which provides valuable observations and guidance for their future development.
zh

[NLP-34] Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation

【速读】: 该论文旨在解决多源异构医学通俗语言生成(Medical Lay Language Generation, MLLG)场景下,标准低秩适配(Low-Rank Adaptation, LoRA)方法在语义保真度和通俗风格多样性方面表现不足的问题。其解决方案的关键在于提出一种不对称LoRA架构Magical:该架构通过共享矩阵A实现抽象摘要的语义一致性,并利用多个独立矩阵B支持多样化的通俗风格生成;同时引入语义不变性约束(Semantic Invariance Constraint)以抑制矩阵A在生成过程中的语义子空间偏移,并设计推荐引导切换机制(Recommendation-guided Switch)作为外部接口,驱动大语言模型(Large Language Models, LLMs)在不同矩阵B之间动态切换,从而在保持语义准确性的前提下提升生成风格的多样性与可控性。

链接: https://arxiv.org/abs/2508.08730
作者: Weibin Liao,Tianlong Wang,Yinghao Zhu,Yasha Wang,Junyi Gao,Liantao Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical Lay Language Generation (MLLG) plays a vital role in improving the accessibility of complex scientific content for broader audiences. Recent literature to MLLG commonly employ parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA) to fine-tuning large language models (LLMs) using paired expert-lay language datasets. However, LoRA struggles with the challenges posed by multi-source heterogeneous MLLG datasets. Specifically, through a series of exploratory experiments, we reveal that standard LoRA fail to meet the requirement for semantic fidelity and diverse lay-style generation in MLLG task. To address these limitations, we propose Magical, an asymmetric LoRA architecture tailored for MLLG under heterogeneous data scenarios. Magical employs a shared matrix A for abstractive summarization, along with multiple isolated matrices B for diverse lay-style generation. To preserve semantic fidelity during the lay language generation process, Magical introduces a Semantic Invariance Constraint to mitigate semantic subspace shifts on matrix A . Furthermore, to better adapt to diverse lay-style generation, Magical incorporates the Recommendation-guided Switch, an externally interface to prompt the LLM to switch between different matrices B . Experimental results on three real-world lay language generation datasets demonstrate that Magical consistently outperforms prompt-based methods, vanilla LoRA, and its recent variants, while also reducing trainable parameters by 31.66%.
zh

[NLP-35] IROTE: Human-like Traits Elicitation of Large Language Model via In-Context Self-Reflective Optimization

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在Trait Elicitation(特质诱导)过程中存在的“浅层模仿问题”(superficial elicitation problem),即现有方法仅能引导模型模仿表面且不稳定的风格模式,难以在多种下游任务中一致、精准地体现目标人类特质(如人格或价值观)。解决方案的关键在于提出一种名为IROTE的新型上下文学习方法,其核心思想是基于心理学理论——特质源于与身份相关的自我反思,通过自动构建并优化提示中的文本化自我反思(self-reflection),包含自评经验内容,从而激发LLM的行为与目标特质强关联。该优化过程无需微调(fine-tuning),仅通过迭代最大化信息论目标函数来增强行为与特质间的因果联系,并抑制冗余噪声,最终生成简洁而具表现力的特质诱导文本,实现跨任务的稳定和可迁移的人类特质模拟。

链接: https://arxiv.org/abs/2508.08719
作者: Yuzhuo Bai,Shitong Duan,Muhua Huang,Jing Yao,Zhenghao Liu,Peng Zhang,Tun Lu,Xiaoyuan Yi,Maosong Sun,Xing Xie
机构: Tsinghua University (清华大学); Microsoft Research Asia (微软亚洲研究院); Fudan University (复旦大学); The University of Chicago (芝加哥大学); Northeastern University of China (中国东北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Trained on various human-authored corpora, Large Language Models (LLMs) have demonstrated a certain capability of reflecting specific human-like traits (e.g., personality or values) by prompting, benefiting applications like personalized LLMs and social simulations. However, existing methods suffer from the superficial elicitation problem: LLMs can only be steered to mimic shallow and unstable stylistic patterns, failing to embody the desired traits precisely and consistently across diverse tasks like humans. To address this challenge, we propose IROTE, a novel in-context method for stable and transferable trait elicitation. Drawing on psychological theories suggesting that traits are formed through identity-related reflection, our method automatically generates and optimizes a textual self-reflection within prompts, which comprises self-perceived experience, to stimulate LLMs’ trait-driven behavior. The optimization is performed by iteratively maximizing an information-theoretic objective that enhances the connections between LLMs’ behavior and the target trait, while reducing noisy redundancy in reflection without any fine-tuning, leading to evocative and compact trait reflection. Extensive experiments across three human trait systems manifest that one single IROTE-generated self-reflection can induce LLMs’ stable impersonation of the target trait across diverse downstream tasks beyond simple questionnaire answering, consistently outperforming existing strong baselines.
zh

[NLP-36] A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

【速读】: 该论文旨在解决当前大规模语言模型(Large Language Models, LLMs)在文本生成中普遍依赖自回归(Autoregressive, AR)机制所导致的推理效率低下问题,即每次仅能生成一个词元(token),受限于其固有的串行特性,难以满足高吞吐量应用场景的需求。解决方案的关键在于系统性地梳理和分类并行文本生成(Parallel Text Generation)方法,将其划分为基于AR和非AR(Non-AR)两大范式,并深入分析各类技术的核心原理与性能权衡(如速度、质量与效率之间的 trade-offs),从而为提升LLM推理效率提供理论依据与实践指导。

链接: https://arxiv.org/abs/2508.08712
作者: Lingzhe Zhang,Liancheng Fang,Chiming Duan,Minghua He,Leyi Pan,Pei Xiao,Shiyu Huang,Yunpeng Zhai,Xuming Hu,Philip S. Yu,Aiwei Liu
机构: Peking University (北京大学); University of Illinois Chicago (伊利诺伊大学芝加哥分校); Tsinghua University (清华大学); XPENG (小鹏汽车); Alibaba Group (阿里巴巴集团); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州))
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation.
zh

[NLP-37] Out of the Box into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults

【速读】: 该论文旨在解决老年人在临床场景中使用语音控制界面时面临的自动语音识别(Automatic Speech Recognition, ASR)可靠性问题,特别是针对荷兰老年群体的语言特征未被充分覆盖导致的识别准确率低的问题。其解决方案的关键在于评估通用多语言ASR模型与针对老年荷兰语使用者微调后的专用模型的性能表现,并发现通用模型在未经微调的情况下反而优于专用模型,表明当前先进ASR模型具备良好的泛化能力;同时提出通过截断现有架构可在准确率与处理速度之间实现更优平衡,尽管仍存在因幻觉(hallucinations)导致高词错误率(Word Error Rate, WER)的情况。

链接: https://arxiv.org/abs/2508.08684
作者: Bram van Dijk,Tiberon Kuiper,Sirin Aoulad si Ahmed,Armel Levebvre,Jake Johnson,Jan Duin,Simon Mooijaart,Marco Spruit
机构: Leiden University Medical Centre (莱顿大学医学中心); Leiden Institute of Advanced Computer Science (莱顿大学高级计算机科学研究所)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Voice-controlled interfaces can support older adults in clinical contexts, with chatbots being a prime example, but reliable Automatic Speech Recognition (ASR) for underrepresented groups remains a bottleneck. This study evaluates state-of-the-art ASR models on language use of older Dutch adults, who interacted with the this http URL chatbot designed for geriatric contexts. We benchmark generic multilingual ASR models, and models fine-tuned for Dutch spoken by older adults, while also considering processing speed. Our results show that generic multilingual models outperform fine-tuned models, which suggests recent ASR models can generalise well out of the box to realistic datasets. Furthermore, our results suggest that truncating existing architectures is helpful in balancing the accuracy-speed trade-off, though we also identify some cases with high WER due to hallucinations.
zh

[NLP-38] opXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

【速读】: 该论文旨在解决低资源语言(Low-Resource Language, LRL)机器翻译中因平行语料库稀缺、质量与多样性不足而导致生成式 AI 模型性能受限的问题。现有方法如基于相似性检索的示例选择和监督微调虽有一定效果,但其提升空间受限于现有平行数据的规模与质量。解决方案的关键在于提出一种名为 \textscTopXGen 的 LLM-based 数据生成框架,利用大语言模型在高资源语言(High-Resource Language, HRL)上的强翻译能力以及多语言特性,首先生成高质量且主题多样化的 LRL 目标文本,再通过回译(backtranslation)将其转换为可用于微调和上下文学习(In-Context Learning, ICL)的平行语料。该方法有效缓解了 LRL 平行数据匮乏问题,显著提升了模型在 LRL 翻译任务中的表现。

链接: https://arxiv.org/abs/2508.08680
作者: Armel Zebaze,Benoît Sagot,Rachel Bawden
机构: Inria(法国国家信息与自动化研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present \textscTopXGen, an LLM-based approach for the generation of high quality and topic-diverse data in multiple LRLs, which can then be backtranslated to produce useful and diverse parallel texts for ICL and fine-tuning. Our intuition is that while LLMs struggle to translate into LRLs, their ability to translate well into HRLs and their multilinguality enable them to generate good quality, natural-sounding target-side texts, which can be translated well into a high-resource source language. We show that \textscTopXGen boosts LLM translation performance during fine-tuning and in-context learning. Code and outputs are available at this https URL.
zh

[NLP-39] textM2LLM : Multi-view Molecular Representation Learning with Large Language Models IJCAI2025

【速读】: 该论文旨在解决分子属性预测中传统表示方法(如指纹和图神经网络)忽视长期积累的语义与上下文知识的问题。其解决方案的关键在于提出一种多视角框架 \textM^2 LLM,通过融合分子结构视图、分子任务视图和分子规则视图三个维度,并动态调整各视图权重以适应不同任务需求,从而有效利用大语言模型(LLM)在科学领域中的先验知识和推理能力,实现对分子特征的深度编码与智能筛选,显著提升分类与回归任务上的性能表现。

链接: https://arxiv.org/abs/2508.08657
作者: Jiaxin Ju,Yizhen Zheng,Huan Yee Koh,Can Wang,Shirui Pan
机构: Griffith University (格里菲斯大学); Monash University (莫纳什大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: IJCAI 2025

点击查看摘要

Abstract:Accurate molecular property prediction is a critical challenge with wide-ranging applications in chemistry, materials science, and drug discovery. Molecular representation methods, including fingerprints and graph neural networks (GNNs), achieve state-of-the-art results by effectively deriving features from molecular structures. However, these methods often overlook decades of accumulated semantic and contextual knowledge. Recent advancements in large language models (LLMs) demonstrate remarkable reasoning abilities and prior knowledge across scientific domains, leading us to hypothesize that LLMs can generate rich molecular representations when guided to reason in multiple perspectives. To address these gaps, we propose \textM^2 LLM, a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view. These views are fused dynamically to adapt to task requirements, and experiments demonstrate that \textM^2 LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks. Moreover, we demonstrate that representation derived from LLM achieves exceptional performance by leveraging two core functionalities: the generation of molecular embeddings through their encoding capabilities and the curation of molecular features through advanced reasoning processes.
zh

[NLP-40] LLM driven Text-to-Table Generation through Sub-Tasks Guidance and Iterative Refinement

【速读】: 该论文旨在解决从非结构化文本到结构化表格的转换难题,该任务需要语义理解、推理能力和结构认知,而大型语言模型(Large Language Models, LLMs)在处理模糊或领域特定数据、保持表格结构完整性、应对长输入以及进行数值推理方面常表现不佳。解决方案的关键在于提出一种高效的LLM驱动文本转表格系统,其核心创新包括两个策略:一是将复杂的文本转表格任务分解为可管理的、受引导的子任务,使模型能够分步处理问题;二是通过迭代自反馈机制对生成的表格进行优化,从而提升表格质量。该方法在公开的两个复杂文本转表格数据集上显著优于基线模型。

链接: https://arxiv.org/abs/2508.08653
作者: Rajmohan C,Sarthak Harne,Arvind Agarwal
机构: IBM Research (IBM研究院); IIIT-Bangalore (印度国际技术研究所-班加罗尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transforming unstructured text into structured data is a complex task, requiring semantic understanding, reasoning, and structural comprehension. While Large Language Models (LLMs) offer potential, they often struggle with handling ambiguous or domain-specific data, maintaining table structure, managing long inputs, and addressing numerical reasoning. This paper proposes an efficient system for LLM-driven text-to-table generation that leverages novel prompting techniques. Specifically, the system incorporates two key strategies: breaking down the text-to-table task into manageable, guided sub-tasks and refining the generated tables through iterative self-feedback. We show that this custom task decomposition allows the model to address the problem in a stepwise manner and improves the quality of the generated table. Furthermore, we discuss the benefits and potential risks associated with iterative self-feedback on the generated tables while highlighting the trade-offs between enhanced performance and computational cost. Our methods achieve strong results compared to baselines on two complex text-to-table generation datasets available in the public domain.
zh

[NLP-41] Prompt-Based Approach for Czech Sentiment Analysis ACL

【速读】: 该论文旨在解决捷克语场景下基于方面的情感分析(aspect-based sentiment analysis)和情感分类任务中数据稀缺与模型泛化能力不足的问题。其解决方案的关键在于引入基于提示(prompt-based)的方法,利用序列到序列(sequence-to-sequence)模型同时处理多个方面的情感分析任务,并通过提示机制在少量标注数据甚至零样本条件下实现优于传统微调(fine-tuning)的性能表现。实验表明,目标领域预训练可进一步提升零样本场景下的效果,验证了提示方法在低资源设置中的有效性。

链接: https://arxiv.org/abs/2508.08651
作者: Jakub Šmíd,Pavel Přibáň
机构: Department of Computer Science and Engineering,; Department of Computer Science and Engineering,
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing (RANLP 2023). Official version: this https URL

点击查看摘要

Abstract:This paper introduces the first prompt-based methods for aspect-based sentiment analysis and sentiment classification in Czech. We employ the sequence-to-sequence models to solve the aspect-based tasks simultaneously and demonstrate the superiority of our prompt-based approach over traditional fine-tuning. In addition, we conduct zero-shot and few-shot learning experiments for sentiment classification and show that prompting yields significantly better results with limited training examples compared to traditional fine-tuning. We also demonstrate that pre-training on data from the target domain can lead to significant improvements in a zero-shot scenario.
zh

[NLP-42] UWB at WASSA-2024 Shared Task 2: Cross-lingual Emotion Detection WASSA2024 WASSA-1 ACL

【速读】: 该论文旨在解决跨语言情感检测(Cross-lingual Emotion Detection)问题,具体包括两个子任务:一是对多语言(五种语言)推文进行六类情感标签分类,二是预测触发情感的关键词(以二进制和数值形式输出)。解决方案的关键在于采用量化的大语言模型(如Orca-2)结合低秩适配器(LoRA)进行微调,并融合多语言Transformer模型(如XLM-R和mT5),同时引入机器翻译增强策略以提升跨语言迁移能力,以及触发词切换机制优化第二子任务的预测精度。该方法在WASSA-2024共享任务中取得了优异表现,尤其在数值型触发词检测中排名第一。

链接: https://arxiv.org/abs/2508.08650
作者: Jakub Šmíd,Pavel Přibáň,Pavel Král
机构: University of West Bohemia (西波希米亚大学); Department of Computer Science and Engineering (计算机科学与工程系); NTIS – New Technologies for the Information Society (信息社会新技术中心)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, Social Media Analysis (WASSA 2024). Official version: this https URL

点击查看摘要

Abstract:This paper presents our system built for the WASSA-2024 Cross-lingual Emotion Detection Shared Task. The task consists of two subtasks: first, to assess an emotion label from six possible classes for a given tweet in one of five languages, and second, to predict words triggering the detected emotions in binary and numerical formats. Our proposed approach revolves around fine-tuning quantized large language models, specifically Orca~2, with low-rank adapters (LoRA) and multilingual Transformer-based models, such as XLM-R and mT5. We enhance performance through machine translation for both subtasks and trigger word switching for the second subtask. The system achieves excellent performance, ranking 1st in numerical trigger words detection, 3rd in binary trigger words detection, and 7th in emotion detection.
zh

[NLP-43] LLaMA-Based Models for Aspect-Based Sentiment Analysis WASSA2024 WASSA-1 ACL

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复合方面情感分析(Compound Aspect-Based Sentiment Analysis, ABSA)任务中性能落后于微调模型的问题,尤其是探索开源LLMs经过特定微调后在ABSA任务中的潜力。其解决方案的关键在于对基于LLaMA架构的开源模型进行针对性微调,并在四个ABSA任务和八个英文数据集上系统评估其表现,结果表明微调后的Orca~2模型在所有任务中均超越了现有最优结果,验证了微调策略的有效性;同时,研究还揭示了零样本和少样本场景下模型性能显著下降的问题,为后续优化提供了方向。

链接: https://arxiv.org/abs/2508.08649
作者: Jakub Šmíd,Pavel Přibáň,Pavel Král
机构: University of West Bohemia (西波希米亚大学); Department of Computer Science and Engineering (计算机科学与工程系); NTIS – New Technologies for the Information Society (信息社会新技术中心)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, Social Media Analysis (WASSA 2024). Official version: this https URL

点击查看摘要

Abstract:While large language models (LLMs) show promise for various tasks, their performance in compound aspect-based sentiment analysis (ABSA) tasks lags behind fine-tuned models. However, the potential of LLMs fine-tuned for ABSA remains unexplored. This paper examines the capabilities of open-source LLMs fine-tuned for ABSA, focusing on LLaMA-based models. We evaluate the performance across four tasks and eight English datasets, finding that the fine-tuned Orca~2 model surpasses state-of-the-art results in all tasks. However, all models struggle in zero-shot and few-shot scenarios compared to fully fine-tuned ones. Additionally, we conduct error analysis to identify challenges faced by fine-tuned models.
zh

[NLP-44] Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents

【速读】: 该论文旨在解决当前移动使用代理(mobile-use agents)在自动化任务中因忽视用户隐式意图(implicit intention flows,如个人偏好)而导致个性化能力不足的问题。现有方法仅依赖显式意图流(explicit intention flows,如操作步骤序列)进行示范学习,难以实现真正的人机意图对齐。解决方案的关键在于提出IFRAgent框架,其核心是通过意图流识别(Intention Flow Recognition)从人类示范中同时解析显式与隐式意图:前者构建标准操作程序(SOP)的查询级向量库,后者建立用户级习惯存储库;进而结合检索增强生成(retrieval-augmented generation)与查询重写机制,将模糊原始查询转化为个性化SOP,从而显著提升移动代理与人类意图的一致性。实验表明,该方法在意图对齐率上相较基线平均提升6.79%(相对改进32.06%),步骤完成率提升5.30%(相对改进26.34%)。

链接: https://arxiv.org/abs/2508.08645
作者: Zheng Wu,Heyuan Huang,Yanjia Yang,Yuanyi Song,Xingyu Lou,Weiwen Liu,Weinan Zhang,Jun Wang,Zhuosheng Zhang
机构: OPPO(欧珀)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As multimodal large language models advance rapidly, the automation of mobile tasks has become increasingly feasible through the use of mobile-use agents that mimic human interactions from graphical user interface. To further enhance mobile-use agents, previous studies employ demonstration learning to improve mobile-use agents from human demonstrations. However, these methods focus solely on the explicit intention flows of humans (e.g., step sequences) while neglecting implicit intention flows (e.g., personal preferences), which makes it difficult to construct personalized mobile-use agents. In this work, to evaluate the \textbfIntention \textbfAlignment \textbfRate between mobile-use agents and humans, we first collect \textbfMobileIAR, a dataset containing human-intent-aligned actions and ground-truth actions. This enables a comprehensive assessment of the agents’ understanding of human intent. Then we propose \textbfIFRAgent, a framework built upon \textbfIntention \textbfFlow \textbfRecognition from human demonstrations. IFRAgent analyzes explicit intention flows from human demonstrations to construct a query-level vector library of standard operating procedures (SOP), and analyzes implicit intention flows to build a user-level habit repository. IFRAgent then leverages a SOP extractor combined with retrieval-augmented generation and a query rewriter to generate personalized query and SOP from a raw ambiguous query, enhancing the alignment between mobile-use agents and human intent. Experimental results demonstrate that IFRAgent outperforms baselines by an average of 6.79% (32.06% relative improvement) in human intention alignment rate and improves step completion rates by an average of 5.30% (26.34% relative improvement). The codes are available at this https URL.
zh

[NLP-45] MiGrATe: Mixed-Policy GRPO for Adaptation at Test-Time

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在黑盒优化任务中难以平衡探索新解空间与利用高奖励区域的问题。现有方法依赖于上下文学习(in-context learning)或需人工设计的合成数据进行测试时训练(Test-Time Training, TTT),但前者易陷入局部最优,后者则受限于任务特异性导致可扩展性差。其解决方案的关键在于提出MiGrATe——一种无需外部训练数据的在线TTT方法,核心机制是通过混合策略组构建过程,结合策略梯度优化(GRPO)实现推理阶段的自适应调整:该过程融合了在线策略采样以维持探索能力,以及贪心采样(greedy sampling)和邻域采样(neighborhood sampling, NS)两种离策略数据选择技术,从而有效引导策略梯度向高潜力解空间集中,同时保持多样性。实验证明,MiGrATe在词搜索、分子优化和抽象推理任务中均显著优于纯推理及传统TTT基线方法。

链接: https://arxiv.org/abs/2508.08641
作者: Peter Phan,Dhruv Agarwal,Kavitha Srinivas,Horst Samulowitz,Pavan Kapanipathi,Andrew McCallum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly being applied to black-box optimization tasks, from program synthesis to molecule design. Prior work typically leverages in-context learning to iteratively guide the model towards better solutions. Such methods, however, often struggle to balance exploration of new solution spaces with exploitation of high-reward ones. Recently, test-time training (TTT) with synthetic data has shown promise in improving solution quality. However, the need for hand-crafted training data tailored to each task limits feasibility and scalability across domains. To address this problem, we introduce MiGrATe-a method for online TTT that uses GRPO as a search algorithm to adapt LLMs at inference without requiring external training data. MiGrATe operates via a mixed-policy group construction procedure that combines on-policy sampling with two off-policy data selection techniques: greedy sampling, which selects top-performing past completions, and neighborhood sampling (NS), which generates completions structurally similar to high-reward ones. Together, these components bias the policy gradient towards exploitation of promising regions in solution space, while preserving exploration through on-policy sampling. We evaluate MiGrATe on three challenging domains-word search, molecule optimization, and hypothesis+program induction on the Abstraction and Reasoning Corpus (ARC)-and find that it consistently outperforms both inference-only and TTT baselines, demonstrating the potential of online TTT as a solution for complex search tasks without external supervision.
zh

[NLP-46] InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

【速读】: 该论文旨在解决当前强化学习(Reinforcement Learning, RL)研究中普遍存在的领域局限性问题,即现有方法主要聚焦于特定领域的推理任务(如数学或代码生成),难以应对真实世界中多样化且复杂的推理场景。为填补这一空白,作者提出 InternBootcamp,一个包含1000+跨域任务环境的开源框架,其关键在于:(1) 自动化生成可配置难度级别的无限训练/测试用例;(2) 内置验证模块实现对模型输出的客观评估。该框架通过自动化代理工作流结合人工校验机制快速扩展任务覆盖范围,显著提升了模型在复杂推理任务中的泛化能力,实验证明“任务规模扩展”(task scaling)是性能提升的核心驱动力,为构建具备通用推理能力的模型提供了有效路径。

链接: https://arxiv.org/abs/2508.08636
作者: Peiji Li,Jiasheng Ye,Yongkang Chen,Yichuan Ma,Zijie Yu,Kedi Chen,Ganqu Cui,Haozhan Li,Jiacheng Chen,Chengqi Lyu,Wenwei Zhang,Linyang Li,Qipeng Guo,Dahua Lin,Bowen Zhou,Kai Chen
机构: Shanghai AI Laboratory (上海人工智能实验室); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注: InternBootcamp Tech Report

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbftask scaling, over two orders of magnitude, offering a promising route towards capable reasoning generalist.
zh

[NLP-47] Adaptive Personalized Conversational Information Retrieval CIKM2025

【速读】: 该论文旨在解决个性化对话式信息检索(Personalized Conversational Information Retrieval, PCIR)系统中“一刀切”式个人化策略导致的性能不佳问题,即在多轮交互中未能根据每个查询的实际需求动态调整个人化程度。其核心解决方案是提出自适应个性化框架APCIR,关键在于:首先识别当前查询所需的个人化水平,并将个性化查询与其他查询重构结果融合生成多样化增强查询;随后设计一种感知个人化的排序融合方法,依据个人化需求动态分配不同重构查询的融合权重,从而实现更精准的个性化排序。

链接: https://arxiv.org/abs/2508.08634
作者: Fengran Mo,Yuchen Hui,Yuxing Tian,Zhaoxuan Tan,Chuan Meng,Zhan Su,Kaiyu Huang,Jian-Yun Nie
机构: Université de Montréal(蒙特利尔大学); University of Notre Dame(圣母大学); University of Amsterdam(阿姆斯特丹大学); Beijing Jiaotong University(北京交通大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by CIKM 2025

点击查看摘要

Abstract:Personalized conversational information retrieval (CIR) systems aim to satisfy users’ complex information needs through multi-turn interactions by considering user profiles. However, not all search queries require personalization. The challenge lies in appropriately incorporating personalization elements into search when needed. Most existing studies implicitly incorporate users’ personal information and conversational context using large language models without distinguishing the specific requirements for each query turn. Such a ``one-size-fits-all’’ personalization strategy might lead to sub-optimal results. In this paper, we propose an adaptive personalization method, in which we first identify the required personalization level for a query and integrate personalized queries with other query reformulations to produce various enhanced queries. Then, we design a personalization-aware ranking fusion approach to assign fusion weights dynamically to different reformulated queries, depending on the required personalization level. The proposed adaptive personalized conversational information retrieval framework APCIR is evaluated on two TREC iKAT datasets. The results confirm the effectiveness of adaptive personalization of APCIR by outperforming state-of-the-art methods.
zh

[NLP-48] Optimizing Retrieval-Augmented Generation (RAG ) for Colloquial Cantonese: A LoRA-Based Systematic Review

【速读】: 该论文旨在解决生成式 AI(Generative AI)在方言语境下,尤其是粤语口语表达的理解与生成问题,其核心挑战在于标注数据稀缺和语言变异性带来的模型性能瓶颈。解决方案的关键在于引入参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)技术,特别是低秩适应(Low-Rank Adaptation, LoRA),将其集成到检索增强生成(Retrieval-Augmented Generation, RAG)系统中,以提升模型在有限数据条件下的语义保真度、检索精度与方言适配能力。研究表明,动态和集成式LoRA策略可在显著减少可训练参数的同时保持高质量的检索与生成效果,但对细粒度语言特征的保留仍存在局限,尤其在低资源场景下,亟需结合实时用户反馈与领域特定数据来增强模型的适应性与个性化能力。

链接: https://arxiv.org/abs/2508.08610
作者: David Santandreu Calonge(1),Linda Smail(2) ((1) Center for Teaching and Learning, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates, (2) College of Interdisciplinary Studies, Zayed University, Dubai, United Arab Emirates)
机构: 未知
类目: Computation and Language (cs.CL)
备注: 27 pages, 1 figure, 8 tables

点击查看摘要

Abstract:This review examines recent advances in Parameter-Efficient Fine-Tuning (PEFT), with a focus on Low-Rank Adaptation (LoRA), to optimize Retrieval-Augmented Generation (RAG) systems like Qwen3, DeepSeek, and Kimi. These systems face challenges in understanding and generating authentic Cantonese colloquial expressions due to limited annotated data and linguistic variability. The review evaluates the integration of LoRA within RAG frameworks, benchmarks PEFT methods for retrieval and generation accuracy, identify domain adaptation strategies under limited data, and compares fine-tuning techniques aimed at improving semantic fidelity under data-scarce conditions. A systematic analysis of recent studies employing diverse LoRA variants, synthetic data generation, user feedback integration, and adaptive parameter allocation was conducted to assess their impact on computational efficiency, retrieval precision, linguistic authenticity, and scalability. Findings reveal that dynamic and ensemble LoRA adaptations significantly reduce trainable parameters without sacrificing retrieval accuracy and generation quality in dialectal contexts. However, limitations remain in fully preserving fine-grained linguistic nuances, especially for low-resource settings like Cantonese. The integration of real-time user feedback and domain-specific data remains underdeveloped, limiting model adaptability and personalization. While selective parameter freezing and nonlinear adaptation methods offer better trade-offs between efficiency and accuracy, their robustness at scale remains an open challenge. This review highlights the promise of PEFT-enhanced RAG systems for domain-specific language tasks and calls for future work targeting dialectal authenticity, dynamic adaptation, and scalable fine-tuning pipelines.
zh

[NLP-49] DepressLLM : Interpretable domain-adapted language model for depression detection from real-world narratives

【速读】: 该论文旨在解决抑郁症预测中因缺乏大规模、高质量且严格标注数据集而导致的瓶颈问题。其解决方案的关键在于提出DepressLLM模型,该模型基于3,699篇自传体叙述文本进行训练与评估,具备可解释性,并通过Score-guided Token Probability Summation (SToPS)模块实现更优的分类性能和可靠的置信度估计,最终在AUC指标上达到0.789,且在高置信度样本(≥0.95)中提升至0.904,同时验证了其对异质数据的鲁棒性。

链接: https://arxiv.org/abs/2508.08591
作者: Sehwan Moon,Aram Lee,Jeong Eun Kim,Hee-Ju Kang,Il-Seon Shin,Sung-Wan Kim,Jae-Min Kim,Min Jhon,Ju-Wan Kim
机构: ETRI(韩国电子通信研究院); Chonnam National University (全南国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advances in large language models (LLMs) have enabled a wide range of applications. However, depression prediction is hindered by the lack of large-scale, high-quality, and rigorously annotated datasets. This study introduces DepressLLM, trained and evaluated on a novel corpus of 3,699 autobiographical narratives reflecting both happiness and distress. DepressLLM provides interpretable depression predictions and, via its Score-guided Token Probability Summation (SToPS) module, delivers both improved classification performance and reliable confidence estimates, achieving an AUC of 0.789, which rises to 0.904 on samples with confidence \geq 0.95. To validate its robustness to heterogeneous data, we evaluated DepressLLM on in-house datasets, including an Ecological Momentary Assessment (EMA) corpus of daily stress and mood recordings, and on public clinical interview data. Finally, a psychiatric review of high-confidence misclassifications highlighted key model and data limitations that suggest directions for future refinements. These findings demonstrate that interpretable AI can enable earlier diagnosis of depression and underscore the promise of medical AI in psychiatry.
zh

[NLP-50] Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization ACL2025

【速读】: 该论文旨在解决视频配音(video dubbing)中因源语言与目标语言信息密度差异导致的语音时长不匹配问题,从而避免音画不同步对观感体验的显著影响。其解决方案的关键在于将基于大语言模型(LLM)的视频配音中的时长对齐问题建模为偏好优化(preference optimization)任务,并提出分段监督偏好优化(Segment Supervised Preference Optimization, SSPO)方法,通过分段采样策略和细粒度损失函数实现更精确的时长控制与对齐。

链接: https://arxiv.org/abs/2508.08550
作者: Chaoqun Cui,Liangbin Huang,Shijing Wang,Zhe Tong,Zhaolong Huang,Xiao Zeng,Xiaofeng Liu
机构: Alibaba Digital Media and Entertainment Group (阿里巴巴数字媒体与娱乐集团); School of Software Engineering, Huazhong University of Science and Technology (华中科技大学软件学院); Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing Jiaotong University (北京交通大学交通数据挖掘与具身智能北京市重点实验室)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: This paper is accepted by ACL2025 (Main)

点击查看摘要

Abstract:Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.
zh

[NLP-51] DeCAL Tokenwise Compression

【速读】: 该论文旨在解决自然语言处理中高维文本表示的存储与计算效率问题,即如何在不显著损失下游任务性能的前提下实现高效的tokenwise压缩。其解决方案的关键在于提出DeCAL方法,该方法基于预训练的编码器-解码器语言模型(encoder-decoder language model),通过去噪训练目标学习生成高质量、通用性强的压缩表示;同时对编码器进行微小修改以最大化压缩质量,即使牺牲部分计算开销也在所不惜。实验表明,在2倍压缩比下DeCAL可达到未压缩模型的性能水平,且在高达8倍压缩比时仍保持较小的指标下降,尤其适用于可复用预计算密集表示的场景。

链接: https://arxiv.org/abs/2508.08514
作者: Sameer Panwar
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces DeCAL, a new method for tokenwise compression. DeCAL uses an encoder-decoder language model pretrained with denoising to learn to produce high-quality, general-purpose compressed representations by the encoder. DeCAL applies small modifications to the encoder, with the emphasis on maximizing compression quality, even at the expense of compute. We show that DeCAL at 2x compression can match uncompressed on many downstream tasks, with usually only minor dropoff in metrics up to 8x compression, among question-answering, summarization, and multi-vector retrieval tasks. DeCAL offers significant savings where pre-computed dense representations can be utilized, and we believe the approach can be further developed to be more broadly applicable.
zh

[NLP-52] Steerable Pluralism: Pluralistic Alignment via Few-Shot Comparative Regression AAAI

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)对齐方法中仅依赖标量奖励(scalar rewards)所导致的用户偏好表达单一化问题,即现有方法如基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)只能反映用户偏好平均值,难以捕捉多样化的价值维度。为实现更公平、更具代表性的模型对齐,论文提出一种可调节的多元对齐(steerable pluralistic alignment)方法,其关键在于采用少样本比较回归(few-shot comparative regression)机制,利用上下文学习(in-context learning)和推理能力,在一组细粒度属性基础上对多个响应选项进行对比并做出符合个体偏好的选择。该方案具备可解释性、跨属性兼容性,并在新构建的两个基准测试(基于Moral Integrity Corpus和HelpSteer2数据集)上优于多种基线与前沿方法,推动了伦理AI的发展。

链接: https://arxiv.org/abs/2508.08509
作者: Jadie Adams,Brian Hu,Emily Veenhuis,David Joy,Bharadwaj Ravichandran,Aaron Bray,Anthony Hoogs,Arslan Basharat
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: AIES '25: Proceedings of the 2025 AAAI/ACM Conference on AI, Ethics, and Society

点击查看摘要

Abstract:Large language models (LLMs) are currently aligned using techniques such as reinforcement learning from human feedback (RLHF). However, these methods use scalar rewards that can only reflect user preferences on average. Pluralistic alignment instead seeks to capture diverse user preferences across a set of attributes, moving beyond just helpfulness and harmlessness. Toward this end, we propose a steerable pluralistic model based on few-shot comparative regression that can adapt to individual user preferences. Our approach leverages in-context learning and reasoning, grounded in a set of fine-grained attributes, to compare response options and make aligned choices. To evaluate our algorithm, we also propose two new steerable pluralistic benchmarks by adapting the Moral Integrity Corpus (MIC) and the HelpSteer2 datasets, demonstrating the applicability of our approach to value-aligned decision-making and reward modeling, respectively. Our few-shot comparative regression approach is interpretable and compatible with different attributes and LLMs, while outperforming multiple baseline and state-of-the-art methods. Our work provides new insights and research directions in pluralistic alignment, enabling a more fair and representative use of LLMs and advancing the state-of-the-art in ethical AI.
zh

[NLP-53] Re:Verse – Can Your VLM Read a Manga?

【速读】: 该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在处理连续视觉叙事(如漫画)时存在的核心能力缺陷,即模型虽能较好完成单个画面的表层识别,但在时间因果推理与跨面板连贯性等深层叙事理解方面存在系统性不足。其解决方案的关键在于提出一个新颖的评估框架,包含三个核心要素:(i) 基于对齐轻小说文本的细粒度多模态标注协议,以明确视觉元素与叙事结构的关系;(ii) 覆盖直接推理与检索增强生成等多种推理范式的综合评估;(iii) 通过跨模态嵌入相似性分析揭示当前VLMs在联合表示中存在的根本性错位。该方法首次实现了对长篇视觉叙事理解的系统性量化评估,并指出当前模型缺乏真正的故事级智能,尤其在非线性叙事、角色一致性及长序列因果推理等方面表现薄弱。

链接: https://arxiv.org/abs/2508.08508
作者: Aaditya Baranwal,Madhav Kataria,Naitik Agrawal,Yogesh S Rawat,Shruti Vyas
机构: University of Central Florida (中佛罗里达大学); Indian Institute of Technology, Jodhpur (印度理工学院,乔德普尔分校); Indian Institute of Technology, Varanasi (印度理工学院,瓦拉纳西分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current Vision Language Models (VLMs) demonstrate a critical gap between surface-level recognition and deep narrative reasoning when processing sequential visual storytelling. Through a comprehensive investigation of manga narrative understanding, we reveal that while recent large multimodal models excel at individual panel interpretation, they systematically fail at temporal causality and cross-panel cohesion, core requirements for coherent story comprehension. We introduce a novel evaluation framework that combines fine-grained multimodal annotation, cross-modal embedding analysis, and retrieval-augmented assessment to systematically characterize these limitations. Our methodology includes (i) a rigorous annotation protocol linking visual elements to narrative structure through aligned light novel text, (ii) comprehensive evaluation across multiple reasoning paradigms, including direct inference and retrieval-augmented generation, and (iii) cross-modal similarity analysis revealing fundamental misalignments in current VLMs’ joint representations. Applying this framework to Re:Zero manga across 11 chapters with 308 annotated panels, we conduct the first systematic study of long-form narrative understanding in VLMs through three core evaluation axes: generative storytelling, contextual dialogue grounding, and temporal reasoning. Our findings demonstrate that current models lack genuine story-level intelligence, struggling particularly with non-linear narratives, character consistency, and causal inference across extended sequences. This work establishes both the foundation and practical methodology for evaluating narrative intelligence, while providing actionable insights into the capability of deep sequential understanding of Discrete Visual Narratives beyond basic recognition in Multimodal Models. Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL) Cite as: arXiv:2508.08508 [cs.CV] (or arXiv:2508.08508v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.08508 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-54] Momentum Point-Perplexity Mechanics in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中隐藏状态(hidden states)动态变化机制不清晰的问题,从而为模型的可解释性、异常检测和可控引导提供理论基础。其核心解决方案是引入一个类比物理学中能量的概念——即“对数拉格朗日量”(log-Lagrangian),该量由隐藏状态的变化率与模型对下一个token的置信度共同决定,在多个开源Transformer模型中表现出近似守恒特性。基于此守恒性质,作者提出了一种名为Jacobian steering的控制方法,通过最小扰动调整隐藏状态以偏好目标token,同时保持能量守恒,从而在保证稳定性的同时提升生成文本的语义质量。

链接: https://arxiv.org/abs/2508.08492
作者: Lorenzo Tomaz,Judd Rosenblatt,Thomas Berry Jones,Diogo Schwerz de Lucena
机构: AE Studio; PIBBSS; Timaeus; Apart Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We take a physics-based approach to studying how the internal hidden states of large language models change from token to token during inference. Across 20 open-source transformer models (135M-3B parameters), we find that a quantity combining the rate of change in hidden states and the model’s next-token certainty, analogous to energy in physics, remains nearly constant. Random-weight models conserve this “energy” more tightly than pre-trained ones, while training shifts models into a faster, more decisive regime with greater variability. Using this “log-Lagrangian” view, we derive a control method called Jacobian steering, which perturbs hidden states in the minimal way needed to favor a target token. This approach maintained near-constant energy in two tested models and produced continuations rated higher in semantic quality than the models’ natural outputs. Viewing transformers through this mechanics lens offers a principled basis for interpretability, anomaly detection, and low-risk steering. This could help make powerful models more predictable and aligned with human intent.
zh

[NLP-55] Enhancing Small LLM Alignment through Margin-Based Objective Modifications under Resource Constraints

【速读】: 该论文旨在解决小规模大语言模型(Small Large Language Models, LLMs)在人类偏好对齐(preference alignment)中因性能差距显著而难以有效优化的问题。其核心挑战在于,当模型能力受限时,传统基于直接偏好优化(Direct Preference Optimization, DPO)的方法难以充分学习高质量的输出策略。解决方案的关键在于提出两种轻量级DPO变体:Adaptive Margin-Sigmoid Loss与APO-hinge-zero,其中APO-hinge-zero通过引入hinge损失驱动的难例挖掘(hinge-induced hard-example mining)与APO-zero的选中聚焦优化(chosen-focused optimization)相结合,实现了更有效的选择性更新机制。该方法在AlpacaEval和MT-Bench评测中均展现出显著改进,尤其在STEM和人文类任务上表现突出,表明通过简单调整偏好目标函数即可显著提升小模型在资源受限场景下的对齐效果。

链接: https://arxiv.org/abs/2508.08466
作者: Daren Yao,Jinsong Yuan,Ruike Chen
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Small large language models (LLMs) often face difficulties in aligning output to human preferences, particularly when operating under severe performance gaps. In this work, we propose two lightweight DPO-based variants – Adaptive Margin-Sigmoid Loss and APO-hinge-zero – to better address underperformance scenarios by introducing margin-based objectives and selective update mechanisms. Our APO-hinge-zero method, which combines hinge-induced hard-example mining with the chosen-focused optimization of APO-zero, achieves strong results. In AlpacaEval, APO-hinge-zero improves the win rate by +2.0 points and the length-controlled win rate by +1.4 points compared to the APO-zero baseline. In MT-Bench, our methods maintain competitive performance in diverse categories, particularly excelling in STEM and Humanities tasks. These results demonstrate that simple modifications to preference-based objectives can significantly enhance small LLM alignment under resource constraints, offering a practical path toward more efficient deployment. Comments: 10 pages, 3 figures Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.08466 [cs.CL] (or arXiv:2508.08466v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.08466 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-56] Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

【速读】: 该论文旨在解决语言模型中分词策略对下游任务性能影响的争议问题,特别是针对形态学复杂的语言是否应采用形态对齐(morphologically aligned)的分词方法。其关键解决方案在于系统性地评估不同分词算法(如BPE与Unigram)和形态对齐程度对多种语言(包括Telugu、Hindi和English)在语法任务(如词性标注、命名实体识别和依存句法分析)中的影响。研究发现,虽然更高的形态对齐性与下游任务性能呈正相关(但相关度较弱),但分词算法本身(尤其是Unigram优于BPE)才是决定性能的关键因素;此外,引入形态分割的混合分词策略在BPE框架下能显著提升效果,而传统内在指标(如语料词数CTC和Rényi熵)则与下游性能无关。

链接: https://arxiv.org/abs/2508.08424
作者: Saketh Reddy Vemula,Dipti Mishra Sharma,Parameswari Krishnamurthy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prior work on language modeling showed conflicting findings about whether morphologically aligned approaches to tokenization improve performance, particularly for languages with complex morphology. To investigate this, we select a typologically diverse set of languages: Telugu (agglutinative), Hindi (primarily fusional with some agglutination), and English (fusional). We conduct a comprehensive evaluation of language models – starting from tokenizer training and extending through the finetuning and downstream task evaluation. To account for the consistent performance differences observed across tokenizer variants, we focus on two key factors: morphological alignment and tokenization quality. To assess morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal that better morphological alignment correlates positively – though moderately – with performance in syntax-based tasks such as Parts-of-Speech tagging, Named Entity Recognition and Dependency Parsing. However, we also find that the tokenizer algorithm (Byte-pair Encoding vs. Unigram) plays a more significant role in influencing downstream performance than morphological alignment alone. Naive Unigram tokenizers outperform others across most settings, though hybrid tokenizers that incorporate morphological segmentation significantly improve performance within the BPE framework. In contrast, intrinsic metrics like Corpus Token Count (CTC) and Rényi entropy showed no correlation with downstream performance. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.08424 [cs.CL] (or arXiv:2508.08424v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.08424 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-57] Mol-R1: Towards Explicit Long-CoT Reasoning in Molecule Discovery

【速读】: 该论文旨在解决生成式 AI(Generative AI)在知识密集型领域——特别是分子发现任务中——因缺乏精准的领域知识理解与低效推理能力而导致性能受限的问题。其核心挑战在于分子数据的复杂性及高质量专家标注的稀缺性,使得现有显式链式思维(Explicit Long Chain-of-Thought, Long-CoT)推理模型难以有效支持文本驱动的分子生成任务。解决方案的关键在于提出 Mol-R1 框架,包含两个核心技术:一是通过“上下文内蒸馏引导的先验调控”(Prior Regulation via In-context Distillation, PRID)构建高质量推理数据集,以生成受先验规则约束的推理轨迹;二是引入“分子迭代适应”(Molecular Iterative Adaptation, MoIA),通过监督微调(Supervised Fine-tuning, SFT)与强化策略优化(Reinforced Policy Optimization, RPO)的交替迭代训练策略,显著提升 R1 类模型在分子发现中的推理性能与可解释性。

链接: https://arxiv.org/abs/2508.08401
作者: Jiatong Li,Weida Wang,Qinggang Zhang,Junxian Li,Di Zhang,Changmeng Zheng,Shufei Zhang,Xiaoyong Wei,Qing Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages

点击查看摘要

Abstract:Large language models (LLMs), especially Explicit Long Chain-of-Thought (CoT) reasoning models like DeepSeek-R1 and QWQ, have demonstrated powerful reasoning capabilities, achieving impressive performance in commonsense reasoning and mathematical inference. Despite their effectiveness, Long-CoT reasoning models are often criticized for their limited ability and low efficiency in knowledge-intensive domains such as molecule discovery. Success in this field requires a precise understanding of domain knowledge, including molecular structures and chemical principles, which is challenging due to the inherent complexity of molecular data and the scarcity of high-quality expert annotations. To bridge this gap, we introduce Mol-R1, a novel framework designed to improve explainability and reasoning performance of R1-like Explicit Long-CoT reasoning LLMs in text-based molecule generation. Our approach begins with a high-quality reasoning dataset curated through Prior Regulation via In-context Distillation (PRID), a dedicated distillation strategy to effectively generate paired reasoning traces guided by prior regulations. Building upon this, we introduce MoIA, Molecular Iterative Adaptation, a sophisticated training strategy that iteratively combines Supervised Fine-tuning (SFT) with Reinforced Policy Optimization (RPO), tailored to boost the reasoning performance of R1-like reasoning models for molecule discovery. Finally, we examine the performance of Mol-R1 in the text-based molecule reasoning generation task, showing superior performance against existing baselines.
zh

[NLP-58] CoDAE: Adapting Large Language Models for Education via Chain-of-Thought Data Augmentation

【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在教育场景中表现不佳的问题,具体包括:过早暴露答案、无法根据学生不确定程度调整回应策略,以及易受情绪化诱导提示的影响。解决方案的关键在于提出CoDAE框架,通过链式思维(Chain-of-Thought, CoT)数据增强技术,对真实师生对话进行重构与扩充,以促进逐步推理和符合教学逻辑的指导;同时设计针对性对话案例,明确缓解上述三大局限性,并基于增强数据对多个开源LLM进行微调,在模拟教育场景中显著提升了指导的适切性、推理支持能力及抗诱导脆弱性。

链接: https://arxiv.org/abs/2508.08386
作者: Shuzhou Yuan,William LaCroix,Hardik Ghoshal,Ercong Nie,Michael Färber
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly employed as AI tutors due to their scalability and potential for personalized instruction. However, off-the-shelf LLMs often underperform in educational settings: they frequently reveal answers too readily, fail to adapt their responses to student uncertainty, and remain vulnerable to emotionally manipulative prompts. To address these challenges, we introduce CoDAE, a framework that adapts LLMs for educational use through Chain-of-Thought (CoT) data augmentation. We collect real-world dialogues between students and a ChatGPT-based tutor and enrich them using CoT prompting to promote step-by-step reasoning and pedagogically aligned guidance. Furthermore, we design targeted dialogue cases to explicitly mitigate three key limitations: over-compliance, low response adaptivity, and threat vulnerability. We fine-tune four open-source LLMs on different variants of the augmented datasets and evaluate them in simulated educational scenarios using both automatic metrics and LLM-as-a-judge assessments. Our results show that models fine-tuned with CoDAE deliver more pedagogically appropriate guidance, better support reasoning processes, and effectively resist premature answer disclosure.
zh

[NLP-59] Bilevel MCTS for Amortized O(1) Node Selection in Classical Planning

【速读】: 该论文旨在解决经典规划中基于多臂老虎机(Multi-Armed Bandit, MAB)的蒙特卡洛树搜索(Monte-Carlo Tree Search, MCTS)在节点选择阶段效率低下的问题。具体而言,MCTS通常使用基于树结构的OPEN列表来管理待扩展节点,导致每次选择节点的时间复杂度为 O(logN)O(\log N),其中 NN 为OPEN列表大小,这与搜索深度 dd 相当;而在经典规划任务中,dd 可能极大(如kk-盘汉诺塔问题中可达2k12^k-1),使得节点选择开销显著,远超游戏树搜索场景下可忽略的代价。解决方案的关键在于提出一种双层(bilevel)改进机制:在每个选定叶节点处执行一次最佳优先搜索(best-first search),并以搜索深度 dd 为比例设定扩展预算,从而实现节点选择的摊销时间复杂度为 O(1)O(1),等效于传统基于数组的优先队列;此外引入树折叠(Tree Collapsing)技术进一步减少动作选择步骤,提升整体性能。

链接: https://arxiv.org/abs/2508.08385
作者: Masataro Asai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study an efficient implementation of Multi-Armed Bandit (MAB)-based Monte-Carlo Tree Search (MCTS) for classical planning. One weakness of MCTS is that it spends a significant time deciding which node to expand next. While selecting a node from an OPEN list with N nodes has O(1) runtime complexity with traditional array-based priority-queues for dense integer keys, the tree-based OPEN list used by MCTS requires O(\log N) , which roughly corresponds to the search depth d . In classical planning, d is arbitrarily large (e.g., 2^k-1 in k -disk Tower-of-Hanoi) and the runtime for node selection is significant, unlike in game tree search, where the cost is negligible compared to the node evaluation (rollouts) because d is inherently limited by the game (e.g., d\leq 361 in Go). To improve this bottleneck, we propose a bilevel modification to MCTS that runs a best-first search from each selected leaf node with an expansion budget proportional to d , which achieves amortized O(1) runtime for node selection, equivalent to the traditional queue-based OPEN list. In addition, we introduce Tree Collapsing, an enhancement that reduces action selection steps and further improves the performance.
zh

[NLP-60] Exploring the Technical Knowledge Interaction of Global Digital Humanities: Three-decade Evidence from Bibliometric-based perspectives

【速读】: 该论文旨在解决现有文献中对数字人文(Digital Humanities, DH)领域技术进展与研究主题演进之间内在关联性分析不足的问题,尤其是传统文献计量研究多停留在热点识别、合作网络构建等表层描述,难以揭示方法与主题协同演化规律。其解决方案的关键在于提出“主题-方法复合结构”(Topic-Method Composition, TMC)这一新概念,即通过特定研究主题与其对应方法的共现关系构建混合知识结构,并基于TMC之间的交互关系挖掘数字技术与人文学科交叉融合的深层模式。研究进一步开发了一套融合文献计量、主题建模与网络分析的TMC驱动工作流,实现了对DH知识结构的精细化刻画,为跨学科研究提供可迁移的方法论工具。

链接: https://arxiv.org/abs/2508.08347
作者: Jiayi Li,Chengxi Yan,Yurong Zeng,Zhichao Fang,Huiru Wang
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Digital Humanities (DH) is an interdisciplinary field that integrates computational methods with humanities scholarship to investigate innovative topics. Each academic discipline follows a unique developmental path shaped by the topics researchers investigate and the methods they employ. With the help of bibliometric analysis, most of previous studies have examined DH across multiple dimensions such as research hotspots, co-author networks, and institutional rankings. However, these studies have often been limited in their ability to provide deep insights into the current state of technological advancements and topic development in DH. As a result, their conclusions tend to remain superficial or lack interpretability in understanding how methods and topics interrelate in the field. To address this gap, this study introduced a new concept of Topic-Method Composition (TMC), which refers to a hybrid knowledge structure generated by the co-occurrence of specific research topics and the corresponding method. Especially by analyzing the interaction between TMCs, we can see more clearly the intersection and integration of digital technology and humanistic subjects in DH. Moreover, this study developed a TMC-based workflow combining bibliometric analysis, topic modeling, and network analysis to analyze the development characteristics and patterns of research disciplines. By applying this workflow to large-scale bibliometric data, it enables a detailed view of the knowledge structures, providing a tool adaptable to other fields.
zh

[NLP-61] Maximizing GPU Efficiency via Optimal Adapter Caching: An Analytical Approach for Multi-Tenant LLM Serving

【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)适配器(adapter)在单节点部署中因适配器种类繁多而引入的显著性能开销与资源分配难题,这些问题导致请求饥饿和GPU资源利用效率低下。解决方案的关键在于提出一个基于AI驱动的分析管道,通过深入剖析LLM适配器服务中的各类开销和性能波动,结合首个能够复现在线LLM适配器服务系统并匹配关键性能指标的数字孪生(Digital Twin),实现对适配器在单节点上的最优分配——该分配策略能最大化吞吐量、充分利用GPU资源且避免请求饥饿,同时其结果可扩展至多副本部署场景用于整体放置优化、负载均衡与服务器配置调整,从而提升系统整体性能与资源利用率。

链接: https://arxiv.org/abs/2508.08343
作者: Ferran Agullo,Joan Oliveras,Chen Wang,Alberto Gutierrez-Torre,Olivier Tardieu,Alaa Youssef,Jordi Torres,Josep Ll. Berral
机构: Barcelona Supercomputing Center (巴塞罗那超级计算中心); Universitat Politècnica de Catalunya - BarcelonaTech (加泰罗尼亚理工大学-巴塞罗那技术学院); IBM Research (IBM 研究院)
类目: Performance (cs.PF); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Under review for a computer science conference

点击查看摘要

Abstract:Serving LLM adapters has gained significant attention as an effective approach to adapt general-purpose language models to diverse, task-specific use cases. However, serving a wide range of adapters introduces several and substantial overheads, leading to performance degradation and challenges in optimal placement. To address these challenges, we present an analytical, AI-driven pipeline that accurately determines the optimal allocation of adapters in single-node setups. This allocation maximizes performance, effectively using GPU resources, while preventing request starvation. Crucially, the proposed allocation is given based on current workload patterns. These insights in single-node setups can be leveraged in multi-replica deployments for overall placement, load balancing and server configuration, ultimately enhancing overall performance and improving resource efficiency. Our approach builds on an in-depth analysis of LLM adapter serving, accounting for overheads and performance variability, and includes the development of the first Digital Twin capable of replicating online LLM-adapter serving systems with matching key performance metrics. The experimental results demonstrate that the Digital Twin achieves a SMAPE difference of no more than 5.5% in throughput compared to real results, and the proposed pipeline accurately predicts the optimal placement with minimal latency.
zh

[NLP-62] Putnam-AXIOM: A Functional and Static Benchmark ICML2025

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)数学推理能力评估中存在的两个核心问题:一是现有基准测试(benchmark)已接近性能饱和,且受训练集污染(training-set contamination)影响严重;二是缺乏动态、可扩展且能有效检测模型是否真正掌握推理能力的评估机制。解决方案的关键在于提出Putnam-AXIOM基准及其变体(Putnam-AXIOM Variation),其中通过程序化扰动变量和常数生成无限数量难度相当、未见过的问题实例,从而构建出抗污染的动态测试环境。此外,作者引入Teacher-Forced Accuracy(TFA)这一轻量级指标,直接评估推理轨迹并自动化自然语言证明评分,弥补传统“框定”准确率(boxed accuracy)的局限性,为高级数学推理能力提供更严谨、可扩展的评估框架。

链接: https://arxiv.org/abs/2508.08292
作者: Aryan Gulati,Brando Miranda,Eric Chen,Emily Xia,Kai Fronsdal,Bruno Dumont,Elyas Obbad,Sanmi Koyejo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Neural and Evolutionary Computing (cs.NE)
备注: 27 pages total (10-page main paper + 17-page appendix), 12 figures, 6 tables. Submitted to ICML 2025 (under review)

点击查看摘要

Abstract:Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances – yielding a contamination-resilient test bed. On the Original set, OpenAI’s o1-preview – the strongest evaluated model – scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement “boxed” accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at this https URL.
zh

[NLP-63] Sacred or Synthetic? Evaluating LLM Reliability and Abstention for Religious Questions AAAI

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在宗教领域、特别是伊斯兰教法(Fiqh)推理中的可靠性与准确性问题,尤其关注不同逊尼派法学派别(如哈乃斐、马立克、沙斐仪和罕百里派)的细粒度区分以及模型在不确定时的拒答行为(abstention)。其解决方案的关键在于构建首个针对伊斯兰教法细分学派的基准测试数据集 FiqhQA,该数据集涵盖阿拉伯语和英语双语样本,并明确标注各回答所属的法学派别;同时,通过零样本(zero-shot)和拒答实验评估模型在准确性和拒答能力上的表现,发现尽管 GPT-4o 在准确性上最优,但 Gemini 和 Fanar 在拒答行为上更优,且所有模型在阿拉伯语任务中性能显著下降,揭示了多语言宗教推理的局限性。

链接: https://arxiv.org/abs/2508.08287
作者: Farah Atif,Nursultan Askarbekuly,Kareem Darwish,Monojit Choudhury
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 8th AAAI/ACM Conference on AI, Ethics, and Society (AIES 2025)

点击查看摘要

Abstract:Despite the increasing usage of Large Language Models (LLMs) in answering questions in a variety of domains, their reliability and accuracy remain unexamined for a plethora of domains including the religious domains. In this paper, we introduce a novel benchmark FiqhQA focused on the LLM generated Islamic rulings explicitly categorized by the four major Sunni schools of thought, in both Arabic and English. Unlike prior work, which either overlooks the distinctions between religious school of thought or fails to evaluate abstention behavior, we assess LLMs not only on their accuracy but also on their ability to recognize when not to answer. Our zero-shot and abstention experiments reveal significant variation across LLMs, languages, and legal schools of thought. While GPT-4o outperforms all other models in accuracy, Gemini and Fanar demonstrate superior abstention behavior critical for minimizing confident incorrect answers. Notably, all models exhibit a performance drop in Arabic, highlighting the limitations in religious reasoning for languages other than English. To the best of our knowledge, this is the first study to benchmark the efficacy of LLMs for fine-grained Islamic school of thought specific ruling generation and to evaluate abstention for Islamic jurisprudence queries. Our findings underscore the need for task-specific evaluation and cautious deployment of LLMs in religious applications.
zh

[NLP-64] he Illusion of Progress: Re-evaluating Hallucination Detection in LLM s

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)中幻觉(Hallucination)检测方法的评估体系存在严重偏差的问题。现有方法普遍依赖ROUGE这一基于词法重叠的指标,但研究表明其高召回率伴随极低精确率,导致对检测性能的估计严重失真;更关键的是,部分成熟检测方法在采用人类对齐的评估指标(如LLM-as-Judge)时性能下降高达45.9%。论文指出,当前评估范式未能捕捉语义一致性,且简单基于响应长度的启发式策略甚至可媲美复杂检测技术,揭示了现有方法评价体系的根本缺陷。解决方案的关键在于引入语义感知且鲁棒的评估框架,以准确衡量幻觉检测方法的真实效能,从而保障LLM输出的可信性。

链接: https://arxiv.org/abs/2508.08285
作者: Denis Janiak,Jakub Binkowski,Albert Sawczyn,Bogdan Gabrys,Ravid Schwartz-Ziv,Tomasz Kajdanowicz
机构: Wroclaw University of Science and Technology (弗罗茨瓦夫理工大学); University of Technology Sydney (悉尼科技大学); New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Preprint, under review

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detection methods, their evaluations often rely on ROUGE, a metric based on lexical overlap that misaligns with human judgments. Through comprehensive human studies, we demonstrate that while ROUGE exhibits high recall, its extremely low precision leads to misleading performance estimates. In fact, several established detection methods show performance drops of up to 45.9% when assessed using human-aligned metrics like LLM-as-Judge. Moreover, our analysis reveals that simple heuristics based on response length can rival complex detection techniques, exposing a fundamental flaw in current evaluation practices. We argue that adopting semantically aware and robust evaluation frameworks is essential to accurately gauge the true performance of hallucination detection methods, ultimately ensuring the trustworthiness of LLM outputs.
zh

[NLP-65] MinionsLLM : a Task-adaptive Framework For The Training and Control of Multi-Agent Systems Through Natural Language

【速读】: 该论文旨在解决如何通过自然语言指令实现对多智能体系统(multi-agent systems)在任意用户定义环境中的有效控制问题。现有方法难以保证语言指令的语法正确性与任务语义的相关性,导致控制效果不稳定。解决方案的关键在于提出MinionsLLM框架,该框架将大型语言模型(Large Language Models, LLMs)与行为树(Behavior Trees, BTs)及形式文法(Formal Grammars)相结合,构建标准化接口以定义环境、代理和行为原语,并引入两种合成数据集生成方法(Method A 和 Method B),用于微调LLMs以提升语法有效性与任务相关性。实验表明,Method B可使语法有效性提升至92.6%,平均任务性能提高33%,且小参数规模模型(如1B)在微调后收益最大,为资源受限场景下部署本地化轻量级LLM提供了可行路径。

链接: https://arxiv.org/abs/2508.08283
作者: Andres Garcia Rincon,Eliseo Ferrante
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper presents MinionsLLM, a novel framework that integrates Large Language Models (LLMs) with Behavior Trees (BTs) and Formal Grammars to enable natural language control of multi-agent systems within arbitrary, user-defined environments. MinionsLLM provides standardized interfaces for defining environments, agents, and behavioral primitives, and introduces two synthetic dataset generation methods (Method A and Method B) to fine-tune LLMs for improved syntactic validity and semantic task relevance. We validate our approach using Google’s Gemma 3 model family at three parameter scales (1B, 4B, and 12B) and demonstrate substantial gains: Method B increases syntactic validity to 92.6% and achieves a mean task performance improvement of 33% over baseline. Notably, our experiments show that smaller models benefit most from fine-tuning, suggesting promising directions for deploying compact, locally hosted LLMs in resource-constrained multi-agent control scenarios. The framework and all resources are released open-source to support reproducibility and future research.
zh

[NLP-66] Objective Metrics for Evaluating Large Language Models Using External Data Sources

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)性能评估中依赖主观判断所带来的不一致性和不可靠性问题,尤其是在教育、科研等高风险场景下。其解决方案的关键在于构建一个基于跨学期课程文本材料的主观指标框架,通过结合明确定义的基准测试、事实性数据集和结构化的评估流程,实现自动化、透明化且偏差最小化的评分机制,从而提升评估结果的一致性、可复现性与实际应用契合度。

链接: https://arxiv.org/abs/2508.08277
作者: Haoze Du,Richard Li,Edward Gehringer
机构: North Carolina State University (北卡罗来纳州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: This version of the paper is lightly revised from the EDM 2025 proceedings for the sake of clarity

点击查看摘要

Abstract:Evaluating the performance of Large Language Models (LLMs) is a critical yet challenging task, particularly when aiming to avoid subjective assessments. This paper proposes a framework for leveraging subjective metrics derived from the class textual materials across different semesters to assess LLM outputs across various tasks. By utilizing well-defined benchmarks, factual datasets, and structured evaluation pipelines, the approach ensures consistent, reproducible, and bias-minimized measurements. The framework emphasizes automation and transparency in scoring, reducing reliance on human interpretation while ensuring alignment with real-world applications. This method addresses the limitations of subjective evaluation methods, providing a scalable solution for performance assessment in educational, scientific, and other high-stakes domains.
zh

[NLP-67] Evaluating Contrast Localizer for Identifying Causal Unitsin Social Mathematical Tasks in Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)和视觉-语言模型(Vision-Language Models, VLMs)中特定认知功能单元的因果可解释性问题,即如何准确识别并验证对理论心智(Theory of Mind, ToM)和数学推理任务具有因果作用的神经元单元。其解决方案的关键在于采用神经科学中的对比局部化(contrast localizer)方法,通过设计对比刺激集定位高激活单元,并结合靶向消融实验(targeted ablations)评估这些单元在下游任务中的因果影响。研究发现,基于对比的局部化方法所识别的高激活单元并不一定具有更强的因果作用,甚至低激活单元有时导致更大的性能下降,提示当前方法存在局限性,亟需更广泛的刺激集以更精确地捕捉任务特异性单元。

链接: https://arxiv.org/abs/2508.08276
作者: Yassine Jamaa,Badr AlKhamissi,Satrajit Ghosh,Martin Schrimpf
机构: EPFL (瑞士联邦理工学院); MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work adapts a neuroscientific contrast localizer to pinpoint causally relevant units for Theory of Mind (ToM) and mathematical reasoning tasks in large language models (LLMs) and vision-language models (VLMs). Across 11 LLMs and 5 VLMs ranging in size from 3B to 90B parameters, we localize top-activated units using contrastive stimulus sets and assess their causal role via targeted ablations. We compare the effect of lesioning functionally selected units against low-activation and randomly selected units on downstream accuracy across established ToM and mathematical benchmarks. Contrary to expectations, low-activation units sometimes produced larger performance drops than the highly activated ones, and units derived from the mathematical localizer often impaired ToM performance more than those from the ToM localizer. These findings call into question the causal relevance of contrast-based localizers and highlight the need for broader stimulus sets and more accurately capture task-specific units.
zh

[NLP-68] MLLM -CBench:A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLM s with Chain-of-Thought Reasoning Analysis

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在持续指令微调(continual instruction tuning)过程中缺乏系统化、严谨评估基准的问题。现有研究难以客观衡量模型在动态任务场景下的性能保持与演化能力,制约了算法优化与训练范式的比较。其解决方案的关键在于提出MLLM-CTBench,一个涵盖多维度评估、全面覆盖不同持续学习算法与训练范式、并基于精心筛选的16个数据集构建的综合性基准测试平台。该方案通过引入细粒度的思维链(Chain-of-Thought, CoT)推理质量评估机制、系统对比强化学习与监督微调(Supervised Fine-Tuning, SFT)策略,并揭示模型能力、任务顺序与遗忘之间的复杂关系,为MLLMs的持续学习提供了可量化、可复现的评估标准和实用指导。

链接: https://arxiv.org/abs/2508.08275
作者: Haiyun Guo,ZhiYan Hou,Yu Chen,Jinghan He,Yandu Sun,Yuzhe Zhou,Shujing Guo,Kuan Zhu,Jinqiao Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: under review

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) rely on continual instruction tuning to adapt to the evolving demands of real-world applications. However, progress in this area is hindered by the lack of rigorous and systematic benchmarks. To address this gap, we present MLLM-CTBench, a comprehensive evaluation benchmark with three key contributions: (1) Multidimensional Evaluation: We combine final answer accuracy with fine-grained CoT reasoning quality assessment, enabled by a specially trained CoT evaluator; (2) Comprehensive Evaluation of Algorithms and Training Paradigms: We benchmark eight continual learning algorithms across four major categories and systematically compare reinforcement learning with supervised fine-tuning paradigms; (3) Carefully Curated Tasks: We select and organize 16 datasets from existing work, covering six challenging domains. Our key findings include: (i) Models with stronger general capabilities exhibit greater robustness to forgetting during continual learning; (ii) Reasoning chains degrade more slowly than final answers, supporting the hierarchical forgetting hypothesis; (iii) The effectiveness of continual learning algorithms is highly dependent on both model capability and task order; (iv) In reinforcement learning settings, incorporating KL-divergence constraints helps maintain policy stability and plays a crucial role in mitigating forgetting. MLLM-CTBench establishes a rigorous standard for continual instruction tuning of MLLMs and offers practical guidance for algorithm design and evaluation.
zh

[NLP-69] Distilling Knowledge from Large Language Models : A Concept Bottleneck Model for Hate and Counter Speech Recognition

【速读】: 该论文旨在解决社交媒体上仇恨言论(hate speech)激增所带来的社会影响问题,提出一种自动化检测方法以提升识别准确性和可解释性。其解决方案的关键在于提出“话语概念瓶颈模型”(Speech Concept Bottleneck Model, SCBM),该模型利用形容词作为人类可理解的瓶颈概念(bottleneck concepts),通过大型语言模型(LLMs)将输入文本映射为基于形容词的抽象表征,并由轻量级分类器完成下游任务。此设计不仅在五个跨语言和平台的基准数据集上实现了平均宏F1分数0.69,优于现有方法,还提供了局部与全局层面的高可解释性,同时证明了该形容词表示能与Transformer嵌入互补,进一步提升性能。

链接: https://arxiv.org/abs/2508.08274
作者: Roberto Labadie-Tamayo,Djordje Slijepčević,Xihui Chen,Adrian Jaques Böck,Andreas Babic,Liz Freimann,Christiane Atzmüller Matthias Zeppelzauer
机构: St.Pölten University of Applied Sciences (圣珀尔滕应用科学大学); University of Vienna (维也纳大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 10 figures, This is a preprint of a manuscript accepted for publication in Information Processing Management (Elsevier)

点击查看摘要

Abstract:The rapid increase in hate speech on social media has exposed an unprecedented impact on society, making automated methods for detecting such content important. Unlike prior black-box models, we propose a novel transparent method for automated hate and counter speech recognition, i.e., “Speech Concept Bottleneck Model” (SCBM), using adjectives as human-interpretable bottleneck concepts. SCBM leverages large language models (LLMs) to map input texts to an abstract adjective-based representation, which is then sent to a light-weight classifier for downstream tasks. Across five benchmark datasets spanning multiple languages and platforms (e.g., Twitter, Reddit, YouTube), SCBM achieves an average macro-F1 score of 0.69 which outperforms the most recently reported results from the literature on four out of five datasets. Aside from high recognition accuracy, SCBM provides a high level of both local and global interpretability. Furthermore, fusing our adjective-based concept representation with transformer embeddings, leads to a 1.8% performance increase on average across all datasets, showing that the proposed representation captures complementary information. Our results demonstrate that adjective-based concept representations can serve as compact, interpretable, and effective encodings for hate and counter speech recognition. With adapted adjectives, our method can also be applied to other NLP tasks.
zh

[NLP-70] -XAI: Trustworthy Clinical Text Explanations via Keyword Distillation and LLM Reasoning

【速读】: 该论文旨在解决临床语言模型在处理长篇、非结构化电子健康记录(Electronic Health Records, EHR)时,预测结果缺乏可信度以及解释性不足的问题。其核心解决方案在于提出TT-XAI框架,关键创新点包括:一是通过领域感知的关键词蒸馏(domain-aware keyword distillation)将原始出院小结压缩为精炼的关键词表示,从而提升BERT分类器性能并增强局部解释的保真度(fidelity),采用改进版LIME实现聚焦式解释;二是利用关键词引导的提示工程(keyword-guided prompts)驱动大语言模型(Large Language Models, LLMs)生成链式思维(chain-of-thought)临床推理,提高解释的简洁性和临床相关性。实验表明,该方法在删除法保真度指标、LLaMA-3自评和专家盲测中均优于基线,显著提升了机器与人类对模型决策的理解能力。

链接: https://arxiv.org/abs/2508.08273
作者: Kristian Miok,Blaz Škrlj,Daniela Zaharie,Marko Robnik Šikonja
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clinical language models often struggle to provide trustworthy predictions and explanations when applied to lengthy, unstructured electronic health records (EHRs). This work introduces TT-XAI, a lightweight and effective framework that improves both classification performance and interpretability through domain-aware keyword distillation and reasoning with large language models (LLMs). First, we demonstrate that distilling raw discharge notes into concise keyword representations significantly enhances BERT classifier performance and improves local explanation fidelity via a focused variant of LIME. Second, we generate chain-of-thought clinical explanations using keyword-guided prompts to steer LLMs, producing more concise and clinically relevant reasoning. We evaluate explanation quality using deletion-based fidelity metrics, self-assessment via LLaMA-3 scoring, and a blinded human study with domain experts. All evaluation modalities consistently favor the keyword-augmented method, confirming that distillation enhances both machine and human interpretability. TT-XAI offers a scalable pathway toward trustworthy, auditable AI in clinical decision support.
zh

[NLP-71] Real-time News Story Identification

【速读】: 该论文旨在解决新闻监测系统中实时识别新闻文章所属事件故事(story)的问题,即如何在新闻文章发布时自动将其归类到对应的具体事件、地点和人物构成的故事簇中,而非基于一般文本相似性或预定义主题进行聚类。解决方案的关键在于融合多种文本表示技术、聚类算法与在线主题建模方法,特别是通过结合BERTopic、DBStream和TextClust等在线主题模型,有效提取事件特征与命名实体,并实现对新闻流的实时故事识别,从而在斯洛文尼亚媒体数据集上验证了该方法在实际场景中的有效性与合理性。

链接: https://arxiv.org/abs/2508.08272
作者: Tadej Škvorc,Nikola Ivačič,Sebastjan Hribar,Marko Robnik-Šikonja
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:To improve the reading experience, many news sites organize news into topical collections, called stories. In this work, we present an approach for implementing real-time story identification for a news monitoring system that automatically collects news articles as they appear online and processes them in various ways. Story identification aims to assign each news article to a specific story that the article is covering. The process is similar to text clustering and topic modeling, but requires that articles be grouped based on particular events, places, and people, rather than general text similarity (as in clustering) or general (predefined) topics (as in topic modeling). We present an approach to story identification that is capable of functioning in real time, assigning articles to stories as they are published online. In the proposed approach, we combine text representation techniques, clustering algorithms, and online topic modeling methods. We combine various text representation methods to extract specific events and named entities necessary for story identification, showing that a mixture of online topic-modeling approaches such as BERTopic, DBStream, and TextClust can be adapted for story discovery. We evaluate our approach on a news dataset from Slovene media covering a period of 1 month. We show that our real-time approach produces sensible results as judged by human evaluators.
zh

[NLP-72] Heartificial Intelligence: Exploring Empathy in Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在模拟人类共情能力方面的局限性问题,特别是其在认知共情(cognitive empathy)与情感共情(affective empathy)之间的差异。研究通过标准化心理测试评估多个小型(Small Language Models, SLMs)和大型语言模型的共情表现,发现LLMs在认知共情任务中显著优于人类(包括心理学学生),但其情感共情水平则明显低于人类参与者。解决方案的关键在于揭示了LLMs在模拟认知共情方面的强大能力,同时指出其情感共情的不足,从而为未来开发更具人性化的虚拟陪伴系统提供方向:利用LLMs的认知共情优势实现客观、稳定且无情绪疲劳的情感支持,弥补传统人工共情服务的局限性。

链接: https://arxiv.org/abs/2508.08271
作者: Victoria Williams,Benjamin Rosman
机构: 未知
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 21 pages, 5 tables

点击查看摘要

Abstract:Large language models have become increasingly common, used by millions of people worldwide in both professional and personal contexts. As these models continue to advance, they are frequently serving as virtual assistants and companions. In human interactions, effective communication typically involves two types of empathy: cognitive empathy (understanding others’ thoughts and emotions) and affective empathy (emotionally sharing others’ feelings). In this study, we investigated both cognitive and affective empathy across several small (SLMs) and large (LLMs) language models using standardized psychological tests. Our results revealed that LLMs consistently outperformed humans - including psychology students - on cognitive empathy tasks. However, despite their cognitive strengths, both small and large language models showed significantly lower affective empathy compared to human participants. These findings highlight rapid advancements in language models’ ability to simulate cognitive empathy, suggesting strong potential for providing effective virtual companionship and personalized emotional support. Additionally, their high cognitive yet lower affective empathy allows objective and consistent emotional support without running the risk of emotional fatigue or bias.
zh

[NLP-73] Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI

【速读】: 该论文旨在解决当前医学多模态大模型(Medical Multimodal Large Models, MMLMs)在理解复杂医学概念时受限于有限医疗训练数据,以及基于LLaVA架构的医学多模态模型难以有效捕捉文本与图像之间细粒度关联的问题。其解决方案的关键在于提出Doctor Sun——一个专为医学领域设计的大型多模态生成模型,通过整合预训练视觉编码器与医学专用大语言模型(Medical Large Language Model, Medical LLM),并在多种医学数据集上采用两阶段训练策略:第一阶段聚焦特征对齐以增强跨模态语义一致性,第二阶段进行指令微调以提升任务导向的推理能力。此外,研究团队发布了SunMed-VL这一涵盖广泛场景的双语医学多模态数据集及相关代码与模型资源,以推动生物医学多模态研究的发展。

链接: https://arxiv.org/abs/2508.08270
作者: Dong Xue,Ziyao Shao,Zhaoyang Duan,Fangzhou Liu,Bing Li,Zhongheng Zhang
机构: East China University of Science and Technology (华东理工大学); Harbin Institute of Technology (哈尔滨工业大学); Zhejiang University School of Medicine (浙江大学医学院); Shaoxing University (绍兴大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) have demonstrated significant potential in providing innovative solutions for various biomedical tasks, including pathology analysis, radiology report generation, and biomedical assistance. However, the existing multimodal biomedical AI is typically based on foundation LLMs, thus hindering the understanding of intricate medical concepts with limited medical training data. Moreover, recent LLaVA-induced medical LMMs struggle to effectively capture the intricate relationship between the texts and the images. Therefore, we introduce Doctor Sun, a large multimodal generative model specialized in medicine, developed to encode, integrate, and interpret diverse biomedical data modalities such as text and images. In particular, Doctor Sun integrates a pre-trained vision encoder with a medical LLM and conducts two-stage training on various medical datasets, focusing on feature alignment and instruction tuning. Moreover, we release SunMed-VL, a wide-range bilingual medical multimodal dataset, along with all associated models, code, and resources, to freely support the advancement of biomedical multimodal research.
zh

[NLP-74] Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants

【速读】: 该论文旨在解决历史土地专利文本(如17–18世纪弗吉尼亚州的metes-and-bounds描述)难以进行空间分析的问题,即如何将非结构化的叙事性地籍描述自动转换为地理坐标(latitude/longitude)。其核心解决方案是利用当前一代大语言模型(Large Language Models, LLMs)对这类历史文本进行地理编码,通过两种范式——直接输出坐标与调用外部地理编码API的链式推理(tool-augmented chain-of-thought),实现高精度、低成本的批量地理标注。研究发现,采用多轮调用集成策略(five-call ensemble)可显著降低误差至19 km(中位数12 km),且成本可控(约0.20美元/宗),优于传统GIS分析师基准和外部命名实体识别(NER)工具,表明LLMs在历史地理信息提取任务中具备可扩展性和实用性。

链接: https://arxiv.org/abs/2508.08266
作者: Ryan Mioduski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Virginia’s seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures (o-series, GPT-4-class, and GPT-3.5) were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared with a GIS-analyst baseline, the Stanford NER geoparser, Mordecai-3, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19 km (median 12 km) at minimal additional cost (approx. USD 0.20 per grant), outperforming the median LLM by 48.6%. A patentee-name-redaction ablation increased error by about 9%, indicating reliance on textual landmark and adjacency descriptions rather than memorization. The cost-efficient gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark; external geocoding tools offered no measurable benefit in this evaluation. These findings demonstrate the potential of LLMs for scalable, accurate, and cost-effective historical georeferencing. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR) Cite as: arXiv:2508.08266 [cs.LG] (or arXiv:2508.08266v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.08266 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-75] urQUaz at CheckThat! 2025: Debating Large Language Models for Scientific Web Discourse Detection

【速读】: 该论文旨在解决科学网络话语检测任务(scientific web discourse detection),即识别推文中是否包含科学主张(scientific claim)、对科学研究的引用(reference to a scientific study)或科学实体提及(mention of scientific entities)。其解决方案的关键在于提出了一种新颖的“理事会辩论”方法(council debate method),该方法通过多个大型语言模型(LLMs)在主席模型的引导下进行结构化协同讨论,以达成共识,从而提升对科学文献引用的检测准确性。实验表明,该方法在参考科学文献的识别上表现最优(排名第一),尽管在其他两类任务中排名较低。

链接: https://arxiv.org/abs/2508.08265
作者: Tarık Saraç,Selin Mergen,Mucahid Kutlu
机构: TOBB University of Economics and Technology (TOBB经济与技术大学); Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we present our work developed for the scientific web discourse detection task (Task 4a) of CheckThat! 2025. We propose a novel council debate method that simulates structured academic discussions among multiple large language models (LLMs) to identify whether a given tweet contains (i) a scientific claim, (ii) a reference to a scientific study, or (iii) mentions of scientific entities. We explore three debating methods: i) single debate, where two LLMs argue for opposing positions while a third acts as a judge; ii) team debate, in which multiple models collaborate within each side of the debate; and iii) council debate, where multiple expert models deliberate together to reach a consensus, moderated by a chairperson model. We choose council debate as our primary model as it outperforms others in the development test set. Although our proposed method did not rank highly for identifying scientific claims (8th out of 10) or mentions of scientific entities (9th out of 10), it ranked first in detecting references to scientific studies.
zh

[NLP-76] Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models

【速读】: 该论文旨在解决金融沟通中论点质量评估(argument quality assessment)缺乏可靠自动化方法的问题,尤其关注大语言模型(Large Language Models, LLMs)在该任务中的表现与公平性。其解决方案的关键在于:首先,利用三个前沿LLM(GPT-4o、Llama 3.1和Gemma 2)对FinArgQuality数据集进行论点质量标注,并通过多轮运行评估其标注一致性;其次,设计对抗攻击以注入性别偏见,系统性分析模型响应的鲁棒性与公平性。实验结果表明,LLM标注的标注者间一致性优于人类标注,但模型仍存在不同程度的性别偏见,研究据此提出多维度分析与改进策略,为未来开发更可靠、低成本且具备偏见感知能力的自动标注方法提供实践指导。

链接: https://arxiv.org/abs/2508.08262
作者: Alaa Alhamzeh,Mays Al Rebdawi
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures, Passau uni, Master thesis in NLP

点击查看摘要

Abstract:Financial arguments play a critical role in shaping investment decisions and public trust in financial institutions. Nevertheless, assessing their quality remains poorly studied in the literature. In this paper, we examine the capabilities of three state-of-the-art LLMs GPT-4o, Llama 3.1, and Gemma 2 in annotating argument quality within financial communications, using the FinArgQuality dataset. Our contributions are twofold. First, we evaluate the consistency of LLM-generated annotations across multiple runs and benchmark them against human annotations. Second, we introduce an adversarial attack designed to inject gender bias to analyse models responds and ensure model’s fairness and robustness. Both experiments are conducted across three temperature settings to assess their influence on annotation stability and alignment with human labels. Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts, though the models still exhibit varying degrees of gender bias. We provide a multifaceted analysis of these outcomes and offer practical recommendations to guide future research toward more reliable, cost-effective, and bias-aware annotation methodologies.
zh

[NLP-77] EndoAgent : A Memory-Guided Reflective Agent for Intelligent Endoscopic Vision-to-Decision Reasoning

【速读】: 该论文旨在解决当前通用人工智能(AI)系统在内镜图像诊断中面临的两大挑战:一是现有基于大规模预训练的方法缺乏跨任务的统一协调机制,难以应对复杂临床流程中的多步骤决策;二是尽管AI代理(AI agent)在指令解析与工具集成方面展现出灵活性,但在内镜场景下的潜力尚未被充分挖掘。解决方案的关键在于提出EndoAgent,这是一个首个基于记忆引导的视觉到决策内镜分析代理,其核心创新是采用双记忆架构——短期记忆用于跟踪动作以确保逻辑连贯性,长期记忆则通过经验学习逐步提升推理精度,并结合自适应工具选择与协作机制,在统一推理循环中整合专家设计的多种工具,从而实现对复杂内镜任务的高效、灵活且准确的决策支持。

链接: https://arxiv.org/abs/2508.07292
作者: Yi Tang,Kaini Wang,Yang Chen,Guangquan Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Developing general artificial intelligence (AI) systems to support endoscopic image diagnosis is an emerging research priority. Existing methods based on large-scale pretraining often lack unified coordination across tasks and struggle to handle the multi-step processes required in complex clinical workflows. While AI agents have shown promise in flexible instruction parsing and tool integration across domains, their potential in endoscopy remains underexplored. To address this gap, we propose EndoAgent, the first memory-guided agent for vision-to-decision endoscopic analysis that integrates iterative reasoning with adaptive tool selection and collaboration. Built on a dual-memory design, it enables sophisticated decision-making by ensuring logical coherence through short-term action tracking and progressively enhancing reasoning acuity through long-term experiential learning. To support diverse clinical tasks, EndoAgent integrates a suite of expert-designed tools within a unified reasoning loop. We further introduce EndoAgentBench, a benchmark of 5,709 visual question-answer pairs that assess visual understanding and language generation capabilities in realistic scenarios. Extensive experiments show that EndoAgent consistently outperforms both general and medical multimodal models, exhibiting its strong flexibility and reasoning capabilities.
zh

[NLP-78] MultiAiTutor: Child-Friendly Educational Multilingual Speech Generation Tutor with LLM s

【速读】: 该论文旨在解决低资源语言环境下儿童语言学习中高质量、适龄且文化相关语音生成困难的问题,特别是在新加坡多语种教育场景下,如何通过生成式AI实现针对儿童的个性化教学互动。解决方案的关键在于提出MultiAiTutor——一个基于大语言模型(Large Language Model, LLM)架构的多语言教育生成式AI助教系统,其核心创新在于融合年龄适宜的多语言语音生成能力,并通过文化相关的图像描述任务,在新加坡本地化方言(如新加坡口音普通话、马来语和泰米尔语)中验证了其有效性,从而显著提升儿童语言习得的沉浸感与实用性。

链接: https://arxiv.org/abs/2508.08715
作者: Xiaoxue Gao,Huayun Zhang,Nancy F. Chen
机构: Institute for Infocomm Research, Agency for Science, Technology, and Research (A*STAR), Singapore
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: 5 figures

点击查看摘要

Abstract:Generative speech models have demonstrated significant potential in personalizing teacher-student interactions, offering valuable real-world applications for language learning in children’s education. However, achieving high-quality, child-friendly speech generation remains challenging, particularly for low-resource languages across diverse languages and cultural contexts. In this paper, we propose MultiAiTutor, an educational multilingual generative AI tutor with child-friendly designs, leveraging LLM architecture for speech generation tailored for educational purposes. We propose to integrate age-appropriate multilingual speech generation using LLM architectures, facilitating young children’s language learning through culturally relevant image-description tasks in three low-resource languages: Singaporean-accent Mandarin, Malay, and Tamil. Experimental results from both objective metrics and subjective evaluations demonstrate the superior performance of the proposed MultiAiTutor compared to baseline methods.
zh

计算机视觉

[CV-0] HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis ICCV2025

【速读】:该论文旨在解决数字人表征中同时进行光照重渲染(relighting)与新视角渲染(novel-view rendering)的难题,这一任务在虚拟现实、影视制作和数字孪生等领域具有重要应用价值。由于缺乏高质量、大规模的全身体数据集,相关研究长期受限。为此,作者提出了HumanOLAT数据集,其关键创新在于首次公开了一个大规模多视角“一次一光”(One-Light-at-a-Time, OLAT)采集的全身体人像数据集,包含多种光照条件下的HDR RGB帧(如白光、环境贴图、颜色渐变及细粒度OLAT光照)。该数据集为模型训练与评估提供了可靠基准,显著推动了人类特异性光照建模与渲染技术的发展。

链接: https://arxiv.org/abs/2508.09137
作者: Timo Teufel,Pulkit Gera,Xilong Zhou,Umar Iqbal,Pramod Rao,Jan Kautz,Vladislav Golyanik,Christian Theobalt
机构: Max Planck Institute for Informatics (马普所信息学研究所); NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TT and PG contributed equally; accepted at ICCV 2025; project page: this https URL

点击查看摘要

Abstract:Simultaneous relighting and novel-view rendering of digital human representations is an important yet challenging task with numerous applications. Progress in this area has been significantly limited due to the lack of publicly available, high-quality datasets, especially for full-body human captures. To address this critical gap, we introduce the HumanOLAT dataset, the first publicly accessible large-scale dataset of multi-view One-Light-at-a-Time (OLAT) captures of full-body humans. The dataset includes HDR RGB frames under various illuminations, such as white light, environment maps, color gradients and fine-grained OLAT illuminations. Our evaluations of state-of-the-art relighting and novel-view synthesis methods underscore both the dataset’s value and the significant challenges still present in modeling complex human-centric appearance and lighting interactions. We believe HumanOLAT will significantly facilitate future research, enabling rigorous benchmarking and advancements in both general and human-specific relighting and rendering techniques.
zh

[CV-1] urbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices

【速读】:该论文旨在解决大规模视频生成模型中变分自编码器(Variational AutoEncoder, VAE)在移动端部署时面临的两大核心问题:一是VAE参数量庞大且卷积核不匹配导致内存溢出或推理速度极慢;二是主流VAE中上采样技术与移动硬件特性不兼容,成为性能瓶颈。解决方案的关键在于三个方面:首先,通过分析现有VAE架构的冗余性并引入3D深度可分离卷积,显著降低参数规模;其次,提出解耦式3D像素洗牌(decoupled 3D pixel shuffle)策略优化上采样过程,大幅减少端到端延迟;最后,设计一种高效的VAE解码器蒸馏训练方法,仅对解码器进行知识迁移而非全模型重训练,实现快速适配移动端的同时保持高重建质量。基于此,作者开发了通用型移动端VAE解码器Turbo-VAED,首次实现在移动端实时解码720p视频VAE,相较现有方案在iPhone 16 Pro上提升2.9倍FPS并保持更优重建质量。

链接: https://arxiv.org/abs/2508.09136
作者: Ya Zou,Jingfeng Yao,Siyuan Yu,Shuai Zhang,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:There is a growing demand for deploying large generative AI models on mobile devices. For recent popular video generative models, however, the Variational AutoEncoder (VAE) represents one of the major computational bottlenecks. Both large parameter sizes and mismatched kernels cause out-of-memory errors or extremely slow inference on mobile devices. To address this, we propose a low-cost solution that efficiently transfers widely used video VAEs to mobile devices. (1) We analyze redundancy in existing VAE architectures and get empirical design insights. By integrating 3D depthwise separable convolutions into our model, we significantly reduce the number of parameters. (2) We observe that the upsampling techniques in mainstream video VAEs are poorly suited to mobile hardware and form the main bottleneck. In response, we propose a decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Building upon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED. (3) We propose an efficient VAE decoder training method. Since only the decoder is used during deployment, we distill it to Turbo-VAED instead of retraining the full VAE, enabling fast mobile adaptation with minimal performance loss. To our knowledge, our method enables real-time 720p video VAE decoding on mobile devices for the first time. This approach is widely applicable to most video VAEs. When integrated into four representative models, with training cost as low as 95, it accelerates original VAEs by up to 84.5x at 720p resolution on GPUs, uses as low as 17.5% of original parameter count, and retains 96.9% of the original reconstruction quality. Compared to mobile-optimized VAEs, Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality on the iPhone 16 Pro. The code and models will soon be available at this https URL.
zh

[CV-2] raining-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

【速读】:该论文旨在解决图像与视频中基于文本引导的色彩编辑问题,该问题要求对反照率(albedo)、光源颜色和环境光照等颜色属性进行细粒度操控,同时保持几何结构、材质特性及光-物质相互作用的物理一致性。现有无训练(training-free)方法虽适用性广,但在精确控制颜色和避免视觉不一致方面表现不佳。其解决方案的关键在于提出ColorCtrl方法,该方法利用多模态扩散Transformer(Multi-Modal Diffusion Transformers, MM-DiT)中的注意力机制,通过针对性地操作注意力图(attention maps)和值令牌(value tokens),实现结构与颜色的解耦,从而在仅修改提示指定区域的前提下完成精准且一致的色彩编辑,并支持词级属性强度控制,显著提升编辑质量与时空一致性。

链接: https://arxiv.org/abs/2508.09131
作者: Zixin Yin,Xili Dai,Ling-Hao Chen,Deyu Zhou,Jianan Wang,Duomin Wang,Gang Yu,Lionel M. Ni,Heung-Yeung Shum
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; Tsinghua University; International Digital Economy Academy; StepFun; Astribot
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free methods offer broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method also generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.
zh

[CV-3] OpenCUA: Open Foundations for Computer-Use Agents

【速读】:该论文旨在解决当前视觉-语言模型作为计算机使用代理(Computer-Use Agents, CUAs)在商业化进程中缺乏开放框架的问题,从而阻碍了研究社区对这类系统的能力、局限性和风险进行深入分析。其解决方案的关键在于提出OpenCUA——一个完整的开源框架,包含三个核心组件:(1) 用于无缝捕获人类计算机操作演示的标注基础设施;(2) AgentNet,首个涵盖3个操作系统和200+应用及网站的大规模计算机任务数据集;(3) 可扩展的数据处理流水线,将演示转化为带反思式长链思维(Chain-of-Thought)推理的状态-动作对,从而在数据规模增长时仍保持性能提升。该框架推动了开源CUA模型的性能边界,如OpenCUA-32B在OSWorld-Verified基准上达到34.8%的平均成功率,超越GPT-4o,验证了其泛化能力和对测试时计算资源的显著收益。

链接: https://arxiv.org/abs/2508.09123
作者: Xinyuan Wang,Bowen Wang,Dunjie Lu,Junlin Yang,Tianbao Xie,Junli Wang,Jiaqi Deng,Xiaole Guo,Yiheng Xu,Chen Henry Wu,Zhennan Shen,Zhuokai Li,Ryan Li,Xiaochuan Li,Junda Chen,Boyuan Zheng,Peihang Li,Fangyu Lei,Ruisheng Cao,Yeqiao Fu,Dongchan Shin,Martin Shin,Jiarui Hu,Yuyan Wang,Jixuan Chen,Yuxiao Ye,Danyang Zhang,Dikang Du,Hao Hu,Huarong Chen,Zaida Zhou,Yipu Wang,Heng Wang,Diyi Yang,Victor Zhong,Flood Sung,Y.Charles,Zhilin Yang,Tao Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models have demonstrated impressive capabilities as computer-use agents (CUAs) capable of automating diverse computer tasks. As their commercial potential grows, critical details of the most capable CUA systems remain closed. As these agents will increasingly mediate digital interactions and execute consequential decisions on our behalf, the research community needs access to open CUA frameworks to study their capabilities, limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive open-source framework for scaling CUA data and foundation models. Our framework consists of: (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations; (2) AgentNet, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites; (3) a scalable pipeline that transforms demonstrations into state-action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales. Our end-to-end agent models demonstrate strong performance across CUA benchmarks. In particular, OpenCUA-32B achieves an average success rate of 34.8% on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o). Further analysis confirms that our approach generalizes well across domains and benefits significantly from increased test-time computation. We release our annotation tool, datasets, code, and models to build open foundations for further CUA research.
zh

[CV-4] Deep Learning Models for Robust Facial Liveness Detection

【速读】:该论文旨在解决生物特征认证系统(尤其是人脸识别)在面对深度伪造(deepfake)等高级欺骗攻击时可靠性不足的问题。现有活体检测(liveness detection)方法难以有效识别由人工智能驱动的伪造样本,导致安全漏洞。解决方案的关键在于提出了一种基于深度学习的新模型,创新性地融合了纹理分析(texture analysis)与真实人类生物特征相关的反射特性(reflective properties),从而在多种攻击场景和环境条件下实现高精度的真伪区分。实验表明,最佳模型AttackNet V2.2在多数据集联合训练下平均准确率达99.9%,显著优于现有技术,并揭示了欺骗行为的演化模式,为提升生物特征系统的安全性提供了重要依据。

链接: https://arxiv.org/abs/2508.09094
作者: Oleksandr Kuznetsov,Emanuele Frontoni,Luca Romeo,Riccardo Rosati,Andrea Maranesi,Alessandro Muscatello
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the rapidly evolving landscape of digital security, biometric authentication systems, particularly facial recognition, have emerged as integral components of various security protocols. However, the reliability of these systems is compromised by sophisticated spoofing attacks, where imposters gain unauthorized access by falsifying biometric traits. Current literature reveals a concerning gap: existing liveness detection methodologies - designed to counteract these breaches - fall short against advanced spoofing tactics employing deepfakes and other artificial intelligence-driven manipulations. This study introduces a robust solution through novel deep learning models addressing the deficiencies in contemporary anti-spoofing techniques. By innovatively integrating texture analysis and reflective properties associated with genuine human traits, our models distinguish authentic presence from replicas with remarkable precision. Extensive evaluations were conducted across five diverse datasets, encompassing a wide range of attack vectors and environmental conditions. Results demonstrate substantial advancement over existing systems, with our best model (AttackNet V2.2) achieving 99.9% average accuracy when trained on combined data. Moreover, our research unveils critical insights into the behavioral patterns of impostor attacks, contributing to a more nuanced understanding of their evolving nature. The implications are profound: our models do not merely fortify the authentication processes but also instill confidence in biometric systems across various sectors reliant on secure access.
zh

[CV-5] Addressing Bias in VLMs for Glaucoma Detection Without Protected Attribute Supervision MICCAI-2025

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在医学影像分析中可能存在的群体偏差问题,特别是在无显式受保护属性标签的情况下,仍可能导致对不同人口统计学子群体的不公平表现。其核心解决方案是提出一种无需标签的属性无关去偏方法,关键在于:(i) 通过图像嵌入的无监督聚类推断代理子群体;(ii) 计算CLIP风格的跨模态对比损失与SimCLR风格的图像对对比损失之间的梯度相似性权重;(iii) 在联合top-k加权目标中利用这些权重提升表现较差子群体的训练权重,从而自适应地聚焦于最难样本并降低子群体间的性能差异。

链接: https://arxiv.org/abs/2508.09087
作者: Ahsan Habib Akash,Greg Murray,Annahita Amireskandari,Joel Palko,Carol Laxson,Binod Bhattarai,Prashnna Gyawali
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3rd Workshop in Data Engineering in Medical Imaging (DEMI), MICCAI-2025 Workshop

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable success on multimodal tasks such as image-text retrieval and zero-shot classification, yet they can exhibit demographic biases even when explicit protected attributes are absent during training. In this work, we focus on automated glaucoma screening from retinal fundus images, a critical application given that glaucoma is a leading cause of irreversible blindness and disproportionately affects underserved populations. Building on a reweighting-based contrastive learning framework, we introduce an attribute-agnostic debiasing method that (i) infers proxy subgroups via unsupervised clustering of image-image embeddings, (ii) computes gradient-similarity weights between the CLIP-style multimodal loss and a SimCLR-style image-pair contrastive loss, and (iii) applies these weights in a joint, top- k weighted objective to upweight underperforming clusters. This label-free approach adaptively targets the hardest examples, thereby reducing subgroup disparities. We evaluate our method on the Harvard FairVLMed glaucoma subset, reporting Equalized Odds Distance (EOD), Equalized Subgroup AUC (ES AUC), and Groupwise AUC to demonstrate equitable performance across inferred demographic subgroups.
zh

[CV-6] Scaling Learned Image Compression Models up to 1 Billion

【速读】:该论文旨在解决当前学习型图像压缩模型规模受限、表示能力不足的问题,以及缺乏对模型规模扩展如何影响压缩性能的系统性理解。解决方案的关键在于首次开展大规模学习型图像压缩模型的缩放研究,通过将HPCM模型参数从6850万扩展至10亿,并拟合测试损失与模型规模及最优训练计算量之间的幂律关系,揭示了压缩性能的缩放规律,从而实现对更大规模模型性能的外推预测。实验表明,扩展后的HPCM-1B模型在率失真(rate-distortion)性能上达到当前最优水平。

链接: https://arxiv.org/abs/2508.09075
作者: Yuqi Li,Haotian Zhang,Li Li,Dong Liu,Feng Wu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, technical report

点击查看摘要

Abstract:Recent advances in large language models (LLMs) highlight a strong connection between intelligence and compression. Learned image compression, a fundamental task in modern data compression, has made significant progress in recent years. However, current models remain limited in scale, restricting their representation capacity, and how scaling model size influences compression performance remains unexplored. In this work, we present a pioneering study on scaling up learned image compression models and revealing the performance trends through scaling laws. Using the recent state-of-the-art HPCM model as baseline, we scale model parameters from 68.5 millions to 1 billion and fit power-law relations between test loss and key scaling variables, including model size and optimal training compute. The results reveal a scaling trend, enabling extrapolation to larger scale models. Experimental results demonstrate that the scaled-up HPCM-1B model achieves state-of-the-art rate-distortion performance. We hope this work inspires future exploration of large-scale compression models and deeper investigations into the connection between compression and intelligence.
zh

[CV-7] VertexRegen: Mesh Generation with Continuous Level of Detail ICCV2025

【速读】:该论文旨在解决现有自回归(autoregressive)网格生成方法在生成过程中无法提供连续细节层次(level of detail, LoD)控制的问题,这类方法通常以从局部到完整的顺序生成网格,导致中间步骤产生的结构不完整且不可用。解决方案的关键在于提出VertexRegen框架,该框架受渐进式网格(progressive meshes)启发,将网格生成过程重新建模为边坍缩(edge collapse)的逆操作——即顶点分裂(vertex split),并通过生成模型学习这一逆过程,从而实现任意时刻中断生成仍能输出有效网格的能力,同时支持连续的细节层次控制。

链接: https://arxiv.org/abs/2508.09062
作者: Xiang Zhang,Yawar Siddiqui,Armen Avetisyan,Chris Xie,Jakob Engel,Henry Howard-Jenkins
机构: UC San Diego; Meta Reality Labs Research
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICCV 2025. Project Page: this https URL

点击查看摘要

Abstract:We introduce VertexRegen, a novel mesh generation framework that enables generation at a continuous level of detail. Existing autoregressive methods generate meshes in a partial-to-complete manner and thus intermediate steps of generation represent incomplete structures. VertexRegen takes inspiration from progressive meshes and reformulates the process as the reversal of edge collapse, i.e. vertex split, learned through a generative model. Experimental results demonstrate that VertexRegen produces meshes of comparable quality to state-of-the-art methods while uniquely offering anytime generation with the flexibility to halt at any step to yield valid meshes with varying levels of detail.
zh

[CV-8] VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception

【速读】:该论文旨在解决复杂交通环境中开放集感知(open-set perception)问题,即自动驾驶系统对未见过物体类别的识别能力不足,这对安全性构成重大挑战。现有方法通常将视觉语言模型(Visual Language Models, VLMs)用于提取视觉特征并耦合传统目标检测器,导致多阶段误差传播,限制感知精度。其解决方案的关键在于提出首个端到端框架VLM-3D,通过引入低秩适配(Low-Rank Adaptation, LoRA)实现VLMs在驾驶任务中的高效微调,并设计联合语义-几何损失机制:早期采用token级语义损失保证训练稳定收敛,后期引入3D IoU损失优化3D边界框预测精度,从而显著提升感知性能,在nuScenes数据集上实现12.8%的准确率提升。

链接: https://arxiv.org/abs/2508.09061
作者: Fuhao Chang,Shuxin Li,Yabei Li,Lei He
机构: Tsinghua University (清华大学); China Agricultural University (中国农业大学); Southwestern University of Finance and Economics (西南财经大学); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-set perception in complex traffic environments poses a critical challenge for autonomous driving systems, particularly in identifying previously unseen object categories, which is vital for ensuring safety. Visual Language Models (VLMs), with their rich world knowledge and strong semantic reasoning capabilities, offer new possibilities for addressing this task. However, existing approaches typically leverage VLMs to extract visual features and couple them with traditional object detectors, resulting in multi-stage error propagation that hinders perception accuracy. To overcome this limitation, we propose VLM-3D, the first end-to-end framework that enables VLMs to perform 3D geometric perception in autonomous driving scenarios. VLM-3D incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving tasks with minimal computational overhead, and introduces a joint semantic-geometric loss design: token-level semantic loss is applied during early training to ensure stable convergence, while 3D IoU loss is introduced in later stages to refine the accuracy of 3D bounding box predictions. Evaluations on the nuScenes dataset demonstrate that the proposed joint semantic-geometric loss in VLM-3D leads to a 12.8% improvement in perception accuracy, fully validating the effectiveness and advancement of our method.
zh

[CV-9] ALFred: An Active Learning Framework for Real-world Semi-supervised Anomaly Detection with Adaptive Thresholds

【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)在真实场景中因人类行为动态性、环境变化及领域偏移而导致的性能下降问题。传统评估指标因依赖静态假设而难以在动态环境中准确区分正常与异常行为,且缺乏自适应阈值机制。解决方案的关键在于提出一种面向VAD的主动学习框架,通过持续选择最具信息量的数据进行人工标注,并引入“人在回路”(human-in-the-loop)机制,从AI生成的伪标签中识别真实正常的和异常的样本,从而构建可随环境变化自适应调整的异常判定阈值,显著提升模型在复杂动态场景下的鲁棒性和实用性。

链接: https://arxiv.org/abs/2508.09058
作者: Shanle Yao,Ghazal Alinezhad Noghre,Armin Danesh Pazho,Hamed Tabkhi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Anomaly Detection (VAD) can play a key role in spotting unusual activities in video footage. VAD is difficult to use in real-world settings due to the dynamic nature of human actions, environmental variations, and domain shifts. Traditional evaluation metrics often prove inadequate for such scenarios, as they rely on static assumptions and fall short of identifying a threshold that distinguishes normal from anomalous behavior in dynamic settings. To address this, we introduce an active learning framework tailored for VAD, designed for adapting to the ever-changing real-world conditions. Our approach leverages active learning to continuously select the most informative data points for labeling, thereby enhancing model adaptability. A critical innovation is the incorporation of a human-in-the-loop mechanism, which enables the identification of actual normal and anomalous instances from pseudo-labeling results generated by AI. This collected data allows the framework to define an adaptive threshold tailored to different environments, ensuring that the system remains effective as the definition of ‘normal’ shifts across various settings. Implemented within a lab-based framework that simulates real-world conditions, our approach allows rigorous testing and refinement of VAD algorithms with a new metric. Experimental results show that our method achieves an EBI (Error Balance Index) of 68.91 for Q3 in real-world simulated scenarios, demonstrating its practical effectiveness and significantly enhancing the applicability of VAD in dynamic environments.
zh

[CV-10] Per-Query Visual Concept Learning

【速读】:该论文旨在解决生成式 AI(Generative AI)中视觉概念学习(Visual concept learning),即文本到图像个性化(Text-to-image personalization)的问题,目标是提升模型对新概念的准确建模与生成能力。其解决方案的关键在于引入一个针对特定提示词(prompt)和噪声种子(noise seed)的个性化步骤,并结合基于自注意力(self-attention)和交叉注意力(cross-attention)的双损失项,以捕捉个性化概念的身份特征;同时利用先前设计用于身份感知的PDM特征(PDM features),显著增强个性化语义相似性表现,从而在多种基础文本到图像模型(包括UNet和DiT架构)及六种现有个性化方法上均实现显著性能提升。

链接: https://arxiv.org/abs/2508.09045
作者: Ori Malca,Dvir Samuel,Gal Chechik
机构: Bar‑Ilan University (巴伊兰大学); OriginAI; NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page is at this https URL

点击查看摘要

Abstract:Visual concept learning, also known as Text-to-image personalization, is the process of teaching new concepts to a pretrained model. This has numerous applications from product placement to entertainment and personalized design. Here we show that many existing methods can be substantially augmented by adding a personalization step that is (1) specific to the prompt and noise seed, and (2) using two loss terms based on the self- and cross- attention, capturing the identity of the personalized concept. Specifically, we leverage PDM features - previously designed to capture identity - and show how they can be used to improve personalized semantic similarity. We evaluate the benefit that our method gains on top of six different personalization methods, and several base text-to-image models (both UNet- and DiT-based). We find significant improvements even over previous per-query personalization methods.
zh

[CV-11] Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在预测代理在虚拟环境和现实场景中运动时,对空间与时间理解难以协同提升的问题。现有方法多独立优化空间或时间感知能力,缺乏统一建模机制。其解决方案的关键在于提出一种基于视觉提示(visual prompting)的新方法:通过将观测中关键点的视觉轨迹投影到深度图上,使模型能够同时捕捉空间布局与时间动态信息,从而增强对任务上下文的理解能力。实验表明,该方法在SimplerEnv环境中相较SpatialVLA和TraceVLA分别提升了4%和19%的任务成功率,且仅需少量训练数据即可实现显著性能跃升,适用于数据稀缺的真实世界部署场景。

链接: https://arxiv.org/abs/2508.09032
作者: Maxim A. Patratskiy,Alexey K. Kovalev,Aleksandr I. Panov
机构: MIPT(莫斯科物理技术学院); AIRI(人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action models have demonstrated remarkable capabilities in predicting agent movements within virtual environments and real-world scenarios based on visual observations and textual instructions. Although recent research has focused on enhancing spatial and temporal understanding independently, this paper presents a novel approach that integrates both aspects through visual prompting. We introduce a method that projects visual traces of key points from observations onto depth maps, enabling models to capture both spatial and temporal information simultaneously. The experiments in SimplerEnv show that the mean number of tasks successfully solved increased for 4% compared to SpatialVLA and 19% compared to TraceVLA. Furthermore, we show that this enhancement can be achieved with minimal training data, making it particularly valuable for real-world applications where data collection is challenging. The project page is available at this https URL.
zh

[CV-12] When Deepfakes Look Real: Detecting AI-Generated Faces with Unlabeled Data due to Annotation Challenges

【速读】:该论文旨在解决深度伪造(deepfake)检测中因标注数据稀缺而导致的性能瓶颈问题,尤其是在生成式 AI(Generative AI)内容日益逼真、人类标注者难以区分真假图像的背景下,传统依赖大量人工标注的方法变得低效且不可靠。为此,作者提出 Dual-Path Guidance Network (DPGNet),其核心创新在于两个关键模块:一是文本引导的跨域对齐机制,通过可学习提示(learnable prompts)将视觉与文本嵌入统一到域不变特征空间,缓解不同生成模型间的人脸分布差异;二是课程驱动的伪标签生成策略,动态筛选更具信息量的无标签样本以提升模型泛化能力。此外,为防止灾难性遗忘,引入跨域知识蒸馏机制实现不同域之间的知识迁移,从而有效利用大规模无标签社交网络数据,显著提升检测性能,在11个主流数据集上相比当前最优方法平均提升6.3%。

链接: https://arxiv.org/abs/2508.09022
作者: Zhiqiang Yang,Renshuai Tao,Xiaolong Zheng,Guodong Yang,Chunjie Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Cloud (阿里云); 3. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10pages,5figures

点击查看摘要

Abstract:Existing deepfake detection methods heavily depend on labeled training data. However, as AI-generated content becomes increasingly realistic, even \textbfhuman annotators struggle to distinguish between deepfakes and authentic images. This makes the labeling process both time-consuming and less reliable. Specifically, there is a growing demand for approaches that can effectively utilize large-scale unlabeled data from online social networks. Unlike typical unsupervised learning tasks, where categories are distinct, AI-generated faces closely mimic real image distributions and share strong similarities, causing performance drop in conventional strategies. In this paper, we introduce the Dual-Path Guidance Network (DPGNet), to tackle two key challenges: (1) bridging the domain gap between faces from different generation models, and (2) utilizing unlabeled image samples. The method features two core modules: text-guided cross-domain alignment, which uses learnable prompts to unify visual and textual embeddings into a domain-invariant feature space, and curriculum-driven pseudo label generation, which dynamically exploit more informative unlabeled samples. To prevent catastrophic forgetting, we also facilitate bridging between domains via cross-domain knowledge distillation. Extensive experiments on \textbf11 popular datasets, show that DPGNet outperforms SoTA approaches by \textbf6.3%, highlighting its effectiveness in leveraging unlabeled data to address the annotation challenges posed by the increasing realism of deepfakes.
zh

[CV-13] Uncertainty-aware Cross-training for Semi-supervised Medical Image Segmentation

【速读】:该论文旨在解决半监督医学图像分割中因模型认知偏差(cognitive biases)导致的性能瓶颈问题,以及在训练过程中难以生成高置信度伪标签(pseudo-labels)的挑战。其核心解决方案是提出一种不确定性感知的交叉训练框架(Uncertainty-aware Cross-training framework, UC-Seg),关键创新在于引入两个相互关联的子网络,并设计了两种机制:一是跨子网一致性保持(Cross-subnet Consistency Preservation, CCP)策略,通过增强特征表示能力和跨子网特征一致性来缓解模型内部偏差并学习共享语义;二是不确定性感知的伪标签生成(Uncertainty-aware Pseudo-label Generation, UPG)组件,利用双子网输出的分割结果及其对应的不确定性图谱协同生成高质量伪标签,从而提升未标注数据的有效利用效率与分割精度。

链接: https://arxiv.org/abs/2508.09014
作者: Kaiwen Huang,Tao Zhou,Huazhu Fu,Yizhe Zhang,Yi Zhou,Xiao-Jun Wu
机构: Nanjing University of Science and Technology (南京理工大学); Institute of High Performance Computing, A*STAR (新加坡高性能计算研究所); Southeast University (东南大学); Jiangnan University (江南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Semi-supervised learning has gained considerable popularity in medical image segmentation tasks due to its capability to reduce reliance on expert-examined annotations. Several mean-teacher (MT) based semi-supervised methods utilize consistency regularization to effectively leverage valuable information from unlabeled data. However, these methods often heavily rely on the student model and overlook the potential impact of cognitive biases within the model. Furthermore, some methods employ co-training using pseudo-labels derived from different inputs, yet generating high-confidence pseudo-labels from perturbed inputs during training remains a significant challenge. In this paper, we propose an Uncertainty-aware Cross-training framework for semi-supervised medical image Segmentation (UC-Seg). Our UC-Seg framework incorporates two distinct subnets to effectively explore and leverage the correlation between them, thereby mitigating cognitive biases within the model. Specifically, we present a Cross-subnet Consistency Preservation (CCP) strategy to enhance feature representation capability and ensure feature consistency across the two subnets. This strategy enables each subnet to correct its own biases and learn shared semantics from both labeled and unlabeled data. Additionally, we propose an Uncertainty-aware Pseudo-label Generation (UPG) component that leverages segmentation results and corresponding uncertainty maps from both subnets to generate high-confidence pseudo-labels. We extensively evaluate the proposed UC-Seg on various medical image segmentation tasks involving different modality images, such as MRI, CT, ultrasound, colonoscopy, and so on. The results demonstrate that our method achieves superior segmentation accuracy and generalization performance compared to other state-of-the-art semi-supervised methods. Our code will be released at this https URL.
zh

[CV-14] owards Perfection: Building Inter-component Mutual Correction for Retinex-based Low-light Image Enhancement

【速读】:该论文旨在解决低光照图像增强中基于Retinex的深度学习方法因illumination(光照)与reflectance(反射)成分分解不彻底而引入的互成分残差(inter-component residuals, ICR)问题。ICR不仅影响分解精度,还会导致增强后的成分偏离理想结果,进而降低最终合成图像质量。解决方案的关键在于提出一种新颖的Inter-correction Retinex模型(IRetinex),在分解阶段通过引入互成分残差减少模块(inter-component residual reduction module)降低光照与反射成分之间的特征相似性,在增强阶段利用两成分间的特征相似性检测并抑制ICR的影响,从而有效提升图像增强效果。

链接: https://arxiv.org/abs/2508.09009
作者: Luyang Cao,Han Xu,Jian Zhang,Lei Qi,Jiayi Ma,Yinghuan Shi,Yang Gao
机构: Nanjing University (南京大学); Southeast Univeristy (东南大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This article has been accepted by ACMMM 2025

点击查看摘要

Abstract:In low-light image enhancement, Retinex-based deep learning methods have garnered significant attention due to their exceptional interpretability. These methods decompose images into mutually independent illumination and reflectance components, allows each component to be enhanced separately. In fact, achieving perfect decomposition of illumination and reflectance components proves to be quite challenging, with some residuals still existing after decomposition. In this paper, we formally name these residuals as inter-component residuals (ICR), which has been largely underestimated by previous methods. In our investigation, ICR not only affects the accuracy of the decomposition but also causes enhanced components to deviate from the ideal outcome, ultimately reducing the final synthesized image quality. To address this issue, we propose a novel Inter-correction Retinex model (IRetinex) to alleviate ICR during the decomposition and enhancement stage. In the decomposition stage, we leverage inter-component residual reduction module to reduce the feature similarity between illumination and reflectance components. In the enhancement stage, we utilize the feature similarity between the two components to detect and mitigate the impact of ICR within each enhancement unit. Extensive experiments on three low-light benchmark datasets demonstrated that by reducing ICR, our method outperforms state-of-the-art approaches both qualitatively and quantitatively.
zh

[CV-15] UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale ICCV2025

【速读】:该论文旨在解决传统卷积神经网络(Convolutional Neural Networks, CNNs)中大感受野(Effective Receptive Field, ERF)扩展所面临的高参数量与浮点运算次数(FLOPs)代价以及ERF渐近高斯分布(Asymptotically Gaussian Distribution, AGD)被破坏的问题。其核心解决方案在于提出一种基于多层感受野聚合器(Three-layer Receptive Field Aggregator)的新范式,通过合理组合小尺寸卷积核(如7×7、9×9、11×11)来扩展ERF,同时保持AGD特性;并设计了以感受野为基本视角的层操作算子(Layer Operator),使得模型在堆叠模块后仍能实现与现有大核CNN相当的感受野覆盖,且具备良好的可扩展性。由此构建出统一的通用卷积网络架构UniConvNet,在图像分类、目标检测和语义分割等任务上均展现出优于当前主流CNN与Vision Transformer(ViT)的性能表现。

链接: https://arxiv.org/abs/2508.09000
作者: Yuhao Wang,Wei Xi
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Convolutional neural networks (ConvNets) with large effective receptive field (ERF), still in their early stages, have demonstrated promising effectiveness while constrained by high parameters and FLOPs costs and disrupted asymptotically Gaussian distribution (AGD) of ERF. This paper proposes an alternative paradigm: rather than merely employing extremely large ERF, it is more effective and efficient to expand the ERF while maintaining AGD of ERF by proper combination of smaller kernels, such as 7\times7 , 9\times9 , 11\times11 . This paper introduces a Three-layer Receptive Field Aggregator and designs a Layer Operator as the fundamental operator from the perspective of receptive field. The ERF can be expanded to the level of existing large-kernel ConvNets through the stack of proposed modules while maintaining AGD of ERF. Using these designs, we propose a universal model for ConvNet of any scale, termed UniConvNet. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K demonstrate that UniConvNet outperforms state-of-the-art CNNs and ViTs across various vision recognition tasks for both lightweight and large-scale models with comparable throughput. Surprisingly, UniConvNet-T achieves 84.2% ImageNet top-1 accuracy with 30M parameters and 5.1G FLOPs. UniConvNet-XL also shows competitive scalability to big data and large models, acquiring 88.4% top-1 accuracy on ImageNet. Code and models are publicly available at this https URL.
zh

[CV-16] Spatial-Temporal Multi-Scale Quantization for Flexible Motion Generation

【速读】:该论文旨在解决当前人体运动生成中运动表示方法的两大局限性:一是现有离散帧序列表示难以从多尺度(multi-scale)视角捕捉运动特征,限制了对复杂运动模式的建模能力;二是缺乏组合灵活性(compositional flexibility),影响模型在多样化生成任务中的泛化性能。解决方案的关键在于提出一种名为MSQ(Multi-Scale Quantization)的新量化方法,该方法通过空间和时间维度上的多尺度离散化,将运动序列压缩为多尺度离散标记(discrete tokens)。MSQ采用不同的编码器以不同空间粒度捕获身体部位特征,并在时间上插值后进行多尺度量化,从而构建具有层次结构的运动表示。在此基础上,进一步建立基于生成掩码建模(generative mask modeling)的框架,实现无需专门设计或重新训练即可无缝组合运动标记,支持运动编辑、控制与条件生成等多样化任务。

链接: https://arxiv.org/abs/2508.08991
作者: Zan Wang,Jingze Zhang,Yixin Chen,Baoxiong Jia,Wei Liang,Siyuan Huang
机构: Beijing Institute of Technology (北京理工大学); BIGAI (通用人工智能国家重点实验室); Tsinghua University (清华大学); Yangtze Delta Region Academy of Beijing Institute of Technology (北京理工大学长三角研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages

点击查看摘要

Abstract:Despite significant advancements in human motion generation, current motion representations, typically formulated as discrete frame sequences, still face two critical limitations: (i) they fail to capture motion from a multi-scale perspective, limiting the capability in complex patterns modeling; (ii) they lack compositional flexibility, which is crucial for model’s generalization in diverse generation tasks. To address these challenges, we introduce MSQ, a novel quantization method that compresses the motion sequence into multi-scale discrete tokens across spatial and temporal dimensions. MSQ employs distinct encoders to capture body parts at varying spatial granularities and temporally interpolates the encoded features into multiple scales before quantizing them into discrete tokens. Building on this representation, we establish a generative mask modeling model to effectively support motion editing, motion control, and conditional motion generation. Through quantitative and qualitative analysis, we show that our quantization method enables the seamless composition of motion tokens without requiring specialized design or re-training. Furthermore, extensive evaluations demonstrate that our approach outperforms existing baseline methods on various benchmarks.
zh

[CV-17] KFFocus: Highlighting Keyframes for Enhanced Video Understanding

【速读】:该论文旨在解决当前视频大语言模型(Vid-LLMs)在处理长视频时因计算资源限制而采用的均匀采样与帧内令牌压缩策略所导致的关键信息丢失问题,尤其是忽视了视频中关键帧(keyframes)在时间维度上的非均匀分布特性。其解决方案之关键在于提出KFFocus方法:首先,以经典视频压缩原理为基础,用更具针对性的采样策略替代均匀采样,识别并保留具有高信息量的帧;其次,根据帧的上下文相关性动态调整压缩比例,从而在降低冗余令牌数量的同时保留语义细节;最后引入时空建模模块,显式编码帧间时序关系与帧内空间结构,增强模型对视频时空动态的细粒度理解。

链接: https://arxiv.org/abs/2508.08989
作者: Ming Nie,Chunwei Wang,Hang Xu,Li Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, with the emergence of large language models, multimodal LLMs have demonstrated exceptional capabilities in image and video modalities. Despite advancements in video comprehension, the substantial computational demands of long video sequences lead current video LLMs (Vid-LLMs) to employ compression strategies at both the inter-frame level (e.g., uniform sampling of video frames) and intra-frame level (e.g., condensing all visual tokens of each frame into a limited number). However, this approach often neglects the uneven temporal distribution of critical information across frames, risking the omission of keyframes that contain essential temporal and semantic details. To tackle these challenges, we propose KFFocus, a method designed to efficiently compress video tokens and emphasize the informative context present within video frames. We substitute uniform sampling with a refined approach inspired by classic video compression principles to identify and capture keyframes based on their temporal redundancy. By assigning varying condensation ratios to frames based on their contextual relevance, KFFocus efficiently reduces token redundancy while preserving informative content details. Additionally, we introduce a spatiotemporal modeling module that encodes both the temporal relationships between video frames and the spatial structure within each frame, thus providing Vid-LLMs with a nuanced understanding of spatial-temporal dynamics. Extensive experiments on widely recognized video understanding benchmarks, especially long video scenarios, demonstrate that KFFocus significantly outperforms existing methods, achieving substantial computational efficiency and enhanced accuracy.
zh

[CV-18] ColorGPT : Leverag ing Large Language Models for Multimodal Color Recommendation ICDAR2025

【速读】:该论文旨在解决向量图形文档设计中颜色推荐的难题,即在缺少或需调整颜色时,如何高效、准确地推荐合适的颜色以提升视觉吸引力、沟通效率、可用性和可访问性。传统方法因颜色设计的复杂性及数据稀缺而效果有限。解决方案的关键在于利用预训练大型语言模型(Large Language Models, LLMs)的常识推理能力,构建了一个名为ColorGPT的系统化管道,通过多种颜色表示方式与有效的提示工程(prompt engineering)技术,实现基于给定颜色集和上下文的颜色调色板补全,并进一步扩展至根据文本描述生成完整调色板的任务,从而在颜色建议准确性、调色板分布合理性、颜色多样性及相似性等方面显著优于现有方法。

链接: https://arxiv.org/abs/2508.08987
作者: Ding Xia,Naoto Inoue,Qianru Qiu,Kotaro Kikuchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted to ICDAR2025

点击查看摘要

Abstract:Colors play a crucial role in the design of vector graphic documents by enhancing visual appeal, facilitating communication, improving usability, and ensuring accessibility. In this context, color recommendation involves suggesting appropriate colors to complete or refine a design when one or more colors are missing or require alteration. Traditional methods often struggled with these challenges due to the complex nature of color design and the limited data availability. In this study, we explored the use of pretrained Large Language Models (LLMs) and their commonsense reasoning capabilities for color recommendation, raising the question: Can pretrained LLMs serve as superior designers for color recommendation tasks? To investigate this, we developed a robust, rigorously validated pipeline, ColorGPT, that was built by systematically testing multiple color representations and applying effective prompt engineering techniques. Our approach primarily targeted color palette completion by recommending colors based on a set of given colors and accompanying context. Moreover, our method can be extended to full palette generation, producing an entire color palette corresponding to a provided textual description. Experimental results demonstrated that our LLM-based pipeline outperformed existing methods in terms of color suggestion accuracy and the distribution of colors in the color palette completion task. For the full palette generation task, our approach also yielded improvements in color diversity and similarity compared to current techniques.
zh

[CV-19] aoCache: Structure-Maintained Video Generation Acceleration

【速读】:该论文旨在解决现有基于缓存(cache-based)加速方法在视频扩散模型中因跳过早期或中期去噪步骤而导致的结构失真问题,这些问题会损害指令遵循能力与角色一致性。其解决方案的关键在于提出一种无需训练、即插即用的缓存策略TaoCache,该策略摒弃传统的残差(residual-based)缓存方式,转而采用固定点(fixed-point)视角预测模型噪声输出,并特别适用于去噪后期阶段;通过校准连续噪声增量之间的余弦相似度和范数比,TaoCache能够在实现大幅跳步的同时保持高分辨率结构信息,从而在相同加速比下显著提升视觉质量(如LPIPS、SSIM、PSNR)。

链接: https://arxiv.org/abs/2508.08978
作者: Zhentao Fan,Zongzuo Wang,Weiwei Zhang
机构: Huawei Inc.(华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing cache-based acceleration methods for video diffusion models primarily skip early or mid denoising steps, which often leads to structural discrepancies relative to full-timestep generation and can hinder instruction following and character consistency. We present TaoCache, a training-free, plug-and-play caching strategy that, instead of residual-based caching, adopts a fixed-point perspective to predict the model’s noise output and is specifically effective in late denoising stages. By calibrating cosine similarities and norm ratios of consecutive noise deltas, TaoCache preserves high-resolution structure while enabling aggressive skipping. The approach is orthogonal to complementary accelerations such as Pyramid Attention Broadcast (PAB) and TeaCache, and it integrates seamlessly into DiT-based frameworks. Across Latte-1, OpenSora-Plan v110, and Wan2.1, TaoCache attains substantially higher visual quality (LPIPS, SSIM, PSNR) than prior caching methods under the same speedups.
zh

[CV-20] xt-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering

【速读】:该论文旨在解决Change Detection Visual Question Answering (CDVQA)任务中因域偏移(domain shift)导致模型泛化能力下降的问题。现有方法通常假设训练与测试数据分布一致,但在实际应用中这一假设难以满足,从而限制了非专家用户在复杂场景下对地表变化信息的准确获取。解决方案的关键在于提出一种新的多模态、多域数据集BrightVQA以及一种文本条件状态空间模型(Text-Conditioned State Space Model, TCSSM),该模型能够统一建模双时相遥感图像与地理灾害相关文本信息,通过动态预测输入依赖参数来对齐视觉与语义特征,从而提取跨域不变特征,提升模型在不同域间的适应性和鲁棒性。

链接: https://arxiv.org/abs/2508.08974
作者: Elman Ghazaei,Erchan Aptoula
机构: Sabanci University (萨班哲大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Earth’s surface is constantly changing, and detecting these changes provides valuable insights that benefit various aspects of human society. While traditional change detection methods have been employed to detect changes from bi-temporal images, these approaches typically require expert knowledge for accurate interpretation. To enable broader and more flexible access to change information by non-expert users, the task of Change Detection Visual Question Answering (CDVQA) has been introduced. However, existing CDVQA methods have been developed under the assumption that training and testing datasets share similar distributions. This assumption does not hold in real-world applications, where domain shifts often occur. In this paper, the CDVQA task is revisited with a focus on addressing domain shift. To this end, a new multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate domain generalization research in CDVQA. Furthermore, a novel state space model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The TCSSM framework is designed to leverage both bi-temporal imagery and geo-disaster-related textual information in an unified manner to extract domain-invariant features across domains. Input-dependent parameters existing in TCSSM are dynamically predicted by using both bi-temporal images and geo-disaster-related description, thereby facilitating the alignment between bi-temporal visual data and the associated textual descriptions. Extensive experiments are conducted to evaluate the proposed method against state-of-the-art models, and superior performance is consistently demonstrated. The code and dataset will be made publicly available upon acceptance at this https URL.
zh

[CV-21] Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation ICCV2025

【速读】:该论文旨在解决生成式AI(Generative AI)在叙事任务中难以保持主体一致性的问题,尤其是由于缺乏细粒度引导和帧间交互机制,以及高质量数据稀缺导致的对主体位置、外观、服饰、表情和姿态等关键细节控制不足。其解决方案的关键在于引入布局条件(layout conditions),通过显式提供主体的位置和属性信息,增强帧间的细粒度交互,从而显著提升生成序列的一致性与可控性。在此基础上,作者提出了Layout-Togglable Storytelling新任务,并构建了大规模标注数据集Lay2Story-1M及评估基准Lay2Story-Bench,最终设计了基于Diffusion Transformers(DiTs)架构的Lay2Story框架,在一致性、语义相关性和美学质量上均超越现有最先进方法。

链接: https://arxiv.org/abs/2508.08949
作者: Ao Ma,Jiasong Feng,Ke Cao,Jing Wang,Yun Wang,Quanwei Zhang,Zhanjie Zhang
机构: JD.com, Inc.(京东)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Storytelling tasks involving generating consistent subjects have gained significant attention recently. However, existing methods, whether training-free or training-based, continue to face challenges in maintaining subject consistency due to the lack of fine-grained guidance and inter-frame interaction. Additionally, the scarcity of high-quality data in this field makes it difficult to precisely control storytelling tasks, including the subject’s position, appearance, clothing, expression, and posture, thereby hindering further advancements. In this paper, we demonstrate that layout conditions, such as the subject’s position and detailed attributes, effectively facilitate fine-grained interactions between frames. This not only strengthens the consistency of the generated frame sequence but also allows for precise control over the subject’s position, appearance, and other key details. Building on this, we introduce an advanced storytelling task: Layout-Togglable Storytelling, which enables precise subject control by incorporating layout conditions. To address the lack of high-quality datasets with layout annotations for this task, we develop Lay2Story-1M, which contains over 1 million 720p and higher-resolution images, processed from approximately 11,300 hours of cartoon videos. Building on Lay2Story-1M, we create Lay2Story-Bench, a benchmark with 3,000 prompts designed to evaluate the performance of different methods on this task. Furthermore, we propose Lay2Story, a robust framework based on the Diffusion Transformers (DiTs) architecture for Layout-Togglable Storytelling tasks. Through both qualitative and quantitative experiments, we find that our method outperforms the previous state-of-the-art (SOTA) techniques, achieving the best results in terms of consistency, semantic correlation, and aesthetic quality.
zh

[CV-22] UniSTFormer: Unified Spatio-Temporal Lightweight Transformer for Efficient Skeleton-Based Action Recognition

【速读】:该论文旨在解决基于骨骼的动作识别(Skeleton-based Action Recognition, SAR)中现有Transformer架构存在的参数量大、计算成本高及可扩展性差的问题。其解决方案的关键在于提出一种统一的时空轻量化Transformer框架,将空间与时间建模整合到单一注意力模块中,从而避免了传统方法中独立的时间建模模块带来的冗余计算,同时在空间建模过程中保留时间感知能力;此外,引入简化多尺度池化融合模块,通过结合局部与全局池化路径,有效增强模型对细粒度局部动作和整体运动模式的捕捉能力,显著提升了模型的效率与性能平衡。

链接: https://arxiv.org/abs/2508.08944
作者: Wenhan Wu,Zhishuai Guo,Chen Chen,Aidong Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Skeleton-based action recognition (SAR) has achieved impressive progress with transformer architectures. However, existing methods often rely on complex module compositions and heavy designs, leading to increased parameter counts, high computational costs, and limited scalability. In this paper, we propose a unified spatio-temporal lightweight transformer framework that integrates spatial and temporal modeling within a single attention module, eliminating the need for separate temporal modeling blocks. This approach reduces redundant computations while preserving temporal awareness within the spatial modeling process. Furthermore, we introduce a simplified multi-scale pooling fusion module that combines local and global pooling pathways to enhance the model’s ability to capture fine-grained local movements and overarching global motion patterns. Extensive experiments on benchmark datasets demonstrate that our lightweight model achieves a superior balance between accuracy and efficiency, reducing parameter complexity by over 58% and lowering computational cost by over 60% compared to state-of-the-art transformer-based baselines, while maintaining competitive recognition performance.
zh

[CV-23] MADPromptS: Unlocking Zero-Shot Morphing Attack Detection with Multiple Prompt Aggregation

【速读】:该论文旨在解决人脸伪造攻击检测(Face Morphing Attack Detection, MAD)问题,即攻击者通过插值多个个体的身份信息生成伪造人脸图像,使识别系统误判其为多个身份的合法样本。传统方法多依赖于对基础模型(Foundation Models, FMs)进行微调以适应特定任务,但忽略了其零样本(zero-shot)直接部署潜力。本文提出一种纯零样本解决方案,利用CLIP模型无需额外训练或微调,核心创新在于设计并聚合每类样本的多种文本提示(textual prompts),通过整合多样化提示对应的嵌入表示,增强模型内部表征与MAD任务的对齐性,从而更有效地捕捉真实样本与攻击样本之间的细微差异,显著提升检测性能。

链接: https://arxiv.org/abs/2508.08939
作者: Eduarda Caldeira,Fadi Boutros,Naser Damer
机构: Fraunhofer IGD (弗劳恩霍夫信息及数据处理研究所); Technische Universität Darmstadt (达姆施塔特工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACM Multimedia Workshops

点击查看摘要

Abstract:Face Morphing Attack Detection (MAD) is a critical challenge in face recognition security, where attackers can fool systems by interpolating the identity information of two or more individuals into a single face image, resulting in samples that can be verified as belonging to multiple identities by face recognition systems. While multimodal foundation models (FMs) like CLIP offer strong zero-shot capabilities by jointly modeling images and text, most prior works on FMs for biometric recognition have relied on fine-tuning for specific downstream tasks, neglecting their potential for direct, generalizable deployment. This work explores a pure zero-shot approach to MAD by leveraging CLIP without any additional training or fine-tuning, focusing instead on the design and aggregation of multiple textual prompts per class. By aggregating the embeddings of diverse prompts, we better align the model’s internal representations with the MAD task, capturing richer and more varied cues indicative of bona-fide or attack samples. Our results show that prompt aggregation substantially improves zero-shot detection performance, demonstrating the effectiveness of exploiting foundation models’ built-in multimodal knowledge through efficient prompt engineering.
zh

[CV-24] Accelerated Volumetric Compression without Hierarchies: A Fourier Feature Based Implicit Neural Representation Approach

【速读】:该论文旨在解决体积数据(volumetric data)压缩在医学影像、科学模拟和娱乐等领域中的效率与质量平衡问题。传统方法依赖于分层元数据或结构化编码,存在计算冗余和加载开销。其解决方案的关键在于提出一种无结构的神经压缩方法,结合傅里叶特征编码(Fourier feature encoding)与选择性体素采样(selective voxel sampling),通过形态学膨胀动态优先选择活跃区域,从而减少冗余计算;最终以纯网络权重形式存储神经表示,实现14倍压缩率并消除传统数据加载开销,显著提升训练速度(缩短63.7%)且仅带来微小质量损失(PSNR下降0.59 dB)。

链接: https://arxiv.org/abs/2508.08937
作者: Leona Žůrková(1),Petr Strakoš(1),Michal Kravčenko(1),Tomáš Brzobohatý(1),Lubomír Říha(1) ((1) IT4Innovations, VSB - Technical University of Ostrava)
机构: IT4Innovations, VSB – Technical University of Ostrava (IT4创新中心,奥斯特拉发技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 2 pages, accepted for the VIS IEEE 2025 poster

点击查看摘要

Abstract:Volumetric data compression is critical in fields like medical imaging, scientific simulation, and entertainment. We introduce a structure-free neural compression method combining Fourierfeature encoding with selective voxel sampling, yielding compact volumetric representations and faster convergence. Our dynamic voxel selection uses morphological dilation to prioritize active regions, reducing redundant computation without any hierarchical metadata. In the experiment, sparse training reduced training time by 63.7 % (from 30 to 11 minutes) with only minor quality loss: PSNR dropped 0.59 dB (from 32.60 to 32.01) and SSIM by 0.008 (from 0.948 to 0.940). The resulting neural representation, stored solely as network weights, achieves a compression rate of 14 and eliminates traditional data-loading overhead. This connects coordinate-based neural representation with efficient volumetric compression, offering a scalable, structure-free solution for practical applications.
zh

[CV-25] Shape Completion and Real-Time Visualization in Robotic Ultrasound Spine Acquisitions

【速读】:该论文旨在解决超声(Ultrasound, US)成像在脊柱介入操作中因伪影导致深层组织结构显示不清的问题,以及传统基于术前CT与US配准方法存在的注册复杂、脊柱曲度差异和需近期CT影像等局限性。其解决方案的关键在于提出一个集成机器人超声系统,结合实时形态补全(shape completion)技术:通过自主采集腰椎区域的超声扫查数据,利用预训练的深度学习网络实时重建完整的脊柱解剖结构,从而实现交互式、可重复的实时可视化,并支持导航至目标位置,显著提升成像一致性、可重复性和对解剖结构的理解能力。

链接: https://arxiv.org/abs/2508.08923
作者: Miruna-Alexandra Gafencu,Reem Shaban,Yordanka Velikova,Mohammad Farid Azampour,Nassir Navab
机构: Technical University of Munich (慕尼黑工业大学); Konrad Zuse School of Excellence in Reliable AI (relAI) (康拉德·祖塞可靠人工智能卓越中心); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Ultrasound (US) imaging is increasingly used in spinal procedures due to its real-time, radiation-free capabilities; however, its effectiveness is hindered by shadowing artifacts that obscure deeper tissue structures. Traditional approaches, such as CT-to-US registration, incorporate anatomical information from preoperative CT scans to guide interventions, but they are limited by complex registration requirements, differences in spine curvature, and the need for recent CT imaging. Recent shape completion methods can offer an alternative by reconstructing spinal structures in US data, while being pretrained on large set of publicly available CT scans. However, these approaches are typically offline and have limited reproducibility. In this work, we introduce a novel integrated system that combines robotic ultrasound with real-time shape completion to enhance spinal visualization. Our robotic platform autonomously acquires US sweeps of the lumbar spine, extracts vertebral surfaces from ultrasound, and reconstructs the complete anatomy using a deep learning-based shape completion network. This framework provides interactive, real-time visualization with the capability to autonomously repeat scans and can enable navigation to target locations. This can contribute to better consistency, reproducibility, and understanding of the underlying anatomy. We validate our approach through quantitative experiments assessing shape completion accuracy and evaluations of multiple spine acquisition protocols on a phantom setup. Additionally, we present qualitative results of the visualization on a volunteer scan.
zh

[CV-26] A Pseudo Global Fusion Paradigm-Based Cross-View Network for LiDAR-Based Place Recognition

【速读】:该论文旨在解决LiDAR-based Place Recognition (LPR) 中因采用欧氏距离(Euclidean distance)驱动的度量学习方法所导致的问题,即忽略了特征空间的内在结构和类内差异,从而在复杂环境与时间变化场景下性能受限。其解决方案的关键在于提出一种基于跨视图网络的新颖融合范式,引入伪全局信息引导机制以协调多模态分支在统一语义空间中进行特征学习;同时设计了流形自适应与成对方差-局部性学习度量(Manifold Adaptation and Pairwise Variance-Locality Learning Metric),通过构建对称正定(SPD)矩阵计算马氏距离(Mahalanobis distance),从而更准确地刻画数据的内在分布并捕捉特征空间中的复杂类间依赖关系。

链接: https://arxiv.org/abs/2508.08917
作者: Jintao Cheng,Jiehao Luo,Xieyuanli Chen,Jin Wu,Rui Fan,Xiaoyu Tang,Wei Zhang
机构: South China Normal University (华南师范大学); National University of Defense Technology (国防科技大学); Tongji University (同济大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR-based Place Recognition (LPR) remains a critical task in Embodied Artificial Intelligence (AI) and Autonomous Driving, primarily addressing localization challenges in GPS-denied environments and supporting loop closure detection. Existing approaches reduce place recognition to a Euclidean distance-based metric learning task, neglecting the feature space’s intrinsic structures and intra-class variances. Such Euclidean-centric formulation inherently limits the model’s capacity to capture nonlinear data distributions, leading to suboptimal performance in complex environments and temporal-varying scenarios. To address these challenges, we propose a novel cross-view network based on an innovative fusion paradigm. Our framework introduces a pseudo-global information guidance mechanism that coordinates multi-modal branches to perform feature learning within a unified semantic space. Concurrently, we propose a Manifold Adaptation and Pairwise Variance-Locality Learning Metric that constructs a Symmetric Positive Definite (SPD) matrix to compute Mahalanobis distance, superseding traditional Euclidean distance metrics. This geometric formulation enables the model to accurately characterize intrinsic data distributions and capture complex inter-class dependencies within the feature space. Experimental results demonstrate that the proposed algorithm achieves competitive performance, particularly excelling in complex environmental conditions.
zh

[CV-27] Automatic and standardized surgical reporting for central nervous system tumors

【速读】:该论文旨在解决中枢神经系统(CNS)肿瘤术后影像学评估中缺乏标准化、自动化分析流程的问题,尤其针对术后对比增强残留肿瘤和切除腔的精准分割及报告生成。解决方案的关键在于构建一个整合多任务模型的端到端分析流水线:采用Attention U-Net架构实现对术前非强化肿瘤核心、术后对比增强残留肿瘤及切除腔的高精度分割(平均体素级Dice分数分别为87%、66%、70%和77%),并利用DenseNet进行磁共振(MR)序列分类(平衡准确率达99.5%)与对比增强病灶类型识别(准确率80%),所有模块均遵循RANO 2.0指南,最终集成至开源平台Raidionics,实现了术后影像的标准化、自动化分析与报告生成,显著提升临床决策效率与一致性。

链接: https://arxiv.org/abs/2508.08916
作者: David Bouget,Mathilde Gajda Faanes,Asgeir Store Jakola,Frederik Barkhof,Hilko Ardon,Lorenzo Bello,Mitchel S. Berger,Shawn L. Hervey-Jumper,Julia Furtner,Albert J. S. Idema,Barbara Kiesel,Georg Widhalm,Rishi Nandoe Tewarie,Emmanuel Mandonnet,Pierre A. Robe,Michiel Wagemakers,Timothy R. Smith,Philip C. De Witt Hamer,Ole solheim,Ingerid Reinertsen
机构: SINTEF Digital (SINTEF 数字); University of Gothenburg (哥德堡大学); Sahlgrenska University Hospital (萨尔格伦斯卡大学医院); Amsterdam University Medical Centers (阿姆斯特丹大学医疗中心); Vrije Universiteit (自由大学); University College London (伦敦大学学院); Elisabeth-TweeSteden Hospital (伊丽莎白-特维斯特医院); Humanitas Research Hospital (人类研究医院); University of California, San Francisco (加州大学旧金山分校); Medical University Vienna (维也纳医科大学); Krems (克雷姆斯); Northwest Clinics (西北诊所); Haaglanden Medical Center (哈勒兰医疗中心); Hôpital Lariboisière (拉里博西埃医院); University Medical Center Utrecht (乌得勒支大学医疗中心); University Medical Center Groningen (格罗宁根大学医疗中心); University of Groningen (格罗宁根大学); Brigham and Women’s Hospital (布里格姆妇女医院); Harvard Medical School (哈佛医学院); Cancer Center Amsterdam (阿姆斯特丹癌症中心); NTNU (挪威科技大学); St. Olavs hospital (圣奥拉夫医院); Trondheim University Hospital (特隆赫姆大学医院); Department of Circulation and Medical Imaging (循环与医学影像系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 6 figures, 9 tables

点击查看摘要

Abstract:Magnetic resonance (MR) imaging is essential for evaluating central nervous system (CNS) tumors, guiding surgical planning, treatment decisions, and assessing postoperative outcomes and complication risks. While recent work has advanced automated tumor segmentation and report generation, most efforts have focused on preoperative data, with limited attention to postoperative imaging analysis. This study introduces a comprehensive pipeline for standardized postsurtical reporting in CNS tumors. Using the Attention U-Net architecture, segmentation models were trained for the preoperative (non-enhancing) tumor core, postoperative contrast-enhancing residual tumor, and resection cavity. Additionally, MR sequence classification and tumor type identification for contrast-enhancing lesions were explored using the DenseNet architecture. The models were integrated into a reporting pipeline, following the RANO 2.0 guidelines. Training was conducted on multicentric datasets comprising 2000 to 7000 patients, using a 5-fold cross-validation. Evaluation included patient-, voxel-, and object-wise metrics, with benchmarking against the latest BraTS challenge results. The segmentation models achieved average voxel-wise Dice scores of 87%, 66%, 70%, and 77% for the tumor core, non-enhancing tumor core, contrast-enhancing residual tumor, and resection cavity, respectively. Classification models reached 99.5% balanced accuracy in MR sequence classification and 80% in tumor type classification. The pipeline presented in this study enables robust, automated segmentation, MR sequence classification, and standardized report generation aligned with RANO 2.0 guidelines, enhancing postoperative evaluation and clinical decision-making. The proposed models and methods were integrated into Raidionics, open-source software platform for CNS tumor analysis, now including a dedicated module for postsurgical analysis.
zh

[CV-28] Masked Clustering Prediction for Unsupervised Point Cloud Pre-training

【速读】:该论文旨在解决标准视觉变压器(Vision Transformer, ViT)在3D点云理解任务中难以学习密集且语义丰富的特征的问题。现有方法主要依赖掩码自动编码(masked autoencoding)进行预训练,但未能充分挖掘点云中的细粒度语义信息。解决方案的关键在于提出一种名为MaskClu的无监督预训练方法,其核心创新是将掩码点建模与基于聚类的学习相结合:模型不仅重建被掩码点云的聚类分配(cluster assignments),还重构聚类中心(cluster centers),从而引导网络捕捉更密集的语义特征;同时引入全局对比学习机制,通过对比同一场景的不同掩码视图来增强实例级别的特征表示能力。二者联合优化,显著提升了ViT在3D点云上的语义表征能力,在多个下游任务(如部分分割、语义分割、目标检测和分类)中取得了新的性能纪录。

链接: https://arxiv.org/abs/2508.08910
作者: Bin Ren,Xiaoshui Huang,Mengyuan Liu,Hong Liu,Fabio Poiesi,Nicu Sebe,Guofeng Mei
机构: University of Pisa (比萨大学); University of Trento (特伦托大学); Shanghai Jiao Tong University (上海交通大学); Peking University (北京大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3D point cloud pretraining method. 8 pages in the main manuscript

点击查看摘要

Abstract:Vision transformers (ViTs) have recently been widely applied to 3D point cloud understanding, with masked autoencoding as the predominant pre-training paradigm. However, the challenge of learning dense and informative semantic features from point clouds via standard ViTs remains underexplored. We propose MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds that integrates masked point modeling with clustering-based learning. MaskClu is designed to reconstruct both cluster assignments and cluster centers from masked point clouds, thus encouraging the model to capture dense semantic information. Additionally, we introduce a global contrastive learning mechanism that enhances instance-level feature learning by contrasting different masked views of the same point cloud. By jointly optimizing these complementary objectives, i.e., dense semantic reconstruction, and instance-level contrastive learning. MaskClu enables ViTs to learn richer and more semantically meaningful representations from 3D point clouds. We validate the effectiveness of our method via multiple 3D tasks, including part segmentation, semantic segmentation, object detection, and classification, where MaskClu sets new competitive results. The code and models will be released at:this https URL.
zh

[CV-29] A Robust Epipolar-Domain Regularization Algorithm for Light Field Depth Estimation

【速读】:该论文旨在解决光场成像中深度估计的鲁棒性问题,尤其是在噪声干扰和复杂真实场景下传统深度卷积神经网络(Deep Convolutional Neural Networks, CNNs)因计算开销大、泛化能力弱而难以满足应用需求的问题。其解决方案的关键在于提出一种轻量级深度估计流程,该流程融合光场视差信息与定向随机游走(Directed Random Walk)精修算法,通过构建概率图模型实现深度图的一致性增强,无需大规模训练数据或复杂调参即可获得稳定性能,从而在保持低计算复杂度的同时达到与前沿深度学习方法相当的精度。

链接: https://arxiv.org/abs/2508.08900
作者: Noor Islam S. Mohammad
机构: New York University Tandon School of Engineering (纽约大学坦顿工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust depth estimation in light field imaging remains a critical challenge for pattern recognition applications such as augmented reality, biomedical imaging, and scene reconstruction. While existing approaches often rely heavily on deep convolutional neural networks, they tend to incur high computational costs and struggle in noisy real-world environments. This paper proposes a novel lightweight depth estimation pipeline that integrates light field-based disparity information with a directed random walk refinement algorithm. Unlike traditional CNN-based methods, our approach enhances depth map consistency without requiring extensive training or large-scale datasets. The proposed method was evaluated on the 4D Light Field Benchmark dataset and a diverse set of real-world images. Experimental results indicate that while performance slightly declines under uncontrolled conditions, the algorithm consistently maintains low computational complexity and competitive accuracy compared to state-of-the-art deep learning models. These findings highlight the potential of our method as a robust and efficient alternative for depth estimation and segmentation in light field imaging. The work provides insights into practical algorithm design for light field-based pattern recognition and opens new directions for integrating probabilistic graph models with depth sensing frameworks.
zh

[CV-30] Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos ICCV2025

【速读】:该论文旨在解决从单张肖像图像生成真实且可完整动画化全身虚拟形象(whole-body avatar)所面临的挑战,特别是现有方法在捕捉细微表情、身体动作和动态背景方面的局限性,以及当前评估数据集与指标难以全面衡量此类复杂性的不足。解决方案的关键在于提出一个名为Whole-Body Benchmark Dataset (WB-DH) 的开源多模态基准数据集,其核心优势包括:(1) 提供细粒度的多模态标注以支持精确引导,(2) 构建灵活的评估框架以系统化测试生成质量,(3) 开放数据集与工具链接,促进社区协作与模型迭代优化。

链接: https://arxiv.org/abs/2508.08891
作者: Chaoyi Wang,Yifan Yang,Jun Pei,Lijie Xia,Jianpo Liu,Xiaobing Yuan,Xinhan Di
机构: Shanghai Institute of Microsystem and Information Technology, CAS(中国科学院上海微系统与信息技术研究所); Shanghai Jiao Tong University(上海交通大学); Independent Researcher(独立研究者)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by ICCV 2025 Workshop MMFM4

点击查看摘要

Abstract:Creating realistic, fully animatable whole-body avatars from a single portrait is challenging due to limitations in capturing subtle expressions, body movements, and dynamic backgrounds. Current evaluation datasets and metrics fall short in addressing these complexities. To bridge this gap, we introduce the Whole-Body Benchmark Dataset (WB-DH), an open-source, multi-modal benchmark designed for evaluating whole-body animatable avatar generation. Key features include: (1) detailed multi-modal annotations for fine-grained guidance, (2) a versatile evaluation framework, and (3) public access to the dataset and tools at this https URL.
zh

[CV-31] GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments ICCV2025

【速读】:该论文旨在解决神经渲染模型在场景变化时难以高效适应的问题,现有方法或需大量重训练,或无法捕捉随时间变化的细节。其解决方案的关键在于提出GaussianUpdate方法,通过结合3D高斯表示(3D Gaussian representation)与持续学习(continual learning),实现对当前数据的有效更新并保留历史场景信息;同时引入一种多阶段更新策略以显式建模不同类型的场景变化,并采用基于生成回放(generative replay)的可见性感知持续学习机制,使模型能够自适应更新而无需存储原始图像,从而在基准数据集上实现高质量、实时的视图合成及跨时段变化可视化。

链接: https://arxiv.org/abs/2508.08867
作者: Lin Zeng,Boming Zhao,Jiarui Hu,Xujie Shen,Ziqiang Dang,Hujun Bao,Zhaopeng Cui
机构: State Key Lab of CAD & CG, Zhejiang University (浙江大学CAD&CG国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Novel view synthesis with neural models has advanced rapidly in recent years, yet adapting these models to scene changes remains an open problem. Existing methods are either labor-intensive, requiring extensive model retraining, or fail to capture detailed types of changes over time. In this paper, we present GaussianUpdate, a novel approach that combines 3D Gaussian representation with continual learning to address these challenges. Our method effectively updates the Gaussian radiance fields with current data while preserving information from past scenes. Unlike existing methods, GaussianUpdate explicitly models different types of changes through a novel multi-stage update strategy. Additionally, we introduce a visibility-aware continual learning approach with generative replay, enabling self-aware updating without the need to store images. The experiments on the benchmark dataset demonstrate our method achieves superior and real-time rendering with the capability of visualizing changes over different times
zh

[CV-32] Adaptive High-Frequency Preprocessing for Video Coding

【速读】:该论文旨在解决视频编码中高频分量(high-frequency components)在维持画面清晰度与真实感的同时,显著增加码率、进而导致带宽和存储成本上升的问题。解决方案的关键在于提出一种端到端的学习框架,其核心是频率感知特征金字塔预测网络(Frequency-attentive Feature Pyramid Prediction Network, FFPN),该网络能够自适应地预测最优的高频预处理策略,指导后续滤波操作以实现压缩后码率与主观质量之间的最佳权衡。训练时通过伪标签(pseudo-labeling)方式,基于不同预处理类型和强度下的率失真(rate-distortion, RD)性能比较确定最优策略,并采用最新的视觉质量评估指标进行失真计算,从而实现主观画质提升与码率节省的双重目标。

链接: https://arxiv.org/abs/2508.08849
作者: Yingxue Pang,Shijie Zhao,Junlin Li,Li Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-frequency components are crucial for maintaining video clarity and realism, but they also significantly impact coding bitrate, resulting in increased bandwidth and storage costs. This paper presents an end-to-end learning-based framework for adaptive high-frequency preprocessing to enhance subjective quality and save bitrate in video coding. The framework employs the Frequency-attentive Feature pyramid Prediction Network (FFPN) to predict the optimal high-frequency preprocessing strategy, guiding subsequent filtering operators to achieve the optimal tradeoff between bitrate and quality after compression. For training FFPN, we pseudo-label each training video with the optimal strategy, determined by comparing the rate-distortion (RD) performance across different preprocessing types and strengths. Distortion is measured using the latest quality assessment metric. Comprehensive evaluations on multiple datasets demonstrate the visually appealing enhancement capabilities and bitrate savings achieved by our framework.
zh

[CV-33] DiffPhysCam: Differentiable Physics-Based Camera Simulation for Inverse Rendering and Embodied AI

【速读】:该论文旨在解决现有虚拟相机在机器人和具身人工智能(Embodied AI)应用中难以实现高保真图像合成与精确场景重建的问题,具体表现为对相机内参控制不足、光学伪影建模不准确以及缺乏可调校准参数,从而限制了仿真到现实(sim-to-real)的迁移效果。解决方案的关键在于提出DiffPhysCam——一个可微分相机模拟器,其核心是一个多阶段流水线:首先提供对相机设置的细粒度控制,其次建模关键光学效应(如散焦模糊),并支持基于真实世界数据的校准;该框架同时支持前向渲染(用于图像合成)与逆向渲染(用于3D场景重建,包括网格和材质纹理优化),从而显著提升机器人感知性能,并成功构建真实场景的数字孪生体用于多物理环境中的自主导航验证。

链接: https://arxiv.org/abs/2508.08831
作者: Bo-Hsun Chen,Nevindu M. Batagoda,Dan Negrut
机构: Simulation-Based Engineering Lab, University of Wisconsin-Madison (威斯康星大学麦迪逊分校仿真工程实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 19 pages, 17 figures, and 4 tables

点击查看摘要

Abstract:We introduce DiffPhysCam, a differentiable camera simulator designed to support robotics and embodied AI applications by enabling gradient-based optimization in visual perception pipelines. Generating synthetic images that closely mimic those from real cameras is essential for training visual models and enabling end-to-end visuomotor learning. Moreover, differentiable rendering allows inverse reconstruction of real-world scenes as digital twins, facilitating simulation-based robotics training. However, existing virtual cameras offer limited control over intrinsic settings, poorly capture optical artifacts, and lack tunable calibration parameters – hindering sim-to-real transfer. DiffPhysCam addresses these limitations through a multi-stage pipeline that provides fine-grained control over camera settings, models key optical effects such as defocus blur, and supports calibration with real-world data. It enables both forward rendering for image synthesis and inverse rendering for 3D scene reconstruction, including mesh and material texture optimization. We show that DiffPhysCam enhances robotic perception performance in synthetic image tasks. As an illustrative example, we create a digital twin of a real-world scene using inverse rendering, simulate it in a multi-physics environment, and demonstrate navigation of an autonomous ground vehicle using images generated by DiffPhysCam.
zh

[CV-34] Silicon Minds versus Human Hearts: The Wisdom of Crowds Beats the Wisdom of AI in Emotion Recognition

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在情感识别能力上的认知盲区问题,特别是其是否能够达到甚至超越人类专家水平。研究通过引入“读心眼测试”(Reading the Mind in the Eyes Test, RMET)及其多族裔版本(MRMET),系统评估了MLLMs在识别细微情绪线索方面的表现,并与人类参与者进行对比。关键解决方案在于:首先,验证了单个MLLM在个体层面上已具备优于人类的识别准确率;其次,发现当聚合人类决策以模拟群体智慧时,人类集体显著优于聚合后的MLLM预测;最终提出“增强智能”(augmented intelligence)协同策略——将人类判断与MLLM输出结合,实现比单独使用人类或MLLM更高的准确性。这一发现揭示了人类集体智慧与人机协作潜力是推动情感智能AI发展的核心路径。

链接: https://arxiv.org/abs/2508.08830
作者: Mustafa Akben,Vinayaka Gude,Haya Ajjan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The ability to discern subtle emotional cues is fundamental to human social intelligence. As artificial intelligence (AI) becomes increasingly common, AI’s ability to recognize and respond to human emotions is crucial for effective human-AI interactions. In particular, whether such systems can match or surpass human experts remains to be seen. However, the emotional intelligence of AI, particularly multimodal large language models (MLLMs), remains largely unexplored. This study evaluates the emotion recognition abilities of MLLMs using the Reading the Mind in the Eyes Test (RMET) and its multiracial counterpart (MRMET), and compares their performance against human participants. Results show that, on average, MLLMs outperform humans in accurately identifying emotions across both tests. This trend persists even when comparing performance across low, medium, and expert-level performing groups. Yet when we aggregate independent human decisions to simulate collective intelligence, human groups significantly surpass the performance of aggregated MLLM predictions, highlighting the wisdom of the crowd. Moreover, a collaborative approach (augmented intelligence) that combines human and MLLM predictions achieves greater accuracy than either humans or MLLMs alone. These results suggest that while MLLMs exhibit strong emotion recognition at the individual level, the collective intelligence of humans and the synergistic potential of human-AI collaboration offer the most promising path toward effective emotional AI. We discuss the implications of these findings for the development of emotionally intelligent AI systems and future research directions.
zh

[CV-35] A Parametric Bi-Directional Curvature-Based Framework for Image Artifact Classification and Quantification

【速读】:该论文旨在解决无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)中如何准确识别和量化图像退化类型的问题。传统方法往往难以区分不同类型的失真(如模糊与噪声),且缺乏对纹理结构变化的敏感性。解决方案的关键在于提出一种基于方向性图像曲率分析的新框架,核心是定义像素级的各向异性纹理丰富度(Anisotropic Texture Richness, ATR)指标,该指标通过两个可调阈值(一个宽松、一个严格)来量化正交方向上的纹理抑制程度。该框架采用两阶段系统:第一阶段利用两种特定配置下的ATR响应特征实现对主要退化类型(模糊 vs. 噪声)的分类,准确率超过97%;第二阶段则针对已分类的退化类型使用专用回归模型将ATR得分映射为质量评分。该方法在LIVE数据集上实现了高达0.892的决定系数(R²)和仅5.17 DMOS点的均方根误差(RMSE),表明其在图像退化分类与定量评估方面具有高精度与鲁棒性。

链接: https://arxiv.org/abs/2508.08824
作者: Diego Frias
机构: Bahia State University (UNEB) (巴伊亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work presents a novel framework for No-Reference Image Quality Assessment (NR-IQA) founded on the analysis of directional image curvature. Within this framework, we define a measure of Anisotropic Texture Richness (ATR), which is computed at the pixel level using two tunable thresholds – one permissive and one restrictive – that quantify orthogonal texture suppression. When its parameters are optimized for a specific artifact, the resulting ATR score serves as a high-performance quality metric, achieving Spearman correlations with human perception of approximately -0.93 for Gaussian blur and -0.95 for white noise on the LIVE dataset. The primary contribution is a two-stage system that leverages the differential response of ATR to various distortions. First, the system utilizes the signature from two specialist ATR configurations to classify the primary artifact type (blur vs. noise) with over 97% accuracy. Second, following classification, it employs a dedicated regression model mapping the relevant ATR score to a quality rating to quantify the degradation. On a combined dataset, the complete system predicts human scores with a coefficient of determination (R2) of 0.892 and a Root Mean Square Error (RMSE) of 5.17 DMOS points. This error corresponds to just 7.4% of the dataset’s total quality range, demonstrating high predictive accuracy. This establishes our framework as a robust, dual-purpose tool for the classification and subsequent quantification of image degradation.
zh

[CV-36] 3DFroMLLM : 3D Prototype Generation only from Pretrained Multimodal LLM s

【速读】:该论文旨在解决多模态大语言模型(Multi-Modal Large Language Models, MLLMs)在空间推理能力上的局限性,尤其是其难以生成具有几何结构和语义标签的3D物体原型的问题。解决方案的关键在于提出一种无需额外训练数据或详细用户指令的代理式(agentic)框架——3DFroMLLM,该框架通过设计者(designer)、编码器(coder)和视觉检查员(visual inspector)组成的迭代优化循环,直接从MLLM中生成包含几何信息与部件标签的3D对象原型。该方法利用生成的渲染图像进行图像分类预训练,性能优于先前方法15%,并进一步通过使用这些带部件标签的3D原型对CLIP模型进行微调,在不依赖任何人工标注数据的情况下实现了细粒度视觉-语言模型中部件分割准确率提升55%。

链接: https://arxiv.org/abs/2508.08821
作者: Noor Ahmed,Cameron Braunstein,Steffen Eger,Eddy Ilg
机构: University of Tuebingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent Multi-Modal Large Language Models (MLLMs) have demonstrated strong capabilities in learning joint representations from text and images. However, their spatial reasoning remains limited. We introduce 3DFroMLLM, a novel framework that enables the generation of 3D object prototypes directly from MLLMs, including geometry and part labels. Our pipeline is agentic, comprising a designer, coder, and visual inspector operating in a refinement loop. Notably, our approach requires no additional training data or detailed user instructions. Building on prior work in 2D generation, we demonstrate that rendered images produced by our framework can be effectively used for image classification pretraining tasks and outperforms previous methods by 15%. As a compelling real-world use case, we show that the generated prototypes can be leveraged to improve fine-grained vision-language models by using the rendered, part-labeled prototypes to fine-tune CLIP for part segmentation and achieving a 55% accuracy improvement without relying on any additional human-labeled data.
zh

[CV-37] ARA: Token-Aware LoRA for Composable Personalization in Diffusion Models

【速读】:该论文旨在解决多概念个性化图像生成中因多个LoRA模块组合导致的身份缺失(identity missing)和视觉特征泄露(visual feature leakage)问题。其关键解决方案是提出Token-Aware LoRA(TARA),通过引入显式的token mask来约束每个LoRA模块仅关注与其对应的稀有token,从而避免模块间token级干扰;同时设计了一种训练目标,促使稀有token的空间注意力与特定概念区域对齐,以缓解空间错位问题。该方法支持无需再训练即可在推理阶段直接注入多个独立训练的TARA模块,实现高效且身份保持稳定的多概念图像合成。

链接: https://arxiv.org/abs/2508.08812
作者: Yuqi Peng,Lingtao Zheng,Yufeng Yang,Yi Huang,Mingfu Yan,Jianzhuang Liu,Shifeng Chen
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Northeastern University (东北大学); Shenzhen University of Advanced Technology (深圳大学先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Personalized text-to-image generation aims to synthesize novel images of a specific subject or style using only a few reference images. Recent methods based on Low-Rank Adaptation (LoRA) enable efficient single-concept customization by injecting lightweight, concept-specific adapters into pre-trained diffusion models. However, combining multiple LoRA modules for multi-concept generation often leads to identity missing and visual feature leakage. In this work, we identify two key issues behind these failures: (1) token-wise interference among different LoRA modules, and (2) spatial misalignment between the attention map of a rare token and its corresponding concept-specific region. To address these issues, we propose Token-Aware LoRA (TARA), which introduces a token mask to explicitly constrain each module to focus on its associated rare token to avoid interference, and a training objective that encourages the spatial attention of a rare token to align with its concept region. Our method enables training-free multi-concept composition by directly injecting multiple independently trained TARA modules at inference time. Experimental results demonstrate that TARA enables efficient multi-concept inference and effectively preserving the visual identity of each concept by avoiding mutual interference between LoRA modules. The code and models are available at this https URL.
zh

[CV-38] Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment ICCV2025

【速读】:该论文旨在解决轻量化语义分割模型在资源受限设备上部署时面临的性能瓶颈问题,其核心挑战在于:现有基于逐像素分类的范式导致类别表示与图像特征之间存在固有错位(misalignment),尤其在高效场景下,该范式隐含假设同一类别的图像像素特征在不同图像中保持不变,这在实际复杂场景中难以满足。为应对这一困境,作者提出了一种耦合双分支偏移学习(coupled dual-branch offset learning)范式,其关键在于显式地学习特征偏移(feature offset)和类别偏移(class offset),从而动态优化类别表示与空间图像特征之间的对齐关系。该方法可无缝集成至现有架构中,无需额外结构改动,在多个主流数据集上均实现显著性能提升,同时仅引入极少量参数(0.1–0.2M)。

链接: https://arxiv.org/abs/2508.08811
作者: Shi-Chen Zhang,Yunheng Li,Yu-Huan Wu,Qibin Hou,Ming-Ming Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference through lightweight designs, we reveal their inherent limitation: misalignment between class representations and image features caused by a per-pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual-branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network, OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent improvements with negligible parameters. For instance, on the ADE20K dataset, our proposed offset learning paradigm improves SegFormer-B0, SegNeXt-T, and Mask2Former-Tiny by 2.7%, 1.9%, and 2.6% mIoU, respectively, with only 0.1-0.2M additional parameters required.
zh

[CV-39] Identity-Preserving Aging and De-Aging of Faces in the StyleGAN Latent Space

【速读】:该论文旨在解决生成式AI在人脸年龄变换(老化或去老化)应用中面临的两大核心问题:一是现有方法多依赖条件生成对抗网络(Conditional GANs)、扩散模型或视觉语言模型(VLMs),需复杂训练、大量数据且难以保证结果一致性;二是身份保真度缺乏系统性评估与控制,现有研究通常仅在单一人脸识别系统上测试,未明确保障生成图像中身份信息的保留。解决方案的关键在于利用StyleGAN2的隐空间(latent space)编辑技术,通过支持向量建模提取老化/去老化方向,并结合特征选择策略识别出身份保持子空间(identity-preserving subspace)。在此基础上,提出一个简洁实用的公式以估算确保身份不变的最大年龄变换参数范围,从而实现对输入人脸的可控年龄调整同时维持其身份特征。

链接: https://arxiv.org/abs/2508.08808
作者: Luis S. Luevano,Pavel Korshunov,Sebastien Marcel
机构: Idiap Research Institute (Idiap 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE International Joint Conference on Biometrics (IJCB), 2025

点击查看摘要

Abstract:Face aging or de-aging with generative AI has gained significant attention for its applications in such fields like forensics, security, and media. However, most state of the art methods rely on conditional Generative Adversarial Networks (GANs), Diffusion-based models, or Visual Language Models (VLMs) to age or de-age faces based on predefined age categories and conditioning via loss functions, fine-tuning, or text prompts. The reliance on such conditioning leads to complex training requirements, increased data needs, and challenges in generating consistent results. Additionally, identity preservation is rarely taken into accountor evaluated on a single face recognition system without any control or guarantees on whether identity would be preserved in a generated aged/de-aged face. In this paper, we propose to synthesize aged and de-aged faces via editing latent space of StyleGAN2 using a simple support vector modeling of aging/de-aging direction and several feature selection approaches. By using two state-of-the-art face recognition systems, we empirically find the identity preserving subspace within the StyleGAN2 latent space, so that an apparent age of a given face can changed while preserving the identity. We then propose a simple yet practical formula for estimating the limits on aging/de-aging parameters that ensures identity preservation for a given input face. Using our method and estimated parameters we have generated a public dataset of synthetic faces at different ages that can be used for benchmarking cross-age face recognition, age assurance systems, or systems for detection of synthetic images. Our code and dataset are available at the project page this https URL
zh

[CV-40] MonoPartNeRF:Human Reconstruction from Monocular Video via Part-Based Neural Radiance Fields

【速读】:该论文旨在解决单目动态人体重建与渲染中面临的两大挑战:一是复杂姿态变化下部件边界处的不自然过渡,二是单视角设置下遮挡区域的重建精度不足。其解决方案的关键在于提出MonoPartNeRF框架,通过构建双向变形模型(结合刚性与非刚性变换)建立观测空间与规范空间之间的连续可逆映射,并将采样点投影至参数化表面-时间空间(u, v, t)以更好捕捉非刚性运动;同时引入基于部件的姿态嵌入机制,分解全局姿态向量为局部关节嵌入,并结合关键帧姿态检索与三方向插值引导姿态感知特征采样,辅以可学习外观码通过注意力机制建模动态纹理变化,从而实现平滑过渡和鲁棒的遮挡恢复。

链接: https://arxiv.org/abs/2508.08798
作者: Yao Lu,Jiawei Li,Ming Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, Neural Radiance Fields (NeRF) have achieved remarkable progress in dynamic human reconstruction and rendering. Part-based rendering paradigms, guided by human segmentation, allow for flexible parameter allocation based on structural complexity, thereby enhancing representational efficiency. However, existing methods still struggle with complex pose variations, often producing unnatural transitions at part boundaries and failing to reconstruct occluded regions accurately in monocular settings. We propose MonoPartNeRF, a novel framework for monocular dynamic human rendering that ensures smooth transitions and robust occlusion recovery. First, we build a bidirectional deformation model that combines rigid and non-rigid transformations to establish a continuous, reversible mapping between observation and canonical spaces. Sampling points are projected into a parameterized surface-time space (u, v, t) to better capture non-rigid motion. A consistency loss further suppresses deformation-induced artifacts and discontinuities. We introduce a part-based pose embedding mechanism that decomposes global pose vectors into local joint embeddings based on body regions. This is combined with keyframe pose retrieval and interpolation, along three orthogonal directions, to guide pose-aware feature sampling. A learnable appearance code is integrated via attention to model dynamic texture changes effectively. Experiments on the ZJU-MoCap and MonoCap datasets demonstrate that our method significantly outperforms prior approaches under complex pose and occlusion conditions, achieving superior joint alignment, texture fidelity, and structural continuity.
zh

[CV-41] Region-Adaptive Video Sharpening via Rate-Perception Optimization

【速读】:该论文旨在解决视频锐化过程中因均匀增强强度忽略纹理差异而导致的画质下降问题,以及锐化带来的比特率增加与比特分配不优化的问题。解决方案的关键在于提出一种端到端的区域自适应视频锐化模型 RPO-AdaSharp,利用编码树单元(Coding Tree Unit, CTU)划分掩码作为先验信息,引导并约束增强后比特的分布,从而在提升主观视觉质量的同时实现比特率的有效节省。

链接: https://arxiv.org/abs/2508.08794
作者: Yingxue Pang,Shijie Zhao,Mengxi Guo,Junlin Li,Li Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sharpening is a widely adopted video enhancement technique. However, uniform sharpening intensity ignores texture variations, degrading video quality. Sharpening also increases bitrate, and there’s a lack of techniques to optimally allocate these additional bits across diverse regions. Thus, this paper proposes RPO-AdaSharp, an end-to-end region-adaptive video sharpening model for both perceptual enhancement and bitrate savings. We use the coding tree unit (CTU) partition mask as prior information to guide and constrain the allocation of increased bits. Experiments on benchmarks demonstrate the effectiveness of the proposed model qualitatively and quantitatively.
zh

[CV-42] DiffPose-Animal: A Language-Conditioned Diffusion Framework for Animal Pose Estimation

【速读】:该论文旨在解决动物姿态估计(animal pose estimation)中存在的挑战,包括物种间形态差异大、身体结构复杂以及标注数据稀缺等问题。其解决方案的关键在于提出了一种基于扩散模型(diffusion model)的新型自顶向下框架DiffPose-Animal,将姿态估计建模为一个去噪过程;同时引入大语言模型(LLM)提取物种特定的全局解剖先验和局部关键点语义信息,并通过交叉注意力模块将其与图像特征融合,从而在去噪过程中提供生物合理约束;此外,设计了基于扩散的关键点解码器以逐步优化预测结果,显著提升了对遮挡和标注稀疏场景的鲁棒性。

链接: https://arxiv.org/abs/2508.08783
作者: Tianyu Xiong,Dayi Tan,Wei Tian
机构: Guanghua Cambridge International School (光华剑桥国际学校); School of Automotive Studies, Tongji University (同济大学汽车学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13pages,2figures

点击查看摘要

Abstract:Animal pose estimation is a fundamental task in computer vision, with growing importance in ecological monitoring, behavioral analysis, and intelligent livestock management. Compared to human pose estimation, animal pose estimation is more challenging due to high interspecies morphological diversity, complex body structures, and limited annotated data. In this work, we introduce DiffPose-Animal, a novel diffusion-based framework for top-down animal pose estimation. Unlike traditional heatmap regression methods, DiffPose-Animal reformulates pose estimation as a denoising process under the generative framework of diffusion models. To enhance semantic guidance during keypoint generation, we leverage large language models (LLMs) to extract both global anatomical priors and local keypoint-wise semantics based on species-specific prompts. These textual priors are encoded and fused with image features via cross-attention modules to provide biologically meaningful constraints throughout the denoising process. Additionally, a diffusion-based keypoint decoder is designed to progressively refine pose predictions, improving robustness to occlusion and annotation sparsity. Extensive experiments on public animal pose datasets demonstrate the effectiveness and generalization capability of our method, especially under challenging scenarios with diverse species, cluttered backgrounds, and incomplete keypoints.
zh

[CV-43] SHREC 2025: Retrieval of Optimal Objects for Multi-modal Enhanced Language and Spatial Assistance (ROOMELSA)

【速读】:该论文旨在解决当前3D检索系统在复杂真实场景中表现不足的问题,尤其是针对基于模糊自由描述词从杂乱场景中识别特定物体的挑战。现有方法多适用于简单、受控环境(如从裁剪图像或简短描述中检索对象),难以应对包含复杂背景和语义模糊性的实际应用需求。解决方案的关键在于提出ROOMELSA这一新基准,其核心创新是要求模型在全景房间图像中精确定位目标区域,并从大规模数据库中准确检索对应的3D模型。该基准涵盖超过1,600个公寓场景、近5,200个房间及44,000个查询,强调视觉与语言理解的深度融合,从而推动更鲁棒、面向现实世界的3D识别系统的发展。

链接: https://arxiv.org/abs/2508.08781
作者: Trong-Thuan Nguyen,Viet-Tham Huynh,Quang-Thuc Nguyen,Hoang-Phuc Nguyen,Long Le Bao,Thai Hoang Minh,Minh Nguyen Anh,Thang Nguyen Tien,Phat Nguyen Thuan,Huy Nguyen Phong,Bao Huynh Thai,Vinh-Tiep Nguyen,Duc-Vu Nguyen,Phu-Hoa Pham,Minh-Huy Le-Hoang,Nguyen-Khang Le,Minh-Chinh Nguyen,Minh-Quan Ho,Ngoc-Long Tran,Hien-Long Le-Hoang,Man-Khoi Tran,Anh-Duong Tran,Kim Nguyen,Quan Nguyen Hung,Dat Phan Thanh,Hoang Tran Van,Tien Huynh Viet,Nhan Nguyen Viet Thien,Dinh-Khoi Vo,Van-Loc Nguyen,Trung-Nghia Le,Tam V. Nguyen,Minh-Triet Tran
机构: Ho Chi Minh City University of Science (胡志明市科学大学); Faculty of Information Technology (信息科技学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent 3D retrieval systems are typically designed for simple, controlled scenarios, such as identifying an object from a cropped image or a brief description. However, real-world scenarios are more complex, often requiring the recognition of an object in a cluttered scene based on a vague, free-form description. To this end, we present ROOMELSA, a new benchmark designed to evaluate a system’s ability to interpret natural language. Specifically, ROOMELSA attends to a specific region within a panoramic room image and accurately retrieves the corresponding 3D model from a large database. In addition, ROOMELSA includes over 1,600 apartment scenes, nearly 5,200 rooms, and more than 44,000 targeted queries. Empirically, while coarse object retrieval is largely solved, only one top-performing model consistently ranked the correct match first across nearly all test cases. Notably, a lightweight CLIP-based model also performed well, although it struggled with subtle variations in materials, part structures, and contextual cues, resulting in occasional errors. These findings highlight the importance of tightly integrating visual and language understanding. By bridging the gap between scene-level grounding and fine-grained 3D retrieval, ROOMELSA establishes a new benchmark for advancing robust, real-world 3D recognition systems.
zh

[CV-44] Bridging the Gap: A Framework for Real-World Video Deepfake Detection via Social Network Compression Emulation

【速读】:该论文旨在解决深度伪造(deepfake)检测模型在实验室环境下训练后难以泛化到真实社交网络场景的问题。其核心挑战在于,YouTube、Facebook等平台对上传视频实施了专有的压缩和重采样处理,这些操作会破坏低层次的取证特征(forensic cues),导致检测器性能显著下降。解决方案的关键在于提出首个可模拟社交平台视频分享流水线的框架:通过少量上传视频估计压缩与重缩放参数,并基于此构建本地模拟器,在无需API访问的情况下对大规模数据集重现平台特异性伪影。实验表明,该方法生成的数据能准确复现真实上传视频的退化模式,且在模拟数据上微调的检测器性能接近于使用真实共享媒体训练的结果,从而为压缩视频内容下的深伪检测提供了可扩展且实用的解决方案。

链接: https://arxiv.org/abs/2508.08765
作者: Andrea Montibeller,Dasara Shullani,Daniele Baracchi,Alessandro Piva,Giulia Boato
机构: University of Trento (特伦托大学); Truebees srl (Truebees公司); University of Florence (佛罗伦萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing presence of AI-generated videos on social networks poses new challenges for deepfake detection, as detectors trained under controlled conditions often fail to generalize to real-world scenarios. A key factor behind this gap is the aggressive, proprietary compression applied by platforms like YouTube and Facebook, which launder low-level forensic cues. However, replicating these transformations at scale is difficult due to API limitations and data-sharing constraints. For these reasons, we propose a first framework that emulates the video sharing pipelines of social networks by estimating compression and resizing parameters from a small set of uploaded videos. These parameters enable a local emulator capable of reproducing platform-specific artifacts on large datasets without direct API access. Experiments on FaceForensics++ videos shared via social networks demonstrate that our emulated data closely matches the degradation patterns of real uploads. Furthermore, detectors fine-tuned on emulated videos achieve comparable performance to those trained on actual shared media. Our approach offers a scalable and practical solution for bridging the gap between lab-based training and real-world deployment of deepfake detectors, particularly in the underexplored domain of compressed video content.
zh

[CV-45] Exploring Palette based Color Guidance in Diffusion Models ACM-MM2025

【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在整体色彩方案控制上的不足,尤其是在未明确提及的背景元素和次要物体上难以实现精确颜色调控的问题。现有方法主要依赖提示词(prompt)进行颜色指定,但缺乏对全局色彩一致性的有效引导。解决方案的关键在于引入一种独立的色彩调色板(color palette)引导机制,将其作为与提示词并行的控制信号嵌入扩散模型框架中,从而增强模型对图像整体色彩布局的可控性。通过构建专用的调色板-文本-图像数据集,并系统评估不同调色板表示方式的效果,实验证明该方法显著提升了生成图像在色彩一致性与用户意图匹配度方面的表现。

链接: https://arxiv.org/abs/2508.08754
作者: Qianru Qiu,Jiafeng Mao,Xueting Wang
机构: CyberAgent Inc.(CyberAgent公司)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to ACM MM 2025

点击查看摘要

Abstract:With the advent of diffusion models, Text-to-Image (T2I) generation has seen substantial advancements. Current T2I models allow users to specify object colors using linguistic color names, and some methods aim to personalize color-object association through prompt learning. However, existing models struggle to provide comprehensive control over the color schemes of an entire image, especially for background elements and less prominent objects not explicitly mentioned in prompts. This paper proposes a novel approach to enhance color scheme control by integrating color palettes as a separate guidance mechanism alongside prompt instructions. We investigate the effectiveness of palette guidance by exploring various palette representation methods within a diffusion-based image colorization framework. To facilitate this exploration, we construct specialized palette-text-image datasets and conduct extensive quantitative and qualitative analyses. Our results demonstrate that incorporating palette guidance significantly improves the model’s ability to generate images with desired color schemes, enabling a more controlled and refined colorization process.
zh

[CV-46] Adaptive Confidence-Wise Loss for Improved Lens Structure Segmentation in AS-OCT

【速读】:该论文旨在解决现有深度分割网络在白内障手术中人工晶状体(IOL)结构分割任务中存在的两个关键问题:一是传统交叉熵(Cross-Entropy, CE)损失函数对所有像素同等加权,忽略了不同子区域的异质性(如某些区域分割性能更优);二是边界区域常因像素级校准不足导致分割精度下降。解决方案的核心在于提出一种自适应置信度加权(Adaptive Confidence-Wise, ACW)损失函数,通过基于每个目标区域特性的置信度阈值将子区域划分为高置信度和低置信度组,并引入区域加权损失重新分配各组权重,同时设计自适应置信度阈值优化算法动态调整阈值以提升模型性能。此外,论文还提出了边界期望校准误差(Boundary Expected Calibration Error, BECE)作为新指标,更精准量化边界区域的校准偏差。实验表明,ACW在多个数据集上显著优于现有方法,尤其在U-Net架构下相较CE损失实现6.13% IoU提升、4.33% DSC增长和4.79% BECE降低。

链接: https://arxiv.org/abs/2508.08705
作者: Zunjie Xiao,Xiao Wu,Tianhang Liu,Lingxi Hu,Yinling Zhang,Xiaoqing Zhang,Risa Higashita,Jiang Liu
机构: Research Institute of Trustworthy Autonomous Systems and Department of Computer Science and Engineering, Southern University of Science and Technology (南方科技大学); Tomey Corporation (东明公司); Department of Electronic and Information Engineering, Changchun University (长春大学); School of Computer Science, University of Nottingham Ningbo China (诺丁汉大学宁波分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Precise lens structure segmentation is essential for the design of intraocular lenses (IOLs) in cataract surgery. Existing deep segmentation networks typically weight all pixels equally under cross-entropy (CE) loss, overlooking the fact that sub-regions of lens structures are inhomogeneous (e.g., some regions perform better than others) and that boundary regions often suffer from poor segmentation calibration at the pixel level. Clinically, experts annotate different sub-regions of lens structures with varying confidence levels, considering factors such as sub-region proportions, ambiguous boundaries, and lens structure shapes. Motivated by this observation, we propose an Adaptive Confidence-Wise (ACW) loss to group each lens structure sub-region into different confidence sub-regions via a confidence threshold from the unique region aspect, aiming to exploit the potential of expert annotation confidence prior. Specifically, ACW clusters each target region into low-confidence and high-confidence groups and then applies a region-weighted loss to reweigh each confidence group. Moreover, we design an adaptive confidence threshold optimization algorithm to adjust the confidence threshold of ACW dynamically. Additionally, to better quantify the miscalibration errors in boundary region segmentation, we propose a new metric, termed Boundary Expected Calibration Error (BECE). Extensive experiments on a clinical lens structure AS-OCT dataset and other multi-structure datasets demonstrate that our ACW significantly outperforms competitive segmentation loss methods across different deep segmentation networks (e.g., MedSAM). Notably, our method surpasses CE with 6.13% IoU gain, 4.33% DSC increase, and 4.79% BECE reduction in lens structure segmentation under U-Net. The code of this paper is available at this https URL.
zh

[CV-47] SafeFix: Targeted Model Repair via Controlled Image Generation

【速读】:该论文旨在解决深度学习视觉识别模型因语义子群体(semantic subpopulations)代表性不足而导致的系统性错误问题,现有调试框架虽能定位关键失败属性,但修复效果有限。其解决方案的关键在于提出一种基于可解释失败归因管道的模型修复模块:首先利用条件文本到图像生成模型(conditional text-to-image model)为失败案例生成语义一致且目标明确的合成训练图像,再通过大视觉语言模型(LVLM)对生成结果进行过滤,确保样本质量与原始数据分布一致、语义连贯;最终通过在稀有场景增强的合成数据集上重训练视觉模型,显著降低罕见情况下的错误率,同时不引入新缺陷。

链接: https://arxiv.org/abs/2508.08701
作者: Ouyang Xu,Baoming Zhang,Ruiyu Mao,Yunhui Guo
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. Although existing debugging frameworks can pinpoint these failures by identifying key failure attributes, repairing the model effectively remains difficult. Current solutions often rely on manually designed prompts to generate synthetic training images – an approach prone to distribution shift and semantic errors. To overcome these challenges, we introduce a model repair module that builds on an interpretable failure attribution pipeline. Our approach uses a conditional text-to-image model to generate semantically faithful and targeted images for failure cases. To preserve the quality and relevance of the generated samples, we further employ a large vision-language model (LVLM) to filter the outputs, enforcing alignment with the original data distribution and maintaining semantic consistency. By retraining vision models with this rare-case-augmented synthetic dataset, we significantly reduce errors associated with rare cases. Our experiments demonstrate that this targeted repair strategy improves model robustness without introducing new bugs. Code is available at this https URL
zh

[CV-48] Subjective and Objective Quality Assessment of Banding Artifacts on Compressed Videos

【速读】:该论文旨在解决视频压缩中带状伪影(banding artifacts)对感知质量的影响问题,尤其是针对高清视频平滑区域中此类伪影的评估难题。现有公开数据集仅包含静态图像,无法反映时序上的带状伪影动态特性,限制了相关研究的发展。为此,作者构建了首个公开的视频带状伪影数据库LIVE-YT-Banding,包含160段由AV1编码器以四种参数压缩的视频及7200条主观评分。解决方案的关键在于提出一种新型无参考(no-reference, NR)视频质量评估模型CBAND,其核心创新是利用深度神经网络嵌入空间中自然图像的统计特性来检测并量化带状伪影对感知质量的影响。实验表明,CBAND在预测性能上显著优于现有最优模型,且计算效率高出数个数量级,并可作为可微损失函数用于优化视频去伪影模型。

链接: https://arxiv.org/abs/2508.08700
作者: Qi Zheng,Li-Heng Chen,Chenlong He,Neil Berkbeck,Yilin Wang,Balu Adsumilli,Alan C. Bovik,Yibo Fan,Zhengzhong Tu
机构: Fudan University (复旦大学); Netflix (奈飞); Google LLC (谷歌); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Texas A&M University (德州农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although there have been notable advancements in video compression technologies in recent years, banding artifacts remain a serious issue affecting the quality of compressed videos, particularly on smooth regions of high-definition videos. Noticeable banding artifacts can severely impact the perceptual quality of videos viewed on a high-end HDTV or high-resolution screen. Hence, there is a pressing need for a systematic investigation of the banding video quality assessment problem for advanced video codecs. Given that the existing publicly available datasets for studying banding artifacts are limited to still picture data only, which cannot account for temporal banding dynamics, we have created a first-of-a-kind open video dataset, dubbed LIVE-YT-Banding, which consists of 160 videos generated by four different compression parameters using the AV1 video codec. A total of 7,200 subjective opinions are collected from a cohort of 45 human subjects. To demonstrate the value of this new resources, we tested and compared a variety of models that detect banding occurrences, and measure their impact on perceived quality. Among these, we introduce an effective and efficient new no-reference (NR) video quality evaluator which we call CBAND. CBAND leverages the properties of the learned statistics of natural images expressed in the embeddings of deep neural networks. Our experimental results show that the perceptual banding prediction performance of CBAND significantly exceeds that of previous state-of-the-art models, and is also orders of magnitude faster. Moreover, CBAND can be employed as a differentiable loss function to optimize video debanding models. The LIVE-YT-Banding database, code, and pre-trained model are all publically available at this https URL.
zh

[CV-49] ROD: RGB-Only Fast and Efficient Off-road Freespace Detection

【速读】:该论文旨在解决越野场景下自由空间检测(off-road freespace detection)的挑战,尤其是由于可通行区域边界模糊导致的识别困难。传统基于多模态融合(RGB图像与LiDAR数据)的方法虽性能优异,但因计算表面法向量图(surface normal maps)引入显著延迟,难以满足实时应用需求,尤其在对帧率(FPS)要求较高的实际场景中表现不佳。解决方案的关键在于提出一种仅依赖RGB图像的新方法ROD(RGB-only freespace detection),其核心创新包括:利用预训练视觉Transformer(Vision Transformer, ViT)提取丰富的图像特征,并设计了一个轻量化且高效的解码器结构,在不依赖LiDAR的前提下实现了高精度与高速推理(达到50 FPS),从而在ORFD和RELLIS-3D数据集上均取得新的SOTA性能。

链接: https://arxiv.org/abs/2508.08697
作者: Tong Sun,Hongliang Ye,Jilin Mei,Liang Chen,Fangzhou Zhao,Leiqiang Zong,Yu Hu
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Zhejiang Lab (浙江省实验室); Beijing Special Vehicle Academy (北京特种车辆研究院); School of Computer Science and Technology, University of Chinese Academy of Sciences (中国科学院大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Off-road freespace detection is more challenging than on-road scenarios because of the blurred boundaries of traversable areas. Previous state-of-the-art (SOTA) methods employ multi-modal fusion of RGB images and LiDAR data. However, due to the significant increase in inference time when calculating surface normal maps from LiDAR data, multi-modal methods are not suitable for real-time applications, particularly in real-world scenarios where higher FPS is required compared to slow navigation. This paper presents a novel RGB-only approach for off-road freespace detection, named ROD, eliminating the reliance on LiDAR data and its computational demands. Specifically, we utilize a pre-trained Vision Transformer (ViT) to extract rich features from RGB images. Additionally, we design a lightweight yet efficient decoder, which together improve both precision and inference speed. ROD establishes a new SOTA on ORFD and RELLIS-3D datasets, as well as an inference speed of 50 FPS, significantly outperforming prior models.
zh

[CV-50] STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂多模态任务中推理能力不足以及生成文本过于冗长的问题。现有方法主要依赖链式思维(Chain-of-Thought, CoT)推理,而许多任务实际上更受益于树状或图状等多样化拓扑结构的推理方式。解决方案的关键在于提出STEALR-Vision训练框架,其核心是TopoAug——一种合成数据增强管道,能够引入多样化的拓扑结构以丰富训练数据;同时结合监督微调与强化学习优化模型准确性与效率,并进一步提出Frugal Learning策略,在最小化精度损失的前提下显著减少输出长度。实验表明,该方法在多个基准测试上均优于基线模型,尤其在分布外(Out-of-Distribution, OOD)场景下展现出更强泛化能力。

链接: https://arxiv.org/abs/2508.08688
作者: Chen Li,Han Zhang,Zhantao Yang,Fangyi Chen,Zihan Wang,Anudeepsekhar Bolimera,Marios Savvides
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks. We have released datasets, and code will be available.
zh

[CV-51] PADReg: Physics-Aware Deformable Registration Guided by Contact Force for Ultrasound Sequences

【速读】:该论文旨在解决超声图像变形配准(ultrasound deformable registration)在大形变条件下因图像对比度低、噪声大及组织边界模糊而导致的特征提取困难与对应匹配不可靠的问题,从而提升甲状腺结节和乳腺癌等疾病中生物力学特性捕捉的准确性与诊断精度。其解决方案的关键在于提出一种物理感知的配准框架PADReg,通过引入机器人超声系统同步测量的接触力作为物理先验,首先构建基于多模态信息(接触力与超声图像)的像素级刚度图,再利用受胡克定律启发的轻量级物理感知模块,将刚度图与力数据联合估计稠密形变场,从而实现具有物理可解释性且 anatomical alignment 更优的配准结果。

链接: https://arxiv.org/abs/2508.08685
作者: Yimeng Geng,Mingyang Zhao,Fan Xu,Guanglin Cao,Gaofeng Meng,Hongbin Liu
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Center for Artificial Intelligence and Robotics, HK Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong SAR; School of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Ultrasound deformable registration estimates spatial transformations between pairs of deformed ultrasound images, which is crucial for capturing biomechanical properties and enhancing diagnostic accuracy in diseases such as thyroid nodules and breast cancer. However, ultrasound deformable registration remains highly challenging, especially under large deformation. The inherently low contrast, heavy noise and ambiguous tissue boundaries in ultrasound images severely hinder reliable feature extraction and correspondence matching. Existing methods often suffer from poor anatomical alignment and lack physical interpretability. To address the problem, we propose PADReg, a physics-aware deformable registration framework guided by contact force. PADReg leverages synchronized contact force measured by robotic ultrasound systems as a physical prior to constrain the registration. Specifically, instead of directly predicting deformation fields, we first construct a pixel-wise stiffness map utilizing the multi-modal information from contact force and ultrasound images. The stiffness map is then combined with force data to estimate a dense deformation field, through a lightweight physics-aware module inspired by Hooke’s law. This design enables PADReg to achieve physically plausible registration with better anatomical alignment than previous methods relying solely on image similarity. Experiments on in-vivo datasets demonstrate that it attains a HD95 of 12.90, which is 21.34% better than state-of-the-art methods. The source code is available at this https URL.
zh

[CV-52] MMIF-AMIN: Adaptive Loss-Driven Multi-Scale Invertible Dense Network for Multimodal Medical Image Fusion

【速读】:该论文旨在解决多模态医学图像融合(Multimodal Medical Image Fusion, MMIF)中如何同时捕捉各模态的独特信息与互补信息这一关键挑战。解决方案的核心在于提出了一种名为MMIF-AMIN的新方法,其关键创新包括:1)采用可逆密集网络(Invertible Dense Network, IDN)实现无损特征提取;2)设计多尺度互补特征提取模块(Multi-scale Complementary Feature Extraction Module, MCFEM),融合不同感受野的卷积层、注意力机制与Transformer结构以挖掘跨模态互补信息;3)引入自适应损失函数,克服传统人工设计损失函数的局限性,提升模型对数据深层特征的挖掘能力。实验表明,该方法在定量和定性评估中均优于九种先进方法,且组件消融实验证明了各模块的有效性。

链接: https://arxiv.org/abs/2508.08679
作者: Tao Luo,Weihua Xu
机构: Southwest University (西南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures,conference

点击查看摘要

Abstract:Multimodal medical image fusion (MMIF) aims to integrate images from different modalities to produce a comprehensive image that enhances medical diagnosis by accurately depicting organ structures, tissue textures, and metabolic information. Capturing both the unique and complementary information across multiple modalities simultaneously is a key research challenge in MMIF. To address this challenge, this paper proposes a novel image fusion method, MMIF-AMIN, which features a new architecture that can effectively extract these unique and complementary features. Specifically, an Invertible Dense Network (IDN) is employed for lossless feature extraction from individual modalities. To extract complementary information between modalities, a Multi-scale Complementary Feature Extraction Module (MCFEM) is designed, which incorporates a hybrid attention mechanism, convolutional layers of varying sizes, and Transformers. An adaptive loss function is introduced to guide model learning, addressing the limitations of traditional manually-designed loss functions and enhancing the depth of data mining. Extensive experiments demonstrate that MMIF-AMIN outperforms nine state-of-the-art MMIF methods, delivering superior results in both quantitative and qualitative analyses. Ablation experiments confirm the effectiveness of each component of the proposed method. Additionally, extending MMIF-AMIN to other image fusion tasks also achieves promising performance.
zh

[CV-53] Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL

【速读】:该论文旨在解决在线类增量学习(Online Class-Incremental Learning, OCIL)中面临的两大核心挑战:在严格内存约束下维持模型稳定性,以及确保对新任务的适应能力(即平衡稳定性与可塑性)。现有基于回放的方法在低内存条件下效果有限,而集成方法虽提升可塑性却常牺牲稳定性。其解决方案的关键在于提出一种基于全局工作空间模型(Global Workspace Model, GWM)的新型集成学习框架:GWM通过融合每个训练批次中所有学生模型的参数,形成一个隐式共享记忆,捕捉历史学习轨迹并作为知识巩固的动态锚点;随后周期性地将该融合模型重新分配给各学生模型以稳定学习过程并促进跨任务一致性;同时引入多层级协同蒸馏机制,强制学生间相互对齐并与GWM保持一致,从而在保持旧知识的同时增强对新任务的适应能力。

链接: https://arxiv.org/abs/2508.08677
作者: Shibin Su,Guoqiang Liang,De Cheng,Shizhou Zhang,Lingyan Ran,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学); Shenzhen Research Institute of Northwestern Polytechnical University (西北工业大学深圳研究院); Xidian University (西安电子科技大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:Online Class-Incremental Learning (OCIL) enables models to learn continuously from non-i.i.d. data streams and samples of the data streams can be seen only once, making it more suitable for real-world scenarios compared to offline learning. However, OCIL faces two key challenges: maintaining model stability under strict memory constraints and ensuring adaptability to new tasks. Under stricter memory constraints, current replay-based methods are less effective. While ensemble methods improve adaptability (plasticity), they often struggle with stability. To overcome these challenges, we propose a novel approach that enhances ensemble learning through a Global Workspace Model (GWM)-a shared, implicit memory that guides the learning of multiple student models. The GWM is formed by fusing the parameters of all students within each training batch, capturing the historical learning trajectory and serving as a dynamic anchor for knowledge consolidation. This fused model is then redistributed periodically to the students to stabilize learning and promote cross-task consistency. In addition, we introduce a multi-level collaborative distillation mechanism. This approach enforces peer-to-peer consistency among students and preserves historical knowledge by aligning each student with the GWM. As a result, student models remain adaptable to new tasks while maintaining previously learned knowledge, striking a better balance between stability and plasticity. Extensive experiments on three standard OCIL benchmarks show that our method delivers significant performance improvement for several OCIL models across various memory budgets.
zh

[CV-54] Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization

【速读】:该论文旨在解决深度图像水印(Deep Image Watermarking)在实际应用中难以同时满足不可见性(invisibility)、鲁棒性(robustness)和广泛适用性(broad applicability)三大核心指标的问题。现有方法往往在三者之间存在权衡,难以实现通用化部署。其解决方案的关键在于提出分阶段优化的层级水印学习(Hierarchical Watermark Learning, HiWL)框架:第一阶段通过分布对齐学习(distribution alignment learning)构建共享潜在空间,约束水印图像与原始图像之间的视觉一致性及水印潜在表示的信息不变性,从而保障水印的不可见性和鲁棒性;第二阶段采用广义水印表示学习(generalized watermark representation learning),引入解耦策略分离水印与图像内容,并通过强惩罚机制抑制相同消息对应RGB空间水印的剧烈波动,显著提升水印表示的泛化能力与处理效率。实验表明,该方法在水印提取准确率上较现有方法提升7.6%,且延迟极低(8秒内处理10万张图像)。

链接: https://arxiv.org/abs/2508.08667
作者: Ke Liu,Xuanhan Wang,Qilong Zhang,Lianli Gao,Jingkuan Song
机构: University of Electronic Science and Technology of China(电子科技大学); Bytedance(字节跳动); Tongji University(同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Deep image watermarking, which refers to enable imperceptible watermark embedding and reliable extraction in cover images, has shown to be effective for copyright protection of image assets. However, existing methods face limitations in simultaneously satisfying three essential criteria for generalizable watermarking: 1) invisibility (imperceptible hide of watermarks), 2) robustness (reliable watermark recovery under diverse conditions), and 3) broad applicability (low latency in watermarking process). To address these limitations, we propose a Hierarchical Watermark Learning (HiWL), a two-stage optimization that enable a watermarking model to simultaneously achieve three criteria. In the first stage, distribution alignment learning is designed to establish a common latent space with two constraints: 1) visual consistency between watermarked and non-watermarked images, and 2) information invariance across watermark latent representations. In this way, multi-modal inputs including watermark message (binary codes) and cover images (RGB pixels) can be well represented, ensuring the invisibility of watermarks and robustness in watermarking process thereby. The second stage employs generalized watermark representation learning to establish a disentanglement policy for separating watermarks from image content in RGB space. In particular, it strongly penalizes substantial fluctuations in separated RGB watermarks corresponding to identical messages. Consequently, HiWL effectively learns generalizable latent-space watermark representations while maintaining broad applicability. Extensive experiments demonstrate the effectiveness of proposed method. In particular, it achieves 7.6% higher accuracy in watermark extraction than existing methods, while maintaining extremely low latency (100K images processed in 8s).
zh

[CV-55] Unified and Semantically Grounded Domain Adaptation for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中无监督域适应(Unsupervised Domain Adaptation, UDA)方法在源可访问(source-accessible)与源不可访问(source-free)两种设置下方法设计割裂的问题,即现有方法缺乏一种能够跨域和跨设置通用的显式结构化解剖知识建模机制。其解决方案的关键在于提出一个统一的、语义 grounded 的框架,该框架通过学习一个与域无关的概率流形(probabilistic manifold),将每张图像的结构内容解耦为从流形中检索到的标准解剖形态和捕捉个体特异性几何的空间变换,从而实现无需手工设计适配策略即可自然具备适应能力的模型架构。这一解耦且可解释的表述不仅提升了性能一致性(源自由设置下的表现接近源可访问设置),还支持通过流形遍历实现平滑形状操控,显著增强模型的可解释性。

链接: https://arxiv.org/abs/2508.08660
作者: Xin Wang,Yin Guo,Jiamin Xia,Kaiyu Zhang,Niranjan Balu,Mahmud Mossa-Basha,Linda Shapiro,Chun Yuan
机构: University of Washington (华盛顿大学); University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Most prior unsupervised domain adaptation approaches for medical image segmentation are narrowly tailored to either the source-accessible setting, where adaptation is guided by source-target alignment, or the source-free setting, which typically resorts to implicit supervision mechanisms such as pseudo-labeling and model distillation. This substantial divergence in methodological designs between the two settings reveals an inherent flaw: the lack of an explicit, structured construction of anatomical knowledge that naturally generalizes across domains and settings. To bridge this longstanding divide, we introduce a unified, semantically grounded framework that supports both source-accessible and source-free adaptation. Fundamentally distinct from all prior works, our framework’s adaptability emerges naturally as a direct consequence of the model architecture, without the need for any handcrafted adaptation strategies. Specifically, our model learns a domain-agnostic probabilistic manifold as a global space of anatomical regularities, mirroring how humans establish visual understanding. Thus, the structural content in each image can be interpreted as a canonical anatomy retrieved from the manifold and a spatial transformation capturing individual-specific geometry. This disentangled, interpretable formulation enables semantically meaningful prediction with intrinsic adaptability. Extensive experiments on challenging cardiac and abdominal datasets show that our framework achieves state-of-the-art results in both settings, with source-free performance closely approaching its source-accessible counterpart, a level of consistency rarely observed in prior works. Beyond quantitative improvement, we demonstrate strong interpretability of the proposed framework via manifold traversal for smooth shape manipulation.
zh

[CV-56] AME: Aligned Manifold Entropy for Robust Vision-Language Distillation

【速读】:该论文旨在解决视觉-语言知识蒸馏(Vision-Language Knowledge Distillation)在低数据场景下难以实现鲁棒泛化的问题,尤其针对具有高预测不确定性的边界邻近样本(boundary-adjacent representations)缺乏足够标注数据的现实挑战。其解决方案的关键在于提出Aligned Manifold Entropy (AME),通过在重构的共享流形(shared manifold)上施加熵最小化约束,利用一对投影函数将图像与文本模态对齐,从而促进跨模态特征表示的结构压缩与一致性优化。此方法无需修改骨干网络架构,可作为即插即用模块集成到多种视觉-语言蒸馏框架中,并理论上证明了其能获得更紧的泛化误差界,显著提升小样本条件下的模型性能。

链接: https://arxiv.org/abs/2508.08644
作者: Guiming Cao,Yuming Ou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Knowledge distillation is a long-established technique for knowledge transfer, and has regained attention in the context of the recent emergence of large vision-language models (VLMs). However, vision-language knowledge distillation often requires sufficient training data to achieve robust generalization on amples with ambiguous or boundary-adjacent representations, which are associated with high predictive uncertainty. Critically, collecting such large-scale, task-specific data for training is often impractical in real-world scenarios. To address this major challenge arising from the entanglement of uncertainty and cross-modal feature representation, we propose Aligned Manifold Entropy for Robust Vision-Language Distillation (AME), aiming to achieve robust generalization under real-world conditions. AME applies entropy minimization over a reconfigured shared manifold, where multi-modal data (i.e., image and text) are bridged through a pair of projection functions, conducive to structural compression for cross-modal feature representations. This enables robust knowledge distillation under low-data regimes, while requiring no architectural modifications to the backbone. As a result, it can serve as a plug-and-play module compatible with a wide range of vision-language distillation frameworks. Notably, our theoretical analysis reveals that integrating knowledge distillation with entropy minimization over the shared manifold leads to a tighter generalization error bound. Extensive experiments across diverse distillation architectures and training settings demonstrate that AME consistently facilitates robust knowledge distillation, resulting in superior generalization performance across a wide spectrum of downstream tasks.
zh

[CV-57] Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation ICCV2025

【速读】:该论文旨在解决视频实例分割(Video Instance Segmentation, VIS)中两个关键问题:一是现有方法通常假设物体类别在视频序列中保持不变,这与现实场景不符;二是当模型需要持续学习新类别的对象实例时,会因灾难性遗忘(catastrophic forgetting)导致对旧类别的性能显著下降。解决方案的核心在于提出一种分层视觉提示学习(Hierarchical Visual Prompt Learning, HVPL)模型,其关键创新包括:在帧级别设计任务特定的帧提示(frame prompt)与正交梯度修正(Orthogonal Gradient Correction, OGC)模块,通过将梯度投影到旧类别的正交特征空间来保留旧类别的全局实例信息;在视频级别引入任务特定的视频提示(video prompt)和视频上下文解码器(video context decoder),该解码器首先将跨帧的类间结构关系嵌入帧提示特征,再将任务特定的全局视频上下文从帧提示传播至视频提示,从而有效缓解视频级别的遗忘问题。

链接: https://arxiv.org/abs/2508.08612
作者: Jiahua Dong,Hui Yin,Wenqi Liang,Hanbin Zhao,Henghui Ding,Nicu Sebe,Salman Khan,Fahad Shahbaz Khan
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Hunan University (湖南大学); University of Trento (特伦托大学); Zhejiang University (浙江大学); Fudan University (复旦大学); Australian National University (澳大利亚国立大学); Linköping University (林雪平大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025

点击查看摘要

Abstract:Video instance segmentation (VIS) has gained significant attention for its capability in tracking and segmenting object instances across video frames. However, most of the existing VIS approaches unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new categories. To resolve these challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model that overcomes catastrophic forgetting of previous categories from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each individual frame by projecting its gradients onto the orthogonal feature space of old classes. Furthermore, to address forgetting at the video level, we design a task-specific video prompt and a video context decoder. This decoder first embeds structural inter-class relationships across frames into the frame prompt features, and then propagates task-specific global video contexts from the frame prompt features to the video prompt. Through rigorous comparisons, our HVPL model proves to be more effective than baseline approaches. The code is available at this https URL.
zh

[CV-58] Neural Artistic Style and Color Transfer Using Deep Learning

【速读】:该论文旨在解决如何将神经艺术风格迁移(Neural Artistic Style Transfer)与色彩迁移算法有效结合,以在保持内容结构的同时优化图像的色彩表现,从而提升生成图像的艺术性和视觉一致性。其解决方案的关键在于引入Kullback-Leibler(KL)散度作为量化指标,对多种色彩迁移方法(如Reinhard全局色彩迁移、迭代分布迁移IDT、带重颗粒处理的IDT、Cholesky分解和主成分分析PCA)进行评估,通过比较原始图像与经过深度学习驱动的艺术风格转换图像之间的颜色通道核密度估计及直方图匹配程度,实现更精确的颜色一致性控制与风格融合效果。

链接: https://arxiv.org/abs/2508.08608
作者: Justin London
机构: University of North Dakota (北达科他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural artistic style transfers and blends the content and style representation of one image with the style of another. This enables artists to create unique innovative visuals and enhances artistic expression in various fields including art, design, and film. Color transfer algorithms are an important in digital image processing by adjusting the color information in a target image based on the colors in the source image. Color transfer enhances images and videos in film and photography, and can aid in image correction. We introduce a methodology that combines neural artistic style with color transfer. The method uses the Kullback-Leibler (KL) divergence to quantitatively evaluate color and luminance histogram matching algorithms including Reinhard global color transfer, iteration distribution transfer (IDT), IDT with regrain, Cholesky, and PCA between the original and neural artistic style transferred image using deep learning. We estimate the color channel kernel densities. Various experiments are performed to evaluate the KL of these algorithms and their color histograms for style to content transfer.
zh

[CV-59] SelfHVD: Self-Supervised Handheld Video Deblurring for Mobile Phones

【速读】:该论文旨在解决手持移动设备拍摄视频时因手抖等不稳定因素导致的图像模糊问题,特别是现有视频去模糊方法在真实场景下性能受限的问题,根源在于训练数据与测试数据之间的模糊域差异(blur domain gap)。其解决方案的关键在于提出一种自监督方法,通过提取视频中的清晰线索(sharp clues)作为相邻模糊帧的错位标签来训练模型,并引入自增强视频去模糊(Self-Enhanced Video Deblurring, SEVD)策略生成高质量配对数据,同时设计自约束空间一致性保持(Self-Constrained Spatial Consistency Maintenance, SCSCM)机制以防止输出帧与输入帧之间发生位置偏移,从而提升模型在真实手持视频上的泛化能力。

链接: https://arxiv.org/abs/2508.08605
作者: Honglei Xu,Zhilu Zhang,Junjie Fan,Xiaohe Wu,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Shooting video with a handheld mobile phone, the most common photographic device, often results in blurry frames due to shaking hands and other instability factors. Although previous video deblurring methods have achieved impressive progress, they still struggle to perform satisfactorily on real-world handheld video due to the blur domain gap between training and testing data. To address the issue, we propose a self-supervised method for handheld video deblurring, which is driven by sharp clues in the video. First, to train the deblurring model, we extract the sharp clues from the video and take them as misalignment labels of neighboring blurry frames. Second, to improve the model’s ability, we propose a novel Self-Enhanced Video Deblurring (SEVD) method to create higher-quality paired video data. Third, we propose a Self-Constrained Spatial Consistency Maintenance (SCSCM) method to regularize the model, preventing position shifts between the output and input frames. Moreover, we construct a synthetic and a real-world handheld video dataset for handheld video deblurring. Extensive experiments on these two and other common real-world datasets demonstrate that our method significantly outperforms existing self-supervised ones. The code and datasets are publicly available at this https URL.
zh

[CV-60] ransferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在大规模和复杂化趋势下,微调(fine-tuning)成本高昂的问题,特别是如何高效地将“较弱”模型的适应知识迁移至“更强”模型以提升性能,同时避免高计算开销与模型特定设计限制。解决方案的关键在于提出一种轻量级、模型无关的适配器——TransMiter,其通过无监督方式捕捉预训练与微调VLM之间的知识差距,并在无需反向传播(backpropagation)的情况下实现跨模型的知识迁移;此外,仅需少量标注数据即可进一步提升性能,且训练成本极低,从而在保持模型泛化能力的同时显著提高适应知识的可迁移性与效率。

链接: https://arxiv.org/abs/2508.08604
作者: Jihwan Park,Taehoon song,Sanghyeok Lee,Miso Choi,Hyunwoo J. Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have been widely used in various visual recognition tasks due to their remarkable generalization capabilities. As these models grow in size and complexity, fine-tuning becomes costly, emphasizing the need to reuse adaptation knowledge from ‘weaker’ models to efficiently enhance ‘stronger’ ones. However, existing adaptation transfer methods exhibit limited transferability across models due to their model-specific design and high computational demands. To tackle this, we propose Transferable Model-agnostic adapter (TransMiter), a light-weight adapter that improves vision-language models ‘without backpropagation’. TransMiter captures the knowledge gap between pre-trained and fine-tuned VLMs, in an ‘unsupervised’ manner. Once trained, this knowledge can be seamlessly transferred across different models without the need for backpropagation. Moreover, TransMiter consists of only a few layers, inducing a negligible additional inference cost. Notably, supplementing the process with a few labeled data further yields additional performance gain, often surpassing a fine-tuned stronger model, with a marginal training cost. Experimental results and analyses demonstrate that TransMiter effectively and efficiently transfers adaptation knowledge while preserving generalization abilities across VLMs of different sizes and architectures in visual recognition tasks.
zh

[CV-61] Yan: Foundational Interactive Video Generation

【速读】:该论文旨在解决交互式视频生成(Interactive Video Generation)中缺乏端到端集成框架的问题,现有方法通常局限于单一模块(如模拟、生成或编辑),难以实现高实时性、多模态控制与灵活编辑的统一。其解决方案的关键在于提出一个名为Yan的基础性框架,包含三个核心模块:AAA级仿真(AAA-level Simulation)、多模态生成(Multi-Modal Generation)和多粒度编辑(Multi-Granularity Editing)。其中,关键创新包括:1)基于3D-VAE与KV缓存机制的低延迟实时仿真,支持1080P/60FPS交互;2)引入分层自回归文本注入策略,将游戏特定知识融入开放域视频扩散模型(VDMs),实现帧级可控、无限流式交互视频生成,并具备跨域风格与机制融合能力;3)显式解耦交互逻辑与视觉渲染的混合模型,支持通过文本指令进行多粒度内容编辑。整体上,Yan推动了交互式视频生成从孤立功能向AI驱动的全流程创作范式的演进。

链接: https://arxiv.org/abs/2508.08601
作者: Yan Team
机构: Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Yan, a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing. Specifically, Yan comprises three core modules. AAA-level Simulation: We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process, achieving real-time 1080P/60FPS interactive simulation. Multi-Modal Generation: We introduce a hierarchical autoregressive caption method that injects game-specific knowledge into open-domain multi-modal video diffusion models (VDMs), then transforming the VDM into a frame-wise, action-controllable, real-time infinite interactive video generator. Notably, when the textual and visual prompts are sourced from different domains, the model demonstrates strong generalization, allowing it to blend and compose the style and mechanics across domains flexibly according to user prompts. Multi-Granularity Editing: We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text. Collectively, Yan offers an integration of these modules, pushing interactive video generation beyond isolated capabilities toward a comprehensive AI-driven interactive creation paradigm, paving the way for the next generation of creative tools, media, and entertainment. The project page is: this https URL.
zh

[CV-62] QueryCraft: Transformer-Guided Query Initialization for Enhanced Human-Object Interaction Detection

【速读】:该论文旨在解决基于DETR(Detection Transformer)的Human-Object Interaction (HOI)检测方法中,随机初始化查询(queries)缺乏语义信息导致检测性能受限的问题。其核心解决方案是提出QueryCraft框架,通过双分支查询初始化机制提升查询的语义明确性与有效性:一是引入ACTOR(Action-aware Cross-modal Transformer),利用跨模态注意力机制联合视觉区域与文本提示,提取与动作相关的特征并生成语义感知的交互查询;二是设计PDQD(Perceptual Distilled Query Decoder),从预训练目标检测器中蒸馏物体类别知识,作为对象查询的初始表示。这一双重策略显著增强了模型对HOI场景的理解能力,从而在HICO-Det和V-COCO基准上实现了SOTA性能。

链接: https://arxiv.org/abs/2508.08590
作者: Yuxiao Wang,Wolin Liang,Yu Lei,Weiying Xue,Nan Zhuang,Qi Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions in images. Although DETR-based methods have recently emerged as the mainstream framework for HOI detection, they still suffer from a key limitation: Randomly initialized queries lack explicit semantics, leading to suboptimal detection performance. To address this challenge, we propose QueryCraft, a novel plug-and-play HOI detection framework that incorporates semantic priors and guided feature learning through transformer-based query initialization. Central to our approach is \textbfACTOR (\textbfAction-aware \textbfCross-modal \textbfTransf\textbfORmer), a cross-modal Transformer encoder that jointly attends to visual regions and textual prompts to extract action-relevant features. Rather than merely aligning modalities, ACTOR leverages language-guided attention to infer interaction semantics and produce semantically meaningful query representations. To further enhance object-level query quality, we introduce a \textbfPerceptual \textbfDistilled \textbfQuery \textbfDecoder (\textbfPDQD), which distills object category awareness from a pre-trained detector to serve as object query initiation. This dual-branch query initialization enables the model to generate more interpretable and effective queries for HOI detection. Extensive experiments on HICO-Det and V-COCO benchmarks demonstrate that our method achieves state-of-the-art performance and strong generalization. Code will be released upon publication.
zh

[CV-63] DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding ICCV2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在文档理解任务中推理过程缺乏透明性与可解释性的问题,尤其是在法律、金融和医疗等高风险领域,其黑箱特性严重影响了可靠性与可信度。现有方法依赖于固定链式思维(Chain-of-Thought, CoT)模板并通过监督微调(Supervised Fine-Tuning, SFT)实现推理,但存在灾难性遗忘、适应能力差以及跨任务泛化性能有限等缺陷。论文提出DocThinker,一种基于规则的强化学习(Reinforcement Learning, RL)框架,其核心创新在于通过策略学习在推理时动态优化推理策略,而非使用静态CoT模板;该框架能够生成结构化的推理步骤、重述问题、支持答案的感兴趣区域(Region of Interest, RoI)及最终答案,并结合多目标规则奖励机制与KL约束优化,有效缓解灾难性遗忘,提升模型的适应性和可解释性。

链接: https://arxiv.org/abs/2508.08589
作者: Wenwen Yu,Zhibo Yang,Yuliang Liu,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks. In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating explainable intermediate results, including structured reasoning processes, rephrased questions, regions of interest (RoI) supporting the answer, and the final answer. By integrating multi-objective rule-based rewards and KL-constrained optimization, our method mitigates catastrophic forgetting and enhances both adaptability and transparency. Extensive experiments on multiple benchmarks demonstrate that DocThinker significantly improves generalization while producing more explainable and human-understandable reasoning steps. Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding. Code will be available at this https URL.
zh

[CV-64] RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space

【速读】:该论文旨在解决生成具有真实感且可控运动的人类视频这一挑战性问题,现有方法虽能生成视觉上吸引人的视频,但缺乏对前景主体、背景视频、人体轨迹和动作模式这四个关键视频元素的独立控制能力。其解决方案的核心在于提出一种分解式人体运动控制与视频生成框架,通过显式解耦运动与外观、主体与背景、动作与轨迹,实现这些元素的灵活组合。关键技术包括构建基于地面的三维世界坐标系,在3D空间中直接进行运动编辑;利用焦距校准和坐标变换将编辑后的2D轨迹映射回3D空间并进行速度对齐与方向调整以实现轨迹控制;动作则由动作库或文本到动作(text-to-motion)方法提供;最后在现代文本到视频扩散Transformer模型基础上,通过注入主体token、拼接背景通道及添加运动控制信号(轨迹与动作)来实现可控视频生成,从而支持“任何人做任何事在任何地点”的生成能力。

链接: https://arxiv.org/abs/2508.08588
作者: Jingyun Liang,Jingkai Zhou,Shikai Li,Chenjie Cao,Lei Sun,Yichen Qian,Weihua Chen,Fan Wang
机构: DAMO Academy, Alibaba Group (阿里巴巴集团达摩院); Hupan Lab (湖畔实验室); INSAIT; Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Project page: this https URL

点击查看摘要

Abstract:Generating human videos with realistic and controllable motions is a challenging task. While existing methods can generate visually compelling videos, they lack separate control over four key video elements: foreground subject, background video, human trajectory and action patterns. In this paper, we propose a decomposed human motion control and video generation framework that explicitly decouples motion from appearance, subject from background, and action from trajectory, enabling flexible mix-and-match composition of these elements. Concretely, we first build a ground-aware 3D world coordinate system and perform motion editing directly in the 3D space. Trajectory control is implemented by unprojecting edited 2D trajectories into 3D with focal-length calibration and coordinate transformation, followed by speed alignment and orientation adjustment; actions are supplied by a motion bank or generated via text-to-motion methods. Then, based on modern text-to-video diffusion transformer models, we inject the subject as tokens for full attention, concatenate the background along the channel dimension, and add motion (trajectory and action) control signals by addition. Such a design opens up the possibility for us to generate realistic videos of anyone doing anything anywhere. Extensive experiments on benchmark datasets and real-world cases demonstrate that our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.
zh

[CV-65] Superclass-Guided Representation Disentanglement for Spurious Correlation Mitigation

【速读】:该论文旨在解决域泛化(domain generalization)任务中因数据集间存在虚假相关性(spurious correlations)而导致模型鲁棒性不足的问题。传统方法通常依赖于对样本分组或虚假特征的辅助标注,并假设源域与目标域具有相同的分组结构,但这些假设在真实场景中往往不成立且难以实现。本文的关键解决方案是利用类别标签固有的语义结构——即超类(superclass)信息——来引导模型减少对虚假特征的依赖。具体而言,模型通过预训练视觉-语言模型(vision-language model)提供的梯度注意力机制,分离出与超类相关和无关的特征;随后通过鼓励使用全部超类相关特征进行预测,从而在无需任何源样本标注的情况下提升模型对复杂虚假相关性的鲁棒性。实验表明,该方法在多个数据集上显著优于基线模型,在定量指标和定性可视化方面均表现出优越性能。

链接: https://arxiv.org/abs/2508.08570
作者: Chenruo Liu,Hongjun Liu,Zeyu Lai,Yiqiu Shen,Chen Zhao,Qi Lei
机构: New York University (纽约大学); NYU Grossman School of Medicine (纽约大学格罗斯曼医学院); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To enhance group robustness to spurious correlations, prior work often relies on auxiliary annotations for groups or spurious features and assumes identical sets of groups across source and target domains. These two requirements are both unnatural and impractical in real-world settings. To overcome these limitations, we propose a method that leverages the semantic structure inherent in class labels–specifically, superclass information–to naturally reduce reliance on spurious features. Our model employs gradient-based attention guided by a pre-trained vision-language model to disentangle superclass-relevant and irrelevant features. Then, by promoting the use of all superclass-relevant features for prediction, our approach achieves robustness to more complex spurious correlations without the need to annotate any source samples. Experiments across diverse datasets demonstrate that our method significantly outperforms baselines in domain generalization tasks, with clear improvements in both quantitative metrics and qualitative visualizations.
zh

[CV-66] hink as Cardiac Sonographers: Marrying SAM with Left Ventricular Indicators Measurements According to Clinical Guidelines

【速读】:该论文旨在解决超声心动图中左心室(Left Ventricle, LV)指标测量自动化难题,尤其是现有算法因训练数据有限而难以捕捉通用视觉表征的问题。解决方案的关键在于提出一种名为AutoSAME的新框架,该框架融合了Segment Anything Model(SAM)的强大视觉理解能力,并同时完成LV分割与关键解剖点定位任务,从而模拟心脏超声医师的操作流程,实现符合临床指南的LV指标测量。其核心创新包括:1)引入滤波交叉分支注意力机制(Filtered Cross-Branch Attention, FCBA),利用分割任务中相对完整的特征增强关键点热图回归的频率域表示;2)设计空间引导提示对齐机制(Spatial-Guided Prompt Alignment, SGPA),基于LV的空间属性自动生成提示嵌入,提升密集预测的准确性。

链接: https://arxiv.org/abs/2508.08566
作者: Tuo Liu,Qinghan Yang,Yu Zhang,Rongjun Ge,Yang Chen,Guangquan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Left ventricular (LV) indicator measurements following clinical echocardiog-raphy guidelines are important for diagnosing cardiovascular disease. Alt-hough existing algorithms have explored automated LV quantification, they can struggle to capture generic visual representations due to the normally small training datasets. Therefore, it is necessary to introduce vision founda-tional models (VFM) with abundant knowledge. However, VFMs represented by the segment anything model (SAM) are usually suitable for segmentation but incapable of identifying key anatomical points, which are critical in LV indicator measurements. In this paper, we propose a novel framework named AutoSAME, combining the powerful visual understanding of SAM with seg-mentation and landmark localization tasks simultaneously. Consequently, the framework mimics the operation of cardiac sonographers, achieving LV indi-cator measurements consistent with clinical guidelines. We further present fil-tered cross-branch attention (FCBA) in AutoSAME, which leverages relatively comprehensive features in the segmentation to enhance the heatmap regression (HR) of key points from the frequency domain perspective, optimizing the vis-ual representation learned by the latter. Moreover, we propose spatial-guided prompt alignment (SGPA) to automatically generate prompt embeddings guid-ed by spatial properties of LV, thereby improving the accuracy of dense pre-dictions by prior spatial knowledge. The extensive experiments on an echocar-diography dataset demonstrate the efficiency of each design and the superiori-ty of our AutoSAME in LV segmentation, landmark localization, and indicator measurements. The code will be available at this https URL.
zh

[CV-67] Unlocking the Potential of Diffusion Priors in Blind Face Restoration

【速读】:该论文旨在解决扩散先验(diffusion prior)在盲人脸恢复(blind face restoration, BFR)任务中适应性不足的问题,其核心挑战在于:1)原始扩散模型训练时使用的是高质量(HQ)图像,而BFR处理的是中度至重度退化的低质量(LQ)图像;2)训练数据中的LQ图像由简化的退化模型合成,无法模拟真实世界中复杂且未知的退化模式。解决方案的关键在于提出一个统一网络FLIPNET,通过两种模式切换实现针对性优化:在恢复模式(Restoration mode)下,模型逐步融合面向BFR的特征与来自LQ图像的人脸嵌入,以提升恢复结果的真实性与保真度;在退化模式(Degradation mode)下,模型基于真实退化数据集学习的知识生成类真实世界的退化图像,从而更准确地建模现实场景中的退化过程。

链接: https://arxiv.org/abs/2508.08556
作者: Yunqi Miao,Zhiyu Qu,Mingqi Gao,Changrui Chen,Jifei Song,Jungong Han,Jiankang Deng
机构: University of Warwick (华威大学); University of Sheffield (谢菲尔德大学); University of Surrey (萨里大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although diffusion prior is rising as a powerful solution for blind face restoration (BFR), the inherent gap between the vanilla diffusion model and BFR settings hinders its seamless adaptation. The gap mainly stems from the discrepancy between 1) high-quality (HQ) and low-quality (LQ) images and 2) synthesized and real-world images. The vanilla diffusion model is trained on images with no or less degradations, whereas BFR handles moderately to severely degraded images. Additionally, LQ images used for training are synthesized by a naive degradation model with limited degradation patterns, which fails to simulate complex and unknown degradations in real-world scenarios. In this work, we use a unified network FLIPNET that switches between two modes to resolve specific gaps. In Restoration mode, the model gradually integrates BFR-oriented features and face embeddings from LQ images to achieve authentic and faithful face restoration. In Degradation mode, the model synthesizes real-world like degraded images based on the knowledge learned from real-world degradation datasets. Extensive evaluations on benchmark datasets show that our model 1) outperforms previous diffusion prior based BFR methods in terms of authenticity and fidelity, and 2) outperforms the naive degradation model in modeling the real-world degradations.
zh

[CV-68] Boosting Generic Semi-Supervised Medical Image Segmentation via Diverse Teaching and Label Propagation

【速读】:该论文旨在解决医学图像分割中因标注数据有限和域偏移(domain shift)带来的挑战,涵盖半监督医学图像分割(Semi-Supervised Medical Image Segmentation, SSMIS)、半监督医学域泛化(Semi-MDG)以及无监督医学域适应(UMDA)三种典型场景。传统方法通常针对单一任务设计,存在误差累积问题,难以有效利用未标注数据,导致性能受限。解决方案的关键在于:通过一个包含两个多样化教师模型(diverse teacher models)的Diverse Teaching and Label Propagation Network (DTLP-Net) 生成可靠的伪标签(pseudo-labels),其中第一个教师模型分离标注与未标注数据的训练过程,第二个教师模型采用动量更新机制以产生多样且可靠的伪标签;同时结合样本间与样本内数据增强策略以提取全局与局部特征,并引入标签传播机制强化体素级关联建模,从而提升模型鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2508.08549
作者: Wei Li,Pengcheng Zhou,Linye Ma,Wenyi Zhao,Huihua Yang
机构: Beijing Information Science and Technology National Research Center (北京信息科学与技术国家研究中心); Tsinghua University (清华大学); School of Information and Communication Engineering, Beijing University of Posts and Telecommunications (北京邮电大学信息与通信工程学院); The 15th Institute of China Electronics Technology Group Corporation (中国电子科技集团公司第十五研究所); School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications (北京邮电大学智能工程与自动化学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Both limited annotation and domain shift are significant challenges frequently encountered in medical image segmentation, leading to derivative scenarios like semi-supervised medical (SSMIS), semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). Conventional methods are generally tailored to specific tasks in isolation, the error accumulation hinders the effective utilization of unlabeled data and limits further improvements, resulting in suboptimal performance when these issues occur. In this paper, we aim to develop a generic framework that masters all three tasks. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data and increasing the diversity of the model. To tackle this issue, we employ a Diverse Teaching and Label Propagation Network (DTLP-Net) to boosting the Generic Semi-Supervised Medical Image Segmentation. Our DTLP-Net involves a single student model and two diverse teacher models, which can generate reliable pseudo-labels for the student model. The first teacher model decouple the training process with labeled and unlabeled data, The second teacher is momentum-updated periodically, thus generating reliable yet divers pseudo-labels. To fully utilize the information within the data, we adopt inter-sample and intra-sample data augmentation to learn the global and local knowledge. In addition, to further capture the voxel-level correlations, we propose label propagation to enhance the model robust. We evaluate our proposed framework on five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks. The results showcase notable improvements compared to state-of-the-art methods across all five settings, indicating the potential of our framework to tackle more challenging SSL scenarios.
zh

[CV-69] Calibration Attention: Instance-wise Temperature Scaling for Vision Transformers

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在风险敏感应用中概率校准不足的问题。标准的后处理温度缩放(post-hoc temperature scaling)方法依赖于单一全局标量,并需额外的验证集,难以适应不同输入实例的不确定性差异。其解决方案的关键在于提出 Calibration Attention (CalAttn) 模块,该模块直接从ViT的CLS token中学习自适应的、逐样本的温度参数,无需额外验证集且仅引入不到0.1%的额外参数。实验表明,CalAttn在多个数据集上可将校准误差降低至原来的四分之一,且学习到的温度值紧密聚集在1.0附近,显著优于传统方法的大规模全局温度值,从而提升了模型输出概率的可信度而不损失分类准确率。

链接: https://arxiv.org/abs/2508.08547
作者: Wenhao Liang,Wei Emma Zhang,Lin Yue,Miao Xu,Olaf Maennel,Weitong Chen
机构: The University of Adelaide (阿德莱德大学); The University of Queensland (昆士兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: UnderReview

点击查看摘要

Abstract:Probability calibration is critical when Vision Transformers are deployed in risk-sensitive applications. The standard fix, post-hoc temperature scaling, uses a single global scalar and requires a held-out validation set. We introduce Calibration Attention (CalAttn), a drop-in module that learns an adaptive, per-instance temperature directly from the ViT’s CLS token. Across CIFAR-10/100, MNIST, Tiny-ImageNet, and ImageNet-1K, CalAttn reduces calibration error by up to 4x on ViT-224, DeiT, and Swin, while adding under 0.1 percent additional parameters. The learned temperatures cluster tightly around 1.0, in contrast to the large global values used by standard temperature scaling. CalAttn is simple, efficient, and architecture-agnostic, and yields more trustworthy probabilities without sacrificing accuracy. Code: [this https URL](this https URL)
zh

[CV-70] Hybrid Long and Short Range Flows for Point Cloud Filtering

【速读】:该论文旨在解决点云捕获过程中引入的噪声伪影问题,传统滤波方法常存在点聚集或噪声残留等缺陷。其解决方案的关键在于提出一种混合点云滤波方法(HybridPF),通过同时建模短距离得分(short-range scores)与长距离速度流(long-range velocity flows)来提升去噪效果:短距离得分提供局部位移以引导噪声点向清洁表面移动,而长距离速度流则提供全局方向性约束,使短距离得分更精准地对齐至干净点;为此设计了两个并行模块——ShortModule 和 LongModule,分别基于编码器-解码器结构处理两类信息,并引入联合损失函数实现端到端训练;此外,针对现有基于位移的方法在解码器架构上的局限性,进一步提出动态图卷积解码器以优化推理过程,从而在保持先进性能的同时显著提升推理速度。

链接: https://arxiv.org/abs/2508.08542
作者: Dasith de Silva Edirimuni,Xuequan Lu,Ajmal Saeed Mian,Lei Wei,Gang Li,Scott Schaefer,Ying He
机构: University of Western Australia (西澳大利亚大学); Deakin University (迪肯大学); Texas A&M University (德克萨斯A&M大学); Nanyang Technological University (南洋理工大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud capture processes are error-prone and introduce noisy artifacts that necessitate filtering/denoising. Recent filtering methods often suffer from point clustering or noise retaining issues. In this paper, we propose Hybrid Point Cloud Filtering ( \textbfHybridPF ) that considers both short-range and long-range filtering trajectories when removing noise. It is well established that short range scores, given by \nabla_x\log p(x_t) , may provide the necessary displacements to move noisy points to the underlying clean surface. By contrast, long range velocity flows approximate constant displacements directed from a high noise variant patch x_0 towards the corresponding clean surface x_1 . Here, noisy patches x_t are viewed as intermediate states between the high noise variant and the clean patches. Our intuition is that long range information from velocity flow models can guide the short range scores to align more closely with the clean points. In turn, score models generally provide a quicker convergence to the clean surface. Specifically, we devise two parallel modules, the ShortModule and LongModule, each consisting of an Encoder-Decoder pair to respectively account for short-range scores and long-range flows. We find that short-range scores, guided by long-range features, yield filtered point clouds with good point distributions and convergence near the clean surface. We design a joint loss function to simultaneously train the ShortModule and LongModule, in an end-to-end manner. Finally, we identify a key weakness in current displacement based methods, limitations on the decoder architecture, and propose a dynamic graph convolutional decoder to improve the inference process. Comprehensive experiments demonstrate that our HybridPF achieves state-of-the-art results while enabling faster inference speed.
zh

[CV-71] raining Kindai OCR with parallel textline images and self-attention feature distance-based loss

【速读】:该论文旨在解决历史文献光学字符识别(OCR)中因标注数据稀缺而导致模型性能受限的问题。其核心挑战在于Kindai文档(明治时期至大正初期的近代日语文献)的文本难以获取大量人工标注样本,从而限制了OCR系统训练效果。解决方案的关键在于利用平行文本行图像对——即原始Kindai文本与其对应现代日语字体版本的图像配对——来增强训练数据集,并引入基于距离的损失函数,通过最小化自注意力特征之间的差异(采用欧几里得距离或最大均值差异MMD作为域适应度量),提升模型在历史文本上的泛化能力。实验表明,该方法相较于基于Transformer的OCR基线模型,在字符错误率(CER)上分别降低了2.23%和3.94%,同时增强了自注意力表示的判别性,显著改善了历史文档的OCR性能。

链接: https://arxiv.org/abs/2508.08537
作者: Anh Le,Asanobu Kitamoto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Kindai documents, written in modern Japanese from the late 19th to early 20th century, hold significant historical value for researchers studying societal structures, daily life, and environmental conditions of that period. However, transcribing these documents remains a labor-intensive and time-consuming task, resulting in limited annotated data for training optical character recognition (OCR) systems. This research addresses this challenge of data scarcity by leveraging parallel textline images - pairs of original Kindai text and their counterparts in contemporary Japanese fonts - to augment training datasets. We introduce a distance-based objective function that minimizes the gap between self-attention features of the parallel image pairs. Specifically, we explore Euclidean distance and Maximum Mean Discrepancy (MMD) as domain adaptation metrics. Experimental results demonstrate that our method reduces the character error rate (CER) by 2.23% and 3.94% over a Transformer-based OCR baseline when using Euclidean distance and MMD, respectively. Furthermore, our approach improves the discriminative quality of self-attention representations, leading to more effective OCR performance for historical documents.
zh

[CV-72] VISOR: Visual Input-based Steering for Output Redirection in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在实际部署中面临的行为控制与安全防护难题,即如何在不依赖模型内部访问或显式文本指令的前提下,实现对模型输出的精准引导。现有方法如系统提示(system prompting)易被检测且效果有限,而基于激活向量的控制则需侵入式运行时访问,难以适用于API服务和闭源场景。论文提出的解决方案——VISOR(Visual Input-based Steering for Output Redirection),其关键在于通过优化设计通用视觉输入图像(steering image),诱导目标激活模式以实现行为重定向,从而在无需修改模型结构或运行时干预的情况下,达成高效、隐蔽且跨模态的行为控制。实验表明,仅用150KB的视觉输入即可实现与激活向量相当甚至更优的正向行为改变(误差<2%),并在负向操控上显著超越传统方法(最高达25%偏差),同时保持99.9%的基准任务性能,揭示了视觉通道作为新型攻击面所带来的严重安全隐患。

链接: https://arxiv.org/abs/2508.08521
作者: Mansi Phute(Georgia Tech),Ravikumar Balakrishnan(HiddenLayer)
机构: Georgia Institute of Technology (佐治亚理工学院); HiddenLayer, Inc
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) are increasingly being used in a broad range of applications, bringing their security and behavioral control to the forefront. While existing approaches for behavioral control or output redirection, like system prompting in VLMs, are easily detectable and often ineffective, activation-based steering vectors require invasive runtime access to model internals–incompatible with API-based services and closed-source deployments. We introduce VISOR (Visual Input-based Steering for Output Redirection), a novel method that achieves sophisticated behavioral control through optimized visual inputs alone. By crafting universal steering images that induce target activation patterns, VISOR enables practical deployment across all VLM serving modalities while remaining imperceptible compared to explicit textual instructions. We validate VISOR on LLaVA-1.5-7B across three critical alignment tasks: refusal, sycophancy and survival instinct. A single 150KB steering image matches steering vector performance within 1-2% for positive behavioral shifts while dramatically exceeding it for negative steering–achieving up to 25% shifts from baseline compared to steering vectors’ modest changes. Unlike system prompting (3-4% shifts), VISOR provides robust bidirectional control while maintaining 99.9% performance on 14,000 unrelated MMLU tasks. Beyond eliminating runtime overhead and model access requirements, VISOR exposes a critical security vulnerability: adversaries can achieve sophisticated behavioral manipulation through visual channels alone, bypassing text-based defenses. Our work fundamentally re-imagines multimodal model control and highlights the urgent need for defenses against visual steering attacks.
zh

[CV-73] SharpXR: Structure-Aware Denoising for Pediatric Chest X-Rays MICCAI2025

【速读】:该论文旨在解决低剂量儿科胸部X线影像中噪声干扰严重、传统去噪方法易损失关键解剖细节从而影响诊断准确性的难题。其解决方案的关键在于提出SharpXR,一种结构感知的双解码器U-Net架构,通过引入拉普拉斯引导的边缘保持解码器与可学习融合模块,实现噪声抑制与结构细节保留之间的自适应平衡,同时在缺乏配对训练数据的情况下,利用泊松-高斯噪声模拟增强数据多样性,显著提升了图像质量与下游肺炎分类性能(从88.8%提升至92.5%)。

链接: https://arxiv.org/abs/2508.08518
作者: Ilerioluwakiiye Abolade,Emmanuel Idoko,Solomon Odelola,Promise Omoigui,Adetola Adebanwo,Aondana Iorumbur,Udunna Anazodo,Alessandro Crimi,Raymond Confidence
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025 MIRASOL Workshop, 10 pages, 5 figures

点击查看摘要

Abstract:Pediatric chest X-ray imaging is essential for early diagnosis, particularly in low-resource settings where advanced imaging modalities are often inaccessible. Low-dose protocols reduce radiation exposure in children but introduce substantial noise that can obscure critical anatomical details. Conventional denoising methods often degrade fine details, compromising diagnostic accuracy. In this paper, we present SharpXR, a structure-aware dual-decoder U-Net designed to denoise low-dose pediatric X-rays while preserving diagnostically relevant features. SharpXR combines a Laplacian-guided edge-preserving decoder with a learnable fusion module that adaptively balances noise suppression and structural detail retention. To address the scarcity of paired training data, we simulate realistic Poisson-Gaussian noise on the Pediatric Pneumonia Chest X-ray dataset. SharpXR outperforms state-of-the-art baselines across all evaluation metrics while maintaining computational efficiency suitable for resource-constrained settings. SharpXR-denoised images improved downstream pneumonia classification accuracy from 88.8% to 92.5%, underscoring its diagnostic value in low-resource pediatric care.
zh

[CV-74] CObL: Toward Zero-Shot Ordinal Layering without User Prompting ICCV2025

【速读】:该论文旨在解决从单张图像中自动推断出具有层次结构的场景表示问题,即如何将图像中的像素分组为独立且完整的对象层(object layers),并准确建模它们之间的遮挡顺序与空间关系。传统方法在处理多物体遮挡或未知数量对象时存在局限性,而本文提出的解决方案——并发对象层(Concurrent Object Layers, CObL)——通过基于扩散模型的架构实现端到端的并行生成,利用Stable Diffusion作为自然物体先验,并结合推理阶段引导机制确保生成的对象层能够精确重构输入图像。其关键创新在于无需用户提示即可同时完成多个遮挡对象的无监督重建,且不依赖训练时已知的对象数量或特定场景分布,从而具备良好的零样本泛化能力。

链接: https://arxiv.org/abs/2508.08498
作者: Aneel Damaraju,Dean Hazineh,Todd Zickler
机构: Harvard University, School of Engineering and Applied Sciences (哈佛大学工程与应用科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025: Project page with demo, datasets, and code: this https URL

点击查看摘要

Abstract:Vision benefits from grouping pixels into objects and understanding their spatial relationships, both laterally and in depth. We capture this with a scene representation comprising an occlusion-ordered stack of “object layers,” each containing an isolated and amodally-completed object. To infer this representation from an image, we introduce a diffusion-based architecture named Concurrent Object Layers (CObL). CObL generates a stack of object layers in parallel, using Stable Diffusion as a prior for natural objects and inference-time guidance to ensure the inferred layers composite back to the input image. We train CObL using a few thousand synthetically-generated images of multi-object tabletop scenes, and we find that it zero-shot generalizes to photographs of real-world tabletops with varying numbers of novel objects. In contrast to recent models for amodal object completion, CObL reconstructs multiple occluded objects without user prompting and without knowing the number of objects beforehand. Unlike previous models for unsupervised object-centric representation learning, CObL is not limited to the world it was trained in.
zh

[CV-75] MuGa-VTON: Multi-Garment Virtual Try-On via Diffusion Transformers with Prompt Customization

【速读】:该论文旨在解决现有虚拟试衣(Virtual Try-On)方法在处理上下装时通常采用分离建模、依赖复杂预处理且难以保留个体特有特征(如纹身、配饰和体型)而导致真实感与灵活性不足的问题。解决方案的关键在于提出MuGa-VTON,一个统一的多服装扩散框架,通过三个核心模块实现:1)服装表征模块(Garment Representation Module, GRM)捕捉服装语义;2)人物表征模块(Person Representation Module, PRM)编码身份与姿态线索;3)A-DiT融合模块利用扩散Transformer(Diffusion Transformer)将服装、人物及文本提示特征整合于共享潜在空间中,从而支持基于提示的细粒度定制,显著提升生成结果的真实感与个性化保真度。

链接: https://arxiv.org/abs/2508.08488
作者: Ankan Deria,Dwarikanath Mahapatra,Behzad Bozorgtabar,Mohna Chakraborty,Snehashis Chakraborty,Sudipta Roy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Virtual try-on seeks to generate photorealistic images of individuals in desired garments, a task that must simultaneously preserve personal identity and garment fidelity for practical use in fashion retail and personalization. However, existing methods typically handle upper and lower garments separately, rely on heavy preprocessing, and often fail to preserve person-specific cues such as tattoos, accessories, and body shape-resulting in limited realism and flexibility. To this end, we introduce MuGa-VTON, a unified multi-garment diffusion framework that jointly models upper and lower garments together with person identity in a shared latent space. Specifically, we proposed three key modules: the Garment Representation Module (GRM) for capturing both garment semantics, the Person Representation Module (PRM) for encoding identity and pose cues, and the A-DiT fusion module, which integrates garment, person, and text-prompt features through a diffusion transformer. This architecture supports prompt-based customization, allowing fine-grained garment modifications with minimal user input. Extensive experiments on the VITON-HD and DressCode benchmarks demonstrate that MuGa-VTON outperforms existing methods in both qualitative and quantitative evaluations, producing high-fidelity, identity-preserving results suitable for real-world virtual try-on applications.
zh

[CV-76] MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

【速读】:该论文旨在解决长序列视频生成框架中存在的三大核心问题:辅助能力不足、视觉质量欠佳以及表达力有限。为应对这些挑战,作者提出了一种端到端的多智能体协作框架MAViS,其关键在于通过分阶段的专用智能体(包括脚本撰写、镜头设计、角色建模、关键帧生成、视频动画和音频生成)协同工作,并在每个阶段严格遵循“探索-审视-增强”(Explore, Examine, Enhance, 3E)原则,以确保中间输出的完整性与高质量。此外,为提升生成模型与脚本之间的兼容性,研究进一步提出了脚本编写指南(Script Writing Guidelines),从而显著优化整体流程的可控性与一致性。实验表明,MAViS在辅助能力、视觉质量和视频表达力方面均达到当前最优水平,且具备良好的模块化扩展性,支持多种生成模型与工具集成,实现仅需简短用户提示即可生成高质量、具叙事性和背景音乐的长序列视频内容。

链接: https://arxiv.org/abs/2508.08487
作者: Qian Wang,Ziqi Huang,Ruoxi Jia,Paul Debevec,Ning Yu
机构: Virginia Tech. University (弗吉尼亚理工学院); Nanyang Technological University (南洋理工大学); Netflix Eyeline Studios (奈飞视效工作室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Video Generation Agent

点击查看摘要

Abstract:Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, an end-to-end multi-agent collaborative framework for long-sequence video storytelling. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle – Explore, Examine, and Enhance – to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief user prompt, MAViS is capable of producing high-quality, expressive long-sequence video storytelling, enriching inspirations and creativity for users. To the best of our knowledge, MAViS is the only framework that provides multimodal design output – videos with narratives and background music.
zh

[CV-77] Enhanced Liver Tumor Detection in CT Images Using 3D U-Net and Bat Algorithm for Hyperparameter Optimization

【速读】:该论文旨在解决肝癌早期检测中肝脏肿瘤自动分割的难题,以提升医学影像分析的准确性与临床实用性。其解决方案的关键在于将三维U-Net(3D U-Net)深度学习架构与蝙蝠算法(Bat Algorithm)相结合,通过后者对学习率、批量大小等关键超参数进行智能优化,从而显著增强模型在CT图像中的分割精度与鲁棒性。

链接: https://arxiv.org/abs/2508.08452
作者: Nastaran Ghorbani,Bitasadat Jamshidi,Mohsen Rostamy-Malkhalifeh
机构: Loyola University Chicago (洛约拉大学芝加哥分校); Islamic Azad University, Science and Research Branch (伊斯兰阿扎德大学,科学与研究分支)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Liver cancer is one of the most prevalent and lethal forms of cancer, making early detection crucial for effective treatment. This paper introduces a novel approach for automated liver tumor segmentation in computed tomography (CT) images by integrating a 3D U-Net architecture with the Bat Algorithm for hyperparameter optimization. The method enhances segmentation accuracy and robustness by intelligently optimizing key parameters like the learning rate and batch size. Evaluated on a publicly available dataset, our model demonstrates a strong ability to balance precision and recall, with a high F1-score at lower prediction thresholds. This is particularly valuable for clinical diagnostics, where ensuring no potential tumors are missed is paramount. Our work contributes to the field of medical image analysis by demonstrating that the synergy between a robust deep learning architecture and a metaheuristic optimization algorithm can yield a highly effective solution for complex segmentation tasks.
zh

[CV-78] Improving Facial Rig Semantics for Tracking and Retargeting

【速读】:该论文旨在解决将已追踪的面部动作数据从一个表演者(person)准确重定向(retargeting)到另一个表演者或游戏/虚拟现实(VR)角色的问题,核心挑战在于不同角色间rig框架语义不一致导致的动画控制失真。解决方案的关键在于:首先利用统一的面部网格参数化模型(如3DMM、FLAME或MetaHuman)构建源与目标的面部rig,确保语义一致性;其次通过体积变形(volumetric morphing)拟合rig至表演者和目标,并采用精心设计的Simon-Says表情集进行运动签名校准,使rig在动画控制器驱动下能生成预期表情;最后提出一种基于隐式微分(implicit differentiation)的微调方法,对追踪器使用的rig进行优化,以提升输出动画控制的语义合理性,从而实现高保真度的动作重定向,同时允许追踪器作为黑盒处理,适应真实场景中的复杂性。

链接: https://arxiv.org/abs/2508.08429
作者: Dalton Omens,Allise Thurman,Jihun Yu,Ronald Fedkiw
机构: Stanford University (斯坦福大学); Epic Games
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we consider retargeting a tracked facial performance to either another person or to a virtual character in a game or virtual reality (VR) environment. We remove the difficulties associated with identifying and retargeting the semantics of one rig framework to another by utilizing the same framework (3DMM, FLAME, MetaHuman, etc.) for both subjects. Although this does not constrain the choice of framework when retargeting from one person to another, it does force the tracker to use the game/VR character rig when retargeting to a game/VR character. We utilize volumetric morphing in order to fit facial rigs to both performers and targets; in addition, a carefully chosen set of Simon-Says expressions is used to calibrate each rig to the motion signatures of the relevant performer or target. Although a uniform set of Simon-Says expressions can likely be used for all person to person retargeting, we argue that person to game/VR character retargeting benefits from Simon-Says expressions that capture the distinct motion signature of the game/VR character rig. The Simon-Says calibrated rigs tend to produce the desired expressions when exercising animation controls (as expected). Unfortunately, these well-calibrated rigs still lead to undesirable controls when tracking a performance (a well-behaved function can have an arbitrarily ill-conditioned inverse), even though they typically produce acceptable geometry reconstructions. Thus, we propose a fine-tuning approach that modifies the rig used by the tracker in order to promote the output of more semantically meaningful animation controls, facilitating high efficacy retargeting. In order to better address real-world scenarios, the fine-tuning relies on implicit differentiation so that the tracker can be treated as a (potentially non-differentiable) black box.
zh

[CV-79] Neural Tangent Knowledge Distillation for Optical Convolutional Networks

【速读】:该论文旨在解决混合光学神经网络(Hybrid Optical Neural Networks, ONNs)在实际应用中面临的两大挑战:一是训练过程中与大规模数字深度网络相比存在的准确率差距,二是仿真系统与物理实现之间因硬件差异导致的性能退化问题。针对这些问题,论文提出了一种任务无关(task-agnostic)和硬件无关(hardware-agnostic)的端到端优化流程,其关键创新在于引入神经切线知识蒸馏(Neural Tangent Knowledge Distillation, NTKD)方法——该方法通过将光学模型与电子教师网络对齐来缩小准确率差距,并在器件制造后指导数字后端的微调以补偿实现误差。实验表明,该方案在多个数据集(如MNIST、CIFAR、Carvana Masking)和不同硬件配置下均能稳定提升ONN性能,支持从仿真设计到物理实现的可靠部署。

链接: https://arxiv.org/abs/2508.08421
作者: Jinlin Xiang,Minho Choi,Yubo Zhang,Zhihao Zhou,Arka Majumdar,Eli Shlizerman
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Hybrid Optical Neural Networks (ONNs, typically consisting of an optical frontend and a digital backend) offer an energy-efficient alternative to fully digital deep networks for real-time, power-constrained systems. However, their adoption is limited by two main challenges: the accuracy gap compared to large-scale networks during training, and discrepancies between simulated and fabricated systems that further degrade accuracy. While previous work has proposed end-to-end optimizations for specific datasets (e.g., MNIST) and optical systems, these approaches typically lack generalization across tasks and hardware designs. To address these limitations, we propose a task-agnostic and hardware-agnostic pipeline that supports image classification and segmentation across diverse optical systems. To assist optical system design before training, we estimate achievable model accuracy based on user-specified constraints such as physical size and the dataset. For training, we introduce Neural Tangent Knowledge Distillation (NTKD), which aligns optical models with electronic teacher networks, thereby narrowing the accuracy gap. After fabrication, NTKD also guides fine-tuning of the digital backend to compensate for implementation errors. Experiments on multiple datasets (e.g., MNIST, CIFAR, Carvana Masking) and hardware configurations show that our pipeline consistently improves ONN performance and enables practical deployment in both pre-fabrication simulations and physical implementations.
zh

[CV-80] Spatiotemporally Consistent Indoor Lighting Estimation with Diffusion Priors SIGGRAPH2025

【速读】:该论文旨在解决从单张图像或视频中估计室内光照条件的难题,尤其针对场景光照在空间和时间上均呈现显著变化的情况。其核心挑战在于光照估计问题的高度病态性(ill-posed nature)。解决方案的关键在于:利用2D扩散先验(diffusion priors)优化以多层感知机(MLP)表示的连续光场(light field),并通过微调预训练图像扩散模型,使其能够通过联合填充多个铬球(chrome balls)作为光照探针,在多个位置预测光照信息,从而实现零样本(zero-shot)泛化至真实世界(in-the-wild)场景。此方法不仅提升了单图/视频照明估计性能,还首次在真实视频中实现了时空一致的光照估计。

链接: https://arxiv.org/abs/2508.08384
作者: Mutian Tong,Rundi Wu,Changxi Zheng
机构: Columbia University (哥伦比亚大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages. Accepted by SIGGRAPH 2025 as Conference Paper

点击查看摘要

Abstract:Indoor lighting estimation from a single image or video remains a challenge due to its highly ill-posed nature, especially when the lighting condition of the scene varies spatially and temporally. We propose a method that estimates from an input video a continuous light field describing the spatiotemporally varying lighting of the scene. We leverage 2D diffusion priors for optimizing such light field represented as a MLP. To enable zero-shot generalization to in-the-wild scenes, we fine-tune a pre-trained image diffusion model to predict lighting at multiple locations by jointly inpainting multiple chrome balls as light probes. We evaluate our method on indoor lighting estimation from a single image or video and show superior performance over compared baselines. Most importantly, we highlight results on spatiotemporally consistent lighting estimation from in-the-wild videos, which is rarely demonstrated in previous works.
zh

[CV-81] Designing Object Detection Models for TinyML: Foundations Comparative Analysis Challenges and Emerging Solutions

【速读】:该论文旨在解决在资源受限的物联网(Internet of Things, IoT)设备上部署生成式 AI (Generative AI) 之外的计算机视觉任务中,尤其是目标检测(Object Detection, OD)模型因计算复杂度高而难以实现高效、实时运行的问题。解决方案的关键在于系统性地分析和总结适用于微型机器学习(TinyML)环境的目标检测模型优化技术,包括量化(quantization)、剪枝(pruning)、知识蒸馏(knowledge distillation)以及神经架构搜索(neural architecture search),并结合理论与实践层面的实现方法,推动边缘人工智能部署从学术研究向实际应用的有效转化。

链接: https://arxiv.org/abs/2508.08352
作者: Christophe EL Zeinaty,Wassim Hamidouche,Glenn Herrou,Daniel Menard
机构: Univ. Rennes, INSA Rennes, CNRS, IETR - UMR 6164 (雷恩大学,INSA雷恩,法国国家科学研究中心,IETR-UMR 6164实验室); KU 6G Research Center, Khalifa University (哈利法大学6G研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object detection (OD) has become vital for numerous computer vision applications, but deploying it on resource-constrained IoT devices presents a significant challenge. These devices, often powered by energy-efficient microcontrollers, struggle to handle the computational load of deep learning-based OD models. This issue is compounded by the rapid proliferation of IoT devices, predicted to surpass 150 billion by 2030. TinyML offers a compelling solution by enabling OD on ultra-low-power devices, paving the way for efficient and real-time processing at the edge. Although numerous survey papers have been published on this topic, they often overlook the optimization challenges associated with deploying OD models in TinyML environments. To address this gap, this survey paper provides a detailed analysis of key optimization techniques for deploying OD models on resource-constrained devices. These techniques include quantization, pruning, knowledge distillation, and neural architecture search. Furthermore, we explore both theoretical approaches and practical implementations, bridging the gap between academic research and real-world edge artificial intelligence deployment. Finally, we compare the key performance indicators (KPIs) of existing OD implementations on microcontroller devices, highlighting the achieved maturity level of these solutions in terms of both prediction accuracy and efficiency. We also provide a public repository to continually track developments in this fast-evolving field: this https URL.
zh

[CV-82] ImageDDI: Image-enhanced Molecular Motif Sequence Representation for Drug-Drug Interaction Prediction

【速读】:该论文旨在解决多药联用可能引发的不良健康效应问题,核心在于准确识别与预测药物-药物相互作用(Drug-Drug Interactions, DDIs)。现有方法受限于功能基序(functional motif)表征学习的瓶颈,因DDIs本质上源于基序间的相互作用而非整体分子结构。其解决方案的关键在于提出ImageDDI框架,该框架从全局和局部结构两个层面表示药物对:首先将分子分词为功能基序并构建序列,通过Transformer编码器进行嵌入;进而利用分子图像信息(如纹理、阴影、颜色及平面空间关系)增强空间表征,并引入自适应特征融合机制(Adaptive Feature Fusion),动态调整基序序列与视觉特征的融合过程,从而提升模型泛化能力。实验表明,ImageDDI在多个公开数据集上优于当前最优方法,在2D与3D图像增强场景下均表现出竞争力。

链接: https://arxiv.org/abs/2508.08338
作者: Yuqin He,Tengfei Ma,Chaoyi Li,Pengsen Ma,Hongxin Xiang,Jianmin Wang,Yiping Liu,Bosheng Song,Xiangxiang Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted By Information Fusion

点击查看摘要

Abstract:To mitigate the potential adverse health effects of simultaneous multi-drug use, including unexpected side effects and interactions, accurately identifying and predicting drug-drug interactions (DDIs) is considered a crucial task in the field of deep learning. Although existing methods have demonstrated promising performance, they suffer from the bottleneck of limited functional motif-based representation learning, as DDIs are fundamentally caused by motif interactions rather than the overall drug structures. In this paper, we propose an Image-enhanced molecular motif sequence representation framework for \textbfDDI prediction, called ImageDDI, which represents a pair of drugs from both global and local structures. Specifically, ImageDDI tokenizes molecules into functional motifs. To effectively represent a drug pair, their motifs are combined into a single sequence and embedded using a transformer-based encoder, starting from the local structure representation. By leveraging the associations between drug pairs, ImageDDI further enhances the spatial representation of molecules using global molecular image information (e.g. texture, shadow, color, and planar spatial relationships). To integrate molecular visual information into functional motif sequence, ImageDDI employs Adaptive Feature Fusion, enhancing the generalization of ImageDDI by dynamically adapting the fusion process of feature representations. Experimental results on widely used datasets demonstrate that ImageDDI outperforms state-of-the-art methods. Moreover, extensive experiments show that ImageDDI achieved competitive performance in both 2D and 3D image-enhanced scenarios compared to other models.
zh

[CV-83] Evaluation of State-of-the-Art Deep Learning Techniques for Plant Disease and Pest Detection

【速读】:该论文旨在解决植物病虫害检测效率与精度不足的问题,传统人工识别方法存在速度慢、准确性低等局限性。其解决方案的关键在于利用人工智能(AI)、机器学习(ML)和深度学习(DL)技术,特别是基于图像的现代计算机视觉方法,构建系统化的分类体系,涵盖高光谱成像、非可视化技术、可视化方法、改进的深度学习架构以及Transformer模型五大类。其中,视觉Transformer(Vision Transformer, ViT)类模型如分层视觉Transformer(Hierarchical Vision Transformer, HvT)展现出显著优势,检测准确率超过99.3%,明显优于MobileNetV3等传统架构,体现了先进AI方法在提升植物病虫害识别性能方面的核心价值。

链接: https://arxiv.org/abs/2508.08317
作者: Saptarshi Banerjee,Tausif Mallick,Amlan Chakroborty,Himadri Nath Saha,Nityananda T. Takur
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AI/ML, Computer Vision

点击查看摘要

Abstract:Addressing plant diseases and pests is critical for enhancing crop production and preventing economic losses. Recent advances in artificial intelligence (AI), machine learning (ML), and deep learning (DL) have significantly improved the precision and efficiency of detection methods, surpassing the limitations of manual identification. This study reviews modern computer-based techniques for detecting plant diseases and pests from images, including recent AI developments. The methodologies are organized into five categories: hyperspectral imaging, non-visualization techniques, visualization approaches, modified deep learning architectures, and transformer models. This structured taxonomy provides researchers with detailed, actionable insights for selecting advanced state-of-the-art detection methods. A comprehensive survey of recent work and comparative studies demonstrates the consistent superiority of modern AI-based approaches, which often outperform older image analysis methods in speed and accuracy. In particular, vision transformers such as the Hierarchical Vision Transformer (HvT) have shown accuracy exceeding 99.3% in plant disease detection, outperforming architectures like MobileNetV3. The study concludes by discussing system design challenges, proposing solutions, and outlining promising directions for future research.
zh

[CV-84] MoSSDA: A Semi-Supervised Domain Adaptation Framework for Multivariate Time-Series Classification using Momentum Encoder

【速读】:该论文旨在解决深度学习模型在源域(source domain)与目标域(target domain)数据分布不一致(即领域偏移,domain shift)时性能显著下降的问题,尤其针对多变量时间序列分类任务中因噪声敏感性和序列依赖性导致的领域偏移问题。解决方案的关键在于提出一种两阶段动量编码器增强的半监督领域自适应框架(MoSSDA),其核心创新包括:首先利用领域不变编码器(domain-invariant encoder)联合学习源域和目标域的特征表示以提升鲁棒性和域不变性;随后通过一个基于mixup增强的正对比学习模块(mixup-enhanced positive contrastive module),结合在线动量编码器(online momentum encoder)进一步优化特征的一致性和判别性;此外,采用两阶段梯度分离策略(two-stage process)分别训练编码器与分类器,从而获得更丰富且复杂的特征表示,且无需数据增强即可在有限标签的目标域样本上实现最优性能。

链接: https://arxiv.org/abs/2508.08280
作者: Seonyoung Kim,Dongil Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has emerged as the most promising approach in various fields; however, when the distributions of training and test data are different (domain shift), the performance of deep learning models can degrade. Semi-supervised domain adaptation (SSDA) is a major approach for addressing this issue, assuming that a fully labeled training set (source domain) is available, but the test set (target domain) provides labels only for a small subset. In this study, we propose a novel two-step momentum encoder-utilized SSDA framework, MoSSDA, for multivariate time-series classification. Time series data are highly sensitive to noise, and sequential dependencies cause domain shifts resulting in critical performance degradation. To obtain a robust, domain-invariant and class-discriminative representation, MoSSDA employs a domain-invariant encoder to learn features from both source and target domains. Subsequently, the learned features are fed to a mixup-enhanced positive contrastive module consisting of an online momentum encoder. The final classifier is trained with learned features that exhibit consistency and discriminability with limited labeled target domain data, without data augmentation. We applied a two-stage process by separating the gradient flow between the encoders and the classifier to obtain rich and complex representations. Through extensive experiments on six diverse datasets, MoSSDA achieved state-of-the-art performance for three different backbones and various unlabeled ratios in the target domain data. The Ablation study confirms that each module, including two-stage learning, is effective in improving the performance. Our code is available at this https URL
zh

[CV-85] Efficient motion-based metrics for video frame interpolation

【速读】:该论文旨在解决视频帧插值(Video Frame Interpolation, VFI)中感知质量评估的难题,即如何有效衡量插值结果在人眼感知上的自然性和流畅性。传统指标如PSNR或SSIM难以准确反映人类视觉对运动连续性和图像质量的主观评价。其解决方案的关键在于利用运动场(motion field)信息设计一种新的感知质量度量方法——通过测量运动场的发散程度(divergence)来量化插值质量。该方法在BVI-VFI数据集上表现出与主观评分具有合理相关性(PLCC=0.51),且计算效率显著优于现有先进方法(如FloLPIPS),实现了更高效且更贴近人类感知的VFI质量评估。

链接: https://arxiv.org/abs/2508.09078
作者: Conall Daly,Darren Ramsook,Anil Kokaram
机构: Sigmedia Group (Sigmedia 组); Electronic and Electrical Engineering Dept. (电子与电气工程系); Trinity College Dublin (都柏林三一学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: SPIE2025 - Applications of Digital Image Processing XLVIII accepted manuscript

点击查看摘要

Abstract:Video frame interpolation (VFI) offers a way to generate intermediate frames between consecutive frames of a video sequence. Although the development of advanced frame interpolation algorithms has received increased attention in recent years, assessing the perceptual quality of interpolated content remains an ongoing area of research. In this paper, we investigate simple ways to process motion fields, with the purposes of using them as video quality metric for evaluating frame interpolation algorithms. We evaluate these quality metrics using the BVI-VFI dataset which contains perceptual scores measured for interpolated sequences. From our investigation we propose a motion metric based on measuring the divergence of motion fields. This metric correlates reasonably with these perceptual scores (PLCC=0.51) and is more computationally efficient (x2.7 speedup) compared to FloLPIPS (a well known motion-based metric). We then use our new proposed metrics to evaluate a range of state of the art frame interpolation metrics and find our metrics tend to favour more perceptual pleasing interpolated frames that may not score highly in terms of PSNR or SSIM.
zh

[CV-86] A new dataset and comparison for multi-camera frame synthesis

【速读】:该论文旨在解决帧插值(frame interpolation)与视图合成(view synthesis)方法在真实图像数据上难以公平比较的问题,其核心挑战在于现有数据集分别侧重于时间维度(单相机运动)或空间深度估计(立体视觉),导致方法性能无法直接对比。解决方案的关键在于构建一个基于自研密集线性摄像机阵列的多相机数据集,从而实现对两种方法在视图插值(view in-betweening)任务上的统一评估;实验表明,在真实图像场景中,深度学习方法并未显著优于传统方法,而3D高斯点绘(3D Gaussian Splatting)甚至比最优帧插值算法低达3.5 dB PSNR,但在合成场景中则反超近5 dB PSNR(95%置信水平)。

链接: https://arxiv.org/abs/2508.09068
作者: Conall Daly,Anil Kokaram
机构: Sigmedia Group (Sigmedia 组); Electronic and Electrical Engineering Dept. (电子与电气工程系); Trinity College Dublin (都柏林圣三一学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: SPIE2025 - Applications of Digital Image Processing XLVIII accepted manuscript

点击查看摘要

Abstract:Many methods exist for frame synthesis in image sequences but can be broadly categorised into frame interpolation and view synthesis techniques. Fundamentally, both frame interpolation and view synthesis tackle the same task, interpolating a frame given surrounding frames in time or space. However, most frame interpolation datasets focus on temporal aspects with single cameras moving through time and space, while view synthesis datasets are typically biased toward stereoscopic depth estimation use cases. This makes direct comparison between view synthesis and frame interpolation methods challenging. In this paper, we develop a novel multi-camera dataset using a custom-built dense linear camera array to enable fair comparison between these approaches. We evaluate classical and deep learning frame interpolators against a view synthesis method (3D Gaussian Splatting) for the task of view in-betweening. Our results reveal that deep learning methods do not significantly outperform classical methods on real image data, with 3D Gaussian Splatting actually underperforming frame interpolators by as much as 3.5 dB PSNR. However, in synthetic scenes, the situation reverses – 3D Gaussian Splatting outperforms frame interpolation algorithms by almost 5 dB PSNR at a 95% confidence level.
zh

[CV-87] Frequency-Assisted Adaptive Sharpening Scheme Considering Bitrate and Quality Tradeoff

【速读】:该论文旨在解决视频锐化(sharpening)过程中如何在提升视频质量的同时有效控制带宽成本的问题。传统方法中,提高锐化强度虽能增强纹理细节、缓解模糊,但会导致视频码率上升,进而影响服务质量(Quality of Service, QoS),且存在过度锐化(over-sharpening)风险,使得主观质量不再提升。为此,论文提出了一种基于频率信息的锐化级别预测模型(Frequency-assisted Sharpening level Prediction model, FreqSP)。其关键在于:首先以最优比特率与画质权衡对应的锐化级别作为标签构建训练数据;随后利用未压缩源视频输入,结合卷积神经网络(CNN)提取的深层特征与高频分量,实现对最优锐化水平的精准估计,从而在保证画质提升的同时优化带宽使用效率。

链接: https://arxiv.org/abs/2508.08854
作者: Yingxue Pang,Shijie Zhao,Haiqiang Wang,Gen Zhan,Junlin Li,Li Zhang
机构: Bytedance Inc.(字节跳动)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Sharpening is a widely adopted technique to improve video quality, which can effectively emphasize textures and alleviate blurring. However, increasing the sharpening level comes with a higher video bitrate, resulting in degraded Quality of Service (QoS). Furthermore, the video quality does not necessarily improve with increasing sharpening levels, leading to issues such as over-sharpening. Clearly, it is essential to figure out how to boost video quality with a proper sharpening level while also controlling bandwidth costs effectively. This paper thus proposes a novel Frequency-assisted Sharpening level Prediction model (FreqSP). We first label each video with the sharpening level correlating to the optimal bitrate and quality tradeoff as ground truth. Then taking uncompressed source videos as inputs, the proposed FreqSP leverages intricate CNN features and high-frequency components to estimate the optimal sharpening level. Extensive experiments demonstrate the effectiveness of our method.
zh

[CV-88] Preprocessing Algorithm Leverag ing Geometric Modeling for Scale Correction in Hyperspectral Images for Improved Unmixing Performance

【速读】:该论文旨在解决高光谱解混算法中因地形、光照和阴影等因素引起的光谱特征尺度变化(scale-induced spectral variability)所导致的精度下降与收敛困难问题。这类大尺度乘性变化会显著干扰解混模型的拟合过程,尤其在复杂场景下加剧了非线性光谱变异建模的难度。解决方案的关键在于提出一种新颖的预处理算法,在解混前对光谱数据进行尺度校正,通过分离并补偿这些大尺度的乘性效应,为后续解混方法提供更干净的输入信号,从而使其能更专注于建模非线性光谱变异和组分丰度估计。实验表明,该预处理步骤可使多种先进解混算法的误差降低近50%,证明其作为通用增强模块在实际高光谱解混流程中的有效性。

链接: https://arxiv.org/abs/2508.08431
作者: Praveen Sumanasekara,Athulya Ratnayake,Buddhi Wijenayake,Keshawa Ratnayake,Roshan Godaliyadda,Parakrama Ekanayake,Vijitha Herath
机构: University of Peradeniya, Sri Lanka(斯里兰卡佩拉德尼亚大学); Purdue University, West Lafayette, IN, USA(美国西拉斐特普渡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 20 pages, 17 figures

点击查看摘要

Abstract:Spectral variability significantly impacts the accuracy and convergence of hyperspectral unmixing algorithms. While many methods address complex spectral variability, large-scale variations in spectral signature scale caused by factors such as topography, illumination, and shadowing remain a major challenge. These variations often degrade unmixing performance and complicate model fitting. In this paper, we propose a novel preprocessing algorithm that corrects scale-induced spectral variability prior to unmixing. By isolating and compensating for these large-scale multiplicative effects, the algorithm provides a cleaner input, enabling unmixing methods to focus more effectively on modeling nonlinear spectral variability and abundance estimation. We present a rigorous mathematical framework to describe scale variability and extensive experimental validation of the proposed algorithm. Furthermore, the algorithm’s impact is evaluated across a broad spectrum of state-of-the-art unmixing algorithms on two synthetic and two real hyperspectral datasets. The proposed preprocessing step consistently improves the performance of these algorithms, including those specifically designed to handle spectral variability, with error reductions close to 50% in many cases. This demonstrates that scale correction acts as a complementary step, facilitating more accurate unmixing by existing methods. The algorithm’s generality and significant impact highlight its potential as a key component in practical hyperspectral unmixing pipelines. The implementation code will be made publicly available upon publication.
zh

[CV-89] Variational volume reconstruction with the Deep Ritz Method

【速读】:该论文旨在解决从稀疏、噪声较大的切片数据中进行变分体积重建的问题,尤其针对生物医学成像中的磁共振成像(MRI)切片到体积重建(slice-to-volume reconstruction, SVR)应用。其核心挑战包括:依赖图像分割提取边界导致对噪声敏感、有限切片平面下重建精度低,以及传统基于网格的方法计算成本高。解决方案的关键在于提出一种结合回归损失与改进的Cahn-Hilliard能量项的变分目标函数——其中回归损失直接作用于噪声切片数据以避免图像分割,而各向异性扩散正则化则提升几何重建质量;同时采用神经网络离散相场、蒙特卡洛积分近似目标函数,并使用ADAM优化器快速收敛,从而在数秒内实现高质量重建,即使面对稀疏和噪声数据也表现出鲁棒性。

链接: https://arxiv.org/abs/2508.08309
作者: Conor Rowan,Sumedh Soman,John A. Evans
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a novel approach to variational volume reconstruction from sparse, noisy slice data using the Deep Ritz method. Motivated by biomedical imaging applications such as MRI-based slice-to-volume reconstruction (SVR), our approach addresses three key challenges: (i) the reliance on image segmentation to extract boundaries from noisy grayscale slice images, (ii) the need to reconstruct volumes from a limited number of slice planes, and (iii) the computational expense of traditional mesh-based methods. We formulate a variational objective that combines a regression loss designed to avoid image segmentation by operating on noisy slice data directly with a modified Cahn-Hilliard energy incorporating anisotropic diffusion to regularize the reconstructed geometry. We discretize the phase field with a neural network, approximate the objective at each optimization step with Monte Carlo integration, and use ADAM to find the minimum of the approximated variational objective. While the stochastic integration may not yield the true solution to the variational problem, we demonstrate that our method reliably produces high-quality reconstructed volumes in a matter of seconds, even when the slice data is sparse and noisy.
zh

人工智能

[AI-0] BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在复杂信息检索任务中难以平衡搜索广度与推理深度的问题。现有方法受限于串行、缓慢的查询方式导致覆盖不足,以及原始输入噪声干扰多步推理连贯性。其解决方案的关键在于提出BrowseMaster框架,通过程序化增强的规划器-执行器协同机制实现分工:规划器根据任务约束动态制定并调整搜索策略,执行器则高效、精准地获取结构化证据供规划器使用,从而在保持长程推理一致性的同时实现广泛而系统的探索,突破了传统代理在规模与精度之间的权衡限制。

链接: https://arxiv.org/abs/2508.09129
作者: Xianghe Pang,Shuo Tang,Rui Ye,Yuwen Du,Yaxin Du,Siheng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective information seeking in the vast and ever-growing digital landscape requires balancing expansive search with strategic reasoning. Current large language model (LLM)-based agents struggle to achieve this balance due to limitations in search breadth and reasoning depth, where slow, serial querying restricts coverage of relevant sources and noisy raw inputs disrupt the continuity of multi-step reasoning. To address these challenges, we propose BrowseMaster, a scalable framework built around a programmatically augmented planner-executor agent pair. The planner formulates and adapts search strategies based on task constraints, while the executor conducts efficient, targeted retrieval to supply the planner with concise, relevant evidence. This division of labor preserves coherent, long-horizon reasoning while sustaining broad and systematic exploration, overcoming the trade-off that limits existing agents. Extensive experiments on challenging English and Chinese benchmarks show that BrowseMaster consistently outperforms open-source and proprietary baselines, achieving scores of 30.0 on BrowseComp-en and 46.5 on BrowseComp-zh, which demonstrates its strong capability in complex, reasoning-heavy information-seeking tasks at scale.
zh

[AI-1] SMA: Who Said That? Auditing Membership Leakage in Semi-Black-box RAG Controlling

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)及其多模态版本(Multimodal Retrieval-Augmented Generation, MRAG)系统中内容来源归属不明确的问题,即现有成员推理(Membership Inference)方法无法可靠区分生成内容源自预训练数据、外部检索知识还是用户输入,从而削弱了隐私泄露的责任追溯能力。解决方案的关键在于提出首个面向源感知的成员审计机制(Source-aware Membership Audit, SMA),其核心创新包括:(1)基于零阶优化的归因估计机制,通过大规模扰动采样与岭回归建模,在半黑盒环境下稳健逼近输入token对输出的真实影响;(2)引入跨模态归因技术,利用多模态大语言模型(Multimodal Large Language Models, MLLMs)将图像输入映射为文本描述,首次实现MRAG系统中图像检索痕迹的token级成员推理,推动成员推理从“数据是否被记忆”转向“内容来自何处”的新范式。

链接: https://arxiv.org/abs/2508.09105
作者: Shixuan Sun,Siyuan Liang,Ruoyu Chen,Jianjie Huang,Jingzhi Li,Xiaochun Cao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) and its Multimodal Retrieval-Augmented Generation (MRAG) significantly improve the knowledge coverage and contextual understanding of Large Language Models (LLMs) by introducing external knowledge sources. However, retrieval and multimodal fusion obscure content provenance, rendering existing membership inference methods unable to reliably attribute generated outputs to pre-training, external retrieval, or user input, thus undermining privacy leakage accountability To address these challenges, we propose the first Source-aware Membership Audit (SMA) that enables fine-grained source attribution of generated content in a semi-black-box setting with retrieval control this http URL address the environmental constraints of semi-black-box auditing, we further design an attribution estimation mechanism based on zero-order optimization, which robustly approximates the true influence of input tokens on the output through large-scale perturbation sampling and ridge regression modeling. In addition, SMA introduces a cross-modal attribution technique that projects image inputs into textual descriptions via MLLMs, enabling token-level attribution in the text modality, which for the first time facilitates membership inference on image retrieval traces in MRAG systems. This work shifts the focus of membership inference from ‘whether the data has been memorized’ to ‘where the content is sourced from’, offering a novel perspective for auditing data provenance in complex generative systems. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.09105 [cs.AI] (or arXiv:2508.09105v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.09105 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-2] owards Universal Neural Inference

【速读】:该论文旨在解决现实世界中异构结构化数据(heterogeneous structured data)因模式差异、语义不一致和特征顺序无固定性而导致的通用模型难以跨数据集整合信息的问题。解决方案的关键在于提出ASPIRE(Arbitrary Set-based Permutation-Invariant Reasoning Engine),其核心创新是将一种排列不变的集合型Transformer与语义锚定模块相结合,该模块利用自然语言描述、数据集元信息及上下文示例来学习跨数据集的特征依赖关系,从而实现对任意特征-值对集合和示例的输入处理,自动对齐不同表格间的语义,并支持针对任意目标变量的预测任务。训练完成后,ASPIRE无需额外微调即可泛化至新推理任务,并在开放世界场景下具备成本感知的主动特征获取能力,能够在测试时预算约束下选择最具信息量的特征用于未见数据的预测。

链接: https://arxiv.org/abs/2508.09100
作者: Shreyas Bhat Brahmavar,Yang Li,Junier Oliva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-world data often appears in diverse, disjoint forms – with varying schemas, inconsistent semantics, and no fixed feature ordering – making it challenging to build general-purpose models that can leverage information across datasets. We introduce ASPIRE, Arbitrary Set-based Permutation-Invariant Reasoning Engine, a Universal Neural Inference model for semantic reasoning and prediction over heterogeneous structured data. ASPIRE combines a permutation-invariant, set-based Transformer with a semantic grounding module that incorporates natural language descriptions, dataset metadata, and in-context examples to learn cross-dataset feature dependencies. This architecture allows ASPIRE to ingest arbitrary sets of feature–value pairs and support examples, align semantics across disjoint tables, and make predictions for any specified target. Once trained, ASPIRE generalizes to new inference tasks without additional tuning. In addition to delivering strong results across diverse benchmarks, ASPIRE naturally supports cost-aware active feature acquisition in an open-world setting, selecting informative features under test-time budget constraints for an arbitrary unseen dataset. These capabilities position ASPIRE as a step toward truly universal, semantics-aware inference over structured data.
zh

[AI-3] SPARC: Soft Probabilistic Adaptive multi-interest Retrieval Model via Codebooks for recommender system

【速读】:该论文旨在解决当前多兴趣检索(multi-interest retrieval)方法中存在的两大核心问题:一是兴趣表示通常依赖预定义的外部知识,缺乏随用户实时消费偏好动态演化的能力;二是在线推理普遍采用过度利用策略,仅匹配用户已有兴趣,忽视对新颖和长尾兴趣的主动探索与发现。解决方案的关键在于提出一种名为SPARC(Soft Probabilistic Adaptive Retrieval Model via Codebooks)的新颖检索框架,其核心创新包括:1)利用残差量化变分自编码器(Residual Quantized Variational Autoencoder, RQ-VAE)构建离散的兴趣空间,并与工业级大规模推荐模型联合训练,从而挖掘能够感知用户反馈并动态演化的行为感知兴趣;2)设计一个概率兴趣模块,预测整个动态离散兴趣空间上的概率分布,支持在线推理中高效的“软搜索”策略,将传统被动匹配模式转变为积极探索模式,显著提升兴趣发现能力。

链接: https://arxiv.org/abs/2508.09090
作者: Jialiang Shi,Yaguang Dou,Tian Qi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages

点击查看摘要

Abstract:Modeling multi-interests has arisen as a core problem in real-world RS. Current multi-interest retrieval methods pose three major challenges: 1) Interests, typically extracted from predefined external knowledge, are invariant. Failed to dynamically evolve with users’ real-time consumption preferences. 2) Online inference typically employs an over-exploited strategy, mainly matching users’ existing interests, lacking proactive exploration and discovery of novel and long-tail interests. To address these challenges, we propose a novel retrieval framework named SPARC(Soft Probabilistic Adaptive Retrieval Model via Codebooks). Our contribution is two folds. First, the framework utilizes Residual Quantized Variational Autoencoder (RQ-VAE) to construct a discretized interest space. It achieves joint training of the RQ-VAE with the industrial large scale recommendation model, mining behavior-aware interests that can perceive user feedback and evolve dynamically. Secondly, a probabilistic interest module that predicts the probability distribution over the entire dynamic and discrete interest space. This facilitates an efficient “soft-search” strategy during online inference, revolutionizing the retrieval paradigm from “passive matching” to “proactive exploration” and thereby effectively promoting interest discovery. Online A/B tests on an industrial platform with tens of millions daily active users, have achieved substantial gains in business metrics: +0.9% increase in user view duration, +0.4% increase in user page views (PV), and a +22.7% improvement in PV500(new content reaching 500 PVs in 24 hours). Offline evaluations are conducted on open-source Amazon Product datasets. Metrics, such as Recall@K and Normalized Discounted Cumulative Gain@K(NDCG@K), also showed consistent improvement. Both online and offline experiments validate the efficacy and practical value of the proposed method.
zh

[AI-4] Dynamic Uncertainty-aware Multimodal Fusion for Outdoor Health Monitoring

【速读】:该论文旨在解决户外健康监测中因动态环境噪声和多模态数据质量差异导致的检测精度下降问题。具体而言,现有基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的方法在面对传感器输入噪声、生理信号波动噪声以及不同模态间噪声水平不一致时,难以实现鲁棒的多模态融合与缺失数据恢复。其解决方案的关键在于提出一种不确定性感知的多模态融合框架 DUAL-Health:首先通过当前与时间特征量化由输入噪声和波动噪声引起的模态不确定性;其次根据校准后的不确定性定制各模态的融合权重,以提升低质量模态的融合效率;最后在统一语义空间内对齐不同模态分布,增强从波动噪声模态中恢复数据的能力。实验证明该方法显著优于现有最优基线,在检测准确性和鲁棒性方面均取得提升。

链接: https://arxiv.org/abs/2508.09085
作者: Zihan Fang,Zheng Lin,Senkang Hu,Yihang Tao,Yiqin Deng,Xianhao Chen,Yuguang Fang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Outdoor health monitoring is essential to detect early abnormal health status for safeguarding human health and safety. Conventional outdoor monitoring relies on static multimodal deep learning frameworks, which requires extensive data training from scratch and fails to capture subtle health status changes. Multimodal large language models (MLLMs) emerge as a promising alternative, utilizing only small datasets to fine-tune pre-trained information-rich models for enabling powerful health status monitoring. Unfortunately, MLLM-based outdoor health monitoring also faces significant challenges: I) sensor data contains input noise stemming from sensor data acquisition and fluctuation noise caused by sudden changes in physiological signals due to dynamic outdoor environments, thus degrading the training performance; ii) current transformer based MLLMs struggle to achieve robust multimodal fusion, as they lack a design for fusing the noisy modality; iii) modalities with varying noise levels hinder accurate recovery of missing data from fluctuating distributions. To combat these challenges, we propose an uncertainty-aware multimodal fusion framework, named DUAL-Health, for outdoor health monitoring in dynamic and noisy environments. First, to assess the impact of noise, we accurately quantify modality uncertainty caused by input and fluctuation noise with current and temporal features. Second, to empower efficient muitimodal fusion with low-quality modalities,we customize the fusion weight for each modality based on quantified and calibrated uncertainty. Third, to enhance data recovery from fluctuating noisy modalities, we align modality distributions within a common semantic space. Extensive experiments demonstrate that our DUAL-Health outperforms state-of-the-art baselines in detection accuracy and robustness.
zh

[AI-5] CVCM Track Circuits Pre-emptive Failure Diagnostics for Predictive Maintenance Using Deep Neural Networks

【速读】:该论文旨在解决铁路轨道电路(track circuit)中连续可变电流调制(CVCM)系统因早期微小异常难以被传统检测方法识别而导致的故障预警滞后问题。此类异常常在监测信号中无明显变化,但会逐步演化为严重故障,引发连锁运营中断。解决方案的关键在于利用深度神经网络构建预测性维护框架,实现对异常类型的早期分类识别,且能在异常发生前1%的时间窗口内完成检测;同时结合校准预测(conformal prediction)提供置信度估计,在各类别上均保持99%的置信水平与一致覆盖,从而满足ISO-17359标准要求,提升维护决策的准确性与可靠性。

链接: https://arxiv.org/abs/2508.09054
作者: Debdeep Mukherjee(2),Eduardo Di Santi(1),Clément Lefebvre(1),Nenad Mijatovic(1),Victor Martin(1),Thierry Josse(3),Jonathan Brown(1),Kenza Saiah(1) ((1) Digital and Integrated Systems, Alstom (2) Innovation and Smart Mobility, Alstom (3) Project System Engineering, Alstom)
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Peer-reviewed conference paper. Presented at ICROMA 2025 (International Conference on Railway Operations Modelling and Analysis), Dresden, Germany. this https URL 8 pages, 6 figures, 1 table

点击查看摘要

Abstract:Track circuits are critical for railway operations, acting as the main signalling sub-system to locate trains. Continuous Variable Current Modulation (CVCM) is one such technology. Like any field-deployed, safety-critical asset, it can fail, triggering cascading disruptions. Many failures originate as subtle anomalies that evolve over time, often not visually apparent in monitored signals. Conventional approaches, which rely on clear signal changes, struggle to detect them early. Early identification of failure types is essential to improve maintenance planning, minimising downtime and revenue loss. Leveraging deep neural networks, we propose a predictive maintenance framework that classifies anomalies well before they escalate into failures. Validated on 10 CVCM failure cases across different installations, the method is ISO-17359 compliant and outperforms conventional techniques, achieving 99.31% overall accuracy with detection within 1% of anomaly onset. Through conformal prediction, we provide uncertainty estimates, reaching 99% confidence with consistent coverage across classes. Given CVCMs global deployment, the approach is scalable and adaptable to other track circuits and railway systems, enhancing operational reliability.
zh

[AI-6] Can We Trust AI to Govern AI? Benchmarking LLM Performance on Privacy and AI Governance Exams

【速读】:该论文旨在解决生成式 AI(Generative AI)在隐私合规与数据治理领域中的能力评估问题,具体聚焦于大型语言模型(Large Language Models, LLMs)是否能够胜任专业隐私岗位的职责,如监管合规、隐私项目管理和人工智能治理。其解决方案的关键在于通过标准化测试——即国际隐私专业人士协会(IAPP)的CIPP/US、CIPM、CIPT和AIGP认证考试——对十种主流开源与闭源LLMs进行封闭式测评,并将模型得分与人类通过标准对比,从而量化评估LLMs在隐私法律、技术控制及AI治理等领域的专业能力边界与实际表现。

链接: https://arxiv.org/abs/2508.09036
作者: Zane Witherspoon,Thet Mon Aye,YingYing Hao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid emergence of large language models (LLMs) has raised urgent questions across the modern workforce about this new technology’s strengths, weaknesses, and capabilities. For privacy professionals, the question is whether these AI systems can provide reliable support on regulatory compliance, privacy program management, and AI governance. In this study, we evaluate ten leading open and closed LLMs, including models from OpenAI, Anthropic, Google DeepMind, Meta, and DeepSeek, by benchmarking their performance on industry-standard certification exams: CIPP/US, CIPM, CIPT, and AIGP from the International Association of Privacy Professionals (IAPP). Each model was tested using official sample exams in a closed-book setting and compared to IAPP’s passing thresholds. Our findings show that several frontier models such as Gemini 2.5 Pro and OpenAI’s GPT-5 consistently achieve scores exceeding the standards for professional human certification - demonstrating substantial expertise in privacy law, technical controls, and AI governance. The results highlight both the strengths and domain-specific gaps of current LLMs and offer practical insights for privacy officers, compliance leads, and technologists assessing the readiness of AI tools for high-stakes data governance roles. This paper provides an overview for professionals navigating the intersection of AI advancement and regulatory risk and establishes a machine benchmark based on human-centric evaluations.
zh

[AI-7] A First Look at Predictability and Explainability of Pre-request Passenger Waiting Time in Ridesharing Systems

【速读】:该论文旨在解决共享出行系统中预请求乘客等待时间(pre-request passenger waiting time)的预测问题,即在乘客提交乘车请求但尚未匹配到司机时,准确预测其等待时间。这一问题对于提升乘客行程规划能力及平台整体效率具有重要意义,但此前未被充分研究。解决方案的关键在于提出FiXGBoost模型——一种基于特征交互的XGBoost方法,通过深入分析供需动态并进行特征工程,能够在不依赖已分配司机信息的前提下实现高精度且可解释的等待时间预测,同时通过重要性分析量化各因素对预测结果的贡献。

链接: https://arxiv.org/abs/2508.09027
作者: Jie Wang,Guang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Passenger waiting time prediction plays a critical role in enhancing both ridesharing user experience and platform efficiency. While most existing research focuses on post-request waiting time prediction with knowing the matched driver information, pre-request waiting time prediction (i.e., before submitting a ride request and without matching a driver) is also important, as it enables passengers to plan their trips more effectively and enhance the experience of both passengers and drivers. However, it has not been fully studied by existing works. In this paper, we take the first step toward understanding the predictability and explainability of pre-request passenger waiting time in ridesharing systems. Particularly, we conduct an in-depth data-driven study to investigate the impact of demandsupply dynamics on passenger waiting time. Based on this analysis and feature engineering, we propose FiXGBoost, a novel feature interaction-based XGBoost model designed to predict waiting time without knowing the assigned driver information. We further perform an importance analysis to quantify the contribution of each factor. Experiments on a large-scale real-world ridesharing dataset including over 30 million trip records show that our FiXGBoost can achieve a good performance for pre-request passenger waiting time prediction with high explainability.
zh

[AI-8] Attacks and Defenses Against LLM Fingerprinting

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在敏感环境中部署时面临的指纹攻击(fingerprinting attacks)所引发的隐私与安全问题。指纹攻击可通过分析模型输出行为来识别或追踪特定模型,从而泄露模型身份信息。论文从攻防两个角度提出解决方案:攻击端采用强化学习(reinforcement learning)自动优化查询选择策略,在仅使用3次查询的情况下显著提升指纹识别准确率;防御端则利用一个辅助的语言模型对输出进行语义保持型过滤(semantic-preserving output filtering),在有效混淆模型身份的同时维持输出内容的语义完整性。其核心创新在于通过强化学习提升攻击效率,并设计轻量级、语义无损的防御机制以实现实际可用的对抗措施。

链接: https://arxiv.org/abs/2508.09021
作者: Kevin Kurian,Ethan Holland,Sean Oesch
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As large language models are increasingly deployed in sensitive environments, fingerprinting attacks pose significant privacy and security risks. We present a study of LLM fingerprinting from both offensive and defensive perspectives. Our attack methodology uses reinforcement learning to automatically optimize query selection, achieving better fingerprinting accuracy with only 3 queries compared to randomly selecting 3 queries from the same pool. Our defensive approach employs semantic-preserving output filtering through a secondary LLM to obfuscate model identity while maintaining semantic integrity. The defensive method reduces fingerprinting accuracy across tested models while preserving output quality. These contributions show the potential to improve fingerprinting tools capabilities while providing practical mitigation strategies against fingerprinting attacks.
zh

[AI-9] Activation Steering for Bias Mitigation: An Interpretable Approach to Safer LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在社会系统中广泛应用时,可能加剧和传播有害偏见的安全风险问题。传统方法通常依赖数据过滤或事后输出修正,将模型视为黑箱,难以实现精准干预。其解决方案的关键在于引入一个端到端的可解释性驱动系统:首先通过训练线性探测器(linear probes)于模型内部激活值,识别出性别、种族、年龄等偏见的潜在表征,实验表明这些偏见信号在模型后期层最为显著;其次基于此发现计算“引导向量”(steering vectors),通过对比有偏与中性语句的激活模式,在推理阶段实时添加该向量以主动调控生成过程,从而有效抑制偏见内容输出并导向更中立的替代结果。该方法提供了一种直接、可解释且可复现的路径,用于构建更安全、更具责任性的LLMs。

链接: https://arxiv.org/abs/2508.09019
作者: Shivam Dubey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become more integrated into societal systems, the risk of them perpetuating and amplifying harmful biases becomes a critical safety concern. Traditional methods for mitigating bias often rely on data filtering or post-hoc output moderation, which treat the model as an opaque black box. In this work, we introduce a complete, end-to-end system that uses techniques from mechanistic interpretability to both identify and actively mitigate bias directly within a model’s internal workings. Our method involves two primary stages. First, we train linear “probes” on the internal activations of a model to detect the latent representations of various biases (e.g., gender, race, age). Our experiments on \textttgpt2-large demonstrate that these probes can identify biased content with near-perfect accuracy, revealing that bias representations become most salient in the model’s later layers. Second, we leverage these findings to compute “steering vectors” by contrasting the model’s activation patterns for biased and neutral statements. By adding these vectors during inference, we can actively steer the model’s generative process away from producing harmful, stereotypical, or biased content in real-time. We demonstrate the efficacy of this activation steering technique, showing that it successfully alters biased completions toward more neutral alternatives. We present our work as a robust and reproducible system that offers a more direct and interpretable approach to building safer and more accountable LLMs.
zh

[AI-10] Intrinsic Memory Agents : Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统在复杂协作任务中因上下文窗口限制所引发的记忆一致性差、角色遵守不严和流程完整性受损等核心问题。其解决方案的关键在于提出内在记忆智能体(Intrinsic Memory Agents)框架,通过构建随智能体输出自适应演化的结构化专属记忆机制,维持与角色对齐的记忆模板,从而在聚焦任务相关信息的同时保持专业视角的一致性,显著提升多智能体系统的规划能力与执行质量。

链接: https://arxiv.org/abs/2508.08997
作者: Sizhe Yuen,Francisco Gomez Medina,Ting Su,Yali Du,Adam J. Sobey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent systems built on Large Language Models (LLMs) show exceptional promise for complex collaborative problem-solving, yet they face fundamental challenges stemming from context window limitations that impair memory consistency, role adherence, and procedural integrity. This paper introduces Intrinsic Memory Agents, a novel framework that addresses these limitations through structured agent-specific memories that evolve intrinsically with agent outputs. Specifically, our method maintains role-aligned memory templates that preserve specialized perspectives while focusing on task-relevant information. We benchmark our approach on the PDDL dataset, comparing its performance to existing state-of-the-art multi-agentic memory approaches and showing an improvement of 38.6% with the highest token efficiency. An additional evaluation is performed on a complex data pipeline design task, we demonstrate that our approach produces higher quality designs when comparing 5 metrics: scalability, reliability, usability, cost-effectiveness and documentation with additional qualitative evidence of the improvements. Our findings suggest that addressing memory limitations through structured, intrinsic approaches can improve the capabilities of multi-agent LLM systems on structured planning tasks.
zh

[AI-11] Prospect Theory Fails for LLM s: Revealing Instability of Decision-Making under Epistemic Uncertainty

【速读】:该论文旨在解决两个关键问题:一是 Prospect Theory (PT) 是否适用于当代大语言模型(Large Language Models, LLMs)的决策行为建模;二是人类表达不确定性的语用标记(如“maybe”)是否会影响LLMs的决策行为。解决方案的关键在于设计了一个基于经济问卷的三阶段实验,提出了一种更通用且精确的评估框架来模拟LLMs在PT下的决策行为,并通过引入与常见语用不确定性标记相对应的经验概率值,量化其对LLMs决策行为的影响。研究发现,以PT建模LLMs决策行为并不稳定,尤其当不确定性以多样化的语言形式表达时,这一现象凸显了语言不确定性对LLM决策机制的重要影响。

链接: https://arxiv.org/abs/2508.08992
作者: Rui Wang,Qihan Lin,Jiayu Liu,Qing Zong,Tianshi Zheng,Weiqi Wang,Yangqiu Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prospect Theory (PT) models human decision-making under uncertainty, while epistemic markers (e.g., maybe) serve to express uncertainty in language. However, it remains largely unexplored whether Prospect Theory applies to contemporary Large Language Models and whether epistemic markers, which express human uncertainty, affect their decision-making behaviour. To address these research gaps, we design a three-stage experiment based on economic questionnaires. We propose a more general and precise evaluation framework to model LLMs’ decision-making behaviour under PT, introducing uncertainty through the empirical probability values associated with commonly used epistemic markers in comparable contexts. We then incorporate epistemic markers into the evaluation framework based on their corresponding probability values to examine their influence on LLM decision-making behaviours. Our findings suggest that modelling LLMs’ decision-making with PT is not consistently reliable, particularly when uncertainty is expressed in diverse linguistic forms. Our code is released in this https URL.
zh

[AI-12] Rational Inverse Reasoning

【速读】:该论文试图解决机器人在少样本(few-shot)场景下难以从单次或少量示范中泛化到不同任务设置的问题,这与人类能够通过一次观察即实现跨情境迁移的能力形成鲜明对比。其解决方案的关键在于提出了一种名为“理性逆向推理”(Rational Inverse Reasoning, RIR)的框架,该框架通过层次化生成模型推断行为背后的潜在结构化程序(structured programs),包括高层目标、子任务分解和执行约束;具体而言,RIR将少样本模仿学习建模为贝叶斯程序归纳问题,利用视觉-语言模型迭代生成符号化的任务假设,并结合“规划器内循环”的推理机制对每个假设进行似然评分,从而得到一个后验分布下的简洁且可执行的程序,最终实现仅需一次示范即可在物体姿态、数量、几何形状及布局变化等复杂条件下成功泛化。

链接: https://arxiv.org/abs/2508.08983
作者: Ben Zandonati,Tomás Lozano-Pérez,Leslie Pack Kaelbling
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Humans can observe a single, imperfect demonstration and immediately generalize to very different problem settings. Robots, in contrast, often require hundreds of examples and still struggle to generalize beyond the training conditions. We argue that this limitation arises from the inability to recover the latent explanations that underpin intelligent behavior, and that these explanations can take the form of structured programs consisting of high-level goals, sub-task decomposition, and execution constraints. In this work, we introduce Rational Inverse Reasoning (RIR), a framework for inferring these latent programs through a hierarchical generative model of behavior. RIR frames few-shot imitation as Bayesian program induction: a vision-language model iteratively proposes structured symbolic task hypotheses, while a planner-in-the-loop inference scheme scores each by the likelihood of the observed demonstration under that hypothesis. This loop yields a posterior over concise, executable programs. We evaluate RIR on a suite of continuous manipulation tasks designed to test one-shot and few-shot generalization across variations in object pose, count, geometry, and layout. With as little as one demonstration, RIR infers the intended task structure and generalizes to novel settings, outperforming state-of-the-art vision-language model baselines.
zh

[AI-13] Unsupervised Skill Discovery as Exploration for Learning Agile Locomotion

【速读】:该论文旨在解决腿式机器人在学习敏捷运动行为时探索过程的挑战性问题,传统方法依赖大量奖励工程、专家示范或课程学习,限制了策略的泛化能力。其解决方案的关键在于提出Skill Discovery as Exploration (SDAX) 框架,通过无监督技能发现自动获取多样化的障碍克服技能,并采用双层优化机制动态调节训练过程中的探索水平,从而显著降低人工干预并提升自主学习能力。

链接: https://arxiv.org/abs/2508.08982
作者: Seungeun Rho,Kartik Garg,Morgan Byrd,Sehoon Ha
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Conference on Robot Learning 2025

点击查看摘要

Abstract:Exploration is crucial for enabling legged robots to learn agile locomotion behaviors that can overcome diverse obstacles. However, such exploration is inherently challenging, and we often rely on extensive reward engineering, expert demonstrations, or curriculum learning - all of which limit generalizability. In this work, we propose Skill Discovery as Exploration (SDAX), a novel learning framework that significantly reduces human engineering effort. SDAX leverages unsupervised skill discovery to autonomously acquire a diverse repertoire of skills for overcoming obstacles. To dynamically regulate the level of exploration during training, SDAX employs a bi-level optimization process that autonomously adjusts the degree of exploration. We demonstrate that SDAX enables quadrupedal robots to acquire highly agile behaviors including crawling, climbing, leaping, and executing complex maneuvers such as jumping off vertical walls. Finally, we deploy the learned policy on real hardware, validating its successful transfer to the real world.
zh

[AI-14] Urban-STA4CLC: Urban Theory-Informed Spatio-Temporal Attention Model for Predicting Post-Disaster Commercial Land Use Change

【速读】:该论文旨在解决自然灾害(如飓风和野火)对商业用地变化的复杂影响难以被现有模型准确捕捉的问题,尤其关注人类活动扰动与商业用地格局重塑之间的动态交互关系。解决方案的关键在于提出一种基于城市理论的时空注意力模型(Urban-STA4CLC),该模型融合了三个理论驱动模块:由韧性理论指导的灾害感知时间注意力模块以捕捉访客流动动态,由空间经济理论引导的多关系空间注意力模块用于块级间相互作用建模,以及由扩散理论贡献的正则化项以约束土地利用转换过程。这种结构设计显著提升了模型在连续飓风情景下预测商业用地年变化的能力,F1分数提升约19%(达0.8763),验证了将城市理论嵌入商业用地建模可有效增强对用地增益与损失的识别能力。

链接: https://arxiv.org/abs/2508.08976
作者: Ziyi Guo,Yan Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Natural disasters such as hurricanes and wildfires increasingly introduce unusual disturbance on economic activities, which are especially likely to reshape commercial land use pattern given their sensitive to customer visitation. However, current modeling approaches are limited in capturing such complex interplay between human activities and commercial land use change under and following disturbances. Such interactions have been more effectively captured in current resilient urban planning theories. This study designs and calibrates a Urban Theory-Informed Spatio-Temporal Attention Model for Predicting Post-Disaster Commercial Land Use Change (Urban-STA4CLC) to predict both the yearly decline and expansion of commercial land use at census block level under cumulative impact of disasters on human activities over two years. Guided by urban theories, Urban-STA4CLC integrates both spatial and temporal attention mechanisms with three theory-informed modules. Resilience theory guides a disaster-aware temporal attention module that captures visitation dynamics. Spatial economic theory informs a multi-relational spatial attention module for inter-block representation. Diffusion theory contributes a regularization term that constrains land use transitions. The model performs significantly better than non-theoretical baselines in predicting commercial land use change under the scenario of recurrent hurricanes, with around 19% improvement in F1 score (0.8763). The effectiveness of the theory-guided modules was further validated through ablation studies. The research demonstrates that embedding urban theory into commercial land use modeling models may substantially enhance the capacity to capture its gains and losses. These advances in commercial land use modeling contribute to land use research that accounts for cumulative impacts of recurrent disasters and shifts in economic activity patterns.
zh

[AI-15] QAMRO: Quality-aware Adaptive Margin Ranking Optimization for Human-aligned Assessment of Audio Generation Systems

【速读】:该论文旨在解决音频生成系统(包括文本到音乐、文本到语音和文本到音频)在评估过程中因人类感知的主观性和多维性而导致的挑战,现有方法将平均意见分(MOS)预测视为回归问题,但标准回归损失函数忽略了感知判断的相对性。解决方案的关键在于提出一种名为QAMRO(Quality-aware Adaptive Margin Ranking Optimization)的新框架,该框架通过融合来自不同视角的回归目标,强调感知差异并优先确保评分准确性;其核心创新在于利用预训练的音频-文本模型(如CLAP和Audiobox-Aesthetics),并在AudioMOS Challenge 2025官方数据集上进行端到端训练,从而在所有维度上显著优于基线模型,实现与人类评价更一致的性能表现。

链接: https://arxiv.org/abs/2508.08957
作者: Chien-Chun Wang,Kuan-Tang Huang,Cheng-Yeh Yang,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to IEEE ASRU 2025

点击查看摘要

Abstract:Evaluating audio generation systems, including text-to-music (TTM), text-to-speech (TTS), and text-to-audio (TTA), remains challenging due to the subjective and multi-dimensional nature of human perception. Existing methods treat mean opinion score (MOS) prediction as a regression problem, but standard regression losses overlook the relativity of perceptual judgments. To address this limitation, we introduce QAMRO, a novel Quality-aware Adaptive Margin Ranking Optimization framework that seamlessly integrates regression objectives from different perspectives, aiming to highlight perceptual differences and prioritize accurate ratings. Our framework leverages pre-trained audio-text models such as CLAP and Audiobox-Aesthetics, and is trained exclusively on the official AudioMOS Challenge 2025 dataset. It demonstrates superior alignment with human evaluations across all dimensions, significantly outperforming robust baseline models.
zh

[AI-16] Generalising Traffic Forecasting to Regions without Traffic Observations

【速读】:该论文旨在解决无交通传感器区域的交通流量预测问题,此类区域因缺乏历史观测数据而难以应用现有模型。解决方案的关键在于提出GenCast模型,其核心思想是利用外部知识弥补缺失观测并提升模型泛化能力:首先引入物理信息神经网络(Physics-Informed Neural Networks, PINNs)以物理规律正则化学习过程;其次设计外部信号学习模块,挖掘交通状态与天气等外部信号的相关性;最后通过空间分组模块过滤局部特征以增强跨区域适用性。

链接: https://arxiv.org/abs/2508.08947
作者: Xinyu Su,Majid Sarvi,Feng Liu,Egemen Tanin,Jianzhong Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic forecasting is essential for intelligent transportation systems. Accurate forecasting relies on continuous observations collected by traffic sensors. However, due to high deployment and maintenance costs, not all regions are equipped with such sensors. This paper aims to forecast for regions without traffic sensors, where the lack of historical traffic observations challenges the generalisability of existing models. We propose a model named GenCast, the core idea of which is to exploit external knowledge to compensate for the missing observations and to enhance generalisation. We integrate physics-informed neural networks into GenCast, enabling physical principles to regularise the learning process. We introduce an external signal learning module to explore correlations between traffic states and external signals such as weather conditions, further improving model generalisability. Additionally, we design a spatial grouping module to filter localised features that hinder model generalisability. Extensive experiments show that GenCast consistently reduces forecasting errors on multiple real-world datasets.
zh

[AI-17] Safe Semantics Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉-语言模型(Large Vision-Language Models, LVLMs)在多模态输入下日益突出的安全性问题,特别是由隐式推理(Implicit Reasoning)引发的漏洞:即看似无害的组合输入会因模型内部 flawed 或隐藏的推理机制而触发不安全输出。解决方案的关键在于提出首个针对此问题的数据集——Safe Semantics, Unsafe Interpretations (SSUI),并通过展示基于该数据集的上下文学习(In-Context Learning)方法能显著缓解此类隐式多模态威胁,从而强调提升跨模态隐式推理能力的紧迫性。

链接: https://arxiv.org/abs/2508.08926
作者: Wei Cai,Jian Zhao,Yuchu Jiang,Tianle Zhang,Xuelong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models face growing safety challenges with multimodal inputs. This paper introduces the concept of Implicit Reasoning Safety, a vulnerability in LVLMs. Benign combined inputs trigger unsafe LVLM outputs due to flawed or hidden reasoning. To showcase this, we developed Safe Semantics, Unsafe Interpretations, the first dataset for this critical issue. Our demonstrations show that even simple In-Context Learning with SSUI significantly mitigates these implicit multimodal threats, underscoring the urgent need to improve cross-modal implicit reasoning.
zh

[AI-18] Compass-Thinker-7B Technical Report

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练中面临的高计算成本与资源消耗问题,同时探索如何在有限资源下激发模型的复杂推理能力。其解决方案的关键在于提出了一种名为Compass-Thinker-7B的小规模模型,并设计了一个专门的强化学习流水线(Reinforcement Learning Pipeline),通过构建包含3万道可验证数学问题的数据集,分阶段配置不同难度分布的数据和训练参数,逐步释放模型潜力并提升训练效率。实验表明,该方法在AIME2024挑战性数学评测中达到了40%的准确率,显著优于同规模模型,为未来更大模型的RL训练提供了高效且可行的技术路径。

链接: https://arxiv.org/abs/2508.08909
作者: Anxiang Zeng,Haibo Zhang,Kaixiang Mo,Long Zhang,Shuman Liu,Yanhui Huang,Yawen Liu,Yuepeng Sheng,Yuwei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent R1-Zero-like research further demonstrates that reasoning extension has given large language models (LLMs) unprecedented reasoning capabilities, and Reinforcement Learning is the core tech- nology to elicit its complex reasoning. However, conducting RL experiments directly on hyperscale models involves high computational costs and resource demands, posing significant risks. We pro- pose the Compass-Thinker-7B model, which aims to explore the potential of Reinforcement Learn- ing with less computational resources and costs, and provides insights for further research into RL recipes for larger models. Compass-Thinker-7B is trained from an open source model through a spe- cially designed Reinforcement Learning Pipeline. we curate a dataset of 30k verifiable mathematics problems for the Reinforcement Learning Pipeline. By configuring data and training settings with dif- ferent difficulty distributions for different stages, the potential of the model is gradually released and the training efficiency is improved. Extensive evaluations show that Compass-Thinker-7B possesses exceptional reasoning potential, and achieves superior performance on mathematics compared to the same-sized RL this http URL in the challenging AIME2024 evaluation, Compass-Thinker-7B achieves 40% accuracy.
zh

[AI-19] Position: Causal Machine Learning Requires Rigorous Synthetic Experiments for Broader Adoption ICML2025

【速读】:该论文旨在解决因果机器学习(causal machine learning)方法在实际应用中因缺乏可靠性和鲁棒性评估而未被广泛采用的问题。当前实证评估多依赖合成实验(synthetic experiments),导致社区对其可信度存疑;论文指出,合成实验恰恰是精确评估和理解因果机器学习方法能力所必需的工具。解决方案的关键在于提出一套基于合成数据的严谨实证分析原则,以系统性提升评估的规范性和透明度,从而增强方法的可信度并推动其在真实世界中的广泛应用。

链接: https://arxiv.org/abs/2508.08883
作者: Audrey Poinsot,Panayiotis Panayiotou,Alessandro Leite,Nicolas Chesneau,Özgür Şimşek,Marc Schoenauer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注: Accepted at ICML 2025

点击查看摘要

Abstract:Causal machine learning has the potential to revolutionize decision-making by combining the predictive power of machine learning algorithms with the theory of causal inference. However, these methods remain underutilized by the broader machine learning community, in part because current empirical evaluations do not permit assessment of their reliability and robustness, undermining their practical utility. Specifically, one of the principal criticisms made by the community is the extensive use of synthetic experiments. We argue, on the contrary, that synthetic experiments are essential and necessary to precisely assess and understand the capabilities of causal machine learning methods. To substantiate our position, we critically review the current evaluation practices, spotlight their shortcomings, and propose a set of principles for conducting rigorous empirical analyses with synthetic data. Adopting the proposed principles will enable comprehensive evaluations that build trust in causal machine learning methods, driving their broader adoption and impactful real-world use.
zh

[AI-20] Reducing Cognitive Load in Multi-Agent Reinforcement Learning for Mathematical Problem Solving: Decoupling Reasoning and Code Generation

【速读】:该论文试图解决当前工具集成的数学推理系统中因单智能体架构导致的认知负载干扰问题,即单一大语言模型在长程推理与精确代码生成之间交替执行时,易造成推理路径质量下降的问题。解决方案的关键在于提出一种双智能体混合框架:由推理智能体(Reasoning Agent)负责分步的问题分解,代码智能体(Code Agent)专注代码生成与执行,通过解耦角色降低认知干扰,并结合模仿学习与强化学习进行训练——其中代码智能体依据中间真实程序匹配获得强奖励、合法执行获得弱奖励,而推理智能体则主要基于最终答案准确性,利用优势估计对中间步骤进行信用分配,从而实现稳定可靠的推理-编码协同。

链接: https://arxiv.org/abs/2508.08882
作者: Dayu Wang,Jiaye Yang,Weikang Li,Jiahui Liang,Yang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current tool-integrated mathematical reasoning systems often adopt a single-agent paradigm, where one large language model handles problem reasoning, code generation, and code execution in an integrated workflow. While this design eases coordination, we hypothesize that it imposes cognitive load interference, as the agent must interleave long-horizon reasoning with precise program synthesis. We validate this hypothesis through a controlled comparison between a reasoning-only agent and a reasoning-plus-code agent, finding that the latter produces significantly fewer correct reasoning paths despite having tool-calling capabilities. To address this, we propose a dual-agent hybrid framework: a Reasoning Agent performs stepwise problem decomposition, and a Code Agent handles code generation and execution. Training combines imitation learning and reinforcement learning: the Code Agent receives strong rewards for matching intermediate ground-truth programs and weaker rewards for valid execution, while the Reasoning Agent is optimized chiefly via final-answer accuracy using advantage estimation to credit intermediate steps. This decoupled role design reduces cognitive interference and promotes stable reasoning-coding coordination.
zh

[AI-21] Oblivionis: A Lightweight Learning and Unlearning Framework for Federated Large Language Models

【速读】:该论文旨在解决联邦大语言模型(Federated Large Language Models, FLLMs)在实际应用中缺乏合规性机制的问题,特别是针对欧盟《通用数据保护条例》(GDPR)所规定的“被遗忘权”(right to be forgotten)无法实现的困境。现有联邦学习框架虽能保护数据隐私,但无法在训练完成后对特定客户端的数据贡献进行选择性删除,导致数据治理与监管合规性不足。解决方案的关键在于提出Oblivionis框架,该框架将联邦学习(Federated Learning, FL)与模型遗忘(unlearning)统一为双重优化目标,在分布式环境下实现对特定客户端数据的可验证移除,同时保持模型性能稳定;其核心创新包括6种联邦学习算法与5种遗忘策略的集成设计,形成一套系统化的评估与比较体系,从而在保证数据隐私和合规性的前提下,实现高效、可控的联邦大语言模型训练与遗忘过程。

链接: https://arxiv.org/abs/2508.08875
作者: Fuyao Zhang,Xinyu Yan,Tiantong Wu,Wenjie Li,Tianxiang Chen,Yang Cao,Ran Yan,Longtao Huang,Wei Yang Bryan Lim,Qiang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly leverage Federated Learning (FL) to utilize private, task-specific datasets for fine-tuning while preserving data privacy. However, while federated LLM frameworks effectively enable collaborative training without raw data sharing, they critically lack built-in mechanisms for regulatory compliance like GDPR’s right to be forgotten. Integrating private data heightens concerns over data quality and long-term governance, yet existing distributed training frameworks offer no principled way to selectively remove specific client contributions post-training. Due to distributed data silos, stringent privacy constraints, and the intricacies of interdependent model aggregation, federated LLM unlearning is significantly more complex than centralized LLM unlearning. To address this gap, we introduce Oblivionis, a lightweight learning and unlearning framework that enables clients to selectively remove specific private data during federated LLM training, enhancing trustworthiness and regulatory compliance. By unifying FL and unlearning as a dual optimization objective, we incorporate 6 FL and 5 unlearning algorithms for comprehensive evaluation and comparative analysis, establishing a robust pipeline for federated LLM unlearning. Extensive experiments demonstrate that Oblivionis outperforms local training, achieving a robust balance between forgetting efficacy and model utility, with cross-algorithm comparisons providing clear directions for future LLM development.
zh

[AI-22] he Roots of International Perceptions: Simulating US Attitude Changes Towards China with LLM Agents AAAI

【速读】:该论文旨在解决大规模、长期、跨国家情境下公众态度演化建模的难题,特别是美国民众对中国态度的演变问题。传统研究多聚焦于特定事件或单一国家内部观点变化,而本文首次构建了基于大语言模型(LLM)的框架,整合媒体数据采集、用户画像生成与认知架构驱动的观念更新机制,成功复现了2005年至当前20年间美国公众对华态度的整体趋势。其解决方案的关键在于:利用LLM的推理能力实现去偏见的媒体暴露(debiased media exposure),从主观新闻内容中提取中性事件以揭示极化成因;同时引入“魔鬼代言人”代理(devils advocate agent)解释罕见的负面到正面态度反转现象,从而阐明信息获取方式变化对国际态度形成的影响。这一方法不仅验证了框架的有效性,还揭示了偏见框架和选择偏差在塑造跨国态度中的核心作用,为理解国际偏见形成机制提供了新范式。

链接: https://arxiv.org/abs/2508.08837
作者: Nicholas Sukiennik,Yichuan Xu,Yuqing Kan,Jinghua Piao,Yuwei Yan,Chen Gao,Yong Li
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: Submitted to AAAI Social Impact 2026

点击查看摘要

Abstract:The rise of LLMs poses new possibilities in modeling opinion evolution, a long-standing task in simulation, by leveraging advanced reasoning abilities to recreate complex, large-scale human cognitive trends. While most prior works focus on opinion evolution surrounding specific isolated events or the views within a country, ours is the first to model the large-scale attitude evolution of a population representing an entire country towards another – US citizens’ perspectives towards China. To tackle the challenges of this broad scenario, we propose a framework that integrates media data collection, user profile creation, and cognitive architecture for opinion updates to successfully reproduce the real trend of US attitudes towards China over a 20-year period from 2005 to today. We also leverage LLMs’ capabilities to introduce debiased media exposure, extracting neutral events from typically subjective news contents, to uncover the roots of polarized opinion formation, as well as a devils advocate agent to help explain the rare reversal from negative to positive attitudes towards China, corresponding with changes in the way Americans obtain information about the country. The simulation results, beyond validating our framework architecture, also reveal the impact of biased framing and selection bias in shaping attitudes. Overall, our work contributes to a new paradigm for LLM-based modeling of cognitive behaviors in a large-scale, long-term, cross-border social context, providing insights into the formation of international biases and offering valuable implications for media consumers to better understand the factors shaping their perspectives, and ultimately contributing to the larger social need for bias reduction and cross-cultural tolerance.
zh

[AI-23] EditMF: Drawing an Invisible Fingerprint for Your Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中资源消耗高、知识产权(Intellectual Property, IP)保护困难的问题,特别是现有基于后门(back-door)的指纹嵌入方法存在隐蔽性差和效率低的局限。其解决方案的关键在于提出一种无需训练的指纹嵌入范式 EditMF,通过将所有权比特映射到加密人工知识库中语义一致的三元组(如虚拟作者-小说-主角),利用因果追踪(causal tracing)定位影响每个三元组的最小层数,并采用零空间更新(zero-space update)注入指纹而不扰动其他知识;验证仅需一次黑盒查询即可完成,且具有高隐蔽性、极小性能损失及强鲁棒性。

链接: https://arxiv.org/abs/2508.08836
作者: Jiaxuan Wu,Yinghan Zhou,Wanli Peng,Yiming Xue,Juan Wen,Ping Zhong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:Training large language models (LLMs) is resource-intensive and expensive, making protecting intellectual property (IP) for LLMs crucial. Recently, embedding fingerprints into LLMs has emerged as a prevalent method for establishing model ownership. However, existing back-door-based methods suffer from limited stealth and efficiency. To simultaneously address these issues, we propose EditMF, a training-free fingerprinting paradigm that achieves highly imperceptible fingerprint embedding with minimal computational overhead. Ownership bits are mapped to compact, semantically coherent triples drawn from an encrypted artificial knowledge base (e.g., virtual author-novel-protagonist facts). Causal tracing localizes the minimal set of layers influencing each triple, and a zero-space update injects the fingerprint without perturbing unrelated knowledge. Verification requires only a single black-box query and succeeds when the model returns the exact pre-embedded protagonist. Empirical results on LLaMA and Qwen families show that EditMF combines high imperceptibility with negligible model’s performance loss, while delivering robustness far beyond LoRA-based fingerprinting and approaching that of SFT embeddings. Extensive experiments demonstrate that EditMF is an effective and low-overhead solution for secure LLM ownership verification.
zh

[AI-24] Geometry-Aware Global Feature Aggregation for Real-Time Indirect Illumination

【速读】:该论文旨在解决实时渲染中全局光照(Global Illumination, GI)的计算难题,特别是如何高效且准确地预测屏幕空间内的漫反射间接光照,以实现高动态范围(HDR)的逼真视觉效果。其解决方案的关键在于提出了一种基于学习的估计器,采用改进的注意力机制来聚合由空间几何特征引导的全局信息,并引入单色设计(monochromatic design)分别编码每个颜色通道,从而有效捕捉远距离间接光照及纹理表面间的多次反射(如颜色渗透效应)。该方法在复杂光照场景下表现出优越性能,且具备良好的泛化能力,可处理训练数据之外的新场景。

链接: https://arxiv.org/abs/2508.08826
作者: Meng Gai,Guoping Wang,Sheng Li
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Real-time rendering with global illumination is crucial to afford the user realistic experience in virtual environments. We present a learning-based estimator to predict diffuse indirect illumination in screen space, which then is combined with direct illumination to synthesize globally-illuminated high dynamic range (HDR) results. Our approach tackles the challenges of capturing long-range/long-distance indirect illumination when employing neural networks and is generalized to handle complex lighting and scenarios. From the neural network thinking of the solver to the rendering equation, we present a novel network architecture to predict indirect illumination. Our network is equipped with a modified attention mechanism that aggregates global information guided by spacial geometry features, as well as a monochromatic design that encodes each color channel individually. We conducted extensive evaluations, and the experimental results demonstrate our superiority over previous learning-based techniques. Our approach excels at handling complex lighting such as varying-colored lighting and environment lighting. It can successfully capture distant indirect illumination and simulates the interreflections between textured surfaces well (i.e., color bleeding effects); it can also effectively handle new scenes that are not present in the training dataset. Comments: 10 pages Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.08826 [cs.GR] (or arXiv:2508.08826v1 [cs.GR] for this version) https://doi.org/10.48550/arXiv.2508.08826 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-25] Wavelet Mixture of Experts for Time Series Forecasting

【速读】:该论文旨在解决传统Transformer模型参数量大、难以捕捉数据非平稳特征,以及多层感知机(MLP)模型在处理多通道依赖关系时效率低的问题。其解决方案的关键在于提出两种轻量级时间序列预测模型:WaveTS-B通过小波变换(wavelet transform)将数据映射到小波域,从而同时捕获周期性和非平稳特性;在此基础上进一步设计WaveTS-M模型,引入基于门控机制和专家网络的混合专家(Mixture of Experts, MoE)框架,并结合通道聚类策略,有效建模多通道间的复杂依赖关系,显著提升了多通道时间序列预测性能,且参数量远低于现有方法。

链接: https://arxiv.org/abs/2508.08825
作者: Zheng Zhou,Yu-Jie Xiong,Jia-Chen Zhang,Chun-Ming Xia,Xi-Jiong Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The field of time series forecasting is rapidly advancing, with recent large-scale Transformers and lightweight Multilayer Perceptron (MLP) models showing strong predictive performance. However, conventional Transformer models are often hindered by their large number of parameters and their limited ability to capture non-stationary features in data through smoothing. Similarly, MLP models struggle to manage multi-channel dependencies effectively. To address these limitations, we propose a novel, lightweight time series prediction model, WaveTS-B. This model combines wavelet transforms with MLP to capture both periodic and non-stationary characteristics of data in the wavelet domain. Building on this foundation, we propose a channel clustering strategy that incorporates a Mixture of Experts (MoE) framework, utilizing a gating mechanism and expert network to handle multi-channel dependencies efficiently. We propose WaveTS-M, an advanced model tailored for multi-channel time series prediction. Empirical evaluation across eight real-world time series datasets demonstrates that our WaveTS series models achieve state-of-the-art (SOTA) performance with significantly fewer parameters. Notably, WaveTS-M shows substantial improvements on multi-channel datasets, highlighting its effectiveness.
zh

[AI-26] OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads

【速读】:该论文旨在解决当前人工智能模型因复杂度提升而导致的计算瓶颈问题,特别是由大规模矩阵乘法运算引发的冯·诺依曼瓶颈(Von Neumann bottleneck)。传统数字或模拟存内计算架构在性能和能效方面存在显著局限,难以满足高效率、高可扩展性的需求。其解决方案的关键在于提出一种新型存内计算架构OISMA(One-In-Situ Memory Architecture),该架构利用准随机计算域(Bent-Pyramid系统)的计算简洁性,在不牺牲数字存储器效率、可扩展性和生产力的前提下,将常规内存读取操作转换为原位随机乘法运算,并通过外围累积电路实现矩阵乘法功能。该设计在保持低功耗与高面积效率的同时,显著提升了计算精度,尤其在大尺寸矩阵(如512×512)下相对Frobenius误差降低至1.81%,展现出优于64位双精度浮点格式的准确性。

链接: https://arxiv.org/abs/2508.08822
作者: Shady Agwa,Yihan Pan,Georgios Papandroulidakis,Themis Prodromakis
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Performance (cs.PF)
备注: 12 pages, 13 figures. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Artificial Intelligence models are currently driven by a significant up-scaling of their complexity, with massive matrix multiplication workloads representing the major computational bottleneck. In-memory computing architectures are proposed to avoid the Von Neumann bottleneck. However, both digital/binary-based and analogue in-memory computing architectures suffer from various limitations, which significantly degrade the performance and energy efficiency gains. This work proposes OISMA, a novel in-memory computing architecture that utilizes the computational simplicity of a quasi-stochastic computing domain (Bent-Pyramid system), while keeping the same efficiency, scalability, and productivity of digital memories. OISMA converts normal memory read operations into in-situ stochastic multiplication operations with a negligible cost. An accumulation periphery then accumulates the output multiplication bitstreams, achieving the matrix multiplication functionality. Extensive matrix multiplication benchmarking was conducted to analyze the accuracy of the Bent-Pyramid system, using matrix dimensions ranging from 4x4 to 512x512. The accuracy results show a significant decrease in the average relative Frobenius error, from 9.42% (for 4x4) to 1.81% (for 512x512), compared to 64-bit double precision floating-point format. A 1T1R OISMA array of 4 KB capacity was implemented using a commercial 180nm technology node and in-house RRAM technology. At 50 MHz, OISMA achieves 0.891 TOPS/W and 3.98 GOPS/mm2 for energy and area efficiency, respectively, occupying an effective computing area of 0.804241 mm2. Scaling OISMA from 180nm to 22nm technology shows a significant improvement of two orders of magnitude in energy efficiency and one order of magnitude in area efficiency, compared to dense matrix multiplication in-memory computing architectures.
zh

[AI-27] Efficient Agent : Optimizing Planning Capability for Multimodal Retrieval Augmented Generation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在现实场景中因缺乏动态信息检索能力而导致的时效性不足问题,尤其是现有生成式AI(Generative AI)方法在处理新闻分析和热点话题时存在检索策略僵化与视觉信息利用不充分的问题。其解决方案的关键在于提出E-Agent框架,包含两个核心创新:一是基于上下文推理训练的mRAG规划器(mRAG planner),用于动态调度多模态工具;二是任务执行器(task executor)采用工具感知的任务排序机制,实现优化的mRAG工作流。该框架采用一次性mRAG规划策略,在提升检索效率的同时显著减少冗余工具调用,实验表明其在准确率上相较当前最优方法提升13%,冗余搜索减少37%。

链接: https://arxiv.org/abs/2508.08816
作者: Yuechen Wang,Yuming Qiao,Dan Meng,Jun Yang,Haonan Lu,Zhenyu Yang,Xudong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (mRAG) has emerged as a promising solution to address the temporal limitations of Multimodal Large Language Models (MLLMs) in real-world scenarios like news analysis and trending topics. However, existing approaches often suffer from rigid retrieval strategies and under-utilization of visual information. To bridge this gap, we propose E-Agent, an agent framework featuring two key innovations: a mRAG planner trained to dynamically orchestrate multimodal tools based on contextual reasoning, and a task executor employing tool-aware execution sequencing to implement optimized mRAG workflows. E-Agent adopts a one-time mRAG planning strategy that enables efficient information retrieval while minimizing redundant tool invocations. To rigorously assess the planning capabilities of mRAG systems, we introduce the Real-World mRAG Planning (RemPlan) benchmark. This novel benchmark contains both retrieval-dependent and retrieval-independent question types, systematically annotated with essential retrieval tools required for each instance. The benchmark’s explicit mRAG planning annotations and diverse question design enhance its practical relevance by simulating real-world scenarios requiring dynamic mRAG decisions. Experiments across RemPlan and three established benchmarks demonstrate E-Agent’s superiority: 13% accuracy gain over state-of-the-art mRAG methods while reducing redundant searches by 37%.
zh

[AI-28] GRainsaCK: a Comprehensive Software Library for Benchmarking Explanations of Link Prediction Tasks on Knowledge Graphs

【速读】:该论文旨在解决知识图谱(Knowledge Graph)中因结构不完整而导致的链接预测(Link Prediction)问题,尤其是现有基于嵌入的方法虽具备可扩展性但缺乏可解释性这一痛点。为提升解释的可信度与实用性,论文提出了一种名为GRainsaCK的可复用软件资源,其核心创新在于构建了一个统一且自动化的评估流程,涵盖从模型训练到解释结果评价的全流程标准化协议,从而填补了当前解释方法缺乏量化比较基准的空白。此外,GRainsaCK通过模块化设计实现组件的灵活替换,增强了系统的可扩展性,并提供详尽文档与教程以促进社区复用。

链接: https://arxiv.org/abs/2508.08815
作者: Roberto Barile,Claudia d’Amato,Nicola Fanizzi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Since Knowledge Graphs are often incomplete, link prediction methods are adopted for predicting missing facts. Scalable embedding based solutions are mostly adopted for this purpose, however, they lack comprehensibility, which may be crucial in several domains. Explanation methods tackle this issue by identifying supporting knowledge explaining the predicted facts. Regretfully, evaluating/comparing quantitatively the resulting explanations is challenging as there is no standard evaluation protocol and overall benchmarking resource. We fill this important gap by proposing GRainsaCK, a reusable software resource that fully streamlines all the tasks involved in benchmarking explanations, i.e., from model training to evaluation of explanations along the same evaluation protocol. Moreover, GRainsaCK furthers modularity/extensibility by implementing the main components as functions that can be easily replaced. Finally, fostering its reuse, we provide extensive documentation including a tutorial.
zh

[AI-29] mpOpt – Unsupervised Alarm Relation Learning for Telecommunication Networks

【速读】:该论文旨在解决电信网络中故障告警(fault alarms)关联关系难以有效识别的问题,尤其是在多厂商、多节点的复杂网络环境中,单一故障常引发大量跨节点的告警序列,导致网络操作中心(NOC)工程师难以快速定位根因告警(root alarm)。为提升告警分析效率与准确性,论文提出一种新颖的无监督告警关系学习方法——Temporal Optimization(TempOpt),其关键在于通过时间优化机制建模告警间的时序依赖性,从而克服现有基于时序依赖(temporal dependency)方法在实际场景中泛化能力弱、关系建模不准确等局限,实验表明TempOpt能更高质量地学习告警间的关系,助力快速精准定位故障根源。

链接: https://arxiv.org/abs/2508.08814
作者: Sathiyanaryanan Sampath,Pratyush Uppuluri,Thirumaran Ekambaram
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 9 figures. IEEE 21st India Council International Conference (INDICON), 2024

点击查看摘要

Abstract:In a telecommunications network, fault alarms generated by network nodes are monitored in a Network Operations Centre (NOC) to ensure network availability and continuous network operations. The monitoring process comprises of tasks such as active alarms analysis, root alarm identification, and resolution of the underlying problem. Each network node potentially can generate alarms of different types, while nodes can be from multiple vendors, a network can have hundreds of nodes thus resulting in an enormous volume of alarms at any time. Since network nodes are inter-connected, a single fault in the network would trigger multiple sequences of alarms across a variety of nodes and from a monitoring point of view, it is a challenging task for a NOC engineer to be aware of relations between the various alarms, when trying to identify, for example, a root alarm on which an action needs to be taken. To effectively identify root alarms, it is essential to learn relation among the alarms for accurate and faster resolution. In this work we propose a novel unsupervised alarm relation learning technique Temporal Optimization (TempOpt) that is practical and overcomes the limitations of an existing class of alarm relational learning method-temporal dependency methods. Experiments have been carried on real-world network datasets, that demonstrate the improved quality of alarm relations learned by TempOpt as compared to temporal dependency method.
zh

[AI-30] Not in My Backyard! Temporal Voting Over Public Chores IJCAI

【速读】:该论文旨在解决具有动态偏好的投票模型中公共事务(public chores)分配的福利优化问题,重点分析效用最大化(utilitarian welfare)和公平性最大化(egalitarian welfare)的计算复杂性。其关键在于揭示:虽然效用最大化的优化问题在计算上是高效的,但公平性最小化问题即使在高度受限的情形下也属于计算难解(NP-hard),然而研究进一步识别出若干可高效求解或通过近似算法处理的特定场景。此外,论文还探讨了时间公平性约束对社会福利的影响、在线算法的竞争比以及参与者的策略行为,为设计兼具效率与公平性的决策机制提供了理论基础和实践指导。

链接: https://arxiv.org/abs/2508.08810
作者: Edith Elkind,Tzeh Yuan Neoh,Nicholas Teh
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
备注: Appears in the 34th International Joint Conference on Artificial Intelligence (IJCAI), 2025

点击查看摘要

Abstract:We study a temporal voting model where voters have dynamic preferences over a set of public chores – projects that benefit society, but impose individual costs on those affected by their implementation. We investigate the computational complexity of optimizing utilitarian and egalitarian welfare. Our results show that while optimizing the former is computationally straightforward, minimizing the latter is computationally intractable, even in very restricted cases. Nevertheless, we identify several settings where this problem can be solved efficiently, either exactly or by an approximation algorithm. We also examine the effects of enforcing temporal fairness and its impact on social welfare, and analyze the competitive ratio of online algorithms. We then explore the strategic behavior of agents, providing insights into potential malfeasance in such decision-making environments. Finally, we discuss a range of fairness measures and their suitability for our setting.
zh

[AI-31] Opening Musical Creativity? Embedded Ideologies in Generative-AI Music Systems

【速读】:该论文试图解决的问题是:生成式AI音乐系统在宣传中常以“民主化音乐创作”为口号,但其实际发展与应用中是否存在真正的包容性,抑或仅是一种市场化的修辞策略。解决方案的关键在于通过结合自传式民族志(autoethnography)与数字民族志(digital ethnography),系统分析四款公开可用的生成式AI音乐工具(AIVA、Stable Audio、Suno和Udio)的开发者话语与用户接受情况,揭示其背后共享的意识形态——即一种个体主义、全球主义、技术自由主义且伦理回避的“总体意识形态”,该意识形态模糊了个人责任,并将音乐的本质及其实践方式重构以适应生成式AI的结果。

链接: https://arxiv.org/abs/2508.08805
作者: Liam Pram,Fabio Morreale
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Extended version of the presentation at The First International Conference in AI Music Studies 2024

点击查看摘要

Abstract:AI systems for music generation are increasingly common and easy to use, granting people without any musical background the ability to create music. Because of this, generative-AI has been marketed and celebrated as a means of democratizing music making. However, inclusivity often functions as marketable rhetoric rather than a genuine guiding principle in these industry settings. In this paper, we look at four generative-AI music making systems available to the public as of mid-2025 (AIVA, Stable Audio, Suno, and Udio) and track how they are rhetoricized by their developers, and received by users. Our aim is to investigate ideologies that are driving the early-stage development and adoption of generative-AI in music making, with a particular focus on democratization. A combination of autoethnography and digital ethnography is used to examine patterns and incongruities in rhetoric when positioned against product functionality. The results are then collated to develop a nuanced, contextual discussion. The shared ideology we map between producers and consumers is individualist, globalist, techno-liberal, and ethically evasive. It is a ‘total ideology’ which obfuscates individual responsibility, and through which the nature of music and musical practice is transfigured to suit generative outcomes.
zh

[AI-32] chOps: Technical Documentation Templates for the AI Act

【速读】:该论文旨在解决欧盟人工智能法案(EU AI Act)在实际操作中面临的合规性挑战,即现有AI系统技术文档模板未能全面覆盖从数据到模型再到应用的整个AI生命周期,导致透明度、可追溯性和问责制不足的问题。解决方案的关键在于提出一套开源的技术文档模板与实例,涵盖数据、模型和应用三个核心环节,通过结构化记录各阶段状态实现全生命周期的可追踪性与可复现性,从而满足法规要求并促进负责任的AI开发。这些模板已在真实场景中验证其有效性,包括皮肤色调数据集用于公平性评估、人像分割神经网络以及建筑工地安全监控系统的部署案例,证明其具备良好的可用性和实施潜力。

链接: https://arxiv.org/abs/2508.08804
作者: Laura Lucaj,Alex Loosley,Hakan Jonsson,Urs Gasser,Patrick van der Smagt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Operationalizing the EU AI Act requires clear technical documentation to ensure AI systems are transparent, traceable, and accountable. Existing documentation templates for AI systems do not fully cover the entire AI lifecycle while meeting the technical documentation requirements of the AI Act. This paper addresses those shortcomings by introducing open-source templates and examples for documenting data, models, and applications to provide sufficient documentation for certifying compliance with the AI Act. These templates track the system status over the entire AI lifecycle, ensuring traceability, reproducibility, and compliance with the AI Act. They also promote discoverability and collaboration, reduce risks, and align with best practices in AI documentation and governance. The templates are evaluated and refined based on user feedback to enable insights into their usability and implementability. We then validate the approach on real-world scenarios, providing examples that further guide their implementation: the data template is followed to document a skin tones dataset created to support fairness evaluations of downstream computer vision models and human-centric applications; the model template is followed to document a neural network for segmenting human silhouettes in photos. The application template is tested on a system deployed for construction site safety using real-time video analytics and sensor data. Our results show that TechOps can serve as a practical tool to enable oversight for regulatory compliance and responsible AI development. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.08804 [cs.LG] (or arXiv:2508.08804v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.08804 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-33] Evaluating Podcast Recommendations with Profile-Aware LLM -as-a-Judge RECSYS’25

【速读】:该论文旨在解决个性化推荐评估中的核心挑战,特别是在长音频领域(如播客)中,传统离线指标因暴露偏差(exposure bias)而失效,而在线方法(如A/B测试)则成本高且受运营限制。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的离线评估框架,通过两阶段的“用户画像感知”方法实现高效、可解释的推荐质量判断:首先从用户90天的收听历史中构建自然语言形式的用户画像(user profiles),该画像融合主题兴趣与行为模式,作为紧凑且语义丰富的偏好表示;随后利用这些画像作为上下文引导LLM进行细粒度的点对点(pointwise)和成对(pairwise)判断,从而提升推荐内容与用户兴趣之间的匹配度评估准确性。实验表明,该方法在控制研究中能高保真地匹配人工判断,优于或等同于使用原始收听数据的变体方案,为推荐系统迭代优化和模型选择提供了可扩展的评估工具。

链接: https://arxiv.org/abs/2508.08777
作者: Francesco Fabbri,Gustavo Penha,Edoardo D’Amico,Alice Wang,Marco De Nadai,Jackie Doremus,Paul Gigioli,Andreas Damianou,Oskar Stal,Mounia Lalmas
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at RecSys '25

点击查看摘要

Abstract:Evaluating personalized recommendations remains a central challenge, especially in long-form audio domains like podcasts, where traditional offline metrics suffer from exposure bias and online methods such as A/B testing are costly and operationally constrained. In this paper, we propose a novel framework that leverages Large Language Models (LLMs) as offline judges to assess the quality of podcast recommendations in a scalable and interpretable manner. Our two-stage profile-aware approach first constructs natural-language user profiles distilled from 90 days of listening history. These profiles summarize both topical interests and behavioral patterns, serving as compact, interpretable representations of user preferences. Rather than prompting the LLM with raw data, we use these profiles to provide high-level, semantically rich context-enabling the LLM to reason more effectively about alignment between a user’s interests and recommended episodes. This reduces input complexity and improves interpretability. The LLM is then prompted to deliver fine-grained pointwise and pairwise judgments based on the profile-episode match. In a controlled study with 47 participants, our profile-aware judge matched human judgments with high fidelity and outperformed or matched a variant using raw listening histories. The framework enables efficient, profile-aware evaluation for iterative testing and model selection in recommender systems.
zh

[AI-34] Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT

【速读】:该论文旨在解决便利商店中机器人执行抓取与放置(pick-and-place)任务时面临的挑战,包括物品密集排列、遮挡以及物体在颜色、形状、尺寸和纹理上的多样性,这些问题显著增加了轨迹规划与抓取的难度。解决方案的关键在于提出一个感知-动作(perception-action)流水线,其中利用标注引导的视觉提示(annotation-guided visual prompting)通过边界框标注明确可抓取物体和放置位置,提供结构化的空间指导;同时采用基于Transformer的Action Chunking with Transformers(ACT)作为模仿学习算法,从人类示范中预测分块动作序列,从而实现平滑、自适应且数据驱动的抓取与放置操作。

链接: https://arxiv.org/abs/2508.08748
作者: Muhammad A. Muttaqien,Tomohiro Motoda,Ryo Hanai,Yukiyasu Domae
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robotic pick-and-place tasks in convenience stores pose challenges due to dense object arrangements, occlusions, and variations in object properties such as color, shape, size, and texture. These factors complicate trajectory planning and grasping. This paper introduces a perception-action pipeline leveraging annotation-guided visual prompting, where bounding box annotations identify both pickable objects and placement locations, providing structured spatial guidance. Instead of traditional step-by-step planning, we employ Action Chunking with Transformers (ACT) as an imitation learning algorithm, enabling the robotic arm to predict chunked action sequences from human demonstrations. This facilitates smooth, adaptive, and data-driven pick-and-place operations. We evaluate our system based on success rate and visual analysis of grasping behavior, demonstrating improved grasp accuracy and adaptability in retail environments.
zh

[AI-35] Simulating Generative Social Agents via Theory-Informed Workflow Design

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的社会代理(social agents)普遍缺乏统一设计框架的问题,导致其在不同社会情境中难以泛化且行为一致性差。解决方案的关键在于提出一个理论驱动的框架,该框架以社会认知理论(Social Cognition Theory)为基础,引入动机模块、行动规划模块和学习模块三个核心组件,使代理能够推理目标、制定连贯行动并动态调整行为,从而生成更灵活且符合现实社会规范的行为模式。实验表明,该框架显著降低与真实人类行为数据的偏差(最多达75%),且消融实验证实各模块对生成逼真社会行为具有不可替代的作用。

链接: https://arxiv.org/abs/2508.08726
作者: Yuwei Yan,Jinghua Piao,Xiaochong Lan,Chenyang Shao,Pan Hui,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent advances in large language models have demonstrated strong reasoning and role-playing capabilities, opening new opportunities for agent-based social simulations. However, most existing agents’ implementations are scenario-tailored, without a unified framework to guide the design. This lack of a general social agent limits their ability to generalize across different social contexts and to produce consistent, realistic behaviors. To address this challenge, we propose a theory-informed framework that provides a systematic design process for LLM-based social agents. Our framework is grounded in principles from Social Cognition Theory and introduces three key modules: motivation, action planning, and learning. These modules jointly enable agents to reason about their goals, plan coherent actions, and adapt their behavior over time, leading to more flexible and contextually appropriate responses. Comprehensive experiments demonstrate that our theory-driven agents reproduce realistic human behavior patterns under complex conditions, achieving up to 75% lower deviation from real-world behavioral data across multiple fidelity metrics compared to classical generative baselines. Ablation studies further show that removing motivation, planning, or learning modules increases errors by 1.5 to 3.2 times, confirming their distinct and essential contributions to generating realistic and coherent social behaviors.
zh

[AI-36] Generative Modeling for Robust Deep Reinforcement Learning on the Traveling Salesman Problem

【速读】:该论文旨在解决现有神经网络求解器在处理现实世界中分布的旅行商问题(Traveling Salesman Problem, TSP)时泛化能力不足的问题,尤其是其在最坏情况下的性能表现较差。核心问题是:当前基于合成数据训练的神经网络模型难以适应实际应用中复杂且多样化的TSP分布,导致在真实场景下推理效果不稳定。解决方案的关键在于提出一种名为“组合优化生成采样”(Combinatorial Optimization with Generative Sampling, COGS)的新方法,通过从一个生成式TSP模型中采样训练数据,显著提升训练数据在TSP分布空间中的覆盖度与插值能力,从而增强模型对不同分布的鲁棒性,尤其改善了最坏情况下的求解性能。

链接: https://arxiv.org/abs/2508.08718
作者: Michael Li,Eric Bae,Christopher Haberland,Natasha Jaques
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 8 figures

点击查看摘要

Abstract:The Traveling Salesman Problem (TSP) is a classic NP-hard combinatorial optimization task with numerous practical applications. Classic heuristic solvers can attain near-optimal performance for small problem instances, but become computationally intractable for larger problems. Real-world logistics problems such as dynamically re-routing last-mile deliveries demand a solver with fast inference time, which has led researchers to investigate specialized neural network solvers. However, neural networks struggle to generalize beyond the synthetic data they were trained on. In particular, we show that there exist TSP distributions that are realistic in practice, which also consistently lead to poor worst-case performance for existing neural approaches. To address this issue of distribution robustness, we present Combinatorial Optimization with Generative Sampling (COGS), where training data is sampled from a generative TSP model. We show that COGS provides better data coverage and interpolation in the space of TSP training distributions. We also present TSPLib50, a dataset of realistically distributed TSP samples, which tests real-world generalization ability without conflating this issue with instance size. We evaluate our method on various synthetic datasets as well as TSPLib50, and compare to state-of-the-art neural baselines. We demonstrate that COGS improves distribution robustness, with most performance gains coming from worst-case scenarios.
zh

[AI-37] Imposing AI: Deceptive design patterns against sustainability

【速读】:该论文旨在解决生成式 AI (Generative AI) 在数字服务中大规模部署所引发的显著环境危害问题,尤其关注技术公司如何通过界面设计策略强制用户采纳 AI 功能,从而加剧环境负担。其解决方案的关键在于识别出两种主要的设计策略:一是将 AI 功能嵌入用户界面以挤压原有非 AI 功能的空间,二是构建有利于 AI 推广的叙事框架以降低用户抵制意愿;文章进一步指出,针对此类“强制采用”行为建立监管机制,是缓解 AI 使用带来的负面环境效应的核心路径。

链接: https://arxiv.org/abs/2508.08672
作者: Anaëlle Beignon,Thomas Thibault,Nolwenn Maudet
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI is being massively deployed in digital services, at a scale that will result in significant environmental harm. We document how tech companies are transforming established user interfaces to impose AI use and show how and to what extent these strategies fit within established deceptive pattern categories. We identify two main design strategies that are implemented to impose AI use in both personal and professional contexts: imposing AI features in interfaces at the expense of existing non-AI features and promoting narratives about AI that make it harder to resist using it. We discuss opportunities for regulating the imposed adoption of AI features, which would inevitably lead to negative environmental effects.
zh

[AI-38] Aryabhata: An exam-focused language model for JEE Math

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在教育场景中适用性不足的问题,特别是针对印度国家级学术考试——联合入学考试(Joint Entrance Examination, JEE)的数学推理需求。其解决方案的关键在于:首先融合多个强健的开源推理模型,并通过监督微调(Supervised Fine-Tuning, SFT)结合课程学习(Curriculum Learning)策略,在经过最佳- n 拒绝采样(best-of-n rejection sampling)筛选的可信思维链(Chain-of-Thought, CoT)数据上进行训练;随后引入基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR),采用A2C目标函数与组相对优势估计(group-relative advantage estimation),并创新性地结合自适应组大小调整(Adaptive Group Resizing)和温度缩放(Temperature Scaling)等探索策略,显著提升模型在JEE主考题及跨域基准(如MATH、GSM8K)上的准确性与效率,同时提供教学友好的分步推理过程。

链接: https://arxiv.org/abs/2508.08665
作者: Ritvik Rastogi,Sachin Dharashivkar,Sandeep Varma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present \textbfAryabhata 1.0 , a compact 7B parameter math reasoning model optimized for the Indian academic exam, the Joint Entrance Examination (JEE). Despite rapid progress in large language models (LLMs), current models often remain unsuitable for educational use. Aryabhata 1.0 is built by merging strong open-weight reasoning models, followed by supervised fine-tuning (SFT) with curriculum learning on verified chain-of-thought (CoT) traces curated through best-of- n rejection sampling. To further boost performance, we apply reinforcement learning with verifiable rewards (RLVR) using A2C objective with group-relative advantage estimation alongwith novel exploration strategies such as \textitAdaptive Group Resizing and \textitTemperature Scaling . Evaluated on both in-distribution (JEE Main 2025) and out-of-distribution (MATH, GSM8K) benchmarks, Aryabhata outperforms existing models in accuracy and efficiency, while offering pedagogically useful step-by-step reasoning. We release Aryabhata as a foundation model to advance exam-centric, open-source small language models. This marks our first open release for community feedback ( \hrefthis https URLAryabhata\ 1.0\ on\ Hugging\ Face ); PW is actively training future models to further improve learning outcomes for students.
zh

[AI-39] Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

【速读】:该论文旨在解决生成式 AI 在软件工程任务中因代码变更的结构复杂性和上下文依赖性而导致的幻觉(hallucination)问题,特别是在提交信息生成和代码审查评论生成这两个关键任务中。其解决方案的关键在于首次系统性地分析了幻觉在代码变更场景下的发生频率,并提出基于多指标融合的自动检测方法;研究发现,仅依赖单一指标效果有限,而结合模型置信度与特征归因(feature attribution)等多维指标可显著提升检测性能,为推理阶段的幻觉识别提供了有效路径。

链接: https://arxiv.org/abs/2508.08661
作者: Chunhua Liu,Hong Yi Lin,Patanamon Thongtanunam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 8 main pages, 5 figures

点击查看摘要

Abstract:Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50% of generated code reviews and 20% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.\footnoteAll code and data will be released upon acceptance.
zh

[AI-40] Hybrid Node-Destroyer Model with Large Neighborhood Search for Solving the Capacitated Vehicle Routing Problem

【速读】:该论文旨在解决元启发式算法在求解车辆路径问题(Capacitated Vehicle Routing Problem, CVRP)时性能受限的问题,特别是搜索效率低和解的质量不稳定。解决方案的关键在于提出一种迭代学习混合优化求解器,其核心是引入基于图神经网络(Graph Neural Networks, GNNs)的节点破坏模型(Node-Destroyer Model),该模型能够利用问题与解的图结构特性,智能识别并选择需移除的客户节点,从而指导大邻域搜索(Large Neighborhood Search, LNS)操作,有效缩小搜索空间并降低运算复杂度。该方法无需针对不同规模问题实例重新训练,且能显著提升多种基线元启发式算法的性能,在标准CVRP基准数据集及包含最多30,000个客户节点的大规模实例上均表现出优越的解质量与可扩展性。

链接: https://arxiv.org/abs/2508.08659
作者: Bachtiar Herdianto,Romain Billot,Flavien Lucas,Marc Sevaux,Daniele Vigo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 10 figures

点击查看摘要

Abstract:In this research, we propose an iterative learning hybrid optimization solver developed to strengthen the performance of metaheuristic algorithms in solving the Capacitated Vehicle Routing Problem (CVRP). The iterative hybrid mechanism integrates the proposed Node-Destroyer Model, a machine learning hybrid model that utilized Graph Neural Networks (GNNs) such identifies and selects customer nodes to guide the Large Neighborhood Search (LNS) operator within the metaheuristic optimization frameworks. This model leverages the structural properties of the problem and solution that can be represented as a graph, to guide strategic selections concerning node removal. The proposed approach reduces operational complexity and scales down the search space involved in the optimization process. The hybrid approach is applied specifically to the CVRP and does not require retraining across problem instances of different sizes. The proposed hybrid mechanism is able to improve the performance of baseline metaheuristic algorithms. Our approach not only enhances the solution quality for standard CVRP benchmarks but also proves scalability on very large-scale instances with up to 30,000 customer nodes. Experimental evaluations on benchmark datasets show that the proposed hybrid mechanism is capable of improving different baseline algorithms, achieving better quality of solutions under similar settings.
zh

[AI-41] Prompt-and-Check: Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training

【速读】:该论文旨在解决模拟训练中程序性沟通合规性评估的准确性问题,尤其是在安全关键领域,如海事操作中,如何高效、自动化地判断受训者是否严格遵循标准检查清单(checklist)。其解决方案的关键在于提出了一种轻量级、可部署的基于提示(prompt-based)推理方法——Prompt-and-Check,该方法利用开源大语言模型(LLM)在消费级GPU(如RTX 4070)上本地运行,通过设计上下文丰富的提示(context-rich prompts)来分析转录的口头交流内容,从而逐项判断每个检查清单条目是否达成。实验表明,该方法无需任务特定微调即可实现有效的上下文感知推理,显著提升了训练环境中自动评估、反馈与复盘的可行性与效率。

链接: https://arxiv.org/abs/2508.08652
作者: Vishakha Lall,Yisi Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate evaluation of procedural communication compliance is essential in simulation-based training, particularly in safety-critical domains where adherence to compliance checklists reflects operational competence. This paper explores a lightweight, deployable approach using prompt-based inference with open-source large language models (LLMs) that can run efficiently on consumer-grade GPUs. We present Prompt-and-Check, a method that uses context-rich prompts to evaluate whether each checklist item in a protocol has been fulfilled, solely based on transcribed verbal exchanges. We perform a case study in the maritime domain with participants performing an identical simulation task, and experiment with models such as LLama 2 7B, LLaMA 3 8B and Mistral 7B, running locally on an RTX 4070 GPU. For each checklist item, a prompt incorporating relevant transcript excerpts is fed into the model, which outputs a compliance judgment. We assess model outputs against expert-annotated ground truth using classification accuracy and agreement scores. Our findings demonstrate that prompting enables effective context-aware reasoning without task-specific training. This study highlights the practical utility of LLMs in augmenting debriefing, performance feedback, and automated assessment in training environments.
zh

[AI-42] P-CAFE: Personalized Cost-Aware Incremental Feature Selection For Electronic Health Records

【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHR)中复杂且多模态数据的特征选择难题,尤其针对传统方法在处理EHR数据固有的稀疏性、异质性以及临床应用中患者个体差异和特征获取成本时表现不佳的问题。解决方案的关键在于提出一种个性化、在线且成本感知的特征选择框架,该框架以个体患者为单位,在预算约束下增量式地获取最具信息量的特征,同时考虑特征变异性和获取成本,从而在保证诊断置信度的同时优化医疗资源利用效率。

链接: https://arxiv.org/abs/2508.08646
作者: Naama Kashani,Mira Cohen,Uri Shaham
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Electronic Health Records (EHR) have revolutionized healthcare by digitizing patient data, improving accessibility, and streamlining clinical workflows. However, extracting meaningful insights from these complex and multimodal datasets remains a significant challenge for researchers. Traditional feature selection methods often struggle with the inherent sparsity and heterogeneity of EHR data, especially when accounting for patient-specific variations and feature costs in clinical applications. To address these challenges, we propose a novel personalized, online and cost-aware feature selection framework tailored specifically for EHR datasets. The features are aquired in an online fashion for individual patients, incorporating budgetary constraints and feature variability costs. The framework is designed to effectively manage sparse and multimodal data, ensuring robust and scalable performance in diverse healthcare contexts. A primary application of our proposed method is to support physicians’ decision making in patient screening scenarios. By guiding physicians toward incremental acquisition of the most informative features within budget constraints, our approach aims to increase diagnostic confidence while optimizing resource utilization.
zh

[AI-43] Diminution: On Reducing the Size of Grounding ASP Programs

【速读】:该论文旨在解决答案集编程(Answer Set Programming, ASP)中的“接地瓶颈”(grounding bottleneck)问题,即大规模Herbrand域导致生成的Ground程序过于庞大,使得求解困难。解决方案的关键在于提出并形式化了“缩减”(diminution)概念,定义为从Herbrand域中选取的一个子集,用于生成规模更小的Ground程序以供求解。通过设计特定编码使现有ASP求解器可评估候选缩减子集,并结合领域谓词无缝集成到现有接地器中,实验表明该方法在五个基准测试上平均减少70%的接地时间与高达85%的接地文件大小,证明其在缓解ASP接地瓶颈方面具有鲁棒性和通用性。

链接: https://arxiv.org/abs/2508.08633
作者: HuanYu Yang,Fengming Zhu,YangFan Wu,Jianmin Ji
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Answer Set Programming (ASP) is often hindered by the grounding bottleneck: large Herbrand universes generate ground programs so large that solving becomes difficult. Many methods employ ad-hoc heuristics to improve grounding performance, motivating the need for a more formal and generalizable strategy. We introduce the notion of diminution, defined as a selected subset of the Herbrand universe used to generate a reduced ground program before solving. We give a formal definition of diminution, analyze its key properties, and study the complexity of identifying it. We use a specific encoding that enables off-the-shelf ASP solver to evaluate candidate subsets. Our approach integrates seamlessly with existing grounders via domain predicates. In extensive experiments on five benchmarks, applying diminutions selected by our strategy yields significant performance improvements, reducing grounding time by up to 70% on average and decreasing the size of grounding files by up to 85%. These results demonstrate that leveraging diminutions constitutes a robust and general-purpose approach for alleviating the grounding bottleneck in ASP.
zh

[AI-44] AgriGPT : a Large Language Model Ecosystem for Agriculture

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在农业领域应用受限的问题,主要瓶颈包括缺乏领域专用模型、高质量标注数据集以及可靠的评估框架。其解决方案的关键在于构建一个名为AgriGPT的农业专用大语言模型生态系统,核心创新包括:1)设计了一个多智能体可扩展的数据引擎,系统性地整合可信数据源形成高质量、标准化的问答数据集Agri-342K;2)引入Tri-RAG三通道检索增强生成框架,融合密集检索、稀疏检索与多跳知识图谱推理,提升模型事实准确性与推理可靠性;3)提出AgriBench-13K基准测试套件,涵盖13项不同类型和复杂度的任务,实现对农业场景下LLM性能的全面评估。该工作不仅显著提升了农业场景下的模型表现,还提供了一个模块化、可扩展的通用框架,推动科学与行业专用大语言模型的发展。

链接: https://arxiv.org/abs/2508.08632
作者: Bo Yang,Yu Zhang,Lanfei Feng,Yunkui Chen,Jianyu Zhang,Xiao Xu,Nueraili Aierken,Yurui Li,Yuxuan Chen,Guijun Yang,Yong He,Runhe Huang,Shijian Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the rapid progress of Large Language Models (LLMs), their application in agriculture remains limited due to the lack of domain-specific models, curated datasets, and robust evaluation frameworks. To address these challenges, we propose AgriGPT, a domain-specialized LLM ecosystem for agricultural usage. At its core, we design a multi-agent scalable data engine that systematically compiles credible data sources into Agri-342K, a high-quality, standardized question-answer (QA) dataset. Trained on this dataset, AgriGPT supports a broad range of agricultural stakeholders, from practitioners to policy-makers. To enhance factual grounding, we employ Tri-RAG, a three-channel Retrieval-Augmented Generation framework combining dense retrieval, sparse retrieval, and multi-hop knowledge graph reasoning, thereby improving the LLM’s reasoning reliability. For comprehensive evaluation, we introduce AgriBench-13K, a benchmark suite comprising 13 tasks with varying types and complexities. Experiments demonstrate that AgriGPT significantly outperforms general-purpose LLMs on both domain adaptation and reasoning. Beyond the model itself, AgriGPT represents a modular and extensible LLM ecosystem for agriculture, comprising structured data construction, retrieval-enhanced generation, and domain-specific evaluation. This work provides a generalizable framework for developing scientific and industry-specialized LLMs. All models, datasets, and code will be released to empower agricultural communities, especially in underserved regions, and to promote open, impactful research.
zh

[AI-45] Securing Educational LLM s: A Generalised Taxonomy of Attacks on LLM s and DREAD Risk Assessment

【速读】:该论文旨在解决教育领域中大型语言模型(Large Language Models, LLMs)集成所引发的网络安全威胁缺乏系统性分类与风险评估的问题。其解决方案的关键在于提出一个涵盖五十种攻击类型的通用分类法,将攻击细分为针对模型本身或其基础设施两类,并基于DREAD风险评估框架对这些攻击在教育场景中的严重性进行量化分析,从而识别出token smuggling、对抗性提示(adversarial prompts)、直接注入(direct injection)和多步骤越狱(multi-step jailbreak)等关键威胁,为学术界和工业界构建具备韧性的防护机制提供理论依据与实践指导。

链接: https://arxiv.org/abs/2508.08629
作者: Farzana Zahid,Anjalika Sewwandi,Lee Brandon,Vimal Kumar,Roopak Sinha
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to perceptions of efficiency and significant productivity gains, various organisations, including in education, are adopting Large Language Models (LLMs) into their workflows. Educator-facing, learner-facing, and institution-facing LLMs, collectively, Educational Large Language Models (eLLMs), complement and enhance the effectiveness of teaching, learning, and academic operations. However, their integration into an educational setting raises significant cybersecurity concerns. A comprehensive landscape of contemporary attacks on LLMs and their impact on the educational environment is missing. This study presents a generalised taxonomy of fifty attacks on LLMs, which are categorized as attacks targeting either models or their infrastructure. The severity of these attacks is evaluated in the educational sector using the DREAD risk assessment framework. Our risk assessment indicates that token smuggling, adversarial prompts, direct injection, and multi-step jailbreak are critical attacks on eLLMs. The proposed taxonomy, its application in the educational environment, and our risk assessment will help academic and industrial practitioners to build resilient solutions that protect learners and institutions.
zh

[AI-46] QoE-Aware Service Provision for Mobile AR Rendering: An Agent -Driven Approach

【速读】:该论文旨在解决6G时代移动增强现实(Mobile Augmented Reality, MAR)应用中通信开销高且用户体验质量(Quality of Experience, QoE)难以保障的问题。其关键解决方案在于提出一种由数字代理(digital agent)驱动的边缘辅助MAR通信服务提供机制:首先,利用大语言模型(Large Language Models, LLMs)构建代表MAR服务提供商的数字代理,弥合MAR服务域与网络控制域之间的数据与功能鸿沟;其次,设计用户级QoE建模方法,量化通信资源需求与用户感知QoE之间的关系,从而实现个性化的、基于代理的通信资源调度。该方案显著提升了QoE建模精度和通信资源利用效率。

链接: https://arxiv.org/abs/2508.08627
作者: Conghao Zhou,Lulu Sun,Xiucheng Wang,Peng Yang,Feng Lyu,Sihan Lu,Xuemin Shen
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mobile augmented reality (MAR) is envisioned as a key immersive application in 6G, enabling virtual content rendering aligned with the physical environment through device pose estimation. In this paper, we propose a novel agent-driven communication service provisioning approach for edge-assisted MAR, aiming to reduce communication overhead between MAR devices and the edge server while ensuring the quality of experience (QoE). First, to address the inaccessibility of MAR application-specific information to the network controller, we establish a digital agent powered by large language models (LLMs) on behalf of the MAR service provider, bridging the data and function gap between the MAR service and network domains. Second, to cope with the user-dependent and dynamic nature of data traffic patterns for individual devices, we develop a user-level QoE modeling method that captures the relationship between communication resource demands and perceived user QoE, enabling personalized, agent-driven communication resource management. Trace-driven simulation results demonstrate that the proposed approach outperforms conventional LLM-based QoE-aware service provisioning methods in both user-level QoE modeling accuracy and communication resource efficiency.
zh

[AI-47] UGM2N: An Unsupervised and Generalizable Mesh Movement Network via M-Uniform Loss

【速读】:该论文旨在解决偏微分方程(Partial Differential Equations, PDEs)数值求解中因网格移动技术受限于高计算复杂度和几何灵活性不足所导致的精度-效率权衡难题。传统方法依赖预适应网格且难以跨不同PDE类型和网格几何实现零样本泛化,而现有监督学习方法也面临类似局限。其解决方案的关键在于提出一种无监督且可泛化的网格移动网络(Unsupervised and Generalizable Mesh Movement Network, UGM2N),通过局部几何特征学习实现无需预适应网格的无监督网格自适应,并设计物理约束损失函数——M-Uniform损失,强制节点处网格均匀分布,从而在保持几何独立性的同时,实现对各类PDE和多尺度网格结构的一致优越性能与误差可控降低。

链接: https://arxiv.org/abs/2508.08615
作者: Zhichao Wang,Xinhai Chen,Qinglin Wang,Xiang Gao,Qingyang Zhang,Menghan Jia,Xiang Zhang,Jie Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Partial differential equations (PDEs) form the mathematical foundation for modeling physical systems in science and engineering, where numerical solutions demand rigorous accuracy-efficiency tradeoffs. Mesh movement techniques address this challenge by dynamically relocating mesh nodes to rapidly-varying regions, enhancing both simulation accuracy and computational efficiency. However, traditional approaches suffer from high computational complexity and geometric inflexibility, limiting their applicability, and existing supervised learning-based approaches face challenges in zero-shot generalization across diverse PDEs and mesh this http URL this paper, we present an Unsupervised and Generalizable Mesh Movement Network (UGM2N). We first introduce unsupervised mesh adaptation through localized geometric feature learning, eliminating the dependency on pre-adapted meshes. We then develop a physics-constrained loss function, M-Uniform loss, that enforces mesh equidistribution at the nodal this http URL results demonstrate that the proposed network exhibits equation-agnostic generalization and geometric independence in efficient mesh adaptation. It demonstrates consistent superiority over existing methods, including robust performance across diverse PDEs and mesh geometries, scalability to multi-scale resolutions and guaranteed error reduction without mesh tangling.
zh

[AI-48] Generative AI for Critical Infrastructure in Smart Grids: A Unified Framework for Synthetic Data Generation and Anomaly Detection

【速读】:该论文旨在解决基于IEC61850标准的数字变电站中,因集成通用面向对象变电站事件(GOOSE)等先进通信协议而引入的网络安全威胁问题,尤其是传统异常检测系统(ADS)在应对零日攻击和数据稀缺场景下检测能力不足的局限性。其解决方案的关键在于提出一种基于生成式AI(GenAI)的新型ADS架构:首先通过先进的对抗性流量变异(AATM)技术生成符合协议规范且平衡的GOOSE消息合成数据集,以模拟真实世界中的零日攻击模式并缓解数据稀缺问题;其次引入面向任务的对话(ToD)机制优化GenAI模型的检测流程,提升对复杂攻击模式的识别精度;最终通过与机器学习(ML)方法对比验证,证明GenAI框架在多种性能指标上具有显著优势。

链接: https://arxiv.org/abs/2508.08593
作者: Aydin Zaboli,Junho Hong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 28 pages, 12 figures

点击查看摘要

Abstract:In digital substations, security events pose significant challenges to the sustained operation of power systems. To mitigate these challenges, the implementation of robust defense strategies is critically important. A thorough process of anomaly identification and detection in information and communication technology (ICT) frameworks is crucial to ensure secure and reliable communication and coordination between interconnected devices within digital substations. Hence, this paper addresses the critical cybersecurity challenges confronting IEC61850-based digital substations within modern smart grids, where the integration of advanced communication protocols, e.g., generic object-oriented substation event (GOOSE), has enhanced energy management and introduced significant vulnerabilities to cyberattacks. Focusing on the limitations of traditional anomaly detection systems (ADSs) in detecting threats, this research proposes a transformative approach by leveraging generative AI (GenAI) to develop robust ADSs. The primary contributions include the suggested advanced adversarial traffic mutation (AATM) technique to generate synthesized and balanced datasets for GOOSE messages, ensuring protocol compliance and enabling realistic zero-day attack pattern creation to address data scarcity. Then, the implementation of GenAI-based ADSs incorporating the task-oriented dialogue (ToD) processes has been explored for improved detection of attack patterns. Finally, a comparison of the GenAI-based ADS with machine learning (ML)-based ADSs has been implemented to showcase the outperformance of the GenAI-based frameworks considering the AATM-generated GOOSE datasets and standard/advanced performance evaluation metrics.
zh

[AI-49] AI Security Map: Holistic Organization of AI Security Technologies and Impacts on Stakeholders

【速读】:该论文旨在解决现有AI安全研究中知识碎片化的问题,即当前研究多局限于特定领域或AI组件的技术、攻击、防御与风险的孤立分析,难以系统理解各要素间的关联及其对信息系统和利益相关者造成的负面影响。解决方案的关键在于构建一个AI安全地图(AI Security Map),该地图从信息系统的角度(Information System Aspect, ISA)和外部影响的角度(External Influence Aspect, EIA)两个维度,整体化地组织AI安全相关的元素及其相互关系,并明确每类元素可能引发的负面后果。通过该地图,可清晰识别潜在威胁的根源、传播路径及缓解措施,从而增强对AI系统安全风险的全局认知与治理能力。

链接: https://arxiv.org/abs/2508.08583
作者: Hiroya Kato,Kentaro Kita,Kento Hasegawa,Seira Hidano
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the social implementation of AI has been steadily progressing, research and development related to AI security has also been increasing. However, existing studies have been limited to organizing related techniques, attacks, defenses, and risks in terms of specific domains or AI elements. Thus, it extremely difficult to understand the relationships among them and how negative impacts on stakeholders are brought about. In this paper, we argue that the knowledge, technologies, and social impacts related to AI security should be holistically organized to help understand relationships among them. To this end, we first develop an AI security map that holistically organizes interrelationships among elements related to AI security as well as negative impacts on information systems and stakeholders. This map consists of the two aspects, namely the information system aspect (ISA) and the external influence aspect (EIA). The elements that AI should fulfill within information systems are classified under the ISA. The EIA includes elements that affect stakeholders as a result of AI being attacked or misused. For each element, corresponding negative impacts are identified. By referring to the AI security map, one can understand the potential negative impacts, along with their causes and countermeasures. Additionally, our map helps clarify how the negative impacts on AI-based systems relate to those on stakeholders. We show some findings newly obtained by referring to our map. We also provide several recommendations and open problems to guide future AI security communities.
zh

[AI-50] Who pays the RENT? Implications of Spatial Inequality for Prediction-Based Allocation Policies AAAI

【速读】:该论文试图解决的问题是:在资源稀缺的社会治理场景中(如防止租户被驱逐),基于人工智能(AI)的个体层面精准干预策略的有效性为何在不同研究中存在矛盾结果——即某些研究发现高不平等环境下个体靶向策略无益,而另一些研究则显示其具有潜在优势。解决方案的关键在于构建一个基于Mallows模型的简化理论框架,引入RENT(Relative Efficiency of Non-Targeting,非靶向相对效率)指标,用以量化评估个体靶向策略相较于基于社区的策略在不同风险空间分布模式下的有效性;并通过美国某中等城市法院 eviction 数据校准模型参数,实证表明即使在高度种族隔离、风险高度集聚的区域,个体靶向政策仍能显著提升高风险家庭的覆盖数量。这一方法澄清了先前文献中的分歧,指出其根源在于部署成本来源和实际风险分布与模型假设之间的差异。

链接: https://arxiv.org/abs/2508.08573
作者: Tasfia Mashiat,Patrick J. Fowler,Sanmay Das
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This work has been accepted for publication as a full paper at the AAAI/ACM Conference on AI, Ethics, and Society (AIES 2025)

点击查看摘要

Abstract:AI-powered scarce resource allocation policies rely on predictions to target either specific individuals (e.g., high-risk) or settings (e.g., neighborhoods). Recent research on individual-level targeting demonstrates conflicting results; some models show that targeting is not useful when inequality is high, while other work demonstrates potential benefits. To study and reconcile this apparent discrepancy, we develop a stylized framework based on the Mallows model to understand how the spatial distribution of inequality affects the effectiveness of door-to-door outreach policies. We introduce the RENT (Relative Efficiency of Non-Targeting) metric, which we use to assess the effectiveness of targeting approaches compared with neighborhood-based approaches in preventing tenant eviction when high-risk households are more versus less spatially concentrated. We then calibrate the model parameters to eviction court records collected in a medium-sized city in the USA. Results demonstrate considerable gains in the number of high-risk households canvassed through individually targeted policies, even in a highly segregated metro area with concentrated risks of eviction. We conclude that apparent discrepancies in the prior literature can be reconciled by considering 1) the source of deployment costs and 2) the observed versus modeled concentrations of risk. Our results inform the deployment of AI-based solutions in social service provision that account for particular applications and geographies.
zh

[AI-51] UQGNN: Uncertainty Quantification of Graph Neural Networks for Multivariate Spatiotemporal Prediction

【速读】:该论文旨在解决现有时空预测模型在多变量场景下难以同时实现高精度预测与不确定性量化的问题。当前大多数模型为确定性模型,仅输出期望均值而忽略不确定性,导致结果可靠性不足;且已有概率模型通常仅针对单一城市现象(如交通流量或犯罪事件)建模,忽视了异构城市现象之间的内在关联。为此,作者提出了一种带不确定性量化能力的图神经网络(UQGNN),其关键创新在于:(i) 提出一种交互感知的时空嵌入模块,融合多变量扩散图卷积网络与交互感知时间卷积网络,以捕捉复杂的空间和时间交互模式;(ii) 设计一个多变量概率预测模块,能够同时估计各变量的期望均值及其对应的不确定性,从而提升预测鲁棒性与可信度。

链接: https://arxiv.org/abs/2508.08551
作者: Dahai Yu,Dingyi Zhuang,Lin Jiang,Rongchao Xu,Xinyue Ye,Yuheng Bu,Shenhao Wang,Guang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, SIGSPATIAL 2025

点击查看摘要

Abstract:Spatiotemporal prediction plays a critical role in numerous real-world applications such as urban planning, transportation optimization, disaster response, and pandemic control. In recent years, researchers have made significant progress by developing advanced deep learning models for spatiotemporal prediction. However, most existing models are deterministic, i.e., predicting only the expected mean values without quantifying uncertainty, leading to potentially unreliable and inaccurate outcomes. While recent studies have introduced probabilistic models to quantify uncertainty, they typically focus on a single phenomenon (e.g., taxi, bike, crime, or traffic crashes), thereby neglecting the inherent correlations among heterogeneous urban phenomena. To address the research gap, we propose a novel Graph Neural Network with Uncertainty Quantification, termed UQGNN for multivariate spatiotemporal prediction. UQGNN introduces two key innovations: (i) an Interaction-aware Spatiotemporal Embedding Module that integrates a multivariate diffusion graph convolutional network and an interaction-aware temporal convolutional network to effectively capture complex spatial and temporal interaction patterns, and (ii) a multivariate probabilistic prediction module designed to estimate both expected mean values and associated uncertainties. Extensive experiments on four real-world multivariate spatiotemporal datasets from Shenzhen, New York City, and Chicago demonstrate that UQGNN consistently outperforms state-of-the-art baselines in both prediction accuracy and uncertainty quantification. For example, on the Shenzhen dataset, UQGNN achieves a 5% improvement in both prediction accuracy and uncertainty quantification.
zh

[AI-52] OmniLLP: Enhancing LLM -based Log Level Prediction with Context-Aware Retrieval

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的日志级别预测(Log Level Prediction, LLP)方法在实际应用中准确性不足的问题,其核心挑战在于现有方法依赖随机选取的上下文示例进行推理,忽略了现代软件项目中代码语义结构和开发者归属一致性等关键工程特征。解决方案的关键在于提出 OmniLLP 框架,通过结合两个维度的聚类策略——代码语义相似性(反映功能目的)与开发者所有权凝聚力(developer ownership cohesion),从结构化、语义一致且开发者归属明确的源文件簇中检索上下文示例,从而构建更连贯、更具针对性的提示(prompt),显著提升 LLM-LLP 的预测性能,实验表明该方法可使 AUC 提升最高达 8%,并在多个项目中达到 0.88 至 0.96 的优异表现。

链接: https://arxiv.org/abs/2508.08545
作者: Youssef Esseddiq Ouatiti,Mohammed Sayagh,Bram Adams,Ahmed E. Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developers insert logging statements in source code to capture relevant runtime information essential for maintenance and debugging activities. Log level choice is an integral, yet tricky part of the logging activity as it controls log verbosity and therefore influences systems’ observability and performance. Recent advances in ML-based log level prediction have leveraged large language models (LLMs) to propose log level predictors (LLPs) that demonstrated promising performance improvements (AUC between 0.64 and 0.8). Nevertheless, current LLM-based LLPs rely on randomly selected in-context examples, overlooking the structure and the diverse logging practices within modern software projects. In this paper, we propose OmniLLP, a novel LLP enhancement framework that clusters source files based on (1) semantic similarity reflecting the code’s functional purpose, and (2) developer ownership cohesion. By retrieving in-context learning examples exclusively from these semantic and ownership aware clusters, we aim to provide more coherent prompts to LLPs leveraging LLMs, thereby improving their predictive accuracy. Our results show that both semantic and ownership-aware clusterings statistically significantly improve the accuracy (by up to 8% AUC) of the evaluated LLM-based LLPs compared to random predictors (i.e., leveraging randomly selected in-context examples from the whole project). Additionally, our approach that combines the semantic and ownership signal for in-context prediction achieves an impressive 0.88 to 0.96 AUC across our evaluated projects. Our findings highlight the value of integrating software engineering-specific context, such as code semantic and developer ownership signals into LLM-LLPs, offering developers a more accurate, contextually-aware approach to logging and therefore, enhancing system maintainability and observability.
zh

[AI-53] AI Agents and the Law AAAI

【速读】:该论文旨在解决当前生成式 AI(Generative AI)向“代理型”(agentic)演进过程中所面临的技术与法律认知错位问题,特别是当AI系统能够直接代表用户执行任务时,其技术设计未能充分映射法律中关于代理权(agency)的核心要素,如忠诚义务(loyalty)、披露义务(disclosure)及对第三方的责任边界。解决方案的关键在于:将法律中成熟的代理法概念(如价值对齐、忠诚义务和披露要求)系统性地整合进AI代理的设计框架中,从而弥补计算机科学在代理建模中对社会-法律维度的理论缺失,确保AI代理不仅高效执行任务,还能在复杂交互场景下实现合规、可预测且负责任的行为表现。

链接: https://arxiv.org/abs/2508.08544
作者: Mark O. Riedl,Deven R. Desai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 2025 AAAI Conference on AI, Ethics, and Society

点击查看摘要

Abstract:As AI becomes more “agentic,” it faces technical and socio-legal issues it must address if it is to fulfill its promise of increased economic productivity and efficiency. This paper uses technical and legal perspectives to explain how things change when AI systems start being able to directly execute tasks on behalf of a user. We show how technical conceptions of agents track some, but not all, socio-legal conceptions of agency. That is, both computer science and the law recognize the problems of under-specification for an agent, and both disciplines have robust conceptions of how to address ensuring an agent does what the programmer, or in the law, the principal desires and no more. However, to date, computer science has under-theorized issues related to questions of loyalty and to third parties that interact with an agent, both of which are central parts of the law of agency. First, we examine the correlations between implied authority in agency law and the principle of value-alignment in AI, wherein AI systems must operate under imperfect objective specification. Second, we reveal gaps in the current computer science view of agents pertaining to the legal concepts of disclosure and loyalty, and how failure to account for them can result in unintended effects in AI ecommerce agents. In surfacing these gaps, we show a path forward for responsible AI agent development and deployment.
zh

[AI-54] M3-Net: A Cost-Effective Graph-Free MLP-Based Model for Traffic Prediction

【速读】:该论文旨在解决当前交通预测中深度学习方法面临的两大挑战:一是模型依赖完整的交通网络结构(如路网拓扑),二是为捕捉复杂的时空依赖关系而设计过于复杂的模型架构,导致在大规模数据集上的部署和运行效率低下。解决方案的关键在于提出一种无需图结构的轻量级多层感知机(Multilayer Perceptron, MLP)模型——M3-Net,其创新性地引入了基于专家混合(Mixture of Experts, MoE)机制的MLP-Mixer架构,并结合时间序列与时空嵌入进行高效特征处理,从而在保持高预测精度的同时显著降低模型复杂度,实现更高效的部署与应用。

链接: https://arxiv.org/abs/2508.08543
作者: Guangyin Jin,Sicong Lai,Xiaoshuai Hao,Mingtao Zhang,Jinlei Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Achieving accurate traffic prediction is a fundamental but crucial task in the development of current intelligent transportation this http URL of the mainstream methods that have made breakthroughs in traffic prediction rely on spatio-temporal graph neural networks, spatio-temporal attention mechanisms, etc. The main challenges of the existing deep learning approaches are that they either depend on a complete traffic network structure or require intricate model designs to capture complex spatio-temporal dependencies. These limitations pose significant challenges for the efficient deployment and operation of deep learning models on large-scale datasets. To address these challenges, we propose a cost-effective graph-free Multilayer Perceptron (MLP) based model M3-Net for traffic prediction. Our proposed model not only employs time series and spatio-temporal embeddings for efficient feature processing but also first introduces a novel MLP-Mixer architecture with a mixture of experts (MoE) mechanism. Extensive experiments conducted on multiple real datasets demonstrate the superiority of the proposed model in terms of prediction performance and lightweight deployment.
zh

[AI-55] LLM -Driven Adaptive 6G-Ready Wireless Body Area Networks: Survey and Framework

【速读】:该论文旨在解决当前无线体域网(Wireless Body Area Networks, WBANs)在适应性、能效和抗量子安全方面的关键瓶颈问题,尤其是在集成6G通信、后量子密码学与能量采集等新兴技术时面临的系统协同挑战。其解决方案的核心在于提出一种由大语言模型(Large Language Model, LLM)驱动的自适应WBAN框架,将LLM作为认知控制平面,实时协调路由策略、物理层参数选择、微能量采集以及后量子安全机制,从而实现超可靠、安全且自我优化的下一代移动健康系统。

链接: https://arxiv.org/abs/2508.08535
作者: Azin Sabzian,Mohammad Jalili Torkamani,Negin Mahmoudi,Kiana Kiashemshaki
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 7 pages

点击查看摘要

Abstract:Wireless Body Area Networks (WBANs) enable continuous monitoring of physiological signals for applications ranging from chronic disease management to emergency response. Recent advances in 6G communications, post-quantum cryptography, and energy harvesting have the potential to enhance WBAN performance. However, integrating these technologies into a unified, adaptive system remains a challenge. This paper surveys some of the most well-known Wireless Body Area Network (WBAN) architectures, routing strategies, and security mechanisms, identifying key gaps in adaptability, energy efficiency, and quantum-resistant security. We propose a novel Large Language Model-driven adaptive WBAN framework in which a Large Language Model acts as a cognitive control plane, coordinating routing, physical layer selection, micro-energy harvesting, and post-quantum security in real time. Our review highlights the limitations of current heuristic-based designs and outlines a research agenda for resource-constrained, 6G-ready medical systems. This approach aims to enable ultra-reliable, secure, and self-optimizing WBANs for next-generation mobile health applications.
zh

[AI-56] SynLLM : A Comparative Analysis of Large Language Models for Medical Tabular Synthetic Data Generation via Prompt Engineering

【速读】:该论文旨在解决医疗数据隐私保护与科研可用性之间的矛盾问题,即如何在遵守隐私法规的前提下生成高质量、临床合理且无隐私泄露风险的合成医学表格数据。其解决方案的关键在于提出一个名为SynLLM的模块化框架,该框架利用20个先进的开源大语言模型(Large Language Models, LLMs)生成数据,并通过四类结构化提示(prompt)——从示例驱动到规则约束——嵌入数据模式、元数据和领域知识,从而在不进行模型微调的情况下精准控制生成内容;同时构建多维评估体系,系统性地衡量生成数据的统计保真度、临床一致性与隐私保护水平,实验证明规则驱动型提示在隐私与质量之间取得最优平衡,验证了LLMs在医疗合成数据生成中的潜力与可行性。

链接: https://arxiv.org/abs/2508.08529
作者: Arshia Ilaty,Hossein Shirazi,Hajar Homayouni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 Pages, 2 Supplementary Pages, 6 Tables

点击查看摘要

Abstract:Access to real-world medical data is often restricted due to privacy regulations, posing a significant barrier to the advancement of healthcare research. Synthetic data offers a promising alternative; however, generating realistic, clinically valid, and privacy-conscious records remains a major challenge. Recent advancements in Large Language Models (LLMs) offer new opportunities for structured data generation; however, existing approaches frequently lack systematic prompting strategies and comprehensive, multi-dimensional evaluation frameworks. In this paper, we present SynLLM, a modular framework for generating high-quality synthetic medical tabular data using 20 state-of-the-art open-source LLMs, including LLaMA, Mistral, and GPT variants, guided by structured prompts. We propose four distinct prompt types, ranging from example-driven to rule-based constraints, that encode schema, metadata, and domain knowledge to control generation without model fine-tuning. Our framework features a comprehensive evaluation pipeline that rigorously assesses generated data across statistical fidelity, clinical consistency, and privacy preservation. We evaluate SynLLM across three public medical datasets, including Diabetes, Cirrhosis, and Stroke, using 20 open-source LLMs. Our results show that prompt engineering significantly impacts data quality and privacy risk, with rule-based prompts achieving the best privacy-quality balance. SynLLM establishes that, when guided by well-designed prompts and evaluated with robust, multi-metric criteria, LLMs can generate synthetic medical data that is both clinically plausible and privacy-aware, paving the way for safer and more effective data sharing in healthcare research. Comments: 10 Pages, 2 Supplementary Pages, 6 Tables Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.08529 [cs.AI] (or arXiv:2508.08529v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.08529 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-57] Playing Atari Space Invaders with Sparse Cosine Optimized Policy Evolution AAAI

【速读】:该论文旨在解决进化算法在游戏博弈领域应用时面临的高维状态空间导致的搜索空间爆炸问题,这会显著降低收敛速度并限制学习效率。其核心解决方案是提出Sparse Cosine Optimized Policy Evolution (SCOPE),关键在于利用离散余弦变换(Discrete Cosine Transform, DCT)作为伪注意力机制,将原始高维输入状态映射为系数矩阵,并通过截断与稀疏化处理压缩输入维度,同时保留原始状态中能量最高的特征;在此基础上,采用双线性仿射映射将稀疏DCT系数映射至策略动作,结合CMA-ES优化算法实现高效学习。实验表明,该方法在Atari游戏Space Invaders中将输入维度从33,600降至15,625(减少53%),显著优于未使用特征压缩的进化方法(如OpenAI-ES、HyperNEAT)及传统强化学习方法(如DQN、A3C)。

链接: https://arxiv.org/abs/2508.08526
作者: Jim O’Connor,Jay B. Nash,Derin Gezgin,Gary B. Parker
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: The 21st AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment

点击查看摘要

Abstract:Evolutionary approaches have previously been shown to be effective learning methods for a diverse set of domains. However, the domain of game-playing poses a particular challenge for evolutionary methods due to the inherently large state space of video games. As the size of the input state expands, the size of the policy must also increase in order to effectively learn the temporal patterns in the game space. Consequently, a larger policy must contain more trainable parameters, exponentially increasing the size of the search space. Any increase in search space is highly problematic for evolutionary methods, as increasing the number of trainable parameters is inversely correlated with convergence speed. To reduce the size of the input space while maintaining a meaningful representation of the original space, we introduce Sparse Cosine Optimized Policy Evolution (SCOPE). SCOPE utilizes the Discrete Cosine Transform (DCT) as a pseudo attention mechanism, transforming an input state into a coefficient matrix. By truncating and applying sparsification to this matrix, we reduce the dimensionality of the input space while retaining the highest energy features of the original input. We demonstrate the effectiveness of SCOPE as the policy for the Atari game Space Invaders. In this task, SCOPE with CMA-ES outperforms evolutionary methods that consider an unmodified input state, such as OpenAI-ES and HyperNEAT. SCOPE also outperforms simple reinforcement learning methods, such as DQN and A3C. SCOPE achieves this result through reducing the input size by 53% from 33,600 to 15,625 then using a bilinear affine mapping of sparse DCT coefficients to policy actions learned by the CMA-ES algorithm.
zh

[AI-58] StreetViewAI: Making Street View Accessible Using Context-Aware Multimodal AI

【速读】:该论文试图解决交互式街景地图工具(如Google Street View,GSV)对视障用户不具可访问性的问题。解决方案的关键在于提出并实现StreetViewAI,这是一个首个面向视障用户的可访问街景工具,其核心创新包括:上下文感知的多模态人工智能(multimodal AI)、无障碍导航控制以及对话式语音交互机制。通过这些技术整合,视障用户能够虚拟探索目的地、进行开放世界漫游或浏览全球超过2200亿张图像和100多个国家的街景数据,从而支持兴趣点(POI)调查与远程路线规划。

链接: https://arxiv.org/abs/2508.08524
作者: Jon E. Froehlich,Alexander Fiannaca,Nimer Jaber,Victor Tsara,Shaun Kane
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted to UIST’25

点击查看摘要

Abstract:Interactive streetscape mapping tools such as Google Street View (GSV) and Meta Mapillary enable users to virtually navigate and experience real-world environments via immersive 360° imagery but remain fundamentally inaccessible to blind users. We introduce StreetViewAI, the first-ever accessible street view tool, which combines context-aware, multimodal AI, accessible navigation controls, and conversational speech. With StreetViewAI, blind users can virtually examine destinations, engage in open-world exploration, or virtually tour any of the over 220 billion images and 100+ countries where GSV is deployed. We iteratively designed StreetViewAI with a mixed-visual ability team and performed an evaluation with eleven blind users. Our findings demonstrate the value of an accessible street view in supporting POI investigations and remote route planning. We close by enumerating key guidelines for future work.
zh

[AI-59] Using LLM s to Capture Users Temporal Context for Recommendation

【速读】:该论文旨在解决传统用户画像在复杂、动态环境中难以捕捉用户偏好中短期兴趣与长期倾向之间差异的问题,从而导致推荐系统对用户上下文理解不足。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成语义丰富且具备时间感知能力的用户画像,并通过将短期和长期偏好解耦为不同的动态用户上下文,自适应融合形成综合用户嵌入(user embeddings),以提升推荐系统的上下文敏感性与适应性。

链接: https://arxiv.org/abs/2508.08512
作者: Milad Sabouri,Masoud Mansoury,Kun Lin,Bamshad Mobasher
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective recommender systems demand dynamic user understanding, especially in complex, evolving environments. Traditional user profiling often fails to capture the nuanced, temporal contextual factors of user preferences, such as transient short-term interests and enduring long-term tastes. This paper presents an assessment of Large Language Models (LLMs) for generating semantically rich, time-aware user profiles. We do not propose a novel end-to-end recommendation architecture; instead, the core contribution is a systematic investigation into the degree of LLM effectiveness in capturing the dynamics of user context by disentangling short-term and long-term preferences. This approach, framing temporal preferences as dynamic user contexts for recommendations, adaptively fuses these distinct contextual components into comprehensive user embeddings. The evaluation across MoviesTV and Video Games domains suggests that while LLM-generated profiles offer semantic depth and temporal structure, their effectiveness for context-aware recommendations is notably contingent on the richness of user interaction histories. Significant gains are observed in dense domains (e.g., MoviesTV), whereas improvements are less pronounced in sparse environments (e.g., Video Games). This work highlights LLMs’ nuanced potential in enhancing user profiling for adaptive, context-aware recommendations, emphasizing the critical role of dataset characteristics for practical applicability.
zh

[AI-60] When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital

【速读】:该论文旨在解决在资源匮乏社区中开发生成式 AI(Generative AI)应用时,因缺乏本地领域专家参与而导致的“有意义共设计”难题。具体而言,当 LLM 开发者与社会工作者等域专家之间的沟通受限时,如何确保所构建的应用能够准确、全面且可验证地呈现患者的社会需求信息成为核心挑战。论文提出了一种新颖的协同设计框架,其关键在于将摘要生成任务分解为可独立优化的属性,并通过多级递进式方法高效地对每个属性进行精炼与验证,从而在域专家资源有限的情况下实现高质量的应用开发。

链接: https://arxiv.org/abs/2508.08504
作者: Avni Kothari,Patrick Vossler,Jean Digitale,Mohammad Forouzannia,Elise Rosenberg,Michele Lee,Jennee Bryant,Melanie Molina,James Marks,Lucas Zier,Jean Feng
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have the potential to address social and behavioral determinants of health by transforming labor intensive workflows in resource-constrained settings. Creating LLM-based applications that serve the needs of underserved communities requires a deep understanding of their local context, but it is often the case that neither LLMs nor their developers possess this local expertise, and the experts in these communities often face severe time/resource constraints. This creates a disconnect: how can one engage in meaningful co-design of an LLM-based application for an under-resourced community when the communication channel between the LLM developer and domain expert is constrained? We explored this question through a real-world case study, in which our data science team sought to partner with social workers at a safety net hospital to build an LLM application that summarizes patients’ social needs. Whereas prior works focus on the challenge of prompt tuning, we found that the most critical challenge in this setting is the careful and precise specification of \what information to surface to providers so that the LLM application is accurate, comprehensive, and verifiable. Here we present a novel co-design framework for settings with limited access to domain experts, in which the summary generation task is first decomposed into individually-optimizable attributes and then each attribute is efficiently refined and validated through a multi-tier cascading approach.
zh

[AI-61] GVGAI-LLM : Evaluating Large Language Model Agents with Infinite Games

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在空间推理和基础规划能力方面的局限性问题,这些问题在现有基准测试中往往被忽视。为实现这一目标,作者提出了GVGAI-LLM,一个基于通用视频游戏AI(General Video Game AI, GVGAI)框架的视频游戏基准测试集,其关键在于通过一套由ASCII字符表示的游戏场景、可快速生成的新游戏与关卡机制,以及可解释的评估指标(如有意义步长比、步长效率和总得分),系统性地评测LLMs在多样化、高挑战性的任务中的推理与问题解决能力。此设计不仅有助于防止模型过拟合,还揭示了当前LLMs在空间逻辑错误上的持续缺陷,从而推动结构化提示(structured prompting)和空间定位(spatial grounding)等技术的发展,为提升语言模型的代理行为(agentic behavior)和情境推理能力提供可复现的研究平台。

链接: https://arxiv.org/abs/2508.08501
作者: Yuchen Li,Cong Lin,Muhammad Umair Nasir,Philip Bontrager,Jialin Liu,Julian Togelius
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model’s ability to handle tasks that differ from most existing LLM benchmarks. The benchmark leverages a game description language that enables rapid creation of new games and levels, helping to prevent overfitting over time. Each game scene is represented by a compact set of ASCII characters, allowing for efficient processing by language models. GVGAI-LLM defines interpretable metrics, including the meaningful step ratio, step efficiency, and overall score, to assess model behavior. Through zero-shot evaluations across a broad set of games and levels with diverse challenges and skill depth, we reveal persistent limitations of LLMs in spatial reasoning and basic planning. Current models consistently exhibit spatial and logical errors, motivating structured prompting and spatial grounding techniques. While these interventions lead to partial improvements, the benchmark remains very far from solved. GVGAI-LLM provides a reproducible testbed for advancing research on language model capabilities, with a particular emphasis on agentic behavior and contextual reasoning.
zh

[AI-62] Large Language Models as Oracles for Ontology Alignment

【速读】:该论文旨在解决ontology对齐(ontology alignment)过程中,如何在保证高精度映射的前提下降低对领域专家人力依赖的问题。当前多数对齐系统虽能生成大量候选对应关系,但在处理复杂或模糊匹配时仍存在不确定性,而传统依赖人工验证的方式成本高昂。论文提出的关键解决方案是:利用大语言模型(Large Language Models, LLMs)仅对对齐系统置信度较低的子集进行验证,从而实现“人机协同”的高效验证机制。通过在Ontology Alignment Evaluation Initiative (OAEI) 多项任务上的实验表明,LLM 在特定提示模板下可有效替代专家判断,且其性能与不同错误率的模拟Oracle相当,验证了该策略在提升对齐质量同时显著减少人工干预的可行性。

链接: https://arxiv.org/abs/2508.08500
作者: Sviatoslav Lushnei,Dmytro Shumskyi,Severyn Shykula,Ernesto Jimenez-Ruiz,Artur d’Avila Garcez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to a conference. 17 pages

点击查看摘要

Abstract:Ontology alignment plays a crucial role in integrating diverse data sources across domains. There is a large plethora of systems that tackle the ontology alignment problem, yet challenges persist in producing highly quality correspondences among a set of input ontologies. Human-in-the-loop during the alignment process is essential in applications requiring very accurate mappings. User involvement is, however, expensive when dealing with large ontologies. In this paper, we explore the feasibility of using Large Language Models (LLM) as an alternative to the domain expert. The use of the LLM focuses only on the validation of the subset of correspondences where an ontology alignment system is very uncertain. We have conducted an extensive evaluation over several matching tasks of the Ontology Alignment Evaluation Initiative (OAEI), analysing the performance of several state-of-the-art LLMs using different ontology-driven prompt templates. The LLM results are also compared against simulated Oracles with variable error rates.
zh

[AI-63] POMO: Leverag ing starting nodes in POMO for solving Capacitated Vehicle Routing Problem

【速读】:该论文旨在解决基于强化学习(Reinforcement Learning, RL)的组合优化问题求解效率与精度不足的问题,尤其针对车辆路径问题(Vehicle Routing Problem, VRP)及其变体。现有方法如POMO虽表现良好,但在初始节点信息利用上存在局限,导致收敛速度慢且解的质量有待提升。解决方案的关键在于提出改进模型POMO+,通过有效利用初始节点信息来引导搜索过程,从而实现更高效、更精准的路径规划。实验结果表明,POMO+在CVRPLIB数据集上的多个实例中实现了更快的收敛速度和更优的解质量,特别是在含100个客户点的问题中表现显著提升。

链接: https://arxiv.org/abs/2508.08493
作者: Szymon Jakubicz,Karol Kuźniak,Jan Wawszczak,Paweł Gora
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, reinforcement learning (RL) methods have emerged as a promising approach for solving combinatorial problems. Among RL-based models, POMO has demonstrated strong performance on a variety of tasks, including variants of the Vehicle Routing Problem (VRP). However, there is room for improvement for these tasks. In this work, we improved POMO, creating a method (\textbfPOMO+) that leverages the initial nodes to find a solution in a more informed way. We ran experiments on our new model and observed that our solution converges faster and achieves better results. We validated our models on the CVRPLIB dataset and noticed improvements in problem instances with up to 100 customers. We hope that our research in this project can lead to further advancements in the field.
zh

[AI-64] Beyond Ordinal Preferences: Why Alignment Needs Cardinal Human Feedback

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)对齐方法中依赖序数偏好数据(ordinal preference data)所带来的根本性局限问题。作者证明了一个不可能性结果:仅依靠序数比较(即二元选择)无法系统性地恢复出最优模型,因为此类数据缺乏分辨不同改进方向之间权衡的能力(如修复一个提示中的事实错误 vs. 提升另一个提示的风格)。解决方案的关键在于转向收集基数反馈(cardinal feedback),即通过意愿支付(willingness-to-pay)机制获取关于响应质量的量化判断,并据此构建模型偏好关系(preferences over models),而非仅关注响应层面的排序。研究团队因此构建并公开发布了一个包含25,000条基数判断的数据集,实验证明将此类信息融入偏好微调可使模型更精准识别高影响力改进,在Arena-Hard等下游任务上显著优于纯序数方法。

链接: https://arxiv.org/abs/2508.08486
作者: Parker Whitfill,Stewy Slocum
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alignment techniques for LLMs rely on optimizing preference-based objectives – where these preferences are typically elicited as ordinal, binary choices between responses. Recent work has focused on improving label quality or mitigating particular biases, but we identify a more fundamental limitation: these methods collect the wrong kind of data. We prove an impossibility result: no algorithm relying solely on ordinal comparisons can systematically recover the most preferred model. Intuitively, ordinal data lacks the information needed to resolve tradeoffs – e.g., fixing a factual error on one prompt versus improving style on another. We show that selecting the optimal model requires recovering preferences over \emphmodels (rather than just responses), which can only be identified given cardinal feedback about response quality. To address this, we collect and publicly release a dataset of 25,000 cardinal judgments using willingness-to-pay elicitations, a well-established tool from experimental economics. Empirically, we find that incorporating cardinal feedback into preference fine-tuning allows models to prioritize high-impact improvements and outperform ordinal-only methods on downstream benchmarks, such as Arena-Hard.
zh

[AI-65] A Fast GRASP Metaheuristic for the Trigger Arc TSP with MIP-Based Construction and Multi-Neighborhood Local Search

【速读】:该论文致力于解决触发弧动态旅行商问题(Trigger Arc Traveling Salesman Problem, TA-TSP),即在经典旅行商问题(TSP)基础上引入当特定触发弧(trigger arcs)被 traversed 时,边成本动态变化的场景,以建模如可压缩存储系统的仓库调度等实际应用。解决方案的关键在于提出一种基于GRASP(Greedy Randomized Adaptive Search Procedure)的元启发式算法,其核心创新包括:构造阶段利用混合整数规划(MIP)技术将TA-TSP转化为一系列定制化的TSP实例;改进阶段结合2-Opt、Swap和Relocate三种邻域操作进行多邻域局部搜索,从而有效处理状态依赖的路径优化问题。实验表明,该方法在MESS 2024竞赛实例上平均最优性间隙仅为0.77%–0.40%,且在小规模合成数据集上相较Gurobi求解器提升11.3%的解质量,展现出对实时路径规划中动态成本场景的强大适应能力。

链接: https://arxiv.org/abs/2508.08477
作者: Joan Salvà Soler,Grégoire de Lambertye
机构: 未知
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
备注: 9 pages, 2 figures, 2-column format

点击查看摘要

Abstract:The Trigger Arc Traveling Salesman Problem (TA-TSP) extends the classical TSP by introducing dynamic arc costs that change when specific \textittrigger arcs are traversed, modeling scenarios such as warehouse operations with compactable storage systems. This paper introduces a GRASP-based metaheuristic that combines multiple construction heuristics with a multi-neighborhood local search. The construction phase uses mixed-integer programming (MIP) techniques to transform the TA-TSP into a sequence of tailored TSP instances, while the improvement phase applies 2-Opt, Swap, and Relocate operators. Computational experiments on MESS 2024 competition instances achieved average optimality gaps of 0.77% and 0.40% relative to the best-known solutions within a 60-second limit. On smaller, synthetically generated datasets, the method produced solutions 11.3% better than the Gurobi solver under the same time constraints. The algorithm finished in the top three at MESS 2024, demonstrating its suitability for real-time routing applications with state-dependent travel costs.
zh

[AI-66] Empowering Children to Create AI-Enabled Augmented Reality Experiences

【速读】:该论文试图解决当前AI增强现实(AR)技术在儿童教育应用中普遍将儿童定位为被动消费者而非主动创作者的问题。解决方案的关键在于提出Capybara,一个基于AR和人工智能(AI)的可视化编程环境,使儿童能够利用文本到3D生成式AI模型创建、定制并编程叠加于物理世界之上的3D角色,并通过自动骨骼绑定(auto-rigging)与身体追踪实现动画效果;同时,系统借助视觉AI模型识别物理物体,支持儿童编程虚拟角色与真实环境之间的交互行为,从而实现虚拟与物理世界的无缝融合,提升儿童在创作过程中的参与度与个性化表达能力。

链接: https://arxiv.org/abs/2508.08467
作者: Lei Zhang,Shuyao Zhou,Amna Liaqat,Tinney Mak,Brian Berengard,Emily Qian,Andrés Monroy-Hernández
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Graphics (cs.GR); Programming Languages (cs.PL)
备注: Accepted to ACM UIST 2025

点击查看摘要

Abstract:Despite their potential to enhance children’s learning experiences, AI-enabled AR technologies are predominantly used in ways that position children as consumers rather than creators. We introduce Capybara, an AR-based and AI-powered visual programming environment that empowers children to create, customize, and program 3D characters overlaid onto the physical world. Capybara enables children to create virtual characters and accessories using text-to-3D generative AI models, and to animate these characters through auto-rigging and body tracking. In addition, our system employs vision-based AI models to recognize physical objects, allowing children to program interactive behaviors between virtual characters and their physical surroundings. We demonstrate the expressiveness of Capybara through a set of novel AR experiences. We conducted user studies with 20 children in the United States and Argentina. Our findings suggest that Capybara can empower children to harness AI in authoring personalized and engaging AR experiences that seamlessly bridge the virtual and physical worlds.
zh

[AI-67] mporal User Profiling with LLM s: Balancing Short-Term and Long-Term Preferences for Recommendations

【速读】:该论文旨在解决现有基于内容的推荐系统中用户偏好建模过于简化的问题,尤其是无法有效捕捉长期与短期偏好之间的动态交互关系。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的时间感知用户画像方法(LLM-driven Temporal User Profiling, LLM-TUP),通过利用交互时间戳生成自然语言形式的用户历史表示,并借助预训练BERT模型将其编码为高维嵌入;随后采用注意力机制动态融合短时与长时嵌入,构建更全面的用户表征,从而显著提升推荐性能。

链接: https://arxiv.org/abs/2508.08454
作者: Milad Sabouri,Masoud Mansoury,Kun Lin,Bamshad Mobasher
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurately modeling user preferences is crucial for improving the performance of content-based recommender systems. Existing approaches often rely on simplistic user profiling methods, such as averaging or concatenating item embeddings, which fail to capture the nuanced nature of user preference dynamics, particularly the interactions between long-term and short-term preferences. In this work, we propose LLM-driven Temporal User Profiling (LLM-TUP), a novel method for user profiling that explicitly models short-term and long-term preferences by leveraging interaction timestamps and generating natural language representations of user histories using a large language model (LLM). These representations are encoded into high-dimensional embeddings using a pre-trained BERT model, and an attention mechanism is applied to dynamically fuse the short-term and long-term embeddings into a comprehensive user profile. Experimental results on real-world datasets demonstrate that LLM-TUP achieves substantial improvements over several baselines, underscoring the effectiveness of our temporally aware user-profiling approach and the use of semantically rich user profiles, generated by LLMs, for personalized content-based recommendation.
zh

[AI-68] OverFill: Two-Stage Models for Efficient Language Model Decoding

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在部署过程中因推理成本高而导致的效率瓶颈问题,尤其针对解码阶段(decode stage)在长序列场景下占据主导延迟的问题。现有decoder-only模型对预填充(prefill)和解码阶段采用统一处理方式,忽视了二者不同的计算特性(prefill为计算密集型,decode为内存密集型)。解决方案的关键在于提出OverFill机制,通过解耦这两个阶段:在预填充阶段使用完整模型并行处理系统与用户输入以提升生成质量,随后切换至一个稀疏剪枝后的稠密模型进行逐token生成;这种策略在不显著增加延迟的前提下,利用更多计算资源优化预填充阶段,从而实现准确率与效率之间的更好权衡。

链接: https://arxiv.org/abs/2508.08446
作者: Woojeong Kim,Junxiong Wang,Jing Nathan Yan,Mohamed Abdelfattah,Alexander M. Rush
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to COLM 2025

点击查看摘要

Abstract:Large language models (LLMs) excel across diverse tasks but face significant deployment challenges due to high inference costs. LLM inference comprises prefill (compute-bound) and decode (memory-bound) stages, with decode dominating latency particularly for long sequences. Current decoder-only models handle both stages uniformly, despite their distinct computational profiles. We propose OverFill, which decouples these stages to optimize accuracy-efficiency tradeoffs. OverFill begins with a full model for prefill, processing system and user inputs in parallel. It then switches to a dense pruned model, while generating tokens sequentially. Leveraging more compute during prefill, OverFill improves generation quality with minimal latency overhead. Our 3B-to-1B OverFill configuration outperforms 1B pruned models by 83.2%, while the 8B-to-3B configuration improves over 3B pruned models by 79.2% on average across standard benchmarks. OverFill matches the performance of same-sized models trained from scratch, while using significantly less training data. Our code is available at this https URL.
zh

[AI-69] Solver-Aided Expansion of Loops to Avoid Generate-and-Test

【速读】:该论文旨在解决约束建模语言(如MiniZinc和Essence)在编译过程中因展开循环(loop unrolling)而导致的效率问题,尤其是在诱导变量(induction variables)取值范围较大且存在选择性前提条件时,传统方法会生成大量无关组合,造成计算资源浪费。解决方案的关键在于引入一种基于求解器(solver)的新方法,通过动态计算仅需生成最终约束集所必需的组合,从而避免全量枚举,实现与传统展平(flattening)等价但显著更高效的模型构建过程。

链接: https://arxiv.org/abs/2508.08442
作者: Niklas Dewally,Özgür Akgün
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, published in ModRef 2025 workshop

点击查看摘要

Abstract:Constraint modelling languages like MiniZinc and Essence rely on unrolling loops (in the form of quantified expressions and comprehensions) during compilation. Standard approaches generate all combinations of induction variables and use partial evaluation to discard those that simplify to identity elements of associative-commutative operators (e.g. true for conjunction, 0 for summation). This can be inefficient for problems where most combinations are ultimately irrelevant. We present a method that avoids full enumeration by using a solver to compute only the combinations required to generate the final set of constraints. The resulting model is identical to that produced by conventional flattening, but compilation can be significantly faster. This improves the efficiency of translating high-level user models into solver-ready form, particularly when induction variables range over large domains with selective preconditions.
zh

[AI-70] Fast weight programming and linear transformers: from machine learning to neurobiology

【速读】:该论文旨在解决传统循环神经网络(Recurrent Neural Networks, RNNs)在建模长期依赖关系时效率低、记忆能力有限的问题。其解决方案的关键在于引入二维状态的RNN架构——快速权重程序员(Fast Weight Programmers, FWPs),该模型通过动态调整“快速权重”(fast weights)来实现短期记忆存储,这些权重随输入观测值实时变化,并由一个可训练的“编程网络”(programmer)控制更新过程。FWPs不仅提升了模型对时序信息的捕捉能力,还揭示了与Transformer和状态空间模型(State Space Models)之间的理论联系,同时为理解大脑中突触可塑性机制提供了新的计算视角,体现了人工神经网络与生物智能的潜在融合趋势。

链接: https://arxiv.org/abs/2508.08435
作者: Kazuki Irie,Samuel J. Gershman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Recent advances in artificial neural networks for machine learning, and language modeling in particular, have established a family of recurrent neural network (RNN) architectures that, unlike conventional RNNs with vector-form hidden states, use two-dimensional (2D) matrix-form hidden states. Such 2D-state RNNs, known as Fast Weight Programmers (FWPs), can be interpreted as a neural network whose synaptic weights (called fast weights) dynamically change over time as a function of input observations, and serve as short-term memory storage; corresponding synaptic weight modifications are controlled or programmed by another network (the programmer) whose parameters are trained (e.g., by gradient descent). In this Primer, we review the technical foundations of FWPs, their computational characteristics, and their connections to transformers and state space models. We also discuss connections between FWPs and models of synaptic plasticity in the brain, suggesting a convergence of natural and artificial intelligence.
zh

[AI-71] Generating Query-Relevant Document Summaries via Reinforcement Learning

【速读】:该论文旨在解决电商搜索中因仅依赖商品标题进行排序而导致的相关性预测不佳的问题,因为商品标题通常信息不足,难以准确捕捉用户查询意图;而商品描述虽信息丰富,但因其冗长和复杂性,难以用于实时排序,尤其对计算成本高的交叉编码器(cross-encoder)模型不友好。解决方案的关键在于提出一种名为ReLSum的新型强化学习框架,通过将相关性评分作为奖励信号,使摘要生成目标与排序目标对齐,从而训练一个可微分的大语言模型(LLM)生成简洁且与查询相关的商品描述摘要,并将其作为输入用于交叉编码器排序模型,实现高效且高相关的在线搜索效果。

链接: https://arxiv.org/abs/2508.08404
作者: Nitin Yadav,Changsung Kang,Hongwei Shang,Ming Sun
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:E-commerce search engines often rely solely on product titles as input for ranking models with latency constraints. However, this approach can result in suboptimal relevance predictions, as product titles often lack sufficient detail to capture query intent. While product descriptions provide richer information, their verbosity and length make them unsuitable for real-time ranking, particularly for computationally expensive architectures like cross-encoder ranking models. To address this challenge, we propose ReLSum, a novel reinforcement learning framework designed to generate concise, query-relevant summaries of product descriptions optimized for search relevance. ReLSum leverages relevance scores as rewards to align the objectives of summarization and ranking, effectively overcoming limitations of prior methods, such as misaligned learning targets. The framework employs a trainable large language model (LLM) to produce summaries, which are then used as input for a cross-encoder ranking model. Experimental results demonstrate significant improvements in offline metrics, including recall and NDCG, as well as online user engagement metrics. ReLSum provides a scalable and efficient solution for enhancing search relevance in large-scale e-commerce systems.
zh

[AI-72] UrzaGPT : LoRA-Tuned Large Language Models for Card Selection in Collectible Card Games

【速读】:该论文旨在解决收集类卡牌游戏(Collectible Card Games, CCGs)中AI在局部可观测性、长期决策制定以及不断更新的卡牌扩展内容下表现远低于人类玩家的问题,尤其是在套牌构建(deckbuilding)与实时选牌(drafting)任务中的能力不足。解决方案的关键在于提出一个领域适配的大语言模型(domain-adapted large language model)UrzaGPT,其通过在标注的选牌日志数据集上使用低秩适应(Low-Rank Adaptation, LoRA)微调技术,从开源大语言模型(LLM)出发,快速适配《万智牌》(Magic: The Gathering)不同扩展包的选牌策略。该方法利用了LLM的语言建模能力,并实现了仅用10,000步微调即可达到66.2%的选牌准确率,显著优于零样本LLM(如GPT-4o的43%),验证了基于LLM的通用且可更新的选牌AI系统的可行性。

链接: https://arxiv.org/abs/2508.08382
作者: Timo Bertram
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Collectible card games (CCGs) are a difficult genre for AI due to their partial observability, long-term decision-making, and evolving card sets. Due to this, current AI models perform vastly worse than human players at CCG tasks such as deckbuilding and gameplay. In this work, we introduce \textitUrzaGPT , a domain-adapted large language model that recommends real-time drafting decisions in \textitMagic: The Gathering . Starting from an open-weight LLM, we use Low-Rank Adaptation fine-tuning on a dataset of annotated draft logs. With this, we leverage the language modeling capabilities of LLM, and can quickly adapt to different expansions of the game. We benchmark \textitUrzaGPT in comparison to zero-shot LLMs and the state-of-the-art domain-specific model. Untuned, small LLMs like Llama-3-8B are completely unable to draft, but the larger GPT-4o achieves a zero-shot performance of 43% . Using UrzaGPT to fine-tune smaller models, we achieve an accuracy of 66.2% using only 10,000 steps. Despite this not reaching the capability of domain-specific models, we show that solely using LLMs to draft is possible and conclude that using LLMs can enable performant, general, and update-friendly drafting AIs in the future.
zh

[AI-73] Processing of synthetic data in AI development for healthcare and the definition of personal data in EU law

【速读】:该论文试图解决的问题是:在欧盟《通用数据保护条例》(GDPR)框架下,合成数据(synthetic data)是否应被认定为个人数据(personal data),从而影响其在医疗人工智能(AI)领域的合法使用与共享。当前GDPR对合成数据的法律地位存在不确定性,导致监管负担加重,阻碍了其在促进AI创新中的应用。论文的关键解决方案在于通过系统性法律分析和实证研究,评估生成合成数据后的残留识别风险(residual identification risk),并模拟推理攻击(inference attacks)以检验技术层面的匿名化程度;研究发现,合成数据在特定条件下可能属于匿名数据(anonymous data),但“合理可能的风险”(reasonably likely risk)的界定仍不明确,因此呼吁制定更清晰的法规标准,在保障隐私与推动医疗AI发展之间实现平衡。

链接: https://arxiv.org/abs/2508.08353
作者: Vibeke Binz Vallevik,Anne Kjersti C. Befring,Severin Elvatun,Jan Franz Nygaard
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 55 pages

点击查看摘要

Abstract:Artificial intelligence (AI) has the potential to transform healthcare, but it requires access to health data. Synthetic data that is generated through machine learning models trained on real data, offers a way to share data while preserving privacy. However, uncertainties in the practical application of the General Data Protection Regulation (GDPR) create an administrative burden, limiting the benefits of synthetic data. Through a systematic analysis of relevant legal sources and an empirical study, this article explores whether synthetic data should be classified as personal data under the GDPR. The study investigates the residual identification risk through generating synthetic data and simulating inference attacks, challenging common perceptions of technical identification risk. The findings suggest synthetic data is likely anonymous, depending on certain factors, but highlights uncertainties about what constitutes reasonably likely risk. To promote innovation, the study calls for clearer regulations to balance privacy protection with the advancement of AI in healthcare.
zh

[AI-74] Fuzzy-Pattern Tsetlin Machine

【速读】:该论文旨在解决传统Tsetlin Machine(TM)中“全或无”(all-or-nothing)条款评估策略导致的效率低下问题,即每个条款必须所有二进制文字(binary literals)均满足才能参与投票,这迫使模型使用数千个条款以达到竞争性准确率,从而造成高内存占用、长训练时间及计算资源浪费。解决方案的关键在于提出模糊模式Tsetlin机(Fuzzy-Pattern Tsetlin Machine, FPTM),其核心创新是将条款评估机制从严格逻辑判断转变为模糊评分机制:当部分文字不匹配时,剩余有效文字仍可按比例贡献投票权重,使每个条款内部形成可自适应输入的子模式(sub-patterns)。这一机制显著降低了对条款数量的需求(如IMDb数据集仅需每类1个条款),同时提升了准确性、训练速度和推理吞吐量,并实现了在微控制器上的在线学习能力。

链接: https://arxiv.org/abs/2508.08350
作者: Artem Hnilov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures, 6 tables

点击查看摘要

Abstract:The “all-or-nothing” clause evaluation strategy is a core mechanism in the Tsetlin Machine ™ family of algorithms. In this approach, each clause - a logical pattern composed of binary literals mapped to input data - is disqualified from voting if even a single literal fails. Due to this strict requirement, standard TMs must employ thousands of clauses to achieve competitive accuracy. This paper introduces the Fuzzy-Pattern Tsetlin Machine (FPTM), a novel variant where clause evaluation is fuzzy rather than strict. If some literals in a clause fail, the remaining ones can still contribute to the overall vote with a proportionally reduced score. As a result, each clause effectively consists of sub-patterns that adapt individually to the input, enabling more flexible, efficient, and robust pattern matching. The proposed fuzzy mechanism significantly reduces the required number of clauses, memory footprint, and training time, while simultaneously improving accuracy. On the IMDb dataset, FPTM achieves 90.15% accuracy with only one clause per class, a 50x reduction in clauses and memory over the Coalesced Tsetlin Machine. FPTM trains up to 316x faster (45 seconds vs. 4 hours) and fits within 50 KB, enabling online learning on microcontrollers. Inference throughput reaches 34.5 million predictions/second (51.4 GB/s). On Fashion-MNIST, accuracy reaches 92.18% (2 clauses), 93.19% (20 clauses) and 94.68% (8000 clauses), a ~400x clause reduction compared to the Composite TM’s 93.00% (8000 clauses). On the Amazon Sales dataset with 20% noise, FPTM achieves 85.22% accuracy, significantly outperforming the Graph Tsetlin Machine (78.17%) and a Graph Convolutional Neural Network (66.23%).
zh

[AI-75] Do AI Companies Make Good on Voluntary Commitments to the White House?

【速读】:该论文旨在解决当前国际人工智能(AI)治理中企业自愿承诺落实不力的问题,特别是大型AI公司在公开承诺与实际行为之间存在显著差距的现象。其关键解决方案在于构建一套基于2023年各公司向白宫所作八项自愿承诺的详细评分体系,并通过量化评估发现:尽管最高得分公司(OpenAI)达到83%,但平均得分仅为52%,尤其在模型权重安全(model weight security)方面表现极差,平均仅17%,11家公司的得分为0%。研究指出,结构性短板在于缺乏可验证的主动披露机制,因此提出三项针对性建议:明确模糊承诺、规范复杂AI供应链责任、强化公众透明度,以提升企业自律与政策制定的有效性。

链接: https://arxiv.org/abs/2508.08345
作者: Jennifer Wang,Kayla Huang,Kevin Klyman,Rishi Bommasani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Voluntary commitments are central to international AI governance, as demonstrated by recent voluntary guidelines from the White House to the G7, from Bletchley Park to Seoul. How do major AI companies make good on their commitments? We score companies based on their publicly disclosed behavior by developing a detailed rubric based on their eight voluntary commitments to the White House in 2023. We find significant heterogeneity: while the highest-scoring company (OpenAI) scores a 83% overall on our rubric, the average score across all companies is just 52%. The companies demonstrate systemically poor performance for their commitment to model weight security with an average score of 17%: 11 of the 16 companies receive 0% for this commitment. Our analysis highlights a clear structural shortcoming that future AI governance initiatives should correct: when companies make public commitments, they should proactively disclose how they meet their commitments to provide accountability, and these disclosures should be verifiable. To advance policymaking on corporate AI governance, we provide three directed recommendations that address underspecified commitments, the role of complex AI supply chains, and public transparency that could be applied towards AI governance initiatives worldwide.
zh

[AI-76] What Breaks Knowledge Graph based RAG ? Empirical Insights into Reasoning under Incomplete Knowledge

【速读】:该论文旨在解决当前知识图谱增强生成(Knowledge Graph-based Retrieval-Augmented Generation, KG-RAG)方法在评估中存在的核心问题:现有基准测试往往包含可直接通过知识图谱中已有三元组回答的问题,导致难以区分模型是真正进行了推理还是仅依赖于检索;同时,评价指标不统一且答案匹配标准宽松,进一步阻碍了对模型性能的客观比较。其解决方案的关键在于提出一种通用的基准构建方法与系统化的评估协议,专门用于在知识不完整性条件下对KG-RAG方法进行严格评估,从而更准确地衡量模型的推理能力、记忆依赖性和泛化性能。

链接: https://arxiv.org/abs/2508.08344
作者: Dongzhuoran Zhou,Yuqicheng Zhu,Xiaxia Wang,Hongkuan Zhou,Yuan He,Jiaoyan Chen,Evgeny Kharlamov,Steffen Staab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks, together with an evaluation protocol, to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.
zh

[AI-77] Algorithmic Fairness amid Social Determinants: Reflection Characterization and Approach

【速读】:该论文试图解决现有算法公平性研究中忽视社会决定因素(Social Determinants of Health, SDH)的问题,即当前方法主要聚焦于敏感属性(如种族、性别等),而未能充分考虑环境与背景因素对个体结果的因果影响。解决方案的关键在于引入形式化和定量化的分析框架,以地区作为社会决定因素的代理变量,并采用伽马分布(Gamma distribution)参数化建模其对个体结果的影响。该方法在大学录取场景中验证了有效性,能够定量捕捉到以往定性讨论中的深层洞见,且揭示了仅基于敏感属性的缓解策略可能加剧结构性不公的风险,从而强调了同时考量敏感属性与社会决定因素对于实现更全面、透明和有效的公平性的必要性。

链接: https://arxiv.org/abs/2508.08337
作者: Zeyu Tang,Alex John London,Atoosa Kasirzadeh,Sanmi Koyejo,Peter Spirtes,Kun Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Social determinants are variables that, while not directly pertaining to any specific individual, capture key aspects of contexts and environments that have direct causal influences on certain attributes of an individual. Previous algorithmic fairness literature has primarily focused on sensitive attributes, often overlooking the role of social determinants. Our paper addresses this gap by introducing formal and quantitative rigor into a space that has been shaped largely by qualitative proposals regarding the use of social determinants. To demonstrate theoretical perspectives and practical applicability, we examine a concrete setting of college admissions, using region as a proxy for social determinants. Our approach leverages a region-based analysis with Gamma distribution parameterization to model how social determinants impact individual outcomes. Despite its simplicity, our method quantitatively recovers findings that resonate with nuanced insights in previous qualitative debates, that are often missed by existing algorithmic fairness approaches. Our findings suggest that mitigation strategies centering solely around sensitive attributes may introduce new structural injustice when addressing existing discrimination. Considering both sensitive attributes and social determinants facilitates a more comprehensive explication of benefits and burdens experienced by individuals from diverse demographic backgrounds as well as contextual environments, which is essential for understanding and achieving fairness effectively and transparently.
zh

[AI-78] HSA-Net: Hierarchical and Structure-Aware Framework for Efficient and Scalable Molecular Language Modeling

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNN)在深度网络中因过平滑(over-smoothing)问题导致节点特征退化、进而影响分子表示学习性能的问题。现有基于交叉注意力(cross-attention)的特征投影方法虽能缓解此问题,但在深层特征表现不佳;而Mamba虽然擅长保留深层全局拓扑信息,却忽视了浅层细粒度结构细节,体现出全局与局部信息之间的权衡。解决方案的关键在于提出一种分层且结构感知的网络架构(Hierarchical and Structure-Aware Network, HSA-Net),其核心创新为两个模块:一是分层自适应投影器(Hierarchical Adaptive Projector, HAP),动态选择浅层使用交叉注意力投影、深层使用结构感知的Graph-Mamba投影,以获取多层级高质量特征;二是源感知融合模块(Source-Aware Fusion, SAF),根据聚合特征特性自适应选择融合专家,实现精准有效的多层级特征融合,从而突破全局与局部信息的权衡瓶颈。

链接: https://arxiv.org/abs/2508.08334
作者: Zihang Shao,Wentao Lei,Lei Wang,Wencai Ye,Li Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Molecular representation learning, a cornerstone for downstream tasks like molecular captioning and molecular property prediction, heavily relies on Graph Neural Networks (GNN). However, GNN suffers from the over-smoothing problem, where node-level features collapse in deep GNN layers. While existing feature projection methods with cross-attention have been introduced to mitigate this issue, they still perform poorly in deep features. This motivated our exploration of using Mamba as an alternative projector for its ability to handle complex sequences. However, we observe that while Mamba excels at preserving global topological information from deep layers, it neglects fine-grained details in shallow layers. The capabilities of Mamba and cross-attention exhibit a global-local trade-off. To resolve this critical global-local trade-off, we propose Hierarchical and Structure-Aware Network (HSA-Net), a novel framework with two modules that enables a hierarchical feature projection and fusion. Firstly, a Hierarchical Adaptive Projector (HAP) module is introduced to process features from different graph layers. It learns to dynamically switch between a cross-attention projector for shallow layers and a structure-aware Graph-Mamba projector for deep layers, producing high-quality, multi-level features. Secondly, to adaptively merge these multi-level features, we design a Source-Aware Fusion (SAF) module, which flexibly selects fusion experts based on the characteristics of the aggregation features, ensuring a precise and effective final representation fusion. Extensive experiments demonstrate that our HSA-Net framework quantitatively and qualitatively outperforms current state-of-the-art (SOTA) methods.
zh

[AI-79] Normative Moral Pluralism for AI: A Framework for Deliberation in Complex Moral Contexts

【速读】:该论文旨在解决人工智能系统在高风险、复杂道德情境下实现价值对齐(Value Alignment)的问题,即如何使AI在面对多因素冲突、多方利益相关者以及非道德考量时,做出既符合伦理规范又具现实可行性的决策。其解决方案的关键在于提出一种双层混合式道德推理架构:上层为通用层,通过自上而下与自下而上的学习机制定义道德阈值;下层为局部层,在具体情境中权衡竞争性道德考量并整合文化特定的规范内容,同时保持在通用阈值之内。该设计不仅支持深度反思式的道德 deliberation(审议),还能驱动快速响应的直觉式行动,从而兼顾道德严谨性与实时性需求。

链接: https://arxiv.org/abs/2508.08333
作者: David-Doron Yaacov
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Conference version: AIES 2025 (non-archival track), 12 pages

点击查看摘要

Abstract:The conceptual framework proposed in this paper centers on the development of a deliberative moral reasoning system - one designed to process complex moral situations by generating, filtering, and weighing normative arguments drawn from diverse ethical perspectives. While the framework is rooted in Machine Ethics, it also makes a substantive contribution to Value Alignment by outlining a system architecture that links structured moral reasoning to action under time constraints. Grounded in normative moral pluralism, this system is not constructed to imitate behavior but is built on reason-sensitive deliberation over structured moral content in a transparent and principled manner. Beyond its role as a deliberative system, it also serves as the conceptual foundation for a novel two-level architecture: functioning as a moral reasoning teacher envisioned to train faster models that support real-time responsiveness without reproducing the full structure of deliberative reasoning. Together, the deliberative and intuitive components are designed to enable both deep reflection and responsive action. A key design feature is the dual-hybrid structure: a universal layer that defines a moral threshold through top-down and bottom-up learning, and a local layer that learns to weigh competing considerations in context while integrating culturally specific normative content, so long as it remains within the universal threshold. By extending the notion of moral complexity to include not only conflicting beliefs but also multifactorial dilemmas, multiple stakeholders, and the integration of non-moral considerations, the framework aims to support morally grounded decision-making in realistic, high-stakes contexts.
zh

[AI-80] Energy-Aware Code Generation with LLM s: Benchmarking Small vs. Large Language Models for Sustainable AI Programming

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中因计算资源消耗大而导致高能耗与碳排放的问题,探索小型开源语言模型(Small Language Models, SLMs)是否能在保持较高代码正确性的同时实现更优的能源效率。其解决方案的关键在于系统性地评估三类开源SLMs(StableCode-3B、StarCoderBase-3B、Qwen2.5-Coder-3B-Instruct)与两个商业LLM(GPT-4.0、DeepSeek-Reasoner)在LeetCode 150道编程题上的表现,通过运行时间、内存占用、能耗和正确性四个维度进行量化对比,并以人工编写的Python代码作为基准。结果表明,尽管LLMs在正确性上领先,但SLMs在多数情况下(>52%)能以相同或更低的能耗生成正确代码,验证了SLMs在代码生成任务中兼具性能与能效优势的可能性。

链接: https://arxiv.org/abs/2508.08332
作者: Humza Ashraf,Syed Muhammad Danish,Aris Leivadeas,Yazan Otoum,Zeeshan Sattar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are widely used for code generation. However, commercial models like ChatGPT require significant computing power, which leads to high energy use and carbon emissions. This has raised concerns about their environmental impact. In this study, we evaluate open-source Small Language Models (SLMs) trained explicitly for code generation and compare their performance and energy efficiency against large LLMs and efficient human-written Python code. The goal is to investigate whether SLMs can match the performance of LLMs on certain types of programming problems while producing more energy-efficient code. We evaluate 150 coding problems from LeetCode, evenly distributed across three difficulty levels: easy, medium, and hard. Our comparison includes three small open-source models, StableCode-3B, StarCoderBase-3B, and Qwen2.5-Coder-3B-Instruct, and two large commercial models, GPT-4.0 and DeepSeek-Reasoner. The generated code is evaluated using four key metrics: run-time, memory usage, energy consumption, and correctness. We use human-written solutions as a baseline to assess the quality and efficiency of the model-generated code. Results indicate that LLMs achieve the highest correctness across all difficulty levels, but SLMs are often more energy-efficient when their outputs are correct. In over 52% of the evaluated problems, SLMs consumed the same or less energy than LLMs.
zh

[AI-81] Context Engineering for Multi-Agent LLM Code Assistants Using Elicit NotebookLM ChatGPT and Claude Code

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂多文件软件项目时因上下文限制和知识缺口而导致的代码生成准确性与可靠性不足的问题。其解决方案的关键在于提出了一种新颖的上下文工程工作流,通过整合多个AI组件实现系统性优化:利用GPT-5作为意图翻译器明确用户需求,借助Elicit驱动的语义文献检索注入领域知识,基于NotebookLM进行文档合成以增强上下文理解,并采用Claude Code多智能体系统执行代码生成与验证。该方法通过意图澄清、检索增强生成(Retrieval-Augmented Generation, RAG)以及基于Claude代理框架的角色分解与协同调度,显著提升了代码助手在真实代码库中的单次成功率和对项目上下文的遵循度,相较单一代理基线方法表现更优。

链接: https://arxiv.org/abs/2508.08322
作者: Muhammad Haseeb
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures, research paper on multi-agent LLM systems for code generation

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promise in automating code generation and software engineering tasks, yet they often struggle with complex, multi-file projects due to context limitations and knowledge gaps. We propose a novel context engineering workflow that combines multiple AI components: an Intent Translator (GPT-5) for clarifying user requirements, an Elicit-powered semantic literature retrieval for injecting domain knowledge, NotebookLM-based document synthesis for contextual understanding, and a Claude Code multi-agent system for code generation and validation. Our integrated approach leverages intent clarification, retrieval-augmented generation, and specialized sub-agents orchestrated via Claude’s agent framework. We demonstrate that this method significantly improves the accuracy and reliability of code assistants in real-world repositories, yielding higher single-shot success rates and better adherence to project context than baseline single-agent approaches. Qualitative results on a large this http URL codebase show the multi-agent system effectively plans, edits, and tests complex features with minimal human intervention. We compare our system with recent frameworks like CodePlan, MASAI, and HyperAgent, highlighting how targeted context injection and agent role decomposition lead to state-of-the-art performance. Finally, we discuss the implications for deploying LLM-based coding assistants in production, along with lessons learned on context management and future research directions.
zh

[AI-82] Between Fear and Desire the Monster Artificial Intelligence (AI): Analysis through the Lenses of Monster Theory

【速读】:该论文试图解决的问题是:如何在人工智能(Artificial Intelligence, AI)日益普及的背景下,深入理解公众对AI的复杂认知及其社会文化影响。解决方案的关键在于引入“怪物理论”(Monster theory),通过其七个命题揭示AI与人类社会之间的深层互动关系——即AI不应被孤立地视为技术实体,而应置于特定社会或文化语境中进行分析;同时,人们对AI的理解和解读具有多样性,正如对怪物的诠释存在差异。该方法为理解AI的“怪异效应”提供了新的理论框架。

链接: https://arxiv.org/abs/2508.08318
作者: Ahmed Tlili
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the increasing adoption of Artificial Intelligence (AI) in all fields and daily activities, a heated debate is found about the advantages and challenges of AI and the need for navigating the concerns associated with AI to make the best of it. To contribute to this literature and the ongoing debate related to it, this study draws on the Monster theory to explain the conflicting representation of AI. It suggests that studying monsters in popular culture can provide an in-depth understanding of AI and its monstrous effects. Specifically, this study aims to discuss AI perception and development through the seven theses of Monster theory. The obtained results revealed that, just like monsters, AI is complex in nature, and it should not be studied as a separate entity but rather within a given society or culture. Similarly, readers may perceive and interpret AI differently, just as readers may interpret monsters differently. The relationship between AI and monsters, as depicted in this study, does not seem to be as odd as it might be at first.
zh

[AI-83] EU Digital Regulation and Guatemala: AI 5G and Cybersecurity

【速读】:该论文试图解决欧盟在人工智能(Artificial Intelligence, AI)、5G和网络安全领域的规则如何作为跨国治理机制影响 Guatemala 的政策制定问题。其核心关切在于这些规则通过布鲁塞尔效应(Brussels effect)、私营标准、供应链条款及数据传输管控等渠道,对发展中国家产生深远的合规压力与结构性挑战。解决方案的关键在于提出五项“护栏”(guardrails):数字宪法主义(digital constitutionalism)、绿色信息技术义务(green IT duties)、第三方国家影响评估(third country impact assessment)、标准共同设计(standards co-design)以及对监管多样性的承认(recognition of regulatory diversity),以实现更具包容性与公平性的全球数字治理框架。

链接: https://arxiv.org/abs/2508.08315
作者: Victor Lopez Juarez
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:The paper examines how EU rules in AI, 5G, and cybersecurity operate as transnational governance and shape policy in Guatemala. It outlines the AI Act’s risk approach, the 5G Action Plan and Security Toolbox, and the cybersecurity regime built on ENISA, NIS2, the Cybersecurity Act, and the Cyber Resilience Act. It traces extraterritorial channels such as the Brussels effect, private standards, supply chain clauses, and data transfer controls. Guatemala specific impacts include SME compliance costs, procurement limits, environmental trade-offs in rollout, rights risks, and capacity gaps. The paper maps current national measures and proposes five guardrails: digital constitutionalism, green IT duties, third country impact assessment, standards co-design, and recognition of regulatory diversity.
zh

[AI-84] Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study

【速读】:该论文旨在解决生成式 AI 在实际教育场景中生成试题的测量学质量(psychometric quality)缺乏系统评估的问题,从而明确其在高效、高质量测评设计中的应用潜力。解决方案的关键在于提出并验证了一种迭代优化策略,通过大语言模型(LLM)反复生成、评估与修订题目,并结合人工反馈或自动评分机制进行多轮改进,最终在涵盖多个学科的91个班级(近1700名学生)的大规模实地研究中,利用项目反应理论(Item Response Theory, IRT)证明AI生成题目的测量性能可媲美专家编制的标准化考试题目。

链接: https://arxiv.org/abs/2508.08314
作者: Calvin Isley,Joshua Gilbert,Evangelos Kassos,Michaela Kocher,Allen Nie,Emma Brunskill,Ben Domingue,Jake Hofman,Joscha Legewie,Teddy Svoronos,Charlotte Tuminelli,Sharad Goel
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While large language models (LLMs) challenge conventional methods of teaching and learning, they present an exciting opportunity to improve efficiency and scale high-quality instruction. One promising application is the generation of customized exams, tailored to specific course content. There has been significant recent excitement on automatically generating questions using artificial intelligence, but also comparatively little work evaluating the psychometric quality of these items in real-world educational settings. Filling this gap is an important step toward understanding generative AI’s role in effective test design. In this study, we introduce and evaluate an iterative refinement strategy for question generation, repeatedly producing, assessing, and improving questions through cycles of LLM-generated critique and revision. We evaluate the quality of these AI-generated questions in a large-scale field study involving 91 classes – covering computer science, mathematics, chemistry, and more – in dozens of colleges across the United States, comprising nearly 1700 students. Our analysis, based on item response theory (IRT), suggests that for students in our sample the AI-generated questions performed comparably to expert-created questions designed for standardized exams. Our results illustrate the power of AI to make high-quality assessments more readily available, benefiting both teachers and students.
zh

[AI-85] First Ask Then Answer: A Framework Design for AI Dialogue Based on Supplementary Questioning with Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在用户输入信息不完整或表述不清时,难以生成准确且可操作回答的问题。其解决方案的关键在于提出一种新的交互范式——“先问后答”(First Ask Then Answer, FATA),即通过提示词引导LLM在生成答案前主动生成多维度的补充问题,以获取用户更完整的上下文信息;随后利用精心设计的提示技术将用户补充信息与原始查询融合,从而显著提升响应的质量和相关性。FATA强调查询的完整性而非仅模糊澄清,并鼓励用户参与而非依赖模型内部推理,同时采用单轮策略一次性生成所有澄清问题,提升了效率与稳定性。

链接: https://arxiv.org/abs/2508.08308
作者: Chuanruo Fu,Yuncheng Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle to deliver accurate and actionable answers when user-provided information is incomplete or ill-specified. We propose a new interaction paradigm, First Ask Then Answer (FATA), in which, through prompt words, LLMs are guided to proactively generate multidimensional supplementary questions for users prior to response generation. Subsequently, by integrating user-provided supplementary information with the original query through sophisticated prompting techniques, we achieve substantially improved response quality and relevance. In contrast to existing clarification approaches – such as the CLAM framework oriented to ambiguity and the self-interrogation Self-Ask method – FATA emphasizes completeness (beyond mere disambiguation) and user participation (inviting human input instead of relying solely on model-internal reasoning). It also adopts a single-turn strategy: all clarifying questions are produced at once, thereby reducing dialogue length and improving efficiency. Conceptually, FATA uses the reasoning power of LLMs to scaffold user expression, enabling non-expert users to formulate more comprehensive and contextually relevant queries. To evaluate FATA, we constructed a multi-domain benchmark and compared it with two controls: a baseline prompt (B-Prompt) and a context-enhanced expert prompt (C-Prompt). Experimental results show that FATA outperforms B-Prompt by approximately 40% in aggregate metrics and exhibits a coefficient of variation 8% lower than C-Prompt, indicating superior stability.
zh

[AI-86] LLM -BI: Towards Fully Automated Bayesian Inference with Large Language Models

【速读】:该论文旨在解决贝叶斯推断(Bayesian Inference)在实际应用中因先验分布(prior distributions)和似然函数(likelihoods)的设定需要专业统计知识而难以普及的问题。其解决方案的关键在于提出并验证了基于大语言模型(Large Language Model, LLM)的自动化贝叶斯建模框架——LLM-BI(Large Language Model-driven Bayesian Inference),通过自然语言输入即可自动提取先验信息或完整指定模型结构,从而实现从问题描述到贝叶斯推断流程的端到端自动化。

链接: https://arxiv.org/abs/2508.08300
作者: Yongchao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages

点击查看摘要

Abstract:A significant barrier to the widespread adoption of Bayesian inference is the specification of prior distributions and likelihoods, which often requires specialized statistical expertise. This paper investigates the feasibility of using a Large Language Model (LLM) to automate this process. We introduce LLM-BI (Large Language Model-driven Bayesian Inference), a conceptual pipeline for automating Bayesian workflows. As a proof-of-concept, we present two experiments focused on Bayesian linear regression. In Experiment I, we demonstrate that an LLM can successfully elicit prior distributions from natural language. In Experiment II, we show that an LLM can specify the entire model structure, including both priors and the likelihood, from a single high-level problem description. Our results validate the potential of LLMs to automate key steps in Bayesian modeling, enabling the possibility of an automated inference pipeline for probabilistic programming.
zh

[AI-87] Channel-Wise MLPs Improve the Generalization of Recurrent Convolutional Networks

【速读】:该论文旨在解决递归卷积网络(Recurrent Convolutional Networks)在分布内(in-distribution)和分布外(out-of-distribution)场景下泛化能力不足的问题。其核心挑战在于如何提升模型对复杂计算模式的鲁棒性与适应性,尤其是在神经程序合成(Neural Program Synthesis)任务中。解决方案的关键在于引入门控多层感知机(gated MLP)进行通道混合(channel-wise mixing),从而增强模型对特征通道间交互关系的建模能力。实验表明,相较于仅使用简单递归卷积结构的DARC架构,引入通道混合机制的DAMP架构在Re-ARC基准测试中显著提升了泛化性能,证明了显式通道混合是提升递归卷积网络泛化能力的有效策略。

链接: https://arxiv.org/abs/2508.08298
作者: Nathan Breslow
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We investigate the impact of channel-wise mixing via multi-layer perceptrons (MLPs) on the generalization capabilities of recurrent convolutional networks. Specifically, we compare two architectures: DARC (Depth Aware Recurrent Convolution), which employs a simple recurrent convolutional structure, and DAMP (Depth Aware Multi-layer Perceptron), which extends DARC with a gated MLP for channel mixing. Using the Re-ARC benchmark, we find that DAMP significantly outperforms DARC in both in-distribution and out-of-distribution generalization under exact-match grading criteria. These results suggest that explicit channel mixing through MLPs enables recurrent convolutional networks to learn more robust and generalizable computational patterns. Our findings have implications for neural program synthesis and highlight the potential of DAMP as a target architecture for hypernetwork approaches.
zh

[AI-88] An Efficient Application of Goal Programming to Tackle Multiobjective Problems with Recurring Fitness Landscapes

【速读】:该论文旨在解决在高度约束的多目标优化问题中,如何高效获取高质量近似解集(approximation set)的问题,尤其针对那些具有相似适应度景观(fitness landscape)特征的多个问题实例。解决方案的关键在于:首先使用计算代价较高的多目标算法求解其中一个代表性实例,以获得高质量的近似解集;随后,利用目标规划(Goal Programming)结合高效的单目标算法,快速求解其他相似实例。这种方法有效融合了先进多目标算法的求解精度与目标规划的计算效率,从而在保持解质量的同时显著缩短计算时间。

链接: https://arxiv.org/abs/2508.08297
作者: Rodrigo Lankaites Pinheiro,Dario Landa-Silva,Wasakorn Laesanklang,Ademir Aparecido Constantino
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many real-world applications require decision-makers to assess the quality of solutions while considering multiple conflicting objectives. Obtaining good approximation sets for highly constrained many-objective problems is often a difficult task even for modern multiobjective algorithms. In some cases, multiple instances of the problem scenario present similarities in their fitness landscapes. That is, there are recurring features in the fitness landscapes when searching for solutions to different problem instances. We propose a methodology to exploit this characteristic by solving one instance of a given problem scenario using computationally expensive multiobjective algorithms to obtain a good approximation set and then using Goal Programming with efficient single-objective algorithms to solve other instances of the same problem scenario. We use three goal-based objective functions and show that on benchmark instances of the multiobjective vehicle routing problem with time windows, the methodology is able to produce good results in short computation time. The methodology allows to combine the effectiveness of state-of-the-art multiobjective algorithms with the efficiency of goal programming to find good compromise solutions in problem scenarios where instances have similar fitness landscapes.
zh

[AI-89] opos Causal Models

【速读】:该论文旨在解决传统因果模型在处理复杂因果结构时的局限性,特别是如何形式化地描述和计算任意复杂因果图的“解”以及因果干预与等价类推理的问题。其核心解决方案是提出拓扑因果模型(topos causal models, TCMs),利用拓扑范畴(topos category)的三大关键性质:完备性与余完备性((co)completeness)、子对象分类器(subobject classifier)及指数对象(exponential objects)。其中,子对象分类器用于刻画因果干预生成子模型;极限与余极限提供了一种新颖的因果近似解释,使得任意因果图均可通过全局函数逼近;指数对象则支持对因果操作等价类(如边反转和因果同伦)进行统一推理。这些特性共同构成了一个具有内部逻辑(Mitchell-Benabou语言)的结构化框架,使TCMs能够在更广泛的数学语境下精确建模和分析因果关系。

链接: https://arxiv.org/abs/2508.08295
作者: Sridhar Mahadevan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages

点击查看摘要

Abstract:We propose topos causal models (TCMs), a novel class of causal models that exploit the key properties of a topos category: they are (co)complete, meaning all (co)limits exist, they admit a subobject classifier, and allow exponential objects. The main goal of this paper is to show that these properties are central to many applications in causal inference. For example, subobject classifiers allow a categorical formulation of causal intervention, which creates sub-models. Limits and colimits allow causal diagrams of arbitrary complexity to be solved", using a novel interpretation of causal approximation. Exponential objects enable reasoning about equivalence classes of operations on causal models, such as covered edge reversal and causal homotopy. Analogous to structural causal models (SCMs), TCMs are defined by a collection of functions, each defining a local autonomous" causal mechanism that assemble to induce a unique global function from exogenous to endogenous variables. Since the category of TCMs is (co)complete, which we prove in this paper, every causal diagram has a solution" in the form of a (co)limit: this implies that any arbitrary causal model can be approximated" by some global function with respect to the morphisms going into or out of the diagram. Natural transformations are crucial in measuring the quality of approximation. In addition, we show that causal interventions are modeled by subobject classifiers: any sub-model is defined by a monic arrow into its parent model. Exponential objects permit reasoning about entire classes of causal equivalences and interventions. Finally, as TCMs form a topos, they admit an internal logic defined as a Mitchell-Benabou language with an associated Kripke-Joyal semantics. We show how to reason about causal models in TCMs using this internal logic.
zh

[AI-90] opos Theory for Generative AI and LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)架构设计中缺乏理论基础与结构多样性的问题,尤其是传统线性堆叠或专家混合(mixture-of-experts)架构的局限性。其解决方案的关键在于引入拓扑理论(topos theory),将LLM视为函数范畴中的对象,并证明该范畴满足拓扑结构所需的全部条件:包括所有(共)极限的存在性(即(co) completeness)、指数对象(exponential objects)以及子对象分类器(subobject classifier)。这一理论框架使得可以基于范畴论中的泛性质(universal properties)构造全新的、具有数学严谨性的LLM组合结构,如拉回(pullback)、推出(pushout)、(共)等化子((co) equalizers)等,从而为生成式AI(Generative AI)提供更丰富且可形式化的架构设计范式。

链接: https://arxiv.org/abs/2508.08293
作者: Sridhar Mahadevan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 30 pages

点击查看摘要

Abstract:We propose the design of novel categorical generative AI architectures (GAIAs) using topos theory, a type of category that is set-like": a topos has all (co)limits, is Cartesian closed, and has a subobject classifier. Previous theoretical results on the Transformer model have shown that it is a universal sequence-to-sequence function approximator, and dense in the space of all continuous functions with compact support on the Euclidean space of embeddings of tokens. Building on this theoretical result, we explore novel architectures for LLMs that exploit the property that the category of LLMs, viewed as functions, forms a topos. Previous studies of large language models (LLMs) have focused on daisy-chained linear architectures or mixture-of-experts. In this paper, we use universal constructions in category theory to construct novel LLM architectures based on new types of compositional structures. In particular, these new compositional structures are derived from universal properties of LLM categories, and include pullback, pushout, (co) equalizers, exponential objects, and subobject classifiers. We theoretically validate these new compositional structures by showing that the category of LLMs is (co)complete, meaning that all diagrams have solutions in the form of (co)limits. Building on this completeness result, we then show that the category of LLMs forms a topos, a set-like" category, which requires showing the existence of exponential objects as well as subobject classifiers. We use a functorial characterization of backpropagation to define a potential implementation of an LLM topos architecture.
zh

[AI-91] Understanding Transformers through the Lens of Pavlovian Conditioning

【速读】:该论文试图解决的问题是:Transformer架构中注意力机制的成功背后的计算原理尚不明确,尽管其在人工智能领域取得了显著成果。解决方案的关键在于提出一个全新的理论框架,将注意力的核心计算重新解释为巴甫洛夫条件反射(Pavlovian conditioning),并通过线性注意力(linear attention)建立数学类比,从而揭示注意力机制本质上是一种基于Hebbian规则的瞬态关联记忆构建过程。在此框架下,查询(query)、键(key)和值(value)分别对应条件刺激(CS)、测试刺激(test stimulus)和无条件刺激(US),使得每个注意力操作都可视为通过动态关联CS-US对来形成可被后续测试刺激检索的记忆单元。这一视角不仅提供了关于注意力头存储容量、误差传播特性及生物可实现学习规则的理论洞见,还暗示现代AI的成功可能源于对进化优化的计算原则的实现,而不仅仅是架构上的创新。

链接: https://arxiv.org/abs/2508.08289
作者: Mu Qiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Transformer architectures have revolutionized artificial intelligence (AI) through their attention mechanisms, yet the computational principles underlying their success remain opaque. We present a novel theoretical framework that reinterprets the core computation of attention as Pavlovian conditioning. Our model finds a direct mathematical analogue in linear attention, which simplifies the analysis of the underlying associative process. We demonstrate that attention’s queries, keys, and values can be mapped to the three elements of classical conditioning: test stimuli that probe associations, conditional stimuli (CS) that serve as retrieval cues, and unconditional stimuli (US) that contain response information. Through this lens, we suggest that each attention operation constructs a transient associative memory via a Hebbian rule, where CS-US pairs form dynamic associations that test stimuli can later retrieve. Our framework yields several theoretical insights grounded in this linearized model: (1) a capacity theorem showing that attention heads can store O( \sqrtd_k ) associations before interference degrades retrieval; (2) an error propagation analysis revealing fundamental architectural trade-offs of balancing model depth, width, and head redundancy to maintain reliability; and (3) an understanding of how biologically plausible learning rules could enhance transformer architectures. By establishing this deep connection, we suggest that the success of modern AI may stem not from architectural novelty alone, but from implementing computational principles that biology optimized over millions of years of evolution.
zh

[AI-92] Multi-grained spatial-temporal feature complementarity for accurate online cellular traffic prediction KDD

【速读】:该论文旨在解决电信流量预测中因数据的突发性和非平稳性(bursty nature)导致的传统方法精度不足,以及概念漂移(concept drift)对持续预测任务准确率造成的影响。其核心解决方案是提出一种基于多粒度时空特征互补(Multi-Grained Spatial-Temporal feature Complementarity, MGSTC)的在线预测方法:通过粗粒度时间注意力机制提供预测时段的趋势参考,再利用细粒度空间注意力捕捉网络单元间的局部关联以精细化调整趋势,从而实现多粒度时空特征的有效协同;同时引入在线学习策略实时检测概念漂移并动态切换参数更新阶段,保障模型在连续预测场景下的稳定性和高精度表现。

链接: https://arxiv.org/abs/2508.08281
作者: Ningning Fu,Shengheng Liu,Weiliang Xie,Yongming Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: To appear in ACM TKDD. 26 pages, 12 figures,

点击查看摘要

Abstract:Knowledge discovered from telecom data can facilitate proactive understanding of network dynamics and user behaviors, which in turn empowers service providers to optimize cellular traffic scheduling and resource allocation. Nevertheless, the telecom industry still heavily relies on manual expert intervention. Existing studies have been focused on exhaustively explore the spatial-temporal correlations. However, they often overlook the underlying characteristics of cellular traffic, which are shaped by the sporadic and bursty nature of telecom services. Additionally, concept drift creates substantial obstacles to maintaining satisfactory accuracy in continuous cellular forecasting tasks. To resolve these problems, we put forward an online cellular traffic prediction method grounded in Multi-Grained Spatial-Temporal feature Complementarity (MGSTC). The proposed method is devised to achieve high-precision predictions in practical continuous forecasting scenarios. Concretely, MGSTC segments historical data into chunks and employs the coarse-grained temporal attention to offer a trend reference for the prediction horizon. Subsequently, fine-grained spatial attention is utilized to capture detailed correlations among network elements, which enables localized refinement of the established trend. The complementarity of these multi-grained spatial-temporal features facilitates the efficient transmission of valuable information. To accommodate continuous forecasting needs, we implement an online learning strategy that can detect concept drift in real-time and promptly switch to the appropriate parameter update stage. Experiments carried out on four real-world datasets demonstrate that MGSTC outperforms eleven state-of-the-art baselines consistently.
zh

[AI-93] XFMNet: Decoding Cross-Site and Nonstationary Water Patterns via Stepwise Multimodal Fusion for Long-Term Water Quality Forecasting

【速读】:该论文旨在解决多站点水体质量长期时间序列预测中的挑战,这些问题主要源于复杂的周期性、非平稳性以及由生态因素引起的突发波动,尤其在需要同时建模时间和空间动态的场景下更为突出。解决方案的关键在于提出XFMNet——一种分步的多模态融合网络,其核心创新包括:通过自适应下采样对齐水质序列与遥感降水图像的时间分辨率;利用局部自适应分解分离趋势与周期成分;设计交叉注意力门控融合模块,动态整合时间模式与空间及生态信息,从而增强对非平稳性和站点特异性异常的鲁棒性;并通过渐进式和递归融合机制捕捉长期趋势与短期波动。

链接: https://arxiv.org/abs/2508.08279
作者: Ziqi Wang,Hailiang Zhao,Cheng Bao,Wenzhuo Qian,Yuhao Yang,Xueqiang Sun,Shuiguang Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term time-series forecasting is critical for environmental monitoring, yet water quality prediction remains challenging due to complex periodicity, nonstationarity, and abrupt fluctuations induced by ecological factors. These challenges are further amplified in multi-site scenarios that require simultaneous modeling of temporal and spatial dynamics. To tackle this, we introduce XFMNet, a stepwise multimodal fusion network that integrates remote sensing precipitation imagery to provide spatial and environmental context in river networks. XFMNet first aligns temporal resolutions between water quality series and remote sensing inputs via adaptive downsampling, followed by locally adaptive decomposition to disentangle trend and cycle components. A cross-attention gated fusion module dynamically integrates temporal patterns with spatial and ecological cues, enhancing robustness to nonstationarity and site-specific anomalies. Through progressive and recursive fusion, XFMNet captures both long-term trends and short-term fluctuations. Extensive experiments on real-world datasets demonstrate substantial improvements over state-of-the-art baselines, highlighting the effectiveness of XFMNet for spatially distributed time series prediction.
zh

[AI-94] owards Heterogeneity-Aware and Energy-Efficient Topology Optimization for Decentralized Federated Learning in Edge Environment

【速读】:该论文旨在解决边缘计算(Edge Computing, EC)系统中去中心化联邦学习(Decentralized Federated Learning, DFL)面临的多重挑战,包括由拓扑动态变化导致的通信开销、资源异构性引发的能量效率低下以及数据异构性造成的模型性能下降。其核心解决方案是提出Hat-DFed框架,关键在于将拓扑构建建模为一个双目标优化问题(最大化模型性能与最小化累计能耗),并证明该问题为NP-hard;为此设计了一种两阶段算法,在动态构造最优通信拓扑的同时无偏估计其对模型性能和能耗的影响,并引入重要性感知的模型聚合机制以缓解数据异构性带来的性能退化。

链接: https://arxiv.org/abs/2508.08278
作者: Yuze Liu,Tiehua Zhang,Zhishu Shen,Libing Wu,Shiping Chen,Jiong Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) has emerged as a promising paradigm within edge computing (EC) systems, enabling numerous edge devices to collaboratively train artificial intelligence (AI) models while maintaining data privacy. To overcome the communication bottlenecks associated with centralized parameter servers, decentralized federated learning (DFL), which leverages peer-to-peer (P2P) communication, has been extensively explored in the research community. Although researchers design a variety of DFL approach to ensure model convergence, its iterative learning process inevitably incurs considerable cost along with the growth of model complexity and the number of participants. These costs are largely influenced by the dynamic changes of topology in each training round, particularly its sparsity and connectivity conditions. Furthermore, the inherent resources heterogeneity in the edge environments affects energy efficiency of learning process, while data heterogeneity degrades model performance. These factors pose significant challenges to the design of an effective DFL framework for EC systems. To this end, we propose Hat-DFed, a heterogeneity-aware and coset-effective decentralized federated learning (DFL) framework. In Hat-DFed, the topology construction is formulated as a dual optimization problem, which is then proven to be NP-hard, with the goal of maximizing model performance while minimizing cumulative energy consumption in complex edge environments. To solve this problem, we design a two-phase algorithm that dynamically constructs optimal communication topologies while unbiasedly estimating their impact on both model performance and energy cost. Additionally, the algorithm incorporates an importance-aware model aggregation mechanism to mitigate performance degradation caused by data heterogeneity.
zh

[AI-95] mg2tendon: From sEMG Signals to Tendon Control in Musculoskeletal Hands

【速读】:该论文旨在解决腱驱动(tendon-driven)机器人手在学习控制策略时面临的两大挑战:一是缺乏运动捕捉(mocap)数据与腱控制之间的直接一一映射关系,导致学习过程复杂且成本高;二是现实场景中视觉跟踪易受遮挡和精度不足影响,难以准确追踪关节运动。针对这些问题,作者提出构建首个大规模的肌电到腱控制(emg2tendon)数据集,扩展了现有的emg2pose数据集,包含193名受试者、370小时、29种手势的记录,并通过MyoSuite MyoHand模型生成可靠的腱控制信号,克服了先前方法中存在的无效姿态问题。解决方案的关键在于提供高质量、大规模的sEMG到腱控制的映射数据集,并引入一种基于扩散模型(diffusion-based regression model)的新方法,显著提升了从表面肌电信号(sEMG)预测腱控制的准确性与鲁棒性,为腱驱动灵巧机器人操作提供了可扩展、精准的控制基础。

链接: https://arxiv.org/abs/2508.08269
作者: Sagar Verma
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted in Robotics: Science and Systems (RSS 2025)

点击查看摘要

Abstract:Tendon-driven robotic hands offer unparalleled dexterity for manipulation tasks, but learning control policies for such systems presents unique challenges. Unlike joint-actuated robotic hands, tendon-driven systems lack a direct one-to-one mapping between motion capture (mocap) data and tendon controls, making the learning process complex and expensive. Additionally, visual tracking methods for real-world applications are prone to occlusions and inaccuracies, further complicating joint tracking. Wrist-wearable surface electromyography (sEMG) sensors present an inexpensive, robust alternative to capture hand motion. However, mapping sEMG signals to tendon control remains a significant challenge despite the availability of EMG-to-pose data sets and regression-based models in the existing literature. We introduce the first large-scale EMG-to-Tendon Control dataset for robotic hands, extending the emg2pose dataset, which includes recordings from 193 subjects, spanning 370 hours and 29 stages with diverse gestures. This dataset incorporates tendon control signals derived using the MyoSuite MyoHand model, addressing limitations such as invalid poses in prior methods. We provide three baseline regression models to demonstrate emg2tendon utility and propose a novel diffusion-based regression model for predicting tendon control from sEMG recordings. This dataset and modeling framework marks a significant step forward for tendon-driven dexterous robotic manipulation, laying the groundwork for scalable and accurate tendon control in robotic hands. this https URL Comments: Accepted in Robotics: Science and Systems (RSS 2025) Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.08269 [cs.RO] (or arXiv:2508.08269v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2508.08269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-96] EGGCodec: A Robust Neural Encodec Framework for EGG Reconstruction and F0 Extraction

【速读】:该论文旨在解决电声门图(Electroglottography, EGG)信号重建与基频(Fundamental Frequency, F0)提取中的精度与泛化能力不足问题。其核心解决方案是提出一种名为EGGCodec的神经编码器框架,通过设计多尺度频域损失函数以捕捉原始与重构EGG信号间的细微关系,并引入时域相关性损失以提升模型的泛化性能和准确性;同时,EGGCodec创新性地绕过传统Encodec模型中直接从特征中提取F0的步骤,转而利用重构后的EGG信号进行F0估计,从而更准确地映射声门开合动态与F0之间的物理关联。此外,移除GAN判别器简化了训练流程且未显著影响性能,使系统更具实用性与稳定性。

链接: https://arxiv.org/abs/2508.08924
作者: Rui Feng,Yuang Chen,Yu Hu,Jun Du,Jiahong Yuan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures, to be appeared in IEEE Signal Processing Letters

点击查看摘要

Abstract:This letter introduces EGGCodec, a robust neural Encodec framework engineered for electroglottography (EGG) signal reconstruction and F0 extraction. We propose a multi-scale frequency-domain loss function to capture the nuanced relationship between original and reconstructed EGG signals, complemented by a time-domain correlation loss to improve generalization and accuracy. Unlike conventional Encodec models that extract F0 directly from features, EGGCodec leverages reconstructed EGG signals, which more closely correspond to F0. By removing the conventional GAN discriminator, we streamline EGGCodec’s training process without compromising efficiency, incurring only negligible performance degradation. Trained on a widely used EGG-inclusive dataset, extensive evaluations demonstrate that EGGCodec outperforms state-of-the-art F0 extraction schemes, reducing mean absolute error (MAE) from 14.14 Hz to 13.69 Hz, and improving voicing decision error (VDE) by 38.2%. Moreover, extensive ablation experiments validate the contribution of each component of EGGCodec.
zh

[AI-97] ReQuestNet: A Foundational Learning model for Channel Estimation

【速读】:该论文旨在解决5G及未来通信系统中信道估计(Channel Estimation, CE)的复杂性与性能瓶颈问题,尤其针对资源块(Resource Block, RB)数量可变、传输层数动态变化、物理资源块组(Physical Resource Block Group, PRG)捆绑大小(Bundling Size, BS)不固定、解调参考信号(Demodulation Reference Signal, DMRS)模式多样等实际场景下,传统线性最小均方误差(MMSE)方法难以适应且性能受限的问题。解决方案的关键在于提出一种新型神经网络架构——ReQuestNet,其由CoarseNet和RefinementNet两部分组成:CoarseNet实现每个PRG内各发射-接收(Tx-Rx)流的初步信道估计,RefinementNet则通过融合不同预编码PRG间的相关性以及多输入多输出(MIMO)信道空间维度间的跨MIMO相关性,对粗估计结果进行精细化修正,从而在无需已知预编码信息的情况下联合处理MIMO层与不同预编码信道,显著提升估计精度并具备良好的泛化能力。

链接: https://arxiv.org/abs/2508.08790
作者: Kumar Pratik,Pouriya Sadeghi,Gabriele Cesa,Sanaz Barghi,Joseph B. Soriaga,Yuanning Yu,Supratik Bhattacharjee,Arash Behboodi
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at IEEE Globecom 2025. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:In this paper, we present a novel neural architecture for channel estimation (CE) in 5G and beyond, the Recurrent Equivariant UERS Estimation Network (ReQuestNet). It incorporates several practical considerations in wireless communication systems, such as ability to handle variable number of resource block (RB), dynamic number of transmit layers, physical resource block groups (PRGs) bundling size (BS), demodulation reference signal (DMRS) patterns with a single unified model, thereby, drastically simplifying the CE pipeline. Besides it addresses several limitations of the legacy linear MMSE solutions, for example, by being independent of other reference signals and particularly by jointly processing MIMO layers and differently precoded channels with unknown precoding at the receiver. ReQuestNet comprises of two sub-units, CoarseNet followed by RefinementNet. CoarseNet performs per PRG, per transmit-receive (Tx-Rx) stream channel estimation, while RefinementNet refines the CoarseNet channel estimate by incorporating correlations across differently precoded PRGs, and correlation across multiple input multiple output (MIMO) channel spatial dimensions (cross-MIMO). Simulation results demonstrate that ReQuestNet significantly outperforms genie minimum mean squared error (MMSE) CE across a wide range of channel conditions, delay-Doppler profiles, achieving up to 10dB gain at high SNRs. Notably, ReQuestNet generalizes effectively to unseen channel profiles, efficiently exploiting inter-PRG and cross-MIMO correlations under dynamic PRG BS and varying transmit layer allocations.
zh

[AI-98] he DNA of nuclear models: How AI predicts nuclear masses

【速读】:该论文旨在解决核质量(或等价的核结合能 EbE_b)高精度预测问题,同时克服当前基于人工智能(AI)模型在缺乏实验数据时难以评估外推可靠性的问题。其解决方案的关键在于提出了一种可解释的AI模型,该模型不仅实现了前沿精度,还通过内部表示结构揭示了物理意义:例如,模型的核心维度呈现出双螺旋结构,其中类似DNA氢键的关联连接了同位素链中最稳定核素的质子数与中子数;此外,模型对 EbE_b 的预测可被层级分解为一系列符号化项,这些项对应于经典核物理模型(如液滴模型),而性能提升几乎完全归因于Jaffe在1969年提出的观测结果。最终形成一个完全可解释的数据驱动型核质量模型。

链接: https://arxiv.org/abs/2508.08370
作者: Kate A. Richardson,Sokratis Trifinopoulos,Mike Williams
机构: 未知
类目: Nuclear Theory (nucl-th); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Nuclear Experiment (nucl-ex)
备注: 19 pages, 12 figures

点击查看摘要

Abstract:Obtaining high-precision predictions of nuclear masses, or equivalently nuclear binding energies, E_b , remains an important goal in nuclear-physics research. Recently, many AI-based tools have shown promising results on this task, some achieving precision that surpasses the best physics models. However, the utility of these AI models remains in question given that predictions are only useful where measurements do not exist, which inherently requires extrapolation away from the training (and testing) samples. Since AI models are largely black boxes, the reliability of such an extrapolation is difficult to assess. We present an AI model that not only achieves cutting-edge precision for E_b , but does so in an interpretable manner. For example, we find (and explain why) that the most important dimensions of its internal representation form a double helix, where the analog of the hydrogen bonds in DNA here link the number of protons and neutrons found in the most stable nucleus of each isotopic chain. Furthermore, we show that the AI prediction of E_b can be factorized and ordered hierarchically, with the most important terms corresponding to well-known symbolic models (such as the famous liquid drop). Remarkably, the improvement of the AI model over symbolic ones can almost entirely be attributed to an observation made by Jaffe in 1969. The end result is a fully interpretable data-driven model of nuclear masses.
zh

[AI-99] Algorithmic Collusion of Pricing and Advertising on E-commerce Platforms

【速读】:该论文旨在解决在线卖家在电商平台中采用学习算法进行定价与广告决策时可能引发的隐性共谋(tacit collusion)问题,即算法是否会引导市场走向高于竞争水平的价格。其解决方案的关键在于通过多智能体强化学习(multi-agent reinforcement learning)对大规模高频关键词-产品数据集(超过200万种商品)进行实证分析,发现当消费者搜索成本较高时,算法能实现三方共赢——降低价格、减少广告支出并提升平台效率,其机制是算法学会协调较低的广告出价以降低整体成本,从而推动价格低于竞争均衡水平;此外,研究还揭示了平台通过调整佣金而非保留价格可提升利润,为政策制定者和平台提供基于实证的监管与优化依据。

链接: https://arxiv.org/abs/2508.08325
作者: Hangcheng Zhao,Ron Berman
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Online sellers have been adopting AI learning algorithms to automatically make product pricing and advertising decisions on e-commerce platforms. When sellers compete using such algorithms, one concern is that of tacit collusion - the algorithms learn to coordinate on higher than competitive. We empirically investigate whether these concerns are valid when sellers make pricing and advertising decisions together, i.e., two-dimensional decisions. Our empirical strategy is to analyze competition with multi-agent reinforcement learning, which we calibrate to a large-scale dataset collected from this http URL products. Our first contribution is to find conditions under which learning algorithms can facilitate win-win-win outcomes that are beneficial for consumers, sellers, and even the platform, when consumers have high search costs. In these cases the algorithms learn to coordinate on prices that are lower than competitive prices. The intuition is that the algorithms learn to coordinate on lower advertising bids, which lower advertising costs, leading to lower prices. Our second contribution is an analysis of a large-scale, high-frequency keyword-product dataset for more than 2 million products on this http URL. Our estimates of consumer search costs show a wide range of costs for different product keywords. We generate an algorithm usage and find a negative interaction between the estimated consumer search costs and the algorithm usage index, providing empirical evidence of beneficial collusion. Finally, we analyze the platform’s strategic response. We find that reserve price adjustments will not increase profits for the platform, but commission adjustments will. Our analyses help alleviate some worries about the potentially harmful effects of competing learning algorithms, and can help sellers, platforms and policymakers to decide on whether to adopt or regulate such algorithms.
zh

[AI-100] Constrained PSLQ Search for Machin-like Identities Achieving Record-Low Lehmer Measures

【速读】:该论文旨在解决如何高效发现低 Lehmer 测度(Lehmer measure, λ)的 Machin-like arctangent 恒等式以计算圆周率 π 的问题。其核心挑战在于传统搜索方法难以在大规模空间中找到具有极低 λ 值的公式。解决方案的关键在于将整数关系算法 PSLQ 与基于高斯整数代数结构的数论过滤器相结合,从而显著缩小搜索空间并提升计算效率;在此基础上,作者成功发现了具有记录性低 Lehmer 测度的新公式(λ=1.4572 和 λ=1.3291),并进一步展示了通过算法扩展从这些基础关系生成更长公式的方法,为未来探索提供了可靠路径。

链接: https://arxiv.org/abs/2508.08307
作者: Nick Craig-Wood
机构: 未知
类目: Number Theory (math.NT); Artificial Intelligence (cs.AI)
备注: 22 pages, 2 tables, 4035 words

点击查看摘要

Abstract:Machin-like arctangent relations are classical tools for computing \pi , with efficiency quantified by the Lehmer measure ( \lambda ). We present a framework for discovering low-measure relations by coupling the PSLQ integer-relation algorithm with number-theoretic filters derived from the algebraic structure of Gaussian integers, making large scale search tractable. Our search yields new 5 and 6 term relations with record-low Lehmer measures ( \lambda=1.4572, \lambda=1.3291 ). We also demonstrate how discovered relations can serve as a basis for generating new, longer formulae through algorithmic extensions. This combined approach of a constrained PSLQ search and algorithmic extension provides a robust method for future explorations.
zh

[AI-101] On the Effects of Smoothing Rugged Landscape by Different Toy Problems: A Case Study on UBQP

【速读】:该论文旨在解决无约束二元二次规划(Unconstrained Binary Quadratic Program, UBQP)问题因复杂崎岖的解空间景观(rugged landscape)而导致的优化困难。其解决方案的关键在于采用景观平滑策略,通过构建原始UBQP与一个“玩具”UBQP(toy UBQP)的凸组合来降低问题的局部极小值密度,从而改善搜索效率。具体而言,研究对比了三种不同构造方式的toy UBQP:基于±1矩阵(^Q1)、基于±i矩阵(^Q2)以及随机生成矩阵(^Q3),发现使用^Q2构造的toy UBQP能最有效地提升LSILS算法性能,表明景观平滑效果与toy UBQP的结构特性密切相关。

链接: https://arxiv.org/abs/2407.19676
作者: Wei Wang,Jialong Shi,Jianyong Sun,Arnaud Liefooghe,Qingfu Zhang,Ye Fan
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The hardness of the Unconstrained Binary Quadratic Program (UBQP) problem is due its rugged landscape. Various algorithms have been proposed for UBQP, including the Landscape Smoothing Iterated Local Search (LSILS). Different from other UBQP algorithms, LSILS tries to smooth the rugged landscape by building a convex combination of the original UBQP and a toy UBQP. In this paper, our study further investigates the impact of smoothing rugged landscapes using different toy UBQP problems, including a toy UBQP with matrix ^Q1 (construct by “+/-1”), a toy UBQP with matrix ^Q2 (construct by “+/-i”) and a toy UBQP with matrix ^Q3 (construct randomly). We first assess the landscape flatness of the three toy UBQPs. Subsequently, we test the efficiency of LSILS with different toy UBQPs. Results reveal that the toy UBQP with ^Q1 (construct by “+/-1”) exhibits the flattest landscape among the three, while the toy UBQP with ^Q3 (construct randomly) presents the most non-flat landscape. Notably, LSILS using the toy UBQP with ^Q2 (construct by “+/-i”) emerges as the most effective, while ^Q3 (construct randomly) has the poorest result. These findings contribute to a detailed understanding of landscape smoothing techniques in optimizing UBQP.
zh

[AI-102] A New Parallel Cooperative Landscape Smoothing Algorithm and Its Applications on TSP and UBQP

【速读】:该论文旨在解决组合优化问题(Combinatorial Optimization Problem, COP)因解空间中存在大量局部最优解而导致求解困难的问题。其核心解决方案是通过引入同伦凸(Homotopic Convex, HC)变换框架,构造一个具有单一全局最优解的“玩具”无约束二次规划问题(Unconstrained Binary Quadratic Programming, UBQP),并理论证明该玩具问题的单峰性(unimodality)。关键在于利用此单峰结构对原始UBQP进行平滑处理,从而降低搜索难度;进一步提出基于HC变换的迭代局部搜索算法(Landscape Smoothing Iterated Local Search, LSILS),并通过实验验证其有效性,同时发展出并行协作版本(PC-LSILS),在UBQP和旅行商问题(TSP)上均展现出更优的平滑效果与整体性能。

链接: https://arxiv.org/abs/2401.03237
作者: Wei Wang,Jialong Shi,Jianyong Sun,Arnaud Liefooghe,Qingfu Zhang
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Combinatorial optimization problem (COP) is difficult to solve because of the massive number of local optimal solutions in his solution space. Various methods have been put forward to smooth the solution space of COPs, including homotopic convex (HC) transformation for the traveling salesman problem (TSP). This paper extends the HC transformation approach to unconstrained binary quadratic programming (UBQP) by proposing a method to construct a unimodal toy UBQP of any size. We theoretically prove the unimodality of the constructed toy UBQP. After that, we apply this unimodal toy UBQP to smooth the original UBQP by using the HC transformation framework and empirically verify the smoothing effects. Subsequently, we introduce an iterative algorithmic framework incorporating HC transformation, referred as landscape smoothing iterated local search (LSILS). Our experimental analyses, conducted on various UBQP instances show the effectiveness of LSILS. Furthermore, this paper proposes a parallel cooperative variant of LSILS, denoted as PC-LSILS and apply it to both the UBQP and the TSP. Our experimental findings highlight that PC-LSILS improves the smoothing performance of the HC transformation, and further improves the overall performance of the algorithm.
zh

机器学习

[LG-0] Deep Neural Network Calibration by Reducing Classifier Shift with Stochastic Masking

链接: https://arxiv.org/abs/2508.09116
作者: Jiani Ni,He Zhao,Yibo Yang,Dandan Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, deep neural networks (DNNs) have shown competitive results in many fields. Despite this success, they often suffer from poor calibration, especially in safety-critical scenarios such as autonomous driving and healthcare, where unreliable confidence estimates can lead to serious consequences. Recent studies have focused on improving calibration by modifying the classifier, yet such efforts remain limited. Moreover, most existing approaches overlook calibration errors caused by underconfidence, which can be equally detrimental. To address these challenges, we propose MaC-Cal, a novel mask-based classifier calibration method that leverages stochastic sparsity to enhance the alignment between confidence and accuracy. MaC-Cal adopts a two-stage training scheme with adaptive sparsity, dynamically adjusting mask retention rates based on the deviation between confidence and accuracy. Extensive experiments show that MaC-Cal achieves superior calibration performance and robustness under data corruption, offering a practical and effective solution for reliable confidence estimation in DNNs.

[LG-1] Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving

链接: https://arxiv.org/abs/2508.09099
作者: Tianyun Yang,Yunwen Li,Ziniu Li,Zhihang Lin,Ruoyu Sun,Tian Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large vision language models exhibit notable limitations on Geometry Problem Solving (GPS) because of their unreliable diagram interpretation and pure natural-language reasoning. A recent line of work mitigates this by using symbolic solvers: the model directly generates a formal program that a geometry solver can execute. However, this direct program generation lacks intermediate reasoning, making the decision process opaque and prone to errors. In this work, we explore a new approach that integrates Chain-of-Thought (CoT) with formal language. The model interleaves natural language reasoning with incremental emission of solver-executable code, producing a hybrid reasoning trace in which critical derivations are expressed in formal language. To teach this behavior at scale, we combine (1) supervised fine-tuning on an 11K newly developed synthetic dataset with interleaved natural language reasoning and automatic formalization, and (2) solver-in-the-loop reinforcement learning that jointly optimizes both the CoT narrative and the resulting program through outcome-based rewards. Built on Qwen2.5-VL-7B, our new model, named GF-Reasoner, achieves up to 15% accuracy improvements on standard GPS benchmarks, surpassing both 7B-scale peers and the much larger model Qwen2.5-VL-72B. By exploiting high-order geometric knowledge and offloading symbolic computation to the solver, the generated reasoning traces are noticeably shorter and cleaner. Furthermore, we present a comprehensive analysis of method design choices (e.g., reasoning paradigms, data synthesis, training epochs, etc.), providing actionable insights for future research.

[LG-2] Chi-Geometry: A Library for Benchmarking Chirality Prediction of GNNs

链接: https://arxiv.org/abs/2508.09097
作者: Rylie Weaver,Massamiliano Lupo Pasini
类目: Machine Learning (cs.LG)
*备注: 21 pages total: 9 pages main text, 4 pages references, 8 pages appendices. 4 figures and 7 tables

点击查看摘要

Abstract:We introduce Chi-Geometry - a library that generates graph data for testing and benchmarking GNNs’ ability to predict chirality. Chi-Geometry generates synthetic graph samples with (i) user-specified geometric and topological traits to isolate certain types of samples and (ii) randomized node positions and species to minimize extraneous correlations. Each generated graph contains exactly one chiral center labeled either R or S, while all other nodes are labeled N/A (non-chiral). The generated samples are then combined into a cohesive dataset that can be used to assess a GNN’s ability to predict chirality as a node classification task. Chi-Geometry allows more interpretable and less confounding benchmarking of GNNs for prediction of chirality in the graph samples which can guide the design of new GNN architectures with improved predictive performance. We illustrate Chi-Geometry’s efficacy by using it to generate synthetic datasets for benchmarking various state-of-the-art (SOTA) GNN architectures. The conclusions of these benchmarking results guided our design of two new GNN architectures. The first GNN architecture established all-to-all connections in the graph to accurately predict chirality across all challenging configurations where previously tested SOTA models failed, but at a computational cost (both for training and inference) that grows quadratically with the number of graph nodes. The second GNN architecture avoids all-to-all connections by introducing a virtual node in the original graph structure of the data, which restores the linear scaling of training and inference computational cost with respect to the number of nodes in the graph, while still ensuring competitive accuracy in detecting chirality with respect to SOTA GNN architectures.

[LG-3] Scaling Up Active Testing to Large Language Models

链接: https://arxiv.org/abs/2508.09093
作者: Gabrielle Berrada,Jannik Kossen,Muhammed Razzak,Freddie Bickford Smith,Yarin Gal,Tom Rainforth
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Active testing enables label-efficient evaluation of models through careful data acquisition. However, its significant computational costs have previously undermined its use for large models. We show how it can be successfully scaled up to the evaluation of large language models (LLMs). In particular we show that the surrogate model used to guide data acquisition can be constructed cheaply using in-context learning, does not require updating within an active-testing loop, and can be smaller than the target model. We even find we can make good data-acquisition decisions without computing predictions with the target model and further introduce a single-run error estimator to asses how well active testing is working on the fly. We find that our approach is able to more effectively evaluate LLM performance with less data than current standard practices.

[LG-4] Meta-learning optimizes predictions of missing links in real-world networks

链接: https://arxiv.org/abs/2508.09069
作者: Bisman Singh,Lucy Van Kleunen,Aaron Clauset
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 10 pages, 5 figures, 5 tables, 7 appendices

点击查看摘要

Abstract:Relational data are ubiquitous in real-world data applications, e.g., in social network analysis or biological modeling, but networks are nearly always incompletely observed. The state-of-the-art for predicting missing links in the hard case of a network without node attributes uses model stacking or neural network techniques. It remains unknown which approach is best, and whether or how the best choice of algorithm depends on the input network’s characteristics. We answer these questions systematically using a large, structurally diverse benchmark of 550 real-world networks under two standard accuracy measures (AUC and Top-k), comparing four stacking algorithms with 42 topological link predictors, two of which we introduce here, and two graph neural network algorithms. We show that no algorithm is best across all input networks, all algorithms perform well on most social networks, and few perform well on economic and biological networks. Overall, model stacking with a random forest is both highly scalable and surpasses on AUC or is competitive with graph neural networks on Top-k accuracy. But, algorithm performance depends strongly on network characteristics like the degree distribution, triangle density, and degree assortativity. We introduce a meta-learning algorithm that exploits this variability to optimize link predictions for individual networks by selecting the best algorithm to apply, which we show outperforms all state-of-the-art algorithms and scales to large networks.

[LG-5] Developing a Transferable Federated Network Intrusion Detection System

链接: https://arxiv.org/abs/2508.09060
作者: Abu Shafin Mohammad Mahdee Jameel,Shreya Ghosh,Aly El Gamal
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: Currently under review

点击查看摘要

Abstract:Intrusion Detection Systems (IDS) are a vital part of a network-connected device. In this paper, we develop a deep learning based intrusion detection system that is deployed in a distributed setup across devices connected to a network. Our aim is to better equip deep learning models against unknown attacks using knowledge from known attacks. To this end, we develop algorithms to maximize the number of transferability relationships. We propose a Convolutional Neural Network (CNN) model, along with two algorithms that maximize the number of relationships observed. One is a two step data pre-processing stage, and the other is a Block-Based Smart Aggregation (BBSA) algorithm. The proposed system succeeds in achieving superior transferability performance while maintaining impressive local detection rates. We also show that our method is generalizable, exhibiting transferability potential across datasets and even with different backbones. The code for this work can be found at this https URL.

[LG-6] Causal Machine Learning for Patient-Level Intraoperative Opioid Dose Prediction from Electronic Health Records

链接: https://arxiv.org/abs/2508.09059
作者: Jonas Valbjørn Andersena,Anders Peder Højer Karlsen,Markus Harboe Olsen,Nikolaj Krebs Pedersen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces the OPIAID algorithm, a novel approach for predicting and recommending personalized opioid dosages for individual patients. The algorithm optimizes pain management while minimizing opioid related adverse events (ORADE) by employing machine learning models trained on observational electronic health records (EHR) data. It leverages a causal machine learning approach to understand the relationship between opioid dose, case specific patient and intraoperative characteristics, and pain versus ORADE outcomes. The OPIAID algorithm considers patient-specific characteristics and the influence of different opiates, enabling personalized dose recommendations. This paper outlines the algorithm’s methodology and architecture, and discusses key assumptions, and approaches to evaluating its performance.

[LG-7] FetFIDS: A Feature Embedding Attention based Federated Network Intrusion Detection Algorithm

链接: https://arxiv.org/abs/2508.09056
作者: Shreya Ghosh,Abu Shafin Mohammad Mahdee Jameel,Aly El Gamal
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Intrusion Detection Systems (IDS) have an increasingly important role in preventing exploitation of network vulnerabilities by malicious actors. Recent deep learning based developments have resulted in significant improvements in the performance of IDS systems. In this paper, we present FetFIDS, where we explore the employment of feature embedding instead of positional embedding to improve intrusion detection performance of a transformer based deep learning system. Our model is developed with the aim of deployments in edge learning scenarios, where federated learning over multiple communication rounds can ensure both privacy and localized performance improvements. FetFIDS outperforms multiple state-of-the-art intrusion detection systems in a federated environment and demonstrates a high degree of suitability to federated learning. The code for this work can be found at this https URL.

[LG-8] MechaFormer: Sequence Learning for Kinematic Mechanism Design Automation

链接: https://arxiv.org/abs/2508.09005
作者: Diana Bolanos,Mohammadmehdi Ataei,Pradeep Kumar Jayaraman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Designing mechanical mechanisms to trace specific paths is a classic yet notoriously difficult engineering problem, characterized by a vast and complex search space of discrete topologies and continuous parameters. We introduce MechaFormer, a Transformer-based model that tackles this challenge by treating mechanism design as a conditional sequence generation task. Our model learns to translate a target curve into a domain-specific language (DSL) string, simultaneously determining the mechanism’s topology and geometric parameters in a single, unified process. MechaFormer significantly outperforms existing baselines, achieving state-of-the-art path-matching accuracy and generating a wide diversity of novel and valid designs. We demonstrate a suite of sampling strategies that can dramatically improve solution quality and offer designers valuable flexibility. Furthermore, we show that the high-quality outputs from MechaFormer serve as excellent starting points for traditional optimizers, creating a hybrid approach that finds superior solutions with remarkable efficiency.

[LG-9] Low-Regret and Low-Complexity Learning for Hierarchical Inference

链接: https://arxiv.org/abs/2508.08985
作者: Sameep Chattopadhyay,Vinay Sutar,Jaya Prakash Champati,Sharayu Moharir
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work focuses on Hierarchical Inference (HI) in edge intelligence systems, where a compact Local-ML model on an end-device works in conjunction with a high-accuracy Remote-ML model on an edge-server. HI aims to reduce latency, improve accuracy, and lower bandwidth usage by first using the Local-ML model for inference and offloading to the Remote-ML only when the local inference is likely incorrect. A critical challenge in HI is estimating the likelihood of the local inference being incorrect, especially when data distributions and offloading costs change over time – a problem we term Hierarchical Inference Learning (HIL). We introduce a novel approach to HIL by modeling the probability of correct inference by the Local-ML as an increasing function of the model’s confidence measure, a structure motivated by empirical observations but previously unexploited. We propose two policies, HI-LCB and HI-LCB-lite, based on the Upper Confidence Bound (UCB) framework. We demonstrate that both policies achieve order-optimal regret of O(\log T) , a significant improvement over existing HIL policies with O(T^2/3) regret guarantees. Notably, HI-LCB-lite has an O(1) per-sample computational complexity, making it well-suited for deployment on devices with severe resource limitations. Simulations using real-world datasets confirm that our policies outperform existing state-of-the-art HIL methods.

[LG-10] Integrating attention into explanation frameworks for language and vision transformers

链接: https://arxiv.org/abs/2508.08966
作者: Marte Eggen,Jacob Lysnæs-Larsen,Inga Strümke
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The attention mechanism lies at the core of the transformer architecture, providing an interpretable model-internal signal that has motivated a growing interest in attention-based model explanations. Although attention weights do not directly determine model outputs, they reflect patterns of token influence that can inform and complement established explainability techniques. This work studies the potential of utilising the information encoded in attention weights to provide meaningful model explanations by integrating them into explainable AI (XAI) frameworks that target fundamentally different aspects of model behaviour. To this end, we develop two novel explanation methods applicable to both natural language processing and computer vision tasks. The first integrates attention weights into the Shapley value decomposition by redefining the characteristic function in terms of pairwise token interactions via attention weights, thus adapting this widely used game-theoretic solution concept to provide attention-driven attributions for local explanations. The second incorporates attention weights into token-level directional derivatives defined through concept activation vectors to measure concept sensitivity for global explanations. Our empirical evaluations on standard benchmarks and in a comparison study with widely used explanation methods show that attention weights can be meaningfully incorporated into the studied XAI frameworks, highlighting their value in enriching transformer explainability.

[LG-11] Fre-CW: Targeted Attack on Time Series Forecasting using Frequency Domain Loss

链接: https://arxiv.org/abs/2508.08955
作者: Naifu Feng,Lixing Chen,Junhua Tang,Hua Ding,Jianhua Li,Yang Bai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer-based models have made significant progress in time series forecasting. However, a key limitation of deep learning models is their susceptibility to adversarial attacks, which has not been studied enough in the context of time series prediction. In contrast to areas such as computer vision, where adversarial robustness has been extensively studied, frequency domain features of time series data play an important role in the prediction task but have not been sufficiently explored in terms of adversarial attacks. This paper proposes a time series prediction attack algorithm based on frequency domain loss. Specifically, we adapt an attack method originally designed for classification tasks to the prediction field and optimize the adversarial samples using both time-domain and frequency-domain losses. To the best of our knowledge, there is no relevant research on using frequency information for time-series adversarial attacks. Our experimental results show that these current time series prediction models are vulnerable to adversarial attacks, and our approach achieves excellent performance on major time series forecasting datasets.

[LG-12] GRAVITY: A Controversial Graph Representation Learning for Vertex Classification

链接: https://arxiv.org/abs/2508.08954
作者: Etienne Gael Tajeuna,Jean Marie Tshimula
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the quest of accurate vertex classification, we introduce GRAVITY (Graph-based Representation leArning via Vertices Interaction TopologY), a framework inspired by physical systems where objects self-organize under attractive forces. GRAVITY models each vertex as exerting influence through learned interactions shaped by structural proximity and attribute similarity. These interactions induce a latent potential field in which vertices move toward energy efficient positions, coalescing around class-consistent attractors and distancing themselves from unrelated groups. Unlike traditional message-passing schemes with static neighborhoods, GRAVITY adaptively modulates the receptive field of each vertex based on a learned force function, enabling dynamic aggregation driven by context. This field-driven organization sharpens class boundaries and promotes semantic coherence within latent clusters. Experiments on real-world benchmarks show that GRAVITY yields competitive embeddings, excelling in both transductive and inductive vertex classification tasks.

[LG-13] LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks

链接: https://arxiv.org/abs/2508.08935
作者: Ze Tao,Hanxuan Wang,Fujun Liu
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 figures

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have attracted considerable attention for their ability to integrate partial differential equation priors into deep learning frameworks; however, they often exhibit limited predictive accuracy when applied to complex problems. To address this issue, we propose LNN-PINN, a physics-informed neural network framework that incorporates a liquid residual gating architecture while preserving the original physics modeling and optimization pipeline to improve predictive accuracy. The method introduces a lightweight gating mechanism solely within the hidden-layer mapping, keeping the sampling strategy, loss composition, and hyperparameter settings unchanged to ensure that improvements arise purely from architectural refinement. Across four benchmark problems, LNN-PINN consistently reduced RMSE and MAE under identical training conditions, with absolute error plots further confirming its accuracy gains. Moreover, the framework demonstrates strong adaptability and stability across varying dimensions, boundary conditions, and operator characteristics. In summary, LNN-PINN offers a concise and effective architectural enhancement for improving the predictive accuracy of physics-informed neural networks in complex scientific and engineering problems.

[LG-14] Exploring Cross-Stage Adversarial Transferability in Class-Incremental Continual Learning

链接: https://arxiv.org/abs/2508.08920
作者: Jungwoo Kim,Jong-Seok Lee
类目: Machine Learning (cs.LG)
*备注: Accepted at MMSP 2025

点击查看摘要

Abstract:Class-incremental continual learning addresses catastrophic forgetting by enabling classification models to preserve knowledge of previously learned classes while acquiring new ones. However, the vulnerability of the models against adversarial attacks during this process has not been investigated sufficiently. In this paper, we present the first exploration of vulnerability to stage-transferred attacks, i.e., an adversarial example generated using the model in an earlier stage is used to attack the model in a later stage. Our findings reveal that continual learning methods are highly susceptible to these attacks, raising a serious security issue. We explain this phenomenon through model similarity between stages and gradual robustness degradation. Additionally, we find that existing adversarial training-based defense methods are not sufficiently effective to stage-transferred attacks. Codes are available at this https URL.

[LG-15] Stationarity Exploration for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2508.08919
作者: Hao Liu,Chun Yang,Zhang xiaoxing,Rui Ma,Xiaobin Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning-based time series forecasting has found widespread applications. Recently, converting time series data into the frequency domain for forecasting has become popular for accurately exploring periodic patterns. However, existing methods often cannot effectively explore stationary information from complex intertwined frequency components. In this paper, we propose a simple yet effective Amplitude-Phase Reconstruct Network (APRNet) that models the inter-relationships of amplitude and phase, which prevents the amplitude and phase from being constrained by different physical quantities, thereby decoupling the distinct characteristics of signals for capturing stationary information. Specifically, we represent the multivariate time series input across sequence and channel dimensions, highlighting the correlation between amplitude and phase at multiple interaction frequencies. We propose a novel Kolmogorov-Arnold-Network-based Local Correlation (KLC) module to adaptively fit local functions using univariate functions, enabling more flexible characterization of stationary features across different amplitudes and phases. This significantly enhances the model’s capability to capture time-varying patterns. Extensive experiments demonstrate the superiority of our APRNet against the state-of-the-arts (SOTAs).

[LG-16] Sound Signal Synthesis with Auxiliary Classifier GAN COVID-19 cough as an example

链接: https://arxiv.org/abs/2508.08892
作者: Yahya Sherif Solayman Mohamed Saleh,Ahmed Mohammed Dabbous,Lama Alkhaled,Hum Yan Chai,Muhammad Ehsan Rana,Hamam Mokayed
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One of the fastest-growing domains in AI is healthcare. Given its importance, it has been the interest of many researchers to deploy ML models into the ever-demanding healthcare domain to aid doctors and increase accessibility. Delivering reliable models, however, demands a sizable amount of data, and the recent COVID-19 pandemic served as a reminder of the rampant and scary nature of healthcare that makes training models difficult. To alleviate such scarcity, many published works attempted to synthesize radiological cough data to train better COVID-19 detection models on the respective radiological data. To accommodate the time sensitivity expected during a pandemic, this work focuses on detecting COVID-19 through coughs using synthetic data to improve the accuracy of the classifier. The work begins by training a CNN on a balanced subset of the Coughvid dataset, establishing a baseline classification test accuracy of 72%. The paper demonstrates how an Auxiliary Classification GAN (ACGAN) may be trained to conditionally generate novel synthetic Mel Spectrograms of both healthy and COVID-19 coughs. These coughs are used to augment the training dataset of the CNN classifier, allowing it to reach a new test accuracy of 75%. The work highlights the expected messiness and inconsistency in training and offers insights into detecting and handling such shortcomings.

[LG-17] Hi-fi functional priors by learning activations NEURIPS2024

链接: https://arxiv.org/abs/2508.08880
作者: Marcin Sendera,Amin Sorkhei,Tomasz Kuśmierczyk
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published in Workshop on Bayesian Decision-making and Uncertainty, 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Function-space priors in Bayesian Neural Networks (BNNs) provide a more intuitive approach to embedding beliefs directly into the model’s output, thereby enhancing regularization, uncertainty quantification, and risk-aware decision-making. However, imposing function-space priors on BNNs is challenging. We address this task through optimization techniques that explore how trainable activations can accommodate higher-complexity priors and match intricate target function distributions. We investigate flexible activation models, including Pade functions and piecewise linear functions, and discuss the learning challenges related to identifiability, loss construction, and symmetries. Our empirical findings indicate that even BNNs with a single wide hidden layer when equipped with flexible trainable activation, can effectively achieve desired function-space priors.

[LG-18] owards Scalable Lottery Ticket Networks using Genetic Algorithms

链接: https://arxiv.org/abs/2508.08877
作者: Julian Schönberger,Maximilian Zorn,Jonas Nüßlein,Thomas Gabor,Philipp Altmann
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 27 pages, 11 figures, 7 tables, Extended version of a paper submitted to IJCCI 2024 (DOI: https://doi.org/10.5220/0013010300003837 ), the extended version will appear in the journal Studies in Computational Intelligence

点击查看摘要

Abstract:Building modern deep learning systems that are not just effective but also efficient requires rethinking established paradigms for model training and neural architecture design. Instead of adapting highly overparameterized networks and subsequently applying model compression techniques to reduce resource consumption, a new class of high-performing networks skips the need for expensive parameter updates, while requiring only a fraction of parameters, making them highly scalable. The Strong Lottery Ticket Hypothesis posits that within randomly initialized, sufficiently overparameterized neural networks, there exist subnetworks that can match the accuracy of the trained original model-without any training. This work explores the usage of genetic algorithms for identifying these strong lottery ticket subnetworks. We find that for instances of binary and multi-class classification tasks, our approach achieves better accuracies and sparsity levels than the current state-of-the-art without requiring any gradient information. In addition, we provide justification for the need for appropriate evaluation metrics when scaling to more complex network architectures and learning tasks.

[LG-19] Flow Battery Manifold Design with Heterogeneous Inputs Through Generative Adversarial Neural Networks

链接: https://arxiv.org/abs/2508.08863
作者: Eric Seng,Hugh O’Connor,Adam Boyce,Josh J. Bailey,Anton van Beek(School of Mechanical and Materials Engineering, University College Dublin, Dublin, Ireland)
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 30 pages, 7 figures, conference (IDETC-CIE)

点击查看摘要

Abstract:Generative machine learning has emerged as a powerful tool for design representation and exploration. However, its application is often constrained by the need for large datasets of existing designs and the lack of interpretability about what features drive optimality. To address these challenges, we introduce a systematic framework for constructing training datasets tailored to generative models and demonstrate how these models can be leveraged for interpretable design. The novelty of this work is twofold: (i) we present a systematic framework for generating archetypes with internally homogeneous but mutually heterogeneous inputs that can be used to generate a training dataset, and (ii) we show how integrating generative models with Bayesian optimization can enhance the interpretability of the latent space of admissible designs. These findings are validated by using the framework to design a flow battery manifold, demonstrating that it effectively captures the space of feasible designs, including novel configurations while enabling efficient exploration. This work broadens the applicability of generative machine-learning models in system designs by enhancing quality and reliability.

[LG-20] Image selective encryption analysis using mutual information in CNN based embedding space ALT

链接: https://arxiv.org/abs/2508.08832
作者: Ikram Messadi,Giulia Cervia,Vincent Itier
类目: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: Accepted for presentation at the 13th European Workshop on Visual Information Processing (EUVIP), Oct 2025, Valetta, Malta

点击查看摘要

Abstract:As digital data transmission continues to scale, concerns about privacy grow increasingly urgent - yet privacy remains a socially constructed and ambiguously defined concept, lacking a universally accepted quantitative measure. This work examines information leakage in image data, a domain where information-theoretic guarantees are still underexplored. At the intersection of deep learning, information theory, and cryptography, we investigate the use of mutual information (MI) estimators - in particular, the empirical estimator and the MINE framework - to detect leakage from selectively encrypted images. Motivated by the intuition that a robust estimator would require a probabilistic frameworks that can capture spatial dependencies and residual structures, even within encrypted representations - our work represent a promising direction for image information leakage estimation.

[LG-21] Differentiated Information Mining: A Semi-supervised Learning Framework for GNNs

链接: https://arxiv.org/abs/2508.08769
作者: Long Wang,Kai Liu
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures, 8 tables

点击查看摘要

Abstract:In semi-supervised learning (SSL) for enhancing the performance of graph neural networks (GNNs) with unlabeled data, introducing mutually independent decision factors for cross-validation is regarded as an effective strategy to alleviate pseudo-label confirmation bias and training collapse. However, obtaining such factors is challenging in practice: additional and valid information sources are inherently scarce, and even when such sources are available, their independence from the original source cannot be guaranteed. To address this challenge, In this paper we propose a Differentiated Factor Consistency Semi-supervised Framework (DiFac), which derives differentiated factors from a single information source and enforces their consistency. During pre-training, the model learns to extract these factors; in training, it iteratively removes samples with conflicting factors and ranks pseudo-labels based on the shortest stave principle, selecting the top candidate samples to reduce overconfidence commonly observed in confidence-based or ensemble-based methods. Our framework can also incorporate additional information sources. In this work, we leverage the large multimodal language model to introduce latent textual knowledge as auxiliary decision factors, and we design a accountability scoring mechanism to mitigate additional erroneous judgments introduced by these auxiliary factors. Experiments on multiple benchmark datasets demonstrate that DiFac consistently improves robustness and generalization in low-label regimes, outperforming other baseline methods.

[LG-22] Interpretable Reward Model via Sparse Autoencoder

链接: https://arxiv.org/abs/2508.08746
作者: Shuyi Zhang,Wei Shi,Sihang Li,Jiayi Liao,Tao Liang,Hengxing Cai,Xiang Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (\textbfSARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at this https URL.

[LG-23] Elucidating Rectified Flow with Deterministic Sampler: Polynomial Discretization Complexity for Multi and One-step Models

链接: https://arxiv.org/abs/2508.08735
作者: Ruofeng Yang,Zhaoyu Zhu,Bo Jiang,Cheng Chen,Shuai Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, rectified flow (RF)-based models have achieved state-of-the-art performance in many areas for both the multi-step and one-step generation. However, only a few theoretical works analyze the discretization complexity of RF-based models. Existing works either focus on flow-based models with stochastic samplers or establish complexity results that exhibit exponential dependence on problem parameters. In this work, under the realistic bounded support assumption, we prove the first polynomial discretization complexity for multi-step and one-step RF-based models with a deterministic sampler simultaneously. For the multi-step setting, inspired by the predictor-corrector framework of diffusion models, we introduce a Langevin process as a corrector and show that RF-based models can achieve better polynomial discretization complexity than diffusion models. To achieve this result, we conduct a detailed analysis of the RF-based model and explain why it is better than previous popular models, such as variance preserving (VP) and variance exploding (VE)-based models. Based on the observation of multi-step RF-based models, we further provide the first polynomial discretization complexity result for one-step RF-based models, improving upon prior results for one-step diffusion-based models. These findings mark the first step toward theoretically understanding the impressive empirical performance of RF-based models in both multi-step and one-step generation.

[LG-24] CRADLE: Conversational RTL Design Space Exploration with LLM -based Multi-Agent Systems SOCC2025

链接: https://arxiv.org/abs/2508.08709
作者: Lukas Krupp,Maximilian Schöffel,Elias Biehl,Norbert Wehn
类目: Robotics (cs.RO); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted for presentation at the 22nd International SoC Conference (ISOCC 2025). Proceedings to be included in IEEE Xplore

点击查看摘要

Abstract:This paper presents CRADLE, a conversational framework for design space exploration of RTL designs using LLM-based multi-agent systems. Unlike existing rigid approaches, CRADLE enables user-guided flows with internal self-verification, correction, and optimization. We demonstrate the framework with a generator-critic agent system targeting FPGA resource minimization using state-of-the-art LLMs. Experimental results on the RTLLM benchmark show that CRADLE achieves significant reductions in resource usage with averages of 48% and 40% in LUTs and FFs across all benchmark designs.

[LG-25] Expert-Guided Diffusion Planner for Auto-bidding CIKM2025

链接: https://arxiv.org/abs/2508.08687
作者: Yunshan Peng,Wenzheng Shu,Jiahao Sun,Yanxiang Zeng,Jinan Pang,Wentao Bai,Yunke Bai,Xialong Liu,Peng Jiang
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: accepted for presentation at the CIKM 2025 Applied Research Track, eight (8) pages, three (3) figures

点击查看摘要

Abstract:Auto-bidding is extensively applied in advertising systems, serving a multitude of advertisers. Generative bidding is gradually gaining traction due to its robust planning capabilities and generalizability. In contrast to traditional reinforcement learning-based bidding, generative bidding does not rely on the Markov Decision Process (MDP) exhibiting superior planning capabilities in long-horizon scenarios. Conditional diffusion modeling approaches have demonstrated significant potential in the realm of auto-bidding. However, relying solely on return as the optimality condition is weak to guarantee the generation of genuinely optimal decision sequences, lacking personalized structural information. Moreover, diffusion models’ t-step autoregressive generation mechanism inherently carries timeliness risks. To address these issues, we propose a novel conditional diffusion modeling method based on expert trajectory guidance combined with a skip-step sampling strategy to enhance generation efficiency. We have validated the effectiveness of this approach through extensive offline experiments and achieved statistically significant results in online A/B testing, achieving an increase of 11.29% in conversion and a 12.35% in revenue compared with the baseline.

[LG-26] Classifier Language Models: Unifying Sparse Finetuning and Adaptive Tokenization for Specialized Classification Tasks

链接: https://arxiv.org/abs/2508.08635
作者: Adit Krishnan,Chu Wang,Chris Kong
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures, currently under review

点击查看摘要

Abstract:Semantic text classification requires the understanding of the contextual significance of specific tokens rather than surface-level patterns or keywords (as in rule-based or statistical text classification), making large language models (LLMs) well-suited for this task. However, semantic classification applications in industry, like customer intent detection or semantic role labeling, tend to be highly specialized. They require annotation by domain experts in contrast to general-purpose corpora for pretraining. Further, they typically require high inference throughputs which limits the model size from latency and cost perspectives. Thus, for a range of specialized classification tasks, the preferred solution is to develop customized classifiers by finetuning smaller language models (e.g., mini-encoders, small language models). In this work, we develop a token-driven sparse finetuning strategy to adapt small language models to specialized classification tasks. We identify and finetune a small sensitive subset of model parameters by leveraging task-specific token constructs in the finetuning dataset, while leaving most of the pretrained weights unchanged. Unlike adapter approaches such as low rank adaptation (LoRA), we do not introduce additional parameters to the model. Our approach identifies highly relevant semantic tokens (case study in the Appendix) and outperforms end-to-end finetuning, LoRA, layer selection, and prefix tuning on five diverse semantic classification tasks. We achieve greater stability and half the training costs vs. end-to-end finetuning. Comments: 10 pages, 4 figures, currently under review Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.08635 [cs.LG] (or arXiv:2508.08635v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.08635 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-27] Dynamic Rank Adjustment for Accurate and Efficient Neural Network Training

链接: https://arxiv.org/abs/2508.08625
作者: Hyuntak Shin,Aecheon Jung,Sunwoo Lee,Sungeun Hong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank training methods reduce the number of trainable parameters by re-parameterizing the weights with matrix decompositions (e.g., singular value decomposition). However, enforcing a fixed low-rank structure caps the rank of the weight matrices and can hinder the model’s ability to learn complex patterns. Furthermore, the effective rank of the model’s weights tends to decline during training, and this drop is accelerated when the model is reparameterized into a low-rank structure. In this study, we argue that strategically interleaving full-rank training epochs within low-rank training epochs can effectively restore the rank of the model’s weights. Based on our findings, we propose a general dynamic-rank training framework that is readily applicable to a wide range of neural-network tasks. We first describe how to adjust the rank of weight matrix to alleviate the inevitable rank collapse that arises during training, and then present extensive empirical results that validate our claims and demonstrate the efficacy of the proposed framework. Our empirical study shows that the proposed method achieves almost the same computational cost as SVD-based low-rank training while achieving a comparable accuracy to full-rank training across various benchmarks.

[LG-28] Distributed optimization: designed for federated learning

链接: https://arxiv.org/abs/2508.08606
作者: Wenyou Guo,Ting Qu,Chunrong Pan,George Q. Huang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:Federated Learning (FL), as a distributed collaborative Machine Learning (ML) framework under privacy-preserving constraints, has garnered increasing research attention in cross-organizational data collaboration scenarios. This paper proposes a class of distributed optimization algorithms based on the augmented Lagrangian technique, designed to accommodate diverse communication topologies in both centralized and decentralized FL settings. Furthermore, we develop multiple termination criteria and parameter update mechanisms to enhance computational efficiency, accompanied by rigorous theoretical guarantees of convergence. By generalizing the augmented Lagrangian relaxation through the incorporation of proximal relaxation and quadratic approximation, our framework systematically recovers a broad of classical unconstrained optimization methods, including proximal algorithm, classic gradient descent, and stochastic gradient descent, among others. Notably, the convergence properties of these methods can be naturally derived within the proposed theoretical framework. Numerical experiments demonstrate that the proposed algorithm exhibits strong performance in large-scale settings with significant statistical heterogeneity across clients.

[LG-29] Multi-Target Backdoor Attacks Against Speaker Recognition

链接: https://arxiv.org/abs/2508.08559
作者: Alexandrine Fortier,Sonal Joshi,Thomas Thebaud,Jesus Villalba Lopez,Najim Dehak,Patrick Cardinal
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted to IEEE Automatic Speech Recognition and Understanding Workshop 2025

点击查看摘要

Abstract:In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. Unlike previous single-target approaches, our method targets up to 50 speakers simultaneously, achieving success rates of up to 95.04%. To simulate more realistic attack conditions, we vary the signal-to-noise ratio between speech and trigger, demonstrating a trade-off between stealth and effectiveness. We further extend the attack to the speaker verification task by selecting the most similar training speaker - based on cosine similarity - as the target. The attack is most effective when target and enrolled speaker pairs are highly similar, reaching success rates of up to 90% in such cases.

[LG-30] SHEFL: Resource-Aware Aggregation and Sparsification in Heterogeneous Ensemble Federated Learning AAAI2026

链接: https://arxiv.org/abs/2508.08552
作者: Keumseo Ryum,Jinu Gong,Joonhyuk Kang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 9 pages, 7 figures, submitted to AAAI 2026

点击查看摘要

Abstract:Federated learning enables distributed training with private data of clients, but its convergence is hindered by data and system heterogeneity in realistic communication scenarios. Most existing system heterogeneous FL schemes utilize global pruning or ensemble distillation, yet they often overlook typical constraints required for communication efficiency. Meanwhile, deep ensembles can aggregate predictions from individually trained models to improve performance, but current ensemble-based FL methods fall short in fully capturing the diversity of model predictions. In this work, we propose SHEFL, a global ensemble-based federated learning framework suited for clients with diverse computational capacities. We allocate different numbers of global models to clients based on their available resources. We further introduce a novel aggregation scheme that accounts for bias between clients with different computational capabilities. To reduce the computational burden of training deep ensembles and mitigate data bias, we dynamically adjust the resource ratio across clients - aggressively reducing the influence of underpowered clients in constrained scenarios, while increasing their weight in the opposite case. Extensive experiments demonstrate that our method effectively addresses computational heterogeneity, significantly improving both fairness and overall performance compared to existing approaches.

[LG-31] Biased Local SGD for Efficient Deep Learning on Heterogeneous Systems

链接: https://arxiv.org/abs/2508.08540
作者: Jihyun Lim,Junhyuk Jo,Chanhyeok Ko,Young Min Go,Jimin Hwa,Sunwoo Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Most large-scale neural network training methods assume homogeneous parallel computing resources. For example, synchronous SGD with data parallelism, the most widely used parallel training strategy, incurs significant synchronization overhead when workers process their assigned data at different speeds. Consequently, in systems with heterogeneous compute resources, users often rely solely on the fastest components, such as GPUs, for training. In this work, we explore how to effectively use heterogeneous resources for neural network training. We propose a system-aware local stochastic gradient descent (local SGD) method that allocates workloads to each compute resource in proportion to its compute capacity. To make better use of slower resources such as CPUs, we intentionally introduce bias into data sampling and model aggregation. Our study shows that well-controlled bias can significantly accelerate local SGD in heterogeneous environments, achieving comparable or even higher accuracy than synchronous SGD with data-parallelism within the same time budget. This fundamental parallelization strategy can be readily extended to diverse heterogeneous environments, including cloud platforms and multi-node high-performance computing clusters.

[LG-32] Benchmarking Federated Learning for Throughput Prediction in 5G Live Streaming Applications

链接: https://arxiv.org/abs/2508.08479
作者: Yuvraj Dutta,Soumyajit Chatterjee,Sandip Chakraborty,Basabdatta Palit
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 14 pages, 24 figures, submitted to IEEE TNET

点击查看摘要

Abstract:Accurate and adaptive network throughput prediction is essential for latency-sensitive and bandwidth-intensive applications in 5G and emerging 6G networks. However, most existing methods rely on centralized training with uniformly collected data, limiting their applicability in heterogeneous mobile environments with non-IID data distributions. This paper presents the first comprehensive benchmarking of federated learning (FL) strategies for throughput prediction in realistic 5G edge scenarios. We evaluate three aggregation algorithms - FedAvg, FedProx, and FedBN - across four time-series architectures: LSTM, CNN, CNN+LSTM, and Transformer, using five diverse real-world datasets. We systematically analyze the effects of client heterogeneity, cohort size, and history window length on prediction performance. Our results reveal key trade-offs among model complexities, convergence rates, and generalization. It is found that FedBN consistently delivers robust performance under non-IID conditions. On the other hand, LSTM and Transformer models outperform CNN-based baselines by up to 80% in R2 scores. Moreover, although Transformers converge in half the rounds of LSTM, they require longer history windows to achieve a high R2, indicating higher context dependence. LSTM is, therefore, found to achieve a favorable balance between accuracy, rounds, and temporal footprint. To validate the end-to-end applicability of the framework, we have integrated our FL-based predictors into a live adaptive streaming pipeline. It is seen that FedBN-based LSTM and Transformer models improve mean QoE scores by 11.7% and 11.4%, respectively, over FedAvg, while also reducing the variance. These findings offer actionable insights for building scalable, privacy-preserving, and edge-aware throughput prediction systems in next-generation wireless networks.

[LG-33] Sparse Partial Optimal Transport via Quadratic Regularization

链接: https://arxiv.org/abs/2508.08476
作者: Khang Tran,Khoa Nguyen,Anh Nguyen,Thong Huynh,Son Pham,Sy-Hoang Nguyen-Dang,Manh Pham,Bang Vo,Mai Ngoc Tran,Mai Ngoc Tran,Dung Luong
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures

点击查看摘要

Abstract:Partial Optimal Transport (POT) has recently emerged as a central tool in various Machine Learning (ML) applications. It lifts the stringent assumption of the conventional Optimal Transport (OT) that input measures are of equal masses, which is often not guaranteed in real-world datasets, and thus offers greater flexibility by permitting transport between unbalanced input measures. Nevertheless, existing major solvers for POT commonly rely on entropic regularization for acceleration and thus return dense transport plans, hindering the adoption of POT in various applications that favor sparsity. In this paper, as an alternative approach to the entropic POT formulation in the literature, we propose a novel formulation of POT with quadratic regularization, hence termed quadratic regularized POT (QPOT), which induces sparsity to the transport plan and consequently facilitates the adoption of POT in many applications with sparsity requirements. Extensive experiments on synthetic and CIFAR-10 datasets, as well as real-world applications such as color transfer and domain adaptations, consistently demonstrate the improved sparsity and favorable performance of our proposed QPOT formulation.

[LG-34] Discrete Diffusion-Based Model-Level Explanation of Heterogeneous GNNs with Node Features

链接: https://arxiv.org/abs/2508.08458
作者: Pallabee Das,Stefan Heindorf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many real-world datasets, such as citation networks, social networks, and molecular structures, are naturally represented as heterogeneous graphs, where nodes belong to different types and have additional features. For example, in a citation network, nodes representing “Paper” or “Author” may include attributes like keywords or affiliations. A critical machine learning task on these graphs is node classification, which is useful for applications such as fake news detection, corporate risk assessment, and molecular property prediction. Although Heterogeneous Graph Neural Networks (HGNNs) perform well in these contexts, their predictions remain opaque. Existing post-hoc explanation methods lack support for actual node features beyond one-hot encoding of node type and often fail to generate realistic, faithful explanations. To address these gaps, we propose DiGNNExplainer, a model-level explanation approach that synthesizes heterogeneous graphs with realistic node features via discrete denoising diffusion. In particular, we generate realistic discrete features (e.g., bag-of-words features) using diffusion models within a discrete space, whereas previous approaches are limited to continuous spaces. We evaluate our approach on multiple datasets and show that DiGNNExplainer produces explanations that are realistic and faithful to the model’s decision-making, outperforming state-of-the-art methods.

[LG-35] Differentiable Cyclic Causal Discovery Under Unmeasured Confounders

链接: https://arxiv.org/abs/2508.08450
作者: Muralikrishnna G. Sethuraman,Faramarz Fekri
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Understanding causal relationships between variables is fundamental across scientific disciplines. Most causal discovery algorithms rely on two key assumptions: (i) all variables are observed, and (ii) the underlying causal graph is acyclic. While these assumptions simplify theoretical analysis, they are often violated in real-world systems, such as biological networks. Existing methods that account for confounders either assume linearity or struggle with scalability. To address these limitations, we propose DCCD-CONF, a novel framework for differentiable learning of nonlinear cyclic causal graphs in the presence of unmeasured confounders using interventional data. Our approach alternates between optimizing the graph structure and estimating the confounder distribution by maximizing the log-likelihood of the data. Through experiments on synthetic data and real-world gene perturbation datasets, we show that DCCD-CONF outperforms state-of-the-art methods in both causal graph recovery and confounder identification. Additionally, we also provide consistency guarantees for our framework, reinforcing its theoretical soundness.

[LG-36] Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference

链接: https://arxiv.org/abs/2508.08438
作者: Kexin Chu,Zecheng Lin,Dawei Xiang,Zixu Shen,Jianchang Su,Cheng Chu,Yiwei Yang,Wenhui Zhang,Wenfei Wu,Wei Zhang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Operating Systems (cs.OS)
*备注: 17 pages,17 figures

点击查看摘要

Abstract:Global KV-cache sharing has emerged as a key optimization for accelerating large language model (LLM) inference. However, it exposes a new class of timing side-channel attacks, enabling adversaries to infer sensitive user inputs via shared cache entries. Existing defenses, such as per-user isolation, eliminate leakage but degrade performance by up to 38.9% in time-to-first-token (TTFT), making them impractical for high-throughput deployment. To address this gap, we introduce SafeKV (Secure and Flexible KV Cache Sharing), a privacy-aware KV-cache management framework that selectively shares non-sensitive entries while confining sensitive content to private caches. SafeKV comprises three components: (i) a hybrid, multi-tier detection pipeline that integrates rule-based pattern matching, a general-purpose privacy detector, and context-aware validation; (ii) a unified radix-tree index that manages public and private entries across heterogeneous memory tiers (HBM, DRAM, SSD); and (iii) entropy-based access monitoring to detect and mitigate residual information leakage. Our evaluation shows that SafeKV mitigates 94% - 97% of timing-based side-channel attacks. Compared to per-user isolation method, SafeKV improves TTFT by up to 40.58% and throughput by up to 2.66X across diverse LLMs and workloads. SafeKV reduces cache-induced TTFT overhead from 50.41% to 11.74% on Qwen3-235B. By combining fine-grained privacy control with high cache reuse efficiency, SafeKV reclaims the performance advantages of global sharing while providing robust runtime privacy guarantees for LLM inference.

[LG-37] Regret minimization in Linear Bandits with offline data via extended D-optimal exploration

链接: https://arxiv.org/abs/2508.08420
作者: Sushant Vijayan,Arun Suggala,Karthikeyan VS,Soumyabrata Pal
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider the problem of online regret minimization in linear bandits with access to prior observations (offline data) from the underlying bandit model. There are numerous applications where extensive offline data is often available, such as in recommendation systems, online advertising. Consequently, this problem has been studied intensively in recent literature. Our algorithm, Offline-Online Phased Elimination (OOPE), effectively incorporates the offline data to substantially reduce the online regret compared to prior work. To leverage offline information prudently, OOPE uses an extended D-optimal design within each exploration phase. OOPE achieves an online regret is \tildeO(\sqrt\deff T \log \left(|\mathcalA|T\right)+d^2) . \deff \leq d) is the effective problem dimension which measures the number of poorly explored directions in offline data and depends on the eigen-spectrum (\lambda_k)_k \in [d] of the Gram matrix of the offline data. The eigen-spectrum (\lambda_k)_k \in [d] is a quantitative measure of the \emphquality of offline data. If the offline data is poorly explored ( \deff \approx d ), we recover the established regret bounds for purely online setting while, when offline data is abundant ( \Toff T ) and well-explored ( \deff = o(1) ), the online regret reduces substantially. Additionally, we provide the first known minimax regret lower bounds in this setting that depend explicitly on the quality of the offline data. These lower bounds establish the optimality of our algorithm in regimes where offline data is either well-explored or poorly explored. Finally, by using a Frank-Wolfe approximation to the extended optimal design we further improve the O(d^2) term to O\left(\fracd^2\deff \min \ \deff,1\ \right) , which can be substantial in high dimensions with moderate quality of offline data \deff = \Omega(1) .

[LG-38] Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport

链接: https://arxiv.org/abs/2508.08369
作者: Elon Litman
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The scaled-dot-product attention (SDPA) mechanism is a core component of modern deep learning, but its mathematical form is often motivated by heuristics. This work provides a first-principles justification for SDPA. We first show that the attention forward pass is the exact solution to a degenerate, one-sided Entropic Optimal Transport (EOT) problem, which seeks a distribution that maximizes similarity while being maximally entropic. This optimization perspective has a direct consequence for the backward pass. We prove that the standard gradient computed via backpropagation is mathematically identical to an advantage-based policy gradient, a variance-reduced update rule from reinforcement learning. Crucially, we demonstrate that the EOT formulation of the forward pass induces a specific information geometry on the space of attention distributions. It is this geometry, characterized by the Fisher Information Matrix, that dictates the precise form of the learning gradient, revealing the advantage-based update as a natural consequence of the optimization problem being solved. This unified view reveals SDPA as a principled mechanism where the forward pass performs optimal inference and the backward pass implements a rational, manifold-aware learning update.

[LG-39] SHeRL-FL: When Representation Learning Meets Split Learning in Hierarchical Federated Learning

链接: https://arxiv.org/abs/2508.08339
作者: Dung T. Tran,Nguyen B. Ha,Van-Dinh Nguyen,Kok-Seng Wong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) is a promising approach for addressing scalability and latency issues in large-scale networks by enabling collaborative model training without requiring the sharing of raw data. However, existing FL frameworks often overlook the computational heterogeneity of edge clients and the growing training burden on resource-limited devices. However, FL suffers from high communication costs and complex model aggregation, especially with large models. Previous works combine split learning (SL) and hierarchical FL (HierFL) to reduce device-side computation and improve scalability, but this introduces training complexity due to coordination across tiers. To address these issues, we propose SHeRL-FL, which integrates SL and hierarchical model aggregation and incorporates representation learning at intermediate layers. By allowing clients and edge servers to compute training objectives independently of the cloud, SHeRL-FL significantly reduces both coordination complexity and communication overhead. To evaluate the effectiveness and efficiency of SHeRL-FL, we performed experiments on image classification tasks using CIFAR-10, CIFAR-100, and HAM10000 with AlexNet, ResNet-18, and ResNet-50 in both IID and non-IID settings. In addition, we evaluate performance on image segmentation tasks using the ISIC-2018 dataset with a ResNet-50-based U-Net. Experimental results demonstrate that SHeRL-FL reduces data transmission by over 90% compared to centralized FL and HierFL, and by 50% compared to SplitFed, which is a hybrid of FL and SL, and further improves hierarchical split learning methods.

[LG-40] Synthesize Retrieve and Propagate: A Unified Predictive Modeling Framework for Relational Databases

链接: https://arxiv.org/abs/2508.08327
作者: Ning Li,Kounianhua Du,Han Zhang,Quan Gan,Minjie Wang,David Wipf,Weinan Zhang
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Relational databases (RDBs) have become the industry standard for storing massive and heterogeneous data. However, despite the widespread use of RDBs across various fields, the inherent structure of relational databases hinders their ability to benefit from flourishing deep learning methods. Previous research has primarily focused on exploiting the unary dependency among multiple tables in a relational database using the primary key - foreign key relationships, either joining multiple tables into a single table or constructing a graph among them, which leaves the implicit composite relations among different tables and a substantial potential of improvement for predictive modeling unexplored. In this paper, we propose SRP, a unified predictive modeling framework that synthesizes features using the unary dependency, retrieves related information to capture the composite dependency, and propagates messages across a constructed graph to learn adjacent patterns for prediction on relation databases. By introducing a new retrieval mechanism into RDB, SRP is designed to fully capture both the unary and the composite dependencies within a relational database, thereby enhancing the receptive field of tabular data prediction. In addition, we conduct a comprehensive analysis on the components of SRP, offering a nuanced understanding of model behaviors and practical guidelines for future applications. Extensive experiments on five real-world datasets demonstrate the effectiveness of SRP and its potential applicability in industrial scenarios. The code is released at this https URL.

[LG-41] Weather-Driven Agricultural Decision-Making Using Digital Twins Under Imperfect Conditions

链接: https://arxiv.org/abs/2508.08326
作者: Tamim Ahmed,Monowar Hasan
类目: Machine Learning (cs.LG)
*备注: ACM SIGSPATIAL 2025

点击查看摘要

Abstract:By offering a dynamic, real-time virtual representation of physical systems, digital twin technology can enhance data-driven decision-making in digital agriculture. Our research shows how digital twins are useful for detecting inconsistencies in agricultural weather data measurements, which are key attributes for various agricultural decision-making and automation tasks. We develop a modular framework named Cerealia that allows end-users to check for data inconsistencies when perfect weather feeds are unavailable. Cerealia uses neural network models to check anomalies and aids end-users in informed decision-making. We develop a prototype of Cerealia using the NVIDIA Jetson Orin platform and test it with an operational weather network established in a commercial orchard as well as publicly available weather datasets.

[LG-42] Comparative study of machine learning and statistical methods for automatic identification and quantification in γ-ray spectrometry

链接: https://arxiv.org/abs/2508.08306
作者: Dinh Triem Phan,Jérôme Bobin,Cheick Thiam,Christophe Bobin
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:During the last decade, a large number of different numerical methods have been proposed to tackle the automatic identification and quantification in \gamma-ray spectrometry. However, the lack of common benchmarks, including datasets, code and comparison metrics, makes their evaluation and comparison hard. In that context, we propose an open-source benchmark that comprises simulated datasets of various \gamma-spectrometry settings, codes of different analysis approaches and evaluation metrics. This allows us to compare the state-of-the-art end-to-end machine learning with a statistical unmixing approach using the full spectrum. Three scenarios have been investigated: (1) spectral signatures are assumed to be known; (2) spectral signatures are deformed due to physical phenomena such as Compton scattering and attenuation; and (3) spectral signatures are shifted (e.g., due to temperature variation). A large dataset of 200000 simulated spectra containing nine radionuclides with an experimental natural background is used for each scenario with multiple radionuclides present in the spectrum. Regarding identification performance, the statistical approach consistently outperforms the machine learning approaches across all three scenarios for all comparison metrics. However, the performance of the statistical approach can be significantly impacted when spectral signatures are not modeled correctly. Consequently, the full-spectrum statistical approach is most effective with known or well-modeled spectral signatures, while end-to-end machine learning is a good alternative when measurement conditions are uncertain for radionuclide identification. Concerning the quantification task, the statistical approach provides accurate estimates of radionuclide counting, while the machine learning methods deliver less satisfactory results.

[LG-43] Probabilistic Emissivity Retrieval from Hyperspectral Data via Physics-Guided Variational Inference

链接: https://arxiv.org/abs/2508.08291
作者: Joshua R. Tempelman,Kevin Mitchell,Adam J. Wachtor,Eric B. Flynn
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: 14 figures

点击查看摘要

Abstract:Recent research has proven neural networks to be a powerful tool for performing hyperspectral imaging (HSI) target identification. However, many deep learning frameworks deliver a single material class prediction and operate on a per-pixel basis; such approaches are limited in their interpretability and restricted to predicting materials that are accessible in available training libraries. In this work, we present an inverse modeling approach in the form of a physics-conditioned generative model.A probabilistic latent-variable model learns the underlying distribution of HSI radiance measurements and produces the conditional distribution of the emissivity spectrum. Moreover, estimates of the HSI scene’s atmosphere and background are used as a physically relevant conditioning mechanism to contextualize a given radiance measurement during the encoding and decoding processes. Furthermore, we employ an in-the-loop augmentation scheme and physics-based loss criteria to avoid bias towards a predefined training material set and to encourage the model to learn physically consistent inverse mappings. Monte-Carlo sampling of the model’s conditioned posterior delivers a sought emissivity distribution and allows for interpretable uncertainty quantification. Moreover, a distribution-based material matching scheme is presented to return a set of likely material matches for an inferred emissivity distribution. Hence, we present a strategy to incorporate contextual information about a given HSI scene, capture the possible variation of underlying material spectra, and provide interpretable probability measures of a candidate material accounting for given remotely-sensed radiance measurement.

[LG-44] Constrained free energy minimization for the design of thermal states and stabilizer thermodynamic systems

链接: https://arxiv.org/abs/2508.09103
作者: Michele Minervini,Madison Chin,Jacob Kupperman,Nana Liu,Ivy Luo,Meghan Ly,Soorya Rethinasamy,Kathie Wang,Mark M. Wilde
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 32 pages, 8 figures

点击查看摘要

Abstract:A quantum thermodynamic system is described by a Hamiltonian and a list of conserved, non-commuting charges, and a fundamental goal is to determine the minimum energy of the system subject to constraints on the charges. Recently, [Liu et al., arXiv:2505.04514] proposed first- and second-order classical and hybrid quantum-classical algorithms for solving a dual chemical potential maximization problem, and they proved that these algorithms converge to global optima by means of gradient-ascent approaches. In this paper, we benchmark these algorithms on several problems of interest in thermodynamics, including one- and two-dimensional quantum Heisenberg models with nearest and next-to-nearest neighbor interactions and with the charges set to the total x , y , and z magnetizations. We also offer an alternative compelling interpretation of these algorithms as methods for designing ground and thermal states of controllable Hamiltonians, with potential applications in molecular and material design. Furthermore, we introduce stabilizer thermodynamic systems as thermodynamic systems based on stabilizer codes, with the Hamiltonian constructed from a given code’s stabilizer operators and the charges constructed from the code’s logical operators. We benchmark the aforementioned algorithms on several examples of stabilizer thermodynamic systems, including those constructed from the one-to-three-qubit repetition code, the perfect one-to-five-qubit code, and the two-to-four-qubit error-detecting code. Finally, we observe that the aforementioned hybrid quantum-classical algorithms, when applied to stabilizer thermodynamic systems, can serve as alternative methods for encoding qubits into stabilizer codes at a fixed temperature, and we provide an effective method for warm-starting these encoding algorithms whenever a single qubit is encoded into multiple physical qubits.

[LG-45] Chartwin: a Case Study on Channel Charting-aided Localization in Dynamic Digital Network Twins

链接: https://arxiv.org/abs/2508.09055
作者: Lorenzo Cazzella,Francesco Linsalata,Mahdi Maleki,Damiano Badini,Matteo Matteucci,Umberto Spagnolini
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Wireless communication systems can significantly benefit from the availability of spatially consistent representations of the wireless channel to efficiently perform a wide range of communication tasks. Towards this purpose, channel charting has been introduced as an effective unsupervised learning technique to achieve both locally and globally consistent radio maps. In this letter, we propose Chartwin, a case study on the integration of localization-oriented channel charting with dynamic Digital Network Twins (DNTs). Numerical results showcase the significant performance of semi-supervised channel charting in constructing a spatially consistent chart of the considered extended urban environment. The considered method results in \approx 4.5 m localization error for the static DNT and \approx 6 m in the dynamic DNT, fostering DNT-aided channel charting and localization.

[LG-46] Subsampling Factorization Machine Annealing

链接: https://arxiv.org/abs/2508.08778
作者: Yusuke Hama,Tadashi Kadowaki
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 34 pages and 17 figures

点击查看摘要

Abstract:Quantum computing and machine learning are state-of-the-art technologies which have been investigated intensively in both academia and industry. The hybrid technology of these two ingredients is expected to be a powerful tool to solve complex problems in many branches of science and engineering such as combinatorial optimization problems and accelerate the creation of next-generation technologies. In this work, we develop an algorithm to solve a black-box optimization problem by improving Factorization Machine Annealing (FMA) such that the training of a machine learning model called Factorization Machine is performed not by a full dataset but by a subdataset which is sampled from a full dataset: Subsampling Factorization Machine Annealing (SFMA). According to such a probabilistic training process, the performance of FMA on exploring a solution space gets enhanced. As a result, SFMA exhibits balanced performance of exploration and exploitation which we call exploitation-exploration functionality. We conduct numerical benchmarking tests to compare the performance of SFMA with that of FMA. Consequently, SFMA certainly exhibits the exploration-exploitation functionality and outperforms FMA in speed and accuracy. In addition, the performance of SFMA can be further improved by sequentially using two subsampling datasets with different sizes such that the size of the latter dataset is substantially smaller than the former. Such a substantial reduction not only enhances the exploration performance of SFMA but also enables us to run it with correspondingly low computational cost even for a large-scale problem. These results indicate the effectiveness of SFMA in a certain class of black-box optimization problems of significant size: the potential scalability of SFMA in solving large-scale problems with correspondingly low computational cost.

[LG-47] Bio-Inspired Artificial Neural Networks based on Predictive Coding

链接: https://arxiv.org/abs/2508.08762
作者: Davide Casnici,Charlotte Frenkel,Justin Dauwels
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Backpropagation (BP) of errors is the backbone training algorithm for artificial neural networks (ANNs). It updates network weights through gradient descent to minimize a loss function representing the mismatch between predictions and desired outputs. BP uses the chain rule to propagate the loss gradient backward through the network hierarchy, allowing efficient weight updates. However, this process requires weight updates at every layer to rely on a global error signal generated at the network’s output. In contrast, the Hebbian model of synaptic plasticity states that weight updates are local, depending only on the activity of pre- and post-synaptic neurons. This suggests biological brains likely do not implement BP directly. Recently, Predictive Coding (PC) has gained interest as a biologically plausible alternative that updates weights using only local information. Originating from 1950s work on signal compression, PC was later proposed as a model of the visual cortex and formalized under the free energy principle, linking it to Bayesian inference and dynamical systems. PC weight updates rely solely on local information and provide theoretical advantages such as automatic scaling of gradients based on uncertainty. This lecture notes column offers a novel, tutorial-style introduction to PC, focusing on its formulation, derivation, and connections to well-known optimization and signal processing algorithms such as BP and the Kalman Filter (KF). It aims to support existing literature by guiding readers from the mathematical foundations of PC to practical implementation, including Python examples using PyTorch. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2508.08762 [stat.ML] (or arXiv:2508.08762v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2508.08762 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-48] Sensitivity Analysis to Unobserved Confounding with Copula-based Normalizing Flows

链接: https://arxiv.org/abs/2508.08752
作者: Sourabh Balgi,Marc Braun,Jose M. Peña,Adel Daoud
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose a novel method for sensitivity analysis to unobserved confounding in causal inference. The method builds on a copula-based causal graphical normalizing flow that we term \rho -GNF, where \rho \in [-1,+1] is the sensitivity parameter. The parameter represents the non-causal association between exposure and outcome due to unobserved confounding, which is modeled as a Gaussian copula. In other words, the \rho -GNF enables scholars to estimate the average causal effect (ACE) as a function of \rho , accounting for various confounding strengths. The output of the \rho -GNF is what we term the \rho_curve , which provides the bounds for the ACE given an interval of assumed \rho values. The \rho_curve also enables scholars to identify the confounding strength required to nullify the ACE. We also propose a Bayesian version of our sensitivity analysis method. Assuming a prior over the sensitivity parameter \rho enables us to derive the posterior distribution over the ACE, which enables us to derive credible intervals. Finally, leveraging on experiments from simulated and real-world data, we show the benefits of our sensitivity analysis method.

[LG-49] Hierarchical Variable Importance with Statistical Control for Medical Data-Based Prediction

链接: https://arxiv.org/abs/2508.08724
作者: Joseph Paillard,Antoine Collas,Denis A. Engemann,Bertrand Thirion
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in machine learning have greatly expanded the repertoire of predictive methods for medical imaging. However, the interpretability of complex models remains a challenge, which limits their utility in medical applications. Recently, model-agnostic methods have been proposed to measure conditional variable importance and accommodate complex non-linear models. However, they often lack power when dealing with highly correlated data, a common problem in medical imaging. We introduce Hierarchical-CPI, a model-agnostic variable importance measure that frames the inference problem as the discovery of groups of variables that are jointly predictive of the outcome. By exploring subgroups along a hierarchical tree, it remains computationally tractable, yet also enjoys explicit family-wise error rate control. Moreover, we address the issue of vanishing conditional importance under high correlation with a tree-based importance allocation mechanism. We benchmarked Hierarchical-CPI against state-of-the-art variable importance methods. Its effectiveness is demonstrated in two neuroimaging datasets: classifying dementia diagnoses from MRI data (ADNI dataset) and analyzing the Berger effect on EEG data (TDBRAIN dataset), identifying biologically plausible variables.

[LG-50] DiffVolume: Diffusion Models for Volume Generation in Limit Order Books

链接: https://arxiv.org/abs/2508.08698
作者: Zhuohan Wang,Carmine Ventre
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注: 13 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Modeling limit order books (LOBs) dynamics is a fundamental problem in market microstructure research. In particular, generating high-dimensional volume snapshots with strong temporal and liquidity-dependent patterns remains a challenging task, despite recent work exploring the application of Generative Adversarial Networks to LOBs. In this work, we propose a conditional \textbfDiffusion model for the generation of future LOB \textbfVolume snapshots (\textbfDiffVolume). We evaluate our model across three axes: (1) \textitRealism, where we show that DiffVolume, conditioned on past volume history and time of day, better reproduces statistical properties such as marginal distribution, spatial correlation, and autocorrelation decay; (2) \textitCounterfactual generation, allowing for controllable generation under hypothetical liquidity scenarios by additionally conditioning on a target future liquidity profile; and (3) \textitDownstream prediction, where we show that the synthetic counterfactual data from our model improves the performance of future liquidity forecasting models. Together, these results suggest that DiffVolume provides a powerful and flexible framework for realistic and controllable LOB volume generation.

[LG-51] In-Context Learning as Nonparametric Conditional Probability Estimation: Risk Bounds and Optimality

链接: https://arxiv.org/abs/2508.08673
作者: Chenrui Liu,Falong Tan,Chuanlong Xie,Yicheng Zeng,Lixing Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper investigates the expected excess risk of In-Context Learning (ICL) for multiclass classification. We model each task as a sequence of labeled prompt samples and a query input, where a pre-trained model estimates the conditional class probabilities of the query. The expected excess risk is defined as the average truncated Kullback-Leibler (KL) divergence between the predicted and ground-truth conditional class distributions, averaged over a specified family of tasks. We establish a new oracle inequality for the expected excess risk based on KL divergence in multiclass classification. This allows us to derive tight upper and lower bounds for the expected excess risk in transformer-based models, demonstrating that the ICL estimator achieves the minimax optimal rate - up to a logarithmic factor - for conditional probability estimation. From a technical standpoint, our results introduce a novel method for controlling generalization error using the uniform empirical covering entropy of the log-likelihood function class. Furthermore, we show that multilayer perceptrons (MLPs) can also perform ICL and achieve this optimal rate under specific assumptions, suggesting that transformers may not be the exclusive architecture capable of effective ICL.

[LG-52] Projection-based multifidelity linear regression for data-scarce applications

链接: https://arxiv.org/abs/2508.08517
作者: Vignesh Sella,Julie Pham,Karen Willcox,Anirban Chaudhuri
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 23 page, 7 figures, submitted to Machine Learning for Computational Science and Engineering special issue Accelerating Numerical Methods With Scientific Machine Learning

点击查看摘要

Abstract:Surrogate modeling for systems with high-dimensional quantities of interest remains challenging, particularly when training data are costly to acquire. This work develops multifidelity methods for multiple-input multiple-output linear regression targeting data-limited applications with high-dimensional outputs. Multifidelity methods integrate many inexpensive low-fidelity model evaluations with limited, costly high-fidelity evaluations. We introduce two projection-based multifidelity linear regression approaches that leverage principal component basis vectors for dimensionality reduction and combine multifidelity data through: (i) a direct data augmentation using low-fidelity data, and (ii) a data augmentation incorporating explicit linear corrections between low-fidelity and high-fidelity data. The data augmentation approaches combine high-fidelity and low-fidelity data into a unified training set and train the linear regression model through weighted least squares with fidelity-specific weights. Various weighting schemes and their impact on regression accuracy are explored. The proposed multifidelity linear regression methods are demonstrated on approximating the surface pressure field of a hypersonic vehicle in flight. In a low-data regime of no more than ten high-fidelity samples, multifidelity linear regression achieves approximately 3% - 12% improvement in median accuracy compared to single-fidelity methods with comparable computational cost.

[LG-53] Language Models Can Understand Spectra: A Multimodal Model for Molecular Structure Elucidation

链接: https://arxiv.org/abs/2508.08441
作者: Yunyue Su,Jiahui Chen,Zao Jiang,Zhenyi Zhong,Liang Wang,Qiang Liu
类目: Quantitative Methods (q-bio.QM); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 22 pages, 3 figures, 11 tables

点击查看摘要

Abstract:Structure elucidation is a fundamental technique for understanding the microscopic composition of matter and is widely applied across various disciplines in the natural sciences and engineering. However, existing methods often rely heavily on prior databases or known structural information, making it difficult to resolve unknown structures. In addition, complex structures typically require the joint analysis of multiple spectroscopic modalities. This process heavily depends on expert domain knowledge and is often accompanied by high costs in terms of both time and instrumentation. To address these challenges, we propose SpectraLLM, the first large language model designed to support multi-modal spectroscopic joint reasoning. SpectraLLM is capable of processing either single or multiple spectroscopic inputs and performing end-to-end structure elucidation. By integrating continuous and discrete spectroscopic modalities into a shared semantic space, SpectraLLM learns to uncover substructural patterns that are consistent and complementary across spectra, enabling precise molecular structure elucidation. We pretrain and fine-tune SpectraLLM in the domain of small molecules, and evaluate it on six standardized, publicly available chemical datasets. The model achieves state-of-the-art performance, significantly outperforming existing approaches trained on single modalities. Notably, SpectraLLM demonstrates strong robustness and generalization even for single-spectrum inference, while its multi-modal reasoning capability further improves the accuracy of structural prediction.

[LG-54] CFM-GP: Unified Conditional Flow Matching to Learn Gene Perturbation Across Cell Types

链接: https://arxiv.org/abs/2508.08312
作者: Abrar Rahman Abir,Sajib Acharjee Dip,Liqing Zhang
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 28 Pages, 19 Tables, 8 Figures

点击查看摘要

Abstract:Understanding gene perturbation effects across diverse cellular contexts is a central challenge in functional genomics, with important implications for therapeutic discovery and precision medicine. Single-cell technologies enable high-resolution measurement of transcriptional responses, but collecting such data is costly and time-consuming, especially when repeated for each cell type. Existing computational methods often require separate models per cell type, limiting scalability and generalization. We present CFM-GP, a method for cell type-agnostic gene perturbation prediction. CFM-GP learns a continuous, time-dependent transformation between unperturbed and perturbed gene expression distributions, conditioned on cell type, allowing a single model to predict across all cell types. Unlike prior approaches that use discrete modeling, CFM-GP employs a flow matching objective to capture perturbation dynamics in a scalable manner. We evaluate on five datasets: SARS-CoV-2 infection, IFN-beta stimulated PBMCs, glioblastoma treated with Panobinostat, lupus under IFN-beta stimulation, and Statefate progenitor fate mapping. CFM-GP consistently outperforms state-of-the-art baselines in R-squared and Spearman correlation, and pathway enrichment analysis confirms recovery of key biological pathways. These results demonstrate the robustness and biological fidelity of CFM-GP as a scalable solution for cross-cell type gene perturbation prediction.

[LG-55] On Experiments

链接: https://arxiv.org/abs/2508.08288
作者: Brendan van Rooyen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:The scientific process is a means for turning the results of experiments into knowledge about the world in which we live. Much research effort has been directed toward automating this process. To do this, one needs to formulate the scientific process in a precise mathematical language. This paper outlines one such language. What is presented here is hardly new. The material leans much on great thinkers of times past as well as more modern contributions. The novel contributions of this paper are: A new, general data processing inequality, a bias variance decomposition for canonical losses, Streamlined proofs of the Blackwell-Sherman-Stein and Randomization Theorems, and Means to calculate deficiency via linear programming.

[LG-56] Evaluating Imputation Techniques for Short-Term Gaps in Heart Rate Data

链接: https://arxiv.org/abs/2508.08268
作者: Vaibhav Gupta,Maria Maleshkova
类目: Applications (stat.AP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in wearable technology have enabled the continuous monitoring of vital physiological signals, essential for predictive modeling and early detection of extreme physiological events. Among these physiological signals, heart rate (HR) plays a central role, as it is widely used in monitoring and managing cardiovascular conditions and detecting extreme physiological events such as hypoglycemia. However, data from wearable devices often suffer from missing values. To address this issue, recent studies have employed various imputation techniques. Traditionally, the effectiveness of these methods has been evaluated using predictive accuracy metrics such as RMSE, MAPE, and MAE, which assess numerical proximity to the original data. While informative, these metrics fail to capture the complex statistical structure inherent in physiological signals. This study bridges this gap by presenting a comprehensive evaluation of four statistical imputation methods, linear interpolation, K Nearest Neighbors (KNN), Piecewise Cubic Hermite Interpolating Polynomial (PCHIP), and B splines, for short term HR data gaps. We assess their performance using both predictive accuracy metrics and statistical distance measures, including the Cohen Distance Test (CDT) and Jensen Shannon Distance (JS Distance), applied to HR data from the D1NAMO dataset and the BIG IDEAs Lab Glycemic Variability and Wearable Device dataset. The analysis reveals limitations in existing imputation approaches and the absence of a robust framework for evaluating imputation quality in physiological signals. Finally, this study proposes a foundational framework to develop a composite evaluation metric to assess imputation performance.

信息检索

[IR-0] Mitigating Popularity Bias in Counterfactual Explanations using Large Language Models

链接: https://arxiv.org/abs/2508.08946
作者: Arjan Hasami,Masoud Mansoury
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Counterfactual explanations (CFEs) offer a tangible and actionable way to explain recommendations by showing users a “what-if” scenario that demonstrates how small changes in their history would alter the system’s output. However, existing CFE methods are susceptible to bias, generating explanations that might misalign with the user’s actual preferences. In this paper, we propose a pre-processing step that leverages large language models to filter out-of-character history items before generating an explanation. In experiments on two public datasets, we focus on popularity bias and apply our approach to ACCENT, a neural CFE framework. We find that it creates counterfactuals that are more closely aligned with each user’s popularity preferences than ACCENT alone.

[IR-1] Recent Advances and Trends in Research Paper Recommender Systems: A Comprehensive Survey

链接: https://arxiv.org/abs/2508.08828
作者: Iratxe Pinedo,Mikel Larrañaga,Ana Arruarte
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注:

点击查看摘要

Abstract:As the volume of scientific publications grows exponentially, researchers increasingly face difficulties in locating relevant literature. Research Paper Recommender Systems have become vital tools to mitigate this information overload by delivering personalized suggestions. This survey provides a comprehensive analysis of Research Paper Recommender Systems developed between November 2021 and December 2024, building upon prior reviews in the field. It presents an extensive overview of the techniques and approaches employed, the datasets utilized, the evaluation metrics and procedures applied, and the status of both enduring and emerging challenges observed during the research. Unlike prior surveys, this survey goes beyond merely cataloguing techniques and models, providing a thorough examination of how these methods are implemented across different stages of the recommendation process. By furnishing a detailed and structured reference, this work aims to function as a consultative resource for the research community, supporting informed decision-making and guiding future investigations in the advances of effective Research Paper Recommender Systems.

[IR-2] Comprehensive Comparison Network: a framework for locality-aware routes-comparable and interpretable route recommendation

链接: https://arxiv.org/abs/2508.08745
作者: Chao Chen,Longfei Xu,Hanyu Guo,Chengzhang Wang,Ying Wang,Kaikui Liu,Xiangxiang Chu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Route recommendation (RR) is a core task of route planning in the Amap app, with the goal of recommending the optimal route among candidate routes to users. Unlike traditional recommendation methods, insights into the local quality of routes and comparisons between candidate routes are crucial for enhancing recommendation performance but often overlooked in previous studies. To achieve these, we propose a novel model called Comprehensive Comparison Network (CCN). CCN not only uses query-level features (e.g. user features) and item-level features (e.g. route features, item embedding) that are common in traditional recommendations, but also introduces comparison-level features which describe the non-overlapping segments between different routes to capture the local quality of routes. The key component Comprehensive Comparison Block (CCB) in CCN is designed to enable comparisons between routes. CCB includes a Comprehensive Comparison Operator (CCO) and a multi-scenario MLP, which can update the representations of candidate routes based on a comprehensive comparison. By stacking multiple CCBs, CCN can determine the final scores of candidate routes and recommend the optimal one to the user. Additionally, since routes directly affect the costs and risks experienced by users, the RR model must be interpretable for online deployment. Therefore, we designed an interpretable pair scoring network to achieve interpretability. Both offline and online experiments demonstrate that CCN significantly improves RR performance and exhibits strong interpretability. CCN has been fully deployed in the Amap app for over a year, providing stable and optimal benefits for route recommendations.

[IR-3] Eat your own KR: a KR-based approach to index Semantic Web Endpoints and Knowledge Graphs

链接: https://arxiv.org/abs/2508.08713
作者: Pierre Maillot(WIMMICS),Catherine Faron(UniCA, I3S, WIMMICS),Fabien Gandon(WIMMICS),Franck Michel(Laboratoire I3S - SPARKS, WIMMICS),Pierre Monnin(WIMMICS)
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Over the last decade, knowledge graphs have multiplied, grown, and evolved on the World Wide Web, and the advent of new standards, vocabularies, and application domains has accelerated this trend. IndeGx is a framework leveraging an extensible base of rules to index the content of KGs and the capacities of their SPARQL endpoints. In this article, we show how knowledge representation (KR) and reasoning methods and techniques can be used in a reflexive manner to index and characterize existing knowledge graphs (KG) with respect to their usage of KR methods and techniques. We extended IndeGx with a fully ontology-oriented modeling and processing approach to do so. Using SPARQL rules and an OWL RL ontology of the indexing domain, IndeGx can now build and reason over an index of the contents and characteristics of an open collection of public knowledge graphs. Our extension of the framework relies on a declarative representation of procedural knowledge and collaborative environments (e.g., GitHub) to provide an agile, customizable, and expressive KR approach for building and maintaining such an index of knowledge graphs in the wild. In doing so, we help anyone answer the question of what knowledge is out there in the world wild Semantic Web in general, and we also help our community monitor which KR research results are used in practice. In particular, this article provides a snapshot of the state of the Semantic Web regarding supported standard languages, ontology usage, and diverse quality evaluations by applying this method to a collection of over 300 open knowledge graph endpoints.

附件下载

点击下载今日全部论文列表