本篇博文主要内容为 2025-03-28 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-03-28)
今日共更新482篇论文,其中:
- 自然语言处理共78篇(Computation and Language (cs.CL))
- 人工智能共110篇(Artificial Intelligence (cs.AI))
- 计算机视觉共131篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共117篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion
【速读】: 该论文试图解决的问题是如何基于多模态条件(包括内容与风格)生成逼真的个性化运动序列。现有方法主要关注于生成多样化的运动内容或从序列中迁移风格,而未能同时兼顾内容的广泛适应性和风格的一致性。为解决此问题,论文的关键创新在于引入了一种风格-内容交叉融合机制(style-content cross fusion mechanism),并通过将风格编码器与预训练的多模态模型对齐,确保生成的运动既能够准确捕捉参考风格,又保持高度的真实性。这一方案实现了跨多种内容类型的运动合成,并支持多模态风格化表达,展现出更精细的运动生成能力。
链接: https://arxiv.org/abs/2503.21775
作者: Ziyu Guo,Young Yoon Lee,Joseph Liu,Yizhak Ben-Shabat,Victor Zordan,Mubbasir Kapadia
机构: CUHK(香港中文大学); Roblox; Stylemotif(风格主题)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project Page: this https URL
点击查看摘要
Abstract:We present StyleMotif, a novel Stylized Motion Latent Diffusion model, generating motion conditioned on both content and style from multiple modalities. Unlike existing approaches that either focus on generating diverse motion content or transferring style from sequences, StyleMotif seamlessly synthesizes motion across a wide range of content while incorporating stylistic cues from multi-modal inputs, including motion, text, image, video, and audio. To achieve this, we introduce a style-content cross fusion mechanism and align a style encoder with a pre-trained multi-modal model, ensuring that the generated motion accurately captures the reference style while preserving realism. Extensive experiments demonstrate that our framework surpasses existing methods in stylized motion generation and exhibits emergent capabilities for multi-modal motion stylization, enabling more nuanced motion synthesis. Source code and pre-trained models will be released upon acceptance. Project Page: this https URL
zh
[NLP-1] MemInsight: Autonomous Memory Augmentation for LLM Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在处理长程记忆时面临的内存规模增长及语义结构化需求带来的挑战。论文提出了一种名为MemInsight的自主记忆增强方法,其关键是通过自主增强历史交互数据,提升语义数据表示与检索机制的效能。这种方法使LLM代理能够提供更准确且情境相关的响应。实验验证了MemInsight在会话推荐、问答和事件摘要三个任务场景中的有效性,分别提升了推荐的说服力(高达14%)以及LoCoMo检索的召回率(比RAG基线高出34%)。
链接: https://arxiv.org/abs/2503.21760
作者: Rana Salama,Jason Cai,Michelle Yuan,Anna Currey,Monica Sunkara,Yi Zhang,Yassine Benajiba
机构: AWS AI (亚马逊云科技 AI)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Large language model (LLM) agents have evolved to intelligently process information, make decisions, and interact with users or tools. A key capability is the integration of long-term memory capabilities, enabling these agents to draw upon historical interactions and knowledge. However, the growing memory size and need for semantic structuring pose significant challenges. In this work, we propose an autonomous memory augmentation approach, MemInsight, to enhance semantic data representation and retrieval mechanisms. By leveraging autonomous augmentation to historical interactions, LLM agents are shown to deliver more accurate and contextualized responses. We empirically validate the efficacy of our proposed approach in three task scenarios; conversational recommendation, question answering and event summarization. On the LLM-REDIAL dataset, MemInsight boosts persuasiveness of recommendations by up to 14%. Moreover, it outperforms a RAG baseline by 34% in recall for LoCoMo retrieval. Our empirical results show the potential of MemInsight to enhance the contextual performance of LLM agents across multiple tasks.
zh
[NLP-2] GateLens: A Reasoning -Enhanced LLM Agent for Automotive Software Release Analytics
【速读】: 该论文旨在解决软件发布决策可靠性与有效性保障中的核心挑战,特别是在汽车等安全关键领域的发布验证数据解析问题。传统方法依赖人工分析大规模测试数据集和验证指标,存在效率低下和成本高昂的问题。尽管大语言模型(LLMs)提供了潜在替代方案,但其在分析推理、上下文理解、处理超出范围的查询以及一致处理结构化测试数据方面存在局限性,限制了其在安全关键场景中的直接应用。
论文的关键解决方案是提出了GateLens,一种基于LLM的工具,用于分析汽车领域的表格数据。GateLens通过将自然语言查询转换为关系代数(Relational Algebra, RA)表达式,并生成优化的Python代码来实现这一目标。其核心创新在于通过RA模块实现高效的查询解析与执行,同时确保高精度和鲁棒性。论文的评估表明,GateLens在基准数据集上优于基线系统,F1分数更高,并且能够更可靠地处理复杂和模糊的查询。此外,工业评估显示GateLens将分析时间减少了80%以上,同时保持了高精度和可靠性。这些结果证明了GateLens在不依赖少量示例的情况下实现了强大的泛化能力,并为AI在发布验证等关键工作流中的集成提供了实用指导。
链接: https://arxiv.org/abs/2503.21735
作者: Arsham Gholamzadeh Khoee,Shuai Wang,Yinan Yu,Robert Feldt,Dhasarathy Parthasarathy
机构: Chalmers University of Technology (查尔姆斯理工大学); Volvo Group (沃尔沃集团)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:
点击查看摘要
Abstract:Ensuring the reliability and effectiveness of software release decisions is critical, particularly in safety-critical domains like automotive systems. Precise analysis of release validation data, often presented in tabular form, plays a pivotal role in this process. However, traditional methods that rely on manual analysis of extensive test datasets and validation metrics are prone to delays and high costs. Large Language Models (LLMs) offer a promising alternative but face challenges in analytical reasoning, contextual understanding, handling out-of-scope queries, and processing structured test data consistently; limitations that hinder their direct application in safety-critical scenarios. This paper introduces GateLens, an LLM-based tool for analyzing tabular data in the automotive domain. GateLens translates natural language queries into Relational Algebra (RA) expressions and then generates optimized Python code. It outperforms the baseline system on benchmarking datasets, achieving higher F1 scores and handling complex and ambiguous queries with greater robustness. Ablation studies confirm the critical role of the RA module, with performance dropping sharply when omitted. Industrial evaluations reveal that GateLens reduces analysis time by over 80% while maintaining high accuracy and reliability. As demonstrated by presented results, GateLens achieved high performance without relying on few-shot examples, showcasing strong generalization across various query types from diverse company roles. Insights from deploying GateLens with a partner automotive company offer practical guidance for integrating AI into critical workflows such as release validation. Results show that by automating test result analysis, GateLens enables faster, more informed, and dependable release decisions, and can thus advance software scalability and reliability in automotive systems.
zh
[NLP-3] Effective Skill Unlearning through Intervention and Abstention NAACL2025
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在特定技能卸载(skill unlearning)方面的挑战,即如何在保留模型整体能力的同时,有选择性地卸载某一特定技能。论文聚焦于实现这一目标的轻量级且无需额外训练的方法。
解决方案的关键在于利用模型内部机制来区分和干预不同技能的触发条件。首先,作者观察到,在每个前馈层(Feed-Forward Layer, FFL)中,神经元的预激活分布会因模型展示的不同技能而发生变化;其次,发现能够触发相同技能的查询会在FFL的关键空间(key space)中聚类,并可通过超立方体将其与其他查询分离。基于这些观察,论文提出了两种轻量级、无训练需求的技能卸载方法:通过干预的“神经元调整”(\textttNeuron Adjust)和通过放弃的“关键空间检测”(\textttKey Space Detection)。实验结果表明,\textttKey Space Detection 方法在目标技能的性能下降超过80%的同时,对其他技能及模型的整体知识(如MMLU基准测试)的影响控制在10%以内,验证了其有效性和针对性。
链接: https://arxiv.org/abs/2503.21730
作者: Yongce Li,Chung-En Sun,Tsui-Wei Weng
机构: UCSD HDSI (加州大学圣地亚哥分校数据科学研究所); UCSD CSE (加州大学圣地亚哥分校计算机科学与工程系); UCSD HDSI (加州大学圣地亚哥分校数据科学研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to NAACL 2025 main conference
点击查看摘要
Abstract:Large language Models (LLMs) have demonstrated remarkable skills across various domains. Understanding the mechanisms behind their abilities and implementing controls over them is becoming increasingly important for developing better models. In this paper, we focus on skill unlearning in LLMs, specifically unlearning a particular skill while retaining their overall capabilities. We introduce two lightweight, training-free machine skill unlearning techniques for LLMs. First, we observe that the pre-activation distribution of neurons in each Feed-Forward Layer (FFL) differs when the model demonstrates different skills. Additionally, we find that queries triggering the same skill cluster within the FFL key space and can be separated from other queries using a hypercube. Based on these observations, we propose two lightweight, training-free skill unlearning methods via \textitintervention and \textitabstention respectively: \textttNeuron Adjust and \textttKey Space Detection. We evaluate our methods on unlearning math-solving, Python-coding, and comprehension skills across seven different languages. The results demonstrate their strong unlearning capabilities for the designated skills. Specifically, \textttKey Space Detection achieves over 80% relative performance drop on the forgetting skill and less than 10% relative performance drop on other skills and the model’s general knowledge (MMLU) for most unlearning tasks. Our code is available at this https URL
zh
[NLP-4] ReaRAG : Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation
【速读】: 本文旨在解决大型推理模型(LRMs)在问答任务中因主要依赖参数化知识而导致事实准确性不足的问题。同时,尽管近期有研究通过引入基于强化学习(RL)的检索增强方法提升LRMs的推理能力,但这些方法存在过度推理(overthinking)及推理鲁棒性不足的缺陷,影响了其实际效果。为了解决这些问题,论文提出了一种名为ReaRAG的事实增强推理模型,其关键在于探索多样化的查询路径而不进行过多迭代。具体而言,ReaRAG通过构建一个具有推理链长度上限的新框架实现这一目标,并利用预定义的动作空间(搜索(Search)与完成(Finish))引导推理过程。对于搜索动作,系统向RAG引擎发出查询请求,返回的结果作为观察值指导后续推理步骤,直至选择完成动作结束循环。这种方法不仅增强了LRMs的事实准确性,还有效提升了其在多跳问答任务中的表现,并具备较强的错误识别与推理轨迹优化能力。
链接: https://arxiv.org/abs/2503.21729
作者: Zhicheng Lee,Shulin Cao,Jinxin Liu,Jiajie Zhang,Weichuan Liu,Xiaoyin Che,Lei Hou,Juanzi Li
机构: Tsinghua University (清华大学); Siemens AG (西门子股份公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Reasoning Models (LRMs) exhibit remarkable reasoning abilities but rely primarily on parametric knowledge, limiting factual accuracy. While recent works equip reinforcement learning (RL)-based LRMs with retrieval capabilities, they suffer from overthinking and lack robustness in reasoning, reducing their effectiveness in question answering (QA) tasks. To address this, we propose ReaRAG, a factuality-enhanced reasoning model that explores diverse queries without excessive iterations. Our solution includes a novel data construction framework with an upper bound on the reasoning chain length. Specifically, we first leverage an LRM to generate deliberate thinking, then select an action from a predefined action space (Search and Finish). For Search action, a query is executed against the RAG engine, where the result is returned as observation to guide reasoning steps later. This process iterates until a Finish action is chosen. Benefiting from ReaRAG’s strong reasoning capabilities, our approach outperforms existing baselines on multi-hop QA. Further analysis highlights its strong reflective ability to recognize errors and refine its reasoning trajectory. Our study enhances LRMs’ factuality while effectively integrating robust reasoning for Retrieval-Augmented Generation (RAG).
zh
[NLP-5] Collab: Controlled Decoding using Mixture of Agents for LLM Alignment ICLR2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署过程中因参数量庞大而导致基于强化学习的人类反馈(Reinforcement Learning from Human Feedback, RLHF)方法计算成本高昂的问题,同时克服单一解码代理难以适应多样化任务的局限性。论文的关键创新在于提出了一种基于多代理协作的解码策略(Collab),通过在推理阶段动态选择最优的语言模型,以实现高效的对齐和协作。具体而言,该方案将现有的预对齐LLM策略视为代理,并采用基于令牌级别的选择机制,在多个代理之间进行灵活切换,从而最大化长期效用。这种策略切换机制确保了每一步都能选择最佳模型,显著提升了目标任务的表现,超越当前最先进的单代理解码基线,平均奖励提升高达1.56倍,GPT-4基于胜率与平局率的综合表现提高了71.89%。
链接: https://arxiv.org/abs/2503.21720
作者: Souradip Chakraborty,Sujay Bhatt,Udari Madhushani Sehwag,Soumya Suvra Ghosal,Jiahao Qiu,Mengdi Wang,Dinesh Manocha,Furong Huang,Alec Koppel,Sumitra Ganesh
机构: JPMorgan AI Research (摩根大通人工智能研究中心); University of Maryland, College Park (马里兰大学帕克分校); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICLR 2025
点击查看摘要
Abstract:Alignment of Large Language models (LLMs) is crucial for safe and trustworthy deployment in applications. Reinforcement learning from human feedback (RLHF) has emerged as an effective technique to align LLMs to human preferences and broader utilities, but it requires updating billions of model parameters, which is computationally expensive. Controlled Decoding, by contrast, provides a mechanism for aligning a model at inference time without retraining. However, single-agent decoding approaches often struggle to adapt to diverse tasks due to the complexity and variability inherent in these tasks. To strengthen the test-time performance w.r.t the target task, we propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies. Treating each prior policy as an agent in the spirit of mixture of agent collaboration, we develop a decoding method that allows for inference-time alignment through a token-level selection strategy among multiple agents. For each token, the most suitable LLM is dynamically chosen from a pool of models based on a long-term utility metric. This policy-switching mechanism ensures optimal model selection at each step, enabling efficient collaboration and alignment among LLMs during decoding. Theoretical analysis of our proposed algorithm establishes optimal performance with respect to the target task represented via a target reward for the given off-the-shelf models. We conduct comprehensive empirical evaluations with open-source aligned models on diverse tasks and preferences, which demonstrates the merits of this approach over single-agent decoding baselines. Notably, Collab surpasses the current SoTA decoding strategy, achieving an improvement of up to 1.56x in average reward and 71.89% in GPT-4 based win-tie rate.
zh
[NLP-6] Outlier dimensions favor frequent tokens in language model
【速读】: 该论文旨在研究语言模型中的“最后一层异常维度”(last-layer outlier dimensions),即在大多数输入情况下表现出极端激活的特征维度。论文指出,这种异常维度广泛存在于多种现代语言模型中,并揭示其功能源于一种持续预测高频词的启发式策略。论文的关键解决方案在于展示模型如何通过在剩余维度上分配“对抗性权重质量”(counterbalancing weight mass)来抑制这种启发式策略,当其在特定上下文中不适用时。此外,论文进一步分析了哪些模型参数会增强异常维度,并探讨了它们在训练过程中的出现时机。最终结论认为,异常维度是一种被许多独立模型发现的专用机制,用于实现这一有用的令牌预测启发式策略。
链接: https://arxiv.org/abs/2503.21718
作者: Iuri Macocco,Nora Graichen,Gemma Boleda,Marco Baroni
机构: Universitat Pompeu Fabra (庞培法布拉大学); ICREA (未知)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
点击查看摘要
Abstract:We study last-layer outlier dimensions, this http URL that display extreme activations for the majority of inputs. We show that outlier dimensions arise in many different modern language models, and trace their function back to the heuristic of constantly predicting frequent words. We further show how a model can block this heuristic when it is not contextually appropriate, by assigning a counterbalancing weight mass to the remaining dimensions, and we investigate which model parameters boost outlier dimensions and when they arise during training. We conclude that outlier dimensions are a specialized mechanism discovered by many distinct models to implement a useful token prediction heuristic.
zh
[NLP-7] CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?
【速读】: 该论文试图解决科学同行评审中生成式人工智能(Generative AI)自动评审的可靠性与准确性问题,特别是确保生成的评审能够基于论文的核心科学主张(claims)作出合理评估的挑战。论文的关键在于引入了一个名为CLAIMCHECK的标注数据集,该数据集包含了从OpenReview挖掘的NeurIPS 2023和2024提交的论文及其评审,并由机器学习专家对评审中的弱点评述(weakness statements)、争议的论文主张以及这些弱点的有效性、客观性和类型进行了细致标注。通过这一数据集,论文评估了几种最先进的大型语言模型(LLMs)在三个以主张为核心的任务上的表现,包括关联弱点与争议的主张、预测弱点的细粒度标签并改写弱点以增强其特异性,以及基于有据推理验证论文主张。实验结果表明,尽管最先进的LLMs在预测弱点标签方面表现出一定能力,但在其他任务上仍显著落后于人类专家。
链接: https://arxiv.org/abs/2503.21717
作者: Jiefu Ou,William Gantt Walden,Kate Sanders,Zhengping Jiang,Kaiser Sun,Jeffrey Cheng,William Jurayj,Miriam Wanner,Shaobo Liang,Candice Morgan,Seunghoon Han,Weiqi Wang,Chandler May,Hannah Recknor,Daniel Khashabi,Benjamin Van Durme
机构: Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:A core part of scientific peer review involves providing expert critiques that directly assess the scientific claims a paper makes. While it is now possible to automatically generate plausible (if generic) reviews, ensuring that these reviews are sound and grounded in the papers’ claims remains challenging. To facilitate LLM benchmarking on these challenges, we introduce CLAIMCHECK, an annotated dataset of NeurIPS 2023 and 2024 submissions and reviews mined from OpenReview. CLAIMCHECK is richly annotated by ML experts for weakness statements in the reviews and the paper claims that they dispute, as well as fine-grained labels of the validity, objectivity, and type of the identified weaknesses. We benchmark several LLMs on three claim-centric tasks supported by CLAIMCHECK, requiring models to (1) associate weaknesses with the claims they dispute, (2) predict fine-grained labels for weaknesses and rewrite the weaknesses to enhance their specificity, and (3) verify a paper’s claims with grounded reasoning. Our experiments reveal that cutting-edge LLMs, while capable of predicting weakness labels in (2), continue to underperform relative to human experts on all other tasks.
zh
[NLP-8] As easy as PIE: understanding when pruning causes language models to disagree NAACL2025
【速读】: 该论文试图解决语言模型(Language Model, LM)剪枝过程中被忽视的数据点对其推理质量的影响问题。传统剪枝方法主要关注整体效率提升而牺牲部分有效性,但忽略了特定子集数据点(PIEs)在剪枝后受到的显著准确性下降。论文的关键解决方案在于识别并分析这些PIEs,揭示它们在不同压缩水平下的影响,并指出PIEs与模型泛化能力之间的关系,即PIEs对未见数据的泛化性能具有重要影响。论文通过实证研究发现,PIEs通常包含更长且语义更复杂的文本,这使得它们在剪枝后对推理质量的影响更为显著,尤其对于BERT模型比BiLSTM模型更为敏感。代码已开源供进一步验证和应用。
链接: https://arxiv.org/abs/2503.21714
作者: Pietro Tropeano,Maria Maistro,Tuukka Ruotsalo,Christina Lioma
机构: University of Copenhagen (哥本哈根大学); LUT University (拉赫蒂大学)
类目: Computation and Language (cs.CL)
备注: Accepted to NAACL 2025 (Findings)
点击查看摘要
Abstract:Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness. However, when looking at how individual data points are affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when pruning, but this effect goes unnoticed when reporting the mean accuracy of all data points. These data points are called PIEs and have been studied in image processing, but not in NLP. In a study of various NLP datasets, pruning methods, and levels of compression, we find that PIEs impact inference quality considerably, regardless of class frequency, and that BERT is more prone to this than BiLSTM. We also find that PIEs contain a high amount of data points that have the largest influence on how well the model generalises to unseen data. This means that when pruning, with seemingly moderate loss to accuracy across all data points, we in fact hurt tremendously those data points that matter the most. We trace what makes PIEs both hard and impactful to inference to their overall longer and more semantically complex text. These findings are novel and contribute to understanding how LMs are affected by pruning. The code is available at: this https URL
zh
[NLP-9] Elementwise Layer Normalization
【速读】: 该论文试图解决的问题是如何为Dynamic Tanh (DyT) 提供一个坚实的理论基础。DyT 被提出作为 Layer Normalization 的替代方法,但在现有形式下缺乏理论支持。论文的关键解决方案是通过数学推导 DyT,并指出需要一个恰当的近似来实现这一点。当放弃这一近似时,得到了一个新的逐元素变换方法,称为 Elementwise Layer Normalization (ELN),并证明 ELN 对 Layer Normalization 的表征比 DyT 更加准确。
链接: https://arxiv.org/abs/2503.21708
作者: Felix Stollenwerk
机构: AI Sweden (AI瑞典)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 3 figures
点击查看摘要
Abstract:A recent paper proposed Dynamic Tanh (DyT) as a drop-in replacement for Layer Normalization. Although the method is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we derive DyT mathematically and show that a well-defined approximation is needed to do so. By dropping said approximation, an alternative element-wise transformation is obtained, which we call Elementwise Layer Normalization (ELN). We demonstrate that ELN resembles Layer Normalization more accurately than DyT does.
zh
[NLP-10] Learning to Represent Individual Differences for Choice Decision Making IJCAI
【速读】: 该论文试图解决如何有效测量个体差异以提高人类决策预测的准确性这一问题。解决方案的关键在于利用表征学习(representation learning)从行为实验数据中创建个体嵌入(individual embeddings),这种方法能够同时处理结构化数据(如人口统计信息)和非结构化数据(如自由文本响应)。通过这种方式,表征学习提供了更灵活的个体差异度量手段,从而支持个性化预测任务,并且在经济决策任务中验证了使用表征学习捕获个体差异的模型显著优于不使用表征学习的模型,甚至超过了基于理论的传统行为模型。
链接: https://arxiv.org/abs/2503.21704
作者: Yan-Ying Chen,Yue Weng,Alexandre Filipowicz,Rumen Iliev,Francine Chen,Shabnam Hakimi,Yanxia Zhang,Matthew Lee,Kent Lyons,Charlene Wu
机构: Toyota Research Institute (丰田研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Published in IJCAI MRC 2022
点击查看摘要
Abstract:Human decision making can be challenging to predict because decisions are affected by a number of complex factors. Adding to this complexity, decision-making processes can differ considerably between individuals, and methods aimed at predicting human decisions need to take individual differences into account. Behavioral science offers methods by which to measure individual differences (e.g., questionnaires, behavioral models), but these are often narrowed down to low dimensions and not tailored to specific prediction tasks. This paper investigates the use of representation learning to measure individual differences from behavioral experiment data. Representation learning offers a flexible approach to create individual embeddings from data that are both structured (e.g., demographic information) and unstructured (e.g., free text), where the flexibility provides more options for individual difference measures for personalization, e.g., free text responses may allow for open-ended questions that are less privacy-sensitive. In the current paper we use representation learning to characterize individual differences in human performance on an economic decision-making task. We demonstrate that models using representation learning to capture individual differences consistently improve decision predictions over models without representation learning, and even outperform well-known theory-based behavioral models used in these environments. Our results propose that representation learning offers a useful and flexible tool to capture individual differences.
zh
[NLP-11] Embodied-Reason er: Synergizing Visual Search Reasoning and Action for Embodied Interactive Tasks
【速读】: 该论文旨在解决深度思维模型在具身领域(embodied domains)中的推理能力不足问题,这些领域需要通过图像动作交织轨迹实现与环境的连续交互。传统数学推理主要依赖逻辑演绎,而具身场景则要求空间理解、时间推理以及基于交互历史的持续自我反思。为应对这些挑战,研究者提出了Embodied Reasoner模型,并构建了一个包含9.3k条连贯观察-思考-行动轨迹的数据集,其中涵盖64k交互图像和90k多样化思维过程。解决方案的关键在于开发了一种三阶段训练管道:首先通过模仿学习提升基础能力;其次利用拒绝采样的方式实现自我探索;最后借助反思调优进行自我修正,从而逐步增强模型的推理性能。实验结果表明,该模型在复杂长时序任务中表现出显著优势,尤其减少了重复搜索和逻辑不一致现象。
链接: https://arxiv.org/abs/2503.21696
作者: Wenqi Zhang,Mengna Wang,Gangao Liu,Xu Huixin,Yiwei Jiang,Yongliang Shen,Guiyang Hou,Zhe Zheng,Hang Zhang,Xin Li,Weiming Lu,Peng Li,Yueting Zhuang
机构: College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学); Alibaba Group (阿里巴巴集团); DAMO Academy, Alibaba Group (阿里巴巴达摩院); Nanjing Institute of Software Technology (南京软件技术研究院); Nanjing University of Posts and Telecommunications (南京邮电大学); Hohai University (河海大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code: this https URL Dataset: this https URL
点击查看摘要
Abstract:Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model’s capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9%, 24%, and +13%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
zh
[NLP-12] LLM -Gomoku: A Large Language Model-Based System for Strategic Gomoku with Self-Play and Reinforcement Learning
【速读】: 该论文旨在解决如何有效利用大型语言模型(Large Language Models, LLMs)进行五子棋(Gomoku)的战略规划与决策问题。当前在这一领域面临的主要挑战是如何让LLMs理解并应用五子棋的策略与逻辑以做出理性决策。论文的关键解决方案在于设计了一个模拟人类下棋学习过程的五子棋AI系统,通过使模型具备“读取棋盘”、“理解规则”、“选择策略”以及“评估局面”的能力,并结合自对弈(self-play)与强化学习(reinforcement learning)来提升其性能。这种方法显著改善了走子位置的选择,解决了生成非法局面的问题,并通过并行局面评估减少了处理时间。
链接: https://arxiv.org/abs/2503.21683
作者: Hui Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:In recent years, large language models (LLMs) have shown significant advancements in natural language processing (NLP), with strong capa-bilities in generation, comprehension, and rea-soning. These models have found applications in education, intelligent decision-making, and gaming. However, effectively utilizing LLMs for strategic planning and decision-making in the game of Gomoku remains a challenge. This study aims to develop a Gomoku AI system based on LLMs, simulating the human learning process of playing chess. The system is de-signed to understand and apply Gomoku strat-egies and logic to make rational decisions. The research methods include enabling the model to “read the board,” “understand the rules,” “select strategies,” and “evaluate positions,” while en-hancing its abilities through self-play and rein-forcement learning. The results demonstrate that this approach significantly improves the se-lection of move positions, resolves the issue of generating illegal positions, and reduces pro-cess time through parallel position evaluation. After extensive self-play training, the model’s Gomoku-playing capabilities have been notably enhanced.
zh
[NLP-13] JiraiBench: A Bilingual Benchmark for Evaluating Large Language Models Detection of Human Self-Destructive Behavior Content in Jirai Community
【速读】: 该论文旨在解决跨语言(中文与日文)社交平台上检测自我毁灭内容(self-destructive content)的有效性问题。论文的关键在于构建了一个名为JiraiBench的双语基准数据集,涵盖“地雷”(Jirai)亚文化中的多种自我毁灭行为,并结合语言与文化维度提出了一种综合评估框架。通过分析四种最先进的大语言模型的表现,研究发现指令语言的设计显著影响性能,且意外地发现日文提示在处理中文内容时表现更优,这表明文化接近性可能在某些情况下比语言相似性更重要。此外,跨语言微调模型的实验进一步证明了这两种语言系统之间存在知识迁移的可能性。因此,论文强调了在多语言内容审核中融入文化背景的重要性,并为开发针对脆弱在线社区的更高效检测系统提供了实证支持。
链接: https://arxiv.org/abs/2503.21679
作者: Yunze Xiao,Tingyu He,Lionel Z. Wang,Yiming Ma,Xingyu Song,Xiaohang Xu,Irene Li,Ka Chung Ng
机构: Carnegie Mellon University (卡内基梅隆大学); University of Washington (华盛顿大学); The Hong Kong Polytechnic University (香港理工大学); The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 20 pages, 1 figures
点击查看摘要
Abstract:This paper introduces JiraiBench, the first bilingual benchmark for evaluating large language models’ effectiveness in detecting self-destructive content across Chinese and Japanese social media communities. Focusing on the transnational “Jirai” (landmine) online subculture that encompasses multiple forms of self-destructive behaviors including drug overdose, eating disorders, and self-harm, we present a comprehensive evaluation framework incorporating both linguistic and cultural dimensions. Our dataset comprises 10,419 Chinese posts and 5,000 Japanese posts with multidimensional annotation along three behavioral categories, achieving substantial inter-annotator agreement. Experimental evaluations across four state-of-the-art models reveal significant performance variations based on instructional language, with Japanese prompts unexpectedly outperforming Chinese prompts when processing Chinese content. This emergent cross-cultural transfer suggests that cultural proximity can sometimes outweigh linguistic similarity in detection tasks. Cross-lingual transfer experiments with fine-tuned models further demonstrate the potential for knowledge transfer between these language systems without explicit target language training. These findings highlight the need for culturally-informed approaches to multilingual content moderation and provide empirical evidence for the importance of cultural context in developing more effective detection systems for vulnerable online communities.
zh
[NLP-14] How do language models learn facts? Dynamics curricula and hallucinations
【速读】: 该论文试图解决大型语言模型在预训练过程中知识获取动态机制不明确的问题。论文通过设计一个合成的事实回忆任务,揭示了三个关键发现:首先,语言模型的学习过程分为三个阶段,并在获得精确事实知识之前表现出性能平台期,这一平台期与基于注意力的电路形成相吻合;其次,训练数据分布显著影响学习动态,不平衡的数据分布会导致更短的平台期;最后,幻觉现象与知识同时出现,且通过微调将新知识整合到模型中具有挑战性,因为这会快速破坏其现有的参数记忆。论文的关键解决方案在于强调数据分布在知识获取中的重要性,并提出了新的数据调度策略以加速神经网络训练。
链接: https://arxiv.org/abs/2503.21676
作者: Nicolas Zucchet,Jörg Bornschein,Stephanie Chan,Andrew Lampinen,Razvan Pascanu,Soham De
机构: Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Large language models accumulate vast knowledge during pre-training, yet the dynamics governing this acquisition remain poorly understood. This work investigates the learning dynamics of language models on a synthetic factual recall task, uncovering three key findings: First, language models learn in three phases, exhibiting a performance plateau before acquiring precise factual knowledge. Mechanistically, this plateau coincides with the formation of attention-based circuits that support recall. Second, the training data distribution significantly impacts learning dynamics, as imbalanced distributions lead to shorter plateaus. Finally, hallucinations emerge simultaneously with knowledge, and integrating new knowledge into the model through fine-tuning is challenging, as it quickly corrupts its existing parametric memories. Our results emphasize the importance of data distribution in knowledge acquisition and suggest novel data scheduling strategies to accelerate neural network training.
zh
[NLP-15] COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing
【速读】: 该论文旨在解决代码混合文本(Code-Mixed Text)在自然语言处理(NLP)任务中的挑战,特别是现有数据集存在的局限性,如仅关注罗马化文本、范围有限或依赖合成数据,这些都无法充分捕捉现实世界中的语言细微差别。论文的关键解决方案是引入COMI-LINGUA,这是一个大规模人工标注的数据集,包含100,970个实例,使用Devanagari和罗马两种书写系统,并由三位专家评估。该数据集支持五种基本的NLP任务:语言识别、主语言识别、词性标注、命名实体识别和翻译。通过在COMI-LINGUA上评估大型语言模型(LLMs),论文揭示了当前多语言建模策略的局限性,并强调了提升代码混合文本处理能力的重要性。
链接: https://arxiv.org/abs/2503.21670
作者: Rajvee Sheth,Himanshu Beniwal,Mayank Singh
机构: LINGO, Indian Institute of Technology Gandhinagar (印度理工学院甘地讷格尔)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:The rapid growth of digital communication has driven the widespread use of code-mixing, particularly Hindi-English, in multilingual communities. Existing datasets often focus on romanized text, have limited scope, or rely on synthetic data, which fails to capture realworld language nuances. Human annotations are crucial for assessing the naturalness and acceptability of code-mixed text. To address these challenges, We introduce COMI-LINGUA, the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts. The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation. We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities. COMI-LINGUA is publically availabe at: this https URL.
zh
[NLP-16] Model Assembly Learning with Heterogeneous Layer Weight Merging ICLR2025
【速读】: 该论文试图解决在模型合并过程中因架构异构性和层宽度不匹配导致的限制问题。解决方案的关键在于提出了一种名为Model Assembly Learning (MAL) 的新范式,它通过迭代整合开放模型库中多样化模型的参数来增强基础模型的能力。与之前需要相同架构的方法不同,MAL 允许合并异构架构,并选择性地跨层集成参数。具体而言,基础模型可以从多个预训练模型的不同层中吸收参数。此外,论文系统性地研究了异构参数合并的条件和基本设置,解决了基础模型与目标模型之间所有可能的层宽度不匹配问题,并确立了实施 MAL 的关键原则和实用指南。
链接: https://arxiv.org/abs/2503.21657
作者: Yi-Kai Zhang,Jin Wang,Xu-Xiang Zhong,De-Chuan Zhan,Han-Jia Ye
机构: School of Artificial Intelligence, Nanjing University (南京大学); National Key Laboratory for Novel Software Technology, Nanjing University (南京大学); Yingcai Honors College, University of Electronic Science and Technology of China (电子科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICLR 2025 Workshop on Neural Network Weights as a New Data Modality
点击查看摘要
Abstract:Model merging acquires general capabilities without extra data or training by combining multiple models’ parameters. Previous approaches achieve linear mode connectivity by aligning parameters into the same loss basin using permutation invariance. In this paper, we introduce Model Assembly Learning (MAL), a novel paradigm for model merging that iteratively integrates parameters from diverse models in an open-ended model zoo to enhance the base model’s capabilities. Unlike previous works that require identical architectures, MAL allows the merging of heterogeneous architectures and selective parameters across layers. Specifically, the base model can incorporate parameters from different layers of multiple pre-trained models. We systematically investigate the conditions and fundamental settings of heterogeneous parameter merging, addressing all possible mismatches in layer widths between the base and target models. Furthermore, we establish key laws and provide practical guidelines for effectively implementing MAL.
zh
[NLP-17] A Survey of Efficient Reasoning for Large Reasoning Models: Language Multimodality and Beyond
【速读】: 该论文旨在解决Large Reasoning Models (LRMs) 在推理过程中产生冗长且低效推理轨迹的问题,这些问题包括冗余内容(如重复定义)、对简单问题的过度分析以及在复杂任务中对多条推理路径的浅层探索。这种低效性对模型的训练、推理及实际部署(特别是在基于代理的系统中)提出了显著挑战,尤其是在关注Token经济性的场景下。论文的关键在于全面概述了针对提高LRMs推理效率的各种努力,并聚焦于这一新范式下特有的挑战。通过识别低效的常见模式、审查从预训练到推理阶段提出的解决方案,论文讨论了未来研究的潜在方向,并维护了一个实时更新的GitHub仓库以跟踪领域进展。
链接: https://arxiv.org/abs/2503.21614
作者: Xiaoye Qu,Yafu Li,Zhaochen Su,Weigao Sun,Jianhao Yan,Dongrui Liu,Ganqu Cui,Daizong Liu,Shuxian Liang,Junxian He,Peng Li,Wei Wei,Jing Shao,Chaochao Lu,Yue Zhang,Xian-Sheng Hua,Bowen Zhou,Yu Cheng
机构: Shanghai AI Laboratory; Soochow University; Westlake University; Peking University; Tongji University; The Hong Kong University of Science and Technology; Tsinghua University; Huazhong University of Science and Technology; The Chinese University of Hong Kong
类目: Computation and Language (cs.CL)
备注: Survey, 32 pages, Large Reasoning Models, Efficient Reasoning for Language, Multimodality, and Beyond
点击查看摘要
Abstract:Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference. However, a growing concern lies in their tendency to produce excessively long reasoning traces, which are often filled with redundant content (e.g., repeated definitions), over-analysis of simple problems, and superficial exploration of multiple reasoning paths for harder tasks. This inefficiency introduces significant challenges for training, inference, and real-world deployment (e.g., in agent-based systems), where token economy is critical. In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm. We identify common patterns of inefficiency, examine methods proposed across the LRM lifecycle, i.e., from pretraining to inference, and discuss promising future directions for research. To support ongoing development, we also maintain a real-time GitHub repository tracking recent progress in the field. We hope this survey serves as a foundation for further exploration and inspires innovation in this rapidly evolving area.
zh
[NLP-18] Evaluating book summaries from internal knowledge in Large Language Models : a cross-model and semantic consistency approach
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够仅依赖其内部知识生成全面且准确的书籍摘要,而不参考原始文本。研究关注这些模型在不依赖外部数据的情况下,能否合成与人类已建立解释一致的有意义叙事,并探索其潜在偏差及风格偏好。
解决方案的关键在于采用“LLM作为裁判”的评估范式:通过多个高质量的人类撰写摘要与AI生成摘要之间的跨模型对比评价,所有参与的LLMs不仅评估自身输出,还评估其他模型的输出。此外,利用ROUGE和BERTScore等指标量化人工创作与LLM生成摘要间的对齐程度,以评估语法和语义上的对应深度。这种设计揭示了不同模型在内容表达和风格选择上的细微差异,从而深入理解LLMs内部事实信息的编码方式以及跨模型评估的动态特性。
链接: https://arxiv.org/abs/2503.21613
作者: Javier Coronado-Blázquez
机构: 未知
类目: Computation and Language (cs.CL)
备注: 22 pages, 6 figures
点击查看摘要
Abstract:We study the ability of large language models (LLMs) to generate comprehensive and accurate book summaries solely from their internal knowledge, without recourse to the original text. Employing a diverse set of books and multiple LLM architectures, we examine whether these models can synthesize meaningful narratives that align with established human interpretations. Evaluation is performed with a LLM-as-a-judge paradigm: each AI-generated summary is compared against a high-quality, human-written summary via a cross-model assessment, where all participating LLMs evaluate not only their own outputs but also those produced by others. This methodology enables the identification of potential biases, such as the proclivity for models to favor their own summarization style over others. In addition, alignment between the human-crafted and LLM-generated summaries is quantified using ROUGE and BERTScore metrics, assessing the depth of grammatical and semantic correspondence. The results reveal nuanced variations in content representation and stylistic preferences among the models, highlighting both strengths and limitations inherent in relying on internal knowledge for summarization tasks. These findings contribute to a deeper understanding of LLM internal encodings of factual information and the dynamics of cross-model evaluation, with implications for the development of more robust natural language generative systems.
zh
[NLP-19] debug-gym: A Text-Based Environment for Interactive Debugging
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在编码任务中假设所有相关信息均可通过上下文访问或与训练数据匹配的问题。论文提出,LLMs 可以从交互式探索代码库以获取与其任务相关的信息的能力中受益。解决方案的关键在于引入一个名为 debug-gym 的文本环境,用于开发基于 LLM 的代理在交互式编码场景中工作。debug-gym 环境轻量级且内置了一系列有用工具(如 Python 调试器 pdb),旨在促进基于 LLM 的代理进行交互式调试,从而提升其性能,并进一步推广到其他需要信息寻求行为的任务。
链接: https://arxiv.org/abs/2503.21557
作者: Xingdi Yuan,Morgane M Moss,Charbel El Feghali,Chinmay Singh,Darya Moldavskaya,Drew MacPhee,Lucas Caccia,Matheus Pereira,Minseon Kim,Alessandro Sordoni,Marc-Alexandre Côté
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly relied upon for coding tasks, yet in most scenarios it is assumed that all relevant information can be either accessed in context or matches their training data. We posit that LLMs can benefit from the ability to interactively explore a codebase to gather the information relevant to their task. To achieve this, we present a textual environment, namely debug-gym, for developing LLM-based agents in an interactive coding setting. Our environment is lightweight and provides a preset of useful tools, such as a Python debugger (pdb), designed to facilitate an LLM-based agent’s interactive debugging. Beyond coding and debugging tasks, this approach can be generalized to other tasks that would benefit from information-seeking behavior by an LLM agent.
zh
[NLP-20] SWI: Speaking with Intent in Large Language Models
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在推理和生成任务中缺乏明确意图指导的问题,以提升其推理能力和生成质量。论文提出了一种名为Speaking with Intent (SWI) 的解决方案,通过显式生成意图(explicitly generated intent),将模型的底层意图封装并提供高层次规划,从而指导后续的分析与沟通。SWI 模拟人类有目的性的思维过程,增强模型的逻辑推理能力,并在数学推理、问答和文本摘要等基准测试中验证了其优越性,尤其是在生成准确性、简洁性和事实正确性方面表现突出,同时减少了幻觉现象。此外,SWI 方法在多种任务中保持了良好的泛化性能,并通过人类评估进一步证实了其生成意图的连贯性、有效性和可解释性。关键在于通过引入显式意图机制,使 LLMs 在推理过程中具备更强的目标导向能力。
链接: https://arxiv.org/abs/2503.21544
作者: Yuwei Yin,EunJeong Hwang,Giuseppe Carenini
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute for AI (向量人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages. Code: this https URL
点击查看摘要
Abstract:Intent, typically clearly formulated and planned, functions as a cognitive framework for reasoning and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model’s underlying intention and provides high-level planning to guide subsequent analysis and communication. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on mathematical reasoning benchmarks consistently demonstrate the superiority of Speaking with Intent over Baseline (i.e., generation without explicit intent). Moreover, SWI outperforms answer-trigger prompting methods Chain-of-Thought and Plan-and-Solve and maintains competitive performance with the strong method ARR (Analyzing, Retrieving, and Reasoning). Additionally, the effectiveness and generalizability of SWI are solidified on reasoning-intensive question answering (QA) and text summarization benchmarks, where SWI brings consistent improvement to the Baseline generation. In text summarization, SWI-generated summaries exhibit greater accuracy, conciseness, and factual correctness, with fewer hallucinations. Furthermore, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. This proof-of-concept study creates a novel avenue for enhancing LLMs’ reasoning abilities with cognitive notions.
zh
[NLP-21] Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models
【速读】: 该论文旨在解决低资源语言在信息检索(IR)领域中的包容性问题,特别是乌尔都语(Urdu)与其罗马化形式(Roman Urdu)之间的转写任务。尽管这两种书写系统在南亚地区广泛使用,但相关研究仍较为匮乏,尤其是针对跨域适应性和评估标准不足的问题。以往基于循环神经网络(RNNs)的方法虽展现出一定潜力,但在跨领域应用及全面评估方面存在局限。
为应对上述挑战,论文提出了一种基于Transformer架构的解决方案,利用m2m100多语言翻译模型,并结合掩码语言建模(Masked Language Modeling, MLM)预训练以及在Roman-Urdu-Parl与Dakshina数据集上的微调策略。此外,为了改进现有评估方法,作者引入了更为严格的数据集划分方式,并采用BLEU、字符级BLEU(Char-BLEU)及CHRF指标进行性能评估。最终,所提模型在Urdu-Roman-Urdu和Roman-Urdu-Urdu两个方向上的Char-BLEU得分分别达到96.37和97.44,显著优于RNN基线模型及GPT-4o Mini等对比方法,验证了多语言迁移学习在低资源转写任务中的有效性。
链接: https://arxiv.org/abs/2503.21530
作者: Umer Butt,Stalin Veranasi,Günter Neumann
机构: University of Saarland (萨尔兰大学); DFKI (德國人工智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. Transliteration between Urdu and its Romanized form, Roman Urdu, remains underexplored despite the widespread use of both scripts in South Asia. Prior work using RNNs on the Roman-Urdu-Parl dataset showed promising results but suffered from poor domain adaptability and limited evaluation. We propose a transformer-based approach using the m2m100 multilingual translation model, enhanced with masked language modeling (MLM) pretraining and fine-tuning on both Roman-Urdu-Parl and the domain-diverse Dakshina dataset. To address previous evaluation flaws, we introduce rigorous dataset splits and assess performance using BLEU, character-level BLEU, and CHRF. Our model achieves strong transliteration performance, with Char-BLEU scores of 96.37 for Urdu-Roman-Urdu and 97.44 for Roman-Urdu-Urdu. These results outperform both RNN baselines and GPT-4o Mini and demonstrate the effectiveness of multilingual transfer learning for low-resource transliteration tasks.
zh
[NLP-22] Datasets for Depression Modeling in Social Media: An Overview NAACL2025
【速读】: 该论文旨在应对抑郁症研究领域中日益增长的跨学科兴趣,特别关注通过社交网络数据增强传统抑郁症筛查方法的趋势。论文的核心问题是提供一个全面且最新的数据集列表,用于分析和预测社交媒体中的抑郁症,以支持早期职业研究人员。关键在于整理并发布2019年至2024年间发布的相关数据集概览,并将这些数据集以在线资源的形式持续更新,以便促进关于社交网络中抑郁症语言表达的进一步跨学科研究。
链接: https://arxiv.org/abs/2503.21513
作者: Ana-Maria Bucur,Andreea-Codrina Moldovan,Krutika Parvatikar,Marcos Zampieri,Ashiqur R. KhudaBukhsh,Liviu P. Dinu
机构: Interdisciplinary School of Doctoral Studies, University of Bucharest (跨学科博士研究学校, 布加勒斯特大学); PRHLT Research Center, Universitat Politècnica de València (PRHLT 研究中心, 阀伦西亚理工大学); Rochester Institute of Technology (罗切斯特理工学院); George Mason University (乔治梅森大学); Faculty of Mathematics and Computer Science, HLT Research Center, University of Bucharest (数学与计算机科学学院, HLT 研究中心, 布加勒斯特大学)
类目: Computation and Language (cs.CL)
备注: Accepted to CLPsych Workshop, NAACL 2025
点击查看摘要
Abstract:Depression is the most common mental health disorder, and its prevalence increased during the COVID-19 pandemic. As one of the most extensively researched psychological conditions, recent research has increasingly focused on leveraging social media data to enhance traditional methods of depression screening. This paper addresses the growing interest in interdisciplinary research on depression, and aims to support early-career researchers by providing a comprehensive and up-to-date list of datasets for analyzing and predicting depression through social media data. We present an overview of datasets published between 2019 and 2024. We also make the comprehensive list of datasets available online as a continuously updated resource, with the hope that it will facilitate further interdisciplinary research into the linguistic expressions of depression on social media.
zh
[NLP-23] Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
【速读】: 该论文试图解决现有自动驾驶(Autonomous Driving, AD)领域视觉-语言模型(Vision-Language Model, VLM)评估基准的不足,这些基准主要通过粗粒度任务中的开放式视觉问答(Visual Question Answering, QA)来评估可解释性,但难以充分评估模型在复杂驾驶场景下的能力。论文的关键在于引入了一个名为\textbf{VLADBench}的新基准,它是一个细粒度且具有挑战性的数据集,包含从静态基础知识到动态道路情境高级推理的闭合式问答(Close-form QA)。该基准涵盖了5个关键领域:交通知识理解、通用元素识别、交通图生成、目标属性理解以及自我决策与规划,并进一步细化为11个次级方面和29个三级任务以实现精确评估。此外,论文通过在单领域数据集(源自1.4M个特定领域的问答对)上训练领域特定(Domain-Specific, DS)模型,探索了这5个领域之间的认知与推理交互,从而为更全面评估VLM在AD中的能力提供了重要参考,推动了更具认知深度和推理能力的AD系统的发展。
链接: https://arxiv.org/abs/2503.21505
作者: Yue Li,Meng Tian,Zhenyu Lin,Jiangtong Zhu,Dechang Zhu,Haiqiang Liu,Zining Wang,Yueyi Zhang,Zhiwei Xiong,Xinhai Zhao
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce \textbfVLADBench , a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate \textbfVLADBench spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources). The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.
zh
[NLP-24] Keyword-Oriented Multimodal Modeling for Euphemism Identification
【速读】: 该论文旨在解决隐晦语识别领域中缺乏多模态数据集的问题,以应对社交网络环境下仅依赖文本分析的局限性。论文的关键在于构建了一个面向隐晦语的关键词导向多模态语料库(Keyword-Oriented Multimodal Euphemism Corpus, KOM-Euph),包含文本、图像和语音三种模态的数据,并提出了一个相应的关键词导向多模态隐晦语识别方法(Keyword-Oriented Multimodal Euphemism Identification, KOM-EI)。该方法通过跨模态特征对齐与动态融合模块,显式利用关键词的视觉和音频特征,从而实现高效的隐晦语识别。实验结果表明,KOM-EI在性能上优于现有最先进的模型和大型语言模型,验证了多模态数据的重要性。
链接: https://arxiv.org/abs/2503.21504
作者: Yuxue Hu,Junsong Li,Meixuan Chen,Dongyu Su,Tongguan Wang,Ying Sha
机构: College of Informatics, Huazhong Agricultural University (华中农业大学信息学院); Key Laboratory of Smart Farming for Agricultural Animals (智能养殖重点实验室); Hubei Engineering Technology Research Center of Agricultural Big Data (湖北省农业大数据工程技术研究中心); Engineering Research Center of Intelligent Technology for Agriculture, Ministry of Education (教育部农业智能技术工程研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Euphemism identification deciphers the true meaning of euphemisms, such as linking “weed” (euphemism) to “marijuana” (target keyword) in illicit texts, aiding content moderation and combating underground markets. While existing methods are primarily text-based, the rise of social media highlights the need for multimodal analysis, incorporating text, images, and audio. However, the lack of multimodal datasets for euphemisms limits further research. To address this, we regard euphemisms and their corresponding target keywords as keywords and first introduce a keyword-oriented multimodal corpus of euphemisms (KOM-Euph), involving three datasets (Drug, Weapon, and Sexuality), including text, images, and speech. We further propose a keyword-oriented multimodal euphemism identification method (KOM-EI), which uses cross-modal feature alignment and dynamic fusion modules to explicitly utilize the visual and audio features of the keywords for efficient euphemism identification. Extensive experiments demonstrate that KOM-EI outperforms state-of-the-art models and large language models, and show the importance of our multimodal datasets.
zh
[NLP-25] OpenHuEval: Evaluating Large Language Model on Hungarian Specifics
【速读】: 该论文试图解决主流大型语言模型(Large Language Models, LLMs)在处理匈牙利语及其特定任务时评估不足的问题。为了解决这一问题,论文提出了OpenHuEval,这是一个专注于匈牙利语及其特性的首个基准测试集。其关键在于采用了最新的LLM评估设计原则,包括从互联网收集真实用户查询、强调生成式能力(Generative Capabilities)的评估,并利用LLM作为裁判以提升评估的多维性和准确性。最终,OpenHuEval包含八个匈牙利语特定维度、五项任务及3953个问题,从而为LLMs在匈牙利语环境下的性能提供了全面、深入且科学准确的评估。此外,通过OpenHuEval框架分析了大型推理模型(Large Reasoning Models, LRMs)的思维过程,揭示了这些模型在非英语语言中的内在模式与机制。
链接: https://arxiv.org/abs/2503.21500
作者: Haote Yang,Xingjian Wei,Jiang Wu,Noémi Ligeti-Nagy,Jiaxing Sun,Yinfan Wang,Zijian Győző Yang,Junyuan Gao,Jingchao Wang,Bowen Jiang,Shasha Wang,Nanjun Yu,Zihao Zhang,Shixin Hong,Hongwei Liu,Wei Li,Songyang Zhang,Dahua Lin,Lijun Wu,Gábor Prószéky,Conghui He
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); HUN-REN Hungarian Research Centre for Linguistics (匈牙利语言学研究中心); Wuhan University (武汉大学); Shanghai University (上海大学); University of Chinese Academy of Sciences (中国科学院大学); East China Normal University (华东师范大学); Peking University (北京大学); Harbin Institute of Technology (哈尔滨工业大学); Tsinghua University (清华大学); Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs’ generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at this https URL .
zh
[NLP-26] OmniVox: Zero-Shot Emotion Recognition with Omni-LLM s
【速读】: 本文旨在解决跨模态认知状态任务中大型语言模型(Large Language Models, LLMs)在语音模态下的零样本情绪识别(Zero-shot Emotion Recognition)能力研究不足的问题。论文的关键在于提出OmniVox,这是首个针对四种全模态大语言模型(omni-LLMs)在零样本情绪识别任务上的系统性评估,并通过引入声学提示(acoustic prompting)这一特定于音频的提示策略,聚焦于声学特征分析、对话上下文分析及逐步推理,显著提升了模型性能。此外,论文还探讨了上下文窗口大小对模型表现的影响,发现利用上下文信息尤其有助于提升IEMOCAP数据集上的表现。
链接: https://arxiv.org/abs/2503.21480
作者: John Murzaku,Owen Rambow
机构: Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL)
备注: Submitted to COLM 2025. Preprint
点击查看摘要
Abstract:The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.
zh
[NLP-27] Harnessing Chain-of-Thought Metadata for Task Routing and Adversarial Prompt Detection
【速读】: 该论文试图解决如何在预提示(pre-prompting)场景下衡量任务难度,并支持大型语言模型(Large Language Models, LLMs)在生产环境中的高效运行。同时,论文还关注通过检测对抗性提示来防御提示注入攻击。论文的关键解决方案是提出了一种名为“思考次数”(Number of Thoughts, NofT)的新指标,该指标通过量化提示所需的思考次数来评估任务难度,并据此实现更有效的提示路由。此外,NofT 指标能够显著提高对抗性提示检测的准确性,在使用 Deepseek 的量化蒸馏版本进行提示路由时实现了 2% 的延迟降低,并且在对抗性提示检测任务中使分类器达到了 95% 的准确率。
链接: https://arxiv.org/abs/2503.21464
作者: Ryan Marinelli,Josef Pichlmeier,Tamas Bisztray
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
点击查看摘要
Abstract:In this work, we propose a metric called Number of Thoughts (NofT) to determine the difficulty of tasks pre-prompting and support Large Language Models (LLMs) in production contexts. By setting thresholds based on the number of thoughts, this metric can discern the difficulty of prompts and support more effective prompt routing. A 2% decrease in latency is achieved when routing prompts from the MathInstruct dataset through quantized, distilled versions of Deepseek with 1.7 billion, 7 billion, and 14 billion parameters. Moreover, this metric can be used to detect adversarial prompts used in prompt injection attacks with high efficacy. The Number of Thoughts can inform a classifier that achieves 95% accuracy in adversarial prompt detection. Our experiments ad datasets used are available on our GitHub page: this https URL.
zh
[NLP-28] Large Language Model Agent : A Survey on Methodology Applications and Challenges
【速读】: 本文旨在解决大型语言模型(Large Language Model, LLM)代理系统在构建、协作与演化方面的碎片化研究问题,通过以方法为中心的分类学系统性地拆解LLM代理系统,揭示代理设计原则与其在复杂环境中涌现行为之间的根本联系。论文的关键在于提出一个统一的架构视角,涵盖代理的构建方式、协作机制以及随时间演化的路径,并探讨评估方法、工具应用、实际挑战及多样化应用场景。最终,本文为研究人员提供了一个结构化的分类框架以理解LLM代理,并指出了未来研究的有前景方向。
链接: https://arxiv.org/abs/2503.21460
作者: Junyu Luo,Weizhi Zhang,Ye Yuan,Yusheng Zhao,Junwei Yang,Yiyang Gu,Bohan Wu,Binqi Chen,Ziyue Qiao,Qingqing Long,Rongcheng Tu,Xiao Luo,Wei Ju,Zhiping Xiao,Yifan Wang,Meng Xiao,Chenwu Liu,Jingyang Yuan,Shichang Zhang,Yiqiao Jin,Fan Zhang,Xian Wu,Hanqing Zhao,Dacheng Tao,Philip S. Yu,Ming Zhang
机构: School of Computer Science and PKU-Anker LLM Lab, Peking University (北京大学), Beijing, China; Department of Computer Science, University of Illinois at Chicago (芝加哥伊利诺伊大学), Chicago, USA; School of Computing and Information Technology, Great Bay University (大湾区大学), Guangdong, China; Computer Network Information Center, Chinese Academy of Sciences (中国科学院计算机网络信息中心), Beijing, China; Nanyang Technological University (南洋理工大学), Singapore; Department of Computer Science, University of California, Los Angeles (加州大学洛杉矶分校), USA; Paul G. Allen School of Computer Science and Engineering, University of Washington (华盛顿大学保罗·G·艾伦计算机科学与工程学院), Seattle, USA; School of Information Technology &&& Management, University of International Business and Economics (对外经济贸易大学信息技术与管理学院), Beijing, China; Harvard University (哈佛大学), Cambridge, USA; Georgia Institute of Technology (佐治亚理工学院), Atlanta, USA; Jarvis Research Center, Tencent YouTu Lab (腾讯优图实验室 Jarvis 研究中心), Shenzhen, China
类目: Computation and Language (cs.CL)
备注: 329 papers surveyed, resources are at this https URL
点击查看摘要
Abstract:The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at this https URL.
zh
[NLP-29] Composable Prompting Workspaces for Creative Writing: Exploration and Iteration Using Dynamic Widgets
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)图形用户界面(GUI)缺乏对迭代探索支持的问题,具体表现为现有界面无法将提示(prompts)表示为可操作的界面对象。论文的关键解决方案是提出一种可组合的提示画布(composable prompting canvas)的概念,结合动态小组件(dynamic widgets),使用户能够通过系统建议、主动提示或手动方式生成小组件,以捕捉影响生成文本的相关任务特征。这种方法允许用户自定义和重构提示环境,从而提高生成过程的灵活性与效率。在与基准(基于会话式 UI)的对比研究中,该设计显著提升了创造力支持指数(Creativity Support Index),并获得了用户的积极反馈,表明其提供的控制感优于基准系统且结果值得投入的努力。
链接: https://arxiv.org/abs/2503.21394
作者: Rifat Mehreen Amin,Oliver Hans Kühle,Daniel Buschek,Andreas Butz
机构: LMU Munich(慕尼黑路德维希-马克西米利安大学); University of Bayreuth(拜罗伊特大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 11 pages, 9 figures, 2 tables, ACM CHI 2025 LBW
点击查看摘要
Abstract:Generative AI models offer many possibilities for text creation and transformation. Current graphical user interfaces (GUIs) for prompting them lack support for iterative exploration, as they do not represent prompts as actionable interface objects. We propose the concept of a composable prompting canvas for text exploration and iteration using dynamic widgets. Users generate widgets through system suggestions, prompting, or manually to capture task-relevant facets that affect the generated text. In a comparative study with a baseline (conversational UI), 18 participants worked on two writing tasks, creating diverse prompting environments with custom widgets and spatial layouts. They reported having more control over the generated text and preferred our system over the baseline. Our design significantly outperformed the baseline on the Creativity Support Index, and participants felt the results were worth the effort. This work highlights the need for GUIs that support user-driven customization and (re-)structuring to increase both the flexibility and efficiency of prompting.
zh
[NLP-30] An evaluation of LLM s and Google Translate for translation of selected Indian languages via sentiment and semantic analyses
【速读】: 该论文试图解决大型语言模型(LLMs)在低资源语言翻译质量评估方面的不足。解决方案的关键在于通过语义和情感分析方法,比较LLMs(如Gemini、GPT系列和Google Translate)与专家翻译的人工翻译在印度语言(包括梵文、泰卢固语和印地语)中的表现。研究通过选取已被专家良好翻译的权威文本,利用LLMs生成对应的英译版本,并进行对比分析,从而揭示LLMs在翻译准确性、情感保持及语义完整性方面取得的进步与面临的挑战,特别是对修辞和哲学语境下的翻译效果。
链接: https://arxiv.org/abs/2503.21393
作者: Rohitash Chandra,Aryan Chaudhary,Yeshwanth Rayavarapu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Large Language models (LLMs) have been prominent for language translation, including low-resource languages. There has been limited study about the assessment of the quality of translations generated by LLMs, including Gemini, GPT and Google Translate. In this study, we address this limitation by using semantic and sentiment analysis of selected LLMs for Indian languages, including Sanskrit, Telugu and Hindi. We select prominent texts that have been well translated by experts and use LLMs to generate their translations to English, and then we provide a comparison with selected expert (human) translations. Our findings suggest that while LLMs have made significant progress in translation accuracy, challenges remain in preserving sentiment and semantic integrity, especially in figurative and philosophical contexts. The sentiment analysis revealed that GPT-4o and GPT-3.5 are better at preserving the sentiments for the Bhagavad Gita (Sanskrit-English) translations when compared to Google Translate. We observed a similar trend for the case of Tamas (Hindi-English) and Maha P (Telugu-English) translations. GPT-4o performs similarly to GPT-3.5 in the translation in terms of sentiments for the three languages. We found that LLMs are generally better at translation for capturing sentiments when compared to Google Translate.
zh
[NLP-31] Controlling Large Language Model with Latent Actions
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)适应下游任务时缺乏明确动作空间定义的问题,特别是传统基于词元(token-level)动作的方法限制了语义多样性和探索能力。论文的关键解决方案是提出了一种名为“通过潜在动作控制大型语言模型”(Controlling Large Language Models with Latent Actions, CoLA)的框架,该框架将一个紧凑的潜在动作空间整合到预训练的LLMs中,从而提升强化学习的可控性和探索效率。这一创新性设计显著增强了文本生成的语义多样性,并在多个任务中展示了优于基线方法的性能提升。
链接: https://arxiv.org/abs/2503.21383
作者: Chengxing Jia,Ziniu Li,Pengyuan Wang,Yi-Chen Li,Zhenyu Hou,Yuxiao Dong,Yang Yu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Adapting Large Language Models (LLMs) to downstream tasks using Reinforcement Learning (RL) has proven to be an effective approach. However, LLMs do not inherently define the structure of an agent for RL training, particularly in terms of defining the action space. This paper studies learning a compact latent action space to enhance the controllability and exploration of RL for LLMs. We propose Controlling Large Language Models with Latent Actions (CoLA), a framework that integrates a latent action space into pre-trained LLMs. We apply CoLA to the Llama-3.1-8B model. Our experiments demonstrate that, compared to RL with token-level actions, CoLA’s latent action enables greater semantic diversity in text generation. For enhancing downstream tasks, we show that CoLA with RL achieves a score of 42.4 on the math500 benchmark, surpassing the baseline score of 38.2, and reaches 68.2 when augmented with a Monte Carlo Tree Search variant. Furthermore, CoLA with RL consistently improves performance on agent-based tasks without degrading the pre-trained LLM’s capabilities, unlike the baseline. Finally, CoLA reduces computation time by half in tasks involving enhanced thinking prompts for LLMs by RL. These results highlight CoLA’s potential to advance RL-based adaptation of LLMs for downstream applications.
zh
[NLP-32] Challenging the Boundaries of Reasoning : An Olympiad-Level Math Benchmark for Large Language Models
【速读】: 该论文旨在解决现有数学推理评估基准因大型推理模型的快速发展而趋于饱和的问题,强调了开发更具挑战性和严格性的评估框架的迫切需求。为填补这一空白,论文引入了OlymMATH,这是一个针对奥赛级别的新型数学基准,专门用于全面测试大型语言模型(LLMs)的复杂推理能力。其关键解决方案在于精心设计了包含200个问题的数据集,分为AIME级别(易)和更具有挑战性的难题(难)两个难度层次,并涵盖四大核心数学领域,每个问题均提供可验证的数值解以支持客观评价。此外,OlymMATH还支持双语评估,弥补了主流数学推理基准在此方面的不足。该基准已通过STILL项目公开发布。
链接: https://arxiv.org/abs/2503.21380
作者: Haoxiang Sun,Yingqian Min,Zhipeng Chen,Wayne Xin Zhao,Zheng Liu,Zhongyuan Wang,Lei Fang,Ji-Rong Wen
机构: School of Information, Renmin University of China (中国人民大学信息学院); Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); BAAI (北京智源人工智能研究院); DataCanvas Alaya NeW (数皆 Analytics Alaya New)
类目: Computation and Language (cs.CL)
备注: Technical Report on Slow Thinking with LLMs: Evaluation Benchmark
点击查看摘要
Abstract:In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1 and OpenAI’s o3-mini demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the OlymMATH benchmark at the STILL project: this https URL.
zh
[NLP-33] Retrieving Time-Series Differences Using Natural Language Queries
【速读】: 该论文旨在解决传统时间序列数据搜索方法依赖领域专业知识定义搜索标准以及现有自然语言搜索方法难以有效处理时间序列数据差异性的问题。为克服这些局限性,论文提出了一种基于自然语言查询的时间序列数据对检索方法,关键在于定义了六种时间序列差异的关键特征,构建相应数据集,并开发了一种基于对比学习的模型,将查询文本与时间序列数据之间的差异对齐。实验结果显示,该模型在检索时间序列对任务上的总体mAP得分为0.994。
链接: https://arxiv.org/abs/2503.21378
作者: Kota Dohi,Tomoya Nishida,Harsh Purohit,Takashi Endo,Yohei Kawaguchi
机构: Research and Development Group, Hitachi, Ltd. (日立有限公司研发部)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Effectively searching time-series data is essential for system analysis; however, traditional methods often require domain expertise to define search criteria. Recent advancements have enabled natural language-based search, but these methods struggle to handle differences between time-series data. To address this limitation, we propose a natural language query-based approach for retrieving pairs of time-series data based on differences specified in the query. Specifically, we define six key characteristics of differences, construct a corresponding dataset, and develop a contrastive learning-based model to align differences between time-series data with query texts. Experimental results demonstrate that our model achieves an overall mAP score of 0.994 in retrieving time-series pairs.
zh
[NLP-34] From User Preferences to Optimization Constraints Using Large Language Models
【速读】: 该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)将用户的偏好转化为家庭电器的能源优化约束的问题。在可再生能源社区(Renewable Energy Community, REC)的背景下,特别是在意大利场景中,研究了一项任务,即将自然语言的用户表述转换为智能家电的正式约束条件。论文的关键在于评估当前可用的多种针对意大利语的LLMs在零样本、单样本和少样本学习设置下的有效性,并通过一个包含意大利用户请求及其对应正式约束表示的试点数据集进行验证。论文的主要贡献包括为该任务建立基准性能、公开发布数据集和代码以供进一步研究,以及提供关于LLMs在此特定领域内最佳实践和局限性的见解。
链接: https://arxiv.org/abs/2503.21360
作者: Manuela Sanguinetti,Alessandra Perniciano,Luca Zedda,Andrea Loddo,Cecilia Di Ruberto,Maurizio Atzori
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This work explores using Large Language Models (LLMs) to translate user preferences into energy optimization constraints for home appliances. We describe a task where natural language user utterances are converted into formal constraints for smart appliances, within the broader context of a renewable energy community (REC) and in the Italian scenario. We evaluate the effectiveness of various LLMs currently available for Italian in translating these preferences resorting to classical zero-shot, one-shot, and few-shot learning settings, using a pilot dataset of Italian user requests paired with corresponding formal constraint representation. Our contributions include establishing a baseline performance for this task, publicly releasing the dataset and code for further research, and providing insights on observed best practices and limitations of LLMs in this particular domain
zh
[NLP-35] Fine-Tuning LLM s on Small Medical Datasets: Text Classification and Normalization Effectiveness on Cardiology reports and Discharge records
【速读】: 该论文旨在解决在有限医疗数据集上,如何有效利用大规模语言模型(Large Language Models, LLMs)进行文本分类和命名实体识别任务的问题。论文的关键在于通过针对具体任务对小规模LLMs进行微调(fine-tuning),以充分利用少量训练数据(如200-300个样本)来提升性能,并证明其效果可与更大模型相媲美。这一方法展示了通过任务特定微调LLMs实现临床工作流自动化及从非结构化医学文本中高效提取结构化数据的潜力。
链接: https://arxiv.org/abs/2503.21349
作者: Noah Losch,Lucas Plagwitz,Antonius Büscher,Julian Varghese
机构: Institute of Medical Informatics (医学信息学研究所), University of Münster (明斯特大学), Münster, Germany
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 pages, 2 tables,
点击查看摘要
Abstract:We investigate the effectiveness of fine-tuning large language models (LLMs) on small medical datasets for text classification and named entity recognition tasks. Using a German cardiology report dataset and the i2b2 Smoking Challenge dataset, we demonstrate that fine-tuning small LLMs locally on limited training data can improve performance achieving comparable results to larger models. Our experiments show that fine-tuning improves performance on both tasks, with notable gains observed with as few as 200-300 training examples. Overall, the study highlights the potential of task-specific fine-tuning of LLMs for automating clinical workflows and efficiently extracting structured data from unstructured medical text.
zh
[NLP-36] ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback
【速读】: 该论文旨在解决多维摘要精炼(multi-dimensional summarization refinement)面临的挑战,提出了一种名为ReFeed的高效摘要精炼流水线,通过反馈的反思推理(reflective reasoning on feedback)来增强多个维度。解决方案的关键在于引入SumFeed-CoT,这是一个大规模基于长链路推理(Long-CoT-based)的数据集,用于训练具有反思推理能力的轻量级模型。研究发现,维度数量、反馈暴露程度以及推理策略对精炼性能有重要影响,强调了在多维场景下同时处理多条反馈并通过反思推理缓解维度间权衡的重要性。此外,ReFeed对噪声反馈和反馈顺序具有鲁棒性。最后,研究指出,创建带有适当目标和指南的数据是有效推理的基础。
链接: https://arxiv.org/abs/2503.21332
作者: Taewon Yun,Jihwan Oh,Hyangsuk Min,Yuho Lee,Jihwan Bang,Jason Cai,Hwanjun Song
机构: Korea Advanced Institute of Science and Technology (KAIST); Amazon Web Services, AI Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Summarization refinement faces challenges when extending to multi-dimension. In this paper, we introduce ReFeed, a powerful summarization refinement pipeline that enhances multiple dimensions through reflective reasoning on feedback. To achieve this, we release SumFeed-CoT, a large-scale Long-CoT-based dataset optimized for training a lightweight model with reflective reasoning. Our experiments reveal how the number of dimensions, feedback exposure, and reasoning policy influence refinement performance, highlighting reflective reasoning and simultaneously addressing multiple feedback is crucial to mitigate trade-off between dimensions. Furthermore, ReFeed is robust to noisy feedback and feedback order. Lastly, our finding emphasizes that creating data with a proper goal and guideline constitutes a fundamental pillar of effective reasoning. The dataset and model will be released.
zh
[NLP-37] R-PRM: Reasoning -Driven Process Reward Modeling
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在逐步数学推理过程中不可避免出现错误的问题,以及现有过程奖励模型(Process Reward Models, PRMs)输出评价分数直接导致的学习效率低和评价准确性受限的问题。尤其,标注数据的稀缺性进一步加剧了这些限制。为了解决这些问题,论文提出了推理驱动的过程奖励建模方法(Reasoning-Driven Process Reward Modeling, R-PRM)。其关键在于:首先利用更强的LLMs从有限的标注数据中生成种子数据,以有效增强模型的推理能力并实现全面的逐步评估;其次通过偏好优化提升性能,而无需额外的标注数据;最后引入推理时扩展来充分释放模型的推理潜力。实验结果表明,R-PRM在ProcessBench和PRMBench上的F1得分分别比强基准高出11.9和8.5点,并在六个具有挑战性的数据集上实现了超过8.5点的精度一致提升。
链接: https://arxiv.org/abs/2503.21295
作者: Shuaijie She,Junxiao Liu,Yifeng Liu,Jiajun Chen,Xin Huang,Shujian Huang
机构: National Key Laboratory for Novel Software Technology, Nanjing University (南京大学国家重点软件实验室); China Mobile Communications Company Limited Research Institute (中国移动通信公司研究院)
类目: Computation and Language (cs.CL)
备注: The project is available at this https URL
点击查看摘要
Abstract:Large language models (LLMs) inevitably make mistakes when performing step-by-step mathematical reasoning. Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. However, existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy, which is further exacerbated by the scarcity of annotated data. To address these issues, we propose Reasoning-Driven Process Reward Modeling (R-PRM). First, we leverage stronger LLMs to generate seed data from limited annotations, effectively bootstrapping our model’s reasoning capabilities and enabling comprehensive step-by-step evaluation. Second, we further enhance performance through preference optimization, without requiring additional annotated data. Third, we introduce inference-time scaling to fully harness the model’s reasoning potential. Extensive experiments demonstrate R-PRM’s effectiveness: on ProcessBench and PRMBench, it surpasses strong baselines by 11.9 and 8.5 points in F1 scores, respectively. When applied to guide mathematical reasoning, R-PRM achieves consistent accuracy improvements of over 8.5 points across six challenging datasets. Further analysis reveals that R-PRM exhibits more comprehensive evaluation and stronger generalization capabilities, thereby highlighting its significant potential.
zh
[NLP-38] Cultivating Game Sense for Yourself: Making VLMs Gaming Experts
【速读】: 该论文试图解决在第一人称/第三人称游戏中开发具备流畅游戏玩法的智能体(Agent)而无需API访问这一关键挑战。传统方法依赖视觉语言模型(Vision Language Models, VLMs)直接控制游戏,通过频繁暂停游戏以分析屏幕并基于语言推理规划动作,但这种低效范式限制了智能体只能进行基本且不流畅的交互,难以应对需要高反应性(如FPS射击)或动态适应性的任务。
解决方案的关键在于提出一种新的范式:不是让VLM直接控制游戏,而是利用VLM开发针对特定任务(如射击和战斗)的专业化执行模块。这些模块负责实时处理游戏交互,将VLM提升为高层次的开发者角色。在此基础上,论文引入GameSense框架,通过观察任务执行过程并结合视觉工具与神经网络训练管道,使VLM能够生成针对具体任务的游戏感知模块。这些模块封装了从直接规则到基于神经网络决策的动作反馈逻辑。实验表明,该框架首次实现了在多种类型游戏(包括ACT、FPS和Flappy Bird)中的流畅游戏玩法,为游戏AI代理设定了新的基准。
链接: https://arxiv.org/abs/2503.21263
作者: Wenxuan Lu,Jiangyang He,Zhanqiu Zhang,Yiwen Guo,Tianning Zang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Developing agents capable of fluid gameplay in first/third-person games without API access remains a critical challenge in Artificial General Intelligence (AGI). Recent efforts leverage Vision Language Models (VLMs) as direct controllers, frequently pausing the game to analyze screens and plan action through language reasoning. However, this inefficient paradigm fundamentally restricts agents to basic and non-fluent interactions: relying on isolated VLM reasoning for each action makes it impossible to handle tasks requiring high reactivity (e.g., FPS shooting) or dynamic adaptability (e.g., ACT combat). To handle this, we propose a paradigm shift in gameplay agent design: instead of directly controlling gameplay, VLM develops specialized execution modules tailored for tasks like shooting and combat. These modules handle real-time game interactions, elevating VLM to a high-level developer. Building upon this paradigm, we introduce GameSense, a gameplay agent framework where VLM develops task-specific game sense modules by observing task execution and leveraging vision tools and neural network training pipelines. These modules encapsulate action-feedback logic, ranging from direct action rules to neural network-based decisions. Experiments demonstrate that our framework is the first to achieve fluent gameplay in diverse genres, including ACT, FPS, and Flappy Bird, setting a new benchmark for game-playing agents.
zh
[NLP-39] ResearchBench: Benchmarking LLM s in Scientific Discovery via Inspiration-Based Task Decomposition
【速读】: 该论文试图解决大型语言模型(LLMs)在科学发现领域中评估基准不足的问题,特别是其能否有效发现高质量研究假设的能力尚未被充分考察。为填补这一空白,论文提出了首个大规模基准测试框架,涵盖科学发现的三个关键子任务:灵感检索、假设生成与假设排序。解决方案的关键在于开发了一个自动化的框架,能够从12个学科的学术论文中提取核心成分(如研究问题、背景调查、灵感来源和假设),并通过专家验证确保其准确性。此外,为了防止数据污染,仅使用2024年发表的论文作为数据源,这些论文与LLM预训练数据的重叠极小。通过这一方法,论文揭示了LLMs在灵感检索任务上的良好表现,表明其具备挖掘新颖知识关联的能力,从而将LLMs定位为“研究假设矿场”,能够在最少人工干预的情况下规模化生成创新性假设。
链接: https://arxiv.org/abs/2503.21248
作者: Yujie Liu,Zonglin Yang,Tong Xie,Jinjie Ni,Ben Gao,Yuqiang Li,Shixiang Tang,Wanli Ouyang,Erik Cambria,Dongzhan Zhou
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Nanyang Technological University (南洋理工大学); University of New South Wales (新南威尔士大学); National University of Singapore (新加坡国立大学); Wuhan University (武汉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as “research hypothesis mines”, capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.
zh
[NLP-40] Bias-Aware Agent : Enhancing Fairness in AI-Driven Knowledge Retrieval
【速读】: 该论文试图解决信息检索中的偏见(bias)和公平性(fairness)问题,这些问题源于大型语言模型(LLMs)的知识基础和训练过程。论文的关键解决方案在于提出了一种基于能动框架(agentic framework)的新型偏见感知知识检索方法,并创新性地利用偏见检测工具来识别和突出检索内容中的内在偏见。通过向用户提供透明性和意识,该方法旨在构建更加公平的信息系统,并推动负责任的人工智能(Responsible AI)的发展。
链接: https://arxiv.org/abs/2503.21237
作者: Karanbir Singh,William Ngu
机构: Salesforce(Salesforce)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Advancements in retrieving accessible information have evolved faster in the last few years compared to the decades since the internet’s creation. Search engines, like Google, have been the number one way to find relevant data. They have always relied on the user’s abilities to find the best information in its billions of links and sources at everybody’s fingertips. The advent of large language models (LLMs) has completely transformed the field of information retrieval. The LLMs excel not only at retrieving relevant knowledge but also at summarizing it effectively, making information more accessible and consumable for users. On top of it, the rise of AI Agents has introduced another aspect to information retrieval i.e. dynamic information retrieval which enables the integration of real-time data such as weather forecasts, and financial data with the knowledge base to curate context-aware knowledge. However, despite these advancements the agents remain susceptible to issues of bias and fairness, challenges deeply rooted within the knowledge base and training of LLMs. This study introduces a novel approach to bias-aware knowledge retrieval by leveraging agentic framework and the innovative use of bias detectors as tools to identify and highlight inherent biases in the retrieved content. By empowering users with transparency and awareness, this approach aims to foster more equitable information systems and promote the development of responsible AI.
zh
[NLP-41] LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models
【速读】: 该论文旨在解决在将混合专家模型(Mixture of Experts, MoE)应用于大规模语言模型以实现连续学习时面临的两个主要挑战:(1) 随着任务数量的增长,简单的参数扩展策略可能导致模型规模过大;(2) 修改现有路由参数会导致之前学到的知识退化。为了解决这些问题,论文提出了一种创新的框架LLaVA-CMoE,其关键在于引入了探针引导的知识扩展方法(Probe-Guided Knowledge Extension, PGKE),通过探针专家评估特定层是否需要额外知识,从而实现基于任务分布的自适应网络参数扩展,显著提高了参数扩展效率。同时,还提出了一个分层路由算法——概率任务定位器(Probabilistic Task Locator, PTL),其中高层路由捕获跨任务信息,低层路由关注任务细节,确保新任务专家不会干扰已有专家。实验表明,该高效架构在Coin基准测试中显著提升了模型性能,同时保持了合理的参数量。
链接: https://arxiv.org/abs/2503.21227
作者: Hengyuan Zhao,Ziqin Wang,Qixin Sun,Kaiyou Song,Yilin Li,Xiaolin Hu,Qingpei Guo,Si Liu
机构: School of Artificial Intelligence, Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: Preprint
点击查看摘要
Abstract:Although applying Mixture of Experts to large language models for learning new tasks is widely regarded as an effective strategy for continuous learning, there still remain two major challenges: (1) As the number of tasks grows, simple parameter expansion strategies can lead to excessively large models. (2) Modifying the parameters of the existing router results in the erosion of previously acquired knowledge. In this paper, we present an innovative framework named LLaVA-CMoE, which is a continuous Mixture of Experts (MoE) architecture without any replay data. Specifically, we have developed a method called Probe-Guided Knowledge Extension (PGKE), which employs probe experts to assess whether additional knowledge is required for a specific layer. This approach enables the model to adaptively expand its network parameters based on task distribution, thereby significantly improving the efficiency of parameter expansion. Additionally, we introduce a hierarchical routing algorithm called Probabilistic Task Locator (PTL), where high-level routing captures inter-task information and low-level routing focuses on intra-task details, ensuring that new task experts do not interfere with existing ones. Our experiments shows that our efficient architecture has substantially improved model performance on the Coin benchmark while maintaining a reasonable parameter count.
zh
[NLP-42] VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation
【速读】: 本文旨在解决从体素数据中提取高层次语义信息(如物体身份、颜色和位置)的挑战性问题。传统方法依赖复杂的三维网络,而本文提出的关键解决方案是采用一种基于切片的方法:将体素空间沿主轴(例如Z轴)系统性地切分成二维切片,并将其依次输入到标准视觉-语言模型(Vision-Language Model, VLM)的图像编码器中。这种方法通过利用预训练的二维VLM的强大能力,实现了从体素表示直接进行高效三维语义理解的目标。
链接: https://arxiv.org/abs/2503.21214
作者: Alan Dao(Gia Tuan Dao),Norapat Buppodom
机构: Menlo Research (Menlo 研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Comprehending 3D environments is vital for intelligent systems in domains like robotics and autonomous navigation. Voxel grids offer a structured representation of 3D space, but extracting high-level semantic meaning remains challenging. This paper proposes a novel approach utilizing a Vision-Language Model (VLM) to extract “voxel semantics”-object identity, color, and location-from voxel data. Critically, instead of employing complex 3D networks, our method processes the voxel space by systematically slicing it along a primary axis (e.g., the Z-axis, analogous to CT scan slices). These 2D slices are then formatted and sequentially fed into the image encoder of a standard VLM. The model learns to aggregate information across slices and correlate spatial patterns with semantic concepts provided by the language component. This slice-based strategy aims to leverage the power of pre-trained 2D VLMs for efficient 3D semantic understanding directly from voxel representations.
zh
[NLP-43] UGen: Unified Autoregressive Multimodal Model with Progressive Vocabulary Learning
【速读】: 该论文试图解决跨模态统一学习(unified multimodal learning)中的挑战,特别是在文本处理、图像理解和图像生成等多任务场景下的性能优化问题。论文提出的关键解决方案是引入了一种名为渐进词汇学习(progressive vocabulary learning)的新机制。通过该机制,视觉token ID逐步激活并融入训练过程,从而有效提升多模态统一学习的效果。实验结果表明,相比传统的单一模态方法,UGen在综合任务上的整体性能提升了13.3%,并在所有任务中表现出了与专用模型竞争的能力。
链接: https://arxiv.org/abs/2503.21193
作者: Hongxuan Tang,Hao Liu,Xinyan Xiao
机构: Baidu Inc. (百度公司), Beijing, China
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We introduce UGen, a unified autoregressive multimodal model that demonstrates strong performance across text processing, image understanding, and image generation tasks simultaneously. UGen converts both texts and images into discrete token sequences and utilizes a single transformer to generate them uniformly in an autoregressive manner. To address the challenges associated with unified multimodal learning, UGen is trained using a novel mechanism, namely progressive vocabulary learning. In this process, visual token IDs are incrementally activated and integrated into the training phase, ultimately enhancing the effectiveness of unified multimodal learning. Experiments on comprehensive text and image tasks show that UGen achieves a significant overall performance improvement of 13.3% compared to the vanilla unified autoregressive method, and it also delivers competitive results across all tasks against several task-specific models.
zh
[NLP-44] Collaborative Evolution: Multi-Round Learning Between Large and Small Language Models for Emergent Fake News Detection
【速读】: 该论文旨在解决社交平台上假新闻传播带来的显著社会影响及现有模型在检测新兴假新闻方面的局限性。传统基于小语言模型(Small Language Models, SLMs)的深度学习方法因需要大量标注数据且难以快速适应环境变化而受限;尽管大语言模型(Large Language Models, LLMs)具备强大的零样本能力,但其在缺乏相关示例和动态知识的情况下无法有效识别假新闻。为应对这些挑战,论文提出了一种名为多轮协作检测(Multi-Round Collaboration Detection, MRCD)的新框架。该框架的关键在于结合LLMs和SLMs的优势,通过设计两阶段检索模块选择相关且最新的示例与知识,增强情境学习能力以更好地检测新出现的新闻事件,并进一步构建多轮学习机制确保检测结果的可靠性。实验结果显示,MRCD框架在Pheme和Twitter16两个真实数据集上的表现优于仅使用SLMs的方法,分别提高了7.4%和12.8%的准确性,有效克服了当前模型的局限性,提升了新兴假新闻的检测效果。
链接: https://arxiv.org/abs/2503.21127
作者: Ziyi Zhou,Xiaoming Zhang,Shenghan Tan,Litian Zhang,Chaozhuo Li
机构: AAAI Press (AAAI出版社)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
点击查看摘要
Abstract:The proliferation of fake news on social media platforms has exerted a substantial influence on society, leading to discernible impacts and deleterious consequences. Conventional deep learning methodologies employing small language models (SLMs) suffer from the necessity for extensive supervised training and the challenge of adapting to rapidly evolving circumstances. Large language models (LLMs), despite their robust zero-shot capabilities, have fallen short in effectively identifying fake news due to a lack of pertinent demonstrations and the dynamic nature of knowledge. In this paper, a novel framework Multi-Round Collaboration Detection (MRCD) is proposed to address these aforementioned limitations. The MRCD framework is capable of enjoying the merits from both LLMs and SLMs by integrating their generalization abilities and specialized functionalities, respectively. Our approach features a two-stage retrieval module that selects relevant and up-to-date demonstrations and knowledge, enhancing in-context learning for better detection of emerging news events. We further design a multi-round learning framework to ensure more reliable detection results. Our framework MRCD achieves SOTA results on two real-world datasets Pheme and Twitter16, with accuracy improvements of 7.4% and 12.8% compared to using only SLMs, which effectively addresses the limitations of current models and improves the detection of emergent fake news.
zh
[NLP-45] Leverag ing Large Language Models for Risk Assessment in Hyperconnected Logistic Hub Network Deployment
【速读】: 该论文旨在解决在高度互联的物流枢纽网络部署中因能源效率和环境可持续性日益受到重视而引入的新挑战,特别是在当前充满波动性、不确定性、复杂性和模糊性(VUCA)的环境中,传统方法难以有效捕捉和分析非结构化信息导致的动态风险评估难题。论文的关键解决方案是设计了一个基于大型语言模型(LLM)驱动的风险评估管道,并集成了多种分析工具,通过系统性分析非结构化数据(如地缘政治不稳定、金融趋势、历史风暴事件、交通状况及新闻来源中的新兴风险)来识别潜在风险。此外,通过精心设计的提示词指导LLMs调用这些工具进行多维度风险类型与级别评估,实现基于风险相似性分析的物流枢纽聚类,从而支持结构化且数据驱动的决策过程,最终提升全面风险评估的能力。
链接: https://arxiv.org/abs/2503.21115
作者: Yinzhu Quan,Yujia Xu,Guanlin Chen,Frederick Benaben,Benoit Montreuil
机构: Physical Internet Center (物理互联网中心), H. Milton Stewart School of Industrial and Systems Engineering (工业与系统工程学院), Georgia Institute of Technology (乔治亚理工学院); Industrial Engineering Centre (工业工程中心), IMT Mines Albi (IMT Mines Albi学院), Albi, France (法国)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:The growing emphasis on energy efficiency and environmental sustainability in global supply chains introduces new challenges in the deployment of hyperconnected logistic hub networks. In current volatile, uncertain, complex, and ambiguous (VUCA) environments, dynamic risk assessment becomes essential to ensure successful hub deployment. However, traditional methods often struggle to effectively capture and analyze unstructured information. In this paper, we design an Large Language Model (LLM)-driven risk assessment pipeline integrated with multiple analytical tools to evaluate logistic hub deployment. This framework enables LLMs to systematically identify potential risks by analyzing unstructured data, such as geopolitical instability, financial trends, historical storm events, traffic conditions, and emerging risks from news sources. These data are processed through a suite of analytical tools, which are automatically called by LLMs to support a structured and data-driven decision-making process for logistic hub selection. In addition, we design prompts that instruct LLMs to leverage these tools for assessing the feasibility of hub selection by evaluating various risk types and levels. Through risk-based similarity analysis, LLMs cluster logistic hubs with comparable risk profiles, enabling a structured approach to risk assessment. In conclusion, the framework incorporates scalability with long-term memory and enhances decision-making through explanation and interpretation, enabling comprehensive risk assessments for logistic hub deployment in hyperconnected supply chain networks.
zh
[NLP-46] Measuring and Analyzing Subjective Uncertainty in Scientific Communications
【速读】: 该论文试图解决科学发现中主观不确定性在不同学科、年份和地区间的分布及其影响机制的问题。论文的关键在于通过分析科学文献的语言特征,量化作者表达中的主观不确定性,并研究其与引用次数、作者数量及性别、学科中心性等计量学指标之间的相关性。通过揭示这些模式,论文为不同学术社区和文化背景下科学交流的语言规范识别与记录提供了依据。
链接: https://arxiv.org/abs/2503.21114
作者: Jamshid Sourati,Grace Shao
机构: 未知
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: Coming with Appendix and supplementary material
点击查看摘要
Abstract:Uncertainty of scientific findings are typically reported through statistical metrics such as p -values, confidence intervals, etc. The magnitude of this objective uncertainty is reflected in the language used by the authors to report their findings primarily through expressions carrying uncertainty-inducing terms or phrases. This language uncertainty is a subjective concept and is highly dependent on the writing style of the authors. There is evidence that such subjective uncertainty influences the impact of science on public audience. In this work, we turned our focus to scientists themselves, and measured/analyzed the subjective uncertainty and its impact within scientific communities across different disciplines. We showed that the level of this type of uncertainty varies significantly across different fields, years of publication and geographical locations. We also studied the correlation between subjective uncertainty and several bibliographical metrics, such as number/gender of authors, centrality of the field’s community, citation count, etc. The underlying patterns identified in this work are useful in identification and documentation of linguistic norms in scientific communication in different communities/societies.
zh
[NLP-47] Function Alignment: A New Theory for Mind and Intelligence Part I: Foundations
【速读】: 该论文试图解决如何从结构层面构建一个统一的理论框架,以解释心智与智能的本质,并将其应用于建模及实现心智。论文的关键解决方案是提出“功能对齐(Function Alignment)”理论,该理论通过显式建模分层表征之间的交互,揭示意义、解释及类比的产生机制,从而形成一个既能描述心智又能指导构建心智的连贯框架。其中,关键的理论洞见之一是“有界可解释性(Bounded Interpretability)”,它统一整合了认知科学中的碎片化概念,如有限理性、符号接地及类比生成等。这一方案不仅跨越了学科界限,还连接了计算架构、心理理论乃至禅宗等不同领域。
链接: https://arxiv.org/abs/2503.21106
作者: Gus G. Xia
机构: Music X Lab, Machine Learning Department (音乐X实验室, 机器学习系); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 2 figures. Part I of a multi-part position paper on a new theory of mind
点击查看摘要
Abstract:This paper introduces function alignment, a novel theory of mind and intelligence that is both intuitively compelling and structurally grounded. It explicitly models how meaning, interpretation, and analogy emerge from interactions among layered representations, forming a coherent framework capable not only of modeling minds but also of serving as a blueprint for building them. One of the key theoretical insights derived from function alignment is bounded interpretability, which provides a unified explanation for previously fragmented ideas in cognitive science, such as bounded rationality, symbol grounding, and analogy-making. Beyond modeling, the function alignment framework bridges disciplines often kept apart, linking computational architecture, psychological theory, and even contemplative traditions such as Zen. Rather than building on any philosophical systems, it offers a structural foundation upon which multiple ways of understanding the mind may be reconstructed.
zh
[NLP-48] ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging
【速读】: 该论文致力于解决从大型语言模型中选择性擦除敏感知识的问题,旨在避免过度遗忘(over-forgetting)和欠遗忘(under-forgetting)的问题。论文的关键解决方案是提出了一种基于模型融合(Model Merging)的无学习系统,具体采用TIES-Merging技术,将两个专业化模型合并为一个更平衡的无学习模型。通过这种方法,该系统在SemEval-2025 Task 4任务集合(Task Aggregate)中的在线得分为0.944,在总体集合(overall Aggregate)中的得分为0.487,排名第二。此外,论文还进行了本地实验与综合分析,探讨了性能轨迹、损失动态及权重视角,并通过补充实验验证方法的有效性,同时指出当前评估指标(如MIA分数和ROUGE基评估指标)的局限性,强调需要更全面的评估方法和重新思考无学习的目标。代码可在提供的链接获取。
链接: https://arxiv.org/abs/2503.21088
作者: Haoming Xu,Shuxun Wang,Yanqiu Zhao,Yi Zhong,Ziyan Jiang,Ningyuan Zhao,Shumin Deng,Huajun Chen,Ningyu Zhang
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Work in progress
点击查看摘要
Abstract:This paper presents the ZJUKLAB team’s submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model. Our system achieves competitive results, ranking second among 26 teams, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method. Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research. Code is available at this https URL.
zh
[NLP-49] EQ-Negotiator: An Emotion-Reasoning LLM Agent in Credit Dialogues
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的聊天机器人在信用对话中动态情感表达能力有限的问题。当前的对话代理主要依赖被动共情而非情感推理,无法有效应对复杂的情感变化。论文的关键解决方案是提出了一种名为EQ-negotiator的方法,它结合了预训练语言模型(Pre-trained Language Models, PLMs)的情感感知能力与基于博弈论(Game Theory)和隐马尔可夫模型(Hidden Markov Models)的情感推理机制,通过综合考虑客户当前及历史情绪状态,实现上下文感知的情感调节。这种方法通过对公共情感数据集微调预训练语言模型,并在信用对话数据集上验证,使基于LLM的代理能够捕捉客户情绪的变化并动态调整回应语气,从而提升实际金融谈判中的表现,同时帮助信用机构建立积极的客户关系,提高信用服务满意度。
链接: https://arxiv.org/abs/2503.21080
作者: Yuhan Liu,Yunbo Long
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:While large language model (LLM)-based chatbots have been applied for effective engagement in credit dialogues, their capacity for dynamic emotional expression remains limited. Current agents primarily rely on passive empathy rather than affective reasoning. For instance, when faced with persistent client negativity, the agent should employ strategic emotional adaptation by expressing measured anger to discourage counterproductive behavior and guide the conversation toward resolution. This context-aware emotional modulation is essential for imitating the nuanced decision-making of human negotiators. This paper introduces an EQ-negotiator that combines emotion sensing from pre-trained language models (PLMs) with emotional reasoning based on Game Theory and Hidden Markov Models. It takes into account both the current and historical emotions of the client to better manage and address negative emotions during interactions. By fine-tuning pre-trained language models (PLMs) on public emotion datasets and validating them on the credit dialogue datasets, our approach enables LLM-based agents to effectively capture shifts in client emotions and dynamically adjust their response tone based on our emotion decision policies in real-world financial negotiations. This EQ-negotiator can also help credit agencies foster positive client relationships, enhancing satisfaction in credit services.
zh
[NLP-50] Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems
【速读】: 该论文旨在探索印度河谷文字与藏彝走廊象形系统之间潜在的历史联系。为解决这一问题,论文采用了结合卷积神经网络(CNN)与Transformer架构的混合模型,并辅以详细的人类学框架。解决方案的关键在于通过集成三种目标文字的15个独立训练模型的方法,证明藏彝走廊文字在视觉相似性上比青铜时代西亚的楔形文字或埃兰前体文字更接近印度河谷文字(分别为61.7%-63.5% vs. 10.2%-10.9% 和 7.6%-8.7%)。此外,尽管地理邻近性和贸易关系密切,印度河谷文字与上述西亚符号系统的平均余弦相似度仅为0.104和0.080,而与藏彝走廊文字的相似度高达0.629,进一步支持了复杂古代文化传播网络的存在可能性。这一发现挑战了传统的孤立文字发展叙事,提出了南亚与东亚之间更复杂的文化交流路径。
链接: https://arxiv.org/abs/2503.21074
作者: Ooha Lakkadi Reddy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 106 pages total (main text: 42, 48 w/refs, 100 w/appendices). 21 figures, 4 tables in main; 106 figs, 8 tables total. Code and data at this URL: this https URL . Submitted as undergrad thesis at Duke Kunshan University; accepted for presentation at the 2025 Computer Applications and Quantitative Methods in Archaeology Conference, Athens
点击查看摘要
Abstract:This thesis employs a hybrid CNN-Transformer architecture, in conjunction with a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (61.7%-63.5%) than to the Bronze Age Proto-Cuneiform (10.2%-10.9%) or Proto-Elamite (7.6%-8.7%) systems. Additionally and contrarily to our current understanding of the networks of the Indus Valley Civilization, the Indus script unexpectedly maps closer to Tibetan-Yi Corridor scripts, with a mean cosine similarity of 0.629, than to the aforementioned contemporaneous West Asian signaries, both of which recorded mean cosine similarities of 0.104 and 0.080 despite their close geographic proximity and evident trade relations. Across various dimensionality reduction practices and clustering methodologies, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts. Our computational results align with qualitative observations of specific pictorial parallels in numeral systems, gender markers, and key iconographic elements; this is further supported by archaeological evidence of sustained contact networks along the ancient Shu-Shendu road in tandem with the Indus Valley Civilization’s decline, providing a plausible transmission pathway. While alternative explanations cannot be ruled out, the specificity and consistency of observed similarities challenge conventional narratives of isolated script development and suggest more complex ancient cultural transmission networks between South and East Asia than previously recognized.
zh
[NLP-51] Shared Global and Local Geometry of Language Model Embeddings
【速读】: 该论文试图探索语言模型中词嵌入(Token Embeddings)的共同几何结构及其潜在应用。具体而言,研究关注词嵌入在全局和局部几何上的相似性,并揭示其低维流形特性及语义一致性。关键在于发现词嵌入不仅在初始层表现出一致的相对方向,而且这种对齐特性贯穿于语言模型的所有隐藏状态(Hidden States)。基于此,作者提出了一种新的解释性应用:通过跨模型迁移引导向量(Steering Vectors),即使不同模型具有不同的维度。这一解决方案的核心在于利用词嵌入的几何特性来实现模型间的知识转移与功能复用。
链接: https://arxiv.org/abs/2503.21073
作者: Andrew Lee,Melanie Weber,Fernanda Viégas,Martin Wattenberg
机构: Harvard University (哈佛大学); Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Researchers have recently suggested that models share common representations. In this work, we find that the token embeddings of language models exhibit common geometric structure. First, we find ``global’’ similarities: token embeddings often share similar relative orientations. Next, we characterize local geometry in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a simple measure for the intrinsic dimension of each token embedding. Our intrinsic dimension measure demonstrates that token embeddings lie on a lower dimensional manifold. We qualitatively show that tokens with lower intrinsic dimensions often have semantically coherent clusters, while those with higher intrinsic dimensions do not. Both characterizations allow us to find similarities in the local geometry of token embeddings. Perhaps most surprisingly, we find that alignment in token embeddings persists through the hidden states of language models, allowing us to develop an application for interpretability. Namely, we empirically demonstrate that steering vectors from one language model can be transferred to another, despite the two models having different dimensions.
zh
[NLP-52] AskSport: Web Application for Sports Question-Answering
【速读】: 该论文旨在解决体育领域中通过自然语言提问并获取相关答案的问题。解决方案的关键在于开发了一个名为AskSport的问答web应用程序,它能够接收用户的自然语言问题,并返回与问题最相关的三个答案,同时附带相关信息和文档。AskSport的核心功能不仅限于返回名称等非数值信息,还能够处理包含数值的相关查询,体现了其在处理复杂问题上的能力。这一方案的关键在于其高效的实现方式以及对多模态数据的整合能力,使其能够在HuggingFace平台上提供公开访问。
链接: https://arxiv.org/abs/2503.21067
作者: Enzo B Onofre(1),Leonardo M P Moraes(2),Cristina D Aguiar(2) ((1) Faculty of Computing, Federal University of Uberlandia, Brazil, (2) Institute of Mathematics and Computer Sciences, University of Sao Paulo, Brazil)
机构: Federal University of Uberlandia (巴西联邦大学乌贝兰迪亚); Institute of Mathematics and Computer Sciences, University of Sao Paulo (圣保罗大学数学与计算机科学研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: for accessing the application, see this https URL
点击查看摘要
Abstract:This paper introduces AskSport, a question-answering web application about sports. It allows users to ask questions using natural language and retrieve the three most relevant answers, including related information and documents. The paper describes the characteristics and functionalities of the application, including use cases demonstrating its ability to return names and numerical values. AskSport and its implementation are available for public access on HuggingFace.
zh
[NLP-53] Enhancing Korean Dependency Parsing with Morphosyntactic Features
【速读】: 该论文旨在解决现有框架在处理韩语(Korean)丰富的屈折形态(rich inflectional morphology)和灵活词序(flexible word order)时存在的挑战,这些框架通常将形态学(morphology)和句法(syntax)分开处理,导致语言分析中存在不一致性。论文的关键解决方案是提出UniDive框架,通过整合Universal Dependencies (UD) 和 Universal Morphology (UniMorph),统一句法和形态学标注(syntactic and morphological annotations),同时保留句法依存关系(syntactic dependencies)并引入由UniMorph衍生的特征,从而提高标注的一致性。实验结果表明,这种增强的形态句法特征(morphosyntactic features)特别是有助于区分受形态影响的语法关系(grammatical relations),并在基于编码器-解码器(encoder-only 和 decoder-only)模型的依存解析(dependency parsing)任务中提升了准确性。
链接: https://arxiv.org/abs/2503.21029
作者: Jungyeul Park,Yige Chen,Kyuwon Kim,KyungTae Lim,Chulwoo Park
机构: The University of British Columbia (英属哥伦比亚大学), Canada; The Chinese University of Hong Kong (香港中文大学), Hong Kong; Seoul National University (首尔国立大学), South Korea; KAIST (韩国科学技术院), South Korea; Anyang University (安阳大学), South Korea
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:This paper introduces UniDive for Korean, an integrated framework that bridges Universal Dependencies (UD) and Universal Morphology (UniMorph) to enhance the representation and processing of Korean morphosyntax. Korean’s rich inflectional morphology and flexible word order pose challenges for existing frameworks, which often treat morphology and syntax separately, leading to inconsistencies in linguistic analysis. UniDive unifies syntactic and morphological annotations by preserving syntactic dependencies while incorporating UniMorph-derived features, improving consistency in annotation. We construct an integrated dataset and apply it to dependency parsing, demonstrating that enriched morphosyntactic features enhance parsing accuracy, particularly in distinguishing grammatical relations influenced by morphology. Our experiments, conducted with both encoder-only and decoder-only models, confirm that explicit morphological information contributes to more accurate syntactic analysis.
zh
[NLP-54] Can Large Language Models Predict Associations Among Human Attitudes?
【速读】: 该论文试图解决的问题是如何利用大型语言模型(Large Language Models, LLMs)在缺乏表面相似性(surface-similarity)的情况下预测人类态度之间的关联,并揭示其对人类信念系统深层结构的理解能力。此前的研究主要集中在高度相似且相互关联的态度预测上,而本文通过构建一个包含多样化态度陈述的人类响应数据集,测试了前沿模型GPT-4o在跨主题和不相似态度之间进行预测的能力。解决方案的关键在于评估GPT-4o是否能够捕捉到人类信念系统的深层潜在结构,即使在缺乏表面相似性的情况下,仍能生成有意义的社会推断,从而验证LLMs对人类态度系统的全面表征能力。
链接: https://arxiv.org/abs/2503.21011
作者: Ana Ma,Derek Powell
机构: School of Social and Behavioral Sciences (社会与行为科学学院), Arizona State University (亚利桑那州立大学); School of Social and Behavioral Sciences (社会与行为科学学院), Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Prior work has shown that large language models (LLMs) can predict human attitudes based on other attitudes, but this work has largely focused on predictions from highly similar and interrelated attitudes. In contrast, human attitudes are often strongly associated even across disparate and dissimilar topics. Using a novel dataset of human responses toward diverse attitude statements, we found that a frontier language model (GPT-4o) was able to recreate the pairwise correlations among individual attitudes and to predict individuals’ attitudes from one another. Crucially, in an advance over prior work, we tested GPT-4o’s ability to predict in the absence of surface-similarity between attitudes, finding that while surface similarity improves prediction accuracy, the model was still highly-capable of generating meaningful social inferences between dissimilar attitudes. Altogether, our findings indicate that LLMs capture crucial aspects of the deeper, latent structure of human belief systems.
zh
[NLP-55] Evaluating Large Language Models for Automated Clinical Abstraction in Pulmonary Embolism Registries: Performance Across Model Sizes Versions and Parameters
【速读】: 该论文旨在解决肺栓塞(Pulmonary Embolism, PE)管理数据自动化提取的问题,以替代资源密集型的人工抽象方法。目前,PERT Consortium registry虽标准化了PE管理数据,但依赖于耗时的人工处理,而大型语言模型(Large Language Models, LLMs)提供了一种可扩展的替代方案,用于从CT肺动脉造影(CT Pulmonary Embolism, CTPE)报告中自动提取相关概念。论文的关键解决方案在于利用不同规模的LLaMA模型,并发现更大规模的模型(如70B参数量级)在PE检测、PE位置、右心室负荷及图像伪影等概念提取任务中表现更优,同时通过适度的温度调参(0.2-0.5)和双模型审查框架进一步优化性能,最终实现高精度(80%-90%)的自动化提取,从而减轻人工负担并保持准确性。
链接: https://arxiv.org/abs/2503.21004
作者: Mahmoud Alwakeel,Emory Buck,Jonathan G. Martin,Imran Aslam,Sudarshan Rajagopal,Jian Pei,Mihai V. Podgoreanu,Christopher J. Lindsell,An-Kwok Ian Wong
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Pulmonary embolism (PE) is a leading cause of cardiovascular mortality, yet our understanding of optimal management remains limited due to heterogeneous and inaccessible radiology documentation. The PERT Consortium registry standardizes PE management data but depends on resource-intensive manual abstraction. Large language models (LLMs) offer a scalable alternative for automating concept extraction from computed tomography PE (CTPE) reports. This study evaluated the accuracy of LLMs in extracting PE-related concepts compared to a human-curated criterion standard. We retrospectively analyzed MIMIC-IV and Duke Health CTPE reports using multiple LLaMA models. Larger models (70B) outperformed smaller ones (8B), achieving kappa values of 0.98 (PE detection), 0.65-0.75 (PE location), 0.48-0.51 (right heart strain), and 0.65-0.70 (image artifacts). Moderate temperature tuning (0.2-0.5) improved accuracy, while excessive in-context examples reduced performance. A dual-model review framework achieved 80-90% precision. LLMs demonstrate strong potential for automating PE registry abstraction, minimizing manual workload while preserving accuracy.
zh
[NLP-56] Multi-head Reward Aggregation Guided by Entropy
【速读】: 该论文旨在解决在基于多属性安全准则评估大型语言模型(Large Language Models, LLMs)时,由于高质量评分一致性难以保证而导致的可靠性挑战。传统方法依赖于人类生成的偏好标注,而论文观察到具有高评分熵的安全规则通常无法可靠地识别人类偏好的响应。为此,论文提出了一种名为ENCORE的解决方案,其关键是通过熵引导的方法对评分规则进行加权调整,降低具有高评分熵规则的权重。理论分析表明,在Bradley-Terry优化框架下,高熵规则自然会获得较低权重,从而支持这种基于熵的惩罚策略。实验结果表明,该方法在RewardBench安全任务上显著优于多种竞争性基线方法,同时具备无需训练、适用于多种数据集且具有可解释性的优点。
链接: https://arxiv.org/abs/2503.20995
作者: Xiaomin Li,Xupeng Chen,Jingxuan Fan,Eric Hanchen Jiang,Mingye Gao
机构: Harvard University (哈佛大学); New York University (纽约大学); University of California, Los Angeles (加州大学洛杉矶分校); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Aligning large language models (LLMs) with safety guidelines typically involves reinforcement learning from human feedback (RLHF), relying on human-generated preference annotations. However, assigning consistent overall quality ratings is challenging, prompting recent research to shift towards detailed evaluations based on multiple specific safety criteria. This paper uncovers a consistent observation: safety rules characterized by high rating entropy are generally less reliable in identifying responses preferred by humans. Leveraging this finding, we introduce ENCORE, a straightforward entropy-guided approach that composes multi-head rewards by downweighting rules exhibiting high rating entropy. Theoretically, we demonstrate that rules with elevated entropy naturally receive minimal weighting in the Bradley-Terry optimization framework, justifying our entropy-based penalization. Through extensive experiments on RewardBench safety tasks, our method significantly surpasses several competitive baselines, including random weighting, uniform weighting, single-head Bradley-Terry models, and LLM-based judging methods. Our proposed approach is training-free, broadly applicable to various datasets, and maintains interpretability, offering a practical and effective solution for multi-attribute reward modeling.
zh
[NLP-57] ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer
【速读】: 该论文旨在解决文本驱动的语音风格迁移问题,即通过文本描述中的风格线索调整语音的语调、语速和音色等特性。现有方法通常依赖于大规模神经架构或预训练语言模型,但计算成本较高。为应对这一挑战,论文提出了一种名为\emph{ReverBERT}的高效框架,其灵感来源于状态空间模型(State Space Model, SSM)范式,并受到基于图像的方法启发。该方案的关键在于,它在语音空间中操作,通过整合潜在语音特征的离散傅里叶变换实现平滑且连续的风格调节,同时引入一种新颖的基于\emph{Transformer}的SSM层,以桥接文本风格描述符与声学属性,大幅降低了推理时间,同时保持了高质量的语音特性。
链接: https://arxiv.org/abs/2503.20992
作者: Michael Brown,Sofia Martinez,Priya Singh
机构: University of Oregon (俄勒冈大学); Rochester Institute of Technology (罗切斯特理工学院); IIT Kharagpur (印度理工学院克勒格布尔)
类目: Graphics (cs.GR); Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Text-driven speech style transfer aims to mold the intonation, pace, and timbre of a spoken utterance to match stylistic cues from text descriptions. While existing methods leverage large-scale neural architectures or pre-trained language models, the computational costs often remain high. In this paper, we present \emphReverBERT, an efficient framework for text-driven speech style transfer that draws inspiration from a state space model (SSM) paradigm, loosely motivated by the image-based method of Wang and Liu~\citewang2024stylemamba. Unlike image domain techniques, our method operates in the speech space and integrates a discrete Fourier transform of latent speech features to enable smooth and continuous style modulation. We also propose a novel \emphTransformer-based SSM layer for bridging textual style descriptors with acoustic attributes, dramatically reducing inference time while preserving high-quality speech characteristics. Extensive experiments on benchmark speech corpora demonstrate that \emphReverBERT significantly outperforms baselines in terms of naturalness, expressiveness, and computational efficiency. We release our model and code publicly to foster further research in text-driven speech style transfer.
zh
[NLP-58] Cross-Modal State-Space Graph Reasoning for Structured Summarization
【速读】: 该论文旨在解决跨模态摘要生成中的高计算开销和有限可解释性问题。为应对这些挑战,论文提出了一种名为\textit{Cross-Modal State-Space Graph Reasoning (CSS-GR)}的框架,其关键在于结合状态空间模型与基于图的消息传递机制,通过构建一个捕获文本与视觉模态间以及模态内部关系的图结构,实现对多模态数据更全面的推理。这种方法不仅显著提升了摘要的质量和可解释性,同时保持了计算效率,并在标准多模态摘要数据集上得到验证。
链接: https://arxiv.org/abs/2503.20988
作者: Hannah Kim,Sofia Martinez,Jason Lee
机构: Department of Computer Science, University of Calgary (卡尔加里大学); Institute of AI Research, Rochester Institute of Technology (罗切斯特理工学院)
类目: Computation and Language (cs.CL); Graphics (cs.GR)
备注:
点击查看摘要
Abstract:The ability to extract compact, meaningful summaries from large-scale and multimodal data is critical for numerous applications, ranging from video analytics to medical reports. Prior methods in cross-modal summarization have often suffered from high computational overheads and limited interpretability. In this paper, we propose a \textitCross-Modal State-Space Graph Reasoning (\textbfCSS-GR) framework that incorporates a state-space model with graph-based message passing, inspired by prior work on efficient state-space models. Unlike existing approaches relying on purely sequential models, our method constructs a graph that captures inter- and intra-modal relationships, allowing more holistic reasoning over both textual and visual streams. We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks. We also provide a thorough ablation study to highlight the contributions of each component.
zh
[NLP-59] Patients Speak AI Listens: LLM -based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction
【速读】: 该论文旨在解决通过传统调查方法难以全面捕捉公众对紧急护理设施体验的问题,特别是在时间和空间覆盖上的局限性。为了解决这一挑战,论文提出利用基于大型语言模型(Large Language Models, LLMs)的提示工程方法,从在线评论中提取细粒度的公众感知,以分析影响患者满意度的关键因素。解决方案的关键在于结合地理空间分析与多变量统计模型,通过收集Google Maps评论数据,并使用GPT模型进行提示工程,实现面向方面的公众情感分析。研究发现,人际关系因素和运营效率是患者满意度的最强决定因素,而技术质量、财务状况和设施条件在调整其他变量后未表现出显著独立影响。此外,仅人口密度与患者评分呈现微弱但显著的关联,其他社会经济和人口统计因素无显著相关性。本研究展示了众包方法在揭示居民关注的核心因素以及为利益相关者提供改善公众紧急护理满意度的洞见方面的潜力。
链接: https://arxiv.org/abs/2503.20981
作者: Xiaoran Xu,Zhaoqian Xue,Chi Zhang,Jhonatan Medri,Junjie Xiong,Jiayan Zhou,Jin Jin,Yongfeng Zhang,Siyuan Ma,Lingyao Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:Investigating the public experience of urgent care facilities is essential for promoting community healthcare development. Traditional survey methods often fall short due to limited scope, time, and spatial coverage. Crowdsourcing through online reviews or social media offers a valuable approach to gaining such insights. With recent advancements in large language models (LLMs), extracting nuanced perceptions from reviews has become feasible. This study collects Google Maps reviews across the DMV and Florida areas and conducts prompt engineering with the GPT model to analyze the aspect-based sentiment of urgent care. We first analyze the geospatial patterns of various aspects, including interpersonal factors, operational efficiency, technical quality, finances, and facilities. Next, we determine Census Block Group(CBG)-level characteristics underpinning differences in public perception, including population density, median income, GINI Index, rent-to-income ratio, household below poverty rate, no insurance rate, and unemployment rate. Our results show that interpersonal factors and operational efficiency emerge as the strongest determinants of patient satisfaction in urgent care, while technical quality, finances, and facilities show no significant independent effects when adjusted for in multivariate models. Among socioeconomic and demographic factors, only population density demonstrates a significant but modest association with patient ratings, while the remaining factors exhibit no significant correlations. Overall, this study highlights the potential of crowdsourcing to uncover the key factors that matter to residents and provide valuable insights for stakeholders to improve public satisfaction with urgent care.
zh
[NLP-60] ScreenLLM : Stateful Screen Schema for Efficient Action Understanding and Prediction
【速读】: 该论文旨在解决图形用户界面(GUI)代理训练中的三大挑战:监督信号稀疏性、大规模数据集的可扩展性以及对用户意图的细微理解需求。为应对这些挑战,论文提出了一种状态感知屏幕模式(stateful screen schema),这是一种高效表示GUI交互的方式,能够捕捉随时间推移的关键用户操作与意图。基于此基础,论文引入了ScreenLLM,这是一组针对高级UI理解和动作预测定制的多模态大型语言模型(Multimodal Large Language Models, MLLMs)。实验结果表明,ScreenLLM能够准确建模用户行为并预测动作,从而为构建可扩展、鲁棒且智能的GUI代理奠定了基础,这些代理能够在不同软件环境中增强用户体验。
链接: https://arxiv.org/abs/2503.20978
作者: Yiqiao Jin,Stefano Petrangeli,Yu Shen,Gang Wu
机构: Georgia Institute of Technology (乔治亚理工学院); Adobe Research (Adobe 研究院); Adobe Research (Adobe 研究院); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注: Accepted to MM4SG Workshop at The Web Conference 2025
点击查看摘要
Abstract:Graphical User Interface (GUI) agents are autonomous systems that interpret and generate actions, enabling intelligent user assistance and automation. Effective training of these agent presents unique challenges, such as sparsity in supervision signals, scalability for large datasets, and the need for nuanced user understanding. We propose stateful screen schema, an efficient representation of GUI interactions that captures key user actions and intentions over time. Building on this foundation, we introduce ScreenLLM, a set of multimodal large language models (MLLMs) tailored for advanced UI understanding and action prediction. Extensive experiments on both open-source and proprietary models show that ScreenLLM accurately models user behavior and predicts actions. Our work lays the foundation for scalable, robust, and intelligent GUI agents that enhance user interaction in diverse software environments.
zh
[NLP-61] Multi-Modal Framing Analysis of News
【速读】: 该论文试图解决政治传播自动化框架分析中存在的局限性问题,即现有研究仅基于固定的预定义框架,局限于文本分析而忽略视觉语境。这种局限性尤其在新闻 framing 中排除了关于编辑选择的有价值信息,包括文章配图等多模态元素。为克服这些限制,论文提出了一种利用大规模(视觉-语言)模型进行多模态、多标签框架分析的方法。其关键在于结合图像和文本的对比分析,通过提取图像中隐含的意义并与相应文本框架进行比较,同时识别特定议题下高度党派化的 framing 模式。此方法能够实现新闻中文本与图像的可扩展集成框架分析,从而更全面地揭示媒体偏见。
链接: https://arxiv.org/abs/2503.20960
作者: Arnav Arora,Srishti Yadav,Maria Antoniak,Serge Belongie,Isabelle Augenstein
机构: University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Automated frame analysis of political communication is a popular task in computational social science that is used to study how authors select aspects of a topic to frame its reception. So far, such studies have been narrow, in that they use a fixed set of pre-defined frames and focus only on the text, ignoring the visual contexts in which those texts appear. Especially for framing in the news, this leaves out valuable information about editorial choices, which include not just the written article but also accompanying photographs. To overcome such limitations, we present a method for conducting multi-modal, multi-label framing analysis at scale using large (vision-)language models. Grounding our work in framing theory, we extract latent meaning embedded in images used to convey a certain point and contrast that to the text by comparing the respective frames used. We also identify highly partisan framing of topics with issue-specific frame analysis found in prior qualitative work. We demonstrate a method for doing scalable integrative framing analysis of both text and image in news, providing a more complete picture for understanding media bias.
zh
[NLP-62] Sociotechnical Effects of Machine Translation
【速读】: 该论文试图探讨机器翻译(Machine Translation, MT)在实际应用中带来的副作用与风险,并提出相应的缓解措施。论文重点关注神经网络机器翻译(Neural Machine Translation, NMT)及大规模语言模型(Large Language Models, LLMs)对环境的影响,包括其高昂的训练成本、巨大的电力消耗以及显著的碳排放量。同时,论文也讨论了机器翻译对译者及其他用户的潜在负面影响,涉及版权与数据所有权的问题,并提出了伦理方面的考量。此外,论文还展示了在危机场景中合理使用机器翻译可能挽救生命的方法及其实施路径。
解决方案的关键在于:通过构建碳足迹更低的小型模型以及对预训练模型进行微调,减少训练需求,从而降低能源消耗和环境负担;同时,在确保合法性和伦理性的前提下,探索机器翻译在危机场景中的有效应用方式。
链接: https://arxiv.org/abs/2503.20959
作者: Joss Moorkens,Andy Way,Séamus Lankford
机构: ADAPT Centre (ADAPT 中心); Dublin City University (都柏林城市大学); Munster Technological University (芒斯特技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:While the previous chapters have shown how machine translation (MT) can be useful, in this chapter we discuss some of the side-effects and risks that are associated, and how they might be mitigated. With the move to neural MT and approaches using Large Language Models (LLMs), there is an associated impact on climate change, as the models built by multinational corporations are massive. They are hugely expensive to train, consume large amounts of electricity, and output huge volumes of kgCO2 to boot. However, smaller models which still perform to a high level of quality can be built with much lower carbon footprints, and tuning pre-trained models saves on the requirement to train from scratch. We also discuss the possible detrimental effects of MT on translators and other users. The topics of copyright and ownership of data are discussed, as well as ethical considerations on data and MT use. Finally, we show how if done properly, using MT in crisis scenarios can save lives, and we provide a method of how this might be done.
zh
[NLP-63] Clean Clear: Feasibility of Safe LLM Clinical Guidance
【速读】: 该论文旨在开发并初步评估一款基于大型语言模型(LLM)的聊天机器人软件,使其能够可靠地回答基于伦敦大学学院医院(UCLH)临床指南的问题。论文的核心目标是利用开放权重的Llama-3.1-8B LLM从UCLH指南中提取相关信息以回答临床指南相关问题。解决方案的关键在于强调引用信息的安全性和可靠性,而非侧重于对其的解读与响应生成。通过七位医生对聊天机器人的表现进行评估,并将其答案与金标准进行比较,结果显示该聊天机器人在相关性、召回率及效率方面表现出色,具有显著潜力加速并改善医疗专业人员获取本地相关临床信息的过程。
链接: https://arxiv.org/abs/2503.20953
作者: Julia Ive,Felix Jozsa,Nick Jackson,Paulina Bondaronek,Ciaran Scott Hill,Richard Dobson
机构: University College London (伦敦大学学院); Wolfson Institute of Biomedical Research, University College London (沃尔夫森生物医学研究所,伦敦大学学院); King’s College Hospital (国王学院医院); National Hospital for Neurology and Neurosurgery (国家神经学与神经外科医院); King’s College London (国王学院伦敦)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Background: Clinical guidelines are central to safe evidence-based medicine in modern healthcare, providing diagnostic criteria, treatment options and monitoring advice for a wide range of illnesses. LLM-empowered chatbots have shown great promise in Healthcare QA tasks, offering the potential to provide quick and accurate responses to medical inquiries. Our main objective was the development and preliminary assessment of an LLM-empowered chatbot software capable of reliably answering clinical guideline questions using University College London Hospital (UCLH) clinical guidelines. Methods: We used the open-weight Llama-3.1-8B LLM to extract relevant information from the UCLH guidelines to answer questions. Our approach highlights the safety and reliability of referencing information over its interpretation and response generation. Seven doctors from the ward assessed the chatbot’s performance by comparing its answers to the gold standard. Results: Our chatbot demonstrates promising performance in terms of relevance, with ~73% of its responses rated as very relevant, showcasing a strong understanding of the clinical context. Importantly, our chatbot achieves a recall of 0.98 for extracted guideline lines, substantially minimising the risk of missing critical information. Approximately 78% of responses were rated satisfactory in terms of completeness. A small portion (~14.5%) contained minor unnecessary information, indicating occasional lapses in precision. The chatbot’ showed high efficiency, with an average completion time of 10 seconds, compared to 30 seconds for human respondents. Evaluation of clinical reasoning showed that 72% of the chatbot’s responses were without flaws. Our chatbot demonstrates significant potential to speed up and improve the process of accessing locally relevant clinical information for healthcare professionals. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2503.20953 [cs.CL] (or arXiv:2503.20953v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2503.20953 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Julia Ive [view email] [v1] Wed, 26 Mar 2025 19:36:43 UTC (449 KB) Full-text links: Access Paper: View a PDF of the paper titled Clean Clear: Feasibility of Safe LLM Clinical Guidance, by Julia Ive and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-03 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-64] Hacia la interpretabilidad de la detección anticipada de riesgos de depresión utilizando grandes modelos de lenguaje
【速读】: 该论文旨在解决抑郁症相关的早期风险检测(EDR)问题,特别是在西班牙语文本中识别处于抑郁风险中的用户。论文的关键在于提出了一种利用大型语言模型(LLMs)结合推理分析的方法,通过定义特定的推理标准,采用 Gemini 模型进行上下文学习,并结合人类可解释的响应进行评估。这种方法不仅实现了准确的预测,还提供了基于解释性推理的结果,从而为利用 LLMs 解决 EDR 问题提供了新的视角。
链接: https://arxiv.org/abs/2503.20939
作者: Horacio Thompson,Maximiliano Sapino,Edgardo Ferretti,Marcelo Errecalde
机构: 未知
类目: Computation and Language (cs.CL)
备注: In Spanish language, In 30° Congreso Argentino de Ciencias de la Computación (CACIC 2024), La Plata, Argentina
点击查看摘要
Abstract:Early Detection of Risks (EDR) on the Web involves identifying at-risk users as early as possible. Although Large Language Models (LLMs) have proven to solve various linguistic tasks efficiently, assessing their reasoning ability in specific domains is crucial. In this work, we propose a method for solving depression-related EDR using LLMs on Spanish texts, with responses that can be interpreted by humans. We define a reasoning criterion to analyze users through a specialist, apply in-context learning to the Gemini model, and evaluate its performance both quantitatively and qualitatively. The results show that accurate predictions can be obtained, supported by explanatory reasoning, providing a deeper understanding of the solution. Our approach offers new perspectives for addressing EDR problems by leveraging the power of LLMs.
zh
[NLP-65] GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations
【速读】: 本文旨在解决传统情感识别方法在会话情境下未能充分捕捉人类情绪动态性的问题。现有的单模态方法往往局限于孤立地分析个体表达,而多模态情感识别(MER)虽然结合多种信号,但传统上仍依赖于话语层面的分析,忽略了对话中情绪的动态变化。此外,尽管会话中情感识别(ERC)解决了部分问题,但现有方法在对齐多模态特征以及解释情绪如何随对话演变方面存在困难。
为了解决上述问题,论文提出了一种名为GatedxLSTM的新颖语音-文本多模态ERC模型。该模型的关键在于显式地考虑说话者及其对话伙伴的声音与转录文本,以确定引发情绪转变的最具影响力的话语。通过集成对比语言-音频预训练(CLAP)来提升跨模态对齐能力,并采用门控机制强调具有情感冲击力的话语,GatedxLSTM不仅提高了性能,还增强了可解释性。进一步地,对话情感解码器(DED)通过建模上下文依赖关系优化了情感预测。实验结果表明,GatedxLSTM在IEMOCAP数据集上的四类情感分类任务中达到了开源方法中的最新技术水平,验证了其在ERC应用中的有效性,并从心理学角度提供了可解释性分析。
链接: https://arxiv.org/abs/2503.20919
作者: Yupei Li,Qiyang Sun,Sunil Munthumoduku Krishna Murthy,Emran Alturki,Björn W. Schuller
机构: GLAM, Department of Computing, Imperial College London (帝国理工学院), UK; CHI – Chair of Health Informatics, MRI, Technical University of Munich (慕尼黑工业大学), Germany; relAI – the Konrad Zuse School of Excellence in Reliable AI, Munich, Germany; MDSI – Munich Data Science Institute, Munich, Germany; MCML – Munich Center for Machine Learning, Munich, Germany
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:
点击查看摘要
Abstract:Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual’s expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. To bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Additionally, the Dialogical Emotion Decoder (DED) refines emotion predictions by modelling contextual dependencies. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective.
zh
[NLP-66] D4R – Exploring and Querying Relational Graphs Using Natural Language and Large Language Models – the Case of Historical Documents
【速读】: 该论文旨在解决非技术用户(尤其是历史学家)在利用文本数据进行研究时面临的复杂性和技术门槛问题。传统的历史研究依赖于手动分析大量未结构化的文本数据,缺乏高效的知识提取工具。为了解决这一问题,论文提出的关键方案是开发一个名为D4R的数字平台,它通过结合大型语言模型(Large Language Model)将自然语言查询转换为Cypher查询,从而实现从Neo4J图数据库中检索数据的功能。这一方法的核心在于利用先进的图形化工具简化文本分析和知识提取的过程,同时提供直观的用户界面(Graphical Interface),使用户能够轻松导航和分析从非结构化文本中提取的复杂关系数据。这种设计不仅弥合了人工智能技术与历史研究之间的差距,还展示了其在其他领域的潜在应用价值。
链接: https://arxiv.org/abs/2503.20914
作者: Michel Boeglin,David Kahn,Josiane Mothe,Diego Ortiz,David Panzoli
机构: IRIEC – Univ. Paul-Valéry (保罗瓦莱大学); INU Jean-François Champollion (让-弗朗索瓦·尚波利翁大学), FRAMESPA, UMR5136 CNRS; INSPE, UT2J, IRIT, UMR5505 CNRS, Univ. de Toulouse (图卢兹大学); IRIT, UMR5505 CNRS, Univ. de Toulouse (图卢兹大学); IRIT, UMR5505 CNRS, INU Jean-François Champollion (让-弗朗索瓦·尚波利翁大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 7 figures
点击查看摘要
Abstract:D4R is a digital platform designed to assist non-technical users, particularly historians, in exploring textual documents through advanced graphical tools for text analysis and knowledge extraction. By leveraging a large language model, D4R translates natural language questions into Cypher queries, enabling the retrieval of data from a Neo4J database. A user-friendly graphical interface allows for intuitive interaction, enabling users to navigate and analyse complex relational data extracted from unstructured textual documents. Originally designed to bridge the gap between AI technologies and historical research, D4R’s capabilities extend to various other domains. A demonstration video and a live software demo are available.
zh
[NLP-67] VinaBench: Benchmark for Faithful and Consistent Visual Narratives CVPR2025
【速读】: 该论文旨在解决视觉叙事生成中,确保生成的图像序列忠实于输入文本且在生成的图像间保持一致性的挑战。这一挑战源于缺乏用于规划故事的知识约束。为解决此问题,论文提出了一个新的基准数据集VinaBench,通过标注视觉叙事样本中的常识性和语篇性约束,为学习隐含的视觉叙事策略提供系统框架。解决方案的关键在于基于这些引入的叙事约束,进一步提出新的评估指标,以更紧密地衡量生成图像的一致性以及生成结果与输入文本叙述的对齐程度。实验结果表明,在VinaBench知识约束下进行训练可有效提升生成视觉叙事的忠实性和连贯性。
链接: https://arxiv.org/abs/2503.20871
作者: Silin Gao,Sheryl Mathew,Li Mi,Sepideh Mamooler,Mengjie Zhao,Hiromi Wakaki,Yuki Mitsufuji,Syrielle Montariol,Antoine Bosselut
机构: EPFL(瑞士联邦理工学院); Sony Group Corporation(索尼集团); Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)
点击查看摘要
Abstract:Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench’s knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.
zh
[NLP-68] Both Direct and Indirect Evidence Contribute to Dative Alternation Preferences in Language Models
【速读】: 该论文试图解决的问题是语言模型(Language Models, LMs)在语法现象上表现出的人类偏好是否主要归因于直接接触这些现象,还是更广泛的语言特性。论文通过探索英语双宾语交替现象(DO:“gave Y the X” vs. PO:“gave the X to Y”),研究影响交替选择的因素(长度和指代性)如何影响模型的行为。解决方案的关键在于采用受控训练方法,通过迭代训练小型语言模型(LMs)处理经过系统操纵的输入数据,同时结合直接操控双宾语结构中的长度和指代性偏见以及全局长度效应的实验设计,揭示语言模型的句法偏好既来源于直接证据,也来源于间接证据。
链接: https://arxiv.org/abs/2503.20850
作者: Qing Yao,Kanishka Misra,Leonie Weissweiler,Kyle Mahowald
机构: Department of Linguistics, The University of Texas at Austin (德克萨斯大学奥斯汀分校语言学系); Toyota Technological Institute at Chicago (丰田技术研究院芝加哥分院)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Language models (LMs) tend to show human-like preferences on a number of syntactic phenomena, but the extent to which these are attributable to direct exposure to the phenomena or more general properties of language is unclear. We explore this with the English dative alternation (DO: “gave Y the X” vs. PO: “gave the X to Y”), using a controlled rearing paradigm wherein we iteratively train small LMs on systematically manipulated input. We focus on properties that affect the choice of alternant: length and animacy. Both properties are directly present in datives but also reflect more global tendencies for shorter elements to precede longer ones and animates to precede inanimates. First, by manipulating and ablating datives for these biases in the input, we show that direct evidence of length and animacy matters, but easy-first preferences persist even without such evidence. Then, using LMs trained on systematically perturbed datasets to manipulate global length effects (re-linearizing sentences globally while preserving dependency structure), we find that dative preferences can emerge from indirect evidence. We conclude that LMs’ emergent syntactic preferences come from a mix of direct and indirect sources.
zh
[NLP-69] Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead
【速读】: 该论文旨在解决在高风险领域中由于监管、隐私或机构原因导致数据被分割且难以有效利用的问题。论文的核心问题是探索如何通过生成式隐私保护合成数据(Privacy-preserving Synthetic Data)来克服这一挑战。论文的关键在于结合生成模型(Generative Models)与差分隐私(Differential Privacy)的理论基础,提出了一套综合评估方法,揭示了下游任务效用与隐私保障之间的权衡关系,并指出当前研究中的两大关键不足:缺乏代表专业化领域的现实基准数据集以及对形式化隐私保证的实证验证不足。通过针对四个领先方法在五个专业化领域真实数据集上的实证分析,论文发现,在实际隐私约束((\epsilon \leq 4))下,性能显著下降,表明通用领域基准结果与特定领域数据表现之间存在巨大差距。因此,论文强调需要更稳健的评估框架、标准化的专业领域基准,以及改进的技术手段以满足隐私敏感领域的需求,从而充分发挥该技术的巨大潜力。
链接: https://arxiv.org/abs/2503.20846
作者: Viktor Schlegel,Anil A Bharath,Zilong Zhao,Kevin Yee
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages + references + Appendix. Preprint
点击查看摘要
Abstract:Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains where information is compartmentalized for regulatory, privacy, or institutional reasons. This survey provides a comprehensive framework for understanding the landscape of privacy-preserving synthetic data, presenting the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods across tabular data, images, and text. Our synthesis of evaluation approaches highlights the fundamental trade-off between utility for down-stream tasks and privacy guarantees, while identifying critical research gaps: the lack of realistic benchmarks representing specialized domains and insufficient empirical evaluations required to contextualise formal guarantees. Through empirical analysis of four leading methods on five real-world datasets from specialized domains, we demonstrate significant performance degradation under realistic privacy constraints ( \epsilon \leq 4 ), revealing a substantial gap between results reported on general domain benchmarks and performance on domain-specific data. %Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks. These challenges underscore the need for robust evaluation frameworks, standardized benchmarks for specialized domains, and improved techniques to address the unique requirements of privacy-sensitive fields such that this technology can deliver on its considerable potential. Comments: 23 pages + references + Appendix. Preprint Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.20846 [cs.CR] (or arXiv:2503.20846v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2503.20846 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-70] Named Entity Recognition in Context
【速读】: 该论文旨在解决古汉语命名实体识别(Classical Chinese Named Entity Recognition, CC-NER)的问题。解决方案的关键在于集成三个核心组件:(1) 基于现代Transformer架构的双向编码器Pindola,其在大量古汉语语料上进行了预训练;(2) 一个检索模块,用于为目标序列获取相关的外部上下文;(3) 一个生成式推理步骤,以古汉语总结检索到的上下文,从而实现更鲁棒的实体消歧。通过这一方法,团队实现了平均F1分数为85.58,较竞赛基线提升了近5个百分点。
链接: https://arxiv.org/abs/2503.20836
作者: Colin Brisson(CRCAO),Ayoub Kahfy,Marc Bui(AOROC),Frédéric Constant(ERMES)
机构: 未知
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:We present the Named Entity Recognition system developed by the Edit Dunhuang team for the EvaHan2025 competition. Our approach integrates three core components: (1) Pindola, a modern transformer-based bidirectional encoder pretrained on a large corpus of Classical Chinese texts; (2) a retrieval module that fetches relevant external context for each target sequence; and (3) a generative reasoning step that summarizes retrieved context in Classical Chinese for more robust entity disambiguation. Using this approach, we achieve an average F1 score of 85.58, improving upon the competition baseline by nearly 5 points.
zh
[NLP-71] Comprehensive Manuscript Assessment with Text Summarization Using 69707 articles
【速读】: 该论文旨在解决学术论文早期阶段对未来影响力评估的问题,特别是针对尚未发表的研究手稿,提出了一种基于影响力的分类方法。传统影响力衡量方式依赖于引用次数,但其预测通常受限于特定学科领域或需要早期引用数据,缺乏普适性。本文的关键解决方案在于构建了一个包含69707篇跨多学科文章的大规模数据集,并采用深度学习方法,利用Transformer语言模型提取手稿及元数据中的语义特征,通过设计文本融合层整合标题与摘要信息,从而实现对期刊影响力及手稿未来影响力的有效预测。此外,该模型还展示了生成反馈和改进建议的潜力。
链接: https://arxiv.org/abs/2503.20835
作者: Qichen Sun,Yuxing Lu,Kun Xia,Li Chen,He Sun,Jinzhuo Wang
机构: Peking University (北京大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:
点击查看摘要
Abstract:Rapid and efficient assessment of the future impact of research articles is a significant concern for both authors and reviewers. The most common standard for measuring the impact of academic papers is the number of citations. In recent years, numerous efforts have been undertaken to predict citation counts within various citation windows. However, most of these studies focus solely on a specific academic field or require early citation counts for prediction, rendering them impractical for the early-stage evaluation of papers. In this work, we harness Scopus to curate a significantly comprehensive and large-scale dataset of information from 69707 scientific articles sourced from 99 journals spanning multiple disciplines. We propose a deep learning methodology for the impact-based classification tasks, which leverages semantic features extracted from the manuscripts and paper metadata. To summarize the semantic features, such as titles and abstracts, we employ a Transformer-based language model to encode semantic features and design a text fusion layer to capture shared information between titles and abstracts. We specifically focus on the following impact-based prediction tasks using information of scientific manuscripts in pre-publication stage: (1) The impact of journals in which the manuscripts will be published. (2) The future impact of manuscripts themselves. Extensive experiments on our datasets demonstrate the superiority of our proposed model for impact-based prediction tasks. We also demonstrate potentials in generating manuscript’s feedback and improvement suggestions.
zh
[NLP-72] SE-GNN: Seed Expanded-Aware Graph Neural Network with Iterative Optimization for Semi-supervised Entity Alignment
【速读】: 本文旨在解决知识图谱(Knowledge Graphs, KGs)规模增大导致人工标注预对齐种子对(pre-aligned seed pairs)困难的问题,以及现有方法在利用单一结构信息获取潜在种子对时因知识图谱结构异质性而导致的质量不佳的问题。此外,还关注噪声种子对引入的嵌入失真对对齐效果的影响。为了解决这些问题,论文提出了一种基于迭代优化的种子扩展感知图神经网络(Seed Expanded-aware Graph Neural Network with Iterative Optimization for Semi-Supervised Entity Alignment),简称SE-GNN。其关键是首先通过结合语义属性和结构特征,并采用条件过滤机制获得高质量初始潜在种子对;其次设计了局部和全局感知机制,结合初始潜在种子对与局部和全局信息以获取更全面的实体嵌入表示,缓解结构异质性的影响并为初始潜在种子对的优化奠定基础;最后提出阈值最近邻嵌入校正策略,结合相似度阈值和双向最近邻方法作为过滤机制选择迭代潜在种子对,并使用嵌入校正策略消除嵌入失真。
链接: https://arxiv.org/abs/2503.20801
作者: Tao Meng,Shuo Shan,Hongen Shao,Yuntao Shou,Wei Ai,Keqin Li
机构: School of Computer and Information Engineering, Central South University of Forestry and Technology (中南林业科技大学计算机与信息工程学院); Department of Computer Science, State University of New York (纽约州立大学计算机科学系)
类目: Computation and Language (cs.CL)
备注: 15 pages
点击查看摘要
Abstract:Entity alignment aims to use pre-aligned seed pairs to find other equivalent entities from different knowledge graphs (KGs) and is widely used in graph fusion-related fields. However, as the scale of KGs increases, manually annotating pre-aligned seed pairs becomes difficult. Existing research utilizes entity embeddings obtained by aggregating single structural information to identify potential seed pairs, thus reducing the reliance on pre-aligned seed pairs. However, due to the structural heterogeneity of KGs, the quality of potential seed pairs obtained using only a single structural information is not ideal. In addition, although existing research improves the quality of potential seed pairs through semi-supervised iteration, they underestimate the impact of embedding distortion produced by noisy seed pairs on the alignment effect. In order to solve the above problems, we propose a seed expanded-aware graph neural network with iterative optimization for semi-supervised entity alignment, named SE-GNN. First, we utilize the semantic attributes and structural features of entities, combined with a conditional filtering mechanism, to obtain high-quality initial potential seed pairs. Next, we designed a local and global awareness mechanism. It introduces initial potential seed pairs and combines local and global information to obtain a more comprehensive entity embedding representation, which alleviates the impact of KGs structural heterogeneity and lays the foundation for the optimization of initial potential seed pairs. Then, we designed the threshold nearest neighbor embedding correction strategy. It combines the similarity threshold and the bidirectional nearest neighbor method as a filtering mechanism to select iterative potential seed pairs and also uses an embedding correction strategy to eliminate the embedding distortion.
zh
[NLP-73] “Whose Side Are You On?” Estimating Ideology of Political and News Content Using Large Language Models and Few-shot Demonstration Selection
【速读】: 该论文试图解决在社交平台上因内容偏见、极化现象及过滤气泡等问题引发的对在线内容意识形态分类的挑战。现有方法受限于大量人工标注数据的需求以及无法适应不断演化的意识形态语境。为应对这些问题,论文探索了大型语言模型(Large Language Models, LLMs)通过情境学习(in-context learning, ICL)进行在线内容政治意识形态分类的潜力。解决方案的关键在于采用基于标签平衡的选择策略进行示范样本筛选,并通过实验验证了该方法在包含新闻文章和YouTube视频的三个数据集上的有效性,结果显示其显著优于零样本和传统监督方法。此外,论文还评估了元数据(如内容来源和描述)对意识形态分类的影响,并探讨了提供内容来源对LLM分类结果的作用。
链接: https://arxiv.org/abs/2503.20797
作者: Muhammad Haroon,Magdalena Wojcieszak,Anshuman Chhabra
机构: University of California, Davis (加州大学戴维斯分校); University of South Florida (南佛罗里达大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:
点击查看摘要
Abstract:The rapid growth of social media platforms has led to concerns about radicalization, filter bubbles, and content bias. Existing approaches to classifying ideology are limited in that they require extensive human effort, the labeling of large datasets, and are not able to adapt to evolving ideological contexts. This paper explores the potential of Large Language Models (LLMs) for classifying the political ideology of online content in the context of the two-party US political spectrum through in-context learning (ICL). Our extensive experiments involving demonstration selection in label-balanced fashion, conducted on three datasets comprising news articles and YouTube videos, reveal that our approach significantly outperforms zero-shot and traditional supervised methods. Additionally, we evaluate the influence of metadata (e.g., content source and descriptions) on ideological classification and discuss its implications. Finally, we show how providing the source for political and non-political content influences the LLM’s classification.
zh
[NLP-74] Can Zero-Shot Commercial APIs Deliver Regulatory-Grade Clinical Text DeIdentification? ECIR2025
【速读】: 该论文旨在评估基于API的三种领先去标识化系统(Azure Health Data Services、AWS Comprehend Medical和OpenAI GPT-4o)与自研去标识化系统Healthcare NLP在真实临床文档数据集上的性能差异。论文通过实体级和标记级分析发现,Healthcare NLP在受保护健康信息(PHI)检测方面实现了最高的准确率(F1分数为96%),显著优于其他系统(Azure: 91%,AWS: 83%,GPT-4o: 79%)。此外,Healthcare NLP还通过固定成本本地部署模型降低了超过80%的处理成本,解决了云端按请求计费模式带来的成本上升问题。论文的关键在于提出了一种兼具高精度、强适应性和经济性的去标识化解决方案,强调零样本商业API无法满足监管级别的临床去标识化需求,而Healthcare NLP凭借其卓越性能、定制能力及成本优势成为医疗组织在临床自然语言处理(NLP)工作流中实现合规性和可扩展性的更优选择。
链接: https://arxiv.org/abs/2503.20794
作者: Veysel Kocaman,Muhammed Santas,Yigit Gul,Mehmet Butgul,David Talby
机构: John Snow Labs inc. (约翰·斯诺实验室公司)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 14 pages, accepted at Text2Story Workshop at ECIR 2025
点击查看摘要
Abstract:We systematically assess the performance of three leading API-based de-identification systems - Azure Health Data Services, AWS Comprehend Medical, and OpenAI GPT-4o - against our de-identification systems on a ground truth dataset of 48 clinical documents annotated by medical experts. Our analysis, conducted at both entity-level and token-level, demonstrates that our solution, Healthcare NLP, achieves the highest accuracy, with a 96% F1-score in protected health information (PHI) detection, significantly outperforming Azure (91%), AWS (83%), and GPT-4o (79%). Beyond accuracy, Healthcare NLP is also the most cost-effective solution, reducing processing costs by over 80% compared to Azure and GPT-4o. Its fixed-cost local deployment model avoids the escalating per-request fees of cloud-based services, making it a scalable and economical choice. Our results underscore a critical limitation: zero-shot commercial APIs fail to meet the accuracy, adaptability, and cost-efficiency required for regulatory-grade clinical de-identification. Healthcare NLP’s superior performance, customization capabilities, and economic advantages position it as the more viable solution for healthcare organizations seeking compliance and scalability in clinical NLP workflows.
zh
[NLP-75] ECLAIR: Enhanced Clarification for Interactive Responses in an Enterprise AI Assistant
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理企业级交互中的歧义问题时表现不佳的问题,尤其是在需要依赖上下文和领域特定知识的场景下。论文的关键解决方案是提出了一种名为ECLAIR(Enhanced CLArification for Interactive Responses)的多智能体框架,用于交互式消歧。ECLAIR的核心在于通过定义定制化智能体、执行歧义推理、生成澄清问题,并利用用户反馈优化最终响应,从而显著提升澄清问题生成的效果,超越标准的Few-Shot方法。
链接: https://arxiv.org/abs/2503.20791
作者: John Murzaku,Zifan Liu,Vaishnavi Muppala,Md Mehrab Tanjim,Xiang Chen,Yunyao Li
机构: Adobe (Adobe)
类目: Computation and Language (cs.CL)
备注: 3 pages, 1 figure
点击查看摘要
Abstract:Large language models (LLMs) have shown remarkable progress in understanding and generating natural language across various applications. However, they often struggle with resolving ambiguities in real-world, enterprise-level interactions, where context and domain-specific knowledge play a crucial role. In this demonstration, we introduce ECLAIR (Enhanced CLArification for Interactive Responses), a multi-agent framework for interactive disambiguation. ECLAIR enhances ambiguous user query clarification through an interactive process where custom agents are defined, ambiguity reasoning is conducted by the agents, clarification questions are generated, and user feedback is leveraged to refine the final response. When tested on real-world customer data, ECLAIR demonstrates significant improvements in clarification question generation compared to standard few-shot methods.
zh
[NLP-76] Jaco: An Offline Running Privacy-aware Voice Assistant
【速读】: 该论文试图解决智能语音助手在隐私保护与功能扩展性之间的平衡问题,同时确保其在低资源设备上的可用性。解决方案的关键在于提出了一种名为Jaco的新架构,该架构支持完全离线运行(包括低资源设备如Raspberry Pi)、通过技能概念轻松扩展功能、专注于隐私保护而不牺牲开发者的功能实现能力,并支持多语言且具备与其他语音助手方案竞争的能力。
链接: https://arxiv.org/abs/2209.07775
作者: Daniel Bermuth,Alexander Poeppel,Wolfgang Reif
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
点击查看摘要
Abstract:With the recent advance in speech technology, smart voice assistants have been improved and are now used by many people. But often these assistants are running online as a cloud service and are not always known for a good protection of users’ privacy. This paper presents the architecture of a novel voice assistant, called Jaco, with the following features: (a) It can run completely offline, even on low resource devices like a RaspberryPi. (b) Through a skill concept it can be easily extended. © The architectural focus is on protecting users’ privacy, but without restricting capabilities for developers. (d) It supports multiple languages. (e) It is competitive with other voice assistant solutions. In this respect the assistant combines and extends the advantages of other approaches.
zh
[NLP-77] Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在任务特定数据集上的微调过程中安全性和能力之间的权衡问题,即所谓的安全-能力权衡(safety-capability trade-off)。论文的关键在于提出了一种理论框架,用于理解两种主要的安全感知微调策略中安全性和能力之间的相互作用,并深入分析数据相似性、上下文重叠以及对齐损失景观的影响。通过这一框架,论文揭示了LLM微调中安全-能力权衡的基本限制,并通过数值实验验证了理论结果。
链接: https://arxiv.org/abs/2503.20807
作者: Pin-Yu Chen,Han Shen,Payel Das,Tianyi Chen
机构: IBM Research (IBM 研究院); Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The first two authors contribute equally to this work and are listed in alphabetical order
点击查看摘要
Abstract:Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.
zh
计算机视觉
[CV-0] Mobile-VideoGPT : Fast and Accurate Video Understanding Language Model
【速读】:该论文旨在解决视频理解模型在实际应用中面临的高计算需求、庞大参数量以及缓慢推理速度的问题。为应对这些挑战,论文提出了一种名为Mobile-VideoGPT的高效多模态框架,其设计目标是在低于10亿参数的情况下运行。解决方案的关键在于引入轻量级的双视觉编码器、高效的投影模块以及小型语言模型(SLM),以实现实时处理能力。此外,通过注意力机制选择关键帧的Attention-Based Frame Scoring机制以及修剪冗余视觉标记并保留关键上下文线索的高效令牌投影器进一步提升了模型效率。实验结果表明,Mobile-VideoGPT-0.5B版本在保持更高吞吐量的同时,比现有同等规模的最佳模型平均高出6个百分点,且参数减少了40%。
链接: https://arxiv.org/abs/2503.21782
作者: Abdelrahman Shaker,Muhammad Maaz,Chenhui Gou,Hamid Rezatofighi,Salman Khan,Fahad Shahbaz Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. Project Page: this https URL
点击查看摘要
Abstract:Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: this https URL.
zh
[CV-1] VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models CVPR2025
【速读】:该论文致力于解决多主体及其交互运动的定制化文本到视频生成问题,现有方法主要局限于单一概念(主体身份或运动模式)的个性化,难以有效处理多个主体与其期望运动模式的结合。为应对这一挑战,论文提出了一种统一框架VideoMage,其关键在于采用主体和运动LoRAs捕捉用户提供的图像和视频中的个性化内容,并通过与视觉外观无关的运动学习方法解耦运动模式与视觉外观。此外,还开发了一种时空组成方案以指导所需运动模式内主体之间的交互。实验结果表明,VideoMage在生成一致且用户可控的视频方面优于现有方法。
链接: https://arxiv.org/abs/2503.21781
作者: Chi-Pin Huang,Yen-Siang Wu,Hung-Kai Chung,Kai-Po Chang,Fu-En Yang,Yu-Chiang Frank Wang
机构: Graduate Institute of Communication Engineering, National Taiwan University (台湾大学通信工程研究所); National Taiwan University (台湾大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project Page: this https URL
点击查看摘要
Abstract:Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.
zh
[CV-2] Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation CVPR2025
【速读】:该论文旨在解决开放词汇语义分割模型在训练集与测试集域间存在较大分布偏移时性能下降的问题,特别是在无需额外训练的情况下实现测试时领域自适应(Test-Time Domain Adaptation, TTDA)。论文的关键创新在于提出了一种名为Semantic Library Adaptation (SemLA) 的框架,其核心思想是构建一个基于LoRA(Low-Rank Adaptation)的适配器库,并通过CLIP嵌入索引。在推理阶段,SemLA能够根据目标域在嵌入空间中的接近程度动态合并最相关的适配器,从而为每个特定输入构建定制化的模型。这种方法不仅避免了额外的训练需求,还通过追踪适配器贡献提高了可解释性,并保护了数据隐私,使其适用于敏感应用场景。实验结果表明,SemLA在跨20个领域的基准测试中表现出色,显著提升了开放词汇语义分割任务的领域适应能力和性能表现。
链接: https://arxiv.org/abs/2503.21780
作者: Reza Qorbani,Gianluca Villani,Theodoros Panagiotakopoulos,Marc Botet Colomer,Linus Härenstam-Nielsen,Mattia Segu,Pier Luigi Dovesi,Jussi Karlgren,Daniel Cremers,Federico Tombari,Matteo Poggi
机构: The Good AI Lab (The Good AI 实验室); University of Toronto (多伦多大学); KTH; King; Technical University of Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Google; ETH Zurich (瑞士联邦理工学院); University of Bologna (博洛尼亚大学); AMD Silo AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project page: this https URL Code: this https URL
点击查看摘要
Abstract:Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries, providing versatile performance on novel datasets. However, large shifts between training and test domains degrade their performance, requiring fine-tuning for effective real-world applications. We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation. SemLA leverages a library of LoRA-based adapters indexed with CLIP embeddings, dynamically merging the most relevant adapters based on proximity to the target domain in the embedding space. This approach constructs an ad-hoc model tailored to each specific input without additional training. Our method scales efficiently, enhances explainability by tracking adapter contributions, and inherently protects data privacy, making it ideal for sensitive applications. Comprehensive experiments on a 20-domain benchmark built over 10 standard datasets demonstrate SemLA’s superior adaptability and performance across diverse settings, establishing a new standard in domain adaptation for open-vocabulary semantic segmentation.
zh
[CV-3] X2-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction
【速读】:该论文旨在解决传统四维 computed tomography (4D CT) 重建方法因相位分箱工作流程固有限制而导致的运动对齐误差和临床实用性受限的问题。解决方案的关键在于提出了一种名为 X²-Gaussian 的新框架,通过结合动态辐射高斯点阵投射与自监督呼吸运动学习,实现了连续时间的 4D CT 重建。该方法采用时空编码器-解码器架构预测随时间变化的高斯变形,消除了相位离散化的需求,并通过生理驱动的周期一致性损失函数,在无需外部门控设备的情况下直接从投影数据中学习患者特定的呼吸周期,从而实现硬件无关的周期学习。
链接: https://arxiv.org/abs/2503.21779
作者: Weihao Yu,Yuanhao Cai,Ruyi Zha,Zhiwen Fan,Chenxin Li,Yixuan Yuan
机构: The Chinese University of Hong Kong (香港中文大学); Johns Hopkins University (约翰斯·霍普金斯大学); The Australian National University (澳大利亚国立大学); University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X ^2 -Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization. To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization. Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. By unifying continuous motion modeling with hardware-free period learning, X ^2 -Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging. Project website at: this https URL.
zh
[CV-4] HS-SLAM: Hybrid Representation with Structural Supervision for Improved Dense SLAM ICRA2025
【速读】:该论文致力于解决基于NeRF的SLAM在处理动态场景或被遗忘场景时,存在的场景表示不足、结构信息捕捉不充分以及全局一致性难以维持的问题。论文的关键解决方案包括:提出一种混合编码网络(Hybrid Encoding Network),结合哈希网格(hash-grid)、三平面(tri-planes)和单体块(one-blob)的优势,以提升场景表示能力并增强重建的完整性和平滑性;引入结构监督机制,通过采样非局部像素块而非单条光线来更好地捕获场景结构;并通过主动全局束调整(Bundle Adjustment, BA)消除相机漂移并减轻累积误差,确保全局一致性。这些方法显著提升了跟踪与重建的精度,同时保持了机器人应用所需的效率。
链接: https://arxiv.org/abs/2503.21778
作者: Ziren Gong,Fabio Tosi,Youmin Zhang,Stefano Mattoccia,Matteo Poggi
机构: University of Bologna (博洛尼亚大学); Rock Universe (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2025. Project Page: this https URL
点击查看摘要
Abstract:NeRF-based SLAM has recently achieved promising results in tracking and reconstruction. However, existing methods face challenges in providing sufficient scene representation, capturing structural information, and maintaining global consistency in scenes emerging significant movement or being forgotten. To this end, we present HS-SLAM to tackle these problems. To enhance scene representation capacity, we propose a hybrid encoding network that combines the complementary strengths of hash-grid, tri-planes, and one-blob, improving the completeness and smoothness of reconstruction. Additionally, we introduce structural supervision by sampling patches of non-local pixels rather than individual rays to better capture the scene structure. To ensure global consistency, we implement an active global bundle adjustment (BA) to eliminate camera drifts and mitigate accumulative errors. Experimental results demonstrate that HS-SLAM outperforms the baselines in tracking and reconstruction accuracy while maintaining the efficiency required for robotics.
zh
[CV-5] st-Time Visual In-Context Tuning CVPR2025
【速读】:该论文旨在解决现有视觉上下文学习(Visual In-Context Learning, VICL)范式在分布偏移(distribution shifts)下泛化能力较差的问题。论文提出了一种名为测试时视觉上下文微调(Test-Time Visual In-Context Tuning, VICT)的方法,能够在单个测试样本的情况下动态调整VICL模型以适应新场景。关键在于通过翻转任务提示与测试样本的角色,并利用循环一致性损失(cycle consistency loss)重建原始任务提示输出,从而确保模型能够感知新的测试分布。论文表明,这种方法显著提升了VICL模型在未见过的新领域中的泛化能力,并展示了其在未知任务上的潜在应用价值。
链接: https://arxiv.org/abs/2503.21777
作者: Jiahao Xie,Alessio Tonioni,Nathalie Rauschmayr,Federico Tombari,Bernt Schiele
机构: Max Planck Institute for Informatics (马克斯·普朗克计算机科学研究所); VIA Research Center (翻译未知); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2025. Code: this https URL
点击查看摘要
Abstract:Visual in-context learning (VICL), as a new paradigm in computer vision, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. While effective, the existing VICL paradigm exhibits poor generalizability under distribution shifts. In this work, we propose test-time Visual In-Context Tuning (VICT), a method that can adapt VICL models on the fly with a single test sample. Specifically, we flip the role between the task prompts and the test sample and use a cycle consistency loss to reconstruct the original task prompt output. Our key insight is that a model should be aware of a new test distribution if it can successfully recover the original task prompts. Extensive experiments on six representative vision tasks ranging from high-level visual understanding to low-level image processing, with 15 common corruptions, demonstrate that our VICT can improve the generalizability of VICL to unseen new domains. In addition, we show the potential of applying VICT for unseen tasks at test time. Code: this https URL.
zh
[CV-6] Video-R1: Reinforcing Video Reasoning in MLLM s
【速读】:该论文旨在解决通过强化学习(Reinforcement Learning, RL)范式在多模态大语言模型(Multimodal Large Language Models, MLLMs)中系统性地激发视频推理能力的问题。论文面临的两个主要挑战是:(i) 视频推理中缺乏时间建模方法,以及(ii) 高质量视频推理数据的稀缺性。为了解决这些问题,论文的关键创新包括提出T-GRPO算法,该算法鼓励模型利用视频中的时间信息进行推理;同时,通过将高质量的图像推理数据引入训练过程来补充视频数据的不足。实验结果表明,Video-R1在多个视频推理基准测试(如VideoMMMU、VSI-Bench)及通用视频基准测试(如MVBench、TempCompass)中取得了显著性能提升,特别是Video-R1-7B在VSI-Bench上的空间推理准确率达到35.8%,超过了商用专有模型GPT-4o。
链接: https://arxiv.org/abs/2503.21776
作者: Kaituo Feng,Kaixiong Gong,Bohao Li,Zonghao Guo,Yibing Wang,Tianshuo Peng,Benyou Wang,Xiangyu Yue
机构: CUHK MMLab (香港中文大学多媒体实验室); CUHK (SZ) (香港中文大学(深圳)); Tsinghua University (清华大学); UCAS (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Inspired by DeepSeek-R1’s success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for eliciting video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All codes, models, data are released.
zh
[CV-7] Optimal Stepsize for Diffusion Sampling
【速读】:该论文试图解决扩散模型在采样过程中因次优步长离散化导致计算开销大的问题。论文的关键在于提出了一种名为“最优步长蒸馏(Optimal Stepsize Distillation)”的方法,这是一种基于动态规划的框架,通过从参考轨迹中蒸馏知识来提取理论上最优的步长调度方案。该方法将步长优化重新表述为递归误差最小化问题,并通过利用最优子结构保证全局离散化界限。关键之处在于,蒸馏得到的步长调度方案在不同架构、常微分方程(ODE)求解器以及噪声调度下表现出较强的鲁棒性。实验结果表明,该方法可将文本到图像的生成加速10倍,同时在GenEval基准测试中保持99.4%的性能。
链接: https://arxiv.org/abs/2503.21774
作者: Jianning Pei,Han Hu,Shuyang Gu
机构: University Chinese Academy of Science (中国科学院大学); Tencent Hunyuan Research (腾讯混元研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Diffusion models achieve remarkable generation quality but suffer from computational intensive sampling due to suboptimal step discretization. While existing works focus on optimizing denoising directions, we address the principled design of stepsize schedules. This paper proposes Optimal Stepsize Distillation, a dynamic programming framework that extracts theoretically optimal schedules by distilling knowledge from reference trajectories. By reformulating stepsize optimization as recursive error minimization, our method guarantees global discretization bounds through optimal substructure exploitation. Crucially, the distilled schedules demonstrate strong robustness across architectures, ODE solvers, and noise schedules. Experiments show 10x accelerated text-to-image generation while preserving 99.4% performance on GenEval. Our code is available at this https URL.
zh
[CV-8] LOCORE: Image Re-ranking with Long-Context Sequence Modeling CVPR2025
【速读】:该论文试图解决图像检索中长上下文依赖建模的问题。现有方法通常采用基于局部描述符的成对相似度估计或基于全局描述符的列表级重排序,但这些方法在处理长上下文信息时存在局限性。论文的关键创新在于提出LOCORE(Long-Context Re-ranker),这是一种首次利用局部描述符进行列表级重排序的方法。通过结合高效的长上下文序列模型,LOCORE能够有效地捕获查询图像与候选图像集合之间的局部描述符级别的依赖关系。此外,为了应对序列模型在处理长列表时的上下文长度限制,文中采用了定制的滑动窗口策略。实验结果表明,LOCORE在多个标准图像检索基准数据集上实现了优于其他重排序器的性能,同时保持了与成对局部描述符重排序器相当的延迟。
链接: https://arxiv.org/abs/2503.21772
作者: Zilin Xiao,Pavel Suma,Ayush Sachdeva,Hao-Jen Wang,Giorgos Kordopatis-Zilos,Giorgos Tolias,Vicente Ordonez
机构: Rice University (莱斯大学); VRG, FEE, Czech Technical University in Prague (捷克布拉格工业大学视觉识别与图形学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025
点击查看摘要
Abstract:We introduce LOCORE, Long-Context Re-ranker, a model that takes as input local descriptors corresponding to an image query and a list of gallery images and outputs similarity scores between the query and each gallery image. This model is used for image retrieval, where typically a first ranking is performed with an efficient similarity measure, and then a shortlist of top-ranked images is re-ranked based on a more fine-grained similarity measure. Compared to existing methods that perform pair-wise similarity estimation with local descriptors or list-wise re-ranking with global descriptors, LOCORE is the first method to perform list-wise re-ranking with local descriptors. To achieve this, we leverage efficient long-context sequence models to effectively capture the dependencies between query and gallery images at the local-descriptor level. During testing, we process long shortlists with a sliding window strategy that is tailored to overcome the context size limitations of sequence models. Our approach achieves superior performance compared with other re-rankers on established image retrieval benchmarks of landmarks (ROxf and RPar), products (SOP), fashion items (In-Shop), and bird species (CUB-200) while having comparable latency to the pair-wise local descriptor re-rankers.
zh
[CV-9] A Unified Image-Dense Annotation Generation Model for Underwater Scenes CVPR2025
【速读】:本文旨在解决水下密集预测任务(尤其是深度估计和语义分割)中高质量大规模数据集稀缺的问题,由于水下环境的复杂性和高昂的数据采集成本,带有密集标注的水下数据集仍然匮乏。为应对这一挑战,论文提出了一种统一的文本到图像及密集标注生成方法(TIDE)。其关键在于仅利用文本作为输入即可同时生成逼真的水下图像及其多类高度一致的密集标注。具体而言,通过引入隐式布局共享机制(ILS)和时间自适应归一化(TAN)的跨模态交互方法,在单一模型内统一实现文本到图像以及文本到密集标注的生成,并优化图像与标注之间的一致性。实验基于合成的大规模水下数据集验证了该方法在提升现有水下密集预测模型性能方面的能力,有效缓解了标注数据不足的问题。我们期望此方法能够为其他领域缓解数据稀缺问题提供新思路。代码已开源,详见 https://github.com/HongkLin/TIDE。
链接: https://arxiv.org/abs/2503.21771
作者: Hongkai Lin,Dingkang Liang,Zhenghao Qi,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025. The code is available at https: //github.com/HongkLin/TIDE
点击查看摘要
Abstract:Underwater dense prediction, especially depth estimation and semantic segmentation, is crucial for gaining a comprehensive understanding of underwater scenes. Nevertheless, high-quality and large-scale underwater datasets with dense annotations remain scarce because of the complex environment and the exorbitant data collection costs. This paper proposes a unified Text-to-Image and DEnse annotation generation method (TIDE) for underwater scenes. It relies solely on text as input to simultaneously generate realistic underwater images and multiple highly consistent dense annotations. Specifically, we unify the generation of text-to-image and text-to-dense annotations within a single model. The Implicit Layout Sharing mechanism (ILS) and cross-modal interaction method called Time Adaptive Normalization (TAN) are introduced to jointly optimize the consistency between image and dense annotations. We synthesize a large-scale underwater dataset using TIDE to validate the effectiveness of our method in underwater dense prediction tasks. The results demonstrate that our method effectively improves the performance of existing underwater dense prediction models and mitigates the scarcity of underwater data with dense annotations. We hope our method can offer new perspectives on alleviating data scarcity issues in other fields. The code is available at https: //github.com/HongkLin/TIDE.
zh
[CV-10] Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting
【速读】:该论文试图解决场景理解中的新任务——Visual Jenga,旨在通过系统性地移除单个图像中的对象,直到仅剩背景,揭示场景元素之间的内在物理和几何依赖关系。解决方案的关键在于利用场景内对象之间的不对称双向关系,并结合大型图像修复(inpainting)模型生成反事实集(counterfactuals)以量化这种不对称性,从而实现无需训练的数据驱动方法。
链接: https://arxiv.org/abs/2503.21770
作者: Anand Bhattad,Konpat Preechakul,Alexei A. Efros
机构: Toyota Technological Institute at Chicago (丰田日本技术学院芝加哥分院); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
点击查看摘要
Abstract:This paper proposes a novel scene understanding task called Visual Jenga. Drawing inspiration from the game Jenga, the proposed task involves progressively removing objects from a single image until only the background remains. Just as Jenga players must understand structural dependencies to maintain tower stability, our task reveals the intrinsic relationships between scene elements by systematically exploring which objects can be removed while preserving scene coherence in both physical and geometric sense. As a starting point for tackling the Visual Jenga task, we propose a simple, data-driven, training-free approach that is surprisingly effective on a range of real-world images. The principle behind our approach is to utilize the asymmetry in the pairwise relationships between objects within a scene and employ a large inpainting model to generate a set of counterfactuals to quantify the asymmetry.
zh
[CV-11] Semantic Consistent Language Gaussian Splatting for Point-Level Open-vocabulary Querying
【速读】:本文旨在解决基于文本查询在3D高斯点云(3D Gaussian Splatting)中的开放词汇查询问题,即从3D高斯表示中识别与给定文本查询语义相关的区域。传统方法如LangSplat通过检索2D渲染图上的分割掩膜来完成此任务,而更近期的工作如OpenGaussian则引入了点级查询,直接选择3D高斯子集。本文提出的点级查询方法基于LangSplat框架进行了改进,其关键是:(a) 利用Segment Anything Model 2 (SAM2) 的masklets建立语义一致的地面真值以蒸馏语言相关的高斯分布;(b) 提出一种新颖的两步查询方法,首先检索蒸馏后的地面真值,然后利用该真值查询单个高斯点。实验结果表明,该方法在三个基准数据集上的性能优于现有技术,例如在3D-OVS数据集上mIoU提升了+20.42。
链接: https://arxiv.org/abs/2503.21767
作者: Hairong Yin,Huangying Zhan,Yi Xu,Raymond A. Yeh
机构: Department Computer Science, Purdue University (普渡大学); Goertek Alpha Labs (歌尔 alpha 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Open-vocabulary querying in 3D Gaussian Splatting aims to identify semantically relevant regions within a 3D Gaussian representation based on a given text query. Prior work, such as LangSplat, addressed this task by retrieving these regions in the form of segmentation masks on 2D renderings. More recently, OpenGaussian introduced point-level querying, which directly selects a subset of 3D Gaussians. In this work, we propose a point-level querying method that builds upon LangSplat’s framework. Our approach improves the framework in two key ways: (a) we leverage masklets from the Segment Anything Model 2 (SAM2) to establish semantic consistent ground-truth for distilling the language Gaussians; (b) we introduces a novel two-step querying approach that first retrieves the distilled ground-truth and subsequently uses the ground-truth to query the individual Gaussians. Experimental evaluations on three benchmark datasets demonstrate that the proposed method achieves better performance compared to state-of-the-art approaches. For instance, our method achieves an mIoU improvement of +20.42 on the 3D-OVS dataset.
zh
[CV-12] Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence CVPR2025
【速读】:本文旨在解决三维形状对应(3D Shape Correspondence)在复杂真实场景中的挑战,特别是针对非等距形状差异(non-isometric shape discrepancies)。现有的功能映射(Functional Map)方法虽在受控条件下有效,但在实际复杂场景中表现不稳定。为应对这一问题,论文重新审视基于配准(registration-for-correspondence)的方法,并挖掘其潜力以实现更稳定的形状对应估计。关键在于引入Stable-SCore框架,它首先利用一个经过改造的二维人物对应基础模型确保可靠的二维映射,然后通过提出语义流引导配准(Semantic Flow Guided Registration)方法,利用二维对应指导网格变形。这种方法克服了传统基于配准方法中常见的不稳定形变及对精确预对齐或高质量初始三维对应的依赖,从而显著提升了在具有挑战性场景下的性能。
链接: https://arxiv.org/abs/2503.21766
作者: Haolin Liu,Xiaohang Zhan,Zizheng Yan,Zhongjin Luo,Yuxin Wen,Xiaoguang Han
机构: FNii (FNii), CUHKSZ (香港中文大学深圳分校); Tencent (腾讯); Tencent-Hunyuan3D (腾讯-浑元3D); SSE (南方科技大学), CUHKSZ (香港中文大学深圳分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025. Homepage: this https URL
点击查看摘要
Abstract:Establishing character shape correspondence is a critical and fundamental task in computer vision and graphics, with diverse applications including re-topology, attribute transfer, and shape interpolation. Current dominant functional map methods, while effective in controlled scenarios, struggle in real situations with more complex challenges such as non-isometric shape discrepancies. In response, we revisit registration-for-correspondence methods and tap their potential for more stable shape correspondence estimation. To overcome their common issues including unstable deformations and the necessity for careful pre-alignment or high-quality initial 3D correspondences, we introduce Stable-SCore: A Stable Registration-based Framework for 3D Shape Correspondence. We first re-purpose a foundation model for 2D character correspondence that ensures reliable and stable 2D mappings. Crucially, we propose a novel Semantic Flow Guided Registration approach that leverages 2D correspondence to guide mesh deformations. Our framework significantly surpasses existing methods in challenging scenarios, and brings possibilities for a wide array of real applications, as demonstrated in our results.
zh
[CV-13] Exploring the Evolution of Physics Cognition in Video Generation: A Survey
【速读】:该论文旨在解决视频生成领域中物理认知不足的问题,即当前生成内容在视觉上逼真但违背物理定律的现象。论文的关键在于提出一种从认知科学视角出发的三层次分类法:1)生成的基本图式感知;2)被动的物理知识认知;3)主动的世界模拟认知,涵盖最新方法、经典范式及基准。通过系统性总结架构设计与应用,论文强调了该领域的核心挑战,并指出了未来研究的潜在方向,以推动生成模型从“视觉模仿”向“类人物理理解”的转变。
链接: https://arxiv.org/abs/2503.21765
作者: Minghui Lin,Xiang Wang,Yishan Wang,Shu Wang,Fengqi Dai,Pengxiang Ding,Cunxiang Wang,Zhengrong Zuo,Nong Sang,Siteng Huang,Donglin Wang
机构: School of Artificial Intelligence and Automation, Huazhong University of Science and Technology (华中科技大学); School of Engineering, Westlake University (西湖大学), Hangzhou, China; School of Control Science and Engineering, Shandong University (山东大学), Jinan, China; Tsinghua University (清华大学), Beijing, China; Zhejiang University (浙江大学), Hangzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A comprehensive list of papers studied in this survey is available at this https URL
点击查看摘要
Abstract:Recent advancements in video generation have witnessed significant progress, especially with the rapid advancement of diffusion models. Despite this, their deficiencies in physical cognition have gradually received widespread attention - generated content often violates the fundamental laws of physics, falling into the dilemma of ''visual realism but physical absurdity". Researchers began to increasingly recognize the importance of physical fidelity in video generation and attempted to integrate heuristic physical cognition such as motion representations and physical knowledge into generative systems to simulate real-world dynamic scenarios. Considering the lack of a systematic overview in this field, this survey aims to provide a comprehensive summary of architecture designs and their applications to fill this gap. Specifically, we discuss and organize the evolutionary process of physical cognition in video generation from a cognitive science perspective, while proposing a three-tier taxonomy: 1) basic schema perception for generation, 2) passive cognition of physical knowledge for generation, and 3) active cognition for world simulation, encompassing state-of-the-art methods, classical paradigms, and benchmarks. Subsequently, we emphasize the inherent key challenges in this domain and delineate potential pathways for future research, contributing to advancing the frontiers of discussion in both academia and industry. Through structured review and interdisciplinary analysis, this survey aims to provide directional guidance for developing interpretable, controllable, and physically consistent video generation paradigms, thereby propelling generative models from the stage of ‘‘visual mimicry’’ towards a new phase of ‘‘human-like physical comprehension’’.
zh
[CV-14] Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video CVPR2025
【速读】:该论文试图解决从随意视频中理解动态场景的问题。现有大规模预训练视觉基础模型(如视觉-语言、视频深度预测、运动跟踪和分割模型)虽具有潜力,但训练单一模型以实现全面的四维(4D)理解仍具挑战性。论文的关键解决方案是提出Uni4D,这是一种多阶段优化框架,通过利用多个预训练模型推进动态三维建模,包括静态/动态重建、相机姿态估计以及密集三维运动跟踪。Uni4D无需重新训练或微调即可达到最先进的动态四维建模性能,突显了重用视觉基础模型进行4D理解的有效性。
链接: https://arxiv.org/abs/2503.21761
作者: David Yifan Yao,Albert J. Zhai,Shenlong Wang
机构: University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: CVPR 2025. Project page (with code): this https URL
点击查看摘要
Abstract:This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.
zh
[CV-15] Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
【速读】:该论文试图解决文本到图像(Text-to-Image, T2I)生成任务中的高效性和跨模态交互问题。解决方案的关键在于两个方面:(1) 统一架构(Unified Next-DiT),它将文本和图像标记视为联合序列,实现自然的跨模态交互,并支持任务扩展;同时引入了一个专门设计用于T2I任务的统一描述系统(Unified Captioner, UniCap),以生成高质量且语义对齐的文本-图像配对,加速收敛并增强提示一致性。(2) 效率提升,通过多阶段渐进训练策略以及推理加速技术,在不牺牲图像质量的前提下提高模型效率。这些方法使得Lumina-Image 2.0在参数量仅为2.6B的情况下表现出色,验证了其可扩展性和设计合理性。
链接: https://arxiv.org/abs/2503.21758
作者: Qi Qin,Le Zhuo,Yi Xin,Ruoyi Du,Zhen Li,Bin Fu,Yiting Lu,Jiakang Yuan,Xinyue Li,Dongyang Liu,Xiangyang Zhu,Manyuan Zhang,Will Beddow,Erwann Millon,Victor Perez,Wenhai Wang,Conghui He,Bo Zhang,Xiaohong Liu,Hongsheng Li,Yu Qiao,Chang Xu,Peng Gao
机构: Shanghai AI Laboratory (上海人工智能实验室); The University of Sydney (悉尼大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Innovation Institute (上海创新研究院); Shanghai Jiao Tong University (上海交通大学); Krea AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech Report, 21 pages, 12 figures
点击查看摘要
Abstract:We introduce Lumina-Image 2.0, an advanced text-to-image generation framework that achieves significant progress compared to previous work, Lumina-Next. Lumina-Image 2.0 is built upon two key principles: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), specifically designed for T2I generation tasks. UniCap excels at generating comprehensive and accurate captions, accelerating convergence and enhancing prompt adherence. (2) Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies and introduce inference acceleration techniques without compromising image quality. Extensive evaluations on academic benchmarks and public text-to-image arenas show that Lumina-Image 2.0 delivers strong performances even with only 2.6B parameters, highlighting its scalability and design efficiency. We have released our training details, code, and models at this https URL.
zh
[CV-16] Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
【速读】:该论文旨在解决大型视觉语言模型(Large Vision Language Model, LVLM)中视觉信息压缩的问题,目标是生成一种同时适用于生成式任务和判别式任务的表示,且保持接近无损的同时具备存储高效性。论文提出了一种名为Fwd2Bot的新颖压缩方法,其核心在于利用LVLM自身以任务无关的方式压缩视觉信息。解决方案的关键在于“双前向传递”训练策略:第一阶段通过将视觉信息凝练为少量摘要标记来创建瓶颈;第二阶段则使用相同的语言模型处理语言指令与这些摘要标记,替代原始图像标记。此外,训练过程中引入了两种损失函数——第二阶段的自回归损失用于直接优化压缩目标,第一阶段的对比损失进一步增强表征能力,尤其针对判别任务。训练还通过特定阶段适配器得到了增强。总体而言,Fwd2Bot实现了高度信息丰富的压缩表示,不仅在生成任务中提供了两倍于现有技术的压缩率,同时保持生成能力,还在图像检索和组成性判别任务中达到了新的性能基准。
链接: https://arxiv.org/abs/2503.21757
作者: Adrian Bulat,Yassine Ouali,Georgios Tzimiropoulos
机构: Samsung AI Cambridge (三星人工智能剑桥); Technical University of Iasi (雅西技术大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:In this work, we aim to compress the vision tokens of a Large Vision Language Model (LVLM) into a representation that is simultaneously suitable for (a) generative and (b) discriminative tasks, © is nearly lossless, and (d) is storage-efficient. We propose a novel compression approach, called Fwd2Bot, that uses the LVLM itself to compress the visual information in a task-agnostic manner. At the core of Fwd2bot there exists a “double-forward pass” training strategy, whereby, during the first forward pass, the LLM (of the LVLM) creates a bottleneck by condensing the visual information into a small number of summary tokens. Then, using the same LLM, the second forward pass processes the language instruction(s) alongside the summary tokens, used as a direct replacement for the image ones. The training signal is provided by two losses: an autoregressive one applied after the second pass that provides a direct optimization objective for compression, and a contrastive loss, applied after the first pass, that further boosts the representation strength, especially for discriminative tasks. The training is further enhanced by stage-specific adapters. We accompany the proposed method by an in-depth ablation study. Overall, Fwd2Bot results in highly-informative compressed representations suitable for both generative and discriminative tasks. For generative tasks, we offer a 2x higher compression rate without compromising the generative capabilities, setting a new state-of-the-art result. For discriminative tasks, we set a new state-of-the-art on image retrieval and compositionality.
zh
[CV-17] VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
【速读】:该论文试图解决视频生成模型在追求视觉逼真度的同时,难以确保生成内容在物理规律、常识推理、解剖学正确性和构成完整性等内在真实性(intrinsic faithfulness)方面表现不足的问题。论文的关键解决方案是引入VBench-2.0,这是一个新一代的基准评估工具,专门用于自动评估视频生成模型的内在真实性。VBench-2.0通过五个关键维度——人类保真度(Human Fidelity)、可控性(Controllability)、创造力(Creativity)、物理规律(Physics)和常识(Commonsense),以及每个维度下的细粒度能力,综合评估模型的表现。该框架结合了通用评估方法(如最先进的视觉语言模型VLMs和大型语言模型LLMs)与专门技术(如针对视频生成设计的异常检测方法),并通过广泛的标注工作确保评估结果与人类判断的一致性。通过超越表层的真实性,VBench-2.0旨在为下一代视频生成模型设定新的标准,推动其向更高水平的内在真实性迈进。
链接: https://arxiv.org/abs/2503.21755
作者: Dian Zheng,Ziqi Huang,Hongbo Liu,Kai Zou,Yinan He,Fan Zhang,Yuanhan Zhang,Jingwen He,Wei-Shi Zheng,Yu Qiao,Ziwei Liu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); S-Lab, Nanyang Technological University (南洋理工大学S-Lab); Sun Yat-Sen University (中山大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Equal contributions from first two authors. Project page: this https URL Code: this https URL
点击查看摘要
Abstract:Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real “world models” through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored for individual dimensions, our evaluation framework integrates generalists such as state-of-the-art VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive annotations to ensure alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.
zh
[CV-18] Reconstructing Humans with a Biomechanically Accurate Skeleton CVPR2025
【速读】:该论文旨在解决从单目图像重建 biomechanically accurate(生物力学精确)3D人体的问题。解决方案的关键在于提出了一种基于 Transformer 的方法,该方法以图像为输入并估计模型参数。由于缺乏针对此任务的训练数据,研究者构建了一个管道来为单张图像生成伪真实标签的模型参数,并设计了一种迭代优化这些伪标签的训练流程。此外,与现有最先进的 3D 人体网格恢复方法相比,该方法在极端 3D 姿态和视角设置下表现出显著优越性,同时避免了以往方法中常见的关节角度限制违反问题,通过利用生物力学合理的自由度实现更自然的关节旋转估计。
链接: https://arxiv.org/abs/2503.21751
作者: Yan Xia,Xiaowei Zhou,Etienne Vouga,Qixing Huang,Georgios Pavlakos
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025. Project Webpage: this https URL
点击查看摘要
Abstract:In this paper, we introduce a method for reconstructing 3D humans from a single image using a biomechanically accurate skeleton model. To achieve this, we train a transformer that takes an image as input and estimates the parameters of the model. Due to the lack of training data for this task, we build a pipeline to produce pseudo ground truth model parameters for single images and implement a training procedure that iteratively refines these pseudo labels. Compared to state-of-the-art methods for 3D human mesh recovery, our model achieves competitive performance on standard benchmarks, while it significantly outperforms them in settings with extreme 3D poses and viewpoints. Additionally, we show that previous reconstruction methods frequently violate joint angle limits, leading to unnatural rotations. In contrast, our approach leverages the biomechanically plausible degrees of freedom making more realistic joint rotation estimates. We validate our approach across multiple human pose estimation benchmarks. We make the code, models and data available at: this https URL
zh
[CV-19] LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
【速读】:该论文旨在解决文本到图像合成领域中提示表达力与生成图像保真度之间的差距问题。为实现这一目标,论文的关键在于采用以数据为中心的方法论,构建了一个高质量的数据合成管道,基于Deepseek-R1生成了LeX-10K数据集,包含10,000张高分辨率(1024 × 1024像素)且美学优化的图像。此外,开发了LeX-Enhancer作为强大的提示增强模型,并训练了两个顶级性能的文本到图像生成模型LeX-FLUX和LeX-Lumina。同时,引入了LeX-Bench评估基准及创新的Pairwise Normalized Edit Distance (PNED)指标,用于全面评估视觉文本生成的质量。实验结果表明,所提出的方法在多个方面显著提升了生成效果。
链接: https://arxiv.org/abs/2503.21749
作者: Shitian Zhao,Qilong Wu,Xinyue Li,Bo Zhang,Ming Li,Qi Qin,Dongyang Liu,Kaipeng Zhang,Hongsheng Li,Yu Qiao,Peng Gao,Bin Fu,Zhen Li
机构: Shanghai AI Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024 \times 1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.
zh
[CV-20] CTRL-O: Language-Controllable Object-Centric Visual Representation Learning CVPR2025
【速读】:该论文旨在解决现有对象中心表征学习模型缺乏可控性的问题。当前最先进的对象中心模型在复杂真实场景中的物体发现任务上表现出色,但它们无法根据用户需求选择性地表征特定对象,而是基于预设的对象理解进行表征学习。论文的关键解决方案是提出了一种新的方法,通过语言描述来控制槽位(slot)表示,即ConTRoLlable Object-centric 表征学习(CTRL-O)。这种方法实现了复杂真实场景中目标与语言的精确绑定,无需掩码监督即可实现用户引导的对象选择性表征。此外,该方法在文本到图像生成和视觉问答两个下游任务中展示了实例特定的生成能力和优异性能。
链接: https://arxiv.org/abs/2503.21747
作者: Aniket Didolkar,Andrii Zadaianchuk,Rabiul Awal,Maximilian Seitzer,Efstratios Gavves,Aishwarya Agrawal
机构: Mila - Quebec AI Institute (Mila - 魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); University of Amsterdam (阿姆斯特丹大学), The Netherlands; University of Tübingen (蒂宾根大学); Archimedes/Athena RC (Archimedes/Athena 研究中心), Greece
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CVPR 2025
点击查看摘要
Abstract:Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called “slots” or “object files”, where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. The proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.
zh
[CV-21] 3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models
【速读】:该论文旨在解决3D生成领域自动评价方法与人类感知偏好不均衡对齐的问题。当前3D生成技术快速发展,但其评价体系的发展滞后,缺乏能够全面反映人类偏好的大规模多维度数据集。论文的关键解决方案是开发了3DGen-Arena这一集成平台,并通过精心设计的文本和图像提示,从公众用户和专家标注者中收集了大规模的人类偏好数据,构建了3DGen-Bench数据集。基于此数据集,论文进一步训练了一个CLIP-based评分模型(3DGen-Score)和一个MLLM-based自动评估器(3DGen-Eval),创新性地统一了文本到3D和图像到3D生成的质量评估,并结合两者优势形成了完整的自动化评价系统。实验结果表明,所提出的评分模型在预测人类偏好方面表现出色,与现有指标相比具有更高的相关性。
链接: https://arxiv.org/abs/2503.21745
作者: Yuhan Zhang,Mengchen Zhang,Tong Wu,Tengfei Wang,Gordon Wetzstein,Dahua Lin,Ziwei Liu
机构: Fudan University (复旦大学); Zhejiang University (浙江大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Stanford University (斯坦福大学); The Chinese University of Hong Kong (香港中文大学); S-Lab, Nanyang Technological University (南洋理工大学 S-Lab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models. To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths. Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications.
zh
[CV-22] SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling
【速读】:该论文旨在解决高保真 3D 网格(High-fidelity 3D Meshes)重建中存在的一些挑战,特别是针对具有任意拓扑结构(包括开放表面和复杂内部结构)的模型。传统隐式场方法通常需要代价高昂且会导致细节丢失的封闭转换过程,而其他方法则难以处理高分辨率场景。论文的关键创新在于提出了 SparseFlex,这是一种新颖的稀疏结构等值面表示法,能够直接从渲染损失中以高达 (1024^3) 的分辨率进行可微分网格重建。SparseFlex 结合了 Flexicubes 的准确性与稀疏体素结构的优势,将计算集中在与表面相邻的区域,并有效处理开放表面。其核心突破在于引入了一种基于视锥体感知的分区体素训练策略,该策略在渲染过程中仅激活相关的体素,从而大幅减少了内存消耗并支持高分辨率训练。这一进展首次实现了仅通过渲染监督即可重建网格内部结构的功能。此外,作者构建了一个完整的形状建模管道,通过训练变分自编码器(Variational Autoencoder, VAE)和修正流变换器(Rectified Flow Transformer),实现了高质量 3D 形状的生成。实验结果表明,SparseFlex 在重建精度方面达到了最先进的水平,相较于先前的方法,Chamfer 距离降低了约 82%,F 分数提高了约 88%,同时展示了生成高分辨率、细节丰富的任意拓扑 3D 形状的能力。因此,SparseFlex 通过启用基于渲染损失的高分辨率可微分网格重建与生成,显著推动了 3D 形状表示与建模领域的前沿发展。
链接: https://arxiv.org/abs/2503.21732
作者: Xianglong He,Zi-Xin Zou,Chia-Hao Chen,Yuan-Chen Guo,Ding Liang,Chun Yuan,Wanli Ouyang,Yan-Pei Cao,Yangguang Li
机构: Tsinghua University (清华大学); VAST; The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
点击查看摘要
Abstract:Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to 1024^3 directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation and modeling.
zh
[CV-23] OccRobNet : Occlusion Robust Network for Accurate 3D Interacting Hand-Object Pose Estimation
【速读】:该论文旨在解决3D手部姿态估计中的遮挡问题,特别是在手与物体交互或双手参与的情况下,这类问题尤为突出。过去的研究较少关注被遮挡区域的信息,而这些区域实际上包含对3D手部姿态估计至关重要的信息。论文的关键在于提出了一种针对输入RGB图像的3D手-物姿态估计的鲁棒且精确的方法。该方法首先利用基于CNN的模型定位手部关节,然后通过提取上下文信息对这些关节进行精化。自注意力Transformer进一步识别特定关节及其所属的手部身份,这有助于模型确定特定关节的手部归属,从而在遮挡区域也能检测到关节。随后,结合手部身份信息的关节通过交叉注意力机制来估算姿态。通过识别遮挡区域中的关节,所得到的网络对遮挡具有鲁棒性,因此在InterHand2.6M、HO3D和H_2O3D数据集上的评估中实现了最先进的性能。
链接: https://arxiv.org/abs/2503.21723
作者: Mallika Garg,Debashis Ghosh,Pyari Mohan Pradhan
机构: IIT Roorkee (印度理工学院鲁尔基); IIT Roorkee (印度理工学院鲁尔基); IIT Roorkee (印度理工学院鲁尔基)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted in NATIONAL CONFERENCE ON COMMUNICATIONS (NCC) 2025
点击查看摘要
Abstract:Occlusion is one of the challenging issues when estimating 3D hand pose. This problem becomes more prominent when hand interacts with an object or two hands are involved. In the past works, much attention has not been given to these occluded regions. But these regions contain important and beneficial information that is vital for 3D hand pose estimation. Thus, in this paper, we propose an occlusion robust and accurate method for the estimation of 3D hand-object pose from the input RGB image. Our method includes first localising the hand joints using a CNN based model and then refining them by extracting contextual information. The self attention transformer then identifies the specific joints along with the hand identity. This helps the model to identify the hand belongingness of a particular joint which helps to detect the joint even in the occluded region. Further, these joints with hand identity are then used to estimate the pose using cross attention mechanism. Thus, by identifying the joints in the occluded region, the obtained network becomes robust to occlusion. Hence, this network achieves state-of-the-art results when evaluated on the InterHand2.6M, HO3D and H _2 O3D datasets.
zh
[CV-24] Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance
【速读】:该论文试图解决文本到图像合成评估中因现有度量与人类偏好不匹配而导致的挑战。为了解决这一问题,论文提出了cFreD(Conditional Fréchet Distance)这一新度量方法,其关键是同时显式考虑视觉保真度和文本提示对齐,弥补了如Inception Score (IS)、Fréchet Inception Distance (FID) 和CLIPScore等现有度量在单一评估方面的不足,这些度量要么侧重图像质量,要么关注图像-文本对齐,但无法兼顾两者。此外,论文通过大量实验验证了cFreD相较于统计学度量及基于人类偏好的训练模型,在与人类判断的相关性上表现更优,证明了其作为稳健且未来适用的评估工具的有效性。
链接: https://arxiv.org/abs/2503.21721
作者: Jaywon Koo,Jefferson Hernandez,Moayed Haji-Ali,Ziyan Yang,Vicente Ordonez
机构: Rice University (莱斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Evaluating text-to-image synthesis is challenging due to misalignment between established metrics and human preferences. We propose cFreD, a metric based on the notion of Conditional Fréchet Distance that explicitly accounts for both visual fidelity and text-prompt alignment. Existing metrics such as Inception Score (IS), Fréchet Inception Distance (FID) and CLIPScore assess either image quality or image-text alignment but not both which limits their correlation with human preferences. Scoring models explicitly trained to replicate human preferences require constant updates and may not generalize to novel generation techniques or out-of-domain inputs. Through extensive experiments across multiple recently proposed text-to-image models and diverse prompt datasets, we demonstrate that cFreD exhibits a higher correlation with human judgments compared to statistical metrics, including metrics trained with human preferences. Our findings validate cFreD as a robust, future-proof metric for the systematic evaluation of text-to-image models, standardizing benchmarking in this rapidly evolving field. We release our evaluation toolkit and benchmark in the appendix.
zh
[CV-25] MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX
【速读】:该论文试图解决多模态模型在跨模态感知能力评估方面缺乏标准化框架的问题。解决方案的关键在于引入了一个名为MAVERIX(Multimodal Audio-Visual Evaluation Reasoning IndeX)的新基准,包含700个视频和2,556个问题,专门设计用于通过需要紧密整合视频和音频信息的任务来评估多模态模型。MAVERIX的独特之处在于提供了模仿人类在推理和决策过程中可用的多模态感知体验的音频视觉任务,从而实现对音频视觉集成的全面评估。这一标准化评估框架为提升多模态智能提供了具有挑战性的测试平台。
链接: https://arxiv.org/abs/2503.21699
作者: Liuyue Xie,George Z. Wei,Avik Kuthiala,Ce Zheng,Ananya Bal,Mosam Dabhi,Liting Wen,Taru Rustagi,Ethan Lai,Sushil Khyalia,Rohan Choudhury,Morteza Ziyadi,Xu Zhang,Hao Yang,László A. Jeni
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Evaluation Reasoning IndeX), a novel benchmark with 700 videos and 2,556 questions explicitly designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information. MAVERIX uniquely provides models with audiovisual tasks, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration. Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels (around 70% accuracy), while human experts reach near-ceiling performance (95.1%). With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.
zh
[CV-26] AMA-SAM: Adversarial Multi-Domain Alignment of Segment Anything Model for High-Fidelity Histology Nuclei Segmentation
【速读】:该论文旨在解决细胞核分割在多数据集学习中的域偏移问题及高分辨率细节捕捉的挑战。传统方法仅关注单一数据集(主域),而忽略利用多样化辅助数据来减少过拟合并提升性能,但直接结合多数据集往往会导致域偏移引起的性能下降。为克服这些障碍,论文提出了Adversarial Multi-domain Alignment of Segment Anything Model (AMA-SAM),其关键创新包括:首先引入Conditional Gradient Reversal Layer (CGRL),通过多域对齐模块协调来自不同域的特征,促进域不变表征学习同时保留主域的关键判别性特征;其次设计High-Resolution Decoder (HR-Decoder),以直接生成精细的分割图,从而捕获高分辨率组织学图像中复杂的细胞核边界。据作者所知,这是首次将Segment Anything Model (SAM) 应用于多数据集学习以实现组织学细胞核分割任务。实验结果验证了该方法在多个公开数据集上的有效性和优越性。
链接: https://arxiv.org/abs/2503.21695
作者: Jiahe Qian,Yaoyu Fang,Jinkui Hao,Bo Zhou
机构: Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 tables, 2 figures
点击查看摘要
Abstract:Accurate segmentation of cell nuclei in histopathology images is essential for numerous biomedical research and clinical applications. However, existing cell nucleus segmentation methods only consider a single dataset (i.e., primary domain), while neglecting to leverage supplementary data from diverse sources (i.e., auxiliary domains) to reduce overfitting and enhance the performance. Although incorporating multiple datasets could alleviate overfitting, it often exacerbates performance drops caused by domain shifts. In this work, we introduce Adversarial Multi-domain Alignment of Segment Anything Model (AMA-SAM) that extends the Segment Anything Model (SAM) to overcome these obstacles through two key innovations. First, we propose a Conditional Gradient Reversal Layer (CGRL), a multi-domain alignment module that harmonizes features from diverse domains to promote domain-invariant representation learning while preserving crucial discriminative features for the primary dataset. Second, we address SAM’s inherent low-resolution output by designing a High-Resolution Decoder (HR-Decoder), which directly produces fine-grained segmentation maps in order to capture intricate nuclei boundaries in high-resolution histology images. To the best of our knowledge, this is the first attempt to adapt SAM for multi-dataset learning with application to histology nuclei segmentation. We validate our method on several publicly available datasets, demonstrating consistent and significant improvements over state-of-the-art approaches.
zh
[CV-27] Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data CVPR2025
【速读】:该论文旨在解决快速生成高质量 3D 网格(3D Meshes)的问题,尤其是在缺乏足够的高质量 3D 训练数据的情况下。传统方法通过将预训练的文本到图像扩散模型(如 Stable Diffusion, SD)迁移到 3D 表示生成任务时,往往因数据不足而导致生成质量不佳。为克服这一数据短缺问题,论文提出了一种名为 Progressive Rendering Distillation (PRD) 的新型训练方案。
PRD 的关键在于无需依赖 3D 地面真实数据(3D Ground-Truths),而是通过蒸馏多视角扩散模型(如 MVDream 和 RichDreamer)以及将 SD 适配为原生 3D 生成器来实现。在每个训练迭代中,PRD 使用 U-Net 对潜在表示进行逐步去噪,并在每一步将其解码为 3D 输出。同时,通过分数蒸馏(Score Distillation)机制,PRD 将文本一致的纹理和几何信息融入到 3D 输出中。这种方法不仅支持无 3D 地面真实数据的训练,还能够扩展训练数据规模以提升对复杂文本概念生成的质量,并显著加速推理速度。最终,基于 PRD 训练得到的 TriplaneTurbo 模型,在效率和质量上均超越了现有的文本到 3D 生成器。
链接: https://arxiv.org/abs/2503.21694
作者: Zhiyuan Ma,Xinyue Liang,Rongyuan Wu,Xiangyu Zhu,Zhen Lei,Lei Zhang
机构: The Hong Kong Polytechnic University (香港理工大学); Center for Artificial Intelligence and Robotics, HKISI CAS (香港人工智能与机器人研究所); State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA (中科院多模态人工智能系统重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences, UCAS (中国科学院大学人工智能学院)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025. Code: this https URL . Demo: this https URL
点击查看摘要
Abstract:It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only 2.5% trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at this https URL.
zh
[CV-28] RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond
【速读】:该论文致力于解决多视角多人姿态估计中的快速三角化速度与良好泛化能力的问题。其关键在于提出了一种新的算法,能够实现从面部表情到手指运动等全身姿态细节的精确捕捉,并展现出在不同场景和未见数据集上的强大适应性与性能表现。这一方案通过提升多视角多目标姿态估计的效率与准确性,推动了计算机视觉领域对人类行为理解和交互研究的进步。所有相关工作均公开可用以支持进一步的研究发展。
链接: https://arxiv.org/abs/2503.21692
作者: Daniel Bermuth,Alexander Poeppel,Wolfgang Reif
机构: University of Augsburg (奥格斯堡大学); ISSE (未知)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The integration of multi-view imaging and pose estimation represents a significant advance in computer vision applications, offering new possibilities for understanding human movement and interactions. This work presents a new algorithm that improves multi-view multi-person pose estimation, focusing on fast triangulation speeds and good generalization capabilities. The approach extends to whole-body pose estimation, capturing details from facial expressions to finger movements across multiple individuals and viewpoints. Adaptability to different settings is demonstrated through strong performance across unseen datasets and configurations. To support further progress in this field, all of this work is publicly accessible.
zh
[CV-29] CMED: A Child Micro-Expression Dataset
【速读】:该论文旨在解决儿童微表情检测的研究空白问题。现有微表情检测研究主要集中于成年人,而儿童与成人在表情特征上存在显著差异,这使得基于成人数据的研究难以直接适用于儿童。由于缺乏专门针对儿童的微表情数据集,捕捉儿童面部表情更具挑战性,因为其表现缺乏可预测性和可控性。为解决这一问题,论文编译了一个名为“儿童自发微表情视频”的数据集,这是迄今为止已知首个此类数据集。该数据集通过视频会议软件在自然环境中采集。论文的关键解决方案在于利用这一新数据集,探索成人与儿童微表情之间的关键特征和差异,并为儿童微表情的自动化检测和识别建立基准,采用的方法包括手工设计方法和学习型方法。
链接: https://arxiv.org/abs/2503.21690
作者: NikinMatharaarachchi,MuhammadFermi Pasha,SonyaColeman,Kah PengWong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Micro-expressions are short bursts of emotion that are difficult to hide. Their detection in children is an important cue to assist psychotherapists in conducting better therapy. However, existing research on the detection of micro-expressions has focused on adults, whose expressions differ in their characteristics from those of children. The lack of research is a direct consequence of the lack of a child-based micro-expressions dataset as it is much more challenging to capture children’s facial expressions due to the lack of predictability and controllability. This study compiles a dataset of spontaneous child micro-expression videos, the first of its kind, to the best of the authors knowledge. The dataset is captured in the wild using video conferencing software. This dataset enables us to then explore key features and differences between adult and child micro-expressions. This study also establishes a baseline for the automated spotting and recognition of micro-expressions in children using three approaches comprising of hand-created and learning-based approaches.
zh
[CV-30] Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AI
【速读】:该论文试图解决的核心问题是:如何实现具备真实世界语境中通用物体理解能力的AI系统。论文指出,尽管现有AI范式能够模拟某些孤立的物体属性(如形状、运动等),但它们在功能整合方面存在不足,无法全面解决“物体理解”这一挑战。论文的关键在于提出一种新的评估方法,这些方法旨在通过整合生物体中物体理解的核心能力(如格式塔心理学、具身认知和发育心理学所描述的功能角色)来推动AI从孤立的物体属性建模向具备真实世界通用物体理解能力的方向发展。
链接: https://arxiv.org/abs/2503.21668
作者: Danaja Rutar,Alva Markelius,Konstantinos Voudouris,José Hernández-Orallo,Lucy Cheke
机构: Cambridge University (剑桥大学); Universitat Politècnica de València (瓦伦西亚理工大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:One of the core components of our world models is ‘intuitive physics’ - an understanding of objects, space, and causality. This capability enables us to predict events, plan action and navigate environments, all of which rely on a composite sense of objecthood. Despite its importance, there is no single, unified account of objecthood, though multiple theoretical frameworks provide insights. In the first part of this paper, we present a comprehensive overview of the main theoretical frameworks in objecthood research - Gestalt psychology, enactive cognition, and developmental psychology - and identify the core capabilities each framework attributes to object understanding, as well as what functional roles they play in shaping world models in biological agents. Given the foundational role of objecthood in world modelling, understanding objecthood is also essential in AI. In the second part of the paper, we evaluate how current AI paradigms approach and test objecthood capabilities compared to those in cognitive science. We define an AI paradigm as a combination of how objecthood is conceptualised, the methods used for studying objecthood, the data utilised, and the evaluation techniques. We find that, whilst benchmarks can detect that AI systems model isolated aspects of objecthood, the benchmarks cannot detect when AI systems lack functional integration across these capabilities, not solving the objecthood challenge fully. Finally, we explore novel evaluation approaches that align with the integrated vision of objecthood outlined in this paper. These methods are promising candidates for advancing from isolated object capabilities toward general-purpose AI with genuine object understanding in real-world contexts.
zh
[CV-31] InteractionMap: Improving Online Vectorized HDMap Construction with Interaction
【速读】:本文旨在解决高精度(HD)地图矢量化中的关键挑战,特别是在基于DETR-like框架的现有方法中未能充分挖掘时空局部到全局信息交互的问题。论文提出InteractionMap,其核心解决方案包括:首先,通过从点级到实例级的显式位置关系先验增强DETR-like检测器,充分利用地图元素的强形状先验;其次,设计了一种基于关键帧的分层时间融合模块,实现从局部到全局的时间信息交互;最后,在优化过程中引入几何感知分类损失,并在标签分配中加入几何感知匹配成本,以解决分类分支与回归分支输出分布错位的问题。这些创新显著提升了地图矢量化的性能,在nuScenes和Argoverse2数据集上达到了当前最优水平。
链接: https://arxiv.org/abs/2503.21659
作者: Kuang Wu,Chuan Yang,Zhanbin Li
机构: Langge Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Vectorized high-definition (HD) maps are essential for an autonomous driving system. Recently, state-of-the-art map vectorization methods are mainly based on DETR-like framework to generate HD maps in an end-to-end manner. In this paper, we propose InteractionMap, which improves previous map vectorization methods by fully leveraging local-to-global information interaction in both time and space. Firstly, we explore enhancing DETR-like detectors by explicit position relation prior from point-level to instance-level, since map elements contain strong shape priors. Secondly, we propose a key-frame-based hierarchical temporal fusion module, which interacts temporal information from local to global. Lastly, the separate classification branch and regression branch lead to the problem of misalignment in the output distribution. We interact semantic information with geometric information by introducing a novel geometric-aware classification loss in optimization and a geometric-aware matching cost in label assignment. InteractionMap achieves state-of-the-art performance on both nuScenes and Argoverse2 benchmarks.
zh
[CV-32] When Astronomy Meets AI: Manazel For Crescent Visibility Prediction in Morocco
【速读】:该论文旨在解决精确确定伊斯兰历(Hijri calendar)每月初开始时间的问题,这对宗教、文化和行政事务具有重要意义。论文的关键解决方案是通过整合13年的月牙可见性数据改进ODEH标准,并引入两个关键特征——视弧(Arc of Vision, ARCV)和月牙总宽度(total width, W),以提升月相可见性评估的准确性。此外,采用Logistic回归算法的机器学习方法对月牙可见性条件进行分类,实现了98.83%的预测精度,从而提供了一个稳健且可靠的方法来确定伊斯兰历每月的起始时间,同时优化了摩洛哥地区月历计算的一致性。
链接: https://arxiv.org/abs/2503.21634
作者: Yassir Lairgi
机构: INSA Lyon (里昂国立应用科学学院); LIRIS (里尔图像与信号实验室); Villeurbanne, France (法国维勒班)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The accurate determination of the beginning of each Hijri month is essential for religious, cultural, and administrative purposes. Manazel (The code and datasets are available at this https URL) addresses this challenge in Morocco by leveraging 13 years of crescent visibility data to refine the ODEH criterion, a widely used standard for lunar crescent visibility prediction. The study integrates two key features, the Arc of Vision (ARCV) and the total width of the crescent (W), to enhance the accuracy of lunar visibility assessments. A machine learning approach utilizing the Logistic Regression algorithm is employed to classify crescent visibility conditions, achieving a predictive accuracy of 98.83%. This data-driven methodology offers a robust and reliable framework for determining the start of the Hijri month, comparing different data classification tools, and improving the consistency of lunar calendar calculations in Morocco. The findings demonstrate the effectiveness of machine learning in astronomical applications and highlight the potential for further enhancements in the modeling of crescent visibility.
zh
[CV-33] he MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly Detection CVPR2025
【速读】:该论文试图解决现有异常检测基准(如MVTec AD和VisA)在分割平均精确率(AU-PRO)方面的性能饱和问题,即最先进的模型之间在指标上仅相差不到一个百分点,导致无法有效区分模型优劣,从而阻碍了领域进展。这种性能饱和现象尤其受到机器学习结果固有的随机性影响。为了解决这一问题,论文提出了MVTec AD 2数据集,这是一个包含八个异常检测场景、超过8000张高分辨率图像的新数据集。其关键是引入了一系列具有挑战性和高度相关的工业检测应用场景,这些场景在之前的公开数据集中未被充分考虑,例如透明与重叠物体、暗场与背光照明、正常数据中具有高变异性的对象以及极小缺陷等。此外,该数据集还设计了光照条件变化的测试场景,用于评估方法在实际分布偏移下的鲁棒性。通过提供全面的最新方法评估,并公开像素级精确的测试集标注,论文旨在推动异常检测领域的进一步发展。
链接: https://arxiv.org/abs/2503.21622
作者: Lars Heckler-Kram,Jan-Hendrik Neudeck,Ulla Scheler,Rebecca König,Carsten Steger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: paper under review; dataset first released for the VAND3.0 challenge @ CVPR 2025 this https URL
点击查看摘要
Abstract:In recent years, performance on existing anomaly detection benchmarks like MVTec AD and VisA has started to saturate in terms of segmentation AU-PRO, with state-of-the-art models often competing in the range of less than one percentage point. This lack of discriminatory power prevents a meaningful comparison of models and thus hinders progress of the field, especially when considering the inherent stochastic nature of machine learning results. We present MVTec AD 2, a collection of eight anomaly detection scenarios with more than 8000 high-resolution images. It comprises challenging and highly relevant industrial inspection use cases that have not been considered in previous datasets, including transparent and overlapping objects, dark-field and back light illumination, objects with high variance in the normal data, and extremely small defects. We provide comprehensive evaluations of state-of-the-art methods and show that their performance remains below 60% average AU-PRO. Additionally, our dataset provides test scenarios with lighting condition changes to assess the robustness of methods under real-world distribution shifts. We host a publicly accessible evaluation server that holds the pixel-precise ground truth of the test set (this https URL). All image data is available at this https URL.
zh
[CV-34] Audio-driven Gesture Generation via Deviation Feature in the Latent Space
【速读】:该论文致力于解决基于弱监督学习的共言手势(co-speech gesture)视频生成问题,关注点在于利用像素级运动偏差而非完全监督的数据驱动方法。论文的关键解决方案在于提出了一种弱监督框架,通过学习潜在表征的偏差来生成共言手势视频,并采用扩散模型整合潜在运动特征,以实现更精确且细腻的手势表达。此外,通过利用潜在空间中的弱监督偏差,有效生成手部动作与口型变化,从而提升视频的真实感与质量,显著超越当前最先进的技术。
链接: https://arxiv.org/abs/2503.21616
作者: Jiahui Chen,Yang Huan,Runhua Shi,Chanfan Ding,Xiaoqi Mo,Siyu Xiong,Yinong He
机构: Ai Lab, Gaint Network (Gaint Network 实验室); Department of Digital Media Technology, Xiamen University (厦门大学数字媒体技术系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures
点击查看摘要
Abstract:Gestures are essential for enhancing co-speech communication, offering visual emphasis and complementing verbal interactions. While prior work has concentrated on point-level motion or fully supervised data-driven methods, we focus on co-speech gestures, advocating for weakly supervised learning and pixel-level motion deviations. We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation. Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation. By leveraging weakly supervised deviations in latent space, we effectively generate hand gestures and mouth movements, crucial for realistic video production. Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
zh
[CV-35] FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise Segmentation
【速读】:本文旨在解决传统单模态行人再识别(Person Re-identification, ReID)方法在处理复杂场景(如遮挡、光照变化和姿态变化)时面临的性能瓶颈。为克服这些挑战,论文提出了一种名为FusionSegReID的多模态模型,其关键在于整合图像和文本两种模态的优势。通过结合这两种模态的信息,该模型不仅提升了匹配精度,还增强了鲁棒性,特别是在单一模态表现不佳的情况下。实验结果表明,与传统单模态方法相比,FusionSegReID显著提高了Top-1准确率和平均精度均值(mean Average Precision, mAP),并在遮挡和低质量图像等困难场景中实现了更优的分割效果。消融研究进一步验证了多模态融合及分割模块对提升再识别性能和掩码准确性的重要性。总体而言,该方案通过多模态信息互补有效地应对了现实世界中的复杂场景需求。
链接: https://arxiv.org/abs/2503.21595
作者: Jincheng Yan,Yun Wang,Xiaoyan Luo,Yu-Wing Tai
机构: Beihang University (北航); Dartmouth College (达特茅斯学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Person re-identification (ReID) plays a critical role in applications like security surveillance and criminal investigations by matching individuals across large image galleries captured by non-overlapping cameras. Traditional ReID methods rely on unimodal inputs, typically images, but face limitations due to challenges like occlusions, lighting changes, and pose variations. While advancements in image-based and text-based ReID systems have been made, the integration of both modalities has remained under-explored. This paper presents FusionSegReID, a multimodal model that combines both image and text inputs for enhanced ReID performance. By leveraging the complementary strengths of these modalities, our model improves matching accuracy and robustness, particularly in complex, real-world scenarios where one modality may struggle. Our experiments show significant improvements in Top-1 accuracy and mean Average Precision (mAP) for ReID, as well as better segmentation results in challenging scenarios like occlusion and low-quality images. Ablation studies further confirm that multimodal fusion and segmentation modules contribute to enhanced re-identification and mask accuracy. The results show that FusionSegReID outperforms traditional unimodal models, offering a more robust and flexible solution for real-world person ReID tasks.
zh
[CV-36] AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion
【速读】:该论文旨在解决在真实世界复杂环境中相机标定准确性不足的问题,特别是现有方法依赖于预畸变图像或特定标定图案,限制了其适用性和灵活性。论文的关键创新在于提出了一种基于通用光线相机模型的新型框架AlignDiff,通过联合建模相机内参和外参,从几何特征而非语义特征出发,实现了局部畸变的更精确建模。此外,AlignDiff是一种以几何先验条件的扩散模型,并结合边缘感知注意力机制增强对图像边缘附近几何特征的关注,同时利用包含三千多个样本的光线追踪透镜数据库提升对多种镜头形式固有畸变的表征能力。实验表明,该方法显著降低了估计光线束的角度误差(约8.2度)并提升了整体标定精度,在具有挑战性的现实世界数据集上优于现有方法。
链接: https://arxiv.org/abs/2503.21581
作者: Liuyue Xie,Jiancong Guo,Ozan Cakmakci,Andre Araujo,Laszlo A. Jeni,Zhiheng Jia
机构: Carnegie Mellon University (卡内基梅隆大学); Google; Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-the-wild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by ~8.2 degrees and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.
zh
[CV-37] Bearing fault diagnosis based on multi-scale spectral images and convolutional neural network
【速读】:该论文旨在解决传统轴承故障诊断方法中诊断准确性较低的问题。解决方案的关键在于提出了一种基于多尺度谱特征图像和深度学习的新方法:首先通过均值去除对振动信号进行预处理,并利用快速傅里叶变换(FFT)将其转换为多长度频谱;其次,通过多长度频谱铺设方案构建了一种新的特征——多尺度谱特征图像(MSSI);最后,采用卷积神经网络(CNN)这一深度学习框架实现轴承故障的诊断。实验结果验证了所提方法在提高故障诊断准确性方面的有效性。
链接: https://arxiv.org/abs/2503.21566
作者: Tongchao Luo,Mingquan Qiu,Zhenyu Wu,Zebo Zhao,Dingyou Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12pages, 10 figures and 8 tables
点击查看摘要
Abstract:To address the challenges of low diagnostic accuracy in traditional bearing fault diagnosis methods, this paper proposes a novel fault diagnosis approach based on multi-scale spectrum feature images and deep learning. Firstly, the vibration signal are preprocessed through mean removal and then converted to multi-length spectrum with fast Fourier transforms (FFT). Secondly, a novel feature called multi-scale spectral image (MSSI) is constructed by multi-length spectrum paving scheme. Finally, a deep learning framework, convolutional neural network (CNN), is formulated to diagnose the bearing faults. Two experimental cases are utilized to verify the effectiveness of the proposed method. Experimental results demonstrate that the proposed method significantly improves the accuracy of fault diagnosis.
zh
[CV-38] uLayout: Unified Room Layout Estimation for Perspective and Panoramic Images WACV-2025
【速读】:该论文旨在解决从透视图像和全景图像中统一估计房间布局几何结构的问题,传统方法需要针对每种图像类型设计不同的模型。论文的关键解决方案在于将两种域统一到等角投影(equirectangular projection)中,并通过为透视图像分配最合适的纬度坐标,实现两种域的无缝融合。为了解决输入域之间视场(Field-of-View, FoV)的差异,论文设计了一个共享特征提取器,并额外加入一个1D卷积层来差异化处理每个域的输入。这种条件化方法使得无论输入的FoV如何,都可以高效地将特征回归问题转化为列向量形式,从而实现端到端的统一建模。该方法不仅性能媲美当前最先进的解决方案,而且首次展示了单一模型在两个域上的有效性。实验证明了该方法在LSUN、Matterport3D、PanoContext和Stanford 2D-3D等真实数据集上的贡献。
链接: https://arxiv.org/abs/2503.21562
作者: Jonathan Lee,Bolivar Solarte,Chin-Hsuan Wu,Jin-Cheng Jhang,Fu-En Wang,Yi-Hsuan Tsai,Min Sun
机构: National Tsing Hua University (国立清华大学), Taiwan; Atmanity Inc. (Atmanity 公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to WACV-2025
点击查看摘要
Abstract:We present uLayout, a unified model for estimating room layout geometries from both perspective and panoramic images, whereas traditional solutions require different model designs for each image type. The key idea of our solution is to unify both domains into the equirectangular projection, particularly, allocating perspective images into the most suitable latitude coordinate to effectively exploit both domains seamlessly. To address the Field-of-View (FoV) difference between the input domains, we design uLayout with a shared feature extractor with an extra 1D-Convolution layer to condition each domain input differently. This conditioning allows us to efficiently formulate a column-wise feature regression problem regardless of the FoV input. This simple yet effective approach achieves competitive performance with current state-of-the-art solutions and shows for the first time a single end-to-end model for both domains. Extensive experiments in the real-world datasets, LSUN, Matterport3D, PanoContext, and Stanford 2D-3D evidence the contribution of our approach. Code is available at this https URL.
zh
[CV-39] SyncSDE: A Probabilistic Framework for Diffusion Synchronization CVPR2025
【速读】:该论文试图解决多扩散模型协作生成中的任务特定性问题,现有方法通过简单的启发式策略(如平均操作)同步多个扩散轨迹,但缺乏理论解释且在跨任务应用时常失效。论文的关键在于提出一个概率框架来分析扩散同步机制的工作原理,并揭示启发式策略应聚焦于建模多扩散轨迹间的相关性并针对具体任务进行适配。进一步地,论文通过为每项任务识别最优的相关性模型,超越了以往对所有任务采用单一启发式策略且无明确依据的方法,从而实现更优的结果。
链接: https://arxiv.org/abs/2503.21555
作者: Hyunjun Lee,Hyunsoo Lee,Sookwan Han
机构: ECE, Seoul National University (首尔国立大学); Republic of Korea Air Force (大韩民国空军)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted to CVPR2025
点击查看摘要
Abstract:There have been many attempts to leverage multiple diffusion models for collaborative generation, extending beyond the original domain. A prominent approach involves synchronizing multiple diffusion trajectories by mixing the estimated scores to artificially correlate the generation processes. However, existing methods rely on naive heuristics, such as averaging, without considering task specificity. These approaches do not clarify why such methods work and often fail when a heuristic suitable for one task is blindly applied to others. In this paper, we present a probabilistic framework for analyzing why diffusion synchronization works and reveal where heuristics should be focused - modeling correlations between multiple trajectories and adapting them to each specific task. We further identify optimal correlation models per task, achieving better results than previous approaches that apply a single heuristic across all tasks without justification.
zh
[CV-40] LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing
【速读】:该论文旨在解决现有文本引导图像编辑方法在保持图像整体结构和背景保真度的同时,难以维持空间一致性的问题。这些方法依赖于扩散模型生成的交叉注意力图(cross-attention maps)来确定目标区域,但由于交叉注意力机制主要关注语义相关性,往往无法有效维护图像完整性,导致编辑结果存在伪影和失真现象。论文的关键解决方案是提出LOCATEdit,通过基于图的方法增强交叉注意力图,利用自注意力(self-attention)推导出的patch关系,在图像各区域间建立平滑且一致的注意力分布,从而确保修改仅限于指定的目标对象,同时保留周围结构的连贯性。这一改进显著提升了在PIE-Bench基准上的性能,展示了其在多种编辑任务中的先进性和有效性。
链接: https://arxiv.org/abs/2503.21541
作者: Achint Soni,Meet Soni,Sirisha Rambhatla
机构: University of Waterloo (滑铁卢大学); Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Text-guided image editing aims to modify specific regions of an image according to natural language instructions while maintaining the general structure and the background fidelity. Existing methods utilize masks derived from cross-attention maps generated from diffusion models to identify the target regions for modification. However, since cross-attention mechanisms focus on semantic relevance, they struggle to maintain the image integrity. As a result, these methods often lack spatial consistency, leading to editing artifacts and distortions. In this work, we address these limitations and introduce LOCATEdit, which enhances cross-attention maps through a graph-based approach utilizing self-attention-derived patch relationships to maintain smooth, coherent attention across image regions, ensuring that alterations are limited to the designated items while retaining the surrounding structure. \method consistently and substantially outperforms existing baselines on PIE-Bench, demonstrating its state-of-the-art performance and effectiveness on various editing tasks. Code can be found on this https URL
zh
[CV-41] ICG-MVSNet: Learning Intra-view and Cross-view Relationships for Guidance in Multi-View Stereo
【速读】:该论文旨在解决现有基于学习的多视图立体视觉(MVS)框架忽视特征中嵌入的几何信息和相关性的问题,这导致了成本匹配能力较弱。论文提出了一种名为ICG-MVSNet的方法,通过显式整合图像内的关系和跨视图关系来增强深度估计。关键解决方案包括开发一个利用单个图像内特征坐标相关性的图像内特征融合模块,以增强鲁棒的成本匹配,以及引入一个轻量级的跨视图聚合模块,高效利用体素相关性中的上下文信息进行指导正则化。这种方法在DTU数据集和Tanks and Temples基准上进行了评估,与最先进的方法相比表现出竞争性的性能,同时所需的计算资源更少。
链接: https://arxiv.org/abs/2503.21525
作者: Yuxi Hu,Jun Zhang,Zhe Zhang,Rafael Weilharter,Yuchen Rao,Kuangyi Chen,Runze Yuan,Friedrich Fraundorfer
机构: Graz University of Technology (格拉茨工业大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Multi-view Stereo (MVS) aims to estimate depth and reconstruct 3D point clouds from a series of overlapping images. Recent learning-based MVS frameworks overlook the geometric information embedded in features and correlations, leading to weak cost matching. In this paper, we propose ICG-MVSNet, which explicitly integrates intra-view and cross-view relationships for depth estimation. Specifically, we develop an intra-view feature fusion module that leverages the feature coordinate correlations within a single image to enhance robust cost matching. Additionally, we introduce a lightweight cross-view aggregation module that efficiently utilizes the contextual information from volume correlations to guide regularization. Our method is evaluated on the DTU dataset and Tanks and Temples benchmark, consistently achieving competitive performance against state-of-the-art works, while requiring lower computational resources.
zh
[CV-42] Uncertainty-aware Bayesian machine learning modelling of land cover classification
【速读】:该论文试图解决土地覆盖分类中现有机器学习模型未充分考虑输入测量不确定性的问题,这对于计量学中的可追溯性至关重要。论文的关键解决方案是提出了一种基于生成模型的贝叶斯分类框架,通过具体实现贝叶斯二次判别分析(Bayesian Quadratic Discriminant Analysis),结合Copernicus Sentinel-2在2020年和2021年的土地覆盖数据进行验证。该方法不仅提高了模型的可解释性,还显式地建模了输入测量不确定性,并在不同年份和规模的数据集上保持了类概率输出预测性能,同时具备计算效率优势。
链接: https://arxiv.org/abs/2503.21510
作者: Samuel Bilson,Anna Pustogvar
机构: National Physical Laboratory (国家物理实验室), Teddington, UK; National Physical Laboratory (国家物理实验室), Teddington, UK
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 10 figures
点击查看摘要
Abstract:Land cover classification involves the production of land cover maps, which determine the type of land through remote sensing imagery. Over recent years, such classification is being performed by machine learning classification models, which can give highly accurate predictions on land cover per pixel using large quantities of input training data. However, such models do not currently take account of input measurement uncertainty, which is vital for traceability in metrology. In this work we propose a Bayesian classification framework using generative modelling to take account of input measurement uncertainty. We take the specific case of Bayesian quadratic discriminant analysis, and apply it to land cover datasets from Copernicus Sentinel-2 in 2020 and 2021. We benchmark the performance of the model against more popular classification models used in land cover maps such as random forests and neural networks. We find that such Bayesian models are more trustworthy, in the sense that they are more interpretable, explicitly model the input measurement uncertainty, and maintain predictive performance of class probability outputs across datasets of different years and sizes, whilst also being computationally efficient.
zh
[CV-43] Shape Modeling of Longitudinal Medical Images: From Diffeomorphic Metric Mapping to Deep Learning
【速读】:该论文旨在解决生物组织在纵向时间尺度上形状变化的建模问题,特别是针对自然及病理状态下解剖结构形状的复杂非线性变化。论文强调,由于这类变化的固有非线性特性,对其进行精确建模并非易事。为应对这一挑战,论文综述了多种现有方法与工具,包括基于微分同胚度量映射的技术以及深度学习驱动的方法(如自动编码器、生成网络、循环神经网络等)。这些方法的关键在于通过结合空间与时间信息实现对形状变化的有效表征,并揭示现有研究领域的不足之处,同时探讨未来研究的潜在方向。
链接: https://arxiv.org/abs/2503.21489
作者: Edwin Tay,Nazli Tümer,Amir A. Zadpoor
机构: Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Living biological tissue is a complex system, constantly growing and changing in response to external and internal stimuli. These processes lead to remarkable and intricate changes in shape. Modeling and understanding both natural and pathological (or abnormal) changes in the shape of anatomical structures is highly relevant, with applications in diagnostic, prognostic, and therapeutic healthcare. Nevertheless, modeling the longitudinal shape change of biological tissue is a non-trivial task due to its inherent nonlinear nature. In this review, we highlight several existing methodologies and tools for modeling longitudinal shape change (i.e., spatiotemporal shape modeling). These methods range from diffeomorphic metric mapping to deep-learning based approaches (e.g., autoencoders, generative networks, recurrent neural networks, etc.). We discuss the synergistic combinations of existing technologies and potential directions for future research, underscoring key deficiencies in the current research landscape.
zh
[CV-44] Invert2Restore: Zero-Shot Degradation-Blind Image Restoration
【速读】:该论文旨在解决图像恢复领域中两个主要挑战:一是精确表征图像先验(image prior),二是精准建模图像退化算子(degradation operator)。当前基于预训练扩散模型的零样本图像先验方法在图像先验方面表现优异,但在处理退化算子时仍存在局限性,尤其是当退化模型依赖特定参数假设时,在实际场景中的适用性受限。为此,论文提出了一种名为Invert2Restore的零样本、无需训练的方法,适用于完全盲解退化(fully blind)和部分盲解退化(partially blind)场景,无需或只需部分了解退化模型的参数形式即可实现图像恢复。尽管如此,该方法仍能够以高保真度恢复图像,并在多种类型的图像退化场景中表现出良好的泛化能力。
该解决方案的关键在于利用预训练扩散模型作为正常样本与未失真图像样本之间的确定性映射。论文的核心洞见是,通过扩散模型映射到退化图像的输入噪声位于标准正态分布的低概率密度区域。因此,可以通过精心引导输入噪声向更高概率密度区域移动来恢复退化图像。实验验证表明,Invert2Restore在退化算子未知或部分已知的场景下达到了最先进的性能。
链接: https://arxiv.org/abs/2503.21486
作者: Hamadi Chihaoui,Paolo Favaro
机构: Computer Vision Group, University of Bern (伯尔尼大学计算机视觉小组); University of Bern (伯尔尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Two of the main challenges of image restoration in real-world scenarios are the accurate characterization of an image prior and the precise modeling of the image degradation operator. Pre-trained diffusion models have been very successfully used as image priors in zero-shot image restoration methods. However, how to best handle the degradation operator is still an open problem. In real-world data, methods that rely on specific parametric assumptions about the degradation model often face limitations in their applicability. To address this, we introduce Invert2Restore, a zero-shot, training-free method that operates in both fully blind and partially blind settings – requiring no prior knowledge of the degradation model or only partial knowledge of its parametric form without known parameters. Despite this, Invert2Restore achieves high-fidelity results and generalizes well across various types of image degradation. It leverages a pre-trained diffusion model as a deterministic mapping between normal samples and undistorted image samples. The key insight is that the input noise mapped by a diffusion model to a degraded image lies in a low-probability density region of the standard normal distribution. Thus, we can restore the degraded image by carefully guiding its input noise toward a higher-density region. We experimentally validate Invert2Restore across several image restoration tasks, demonstrating that it achieves state-of-the-art performance in scenarios where the degradation operator is either unknown or partially known.
zh
[CV-45] BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding CVPR2025
【速读】:该论文旨在解决大型视频语言模型(VLMs)在长视频分析中的有效性受限问题,特别是由于有限的上下文窗口导致的资源浪费。传统均匀帧采样方法在真实场景中往往分配资源给无关内容,影响性能。为应对这一挑战,论文提出BOLT方法,通过全面研究帧选择策略来提升大型VLMs的效果而无需额外训练。方案的关键在于引入逆变换采样(inverse transform sampling)策略,显著提高了模型在Video-MME和MLVU基准测试上的准确性,分别从53.8%提升至56.1%,以及从58.9%提升至63.4%。此外,论文还提出了一个多源检索评估设置,以更现实地评估VLMs在长视频理解中的表现。
链接: https://arxiv.org/abs/2503.21483
作者: Shuming Liu,Chen Zhao,Tianqi Xu,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025
点击查看摘要
Abstract:Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is constrained by limited context windows. Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content, diminishing their effectiveness in real-world scenarios. In this paper, we introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies. First, to enable a more realistic evaluation of VLMs in long-form video understanding, we propose a multi-source retrieval evaluation setting. Our findings reveal that uniform sampling performs poorly in noisy contexts, underscoring the importance of selecting the right frames. Second, we explore several frame selection strategies based on query-frame similarity and analyze their effectiveness at inference time. Our results show that inverse transform sampling yields the most significant performance improvement, increasing accuracy on the Video-MME benchmark from 53.8% to 56.1% and MLVU benchmark from 58.9% to 63.4%. Our code is available at this https URL.
zh
[CV-46] Fine-Grained Behavior and Lane Constraints Guided Trajectory Prediction Method
【速读】:该论文试图解决现有轨迹预测算法在描述目标车辆未来行为及车道约束时,无法提供细粒度且连续刻画的问题,这限制了预测准确性。为应对这一挑战,论文提出BLNet,其关键在于通过并行注意力机制协同整合行为意图识别与车道约束建模,构建双流架构。该框架生成捕捉时空运动模式的行为状态查询以及编码车道拓扑约束的车道查询,并由两个辅助损失分别监督。此外,两阶段解码器先生成轨迹提案,再结合已通行车道的连续性和未来运动特征进行点级细化,从而实现更精确的轨迹预测。
链接: https://arxiv.org/abs/2503.21477
作者: Wenyi Xiong,Jian Chen,Ziheng Qi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE TIM for possible publication
点击查看摘要
Abstract:Trajectory prediction, as a critical component of autonomous driving systems, has attracted the attention of many researchers. Existing prediction algorithms focus on extracting more detailed scene features or selecting more reasonable trajectory destinations. However, in the face of dynamic and evolving future movements of the target vehicle, these algorithms cannot provide a fine-grained and continuous description of future behaviors and lane constraints, which degrades the prediction accuracy. To address this challenge, we present BLNet, a novel dualstream architecture that synergistically integrates behavioral intention recognition and lane constraint modeling through parallel attention mechanisms. The framework generates fine-grained behavior state queries (capturing spatial-temporal movement patterns) and lane queries (encoding lane topology constraints), supervised by two auxiliary losses, respectively. Subsequently, a two-stage decoder first produces trajectory proposals, then performs point-level refinement by jointly incorporating both the continuity of passed lanes and future motion features. Extensive experiments on two large datasets, nuScenes and Argoverse, show that our network exhibits significant performance gains over existing direct regression and goal-based algorithms.
zh
[CV-47] Retinal Fundus Multi-Disease Image Classification using Hybrid CNN-Transformer-Ensemble Architectures ALT
【速读】:该论文旨在解决全球范围内大量人群受视网膜疾病影响的问题,特别是在非城市地区缺乏专业医疗资源的背景下,致力于通过开发一种全面的诊断系统来准确预测视网膜疾病,仅基于眼底图像即可实现分类。为应对有限且多样化的数据集以及类别分布不平衡的挑战,研究引入了创新策略,包括结合更深的卷积神经网络(CNNs)、Transformer编码器以及级联与并行集成架构的混合模型,用于将眼底图像分类为20种疾病标签。关键在于采用先进的模型组合方法,如C-Tran集成模型实现了0.9166的卓越模型评分,并通过IEViT模型验证了动态补丁提取及领域知识融入计算机视觉任务的有效性,从而显著提升了诊断准确性,尤其针对广泛条件下的视网膜疾病检测。
链接: https://arxiv.org/abs/2503.21465
作者: Deependra Singh,Saksham Agarwal,Subhankar Mishra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 3 figures, 7 tables. Conference paper presented at the International Health Informatics Conference (IHIC 2023)
点击查看摘要
Abstract:Our research is motivated by the urgent global issue of a large population affected by retinal diseases, which are evenly distributed but underserved by specialized medical expertise, particularly in non-urban areas. Our primary objective is to bridge this healthcare gap by developing a comprehensive diagnostic system capable of accurately predicting retinal diseases solely from fundus images. However, we faced significant challenges due to limited, diverse datasets and imbalanced class distributions. To overcome these issues, we have devised innovative strategies. Our research introduces novel approaches, utilizing hybrid models combining deeper Convolutional Neural Networks (CNNs), Transformer encoders, and ensemble architectures sequentially and in parallel to classify retinal fundus images into 20 disease labels. Our overarching goal is to assess these advanced models’ potential in practical applications, with a strong focus on enhancing retinal disease diagnosis accuracy across a broader spectrum of conditions. Importantly, our efforts have surpassed baseline model results, with the C-Tran ensemble model emerging as the leader, achieving a remarkable model score of 0.9166, surpassing the baseline score of 0.9. Additionally, experiments with the IEViT model showcased equally promising outcomes with improved computational efficiency. We’ve also demonstrated the effectiveness of dynamic patch extraction and the integration of domain knowledge in computer vision tasks. In summary, our research strives to contribute significantly to retinal disease diagnosis, addressing the critical need for accessible healthcare solutions in underserved regions while aiming for comprehensive and accurate disease prediction.
zh
[CV-48] RoadSocial: A Diverse VideoQA Dataset and Benchmark for Road Event Understanding from Social Video Narratives CVPR2025
【速读】:该论文试图解决在交通事件理解中因数据集区域偏差、视角偏差以及专家驱动标注导致的局限性问题。为应对这一挑战,论文提出RoadSocial,一个大规模、多样化的视频问答(VideoQA)数据集,专门用于从社交媒体叙事中理解通用的道路事件。解决方案的关键在于其可扩展的半自动标注框架,该框架利用文本大语言模型(Text LLMs)和视频大语言模型(Video LLMs)生成涵盖12个挑战性问答任务的全面问答对,从而突破现有道路事件理解的边界。
链接: https://arxiv.org/abs/2503.21459
作者: Chirag Parikh,Deepti Rawat,Rakshitha R. T.,Tathagata Ghosh,Ravi Kiran Sarvadevabhatla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2025; Project Page: this https URL
点击查看摘要
Abstract:We introduce RoadSocial, a large-scale, diverse VideoQA dataset tailored for generic road event understanding from social media narratives. Unlike existing datasets limited by regional bias, viewpoint bias and expert-driven annotations, RoadSocial captures the global complexity of road events with varied geographies, camera viewpoints (CCTV, handheld, drones) and rich social discourse. Our scalable semi-automatic annotation framework leverages Text LLMs and Video LLMs to generate comprehensive question-answer pairs across 12 challenging QA tasks, pushing the boundaries of road event understanding. RoadSocial is derived from social media videos spanning 14M frames and 414K social comments, resulting in a dataset with 13.2K videos, 674 tags and 260K high-quality QA pairs. We evaluate 18 Video LLMs (open-source and proprietary, driving-specific and general-purpose) on our road event understanding benchmark. We also demonstrate RoadSocial’s utility in improving road event understanding capabilities of general-purpose Video LLMs.
zh
[CV-49] FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLM s CVPR2025
【速读】:该论文试图解决多模态大型语言模型(Multimodal Large Language Models, MLLMs)在人脸感知任务评估方面的不足,具体表现为现有方法难以全面评估MLLMs对细粒度人脸属性的理解能力。为了解决这一问题,论文的关键在于提出了FaceBench数据集和基于此训练的Face-LLaVA模型。FaceBench数据集包含一个分层的多视角多级别属性结构,涵盖超过210个属性和700个属性值,并提供了49,919个视觉问答(Visual Question-Answering, VQA)样本用于评估以及23,841个样本用于微调。Face-LLaVA则通过利用FaceBench的数据进行训练,实现了在少量训练数据下显著优于现有开源模型的表现,并与商业模型如GPT-4o和Gemini相当。这一解决方案的核心在于构建了一个系统化且具有挑战性的数据集,以推动MLLMs在人脸感知领域的能力提升。
链接: https://arxiv.org/abs/2503.21457
作者: Xiaoqin Wang,Xusen Ma,Xianxu Hou,Meidan Ding,Yudong Li,Junliang Chen,Wenting Chen,Xiaoyang Peng,Linlin Shen
机构: Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院计算机视觉研究所); National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University(深圳大学大数据系统计算技术国家工程实验室); Guangdong Provincial Key Laboratory of Intelligent Information Processing(广东省智能信息处理重点实验室); AIAC, Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学人工智能研究院); Tsinghua University(清华大学); The Hong Kong Polytechnic University(香港理工大学); City University of Hong Kong(香港城市大学); Sun Yat-sen University(中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR2025
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at this https URL.
zh
[CV-50] owards Generating Realistic 3D Semantic Training Data for Autonomous Driving
【速读】:该论文试图解决3D语义场景理解中由于真实数据标注复杂性导致的数据瓶颈问题,并缩小真实数据与合成数据之间的域差距。论文的关键在于提出了一种全新的方法,能够直接生成大规模3D语义场景点云数据,无需依赖图像投影或分层多分辨率解耦模型。这种方法通过避免中间表示引入的误差,实现了比现有最先进的方法更高质量和更真实的语义场景数据生成。此外,论文还验证了利用所生成的合成标注数据训练语义分割网络的有效性,证明了该方法在扩展现有数据集、减少人工标注工作方面的潜力。
链接: https://arxiv.org/abs/2503.21449
作者: Lucas Nunes,Rodrigo Marcuzzi,Jens Behley,Cyrill Stachniss
机构: Center for Robotics, University of Bonn, Germany (波恩大学机器人中心,德国); Lamarr Institute for Machine Learning and Artificial Intelligence, Germany (Lamarr 机器学习与人工智能研究所,德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still however a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at this https URL.
zh
[CV-51] RainyGS: Efficient Rain Synthesis with Physically-Based Gaussian Splatting
【速读】:该论文致力于以物理正确的方式为野外场景添加动态降雨效果。传统基于物理的模拟能够生成逼真的雨滴和飞溅效果,但需要熟练艺术家精细设置高保真场景,缺乏灵活性与可扩展性,难以应用于更广泛的开放世界环境。而NeRF和3DGS等新兴技术虽在新颖视图合成方面表现优异,但在基于物理的场景编辑任务(如降雨模拟)上面临挑战。
论文的关键解决方案在于提出RainyGS方法,它结合基于物理建模与3DGS的优势,在快速3DGS渲染框架内集成基于物理的雨滴和浅水模拟技术,实现了逼真且高效的雨滴行为、飞溅及反射模拟。这种方法支持以超过30 fps的速度合成从毛毛雨到暴雨的各种强度雨效,并展现出对真实户外场景及大规模驾驶场景的卓越性能,相比现有技术提供了更逼真且物理准确的降雨效果。
链接: https://arxiv.org/abs/2503.21442
作者: Qiyu Dai,Xingyu Ni,Qianfan Shen,Wenzheng Chen,Baoquan Chen,Mengyu Chu
机构: School of Intelligence Science and Technology, Peking University (北京大学智能科学与技术学院); School of Computer Science, Peking University (北京大学计算机科学学院); School of EECS, Peking University (北京大学信息科学技术学院); Wangxuan Institute of Computer Technology, Peking University (王选计算机技术研究所); State Key Laboratory of Multimedia Information Processing, Peking University (北京大学多媒体信息处理国家重点实验室); State Key Laboratory of General Artificial Intelligence, Peking University (北京大学通用人工智能国家重点实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We consider the problem of adding dynamic rain effects to in-the-wild scenes in a physically-correct manner. Recent advances in scene modeling have made significant progress, with NeRF and 3DGS techniques emerging as powerful tools for reconstructing complex scenes. However, while effective for novel view synthesis, these methods typically struggle with challenging scene editing tasks, such as physics-based rain simulation. In contrast, traditional physics-based simulations can generate realistic rain effects, such as raindrops and splashes, but they often rely on skilled artists to carefully set up high-fidelity scenes. This process lacks flexibility and scalability, limiting its applicability to broader, open-world environments. In this work, we introduce RainyGS, a novel approach that leverages the strengths of both physics-based modeling and 3DGS to generate photorealistic, dynamic rain effects in open-world scenes with physical accuracy. At the core of our method is the integration of physically-based raindrop and shallow water simulation techniques within the fast 3DGS rendering framework, enabling realistic and efficient simulations of raindrop behavior, splashes, and reflections. Our method supports synthesizing rain effects at over 30 fps, offering users flexible control over rain intensity – from light drizzles to heavy downpours. We demonstrate that RainyGS performs effectively for both real-world outdoor scenes and large-scale driving scenarios, delivering more photorealistic and physically-accurate rain effects compared to state-of-the-art methods. Project page can be found at this https URL
zh
[CV-52] Dual-Task Learning for Dead Tree Detection and Segmentation with Hybrid Self-Attention U-Nets in Aerial Imagery
【速读】:本文旨在解决利用现有方法在复杂森林环境中精确定位单株枯死树时面临的挑战,这些问题包括密集冠层结构导致的遮挡、生植被与枯死植被之间的光谱重叠以及过分割错误。为了解决这些问题,论文提出了一种混合后处理框架,其关键是将基于深度学习的树木分割结果与分水岭算法相结合,并引入自适应滤波技术以增强边界描绘能力,同时减少假阳性检测。通过这种方式,该框架不仅提升了实例级分割精度(提高了41.5%),还显著降低了位置误差(减少了57%),从而在高分辨率航拍影像上实现了对密集植被区域中枯死树的有效识别,为生态监测提供了重要支持。此外,该方法在保持较高检测准确性的同时有效控制了过分割现象,保证了计算效率,使其适用于大范围地理区域内的全区域树木死亡状况制图等实际应用需求。
链接: https://arxiv.org/abs/2503.21438
作者: Anis Ur Rahman,Einari Heinaro,Mete Ahishali,Samuli Junttila
机构: School of Forest Sciences, University of Eastern Finland (东芬兰大学森林科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, 4 tables
点击查看摘要
Abstract:Mapping standing dead trees is critical for assessing forest health, monitoring biodiversity, and mitigating wildfire risks, for which aerial imagery has proven useful. However, dense canopy structures, spectral overlaps between living and dead vegetation, and over-segmentation errors limit the reliability of existing methods. This study introduces a hybrid postprocessing framework that refines deep learning-based tree segmentation by integrating watershed algorithms with adaptive filtering, enhancing boundary delineation, and reducing false positives in complex forest environments. Tested on high-resolution aerial imagery from boreal forests, the framework improved instance-level segmentation accuracy by 41.5% and reduced positional errors by 57%, demonstrating robust performance in densely vegetated regions. By balancing detection accuracy and over-segmentation artifacts, the method enabled the precise identification of individual dead trees, which is critical for ecological monitoring. The framework’s computational efficiency supports scalable applications, such as wall-to-wall tree mortality mapping over large geographic regions using aerial or satellite imagery. These capabilities directly benefit wildfire risk assessment (identifying fuel accumulations), carbon stock estimation (tracking emissions from decaying biomass), and precision forestry (targeting salvage loggings). By bridging advanced remote sensing techniques with practical forest management needs, this work advances tools for large-scale ecological conservation and climate resilience planning.
zh
[CV-53] STAMICS: Splat Track And Map with Integrated Consistency and Semantics for Dense RGB-D SLAM
【速读】:该论文旨在解决现有同时定位与建图(SLAM)方法在动态或密集场景中因主要依赖几何线索而导致语义一致性不足的问题。为应对这一局限性,论文提出了一种名为STAMICS的新方法,其关键是将语义信息与基于三维高斯表示相结合,通过三个关键组件实现:基于三维高斯的高保真场景表示、基于图的聚类技术以强制执行时间上的语义一致性,以及支持开放词汇表的系统以实现未见物体的分类。实验结果表明,STAMICS显著提升了相机位姿估计精度和地图质量,并在降低重建误差方面优于现有最先进的方法。
链接: https://arxiv.org/abs/2503.21425
作者: Yongxu Wang,Xu Cao,Weiyun Yi,Zhaoxin Fan
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Simultaneous Localization and Mapping (SLAM) is a critical task in robotics, enabling systems to autonomously navigate and understand complex environments. Current SLAM approaches predominantly rely on geometric cues for mapping and localization, but they often fail to ensure semantic consistency, particularly in dynamic or densely populated scenes. To address this limitation, we introduce STAMICS, a novel method that integrates semantic information with 3D Gaussian representations to enhance both localization and mapping accuracy. STAMICS consists of three key components: a 3D Gaussian-based scene representation for high-fidelity reconstruction, a graph-based clustering technique that enforces temporal semantic consistency, and an open-vocabulary system that allows for the classification of unseen objects. Extensive experiments show that STAMICS significantly improves camera pose estimation and map quality, outperforming state-of-the-art methods while reducing reconstruction errors. Code will be public available.
zh
[CV-54] Diffusion Image Prior
【速读】:该论文试图解决零样本图像恢复(Zero-shot Image Restoration, IR)在复杂未知退化场景下的挑战。传统方法通常依赖于已知的参数化退化模型,但在实际应用中,退化过程可能过于复杂而难以显式定义。为了解决这一问题,论文引入了扩散图像先验(Diffusion Image Prior, DIIP)。DIIP 的关键创新在于利用预训练扩散模型的强大先验能力,而非依赖显式的退化模型。与 Deep Image Prior (DIP) 不同,DIIP 在优化过程中首先重建出干净图像版本,并且能够处理更广泛的退化类型。基于此特性,作者提出了一种通过早停(early stopping)实现盲图像恢复的方法,无需提前知道具体的退化模型。实验验证表明,DIIP 在多种无监督退化任务中取得了最先进的性能。
链接: https://arxiv.org/abs/2503.21410
作者: Hamadi Chihaoui,Paolo Favaro
机构: Computer Vision Group, University of Bern (伯尔尼大学计算机视觉小组), Switzerland
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Zero-shot image restoration (IR) methods based on pretrained diffusion models have recently achieved significant success. These methods typically require at least a parametric form of the degradation model. However, in real-world scenarios, the degradation may be too complex to define explicitly. To handle this general case, we introduce the Diffusion Image Prior (DIIP). We take inspiration from the Deep Image Prior (DIP)[16], since it can be used to remove artifacts without the need for an explicit degradation model. However, in contrast to DIP, we find that pretrained diffusion models offer a much stronger prior, despite being trained without knowledge from corrupted data. We show that, the optimization process in DIIP first reconstructs a clean version of the image before eventually overfitting to the degraded input, but it does so for a broader range of degradations than DIP. In light of this result, we propose a blind image restoration (IR) method based on early stopping, which does not require prior knowledge of the degradation model. We validate DIIP on various degradation-blind IR tasks, including JPEG artifact removal, waterdrop removal, denoising and super-resolution with state-of-the-art results.
zh
[CV-55] VALLR: Visual ASR Language Model for Lip Reading
【速读】:该论文致力于解决视觉自动语音识别(Visual Automatic Speech Recognition, V-ASR)中因缺乏听觉信息及视觉区分音素(phoneme)时的同形视位态(viseme)歧义所导致的高错误率问题。传统方法直接从视觉线索预测单词或字符,但容易受到连读效应(coarticulation effects)和视位态模糊性的严重影响。论文的关键创新在于提出了一种基于音素的两阶段框架:首先,通过带有CTC头的视频Transformer模型从视觉输入中预测紧凑的音素序列,实现任务复杂度的降低并获得鲁棒的说话人不变性;其次,将音素输出作为输入传递给经过微调的大语言模型(Large Language Model, LLM),利用更广泛的语言学上下文重构出连贯的单词和句子。这种方法通过显式编码中间语言结构,在数据效率方面表现出显著优势,并在LRS2和LRS3两个数据集上实现了当前最佳性能,尤其在LRS3上取得了比第二佳方法少使用99.4%标注数据的状态-of-the-art词错误率(Word Error Rate, WER)18.7。
链接: https://arxiv.org/abs/2503.21408
作者: Marshall Thomas,Edward Fish,Richard Bowden
机构: University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly-often faltering on visually similar phonemes-or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach.
zh
[CV-56] ProHOC: Probabilistic Hierarchical Out-of-Distribution Classification via Multi-Depth Networks CVPR2025
【速读】:该论文试图解决深度学习中分布外(Out-of-distribution, OOD)检测的问题,传统方法通常将其视为一个二分类任务,即样本要么被归类为已知类别,要么被标记为 OOD,而忽略了 OOD 样本与已知类别(In-distribution, ID)之间的语义关系。论文提出了一种基于给定类别层次结构的框架,旨在将 OOD 样本预测到类别层次结构中的正确内部节点,同时将已知 ID 类别预测为其对应的叶节点。解决方案的关键在于利用类别层次结构构建概率模型,并通过在多个层次深度训练的网络实现该模型。
链接: https://arxiv.org/abs/2503.21397
作者: Erik Wallin,Fredrik Kahl,Lars Hammarstrand
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: CVPR2025
点击查看摘要
Abstract:Out-of-distribution (OOD) detection in deep learning has traditionally been framed as a binary task, where samples are either classified as belonging to the known classes or marked as OOD, with little attention given to the semantic relationships between OOD samples and the in-distribution (ID) classes. We propose a framework for detecting and classifying OOD samples in a given class hierarchy. Specifically, we aim to predict OOD data to their correct internal nodes of the class hierarchy, whereas the known ID classes should be predicted as their corresponding leaf nodes. Our approach leverages the class hierarchy to create a probabilistic model and we implement this model by using networks trained for ID classification at multiple hierarchy depths. We conduct experiments on three datasets with predefined class hierarchies and show the effectiveness of our method. Our code is available at this https URL.
zh
[CV-57] Unsupervised Real-World Denoising: Sparsity is All You Need
【速读】:该论文旨在解决现实世界中去噪任务面临的挑战,即由于难以收集大规模配对的噪声图像与干净图像数据集,传统的监督训练方法难以直接应用。为应对这一难题,现有方法尝试利用未配对的清洁与噪声图像数据集,通过生成合成的清洁-噪声配对来实现监督训练。然而,这些方法常因合成与真实噪声图像之间的分布差异而效果不佳。本文的关键创新在于提出了一种基于输入稀疏化的方法,具体通过随机输入掩蔽(random input masking)来缩小这种分布差距。所提出的Mask、Inpaint和Denoise (MID) 方法不仅训练去噪器同时进行去噪,还结合修复任务(inpainting),以更好地适应真实噪声图像的特性。此外,该方法通过迭代优化噪声采样器,利用去噪器预测的伪清洁图像与真实噪声图像的差值构建更精确的噪声数据集,从而进一步提升去噪性能。实验结果表明,该方法在实际噪声图像数据集上的表现具有竞争力。
链接: https://arxiv.org/abs/2503.21377
作者: Hamadi Chihaoui,Paolo Favaro
机构: Computer Vision Group, University of Bern (伯尔尼大学计算机视觉小组); University of Bern (伯尔尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Supervised training for real-world denoising presents challenges due to the difficulty of collecting large datasets of paired noisy and clean images. Recent methods have attempted to address this by utilizing unpaired datasets of clean and noisy images. Some approaches leverage such unpaired data to train denoisers in a supervised manner by generating synthetic clean-noisy pairs. However, these methods often fall short due to the distribution gap between synthetic and real noisy images. To mitigate this issue, we propose a solution based on input sparsification, specifically using random input masking. Our method, which we refer to as Mask, Inpaint and Denoise (MID), trains a denoiser to simultaneously denoise and inpaint synthetic clean-noisy pairs. On one hand, input sparsification reduces the gap between synthetic and real noisy images. On the other hand, an inpainter trained in a supervised manner can still accurately reconstruct sparse inputs by predicting missing clean pixels using the remaining unmasked pixels. Our approach begins with a synthetic Gaussian noise sampler and iteratively refines it using a noise dataset derived from the denoiser’s predictions. The noise dataset is created by subtracting predicted pseudo-clean images from real noisy images at each iteration. The core intuition is that improving the denoiser results in a more accurate noise dataset and, consequently, a better noise sampler. We validate our method through extensive experiments on real-world noisy image datasets, demonstrating competitive performance compared to existing unsupervised denoising methods.
zh
[CV-58] Multimodal surface defect detection from wooden logs for sawing optimization
【速读】:该论文试图解决木材表面节疤(knots)检测精度低的问题,特别是在基于单一模态数据(如RGB图像或点云数据)时,由于节疤尺寸小及外部噪声干扰(如树皮和其他自然变化),导致检测效果不佳。论文的关键在于提出了一种多模态数据融合(multimodal data fusion)的方法,通过设计一个包含独立RGB和点云数据处理流,并结合晚期融合模块(late fusion module)的数据融合管道,实现了比单独使用任一模态更高的节疤检测准确性。此外,论文还进一步提出了一种简单的锯切角度优化方法,利用表面节疤检测结果与互相关分析来减少不必要的边角节疤(arris knots),展示了其相对于随机锯切角度的优势。
链接: https://arxiv.org/abs/2503.21367
作者: Bořek Reich,Matej Kunda,Fedor Zolotarev,Tuomas Eerola,Pavel Zemčík,Tomi Kauppi
机构: Lappeenranta-Lahti University of Technology (拉彭兰塔-拉赫蒂工业大学); Finnos.fi (Finnos.fi)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We propose a novel, good-quality, and less demanding method for detecting knots on the surface of wooden logs using multimodal data fusion. Knots are a primary factor affecting the quality of sawn timber, making their detection fundamental to any timber grading or cutting optimization system. While X-ray computed tomography provides accurate knot locations and internal structures, it is often too slow or expensive for practical use. An attractive alternative is to use fast and cost-effective log surface measurements, such as laser scanners or RGB cameras, to detect surface knots and estimate the internal structure of wood. However, due to the small size of knots and noise caused by factors, such as bark and other natural variations, detection accuracy often remains low when only one measurement modality is used. In this paper, we demonstrate that by using a data fusion pipeline consisting of separate streams for RGB and point cloud data, combined by a late fusion module, higher knot detection accuracy can be achieved compared to using either modality alone. We further propose a simple yet efficient sawing angle optimization method that utilizes surface knot detections and cross-correlation to minimize the amount of unwanted arris knots, demonstrating its benefits over randomized sawing angles.
zh
[CV-59] LandMarkSystem Technical Report
【速读】:本文旨在解决传统深度学习框架在满足三维场景高质量和大规模重建需求方面的局限性问题。近年来,如Neural Radiance Fields (NeRF) 和 3D Gaussian Splatting (3DGS) 等技术推动了三维重建领域的发展,但现有框架难以高效处理复杂的大规模场景。为应对这些挑战,论文提出了一种名为LandMarkSystem的新计算框架,其关键在于通过组件化模型适配层支持多种NeRF和3DGS结构,并结合分布式并行计算与模型参数卸载优化计算效率。此外,系统引入专用算子以高效处理复杂的三维稀疏计算,实现大场景下的快速训练与推理。核心贡献包括模块化架构设计、面向资源限制的动态加载策略以及多场景验证能力,同时提供开源代码与文档以促进研究与协作。
链接: https://arxiv.org/abs/2503.21364
作者: Zhenxiang Ma,Zhenyu Yang,Miao Tao,Yuanzhen Zhou,Zeyu He,Yuchang Zhang,Rong Fu,Hengjie Li
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室), Shanghai, China; Shanghai Jiao Tong University (上海交通大学), Shanghai, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D reconstruction is vital for applications in autonomous driving, virtual reality, augmented reality, and the metaverse. Recent advancements such as Neural Radiance Fields(NeRF) and 3D Gaussian Splatting (3DGS) have transformed the field, yet traditional deep learning frameworks struggle to meet the increasing demands for scene quality and scale. This paper introduces LandMarkSystem, a novel computing framework designed to enhance multi-scale scene reconstruction and rendering. By leveraging a componentized model adaptation layer, LandMarkSystem supports various NeRF and 3DGS structures while optimizing computational efficiency through distributed parallel computing and model parameter offloading. Our system addresses the limitations of existing frameworks, providing dedicated operators for complex 3D sparse computations, thus facilitating efficient training and rapid inference over extensive scenes. Key contributions include a modular architecture, a dynamic loading strategy for limited resources, and proven capabilities across multiple representative this http URL comprehensive solution aims to advance the efficiency and effectiveness of 3D reconstruction this http URL facilitate further research and collaboration, the source code and documentation for the LandMarkSystem project are publicly available in an open-source repository, accessing the repository at: this https URL.
zh
[CV-60] UGNA-VPR: A Novel Training Paradigm for Visual Place Recognition Based on Uncertainty-Guided NeRF Augmentation
【速读】:本文旨在解决视觉位置识别(Visual Place Recognition, VPR)在多视角场景中的性能下降问题,尤其是在特征稀疏或多方向行驶环境下,现有单视点数据集的局限性导致识别准确性降低。此外,通过采集更多数据来缓解这些问题通常成本高昂。为应对这些挑战,论文提出了一种新的训练范式,其关键是通过不确定性估计与基于NeRF的数据增强技术,在现有数据集中提升多视角多样性。具体而言,首先利用现有VPR数据集训练NeRF模型,然后通过设计的自监督不确定性估计网络识别出高不确定性地点,并将这些地点的位姿输入NeRF生成新的合成观测,用于进一步训练VPR网络。同时,还提出了改进的数据存储方法以高效组织增强数据与原始数据。实验结果表明,该训练范式能够显著提高VPR性能,优于其他训练方法,并在自录制的室内和室外数据集上验证了方法的有效性。
链接: https://arxiv.org/abs/2503.21338
作者: Yehui Shen,Lei Zhang,Qingqiu Li,Xiongwei Zhao,Yue Wang,Huimin Lu,Xieyuanli Chen
机构: College of Intelligence Science and Technology, National University of Defense Technology (国防科技大学智能科学与技术学院), China; Faculty of Robot Science and Engineering, Northeastern University (东北大学机器人科学与工程学院), China; School of Electronic and Information Engineering, Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)电子与信息工程学院), China; State Key Laboratory of Industrial Control Technology and Institute of Cyber-Systems and Control, Zhejiang University (浙江大学工业控制技术国家重点实验室和网络系统与控制研究所), China; SIASUN Robot& Automation Co., Ltd (沈阳新松机器人自动化股份有限公司), China; School of Computer Science, Fudan University (复旦大学计算机科学学院), China
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to IEEE Robotics and Automation Letters (RA-L)
点击查看摘要
Abstract:Visual place recognition (VPR) is crucial for robots to identify previously visited locations, playing an important role in autonomous navigation in both indoor and outdoor environments. However, most existing VPR datasets are limited to single-viewpoint scenarios, leading to reduced recognition accuracy, particularly in multi-directional driving or feature-sparse scenes. Moreover, obtaining additional data to mitigate these limitations is often expensive. This paper introduces a novel training paradigm to improve the performance of existing VPR networks by enhancing multi-view diversity within current datasets through uncertainty estimation and NeRF-based data augmentation. Specifically, we initially train NeRF using the existing VPR dataset. Then, our devised self-supervised uncertainty estimation network identifies places with high uncertainty. The poses of these uncertain places are input into NeRF to generate new synthetic observations for further training of VPR networks. Additionally, we propose an improved storage method for efficient organization of augmented and original training data. We conducted extensive experiments on three datasets and tested three different VPR backbone networks. The results demonstrate that our proposed training paradigm significantly improves VPR performance by fully utilizing existing data, outperforming other training approaches. We further validated the effectiveness of our approach on self-recorded indoor and outdoor datasets, consistently demonstrating superior results. Our dataset and code have been released at \hrefthis https URLthis https URL.
zh
[CV-61] DuckSegmentation: A segmentation model based on the AnYue Hemp Duck Dataset
【速读】:该论文旨在解决智能农业领域中目标检测与分割模型在实际应用中的可解释性差及计算量大的问题。针对麻鸭养殖行业的需求,构建了包含1951个样本的AnYue Shelduck数据集,并通过专业标注完成了目标检测与分割任务。论文的关键解决方案在于提出了DuckProcessing模块,结合YOLOv8实现高精度的目标检测(测试集Precision达98.10%,Recall达96.53%,F1分数达0.95),并通过DuckSegmentation模型实现了96.43%的mIoU分割性能。进一步地,利用知识蒸馏技术,以DuckSegmentation为教师模型训练学生模型Deeplabv3 r50,在测试集上达到94.49%的mIoU,从而提供了一种适用于麻鸭智慧养殖的新方法。
链接: https://arxiv.org/abs/2503.21323
作者: Ling Feng,Tianyu Xie,Wei Ma,Ruijie Fu,Yingxiao Zhang,Jun Li,Bei Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:The modernization of smart farming is a way to improve agricultural production efficiency, and improve the agricultural production environment. Although many large models have achieved high accuracy in the task of object recognition and segmentation, they cannot really be put into use in the farming industry due to their own poor interpretability and limitations in computational volume. In this paper, we built AnYue Shelduck Dateset, which contains a total of 1951 Shelduck datasets, and performed target detection and segmentation annotation with the help of professional annotators. Based on AnYue ShelduckDateset, this paper describes DuckProcessing, an efficient and powerful module for duck identification based on real shelduckfarms. First of all, using the YOLOv8 module designed to divide the mahjong between them, Precision reached 98.10%, Recall reached 96.53% and F1 score reached 0.95 on the test set. Again using the DuckSegmentation segmentation model, DuckSegmentation reached 96.43% mIoU. Finally, the excellent DuckSegmentation was used as the teacher model, and through knowledge distillation, Deeplabv3 r50 was used as the student model, and the final student model achieved 94.49% mIoU on the test set. The method provides a new way of thinking in practical sisal duck smart farming.
zh
[CV-62] HORT: Monocular Hand-held Objects Reconstruction with Transformers
【速读】:该论文旨在解决从单目图像重建手持物体三维点云这一计算机视觉领域的重大挑战。现有方法多依赖隐式三维表示,导致重建结果过于平滑且生成显式三维形状耗时较长;近期基于扩散模型直接重建点云的方法虽有所改进,但多步去噪过程仍使高分辨率重建效率低下。为此,论文提出了一种基于Transformer的模型,通过粗到细策略,首先从图像生成稀疏点云,并利用像素对齐的图像特征逐步优化为稠密表示,同时结合图像特征与三维手部几何结构共同预测物体点云及其相对于手部的姿态,以提高重建精度。该方法端到端训练,实现最优性能。实验表明,此方法在合成与真实数据集上达到了最先进的准确性,推理速度显著提升,且对野外图像具有良好的泛化能力。
链接: https://arxiv.org/abs/2503.21313
作者: Zerui Chen,Rolandos Alexandros Potamias,Shizhe Chen,Cordelia Schmid
机构: Inria(法国国家信息与自动化研究所); École normale supérieure(巴黎高等师范学校); CNRS(法国国家科学研究中心); PSL Research University(巴黎文理研究大学); Imperial College London(伦敦帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
zh
[CV-63] FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval
【速读】:该论文旨在解决现有组合图像检索(Composed Image Retrieval, CIR)数据集中粗粒度修改文本(CoarseMT)导致的不精确正样本以及视觉相似图像检索的歧义性问题。这些局限性降低了检索准确性,需要人工过滤结果或重复查询。为克服这些限制,论文提出的关键解决方案是开发了一种鲁棒的细粒度CIR数据标注流水线,以减少不精确的正样本并提升系统解析修改意图的能力。基于此流水线,论文优化了FashionIQ和CIRR数据集,创建了Fine-FashionIQ和Fine-CIRR两个细粒度CIR数据集,并引入了首个明确解析修改文本的CIR框架FineCIR。FineCIR通过有效捕获细粒度修改语义并与模糊视觉实体对齐,显著提高了检索精度。实验表明,FineCIR在细粒度和传统CIR基准数据集上均优于当前最先进的CIR基线。
链接: https://arxiv.org/abs/2503.21309
作者: Zixu Li,Zhiheng Fu,Yupeng Hu,Zhiwei Chen,Haokun Wen,Liqiang Nie
机构: School of Software, Shandong University (山东大学软件学院); School of Data Science, City University of Hong Kong (香港城市大学数据科学学院); School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems’ ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at this https URL.
zh
[CV-64] InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理大量视觉标记(visual tokens)时面临的计算资源需求高和推理时间长的问题。论文的关键解决方案在于提出InternVL-X模型,通过整合三种视觉标记压缩方法显著提升性能与效率。其中,核心创新包括:(1) 提出一种新颖的视觉-语言投影器PVTC,通过局部查询和全局查询实现点到区域的交叉注意力机制,更有效地转换视觉特征;(2) 设计一个分层视觉标记压缩模块LVTC,在浅层压缩标记并在深层通过上采样和残差连接恢复,大幅提高模型计算效率;(3) 引入高效的高分辨率切片方法RVTC,动态调整视觉标记数量以优化训练效率,仅轻微牺牲性能。这些方法使得InternVL-X在7个公共MLLM基准测试中达到最先进的性能,并在12项任务中平均提升了2.34%的指标。
链接: https://arxiv.org/abs/2503.21307
作者: Dongchen Lu,Yuyao Sun,Zilu Zhang,Leping Huang,Jianliang Zeng,Mao Shu,Huo Cao
机构: Baidu Inc. (百度); Xidian University (西安电子科技大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Most multimodal large language models (MLLMs) treat visual tokens as “a sequence of text”, integrating them with text tokens into a large language model (LLM). However, a great quantity of visual tokens significantly increases the demand for computational resources and time. In this paper, we propose InternVL-X, which outperforms the InternVL model in both performance and efficiency by incorporating three visual token compression methods. First, we propose a novel vision-language projector, PVTC. This component integrates adjacent visual embeddings to form a local query and utilizes the transformed CLS token as a global query, then performs point-to-region cross-attention through these local and global queries to more effectively convert visual features. Second, we present a layer-wise visual token compression module, LVTC, which compresses tokens in the LLM shallow layers and then expands them through upsampling and residual connections in the deeper layers. This significantly enhances the model computational efficiency. Futhermore, we propose an efficient high resolution slicing method, RVTC, which dynamically adjusts the number of visual tokens based on image area or length filtering. RVTC greatly enhances training efficiency with only a slight reduction in performance. By utilizing 20% or fewer visual tokens, InternVL-X achieves state-of-the-art performance on 7 public MLLM benchmarks, and improves the average metric by 2.34% across 12 tasks.
zh
[CV-65] Multi-Scale Invertible Neural Network for Wide-Range Variable-Rate Learned Image Compression
【速读】:该论文旨在解决基于自动编码器(Autoencoder)的图像压缩方法在高比特率下的性能限制及其在比特率适应性方面的局限性。具体而言,这些问题包括信息损失导致的率失真(Rate-Distortion)性能瓶颈以及难以灵活调整比特率。为了解决这些问题,论文提出了一种基于可逆变换(Invertible Transform)的变比特率图像压缩模型。其关键在于设计了一个轻量级的多尺度可逆神经网络(Multi-Scale Invertible Neural Network),能够双向映射输入图像至多尺度潜在表示(Latent Representations)。此外,通过引入扩展增益单元的多尺度空间-通道上下文模型(Multi-Scale Spatial-Channel Context Model),从高到低层次估计潜在表示的熵,进一步提升了压缩效率。实验结果表明,该方法在多种指标上达到了当前最优性能,并在广泛的比特率范围内超越了传统的视频编码标准VVC,同时保持了与其他多模型方法的竞争优势。
链接: https://arxiv.org/abs/2503.21284
作者: Hanyue Tu,Siqi Wu,Li Li,Wengang Zhou,Houqiang Li
机构: National Engineering Laboratory for Brain-Inspired Intelligence Technology and Application, University of Science and Technology of China (中国科学技术大学脑启发智能技术与应用国家工程实验室); Department of Electrical Engineering, University of Missouri (密苏里大学电气工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Transactions on Multimedia 2025
点击查看摘要
Abstract:Autoencoder-based structures have dominated recent learned image compression methods. However, the inherent information loss associated with autoencoders limits their rate-distortion performance at high bit rates and restricts their flexibility of rate adaptation. In this paper, we present a variable-rate image compression model based on invertible transform to overcome these limitations. Specifically, we design a lightweight multi-scale invertible neural network, which bijectively maps the input image into multi-scale latent representations. To improve the compression efficiency, a multi-scale spatial-channel context model with extended gain units is devised to estimate the entropy of the latent representation from high to low levels. Experimental results demonstrate that the proposed method achieves state-of-the-art performance compared to existing variable-rate methods, and remains competitive with recent multi-model approaches. Notably, our method is the first learned image compression solution that outperforms VVC across a very wide range of bit rates using a single model, especially at high bit this http URL source code is available at \hrefthis https URLthis https URL.
zh
[CV-66] Zero-Shot Visual Concept Blending Without Text Guidance
【速读】:该论文旨在解决单个参考图像难以明确区分并选择性转移特定视觉特征的问题。解决方案的关键在于提出了一种名为“Visual Concept Blending”的零样本图像生成技术,通过利用多个参考图像,结合部分解耦的对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)嵌入空间(基于IP-Adapter),能够区分共同特征与独特特征,并灵活控制纹理、形状、运动、风格以及更抽象的概念性变换的转移,而无需额外训练或文本提示。这种方法在风格迁移、形态变化及概念转换等多种任务中展示了其有效性。
链接: https://arxiv.org/abs/2503.21277
作者: Hiroya Makino,Takahiro Yamaguchi,Hiroyuki Sakai
机构: Toyota Central R&D Labs., Inc. (丰田中央研发实验室股份有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:We propose a novel, zero-shot image generation technique called “Visual Concept Blending” that provides fine-grained control over which features from multiple reference images are transferred to a source image. If only a single reference image is available, it is difficult to isolate which specific elements should be transferred. However, using multiple reference images, the proposed approach distinguishes between common and unique features by selectively incorporating them into a generated output. By operating within a partially disentangled Contrastive Language-Image Pre-training (CLIP) embedding space (from IP-Adapter), our method enables the flexible transfer of texture, shape, motion, style, and more abstract conceptual transformations without requiring additional training or text prompts. We demonstrate its effectiveness across a diverse range of tasks, including style transfer, form metamorphosis, and conceptual transformations, showing how subtle or abstract attributes (e.g., brushstroke style, aerodynamic lines, and dynamism) can be seamlessly combined into a new image. In a user study, participants accurately recognized which features were intended to be transferred. Its simplicity, flexibility, and high-level control make Visual Concept Blending valuable for creative fields such as art, design, and content creation, where combining specific visual qualities from multiple inspirations is crucial.
zh
[CV-67] Delving Deep into Semantic Relation Distillation
【速读】:该论文旨在解决传统知识蒸馏方法仅关注实例级知识转移,而未能有效捕捉数据中细微语义关系的问题。为应对这一挑战,论文提出了一种名为语义关系知识蒸馏(Semantics-based Relation Knowledge Distillation, SeRKD)的新方法。SeRKD的关键在于通过语义-关系视角重新定义知识蒸馏过程,利用超像素等语义组件实现更全面且上下文感知的知识传递。具体而言,该方法巧妙结合基于超像素的语义提取与基于关系的知识蒸馏,特别适用于视觉Transformer(Vision Transformers, ViTs)领域,在该领域中视觉tokens作为基本表征单元。实验评估表明,SeRKD在基准数据集上的表现优于现有方法,显著提升了模型性能和泛化能力。
链接: https://arxiv.org/abs/2503.21269
作者: Zhaoyi Yan,Kangjun Liu,Qixiang Ye
机构: School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子电气与通信工程学院); Peng Cheng Laboratory (鹏城实验室); School of Computer Science and Technology, Harbin Institute of Technology (哈尔滨工业大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Knowledge distillation has become a cornerstone technique in deep learning, facilitating the transfer of knowledge from complex models to lightweight counterparts. Traditional distillation approaches focus on transferring knowledge at the instance level, but fail to capture nuanced semantic relationships within the data. In response, this paper introduces a novel methodology, Semantics-based Relation Knowledge Distillation (SeRKD), which reimagines knowledge distillation through a semantics-relation lens among each sample. By leveraging semantic components, \ie, superpixels, SeRKD enables a more comprehensive and context-aware transfer of knowledge, which skillfully integrates superpixel-based semantic extraction with relation-based knowledge distillation for a sophisticated model compression and distillation. Particularly, the proposed method is naturally relevant in the domain of Vision Transformers (ViTs), where visual tokens serve as fundamental units of representation. Experimental evaluations on benchmark datasets demonstrate the superiority of SeRKD over existing methods, underscoring its efficacy in enhancing model performance and generalization capabilities.
zh
[CV-68] ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate WWW CVPR2025
【速读】:该论文旨在解决攀爬运动(off-ground motion)在人体运动恢复(Human Motion Recovery, HMR)领域研究不足的问题,特别是缺乏大规模且具有挑战性的三维标注攀爬运动数据集。论文的关键在于通过收集AscendMotion这一大规模高质量标注的攀爬运动数据集,并提出ClimbingCap方法来克服现有HMR方法在捕捉攀爬动作时的局限性。ClimbingCap的核心解决方案是利用RGB和LiDAR模态分别在相机坐标系和全局坐标系中重建运动,并联合优化以实现连续的三维人体攀爬运动全局重建。
链接: https://arxiv.org/abs/2503.21268
作者: Ming Yan,Xincheng Lin,Yuhua Luo,Shuqi Fan,Yudi Dai,Qixin Zhong,Lincai Zhong,Yuexin Ma,Lan Xu,Chenglu Wen,Siqi Shen,Cheng Wang
机构: Fujian Key Laboratory of Sensing and Computing for Smart Cities, Xiamen University (厦门大学); National Institute for Data Science in Health and Medicine, Xiamen University (厦门大学); Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, School of Informatics, Xiamen University (厦门大学); China National Climbing Team (中国国家攀岩队); Ningbo Sports Work Training Team (宁波体育工作训练队); ShanghaiTech University (上海科技大学); ETH AI Center, ETH Zürich (苏黎世联邦理工学院 AI 中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR2025, project in \href{this link}{ this http URL }
点击查看摘要
Abstract:Human Motion Recovery (HMR) research mainly focuses on ground-based motions such as running. The study on capturing climbing motion, an off-ground motion, is sparse. This is partly due to the limited availability of climbing motion datasets, especially large-scale and challenging 3D labeled datasets. To address the insufficiency of climbing motion datasets, we collect AscendMotion, a large-scale well-annotated, and challenging climbing motion dataset. It consists of 412k RGB, LiDAR frames, and IMU measurements, including the challenging climbing motions of 22 skilled climbing coaches across 12 different rock walls. Capturing the climbing motions is challenging as it requires precise recovery of not only the complex pose but also the global position of climbers. Although multiple global HMR methods have been proposed, they cannot faithfully capture climbing motions. To address the limitations of HMR methods for climbing, we propose ClimbingCap, a motion recovery method that reconstructs continuous 3D human climbing motion in a global coordinate system. One key insight is to use the RGB and LiDAR modalities to separately reconstruct motions in camera coordinates and global coordinates and to optimize them jointly. We demonstrate the quality of the AscendMotion dataset and present promising results from ClimbingCap. The AscendMotion dataset and source code release publicly at \hrefthis linkthis http URL
zh
[CV-69] vGamba: Attentive State Space Bottleneck for efficient Long-range Dependencies in Visual Recognition
【速读】:该论文旨在解决视觉识别任务中高效捕捉长程依赖关系的问题,现有方法如卷积神经网络(CNNs)受限于感受野限制,而Vision Transformers(ViTs)虽能建模全局上下文但计算成本高昂。论文提出的关键解决方案是vGamba,一种将状态空间模型(SSMs)与注意力机制结合的混合视觉主干网络。其核心组件Gamba瓶颈块包含Gamba单元(Mamba的二维空间结构改编版)、多头自注意力(MHSA)机制以及门控融合模块,通过这些组件的协同作用,vGamba在保持注意力机制建模能力的同时,利用了SSMs的低计算开销,实现了精度与计算效率之间的优越权衡。
链接: https://arxiv.org/abs/2503.21262
作者: Yunusa Haruna,Adamu Lawan
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Capturing long-range dependencies efficiently is essential for visual recognition tasks, yet existing methods face limitations. Convolutional neural networks (CNNs) struggle with restricted receptive fields, while Vision Transformers (ViTs) achieve global context and long-range modeling at a high computational cost. State-space models (SSMs) offer an alternative, but their application in vision remains underexplored. This work introduces vGamba, a hybrid vision backbone that integrates SSMs with attention mechanisms to enhance efficiency and expressiveness. At its core, the Gamba bottleneck block that includes, Gamba Cell, an adaptation of Mamba for 2D spatial structures, alongside a Multi-Head Self-Attention (MHSA) mechanism and a Gated Fusion Module for effective feature representation. The interplay of these components ensures that vGamba leverages the low computational demands of SSMs while maintaining the accuracy of attention mechanisms for modeling long-range dependencies in vision tasks. Additionally, the Fusion module enables seamless interaction between these components. Extensive experiments on classification, detection, and segmentation tasks demonstrate that vGamba achieves a superior trade-off between accuracy and computational efficiency, outperforming several existing models.
zh
[CV-70] Reducing CT Metal Artifacts by Learning Latent Space Alignment with Gemstone Spectral Imaging Data
【速读】:该论文旨在解决金属植入物在CT图像中引起的金属伪影问题,这些伪影会降低图像质量,妨碍对金属附近组织的准确可视化与诊断。论文的关键解决方案是提出了一种名为Latent Gemstone Spectral Imaging (GSI) Alignment Framework的方法。这种方法通过调整普通CT图像的表示形式以匹配GSI CT序列,有效减少了金属伪影,同时避免引入噪声信息。其关键是发现即使受到伪影影响的普通CT序列仍包含足够的信息来识别详细结构,而挑战在于如何清晰地表达这些信息。通过将普通CT图像的表示与GSI数据对齐,不仅能够有效抑制金属伪影,还能清晰揭示细节结构。此外,为了促进方法的应用,论文还构建了一个基于真实患者金属植入数据的新数据集Artifacts-GSI,并建立了相应的基准。实验结果表明,该方法显著减少了金属伪影并大幅提高了CT切片的可读性。
链接: https://arxiv.org/abs/2503.21259
作者: Wencheng Han,Dongqian Guo,Xiao Chen,Pang Lyu,Yi Jin,Jianbing Shen
机构: SKL-IOTSC, CIS, University of Macau (国家重点实验室物联网技术控制科学中心, 澳门大学); Department of Orthopedics, People’s Hospital of Zhengzhou University, Henan Provincial People’s Hospital (郑州大学人民医院骨科, 河南省人民医院); Zhongshan Hospital, Fudan University (复旦大学附属中山医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Metal artifacts in CT slices have long posed challenges in medical diagnostics. These artifacts degrade image quality, resulting in suboptimal visualization and complicating the accurate interpretation of tissues adjacent to metal implants. To address these issues, we introduce the Latent Gemstone Spectral Imaging (GSI) Alignment Framework, which effectively reduces metal artifacts while avoiding the introduction of noise information. Our work is based on a key finding that even artifact-affected ordinary CT sequences contain sufficient information to discern detailed structures. The challenge lies in the inability to clearly represent this information. To address this issue, we developed an Alignment Framework that adjusts the representation of ordinary CT images to match GSI CT sequences. GSI is an advanced imaging technique using multiple energy levels to mitigate artifacts caused by metal implants. By aligning the representation to GSI data, we can effectively suppress metal artifacts while clearly revealing detailed structure, without introducing extraneous information into CT sequences. To facilitate the application, we propose a new dataset, Artifacts-GSI, captured from real patients with metal implants, and establish a new benchmark based on this dataset. Experimental results show that our method significantly reduces metal artifacts and greatly enhances the readability of CT slices. All our code and data are available at: this https URL
zh
[CV-71] Learn by Reasoning : Analogical Weight Generation for Few-Shot Class-Incremental Learning
【速读】:本文旨在解决Few-shot Class-Incremental Learning (FSCIL) 中模型在学习新类别时需要参数微调且面临新知识学习与旧知识利用分离的问题。为应对这一挑战,论文提出了一种基于类比学习机制的创新生成方法——Brain-Inspired Analogical Generator (BiAG),其核心在于无需参数微调即可从已有类别推导出新类别权重。关键在于BiAG包含三个模块:Weight Self-Attention Module (WSA) 补充新类别权重,Weight Prototype Analogical Attention Module (WPAA) 计算类比关系以生成新类别权重,以及通过Neural Collapse理论实现语义转换的Semantic Conversion Module (SCM)。实验结果表明,该方法在miniImageNet、CUB-200和CIFAR-100数据集上取得了优于当前最先进方法的最终及平均准确率。
链接: https://arxiv.org/abs/2503.21258
作者: Jizhou Han,Chenhao Ding,Yuhang He,Songlin Dong,Qiang Wang,Xinyuan Gao,Yihong Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Few-shot class-incremental Learning (FSCIL) enables models to learn new classes from limited data while retaining performance on previously learned classes. Traditional FSCIL methods often require fine-tuning parameters with limited new class data and suffer from a separation between learning new classes and utilizing old knowledge. Inspired by the analogical learning mechanisms of the human brain, we propose a novel analogical generative method. Our approach includes the Brain-Inspired Analogical Generator (BiAG), which derives new class weights from existing classes without parameter fine-tuning during incremental stages. BiAG consists of three components: Weight Self-Attention Module (WSA), Weight Prototype Analogical Attention Module (WPAA), and Semantic Conversion Module (SCM). SCM uses Neural Collapse theory for semantic conversion, WSA supplements new class weights, and WPAA computes analogies to generate new class weights. Experiments on miniImageNet, CUB-200, and CIFAR-100 datasets demonstrate that our method achieves higher final and average accuracy compared to SOTA methods.
zh
[CV-72] Vision-to-Music Generation: A Survey
【速读】:本文旨在系统性地综述视觉到音乐生成(Vision-to-Music Generation)领域的研究进展,重点关注视频到音乐以及图像到音乐任务。论文试图解决现有研究在处理复杂视觉输入与动态关系建模时面临的挑战,并填补针对此特定领域综合性调研的空白。关键在于从输入类型(通用视频、人体运动视频及静态图像)和技术架构角度全面分析技术特点与难点,同时涵盖符号化音乐与音频音乐两种输出形式。通过总结现有方法、详细审查常用数据集及评估指标,论文还探讨了当前挑战与未来研究方向,以期推动学术界和工业界在多模态生成领域的进一步创新。
链接: https://arxiv.org/abs/2503.21254
作者: Zhaokai Wang,Chenxi Bao,Le Zhuo,Jingrui Han,Yang Yue,Yihong Tang,Victor Shea-Jay Huang,Yue Liao
机构: Shanghai Jiao Tong University (上海交通大学); Music Tech Lab, DynamiX (动态音乐实验室, DynamiX); Shanghai AI Laboratory (上海人工智能实验室); Beijing Film Academy (北京电影学院); Tsinghua University (清华大学); McGill University (麦吉尔大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
点击查看摘要
Abstract:Vision-to-music Generation, including video-to-music and image-to-music tasks, is a significant branch of multimodal artificial intelligence demonstrating vast application prospects in fields such as film scoring, short video creation, and dance music synthesis. However, compared to the rapid development of modalities like text and images, research in vision-to-music is still in its preliminary stage due to its complex internal structure and the difficulty of modeling dynamic relationships with video. Existing surveys focus on general music generation without comprehensive discussion on vision-to-music. In this paper, we systematically review the research progress in the field of vision-to-music generation. We first analyze the technical characteristics and core challenges for three input types: general videos, human movement videos, and images, as well as two output types of symbolic music and audio music. We then summarize the existing methodologies on vision-to-music generation from the architecture perspective. A detailed review of common datasets and evaluation metrics is provided. Finally, we discuss current challenges and promising directions for future research. We hope our survey can inspire further innovation in vision-to-music generation and the broader field of multimodal generation in academic research and industrial applications. To follow latest works and foster further innovation in this field, we are continuously maintaining a GitHub repository at this https URL.
zh
[CV-73] Orange Quality Grading with Deep Learning
【速读】:该论文旨在解决橙子分级过程中因人工操作效率低、主观性强而导致的问题,提出了一种基于深度学习的多视角机器视觉橙子分级解决方案。方案的关键在于通过捕获单个橙子的多视角图像,并将其组合成一张拼贴图,以实现对整个橙子表皮的全面分析。随后,利用卷积神经网络(CNN)对拼贴图像进行训练,将橙子分为“优质”、“劣质”和“未定义”三类。实验表明,多视角分级方法优于单一视角方法。
链接: https://arxiv.org/abs/2503.21250
作者: Mohamed Lamine Mekhalfi,Paul Chippendale,Francisco Fraile,Marcos Rico
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Orange grading is a crucial step in the fruit industry, as it helps to sort oranges according to different criteria such as size, quality, ripeness, and health condition, ensuring safety for human consumption and better price allocation and client satisfaction. Automated grading enables faster processing, precision, and reduced human labor. In this paper, we implement a deep learning-based solution for orange grading via machine vision. Unlike typical grading systems that analyze fruits from a single view, we capture multiview images of each single orange in order to enable a richer representation. Afterwards, we compose the acquired images into one collage. This enables the analysis of the whole orange skin. We train a convolutional neural network (CNN) on the composed images to grade the oranges into three classes, namely good, bad, and undefined. We also evaluate the performance with two different CNNs (ResNet-18 and SqueezeNet). We show experimentally that multi-view grading is superior to single view grading.
zh
[CV-74] DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
【速读】:本文旨在解决人体图像动画领域中两个主要挑战:(1) 当前架构的局限性,大多数模型依赖U-Net,其性能相较于MM-DiT存在不足;(2) 对文本信息的忽视,而文本信息可以增强可控性。为应对这些问题,论文提出了DynamiCtrl框架。其关键是引入共享变分自编码器(Shared VAE Encoder)来同时处理参考图像和驱动姿态视频,无需额外的姿态编码器以简化整体结构;提出Pose-adaptive Layer Norm (PadaLN),通过自适应层归一化编码稀疏的姿态特征,并将其直接添加到视觉输入中,从而在保持主干网络时空一致性的同时有效引入姿态控制;此外,在完整注意力机制中对文本和视觉特征进行对齐,利用文本实现对生成内容的精细控制以及首次实现对背景和运动的同时控制。实验结果验证了该方法在基准数据集上的优越性。
链接: https://arxiv.org/abs/2503.21246
作者: Haoyu Zhao,Zhongang Qi,Cong Wang,Qingping Zheng,Guansong Lu,Fei Chen,Hang Xu,Zuxuan Wu
机构: Fudan University (复旦大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); Sun Yat-sen University (中山大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 10 figures
点击查看摘要
Abstract:Human image animation has recently gained significant attention due to advancements in generative models. However, existing methods still face two major challenges: (1) architectural limitations, most models rely on U-Net, which underperforms compared to the MM-DiT; and (2) the neglect of textual information, which can enhance controllability. In this work, we introduce DynamiCtrl, a novel framework that not only explores different pose-guided control structures in MM-DiT, but also reemphasizes the crucial role of text in this task. Specifically, we employ a Shared VAE encoder for both reference images and driving pose videos, eliminating the need for an additional pose encoder and simplifying the overall framework. To incorporate pose features into the full attention blocks, we propose Pose-adaptive Layer Norm (PadaLN), which utilizes adaptive layer normalization to encode sparse pose features. The encoded features are directly added to the visual input, preserving the spatiotemporal consistency of the backbone while effectively introducing pose control into MM-DiT. Furthermore, within the full attention mechanism, we align textual and visual features to enhance controllability. By leveraging text, we not only enable fine-grained control over the generated content, but also, for the first time, achieve simultaneous control over both background and motion. Experimental results verify the superiority of DynamiCtrl on benchmark datasets, demonstrating its strong identity preservation, heterogeneous character driving, background controllability, and high-quality synthesis. The project page is available at this https URL.
zh
[CV-75] Clean Image May be Dangerous: Data Poisoning Attacks Against Deep Hashing
【速读】:该论文旨在解决深度哈希(Deep Hashing)在大规模图像检索中的数据投毒攻击问题(\textit{PADHASH} 攻击),即通过精心设计的干净查询图像诱导恶意的目标检索结果,如非法或不希望出现的图像。论文的关键在于首先训练一个代理模型来模拟目标深度哈希模型的行为,然后提出一种严格的梯度匹配策略生成投毒图像(Poisoned Images)。实验结果验证了所提方法的有效性和通用性。
链接: https://arxiv.org/abs/2503.21236
作者: Shuai Li,Jie Zhang,Yuang Qi,Kejiang Chen,Tianwei Zhang,Weiming Zhang,Nenghai Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted by TMM
点击查看摘要
Abstract:Large-scale image retrieval using deep hashing has become increasingly popular due to the exponential growth of image data and the remarkable feature extraction capabilities of deep neural networks (DNNs). However, deep hashing methods are vulnerable to malicious attacks, including adversarial and backdoor attacks. It is worth noting that these attacks typically involve altering the query images, which is not a practical concern in real-world scenarios. In this paper, we point out that even clean query images can be dangerous, inducing malicious target retrieval results, like undesired or illegal images. To the best of our knowledge, we are the first to study data \textbfpoisoning \textbfattacks against \textbfdeep \textbfhashing \textbf(\textitPADHASH). Specifically, we first train a surrogate model to simulate the behavior of the target deep hashing model. Then, a strict gradient matching strategy is proposed to generate the poisoned images. Extensive experiments on different models, datasets, hash methods, and hash code lengths demonstrate the effectiveness and generality of our attack method.
zh
[CV-76] Frequency-Aware Gaussian Splatting Decomposition
【速读】:该论文旨在解决3D Gaussian Splatting (3D-GS) 在新型视图合成中的高效性和显式表示优势的同时,缺乏频率可解释性的问题,即难以将低频结构与精细细节有效分离。为了解决这一问题,论文的关键创新在于提出了一种频率分解的3D-GS框架,通过将输入图像的Laplacian金字塔子带对应的3D高斯分布分组,并针对每种子带(即一组3D高斯分布)引入专用正则化来确保频率成分的清晰分离。此外,通过扩展颜色值到正负范围以及采用渐进训练策略以粗到细的方式优化细节,进一步增强了方法的稳定性。这种频率感知的设计不仅提高了模型的可解释性,还实现了高级别的3D编辑、样式迁移以及动态细节层次控制等实际应用中的灵活性和精确性提升。
链接: https://arxiv.org/abs/2503.21226
作者: Yishai Lavi,Leo Segre,Shai Avidan
机构: Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:3D Gaussian Splatting (3D-GS) has revolutionized novel view synthesis with its efficient, explicit representation. However, it lacks frequency interpretability, making it difficult to separate low-frequency structures from fine details. We introduce a frequency-decomposed 3D-GS framework that groups 3D Gaussians that correspond to subbands in the Laplacian Pyrmaids of the input images. Our approach enforces coherence within each subband (i.e., group of 3D Gaussians) through dedicated regularization, ensuring well-separated frequency components. We extend color values to both positive and negative ranges, allowing higher-frequency layers to add or subtract residual details. To stabilize optimization, we employ a progressive training scheme that refines details in a coarse-to-fine manner. Beyond interpretability, this frequency-aware design unlocks a range of practical benefits. Explicit frequency separation enables advanced 3D editing and stylization, allowing precise manipulation of specific frequency bands. It also supports dynamic level-of-detail control for progressive rendering, streaming, foveated rendering and fast geometry interaction. Through extensive experiments, we demonstrate that our method provides improved control and flexibility for emerging applications in scene editing and interactive rendering. Our code will be made publicly available.
zh
[CV-77] GenFusion: Closing the Loop between Reconstruction and Generation via Videos
【速读】:该论文旨在解决三维重建与生成领域之间存在的显著条件差距问题。具体而言,传统三维场景重建通常需要密集视角捕捉,而三维生成往往依赖单一或无输入视角,这种局限性极大地限制了两者的应用。研究发现,这一现象的根本原因在于三维约束与生成先验之间的不匹配。为了解决这个问题,论文提出了一种以重建驱动的视频扩散模型,通过学习将视频帧条件化于易产生伪影的RGB-D渲染结果上,从而弥合上述差距。此外,还设计了一个循环融合管道,迭代地将生成模型产生的修复帧添加到训练集中,实现逐步扩展并克服了先前重建和生成管道中存在的视点饱和问题。实验评估表明,所提方法在稀疏视角和掩码输入下的视图合成任务中表现出色。因此,解决方案的关键在于引入重建驱动的视频扩散模型以及循环融合机制来促进跨领域的协同优化。
链接: https://arxiv.org/abs/2503.21219
作者: Sibo Wu,Congrong Xu,Binbin Huang,Andreas Geiger,Anpei Chen
机构: Westlake University (西湖大学); Technical University of Munich (慕尼黑工业大学); ShanghaiTech University (上海科技大学); The University of Hong Kong (香港大学); University of Tübingen, Tübingen AI Center (图宾根大学, 图宾根人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
点击查看摘要
Abstract:Recently, 3D reconstruction and generation have demonstrated impressive novel view synthesis results, achieving high fidelity and efficiency. However, a notable conditioning gap can be observed between these two fields, e.g., scalable 3D scene reconstruction often requires densely captured views, whereas 3D generation typically relies on a single or no input view, which significantly limits their applications. We found that the source of this phenomenon lies in the misalignment between 3D constraints and generative priors. To address this problem, we propose a reconstruction-driven video diffusion model that learns to condition video frames on artifact-prone RGB-D renderings. Moreover, we propose a cyclical fusion pipeline that iteratively adds restoration frames from the generative model to the training set, enabling progressive expansion and addressing the viewpoint saturation limitations seen in previous reconstruction and generation pipelines. Our evaluation, including view synthesis from sparse view and masked input, validates the effectiveness of our approach.
zh
[CV-78] FakeReasoning : Towards Generalizable Forgery Detection and Reasoning
【速读】:本文旨在解决AI生成图像检测与解释的两大挑战:一是跨生成模型的领域差距导致通用伪造检测模型难以开发;二是传统基于显著性的伪造解释方法不适用于像素级合成的AI生成图像。为应对这些挑战,论文提出将伪造检测与推理任务(Forgery Detection and Reasoning, FDR-Task)作为核心研究对象,并利用视觉语言模型(Vision-Language Models, VLMs)通过结构化且可靠的推理来实现精准检测。解决方案的关键在于FakeReasoning框架,其包含两个核心组件:首先,伪造对齐对比学习(Forgery-Aligned Contrastive Learning)通过图像与伪造属性推理之间的跨模态及模态内对比学习增强VLMs对伪造相关语义的理解;其次,分类概率映射器(Classification Probability Mapper)通过将VLMs的输出 logits 映射到校准后的二元分类概率,弥补伪造检测与语言建模之间的优化差异。实验结果表明,该方法不仅具备鲁棒的泛化能力,还在检测与推理任务中超越了现有最先进方法。
链接: https://arxiv.org/abs/2503.21210
作者: Yueying Gao,Dongliang Chang,Bingyao Yu,Haotian Qin,Lei Chen,Kongming Liang,Zhanyu Ma
机构: PRIS, Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Accurate and interpretable detection of AI-generated images is essential for mitigating risks associated with AI misuse. However, the substantial domain gap among generative models makes it challenging to develop a generalizable forgery detection model. Moreover, since every pixel in an AI-generated image is synthesized, traditional saliency-based forgery explanation methods are not well suited for this task. To address these challenges, we propose modeling AI-generated image detection and explanation as a Forgery Detection and Reasoning task (FDR-Task), leveraging vision-language models (VLMs) to provide accurate detection through structured and reliable reasoning over forgery attributes. To facilitate this task, we introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 100K images across 10 generative models, with 10 types of forgery reasoning annotations, enabling comprehensive evaluation of FDR-Task. Additionally, we propose FakeReasoning, a forgery detection and reasoning framework with two key components. First, Forgery-Aligned Contrastive Learning enhances VLMs’ understanding of forgery-related semantics through both cross-modal and intra-modal contrastive learning between images and forgery attribute reasoning. Second, a Classification Probability Mapper bridges the optimization gap between forgery detection and language modeling by mapping the output logits of VLMs to calibrated binary classification probabilities. Experiments across multiple generative models demonstrate that FakeReasoning not only achieves robust generalization but also outperforms state-of-the-art methods on both detection and reasoning tasks.
zh
[CV-79] An improved EfficientNetV2 for garbage classification
【速读】:该论文旨在解决废弃物分类中的数据获取成本高、泛化能力不足以及实时性能受限等挑战。解决方案的关键在于提出了一种名为Channel-Efficient Attention (CE-Attention)的模块,该模块通过缓解全局池化过程中的特征损失来增强关键特征的提取,且不引入维度扩展。此外,论文开发了一个轻量级多尺度空间特征提取模块(SAFM),结合深度可分离卷积显著降低了模型复杂度。同时,还采用了全面的数据增强策略以进一步提升模型的泛化能力。实验结果表明,所提方法在华为云废弃物分类数据集上的分类准确率达到95.4%,较基线提升了3.2%,并优于主流模型,验证了其在实际应用场景中平衡准确性和效率的有效性。
链接: https://arxiv.org/abs/2503.21208
作者: Wenxuan Qiu,Chengxin Xie,Jingui Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:This paper presents an enhanced waste classification framework based on EfficientNetV2 to address challenges in data acquisition cost, generalization, and real-time performance. We propose a Channel-Efficient Attention (CE-Attention) module that mitigates feature loss during global pooling without introducing dimensional scaling, effectively enhancing critical feature extraction. Additionally, a lightweight multi-scale spatial feature extraction module (SAFM) is developed by integrating depthwise separable convolutions, significantly reducing model complexity. Comprehensive data augmentation strategies are further employed to improve generalization. Experiments on the Huawei Cloud waste classification dataset demonstrate that our method achieves a classification accuracy of 95.4%, surpassing the baseline by 3.2% and outperforming mainstream models. The results validate the effectiveness of our approach in balancing accuracy and efficiency for practical waste classification scenarios.
zh
[CV-80] WVSC: Wireless Video Semantic Communication with Multi-frame Compensation
【速读】:该论文旨在解决现有无线视频传输方案仅在像素级进行视频编码,而忽视视频内含语义信息的问题。为应对这一挑战,论文提出了一种无线视频语义通信框架(WVSC),其关键是将语义通信的思想融入无线视频传输场景中。具体而言,WVSC首先将原始视频帧编码为语义帧,并基于这些紧凑表示进行视频编码,从而实现语义级而非像素级的视频编码。此外,通过引入参考语义帧替代传统视频编码方法中的运动矢量,进一步降低了通信开销。在接收端,提出了多帧补偿(MFC)机制,利用多帧融合注意力模块生成补偿后的当前语义帧。结合参考帧传输与MFC,该方案提升了带宽效率并保持了良好的视频传输性能。实验结果表明,WVSC相较于其他基于深度学习的方法(如DVSC)在PSNR指标上提高了约1 dB,比传统方案提高了约2 dB。
链接: https://arxiv.org/abs/2503.21197
作者: Bingyan Xie,Yongpeng Wu,Yuxuan Shi,Biqian Feng,Wenjun Zhang,Jihong Park,Tony Q.S. Quek
机构: Department of Electronic Engineering, Shanghai Jiao Tong University (上海交通大学电子工程系); School of Cyber and Engineering, Shanghai Jiao Tong University (上海交通大学网络与工程学院); ISTD Pillar, Singapore University of Technology of Design (新加坡科技设计大学ISTD支柱)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Existing wireless video transmission schemes directly conduct video coding in pixel level, while neglecting the inner semantics contained in videos. In this paper, we propose a wireless video semantic communication framework, abbreviated as WVSC, which integrates the idea of semantic communication into wireless video transmission scenarios. WVSC first encodes original video frames as semantic frames and then conducts video coding based on such compact representations, enabling the video coding in semantic level rather than pixel level. Moreover, to further reduce the communication overhead, a reference semantic frame is introduced to substitute motion vectors of each frame in common video coding methods. At the receiver, multi-frame compensation (MFC) is proposed to produce compensated current semantic frame with a multi-frame fusion attention module. With both the reference frame transmission and MFC, the bandwidth efficiency improves with satisfying video transmission performance. Experimental results verify the performance gain of WVSC over other DL-based methods e.g. DVSC about 1 dB and traditional schemes about 2 dB in terms of PSNR.
zh
[CV-81] Leverag ing LLM s with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering
【速读】:该论文旨在解决现有基于视频的社会智能方法依赖于通用视频识别或情感识别技术,而忽视了人类交互中独特元素的问题。为应对这一挑战,论文提出了一种名为Looped Video Debating (LVD)的框架,其关键是将Large Language Models (LLMs)与视觉信息(如面部表情和身体动作)相结合,以增强涉及人类交互视频的问题回答任务的透明性和可靠性。实验结果表明,LVD在Social-IQ 2.0基准测试中达到了最先进的性能,并且无需微调即可实现。此外,对现有数据集的补充人工标注提供了模型准确性方面的见解,为未来AI驱动的社会智能改进提供了指导。
链接: https://arxiv.org/abs/2503.21190
作者: Erika Mori,Yue Qiu,Hirokatsu Kataoka,Yoshimitsu Aoki
机构: National Institute of Advanced Industrial Scienece and Technology (AIST) (产业技术综合研究所); Keio University (庆应义塾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Social intelligence, the ability to interpret emotions, intentions, and behaviors, is essential for effective communication and adaptive responses. As robots and AI systems become more prevalent in caregiving, healthcare, and education, the demand for AI that can interact naturally with humans grows. However, creating AI that seamlessly integrates multiple modalities, such as vision and speech, remains a challenge. Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques, often overlook the unique elements inherent in human interactions. To address this, we propose the Looped Video Debating (LVD) framework, which integrates Large Language Models (LLMs) with visual information, such as facial expressions and body movements, to enhance the transparency and reliability of question-answering tasks involving human interaction videos. Our results on the Social-IQ 2.0 benchmark show that LVD achieves state-of-the-art performance without fine-tuning. Furthermore, supplementary human annotations on existing datasets provide insights into the model’s accuracy, guiding future improvements in AI-driven social intelligence.
zh
[CV-82] DGSUnet: An Improved Unet Model with DINO-Guided SAM2 for Multi-Scale Feature Collaboration
【速读】:本文针对大规模预训练基础模型(如Meta的Segment Anything Model (SAM) 系列和DINOv2)在专业化领域中的性能局限性展开研究,主要聚焦于两个关键问题:一是由于大模型参数导致的高昂训练成本;二是对特定领域特征表示能力的不足。为了解决这些问题,论文提出了一种以DINOv2引导的多尺度特征协作框架(SAM2),其核心创新点包括三个方面:(1) 建立DINOv2与SAM2主干网络之间的特征协作机制,利用自监督模型提取的高维语义特征指导多尺度特征融合;(2) 设计轻量级适配模块和跨模态、跨层特征融合单元,在冻结基础模型参数的同时注入跨领域知识;(3) 构建基于U-Net的U形网络结构,通过注意力机制实现多粒度特征的自适应聚合解码。该框架在伪装目标检测和显著物体检测等下游任务中超越现有最先进方法,且无需昂贵的训练过程,为视觉图像分割的高效部署提供了技术路径,并展现出广泛的应用价值。
链接: https://arxiv.org/abs/2503.21187
作者: Yimin Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Despite the significant advancements in general image segmentation achieved by large-scale pre-trained foundation models (such as Meta’s Segment Any-thing Model (SAM) series and DINOv2), their performance in specialized fields remains limited by two critical issues: the excessive training costs due to large model parameters, and the insufficient ability to represent specific domain characteristics. This paper proposes a multi-scale feature collabora-tion framework guided by DINOv2 for SAM2, with core innovations in three aspects: (1) Establishing a feature collaboration mechanism between DINOv2 and SAM2 backbones, where high-dimensional semantic features extracted by the self-supervised model guide multi-scale feature fusion; (2) Designing lightweight adapter modules and cross-modal, cross-layer feature fusion units to inject cross-domain knowledge while freezing the base model parameters; (3) Constructing a U-shaped network structure based on U-net, which utilizes attention mechanisms to achieve adaptive aggregation decoding of multi-granularity features. This framework surpasses existing state-of-the-art meth-ods in downstream tasks such as camouflage target detection and salient ob-ject detection, without requiring costly training processes. It provides a tech-nical pathway for efficient deployment of visual image segmentation, demon-strating significant application value in a wide range of downstream tasks and specialized fields within image this http URL page: this https URL
zh
[CV-83] Model as a Game: On Numerical and Spatial Consistency for Generative Games
【速读】:本文旨在解决现有生成式模型在游戏生成中难以维持数值一致性和空间一致性的问题。尽管这些模型能够生成高质量图像并合理处理玩家输入,但在确保游戏机制正确反映得分变化等量化元素(数值一致性)以及避免场景过渡突兀以提供流畅玩家体验(空间一致性)方面表现不足。为了解决这一挑战,论文提出了一个结合逻辑网络(LogicNet)的数值模块来判定事件触发条件,并通过外部计算作为图像生成的前提条件;同时设计了一个空间模块用于维护已探索区域的地图,在生成过程中检索特定位置信息并与新观察结果建立联系以保证连续性。关键在于这两个专门设计的模块——数值模块与空间模块的集成应用,它们显著提升了所提出方法在一致性指标上的表现,且仅带来可忽略不计的时间开销。
链接: https://arxiv.org/abs/2503.21172
作者: Jingye Chen,Yuzhong Zhao,Yupan Huang,Lei Cui,Li Dong,Tengchao Lv,Qifeng Chen,Furu Wei
机构: HKUST (香港科技大学); UCAS (中国科学院大学); Microsoft Research (微软研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
点击查看摘要
Abstract:Recent advances in generative models have significantly impacted game generation. However, despite producing high-quality graphics and adequately receiving player input, existing models often fail to maintain fundamental game properties such as numerical and spatial consistency. Numerical consistency ensures gameplay mechanics correctly reflect score changes and other quantitative elements, while spatial consistency prevents jarring scene transitions, providing seamless player experiences. In this paper, we revisit the paradigm of generative games to explore what truly constitutes a Model as a Game (MaaG) with a well-developed mechanism. We begin with an empirical study on ``Traveler’', a 2D game created by an LLM featuring minimalist rules yet challenging generative models in maintaining consistency. Based on the DiT architecture, we design two specialized modules: (1) a numerical module that integrates a LogicNet to determine event triggers, with calculations processed externally as conditions for image generation; and (2) a spatial module that maintains a map of explored areas, retrieving location-specific information during generation and linking new observations to ensure continuity. Experiments across three games demonstrate that our integrated modules significantly enhance performance on consistency metrics compared to baselines, while incurring minimal time overhead during inference.
zh
[CV-84] VADMamba: Exploring State Space Models for Fast Video Anomaly Detection ICME2025
【速读】:本文旨在解决视频异常检测(Video Anomaly Detection, VAD)任务中检测精度与推理速度难以兼顾的问题。现有方法多基于卷积神经网络(CNN)或Transformer,虽然在检测精度方面表现优异,但通常以牺牲推理速度为代价。为应对这一挑战,论文提出将状态空间模型(State Space Models)引入VAD领域,并通过Mamba模型的实例展示了其在计算效率上的提升潜力。论文的关键创新在于提出了VADMamba框架,其核心是VQ-Mamba Unet (VQ-MaU) 结构,该结构结合向量量化(Vector Quantization, VQ)层与基于Mamba的非负视觉状态空间(Non-negative Visual State Space, NVSS)模块。此外,通过两个独立的VQ-MaU网络分别预测帧和重建光流,并采用片段级融合评估策略进一步提升检测精度。实验结果验证了VADMamba在三个基准数据集上的有效性,特别是在推理速度方面显著优于现有方法。
链接: https://arxiv.org/abs/2503.21169
作者: Jiahao Lyu,Minghua Zhao,Jing Hu,Xuewen Huang,Yifei Chen,Shuangli Du
机构: School of Computer Science and Engineering, Xi’an University of Technology (西安理工大学), Xi’an, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accpeted by ICME 2025
点击查看摘要
Abstract:Video anomaly detection (VAD) methods are mostly CNN-based or Transformer-based, achieving impressive results, but the focus on detection accuracy often comes at the expense of inference speed. The emergence of state space models in computer vision, exemplified by the Mamba model, demonstrates improved computational efficiency through selective scans and showcases the great potential for long-range modeling. Our study pioneers the application of Mamba to VAD, dubbed VADMamba, which is based on multi-task learning for frame prediction and optical flow reconstruction. Specifically, we propose the VQ-Mamba Unet (VQ-MaU) framework, which incorporates a Vector Quantization (VQ) layer and Mamba-based Non-negative Visual State Space (NVSS) block. Furthermore, two individual VQ-MaU networks separately predict frames and reconstruct corresponding optical flows, further boosting accuracy through a clip-level fusion evaluation strategy. Experimental results validate the efficacy of the proposed VADMamba across three benchmark datasets, demonstrating superior performance in inference speed compared to previous work. Code is available at this https URL.
zh
[CV-85] Adversarial Wear and Tear: Exploiting Natural Damage for Generating Physical-World Adversarial Examples
【速读】:该论文旨在解决物理世界中对抗样本对深度神经网络(Deep Neural Networks)在自动驾驶等安全关键应用部署带来的挑战。现有方法大多依赖于临时性的人工修改(如阴影、激光束或贴纸),这些方法针对性强且场景特定,缺乏通用性。论文提出了一种新的物理世界对抗样本类别——AdvWT,其灵感来源于“磨损与老化”这一自然现象,这是一种物理对象的固有属性。与人工设计的扰动不同,AdvWT通过模拟因环境退化而自然发生的损害来实现对抗效果,从而更贴近现实世界中的复杂情况。
解决方案的关键在于AdvWT采用两步策略:首先,利用基于生成对抗网络(GAN)的无监督图像到图像翻译网络,将户外标识牌的自然损坏特征编码为潜在的“损坏风格代码”;其次,在该风格代码中引入对抗扰动,并优化其转换过程,使生成的对抗样本在视觉上保持真实可信,同时有效误导神经网络。实验表明,AdvWT在数字域和物理域均能显著提高攻击成功率和鲁棒性,并生成更加自然的对抗样本。此外,将其集成到训练过程中还能提升模型对实际损坏标识的泛化能力。
链接: https://arxiv.org/abs/2503.21164
作者: Samra Irshad,Seungkyu Lee,Nassir Navab,Hong Joo Lee,Seong Tae Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages, 9 figures
点击查看摘要
Abstract:The presence of adversarial examples in the physical world poses significant challenges to the deployment of Deep Neural Networks in safety-critical applications such as autonomous driving. Most existing methods for crafting physical-world adversarial examples are ad-hoc, relying on temporary modifications like shadows, laser beams, or stickers that are tailored to specific scenarios. In this paper, we introduce a new class of physical-world adversarial examples, AdvWT, which draws inspiration from the naturally occurring phenomenon of wear and tear', an inherent property of physical objects. Unlike manually crafted perturbations,
wear and tear’ emerges organically over time due to environmental degradation, as seen in the gradual deterioration of outdoor signboards. To achieve this, AdvWT follows a two-step approach. First, a GAN-based, unsupervised image-to-image translation network is employed to model these naturally occurring damages, particularly in the context of outdoor signboards. The translation network encodes the characteristics of damaged signs into a latent `damage style code’. In the second step, we introduce adversarial perturbations into the style code, strategically optimizing its transformation process. This manipulation subtly alters the damage style representation, guiding the network to generate adversarial images where the appearance of damages remains perceptually realistic, while simultaneously ensuring their effectiveness in misleading neural networks. Through comprehensive experiments on two traffic sign datasets, we show that AdvWT effectively misleads DNNs in both digital and physical domains. AdvWT achieves an effective attack success rate, greater robustness, and a more natural appearance compared to existing physical-world adversarial examples. Additionally, integrating AdvWT into training enhances a model’s generalizability to real-world damaged signs.
zh
[CV-86] Integrating Travel Behavior Forecasting and Generative Modeling for Predicting Future Urban Mobility and Spatial Transformations
【速读】:该论文旨在解决传统交通规划方法在预测长期城市增长和交通需求方面准确性不足的问题,这些问题可能导致基础设施拆除以适应当前的交通规划需求。论文的关键解决方案是提出一个集成框架,结合Temporal Fusion Transformer(用于从人口统计数据预测出行模式)和Generative Adversarial Network(用于通过卫星图像预测未来城市环境)。该框架通过数据驱动的方法显著提高了出行行为预测的R-square评分至0.76,并生成了结构相似性指数(Structural Similarity Index)为0.81的高保真卫星图像,从而证明了将预测分析与空间可视化相结合能够显著改进决策过程,推动更可持续和高效的城市发展。
链接: https://arxiv.org/abs/2503.21158
作者: Eugene Denteh,Andrews Danyo,Joshua Kofi Asamoah,Blessing Agyei Kyem,Twitchell Addai,Armstrong Aboah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Transportation planning plays a critical role in shaping urban development, economic mobility, and infrastructure sustainability. However, traditional planning methods often struggle to accurately predict long-term urban growth and transportation demands. This may sometimes result in infrastructure demolition to make room for current transportation planning demands. This study integrates a Temporal Fusion Transformer to predict travel patterns from demographic data with a Generative Adversarial Network to predict future urban settings through satellite imagery. The framework achieved a 0.76 R-square score in travel behavior prediction and generated high-fidelity satellite images with a Structural Similarity Index of 0.81. The results demonstrate that integrating predictive analytics and spatial visualization can significantly improve the decision-making process, fostering more sustainable and efficient urban development. This research highlights the importance of data-driven methodologies in modern transportation planning and presents a step toward optimizing infrastructure placement, capacity, and long-term viability.
zh
[CV-87] he Devil is in Low-Level Features for Cross-Domain Few-Shot Segmentation CVPR2025
【速读】:该论文致力于解决跨域小样本分割(Cross-Domain Few-Shot Segmentation, CDFSS)中一个显著但未被充分研究的现象:对于目标域,尤其是与源域距离较远的目标域,分割性能在源域训练的早期阶段达到峰值,随后随着训练的进行急剧下降。论文深入分析了这一现象的原因,指出低级特征对域偏移较为敏感,导致源域训练过程中损失函数景观变得陡峭,这是CDFSS中的关键挑战。为了解决这一问题,论文提出了一种包含两个即插即用模块的方法:其一是在源域训练期间通过一种新颖的锐度感知最小化方法平滑低级特征的损失景观;其二是在目标域测试期间通过基于低级特征的校准直接补充目标域信息。实验结果表明,该方法在四个目标数据集上的表现优于现有最先进的方法,在1-shot和5-shot场景下分别提升了3.71%和5.34%的平均MIoU。因此,解决方案的关键在于通过平滑损失景观和引入目标域信息来缓解低级特征对域偏移的敏感性。
链接: https://arxiv.org/abs/2503.21150
作者: Yuhan Liu,Yixiong Zou,Yuhua Li,Ruixuan Li
机构: School of Computer Science and Technology, Huazhong University of Science and Technology (华中科技大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Cross-Domain Few-Shot Segmentation (CDFSS) is proposed to transfer the pixel-level segmentation capabilities learned from large-scale source-domain datasets to downstream target-domain datasets, with only a few annotated images per class. In this paper, we focus on a well-observed but unresolved phenomenon in CDFSS: for target domains, particularly those distant from the source domain, segmentation performance peaks at the very early epochs, and declines sharply as the source-domain training proceeds. We delve into this phenomenon for an interpretation: low-level features are vulnerable to domain shifts, leading to sharper loss landscapes during the source-domain training, which is the devil of CDFSS. Based on this phenomenon and interpretation, we further propose a method that includes two plug-and-play modules: one to flatten the loss landscapes for low-level features during source-domain training as a novel sharpness-aware minimization method, and the other to directly supplement target-domain information to the model during target-domain testing by low-level-based calibration. Extensive experiments on four target datasets validate our rationale and demonstrate that our method surpasses the state-of-the-art method in CDFSS signifcantly by 3.71% and 5.34% average MIoU in 1-shot and 5-shot scenarios, respectively.
zh
[CV-88] ChatAnyone: Stylized Real-time Portrait Video Generation with Hierarchical Motion Diffusion Model
【速读】:该论文旨在解决现有实时交互式视频聊天头像生成方法在同步头部动作与身体运动以及精细控制说话风格和面部表情表达方面的局限性。解决方案的关键在于提出了一种新颖的风格化实时肖像视频生成框架,包含两个阶段:第一阶段利用高效的分层运动扩散模型,结合显式和隐式运动表示,基于音频输入生成多样化且风格可控的面部表情,并实现头部与身体运动的同步;第二阶段通过注入显式的肢体控制信号以生成更详细的上半身动作(包括手势),并通过人脸细化进一步提升整体真实感与表现力。此外,该方法支持在4090 GPU上以最高512*768分辨率、30帧/秒的效率连续生成高质量的上半身肖像视频,适用于实时互动视频聊天。实验结果验证了所提方法在生成具有丰富表现力和自然上半身运动的肖像视频方面的能力。
链接: https://arxiv.org/abs/2503.21144
作者: Jinwei Qi,Chaonan Ji,Sheng Xu,Peng Zhang,Bang Zhang,Liefeng Bo
机构: Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
点击查看摘要
Abstract:Real-time interactive video-chat portraits have been increasingly recognized as the future trend, particularly due to the remarkable progress made in text and voice chat technologies. However, existing methods primarily focus on real-time generation of head movements, but struggle to produce synchronized body motions that match these head actions. Additionally, achieving fine-grained control over the speaking style and nuances of facial expressions remains a challenge. To address these limitations, we introduce a novel framework for stylized real-time portrait video generation, enabling expressive and flexible video chat that extends from talking head to upper-body interaction. Our approach consists of the following two stages. The first stage involves efficient hierarchical motion diffusion models, that take both explicit and implicit motion representations into account based on audio inputs, which can generate a diverse range of facial expressions with stylistic control and synchronization between head and body movements. The second stage aims to generate portrait video featuring upper-body movements, including hand gestures. We inject explicit hand control signals into the generator to produce more detailed hand movements, and further perform face refinement to enhance the overall realism and expressiveness of the portrait video. Additionally, our approach supports efficient and continuous generation of upper-body portrait video in maximum 512 * 768 resolution at up to 30fps on 4090 GPU, supporting interactive video-chat in real-time. Experimental results demonstrate the capability of our approach to produce portrait videos with rich expressiveness and natural upper-body movements.
zh
[CV-89] Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation
【速读】:该论文致力于解决类别无关的姿态估计问题,即利用少量标注的支持图像来定位查询图像中的关键点,适用于任意新的类别。现有方法通常通过热图池化提取支持特征,并借助交叉注意力从支持图像和查询图像中获取交互特征,但忽略了从支持图像和查询图像中挖掘细粒度且结构感知(FGSA)的特征,而这对于像素级的关键点定位至关重要。为此,论文提出了一种新颖且简洁的框架,该框架通过变形注意力机制设计了一个FGSA挖掘模块,能够反复从支持图像和查询图像中挖掘细粒度和结构感知的特征。一方面,通过在多尺度特征图上应用可变形注意力头来挖掘细粒度特征;另一方面,通过偏移关键点的参考点到其关联的关键点来挖掘结构感知特征。此外,还提出了使用混合关键点来填充不同类别的关键点数量,以提供比现有工作使用的零填充更丰富的监督信息。实验结果表明,该方法在MP-100数据集上的性能显著优于最先进的方法(+3.2% PCK@0.05)。
链接: https://arxiv.org/abs/2503.21140
作者: Junjie Chen,Weilong Chen,Yifan Zuo,Yuming Fang
机构: Jiangxi University of Finance and Economics (江西财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Category-agnostic pose estimation aims to locate keypoints on query images according to a few annotated support images for arbitrary novel classes. Existing methods generally extract support features via heatmap pooling, and obtain interacted features from support and query via cross-attention. Hence, these works neglect to mine fine-grained and structure-aware (FGSA) features from both support and query images, which are crucial for pixel-level keypoint localization. To this end, we propose a novel yet concise framework, which recurrently mines FGSA features from both support and query images. Specifically, we design a FGSA mining module based on deformable attention mechanism. On the one hand, we mine fine-grained features by applying deformable attention head over multi-scale feature maps. On the other hand, we mine structure-aware features by offsetting the reference points of keypoints to their linked keypoints. By means of above module, we recurrently mine FGSA features from support and query images, and thus obtain better support features and query estimations. In addition, we propose to use mixup keypoints to pad various classes to a unified keypoint number, which could provide richer supervision than the zero padding used in existing works. We conduct extensive experiments and in-depth studies on large-scale MP-100 dataset, and outperform SOTA method dramatically (+3.2%PCK@0.05). Code is avaiable at this https URL.
zh
[CV-90] VideoMix: Aggregating How-To Videos for Task-Oriented Learning
【速读】:该论文旨在解决通过多段教程视频学习任务时,用户因视频分散且难以快速浏览而导致的时间消耗和认知负担问题。论文的关键解决方案是提出VideoMix系统,该系统利用视觉-语言模型(Vision-Language Model)管道从多段视频中提取并组织信息,生成简洁的文本摘要与相关视频剪辑,使用户能够高效地掌握任务的整体理解。此外,通过形式化研究和对比用户研究验证了VideoMix在提升任务理解效率方面的有效性,强调了围绕共同目标组织多视频的以任务为中心的方法的潜力。
链接: https://arxiv.org/abs/2503.21130
作者: Saelyne Yang,Anh Truong,Juho Kim,Dingzeyu Li
机构: School of Computing, KAIST(计算机学院, 韩国科学技术院)(大田市, 韩国); Adobe Research, USA(美国)(纽约州, 美国); Adobe Research, USA(美国)(华盛顿州, 美国)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: In Proceedings of the 30th International Conference on Intelligent User Interfaces (IUI '25) 2025
点击查看摘要
Abstract:Tutorial videos are a valuable resource for people looking to learn new tasks. People often learn these skills by viewing multiple tutorial videos to get an overall understanding of a task by looking at different approaches to achieve the task. However, navigating through multiple videos can be time-consuming and mentally demanding as these videos are scattered and not easy to skim. We propose VideoMix, a system that helps users gain a holistic understanding of a how-to task by aggregating information from multiple videos on the task. Insights from our formative study (N=12) reveal that learners value understanding potential outcomes, required materials, alternative methods, and important details shared by different videos. Powered by a Vision-Language Model pipeline, VideoMix extracts and organizes this information, presenting concise textual summaries alongside relevant video clips, enabling users to quickly digest and navigate the content. A comparative user study (N=12) demonstrated that VideoMix enabled participants to gain a more comprehensive understanding of tasks with greater efficiency than a baseline video interface, where videos are viewed independently. Our findings highlight the potential of a task-oriented, multi-video approach where videos are organized around a shared goal, offering an enhanced alternative to conventional video-based learning.
zh
[CV-91] Omni-AD: Learning to Reconstruct Global and Local Features for Multi-class Anomaly Detection
【速读】:该论文针对多类无监督异常检测(Multi-class Unsupervised Anomaly Detection, MUAD)中基于重构的方法容易陷入“学习捷径”问题展开研究,即当解码器未能有效捕捉正常模式时,可能会将正常样本与异常样本一视同仁地进行重构,从而导致无法准确识别异常像素。为了解决这一问题,论文提出了一种全局与局部特征学习相结合的方法,迫使网络更全面地记忆正常模式。其关键在于设计了一个名为Omni-block的双分支解码模块:全局分支通过使用可学习的查询和键值对替换自注意力机制中的传统元素,以紧凑且全面的方式捕获正常模式的整体特性;局部分支采用深度可分离卷积,利用其局部性高效学习正常模式的局部特征。通过堆叠多个Omni-block构建Omni-AD框架,逐步学习并重构不同粒度的正常模式。实验结果表明,该方法在公共异常检测基准上优于现有最先进的MUAD技术。
链接: https://arxiv.org/abs/2503.21125
作者: Jiajie Quan,Ao Tong,Yuxuan Cai,Xinwei He,Yulong Wang,Yang Zhou
机构: Huazhong Agricultural University (华中农业大学); Huazhong University of Science and Technology (华中科技大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:In multi-class unsupervised anomaly detection(MUAD), reconstruction-based methods learn to map input images to normal patterns to identify anomalous pixels. However, this strategy easily falls into the well-known “learning shortcut” issue when decoders fail to capture normal patterns and reconstruct both normal and abnormal samples naively. To address that, we propose to learn the input features in global and local manners, forcing the network to memorize the normal patterns more comprehensively. Specifically, we design a two-branch decoder block, named Omni-block. One branch corresponds to global feature learning, where we serialize two self-attention blocks but replace the query and (key, value) with learnable tokens, respectively, thus capturing global features of normal patterns concisely and thoroughly. The local branch comprises depth-separable convolutions, whose locality enables effective and efficient learning of local features for normal patterns. By stacking Omni-blocks, we build a framework, Omni-AD, to learn normal patterns of different granularity and reconstruct them progressively. Comprehensive experiments on public anomaly detection benchmarks show that our method outperforms state-of-the-art approaches in MUAD. Code is available at this https URL.
zh
[CV-92] AdaMHF: Adaptive Multimodal Hierarchical Fusion for Survival Prediction ICME2025
【速读】:该论文旨在解决现有多模态生存分析方法忽视病理图像与基因组数据在模态内及模态间异质性和稀疏性等生物特性的问题,这些问题限制了其在临床实践中的适用性。为应对这些挑战,论文提出了一种名为AdaMHF(Adaptive Multimodal Hierarchical Fusion)的新框架,用于高效、全面且定制化地提取和融合特征。AdaMHF的关键创新在于通过专家扩展与残差结构激活特定领域的专家以处理异质和稀疏特征,并利用选择与聚合机制优化特征表示,同时实现多粒度跨模态交互的分层融合。此外,该框架还设计了一个生存预测基准来处理缺失模态的情景,反映了真实的临床条件。实验结果表明,AdaMHF在完整和不完整模态设置下均优于当前最先进的方法。
链接: https://arxiv.org/abs/2503.21124
作者: Shuaiyu Zhang,Xun Lin,Rongxiang Zhang,Yu Bai,Yong Xu,Tao Tan,Xunbin Zheng,Zitong Yu
机构: School of Computing and Information Technology, Great Bay University (大湾区大学计算与信息技术学院); Harbin Institute of Technology (哈尔滨工业大学); Beihang University (北京航空航天大学); Macao Polytechnic University (澳门理工大学); Dongguan Key Laboratory for Intelligence and Information Technology (东莞智能信息技术重点实验室); Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology Shenzhen (深圳视觉目标检测与识别重点实验室,哈尔滨工业大学深圳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2025
点击查看摘要
Abstract:The integration of pathologic images and genomic data for survival analysis has gained increasing attention with advances in multimodal learning. However, current methods often ignore biological characteristics, such as heterogeneity and sparsity, both within and across modalities, ultimately limiting their adaptability to clinical practice. To address these challenges, we propose AdaMHF: Adaptive Multimodal Hierarchical Fusion, a framework designed for efficient, comprehensive, and tailored feature extraction and fusion. AdaMHF is specifically adapted to the uniqueness of medical data, enabling accurate predictions with minimal resource consumption, even under challenging scenarios with missing modalities. Initially, AdaMHF employs an experts expansion and residual structure to activate specialized experts for extracting heterogeneous and sparse features. Extracted tokens undergo refinement via selection and aggregation, reducing the weight of non-dominant features while preserving comprehensive information. Subsequently, the encoded features are hierarchically fused, allowing multi-grained interactions across modalities to be captured. Furthermore, we introduce a survival prediction benchmark designed to resolve scenarios with missing modalities, mirroring real-world clinical conditions. Extensive experiments on TCGA datasets demonstrate that AdaMHF surpasses current state-of-the-art (SOTA) methods, showcasing exceptional performance in both complete and incomplete modality settings.
zh
[CV-93] One Snapshot is All You Need: A Generalized Method for mmWave Signal Generation
【速读】:该论文旨在解决毫米波(mmWave)技术在广泛应用中的数据稀缺性问题,特别是缺乏多样化场景下的高质量原始信号数据集。现有数据集通常受限于预处理后的特征表示(如点云或范围角度热图)以及不一致的标注格式,这限制了模型的泛化能力和应用场景。为了解决这些问题,论文提出了一种名为mmGen的新框架,其关键在于通过构建物理信号传输模型,从生成的3D网格中合成包含人体反射和环境反射的完整场景毫米波信号。此外,mmGen还结合了材料属性、天线增益及多路径反射等因素,以提高合成信号的真实性。实验结果表明,在三种不同环境中,合成信号与实际捕获信号之间的范围角度和微多普勒特征平均相似度分别超过0.91和0.89,验证了mmGen的有效性和实用性。
链接: https://arxiv.org/abs/2503.21122
作者: Teng Huang,Han Ding,Wenxin Sun,Cui Zhao,Ge Wang,Fei Wang,Kun Zhao,Zhi Wang,Wei Xi
机构: Xi’an Jiaotong University (西安交通大学), China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE INFOCOM 2025
点击查看摘要
Abstract:Wireless sensing systems, particularly those using mmWave technology, offer distinct advantages over traditional vision-based approaches, such as enhanced privacy and effectiveness in poor lighting conditions. These systems, leveraging FMCW signals, have shown success in human-centric applications like localization, gesture recognition, and so on. However, comprehensive mmWave datasets for diverse applications are scarce, often constrained by pre-processed signatures (e.g., point clouds or RA heatmaps) and inconsistent annotation formats. To overcome these limitations, we propose mmGen, a novel and generalized framework tailored for full-scene mmWave signal generation. By constructing physical signal transmission models, mmGen synthesizes human-reflected and environment-reflected mmWave signals from the constructed 3D meshes. Additionally, we incorporate methods to account for material properties, antenna gains, and multipath reflections, enhancing the realism of the synthesized signals. We conduct extensive experiments using a prototype system with commercial mmWave devices and Kinect sensors. The results show that the average similarity of Range-Angle and micro-Doppler signatures between the synthesized and real-captured signals across three different environments exceeds 0.91 and 0.89, respectively, demonstrating the effectiveness and practical applicability of mmGen.
zh
[CV-94] StyledStreets: Multi-style Street Simulator with Spatial and Temporal Consistency
【速读】:该论文致力于解决城市场景重建中同时建模静态基础设施与动态元素的需求,同时支持多样化的环境条件。论文提出了一种名为\textbfStyledStreets的多风格街道模拟器,通过保证空间和时间一致性实现基于指令的场景编辑。解决方案的关键在于三项创新:首先,一种混合嵌入方案分离了持久的场景几何结构与瞬态的风格属性,允许在保持结构完整性的同时进行逼真的环境编辑;其次,不确定性感知渲染减轻了扩散先验引起的监督噪声,实现了在极端风格变化下的鲁棒训练;最后,统一的参数化模型通过正则化更新防止了几何漂移,维持了来自七个车载摄像机的多视角一致性。这些方法确保了原始场景运动模式和几何关系的保留,并在定量评估中展示了在风格转换下的最先进的几何精度。
链接: https://arxiv.org/abs/2503.21104
作者: Yuyin Chen,Yida Wang,Xueyang Zhang,Kun Zhan,Peng Jia,Yifei Zhan,Xianpeng Lang
机构: Li Auto Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
点击查看摘要
Abstract:Urban scene reconstruction requires modeling both static infrastructure and dynamic elements while supporting diverse environmental conditions. We present \textbfStyledStreets, a multi-style street simulator that achieves instruction-driven scene editing with guaranteed spatial and temporal consistency. Building on a state-of-the-art Gaussian Splatting framework for street scenarios enhanced by our proposed pose optimization and multi-view training, our method enables photorealistic style transfers across seasons, weather conditions, and camera setups through three key innovations: First, a hybrid embedding scheme disentangles persistent scene geometry from transient style attributes, allowing realistic environmental edits while preserving structural integrity. Second, uncertainty-aware rendering mitigates supervision noise from diffusion priors, enabling robust training across extreme style variations. Third, a unified parametric model prevents geometric drift through regularized updates, maintaining multi-view consistency across seven vehicle-mounted cameras. Our framework preserves the original scene’s motion patterns and geometric relationships. Qualitative results demonstrate plausible transitions between diverse conditions (snow, sandstorm, night), while quantitative evaluations show state-of-the-art geometric accuracy under style transfers. The approach establishes new capabilities for urban simulation, with applications in autonomous vehicle testing and augmented reality systems requiring reliable environmental consistency. Codes will be publicly available upon publication. Comments: 14 pages Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2503.21104 [cs.CV] (or arXiv:2503.21104v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2503.21104 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-95] Learning Class Prototypes for Unified Sparse Supervised 3D Object Detection CVPR2025
【速读】:该论文旨在解决当前稀疏监督3D目标检测方法仅关注室外场景而忽视室内场景的问题,提出了一种针对室内和室外场景的统一稀疏监督3D目标检测方法。解决方案的关键在于通过学习类别原型(class prototypes)有效利用未标注物体。具体而言,首先提出了基于原型的目标挖掘模块,将未标注物体的挖掘转化为类别原型与未标注特征之间的匹配问题,并利用最优传输匹配结果为高置信度特征分配原型标签以实现未标注物体的挖掘;其次设计了多标签协同精化模块,通过伪标签质量控制和原型标签协作有效恢复遗漏检测。实验表明,在每个场景仅有一个标注物体的情况下,该方法在ScanNet V2、SUN RGB-D和KITTI数据集上的性能分别达到全监督检测器约78%、90%和96%,凸显了方法的可扩展性。
链接: https://arxiv.org/abs/2503.21099
作者: Yun Zhu,Le Hui,Hang Yang,Jianjun Qian,Jin Xie,Jian Yang
机构: PCA Lab, Nanjing University of Science and Technology (南理工), China; School of Electronics and Information, Northwestern Polytechnical University (西工大), Xi’an, China; State Key Laboratory for Novel Software Technology, Nanjing University (南大), China; School of Intelligence Science and Technology, Nanjing University (南大), Suzhou, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Both indoor and outdoor scene perceptions are essential for embodied intelligence. However, current sparse supervised 3D object detection methods focus solely on outdoor scenes without considering indoor settings. To this end, we propose a unified sparse supervised 3D object detection method for both indoor and outdoor scenes through learning class prototypes to effectively utilize unlabeled objects. Specifically, we first propose a prototype-based object mining module that converts the unlabeled object mining into a matching problem between class prototypes and unlabeled features. By using optimal transport matching results, we assign prototype labels to high-confidence features, thereby achieving the mining of unlabeled objects. We then present a multi-label cooperative refinement module to effectively recover missed detections through pseudo label quality control and prototype label cooperation. Experiments show that our method achieves state-of-the-art performance under the one object per scene sparse supervised setting across indoor and outdoor datasets. With only one labeled object per scene, our method achieves about 78%, 90%, and 96% performance compared to the fully supervised detector on ScanNet V2, SUN RGB-D, and KITTI, respectively, highlighting the scalability of our method. Code is available at this https URL.
zh
[CV-96] Can Video Diffusion Model Reconstruct 4D Geometry?
【速读】:本文旨在解决从单目视频重建动态3D场景(即4D几何)这一重要且具有挑战性的问题。传统基于多视图几何的方法在处理动态运动时常遇到困难,而近期基于学习的方法要么需要专门的4D表示,要么依赖复杂的优化过程。论文的关键在于提出了一种名为Sora3R的新框架,它利用大规模视频扩散模型丰富的时空先验知识,直接从随意的视频中推断出4D点云地图。该框架采用两阶段流程:首先通过适配自预训练视频VAE的点云VAE,确保了几何与视频潜在空间之间的兼容性;其次,在结合视频和点云潜在空间中微调扩散主干网络,以生成每帧连贯的4D点云地图。Sora3R以完全前馈的方式运行,无需外部模块(如深度信息、光流或分割)或迭代全局对齐。大量实验表明,Sora3R能够可靠地恢复相机姿态和详细的场景几何结构,在多种场景下实现了与当前最先进的动态4D重建方法相当的性能。
链接: https://arxiv.org/abs/2503.21082
作者: Jinjie Mai,Wenxuan Zhu,Haozhe Liu,Bing Li,Cheng Zheng,Jürgen Schmidhuber,Bernard Ghanem
机构: King Abdullah University of Science and Technology (KAUST)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Reconstructing dynamic 3D scenes (i.e., 4D geometry) from monocular video is an important yet challenging problem. Conventional multiview geometry-based approaches often struggle with dynamic motion, whereas recent learning-based methods either require specialized 4D representation or sophisticated optimization. In this paper, we present Sora3R, a novel framework that taps into the rich spatiotemporal priors of large-scale video diffusion models to directly infer 4D pointmaps from casual videos. Sora3R follows a two-stage pipeline: (1) we adapt a pointmap VAE from a pretrained video VAE, ensuring compatibility between the geometry and video latent spaces; (2) we finetune a diffusion backbone in combined video and pointmap latent space to generate coherent 4D pointmaps for every frame. Sora3R operates in a fully feedforward manner, requiring no external modules (e.g., depth, optical flow, or segmentation) or iterative global alignment. Extensive experiments demonstrate that Sora3R reliably recovers both camera poses and detailed scene geometry, achieving performance on par with state-of-the-art methods for dynamic 4D reconstruction across diverse scenarios.
zh
[CV-97] KAC: Kolmogorov-Arnold Classifier for Continual Learning CVPR2025
【速读】:该论文旨在解决连续学习(Continual Learning)中模型在不断学习新任务时容易遗忘旧任务的问题。现有方法大多依赖线性分类器(Linear Classifiers),但这些分类器难以在保持稳定分类空间的同时适应新任务。为应对这一挑战,论文提出了一种基于科洛莫戈洛夫-阿诺德网络(Kolmogorov-Arnold Network, KAN)结构的新型分类器——科洛莫戈洛夫-阿诺德分类器(Kolmogorov-Arnold Classifier, KAC)。其关键在于引入了KAN的样条函数(spline functions)特性,并结合径向基函数(Radial Basis Function, RBF)以增强与连续学习场景的兼容性,从而显著提升了模型的学习稳定性与性能表现。实验结果表明,用KAC替代现有方法中的线性分类器后,在多个连续学习基准数据集上均实现了性能提升。
链接: https://arxiv.org/abs/2503.21076
作者: Yusong Hu,Zichen Liang,Fei Yang,Qibin Hou,Xialei Liu,Ming-Ming Cheng
机构: VCIP, CS, Nankai University (南开大学); NKIARI, Shenzhen Futian (深圳福田)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: CVPR 2025
点击查看摘要
Abstract:Continual learning requires models to train continuously across consecutive tasks without forgetting. Most existing methods utilize linear classifiers, which struggle to maintain a stable classification space while learning new tasks. Inspired by the success of Kolmogorov-Arnold Networks (KAN) in preserving learning stability during simple continual regression tasks, we set out to explore their potential in more complex continual learning scenarios. In this paper, we introduce the Kolmogorov-Arnold Classifier (KAC), a novel classifier developed for continual learning based on the KAN structure. We delve into the impact of KAN’s spline functions and introduce Radial Basis Functions (RBF) for improved compatibility with continual learning. We replace linear classifiers with KAC in several recent approaches and conduct experiments across various continual learning benchmarks, all of which demonstrate performance improvements, highlighting the effectiveness and robustness of KAC in continual learning. The code is available at this https URL.
zh
[CV-98] HSLiNets: Evaluating Band Ordering Strategies in Hyperspectral and LiDAR Fusion
【速读】:该论文旨在解决光谱序列或波段顺序对高光谱成像(HSI)与激光雷达(LiDAR)数据融合分类结果的影响这一问题。此前的研究主要关注波段选择与分组在HSI分类中的作用,而忽略了波段顺序在HSI-LiDAR融合中的影响。论文的关键在于提出了一种新的融合架构,不仅整合了HSI和LiDAR数据,还通过学习多种波段顺序配置来优化特征表示。这种方法通过自适应融合不同的光谱序列显著提升了分类精度,超越了现有最先进的融合模型。
链接: https://arxiv.org/abs/2503.21072
作者: Judy X Yang,Jing Wang,Zhuanfeng, Li,Chenhong Sui Zekun Long,Jun Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2 figures, 5 pages
点击查看摘要
Abstract:The integration of hyperspectral imaging (HSI) and Light Detection and Ranging (LiDAR) data provides complementary spectral and spatial information for remote sensing applications. While previous studies have explored the role of band selection and grouping in HSI classification, little attention has been given to how the spectral sequence or band order affects classification outcomes when fused with LiDAR. In this work, we systematically investigate the influence of band order on HSI-LiDAR fusion performance. Through extensive experiments, we demonstrate that band order significantly impacts classification accuracy, revealing a previously overlooked factor in fusion-based models. Motivated by this observation, we propose a novel fusion architecture that not only integrates HSI and LiDAR data but also learns from multiple band order configurations. The proposed method enhances feature representation by adaptively fusing different spectral sequences, leading to improved classification accuracy. Experimental results on the Houston 2013 and Trento datasets show that our approach outperforms state-of-the-art fusion models. Data and code are available at this https URL.
zh
[CV-99] Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing
【速读】:该论文旨在解决文本引导扩散模型在合成包含多个物体的复杂场景时面临的不精确空间定位(imprecise spatial grounding)和有限可扩展性(limited scalability)的问题。为了解决这些问题,论文提出了两个关键模块:1)Janus-Pro驱动的提示解析模块(Janus-Pro-driven Prompt Parsing),通过一个紧凑型1B参数架构连接文本理解与布局生成;2)MIGLoRA,这是一种高效的插件,将低秩适应(Low-Rank Adaptation, LoRA)集成到UNet(SD1.5)和DiT(SD3)骨干网络中。MIGLoRA能够在保持基础模型参数不变的同时实现即插即用的适配能力,减少架构侵入,并支持高效的微调。此外,为了全面评估方法性能,论文构建了DescripBox和DescripBox-1024基准数据集。实验结果表明,所提出的方法在COCO和LVIS基准测试中达到了最先进的性能,同时展示了卓越的布局保真度和开放世界合成的可扩展性。
链接: https://arxiv.org/abs/2503.21069
作者: Fan Qi,Yu Duan,Changsheng Xu
机构: Tianjin University of Technology (天津理工大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Recent advances in text-guided diffusion models have revolutionized conditional image generation, yet they struggle to synthesize complex scenes with multiple objects due to imprecise spatial grounding and limited scalability. We address these challenges through two key modules: 1) Janus-Pro-driven Prompt Parsing, a prompt-layout parsing module that bridges text understanding and layout generation via a compact 1B-parameter architecture, and 2) MIGLoRA, a parameter-efficient plug-in integrating Low-Rank Adaptation (LoRA) into UNet (SD1.5) and DiT (SD3) backbones. MIGLoRA is capable of preserving the base model’s parameters and ensuring plug-and-play adaptability, minimizing architectural intrusion while enabling efficient fine-tuning. To support a comprehensive evaluation, we create DescripBox and DescripBox-1024, benchmarks that span diverse scenes and resolutions. The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency, demonstrating superior layout fidelity and scalability for open-world synthesis.
zh
[CV-100] Neural Architecture Search by Learning a Hierarchical Search Space
【速读】:该论文旨在解决神经架构搜索(Neural Architecture Search, NAS)中的效率问题,特别是在图像分类任务中。NAS 的目标是自动寻找最优的深度学习模型架构,而传统方法通常面临计算资源消耗巨大的挑战。论文指出,蒙特卡洛树搜索(Monte-Carlo Tree Search, MCTS)作为一种强大的搜索工具,在 NAS 中的应用受限于树节点的遍历顺序,即分支选择策略。如果初始分支无法有效区分有潜力与误导性的架构配置,搜索效率将显著降低。
解决方案的关键在于优化 MCTS 的分支选择策略。论文提出通过基于架构相似性进行分层聚类(hierarchical clustering)来学习分支顺序。具体而言,架构之间的相似性通过其输出向量的成对距离(pairwise distance)衡量。实验结果表明,在 CIFAR10 和 ImageNet 上的两个具有挑战性的基准测试中,若 MCTS 使用由分层聚类生成的良好分支层次结构,则其搜索效率显著优于其他 NAS 方法,并能够更高效地找到有潜力的解决方案。
链接: https://arxiv.org/abs/2503.21061
作者: Mehraveh Javan Roshtkhari,Matthew Toews,Marco Pedersoli
机构: École de technologie supérieure (ÉTS)(魁北克省高等技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Monte-Carlo Tree Search (MCTS) is a powerful tool for many non-differentiable search related problems such as adversarial games. However, the performance of such approach highly depends on the order of the nodes that are considered at each branching of the tree. If the first branches cannot distinguish between promising and deceiving configurations for the final task, the efficiency of the search is exponentially reduced. In Neural Architecture Search (NAS), as only the final architecture matters, the visiting order of the branching can be optimized to improve learning. In this paper, we study the application of MCTS to NAS for image classification. We analyze several sampling methods and branching alternatives for MCTS and propose to learn the branching by hierarchical clustering of architectures based on their similarity. The similarity is measured by the pairwise distance of output vectors of architectures. Extensive experiments on two challenging benchmarks on CIFAR10 and ImageNet show that MCTS, if provided with a good branching hierarchy, can yield promising solutions more efficiently than other approaches for NAS problems.
zh
[CV-101] Online Reasoning Video Segmentation with Just-in-Time Digital Twins
【速读】:该论文旨在解决现有推理分割(Reasoning Segmentation, RS)方法在处理多步骤推理、复杂时空关系以及在线视频数据扩展性方面的局限性。具体而言,当前方法过度依赖多模态大型语言模型(Multimodal Large Language Models, LLMs)的视觉感知能力,导致需要频繁微调以适配最新模型,并可能引发灾难性遗忘风险,同时难以有效扩展至在线视频场景。
论文的关键创新在于提出了一种无需LLM微调的在线视频RS代理框架,通过解耦感知与推理实现更高效的处理。其核心解决方案是引入“即时数字孪生”(Just-In-Time Digital Twin)的概念:在接收到隐式查询后,LLM利用专门的视觉模型从高层视频数据构建低层次场景表示,仅在必要时请求特定信息而非始终评估所有专家模型。随后,LLM基于此数字孪生表示进行推理以定位目标对象。为验证方案的有效性,作者构建了一个包含200个视频和895个隐式文本查询的新综合视频推理分割基准,涵盖语义、空间和时间三种推理类别及不同复杂度的推理链。
链接: https://arxiv.org/abs/2503.21056
作者: Yiqing Shen,Bohan Liu,Chenjia Li,Lalithkumar Seenivasan,Mathias Unberath
机构: Johns Hopkins University (约翰斯·霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
点击查看摘要
Abstract:Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where – given an implicit query – a LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as “just-in-time” because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity.
zh
[CV-102] What Changed and What Could Have Changed? State-Change Counterfactuals for Procedure-Aware Video Representation Learning
【速读】:该论文旨在解决理解程序性活动(procedural activities)中视频表征学习的问题,特别关注如何同时建模动作步骤对场景的变换以及这些场景变换如何影响动作步骤的顺序,包括意外或错误的动作。现有方法主要通过建模动作的时间顺序来研究具有程序意识的视频表示,但未明确学习状态变化(scene transformations)。论文的关键解决方案在于引入由大规模语言模型(Large Language Models, LLMs)生成的状态变化描述作为视频编码器的监督信号,并进一步生成状态变化反事实(state-change counterfactuals),以模拟假设的失败结果,从而让模型通过想象“如果……会怎样”(“What if”)的情景进行学习。这种反事实推理增强了模型对活动中每个步骤因果关系的理解能力。论文通过在具有程序意识的任务(如时间动作分割和错误检测)上的广泛实验验证了所提出的状态变化描述及其反事实的有效性,并取得了显著性能提升。
链接: https://arxiv.org/abs/2503.21055
作者: Chi-Hsi Kung,Frangil Ramirez,Juhyung Ha,Yi-Ting Chen,David Crandall,Yi-Hsuan Tsai
机构: Indiana University (印第安纳大学); National Yang-Ming Chiao-Tung University (国立阳明交通大学); Atmanity Inc. (Atmanity Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 4 figures
点击查看摘要
Abstract:Understanding a procedural activity requires modeling both how action steps transform the scene, and how evolving scene transformations can influence the sequence of action steps, even those that are accidental or erroneous. Existing work has studied procedure-aware video representations by proposing novel approaches such as modeling the temporal order of actions and has not explicitly learned the state changes (scene transformations). In this work, we study procedure-aware video representation learning by incorporating state-change descriptions generated by Large Language Models (LLMs) as supervision signals for video encoders. Moreover, we generate state-change counterfactuals that simulate hypothesized failure outcomes, allowing models to learn by imagining the unseen ``What if’’ scenarios. This counterfactual reasoning facilitates the model’s ability to understand the cause and effect of each step in an activity. To verify the procedure awareness of our model, we conduct extensive experiments on procedure-aware tasks, including temporal action segmentation and error detection. Our results demonstrate the effectiveness of the proposed state-change descriptions and their counterfactuals and achieve significant improvements on multiple tasks. We will make our source code and data publicly available soon.
zh
[CV-103] Reconstructing Gridded Data from Higher Autocorrelations
【速读】:该论文试图解决从数据集的高阶自相关重建整数或有理值网格数据集的问题。解决方案的关键在于提出了一种显式的重建算法,并证明了直到阶数 (3r+3) 的自相关总是足以确定数据集(至平移等价),其中 (r) 是网格的维度。此外,论文还提供了反例,表明对于阶数 (3r+2) 的自相关,某些有理值网格数据集无法被唯一确定。
链接: https://arxiv.org/abs/2503.21022
作者: W. Riley Casper,Bobby Orozco
机构: California State University Fullerton (加州州立大学富勒顿分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Group Theory (math.GR); Data Analysis, Statistics and Probability (physics.data-an)
备注: 13 pages, 1 figure
点击查看摘要
Abstract:The higher-order autocorrelations of integer-valued or rational-valued gridded data sets appear naturally in X-ray crystallography, and have applications in computer vision systems, correlation tomography, correlation spectroscopy, and pattern recognition. In this paper, we consider the problem of reconstructing a gridded data set from its higher-order autocorrelations. We describe an explicit reconstruction algorithm, and prove that the autocorrelations up to order 3r + 3 are always sufficient to determine the data up to translation, where r is the dimension of the grid. We also provide examples of rational-valued gridded data sets which are not determined by their autocorrelations up to order 3r + 2.
zh
[CV-104] Forensic Self-Descriptions Are All You Need for Zero-Shot Detection Open-Set Source Attribution and Clustering of AI-generated Images CVPR
【速读】:该论文旨在解决由先进基于AI的工具生成逼真图像所引发的取证检测与来源归因方面的重大挑战,尤其是面对快速涌现的新生成技术时传统方法难以泛化的问题。这些传统方法在训练过程中依赖于已知来源的特定特征,在遇到未知生成器时往往失效。为了解决这一问题,论文提出了一种创新方法,该方法明确建模了取证微结构——即图像创建过程特有的细微像素级模式。通过仅使用真实图像以自监督的方式学习一组多样化的预测滤波器来提取残差,这些残差能够捕捉这些微结构的不同方面。通过对多尺度残差进行联合建模,得到了一个紧凑模型,其参数构成了每幅图像独有的取证自描述。这种自描述使得零样本合成图像检测、开放集图像来源归因以及基于来源的聚类成为可能,而无需先验知识。大量实验表明,该方法在准确性与适应性上优于竞争技术,推动了合成媒体取证领域的进步。解决方案的关键在于通过自监督学习提取并联合建模图像的微结构残差,从而实现通用且高效的取证分析能力。
链接: https://arxiv.org/abs/2503.21003
作者: Tai D. Nguyen,Aref Azizpour,Matthew C. Stamm
机构: Drexel University (德雷塞尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025
点击查看摘要
Abstract:The emergence of advanced AI-based tools to generate realistic images poses significant challenges for forensic detection and source attribution, especially as new generative techniques appear rapidly. Traditional methods often fail to generalize to unseen generators due to reliance on features specific to known sources during training. To address this problem, we propose a novel approach that explicitly models forensic microstructures - subtle, pixel-level patterns unique to the image creation process. Using only real images in a self-supervised manner, we learn a set of diverse predictive filters to extract residuals that capture different aspects of these microstructures. By jointly modeling these residuals across multiple scales, we obtain a compact model whose parameters constitute a unique forensic self-description for each image. This self-description enables us to perform zero-shot detection of synthetic images, open-set source attribution of images, and clustering based on source without prior knowledge. Extensive experiments demonstrate that our method achieves superior accuracy and adaptability compared to competing techniques, advancing the state of the art in synthetic media forensics.
zh
[CV-105] CoMapGS: Covisibility Map-based Gaussian Splatting for Sparse Novel View Synthesis CVPR2025
【速读】:该论文旨在解决稀疏新颖视图合成中underrepresented sparse regions(未充分表示的稀疏区域)的恢复问题。解决方案的关键在于提出了一种名为Covisibility Map-based Gaussian Splatting (CoMapGS) 的方法,通过构建共视图映射(covisibility maps),增强初始点云,并采用基于邻近分类器的不确定性感知加权监督,有效应对高不确定性和低不确定性区域的挑战。其核心创新点包括利用共视图映射重构新颖视图合成、提升COLMAP衍生点云的质量以改善重建效果,以及通过基于共视图分数的自适应监督实现场景间性能的一致性提升。实验结果表明,CoMapGS在Mip-NeRF 360和LLFF等数据集上超越了现有最先进方法。
链接: https://arxiv.org/abs/2503.20998
作者: Youngkyoon Jang,Eduardo Pérez-Pellitero
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2025, Mistakenly submitted as a replacement for arXiv:2402.11057
点击查看摘要
Abstract:We propose Covisibility Map-based Gaussian Splatting (CoMapGS), designed to recover underrepresented sparse regions in sparse novel view synthesis. CoMapGS addresses both high- and low-uncertainty regions by constructing covisibility maps, enhancing initial point clouds, and applying uncertainty-aware weighted supervision using a proximity classifier. Our contributions are threefold: (1) CoMapGS reframes novel view synthesis by leveraging covisibility maps as a core component to address region-specific uncertainty; (2) Enhanced initial point clouds for both low- and high-uncertainty regions compensate for sparse COLMAP-derived point clouds, improving reconstruction quality and benefiting few-shot 3DGS methods; (3) Adaptive supervision with covisibility-score-based weighting and proximity classification achieves consistent performance gains across scenes with varying sparsity scores derived from covisibility maps. Experimental results demonstrate that CoMapGS outperforms state-of-the-art methods on datasets including Mip-NeRF 360 and LLFF.
zh
[CV-106] MVFNet: Multipurpose Video Forensics Network using Multiple Forms of Forensic Evidence WACV
【速读】:该论文旨在解决现有视频取证网络大多只能检测单一篡改类型(如深度伪造、修补等)的问题,而无法应对未知类型的复杂视频伪造。为了解决这一问题,论文提出了一种名为MVFNet的多功能视频取证网络,能够同时检测包括修补、深度伪造、拼接和编辑在内的多种篡改类型。其关键在于通过提取并联合分析广泛的取证特征模态,捕捉伪造视频中的空间和时间异常,并采用创新的多尺度分层Transformer模块,在多个空间尺度上识别取证不一致性,从而可靠地检测和定位各种大小及形状的虚假内容。
链接: https://arxiv.org/abs/2503.20991
作者: Tai D. Nguyen,Matthew C. Stamm
机构: Drexel University (德雷塞尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of the Winter Conference on Applications of Computer Vision (WACV) 2025
点击查看摘要
Abstract:While videos can be falsified in many different ways, most existing forensic networks are specialized to detect only a single manipulation type (e.g. deepfake, inpainting). This poses a significant issue as the manipulation used to falsify a video is not known a priori. To address this problem, we propose MVFNet - a multipurpose video forensics network capable of detecting multiple types of manipulations including inpainting, deepfakes, splicing, and editing. Our network does this by extracting and jointly analyzing a broad set of forensic feature modalities that capture both spatial and temporal anomalies in falsified videos. To reliably detect and localize fake content of all shapes and sizes, our network employs a novel Multi-Scale Hierarchical Transformer module to identify forensic inconsistencies across multiple spatial scales. Experimental results show that our network obtains state-of-the-art performance in general scenarios where multiple different manipulations are possible, and rivals specialized detectors in targeted scenarios.
zh
[CV-107] Eyes Tell the Truth: GazeVal Highlights Shortcomings of Generative AI in Medical Imaging
【速读】:该论文试图解决医学影像领域中合成数据质量评估与临床实际需求不匹配的问题。当前评估方法主要依赖计算指标,这些指标虽在数值上可能表明合成图像逼真,但未能反映临床真实性,导致基于人工智能的医疗工具可靠性与有效性面临挑战。为填补这一差距,论文提出了一种名为GazeVal的实用框架,其关键是结合放射科医生的眼动追踪数据与直接影像学评估,以更深入地理解专家在不同任务(如诊断或图灵测试)中对合成数据的感知与交互方式,从而更准确地评估合成医学图像的质量。实验结果表明,大多数由最先进的生成式AI算法生成的图像被识别为伪造品,凸显了现有生成式AI技术在生成临床准确图像方面的局限性。
链接: https://arxiv.org/abs/2503.20967
作者: David Wong,Bin Wang,Gorkem Durak,Marouane Tliba,Akshay Chaudhari,Aladine Chetouani,Ahmet Enis Cetin,Cagdas Topel,Nicolo Gennaro,Camila Lopes Vendrami,Tugce Agirlar Trabzonlu,Amir Ali Rahsepar,Laetitia Perronne,Matthew Antalek,Onural Ozturk,Gokcan Okur,Andrew C. Gordon,Ayis Pyrros,Frank H. Miller,Amir Borhani,Hatice Savas,Eric Hart,Drew Torigian,Jayaram K. Udupa,Elizabeth Krupinski,Ulas Bagci
机构: Northwestern University (西北大学); University of Illinois at Chicago (芝加哥大学伊利诺伊分校); Stanford University (斯坦福大学); Université Sorbonne Paris Nord (巴黎北部索邦大学); Loyola University Chicago (芝加哥洛约拉大学); DuPage Medical Group (杜佩奇医疗集团); University of Pennsylvania (宾夕法尼亚大学); Emory University (埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:The demand for high-quality synthetic data for model training and augmentation has never been greater in medical imaging. However, current evaluations predominantly rely on computational metrics that fail to align with human expert recognition. This leads to synthetic images that may appear realistic numerically but lack clinical authenticity, posing significant challenges in ensuring the reliability and effectiveness of AI-driven medical tools. To address this gap, we introduce GazeVal, a practical framework that synergizes expert eye-tracking data with direct radiological evaluations to assess the quality of synthetic medical images. GazeVal leverages gaze patterns of radiologists as they provide a deeper understanding of how experts perceive and interact with synthetic data in different tasks (i.e., diagnostic or Turing tests). Experiments with sixteen radiologists revealed that 96.6% of the generated images (by the most recent state-of-the-art AI algorithm) were identified as fake, demonstrating the limitations of generative AI in producing clinically accurate images.
zh
[CV-108] LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos CVPR2025
【速读】:该论文旨在解决在竞争性乒乓球比赛中,如何通过动作预测提升对抗系统的反应能力这一问题。以往研究虽已实现乒乓球实时对战系统,但大多未充分利用动作预测的优势,且现有预测方法受限于数据集规模与多样性。论文的关键在于提出了一种可扩展的三维乒乓球比赛单目视频重建系统,以及一种考虑不确定性(uncertainty-aware)的动作预测控制器,通过策略优化显著提升了对高速击球的回球率,从基线非预测策略的49.9%提高到59.0%。
链接: https://arxiv.org/abs/2503.20936
作者: Daniel Etaat,Dvij Kalaria,Nima Rahmanian,Shankar Sastry
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2025
点击查看摘要
Abstract:Physical agility is a necessary skill in competitive table tennis, but by no means sufficient. Champions excel in this fast-paced and highly dynamic environment by anticipating their opponent’s intent - buying themselves the necessary time to react. In this work, we take one step towards designing such an anticipatory agent. Previous works have developed systems capable of real-time table tennis gameplay, though they often do not leverage anticipation. Among the works that forecast opponent actions, their approaches are limited by dataset size and variety. Our paper contributes (1) a scalable system for reconstructing monocular video of table tennis matches in 3D and (2) an uncertainty-aware controller that anticipates opponent actions. We demonstrate in simulation that our policy improves the ball return rate against high-speed hits from 49.9% to 59.0% as compared to a baseline non-anticipatory policy.
zh
[CV-109] Prototype Guided Backdoor Defense
【速读】:该论文旨在解决深度学习模型易受后门攻击的问题,特别是由恶意攻击者通过在少量训练数据中注入特定触发器(trigger)导致模型产生误分类的攻击。论文关注的重点是如何防御包括语义触发器在内的多种触发器类型的后门攻击,并提出了一种通用且有效的后处理防御方法。解决方案的关键在于提出了一种名为原型引导后门防御(Prototype Guided Backdoor Defense, PGBD)的方法,它利用激活值在几何空间中的位移来惩罚靠近触发器方向的移动,通过后微调步骤中的新型净化损失函数实现这一目标。这种方法能够扩展到不同类型的触发器,包括之前难以解决的语义触发器,并在所有测试设置中表现出更好的性能。此外,论文还展示了PGBD在针对名人面部图像的新语义攻击上的首次成功防御实例。
链接: https://arxiv.org/abs/2503.20925
作者: Venkat Adithya Amula,Sunayana Samavedam,Saurabh Saini,Avani Gupta,Narayanan P J
机构: IIIT Hyderabad(印度国际信息技术研究院), India; Amazon Research(亚马逊研究), India
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
点击查看摘要
Abstract:Deep learning models are susceptible to \em backdoor attacks involving malicious attackers perturbing a small subset of training data with a \em trigger to causes misclassifications. Various triggers have been used, including semantic triggers that are easily realizable without requiring the attacker to manipulate the image. The emergence of generative AI has eased the generation of varied poisoned samples. Robustness across types of triggers is crucial to effective defense. We propose Prototype Guided Backdoor Defense (PGBD), a robust post-hoc defense that scales across different trigger types, including previously unsolved semantic triggers. PGBD exploits displacements in the geometric spaces of activations to penalize movements toward the trigger. This is done using a novel sanitization loss of a post-hoc fine-tuning step. The geometric approach scales easily to all types of attacks. PGBD achieves better performance across all settings. We also present the first defense against a new semantic attack on celebrity face images. Project page: \hyperlinkthis https URLthis https URL.
zh
[CV-110] Feature Modulation for Semi-Supervised Domain Generalization without Domain Labels
【速读】:该论文致力于解决无领域标签的半监督领域泛化(SSDG)问题,即在训练过程中无法获得未标注数据的领域标签情况下,如何有效利用未标注数据以提升模型的领域泛化能力。论文的关键创新在于两个方面:首先,提出了一种特征调节策略,通过增强类别区分性特征的同时抑制领域特定信息,促使特征向跨领域鲁棒的相似平均表示(SAR)靠近,从而引导分类器更好地辨别相关类别,并使特征提取器形成紧密聚类且领域不变的表示;其次,引入了一种动态调整置信度阈值的损失缩放函数,以减轻领域噪声的影响并提高伪标签准确性,优化未标注数据的利用效率。这些方法使得所提方案在无需领域标签的情况下显著提升了四个主要领域泛化基准上的性能表现。
链接: https://arxiv.org/abs/2503.20897
作者: Venuri Amarasinghe(1),Asini Jayakody(1),Isun Randila(1),Kalinga Bandara(1),Chamuditha Jayanga Galappaththige(2),Ranga Rodrigo(1) ((1) University of Moratuwa, (2) Queensland University of Technology)
机构: University of Moratuwa (莫拉图瓦大学); Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Semi-supervised domain generalization (SSDG) leverages a small fraction of labeled data alongside unlabeled data to enhance model generalization. Most of the existing SSDG methods rely on pseudo-labeling (PL) for unlabeled data, often assuming access to domain labels-a privilege not always available. However, domain shifts introduce domain noise, leading to inconsistent PLs that degrade model performance. Methods derived from FixMatch suffer particularly from lower PL accuracy, reducing the effectiveness of unlabeled data. To address this, we tackle the more challenging domain-label agnostic SSDG, where domain labels for unlabeled data are not available during training. First, we propose a feature modulation strategy that enhances class-discriminative features while suppressing domain-specific information. This modulation shifts features toward Similar Average Representations-a modified version of class prototypes-that are robust across domains, encouraging the classifier to distinguish between closely related classes and feature extractor to form tightly clustered, domain-invariant representations. Second, to mitigate domain noise and improve pseudo-label accuracy, we introduce a loss-scaling function that dynamically lowers the fixed confidence threshold for pseudo-labels, optimizing the use of unlabeled data. With these key innovations, our approach achieves significant improvements on four major domain generalization benchmarks-even without domain labels. We will make the code available.
zh
[CV-111] BioX-CPath: Biologically-driven Explainable Diagnostics for Multistain IHC Computational Pathology CVPR2025
【速读】:该论文旨在解决计算病理学中生物可解释性模型开发的关键挑战,特别是在多染色免疫组化(IHC)分析中的问题。论文提出了一种名为BioX-CPath的可解释图神经网络架构,用于全切片图像(WSI)分类,其创新点在于同时利用多种染色的空间和语义特征。解决方案的关键是引入了Stain-Aware Attention Pooling (SAAP)模块,该模块能够生成具有生物学意义且与染色相关的患者嵌入向量。这一设计不仅实现了在类风湿性关节炎和干燥综合征多染色数据集上的最先进的分类性能,还通过染色注意力得分、熵值和染色交互得分提供了可解释的洞察,从而衡量模型与已知病理机制的一致性。这种结合了高性能与生物可解释性的方法,使BioX-CPath特别适用于强调可解释性的临床应用。
链接: https://arxiv.org/abs/2503.20880
作者: Amaya Gallagher-Syed,Henry Senior,Omnia Alwazzan,Elena Pontarini,Michele Bombardieri,Costantino Pitzalis,Myles J. Lewis,Michael R. Barnes,Luca Rossi,Gregory Slabaugh
机构: Queen Mary University of London (伦敦玛丽女王大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cell Behavior (q-bio.CB); Quantitative Methods (q-bio.QM); Tissues and Organs (q-bio.TO)
备注: Accepted for publication at CVPR 2025
点击查看摘要
Abstract:The development of biologically interpretable and explainable models remains a key challenge in computational pathology, particularly for multistain immunohistochemistry (IHC) analysis. We present BioX-CPath, an explainable graph neural network architecture for whole slide image (WSI) classification that leverages both spatial and semantic features across multiple stains. At its core, BioX-CPath introduces a novel Stain-Aware Attention Pooling (SAAP) module that generates biologically meaningful, stain-aware patient embeddings. Our approach achieves state-of-the-art performance on both Rheumatoid Arthritis and Sjogren’s Disease multistain datasets. Beyond performance metrics, BioX-CPath provides interpretable insights through stain attention scores, entropy measures, and stain interaction scores, that permit measuring model alignment with known pathological mechanisms. This biological grounding, combined with strong classification performance, makes BioX-CPath particularly suitable for clinical applications where interpretability is key. Source code and documentation can be found at: this https URL.
zh
[CV-112] Unified Multimodal Discrete Diffusion
【速读】:该论文旨在探索离散扩散模型(Discrete Diffusion Models)作为跨模态联合生成的一种统一框架,以解决多模态生成模型在文本与图像联合理解与生成任务中的挑战。当前,自回归(Autoregressive, AR)方法在多模态生成领域占据主导地位,但其固有的顺序处理方式限制了生成的灵活性和可控性。论文的关键在于提出了一种名为Unified Multimodal Discrete Diffusion (UniDisc) 的新模型,它通过利用离散扩散模型在质量与多样性平衡、多模态修复(Inpainting)、以及生成过程中可控性方面的优势,实现了文本和图像的联合理解和生成。这一方案的核心创新在于将离散扩散机制扩展到联合文本与图像领域,并通过设计提升模型在性能、计算效率、可控编辑能力及推理速度与生成质量权衡上的表现。
链接: https://arxiv.org/abs/2503.20853
作者: Alexander Swerdlow,Mihir Prabhudesai,Siddharth Gandhi,Deepak Pathak,Katerina Fragkiadaki
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project Website: this https URL
点击查看摘要
Abstract:Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in text generation. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models, performing a scaling analysis and demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off between inference time and generation quality. Code and additional visualizations are available at this https URL.
zh
[CV-113] MedSegNet10: A Publicly Accessible Network Repository for Split Federated Medical Image Segmentation
【速读】:该论文旨在解决医疗图像分割领域中因数据隐私、标注样本不足及训练数据有限等挑战所导致的问题。解决方案的关键在于利用分散式学习方法,特别是分割联邦学习(Split Federated Learning, SplitFed),通过在私有存储的水平划分数据上实现协作训练,既保障了患者数据隐私,又有效提升了模型性能。论文提出的“MedSegNet10”是一个公开可用的资源库,提供了针对多种医学图像类型优化的预训练神经网络架构,并支持基于SplitFed的协作学习,从而推动医学图像分割技术的发展同时保护数据隐私。
链接: https://arxiv.org/abs/2503.20830
作者: Chamani Shiranthika,Zahra Hafezi Kafshgari,Hadi Hadizadeh,Parvaneh Saeedi
机构: Simon Fraser University (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 14 figures
点击查看摘要
Abstract:Machine Learning (ML) and Deep Learning (DL) have shown significant promise in healthcare, particularly in medical image segmentation, which is crucial for accurate disease diagnosis and treatment planning. Despite their potential, challenges such as data privacy concerns, limited annotated data, and inadequate training data persist. Decentralized learning approaches such as federated learning (FL), split learning (SL), and split federated learning (SplitFed/SFL) address these issues effectively. This paper introduces “MedSegNet10,” a publicly accessible repository designed for medical image segmentation using split-federated learning. MedSegNet10 provides a collection of pre-trained neural network architectures optimized for various medical image types, including microscopic images of human blastocysts, dermatoscopic images of skin lesions, and endoscopic images of lesions, polyps, and ulcers, with applications extending beyond these examples. By leveraging SplitFed’s benefits, MedSegNet10 allows collaborative training on privately stored, horizontally split data, ensuring privacy and integrity. This repository supports researchers, practitioners, trainees, and data scientists, aiming to advance medical image segmentation while maintaining patient data privacy. The repository is available at: this https URL (password upon request to the authors).
zh
[CV-114] Multimodal Image Matching based on Frequency-domain Information of Local Energy Response
【速读】:该论文旨在解决多模态图像匹配中的主要挑战,包括复杂的非线性强度差异、局部几何形变、噪声以及旋转变换。论文提出了一种基于频域局部能量响应的方法(Frequency-domain Information of Local Energy Response,FILER),其关键是构建了一个以频域信息为基础的局部能量响应模型,能够克服非线性强度差异的影响。此外,通过设计增强边缘结构的特征检测器和卷积特征加权描述符进一步提升对局部非线性几何形变和噪声的鲁棒性,并实现旋转不变性。实验结果表明,FILER在多种多模态图像对匹配任务中优于现有最先进的算法,具有良好的鲁棒性和通用性。
链接: https://arxiv.org/abs/2503.20827
作者: Meng Yang,Jun Chen,Wenping Gong,Longsheng Wei,Xin Tian
机构: China University of Geosciences (中国地质大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 11 figures
点击查看摘要
Abstract:Complicated nonlinear intensity differences, nonlinear local geometric distortions, noises and rotation transformation are main challenges in multimodal image matching. In order to solve these problems, we propose a method based on Frequency-domain Information of Local Energy Response called FILER. The core of FILER is the local energy response model based on frequency-domain information, which can overcome the effect of nonlinear intensity differences. To improve the robustness to local nonlinear geometric distortions and noises, we design a new edge structure enhanced feature detector and convolutional feature weighted descriptor, respectively. In addition, FILER overcomes the sensitivity of the frequency-domain information to the rotation angle and achieves rotation invariance. Extensive experiments multimodal image pairs show that FILER outperforms other state-of-the-art algorithms and has good robustness and universality.
zh
[CV-115] Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for Federated Continual Learning
【速读】:该论文旨在解决联邦持续学习(Federated Continual Learning, FCL)在医疗场景中面临的两个主要挑战:(1) 对先前任务的灾难性遗忘导致服务器模型的错误累积,难以维持全面的知识;(2) 不同客户端异步处理任务引起的优化偏差,在同一时间步长产生目标冲突。为应对这些问题,论文提出了一个新颖的服务器端 FCL 模式——动态分配超网络结合自适应模型重新校准(FedDAH)。其关键在于:(1) 提出动态分配超网络(Dynamic Allocation Hypernetwork, DAHyper),通过不断更新的超网络管理任务标识与相关模型参数之间的映射,实现跨客户端的动态模型分配以缓解灾难性遗忘;(2) 引入自适应模型重新校准(Adaptive Model Recalibration, AMR),将历史模型的变化纳入当前服务器更新,并基于任务相似性为不同时间步的相同任务分配权重,从而实现连续优化以解决优化偏差问题。
链接: https://arxiv.org/abs/2503.20808
作者: Xiaoming Qi,Jingyang Zhang,Huazhu Fu,Guanyu Yang,Shuo Li,Yueming Jin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Federated continual learning (FCL) offers an emerging pattern to facilitate the applicability of federated learning (FL) in real-world scenarios, where tasks evolve dynamically and asynchronously across clients, especially in medical scenario. Existing server-side FCL methods in nature domain construct a continually learnable server model by client aggregation on all-involved tasks. However, they are challenged by: (1) Catastrophic forgetting for previously learned tasks, leading to error accumulation in server model, making it difficult to sustain comprehensive knowledge across all tasks. (2) Biased optimization due to asynchronous tasks handled across different clients, leading to the collision of optimization targets of different clients at the same time steps. In this work, we take the first step to propose a novel server-side FCL pattern in medical domain, Dynamic Allocation Hypernetwork with adaptive model recalibration (FedDAH). It is to facilitate collaborative learning under the distinct and dynamic task streams across clients. To alleviate the catastrophic forgetting, we propose a dynamic allocation hypernetwork (DAHyper) where a continually updated hypernetwork is designed to manage the mapping between task identities and their associated model parameters, enabling the dynamic allocation of the model across clients. For the biased optimization, we introduce a novel adaptive model recalibration (AMR) to incorporate the candidate changes of historical models into current server updates, and assign weights to identical tasks across different time steps based on the similarity for continual optimization. Extensive experiments on the AMOS dataset demonstrate the superiority of our FedDAH to other FCL methods on sites with different task streams. The code is available:this https URL.
zh
[CV-116] Double Blind Imaging with Generative Modeling
【速读】:该论文致力于解决成像领域中的盲逆问题(Blind Inverse Problems),即在成像系统存在不确定性且仅能获取含噪测量数据的情况下,如何从这些测量值中恢复清晰图像的问题。为了解决这一问题,通常需要显式或隐式地识别成像系统。论文的关键创新在于提出了一种基于AmbientGAN的生成式方法,通过无配对的干净图像与对应的含噪测量数据,学习未知成像系统参数的分布。这种方法无需直接访问成像系统的样本,而是利用生成模型作为图像及其系统参数的先验(如点扩散函数PSF的类)。通过这种方式,所学得的先验可以被用于模型驱动的恢复算法中,以解决诸如盲去卷积(Blind Deconvolution)等盲逆问题,并成功展示了该方法在学习高斯模糊与运动模糊先验以及在后验扩散采样中解决盲去卷积的应用价值。
链接: https://arxiv.org/abs/2503.21501
作者: Brett Levac,Ajil Jalal,Kannan Ramchandran,Jonathan I. Tamir
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); University of California Berkeley (加州大学伯克利分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Blind inverse problems in imaging arise from uncertainties in the system used to collect (noisy) measurements of images. Recovering clean images from these measurements typically requires identifying the imaging system, either implicitly or explicitly. A common solution leverages generative models as priors for both the images and the imaging system parameters (e.g., a class of point spread functions). To learn these priors in a straightforward manner requires access to a dataset of clean images as well as samples of the imaging system. We propose an AmbientGAN-based generative technique to identify the distribution of parameters in unknown imaging systems, using only unpaired clean images and corrupted measurements. This learned distribution can then be used in model-based recovery algorithms to solve blind inverse problems such as blind deconvolution. We successfully demonstrate our technique for learning Gaussian blur and motion blur priors from noisy measurements and show their utility in solving blind deconvolution with diffusion posterior sampling.
zh
[CV-117] Embedding Compression Distortion in Video Coding for Machines
【速读】:该论文试图解决视频压缩码流在服务于机器视觉任务时,因现有编解码器(Codecs)主要优化于像素域和人类视觉系统(Human Visual System, HVS)感知指标而忽略机器视觉需求所导致的信息丢失问题。为应对这一挑战,论文提出了一种压缩失真表示嵌入框架(Compression Distortion Representation Embedding, CDRE),其关键是设计了一个对压缩敏感的特征提取器,用于识别特征域中的压缩退化,并将压缩失真的相关信息以紧凑的形式编码后逐步嵌入到下游模型中,使模型能够更好地适应压缩带来的退化,从而提升任务性能。此外,通过引入轻量级失真编解码器实现高效传输,同时确保较低的比特率、执行时间和参数开销。实验结果验证了该框架在多种编解码器和下游任务中均能有效提升率-任务性能。
链接: https://arxiv.org/abs/2503.21469
作者: Yuxiao Sun,Yao Zhao,Meiqin Liu,Chao Yao,Weisi Lin
机构: Beijing Jiaotong University (北京交通大学); University of Scienece and Technology Beijing (北京科技大学); Nanyang Technological University (南洋理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis. However, existing codecs are primarily optimized for pixel-domain and HVS-perception metrics rather than the needs of machine vision tasks. To address this issue, we propose a Compression Distortion Representation Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models, addressing the information lost during compression and improving task performance. Specifically, to better analyze the machine-perception-related distortion, we design a compression-sensitive extractor that identifies compression degradation in the feature domain. For efficient transmission, a lightweight distortion codec is introduced to compress the distortion information into a compact representation. Subsequently, the representation is progressively embedded into the downstream model, enabling it to be better informed about compression degradation and enhancing performance. Experiments across various codecs and downstream tasks demonstrate that our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of bitrate, execution time, and number of parameters. Our codes and supplementary materials are released in this https URL.
zh
[CV-118] Sparse Bayesian Learning for Label Efficiency in Cardiac Real-Time MRI
【速读】:该论文旨在解决心脏实时磁共振成像(Cardiac Real-Time MRI)中由于高帧率成像导致需分割图像数量激增的问题,特别是神经网络在边缘切片(outer slices)预测的可靠性不足。为应对这一挑战,论文提出了一种基于稀疏贝叶斯学习(Sparse Bayesian Learning, SBL)的方法。其关键是利用内切片(inner slices)中已良好分割的数据,通过优化超参数(type-II likelihood)识别与心率和呼吸频率相关的稀疏频域成分,并自动剔除非相关成分。这些稀疏频域信息指导外切片图像的选择以进行标注,从而最小化后验方差,同时确保仅需少量标注样本即可实现精确的心室容积预测。此外,贝叶斯方法提供了不确定性估计,有助于识别不可靠预测。
链接: https://arxiv.org/abs/2503.21443
作者: Felix Terhag,Philipp Knechtges,Achim Basermann,Anja Bach,Darius Gerlach,Jens Tank,Raúl Tempone
机构: 未知
类目: Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR); Statistics Theory (math.ST); Applications (stat.AP)
备注:
点击查看摘要
Abstract:Cardiac real-time magnetic resonance imaging (MRI) is an emerging technology that images the heart at up to 50 frames per second, offering insight into the respiratory effects on the heartbeat. However, this method significantly increases the number of images that must be segmented to derive critical health indicators. Although neural networks perform well on inner slices, predictions on outer slices are often unreliable. This work proposes sparse Bayesian learning (SBL) to predict the ventricular volume on outer slices with minimal manual labeling to address this challenge. The ventricular volume over time is assumed to be dominated by sparse frequencies corresponding to the heart and respiratory rates. Moreover, SBL identifies these sparse frequencies on well-segmented inner slices by optimizing hyperparameters via type -II likelihood, automatically pruning irrelevant components. The identified sparse frequencies guide the selection of outer slice images for labeling, minimizing posterior variance. This work provides performance guarantees for the greedy algorithm. Testing on patient data demonstrates that only a few labeled images are necessary for accurate volume prediction. The labeling procedure effectively avoids selecting inefficient images. Furthermore, the Bayesian approach provides uncertainty estimates, highlighting unreliable predictions (e.g., when choosing suboptimal labels). Subjects: Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV); Probability (math.PR); Statistics Theory (math.ST); Applications (stat.AP) MSC classes: 62F15, 92C55, 62P10, 94A12, 62F07, 62K99 ACMclasses: G.3; J.3 Cite as: arXiv:2503.21443 [stat.ME] (or arXiv:2503.21443v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2503.21443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-119] PLAIN: Scalable Estimation Architecture for Integrated Sensing and Communication
【速读】:本文旨在解决集成感知与通信(Integrated Sensing and Communication, ISAC)系统中参数估计面临的高维复杂性、有限测量时间以及超分辨率需求等挑战。传统的联合多维度(如空间、频率和时间)感知方法在处理高维数据时计算复杂度显著增加,而由于感知与数据传输的同时进行,可用的感知时间窗口通常较短,导致仅能获取单一快拍数据的问题尤为突出。为此,论文提出了一种基于张量的估计架构——PLAIN,其关键在于通过三个阶段灵活应对上述挑战:首先,在压缩阶段将高维输入信号转换为低维表示,同时保持分辨率;其次,在解耦估计阶段以低复杂度并行估计不同维度上的参数;最后,在基于输入的融合阶段整合解耦后的参数形成多维配对估计结果。该方案利用张量代数、子空间处理及压缩感知等工具,在维持较低计算复杂度的同时实现了对维度扩展性和超分辨率性能的良好支持。
链接: https://arxiv.org/abs/2503.21242
作者: Bashar Tahir,Philipp Svoboda,Markus Rupp
机构: Christian Doppler Laboratory for Digital Twin assisted AI for sustainable Radio Access Networks, Institute of Telecommunications, TU Wien (奥地利维也纳技术大学), Austria; Institute of Telecommunications, TU Wien (奥地利维也纳技术大学), Austria
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the IEEE Transactions on Wireless Communications. Code available at GitHub: this https URL
点击查看摘要
Abstract:Integrated sensing and communication (ISAC) is envisioned be to one of the paradigms upon which next-generation mobile networks will be built, extending localization and tracking capabilities, as well as giving birth to environment-aware wireless access. A key aspect of sensing integration is parameter estimation, which involves extracting information about the surrounding environment, such as the direction, distance, and velocity of various objects within. This is typically of a high-dimensional nature, which leads to significant computational complexity, if performed jointly across multiple sensing dimensions, such as space, frequency, and time. Additionally, due to the incorporation of sensing on top of the data transmission, the time window available for sensing is likely to be short, resulting in an estimation problem where only a single snapshot is accessible. In this work, we propose PLAIN, a tensor-based estimation architecture that flexibly scales with multiple sensing dimensions and can handle high dimensionality, limited measurement time, and super-resolution requirements. It consists of three stages: a compression stage, where the high dimensional input is converted into lower dimensionality, without sacrificing resolution; a decoupled estimation stage, where the parameters across the different dimensions are estimated in parallel with low complexity; an input-based fusion stage, where the decoupled parameters are fused together to form a paired multidimensional estimate. We investigate the performance of the architecture for different configurations and compare it against practical sequential and joint estimation baselines, as well as theoretical bounds. Our results show that PLAIN, using tools from tensor algebra, subspace-based processing, and compressed sensing, can scale flexibly with dimensionality, while operating with low complexity and maintaining super-resolution.
zh
[CV-120] Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins
【速读】:该论文旨在解决现有基于端到端深度神经网络的手术室(OR)工作流程分析方法缺乏灵活性的问题,这些方法受限于开发时设定的条件,难以适应不同场景(如大型学术医疗中心与农村医疗机构)的需求,而无需重新进行数据收集、标注和训练。为了解决这些问题,论文提出的关键解决方案包括:首先,设计了一种新型数字孪生(Digital Twin, DT)表示法,以保留手术室内各种组件之间的语义和空间关系;其次,基于此基础,提出了ORDiRS(用于推理分割的手术室数字孪生表示)框架,这是一种无需微调大型语言模型(LLM)的推理分割方法,将推理分割重构为“推理-检索-合成”范式;最后,引入了ORDiRS-Agent,这是一种基于LLM的代理系统,能够将复杂的手术室工作流程分析查询分解为可管理的推理分割子查询,并通过结合详细的文本解释和支持性的视觉证据生成响应。实验结果表明,与现有的最先进方法相比,ORDiRS在cIoU指标上提升了6.12%-9.74%。
链接: https://arxiv.org/abs/2503.21054
作者: Yiqing Shen,Chenjia Li,Bohan Liu,Cheng-Yi Li,Tito Porras,Mathias Unberath
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
点击查看摘要
Abstract:Analyzing operating room (OR) workflows to derive quantitative insights into OR efficiency is important for hospitals to maximize patient care and financial sustainability. Prior work on OR-level workflow analysis has relied on end-to-end deep neural networks. While these approaches work well in constrained settings, they are limited to the conditions specified at development time and do not offer the flexibility necessary to accommodate the OR workflow analysis needs of various OR scenarios (e.g., large academic center vs. rural provider) without data collection, annotation, and retraining. Reasoning segmentation (RS) based on foundation models offers this flexibility by enabling automated analysis of OR workflows from OR video feeds given only an implicit text query related to the objects of interest. Due to the reliance on large language model (LLM) fine-tuning, current RS approaches struggle with reasoning about semantic/spatial relationships and show limited generalization to OR video due to variations in visual characteristics and domain-specific terminology. To address these limitations, we first propose a novel digital twin (DT) representation that preserves both semantic and spatial relationships between the various OR components. Then, building on this foundation, we propose ORDiRS (Operating Room Digital twin representation for Reasoning Segmentation), an LLM-tuning-free RS framework that reformulates RS into a “reason-retrieval-synthesize” paradigm. Finally, we present ORDiRS-Agent, an LLM-based agent that decomposes OR workflow analysis queries into manageable RS sub-queries and generates responses by combining detailed textual explanations with supporting visual evidence from RS. Experimental results on both an in-house and a public OR dataset demonstrate that our ORDiRS achieves a cIoU improvement of 6.12%-9.74% compared to the existing state-of-the-arts.
zh
人工智能
[AI-0] Intelligent IoT Attack Detection Design via ODLLM with Feature Ranking-based Knowledge Base
链接: https://arxiv.org/abs/2503.21674
作者: Satvik Verma,Qun Wang,E. Wes Bethel
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
*备注:
点击查看摘要
Abstract:The widespread adoption of Internet of Things (IoT) devices has introduced significant cybersecurity challenges, particularly with the increasing frequency and sophistication of Distributed Denial of Service (DDoS) attacks. Traditional machine learning (ML) techniques often fall short in detecting such attacks due to the complexity of blended and evolving patterns. To address this, we propose a novel framework leveraging On-Device Large Language Models (ODLLMs) augmented with fine-tuning and knowledge base (KB) integration for intelligent IoT network attack detection. By implementing feature ranking techniques and constructing both long and short KBs tailored to model capacities, the proposed framework ensures efficient and accurate detection of DDoS attacks while overcoming computational and privacy limitations. Simulation results demonstrate that the optimized framework achieves superior accuracy across diverse attack types, especially when using compact models in edge computing environments. This work provides a scalable and secure solution for real-time IoT security, advancing the applicability of edge intelligence in cybersecurity.
[AI-1] Unlocking the Potential of Past Research: Using Generative AI to Reconstruct Healthcare Simulation Models
链接: https://arxiv.org/abs/2503.21646
作者: Thomas Monks,Alison Harper,Amy Heather
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注:
点击查看摘要
Abstract:Discrete-event simulation (DES) is widely used in healthcare Operations Research, but the models themselves are rarely shared. This limits their potential for reuse and long-term impact in the modelling and healthcare communities. This study explores the feasibility of using generative artificial intelligence (AI) to recreate published models using Free and Open Source Software (FOSS), based on the descriptions provided in an academic journal. Using a structured methodology, we successfully generated, tested and internally reproduced two DES models, including user interfaces. The reported results were replicated for one model, but not the other, likely due to missing information on distributions. These models are substantially more complex than AI-generated DES models published to date. Given the challenges we faced in prompt engineering, code generation, and model testing, we conclude that our iterative approach to model development, systematic comparison and testing, and the expertise of our team were necessary to the success of our recreated simulation models.
[AI-2] owards Fully Automated Decision-Making Systems for Greenhouse Control: Challenges and Opportunities
链接: https://arxiv.org/abs/2503.21640
作者: Yongshuai Liu,Taeyeong Choi,Xin Liu
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Machine learning has been successful in building control policies to drive a complex system to desired states in various applications (e.g. games, robotics, etc.). To be specific, a number of parameters of policy can be automatically optimized from the observations of environment to be able to generate a sequence of decisions leading to the best performance. In this survey paper, we particularly explore such policy-learning techniques for another unique, practical use-case scenario–farming, in which critical decisions (e.g., water supply, heating, etc.) must be made in a timely manner to minimize risks (e.g., damage to plants) while maximizing the revenue (e.g., healthy crops) in the end. We first provide a broad overview of latest studies on it to identify not only domain-specific challenges but opportunities with potential solutions, some of which are suggested as promising directions for future research. Also, we then introduce our successful approach to being ranked second among 46 teams at the ‘‘3rd Autonomous Greenhouse Challenge’’ to use this specific example to discuss the lessons learned about important considerations for design to create autonomous farm-management systems.
[AI-3] UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning
链接: https://arxiv.org/abs/2503.21620
作者: Zhengxi Lu,Yuxiang Chai,Yaxuan Guo,Xi Yin,Liang Liu,Hao Wang,Guanjing Xiong,Hongsheng Li
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The recent DeepSeek-R1 has showcased the emergence of reasoning capabilities in LLMs through reinforcement learning (RL) with rule-based rewards. Building on this idea, we are the first to explore how rule-based RL can enhance the reasoning capabilities of multimodal large language models (MLLMs) for graphic user interface (GUI) action prediction tasks. To this end, we curate a small yet high-quality dataset of 136 challenging tasks, encompassing five common action types on mobile devices. We also introduce a unified rule-based action reward, enabling model optimization via policy-based algorithms such as Group Relative Policy Optimization (GRPO). Experimental results demonstrate that our proposed data-efficient model, UI-R1-3B, achieves substantial improvements on both in-domain (ID) and out-of-domain (OOD) tasks. Specifically, on the ID benchmark AndroidControl, the action type accuracy improves by 15%, while grounding accuracy increases by 10.3%, compared with the base model (i.e. Qwen2.5-VL-3B). On the OOD GUI grounding benchmark ScreenSpot-Pro, our model surpasses the base model by 6.0% and achieves competitive performance with larger models (e.g., OS-Atlas-7B), which are trained via supervised fine-tuning (SFT) on 76K data. These results underscore the potential of rule-based reinforcement learning to advance GUI understanding and control, paving the way for future research in this domain.
[AI-4] A Measure Based Generalizable Approach to Understandability
链接: https://arxiv.org/abs/2503.21615
作者: Vikas Kushwaha,Sruti Srinivasa Ragavan,Subhajit Roy
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
*备注: 6 pages
点击查看摘要
Abstract:Successful agent-human partnerships require that any agent generated information is understandable to the human, and that the human can easily steer the agent towards a goal. Such effective communication requires the agent to develop a finer-level notion of what is understandable to the human. State-of-the-art agents, including LLMs, lack this detailed notion of understandability because they only capture average human sensibilities from the training data, and therefore afford limited steerability (e.g., requiring non-trivial prompt engineering). In this paper, instead of only relying on data, we argue for developing generalizable, domain-agnostic measures of understandability that can be used as directives for these agents. Existing research on understandability measures is fragmented, we survey various such efforts across domains, and lay a cognitive-science-rooted groundwork for more coherent and domain-agnostic research investigations in future. Comments: 6 pages Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2503.21615 [cs.HC] (or arXiv:2503.21615v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2503.21615 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-5] GenEdit: Compounding Operators and Continuous Improvement to Tackle Text-to-SQL in the Enterprise
链接: https://arxiv.org/abs/2503.21602
作者: Karime Maamari,Connor Landy,Amine Mhedhbi
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Recent advancements in Text-to-SQL, driven by large language models, are democratizing data access. Despite these advancements, enterprise deployments remain challenging due to the need to capture business-specific knowledge, handle complex queries, and meet expectations of continuous improvements. To address these issues, we designed and implemented GenEdit: our Text-to-SQL generation system that improves with user feedback. GenEdit builds and maintains a company-specific knowledge set, employs a pipeline of operators decomposing SQL generation, and uses feedback to update its knowledge set to improve future SQL generations. We describe GenEdit’s architecture made of two core modules: (i) decomposed SQL generation; and (ii) knowledge set edits based on user feedback. For generation, GenEdit leverages compounding operators to improve knowledge retrieval and to create a plan as chain-of-thought steps that guides generation. GenEdit first retrieves relevant examples in an initial retrieval stage where original SQL queries are decomposed into sub-statements, clauses or sub-queries. It then also retrieves instructions and schema elements. Using the retrieved contextual information, GenEdit then generates step-by-step plan in natural language on how to produce the query. Finally, GenEdit uses the plan to generate SQL, minimizing the need for model reasoning, which enhances complex SQL generation. If necessary, GenEdit regenerates the query based on syntactic and semantic errors. The knowledge set edits are recommended through an interactive copilot, allowing users to iterate on their feedback and to regenerate SQL queries as needed. Each generation uses staged edits which update the generation prompt. Once the feedback is submitted, it gets merged after passing regression testing and obtaining an approval, improving future generations. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2503.21602 [cs.AI] (or arXiv:2503.21602v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.21602 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-6] Prompt Divide and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing
链接: https://arxiv.org/abs/2503.21598
作者: Johan Wahréus,Ahmed Hussain,Panos Papadimitratos
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 22 pages; 26 figures
点击查看摘要
Abstract:Large Language Models (LLMs) have transformed task automation and content generation across various domains while incorporating safety filters to prevent misuse. We introduce a novel jailbreaking framework that employs distributed prompt processing combined with iterative refinements to bypass these safety measures, particularly in generating malicious code. Our architecture consists of four key modules: prompt segmentation, parallel processing, response aggregation, and LLM-based jury evaluation. Tested on 500 malicious prompts across 10 cybersecurity categories, the framework achieves a 73.2% Success Rate (SR) in generating malicious code. Notably, our comparative analysis reveals that traditional single-LLM judge evaluation overestimates SRs (93.8%) compared to our LLM jury system (73.2%), with manual verification confirming that single-judge assessments often accept incomplete implementations. Moreover, we demonstrate that our distributed architecture improves SRs by 12% over the non-distributed approach in an ablation study, highlighting both the effectiveness of distributed prompt processing and the importance of robust evaluation methodologies in assessing jailbreak attempts.
[AI-7] Critical Iterative Denoising: A Discrete Generative Model Applied to Graphs
链接: https://arxiv.org/abs/2503.21592
作者: Yoann Boget,Alexandros Kalousis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Discrete Diffusion and Flow Matching models have significantly advanced generative modeling for discrete structures, including graphs. However, the time dependencies in the noising process of these models lead to error accumulation and propagation during the backward process. This issue, particularly pronounced in mask diffusion, is a known limitation in sequence modeling and, as we demonstrate, also impacts discrete diffusion models for graphs. To address this problem, we propose a novel framework called Iterative Denoising, which simplifies discrete diffusion and circumvents the issue by assuming conditional independence across time. Additionally, we enhance our model by incorporating a Critic, which during generation selectively retains or corrupts elements in an instance based on their likelihood under the data distribution. Our empirical evaluations demonstrate that the proposed method significantly outperforms existing discrete diffusion baselines in graph generation tasks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2503.21592 [cs.LG] (or arXiv:2503.21592v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.21592 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-8] Magnitude-Phase Dual-Path Speech Enhancement Network based on Self-Supervised Embedding and Perceptual Contrast Stretch Boosting ICME2025
链接: https://arxiv.org/abs/2503.21571
作者: Alimjan Mattursun,Liejun Wang,Yinfeng Yu,Chunyang Ma
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注: Main paper (6 pages). Accepted for publication by ICME 2025
点击查看摘要
Abstract:Speech self-supervised learning (SSL) has made great progress in various speech processing tasks, but there is still room for improvement in speech enhancement (SE). This paper presents BSP-MPNet, a dual-path framework that combines self-supervised features with magnitude-phase information for SE. The approach starts by applying the perceptual contrast stretching (PCS) algorithm to enhance the magnitude-phase spectrum. A magnitude-phase 2D coarse (MP-2DC) encoder then extracts coarse features from the enhanced spectrum. Next, a feature-separating self-supervised learning (FS-SSL) model generates self-supervised embeddings for the magnitude and phase components separately. These embeddings are fused to create cross-domain feature representations. Finally, two parallel RNN-enhanced multi-attention (REMA) mask decoders refine the features, apply them to the mask, and reconstruct the speech signal. We evaluate BSP-MPNet on the VoiceBank+DEMAND and WHAMR! datasets. Experimental results show that BSP-MPNet outperforms existing methods under various noise conditions, providing new directions for self-supervised speech enhancement research. The implementation of the BSP-MPNet code is available online\footnote[2]this https URL. \labels1
[AI-9] A Local Perspective-based Model for Overlapping Community Detection
链接: https://arxiv.org/abs/2503.21558
作者: Gaofeng Zhou,Rui-Feng Wang,Kangning Cui
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
*备注: 10 pages, 3 figures, 3 tables
点击查看摘要
Abstract:Community detection, which identifies densely connected node clusters with sparse between-group links, is vital for analyzing network structure and function in real-world systems. Most existing community detection methods based on GCNs primarily focus on node-level information while overlooking community-level features, leading to performance limitations on large-scale networks. To address this issue, we propose LQ-GCN, an overlapping community detection model from a local community perspective. LQ-GCN employs a Bernoulli-Poisson model to construct a community affiliation matrix and form an end-to-end detection framework. By adopting local modularity as the objective function, the model incorporates local community information to enhance the quality and accuracy of clustering results. Additionally, the conventional GCNs architecture is optimized to improve the model capability in identifying overlapping communities in large-scale networks. Experimental results demonstrate that LQ-GCN achieves up to a 33% improvement in Normalized Mutual Information (NMI) and a 26.3% improvement in Recall compared to baseline models across multiple real-world benchmark datasets.
[AI-10] MONO2REST: Identifying and Exposing Microservices: a Reusable RESTification Approach
链接: https://arxiv.org/abs/2503.21522
作者: Matthéo Lecrivain,Hanifa Barry,Dalila Tamzalit,Houari Sahraoui
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The microservices architectural style has become the de facto standard for large-scale cloud applications, offering numerous benefits in scalability, maintainability, and deployment flexibility. Many organizations are pursuing the migration of legacy monolithic systems to a microservices architecture. However, this process is challenging, risky, time-intensive, and prone-to-failure while several organizations lack necessary financial resources, time, or expertise to set up this migration process. So, rather than trying to migrate a legacy system where migration is risky or not feasible, we suggest exposing it as a microservice application without without having to migrate it. In this paper, we present a reusable, automated, two-phase approach that combines evolutionary algorithms with machine learning techniques. In the first phase, we identify microservices at the method level using a multi-objective genetic algorithm that considers both structural and semantic dependencies between methods. In the second phase, we generate REST APIs for each identified microservice using a classification algorithm to assign HTTP methods and endpoints. We evaluated our approach with a case study on the Spring PetClinic application, which has both monolithic and microservices implementations that serve as ground truth for comparison. Results demonstrate that our approach successfully aligns identified microservices with those in the reference microservices implementation, highlighting its effectiveness in service identification and API generation.
[AI-11] Adaptive Resampling with Bootstrap for Noisy Multi-Objective Optimization Problems
链接: https://arxiv.org/abs/2503.21495
作者: Timo Budszuhn,Mark Joachim Krallmann,Daniel Horn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注: 14 pages. 5 figures
点击查看摘要
Abstract:The challenge of noisy multi-objective optimization lies in the constant trade-off between exploring new decision points and improving the precision of known points through resampling. This decision should take into account both the variability of the objective functions and the current estimate of a point in relation to the Pareto front. Since the amount and distribution of noise are generally unknown, it is desirable for a decision function to be highly adaptive to the properties of the optimization problem. This paper presents a resampling decision function that incorporates the stochastic nature of the optimization problem by using bootstrapping and the probability of dominance. The distribution-free estimation of the probability of dominance is achieved using bootstrap estimates of the means. To make the procedure applicable even with very few observations, we transfer the distribution observed at other decision points. The efficiency of this resampling approach is demonstrated by applying it in the NSGA-II algorithm with a sequential resampling procedure under multiple noise variations.
[AI-12] he Procedural Content Generation Benchmark: An Open-source Testbed for Generative Challenges in Games
链接: https://arxiv.org/abs/2503.21474
作者: Ahmed Khalifa,Roberto Gallotta,Matthew Barthet,Antonios Liapis,Julian Togelius,Georgios N. Yannakakis
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 12 pages, 4 figures, 2 tables, published at FDG2025
点击查看摘要
Abstract:This paper introduces the Procedural Content Generation Benchmark for evaluating generative algorithms on different game content creation tasks. The benchmark comes with 12 game-related problems with multiple variants on each problem. Problems vary from creating levels of different kinds to creating rule sets for simple arcade games. Each problem has its own content representation, control parameters, and evaluation metrics for quality, diversity, and controllability. This benchmark is intended as a first step towards a standardized way of comparing generative algorithms. We use the benchmark to score three baseline algorithms: a random generator, an evolution strategy, and a genetic algorithm. Results show that some problems are easier to solve than others, as well as the impact the chosen objective has on quality, diversity, and controllability of the generated artifacts.
[AI-13] Unveiling Latent Information in Transaction Hashes: Hypergraph Learning for Ethereum Ponzi Scheme Detection
链接: https://arxiv.org/abs/2503.21463
作者: Junhao Wu,Yixin Yang,Chengxiang Jin,Silu Mu,Xiaolei Qian,Jiajun Zhou,Shanqing Yu,Qi Xuan
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:With the widespread adoption of Ethereum, financial frauds such as Ponzi schemes have become increasingly rampant in the blockchain ecosystem, posing significant threats to the security of account assets. Existing Ethereum fraud detection methods typically model account transactions as graphs, but this approach primarily focuses on binary transactional relationships between accounts, failing to adequately capture the complex multi-party interaction patterns inherent in Ethereum. To address this, we propose a hypergraph modeling method for the Ponzi scheme detection method in Ethereum, called HyperDet. Specifically, we treat transaction hashes as hyperedges that connect all the relevant accounts involved in a transaction. Additionally, we design a two-step hypergraph sampling strategy to significantly reduce computational complexity. Furthermore, we introduce a dual-channel detection module, including the hypergraph detection channel and the hyper-homo graph detection channel, to be compatible with existing detection methods. Experimental results show that, compared to traditional homogeneous graph-based methods, the hyper-homo graph detection channel achieves significant performance improvements, demonstrating the superiority of hypergraph in Ponzi scheme detection. This research offers innovations for modeling complex relationships in blockchain data.
[AI-14] Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models
链接: https://arxiv.org/abs/2503.21435
作者: Ruizhou Li,Haiyun Jiang
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs), as the dominant paradigm for graph-structured learning, have long faced dual challenges of exponentially escalating computational complexity and inadequate cross-scenario generalization capability. With the rapid advancement of multimodal learning, Vision-Language Models (VLMs) have demonstrated exceptional cross-modal relational reasoning capabilities and generalization capacities, thereby opening up novel pathways for overcoming the inherent limitations of conventional graph learning paradigms. However, current research predominantly concentrates on investigating the single-graph reasoning capabilities of VLMs, which fundamentally fails to address the critical requirement for coordinated reasoning across multiple heterogeneous graph data in real-world application scenarios. To address these limitations, we propose the first multi-graph joint reasoning benchmark for VLMs. Our benchmark encompasses four graph categories: knowledge graphs, flowcharts, mind maps, and route maps,with each graph group accompanied by three progressively challenging instruction-response pairs. Leveraging this benchmark, we conducted comprehensive capability assessments of state-of-the-art VLMs and performed fine-tuning on open-source models. This study not only addresses the underexplored evaluation gap in multi-graph reasoning for VLMs but also empirically validates their generalization superiority in graph-structured learning.
[AI-15] Neuroplasticity in Artificial Intelligence – An Overview and Inspirations on Drop In Out Learning
链接: https://arxiv.org/abs/2503.21419
作者: Yupei Li,Manuel Milling,Björn W. Schuller
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Artificial Intelligence (AI) has achieved new levels of performance and spread in public usage with the rise of deep neural networks (DNNs). Initially inspired by human neurons and their connections, NNs have become the foundation of AI models for many advanced architectures. However, some of the most integral processes in the human brain, particularly neurogenesis and neuroplasticity in addition to the more spread neuroapoptosis have largely been ignored in DNN architecture design. Instead, contemporary AI development predominantly focuses on constructing advanced frameworks, such as large language models, which retain a static structure of neural connections during training and inference. In this light, we explore how neurogenesis, neuroapoptosis, and neuroplasticity can inspire future AI advances. Specifically, we examine analogous activities in artificial NNs, introducing the concepts of dropin'' for neurogenesis and revisiting
dropout’’ and structural pruning for neuroapoptosis. We additionally suggest neuroplasticity combining the two for future large NNs in ``life-long learning’’ settings following the biological inspiration. We conclude by advocating for greater research efforts in this interdisciplinary domain and identifying promising directions for future exploration.
[AI-16] Federated Intelligence: When Large AI Models Meet Federated Fine-Tuning and Collaborative Reasoning at the Network Edge
链接: https://arxiv.org/abs/2503.21412
作者: Wanli Ni,Haofeng Sun,Huiqing Ao,Hui Tian
类目: Artificial Intelligence (cs.AI)
*备注: 8 pages, 6 figures
点击查看摘要
Abstract:Large artificial intelligence (AI) models exhibit remarkable capabilities in various application scenarios, but deploying them at the network edge poses significant challenges due to issues such as data privacy, computational resources, and latency. In this paper, we explore federated fine-tuning and collaborative reasoning techniques to facilitate the implementation of large AI models in resource-constrained wireless networks. Firstly, promising applications of large AI models within specific domains are discussed. Subsequently, federated fine-tuning methods are proposed to adapt large AI models to specific tasks or environments at the network edge, effectively addressing the challenges associated with communication overhead and enhancing communication efficiency. These methodologies follow clustered, hierarchical, and asynchronous paradigms to effectively tackle privacy issues and eliminate data silos. Furthermore, to enhance operational efficiency and reduce latency, efficient frameworks for model collaborative reasoning are developed, which include decentralized horizontal collaboration, cloud-edge-end vertical collaboration, and multi-access collaboration. Next, simulation results demonstrate the effectiveness of our proposed methods in reducing the fine-tuning loss of large AI models across various downstream tasks. Finally, several open challenges and research opportunities are outlined.
[AI-17] Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey Framework and Roadmap
链接: https://arxiv.org/abs/2503.21411
作者: Tong Nie,Jian Sun,Wei Ma
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Modern transportation systems face pressing challenges due to increasing demand, dynamic environments, and heterogeneous information integration. The rapid evolution of Large Language Models (LLMs) offers transformative potential to address these challenges. Extensive knowledge and high-level capabilities derived from pretraining evolve the default role of LLMs as text generators to become versatile, knowledge-driven task solvers for intelligent transportation systems. This survey first presents LLM4TR, a novel conceptual framework that systematically categorizes the roles of LLMs in transportation into four synergetic dimensions: information processors, knowledge encoders, component generators, and decision facilitators. Through a unified taxonomy, we systematically elucidate how LLMs bridge fragmented data pipelines, enhance predictive analytics, simulate human-like reasoning, and enable closed-loop interactions across sensing, learning, modeling, and managing tasks in transportation systems. For each role, our review spans diverse applications, from traffic prediction and autonomous driving to safety analytics and urban mobility optimization, highlighting how emergent capabilities of LLMs such as in-context learning and step-by-step reasoning can enhance the operation and management of transportation systems. We further curate practical guidance, including available resources and computational guidelines, to support real-world deployment. By identifying challenges in existing LLM-based solutions, this survey charts a roadmap for advancing LLM-driven transportation research, positioning LLMs as central actors in the next generation of cyber-physical-social mobility ecosystems. Online resources can be found in the project page: this https URL.
[AI-18] Neuro-Symbolic Imitation Learning: Discovering Symbolic Abstractions for Skill Learning ICRA
链接: https://arxiv.org/abs/2503.21406
作者: Leon Keller,Daniel Tanneberg,Jan Peters
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
*备注: IEEE International Conference on Robotics and Automation (ICRA) 2025
点击查看摘要
Abstract:Imitation learning is a popular method for teaching robots new behaviors. However, most existing methods focus on teaching short, isolated skills rather than long, multi-step tasks. To bridge this gap, imitation learning algorithms must not only learn individual skills but also an abstract understanding of how to sequence these skills to perform extended tasks effectively. This paper addresses this challenge by proposing a neuro-symbolic imitation learning framework. Using task demonstrations, the system first learns a symbolic representation that abstracts the low-level state-action space. The learned representation decomposes a task into easier subtasks and allows the system to leverage symbolic planning to generate abstract plans. Subsequently, the system utilizes this task decomposition to learn a set of neural skills capable of refining abstract plans into actionable robot commands. Experimental results in three simulated robotic environments demonstrate that, compared to baselines, our neuro-symbolic approach increases data efficiency, improves generalization capabilities, and facilitates interpretability.
[AI-19] HybridoNet-Adapt: A Domain-Adapted Framework for Accurate Lithium-Ion Battery RUL Prediction
链接: https://arxiv.org/abs/2503.21392
作者: Khoa Tran,Bao Huynh,Tri Le,Lam Pham,Vy-Rin Nguyen
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Accurate prediction of the remaining useful life (RUL) in Lithium-ion battery (LIB) health management systems is crucial for ensuring reliability and safety. Current methods typically assume that training and testing data share the same distribution, overlooking the benefits of incorporating diverse data sources to enhance model performance. To address this limitation, we introduce a data-independent RUL prediction framework along with its domain adaptation (DA) approach, which leverages heterogeneous data sources for improved target predictions. Our approach integrates comprehensive data preprocessing, including feature extraction, denoising, and normalization, with a data-independent prediction model that combines Long Short-Term Memory (LSTM), Multihead Attention, and a Neural Ordinary Differential Equation (NODE) block, termed HybridoNet. The domain-adapted version, HybridoNet Adapt, is trained using a novel technique inspired by the Domain-Adversarial Neural Network (DANN) framework, a regression ensemble method, and Maximum Mean Discrepancy (MMD) to learn domain-invariant features from labeled cycling data in the source and target domains. Experimental results demonstrate that our approach outperforms state-of-the-art techniques, providing reliable RUL predictions for real-world applications.
[AI-20] Investigating the Duality of Interpretability and Explainability in Machine Learning
链接: https://arxiv.org/abs/2503.21356
作者: Moncef Garouani,Josiane Mothe,Ayah Barhrhouj,Julien Aligon
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The rapid evolution of machine learning (ML) has led to the widespread adoption of complex “black box” models, such as deep neural networks and ensemble methods. These models exhibit exceptional predictive performance, making them invaluable for critical decision-making across diverse domains within society. However, their inherently opaque nature raises concerns about transparency and interpretability, making them untrustworthy decision support systems. To alleviate such a barrier to high-stakes adoption, research community focus has been on developing methods to explain black box models as a means to address the challenges they pose. Efforts are focused on explaining these models instead of developing ones that are inherently interpretable. Designing inherently interpretable models from the outset, however, can pave the path towards responsible and beneficial applications in the field of ML. In this position paper, we clarify the chasm between explaining black boxes and adopting inherently interpretable models. We emphasize the imperative need for model interpretability and, following the purpose of attaining better (i.e., more effective or efficient w.r.t. predictive performance) and trustworthy predictors, provide an experimental evaluation of latest hybrid learning methods that integrates symbolic knowledge into neural network predictors. We demonstrate how interpretable hybrid models could potentially supplant black box ones in different domains.
[AI-21] Using large language models to produce literature reviews: Usages and systematic biases of microphysics parametrizations in 2699 publications
链接: https://arxiv.org/abs/2503.21352
作者: Tianhang Zhang,Shengnan Fu,David M. Schultz,Zhonghua Zheng
类目: Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注:
点击查看摘要
Abstract:Large language models afford opportunities for using computers for intensive tasks, realizing research opportunities that have not been considered before. One such opportunity could be a systematic interrogation of the scientific literature. Here, we show how a large language model can be used to construct a literature review of 2699 publications associated with microphysics parametrizations in the Weather and Research Forecasting (WRF) model, with the goal of learning how they were used and their systematic biases, when simulating precipitation. The database was constructed of publications identified from Web of Science and Scopus searches. The large language model GPT-4 Turbo was used to extract information about model configurations and performance from the text of 2699 publications. Our results reveal the landscape of how nine of the most popular microphysics parameterizations have been used around the world: Lin, Ferrier, WRF Single-Moment, Goddard Cumulus Ensemble, Morrison, Thompson, and WRF Double-Moment. More studies used one-moment parameterizations before 2020 and two-moment parameterizations after 2020. Seven out of nine parameterizations tended to overestimate precipitation. However, systematic biases of parameterizations differed in various regions. Except simulations using the Lin, Ferrier, and Goddard parameterizations that tended to underestimate precipitation over almost all locations, the remaining six parameterizations tended to overestimate, particularly over China, southeast Asia, western United States, and central Africa. This method could be used by other researchers to help understand how the increasingly massive body of scientific literature can be harnessed through the power of artificial intelligence to solve their research problems.
[AI-22] Residual Learning Inspired Crossover Operator and Strategy Enhancements for Evolutionary Multitasking
链接: https://arxiv.org/abs/2503.21347
作者: Ruilin Wang,Xiang Feng,Huiqun Yu,Edmund M-K Lai
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
*备注: 9 pages, 4 figures
点击查看摘要
Abstract:In evolutionary multitasking, strategies such as crossover operators and skill factor assignment are critical for effective knowledge transfer. Existing improvements to crossover operators primarily focus on low-dimensional variable combinations, such as arithmetic crossover or partially mapped crossover, which are insufficient for modeling complex high-dimensional this http URL, static or semi-dynamic crossover strategies fail to adapt to the dynamic dependencies among tasks. In addition, current Multifactorial Evolutionary Algorithm frameworks often rely on fixed skill factor assignment strategies, lacking flexibility. To address these limitations, this paper proposes the Multifactorial Evolutionary Algorithm-Residual Learning (MFEA-RL) method based on residual learning. The method employs a Very Deep Super-Resolution (VDSR) model to generate high-dimensional residual representations of individuals, enhancing the modeling of complex relationships within dimensions. A ResNet-based mechanism dynamically assigns skill factors to improve task adaptability, while a random mapping mechanism efficiently performs crossover operations and mitigates the risk of negative transfer. Theoretical analysis and experimental results show that MFEA-RL outperforms state-of-the-art multitasking algorithms. It excels in both convergence and adaptability on standard evolutionary multitasking benchmarks, including CEC2017-MTSO and WCCI2020-MTSO. Additionally, its effectiveness is validated through a real-world application scenario.
[AI-23] A 71.2-μW Speech Recognition Accelerator with Recurrent Spiking Neural Network
链接: https://arxiv.org/abs/2503.21337
作者: Chih-Chyau Yang,Tian-Sheuan Chang
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:This paper introduces a 71.2- \mu W speech recognition accelerator designed for edge devices’ real-time applications, emphasizing an ultra low power design. Achieved through algorithm and hardware co-optimizations, we propose a compact recurrent spiking neural network with two recurrent layers, one fully connected layer, and a low time step (1 or 2). The 2.79-MB model undergoes pruning and 4-bit fixed-point quantization, shrinking it by 96.42% to 0.1 MB. On the hardware front, we take advantage of \textitmixed-level pruning, \textitzero-skipping and \textitmerged spike techniques, reducing complexity by 90.49% to 13.86 MMAC/S. The \textitparallel time-step execution addresses inter-time-step data dependencies and enables weight buffer power savings through weight sharing. Capitalizing on the sparse spike activity, an input broadcasting scheme eliminates zero computations, further saving power. Implemented on the TSMC 28-nm process, the design operates in real time at 100 kHz, consuming 71.2 \mu W, surpassing state-of-the-art designs. At 500 MHz, it has 28.41 TOPS/W and 1903.11 GOPS/mm ^2 in energy and area efficiency, respectively.
[AI-24] A Low-Power Streaming Speech Enhancement Accelerator For Edge Devices
链接: https://arxiv.org/abs/2503.21335
作者: Ci-Hao Wu,Tian-Sheuan Chang
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
*备注:
点击查看摘要
Abstract:Transformer-based speech enhancement models yield impressive results. However, their heterogeneous and complex structure restricts model compression potential, resulting in greater complexity and reduced hardware efficiency. Additionally, these models are not tailored for streaming and low-power applications. Addressing these challenges, this paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application, which reduces 93.9% of model size by the proposed domain-aware and streaming-aware pruning techniques. The required latency is further reduced with batch normalization-based transformers. Additionally, we employed softmax-free attention, complemented by an extra batch normalization, facilitating simpler hardware design. The tailored hardware accommodates these diverse computing patterns by breaking them down into element-wise multiplication and accumulation (MAC). This is achieved through a 1-D processing array, utilizing configurable SRAM addressing, thereby minimizing hardware complexities and simplifying zero skipping. Using the TSMC 40nm CMOS process, the final implementation requires merely 207.8K gates and 53.75KB SRAM. It consumes only 8.08 mW for real-time inference at a 62.5MHz frequency.
[AI-25] HyperGraphRAG : Retrieval-Augmented Generation with Hypergraph-Structured Knowledge Representation
链接: https://arxiv.org/abs/2503.21322
作者: Haoran Luo,Haihong E,Guanting Chen,Yandan Zheng,Xiaobao Wu,Yikai Guo,Qika Lin,Yu Feng,Zemin Kuang,Meina Song,Yifan Zhu,Luu Anh Tuan
类目: Artificial Intelligence (cs.AI)
*备注: Preprint
点击查看摘要
Abstract:While standard Retrieval-Augmented Generation (RAG) based on chunks, GraphRAG structures knowledge as graphs to leverage the relations among entities. However, previous GraphRAG methods are limited by binary relations: one edge in the graph only connects two entities, which cannot well model the n-ary relations among more than two entities that widely exist in reality. To address this limitation, we propose HyperGraphRAG, a novel hypergraph-based RAG method that represents n-ary relational facts via hyperedges, modeling the complicated n-ary relations in the real world. To retrieve and generate over hypergraphs, we introduce a complete pipeline with a hypergraph construction method, a hypergraph retrieval strategy, and a hypergraph-guided generation mechanism. Experiments across medicine, agriculture, computer science, and law demonstrate that HyperGraphRAG outperforms standard RAG and GraphRAG in accuracy and generation quality.
[AI-26] DeBackdoor: A Deductive Framework for Detecting Backdoor Attacks on Deep Models with Limited Data
链接: https://arxiv.org/abs/2503.21305
作者: Dorde Popovic,Amin Sadeghi,Ting Yu,Sanjay Chawla,Issa Khalil
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Backdoor attacks are among the most effective, practical, and stealthy attacks in deep learning. In this paper, we consider a practical scenario where a developer obtains a deep model from a third party and uses it as part of a safety-critical system. The developer wants to inspect the model for potential backdoors prior to system deployment. We find that most existing detection techniques make assumptions that are not applicable to this scenario. In this paper, we present a novel framework for detecting backdoors under realistic restrictions. We generate candidate triggers by deductively searching over the space of possible triggers. We construct and optimize a smoothed version of Attack Success Rate as our search objective. Starting from a broad class of template attacks and just using the forward pass of a deep model, we reverse engineer the backdoor attack. We conduct extensive evaluation on a wide range of attacks, models, and datasets, with our technique performing almost perfectly across these settings.
[AI-27] Reinforced Model Merging
链接: https://arxiv.org/abs/2503.21272
作者: Jiaqi Han,Jingwen Ye,Shunyu Liu,Haofei Zhang,Jie Song,Zunlei Feng,Mingli Song
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The success of large language models has garnered widespread attention for model merging techniques, especially training-free methods which combine model capabilities within the parameter space. However, two challenges remain: (1) uniform treatment of all parameters leads to performance degradation; (2) search-based algorithms are often inefficient. In this paper, we present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks. These components interact to execute layer-wise merging actions, aiming to search the optimal merging architecture. Notably, RMM operates without any gradient computations on the original models, rendering it feasible for edge devices. Furthermore, by utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times. Extensive experiments demonstrate that RMM achieves state-of-the-art performance across various vision and NLP datasets and effectively overcomes the limitations of the existing baseline methods. Our code is available at this https URL.
[AI-28] OminiAdapt: Learning Cross-Task Invariance for Robust and Environment-Aware Robotic Manipulation
链接: https://arxiv.org/abs/2503.21257
作者: Yongxu Wang,Weiyun Yi,Xinhao Kong,Wanting Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:With the rapid development of embodied intelligence, leveraging large-scale human data for high-level imitation learning on humanoid robots has become a focal point of interest in both academia and industry. However, applying humanoid robots to precision operation domains remains challenging due to the complexities they face in perception and control processes, the long-standing physical differences in morphology and actuation mechanisms between humanoid robots and humans, and the lack of task-relevant features obtained from egocentric vision. To address the issue of covariate shift in imitation learning, this paper proposes an imitation learning algorithm tailored for humanoid robots. By focusing on the primary task objectives, filtering out background information, and incorporating channel feature fusion with spatial attention mechanisms, the proposed algorithm suppresses environmental disturbances and utilizes a dynamic weight update strategy to significantly improve the success rate of humanoid robots in accomplishing target tasks. Experimental results demonstrate that the proposed method exhibits robustness and scalability across various typical task scenarios, providing new ideas and approaches for autonomous learning and control in humanoid robots. The project will be open-sourced on GitHub.
[AI-29] Dual-Splitting Conformal Prediction for Multi-Step Time Series Forecasting
链接: https://arxiv.org/abs/2503.21251
作者: Qingdi Yu,Zhiwei Cao,Ruihang Wang,Zhen Yang,Lijun Deng,Min Hu,Yong Luo,Xin Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: 28 pages, 13 figures, 3 tables. Submitted to Applied Soft Computing. With Editor This is the first public release of the work
点击查看摘要
Abstract:Time series forecasting is crucial for applications like resource scheduling and risk management, where multi-step predictions provide a comprehensive view of future trends. Uncertainty Quantification (UQ) is a mainstream approach for addressing forecasting uncertainties, with Conformal Prediction (CP) gaining attention due to its model-agnostic nature and statistical guarantees. However, most variants of CP are designed for single-step predictions and face challenges in multi-step scenarios, such as reliance on real-time data and limited scalability. This highlights the need for CP methods specifically tailored to multi-step forecasting. We propose the Dual-Splitting Conformal Prediction (DSCP) method, a novel CP approach designed to capture inherent dependencies within time-series data for multi-step forecasting. Experimental results on real-world datasets from four different domains demonstrate that the proposed DSCP significantly outperforms existing CP variants in terms of the Winkler Score, achieving a performance improvement of up to 23.59% compared to state-of-the-art methods. Furthermore, we deployed the DSCP approach for renewable energy generation and IT load forecasting in power management of a real-world trajectory-based application, achieving an 11.25% reduction in carbon emissions through predictive optimization of data center operations and controls.
[AI-30] Improving (α f)-Byzantine Resilience in Federated Learning via layerwise aggregation and cosine distance
链接: https://arxiv.org/abs/2503.21244
作者: Mario García-Márquez,Nuria Rodríguez-Barroso,M.Victoria Luzón,Francisco Herrera
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Submitted to Knowledge-Based Systems
点击查看摘要
Abstract:The rapid development of artificial intelligence systems has amplified societal concerns regarding their usage, necessitating regulatory frameworks that encompass data privacy. Federated Learning (FL) is posed as potential solution to data privacy challenges in distributed machine learning by enabling collaborative model training without data sharing. However, FL systems remain vulnerable to Byzantine attacks, where malicious nodes contribute corrupted model updates. While Byzantine Resilient operators have emerged as a widely adopted robust aggregation algorithm to mitigate these attacks, its efficacy diminishes significantly in high-dimensional parameter spaces, sometimes leading to poor performing models. This paper introduces Layerwise Cosine Aggregation, a novel aggregation scheme designed to enhance robustness of these rules in such high-dimensional settings while preserving computational efficiency. A theoretical analysis is presented, demonstrating the superior robustness of the proposed Layerwise Cosine Aggregation compared to original robust aggregation operators. Empirical evaluation across diverse image classification datasets, under varying data distributions and Byzantine attack scenarios, consistently demonstrates the improved performance of Layerwise Cosine Aggregation, achieving up to a 16% increase in model accuracy.
[AI-31] Feature-Enhanced Machine Learning for All-Cause Mortality Prediction in Healthcare Data
链接: https://arxiv.org/abs/2503.21241
作者: HyeYoung Lee,Pavel Tsoi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Accurate patient mortality prediction enables effective risk stratification, leading to personalized treatment plans and improved patient outcomes. However, predicting mortality in healthcare remains a significant challenge, with existing studies often focusing on specific diseases or limited predictor sets. This study evaluates machine learning models for all-cause in-hospital mortality prediction using the MIMIC-III database, employing a comprehensive feature engineering approach. Guided by clinical expertise and literature, we extracted key features such as vital signs (e.g., heart rate, blood pressure), laboratory results (e.g., creatinine, glucose), and demographic information. The Random Forest model achieved the highest performance with an AUC of 0.94, significantly outperforming other machine learning and deep learning approaches. This demonstrates Random Forest’s robustness in handling high-dimensional, noisy clinical data and its potential for developing effective clinical decision support tools. Our findings highlight the importance of careful feature engineering for accurate mortality prediction. We conclude by discussing implications for clinical adoption and propose future directions, including enhancing model robustness and tailoring prediction models for specific diseases.
[AI-32] Knowledge Graphs as World Models for Semantic Material-Aware Obstacle Handling in Autonomous Vehicles
链接: https://arxiv.org/abs/2503.21232
作者: Ayush Bheemaiah,Seungyong Yang
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:The inability of autonomous vehicles (AVs) to infer the material properties of obstacles limits their decision-making capacity. While AVs rely on sensor systems such as cameras, LiDAR, and radar to detect obstacles, this study suggests combining sensors with a knowledge graph (KG)-based world model to improve AVs’ comprehension of physical material qualities. Beyond sensor data, AVs can infer qualities such as malleability, density, and elasticity using a semantic KG that depicts the relationships between obstacles and their attributes. Using the CARLA autonomous driving simulator, we evaluated AV performance with and without KG integration. The findings demonstrate that the KG-based method improves obstacle management, which allows AVs to use material qualities to make better decisions about when to change lanes or apply emergency braking. For example, the KG-integrated AV changed lanes for hard impediments like traffic cones and successfully avoided collisions with flexible items such as plastic bags by passing over them. Compared to the control system, the KG framework demonstrated improved responsiveness to obstacles by resolving conflicting sensor data, causing emergency stops for 13.3% more cases. In addition, our method exhibits a 6.6% higher success rate in lane-changing maneuvers in experimental scenarios, particularly for larger, high-impact obstacles. While we focus particularly on autonomous driving, our work demonstrates the potential of KG-based world models to improve decision-making in embodied AI systems and scale to other domains, including robotics, healthcare, and environmental simulation.
[AI-33] Integrating Large Language Models For Monte Carlo Simulation of Chemical Reaction Networks
链接: https://arxiv.org/abs/2503.21178
作者: Sadikshya Gyawali,Ashwini Mandal,Manish Dahal,Manish Awale,Sanjay Rijal,Shital Adhikari,Vaghawan Ojha
类目: Artificial Intelligence (cs.AI)
*备注: Accepted on MadeAI 2025 Conference
点击查看摘要
Abstract:Chemical reaction network is an important method for modeling and exploring complex biological processes, bio-chemical interactions and the behavior of different dynamics in system biology. But, formulating such reaction kinetics takes considerable time. In this paper, we leverage the efficiency of modern large language models to automate the stochastic monte carlo simulation of chemical reaction networks and enable the simulation through the reaction description provided in the form of natural languages. We also integrate this process into widely used simulation tool Copasi to further give the edge and ease to the modelers and researchers. In this work, we show the efficacy and limitations of the modern large language models to parse and create reaction kinetics for modelling complex chemical reaction processes.
[AI-34] Multi-Objective Optimization for Privacy-Utility Balance in Differentially Private Federated Learning
链接: https://arxiv.org/abs/2503.21159
作者: Kanishka Ranaweera,David Smith,Pubudu N. Pathirana,Ming Ding,Thierry Rakotoarivelo,Aruna Seneviratne
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Federated learning (FL) enables collaborative model training across distributed clients without sharing raw data, making it a promising approach for privacy-preserving machine learning. However, ensuring differential privacy (DP) in FL presents challenges due to the trade-off between model utility and privacy protection. Clipping gradients before aggregation is a common strategy to limit privacy loss, but selecting an optimal clipping norm is non-trivial, as excessively high values compromise privacy, while overly restrictive clipping degrades model performance. In this work, we propose an adaptive clipping mechanism that dynamically adjusts the clipping norm using a multi-objective optimization framework. By integrating privacy and utility considerations into the optimization objective, our approach balances privacy preservation with model accuracy. We theoretically analyze the convergence properties of our method and demonstrate its effectiveness through extensive experiments on MNIST, Fashion-MNIST, and CIFAR-10 datasets. Our results show that adaptive clipping consistently outperforms fixed-clipping baselines, achieving improved accuracy under the same privacy constraints. This work highlights the potential of dynamic clipping strategies to enhance privacy-utility trade-offs in differentially private federated learning.
[AI-35] Federated Learning with Differential Privacy: An Utility-Enhanced Approach
链接: https://arxiv.org/abs/2503.21154
作者: Kanishka Ranaweera,Dinh C. Nguyen,Pubudu N. Pathirana,David Smith,Ming Ding,Thierry Rakotoarivelo,Aruna Seneviratne
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Federated learning has emerged as an attractive approach to protect data privacy by eliminating the need for sharing clients’ data while reducing communication costs compared with centralized machine learning algorithms. However, recent studies have shown that federated learning alone does not guarantee privacy, as private data may still be inferred from the uploaded parameters to the central server. In order to successfully avoid data leakage, adopting differential privacy (DP) in the local optimization process or in the local update aggregation process has emerged as two feasible ways for achieving sample-level or user-level privacy guarantees respectively, in federated learning models. However, compared to their non-private equivalents, these approaches suffer from a poor utility. To improve the privacy-utility trade-off, we present a modification to these vanilla differentially private algorithms based on a Haar wavelet transformation step and a novel noise injection scheme that significantly lowers the asymptotic bound of the noise variance. We also present a holistic convergence analysis of our proposed algorithm, showing that our method yields better convergence performance than the vanilla DP algorithms. Numerical experiments on real-world datasets demonstrate that our method outperforms existing approaches in model utility while maintaining the same privacy guarantees.
[AI-36] A computational theory of evaluation for parameterisable subject
链接: https://arxiv.org/abs/2503.21138
作者: Hedong Yan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Evaluation is critical to advance decision making across domains, yet existing methodologies often struggle to balance theoretical rigor and practical scalability. In order to reduce the cost of experimental evaluation, we introduce a computational theory of evaluation for parameterisable subjects. We prove upper bounds of generalized evaluation error and generalized causal effect error of evaluation metric on subject. We also prove efficiency, and consistency to estimated causal effect of subject on metric by prediction. To optimize evaluation models, we propose a meta-learner to handle heterogeneous evaluation subjects space. Comparing with other computational approaches, our (conditional) evaluation model reduced 24.1%-99.0% evaluation errors across 12 scenes, including individual medicine, scientific simulation, business activities, and quantum trade. The evaluation time is reduced 3-7 order of magnitude comparing with experiments or simulations.
[AI-37] Optimizing Multi-DNN Inference on Mobile Devices through Heterogeneous Processor Co-Execution
链接: https://arxiv.org/abs/2503.21109
作者: Yunquan Gao,Zhiguo Zhang,Praveen Kumar Donta,Chinmaya Kumar Dehury,Xiujun Wang,Dusit Niyato,Qiyang Zhang
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
*备注: 14 pages, 12 figures, 5 tables
点击查看摘要
Abstract:Deep Neural Networks (DNNs) are increasingly deployed across diverse industries, driving demand for mobile device support. However, existing mobile inference frameworks often rely on a single processor per model, limiting hardware utilization and causing suboptimal performance and energy efficiency. Expanding DNN accessibility on mobile platforms requires adaptive, resource-efficient solutions to meet rising computational needs without compromising functionality. Parallel inference of multiple DNNs on heterogeneous processors remains challenging. Some works partition DNN operations into subgraphs for parallel execution across processors, but these often create excessive subgraphs based only on hardware compatibility, increasing scheduling complexity and memory overhead. To address this, we propose an Advanced Multi-DNN Model Scheduling (ADMS) strategy for optimizing multi-DNN inference on mobile heterogeneous processors. ADMS constructs an optimal subgraph partitioning strategy offline, balancing hardware operation support and scheduling granularity, and uses a processor-state-aware algorithm to dynamically adjust workloads based on real-time conditions. This ensures efficient workload distribution and maximizes processor utilization. Experiments show ADMS reduces multi-DNN inference latency by 4.04 times compared to vanilla frameworks. Comments: 14 pages, 12 figures, 5 tables Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) MSC classes: 68T07, 68W40 ACMclasses: I.2.6; C.1.4; D.4.8 Cite as: arXiv:2503.21109 [cs.DC] (or arXiv:2503.21109v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2503.21109 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-38] Alleviating LLM -based Generative Retrieval Hallucination in Alipay Search
链接: https://arxiv.org/abs/2503.21098
作者: Yedan Shen,Kaixin Wu,Yuechen Ding,Jingyuan Wen,Hong Liu,Mingjie Zhong,Zhouhan Lin,Jia Xu,Linjian Mo
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
*备注: 4 pages
点击查看摘要
Abstract:Generative retrieval (GR) has revolutionized document retrieval with the advent of large language models (LLMs), and LLM-based GR is gradually being adopted by the industry. Despite its remarkable advantages and potential, LLM-based GR suffers from hallucination and generates documents that are irrelevant to the query in some instances, severely challenging its credibility in practical applications. We thereby propose an optimized GR framework designed to alleviate retrieval hallucination, which integrates knowledge distillation reasoning in model training and incorporate decision agent to further improve retrieval precision. Specifically, we employ LLMs to assess and reason GR retrieved query-document (q-d) pairs, and then distill the reasoning data as transferred knowledge to the GR model. Moreover, we utilize a decision agent as post-processing to extend the GR retrieved documents through retrieval model and select the most relevant ones from multi perspectives as the final generative retrieval result. Extensive offline experiments on real-world datasets and online A/B tests on Fund Search and Insurance Search in Alipay demonstrate our framework’s superiority and effectiveness in improving search quality and conversion gains.
[AI-39] Confidence Adjusted Surprise Measure for Active Resourceful Trials (CA-SMART): A Data-driven Active Learning Framework for Accelerating Material Discovery under Resource Constraints
链接: https://arxiv.org/abs/2503.21095
作者: Ahmed Shoyeb Raihan,Zhichao Liu,Tanveer Hossain Bhuiyan,Imtiaz Ahmed
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP)
*备注:
点击查看摘要
Abstract:Accelerating the discovery and manufacturing of advanced materials with specific properties is a critical yet formidable challenge due to vast search space, high costs of experiments, and time-intensive nature of material characterization. In recent years, active learning, where a surrogate machine learning (ML) model mimics the scientific discovery process of a human scientist, has emerged as a promising approach to address these challenges by guiding experimentation toward high-value outcomes with a limited budget. Among the diverse active learning philosophies, the concept of surprise (capturing the divergence between expected and observed outcomes) has demonstrated significant potential to drive experimental trials and refine predictive models. Scientific discovery often stems from surprise thereby making it a natural driver to guide the search process. Despite its promise, prior studies leveraging surprise metrics such as Shannon and Bayesian surprise lack mechanisms to account for prior confidence, leading to excessive exploration of uncertain regions that may not yield useful information. To address this, we propose the Confidence-Adjusted Surprise Measure for Active Resourceful Trials (CA-SMART), a novel Bayesian active learning framework tailored for optimizing data-driven experimentation. On a high level, CA-SMART incorporates Confidence-Adjusted Surprise (CAS) to dynamically balance exploration and exploitation by amplifying surprises in regions where the model is more certain while discounting them in highly uncertain areas. We evaluated CA-SMART on two benchmark functions (Six-Hump Camelback and Griewank) and in predicting the fatigue strength of steel. The results demonstrate superior accuracy and efficiency compared to traditional surprise metrics, standard Bayesian Optimization (BO) acquisition functions and conventional ML methods.
[AI-40] he Art of Tool Interface Design
链接: https://arxiv.org/abs/2503.21036
作者: Yunnan Wu,Paul Chen,Deshank Baranwal,Jinlong Zhou,Jian Yuan
类目: Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:We present an agentic framework, Thinker, which achieves state of art performance in challenging reasoning tasks for realistic customer service scenarios that involve complex business logic and human interactions via long horizons. On the \tau -bench retail dataset, Thinker achieves 82.6% success rate with GPT-4o (version 2024-06-01) (baseline: 68.3%), and 81.9% success rate with Llama-3.1 405B (baseline: 49.6%), without any fine-tuning. Thinker effectively closes the gap in reasoning capabilities between the base models by introducing proper structure. The key features of the Thinker framework are: (1) State-Machine Augmented Generation (SMAG), which represents business logic as state machines and the LLM uses state machines as tools. (2) Delegation of tasks from the main reasoning loop to LLM-powered tools. (3) Adaptive context management. Our prompting-only solution achieves signficant gains, while still maintaining a standard agentic architecture with a ReAct style reasoning loop. The key is to innovate on the tool interface design, as exemplified by SMAG and the LLM-powered tools. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2503.21036 [cs.AI] (or arXiv:2503.21036v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2503.21036 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-41] Improving User Behavior Prediction: Leverag ing Annotator Metadata in Supervised Machine Learning Models
链接: https://arxiv.org/abs/2503.21000
作者: Lynnette Hui Xian Ng,Kokil Jaidka,Kaiyuan Tay,Hansin Ahuja,Niyati Chhaya
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注: Accepted at CSCW 2025
点击查看摘要
Abstract:Supervised machine-learning models often underperform in predicting user behaviors from conversational text, hindered by poor crowdsourced label quality and low NLP task accuracy. We introduce the Metadata-Sensitive Weighted-Encoding Ensemble Model (MSWEEM), which integrates annotator meta-features like fatigue and speeding. First, our results show MSWEEM outperforms standard ensembles by 14% on held-out data and 12% on an alternative dataset. Second, we find that incorporating signals of annotator behavior, such as speed and fatigue, significantly boosts model performance. Third, we find that annotators with higher qualifications, such as Master’s, deliver more consistent and faster annotations. Given the increasing uncertainty over annotation quality, our experiments show that understanding annotator patterns is crucial for enhancing model accuracy in user behavior prediction.
[AI-42] FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
链接: https://arxiv.org/abs/2503.20990
作者: Yupeng Cao,Haohang Li,Yangyang Yu,Shashidhar Reddy Javaji,Yueru He,Jimin Huang,Zining Zhu,Qianqian Xie,Xiao-yang Liu,Koduvayur Subbalakshmi,Meikang Qiu,Sophia Ananiadou,Jian-Yun Nie
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
*备注:
点击查看摘要
Abstract:Audio Large Language Models (AudioLLMs) have received widespread attention and have significantly improved performance on audio tasks such as conversation, audio understanding, and automatic speech recognition (ASR). Despite these advancements, there is an absence of a benchmark for assessing AudioLLMs in financial scenarios, where audio data, such as earnings conference calls and CEO speeches, are crucial resources for financial analysis and investment decisions. In this paper, we introduce \textscFinAudio, the first benchmark designed to evaluate the capacity of AudioLLMs in the financial domain. We first define three tasks based on the unique characteristics of the financial domain: 1) ASR for short financial audio, 2) ASR for long financial audio, and 3) summarization of long financial audio. Then, we curate two short and two long audio datasets, respectively, and develop a novel dataset for financial audio summarization, comprising the \textscFinAudio benchmark. Then, we evaluate seven prevalent AudioLLMs on \textscFinAudio. Our evaluation reveals the limitations of existing AudioLLMs in the financial domain and offers insights for improving AudioLLMs. All datasets and codes will be released.
[AI-43] Competitive Multi-armed Bandit Games for Resource Sharing
链接: https://arxiv.org/abs/2503.20975
作者: Hongbo Li,Lingjie Duan
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
*备注: This paper has been accepted by IEEE TMC
点击查看摘要
Abstract:In modern resource-sharing systems, multiple agents access limited resources with unknown stochastic conditions to perform tasks. When multiple agents access the same resource (arm) simultaneously, they compete for successful usage, leading to contention and reduced rewards. This motivates our study of competitive multi-armed bandit (CMAB) games. In this paper, we study a new N-player K-arm competitive MAB game, where non-myopic players (agents) compete with each other to form diverse private estimations of unknown arms over time. Their possible collisions on same arms and time-varying nature of arm rewards make the policy analysis more involved than existing studies for myopic players. We explicitly analyze the threshold-based structures of social optimum and existing selfish policy, showing that the latter causes prolonged convergence time \Omega(\fracK\eta^2\ln(\fracKN\delta)) , while socially optimal policy with coordinated communication reduces it to \mathcalO(\fracKN\eta^2\ln(\fracK\delta)) . Based on the comparison, we prove that the competition among selfish players for the best arm can result in an infinite price of anarchy (PoA), indicating an arbitrarily large efficiency loss compared to social optimum. We further prove that no informational (non-monetary) mechanism (including Bayesian persuasion) can reduce the infinite PoA, as the strategic misreporting by non-myopic players undermines such approaches. To address this, we propose a Combined Informational and Side-Payment (CISP) mechanism, which provides socially optimal arm recommendations with proper informational and monetary incentives to players according to their time-varying private beliefs. Our CISP mechanism keeps ex-post budget balanced for social planner and ensures truthful reporting from players, achieving the minimum PoA=1 and same convergence time as social optimum.
[AI-44] S-Inverse: A Gradient Inversion Attack Tailored for Federated Time Series Forecasting Models
链接: https://arxiv.org/abs/2503.20952
作者: Caspar Meijer,Jiyue Huang,Shreshtha Sharma,Elena Lazovik,Lydia Y. Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Federated learning (FL) for time series forecasting (TSF) enables clients with privacy-sensitive time series (TS) data to collaboratively learn accurate forecasting models, for example, in energy load prediction. Unfortunately, privacy risks in FL persist, as servers can potentially reconstruct clients’ training data through gradient inversion attacks (GIA). Although GIA is demonstrated for image classification tasks, little is known about time series regression tasks. In this paper, we first conduct an extensive empirical study on inverting TS data across 4 TSF models and 4 datasets, identifying the unique challenges of reconstructing both observations and targets of TS data. We then propose TS-Inverse, a novel GIA that improves the inversion of TS data by (i) learning a gradient inversion model that outputs quantile predictions, (ii) a unique loss function that incorporates periodicity and trend regularization, and (iii) regularization according to the quantile predictions. Our evaluations demonstrate a remarkable performance of TS-Inverse, achieving at least a 2x-10x improvement in terms of the sMAPE metric over existing GIA methods on TS data. Code repository: this https URL
[AI-45] DEMENTIA-PLAN: An Agent -Based Framework for Multi-Knowledge Graph Retrieval-Augmented Generation in Dementia Care ALT AAAI2025
链接: https://arxiv.org/abs/2503.20950
作者: Yutong Song,Chenhan Lyu,Pengfei Zhang,Sabine Brunswicker,Nikil Dutt,Amir Rahmani
类目: Artificial Intelligence (cs.AI)
*备注: Accepted by AAAI 2025 Workshop on Knowledge Graphs for Personalized Public Health
点击查看摘要
Abstract:Mild-stage dementia patients primarily experience two critical symptoms: severe memory loss and emotional instability. To address these challenges, we propose DEMENTIA-PLAN, an innovative retrieval-augmented generation framework that leverages large language models to enhance conversational support. Our model employs a multiple knowledge graph architecture, integrating various dimensional knowledge representations including daily routine graphs and life memory graphs. Through this multi-graph architecture, DEMENTIA-PLAN comprehensively addresses both immediate care needs and facilitates deeper emotional resonance through personal memories, helping stabilize patient mood while providing reliable memory support. Our notable innovation is the self-reflection planning agent, which systematically coordinates knowledge retrieval and semantic integration across multiple knowledge graphs, while scoring retrieved content from daily routine and life memory graphs to dynamically adjust their retrieval weights for optimized response generation. DEMENTIA-PLAN represents a significant advancement in the clinical application of large language models for dementia care, bridging the gap between AI tools and caregivers interventions.
[AI-46] Assessing Generative Models for Structured Data
链接: https://arxiv.org/abs/2503.20903
作者: Reilly Cannon,Nicolette M. Laird,Caesar Vazquez,Andy Lin,Amy Wagler,Tony Chiang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Synthetic tabular data generation has emerged as a promising method to address limited data availability and privacy concerns. With the sharp increase in the performance of large language models in recent years, researchers have been interested in applying these models to the generation of tabular data. However, little is known about the quality of the generated tabular data from large language models. The predominant method for assessing the quality of synthetic tabular data is the train-synthetic-test-real approach, where the artificial examples are compared to the original by how well machine learning models, trained separately on the real and synthetic sets, perform in some downstream tasks. This method does not directly measure how closely the distribution of generated data approximates that of the original. This paper introduces rigorous methods for directly assessing synthetic tabular data against real data by looking at inter-column dependencies within the data. We find that large language models (GPT-2), both when queried via few-shot prompting and when fine-tuned, and GAN (CTGAN) models do not produce data with dependencies that mirror the original real data. Results from this study can inform future practice in synthetic data generation to improve data quality.
[AI-47] Robust Federated Learning Against Poisoning Attacks: A GAN-Based Defense Framework
链接: https://arxiv.org/abs/2503.20884
作者: Usama Zafar,André Teixeira,Salman Toor
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
点击查看摘要
Abstract:Federated Learning (FL) enables collaborative model training across decentralized devices without sharing raw data, but it remains vulnerable to poisoning attacks that compromise model integrity. Existing defenses often rely on external datasets or predefined heuristics (e.g. number of malicious clients), limiting their effectiveness and scalability. To address these limitations, we propose a privacy-preserving defense framework that leverages a Conditional Generative Adversarial Network (cGAN) to generate synthetic data at the server for authenticating client updates, eliminating the need for external datasets. Our framework is scalable, adaptive, and seamlessly integrates into FL workflows. Extensive experiments on benchmark datasets demonstrate its robust performance against a variety of poisoning attacks, achieving high True Positive Rate (TPR) and True Negative Rate (TNR) of malicious and benign clients, respectively, while maintaining model accuracy. The proposed framework offers a practical and effective solution for securing federated learning systems.
[AI-48] he Backfiring Effect of Weak AI Safety Regulation
链接: https://arxiv.org/abs/2503.20848
作者: Benjamin Laufer,Jon Kleinberg,Hoda Heidari
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Theoretical Economics (econ.TH)
*备注: 28 pages, 8 figures
点击查看摘要
Abstract:Recent policy proposals aim to improve the safety of general-purpose AI, but there is little understanding of the efficacy of different regulatory approaches to AI safety. We present a strategic model that explores the interactions between the regulator, the general-purpose AI technology creators, and domain specialists–those who adapt the AI for specific applications. Our analysis examines how different regulatory measures, targeting different parts of the development chain, affect the outcome of the development process. In particular, we assume AI technology is described by two key attributes: safety and performance. The regulator first sets a minimum safety standard that applies to one or both players, with strict penalties for non-compliance. The general-purpose creator then develops the technology, establishing its initial safety and performance levels. Next, domain specialists refine the AI for their specific use cases, and the resulting revenue is distributed between the specialist and generalist through an ex-ante bargaining process. Our analysis of this game reveals two key insights: First, weak safety regulation imposed only on the domain specialists can backfire. While it might seem logical to regulate use cases (as opposed to the general-purpose technology), our analysis shows that weak regulations targeting domain specialists alone can unintentionally reduce safety. This effect persists across a wide range of settings. Second, in sharp contrast to the previous finding, we observe that stronger, well-placed regulation can in fact benefit all players subjected to it. When regulators impose appropriate safety standards on both AI creators and domain specialists, the regulation functions as a commitment mechanism, leading to safety and performance gains, surpassing what is achieved under no regulation or regulating one player only.
[AI-49] Robust Deep Reinforcement Learning in Robotics via Adaptive Gradient-Masked Adversarial Attacks
链接: https://arxiv.org/abs/2503.20844
作者: Zongyuan Zhang,Tianyang Duan,Zheng Lin,Dong Huang,Zihan Fang,Zekai Sun,Ling Xiong,Hongbin Liang,Heming Cui,Yong Cui,Yue Gao
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Robotics (cs.RO)
*备注: 9 pages, 6 figures
点击查看摘要
Abstract:Deep reinforcement learning (DRL) has emerged as a promising approach for robotic control, but its realworld deployment remains challenging due to its vulnerability to environmental perturbations. Existing white-box adversarial attack methods, adapted from supervised learning, fail to effectively target DRL agents as they overlook temporal dynamics and indiscriminately perturb all state dimensions, limiting their impact on long-term rewards. To address these challenges, we propose the Adaptive Gradient-Masked Reinforcement (AGMR) Attack, a white-box attack method that combines DRL with a gradient-based soft masking mechanism to dynamically identify critical state dimensions and optimize adversarial policies. AGMR selectively allocates perturbations to the most impactful state features and incorporates a dynamic adjustment mechanism to balance exploration and exploitation during training. Extensive experiments demonstrate that AGMR outperforms state-of-the-art adversarial attack methods in degrading the performance of the victim agent and enhances the victim agent’s robustness through adversarial defense mechanisms.
[AI-50] Advancing Vulnerability Classification with BERT: A Multi-Objective Learning Model
链接: https://arxiv.org/abs/2503.20831
作者: Himanshu Tiwari
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注: 9 Pages
点击查看摘要
Abstract:The rapid increase in cybersecurity vulnerabilities necessitates automated tools for analyzing and classifying vulnerability reports. This paper presents a novel Vulnerability Report Classifier that leverages the BERT (Bidirectional Encoder Representations from Transformers) model to perform multi-label classification of Common Vulnerabilities and Exposures (CVE) reports from the National Vulnerability Database (NVD). The classifier predicts both the severity (Low, Medium, High, Critical) and vulnerability types (e.g., Buffer Overflow, XSS) from textual descriptions. We introduce a custom training pipeline using a combined loss function-Cross-Entropy for severity and Binary Cross-Entropy with Logits for types-integrated into a Hugging Face Trainer subclass. Experiments on recent NVD data demonstrate promising results, with decreasing evaluation loss across epochs. The system is deployed via a REST API and a Streamlit UI, enabling real-time vulnerability analysis. This work contributes a scalable, open-source solution for cybersecurity practitioners to automate vulnerability triage.
[AI-51] AED: Automatic Discovery of Effective and Diverse Vulnerabilities for Autonomous Driving Policy with Large Language Models
链接: https://arxiv.org/abs/2503.20804
作者: Le Qiu,Zelai Xu,Qixin Tan,Wenhao Tang,Chao Yu,Yu Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Assessing the safety of autonomous driving policy is of great importance, and reinforcement learning (RL) has emerged as a powerful method for discovering critical vulnerabilities in driving policies. However, existing RL-based approaches often struggle to identify vulnerabilities that are both effective-meaning the autonomous vehicle is genuinely responsible for the accidents-and diverse-meaning they span various failure types. To address these challenges, we propose AED, a framework that uses large language models (LLMs) to automatically discover effective and diverse vulnerabilities in autonomous driving policies. We first utilize an LLM to automatically design reward functions for RL training. Then we let the LLM consider a diverse set of accident types and train adversarial policies for different accident types in parallel. Finally, we use preference-based learning to filter ineffective accidents and enhance the effectiveness of each vulnerability. Experiments across multiple simulated traffic scenarios and tested policies show that AED uncovers a broader range of vulnerabilities and achieves higher attack success rates compared with expert-designed rewards, thereby reducing the need for manual reward engineering and improving the diversity and effectiveness of vulnerability discovery.
[AI-52] CEFW: A Comprehensive Evaluation Framework for Watermark in Large Language Models
链接: https://arxiv.org/abs/2503.20802
作者: Shuhao Zhang,Bo Cheng,Jiale Han,Yuli Chen,Zhixuan Wu,Changbao Li,Pingli Gu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Text watermarking provides an effective solution for identifying synthetic text generated by large language models. However, existing techniques often focus on satisfying specific criteria while ignoring other key aspects, lacking a unified evaluation. To fill this gap, we propose the Comprehensive Evaluation Framework for Watermark (CEFW), a unified framework that comprehensively evaluates watermarking methods across five key dimensions: ease of detection, fidelity of text quality, minimal embedding cost, robustness to adversarial attacks, and imperceptibility to prevent imitation or forgery. By assessing watermarks according to all these key criteria, CEFW offers a thorough evaluation of their practicality and effectiveness. Moreover, we introduce a simple and effective watermarking method called Balanced Watermark (BW), which guarantees robustness and imperceptibility through balancing the way watermark information is added. Extensive experiments show that BW outperforms existing methods in overall performance across all evaluation dimensions. We release our code to the community for future research. this https URL.
[AI-53] Evidencing Unauthorized Training Data from AI Generated Content using Information Isotopes
链接: https://arxiv.org/abs/2503.20800
作者: Qi Tao,Yin Jinhua,Cai Dongqi,Xie Yueqi,Wang Huili,Hu Zhiyang,Yang Peiru,Nan Guoshun,Zhou Zhili,Wang Shangguang,Lyu Lingjuan,Huang Yongfeng,Lane Nicholas
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In light of scaling laws, many AI institutions are intensifying efforts to construct advanced AIs on extensive collections of high-quality human data. However, in a rush to stay competitive, some institutions may inadvertently or even deliberately include unauthorized data (like privacy- or intellectual property-sensitive content) for AI training, which infringes on the rights of data owners. Compounding this issue, these advanced AI services are typically built on opaque cloud platforms, which restricts access to internal information during AI training and inference, leaving only the generated outputs available for forensics. Thus, despite the introduction of legal frameworks by various countries to safeguard data rights, uncovering evidence of data misuse in modern opaque AI applications remains a significant challenge. In this paper, inspired by the ability of isotopes to trace elements within chemical reactions, we introduce the concept of information isotopes and elucidate their properties in tracing training data within opaque AI systems. Furthermore, we propose an information isotope tracing method designed to identify and provide evidence of unauthorized data usage by detecting the presence of target information isotopes in AI generations. We conduct experiments on ten AI models (including GPT-4o, Claude-3.5, and DeepSeek) and four benchmark datasets in critical domains (medical data, copyrighted books, and news). Results show that our method can distinguish training datasets from non-training datasets with 99% accuracy and significant evidence (p-value 0.001 ) by examining a data entry equivalent in length to a research paper. The findings show the potential of our work as an inclusive tool for empowering individuals, including those without expertise in AI, to safeguard their data rights in the rapidly evolving era of AI advancements and applications.
[AI-54] Payload-Aware Intrusion Detection with CMAE and Large Language Models
链接: https://arxiv.org/abs/2503.20798
作者: Yongcheol Kim,Chanjae Lee,Young Yoon
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Intrusion Detection Systems (IDS) are crucial for identifying malicious traffic, yet traditional signature-based methods struggle with zero-day attacks and high false positive rates. AI-driven packet-capture analysis offers a promising alternative. However, existing approaches rely heavily on flow-based or statistical features, limiting their ability to detect fine-grained attack patterns. This study proposes Xavier-CMAE, an enhanced Convolutional Multi-Head Attention Ensemble (CMAE) model that improves detection accuracy while reducing computational overhead. By replacing Word2Vec embeddings with a Hex2Int tokenizer and Xavier initialization, Xavier-CMAE eliminates pre-training, accelerates training, and achieves 99.971% accuracy with a 0.018% false positive rate, outperforming Word2Vec-based methods. Additionally, we introduce LLM-CMAE, which integrates pre-trained Large Language Model (LLM) tokenizers into CMAE. While LLMs enhance feature extraction, their computational cost hinders real-time detection. LLM-CMAE balances efficiency and performance, reaching 99.969% accuracy with a 0.019% false positive rate. This work advances AI-powered IDS by (1) introducing a payload-based detection framework, (2) enhancing efficiency with Xavier-CMAE, and (3) integrating LLM tokenizers for improved real-time detection.
[AI-55] EXPLICATE: Enhancing Phishing Detection through Explainable AI and LLM -Powered Interpretability
链接: https://arxiv.org/abs/2503.20796
作者: Bryan Lim,Roman Huerta,Alejandro Sotelo,Anthonie Quintela,Priyanka Kumar
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
*备注:
点击查看摘要
Abstract:Sophisticated phishing attacks have emerged as a major cybersecurity threat, becoming more common and difficult to prevent. Though machine learning techniques have shown promise in detecting phishing attacks, they function mainly as “black boxes” without revealing their decision-making rationale. This lack of transparency erodes the trust of users and diminishes their effective threat response. We present EXPLICATE: a framework that enhances phishing detection through a three-component architecture: an ML-based classifier using domain-specific features, a dual-explanation layer combining LIME and SHAP for complementary feature-level insights, and an LLM enhancement using DeepSeek v3 to translate technical explanations into accessible natural language. Our experiments show that EXPLICATE attains 98.4 % accuracy on all metrics, which is on par with existing deep learning techniques but has better explainability. High-quality explanations are generated by the framework with an accuracy of 94.2 % as well as a consistency of 96.8% between the LLM output and model prediction. We create EXPLICATE as a fully usable GUI application and a light Chrome extension, showing its applicability in many deployment situations. The research shows that high detection performance can go hand-in-hand with meaningful explainability in security applications. Most important, it addresses the critical divide between automated AI and user trust in phishing detection systems.
[AI-56] oward a Human-Centered AI-assisted Colonoscopy System in Australia ALT
链接: https://arxiv.org/abs/2503.20790
作者: Hsiang-Ting Chen,Yuan Zhang,Gustavo Carneiro,Rajvinder Singh
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
*备注: 4 pages, accepted by CHI '25 workshop Envisioning the Future of Interactive Health
点击查看摘要
Abstract:While AI-assisted colonoscopy promises improved colorectal cancer screening, its success relies on effective integration into clinical practice, not just algorithmic accuracy. This paper, based on an Australian field study (observations and gastroenterologist interviews), highlights a critical disconnect: current development prioritizes machine learning model performance, overlooking essential aspects of user interface design, workflow integration, and overall user experience. Industry interactions reveal a similar emphasis on data and algorithms. To realize AI’s full potential, the HCI community must champion user-centered design, ensuring these systems are usable, support endoscopist expertise, and enhance patient outcomes.
[AI-57] Quantitative Evaluation of Quantum/Classical Neural Network Using a Game Solver Metric
链接: https://arxiv.org/abs/2503.21514
作者: Suzukaze Kamei,Hideaki Kawaguchi,Shin Nishio,Tatakahiko Satoh
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: 11 pages, 16 figures
点击查看摘要
Abstract:To evaluate the performance of quantum computing systems relative to classical counterparts and explore the potential for quantum advantage, we propose a game-solving benchmark based on Elo ratings in the game of tic-tac-toe. We compare classical convolutional neural networks (CNNs), quantum convolutional neural networks (QCNNs), and hybrid classical-quantum models by assessing their performance against a random-move agent in automated matches. Additionally, we implement a QCNN integrated with quantum communication and evaluate its performance to quantify the overhead introduced by noisy quantum channels. Our results show that the classical-quantum hybrid model achieves Elo ratings comparable to those of classical CNNs, while the standalone QCNN underperforms under current hardware constraints. The communication overhead was found to be modest. These findings demonstrate the viability of using game-based benchmarks for evaluating quantum computing systems and suggest that quantum communication can be incorporated with limited impact on performance, providing a foundation for future hybrid quantum applications.
[AI-58] Exploiting Temporal State Space Sharing for Video Semantic Segmentation
链接: https://arxiv.org/abs/2503.20824
作者: Syed Ariff Syed Hesham,Yun Liu,Guolei Sun,Henghui Ding,Jing Yang,Ender Konukoglu,Xue Geng,Xudong Jiang
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
*备注: IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025
点击查看摘要
Abstract:Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes. Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements. To this end, we introduce a Temporal Video State Space Sharing (TV3S) architecture to leverage Mamba state space models for temporal feature sharing. Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool. By processing spatial patches independently and incorporating shifted operation, TV3S supports highly parallel computation in both training and inference stages, which reduces the delay in sequential state space processing and improves the scalability for long video sequences. Moreover, TV3S incorporates information from prior frames during inference, achieving long-range temporal coherence and superior adaptability to extended sequences. Evaluations on the VSPW and Cityscapes datasets reveal that our approach outperforms current state-of-the-art methods, establishing a new standard for VSS with consistent results across long video sequences. By achieving a good balance between accuracy and efficiency, TV3S shows a significant advancement in spatiotemporal modeling, paving the way for efficient video analysis. The code is publicly available at this https URL.
[AI-59] Synthetic Video Enhances Physical Fidelity in Video Synthesis
链接: https://arxiv.org/abs/2503.20822
作者: Qi Zhao,Xingyu Ni,Ziyu Wang,Feng Cheng,Ziyan Yang,Lu Jiang,Bohan Wang
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
*备注:
点击查看摘要
Abstract:We investigate how to enhance the physical fidelity of video generation models by leveraging synthetic videos derived from computer graphics pipelines. These rendered videos respect real-world physics, such as maintaining 3D consistency, and serve as a valuable resource that can potentially improve video generation models. To harness this potential, we propose a solution that curates and integrates synthetic data while introducing a method to transfer its physical realism to the model, significantly reducing unwanted artifacts. Through experiments on three representative tasks emphasizing physical consistency, we demonstrate its efficacy in enhancing physical fidelity. While our model still lacks a deep understanding of physics, our work offers one of the first empirical demonstrations that synthetic video enhances physical fidelity in video synthesis. Website: this https URL
机器学习
[LG-0] A Unified Framework for Diffusion Bridge Problems: Flow Matching and Schrödinger Matching into One
链接: https://arxiv.org/abs/2503.21756
作者: Minyoung Kim
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The bridge problem is to find an SDE (or sometimes an ODE) that bridges two given distributions. The application areas of the bridge problem are enormous, among which the recent generative modeling (e.g., conditional or unconditional image generation) is the most popular. Also the famous Schrödinger bridge problem, a widely known problem for a century, is a special instance of the bridge problem. Two most popular algorithms to tackle the bridge problems in the deep learning era are: (conditional) flow matching and iterative fitting algorithms, where the former confined to ODE solutions, and the latter specifically for the Schrödinger bridge problem. The main contribution of this article is in two folds: i) We provide concise reviews of these algorithms with technical details to some extent; ii) We propose a novel unified perspective and framework that subsumes these seemingly unrelated algorithms (and their variants) into one. In particular, we show that our unified framework can instantiate the Flow Matching (FM) algorithm, the (mini-batch) optimal transport FM algorithm, the (mini-batch) Schrödinger bridge FM algorithm, and the deep Schrödinger bridge matching (DSBM) algorithm as its special cases. We believe that this unified framework will be useful for viewing the bridge problems in a more general and flexible perspective, and in turn can help researchers and practitioners to develop new bridge algorithms in their fields.
[LG-1] Energy Minimization for Participatory Federated Learning in IoT Analyzed via Game Theory
链接: https://arxiv.org/abs/2503.21722
作者: Alessandro Buratto,Elia Guerra,Marco Miozzo,Paolo Dini,Leonardo Badia
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 6 pages, 6 figures, 2 tables, conference
点击查看摘要
Abstract:The Internet of Things requires intelligent decision making in many scenarios. To this end, resources available at the individual nodes for sensing or computing, or both, can be leveraged. This results in approaches known as participatory sensing and federated learning, respectively. We investigate the simultaneous implementation of both, through a distributed approach based on empowering local nodes with game theoretic decision making. A global objective of energy minimization is combined with the individual node’s optimization of local expenditure for sensing and transmitting data over multiple learning rounds. We present extensive evaluations of this technique, based on both a theoretical framework and experiments in a simulated network scenario with real data. Such a distributed approach can reach a desired level of accuracy for federated learning without a centralized supervision of the data collector. However, depending on the weight attributed to the local costs of the single node, it may also result in a significantly high Price of Anarchy (from 1.28 onwards). Thus, we argue for the need of incentive mechanisms, possibly based on Age of Information of the single nodes.
[LG-2] A tale of two goals: leverag ing sequentiality in multi-goal scenarios
链接: https://arxiv.org/abs/2503.21677
作者: Olivier Serris,Stéphane Doncieux,Olivier Sigaud
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures
点击查看摘要
Abstract:Several hierarchical reinforcement learning methods leverage planning to create a graph or sequences of intermediate goals, guiding a lower-level goal-conditioned (GC) policy to reach some final goals. The low-level policy is typically conditioned on the current goal, with the aim of reaching it as quickly as possible. However, this approach can fail when an intermediate goal can be reached in multiple ways, some of which may make it impossible to continue toward subsequent goals. To address this issue, we introduce two instances of Markov Decision Process (MDP) where the optimization objective favors policies that not only reach the current goal but also subsequent ones. In the first, the agent is conditioned on both the current and final goals, while in the second, it is conditioned on the next two goals in the sequence. We conduct a series of experiments on navigation and pole-balancing tasks in which sequences of intermediate goals are given. By evaluating policies trained with TD3+HER on both the standard GC-MDP and our proposed MDPs, we show that, in most cases, conditioning on the next two goals improves stability and sample efficiency over other approaches.
[LG-3] Data-Driven Extreme Response Estimation
链接: https://arxiv.org/abs/2503.21638
作者: Samuel J. Edwards,Michael D. Levine
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注: From the 35th Symposium on Naval Hydrodynamics
点击查看摘要
Abstract:A method to rapidly estimate extreme ship response events is developed in this paper. The method involves training by a Long Short-Term Memory (LSTM) neural network to correct a lower-fidelity hydrodynamic model to the level of a higher-fidelity simulation. More focus is placed on larger responses by isolating the time-series near peak events identified in the lower-fidelity simulations and training on only the shorter time-series around the large event. The method is tested on the estimation of pitch time-series maxima in Sea State 5 (significant wave height of 4.0 meters and modal period of 15.0 seconds,) generated by a lower-fidelity hydrodynamic solver known as SimpleCode and a higher-fidelity tool known as the Large Amplitude Motion Program (LAMP). The results are also compared with an LSTM trained without special considerations for large events.
[LG-4] ClusterSC: Advancing Synthetic Control with Donor Selection AISTATS
链接: https://arxiv.org/abs/2503.21629
作者: Saeyoung Rho,Andrew Tang,Noah Bergam,Rachel Cummings,Vishal Misra
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 35 pages, 11 figures, to be published in Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (AIStats) 2025
点击查看摘要
Abstract:In causal inference with observational studies, synthetic control (SC) has emerged as a prominent tool. SC has traditionally been applied to aggregate-level datasets, but more recent work has extended its use to individual-level data. As they contain a greater number of observed units, this shift introduces the curse of dimensionality to SC. To address this, we propose Cluster Synthetic Control (ClusterSC), based on the idea that groups of individuals may exist where behavior aligns internally but diverges between groups. ClusterSC incorporates a clustering step to select only the relevant donors for the target. We provide theoretical guarantees on the improvements induced by ClusterSC, supported by empirical demonstrations on synthetic and real-world datasets. The results indicate that ClusterSC consistently outperforms classical SC approaches.
[LG-5] Provable Reduction in Communication Rounds for Non-Smooth Convex Federated Learning
链接: https://arxiv.org/abs/2503.21627
作者: Karlo Palenzuela,Ali Dadras,Alp Yurtsever,Tommy Löfstedt
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
点击查看摘要
Abstract:Multiple local steps are key to communication-efficient federated learning. However, theoretical guarantees for such algorithms, without data heterogeneity-bounding assumptions, have been lacking in general non-smooth convex problems. Leveraging projection-efficient optimization methods, we propose FedMLS, a federated learning algorithm with provable improvements from multiple local steps. FedMLS attains an \epsilon -suboptimal solution in \mathcalO(1/\epsilon) communication rounds, requiring a total of \mathcalO(1/\epsilon^2) stochastic subgradient oracle calls.
[LG-6] Leverag ing Language Models for Analyzing Longitudinal Experiential Data in Education
链接: https://arxiv.org/abs/2503.21617
作者: Ahatsham Hayat,Bilal Khan,Mohammad Rashedul Hasan
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:We propose a novel approach to leveraging pre-trained language models (LMs) for early forecasting of academic trajectories in STEM students using high-dimensional longitudinal experiential data. This data, which captures students’ study-related activities, behaviors, and psychological states, offers valuable insights for forecasting-based interventions. Key challenges in handling such data include high rates of missing values, limited dataset size due to costly data collection, and complex temporal variability across modalities. Our approach addresses these issues through a comprehensive data enrichment process, integrating strategies for managing missing values, augmenting data, and embedding task-specific instructions and contextual cues to enhance the models’ capacity for learning temporal patterns. Through extensive experiments on a curated student learning dataset, we evaluate both encoder-decoder and decoder-only LMs. While our findings show that LMs effectively integrate data across modalities and exhibit resilience to missing data, they primarily rely on high-level statistical patterns rather than demonstrating a deeper understanding of temporal dynamics. Furthermore, their ability to interpret explicit temporal information remains limited. This work advances educational data science by highlighting both the potential and limitations of LMs in modeling student trajectories for early intervention based on longitudinal experiential data.
[LG-7] Generalizable Implicit Neural Representations via Parameterized Latent Dynamics for Baroclinic Ocean Forecasting
链接: https://arxiv.org/abs/2503.21588
作者: Guang Zhao,Xihaier Luo,Seungjun Lee,Yihui Ren,Shinjae Yoo,Luke Van Roekel,Balu Nadiga,Sri Hari Krishna Narayanan,Yixuan Sun,Wei Xu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Mesoscale ocean dynamics play a critical role in climate systems, governing heat transport, hurricane genesis, and drought patterns. However, simulating these processes at high resolution remains computationally prohibitive due to their nonlinear, multiscale nature and vast spatiotemporal domains. Implicit neural representations (INRs) reduce the computational costs as resolution-independent surrogates but fail in many-query scenarios (inverse modeling) requiring rapid evaluations across diverse parameters. We present PINROD, a novel framework combining dynamics-aware implicit neural representations with parameterized neural ordinary differential equations to address these limitations. By integrating parametric dependencies into latent dynamics, our method efficiently captures nonlinear oceanic behavior across varying boundary conditions and physical parameters. Experiments on ocean mesoscale activity data show superior accuracy over existing baselines and improved computational efficiency compared to standard numerical simulations.
[LG-8] Fusion of Graph Neural Networks via Optimal Transport
链接: https://arxiv.org/abs/2503.21579
作者: Weronika Ormaniec,Michael Vollenweider,Elisa Hoskovec
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In this paper, we explore the idea of combining GCNs into one model. To that end, we align the weights of different models layer-wise using optimal transport (OT). We present and evaluate three types of transportation costs and show that the studied fusion method consistently outperforms the performance of vanilla averaging. Finally, we present results suggesting that model fusion using OT is harder in the case of GCNs than MLPs and that incorporating the graph structure into the process does not improve the performance of the method.
[LG-9] Consistent Multigroup Low-Rank Approximation
链接: https://arxiv.org/abs/2503.21563
作者: Antonis Matakos,Martino Ciaperoni,Heikki Mannila
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We consider the problem of consistent low-rank approximation for multigroup data: we ask for a sequence of k basis vectors such that projecting the data onto their spanned subspace treats all groups as equally as possible, by minimizing the maximum error among the groups. Additionally, we require that the sequence of basis vectors satisfies the natural consistency property: when looking for the best k vectors, the first dk vectors are the best possible solution to the problem of finding d basis vectors. Thus, this multigroup low-rank approximation method naturally generalizes \svd and reduces to \svd for data with a single group. We give an iterative algorithm for this task that sequentially adds to the basis the vector that gives the best rank -1 projection according to the min-max criterion, and then projects the data onto the orthogonal complement of that vector. For finding the best rank -1 projection, we use primal-dual approaches or semidefinite programming. We analyze the theoretical properties of the algorithms and demonstrate empirically that the proposed methods compare favorably to existing methods for multigroup (or fair) PCA.
[LG-10] Exploring the Energy Landscape of RBMs: Reciprocal Space Insights into Bosons Hierarchical Learning and Symmetry Breaking
链接: https://arxiv.org/abs/2503.21536
作者: J. Quetzalcóatl Toledo-Marin,Anindita Maiti,Geoffrey C. Fox,Roger G. Melko
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn)
*备注: 19pp, 8figs, research article
点击查看摘要
Abstract:Deep generative models have become ubiquitous due to their ability to learn and sample from complex distributions. Despite the proliferation of various frameworks, the relationships among these models remain largely unexplored, a gap that hinders the development of a unified theory of AI learning. We address two central challenges: clarifying the connections between different deep generative models and deepening our understanding of their learning mechanisms. We focus on Restricted Boltzmann Machines (RBMs), known for their universal approximation capabilities for discrete distributions. By introducing a reciprocal space formulation, we reveal a connection between RBMs, diffusion processes, and coupled Bosons. We show that at initialization, the RBM operates at a saddle point, where the local curvature is determined by the singular values, whose distribution follows the Marcenko-Pastur law and exhibits rotational symmetry. During training, this rotational symmetry is broken due to hierarchical learning, where different degrees of freedom progressively capture features at multiple levels of abstraction. This leads to a symmetry breaking in the energy landscape, reminiscent of Landau theory. This symmetry breaking in the energy landscape is characterized by the singular values and the weight matrix eigenvector matrix. We derive the corresponding free energy in a mean-field approximation. We show that in the limit of infinite size RBM, the reciprocal variables are Gaussian distributed. Our findings indicate that in this regime, there will be some modes for which the diffusion process will not converge to the Boltzmann distribution. To illustrate our results, we trained replicas of RBMs with different hidden layer sizes using the MNIST dataset. Our findings bridge the gap between disparate generative frameworks and also shed light on the processes underpinning learning in generative models.
[LG-11] F-INR: Functional Tensor Decomposition for Implicit Neural Representations
链接: https://arxiv.org/abs/2503.21507
作者: Sai Karthikeya Vemuri,Tim Büchner,Joachim Denzler
类目: Machine Learning (cs.LG)
*备注: 26 pages, 33 figures, 12 tables
点击查看摘要
Abstract:Implicit Neural Representation (INR) has emerged as a powerful tool for encoding discrete signals into continuous, differentiable functions using neural networks. However, these models often have an unfortunate reliance on monolithic architectures to represent high-dimensional data, leading to prohibitive computational costs as dimensionality grows. We propose F-INR, a framework that reformulates INR learning through functional tensor decomposition, breaking down high-dimensional tasks into lightweight, axis-specific sub-networks. Each sub-network learns a low-dimensional data component (e.g., spatial or temporal). Then, we combine these components via tensor operations, reducing forward pass complexity while improving accuracy through specialized learning. F-INR is modular and, therefore, architecture-agnostic, compatible with MLPs, SIREN, WIRE, or other state-of-the-art INR architecture. It is also decomposition-agnostic, supporting CP, TT, and Tucker modes with user-defined rank for speed-accuracy control. In our experiments, F-INR trains 100\times faster than existing approaches on video tasks while achieving higher fidelity (+3.4 dB PSNR). Similar gains hold for image compression, physics simulations, and 3D geometry reconstruction. Through this, F-INR offers a new scalable, flexible solution for high-dimensional signal modeling.
[LG-12] Robust DNN Partitioning and Resource Allocation Under Uncertain Inference Time
链接: https://arxiv.org/abs/2503.21476
作者: Zhaojun Nan,Yunchu Han,Sheng Zhou,Zhisheng Niu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:In edge intelligence systems, deep neural network (DNN) partitioning and data offloading can provide real-time task inference for resource-constrained mobile devices. However, the inference time of DNNs is typically uncertain and cannot be precisely determined in advance, presenting significant challenges in ensuring timely task processing within deadlines. To address the uncertain inference time, we propose a robust optimization scheme to minimize the total energy consumption of mobile devices while meeting task probabilistic deadlines. The scheme only requires the mean and variance information of the inference time, without any prediction methods or distribution functions. The problem is formulated as a mixed-integer nonlinear programming (MINLP) that involves jointly optimizing the DNN model partitioning and the allocation of local CPU/GPU frequencies and uplink bandwidth. To tackle the problem, we first decompose the original problem into two subproblems: resource allocation and DNN model partitioning. Subsequently, the two subproblems with probability constraints are equivalently transformed into deterministic optimization problems using the chance-constrained programming (CCP) method. Finally, the convex optimization technique and the penalty convex-concave procedure (PCCP) technique are employed to obtain the optimal solution of the resource allocation subproblem and a stationary point of the DNN model partitioning subproblem, respectively. The proposed algorithm leverages real-world data from popular hardware platforms and is evaluated on widely used DNN models. Extensive simulations show that our proposed algorithm effectively addresses the inference time uncertainty with probabilistic deadline guarantees while minimizing the energy consumption of mobile devices.
[LG-13] DATA-WA: Demand-based Adaptive Task Assignment with Dynamic Worker Availability Windows
链接: https://arxiv.org/abs/2503.21458
作者: Jinwen Chen,Jiannan Guo,Dazhuo Qiu,Yawen Li,Guanhua Ye,Yan Zhao,Kai Zheng
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注:
点击查看摘要
Abstract:With the rapid advancement of mobile networks and the widespread use of mobile devices, spatial crowdsourcing, which involves assigning location-based tasks to mobile workers, has gained significant attention. However, most existing research focuses on task assignment at the current moment, overlooking the fluctuating demand and supply between tasks and workers over time. To address this issue, we introduce an adaptive task assignment problem, which aims to maximize the number of assigned tasks by dynamically adjusting task assignments in response to changing demand and supply. We develop a spatial crowdsourcing framework, namely demand-based adaptive task assignment with dynamic worker availability windows, which consists of two components including task demand prediction and task assignment. In the first component, we construct a graph adjacency matrix representing the demand dependency relationships in different regions and employ a multivariate time series learning approach to predict future task demands. In the task assignment component, we adjust tasks to workers based on these predictions, worker availability windows, and the current task assignments, where each worker has an availability window that indicates the time periods they are available for task assignments. To reduce the search space of task assignments and be efficient, we propose a worker dependency separation approach based on graph partition and a task value function with reinforcement learning. Experiments on real data demonstrate that our proposals are both effective and efficient.
[LG-14] Stochastic Engrams for Efficient Continual Learning with Binarized Neural Networks
链接: https://arxiv.org/abs/2503.21436
作者: Isabelle Aguilar,Luis Fernando Herbozo Contreras,Omid Kavehei
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The ability to learn continuously in artificial neural networks (ANNs) is often limited by catastrophic forgetting, a phenomenon in which new knowledge becomes dominant. By taking mechanisms of memory encoding in neuroscience (aka. engrams) as inspiration, we propose a novel approach that integrates stochastically-activated engrams as a gating mechanism for metaplastic binarized neural networks (mBNNs). This method leverages the computational efficiency of mBNNs combined with the robustness of probabilistic memory traces to mitigate forgetting and maintain the model’s reliability. Previously validated metaplastic optimization techniques have been incorporated to enhance synaptic stability further. Compared to baseline binarized models and benchmark fully connected continual learning approaches, our method is the only strategy capable of reaching average accuracies over 20% in class-incremental scenarios and achieving comparable domain-incremental results to full precision state-of-the-art methods. Furthermore, we achieve a significant reduction in peak GPU and RAM usage, under 5% and 20%, respectively. Our findings demonstrate (A) an improved stability vs. plasticity trade-off, (B) a reduced memory intensiveness, and © an enhanced performance in binarized architectures. By uniting principles of neuroscience and efficient computing, we offer new insights into the design of scalable and robust deep learning systems.
[LG-15] Nearest Neighbour Equilibrium Clustering
链接: https://arxiv.org/abs/2503.21431
作者: David P. Hofmeyr
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Currently being considered for publication by IEEE
点击查看摘要
Abstract:A novel and intuitive nearest neighbours based clustering algorithm is introduced, in which a cluster is defined in terms of an equilibrium condition which balances its size and cohesiveness. The formulation of the equilibrium condition allows for a quantification of the strength of alignment of each point to a cluster, with these cluster alignment strengths leading naturally to a model selection criterion which renders the proposed approach fully automatable. The algorithm is simple to implement and computationally efficient, and produces clustering solutions of extremely high quality in comparison with relevant benchmarks from the literature. R code to implement the approach is available from this https URL.
[LG-16] AdvSGM: Differentially Private Graph Learning via Adversarial Skip-gram Model ICDE2025
链接: https://arxiv.org/abs/2503.21426
作者: Sen Zhang,Qingqing Ye,Haibo Hu,Jianliang Xu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted by ICDE 2025
点击查看摘要
Abstract:The skip-gram model (SGM), which employs a neural network to generate node vectors, serves as the basis for numerous popular graph embedding techniques. However, since the training datasets contain sensitive linkage information, the parameters of a released SGM may encode private information and pose significant privacy risks. Differential privacy (DP) is a rigorous standard for protecting individual privacy in data analysis. Nevertheless, when applying differential privacy to skip-gram in graphs, it becomes highly challenging due to the complex link relationships, which potentially result in high sensitivity and necessitate substantial noise injection. To tackle this challenge, we present AdvSGM, a differentially private skip-gram for graphs via adversarial training. Our core idea is to leverage adversarial training to privatize skip-gram while improving its utility. Towards this end, we develop a novel adversarial training module by devising two optimizable noise terms that correspond to the parameters of a skip-gram. By fine-tuning the weights between modules within AdvSGM, we can achieve differentially private gradient updates without additional noise injection. Extensive experimental results on six real-world graph datasets show that AdvSGM preserves high data utility across different downstream tasks.
[LG-17] Workshop Scientific HPC in the pre-Exascale era (part of ITADATA 2024) Proceedings
链接: https://arxiv.org/abs/2503.21415
作者: Nicola Bena,Claudia Diamantini,Michela Natilli,Luigi Romano,Giovanni Stilo,Valentina Pansanella,Claudio A. Ardagna,Anna Monreale,Roberto Trasarti,Valentina Cesare,Gianluca Mittone,Emanuele De Rubeis,Alberto Vecchiato
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:The proceedings of Workshop Scientific HPC in the pre-Exascale era (SHPC), held in Pisa, Italy, September 18, 2024, are part of 3rd Italian Conference on Big Data and Data Science (ITADATA2024) proceedings (arXiv: 2503.14937). The main objective of SHPC workshop was to discuss how the current most critical questions in HPC emerge in astrophysics, cosmology, and other scientific contexts and experiments. In particular, SHPC workshop focused on: \bullet Scientific (mainly in astrophysical and medical fields) applications toward (pre-)Exascale computing \bullet Performance portability \bullet Green computing \bullet Machine learning \bullet Big Data management \bullet Programming on heterogeneous architectures \bullet Programming on accelerators \bullet I/O techniques Subjects: Databases (cs.DB); Machine Learning (cs.LG) Cite as: arXiv:2503.21415 [cs.DB] (or arXiv:2503.21415v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2503.21415 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Valentina Cesare [view email] [v1] Wed, 26 Mar 2025 10:22:15 UTC (2 KB) Full-text links: Access Paper: View a PDF of the paper titled Workshop Scientific HPC in the pre-Exascale era (part of ITADATA 2024) Proceedings, by Nicola Bena and 12 other authorsHTMLOther Formats view license Current browse context: cs.DB prev | next new | recent | 2025-03 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
[LG-18] AcL: Action Learner for Fault-Tolerant Quadruped Locomotion Control
链接: https://arxiv.org/abs/2503.21401
作者: Tianyu Xu(1),Yaoyu Cheng(2),Pinxi Shen(2),Lin Zhao(1) (1)Electrical,Computer Engineering,National University of Singapore,Singapore, (2)Mechanical Engineering,National University of Singapore,Singapore
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
点击查看摘要
Abstract:Quadrupedal robots can learn versatile locomotion skills but remain vulnerable when one or more joints lose power. In contrast, dogs and cats can adopt limping gaits when injured, demonstrating their remarkable ability to adapt to physical conditions. Inspired by such adaptability, this paper presents Action Learner (AcL), a novel teacher-student reinforcement learning framework that enables quadrupeds to autonomously adapt their gait for stable walking under multiple joint faults. Unlike conventional teacher-student approaches that enforce strict imitation, AcL leverages teacher policies to generate style rewards, guiding the student policy without requiring precise replication. We train multiple teacher policies, each corresponding to a different fault condition, and subsequently distill them into a single student policy with an encoder-decoder architecture. While prior works primarily address single-joint faults, AcL enables quadrupeds to walk with up to four faulty joints across one or two legs, autonomously switching between different limping gaits when faults occur. We validate AcL on a real Go2 quadruped robot under single- and double-joint faults, demonstrating fault-tolerant, stable walking, smooth gait transitions between normal and lamb gaits, and robustness against external disturbances.
[LG-19] Scalable Expectation Estimation with Subtractive Mixture Models
链接: https://arxiv.org/abs/2503.21346
作者: Lena Zellinger,Nicola Branchini,Víctor Elvira,Antonio Vergari
类目: Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Many Monte Carlo (MC) and importance sampling (IS) methods use mixture models (MMs) for their simplicity and ability to capture multimodal distributions. Recently, subtractive mixture models (SMMs), i.e. MMs with negative coefficients, have shown greater expressiveness and success in generative modeling. However, their negative parameters complicate sampling, requiring costly auto-regressive techniques or accept-reject algorithms that do not scale in high dimensions. In this work, we use the difference representation of SMMs to construct an unbiased IS estimator ( \Delta\textEx ) that removes the need to sample from the SMM, enabling high-dimensional expectation estimation with SMMs. In our experiments, we show that \Delta\textEx can achieve comparable estimation quality to auto-regressive sampling while being considerably faster in MC estimation. Moreover, we conduct initial experiments with \Delta\textEx using hand-crafted proposals, gaining first insights into how to construct safe proposals for \Delta\textEx .
[LG-20] ricking Retrievers with Influential Tokens: An Efficient Black-Box Corpus Poisoning Attack NAACL2025
链接: https://arxiv.org/abs/2503.21315
作者: Cheng Wang,Yiwei Wang,Yujun Cai,Bryan Hooi
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted to NAACL 2025 Main Track
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) systems enhance large language models by incorporating external knowledge, addressing issues like outdated internal knowledge and hallucination. However, their reliance on external knowledge bases makes them vulnerable to corpus poisoning attacks, where adversarial passages can be injected to manipulate retrieval results. Existing methods for crafting such passages, such as random token replacement or training inversion models, are often slow and computationally expensive, requiring either access to retriever’s gradients or large computational resources. To address these limitations, we propose Dynamic Importance-Guided Genetic Algorithm (DIGA), an efficient black-box method that leverages two key properties of retrievers: insensitivity to token order and bias towards influential tokens. By focusing on these characteristics, DIGA dynamically adjusts its genetic operations to generate effective adversarial passages with significantly reduced time and memory usage. Our experimental evaluation shows that DIGA achieves superior efficiency and scalability compared to existing methods, while maintaining comparable or better attack success rates across multiple datasets.
[LG-21] HOT: Hadamard-based Optimized Training CVPR2025
链接: https://arxiv.org/abs/2503.21261
作者: Seonggon Kim,Juncheol Shin,Seung-taek Woo,Eunhyeok Park
类目: Machine Learning (cs.LG)
*备注: Accepted in CVPR 2025
点击查看摘要
Abstract:It has become increasingly important to optimize backpropagation to reduce memory usage and computational overhead. Achieving this goal is highly challenging, as multiple objectives must be considered jointly while maintaining training quality. In this paper, we focus on matrix multiplication, which accounts for the largest portion of training costs, and analyze its backpropagation in detail to identify lightweight techniques that offer the best benefits. Based on this analysis, we introduce a novel method, Hadamard-based Optimized Training (HOT). In this approach, we apply Hadamard-based optimizations, such as Hadamard quantization and Hadamard low-rank approximation, selectively and with awareness of the suitability of each optimization for different backward paths. Additionally, we introduce two enhancements: activation buffer compression and layer-wise quantizer selection. Our extensive analysis shows that HOT achieves up to 75% memory savings and a 2.6 times acceleration on real GPUs, with negligible accuracy loss compared to FP32 precision.
[LG-22] Efficient Learning for Entropy-regularized Markov Decision Processes via Multilevel Monte Carlo
链接: https://arxiv.org/abs/2503.21224
作者: Matthieu Meunier,Christoph Reisinger,Yufei Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Machine Learning (stat.ML)
*备注: 46 pages, 6 figures
点击查看摘要
Abstract:Designing efficient learning algorithms with complexity guarantees for Markov decision processes (MDPs) with large or continuous state and action spaces remains a fundamental challenge. We address this challenge for entropy-regularized MDPs with Polish state and action spaces, assuming access to a generative model of the environment. We propose a novel family of multilevel Monte Carlo (MLMC) algorithms that integrate fixed-point iteration with MLMC techniques and a generic stochastic approximation of the Bellman operator. We quantify the precise impact of the chosen approximate Bellman operator on the accuracy of the resulting MLMC estimator. Leveraging this error analysis, we show that using a biased plain MC estimate for the Bellman operator results in quasi-polynomial sample complexity, whereas an unbiased randomized multilevel approximation of the Bellman operator achieves polynomial sample complexity in expectation. Notably, these complexity bounds are independent of the dimensions or cardinalities of the state and action spaces, distinguishing our approach from existing algorithms whose complexities scale with the sizes of these spaces. We validate these theoretical performance guarantees through numerical experiments.
[LG-23] Rethinking Graph Structure Learning in the Era of LLM s
链接: https://arxiv.org/abs/2503.21223
作者: Zhihan Zhang,Xunkai Li,Guang Zeng,Hongchao Qin,Ronghua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注: 17 pages, 8 figures
点击查看摘要
Abstract:Recently, the emergence of large language models (LLMs) has prompted researchers to explore the integration of language descriptions into graphs, aiming to enhance model encoding capabilities from a data-centric perspective. This graph representation is called text-attributed graphs (TAGs). A review of prior advancements highlights that graph structure learning (GSL) is a pivotal technique for improving data utility, making it highly relevant to efficient TAG learning. However, most GSL methods are tailored for traditional graphs without textual information, underscoring the necessity of developing a new GSL paradigm. Despite clear motivations, it remains challenging: (1) How can we define a reasonable optimization objective for GSL in the era of LLMs, considering the massive parameters in LLM? (2) How can we design an efficient model architecture that enables seamless integration of LLM for this optimization objective? For Question 1, we reformulate existing GSL optimization objectives as a tree optimization framework, shifting the focus from obtaining a well-trained edge predictor to a language-aware tree sampler. For Question 2, we propose decoupled and training-free model design principles for LLM integration, shifting the focus from computation-intensive fine-tuning to more efficient inference. Based on this, we propose Large Language and Tree Assistant (LLaTA), which leverages tree-based LLM in-context learning to enhance the understanding of topology and text, enabling reliable inference and generating improved graph structure. Extensive experiments on 10 TAG datasets demonstrate that LLaTA enjoys flexibility - incorporated with any backbone; scalability - outperforms other LLM-based GSL methods in terms of running efficiency; effectiveness - achieves SOTA performance.
[LG-24] Resource-Efficient Federated Fine-Tuning Large Language Models for Heterogeneous Data
链接: https://arxiv.org/abs/2503.21213
作者: Jun Liu,Yunming Liao,Hongli Xu,Yang Xu
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Fine-tuning large language models (LLMs) via federated learning, i.e., FedLLM, has been proposed to adapt LLMs for various downstream applications in a privacy-preserving way. To reduce the fine-tuning costs on resource-constrained devices, FedLoRA is proposed to fine-tune only a small subset of model parameters by integrating low-rank adaptation (LoRA) into FedLLM. However, apart from resource constraints, there is still another critical challenge, i.e., data heterogeneity, severely hindering the implementation of FedLoRA in practical applications. Herein, inspired by the previous group-based federated learning paradigm, we propose a hierarchical FedLoRA framework, termed HierFedLoRA, to address these challenges. Specifically, HierFedLoRA partitions all devices into multiple near-IID groups and adjusts the intra-group aggregation frequency for each group to eliminate the negative effects of non-IID data. Meanwhile, to reduce the computation and communication cost, HierFedLoRA dynamically assigns diverse and suitable fine-tuning depth (i.e., the number of continuous fine-tuning layers from the output) for each group. HierFedLoRA explores jointly optimizing aggregation frequency and depth upon their coupled relationship to better enhance the performance of FedLoRA. Extensive experiments are conducted on a physical platform with 80 commercial devices. The results show that HierFedLoRA improves the final model accuracy by 1.6% to 4.2%, speeding up the fine-tuning process by at least 2.1 \times , compared to the strong baselines.
[LG-25] Learning Generalizable Skills from Offline Multi-Task Data for Multi-Agent Cooperation
链接: https://arxiv.org/abs/2503.21200
作者: Sicong Liu,Yang Shu,Chenjuan Guo,Bin Yang
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
点击查看摘要
Abstract:Learning cooperative multi-agent policy from offline multi-task data that can generalize to unseen tasks with varying numbers of agents and targets is an attractive problem in many scenarios. Although aggregating general behavior patterns among multiple tasks as skills to improve policy transfer is a promising approach, two primary challenges hinder the further advancement of skill learning in offline multi-task MARL. Firstly, extracting general cooperative behaviors from various action sequences as common skills lacks bringing cooperative temporal knowledge into them. Secondly, existing works only involve common skills and can not adaptively choose independent knowledge as task-specific skills in each task for fine-grained action execution. To tackle these challenges, we propose Hierarchical and Separate Skill Discovery (HiSSD), a novel approach for generalizable offline multi-task MARL through skill learning. HiSSD leverages a hierarchical framework that jointly learns common and task-specific skills. The common skills learn cooperative temporal knowledge and enable in-sample exploitation for offline multi-task MARL. The task-specific skills represent the priors of each task and achieve a task-guided fine-grained action execution. To verify the advancement of our method, we conduct experiments on multi-agent MuJoCo and SMAC benchmarks. After training the policy using HiSSD on offline multi-task data, the empirical results show that HiSSD assigns effective cooperative behaviors and obtains superior performance in unseen tasks.
[LG-26] Unveiling the Potential of Superexpressive Networks in Implicit Neural Representations ICLR2025
链接: https://arxiv.org/abs/2503.21166
作者: Uvini Balasuriya Mudiyanselage,Woojin Cho,Minju Jo,Noseong Park,Kookjin Lee
类目: Machine Learning (cs.LG)
*备注: Accepted at ICLR 2025 Workshop on Neural Network Weights as a New Data Modality
点击查看摘要
Abstract:In this study, we examine the potential of one of the superexpressive'' networks in the context of learning neural functions for representing complex signals and performing machine learning downstream tasks. Our focus is on evaluating their performance on computer vision and scientific machine learning tasks including signal representation/inverse problems and solutions of partial differential equations. Through an empirical investigation in various benchmark tasks, we demonstrate that superexpressive networks, as proposed by [Zhang et al. NeurIPS, 2022], which employ a specialized network structure characterized by having an additional dimension, namely width, depth, and
height’', can surpass recent implicit neural representations that use highly-specialized nonlinear activation functions.
[LG-27] A Data Balancing and Ensemble Learning Approach for Credit Card Fraud Detection
链接: https://arxiv.org/abs/2503.21160
作者: Yuhan Wang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This research introduces an innovative method for identifying credit card fraud by combining the SMOTE-KMEANS technique with an ensemble machine learning model. The proposed model was benchmarked against traditional models such as logistic regression, decision trees, random forests, and support vector machines. Performance was evaluated using metrics, including accuracy, recall, and area under the curve (AUC). The results demonstrated that the proposed model achieved superior performance, with an AUC of 0.96 when combined with the SMOTE-KMEANS algorithm. This indicates a significant improvement in detecting fraudulent transactions while maintaining high precision and recall. The study also explores the application of different oversampling techniques to enhance the performance of various classifiers. The findings suggest that the proposed method is robust and effective for classification tasks on balanced datasets. Future research directions include further optimization of the SMOTE-KMEANS approach and its integration into existing fraud detection systems to enhance financial security and consumer protection.
[LG-28] Real-Time Evaluation Models for RAG : Who Detects Hallucinations Best?
链接: https://arxiv.org/abs/2503.21157
作者: Ashish Sardana
类目: Machine Learning (cs.LG)
*备注: 11 pages, 8 figures
点击查看摘要
Abstract:This article surveys Evaluation models to automatically detect hallucinations in Retrieval-Augmented Generation (RAG), and presents a comprehensive benchmark of their performance across six RAG applications. Methods included in our study include: LLM-as-a-Judge, Prometheus, Lynx, the Hughes Hallucination Evaluation Model (HHEM), and the Trustworthy Language Model (TLM). These approaches are all reference-free, requiring no ground-truth answers/labels to catch incorrect LLM responses. Our study reveals that, across diverse RAG applications, some of these approaches consistently detect incorrect RAG responses with high precision/recall.
[LG-29] Embedding Domain-Specific Knowledge from LLM s into the Feature Engineering Pipeline
链接: https://arxiv.org/abs/2503.21155
作者: João Eduardo Batista
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 5 tables
点击查看摘要
Abstract:Feature engineering is mandatory in the machine learning pipeline to obtain robust models. While evolutionary computation is well-known for its great results both in feature selection and feature construction, its methods are computationally expensive due to the large number of evaluations required to induce the final model. Part of the reason why these algorithms require a large number of evaluations is their lack of domain-specific knowledge, resulting in a lot of random guessing during evolution. In this work, we propose using Large Language Models (LLMs) as an initial feature construction step to add knowledge to the dataset. By doing so, our results show that the evolution can converge faster, saving us computational resources. The proposed approach only provides the names of the features in the dataset and the target objective to the LLM, making it usable even when working with datasets containing private data. While consistent improvements to test performance were only observed for one-third of the datasets (CSS, PM, and IM10), possibly due to problems being easily explored by LLMs, this approach only decreased the model performance in 1/77 test cases. Additionally, this work introduces the M6GP feature engineering algorithm to symbolic regression, showing it can improve the results of the random forest regressor and produce competitive results with its predecessor, M3GP.
[LG-30] MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness
链接: https://arxiv.org/abs/2503.21135
作者: Zihao Zheng,Xiuping Cui,Size Zheng,Maoliang Li,Jiayu Chen, Yun (Eric)Liang,Xiang Chen
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures and 3 tables
点击查看摘要
Abstract:With the advances in artificial intelligence, Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs), and its demand for model compression is increasing. Quantization is an effective method that not only compresses the models but also significantly accelerates their performance. Existing quantization methods have gradually shifted the focus from parameter scaling to the analysis of data distributions. However, their analysis is designed for dense LLMs and relies on the simple one-model-all-data mapping, which is unsuitable for MoEs. This paper proposes a new quantization framework called MoQa. MoQa decouples the data-model distribution complexity of MoEs in multiple analysis stages, quantitively revealing the dynamics during sparse data activation, data-parameter mapping, and inter-expert correlations. Based on these, MoQa identifies particular experts’ and parameters’ significance with optimal data-model distribution awareness and proposes a series of fine-grained mix-quantization strategies adaptive to various data activation and expert combination scenarios. Moreover, MoQa discusses the limitations of existing quantization and analyzes the impact of each stage analysis, showing novel insights for MoE quantization. Experiments show that MoQa achieves a 1.69~2.18 perplexity decrease in language modeling tasks and a 1.58%~8.91% accuracy improvement in zero-shot inference tasks. We believe MoQa will play a role in future MoE construction, optimization, and compression.
[LG-31] AugWard: Augmentation-Aware Representation Learning for Accurate Graph Classification PAKDD2025
链接: https://arxiv.org/abs/2503.21105
作者: Minjun Kim,Jaehyeon Choi,SeungJoo Lee,Jinhong Jung,U Kang
类目: Machine Learning (cs.LG)
*备注: Accepted to PAKDD 2025 (Oral Presentation)
点击查看摘要
Abstract:How can we accurately classify graphs? Graph classification is a pivotal task in data mining with applications in social network analysis, web analysis, drug discovery, molecular property prediction, etc. Graph neural networks have achieved the state-of-the-art performance in graph classification, but they consistently struggle with overfitting. To mitigate overfitting, researchers have introduced various representation learning methods utilizing graph augmentation. However, existing methods rely on simplistic use of graph augmentation, which loses augmentation-induced differences and limits the expressiveness of representations. In this paper, we propose AugWard (Augmentation-Aware Training with Graph Distance and Consistency Regularization), a novel graph representation learning framework that carefully considers the diversity introduced by graph augmentation. AugWard applies augmentation-aware training to predict the graph distance between the augmented graph and its original one, aligning the representation difference directly with graph distance at both feature and structure levels. Furthermore, AugWard employs consistency regularization to encourage the classifier to handle richer representations. Experimental results show that AugWard gives the state-of-the-art performance in supervised, semi-supervised graph classification, and transfer learning. Comments: Accepted to PAKDD 2025 (Oral Presentation) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.21105 [cs.LG] (or arXiv:2503.21105v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.21105 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-32] Low Stein Discrepancy via Message-Passing Monte Carlo ICLR2025
链接: https://arxiv.org/abs/2503.21103
作者: Nathan Kirk,T. Konstantin Rusch,Jakob Zech,Daniela Rus
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 8 pages, 2 figures, Accepted at the ICLR 2025 Workshop on Frontiers in Probabilistic Inference
点击查看摘要
Abstract:Message-Passing Monte Carlo (MPMC) was recently introduced as a novel low-discrepancy sampling approach leveraging tools from geometric deep learning. While originally designed for generating uniform point sets, we extend this framework to sample from general multivariate probability distributions with known probability density function. Our proposed method, Stein-Message-Passing Monte Carlo (Stein-MPMC), minimizes a kernelized Stein discrepancy, ensuring improved sample quality. Finally, we show that Stein-MPMC outperforms competing methods, such as Stein Variational Gradient Descent and (greedy) Stein Points, by achieving a lower Stein discrepancy.
[LG-33] Geographical hotspot prediction based on point cloud-voxel-community partition clustering
链接: https://arxiv.org/abs/2503.21084
作者: Yan Tang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Existing solutions to the hotspot prediction problem in the field of geographic information remain at a relatively preliminary stage. This study presents a novel approach for detecting and predicting geographical hotspots, utilizing point cloud-voxel-community partition clustering. By analyzing high-dimensional data, we represent spatial information through point clouds, which are then subdivided into multiple voxels to enhance analytical efficiency. Our method identifies spatial voxels with similar characteristics through community partitioning, thereby revealing underlying patterns in hotspot distributions. Experimental results indicate that when applied to a dataset of archaeological sites in Turkey, our approach achieves a 19.31% increase in processing speed, with an accuracy loss of merely 6%, outperforming traditional clustering methods. This method not only provides a fresh perspective for hotspot prediction but also serves as an effective tool for high-dimensional data analysis.
[LG-34] Purifying Approximate Differential Privacy with Randomized Post-processing
链接: https://arxiv.org/abs/2503.21071
作者: Yingyu Lin,Erchi Wang,Yi-An Ma,Yu-Xiang Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We propose a framework to convert (\varepsilon, \delta) -approximate Differential Privacy (DP) mechanisms into (\varepsilon, 0) -pure DP mechanisms, a process we call ``purification’'. This algorithmic technique leverages randomized post-processing with calibrated noise to eliminate the \delta parameter while preserving utility. By combining the tighter utility bounds and computational efficiency of approximate DP mechanisms with the stronger guarantees of pure DP, our approach achieves the best of both worlds. We illustrate the applicability of this framework in various settings, including Differentially Private Empirical Risk Minimization (DP-ERM), data-dependent DP mechanisms such as Propose-Test-Release (PTR), and query release tasks. To the best of our knowledge, this is the first work to provide a systematic method for transforming approximate DP into pure DP while maintaining competitive accuracy and computational efficiency.
[LG-35] Uncertainty propagation in feed-forward neural network models
链接: https://arxiv.org/abs/2503.21059
作者: Jeremy Diamzon,Daniele Venturi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 13 figures
点击查看摘要
Abstract:We develop new uncertainty propagation methods for feed-forward neural network architectures with leaky ReLu activation functions subject to random perturbations in the input vectors. In particular, we derive analytical expressions for the probability density function (PDF) of the neural network output and its statistical moments as a function of the input uncertainty and the parameters of the network, i.e., weights and biases. A key finding is that an appropriate linearization of the leaky ReLu activation function yields accurate statistical results even for large perturbations in the input vectors. This can be attributed to the way information propagates through the network. We also propose new analytically tractable Gaussian copula surrogate models to approximate the full joint PDF of the neural network output. To validate our theorical results, we conduct Monte Carlo simulations and a thorough error analysis on a multi-layer neural network representing a nonlinear integro-differential operator between two polynomial function spaces. Our findings demonstrate excellent agreement between the theoretical predictions and Monte Carlo simulations.
[LG-36] Integrated utilization of equations and small dataset in the Koopman operator: applications to forward and inverse Problems
链接: https://arxiv.org/abs/2503.21048
作者: Ichiro Ohta,Shota Koyanagi,Kayo Kinjo,Jun Ohkubo
类目: Machine Learning (cs.LG)
*备注: 10 pages, 8 figures
点击查看摘要
Abstract:In recent years, there has been a growing interest in data-driven approaches in physics, such as extended dynamic mode decomposition (EDMD). The EDMD algorithm focuses on nonlinear time-evolution systems, and the constructed Koopman matrix yields the next-time prediction with only linear matrix-product operations. Note that data-driven approaches generally require a large dataset. However, assume that one has some prior knowledge, even if it may be ambiguous. Then, one could achieve sufficient learning from only a small dataset by taking advantage of the prior knowledge. This paper yields methods for incorporating ambiguous prior knowledge into the EDMD algorithm. The ambiguous prior knowledge in this paper corresponds to the underlying time-evolution equations with unknown parameters. First, we apply the proposed method to forward problems, i.e., prediction tasks. Second, we propose a scheme to apply the proposed method to inverse problems, i.e., parameter estimation tasks. We demonstrate the learning with only a small dataset using guiding examples, i.e., the Duffing and the van der Pol systems.
[LG-37] World Model Agents with Change-Based Intrinsic Motivation
链接: https://arxiv.org/abs/2503.21047
作者: Jeremias Ferrao,Rafael Cunha
类目: Machine Learning (cs.LG)
*备注: Submitted to Northern Lights Deep Learning Conference 2025
点击查看摘要
Abstract:Sparse reward environments pose a significant challenge for reinforcement learning due to the scarcity of feedback. Intrinsic motivation and transfer learning have emerged as promising strategies to address this issue. Change Based Exploration Transfer (CBET), a technique that combines these two approaches for model-free algorithms, has shown potential in addressing sparse feedback but its effectiveness with modern algorithms remains understudied. This paper provides an adaptation of CBET for world model algorithms like DreamerV3 and compares the performance of DreamerV3 and IMPALA agents, both with and without CBET, in the sparse reward environments of Crafter and Minigrid. Our tabula rasa results highlight the possibility of CBET improving DreamerV3’s returns in Crafter but the algorithm attains a suboptimal policy in Minigrid with CBET further reducing returns. In the same vein, our transfer learning experiments show that pre-training DreamerV3 with intrinsic rewards does not immediately lead to a policy that maximizes extrinsic rewards in Minigrid. Overall, our results suggest that CBET provides a positive impact on DreamerV3 in more complex environments like Crafter but may be detrimental in environments like Minigrid. In the latter case, the behaviours promoted by CBET in DreamerV3 may not align with the task objectives of the environment, leading to reduced returns and suboptimal policies.
[LG-38] Improving Speech Recognition Accuracy Using Custom Language Models with the Vosk Toolkit
链接: https://arxiv.org/abs/2503.21025
作者: Aniket Abhishek Soni
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, includes workflow diagram, accuracy and WER comparisons, spectrograms, and model evaluation
点击查看摘要
Abstract:Although speech recognition algorithms have developed quickly in recent years, achieving high transcription accuracy across diverse audio formats and acoustic environments remains a major challenge. This work explores how incorporating custom language models with the open-source Vosk Toolkit can improve speech-to-text accuracy in varied settings. Unlike many conventional systems limited to specific audio types, this approach supports multiple audio formats such as WAV, MP3, FLAC, and OGG by using Python modules for preprocessing and format conversion. A Python-based transcription pipeline was developed to process input audio, perform speech recognition using Vosk’s KaldiRecognizer, and export the output to a DOCX file. Results showed that custom models reduced word error rates, especially in domain-specific scenarios involving technical terminology, varied accents, or background noise. This work presents a cost-effective, offline solution for high-accuracy transcription and opens up future opportunities for automation and real-time applications. Comments: 10 pages, 7 figures, includes workflow diagram, accuracy and WER comparisons, spectrograms, and model evaluation Subjects: Sound (cs.SD); Machine Learning (cs.LG) ACMclasses: I.2.7; H.5.2 Cite as: arXiv:2503.21025 [cs.SD] (or arXiv:2503.21025v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2503.21025 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-39] Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework
链接: https://arxiv.org/abs/2503.21023
作者: Thomson Yen,Andrew Wei Tung Siah,Haozhe Chen,Tianyi Peng,Daniel Guetta,Hongseok Namkoong
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Careful curation of data sources can significantly improve the performance of LLM pre-training, but predominant approaches rely heavily on intuition or costly trial-and-error, making them difficult to generalize across different data domains and downstream tasks. Although scaling laws can provide a principled and general approach for data curation, standard deterministic extrapolation from small-scale experiments to larger scales requires strong assumptions on the reliability of such extrapolation, whose brittleness has been highlighted in prior works. In this paper, we introduce a \textitprobabilistic extrapolation framework for data mixture optimization that avoids rigid assumptions and explicitly models the uncertainty in performance across decision variables. We formulate data curation as a sequential decision-making problem \unicodex2013 multi-fidelity, multi-scale Bayesian optimization \unicodex2013 where \ data mixtures, model scale, training steps \ are adaptively selected to balance training cost and potential information gain. Our framework naturally gives rise to algorithm prototypes that leverage noisy information from inexpensive experiments to systematically inform costly training decisions. To accelerate methodological progress, we build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple kernels and acquisition functions can enable principled decisions across training models from 20M to 1B parameters and achieve \textbf2.6x and \textbf3.3x speedups compared to multi-fidelity BO and random search baselines. Taken together, our framework underscores potential efficiency gains achievable by developing principled and transferable data mixture optimization methods.
[LG-40] Offline Action-Free Learning of Ex-BMDPs by Comparing Diverse Datasets
链接: https://arxiv.org/abs/2503.21018
作者: Alexander Levine,Peter Stone,Amy Zhang
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:While sequential decision-making environments often involve high-dimensional observations, not all features of these observations are relevant for control. In particular, the observation space may capture factors of the environment which are not controllable by the agent, but which add complexity to the observation space. The need to ignore these “noise” features in order to operate in a tractably-small state space poses a challenge for efficient policy learning. Due to the abundance of video data available in many such environments, task-independent representation learning from action-free offline data offers an attractive solution. However, recent work has highlighted theoretical limitations in action-free learning under the Exogenous Block MDP (Ex-BMDP) model, where temporally-correlated noise features are present in the observations. To address these limitations, we identify a realistic setting where representation learning in Ex-BMDPs becomes tractable: when action-free video data from multiple agents with differing policies are available. Concretely, this paper introduces CRAFT (Comparison-based Representations from Action-Free Trajectories), a sample-efficient algorithm leveraging differences in controllable feature dynamics across agents to learn representations. We provide theoretical guarantees for CRAFT’s performance and demonstrate its feasibility on a toy example, offering a foundation for practical methods in similar settings.
[LG-41] Deep Learning for Forensic Identification of Source
链接: https://arxiv.org/abs/2503.20994
作者: Cole Patten,Christopher Saunders,Michael Puthawala
类目: Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:We used contrastive neural networks to learn useful similarity scores between the 144 cartridge casings in the NBIDE dataset, under the common-but-unknown source paradigm. The common-but-unknown source problem is a problem archetype in forensics where the question is whether two objects share a common source (e.g. were two cartridge casings fired from the same firearm). Similarity scores are often used to interpret evidence under this paradigm. We directly compared our results to a state-of-the-art algorithm, Congruent Matching Cells (CMC). When trained on the E3 dataset of 2967 cartridge casings, contrastive learning achieved an ROC AUC of 0.892. The CMC algorithm achieved 0.867. We also conducted an ablation study where we varied the neural network architecture; specifically, the network’s width or depth. The ablation study showed that contrastive network performance results are somewhat robust to the network architecture. This work was in part motivated by the use of similarity scores attained via contrastive learning for standard evidence interpretation methods such as score-based likelihood ratios.
[LG-42] Reinforcement Learning for Efficient Toxicity Detection in Competitive Online Video Games
链接: https://arxiv.org/abs/2503.20968
作者: Jacob Morrier,Rafal Kocielnik,R. Michael Alvarez
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Online platforms take proactive measures to detect and address undesirable behavior, aiming to focus these resource-intensive efforts where such behavior is most prevalent. This article considers the problem of efficient sampling for toxicity detection in competitive online video games. To make optimal monitoring decisions, video game service operators need estimates of the likelihood of toxic behavior. If no model is available for these predictions, one must be estimated in real time. To close this gap, we propose a contextual bandit algorithm that makes monitoring decisions based on a small set of variables that, according to domain expertise, are associated with toxic behavior. This algorithm balances exploration and exploitation to optimize long-term outcomes and is deliberately designed for easy deployment in production. Using data from the popular first-person action game Call of Duty: Modern Warfare III, we show that our algorithm consistently outperforms baseline algorithms that rely solely on players’ past behavior. This finding has substantive implications for the nature of toxicity. It also illustrates how domain expertise can be harnessed to help video game service operators identify and mitigate toxicity, ultimately fostering a safer and more enjoyable gaming experience.
[LG-43] Global and Local Structure Learning for Sparse Tensor Completion
链接: https://arxiv.org/abs/2503.20929
作者: Dawon Ahn,Evangelos E. Papalexakis
类目: Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:How can we accurately complete tensors by learning relationships of dimensions along each mode? Tensor completion, a widely studied problem, is to predict missing entries in incomplete tensors. Tensor decomposition methods, fundamental tensor analysis tools, have been actively developed to solve tensor completion tasks. However, standard tensor decomposition models have not been designed to learn relationships of dimensions along each mode, which limits to accurate tensor completion. Also, previously developed tensor decomposition models have required prior knowledge between relations within dimensions to model the relations, expensive to obtain. This paper proposes TGL (Tensor Decomposition Learning Global and Local Structures) to accurately predict missing entries in tensors. TGL reconstructs a tensor with factor matrices which learn local structures with GNN without prior knowledges. Extensive experiments are conducted to evaluate TGL with baselines and datasets. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2503.20929 [cs.LG] (or arXiv:2503.20929v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2503.20929 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-44] ransDiffSBDD: Causality-Aware Multi-Modal Structure-Based Drug Design
链接: https://arxiv.org/abs/2503.20913
作者: Xiuyuan Hu,Guoqing Liu,Can Chen,Yang Zhao,Hao Zhang,Xue Liu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Structure-based drug design (SBDD) is a critical task in drug discovery, requiring the generation of molecular information across two distinct modalities: discrete molecular graphs and continuous 3D coordinates. However, existing SBDD methods often overlook two key challenges: (1) the multi-modal nature of this task and (2) the causal relationship between these modalities, limiting their plausibility and performance. To address both challenges, we propose TransDiffSBDD, an integrated framework combining autoregressive transformers and diffusion models for SBDD. Specifically, the autoregressive transformer models discrete molecular information, while the diffusion model samples continuous distributions, effectively resolving the first challenge. To address the second challenge, we design a hybrid-modal sequence for protein-ligand complexes that explicitly respects the causality between modalities. Experiments on the CrossDocked2020 benchmark demonstrate that TransDiffSBDD outperforms existing baselines.
[LG-45] AR: Teacher-Aligned Representations via Contrastive Learning for Quadrupedal Locomotion IROS
链接: https://arxiv.org/abs/2503.20839
作者: Amr Mousa,Neil Karavis,Michele Caprio,Wei Pan,Richard Allmendinger
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: This work has been submitted to the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2025 for review
点击查看摘要
Abstract:Quadrupedal locomotion via Reinforcement Learning (RL) is commonly addressed using the teacher-student paradigm, where a privileged teacher guides a proprioceptive student policy. However, key challenges such as representation misalignment between the privileged teacher and the proprioceptive-only student, covariate shift due to behavioral cloning, and lack of deployable adaptation lead to poor generalization in real-world scenarios. We propose Teacher-Aligned Representations via Contrastive Learning (TAR), a framework that leverages privileged information with self-supervised contrastive learning to bridge this gap. By aligning representations to a privileged teacher in simulation via contrastive objectives, our student policy learns structured latent spaces and exhibits robust generalization to Out-of-Distribution (OOD) scenarios, surpassing the fully privileged “Teacher”. Results showed accelerated training by 2x compared to state-of-the-art baselines to achieve peak performance. OOD scenarios showed better generalization by 40 percent on average compared to existing methods. Additionally, TAR transitions seamlessly into learning during deployment without requiring privileged states, setting a new benchmark in sample-efficient, adaptive locomotion and enabling continual fine-tuning in real-world scenarios. Open-source code and videos are available at this https URL.
[LG-46] Leverag ing VAE-Derived Latent Spaces for Enhanced Malware Detection with Machine Learning Classifiers
链接: https://arxiv.org/abs/2503.20803
作者: Bamidele Ajayi,Basel Barakat,Ken McGarry
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:This paper assesses the performance of five machine learning classifiers: Decision Tree, Naive Bayes, LightGBM, Logistic Regression, and Random Forest using latent representations learned by a Variational Autoencoder from malware datasets. Results from the experiments conducted on different training-test splits with different random seeds reveal that all the models perform well in detecting malware with ensemble methods (LightGBM and Random Forest) performing slightly better than the rest. In addition, the use of latent features reduces the computational cost of the model and the need for extensive hyperparameter tuning for improved efficiency of the model for deployment. Statistical tests show that these improvements are significant, and thus, the practical relevance of integrating latent space representation with traditional classifiers for effective malware detection in cybersecurity is established.
[LG-47] Molecular Quantum Transformer
链接: https://arxiv.org/abs/2503.21686
作者: Yuichi Kamata,Quoc Hoan Tran,Yasuhiro Endo,Hirotaka Oshima
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 13 pages, 8 figures
点击查看摘要
Abstract:The Transformer model, renowned for its powerful attention mechanism, has achieved state-of-the-art performance in various artificial intelligence tasks but faces challenges such as high computational cost and memory usage. Researchers are exploring quantum computing to enhance the Transformer’s design, though it still shows limited success with classical data. With a growing focus on leveraging quantum machine learning for quantum data, particularly in quantum chemistry, we propose the Molecular Quantum Transformer (MQT) for modeling interactions in molecular quantum systems. By utilizing quantum circuits to implement the attention mechanism on the molecular configurations, MQT can efficiently calculate ground-state energies for all configurations. Numerical demonstrations show that in calculating ground-state energies for H_2, LiH, BeH_2, and H_4, MQT outperforms the classical Transformer, highlighting the promise of quantum effects in Transformer structures. Furthermore, its pretraining capability on diverse molecular data facilitates the efficient learning of new molecules, extending its applicability to complex molecular systems with minimal additional effort. Our method offers an alternative to existing quantum algorithms for estimating ground-state energies, opening new avenues in quantum chemistry and materials science.
[LG-48] A Comprehensive Benchmark for RNA 3D Structure-Function Modeling
链接: https://arxiv.org/abs/2503.21681
作者: Luis Wyss,Vincent Mallet,Wissam Karroucha,Karsten Borgwardt,Carlos Oliver
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:The RNA structure-function relationship has recently garnered significant attention within the deep learning community, promising to grow in importance as nucleic acid structure models advance. However, the absence of standardized and accessible benchmarks for deep learning on RNA 3D structures has impeded the development of models for RNA functional characteristics. In this work, we introduce a set of seven benchmarking datasets for RNA structure-function prediction, designed to address this gap. Our library builds on the established Python library rnaglib, and offers easy data distribution and encoding, splitters and evaluation methods, providing a convenient all-in-one framework for comparing models. Datasets are implemented in a fully modular and reproducible manner, facilitating for community contributions and customization. Finally, we provide initial baseline results for all tasks using a graph neural network. Source code: this https URL Documentation: this https URL Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2503.21681 [q-bio.BM] (or arXiv:2503.21681v1 [q-bio.BM] for this version) https://doi.org/10.48550/arXiv.2503.21681 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Carlos Oliver Dr. [view email] [v1] Thu, 27 Mar 2025 16:49:31 UTC (13,917 KB)
[LG-49] Nonlinear Multiple Response Regression and Learning of Latent Spaces
链接: https://arxiv.org/abs/2503.21608
作者: Ye Tian,Sanyou Wu,Long Feng
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Identifying low-dimensional latent structures within high-dimensional data has long been a central topic in the machine learning community, driven by the need for data compression, storage, transmission, and deeper data understanding. Traditional methods, such as principal component analysis (PCA) and autoencoders (AE), operate in an unsupervised manner, ignoring label information even when it is available. In this work, we introduce a unified method capable of learning latent spaces in both unsupervised and supervised settings. We formulate the problem as a nonlinear multiple-response regression within an index model context. By applying the generalized Stein’s lemma, the latent space can be estimated without knowing the nonlinear link functions. Our method can be viewed as a nonlinear generalization of PCA. Moreover, unlike AE and other neural network methods that operate as “black boxes”, our approach not only offers better interpretability but also reduces computational complexity while providing strong theoretical guarantees. Comprehensive numerical experiments and real data analyses demonstrate the superior performance of our method.
[LG-50] Probabilistic Functional Neural Networks
链接: https://arxiv.org/abs/2503.21585
作者: Haixu Wang,Jiguo Cao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:High-dimensional functional time series (HDFTS) are often characterized by nonlinear trends and high spatial dimensions. Such data poses unique challenges for modeling and forecasting due to the nonlinearity, nonstationarity, and high dimensionality. We propose a novel probabilistic functional neural network (ProFnet) to address these challenges. ProFnet integrates the strengths of feedforward and deep neural networks with probabilistic modeling. The model generates probabilistic forecasts using Monte Carlo sampling and also enables the quantification of uncertainty in predictions. While capturing both temporal and spatial dependencies across multiple regions, ProFnet offers a scalable and unified solution for large datasets. Applications to Japan’s mortality rates demonstrate superior performance. This approach enhances predictive accuracy and provides interpretable uncertainty estimates, making it a valuable tool for forecasting complex high-dimensional functional data and HDFTS.
[LG-51] Formation Shape Control using the Gromov-Wasserstein Metric
链接: https://arxiv.org/abs/2503.21538
作者: Haruto Nakashima,Siddhartha Ganguly,Kohei Morimoto,Kenji Kashima
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
*备注: To appear in the proceedings of Learning for Dynamics and Control (L4DC) conference, PMLR, 2025
点击查看摘要
Abstract:This article introduces a formation shape control algorithm, in the optimal control framework, for steering an initial population of agents to a desired configuration via employing the Gromov-Wasserstein distance. The underlying dynamical system is assumed to be a constrained linear system and the objective function is a sum of quadratic control-dependent stage cost and a Gromov-Wasserstein terminal cost. The inclusion of the Gromov-Wasserstein cost transforms the resulting optimal control problem into a well-known NP-hard problem, making it both numerically demanding and difficult to solve with high accuracy. Towards that end, we employ a recent semi-definite relaxation-driven technique to tackle the Gromov-Wasserstein distance. A numerical example is provided to illustrate our results.
[LG-52] Bayesian Pseudo Posterior Mechanism for Differentially Private Machine Learning
链接: https://arxiv.org/abs/2503.21528
作者: Robert Chew,Matthew R. Williams,Elan A. Segarra,Alexander J. Preiss,Amanda Konet,Terrance D. Savitsky
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Differential privacy (DP) is becoming increasingly important for deployed machine learning applications because it provides strong guarantees for protecting the privacy of individuals whose data is used to train models. However, DP mechanisms commonly used in machine learning tend to struggle on many real world distributions, including highly imbalanced or small labeled training sets. In this work, we propose a new scalable DP mechanism for deep learning models, SWAG-PPM, by using a pseudo posterior distribution that downweights by-record likelihood contributions proportionally to their disclosure risks as the randomized mechanism. As a motivating example from official statistics, we demonstrate SWAG-PPM on a workplace injury text classification task using a highly imbalanced public dataset published by the U.S. Occupational Safety and Health Administration (OSHA). We find that SWAG-PPM exhibits only modest utility degradation against a non-private comparator while greatly outperforming the industry standard DP-SGD for a similar privacy budget.
[LG-53] Constraint-based causal discovery with tiered background knowledge and latent variables in single or overlapping datasets
链接: https://arxiv.org/abs/2503.21526
作者: Christine W. Bang,Vanessa Didelez
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Accepted for the 4th Conference on Causal Learning and Reasoning (CLeaR 2025)
点击查看摘要
Abstract:In this paper we consider the use of tiered background knowledge within constraint based causal discovery. Our focus is on settings relaxing causal sufficiency, i.e. allowing for latent variables which may arise because relevant information could not be measured at all, or not jointly, as in the case of multiple overlapping datasets. We first present novel insights into the properties of the ‘tiered FCI’ (tFCI) algorithm. Building on this, we introduce a new extension of the IOD (integrating overlapping datasets) algorithm incorporating tiered background knowledge, the ‘tiered IOD’ (tIOD) algorithm. We show that under full usage of the tiered background knowledge tFCI and tIOD are sound, while simple versions of the tIOD and tFCI are sound and complete. We further show that the tIOD algorithm can often be expected to be considerably more efficient and informative than the IOD algorithm even beyond the obvious restriction of the Markov equivalence classes. We provide a formal result on the conditions for this gain in efficiency and informativeness. Our results are accompanied by a series of examples illustrating the exact role and usefulness of tiered background knowledge.
[LG-54] DeepRV: pre-trained spatial priors for accelerated disease mapping
链接: https://arxiv.org/abs/2503.21473
作者: Jhonathan Navott,Daniel Jenson,Seth Flaxman,Elizaveta Semenova
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:Recently introduced prior-encoding deep generative models (e.g., PriorVAE, \pi VAE, and PriorCVAE) have emerged as powerful tools for scalable Bayesian inference by emulating complex stochastic processes like Gaussian processes (GPs). However, these methods remain largely a proof-of-concept and inaccessible to practitioners. We propose DeepRV, a lightweight, decoder-only approach that accelerates training, and enhances real-world applicability in comparison to current VAE-based prior encoding approaches. Leveraging probabilistic programming frameworks (e.g., NumPyro) for inference, DeepRV achieves significant speedups while also improving the quality of parameter inference, closely matching full MCMC sampling. We showcase its effectiveness in process emulation and spatial analysis of the UK using simulated data, gender-wise cancer mortality rates for individuals under 50, and HIV prevalence in Zimbabwe. To bridge the gap between theory and practice, we provide a user-friendly API, enabling scalable and efficient Bayesian inference.
[LG-55] Exploring the flavor structure of leptons via diffusion models
链接: https://arxiv.org/abs/2503.21432
作者: Satsuki Nishimura,Hajime Otsuka,Haruki Uchiyama
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 23 pages, 5 figures
点击查看摘要
Abstract:We propose a method to explore the flavor structure of leptons using diffusion models, which are known as one of generative artificial intelligence (generative AI). We consider a simple extension of the Standard Model with the type I seesaw mechanism and train a neural network to generate the neutrino mass matrix. By utilizing transfer learning, the diffusion model generates 104 solutions that are consistent with the neutrino mass squared differences and the leptonic mixing angles. The distributions of the CP phases and the sums of neutrino masses, which are not included in the conditional labels but are calculated from the solutions, exhibit non-trivial tendencies. In addition, the effective mass in neutrinoless double beta decay is concentrated near the boundaries of the existing confidence intervals, allowing us to verify the obtained solutions through future experiments. An inverse approach using the diffusion model is expected to facilitate the experimental verification of flavor models from a perspective distinct from conventional analytical methods.
[LG-56] From Deep Learning to LLM s: A survey of AI in Quantitative Investment
链接: https://arxiv.org/abs/2503.21422
作者: Bokai Cao,Saizhuo Wang,Xinyi Lin,Xiaojun Wu,Haohan Zhang,Lionel M. Ni,Jian Guo
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR)
*备注:
点击查看摘要
Abstract:Quantitative investment (quant) is an emerging, technology-driven approach in asset management, increasingy shaped by advancements in artificial intelligence. Recent advances in deep learning and large language models (LLMs) for quant finance have improved predictive modeling and enabled agent-based automation, suggesting a potential paradigm shift in this field. In this survey, taking alpha strategy as a representative example, we explore how AI contributes to the quantitative investment pipeline. We first examine the early stage of quant research, centered on human-crafted features and traditional statistical models with an established alpha pipeline. We then discuss the rise of deep learning, which enabled scalable modeling across the entire pipeline from data processing to order execution. Building on this, we highlight the emerging role of LLMs in extending AI beyond prediction, empowering autonomous agents to process unstructured data, generate alphas, and support self-iterative workflows.
[LG-57] Explainable Boosting Machine for Predicting Claim Severity and Frequency in Car Insurance
链接: https://arxiv.org/abs/2503.21321
作者: Markéta Krùpovà,Nabil Rachdi,Quentin Guibert(CEREMADE)
类目: Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:In a context of constant increase in competition and heightened regulatory pressure, accuracy, actuarial precision, as well as transparency and understanding of the tariff, are key issues in non-life insurance. Traditionally used generalized linear models (GLM) result in a multiplicative tariff that favors interpretability. With the rapid development of machine learning and deep learning techniques, actuaries and the rest of the insurance industry have adopted these techniques widely. However, there is a need to associate them with interpretability techniques. In this paper, our study focuses on introducing an Explainable Boosting Machine (EBM) model that combines intrinsically interpretable characteristics and high prediction performance. This approach is described as a glass-box model and relies on the use of a Generalized Additive Model (GAM) and a cyclic gradient boosting algorithm. It accounts for univariate and pairwise interaction effects between features and provides naturally explanations on them. We implement this approach on car insurance frequency and severity data and extensively compare the performance of this approach with classical competitors: a GLM, a GAM, a CART model and an Extreme Gradient Boosting (XGB) algorithm. Finally, we examine the interpretability of these models to capture the main determinants of claim costs.
[LG-58] Simulation-informed deep learning for enhanced SWOT observations of fine-scale ocean dynamics
链接: https://arxiv.org/abs/2503.21303
作者: Eugenio Cutolo(IMT Atlantique - MEE, Lab-STICC_OSE, ODYSSEY),Carlos Granero-Belinchon(ODYSSEY, IMT Atlantique - MEE, Lab-STICC_OSE),Ptashanna Thiraux(IMT Atlantique - MEE, Lab-STICC_OSE, ODYSSEY),Jinbo Wang(JPL),Ronan Fablet(IMT Atlantique - MEE, Lab-STICC_OSE, ODYSSEY)
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
点击查看摘要
Abstract:Oceanic processes at fine scales are crucial yet difficult to observe accurately due to limitations in satellite and in-situ measurements. The Surface Water and Ocean Topography (SWOT) mission provides high-resolution Sea Surface Height (SSH) data, though noise patterns often obscure fine scale structures. Current methods struggle with noisy data or require extensive supervised training, limiting their effectiveness on real-world observations. We introduce SIMPGEN (Simulation-Informed Metric and Prior for Generative Ensemble Networks), an unsupervised adversarial learning framework combining real SWOT observations with simulated reference data. SIMPGEN leverages wavelet-informed neural metrics to distinguish noisy from clean fields, guiding realistic SSH reconstructions. Applied to SWOT data, SIMPGEN effectively removes noise, preserving fine-scale features better than existing neural methods. This robust, unsupervised approach not only improves SWOT SSH data interpretation but also demonstrates strong potential for broader oceanographic applications, including data assimilation and super-resolution.
[LG-59] Interpretable Cross-Sphere Multiscale Deep Learning Predicts ENSO Skilfully Beyond 2 Years
链接: https://arxiv.org/abs/2503.21211
作者: Rixu Hao,Yuxin Zhao,Shaoqing Zhang,Guihua Wang,Xiong Deng
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 13 pages, 4 figures
点击查看摘要
Abstract:El Niño-Southern Oscillation (ENSO) exerts global climate and societal impacts, but real-time prediction with lead times beyond one year remains challenging. Dynamical models suffer from large biases and uncertainties, while deep learning struggles with interpretability and multi-scale dynamics. Here, we introduce PTSTnet, an interpretable model that unifies dynamical processes and cross-scale spatiotemporal learning in an innovative neural-network framework with physics-encoding learning. PTSTnet produces interpretable predictions significantly outperforming state-of-the-art benchmarks with lead times beyond 24 months, providing physical insights into error propagation in ocean-atmosphere interactions. PTSTnet learns feature representations with physical consistency from sparse data to tackle inherent multi-scale and multi-physics challenges underlying ocean-atmosphere processes, thereby inherently enhancing long-term prediction skill. Our successful realizations mark substantial steps forward in interpretable insights into innovative neural ocean modelling.
[LG-60] Squared families: Searching beyond regular probability models
链接: https://arxiv.org/abs/2503.21128
作者: Russell Tsuchida,Jiawei Liu,Cheng Soon Ong,Dino Sejdinovic
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 43 pages. Preprint
点击查看摘要
Abstract:We introduce squared families, which are families of probability densities obtained by squaring a linear transformation of a statistic. Squared families are singular, however their singularity can easily be handled so that they form regular models. After handling the singularity, squared families possess many convenient properties. Their Fisher information is a conformal transformation of the Hessian metric induced from a Bregman generator. The Bregman generator is the normalising constant, and yields a statistical divergence on the family. The normalising constant admits a helpful parameter-integral factorisation, meaning that only one parameter-independent integral needs to be computed for all normalising constants in the family, unlike in exponential families. Finally, the squared family kernel is the only integral that needs to be computed for the Fisher information, statistical divergence and normalising constant. We then describe how squared families are special in the broader class of g -families, which are obtained by applying a sufficiently regular function g to a linear transformation of a statistic. After removing special singularities, positively homogeneous families and exponential families are the only g -families for which the Fisher information is a conformal transformation of the Hessian metric, where the generator depends on the parameter only through the normalising constant. Even-order monomial families also admit parameter-integral factorisations, unlike exponential families. We study parameter estimation and density estimation in squared families, in the well-specified and misspecified settings. We use a universal approximation property to show that squared families can learn sufficiently well-behaved target densities at a rate of \mathcalO(N^-1/2)+C n^-1/4 , where N is the number of datapoints, n is the number of parameters, and C is some constant.
[LG-61] Quantum advantage for learning shallow neural networks with natural data distributions
链接: https://arxiv.org/abs/2503.20879
作者: Laura Lewis,Dar Gilboa,Jarrod R. McClean
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 8 pages, 1 figure + 80-page appendix
点击查看摘要
Abstract:The application of quantum computers to machine learning tasks is an exciting potential direction to explore in search of quantum advantage. In the absence of large quantum computers to empirically evaluate performance, theoretical frameworks such as the quantum probably approximately correct (PAC) and quantum statistical query (QSQ) models have been proposed to study quantum algorithms for learning classical functions. Despite numerous works investigating quantum advantage in these models, we nevertheless only understand it at two extremes: either exponential quantum advantages for uniform input distributions or no advantage for potentially adversarial distributions. In this work, we study the gap between these two regimes by designing an efficient quantum algorithm for learning periodic neurons in the QSQ model over a broad range of non-uniform distributions, which includes Gaussian, generalized Gaussian, and logistic distributions. To our knowledge, our work is also the first result in quantum learning theory for classical functions that explicitly considers real-valued functions. Recent advances in classical learning theory prove that learning periodic neurons is hard for any classical gradient-based algorithm, giving us an exponential quantum advantage over such algorithms, which are the standard workhorses of machine learning. Moreover, in some parameter regimes, the problem remains hard for classical statistical query algorithms and even general classical algorithms learning under small amounts of noise.
[LG-62] Neuro-Informed Adaptive Learning (NIAL) Algorithm: A Hybrid Deep Learning Approach for ECG Signal Classification
链接: https://arxiv.org/abs/2503.20789
作者: Sowad Rahman
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 1 figure ,2 pages
点击查看摘要
Abstract:The detection of cardiac abnormalities using electrocardiogram (ECG) signals is crucial for early diagnosis and intervention in cardiovascular diseases. Traditional deep learning models often lack adaptability to varying signal patterns. This study introduces the Neuro-Informed Adaptive Learning (NIAL) algorithm, a hybrid approach integrating convolutional neural networks (CNNs) and transformer-based attention mechanisms to enhance ECG signal classification. The algorithm dynamically adjusts learning rates based on real-time validation performance, ensuring efficient convergence. Using the MIT-BIH Arrhythmia and PTB Diagnostic ECG datasets, our model achieves high classification accuracy, outperforming conventional approaches. These findings highlight the potential of NIAL in real-time cardiovascular monitoring applications.
[LG-63] Advanced Digital Simulation for Financial Market Dynamics: A Case of Commodity Futures
链接: https://arxiv.org/abs/2503.20787
作者: Cheng Wang,Chuwen Wang,Shirong Zeng,Changjun Jiang
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注:
点击查看摘要
Abstract:After decades of evolution, the financial system has increasingly deviated from an idealized framework based on theorems. It necessitates accurate projections of complex market dynamics and human behavioral patterns. With the development of data science and machine intelligence, researchers are trying to digitalize and automate market prediction. However, existing methodologies struggle to represent the diversity of individuals and are regardless of the domino effects of interactions on market dynamics, leading to the poor performance facing abnormal market conditions where non-quantitative information dominates the market. To alleviate these disadvantages requires the introduction of knowledge about how non-quantitative information, like news and policy, affects market dynamics. This study investigates overcoming these challenges through rehearsing potential market trends based on the financial large language model agents whose behaviors are aligned with their cognition and analyses in markets. We propose a hierarchical knowledge architecture for financial large language model agents, integrating fine-tuned language models and specialized generators optimized for trading scenarios. For financial market, we develop an advanced interactive behavioral simulation system that enables users to configure agents and automate market simulations. In this work, we take commodity futures as an example to research the effectiveness of our methodologies. Our real-world case simulation succeeds in rehearsing abnormal market dynamics under geopolitical events and reaches an average accuracy of 3.4% across various points in time after the event on predicting futures price. Experimental results demonstrate our method effectively leverages diverse information to simulate behaviors and their impact on market dynamics through systematic interaction.
信息检索
[IR-0] CombiGCN: An effective GCN model for Recommender System
链接: https://arxiv.org/abs/2503.21471
作者: Loc Tan Nguyen,Tin T. Tran
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Graph Neural Networks (GNNs) have opened up a potential line of research for collaborative filtering (CF). The key power of GNNs is based on injecting collaborative signal into user and item embeddings which will contain information about user-item interactions after that. However, there are still some unsatisfactory points for a CF model that GNNs could have done better. The way in which the collaborative signal are extracted through an implicit feedback matrix that is essentially built on top of the message-passing architecture of GNNs, and it only helps to update the embedding based on the value of the items (or users) embeddings neighboring. By identifying the similarity weight of users through their interaction history, a key concept of CF, we endeavor to build a user-user weighted connection graph based on their similarity weight. In this study, we propose a recommendation framework, CombiGCN, in which item embeddings are only linearly propagated on the user-item interaction graph, while user embeddings are propagated simultaneously on both the user-user weighted connection graph and user-item interaction graph graphs with Light Graph Convolution (LGC) and combined in a simpler method by using the weighted sum of the embeddings for each layer. We also conducted experiments comparing CombiGCN with several state-of-the-art models on three real-world datasets. Subjects: Information Retrieval (cs.IR) ACMclasses: H.m Cite as: arXiv:2503.21471 [cs.IR] (or arXiv:2503.21471v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2503.21471 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1007/978-981-97-0669-3_11 Focus to learn more DOI(s) linking to related resources
[IR-1] Improvement Graph Convolution Collaborative Filtering with Weighted addition input
链接: https://arxiv.org/abs/2503.21468
作者: Tin T. Tran,V. Snasel
类目: Information Retrieval (cs.IR)
*备注:
点击查看摘要
Abstract:Graph Neural Networks have been extensively applied in the field of machine learning to find features of graphs, and recommendation systems are no exception. The ratings of users on considered items can be represented by graphs which are input for many efficient models to find out the characteristics of the users and the items. From these insights, relevant items are recommended to users. However, user’s decisions on the items have varying degrees of effects on different users, and this information should be learned so as not to be lost in the process of information mining. In this publication, we propose to build an additional graph showing the recommended weight of an item to a target user to improve the accuracy of GNN models. Although the users’ friendships were not recorded, their correlation was still evident through the commonalities in consumption behavior. We build a model WiGCN (Weighted input GCN) to describe and experiment on well-known datasets. Conclusions will be stated after comparing our results with state-of-the-art such as GCMC, NGCF and LightGCN. The source code is also included at this https URL. Subjects: Information Retrieval (cs.IR) ACMclasses: H.m Cite as: arXiv:2503.21468 [cs.IR] (or arXiv:2503.21468v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2503.21468 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1007/978-3-031-21743-2_51 Focus to learn more DOI(s) linking to related resources
[IR-2] Are We Solving a Well-Defined Problem? A Task-Centric Perspective on Recommendation Tasks
链接: https://arxiv.org/abs/2503.21188
作者: Aixin Sun
类目: Information Retrieval (cs.IR)
*备注: Work in progress
点击查看摘要
Abstract:Recommender systems (RecSys) leverage user interaction history to predict and suggest relevant items, shaping user experiences across various domains. While many studies adopt a general problem definition, i.e., to recommend preferred items to users based on past interactions, such abstraction often lacks the domain-specific nuances necessary for practical deployment. However, models are frequently evaluated using datasets from online recommender platforms, which inherently reflect these specificities. In this paper, we analyze RecSys task formulations, emphasizing key components such as input-output structures, temporal dynamics, and candidate item selection. All these factors directly impact offline evaluation. We further examine the complexities of user-item interactions, including decision-making costs, multi-step engagements, and unobservable interactions, which may influence model design and loss functions. Additionally, we explore the balance between task specificity and model generalizability, highlighting how well-defined task formulations serve as the foundation for robust evaluation and effective solution development. By clarifying task definitions and their implications, this work provides a structured perspective on RecSys research. The goal is to help researchers better navigate the field, particularly in understanding specificities of the RecSys tasks and ensuring fair and meaningful evaluations.
[IR-3] Network Density Analysis of Health Seeking Behavior in Metro Manila: A Retrospective Analysis on COVID-19 Google Trends Data
链接: https://arxiv.org/abs/2503.21162
作者: Michael T. Lopez II,Cheska Elise Hung,Maria Regina Justina E. Estuar
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: Pre-print conference submission to ICMHI 2025, which it has been accepted. This has 12 pages, and 2 figures
点击查看摘要
Abstract:This study examined the temporal aspect of COVID-19-related health-seeking behavior in Metro Manila, National Capital Region, Philippines through a network density analysis of Google Trends data. A total of 15 keywords across five categories (English symptoms, Filipino symptoms, face wearing, quarantine, and new normal) were examined using both 15-day and 30-day rolling windows from March 2020 to March 2021. The methodology involved constructing network graphs using distance correlation coefficients at varying thresholds (0.4, 0.5, 0.6, and 0.8) and analyzing the time-series data of network density and clustering coefficients. Results revealed three key findings: (1) an inverse relationship between the threshold values and network metrics, indicating that higher thresholds provide more meaningful keyword relationships; (2) exceptionally high network connectivity during the initial pandemic months followed by gradual decline; and (3) distinct patterns in keyword relationships, transitioning from policy-focused searches to more symptom-specific queries as the pandemic temporally progressed. The 30-day window analysis showed more stable, but less search activities compared to the 15-day windows, suggesting stronger correlations in immediate search behaviors. These insights are helpful for health communication because it emphasizes the need of a strategic and conscientious information dissemination from the government or the private sector based on the networked search behavior (e.g. prioritizing to inform select symptoms rather than an overview of what the coronavirus is).
[IR-4] FAIR-QR: Enhancing Fairness-aware Information Retrieval through Query Refinement ECIR2025
链接: https://arxiv.org/abs/2503.21092
作者: Fumian Chen,Hui Fang
类目: Information Retrieval (cs.IR)
*备注: This is a preprint of our paper accepted at ECIR 2025
点击查看摘要
Abstract:Information retrieval systems such as open web search and recommendation systems are ubiquitous and significantly impact how people receive and consume online information. Previous research has shown the importance of fairness in information retrieval systems to combat the issue of echo chambers and mitigate the rich-get-richer effect. Therefore, various fairness-aware information retrieval methods have been proposed. Score-based fairness-aware information retrieval algorithms, focusing on statistical parity, are interpretable but could be mathematically infeasible and lack generalizability. In contrast, learning-to-rank-based fairness-aware information retrieval algorithms using fairness-aware loss functions demonstrate strong performance but lack interpretability. In this study, we proposed a novel and interpretable framework that recursively refines query keywords to retrieve documents from underrepresented groups and achieve group fairness. Retrieved documents using refined queries will be re-ranked to ensure relevance. Our method not only shows promising retrieval results regarding relevance and fairness but also preserves interpretability by showing refined keywords used at each iteration.
附件下载
点击下载今日全部论文列表