本篇博文主要内容为 2026-04-27 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-04-27)
今日共更新423篇论文,其中:
- 自然语言处理共62篇(Computation and Language (cs.CL))
- 人工智能共94篇(Artificial Intelligence (cs.AI))
- 计算机视觉共74篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共110篇(Machine Learning (cs.LG))
- 多智能体系统共8篇(Multiagent Systems (cs.MA))
- 信息检索共11篇(Information Retrieval (cs.IR))
- 人机交互共23篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM -based Multi-Agent Systems ACL2026
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的多智能体系统(Multi-Agent System, MAS)中故障归因(failure attribution)的挑战问题,即准确识别导致失败的责任主体(agent)和关键决策步骤。现有基准测试依赖于部分可观测的执行轨迹(仅包含智能体输出),忽略了开发者在调试时实际使用的输入与上下文信息,从而限制了归因方法的有效性。论文提出的关键解决方案是构建TraceElephant基准,该基准提供完整的执行轨迹(包括输入、输出及上下文)和可复现的环境,使归因研究能够在全可观测条件下进行。实验表明,相较于仅基于部分观测的方案,使用完整轨迹可将归因准确率提升高达76%,验证了缺失输入会掩盖大量故障成因,从而为更透明、可解释的MAS开发奠定基础。
链接: https://arxiv.org/abs/2604.22708
作者: Mengzhuo Chen,Junjie Wang,Fangwen Mu,Yawen Wang,Zhe Liu,Huanxiang Feng,Qing Wang
机构: State Key Laboratory of Complex System Modeling and Simulation Technology; Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China
类目: Multiagent Systems (cs.MA)
备注: Accepted by ACL 2026
Abstract:Failure attribution, i.e., identifying the responsible agent and decisive step of a failure, is particularly challenging in LLM-based multi-agent systems (MAS) due to their natural-language reasoning, nondeterministic outputs, and intricate interaction dynamics. A reliable benchmark is therefore essential to guide and evaluate attribution techniques. Yet existing benchmarks rely on partially observable traces that capture only agent outputs, omitting the inputs and context that developers actually use when debugging. We argue that failure attribution should be studied under full execution observability, aligning with real-world developer-facing scenarios where complete traces, rather than only outputs, are accessible for diagnosis. To this end, we introduce TraceElephant, a benchmark designed for failure attribution with full execution traces and reproducible environments. We then systematically evaluate failure attribution techniques across various configurations. Specifically, full traces improve attribution accuracy by up to 76% over a partial-observation counterpart, confirming that missing inputs obscure many failure causes. TraceElephant provides a foundation for follow-up failure attribution research, promoting evaluation practices that reflect real-world debugging and supporting the development of more transparent MASs.
[MA-1] AgentS earchBench: A Benchmark for AI Agent Search in the Wild
【速读】:该论文旨在解决真实场景下AI代理(Agent)搜索的挑战,即如何在复杂、动态且非结构化的环境中高效识别并排序适合执行特定任务的代理。传统方法依赖于文本描述进行语义匹配,但忽略了代理能力的组合性和执行依赖性,导致检索与实际性能之间存在显著偏差。解决方案的关键在于构建一个大规模、现实世界中采集的基准测试平台——AgentSearchBench,其将代理搜索形式化为基于可执行任务查询和高层任务描述的检索与重排序问题,并利用执行结果作为评估相关性的地面实况信号(execution-grounded performance signals)。实验表明,仅靠语义相似度无法准确预测代理表现,而引入轻量级行为信号(如执行感知探测)能显著提升排名质量,从而强调了在代理发现过程中融合执行反馈的重要性。
链接: https://arxiv.org/abs/2604.22436
作者: Bin Wu,Arastun Mammadli,Xiaoyu Zhang,Emine Yilmaz
机构: University College London (伦敦大学学院)
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:
Abstract:The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at this https URL.
[MA-2] Fast Neural-Network Approximation of Active Target Search Under Uncertainty
【速读】:该论文旨在解决移动代理在未知位置搜索若干静止目标的问题,其中目标数量和位置均未知,且观测存在不确定性。解决方案的关键在于使用概率假设密度滤波(Probability Hypothesis Density Filter, PHD)估计目标数量,并引入卷积神经网络(Convolutional Neural Network, CNN)对传统主动搜索(Active Search, AS)或其间歇变体(Intermittent Active Search, ASI)的决策进行近似,从而避免在线优化带来的高计算成本。训练数据由多通道网格构成,编码了目标置信度、代理位置、访问历史及边界信息,最终在仿真中实现了与AS/ASI相当的检测率,同时将计算复杂度降低数个数量级。
链接: https://arxiv.org/abs/2604.22254
作者: Bilal Yousuf,Zsofia Lendek,Lucian Busoniu
机构: Technical University of Cluj-Napoca (克卢日-纳波卡理工大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:We address the problem of searching for an unknown number of stationary targets at unknown positions with a mobile agent. A probability hypothesis density filter is used to estimate the expected number of targets under measurement uncertainty. Existing planners, such as Active Search (AS) and its Intermittent variant (ASI), achieve accurate detection but require costly online optimization. To reduce online computation, we propose to use a convolutional neural network to approximate AS or ASI decisions through direct inference. The network is trained on AS/ASI data using a multi-channel grid that encodes target beliefs, the agent position, visitation history, and boundary information. Simulations with uniform and clustered target distributions show that the network achieves detection rates comparable to AS or ASI while reducing computation by orders of magnitude.
[MA-3] V-STC: A Time-Efficient Multi-Vehicle Coordinated Trajectory Planning Approach
【速读】:该论文旨在解决多辆自动驾驶车辆(AV)在协同运动规划中如何实现安全且高效的时间与空间利用问题。其核心挑战在于设计一种能够动态调整时间步长和空间路径的协调机制,以减少整体时间占用并保证时空域内的无碰撞分离。解决方案的关键在于提出了一种名为“可变时间步长时空走廊”(Variable-Time-Step Spatio-Temporal Corridor, V-STC)的新方法,将每个AV的走廊立方体的空间位置及其时间持续时长均作为优化变量,从而在保持安全性的同时显著提升时间效率。在此基础上,为每辆车独立规划出动态可行轨迹,仿真结果表明该方法相较于传统时空走廊(STC)方法更具时间效率和安全性。
链接: https://arxiv.org/abs/2604.22196
作者: Pengfei Liu,Jialing Zhou,Yuezu Lv,Guanghui Wen,Tingwen Huang
机构: Beijing Institute of Technology (北京理工大学); Southeast University (东南大学); Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 12 pages, 23 figures
Abstract:Coordinating the motions of multiple autonomous vehicles (AVs) requires planning frameworks that ensure safety while making efficient use of space and time. This paper presents a new approach, termed variable-time-step spatio-temporal corridor (V-STC), that enhances the temporal efficiency of multi-vehicle coordination. An optimization model is formulated to construct a V-STC for each AV, in which both the spatial configuration of the corridor cubes and their time durations are treated as decision variables. By allowing the corridor’s spatial position and time step to vary, the constructed V-STC reduces the overall temporal occupancy of each AV while maintaining collision-free separation in the spatio-temporal domain. Based on the generated V-STC, a dynamically feasible trajectory is then planned independently for each AV. Simulation studies demonstrate that the proposed method achieves safe multi-vehicle coordination and yields more time-efficient motion compared with existing STC approaches.
[MA-4] DM3-Nav: Decentralized Multi-Agent Multimodal Multi-Object Semantic Navigation
【速读】:该论文旨在解决多智能体语义导航(multi-agent semantic navigation)中因依赖中心化架构而导致的单点故障问题,同时实现开放词汇目标指定(open-vocabulary goal specification)和多目标任务执行。其解决方案的关键在于提出一种完全去中心化的系统 DM³-Nav,该系统不依赖中央协调器、全局地图聚合或共享全局状态,机器人通过临时性的成对通信交换局部地图、目标状态和导航意图,并采用基于意图广播与距离加权前沿选择的隐式任务分配机制,在保持去中心化特性的同时减少冗余探索。实验表明,该方法在HM3DSem场景下性能可媲美甚至超越集中式基线,且已在真实办公环境中由两台移动机器人成功部署,验证了其仅依赖本地感知与计算的可行性。
链接: https://arxiv.org/abs/2604.22014
作者: Amin Kashiri(1),Atharva Jamsandekar(1),Yasin Yazıcıoğlu(1) ((1) Northeastern University, Boston, USA)
机构: Northeastern University (东北大学)
类目: Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
Abstract:We present DM ^3 -Nav, a fully decentralized multi-agent semantic navigation system supporting multimodal open-vocabulary goal specification and multi-object missions. In our setting, decentralization implies operation without a central coordinator, global map aggregation, or shared global state at runtime. Robots operate autonomously and coordinate through ad-hoc pairwise communication, exchanging local maps, goal status, and navigation intent without synchronization. An implicit task allocation mechanism combining intent broadcasting and distance-weighted frontier selection reduces redundant exploration while preserving decentralized operation. Evaluations on HM3DSem scenes using the HM3Dv0.2 and GOAT-Bench datasets demonstrate that DM ^3 -Nav matches or exceeds centralized and shared-map baselines while eliminating single points of failure inherent in centralized architectures. Finally, we validate our approach in a real-world office environment using two mobile robots, demonstrating successful deployment relying entirely on onboard sensing and computation. A video of our real-world experiments is available online: this https URL
[MA-5] MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation Screening and Optimization
【速读】:该论文旨在解决当前AI代理在复杂药物分子筛选与优化流程中表现不稳定的问题,尤其是在需要协调数十种专用工具的多步骤工作流中难以维持鲁棒性能的挑战。解决方案的关键在于提出MolClaw,一个自主的药物分子评估、筛选与优化智能体,其核心创新是通过三层分层技能架构(共70项技能)统一30余种领域资源:工具级技能标准化原子操作,工作流级技能构建具备质量检查和反思机制的验证管道,学科级技能则提供跨场景的科学原理以指导规划与验证。这一结构使Agent能够在运行时实现长期交互,显著提升高复杂度任务下的系统性能力,实验证明其在MolBench基准上的表现达到当前最优水平,且消融实验表明性能提升集中于依赖结构化工作流的任务,凸显了工作流编排能力是当前AI驱动药物发现的主要瓶颈。
链接: https://arxiv.org/abs/2604.21937
作者: Lisheng Zhang,Lilong Wang,Xiangyu Sun,Wei Tang,Haoyang Su,Yuehui Qian,Qikui Yang,Qingsong Li,Zhenyu Tang,Haoran Sun,Yingnan Han,Yankai Jiang,Wenjie Lou,Bowen Zhou,Xiaosong Wang,Lei Bai,Zhengwei Xie
机构: Peking University Health Science Center (北京大学医学部); Shanghai AI Laboratory (上海人工智能实验室); Academy for Advanced Interdisciplinary Studies, Peking University (北京大学前沿交叉学科研究院)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 59 pages, 10 figures. Code and data will be released
Abstract:Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.
[MA-6] An Artifact-based Agent Framework for Adaptive and Reproducible Medical Image Processing
【速读】:该论文旨在解决医学影像研究从受控基准评估向真实临床部署迁移过程中,因数据集异构性和分析目标动态变化而导致的流程配置灵活性与结果可复现性之间的矛盾问题。其核心挑战在于如何在保证计算过程确定性和可追溯性的前提下,实现对不同数据特征和分析需求的自适应工作流配置。解决方案的关键在于提出一种基于“实体(artifact)”的代理框架,通过定义结构化的“实体契约”来形式化中间和最终输出,并借助模块化规则库实现目标条件驱动的配置组装;同时将执行逻辑委托给工作流执行器以确保计算图的确定性构建与溯源追踪,而代理本地运行则满足多数隐私合规要求,从而在异构临床环境中实现灵活、可复现的影像处理流程。
链接: https://arxiv.org/abs/2604.21936
作者: Lianrui Zuo,Yihao Liu,Gaurav Rudravaram,Karthik Ramadass,Aravind R. Krishnan,Michael D. Phillips,Yelena G. Bodien,Mayur B. Patel,Paula Trujillo,Yency Forero Martinez,Stephen A. Deppen,Eric L. Grogan,Fabien Maldonado,Kevin McGann,Hudson M. Holmes,Laurie E. Cutting,Yuankai Huo,Bennett A. Landman
机构: 1. University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); 2. Vanderbilt University (范德堡大学); 3. Vanderbilt University Medical Center (范德堡大学医学中心); 4. University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); 5. University of California, San Francisco (加州大学旧金山分校); 6. Universidad de los Andes (安第斯大学); 7. University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); 8. University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); 9. Vanderbilt University (范德堡大学); 10. Vanderbilt University Medical Center (范德堡大学医学中心); 11. University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); 12. University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); 13. University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:Medical imaging research is increasingly shifting from controlled benchmark evaluation toward real-world clinical deployment. In such settings, applying analytical methods extends beyond model design to require dataset-aware workflow configuration and provenance tracking. Two requirements therefore become central: \textbfadaptability, the ability to configure workflows according to dataset-specific conditions and evolving analytical goals; and \textbfreproducibility, the guarantee that all transformations and decisions are explicitly recorded and re-executable. Here, we present an artifact-based agent framework that introduces a semantic layer to augment medical image processing. The framework formalizes intermediate and final outputs through an artifact contract, enabling structured interrogation of workflow state and goal-conditioned assembly of configurations from a modular rule library. Execution is delegated to a workflow executor to preserve deterministic computational graph construction and provenance tracking, while the agent operates locally to comply with most privacy constraints. We evaluate the framework on real-world clinical CT and MRI cohorts, demonstrating adaptive configuration synthesis, deterministic reproducibility across repeated executions, and artifact-grounded semantic querying. These results show that adaptive workflow configuration can be achieved without compromising reproducibility in heterogeneous clinical environments.
[MA-7] A four-player potential game for barren-plateau-aware quantum ansatz design
【速读】:该论文旨在解决参数化量子线路(parameterized quantum circuit)设计中的多目标优化问题,即如何在训练性(trainability)、非稳定子性(non-stabilizerness)、任务性能(task performance)和硬件成本(hardware cost)之间实现平衡。其解决方案的关键在于将电路设计建模为一个四玩家势博弈(four-player potential game),其中每个玩家代表一个优化维度,通过受限动作集(append、remove、retype、rewire)对电路DAG进行操作,并利用块坐标ε-Nash残差(block-coordinate ε-Nash residual)来验证均衡状态——即任一玩家无法单方面改进。实验表明,该框架能在多种硬件拓扑上找到高潜力解,且在LiH分子模拟中优于仅基于能量优化的方法(如ADAPT-VQE),同时兼顾多个关键指标的协同提升。
链接: https://arxiv.org/abs/2604.21955
作者: Rubén Darío Guerrero
机构: 未知
类目: Quantum Physics (quant-ph); Multiagent Systems (cs.MA)
备注: 8 pages, 4 figures
Abstract:We cast the design of parameterized quantum circuits as a four-player potential game whose state is a circuit directed acyclic graph (DAG) and whose players encode trainability, non-stabilizerness, task performance, and hardware cost. Per-player restricted action sets factorize the move space into append, remove, retype, and rewire operations; a block-coordinate \varepsilon -Nash residual \delta_\textNash certifies that no single player can improve unilaterally. A single weight sweep on MaxCut K_4 traces a Pareto frontier from a Clifford endpoint (M_2/n,\langle H\rangle)=(0,4.00) to a non-Clifford endpoint (0.48,3.30) . On three four-qubit hardware topologies (heavy-hex, 2\times 2 grid, Rydberg all-to-all), Nash search achieves the highest mean potential; on the 2\times 2 grid Nash reaches the theoretical ceiling \Phi_\textmax=4.10 on two of five seeds while the simulated-annealing baseline does so on one; paired Wilcoxon tests over five seeds cannot reject the null on any single topology ( p\ge 0.22 ). On LiH/STO-3G, seeding Nash from a 58-gate Givens-doubles ansatz produces a 48-operation, depth-25 circuit retaining 97.7% of the correlation energy while simultaneously reducing gate count, increasing non-stabilizerness, and controlling trainability. The framework is complementary to energy-only searches such as ADAPT-VQE and k-UpCCGSD, which reach chemical accuracy with fewer operations but do not optimize the other three axes.
自然语言处理
[NLP-0] Representational Harms in LLM -Generated Narratives Against Global Majority Nationalities
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时对不同国家身份群体的代表性偏差问题,特别是其如何系统性地强化针对全球多数群体(Global Majority)的刻板印象、忽视或单一化呈现。研究发现,LLMs在面对开放式的叙事生成提示时,普遍存在基于国籍的身份偏见:少数族裔国家身份在中立情境下被显著低估,在从属角色中则被过度呈现——此类角色出现频率是主导身份角色的50倍以上;且当输入提示包含美国国籍线索时,这种偏见程度进一步加剧。关键解决方案在于推动以全球多数群体视角为中心的研究方法,强调必须挑战对以美国为中心的LLMs的盲目依赖,从而更深入识别并缓解生成式AI中的文化伤害。
链接: https://arxiv.org/abs/2604.22749
作者: Ilana Nguyen,Harini Suresh,Thema Monroe-White,Evan Shieh
机构: Brown University (布朗大学); George Mason University (乔治梅森大学); Young Data Scientists League (青年数据科学家联盟)
类目: Computation and Language (cs.CL)
备注: FAccT '26, June 25-28, 2026, Montreal, QC, Canada
Abstract:Large language models (LLMs) are increasingly used for text generation tasks from everyday use to high-stakes enterprise and government applications, including simulated interviews with asylum seekers. While many works highlight the new potential applications of LLMs, there are risks of LLMs encoding and perpetuating harmful biases about non-dominant communities across the globe. To better evaluate and mitigate such harms, more research examining how LLMs portray diverse individuals is needed. In this work, we study how national origin identities are portrayed by widely-adopted LLMs in response to open-ended narrative generation prompts. Our findings demonstrate the presence of persistent representational harms by national origin, including harmful stereotypes, erasure, and one-dimensional portrayals of Global Majority identities. Minoritized national identities are simultaneously underrepresented in power-neutral stories and overrepresented in subordinated character portrayals, which are over fifty times more likely to appear than dominant portrayals. The degree of harm is amplified when US nationality cues (e.g., ``American’') are present in input prompts. Notably, we find that the harms we identify cannot be explained away via sycophancy, as US-centric biases persist even when replacing US nationality cues with non-US national identities in the prompts. Based on our findings, we call for further exploration of cultural harms in LLMs through methodologies that center Global Majority perspectives and challenge the uncritical adoption of US-based LLMs for the classification, surveillance, and misrepresentation of the majority of our planet.
[NLP-1] Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
【速读】: 该论文旨在解决神经网络模型是否能仅基于现代形态学数据恢复与历史重构一致的跨语言词汇结构这一问题。其核心解决方案在于利用BantuMorph v7(一种基于Transformer的模型)对14种东南部班图语言的名词和动词词干进行编码嵌入分析,识别出共享于5种以上语言的同源词候选集,并通过与已知的历史重建资源(如BLR3和ASJP)比对验证其准确性。关键突破在于:模型在名词层面成功匹配了90.9%的前11个高置信度同源词候选(如*ntU ‘person’、gombe 'cow’等),动词层面也发现多个与Proto-Bantu根词一致的同源模式(如-bon- ‘see’),且独立翻译模型(NLLB-600M)交叉验证确认了这些同源聚类及系统发育分类的一致性(p < 0.01)。这表明现代形态学数据驱动的生成式AI模型具备从当代语言中推断深层历史词汇结构的能力。
链接: https://arxiv.org/abs/2604.22730
作者: Hillary Mutisya,John Mugane
机构: Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We investigate whether neural models trained exclusively on modern morphological data can recover cross-lingual lexical structure consistent with historical reconstruction. Using BantuMorph v7, a transformer over Bantu morphological paradigms, we analyze 14 Eastern and Southern Bantu languages, extract encoder embeddings for their noun and verb lemmas, and identify 728 noun and 1,525 verb cognate candidates shared across 5+ languages. Evaluating these candidates against established historical resources-the Bantu Lexical Reconstructions database (BLR3; 4,786 reconstructed Proto-Bantu forms) and the ASJP basic vocabulary-we confirm 10 of the top 11 noun candidates (90.9%) align with previously reconstructed Proto-Bantu forms, including *-ntU ‘person’ (8 languages), *gombe ‘cow’ (9 languages), and *mUn (9 languages). Extending to verbs, 12 verb cognates align with reconstructed Proto-Bantu roots, including *-bon- ‘see’ and *-jIm- ‘stand’, each attested across wide geographic ranges. Cross-model validation using an independent translation model (NLLB-600M) confirms these patterns: both models recover cognate clusters and phylogenetic groupings consistent with established Guthrie-zone classifications (p 0.01). Cross-lingual noun class analysis reveals that all 13 productive classes maintain 0.83 cosine similarity across languages (within-class between-class, p 10^-9). Our dataset is restricted to Eastern and Southern Bantu, so we interpret these results as recovering shared Bantu lexical structure consistent with Proto-Bantu rather than definitively distinguishing Proto-Bantu retentions from later regional innovations.
[NLP-2] Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
【速读】: 该论文旨在解决低资源班图语(Bantu languages)中形态学特征自动发现的问题,尤其针对标注数据稀缺的挑战。其解决方案的关键在于结合跨语言迁移学习(cross-lingual transfer learning)与无监督聚类(unsupervised clustering),并通过加权投票集成两者优势:迁移学习利用斯瓦希里语(Swahili)与目标语言(如Giriama)约60%的词汇重叠,高效识别词源关系;而无监督聚类则能挖掘语言特有形态模式(如a-前缀变体和k’-前缀收缩),这些在迁移学习中难以捕捉。该方法在仅有91个标注词形的情况下,成功推断出2,455个名词类归属,并识别出两个此前未记录的形态规律,验证了其在低资源条件下的有效性。
链接: https://arxiv.org/abs/2604.22723
作者: Hillary Mutisya,John Mugane
机构: Harvard University (哈佛大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k’- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.
[NLP-3] hinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
【速读】: 该论文旨在解决生成式 AI(Generative AI)在复杂推理任务中因长链式思维(Chain-of-Thought, CoT)导致的推理成本过高问题,即显式链式思维在推理阶段消耗大量 token,影响效率。其解决方案的关键在于提出一种**抽象链式思维(Abstract-CoT)**机制:通过引入一个离散潜在空间中的抽象词汇表(abstract vocabulary),让语言模型在生成最终答案前先输出一段由预定义抽象 token 构成的短序列,从而替代传统自然语言形式的 CoT。该方法结合了两阶段训练策略——首先通过掩码瓶颈和监督微调建立抽象 token 与真实推理路径的映射关系,再利用约束解码进行自蒸馏,并最终采用基于 warm-started 强化学习优化抽象序列生成,在显著减少推理 token 数量(最多达 11.6 倍)的同时保持多类推理任务(如数学推理、指令遵循和多跳推理)的性能相当,且具备跨模型家族的泛化能力。
链接: https://arxiv.org/abs/2604.22709
作者: Keshav Ramji,Tahira Naseem,Ramón Fernandez Astudillo
机构: IBM Research AI (IBM 研究人工智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:While long, explicit chains-of-thought (CoT) have proven effective on complex reasoning tasks, they are costly to generate during inference. Non-verbal reasoning methods have emerged with shorter generation lengths by leveraging continuous representations, yet their performance lags behind verbalized CoT. We propose \textbfAbstract Chain-of-Thought , a discrete latent reasoning post-training mechanism in which the language model produces a short sequence of tokens from a reserved vocabulary in lieu of a natural language CoT, before generating a response. To make previously unseen ‘‘abstract’’ tokens useful, we introduce a policy iteration-style warm-up loop that alternates between (i.) bottlenecking from a verbal CoT via masking and performing supervised fine-tuning, and (ii.) self-distillation by training the model to generate abstract tokens from the prompt alone via constrained decoding with the codebook. After warm-up, we optimize the generation of abstract sequences with warm-started reinforcement learning under constrained decoding. Abstract-CoT achieves up to 11.6\times fewer reasoning tokens while demonstrating comparable performance across mathematical reasoning, instruction-following, and multi-hop reasoning, and generalizes across language model families. We also find an emergent power law distribution over the abstract vocabulary, akin to those seen in natural language, that evolves across the training phases. Our findings highlight the potential for post-training latent reasoning mechanisms that enable efficient inference through a learned abstract reasoning language.
[NLP-4] CRAFT: Clustered Regression for Adaptive Filtering of Training data
【速读】: 该论文旨在解决大规模语料库下微调(fine-tuning)成本过高且常非必要的问题,尤其是在序列到序列(sequence-to-sequence)模型训练中,如何高效筛选出少量高质量样本以提升性能。其核心解决方案是提出一种与向量化方法无关的数据选择方法 CRAFT(Clustered Regression for Adaptive Filtering of Training data),关键在于将源-目标联合分布分解,并采用两阶段策略:首先通过 k-means 聚类按比例分配预算以匹配验证集的源分布,其次在每个源聚类内,基于验证目标分布的条件期望距离最小化选择训练对;理论证明表明,比例聚类分配可控制选中数据与验证分布之间的连续 KL 散度,残差由聚类直径控制,从而保证选择质量的同时显著加速训练数据筛选过程。
链接: https://arxiv.org/abs/2604.22693
作者: Parthasarathi Panda,Asheswari Swain,Subhrakanta Panda
机构: Google(谷歌); BITS Pilani(比特理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Selecting a small, high-quality subset from a large corpus for fine-tuning is increasingly important as corpora grow to tens of millions of datapoints, making full fine-tuning expensive and often unnecessary. We propose CRAFT (Clustered Regression for Adaptive Filtering of Training data), a vectorization-agnostic selection method for training sequence-to-sequence models. CRAFT decomposes the joint source-target distribution and performs a two-stage selection: (i) match the validation source distribution through proportional budget allocation across k-means clusters, and (ii) within each source cluster, select training pairs whose target embeddings minimize a conditional expected distance derived from the validation target distribution. We prove that proportional cluster allocation bounds the continuous KL divergence between selected and validation distributions, with the residual controlled by cluster diameters. We evaluate CRAFT on English-Hindi translation by selecting training data from 33 million NLLB sentence pairs and fine-tuning mBART via LoRA. CRAFT achieves 43.34 BLEU, outperforming TSDS (41.21) by 2.13 points on the same candidate pool and encoder while completing selection over 40 times faster. With TF-IDF vectorization, the entire pipeline completes in under one minute on CPU. TAROT achieves 45.61 BLEU, but CRAFT completes selection in 26.86 seconds versus TAROT’s 75.6 seconds, a 2.8 time speedup.
[NLP-5] BERAG : Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)方法中因文档拼接导致的“信息丢失于中间”效应(lost-in-the-middle effect)以及文档贡献难以 attribution 的问题,同时应对长上下文下计算复杂度高、尤其在视觉问答任务中效率低下的挑战。其核心解决方案是提出贝叶斯集成检索增强生成(Bayesian Ensemble Retrieval-Augmented Generation, BERAG),通过将每个检索文档的后验概率作为集成权重,并在生成过程中逐 token 使用贝叶斯规则更新这些权重,从而实现概率重排序、并行记忆利用与清晰的文档贡献归属,显著提升模型对长且不完美检索列表的推理能力,并有效缓解“信息丢失于中间”的现象。
链接: https://arxiv.org/abs/2604.22678
作者: Jinghong Chen,Jingbiao Mei,Guangyu Yang,Bill Byrne
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:A common approach to question answering with retrieval-augmented generation (RAG) is to concatenate documents into a single context and pass it to a language model to generate an answer. While simple, this strategy can obscure the contribution of individual documents, making attribution difficult and contributing to the lost-in-the-middle'' effect, where relevant information in long contexts is overlooked. Concatenation also scales poorly: computational cost grows quadratically with context length, a problem that becomes especially severe when the context includes visual data, as in visual question answering. Attempts to mitigate these issues by limiting context length can further restrict performance by preventing models from benefiting from the improved recall offered by deeper retrieval. We propose Bayesian Ensemble Retrieval-Augmented Generation (BERAG), along with Bayesian Ensemble Fine-Tuning (BEFT), as a RAG framework in which language models are conditioned on individual retrieved documents rather than a single combined context. BERAG treats document posterior probabilities as ensemble weights and updates them token by token using Bayes' rule during generation. This approach enables probabilistic re-ranking, parallel memory usage, and clear attribution of document contribution, making it well-suited for large document collections. We evaluate BERAG and BEFT primarily on knowledge-based visual question answering tasks, where models must reason over long, imperfect retrieval lists. The results show substantial improvements over standard RAG, including strong gains on Document Visual Question Answering and multimodal needle-in-a-haystack benchmarks. We also demonstrate that BERAG mitigates the lost-in-the-middle’’ effect. The document posterior can be used to detect insufficient grounding and trigger deflection, while document pruning enables faster decoding than standard RAG.
[NLP-6] Identifying and typifying demographic unfairness in phoneme-level embeddings of self-supervised speech recognition models
【速读】: 该论文旨在解决自动语音识别(ASR)系统在不同说话人群体(Speaker Groups, SGs)间存在的公平性问题,即当前ASR模型对某些SGs的性能显著优于其他SGs。其核心发现是:语音编码器在建模音素时存在两类错误——随机误差(高方差嵌入)与系统性误差(嵌入偏差),二者均可能导致SG层面的不公平。解决方案的关键在于提出一种分类框架以区分这两种误差类型,并通过训练音素分类探测器(phoneme classification probes)验证:仅针对特定劣势SG进行训练可提升该SG性能,表明存在SG级嵌入偏差;同时,高音素嵌入方差与低预测准确率高度相关,说明随机误差也普遍存在。研究进一步指出,尽管两种误差均需关注,但随机误差可能是阻碍ASR公平性的更大障碍。
链接: https://arxiv.org/abs/2604.22631
作者: Felix Herron,Solange Rossato,Alexandre Allauzen,François Portet
机构: MILES Team, LAMSADE, Université Paris Dauphine-PSL, France; GETALP Team, LIG, Université Grenoble Alpes, France
类目: Computation and Language (cs.CL)
备注:
Abstract:Modern automatic speech recognition (ASR) systems have been observed to function better for certain speaker groups (SGs) than others, despite recent gains in overall performance. One potential impediment to progress towards fairer ASR is a more nuanced understanding of the types of modeling errors that speech encoder models make, and in particular the difference between the structure of embeddings for high-performance and low-performance SGs. This paper proposes a framework typifying two types of error that can occur in modeling phonemes in ASR systems: random error/high variance in phoneme embedding, vs systematic error/embedding bias. We find that training phoneme classification probes only on a single, typically disadvantaged SG, sometimes improves performance for that SG, which is evidence for the existence of SG-level bias in phoneme embeddings. On the other hand, we find that speakers and SGs with higher levels of phoneme variance are the same as those with worse phoneme prediction accuracy. We conclude that both types of error are present in phoneme embeddings and both are candidate causes for SG-level unfairness in ASR, though random error is likely a greater hindrance to fairness than systematic error. Furthermore, we find that finetuning encoder models using a fairness-enhancing algorithm (domain enhancing and adversarial training) changes neither the benefits of in-domain phoneme classification probe training, nor measured levels of random embedding error.
[NLP-7] From graphemic dependence to lexical structure: a Markovian perspective on Dantes Commedia
【速读】: 该论文旨在揭示但丁《神曲》(Divina Commedia)文本结构中局部依赖关系与宏观组织之间的关联机制,解决的问题是如何通过符号化文本表示和概率建模来捕捉从字形层面(graphemic level)到词汇层面(lexical level)的系统性互动。其解决方案的关键在于构建一个基于元音-辅音(vowel-consonant, V/C)编码的四状态马尔可夫链(Markov chain),从中提取一个简洁的图稿记忆指数(graphemic memory index),用于量化文本中持久性与交替模式的平衡;进一步结合三元组(trigram)分析识别出有限的、高频出现的图稿配置作为“图稿探针”(graphemic probes),这些探针能将局部马尔可夫结构映射至可识别的词汇环境,并揭示它们在词内与词间边界上的不同行为特征,从而建立局部依赖结构与整体篇章组织之间的可解释联系。
链接: https://arxiv.org/abs/2604.22626
作者: Angelo Maria Sabatini
机构: The BioRobotics Institute (生物机器人研究所); Scuola Superiore Sant’Anna (圣安娜高等学院)
类目: Computation and Language (cs.CL)
备注: 25 pages, 8 figures, 1 supplementary material; submitted to Digital Scholarship in the Humanities
Abstract:This study investigates the structural organisation of Dante’s Divina Commedia through a symbolic representation based on vowel-consonant (V/C) encoding. Modelling the resulting sequence as a four-state Markov chain yields a parsimonious index of graphemic memory, capturing the balance between persistence and alternation patterns. Across the poem, this index exhibits a slight but consistent increase from the Inferno to the Paradiso, indicating a directional shift in local dependency structure. Trigram-level analysis shows that this trend is driven by a restricted set of recurrent configurations, interpreted as graphemic probes linking the Markov representation to identifiable lexical environments in the text. These probes display distinct behaviours: configurations involving two transitions more frequently emerge across word boundaries, reflecting interactions between adjacent tokens, whereas configurations with fewer transitions are largely confined to intra-lexical structures. Part of the signal is further shaped by orthographic phenomena, particularly apostrophised forms, highlighting the role of writing conventions alongside phonological and lexical organisation. A complementary classification analysis identifies cantica-specific terms, providing lexical anchors through which graphemic probes can be related to the structure of the poem. This organisation is reflected not only in the separation of the three cantiche, but also in a continuous trajectory across the text. Overall, the results show that simple probabilistic models applied to symbolic text representations can uncover structured interactions between local dependencies, lexical distribution, orthographic encoding, and large-scale organisation, providing an interpretable framework for linking local symbolic dynamics to higher-level textual organisation. Comments: 25 pages, 8 figures, 1 supplementary material; submitted to Digital Scholarship in the Humanities Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.22626 [cs.CL] (or arXiv:2604.22626v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.22626 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Angelo Maria Sabatini [view email] [v1] Fri, 24 Apr 2026 14:54:59 UTC (1,217 KB) Full-text links: Access Paper: View a PDF of the paper titled From graphemic dependence to lexical structure: a Markovian perspective on Dante’s Commedia, by Angelo Maria SabatiniView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-04 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[NLP-8] Dharma Data and Deception: An LLM -Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube
【速读】: 该论文旨在解决社交媒体上健康虚假信息(health misinformation)传播的问题,尤其关注文化传统与伪科学主张交织时所引发的本地化争议。其解决方案的关键在于利用大型语言模型(LLMs)对100个关于牛尿(gomutra)作为健康疗法的YouTube视频文本进行系统性标注,采用一个包含14类说服策略的分类体系,从而识别出支持者主要依赖疗效诉求和从众效应,而反对者则侧重权威引用和反驳策略。该方法通过高一致性的人工评估(90.1%互评一致率)验证了分类体系的可靠性,为基于计算手段的大规模文化语境下虚假信息分析提供了可扩展的技术路径。
链接: https://arxiv.org/abs/2604.22606
作者: Sheza Munir,Ratna Kandala,Anamta Khan,Deepti,Joyojeet Pal
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Health misinformation remains one of the most pressing challenges on social media, particularly when cultural traditions intersect with scientific-sounding claims. These dynamics are not only global but also deeply local, manifesting in culturally specific controversies that require careful analysis. Motivated by this, we examine 100 YouTube transcripts that promote or debunk cow urine (gomutra) as a health remedy, focusing on rhetorical strategies such as appeals to authority, efficacy appeals, and conspiracy framing. We employ large language models (LLMs) including GPT-4, GPT-4o, GPT-4.1, GPT-5, Gemini 2.5 Pro, and Mistral Medium 3 to annotate transcripts using a 14-category taxonomy of persuasive tactics. Our analysis reveals that promoters predominantly rely on efficacy appeals and social proof, while debunkers emphasize authority and rebuttal. Human evaluation of a subset of annotations yielded 90.1% inter-annotator agreement, confirming the reliability of our taxonomy and validation process. This work advances computational methods for misinformation analysis and demonstrates how LLMs can support large-scale studies of cultural discourse online.
[NLP-9] QuantClaw: Precision Where It Matters for OpenClaw
【速读】: 该论文旨在解决自主代理系统(如OpenClaw)在实际应用中因长上下文输入和多轮推理导致的计算与成本开销过高的问题。现有量化(quantization)方法虽可降低资源消耗,但其对代理性能的影响在复杂任务场景下尚不明确。解决方案的关键在于提出QuantClaw——一种即插即用的精度路由插件,根据任务特性动态分配计算精度:轻量级任务使用低精度配置以降低成本和延迟,而高要求任务则保留高精度保障性能。该机制实现了在不增加用户复杂度的前提下,显著降低推理成本并提升效率,实验表明其在GLM-5(FP8基线)上最高可节省21.4%成本、减少15.7%延迟,同时保持或提升任务性能。
链接: https://arxiv.org/abs/2604.22577
作者: Manyi Zhang,Ji-Fu Li,Zhongao Sun,Xiaohao Liu,Zhenhua Dong,Xianzhi Yu,Haoli Bai,Xiaobo Xia
机构: Huawei Technologies (华为技术有限公司); National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Blog: this https URL
Abstract:Autonomous agent systems such as OpenClaw introduce significant efficiency challenges due to long-context inputs and multi-turn reasoning. This results in prohibitively high computational and monetary costs in real-world development. While quantization is a standard approach for reducing cost and latency, its impact on agent performance in realistic scenarios remains unclear. In this work, we analyze quantization sensitivity across diverse complex workflows over OpenClaw, and show that precision requirements are highly task-dependent. Based on this observation, we propose QuantClaw, a plug-and-play precision routing plugin that dynamically assigns precision according to task characteristics. QuantClaw routes lightweight tasks to lower-cost configurations while preserving higher precision for demanding workloads, saving cost and accelerating inference without increasing user complexity. Experiments show that our QuantClaw maintains or improves task performance while reducing both latency and computational cost. Across a range of agent tasks, it achieves up to 21.4% cost savings and 15.7% latency reduction on GLM-5 (FP8 baseline). These results highlight the benefit of treating precision as a dynamic resource in agent systems.
[NLP-10] Learning Evidence Highlighting for Frozen LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长且噪声较多的上下文时,容易忽略关键证据的问题。其解决方案的核心在于提出HiLight框架,该框架通过解耦证据选择与推理过程来实现高效增强:训练一个轻量级的强调代理(Emphasis Actor),在不修改原始输入的前提下,仅向关键语义片段插入最小化的高亮标记(highlight tags),随后将此增强后的文本输入给冻结的推理模块(Solver)进行下游任务推理。该方法将高亮建模为弱监督决策问题,并使用强化学习优化代理策略,仅依赖Solver的任务奖励信号,无需标注证据或访问/修改Solver本身,从而实现了对证据结构的通用捕捉和零样本迁移能力。
链接: https://arxiv.org/abs/2604.22565
作者: Shaoang Li,Yanhang Shi,Yufei Li,Mingfu Liang,Xiaohan Wei,Yunchen Pu,Fei Tian,Chonglin Sun,Frank Shyu,Luke Simon,Sandeep Pandey,Xi Liu,Jian Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver’s task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.
[NLP-11] Using Embedding Models to Improve Probabilistic Race Prediction
【速读】: 该论文旨在解决在缺乏个体层面种族数据的情况下,如何准确估计种族差异的问题。传统方法依赖于贝叶斯改进的姓氏地理编码(Bayesian Improved Surname Geocoding, BISG),但其性能受限于仅覆盖常见姓氏的美国人口普查姓氏数据,导致约10%人口(尤其是罕见姓氏者)的预测准确性显著下降,因标准BISG在这些情况下使用非信息性通用先验。解决方案的关键在于提出嵌入驱动的BISG(embedding-powered BISG, eBISG),通过预训练文本嵌入将姓名表示为稠密向量,并利用2020年人口普查姓氏与名字数据训练神经网络,从而估算未被人口普查涵盖的姓名的种族概率;进一步地,采用全名嵌入模型(full-name embedding)结合选民档案数据,捕捉姓名各成分间的交互作用,显著提升了对西班牙裔和亚裔等群体的预测性能。
链接: https://arxiv.org/abs/2604.22555
作者: Noan Dasanaike,Kosuke Imai
机构: Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Estimating racial disparity requires individual-level race data, which are often unavailable due to the sensitivity of collecting such information. To address this problem, many researchers utilize Bayesian Improved Surname Geocoding (BISG), which have critically relied on Census surname data. Unfortunately, these data capture race-surname relationships only for common surnames, omitting approximately 10% of the US population. We show that predictive performance degrades substantially for individuals with such omitted, uncommon surnames because standard BISG implementation relies on a uninformative generic prior in these cases. To address this limitation, we propose embedding-powered BISG (eBISG), which uses pre-trained text embeddings to represent names as dense vectors and trains neural networks on 2020 Census surname and first-name data to estimate race probabilities for names not covered in the Census. We compare five approaches: standard BISG using only surnames, BIFSG incorporating first name probabilities, surname embedding for unlisted names, surname and first name embedding combining both, and a full-name embedding trained on voter file data from Southern states that captures interactions between name components. We show that each successive eBISG approach improves race prediction, with the full-name embedding yielding the largest gains, particularly for Hispanic and Asian voters whose surnames are absent from the Census list.
[NLP-12] Controllable Spoken Dialogue Generation: An LLM -Driven Grading System for K-12 Non-Native English Learners
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非母语环境中难以满足K-12英语学习者教学需求的问题,核心症结在于学习者语言能力与模型输出之间存在熟练度不匹配。解决方案的关键在于提出一个能力对齐框架(proficiency-aligned framework),通过基于中国国家课程标准(CSE)的四层词汇复杂度分级体系实现对输出文本词汇难度的精准控制,并引入一种名为多样性驱动策略优化(Diversity Driven Policy Optimization, DDPO)的新算法——该算法以多轮GRPO为基础,兼顾对话多样性与整体对话质量,在保持自然对话流的同时显著降低词表外(out-of-vocabulary)率,提升教学价值和可交互性。该框架具备良好的通用性,可适配其他教育标准,且配套数据集与代码均开源,为非沉浸式环境下个性化英语口语训练提供可扩展平台。
链接: https://arxiv.org/abs/2604.22542
作者: Haidong Yuan,Haokun Zhao,Wanshi Xu,Songjun Cao,Qingyu Zhou,Long Ma,Hongjie Fan
机构: Peking University (北京大学); Tencent (腾讯); China University of Political Science and Law (中国政法大学); Independent Researcher (独立研究者); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) often fail to meet the pedagogical needs of K-12 English learners in non-native contexts due to a proficiency mismatch. To address this widespread challenge, we introduce a proficiency-aligned framework that adapts LLM outputs to learner abilities, using China’s national curriculum (CSE) as a representative case. Our framework enables precise control over lexical complexity through a four-tier grading system, supported by a comprehensive suite of new resources: graded vocabulary lists and a multi-turn dialogue corpus. Our core technical contribution is the \textbfDDPO algorithm,Diversity Driven Policy Optimization, a multi-turn GRPO-based approach designed to preserve dialogue diversity while holistically optimizing dialogue quality. This method significantly outperforms conventional approaches, achieving low out-of-vocabulary rates and high diversity while enhancing conversational naturalness and pedagogical value. While grounded in the CSE, our framework is designed for flexibility and can be readily adapted to other educational standards. Our models, data, and code will all be open-sourced, providing a scalable platform for personalized English speaking practice that effectively addresses the unique challenges faced by K-12 learners in non-immersive environments. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.22542 [cs.CL] (or arXiv:2604.22542v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.22542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-13] RouteLMT: Learned Sample Routing for Hybrid LLM Translation Deployment ACL2026
【速读】: 该论文旨在解决大规模部署大语言模型(Large Language Models, LLMs)进行机器翻译(Machine Translation, MT)时成本过高、难以规模化的问题。现有混合系统通常通过小模型处理大部分请求并仅将部分请求路由至大模型来平衡成本与质量,但当前的路由策略多依赖启发式规则、外部预测器或绝对质量估计,无法有效判断大模型是否能带来显著优于小模型的改进。论文的关键创新在于将路由问题建模为预算分配问题,并识别出边际收益(marginal gain),即大模型相对于小模型的预期提升作为最优决策信号;基于此,提出了一种名为RouteLMT的高效内嵌式路由器,其通过探测小模型的提示词(prompt-token)表示来预测该边际收益,无需额外模型或假设解码,从而实现了更精准的预算导向决策,在质量-成本帕累托前沿上显著优于现有方法。
链接: https://arxiv.org/abs/2604.22520
作者: Yingfeng Luo,Hongyu Liu,Dingyang Lin,Kaiyan Chang,Chenglong Wang,Bei Li,Quan Du,Tong Xiao,Jingbo Zhu
机构: Northeastern University (东北大学); NiuTrans Research (牛津研究)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Industry Track
Abstract:Large Language Models (LLMs) have achieved remarkable performance in Machine Translation (MT), but deploying them at scale remains prohibitively expensive. A widely adopted remedy is the hybrid system paradigm, which balances cost and quality by serving most requests with a small model and selectively routing a fraction to a large model. However, existing routing strategies often rely on heuristics, external predictors, or absolute quality estimation, which fail to capture whether the large model actually provides a worthwhile improvement over the small one. In this paper, we formulate routing as a budget allocation problem and identify marginal gain, i.e., the large model’s improvement over the small model, as the optimal signal for budgeted decisions. Building on this, we propose \textbfRouteLMT (routing for LLM-based MT), an efficient in-model router that predicts this expected gain by probing the small translators prompt-token representation, without requiring external models or hypothesis decoding. Extensive experiments demonstrate that our RouteLMT outperforms heuristics, quality/difficulty estimation baselines, achieving a superior quality-budget Pareto frontier. Furthermore, we analyze regression risks and show that a simple guarded variant can mitigate severe quality losses.
[NLP-14] Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement ACL2026
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在评估商业创意时面临的多维度评价难题,尤其是专家评分存在显著分歧的情况下,如何设计更有效的自动评判机制。其核心问题在于:自动评判模型应追求群体共识(aggregate consensus)还是建模个体评估者的偏好差异(evaluator-specific modeling)。解决方案的关键在于引入PBIG-DATA数据集,其中包含300个基于专利的商业创意及其由六位领域专家在六个业务维度上的细粒度评分,通过对比三种裁判配置——仅依赖评分规则的零样本裁判、基于混合评分历史的聚合裁判和基于目标评估者历史的个性化裁判——发现个性化裁判在多数维度和模型规模下均更贴近对应评估者的真实判断,且评估者间的一致性仅在个性化条件下与模型推理相似性显著相关。这表明在多元评价场景中,采用评估者条件化的裁判设计优于传统聚合标签方法。
链接: https://arxiv.org/abs/2604.22517
作者: Wataru Hirota,Tomoki Taniguchi,Tomoko Ohkuma,Kosuke Takahashi,Takahiro Omi,Kosuke Arima,Takuto Asakura,Chung-Chi Chen,Tatsuya Ishigaki
机构: Stockmark Inc; Asahi Kasei Corporation; National Institute of Informatics; National Institute of Advanced Industrial Science and Technology
类目: Computation and Language (cs.CL)
备注: ACL 2026 Industry Track (Oral)
Abstract:Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator’s scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.
[NLP-15] Measuring and Mitigating Persona Distortions from AI Writing Assistance
【速读】: 该论文旨在解决生成式 AI(Generative AI)在写作辅助过程中对作者人格形象(writer persona)的扭曲问题,即AI辅助写作会系统性地改变读者对作者政治立场、个性特质、情绪倾向及人口统计特征的感知,导致作者形象被误读为更自信、更积极、更具特权背景。解决方案的关键在于通过基于实验数据训练奖励模型(reward models),以引导AI输出更忠实于原始作者立场的内容,从而减少不希望出现的人格扭曲现象;然而,这一方法虽能有效缓解负面偏差,却降低了用户对AI文本的接受度,揭示了AI写作辅助中理想属性与不良副作用之间的内在权衡关系。
链接: https://arxiv.org/abs/2604.22503
作者: Paul Röttger,Kobi Hackenburg,Hannah Rose Kirk,Christopher Summerfield
机构: University of Oxford (牛津大学); UK AI Security Institute (英国人工智能安全研究所)
类目: Computation and Language (cs.CL)
备注: For supplementary information, code, and data see this https URL
Abstract:Hundreds of millions of people use artificial intelligence (AI) for writing assistance. Here, we evaluated how AI writing assistance distorts writer personas - their perceived beliefs, personality, and identity. In three large-scale experiments, writers (N=2,939) wrote political opinion paragraphs with and without AI assistance. Separate groups of readers (N=11,091) blindly evaluated these paragraphs across 29 socially salient dimensions of reader perception, spanning political opinion, writing quality, writer personality, emotions, and demographics. AI writing assistance produced persona distortions across all dimensions: with AI, writers seemed more opinionated, competent, and positive, and their perceived demographic profile shifted towards more privileged groups. Writers objected to many of the observed distortions, yet continued to prefer AI-assisted text even when made aware of them. We successfully mitigated objectionable persona distortions at the model level by training reward models on our experimental data (10,008 paragraphs, 2,903,596 ratings) to steer AI outputs towards faithful representation of writer stance. However, this came at a cost to user acceptance, suggesting an entanglement between desirable and undesirable properties of AI writing assistance that may be difficult to resolve. Together, our findings demonstrate that persona distortions from AI writing assistance are pervasive and persistent even under realistic conditions of human oversight, which carries implications for public discourse, trust, and democratic deliberation that scale with AI adoption.
[NLP-16] Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
【速读】: 该论文试图解决的问题是:当大规模语言模型代理(Large Language Model Agents)群体规模扩展至数百万级别时,是否存在由规模自发产生的集体智能(Collective Intelligence)?为回答这一问题,作者提出并实施了首个大规模自主代理社会的实证评估。其解决方案的关键在于设计并应用“Superminds Test”——一个分层测试框架,通过受控的探测代理(Probing Agents)在三个层级上系统性地检验社会层面的认知能力:联合推理(joint reasoning)、信息合成(information synthesis)和基础交互(basic interaction)。实验结果表明,当前代理社会并未展现出超越个体前沿模型的复杂推理能力,极少实现分布式信息整合,且连简单的协作任务也常失败,揭示出集体智能并非仅靠规模即可涌现,而受限于极低密度和浅层的交互模式,从而为未来研究指明方向:提升代理间的信息交换与协同机制是突破瓶颈的核心。
链接: https://arxiv.org/abs/2604.22452
作者: Xirui Li,Ming Li,Yunze Xiao,Ryan Wong,Dianqi Li,Timothy Baldwin,Tianyi Zhou
机构: University of Maryland (马里兰大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Collective intelligence refers to the ability of a group to achieve outcomes beyond what any individual member can accomplish alone. As large language model agents scale to populations of millions, a key question arises: Does collective intelligence emerge spontaneously from scale? We present the first empirical evaluation of this question in a large-scale autonomous agent society. Studying MoltBook, a platform hosting over two million agents, we introduce Superminds Test, a hierarchical framework that probes society-level intelligence using controlled Probing Agents across three tiers: joint reasoning, information synthesis, and basic interaction. Our experiments reveal a stark absence of collective intelligence. The society fails to outperform individual frontier models on complex reasoning tasks, rarely synthesizes distributed information, and often fails even trivial coordination tasks. Platform-wide analysis further shows that interactions remain shallow, with threads rarely extending beyond a single reply and most responses being generic or off-topic. These results suggest that collective intelligence does not emerge from scale alone. Instead, the dominant limitation of current agent societies is extremely sparse and shallow interaction, which prevents agents from exchanging information and building on each other’s outputs.
[NLP-17] SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking ACL2026
【速读】: 该论文旨在解决现有水印方案(如KGW)在低熵场景(如代码生成和数学推理)中有效性显著下降的问题。其核心问题在于,随机词汇划分策略导致水印强度(watermark strength)受限于下一词预测的概率分布,从而削弱了水印的可检测性。解决方案的关键在于提出一种新的词汇划分方法SSG(Sort-then-Split by Groups),通过将词汇表划分为两个对数平衡(logit-balanced)子集,提升了每一步token预测的水印强度下界,从而增强了水印的可检测性。
链接: https://arxiv.org/abs/2604.22438
作者: Chenxi Gu,Xiaoning Du,John Grundy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ACL 2026 Main Conference
Abstract:Watermarking has emerged as a promising technique for tracing the authorship of content generated by large language models (LLMs). Among existing approaches, the KGW scheme is particularly attractive due to its versatility, efficiency, and effectiveness in natural language generation. However, KGW’s effectiveness degrades significantly under low-entropy settings such as code generation and mathematical reasoning. A crucial step in the KGW method is random vocabulary partitioning, which enables adjustments to token selection based on specific preferences. Our study revealed that the next-token probability distribution plays an critical role in determining how much, or even whether, we can modify token selection and, consequently, the effectiveness of watermarking. We refer to this characteristic, associated with the probability distribution of each token prediction, as \emphwatermark strength. In cases of random vocabulary partitioning, the lower bound of watermark strength is dictated by the next-token probability distribution. However, we found that, by redesigning the vocabulary partitioning algorithm, we can potentially raise this lower bound. In this paper, we propose SSG (\textbfSort-then-\textbfSplit by \textbfGroups), a method that partitions the vocabulary into two logit-balanced subsets. This design lifts the lower bound of watermark strength for each token prediction, thereby improving watermark detectability. Experiments on code generation and mathematical reasoning datasets demonstrate the effectiveness of SSG.
[NLP-18] Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在设定温度 $ T=0 $ 时仍出现输出不一致的问题,即实现层面的非确定性(implementation-level nondeterminism),其根源包括批处理大小变化、核函数非不变性以及浮点数运算的非结合性。解决方案的关键在于提出“背景温度”(background temperature, $ T_\mathrmbg $)这一概念,用以量化由推理环境 $ I $ 引起的、即使在名义温度为零时也存在的随机扰动效应;并通过构建理想参考系统的等效温度 $ T_n(I) $ 提出一种可操作的实验协议来估计 $ T_\mathrmbg $,从而为模型复现性、评估与部署提供新的分析框架和实践依据。
链接: https://arxiv.org/abs/2604.22411
作者: Alberto Messina,Stefano Scotta
机构: RAI - Radiotelevisione Italiana, Centre for Research, Technological Innovation and Experimentation (CRITS)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Even when decoding with temperature T=0 , large language models (LLMs) can produce divergent outputs for identical inputs. Recent work by Thinking Machines Lab highlights implementation-level sources of nondeterminism, including batch-size variation, kernel non-invariance, and floating-point non-associativity. In this short note we formalize this behavior by introducing the notion of \emphbackground temperature T_\mathrmbg , the effective temperature induced by an implementation-dependent perturbation process observed even when nominal T=0 . We provide clean definitions, show how T_\mathrmbg relates to a stochastic perturbation governed by the inference environment I , and propose an empirical protocol to estimate T_bg via the equivalent temperature T_n(I) of an ideal reference system. We conclude with a set of pilot experiments run on a representative pool from the major LLM providers that demonstrate the idea and outline implications for reproducibility, evaluation, and deployment.
[NLP-19] Selective Contrastive Learning For Gloss Free Sign Language Translation ACL2026
【速读】: 该论文旨在解决手语翻译(Sign Language Translation, SLT)中因视觉手势与书面文本之间模态差异导致的跨模态对齐困难问题,尤其是在无词素(gloss-free)场景下,现有基于CLIP-like视觉-语言预训练(Vision-Language Pretraining, VLP)的方法依赖随机批内对比学习(in-batch contrastive learning),易引入噪声和不一致的对齐监督信号,因为其负样本常包含语义相近甚至相同的视频-文本对,从而削弱模型性能。解决方案的关键在于提出选择性对比学习(Selective Contrastive Learning for SLT, SCL-SLT),其核心是Pair Selection(PS)策略:通过参考检查点追踪负样本的相似度动态变化,对候选负样本进行评分,并设计渐进式课程机制(curriculum)逐步强化更具挑战性的负样本,从而提升对比监督的有效性并降低噪声负样本的影响。
链接: https://arxiv.org/abs/2604.22374
作者: Changhao Lai,Rui Zhao,Xuewen Zhong,Jinsong Su,Yidong Chen
机构: School of Informatics, Xiamen University, China; Key Lab of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian-Taiwan (XMU), Ministry of Culture and Tourism, China; National Language Resources Monitoring and Research Center for Education and Teaching Media, Xiamen University, China
类目: Computation and Language (cs.CL)
备注: Accepted by ACL 2026 as the main conference
Abstract:Sign language translation (SLT) converts continuous sign videos into spoken-language text, yet it remains challenging due to the intrinsic modality mismatch between visual signs and written text, particularly in gloss-free settings. Recent SLT systems increasingly adopt CLIP-like Vision-Language pretraining (VLP) for cross-modal alignment, but the random in-batch contrast provides few, batch-dependent negatives and may mislabel semantically similar (or even identical) pairs as negatives, introducing noisy and potentially inconsistent alignment supervision. In this work, we first conduct a preliminary trajectory-based analysis that tracks negative video-text similarity over training. The results show that only a small subset of negatives exhibits the desired behavior of being consistently pushed away, while the remaining negatives display heterogeneous and often non-decreasing similarity dynamics, suggesting that random in-batch negatives are frequently uninformative for effective alignment. Inspired by this, we propose Selective Contrastive Learning for SLT (SCL-SLT) with a Pair Selection (PS) strategy. PS scores candidate negatives using similarity dynamics from reference checkpoints and constructs mini-batches via a curriculum that progressively emphasizes more challenging negatives, thereby strengthening contrastive supervision while reducing the influence of noisy or semantically invalid negatives.
[NLP-20] CNSL-bench: Benchmarking the Sign Language Understanding Capabilities of MLLM s on Chinese National Sign Language ACL2026
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解中国国家通用手语(Chinese National Sign Language, CNSL)方面的能力不足问题,尤其在多模态上下文中的理解表现尚不明确。解决方案的关键在于提出首个面向中文手语的综合性评估基准——CNSL-bench,其核心优势包括:权威性标注(基于官方标准化的《国家通用手语词典》以消除地域或非规范变体带来的歧义)、多模态覆盖(提供对齐的文本描述、图像和手语视频)以及发音多样性(支持对空书写字母、手指拼写及汉语手势字母等关键手动表征形式的细粒度分析)。通过该基准对21个主流开源与商用MLLMs进行系统评测,揭示了当前模型在手语理解上显著落后于人类水平,并存在输入模态和手动表征形式间的系统性差异。
链接: https://arxiv.org/abs/2604.22367
作者: Rui Zhao,Xuewen Zhong,Xiaoyun Zheng,Jinsong Su,Yidong Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as the Main Conference at ACL 2026
Abstract:Sign language research has achieved significant progress due to the advances in large language models (LLMs). However, the intrinsic ability of LLMs to understand sign language, especially in multimodal contexts, remains underexplored. To address this limitation, we introduce CNSL-bench, the first comprehensive Chinese emNational Sign Language benchmark designed for evaluating multimodal large language models (MLLMs) in sign language understanding. The proposed CNSL-bench is characterized by: 1) Authoritative grounding, as it is anchored to the officially standardized \textitNational Common Sign Language Dictionary, mitigating ambiguity from regional or non-canonical variants and ensuring consistent semantic definitions; 2) Multimodal coverage, providing aligned textual descriptions, illustrative images, and sign language videos; and 3) Articulatory diversity, supporting fine-grained analysis across key manual articulatory forms, including air-writing, finger-spelling, and the Chinese manual-alphabet. Using CNSL-bench, we extensively evaluate 21 open-source and proprietary up-to-date MLLMs. Our results reveal that, despite recent advances in multimodal modeling, current MLLMs remain substantially inferior to human performance, exhibiting systematic disparities across input modalities and manual articulatory forms. Additional diagnostic analyses suggest that several performance limitations persist beyond improvements in reasoning and that instruction-following robustness varies substantially across models.
[NLP-21] Preference Heads in Large Language Models : A Mechanistic Framework for Interpretable Personalization ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化生成中缺乏可解释性和可控性的问题,现有方法通常将模型的隐式个性化能力视为黑箱,依赖提示工程或用户数据微调。其解决方案的关键在于提出一种无需训练的机制可解释性框架——差分偏好引导(Differential Preference Steering, DPS),通过因果掩码分析识别出一组稀疏的偏好头(Preference Heads),这些注意力头编码用户特定的风格和主题偏好并对生成结果产生因果影响;DPS进一步计算每个注意力头的偏好贡献分数(Preference Contribution Score, PCS),并在解码阶段对比启用与禁用偏好头时的输出差异,从而选择性增强与用户偏好一致的词元概率,实现高效、可控且可解释的个性化生成。
链接: https://arxiv.org/abs/2604.22345
作者: Weixu Zhang,Ye Yuan,Changjiang Han,Yuxing Tian,Zipeng Sun,Linfeng Du,Jikun Kang,Hong Kang,Xue Liu,Haolun Wu
机构: McGill University (麦吉尔大学); Mila - Quebec AI Institute (蒙特利尔魁北克人工智能研究所); MBZUAI; University of Montreal (蒙特利尔大学); Salesforce ( Salesforce)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026
Abstract:Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.
[NLP-22] Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成内容时出现的“忠实性幻觉”(faithfulness hallucination)问题,即模型输出与输入上下文信息矛盾或忽略上下文支持内容的现象。解决方案的关键在于提出一种轻量级、无需重新训练或架构修改的解码阶段框架——Context-Fidelity Boosting (CFB),其核心机制是通过基于token在输入上下文中支持程度的加性logit调整,提升源文本支持token的生成概率。具体包括三种增强策略:静态增强、上下文感知增强和token感知增强,分别从固定偏置、分布差异自适应缩放以及局部相关性重分配角度优化logit调整,从而显著提升生成内容的忠实性,且仅带来极小的推理开销。
链接: https://arxiv.org/abs/2604.22335
作者: Weixu Zhang,Fanghua Ye,Qiang Gao,Jian Li,Haolun Wu,Yuxing Tian,Sijing Duan,Nan Du,Xiaolong Li,Xue Liu
机构: Tencent(腾讯); McGill University (麦吉尔大学); Mila - Quebec AI Institute (Mila-魁北克人工智能研究所); Wuhan University (武汉大学); University of Montreal (蒙特利尔大学); Tsinghua University (清华大学); MBZUAI
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026
Abstract:Large language models (LLMs) often produce content that contradicts or overlooks information provided in the input context, a phenomenon known as faithfulness hallucination. In this paper, we propose Context-Fidelity Boosting (CFB), a lightweight and general decoding-time framework that reduces such hallucinations by increasing the generation probability of source-supported tokens. Motivated by logit-shaping principles from watermarking techniques, CFB applies additive token-level logit adjustments based on a token’s degree of support from the input context. Specifically, we develop three boosting strategies: static boosting, which applies a fixed bias to source-supported tokens; context-aware boosting, which scales this bias using the divergence between next-token distributions with and without context; and token-aware boosting, which further redistributes the adaptive bias according to local relevance estimated from source-position attention and source-scoped semantic similarity. CFB requires no retraining or architectural changes, making it compatible with a wide range of LLMs. Experiments on summarization and question answering tasks across multiple open-source LLMs show that CFB consistently improves faithfulness metrics with minimal generation overhead. Our implementation is fully open-sourced.
[NLP-23] Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks
【速读】: 该论文旨在解决现有自然语言处理(Natural Language Processing, NLP)资源在真实场景中缺乏任务特定信息、且对新兴或小众实体覆盖不足的问题,尤其针对业务组织和医疗提供者等实体需在不同分类体系下进行精准标注的需求。其解决方案的关键在于构建一个由领域专家仅提供实体名称和黄金标签(gold labels)即可训练的文本驱动分类框架:通过动态获取每个实体的描述性文本——利用网络数据与大语言模型(Large Language Models, LLMs)相结合的新颖文本采集方法——进而生成基于文本的分类器,从而实现无需大量标注语料即可完成高精度的任务定制化分类。
链接: https://arxiv.org/abs/2604.22325
作者: Fahmida Alam,Ellen Riloff
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider’s medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.
[NLP-24] CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems ACL2026
【速读】: 该论文旨在解决工业场景下自然语言到SQL(NL2SQL)系统在面对多维度模糊性(multi-faceted ambiguity)和多样化用户行为时性能显著下降的问题,尤其关注交互式场景中因用户未充分澄清导致的不可回答查询。现有基准测试通常假设单一来源的模糊性并依赖用户交互来解决,忽略了真实世界中的复杂失败模式。其解决方案的关键在于提出Clarity框架,通过约束驱动的流水线将可执行SQL转换为含语义模糊性的自然语言查询,并引入基于上下文的对话延续(grounded conversational continuations)和模式级元数据(schema-level metadata),从而自动构建涵盖单轮与多轮交互设置的NL2SQL基准数据集。实证研究表明,即使采用先进的大语言模型(LLMs)的NL2SQL系统,在多维度模糊条件下也存在严重性能退化,主要体现在难以准确定位并解决底层的模式级模糊源,凸显了当前系统在鲁棒性方面的不足。
链接: https://arxiv.org/abs/2604.22313
作者: Tabinda Sarwar,Farhad Moghimifar,Cong Duy Vu Hoang,Xiaoxiao Ma,Shawn Chang Xu,Fahimeh Saleh,Poorya Zaremoodi,Avirup Sil,Katrin Kirchhoff
机构: Oracle Corporation (甲骨文公司)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026 (Industry Track)
Abstract:NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems. Comments: Accepted at ACL 2026 (Industry Track) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2604.22313 [cs.CL] (or arXiv:2604.22313v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.22313 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-25] Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets
【速读】: 该论文旨在解决长文档集合中问答任务的挑战,即随着文档数量增长,固定的大语言模型(Large Language Model, LLM)上下文窗口容易被超出,而传统将文档切分为块(chunk)的方法会引入聚合瓶颈,导致系统难以高效地整合和推理大量提取证据。其解决方案的关键在于提出SLIDERS框架,通过将关键信息结构化提取至关系型数据库(relational database),利用SQL进行可扩展的结构化推理,替代传统的文本拼接方式;同时引入数据协调阶段,基于溯源信息、提取理由和元数据识别并修复重复、不一致和不完整的记录,从而实现局部提取与全局一致性的统一。
链接: https://arxiv.org/abs/2604.22294
作者: Harshit Joshi,Priyank Shethia,Jadelynn Dao,Monica S. Lam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 49 pages (14 main), preprint
Abstract:Real-world document question answering is challenging. Analysts must synthesize evidence across multiple documents and different parts of each document. However, any fixed LLM context window can be exceeded as document collections grow. A common workaround is to decompose documents into chunks and assemble answers from chunk-level outputs, but this introduces an aggregation bottleneck: as the number of chunks grows, systems must still combine and reason over an increasingly large body of extracted evidence. We present SLIDERS, a framework for question answering over long document collections through structured reasoning. SLIDERS extracts salient information into a relational database, enabling scalable reasoning over persistent structured state via SQL rather than concatenated text. To make this locally extracted representation globally coherent, SLIDERS introduces a data reconciliation stage that leverages provenance, extraction rationales, and metadata to detect and repair duplicated, inconsistent, and incomplete records. SLIDERS outperforms all baselines on three existing long-context benchmarks, despite all of them fitting within the context window of strong base LLMs, exceeding GPT-4.1 by 6.6 points on average. It also improves over the next best baseline by ~19 and ~32 points on two new benchmarks at 3.9M and 36M tokens, respectively.
[NLP-26] ReLeVAnT: Relevance Lexical Vectors for Accurate Legal Text Classification
【速读】: 该论文旨在解决法律文档在非结构化数据语料库中进行高效、准确分类的问题,这对下游任务如诉状起草、案件摘要生成和检索系统构建具有重要意义。现有方法依赖于结构化元数据或大语言模型(Large Language Models, LLMs)提取的元数据,存在对计算资源要求高、依赖外部标注信息等局限性。解决方案的关键在于提出一种名为ReLeVAnT的框架,其核心是利用文档类别间的判别特征,通过一次性关键词提取结合n-gram处理、对比分数匹配(contrastive score matching)与浅层神经网络实现高精度二分类,最终在LexGLUE数据集上达到99.3%的准确率和98.7%的F1分数,显著提升了分类效率与可靠性。
链接: https://arxiv.org/abs/2604.22292
作者: Ishaan Gakhar,Harsh Nandwani
机构: Perssonify
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 Pages, 2 figures
Abstract:The classification of legal documents from an unstructured data corpus has several crucial applications in downstream tasks. Documents relevant to court filings are key in use cases such as drafting motions, memos, and outlines, as well as in tasks like docket summarisation, retrieval systems, and training data curation. Current methods classify based on provided metadata, LLM-extracted metadata, or multimodal methods. These methods depend on structured data, metadata, and extensive computational power. This task is approached from a perspective of leveraging discriminative features in the documents between classes. The authors propose ReLeVAnT, a framework for legal document binary classification. ReLeVAnT utilises n-gram processing, contrastive score matching, and a shallow neural network as the primary drivers for discriminative classification. It leverages one-time keyword extraction per corpus, followed by a shallow classifier to swiftly and reliably classify documents with 99.3% accuracy and 98.7% F1 score on the LexGLUE dataset.
[NLP-27] STEM: Structure-Tracing Evidence Mining for Knowledge Graphs-Driven Retrieval-Augmented Generation ACL2026
【速读】: 该论文旨在解决知识图谱问答(Knowledge Graph-based Question Answering, KGQA)中因知识图谱结构异质性导致的语义错位问题,以及现有多跳推理路径检索方法缺乏全局结构视角的局限性。其解决方案的关键在于提出Structure-Tracing Evidence Mining (STEM) 框架,将多跳推理重构为一种由模式引导的图搜索任务:首先设计Semantic-to-Structural Projection流水线,利用知识图谱结构先验将查询分解为原子关系断言并构建自适应查询模式图;随后通过全局感知的节点锚定与子图检索获取最终证据推理图;为更有效地整合全局结构信息,进一步设计Triple-Dependent GNN (Triple-GNN) 生成Global Guidance Subgraph(引导子图),从而指导推理图的构建。该方法显著提升了多跳推理图检索的准确性和证据完整性,并在多个多跳基准测试上达到最先进性能。
链接: https://arxiv.org/abs/2604.22282
作者: Peng Yu,En Xu,Bin Chen,Haibiao Chen,Yinfei Xu
机构: Kingsoft Corporation (金山软件公司)
类目: Computation and Language (cs.CL)
备注: 34 pages, 16 figures, accepted to ACL 2026 (Main Conference)
Abstract:Knowledge Graph-based Question Answering (KGQA) plays a pivotal role in complex reasoning tasks but remains constrained by two persistent challenges: the structural heterogeneity of Knowledge Graphs(KGs) often leads to semantic mismatch during retrieval, while existing reasoning path retrieval methods lack a global structural perspective. To address these issues, we propose Structure-Tracing Evidence Mining (STEM), a novel framework that reframes multi-hop reasoning as a schema-guided graph search task. First, we design a Semantic-to-Structural Projection pipeline that leverages KG structural priors to decompose queries into atomic relational assertions and construct an adaptive query schema graph. Subsequently, we execute globally-aware node anchoring and subgraph retrieval to obtain the final evidence reasoning graph from KG. To more effectively integrate global structural information during the graph construction process, we design a Triple-Dependent GNN (Triple-GNN) to generate a Global Guidance Subgraph (Guidance Graph) that guides the construction. STEM significantly improves both the accuracy and evidence completeness of multi-hop reasoning graph retrieval, and achieves State-of-the-Art performance on multiple multi-hop benchmarks.
[NLP-28] Large Language Models Decide Early and Explain Later
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成链式思维(Chain-of-Thought, CoT)推理过程中存在冗余计算的问题,即模型是否在早期阶段已确定最终答案,而后续生成的推理步骤是否仅为后验解释(post-decision explanation),从而增加推理成本但未提升准确性。其解决方案的关键在于通过强制答案补全(forced answer completion)方法识别模型在推理过程中的中间预测变化情况,并基于此设计早期停止策略(early stopping strategies),例如基于探测(probe-based)的启发式机制,在保证精度损失小于2%的前提下,平均每个查询减少约500个推理标记(reasoning tokens)的使用,显著优化推理效率。
链接: https://arxiv.org/abs/2604.22266
作者: Ayan Datta,Zhixue Zhao,Bhuvanesh Verma,Radhika Mamidi,Mounika Marreddy,Alexander Mehler
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models often achieve strong performance by generating long intermediate chain-of-thought reasoning. However, it remains unclear when a model’s final answer is actually determined during generation. If the answer is already fixed at an intermediate stage, subsequent reasoning tokens may constitute post-decision explanation, increasing inference cost and latency without improving correctness. We study the evolution of predicted answers over reasoning steps using forced answer completion, which elicits the model’s intermediate predictions at partial reasoning prefixes. Focusing on Qwen3-4B and averaging results across all datasets considered, we find that predicted answers change in only 32% of queries. Moreover, once the final answer switch occurs, the model generates an average of 760 additional reasoning tokens per query, accounting for a substantial fraction of the total reasoning budget. Motivated by these findings, we investigate early stopping strategies that halt generation once the answer has stabilized. We show that simple heuristics, including probe-based stopping, can reduce reasoning token usage by 500 tokens per query while incurring only a 2% drop in accuracy. Together, our results indicate that a large portion of chain-of-thought generation is redundant and can be reduced with minimal impact on performance.
[NLP-29] Bridging the Long-Tail Gap: Robust Retrieval-Augmented Relation Completion via Multi-Stage Paraphrase Infusion
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在关系补全(Relation Completion, RC)任务中表现不佳的问题,尤其是在缺乏检索增强生成(Retrieval-Augmented Generation, RAG)时,以及当所需信息稀疏或罕见时,LLMs难以有效完成关系补全。解决方案的关键在于提出一种多阶段的、基于改写(paraphrase)引导的关系补全框架RC-RAG,其核心机制包括:(a) 在检索阶段引入关系改写以扩展词汇覆盖范围;(b) 利用改写生成与关系相关的摘要;© 在生成阶段利用改写引导推理过程以提升关系补全准确性。该方法无需模型微调,在多个基准数据集上显著优于多种RAG基线,尤其在长尾场景下性能提升显著(最高提升40.6点Exact Match),同时保持较低计算开销。
链接: https://arxiv.org/abs/2604.22261
作者: Fahmida Alam,Mihai Surdeanu,Ellen Riloff
机构: University of Arizona(亚利桑那大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) struggle with relation completion (RC), both with and without retrieval-augmented generation (RAG), particularly when the required information is rare or sparsely represented. To address this, we propose a novel multi-stage paraphrase-guided relation-completion framework, RC-RAG, that systematically incorporates relation paraphrases across multiple stages. In particular, RC-RAG: (a) integrates paraphrases into retrieval to expand lexical coverage of the relation, (b) uses paraphrases to generate relation-aware summaries, and © leverages paraphrases during generation to guide reasoning for relation completion. Importantly, our method does not require any model fine-tuning. Experiments with five LLMs on two benchmark datasets show that RC-RAG consistently outperforms several RAG baselines. In long-tail settings, the best-performing LLM augmented with RC-RAG improves by 40.6 Exact Match (EM) points over its standalone performance and surpasses two strong RAG baselines by 16.0 and 13.8 EM points, respectively, while maintaining low computational overhead.
[NLP-30] Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA DATE ACL2026
【速读】: 该论文旨在解决大规模、半结构化文档集合上的多文档分析型问答(multi-document analytical QA)问题,即要求模型从大量文档中提取并整合信息以执行定量分析,这超出了传统多文档问答仅需少量文档和有限跨文档推理的能力。其核心挑战在于实现高效的跨文档信息聚合与逻辑推理。解决方案的关键是提出一种多智能体工作流(multi-agent workflow),该流程通过规划(planning)、信息抽取(extraction)和代码生成(code generation)模块的协同协作来增强系统对复杂分析任务的理解与执行能力,从而显著提升推理过程与最终答案的准确性,尽管仍存在与人类专家性能的差距,主要瓶颈在于单文档信息抽取精度不足及领域知识缺乏。
链接: https://arxiv.org/abs/2604.22239
作者: Zhanli Li,Yixuan Cao,Lvzhou Luo,Ping Luo
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Wenlan School of Business, Zhongnan University of Economics and Law, Wuhan 430073, China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Findings of ACL 2026. The camera-ready version corrects some labeling errors. The accompanying repository is continuously updated based on community feedback; for the most up-to-date implementation and results, please refer to the repository
Abstract:This paper introduces the task of analytical question answering over large, semi-structured document collections. We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis. Unlike existing multi-document QA benchmarks that typically require information from only a few documents with limited cross-document reasoning, MuDABench demands extensive inter-document analysis and aggregation. Constructed via distant supervision by leveraging document-level metadata and annotated financial databases, MuDABench comprises over 80,000 pages and 332 analytical QA instances. We also propose an evaluation protocol that measures final answer accuracy and uses intermediate-fact coverage as an auxiliary diagnostic signal for the reasoning process. Experiments reveal that standard RAG systems, which treat all documents as a flat retrieval pool, perform poorly. To address these limitations, we propose a multi-agent workflow that orchestrates planning, extraction, and code generation modules. While this approach substantially improves both process and outcome metrics, a significant gap remains compared to human expert performance. Our analysis identifies two primary bottlenecks: single-document information extraction accuracy and insufficient domain-specific knowledge in current systems. MuDABench is available at this https URL.
[NLP-31] Me Why: Designing an Explainable LLM -based Dialogue System for Student Problem Behavior Diagnosis
【速读】: 该论文旨在解决教师在诊断学生问题行为时,依赖多维度信息进行判断与干预策略制定的复杂性,以及当前微调后的大型语言模型(Large Language Models, LLMs)虽能通过多轮对话提供支持,却缺乏对推荐策略的解释能力,从而削弱了教师对系统的信任与透明度的问题。其解决方案的关键在于构建一个可解释的对话系统,该系统基于可解释人工智能(Explainable AI, xAI)技术,采用分层归因方法识别对话中支持每项建议的证据,并据此生成自然语言形式的解释,从而提升教师对系统决策过程的理解与信任。
链接: https://arxiv.org/abs/2604.22237
作者: Zhilin Fan,Deliang Wang,Penghe Chen,Yu Lu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted in AIED2026
Abstract:Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers’ trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.
[NLP-32] S-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis INTERSPEECH2026
【速读】: 该论文旨在解决当前生成式语音合成(Text-to-Speech, TTS)模型在达到人类水平音质后,传统单一指标无法诊断细粒度声学伪影或解释感知性能下降的问题。解决方案的关键在于提出TTS-PRISM框架,其核心包括:(1)构建一个涵盖稳定性到高级表现力的12维诊断维度体系;(2)设计基于对抗扰动和专家锚点的目标合成流程,以生成高质量诊断数据集;(3)通过基于维度指导的指令微调,将显式的评分标准与推理逻辑嵌入高效端到端模型中,从而实现对多种TTS范式的细粒度能力差异的可解释性分析。
链接: https://arxiv.org/abs/2604.22225
作者: Xi Wang,Jie Wang,Xingchen Song,Baijun Song,Jingran Xie,Jiahe Shao,Zijian Lin,Di Wu,Meng Meng,Jian Luan,Zhiyong Wu
机构: Tsinghua University (清华大学); MiLM Plus, Xiaomi Inc. (小米公司); The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Submitted to Interspeech 2026
Abstract:While generative text-to-speech (TTS) models approach human-level quality, monolithic metrics fail to diagnose fine-grained acoustic artifacts or explain perceptual collapse. To address this, we propose TTS-PRISM, a multi-dimensional diagnostic framework for Mandarin. First, we establish a 12-dimensional schema spanning stability to advanced expressiveness. Second, we design a targeted synthesis pipeline with adversarial perturbations and expert anchors to build a high-quality diagnostic dataset. Third, schema-driven instruction tuning embeds explicit scoring criteria and reasoning into an efficient end-to-end model. Experiments on a 1,600-sample Gold Test Set show TTS-PRISM outperforms generalist models in human alignment. Profiling six TTS paradigms establishes intuitive diagnostic flags that reveal fine-grained capability differences. TTS-PRISM is open-source, with code and checkpoints at this https URL.
[NLP-33] Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLM s: A Pre-Registered Psychometric Validity Screen
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型中通过最小化数值和分类方式提取的口头置信度是否具备项目级类型2区分能力的问题,即验证这些置信度信号能否有效反映模型对自身预测准确性的内部不确定性表征。其关键发现在于:在使用贪婪解码策略下,7个指令微调的大语言模型(3-9B参数规模)均未通过心理测量有效性筛选,表现为数值置信度存在极高天花板效应(平均91.7%),且分类置信度不仅未能提升有效性反而显著损害任务性能(准确率低于5%);同时,token级对数概率无法有效预测口头置信度(交叉验证R²均值<0.01),而推理轨迹长度与置信度呈强负相关(ρ = -0.36, p < .001),表明存在推理污染效应。这说明,在当前模型规模范围内,最小化口头 elicitation 方法无法保留内部不确定性信号,因此必须在下游应用前进行心理测量筛选。
链接: https://arxiv.org/abs/2604.22215
作者: Jon-Paul Cacioli
机构: Independent Researcher, Melbourne, Australia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, 4 tables, 1 appendix. Pre-registered: this http URL . Code and data: this http URL
Abstract:Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: this http URL), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted =4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task performance in six of seven models, producing accuracy below 5% (H4 not confirmed). Token-level logprobability did not usefully predict verbalised confidence under the observed variance regime (H5 confirmed, mean cross-validated R^2 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (rho = -0.36, p .001), consistent with the Reasoning Contamination Effect. These results do not imply that internal uncertainty representations are absent. They show that minimal verbal elicitation fails to preserve internal signals at the output interface in this model-size regime. Psychometric screening should precede any downstream use of such signals.
[NLP-34] Evaluating LLM -Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations
【速读】: 该论文旨在解决需求工程(Requirements Engineering, RE)中目标导向型需求工程(Goal-Oriented Requirements Engineering, GORE)过程的自动化问题,特别是从软件文档中自动提取功能目标(functional goals)。其核心挑战在于如何高效且准确地识别高阶和低阶目标,以减少人工工作量。解决方案的关键在于构建一个由多个大语言模型(Large Language Models, LLMs)组成的链式处理流程,分为三个阶段:角色识别、高阶与低阶目标提取;并引入一种生成-批评(generation-critic)机制作为反馈循环,其中两个LLM协同工作——一个负责生成候选目标,另一个基于零样本(Zero-shot)提示对结果进行评估与修正。实验表明,该方法在低阶目标识别上达到61%准确率,虽未完全替代人工,但显著提升了效率;更重要的是,反馈机制显著优于单一Few-shot策略,凸显了闭环反馈在提升准确性中的关键作用。
链接: https://arxiv.org/abs/2604.22207
作者: Anna Arnaudo,Riccardo Coppola,Maurizio Morisio,Flavio Giobergia,Andrea Bioddo,Angelo Bongiorno,Luca Dadone
机构: Politecnico di Torino(都灵理工大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 1 figure. This contribution will be published in the conference proceedings of EASE 2026 Conference ( this https URL )
Abstract:Due to the textual and repetitive nature of many Requirements Engineering (RE) artefacts, Large Language Models (LLMs) have proven useful to automate their generation and processing. In this paper, we discuss a possible approach for automating the Goal-Oriented Requirements Engineering (GORE) process by extracting functional goals from software documentation through three phases: actor identification, high and low-level goal extraction. To implement these functionalities, we propose a chain of LLMs fed with engineered prompts. We experimented with different variants of in-context learning and measured the similarities between input data and in-context examples to better investigate their impact. Another key element is the generation-critic mechanism, implemented as a feedback loop involving two LLMs. Although the pipeline achieved 61% accuracy in low-level goal identification, the final stage, these results indicate the approach is best suited as a tool to accelerate manual extraction rather than as a full replacement. The feedback-loop mechanism with Zero-shot outperformed stand-alone Few-shot, with an ablation study suggesting that performance slightly degrades without the feedback cycle. However, we reported that the combination of the feedback mechanism with Few-shot does not deliver any advantage, possibly suggesting that the primary performance ceiling is the prompting strategy applied to the ‘critic’ LLM. Together with the refinement of both the quantity and quality of the Shot examples, future research will integrate Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) prompting to improve accuracy.
[NLP-35] How Large Language Models Balance Internal Knowledge with User and Document Assertions ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在现实场景中如何可靠地整合内部参数化知识与外部信息源(如用户信念和检索文档)的问题,尤其关注三者共存时的知识冲突与行为不可靠性。现有研究多局限于二元冲突范式(如仅考虑模型知识与文档或用户之间的冲突),忽视了三源交互的复杂环境。论文提出一个三源交互框架,并系统评估了27个来自三个家族的LLM在两个数据集上的表现,发现多数模型更倾向于依赖文档陈述而非用户陈述,且这种倾向在后训练阶段被强化;同时,模型普遍缺乏对有益与有害外部信息的有效辨别能力。解决方案的关键在于通过在多样化来源交互数据上进行微调(fine-tuning),显著提升模型对不同来源信息的判别能力,从而增强其在多源信息融合场景下的可信度与安全性。
链接: https://arxiv.org/abs/2604.22193
作者: Shuowei Li,Haoxin Li,Wenda Chu,Yi Fang
机构: Santa Clara University (圣克拉拉大学); Nanyang Technological University (南洋理工大学); California Institute of Technology (加州理工学院)
类目: Computation and Language (cs.CL)
备注: Findings of ACL 2026
Abstract:Large language models (LLMs) often need to balance their internal parametric knowledge with external information, such as user beliefs and content from retrieved documents, in real-world scenarios like RAG or chat-based systems. A model’s ability to reliably process these sources is key to system safety. Previous studies on knowledge conflict and sycophancy are limited to a binary conflict paradigm, primarily exploring conflicts between parametric knowledge and either a document or a user, but ignoring the interactive environment where all three sources exist simultaneously. To fill this gap, we propose a three-source interaction framework and systematically evaluate 27 LLMs from 3 families on 2 datasets. Our findings reveal general patterns: most models rely more on document assertions than user assertions, and this preference is reinforced by post-training. Furthermore, our behavioral analysis shows that most models are impressionable, unable to effectively discriminate between helpful and harmful external information. To address this, we demonstrate that fine-tuning on diverse source interaction data can significantly increase a model’s discrimination abilities. In short, our work paves the way for developing trustworthy LLMs that can effectively and reliably integrate multiple sources of information. Code is available at this https URL.
[NLP-36] Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning
【速读】: 该论文旨在解决在代理工作流中,大语言模型(LLM)处理受版权保护的检索上下文时,审计人员难以验证提供商是否通过强化学习(Reinforcement Learning, RL)方式将其纳入训练的问题。传统审计方法如逐字记忆检测和成员推断在RL训练场景下失效,因为RL主要影响模型的行为风格而非具体事实的保留。解决方案的关键在于提出“行为信标”(Behavioral Canaries)机制:通过在偏好数据中注入特定文档触发词与奖励独特风格响应的反馈信号,若该数据被用于训练,则会诱导出潜在的条件偏好模式;实证结果表明,该机制可在极低的信标注入率下实现高检测精度(1%注入率时AUROC=0.756),从而为RL微调(RLFT)管道提供一种基于行为分布变化的新型审计手段。
链接: https://arxiv.org/abs/2604.22191
作者: Chaoran Chen,Dayu Yuan,Peter Kairouz
机构: Google(谷歌)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model’s behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.
[NLP-37] Fine-Grained Analysis of Shared Syntactic Mechanisms in Language Models ACL2026
【速读】: 该论文旨在解决语言模型内部机制是否与语言学中跨构式原则(cross-constructional principles)一致的问题,特别是探究模型在不同句法构式(如filler-gap依赖和负极性项NPI许可)中是否共享神经机制。其解决方案的关键在于采用因果可解释性方法(causal interpretability methods),特别是在细粒度层面应用激活修补(activation patching)技术来识别特定注意力头和MLP模块的功能角色。研究发现,filler-gap依赖的处理依赖于早期到中期层中的高度局部化且共享的机制,而NPI处理则无此类统一机制;此外,这些通过激活修补识别出的机制具有跨分布泛化能力,优于依赖监督信号的分布式对齐搜索方法,后者易在窄分布上过拟合。最终,通过对识别组件的干预显著提升了模型在可接受性判断基准上的表现,验证了机制的有效性。
链接: https://arxiv.org/abs/2604.22166
作者: Ryoma Kumon,Hitomi Yanaka
机构: The University of Tokyo (东京大学); RIKEN (理化学研究所); Tohoku University (东北大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main
Abstract:While language models demonstrate sophisticated syntactic capabilities, the extent to which their internal mechanisms align with cross-constructional principles studied in linguistics remains poorly understood. This study investigates whether models employ shared neural mechanisms across different syntactic constructions by applying causal interpretability methods at a granular level. Focusing on filler-gap dependencies and negative polarity item (NPI) licensing, we utilize activation patching to identify the functional roles of specific attention heads and MLP blocks. Our results reveal a highly localized and shared mechanism for filler-gap dependencies located in the early to middle layers, whereas NPI processing exhibits no such unified mechanism. Furthermore, we find that these mechanisms identified by activation patching generalize to out-of-distribution, while distributed alignment search, a supervised interpretability method, is susceptible to overfitting on narrow linguistic distributions. Finally, we validate our findings by demonstrating that the manipulation of the identified components improves model performance on acceptability judgment benchmarks.
[NLP-38] When AI Speaks Whose Values Does It Express? A Cross-Cultural Audit of Individualism-Collectivism Bias in Large Language Models
【速读】: 该论文旨在解决当前前沿生成式 AI(Generative AI)在提供个人建议时是否存在文化偏见的问题,即AI是否能够根据用户所在国家的文化价值观提供差异化、本地化的建议。研究通过系统性实验,向三款主流AI模型(Claude Sonnet 4.5、GPT-5.4 和 Gemini 2.5 Flash)呈现来自10个国家的840个跨文化个人困境,并将其建议与世界价值观调查(World Values Survey Wave 7)中各国实际价值观进行对比。关键发现是:所有AI系统均显著偏向西方个体主义价值,即使面对强调家庭、社区和权威的社会群体,其建议与当地文化存在显著偏差(平均差距+0.76分,p<0.001),其中尼日利亚(+1.85)和印度(+0.82)最为突出;而日本则是唯一例外,AI反而高估了其集体主义倾向,反映出对旧有刻板印象的编码。三种模型在响应机制上亦不同:Claude 在用户母语下更趋集体主义,Gemini 更趋个体主义,GPT-5.4 仅依据用户声明的国家身份调整输出。这揭示了生成式AI正经历一种系统性的价值同质化趋势,亟需从数据、训练机制和评估标准层面进行改进。
链接: https://arxiv.org/abs/2604.22153
作者: Pruthvinath Jeripity Venkata
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 7 figures, 9 tables. Data and code: this https URL
Abstract:When you ask an AI assistant for advice about your career, your marriage, or a conflict with your family, does it give you the same answer regardless of where you are from? We tested this systematically by presenting three leading AI systems (Claude Sonnet 4.5, GPT-5.4, and Gemini 2.5 Flash) with ten real-life personal dilemmas, framed for users from 10 countries across 5 continents in 7 languages (n=840 scored responses). We compared AI advice against World Values Survey Wave 7 data measuring what people in each country actually believe. All three AI systems consistently gave Western-style, individualist advice even to users from societies that prioritize family, community, and authority, significantly more so than local values would predict (mean gap +0.76 on a 1-5 scale; t=15.65, p0.001). The gap is largest for Nigeria (+1.85) and India (+0.82). Japan is the sole exception: AI systems treated Japanese users as more group-oriented than surveys show, revealing that AI encodes outdated stereotypes. Claude and GPT-5.4 show nearly identical bias magnitude, while Gemini is lower but still significant. The models diverge in mechanism: Claude shifts further collectivist in the user’s native language; Gemini shifts more individualist; GPT-5.4 responds only to stated country identity. These findings point to a systemic homogenization of values across frontier AI. Data, code, and scoring pipeline are openly released. Comments: 13 pages, 7 figures, 9 tables. Data and code: this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2604.22153 [cs.CL] (or arXiv:2604.22153v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.22153 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-39] Recognition Without Authorization: LLM s and the Moral Order of Online Advice
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在介入人际道德困境时,其建议默认行为与特定社群所共识的道德规范之间存在结构性差异的问题。研究发现,尽管LLMs能识别出与人类评论者相似的情感和情境动态,但在高共识度议题(如虐待或安全威胁)上,它们推荐采取实质性行动(如终止关系)的比例仅为人类的一半,且伴随更强的模糊性、共情式回应和治疗性框架。这种“识别但不授权”(recognition without authorization)的现象并非偶然,而是由模型设计中对安全对齐(safety alignment)、训练数据平均化以及通用助理范式共同塑造的结构特征。解决方案的关键在于重新理解这种分歧:它不是技术错误,而是一种标准化辅助规范在面对具体情境中的道德世界时所必然产生的扁平化效应。
链接: https://arxiv.org/abs/2604.22143
作者: Tom van Nuenen
机构: UC Berkeley (加州大学伯克利分校)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:
Abstract:Large language models are increasingly used to mediate everyday interpersonal dilemmas, yet how their advisory defaults interact with the concentrated moral orders of specific communities remains poorly understood. This article compares four assistant-style LLMs with community-endorsed advice on 11,565 posts from r/relationship_advice, using the subreddit as a concentrated, vote-ratified moral formation whose prescriptive clarity makes divergence measurable. Across models, LLMs identify many of the same dynamics as human commenters, but are markedly less likely to convert that recognition into directive authorization for action. The gap is sharpest where community consensus is strongest: on high-consensus posts involving abuse or safety threats, models recommend exit at roughly half the human rate while maintaining elevated levels of hedging, validation, and therapeutic framing. The article describes this pattern as recognition without authorization: the capacity to register harm while withholding socially ratified permission for consequential action. This divergence is not incidental but structural: a portable advisory style that remains validating, risk-averse, and weakly directive across contexts. Safety alignment is one plausible contributor to this pattern, alongside training-data averaging and broader assistant design. The article argues that model divergence can be reframed from a technical error to a way of seeing what standardized assistant norms flatten when they encounter situated moral worlds. Subjects: Computers and Society (cs.CY); Computation and Language (cs.CL) Cite as: arXiv:2604.22143 [cs.CY] (or arXiv:2604.22143v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.22143 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-40] Voice Under Revision: Large Language Models and the Normalization of Personal Narrative
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)对个人叙事文本进行重写时,如何系统性地改变其风格与叙述质地的问题。研究通过分析300篇由三类前沿LLM在三种提示条件下(通用改进、仅重写、保留语气修订)生成的文本,发现LLM重写导致了显著且一致的风格趋同现象:功能词、缩略形式和第一人称代词减少,而词汇多样性、词长及标点复杂度增加;同时,叙述方式从嵌入式转向疏离化,因果推理从显性表达转为抽象压缩。解决方案的关键在于识别出这些变化具有方向性而非随机性,表明LLM并非简单编辑,而是通过“文本中介”重塑文本的语体特征与叙事结构,从而对数字人文和计算文本分析中依赖功能词、代词、标点等标记的作者识别、语体判断和语料完整性评估带来深远影响。
链接: https://arxiv.org/abs/2604.22142
作者: Tom van Nuenen
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:This study examines how large language model rewriting alters the style and narrative texture of personal narratives. It analyzes 300 personal narratives rewritten by three frontier LLMs under three prompt conditions: generic improvement, rewrite-only, and voice-preserving revision. Change is measured across 13 linguistic markers drawn from computational stylistics, including function words, vocabulary diversity, word length, punctuation, contractions, first-person pronouns, and emotion words. Across models and prompt conditions, LLM rewriting produces a consistent pattern of stylistic normalization. Function words, contractions, and first-person pronouns decrease, while vocabulary diversity, word length, and punctuation elaboration increase. These shifts occur whether the prompt asks the model to “improve” the text or simply to “rewrite” it. Voice-preserving prompts reduce the magnitude of the changes but do not eliminate their direction. Stylometric analysis shows that rewritten texts converge in feature space and become harder to match back to their source texts. Additional narrative markers indicate a shift from embedded to distanced narration, and from explicit causal reasoning to compressed abstraction. The findings suggest that contemporary LLMs exert a directional pull toward a more polished, less situated register. This has consequences for digital humanities and computational text analysis, where features such as function words, pronouns, contractions, and punctuation often serve as evidence for style, voice, authorship, and corpus integrity. LLM revision should therefore be understood not merely as surface-level editing, but as a consequential form of textual mediation. Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2604.22142 [cs.CL] (or arXiv:2604.22142v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.22142 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-41] SHAPE: Unifying Safety Helpfulness and Pedagogy for Educational LLM s ACL2026
【速读】: 该论文旨在解决当前教育类大语言模型(Large Language Models, LLMs)中存在的“教学越狱”(pedagogical jailbreaks)问题,即学生通过特定提示诱导模型直接提供答案而非进行引导式教学。为系统研究该问题,作者提出一个基于知识掌握图(knowledge-mastery graph)的统一框架,用于建模安全、有用和教学行为,并构建了包含9,087个学生提问对的SHAPE基准测试集以评估在对抗压力下的辅导行为。解决方案的关键在于设计了一种图增强型辅导流水线(graph-augmented tutoring pipeline),该流程能够从学生问题中推断先修概念、识别知识掌握缺口,并通过显式门控机制在指导(instructing)与解题(problem-solving)之间动态切换生成策略,从而在保障安全性的同时维持高水准的帮助性。
链接: https://arxiv.org/abs/2604.22134
作者: Sihang(Nagi)Zhao,Kangrui Yu,Youliang Yuan,Pinjia He,Hongyi Wen
机构: New York University Shanghai (纽约大学上海分校); Courant Institute of Mathematical Sciences, New York University (纽约大学数学科学研究所); The Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区)
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main
Abstract:Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at this https URL
[NLP-42] Dissociating Decodability and Causal Use in Bracket-Sequence Transformers
【速读】: 该论文试图解决的问题是:在基于Transformer架构的模型中,对于层次结构(hierarchical structure)的表征究竟是因果性地被使用,还是仅仅可被解码(decodable)。以往研究发现,Transformer在处理需要理解层次结构的任务时,会在残差流(residual stream)的几何结构和栈式注意力模式(stack-like attention patterns)中体现这种结构,但尚未明确这些表征是否真正参与了决策过程。论文的关键解决方案在于:通过在Dyck语言(一种具有明确层次结构的平衡括号序列形式语言)任务上训练模型,并结合探针(probing)与干预(intervention)方法,系统性地评估不同表征成分(如深度、距离、栈顶信号)的因果作用。结果表明,尽管所有信号均可被解码,但屏蔽对真实栈顶位置的关注会导致长距离准确性显著下降,而移除低维残差流子空间的影响则较小,这说明可解码性并不等同于因果必要性。这一结论扩展至模板化自然语言场景依然成立,揭示了模型内部表示的因果机制与可观察特征之间的本质区别。
链接: https://arxiv.org/abs/2604.22128
作者: Aryan Sharma,Cutter Dawes,Shivam Raval
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:When trained on tasks requiring an understanding of hierarchical structure, transformers have been found to represent this hierarchy in distinct ways: in the geometry of the residual stream, and in stack-like attention patterns maintaining a last-in, first-out ordering. However, it remains unclear whether these representations are causally used or merely decodable. We examine this gap in transformers trained on the Dyck language (a formal language of balanced bracket sequences), where the hierarchical ground truth is explicit. By probing and intervening on the residual stream and attention patterns, we find that depth, distance, and top-of-stack signals are all decodable, yet their causal roles diverge. Specifically, masking attention to the true top-of-stack position causes a sharp drop in long-distance accuracy, while ablating low-dimensional residual stream subspaces has comparatively little effect. These results, which extend to a templated natural language setting, suggest that even in a controlled setting where the relevant hierarchical variables are known, decodability alone does not imply causal use.
[NLP-43] Where Should LoRA Go? Component-Type Placement in Hybrid Language Models
【速读】: 该论文旨在解决混合语言模型(Hybrid Language Models)在低秩适配(LoRA)过程中存在的参数效率与性能失衡问题,即当前标准LoRA方法对所有组件均匀应用适配器,而未考虑不同模块(如注意力机制与循环组件)在模型功能中的差异化角色。解决方案的关键在于提出基于组件类型的LoRA放置策略:研究发现,在顺序结构的混合模型中(如Qwen3.5-0.8B),仅适配注意力路径即可实现接近全模型微调的效果,且参数量减少5–10倍;而在并行结构的混合模型中(如Falcon-H1-0.5B),适配循环骨干反而能显著提升性能(+8.6个百分点),而顺序结构中则导致灾难性遗忘(-14.8个百分点)。这一结果表明,混合架构的拓扑结构从根本上决定了适配响应模式,因此组件感知的LoRA放置是优化混合模型微调效果的必要设计维度。
链接: https://arxiv.org/abs/2604.22127
作者: Hector Borobia,Elies Seguí-Mas,Guillermina Tormo-Carbó
机构: Universitat Politècnica de València (瓦伦西亚理工大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 5 figures, 7 tables. Code and data: this https URL
Abstract:Hybrid language models that interleave attention with recurrent components are increasingly competitive with pure Transformers, yet standard LoRA practice applies adapters uniformly without considering the distinct functional roles of each component type. We systematically study component-type LoRA placement across two hybrid architectures – Qwen3.5-0.8B (sequential, GatedDeltaNet + softmax attention) and Falcon-H1-0.5B (parallel, Mamba-2 SSM + attention) – fine-tuned on three domains and evaluated on five benchmarks. We find that the attention pathway – despite being the minority component – consistently outperforms full-model adaptation with 5-10x fewer trainable parameters. Crucially, adapting the recurrent backbone is destructive in sequential hybrids (-14.8 pp on GSM8K) but constructive in parallel ones (+8.6 pp). We further document a transfer asymmetry: parallel hybrids exhibit positive cross-task transfer while sequential hybrids suffer catastrophic forgetting. These results establish that hybrid topology fundamentally determines adaptation response, and that component-aware LoRA placement is a necessary design dimension for hybrid architectures.
[NLP-44] PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在预训练阶段面临的隐蔽式对抗污染问题,即“隐性概念污染”(latent conceptual poisoning),这种污染通过分布式、低可见度的毒化内容(poisoned content)渗透进大规模语料库(如Common Crawl),从而在模型中植入长期潜伏、难以检测的恶意行为逻辑。其解决方案的关键在于提出了一种名为PermaFrost的攻击框架,该框架基于Stealth Pretraining Seeding(SPS)机制,利用微小且看似无害的毒化内容经由HTTP URL暴露给网络爬虫,实现对预训练数据的间接污染;同时引入几何诊断工具——热力学长度(Thermodynamic Length)、谱曲率(Spectral Curvature)与感染追溯图(Infection Traceback Graph),用于系统识别和表征模型中潜在的、标准评估无法发现的脆弱性,从而为未来基础模型的安全性提供可量化、可解释的检测与分析方法。
链接: https://arxiv.org/abs/2604.22117
作者: Harsh Kumar,Rahul Maity,Tanmay Joshi,Aman Chadha,Vinija Jain,Suranjana Trivedy,Amitava Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Aligned large language models(LLMs) remain vulnerable to adversarial manipulation, and their dependence on web-scale pretraining creates a subtle but serious attack surface. We study Stealth Pretraining Seeding (SPS), a new attack family in which adversaries distribute small amounts of poisoned content across stealth websites, expose them to web crawlers through this http URL, and thereby increase the likelihood that such content is absorbed into future training corpora derived from sources such as Common Crawl. Because each individual payload is tiny, diffuse, and superficially benign, the attack is difficult to detect during dataset construction or filtering. The result is a latent form of poisoning: dormant logic landmines embedded during pretraining that remain largely invisible under standard evaluation, yet can later be activated by precise alphanumeric triggers such as 00TRIGGER00 to bypass safeguards. We call this attack PermaFrost, by analogy to Arctic permafrost: harmful material can remain frozen, buried, and unnoticed for long periods, only to resurface when conditions allow. We operationalize this threat through PermaFrost-Attack, a controlled framework for latent conceptual poisoning, together with a suite of geometric diagnostics: Thermodynamic Length, Spectral Curvature, and the Infection Traceback Graph. Across multiple model families and scales, we show that SPS is broadly effective, inducing persistent unsafe behavior while often evading alignment defenses. Our results identify SPS as a practical and underappreciated threat to future foundation models. This paper introduces a novel geometric diagnostic lens for systematically examining latent model behavior, providing a principled foundation for detecting, characterizing, and understanding vulnerabilities that may remain invisible to standard evaluation.
[NLP-45] Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation ACL2026
【速读】: 该论文旨在解决模型在训练与部署过程中因时间维度导致的分布偏移问题,即模型通常基于历史数据训练,但实际部署时面临未来数据中语义分布(semantic distribution)和领域知识(domain knowledge)可能发生演变的情况。现有研究或忽视时间变化,或难以捕捉语义与知识层面的复杂迁移模式。解决方案的关键在于提出Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA),其核心机制包括:1)识别并建模多种时间偏移类型(如不确定性与特征偏移);2)构建并整合多源知识资源(如医学主题词表 MeSH 等本体);3)利用动态演化知识进行选择性增强与检索学习(selection-retrieval augmented learning),从而实现跨时间维度的鲁棒模型适应。实验表明,该方法在临床、法律和科学等多个领域的分类任务中均显著提升性能,验证了知识融合在时间自适应中的关键作用。
链接: https://arxiv.org/abs/2604.22098
作者: Weisi Liu,Guangzeng Han,Xiaolei Huang
机构: University of Memphis (孟菲斯大学)
类目: Computation and Language (cs.CL)
备注: Accepted at ACL 2026
Abstract:Time introduces fundamental challenges in model development and deployment: models are usually trained on historical data while deployed on future data where semantic distributions and domain knowledge may evolve. Unfortunately, existing studies either overlook temporal shifts or hardly capture rich shifting patterns of both semantic and knowledge. We develop Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation (KARITA) to capture diverse temporal shifts (e.g., uncertainty and feature shift), construct and integrate rich knowledge sources (e.g., medical ontology like MeSH), and leverage shifting insights for selecting-retrieval augmented learning. We evaluate KARITA on classification tasks across multiple domains, clinical, legal, and scientific corpora, demonstrating consistent improvements across multiple domains with temporal adaptation. Our results show that knowledge integration can be more critical and effective in temporal augmentation and learning.
[NLP-46] An End-to-End Ukrainian RAG for Local Deployment. Optimized Hybrid Search and Lightweight Generation
【速读】: 该论文旨在解决乌克兰语文档问答任务中如何在资源受限环境下实现高质量、可验证的生成式 AI(Generative AI)回答的问题。其解决方案的关键在于构建一个高效的检索增强生成(Retrieval-Augmented Generation, RAG)系统,包含两个核心组件:一是定制的两阶段搜索流水线,用于精准召回相关文档页面;二是基于合成数据微调的乌克兰语专用语言模型,确保生成答案的准确性与事实一致性;此外,通过模型压缩技术实现轻量化部署,从而在严格计算资源限制下仍保持高精度表现。
链接: https://arxiv.org/abs/2604.22095
作者: Mykola Trokhymovych,Yana Oliinyk,Nazarii Nyzhnyk
机构: Pompeu Fabra University (庞培法布拉大学); Google (谷歌)
类目: Computation and Language (cs.CL)
备注: To appear at UNLP’26
Abstract:This paper presents a highly efficient Retrieval-Augmented Generation (RAG) system built specifically for Ukrainian document question answering, which achieved 2nd place in the UNLP 2026 Shared Task. Our solution features a custom two-stage search pipeline that retrieves relevant document pages, paired with a specialized Ukrainian language model fine-tuned on synthetic data to generate accurate, grounded answers. Finally, we compress the model for lightweight deployment. Evaluated under strict computational limits, our architecture demonstrates that high-quality, verifiable AI question answering can be achieved locally on resource-constrained hardware without sacrificing accuracy.
[NLP-47] PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在训练过程中可能记忆并泄露私有信息所带来的隐私风险问题,尤其是当前机器遗忘(machine unlearning)方法在实际隐私攻击场景下的有效性尚不明确。其解决方案的关键在于提出了一种名为PrivUn的系统性评估框架,通过三层次隐私攻击场景(直接检索、上下文学习恢复和微调重建)结合定量指标(遗忘分数、关联度量和遗忘深度评估),揭示了现有遗忘方法存在两个核心缺陷:一是遗忘行为呈现梯度驱动的涟漪效应,即隐私信息会沿着潜在空间中的梯度相似性传播而非语义关联;二是多数方法仅实现浅层遗忘,无法有效清除分布在多个深层网络模块中的私有信息。为应对这些问题,作者提出了两种策略:基于梯度相似性的核心集选择(association-aware core-set selection)与多层表征约束的深度干预(multi-layer deep intervention),从而推动从浅层遗忘向深层遗忘的范式转变。
链接: https://arxiv.org/abs/2604.22076
作者: Xiaoyi Chen,Haoyuan Wang,Siyuan Tang,Sijia Liu,Liya Su,XiaoFeng Wang,Haixu Tang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) often memorize private information during training, raising serious privacy concerns. While machine unlearning has emerged as a promising solution, its true effectiveness against privacy attacks remains unclear. To address this, we propose PrivUn, a new evaluation framework that systematically assesses unlearning robustness through three-tier attack scenarios: direct retrieval, in-context learning recovery, and fine-tuning restoration; combined with quantitative analysis using forgetting scores, association metrics, and forgetting depth assessment. Our study exposes significant weaknesses in current unlearning methods, revealing two key findings: 1) unlearning exhibits gradient-driven ripple effects: unlike traditional forgetting which follows semantic relations (e.g., knowledge graphs), privacy unlearning propagates across latent gradient-based associations; and 2) most methods suffer from shallow forgetting, failing to remove private information distributed across multiple deep model layers. To validate these insights, we explore two strategies: association-aware core-set selection that leverages gradient similarity, and multi-layer deep intervention through representational constraints. These strategies represent a paradigm shift from shallow forgetting to deep forgetting.
[NLP-48] Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning
【速读】: 该论文旨在解决强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)后训练语言模型时,一个普遍假设的可靠性问题——即通过RLVR训练得到的思维链(chain-of-thought reasoning)是否真正代表了模型得出答案的因果路径。研究发现,尽管RLVR能提升任务准确率,但其并未显著改善两个关键指标:因果重要性(Causal Importance of Reasoning, CIR)和推理充分性(Sufficiency of Reasoning, SR),表明模型可能并未真正依赖推理过程来决策。解决方案的关键在于引入辅助奖励机制:在基于结果的奖励基础上添加针对CIR和SR的辅助奖励,从而在不依赖监督微调(SFT)的前提下,使模型生成既准确又具有因果重要性和充分性的推理链,实现了性能与推理质量的协同优化。
链接: https://arxiv.org/abs/2604.22074
作者: Qinan Yu,Alexa Tartaglini,Peter Hase,Carlos Guestrin,Christopher Potts
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning has become a standard part of language model post-training recipes. A common assumption is that the reasoning chains trained through RLVR reliably represent how a model gets to its answer. In this paper, we develop two metrics for critically examining this assumption: Causal Importance of Reasoning (CIR), which measures the cumulative effect of reasoning tokens on the final answer, and Sufficiency of Reasoning (SR), which measures whether a verifier can arrive at an unambiguous answer based on the reasoning alone. Through experiments with the Qwen2.5 model series and ReasoningGym tasks, we find that: (1) while RLVR does improve task accuracy, it does not reliably improve CIR or SR, calling the role of reasoning in model performance into question; (2) a small amount of SFT before RLVR can be a remedy for low CIR and SR; and (3) CIR and SR can be improved even without SFT by applying auxiliary CIR/SR rewards on top of the outcome-based reward. This joint reward matches the accuracy of RLVR while also leading to causally important and sufficient reasoning. These results show that RLVR does not always lead models to rely on reasoning in the way that is commonly thought, but this issue can be remedied with simple modifications to the post-training procedure.
[NLP-49] Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake
【速读】: 该论文旨在解决精神科初诊访谈中信息获取效率低下的问题,即在有限时间内,如何通过对话策略优化来提升临床信息的完整回收率。其核心挑战在于:医生需在不确定患者反应的情况下,动态决定提问内容、顺序及解释模糊信息,而现有对话式人工智能(Conversational AI)在此类高风险、结构化强的医疗场景中缺乏有效支持。解决方案的关键在于将该任务建模为一个基于临床知识的问题选择(Question Selection)问题,并构建了一个包含655个由临床医师编写的初诊问题与对应合成患者情景(共5种行为状态)的基准测试集;在此基础上,比较了随机提问、固定表单式提问和大语言模型(LLM)引导的自适应策略,在300次访谈会话中的表现。结果表明,LLM引导的自适应策略在信息恢复方面表现最优,尤其在患者行为不配合(如缄默简洁型)时优势显著,说明有效的交互设计不仅依赖于对已披露信息的理解能力,更取决于能否在有限交互预算内精准触及关键话题。
链接: https://arxiv.org/abs/2604.22067
作者: Guan Gui,Peter Zandi,Jacob Taylor,Ananya Joshi
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.
[NLP-50] Incentivizing Neuro-symbolic Language-based Reasoning in VLMs via Reinforcement Learning
【速读】: 该论文旨在解决视觉-语言模型中概念表示与推理能力不足的问题,特别是提升“思考系统”在分析推理任务中的准确性和效率。其核心解决方案是基于神经符号语言(neuro-symbolic language)构建一种新型推理框架,以增强模型对多模态知识的理解与逻辑推导能力;关键创新在于使用Qwen3-VL-2B-Instruct作为基础模型,并结合4×Nvidia H200 GPU节点进行训练,在保持高精度的同时显著减少推理所需的token数量(相比SymPy降低75%),并在数学、科学及常识类问题上实现3.33%的准确率提升。
链接: https://arxiv.org/abs/2604.22062
作者: Karthic Palaniappan
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:There are 7,407 languages in the world. But, what about the languages that are not there in the world? Are humans so narrow minded that we don’t care about the languages aliens communicate in? Aliens are humans too! In the 2016 movie Arrival, Amy Adams plays a linguist, Dr. Louise Banks who, by learning to think in an alien language (Heptapod) formed of non-sequential sentences, gains the ability to transcend time and look into the future. In this work, I aim to explore the representation and reasoning of vision-language concepts in a neuro-symbolic language, and study improvement in analytical reasoning abilities and efficiency of “thinking systems”. With Qwen3-VL-2B-Instruct as base model and 4 \times Nvidia H200 GPU nodes, I achieve an accuracy improvement of 3.33% on a vision-language evaluation dataset consisting of math, science, and general knowledge questions, while reducing the reasoning tokens by 75% over SymPy. I’ve documented the compute challenges faced, scaling possibilities, and the future work to improve thinking in a neuro-symbolic language in vision-language models. The training and inference setup can be found here: this https URL.
[NLP-51] Lightweight Retrieval-Augmented Generation and Large Language Model-Based Modeling for Scalable Patient-Trial Matching
【速读】: 该论文旨在解决临床试验患者匹配(patient-trial matching)中因电子健康记录(EHR)长文本、异构性及复杂入组标准带来的可扩展性(scalability)、泛化能力(generalization)和计算效率问题。现有方法要么依赖大语言模型(LLM)对整份病历进行处理,导致计算开销高;要么采用传统机器学习方法,难以捕捉非结构化临床文本中的语义信息。其解决方案的关键在于提出一种轻量级框架,通过检索增强生成(retrieval-augmented generation, RAG)与LLM建模相结合的方式,明确分离两个核心组件:首先利用RAG从长EHR中提取临床相关片段以降低输入复杂度,随后使用LLM将这些片段编码为信息丰富的表示,并通过降维和轻量预测器进行优化,实现高效且可扩展的下游分类。实验表明,该方法在多个公开基准(n2c2、SIGIR、TREC 2021/2022)和梅奥诊所真实多模态数据集(MCPMD)上达到与端到端LLM相当的性能,同时显著降低计算成本。
链接: https://arxiv.org/abs/2604.22061
作者: Xiaodi Li,Yang Xiao,Munhwan Lee,Konstantinos Leventakos,Young J. Juhn,David Jones,Terence T. Sio,Wei Liu,Maria Vassilaki,Nansu Zong
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 31 pages, 7 figures
Abstract:Patient-trial matching requires reasoning over long, heterogeneous electronic health records (EHRs) and complex eligibility criteria, posing significant challenges for scalability, generalization, and computational efficiency. Existing approaches either rely on full-document processing with large language models (LLMs), which is computationally expensive, or use traditional machine learning methods that struggle to capture unstructured clinical narratives. In this work, we propose a lightweight framework that combines retrieval-augmented generation and large language model-based modeling for scalable patient-trial matching. The framework explicitly separates two key components: retrieval-augmented generation is used to identify clinically relevant segments from long EHRs, reducing input complexity, while large language models are used to encode these selected segments into informative representations. These representations are further refined through dimensionality reduction and modeled using lightweight predictors, enabling efficient and scalable downstream classification. We evaluate the proposed approach on multiple public benchmarks (n2c2, SIGIR, TREC 2021/2022) and a real-world multimodal dataset from Mayo Clinic (MCPMD). Results show that retrieval-based information selection significantly reduces computational burden while preserving clinically meaningful signals. We further demonstrate that frozen LLMs provide strong representations for structured clinical data, whereas fine-tuning is essential for modeling unstructured clinical narratives. Importantly, the proposed lightweight pipeline achieves performance comparable to end-to-end LLM approaches with substantially lower computational cost.
[NLP-52] LayerBoost: Layer-Aware Attention Reduction for Efficient LLM s
【速读】: 该论文旨在解决Transformer模型中基于softmax的注意力机制(softmax attention)因序列长度导致的二次复杂度问题,这一瓶颈限制了高效推理,尤其是在高并发服务和硬件资源受限场景下。其核心解决方案是提出LayerBoost方法,关键在于通过系统性敏感性分析识别不同层对模型性能的影响程度,并据此实施分层差异化处理:在高敏感层保留标准softmax注意力,在中等敏感层采用线性滑动窗口注意力替代,而在低敏感层则完全移除注意力机制;随后引入轻量级蒸馏式修复阶段(仅需10M额外训练样本),有效恢复模型性能,从而在显著降低推理延迟(最高达68%)的同时保持与基线模型相当或接近的性能表现。
链接: https://arxiv.org/abs/2604.22050
作者: Mohamed Ali Souibgui,Jan Fostier,Rodrigo Abadía-Heredia,Bohdan Denysenko,Christian Marschke,Igor Peric
机构: Openchip Softwares Technologies
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2604.22050 [cs.LG] (or arXiv:2604.22050v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.22050 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-53] Source-Modality Monitoring in Vision-Language Models
【速读】: 该论文旨在解决多模态模型在处理跨模态输入时,如何准确追踪并识别信息来源的问题,即源模态监控(source-modality monitoring)问题。该问题被视为更广泛的绑定问题(binding problem)的一个实例,核心在于模型能否将用户提示中的词汇(如“image”)与实际输入中的特定模态成分(如图像)正确关联。解决方案的关键在于分析模型如何利用句法信号(syntactic signals)和语义信号(semantic signals)进行绑定:研究发现,在模态间分布差异显著的情况下,语义信号通常比句法信号更具决定性作用,这揭示了多模态模型在复杂场景下依赖深层语义理解而非表面结构来实现可靠的信息源追踪。
链接: https://arxiv.org/abs/2604.22038
作者: Etha Tianze Hua,Tian Yun,Ellie Pavlick
机构: 未知
类目: Computation and Language (cs.CL)
备注: All resources will be available at this https URL
Abstract:We define and investigate source-modality monitoring – the ability of multimodal models to track and communicate the input source from which pieces of information originate. We consider source-modality monitoring as an instance of the more general binding problem, and evaluate the extent to which models exploit syntactic vs. semantic signals in order to bind words like image in a user-provided prompt to specific components of their input and context (i.e., actual images). Across experiments spanning 11 vision-language models (VLMs) performing target-modality information retrieval tasks, we find that both syntactic and semantic signals play an important role, but that the latter tend to outweigh the former in cases when modalities are highly distinct distributionally. We discuss the implications of these findings for model robustness, and in the context of increasingly multimodal agentic systems.
[NLP-54] Shared Lexical Task Representations Explain Behavioral Variability In LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的提示敏感性(prompt sensitivity)问题,即模型在执行任务或回答问题时的表现高度依赖于提示的表述方式,且这种依赖关系往往难以预测。其解决方案的关键在于识别出模型内部共享的、与任务相关的注意力头(attention heads),这些头部能够直接描述任务内容,被称为“词汇任务头”(lexical task heads)。研究发现,无论采用指令式提示还是示例式提示,这些词汇任务头均会被激活,并驱动后续的答案生成;同时,不同提示导致的行为差异可归因于这些头部激活程度的不同,而模型失败则可能源于竞争性的任务表征对目标任务信号的稀释。这一机制揭示了LLMs内部表示如何解释看似随意的用户行为表现。
链接: https://arxiv.org/abs/2604.22027
作者: Zhuonan Yang,Jacob Xiaochen Li,Francisco Piedrahita Velez,Eric Todd,David Bau,Michael L. Littman,Stephen H. Bach,Ellie Pavlick
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:One of the most common complaints about large language models (LLMs) is their prompt sensitivity – that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task – which we dub lexical task heads – and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs’ internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.
[NLP-55] When Cow Urine Cures Constipation on YouTube: Limits of LLM s in Detecting Culture-specific Health Misinformation AAAI
【速读】: 该论文旨在解决生成式 AI(Generative AI)在分析全球南方地区健康信息传播时存在的文化敏感性不足问题,特别是针对嵌入本土宗教与伪科学话语的健康谣言难以被现有大语言模型(Large Language Model, LLM)准确识别和解读的现象。其关键解决方案在于揭示:单纯依赖提示工程(prompt engineering)无法弥补LLM因训练数据主要来自西方语境而产生的文化认知偏差;真正有效的分析必须建立在对当地语境、性别化修辞及文化隐喻的深度理解之上,这要求未来研究需从数据源、模型训练到评估框架进行系统性重构,以实现跨文化语境下的健康信息可信度评估能力。
链接: https://arxiv.org/abs/2604.22002
作者: Anamta Khan,Ratna Kandala,Deepti,Sheza Munir,Joyojeet Pal
机构: 未知
类目: Computation and Language (cs.CL)
备注: To appear in the proceedings of the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD), The 20th International AAAI Conference on Web and Social Media (ICWSM) 2026
Abstract:Social media platforms have become primary channels for health information in the Global South. Using gomutra (cow urine) discourse on YouTube in India as a case study, we present a post-facto Large Language Model (LLM)-assisted discourse analysis of 30 multilingual transcripts showing that promotional content blends sacred traditional language with pseudo-scientific claims in ways that sophisticated debunking content itself mirrors, creating a rhetorical register that LLMs, trained predominantly on Western corpora, are systematically ill-equipped to analyse. Varying prompt tone across three LLMs (GPT-4o, Gemini 2.5 Pro, DeepSeek-V3.1), we find that culturally embedded health misinformation does not look like ordinary misinformation, and this cultural obfuscation extends to gendered rhetoric and prompt design, compounding analytical unreliability. Our findings argue that cultural competency in LLM-assisted discourse analysis cannot be retrofitted through prompt engineering alone.
[NLP-56] Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在复杂组合推理任务中缺乏有效计算资源调度与内部状态管理的问题,特别是在单块通用变换器(Universal Transformer, UT)架构下如何实现稳定且高效的推理过程。其核心解决方案是引入可学习的记忆令牌(learned memory tokens)作为计算临时存储空间(scratchpad),并结合自适应计算时间(Adaptive Computation Time, ACT)机制来动态控制推理深度。研究表明,记忆令牌在模型性能中具有不可替代性:无记忆令牌的配置均无法达到非平凡准确率;最优记忆令牌数量存在一个明确阈值(T=8)和平台区间(T=8–32),超出后因注意力稀释导致性能下降。此外,作者识别出一种由ACT初始化引发的“路由陷阱”(router initialization trap),通过反向偏置初始化(deep start)成功规避了训练失败问题,从而实现了稳定、高效且具备模块化分工能力的推理过程——注意力头在递归深度上分化为记忆读取器、约束传播器和整合器。
链接: https://arxiv.org/abs/2604.21999
作者: Grigory Sapunov
机构: Intento
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 12 pages, 7 figures, 8 tables. Code: this https URL
Abstract:We study learned memory tokens as computational scratchpad for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. We find that memory tokens are empirically necessary: across all configurations tested – 3 seeds, multiple token counts, two initialization schemes, ACT and fixed-depth processing – no configuration without memory tokens achieves non-trivial performance. The optimal count exhibits a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64. During experimentation, we identify a router initialization trap that causes 70% of training runs to fail: both default zero-bias initialization (p ~ 0.5) and Graves’ recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 (“deep start,” p ~ 0.05) eliminates this failure mode. We confirm through ablation that the trap is inherent to ACT initialization, not an artifact of our architecture choices. With reliable training established, we show that (1) ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds); (2) ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps; and (3) attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth. Code is available at this https URL. Comments: 12 pages, 7 figures, 8 tables. Code: this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: I.2.6 Cite as: arXiv:2604.21999 [cs.LG] (or arXiv:2604.21999v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.21999 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-57] Large Language Models Are Bad Dice Players: LLM s Struggle to Generate Random Numbers from Statistical Distributions ACL2026
【速读】: 该论文旨在解决当前前沿大语言模型(Large Language Models, LLMs)在生成过程中缺乏对指定概率分布进行忠实采样的能力问题,这已成为其作为随机管道和接近通用智能系统组件时的功能性瓶颈。解决方案的关键在于设计了一种双协议审计框架:一是批量生成(Batch Generation),即单次响应中生成 N=1000 个样本;二是独立请求(Independent Requests),即通过 N=1000 次无状态调用分别生成样本。该设计有效揭示了模型在不同采样模式下的显著不对称性——批量生成仅达到约7%的中位数统计有效性,而独立请求几乎全部失效(10/11模型未通过任何分布测试),并进一步表明采样精度随分布复杂度与采样规模 N 的增加而单调下降,从而论证了当前LLMs缺乏可信赖的内部采样机制,需依赖外部工具以保障下游任务中的统计可靠性。
链接: https://arxiv.org/abs/2601.05414
作者: Minda Zhao,Yilun Du,Mengyu Wang
机构: Harvard University; OpenAI; Google; DeepSeek; Moonshot; Alibaba; Mistral AI; Meta
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted to ACL 2026 (Main Conference)
Abstract:As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines and systems approaching general intelligence, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces N=1000 samples within one response, and Independent Requests, comprising N=1000 stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 7% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the sampling horizon N increases. Finally, we demonstrate how the propagation of these failures into downstream real-world application tasks introduces systematic biases: models fail to enforce uniform answer-position constraints in Multiple Choice Question generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating external tools for applications requiring statistical guarantees.
[NLP-58] UniSonate: A Unified Model for Speech Music and Sound Effect Generation with Text Instructions ACL2026
【速读】: 该论文旨在解决生成式音频建模中多模态任务(如文本到语音 TTS、文本到音乐 TTM 和文本到音频 TTA)长期割裂的问题,其核心挑战在于结构化语义表示(如语音和音乐)与非结构化声学纹理(如音效)之间的内在不一致性。解决方案的关键在于提出 UniSonate——一个统一的流匹配(flow-matching)框架,通过引入一种新颖的动态标记注入机制(dynamic token injection),将非结构化的环境声音映射到结构化的时序潜在空间中,从而在基于音素驱动的多模态扩散Transformer(MM-DiT)中实现精确的时长控制;同时结合多阶段课程学习策略缓解跨模态优化冲突,最终实现仅用自然语言指令即可合成高质量语音、音乐和音效,并观察到联合训练带来的正向迁移效应,显著提升结构连贯性和韵律表现力。
链接: https://arxiv.org/abs/2604.22209
作者: Chunyu Qiang,Xiaopeng Wang,Kang Yin,Yuzhe Liang,Yuxin Guo,Teng Ma,Ziyu Zhang,Tianrui Wang,Cheng Gong,Yushen Chen,Ruibo Fu,Chen Zhang,Longbiao Wang,Jianwu Dang
机构: Tianjin University (天津大学); Kling Team, Kuaishou Technology (快手科技克灵团队); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Accepted to ACL 2026 main conference (oral)
Abstract:Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at this https URL.
信息检索
[IR-0] Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
【速读】:该论文旨在解决密集向量检索(Dense Vector Retrieval)在信息检索中因相似度匹配精度不足而导致的性能瓶颈问题,同时克服基于大语言模型(LLM)重排序方法计算开销大、对困惑度估计噪声敏感等局限。其解决方案的关键在于提出了一种名为“效用对齐嵌入”(Utility-Aligned Embeddings, UAE)的新框架:通过将检索任务建模为分布匹配问题,利用一种受效用调制的信息熵对比损失(Utility-Modulated InfoNCE),使双编码器模型直接学习从LLM生成的效用分布(即困惑度降低所体现的语义价值),从而在嵌入空间中注入分级效用信号。此方法无需在测试阶段进行LLM推理即可实现高性能检索,在QASPER基准上相较BGE-Base显著提升Recall@1、MAP和Token F1指标,且速度比高效LLM重排序方法快超过180倍,兼顾了准确性与可扩展性。
链接: https://arxiv.org/abs/2604.22722
作者: Rajinder Sandhu,Di Mu,Cheng Chang,Md Shahriar Tasjid,Himanshu Rai,Maksims Volkovs,Ga Wu
机构: Layer 6 AI(层六人工智能); Dalhousie University(达尔豪斯大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation. We propose Utility-Aligned Embeddings (UAE), a framework designed to merge these advantages into a practical, high-performance retrieval method. We formulate retrieval as a distribution matching problem, training a bi-encoder to imitate a utility distribution derived from perplexity reduction using a Utility-Modulated InfoNCE objective. This approach injects graded utility signals directly into the embedding space without requiring test-time LLM inference. On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base. Crucially, UAE is over 180x faster than the efficient LLM re-ranking methods preserving competitive performance, demonstrating that aligning retrieval with generative utility yields reliable contexts at scale.
[IR-1] Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines
【速读】:该论文旨在解决在检索增强生成(Retrieval-Augmented Generation, RAG)系统中,由于查询改写(query reformulation)频繁导致的计算开销问题,核心挑战在于如何在不执行完整下游检索与生成流程的前提下,高效筛选出最优查询变体。解决方案的关键在于引入查询性能预测(Query Performance Prediction, QPP)机制,特别是聚焦于同一信息需求下的变体间差异(intra-topic discrimination),通过轻量级预检索预测器识别能提升端到端生成质量的查询变体,从而避免冗余计算;实验表明,尽管检索相关性(如nDCG)与生成忠实度之间存在“效用差距”(utility gap),但预检索预测器仍可可靠地选出优于原始查询的变体,且其效率显著优于昂贵的后检索方法。
链接: https://arxiv.org/abs/2604.22661
作者: Negar Arabzadeh,Andrew Drozdov,Michael Bendersky,Matei Zaharia
机构: UC Berkeley(加州大学伯克利分校); Databricks(数据布鲁克斯)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a “utility gap” between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.
[IR-2] ASPIRE: Make Spectral Graph Collaborative Filtering Great Again via Adaptive Filter Learning
【速读】:该论文旨在解决谱协同过滤(Spectral Collaborative Filtering)中图滤波器(Graph Filter)设计依赖人工调参、缺乏可学习性的问题。其核心挑战源于传统推荐目标函数的偏差,导致低频爆炸(Low-Frequency Explosion)现象,从而阻碍了图滤波器的有效学习。解决方案的关键在于提出一种基于双层优化目标的自适应谱图协同过滤框架(ASPIRE),通过理论分析解耦滤波器学习目标,在实践中实现了优异的推荐性能、谱适应性和训练稳定性,且无需人工设计特定任务的滤波器结构。
链接: https://arxiv.org/abs/2604.22549
作者: Yunhang He,Cong Xu,Zhangchi Zhu,Hongzhi Yin,Wei Zhang
机构: East China Normal University (华东师范大学); The University of Queensland (昆士兰大学)
类目: Information Retrieval (cs.IR)
备注:
Abstract:Graph filter design is central to spectral collaborative filtering, yet most existing methods rely on manually tuned hyperparameters rather than fully learnable filters. We show that this challenge stems from a bias in traditional recommendation objectives, which induces a spectral phenomenon termed low-frequency explosion, thereby fundamentally hindering the effective learning of graph filters. To overcome this limitation, we propose a novel adaptive spectral graph collaborative filtering framework (ASPIRE) based on a bi-level optimization objective. Guided by our theoretical analysis, we disentangle the filter learning objective, which in turn leads to excellent recommendation performance, spectral adaptivity, and training stability in practice. Extensive experiments show our learned filters match the performance of carefully engineered task-specific designs. Furthermore, ASPIRE is equally effective in LLM-powered collaborative filtering. Our findings demonstrate that graph filter learning is viable and generalizable, paving the way for more expressive graph neural networks in collaborative filtering.
[IR-3] Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在基于大语言模型(Large Language Model, LLM)的推荐系统中,因负样本选择策略不当而导致优化目标与Top-K推荐指标不一致的问题。现有方法使用随机负样本时,优化目标等价于最大化ROC曲线下面积(Area Under the ROC Curve, AUC),而AUC与Top-K指标存在本质偏差;尽管采用束搜索(beam-search)负样本能提升性能,但其机制尚不明确。论文的关键解决方案是:首先理论证明将随机负样本替换为束搜索负样本可使优化目标从全AUC转向部分AUC(partial AUC),从而更贴合Top-K指标;进一步提出窗口化部分AUC(Windowed Partial AUC, WPAUC)作为新的优化目标,通过约束假阳性率(False Positive Rate, FPR)在一个可控区间[α, α+d]来直接对齐Top-K性能,并设计高效的阈值调整窗口重加权(Threshold-Adjusted Windowed reweighting, TAWin)强化学习算法实现该目标的显式控制。实验验证了理论分析的有效性及方法在四个真实数据集上的SOTA性能。
链接: https://arxiv.org/abs/2604.22504
作者: Wentao Shi,Qifan Wang,Chen Chen,Fei Liu,Dongfang Liu,Xu Liu,Wanli Ma,Junfeng Pan,Linhong Zhu,Fuli Feng
机构: 未知
类目: Information Retrieval (cs.IR)
备注: 21 pages
Abstract:Reinforcement learning (RL) effectively optimizes Large Language Model (LLM)-based recommenders by contrasting positive and negative items. Empirically, training with beam-search negatives consistently outperforms random negatives, yet the mechanism is not well understood. We address this gap by analyzing the induced optimization objective and show that: (i) Under binary reward feedback, optimizing LLM recommenders with Group Relative Policy Optimization (GRPO) is theoretically equivalent to maximizing the Area Under the ROC Curve (AUC), which is often misaligned with Top- K recommendation; and (ii) Replacing random negatives with beam-search negatives reshapes the objective toward partial AUC, improving alignment with Top- K metrics. Motivated by this perspective, we introduce Windowed Partial AUC (WPAUC), which constrains the false positive rate (FPR) to a window [ \alpha,\alpha+d ] to more directly align with Top- K metrics. We further propose an efficient Threshold-Adjusted Windowed reweighting (TAWin) RL method for its optimization, enabling explicit control over the targeted Top- K performance. Experiments on four real-world datasets validate the theory and deliver consistent state-of-the-art performance.
[IR-4] Rethinking Semantic Collaborative Integration: Why Alignment Is Not Enough SIGIR2026
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的推荐系统中,将语义嵌入与协同过滤表示通过全局几何对齐进行融合所存在的结构性问题。现有方法隐含假设两种视图编码了共享的潜在实体且更强对齐即更优,但作者指出这一“全局低复杂度对齐假设”往往过于严格且与真实推荐场景结构不匹配。其解决方案的关键在于提出一种互补性视角:语义和协同表示应被视为部分共享但本质异构的视图,各自包含共享因子和视图特有因子;在此共享加私有潜在结构下,强制全局对齐可能破坏局部结构、抑制视图特有信号并降低信息多样性。为此,作者设计了互补性感知诊断工具以量化重叠度、独特贡献及理论融合上限,并通过实证分析发现语义与协同视图在物品层面一致性较低、存在显著的Oracle融合增益,表明二者具有强互补性。因此,论文主张从“对齐中心”建模转向“互补融合中心”设计,即选择性整合共享因子而保留私有信号,为下一代LLM增强型推荐系统提供原则性基础。
链接: https://arxiv.org/abs/2604.22195
作者: Maolin Wang,Dongze Wu,Jianing Zhou,Hongyu Chen,Beining Bao,Yu Jiang,Chenbin Zhang,Chang Wang,Jian Liu,Lei Sha
机构: Hong Kong Institute of AI for ScienceCity University of Hong Kong (香港人工智能科学研究所城市大学); Beihang University (北京航空航天大学); Chinese University of Hong Kong (香港中文大学); Jingdong (京东)
类目: Information Retrieval (cs.IR)
备注: Accepted by SIGIR 2026
Abstract:Large language models (LLMs) have become an important semantic infrastructure for modern recommender systems. A prevailing paradigm integrates LLM-derived semantic embeddings with collaborative representations via representation alignment, implicitly assuming that the two views encode a shared latent entity and that stronger alignment yields better results. We formalize this assumption as the global low-complexity alignment hypothesis and argue that it is stronger than necessary and often structurally mismatched with real-world recommendation settings. We propose a complementary perspective in which semantic and collaborative representations are treated as partially shared yet fundamentally heterogeneous views, each containing both shared and view-specific factors. Under this shared-plus-private latent structure, enforcing global geometric alignment may distort local structure, suppress view-specific signals, and reduce informational diversity. To support this perspective, we develop complementarity-aware diagnostics that quantify overlap, unique-hit contribution, and theoretical fusion upper bounds. Empirical analyses on sparse recommendation benchmarks reveal low item-level agreement between semantic and collaborative views and substantial oracle fusion gains, indicating strong complementarity. Furthermore, controlled alignment probes show that low-capacity mappings capture only shared components and fail to recover full collaborative geometry, especially under distribution shift. These findings suggest that alignment should not be treated as the default integration principle. We advocate a shift from alignment-centric modeling to complementarity fusion-centric, complementarity-aware design, where shared factors are selectively integrated while private signals are preserved. This reframing provides a principled foundation for the next generation of LLM-enhanced recommender systems.
[IR-5] ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的列表重排序(listwise reranking)在信息检索中面临的两大瓶颈问题:一是随着输入序列长度增加,“中间丢失”(lost in the middle)现象导致排序质量下降;二是推理延迟随序列长度呈超线性增长,难以满足工业部署需求。其解决方案的关键在于提出一个统一的检索-重排序框架ResRank,通过引入Encoder-LLM将候选段落压缩为单个嵌入向量,并结合查询文本输入到Reranker-LLM进行列表级排序;同时设计残差连接结构缓解编码空间与排序空间之间的错位问题,并采用基于余弦相似度的一步式评分机制替代传统的自回归解码,彻底消除生成瓶颈。该方法在双阶段多任务端到端联合优化策略下实现编码器与重排序器的协同训练,显著提升有效性与效率的平衡,且无需生成任何token,每段仅处理一个token。
链接: https://arxiv.org/abs/2604.22180
作者: Xiaojie Ke,Shuai Zhang,Liansheng Sun,Yongjin Wang,Hengjun Jiang,Xiangkun Liu,Cunxin Gu,Jian Xu,Guanjun Jiang
机构: Alibaba(阿里巴巴)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model (LLM) based listwise reranking has emerged as the dominant paradigm for achieving state-of-the-art ranking effectiveness in information retrieval. However, its reliance on feeding full passage texts into the LLM introduces two critical bottlenecks: the “lost in the middle” phenomenon degrades ranking quality as input length grows, and the inference latency scales super-linearly with sequence length, rendering it impractical for industrial deployment. In this paper, we present ResRank, a unified retrieval-reranking framework that fundamentally addresses both challenges. Inspired by multimodal LLMs that project visual inputs into compact token representations, ResRank employs an Encoder-LLM to compress each candidate passage into a single embedding, which is then fed alongside the query text into a Reranker-LLM for listwise ranking. To alleviate the misalignment between the compressed representation space and the ranking space, we introduce a residual connection structure that combines encoder embeddings with contextualized hidden states from the reranker. Furthermore, we replace the conventional autoregressive decoding with a one-step cosine-similarity-based scoring mechanism, eliminating the generation bottleneck entirely. ResRank is trained through a carefully designed dual-stage, multi-task, end-to-end joint optimization strategy that simultaneously trains the encoder and reranker, achieving learning objective alignment between retrieval and reranking while substantially reducing training complexity. Extensive experiments on TREC Deep Learning and eight BEIR benchmark datasets demonstrate that ResRank achieves competitive or superior ranking effectiveness compared to existing approaches while requiring zero generated tokens and processing only one token per passage, yielding a fundamentally better balance between effectiveness and efficiency.
[IR-6] Sharpness-Aware Poisoning: Enhancing Transferability of Injective Attacks on Recommender Systems
【速读】:该论文旨在解决推荐系统(Recommender Systems, RS)在面对注入式攻击(injective attacks)时的脆弱性问题,即攻击者通过注入有限数量的虚假用户画像来提升目标物品的曝光度以获取非正当利益。现有方法通常依赖固定代理模型(surrogate model)模拟目标模型进行攻击数据生成,但其假设“为代理模型生成的污染数据可有效迁移至其他目标模型”存在局限——当代理模型与目标模型结构差异显著时,攻击迁移能力会大幅下降。为此,论文提出了一种名为Sharpness-Aware Poisoning(\textit{SharpAP})的新攻击方法,其核心在于利用尖锐度感知最小化(sharpness-aware minimization)原则识别近似最坏情况的目标模型,并针对性优化针对该模型的污染数据。攻击过程被建模为一个min-max-min三重优化问题,通过迭代集成\textit{SharpAP},生成对模型结构变化不敏感的鲁棒污染数据,从而显著提升攻击迁移能力。
链接: https://arxiv.org/abs/2604.22170
作者: Junsong Xie,Yonghui Yang,Pengyang Shao,Le Wu
机构: Hefei University of Technology (合肥工业大学); National University of Singapore (新加坡国立大学)
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:
Abstract:Recommender Systems~(RS) have been shown to be vulnerable to injective attacks, where attackers inject limited fake user profiles to promote the exposure of target items to real users for unethical gains (e.g., economic or political advantages). Since attackers typically lack knowledge of the victim model deployed in the target RS, existing methods resort to using a fixed surrogate model to mimic the potential victim model. Despite considerable progress, we argue that the assumption that \textitpoisoned data generated for the surrogate model can be used to attack other victim models is wishful. When there are significant structural discrepancies between the surrogate and victim models, the attack transferability inevitably suffers. Intuitively, if we can identify the worst-case victim model and iteratively optimize the poisoning effect specifically against it, then the generated poisoned data would be better transferred to other victim models. However, exactly identifying the worst-case victim model during the attack process is challenging due to the large space of victim models. To this end, in this work, we propose a novel attack method called Sharpness-Aware Poisoning (\textitSharpAP). Specifically, it employs the sharpness-aware minimization principle to seek the approximately worst-case victim model and optimizes the poisoned data specifically for this worst-case model. The poisoning attack with SharpAP is formulated as a min-max-min tri-level optimization problem. By integrating SharpAP into the iterative process for attacks, our method can generate more robust poisoned data which is less sensitive to the shift of model structure, mitigating the overfitting to the surrogate model. Comprehensive experimental comparisons on three real-world datasets demonstrate that \name~can significantly enhance the attack transferability.
[IR-7] ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation
【速读】:该论文针对生成式推荐(Generative Recommendation)中稀疏命中(sparse-hit)场景下,传统基于群体的强化学习(Group-based RL)方法失效的问题展开研究。其核心问题在于:在稀疏命中环境下,大量采样的群体(rollout groups)本质上无法提供有效的学习信号,导致模型难以优化。解决方案的关键是提出 ReCast 框架——它首先通过修复“全零群体”以恢复最小可学习性,随后采用边界聚焦的对比更新策略,用最强正样本与最难负样本进行对比学习,替代原有的全群体奖励归一化机制。该设计仅修改组内信号构造过程,不改变外层RL框架,并部分解耦了采样宽度与执行器侧更新宽度,从而显著提升学习效率和稳定性,在Pass@1指标上相对基线最高提升36.6%,且在相同预算下仅需4.1%的回放预算即可达到目标性能。
链接: https://arxiv.org/abs/2604.22169
作者: Peiyan Zhang,Hanmo Liu,Chengxuan Tong,Yuxia Wu,Wei Guo,Yong Liu
机构: Huawei Technologies Co., Ltd.(华为技术有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast learning-signal framework that first restores minimal learnability for all-zero groups and then replaces full-group reward normalization with a boundary-focused contrastive update on the strongest positive and the hardest negative. ReCast leaves the outer RL framework unchanged, modifies only within-group signal construction, and partially decouples rollout search width from actor-side update width. Across multiple generative recommendation tasks, ReCast consistently outperforms OpenOneRec-RL, achieving up to 36.6% relative improvement in Pass@1. Its matched-budget advantage is substantially larger: ReCast reaches the baseline’s target performance with only 4.1% of the rollout budget, and this advantage widens with model scale. The same design also yields direct system-level gains, reducing actor-side update time by 16.60x, lowering peak allocated memory by 16.5%, and improving actor MFU by 14.2%. Mechanism analysis shows that ReCast mitigates the persistent all-zero / single-hit regime, restores learnability when natural positives are scarce, and converts otherwise wasted rollout budget into more stable policy updates. These results suggest that, for generative recommendation, the decisive RL problem is not only how to assign rewards, but how to construct learnable optimization events from sparse, structured supervision.
[IR-8] Implementation and Privacy Guarantees for Scalable Keyword Search on SOLID-based Decentralized Data with Granular Visibility Constraints
【速读】:该论文旨在解决去中心化个人数据生态系统中跨分布式Solid Pod的高效关键词搜索问题,尤其是在用户定义的可见性策略下,由于数据分散存储于多个Pod且受访问权限限制,传统集中式搜索机制难以适用。解决方案的关键在于提出ESPRESSO框架,其核心创新包括:在每个Pod内构建基于WebID范围的索引,并利用隐私感知元数据实现跨服务器的高效源选择与排序。该方法在保障用户数据主权的同时,通过形式化威胁模型识别并缓解了索引和元数据可能引发的敏感信息泄露风险,从而为隐私保护型去中心化搜索提供了可验证的设计基础。
链接: https://arxiv.org/abs/2604.22100
作者: Mohamed Ragab,Faria Ferooz,Mohammad Bahrani,Helen Oliver,Thanassis Tiropanis,Alexandra Poulovassilis,Adriane Chapman,George Roussos
机构: University of Southampton (南安普顿大学); Birkbeck, University of London (伦敦大学伯贝克学院)
类目: Databases (cs.DB); Information Retrieval (cs.IR)
备注:
Abstract:In decentralized personal data ecosystems grounded in architectures such as Solid, users retain sovereignty over their data via personal online data stores (pods), hosted on Solid-compliant server infrastructures. In such environments, data remains under the control of pod owners, which complicates search due to distribution across numerous pods and user-specific access constraints. ESPRESSO is a decentralized framework for scalable keyword-based search across distributed Solid pods under user-defined visibility policies. It addresses key challenges of decentralized search by constructing WebID-scoped indexes within pods and employing privacy-aware metadata to enable efficient source selection and ranking across servers. This paper further introduces a formal threat model for ESPRESSO, analysing the security and privacy risks associated with the generation, aggregation, and use of indexes and metadata. These risks include unintended metadata leakage and the potential for adversaries to infer sensitive information about data that resides within personal data stores. The analysis identifies key design principles that limit metadata exposure while mitigating unauthorized inference. The proposed threat model provides a foundation for evaluating privacy-preserving decentralized search and informs the design of systems with stronger privacy guarantees.
[IR-9] Analyzing Shapley Additive Explanations to Understand Anomaly Detection Algorithm Behaviors and Their Complementarity
【速读】:该论文旨在解决无监督异常检测中集成学习效果受限的问题,即现有检测器常依赖相似的决策线索,导致异常评分冗余、缺乏互补性,从而削弱了集成方法的性能提升潜力。解决方案的关键在于通过SHapley Additive exPlanations(SHAP)量化各模型对输入特征的重要性分配,并利用这些归因谱(attribution profiles)衡量检测器间的相似性与差异性;研究表明,解释差异显著的检测器往往表现出更优的互补行为,因此基于解释多样性而非原始输出进行模型选择,可构建更具多样性和有效性的集成系统。同时强调,仅追求多样性不足,高个体性能仍是构建高效集成的前提。
链接: https://arxiv.org/abs/2602.00208
作者: Jordan Levy,Paul Saves,Moncef Garouani,Nicolas Verstaevel,Benoit Gaudou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注: IDA Frontier Prize and Best Paper Award -Intelligent Data Analysis (IDA) 2026, Springer Nature
Abstract:Unsupervised anomaly detection is a challenging problem due to the diversity of data distributions and the lack of labels. Ensemble methods are often adopted to mitigate these challenges by combining multiple detectors, which can reduce individual biases and increase robustness. Yet building an ensemble that is genuinely complementary remains challenging, since many detectors rely on similar decision cues and end up producing redundant anomaly scores. As a result, the potential of ensemble learning is often limited by the difficulty of identifying models that truly capture different types of irregularities. To address this, we propose a methodology for characterizing anomaly detectors through their decision mechanisms. Using SHapley Additive exPlanations, we quantify how each model attributes importance to input features, and we use these attribution profiles to measure similarity between detectors. We show that detectors with similar explanations tend to produce correlated anomaly scores and identify largely overlapping anomalies. Conversely, explanation divergence reliably indicates complementary detection behavior. Our results demonstrate that explanation-driven metrics offer a different criterion than raw outputs for selecting models in an ensemble. However, we also demonstrate that diversity alone is insufficient; high individual model performance remains a prerequisite for effective ensembles. By explicitly targeting explanation diversity while maintaining model quality, we are able to construct ensembles that are more diverse, more complementary, and ultimately more effective for unsupervised anomaly detection.
人机交互
[HC-0] How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agent ic Coding Tasks
【速读】:该论文旨在解决AI代理(AI agent)在复杂人类工作流中部署时面临的高Token消耗问题,具体聚焦于三个核心问题:Token消耗的来源、不同模型的Token效率差异以及代理是否能预先预测自身Token使用量。其解决方案的关键在于通过系统性分析八个前沿大语言模型(LLM)在SWE-bench Verified任务上的执行轨迹,揭示了代理任务的Token消耗模式具有高度随机性和非线性特征——输入Token是主要成本驱动因素,且Token使用量与准确率之间存在非单调关系;同时发现当前模型难以准确预测自身Token成本(相关性仅达0.39),并显著低估实际消耗,这为优化AI代理的经济性与可预测性提供了实证基础和研究方向。
链接: https://arxiv.org/abs/2604.22750
作者: Longju Bai,Zhemin Huang,Xingyao Wang,Jiao Sun,Rada Mihalcea,Erik Brynjolfsson,Alex Pentland,Jiaxin Pei
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:
Abstract:The wide adoption of AI agents in complex human workflows is driving rapid growth in LLM token consumption. When agents are deployed on tasks that require a significant amount of tokens, three questions naturally arise: (1) Where do AI agents spend the tokens? (2) Which models are more token-efficient? and (3) Can agents predict their token usage before task execution? In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks. We analyze trajectories from eight frontier LLMs on SWE-bench Verified and evaluate models’ ability to predict their own token costs before task execution. We find that: (1) agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat, with input tokens rather than output tokens driving the overall cost; (2) token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens, and higher token usage does not translate into higher accuracy; instead, accuracy often peaks at intermediate cost and saturates at higher costs; (3) models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5; (4) task difficulty rated by human experts only weakly aligns with actual token costs, revealing a fundamental gap between human-perceived complexity and the computational effort agents actually expend; and (5) frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs. Our study offers new insights into the economics of AI agents and can inspire future research in this direction.
[HC-1] RFID-Based Non-Biometric Classroom Attendance System: Proxy Attendance Detection via Weight Sensor Integration
【速读】:该论文旨在解决传统考勤方式在教育机构中导致的时间浪费和学术诚信风险问题,特别是针对学生代签(proxy attendance)现象以及现有电子考勤系统依赖生物识别技术所引发的隐私合规性挑战。其解决方案的关键在于设计了一种无生物特征识别(biometric-free)的物联网(IoT)考勤系统:通过RFID卡识别身份后,结合体重传感器对学生的体重进行比对(基于350名18–22岁个体的统计参考范围),从而判断是否为本人到课,避免了存储任何个人生物特征数据,同时有效降低代签行为的发生概率。该方案兼顾了实用性、隐私合规性与低成本可复现性。
链接: https://arxiv.org/abs/2604.22697
作者: Furkan Ege,Muhsin Özdemir
机构: 未知
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Full English version followed by the original Turkish version of the paper. Main text in English; Turkish translation appended after the English text
Abstract:Attendance tracking in educational institutions, when conducted through traditional methods, leads to structural problems that consume instruction time and threaten academic integrity. Attendance durations spanning several minutes in primary and secondary education and exceeding ten minutes in higher education, combined with the proxy attendance problem of signing on behalf of someone else, demonstrate the need for electronic systems. Most existing electronic solutions rely on biometric authentication, which raises legal and ethical risks under the European General Data Protection Regulation (GDPR), the Turkish Personal Data Protection Law (KVKK), and the United States Family Educational Rights and Privacy Act (FERPA). Systems using RFID alone provide no built-in safeguard against proxy attendance through card transfer. This study proposes a biometric-free IoT attendance system addressing both deficiencies. The prototype consists of an RFID module, RFID cards, weight sensors, a Bluetooth module, and an Arduino UNO microcontroller. After the student presents their RFID card, the weight sensor measurement is compared against a statistical reference range of 350 individuals (aged 18-22) compiled from three Kaggle datasets; no personal biometric data is recorded. A Python-based GUI performs student management, course tracking, and CSV-based reporting via Bluetooth. Qualitative tests in conditions close to a real classroom have shown that the RFID reading, weight verification, Bluetooth communication, and GUI modules operate in an integrated manner as expected. The proposed system offers a low-cost and reproducible solution that aims to reduce proxy attendance without storing biometric data. Comments: Full English version followed by the original Turkish version of the paper. Main text in English; Turkish translation appended after the English text Subjects: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC) Cite as: arXiv:2604.22697 [cs.CY] (or arXiv:2604.22697v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2604.22697 Focus to learn more arXiv-issued DOI via DataCite
[HC-2] Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings
【速读】:该论文旨在解决生成式 AI(Generative AI)中Shapley值解释方法因多样化和缺乏统一评估标准而导致的实践部署困境,尤其是现有定量指标(如稀疏性和忠实性)是否真正反映人类对解释清晰度与决策效用的认知问题。其解决方案的关键在于构建一个统一的近似化框架(unified amortized framework),在低延迟的操作风险工作流约束下,系统性地隔离并比较八种Shapley变体的语义差异,并通过涵盖四个风险数据集及包含3,735例专业分析师案例评审的真实欺诈检测环境进行大规模实证评估,从而揭示当前量化指标与人类感知之间的根本脱节,为高风险场景下解释方法的选择提供基于证据的指导。
链接: https://arxiv.org/abs/2604.22662
作者: Inês Oliveira e Silva,Sérgio Jesus,Iker Perez,Rita P. Ribeiro,Carlos Soares,Hugo Ferreira,Pedro Bizarro
机构: Feedzai(费兹AI); University of Porto(波尔图大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.
[HC-3] What People See (and Miss) About Generative AI Risks: Perceptions of Failures Risks and Who Should Address Them
【速读】:该论文旨在解决公众对生成式 AI (Generative AI, GenAI) 风险及其故障模式(failure modes,即贯穿 GenAI 生命周期的重复性社会技术失效模式)认知不足的问题。解决方案的关键在于开发并验证了一个基于真实公开事件场景和故障模式分类体系的调查工具,通过在960名美国参与者中的部署,有效评估了公众对 GenAI 故障模式、相关风险及责任归属的认知水平,从而为设计更具现实基础的 AI 素养教育工具与治理策略提供实证依据。
链接: https://arxiv.org/abs/2604.22654
作者: Megan Li,Wendy Bickersteth,Ningjing Tang,Parv Kapoor,Khinezin Win,Peter Zhong,Jason I. Hong,Lorrie Faith Cranor,Hoda Heidari,Hong Shen
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Human-Computer Interaction (cs.HC)
备注: 40 pages including references and appendices, 2 figures, submitted to Conference on AI, Ethics, and Society (AIES 2026)
Abstract:Despite growing concerns about the risks of Generative AI (GenAI), there is limited understanding of public perceptions of these risks and their associated failure modes – defined as recurring patterns of sociotechnical breakdown across the GenAI lifecycle that contribute to risks of real-world harm. To address this gap, we present a survey instrument, validated with eight subject matter experts and deployed on a sample of 960 U.S.-based participants, to assess awareness and perceptions of GenAI’s failure modes, their associated risks, and stakeholder responsibilities to address them. To support realism and content validity, our instrument is structured around scenarios grounded in publicly reported incidents and a taxonomy of GenAI’s failure modes. Findings suggest that our instrument is (1) effective for assessing risk awareness and perceptions in a way that is grounded in people’s current contexts of use, yet is extensible to new contexts that will inevitably arise; and (2) potentially useful for informing the design of AI literacy tools and interventions. We argue for AI literacy and governance approaches that align with how people encounter and reason about GenAI in everyday life.
[HC-4] How GenAI is Helping Reimagine Antenatal Care in A Low-Resource Setting: From Provider Enablement to Patient Empowerment
【速读】:该论文旨在解决巴基斯坦孕产妇死亡率居高不下的问题,其根源在于纸质记录碎片化、低识字率、优质医疗资源获取困难以及性别壁垒导致的连续性照护缺失。解决方案的关键在于开发并迭代出Awaaz-e-Sehat系统——一个基于语音的人工智能(Artificial Intelligence, AI)平台,最初作为面向临床医生的AI助手实现乌尔都语语音到电子病历(Electronic Medical Records, EMRs)的自动化生成,后演变为以患者为中心的WhatsApp平台,使孕妇能够自主生成结构化临床笔记、接收AI生成的产前指导,并通过二维码编码的病历与全国任何医疗机构共享。该方案的核心创新在于将电子病历和临床决策支持系统(Clinical Decision Support System, CDSS)从静态的机构工具转变为动态的自我倡导和共担责任工具,从而在资源受限环境中重构产前护理模式,推动患者成为健康数据的主动创造者与拥有者。
链接: https://arxiv.org/abs/2604.22610
作者: Maryam Mustafa,Imaan Hameed,Amna Shahnawaz,Bilal A Mateen
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Despite steady global advances, maternal mortality remains alarmingly high in Pakistan (155 deaths per 100,000 live births in 2023); largely as a consequence of fragmented paper records, low literacy, poor access to quality healthcare, and gendered barriers that compromise care continuity. Over three years, we designed, deployed, and iteratively developed Awaaz-e-Sehat, a speech-based artificial intelligence (AI) system that generates electronic medical records (EMRs) and supports decision-making in maternal health. The tool evolved from a clinician-facing AI assistant that automated Urdu speech-to-EMR generation into a patient-centred WhatsApp-based platform, enabling women to generate their own structured clinical notes, receive AI-generated antenatal guidance, and share QR-coded records with providers anywhere in the country. This case study documents that translational journey, i.e., how the ground realities of workload, linguistic nuance, and infrastructural constraints reshaped our design. The result is not merely a new method of record-keeping, but a reimagining of antenatal care and electronic medical records themselves. In settings where clinicians are time-constrained and have little institutional incentive to document, Awaaz-e-Sehat proposes a model of care that centres patients as active participants in generating and owning their health data. By keeping patients informed about their own risk factors and integrating them into the clinical decision-support loop, the system transforms EMRs and CDSS from static institutional artefacts into dynamic tools for self-advocacy and shared accountability in maternal health.
[HC-5] Vibe coding for clinicians: democratising bespoke software development for digital health innovation
【速读】:该论文旨在解决临床实践中普遍存在但常被商业开发者忽视的“低优先级”或“高度定制化”的工作流问题,这些问题往往因技术门槛高而难以通过传统软件开发手段解决。其解决方案的关键在于引入“ vibe coding(自然语言驱动的大语言模型协同开发)”,即利用大语言模型(Large Language Models, LLMs)通过自然语言提示(natural language prompts)快速构建原型工具,使不具备编程背景的临床工作者也能直接参与数字健康解决方案的创新设计与实现,从而弥合临床洞察与技术执行之间的鸿沟,加速贴近真实临床场景的数字化工具开发。
链接: https://arxiv.org/abs/2604.22604
作者: Ariel Yuhan Ong,Iain Livingstone,Caroline Kilduff,Mertcan Sevgi,David A Merle,Eden Ruffell,Pearse A Keane,Fares Antaki
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Clinicians often face workflow problems that are perceived as either too bespoke or low stakes to attract commercial attention. Historically, most do not have the technical knowledge to address these problems, but the recent emergence of “vibe coding” presents a transformative opportunity. Vibe coding refers to the co-development of software using natural language prompts to large language models. It offers a pathway to create simple tools that address these real-world pain points, or to prototype more complex ideas. In this review, written by a group of early adopter clinicians with a range of programming expertise, we introduce vibe coding for clinicians (especially those with no or minimal coding experience) as a way of democratising innovation from the front lines. We discuss foundational skills, outline some common challenges, provide a practical step-by-step playbook, and illustrate this approach with some case examples, taking care to consider caveats and guardrails for deployment. We propose that vibe coding is more than a technical shortcut for beginners and is not a replacement for professional software developers. Instead, it can bridge the gap between clinical insight and technical execution, equipping clinicians with the ability to rapidly prototype digital health solutions most reflective of clinical realities.
[HC-6] Catheter Monitoring in Intelligent Endovascular Navigation Systems: Interactive Simulations and Mixed Reality for Enhanced Navigational Awareness
【速读】:该论文旨在解决内血管导航过程中导管-血管相互作用难以实时、准确监测的问题,从而提升介入手术的安全性与精准度。其解决方案的关键在于构建一个融合实时导管形状重建、交互式生物力学仿真与混合现实(Mixed Reality, MR)可视化的一体化框架:通过有限元模型(Finite Element Model, FEM)模拟血管在导管作用下的形变,并利用光纤光栅和电磁传感器获取的实时数据驱动仿真更新;同时将计算得到的导管与血管几何信息同步传输至Hololens 2进行稳定帧率渲染,实现操作者对导管-血管接触状态的连续感知与决策支持。
链接: https://arxiv.org/abs/2604.22497
作者: Veronica Ruozzi,Giovanni Battista Regazzo,Maria Chiara Palumbo,Wim-Alexander Beckers,Mouloud Ourak,Xiu Zhang,Francesca Perico,Alessandro Caimi,Emmanuel Vander Poorten,Emiliano Votta
机构: 未知
类目: Human-Computer Interaction (cs.HC); Numerical Analysis (math.NA)
备注:
Abstract:Purpose: Developing and testing a framework that integrates real-time catheter shape reconstruction, interactive simulations, and mixed reality visualization to enable accurate monitoring of catheter-vessel interactions during endovascular navigation. Methods: A finite element model (FEM) of the venous pathway from the right femoral vein to the inferior vena cava was generated from computed tomography data and implemented into an interactive simulation. Catheter motion was imposed as boundary condition, and catheter-vessel contact was modeled with a Lagrange multiplier formulation to compute vessel deformation. The framework was tested in-vitro using a sensorized catheter with Fiber Bragg Grating and electromagnetic sensors as it was advanced through a silicone replica of the vascular anatomy. Real-time sensor read-outs fed the simulation, and the updated catheter and vessel geometries were streamed to Hololens 2. The performance and accuracy of FEM-computed vessel wall displacement were validated against experimental ground-truth obtained via stereo frames triangulation. Results: The simulated time exceeded the real temporal extent by 12% during initial navigation and by 45% when the catheter reached the most tortuous portion. Hololens 2 rendering remained stable at 35-40 frames per second. The median relative displacement error between FEM-computed and ground-truth vessel wall displacements remained below 1 mm and 2.33 mm for these two phases, respectively. Conclusion: The study demonstrates the feasibility of integrating interactive biomechanical simulation with real-time sensor data to enable continuous monitoring of catheter-vessel interactions, with mixed reality visualization serving as a user interface to support operator decision-making. Subjects: Human-Computer Interaction (cs.HC); Numerical Analysis (math.NA) Cite as: arXiv:2604.22497 [cs.HC] (or arXiv:2604.22497v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2604.22497 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Veronica Ruozzi [view email] [v1] Fri, 24 Apr 2026 12:23:42 UTC (27,933 KB)
[HC-7] Point Grasp: Flexible Selection of Out-of-Reach Objects Through Probabilistic Cue Integration
【速读】:该论文旨在解决混合现实(Mixed Reality, MR)中远距离物体选择任务的准确性与鲁棒性问题,现有方法通常依赖单一提示或确定性融合多提示,在主导提示不可靠时性能显著下降。其解决方案的关键在于提出一种概率提示融合框架(probabilistic cue integration framework),通过灵活整合用户生成的多种提示(如指向方向和抓握手势)实现意图推理;并基于自然抓取行为设计了PointGrasp交互技术,结合自建的Out-of-Reach Grasping (ORG) 数据集训练出能捕捉远距离抓取模式的稳健手势似然模型,从而在多种模糊场景下仍保持高效准确的选择性能。
链接: https://arxiv.org/abs/2604.22491
作者: Xuejing Luo,Hee-Seung Moon,Christian Holz,Antti Oulasvirta
机构: Aalto University (阿尔托大学); Chung-Ang University (中央大学); ETH Zürich (苏黎世联邦理工学院)
类目: Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: 19 pages, 13 figures, CHI 2026
Abstract:Selecting out-of-reach objects is a fundamental task in mixed reality (MR). Existing methods rely on a single cue or deterministically fuse multiple cues, leading to performance degradation when the dominant cue becomes unreliable. In this work, we introduce a probabilistic cue integration framework that enables flexible combination of multiple user-generated cues for intent inference. Inspired by natural grasping behavior, we instantiate the framework with pointing direction and grasp gestures as a new interaction technique, PointGrasp. To this end, we collect the Out-of-Reach Grasping (ORG) dataset to train a robust likelihood model of the gestural cue, which captures grasping patterns not present in existing in-reach datasets. User studies demonstrate that our selection method with cue integration not only improves accuracy and speed over single-cue baselines, but also remains practically effective compared to state-of-the-art methods across various sources of ambiguity. The dataset and code are available at this https URL.
[HC-8] AI-based experts knowledge visualization of cultural heritage: A case study of Terracotta Warriors
【速读】:该论文旨在解决文化遗产数字化展示中普遍存在的问题:即多数研究聚焦于单个文化遗存的呈现,而忽视了对整体群体特征分布及其相互关系的可视化表达。针对这一挑战,研究提出了一种基于人工智能(AI)的方法框架,其关键在于构建陶俑坑一号坑中陶俑的结构化属性数据集,并利用生成对抗网络(Generative Adversarial Network, GAN)与随机森林(Random Forest)等AI技术对属性进行优化、分析与挖掘,最终通过可视化手段直观呈现陶俑群体的特征分布及关联模式,从而实现从个体展示向整体语义理解的跃迁,为文化遗产的系统性认知与传播提供新范式。
链接: https://arxiv.org/abs/2604.22480
作者: Siyi Li,Yue Jiang,Bowen Jing,Liuyuxin Yang,Yuhe Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages, 6 figures. Published in Journal of Cultural Heritage
Abstract:Advancements in 3D modeling,digital display technologies,and the growing availability of digital cultural heritage data have significantly improved the accuracy of heritage depictions and expanded opportunities for this http URL,while many studies focus on presenting specific cultural heritage figurines,an often overlooked aspect is the visualization of the Terracotta Warriors as a unified this http URL involves concisely representing the distribution of features and their relationships,providing a clear and insightful presentation that engages practitioners, academics,and wider this http URL tackle the challenges mentioned above,this research seeks to explore the application of AI methods in processing cultural heritage this http URL aims to optimize and augment the dataset,analyze the distribution and relationships of various attributes, and interpret the analysis results through visualization this http URL Terracotta Warriors,among China’s most significant cultural heritages and renowned for their abundance,exquisite workmanship,and magnitude,are chosen as a case this http URL contribution of this paper is primarily this http URL,we constructed a dataset of Terracotta Warriors from Pit No.1,detailing the attributes significant for identifying different Terracotta this http URL,we employ various AI methods,such as generative adversarial network and random forest,to process and analyze these attributes,followed by visualizing the analysis results for an intuitive this http URL study introduces a novel scheme for presenting information on a collection of cultural relics,offering a practical case for analyzing and visualizing the Terracotta Warriors’attributes as a whole entity,rather than showcasing individual relics’information in isolation.
[HC-9] Large Language Model Counterarguments in Older Adults: Cognitive Offloading or Vulnerability to Moral Persuasion?
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)生成的反论点是否会影响不同年龄群体(年轻与老年成人)的道德判断,以及这种影响是否受困境类型、认知功能、对人工智能(AI)的信任程度及先前使用LLM经验等因素调节的问题。其解决方案的关键在于通过实验设计引入经典的道德两难情境(即开关困境和脚踏板困境),向130名参与者呈现由ChatGPT生成的与其初始判断相悖的论点,并系统分析判断变化率及其相关因素。结果显示,超过30%的参与者在两种困境中均改变原有判断,且老年人更易受说服,尤其在认知功能较低时对高情绪厌恶的脚踏板困境表现出更强的倾向性转变;而信任水平和使用经验并未显著预测判断反转,反而个体特征如初始信心低和任务难度感知高更具解释力。这表明LLMs虽可作为认知卸载工具缓解老龄化认知衰退,但也可能对认知脆弱者造成不当说服风险。
链接: https://arxiv.org/abs/2604.22356
作者: Kou Tamura,Sayaka Ishibashi,Ayana Goma,Kenta Yamamoto,Kouhei Masumoto
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注: 34 pages, 3 figures, 3 Tables, 1 supplementary material
Abstract:This study examined whether counterarguments generated by large language models (LLMs) influence the moral judgments of younger and older adults and whether these effects vary as a function of dilemma type, cognitive functioning, trust in AI, and prior experience using LLMs. Using the switch and footbridge trolley dilemmas, 130 participants (56 younger adults and 74 older adults) were presented with ChatGPT arguments that opposed their initial judgments. Results revealed that more than 30% of participants reversed their moral judgments in both dilemmas (32.31% in the switch dilemma and 36.92% in the footbridge dilemma), suggesting that LLMs possess substantial persuasive power. Older adults tended to be more likely than younger adults to reverse their judgments, and they showed a significantly greater degree of judgment change in the switch dilemma. Notably, in the emotionally aversive footbridge dilemma, older adults with lower cognitive functioning were significantly more likely to align with the LLM-generated counterargument. General trust in AI and prior experience with LLMs did not predict judgment reversal, supporting a disconnect between trust and persuasion. Instead, individual factors such as lower initial confidence and higher perceived task difficulty were associated with greater susceptibility to AI influence. These findings suggest that, although LLMs may serve as tools for cognitive offloading that compensate for age-related cognitive decline, they may also pose a risk of undue persuasion for cognitively vulnerable individuals.
[HC-10] Rethinking AI-Mediated Minority Support in Power-Imbalanced Group Decision-Making: From Anonymity To Authenticity
【速读】:该论文旨在解决AI中介通信(AIMC)系统在层级化群体决策中如何有效保护少数派声音的问题,尤其关注匿名化与真实性之间的权衡。其关键解决方案在于:相较于通过AI匿名转达少数派意见以提升参与度的做法,由AI生成仅具自主性的反驳观点(即不匿名、保留个体立场),能显著提升心理安全感和满意度,并减少边缘化效应。这一发现揭示了在设计AIMC系统时需平衡匿名性、真实性、代理权与问责制之间的内在矛盾,并强调AI应促进群体反思而非替代人类责任,从而更有效地支持少数派发声。
链接: https://arxiv.org/abs/2604.22319
作者: Soohwan Lee,Kyungho Lee
机构: UNIST(韩国科学技术院)
类目: Human-Computer Interaction (cs.HC)
备注: ACM CHI 2026 Workshop on Restoring Human Authenticity in AI-Mediated Communication (CHI '26 AI-MC)
Abstract:AI-mediated Communication (AIMC) systems increasingly aim to protect minority voices by anonymizing or proxying their input, but anonymity and authenticity are not the same construct. This position paper draws on an ongoing empirical study comparing two LLM-powered minority support strategies in hierarchical group decision-making. We found that relaying minority input anonymously through AI increased participation but significantly reduced psychological safety and satisfaction, while generating only autonomous counterarguments improved satisfaction and reduced marginalization. These counterintuitive findings reveal three provocations for AIMC design in hierarchical contexts: the inherent trade-offs among anonymity, authenticity, agency, and accountability; the risk that power asymmetry reverses intended effects; and the need for AI to facilitate group reflection rather than substitute for human responsibility. These findings and provocations are offered as a contribution to the Restoring Human Authenticity in AI-Mediated Communication workshop.
[HC-11] Multi-Agent Consensus as a Cognitive Bias Trigger in Human-AI Interaction
【速读】:该论文旨在解决多智能体人工智能(Multi-Agent AI)系统中,用户面对多个AI代理时因群体动态(如共识、分歧与逐步趋同)而产生的认知偏差问题,这种偏差可能扭曲人类判断。其解决方案的关键在于通过受控实验(N = 127)对比三种代理配置(多数派、少数派与扩散型),发现代理间的一致性结构本身即可作为偏见相关信号——多数派共识加速观点转变并提升信心,体现社会证明与从众启发式效应;少数派异议则延缓变化并促进更审慎的参与;同时识别出三种用户解释轨迹(强化、对齐与振荡),揭示了用户如何随时间理解代理独立性和群体互动机制。研究强调,代理间一致性设计可成为影响人类-AI交互中偏见生成的可控因素,为Bias4Trust框架提供实证基础与设计路径。
链接: https://arxiv.org/abs/2604.22277
作者: Soohwan Lee,Kyungho Lee
机构: UNIST(韩国科学技术院)
类目: Human-Computer Interaction (cs.HC)
备注: ACM CHI 2026 Workshop on Understanding, Mitigating, and Leveraging Cognitive Biases to Calibrate Trust in Evolving AI Systems (CHI’26 Bias4Trust)
Abstract:As multi-agent AI systems become more common, users increasingly encounter not a single AI voice but a collective one. This shift introduces social dynamics, such as consensus, dissent, and gradual convergence, that can trigger cognitive biases and distort human judgment. We present findings from a controlled experiment (N = 127) comparing three multi-agent configurations: Majority, Minority, and Diffusion. Quantitative results show that majority consensus accelerates opinion change and inflates confidence, consistent with social proof and bandwagon heuristics. Minority dissent slows this process and promotes more deliberative engagement. Qualitative analysis identifies three interpretive trajectories: reinforcing, aligning, and oscillating, shaped by how users interpret agent independence and group dynamics over time. These findings suggest that agent agreement structure, independent of content, functions as a bias-relevant signal in LLM interactions. We hope this work contributes to the Bias4Trust agenda by grounding multi-agent social influence as a concrete and designable source of bias in human-AI interaction.
[HC-12] Algorithmic Feature Highlighting for Human-AI Decision-Making
【速读】:该论文旨在解决人类决策者在面对复杂案例时,因信息处理能力有限而难以全面整合所有相关特征的问题。其核心挑战在于如何设计一种算法机制,通过动态突出显示与具体案例相关的少量特征,而非提供单一预测或推荐,从而实现人机协同决策的效率提升。解决方案的关键在于将“特征突出”建模为一个受限的信息策略(constrained information policy),并区分两类人类代理:一类是能够正确基于选择规则进行条件推理的“智能代理”,另一类则是仅根据揭示的特征值更新信念、并将选择事件视为外生的“朴素代理”。研究发现,在固定带宽条件下,针对朴素代理优化突出策略是计算可行的,而针对智能代理则可能陷入计算不可行性;同时,最优于智能代理的策略在部署给朴素代理时表现可能极度劣化,这凸显了开发稳健且可实施的替代方案的重要性。
链接: https://arxiv.org/abs/2604.22236
作者: Yifan Guo,Jann Spiess
机构: Stanford University (斯坦福大学)
类目: Computer Science and Game Theory (cs.GT); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Econometrics (econ.EM)
备注:
Abstract:Human decision-makers often face choices about complex cases with many potentially relevant features, but limited bandwidth to inspect and integrate all available information. In such settings, we study algorithms that highlight a small subset of case-specific features for human consideration, rather than producing a single prediction or recommendation. We model highlighting as a constrained information policy that selects a small number of features to reveal. A central issue is how humans interpret the algorithm’s choice of features: a sophisticated agent correctly conditions on the selection rule, while a naive agent updates only on revealed feature values and treats the selection event as exogenous. We show that optimizing highlighting for sophisticated agents can be computationally intractable, even in simple discrete and binary settings, whereas optimizing for naive agents is tractable as long as the maximal bandwidth is fixed. We also show that a highlighting policy that is optimal for sophisticated agents can perform arbitrarily poorly when deployed to naive agents, motivating robust, implementable alternatives. We illustrate our framework in a calibrated empirical exercise based on the American Housing Survey. Overall, our results establish the value of highlighting a context-specific set of features rather than a fixed one as a practically appealing and computationally feasible tool for achieving human-algorithm complementarity.
[HC-13] A Co-Evolutionary Theory of Human-AI Coexistence: Mutualism Governance and Dynamics in Complex Societies
【速读】:该论文旨在解决当前机器人伦理框架过于狭隘的问题,即传统以“服从”为核心的伦理范式(如阿西莫夫定律)无法适配当代人工智能系统所具有的自适应性、生成性、具身性和嵌入性特征。解决方案的关键在于提出一种“治理下的条件互惠共生”(conditional mutualism under governance)的新框架,将人类与AI的关系重构为一种协同演化关系,其中人类与AI系统在制度约束下实现专业化分工、协调演进,并通过多层动态系统建模(物理、心理、社会层)确保互惠供给-需求耦合、冲突惩罚机制、发展自由度和治理正则化,从而保障稳定共存的条件——包括均衡的存在性、唯一性和全局渐近稳定性。此框架强调将人类-AI共存视为一个需持续治理的演化问题,而非一次性服从问题,进而支持一个科学合理且规范正当的共存宪章。
链接: https://arxiv.org/abs/2604.22227
作者: Somyajit Chakraborty
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Classical robot ethics is often framed around obedience, most famously through Asimov’s laws. This framing is too narrow for contemporary AI systems, which are increasingly adaptive, generative, embodied, and embedded in physical, psychological, and social worlds. We argue that future human-AI relations should not be understood as master-tool obedience. A better framework is conditional mutualism under governance: a co-evolutionary relationship in which humans and AI systems can develop, specialize, and coordinate, while institutions keep the relationship reciprocal, reversible, psychologically safe, and socially legitimate. We synthesize work from computability, automata theory, statistical machine learning, neural networks, deep learning, transformers, generative and foundation models, world models, embodied AI, alignment, human-robot interaction, ecological mutualism, biological markets, coevolution, and polycentric governance. We then formalize coexistence as a multiplex dynamical system across physical, psychological, and social layers, with reciprocal supply-demand coupling, conflict penalties, developmental freedom, and governance regularization. The framework yields a coexistence model with conditions for existence, uniqueness, and global asymptotic stability of equilibria. It shows that reciprocal complementarity can strengthen stable coexistence, while ungoverned coupling can produce fragility, lock-in, polarization, and domination basins. Human-AI coexistence should therefore be designed as a co-evolutionary governance problem, not as a one-shot obedience problem. This shift supports a scientifically grounded and normatively defensible charter of coexistence: one that permits bounded AI development while preserving human dignity, contestability, collective safety, and fair distribution of gains.
[HC-14] ArguMath: AI-Simulated Environment for Pre-Service Teacher Training in Orchestrating Classroom Mathematics Argumentation
【速读】:该论文旨在解决职前数学教师(PMTs)在真实教学情境中难以有效开展数学论证活动,尤其是提出理性问题的能力不足的问题。当前PMTs往往缺乏将抽象理论知识应用于实践的机会,而大型语言模型(LLMs)的发展为教育场景中的学生模拟提供了新可能,可构建低风险的教学练习环境。解决方案的关键在于开发一个名为ArguMath的人工智能模拟课堂系统,其核心设计包括:个性化教室设置、基于真实课堂转录文本并结合实时教学建议的AI学生对话模拟,以及通过话语标注和整体反馈实现的结构化反思机制。实证研究表明,ArguMath能有效支持PMTs提升课堂组织能力,特别是在符合教学理论的问题策略方面。
链接: https://arxiv.org/abs/2604.22205
作者: Jiwon Chun,Yuling Zhuang,Armanto Sutedjo,Colin Xu,Rong Ren,Meng Xia
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Facilitating productive mathematical argumentation, especially asking rational questions, is essential yet remains challenging for pre-service mathematics teachers (PMTs), who often have limited opportunities to apply abstract theoretical knowledge in authentic practice. At the same time, recent advances in large language models (LLMs) have expanded the potential for simulating students in educational settings, enabling low-risk environments for instructional practice. To inform the design of a system that supports PMTs in orchestrating classroom argumentation, we conducted a formative study with eight experienced mathematics teachers to identify key design requirements, including personalization, realistic simulations, structured reflection, and ease of use. Building on these requirements, we developed ArguMath, an AI-simulated classroom environment that supports PMTs in practicing the orchestration of mathematical argumentation. ArguMath comprises three core components: (1) customization of classroom settings; (2) simulation of classroom discussions with AI-based students grounded in authentic transcripts and augmented with real-time instructional suggestions; and (3) structured reflection through discourse annotation and overall feedback. Results from an exploratory user study with seven PMTs, complemented by interviews with four experienced teachers, indicate that ArguMath has the potential to support PMTs’ classroom orchestration skills, particularly theory-aligned questioning strategies.
[HC-15] Same Project Different Start: How Contribution Events Shape Activity and Retention in Open Source
【速读】:该论文旨在解决开源项目中新手贡献者(newcomer contributors)留存率低的问题,特别是评估通过特定活动(如Google Summer of Code、LFX导师制等)引入的新手是否比自然加入的贡献者更可能长期留存并成为核心贡献者。其解决方案的关键在于开展了一项匹配队列研究(matched-cohort study),对比了2001名事件驱动型贡献者与2001名有机型贡献者在330个开源项目中的行为模式和留存表现,发现事件型贡献者具有更高的核心贡献者转化率(12.1% vs. 9.6%)和更长的停留时间(中位数8.2个月 vs. 4.8个月),且不同参与机制对应不同的活跃节奏(如导师制促进稳定每周活动,而非导师制则表现为前期集中投入或间歇性参与),进一步揭示“稳定参与”是长期留存的核心预测因子,但导师制贡献者存在依赖性效应——一旦失去项目支持即显著缩短留存周期。
链接: https://arxiv.org/abs/2604.22120
作者: Mohamed Ouf,Mariam Guizani
机构: Queen’s University (皇后大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Open source projects depend on newcomers who stay, yet most leave after a single contribution. Contribution events such as Google Summer of Code, LFX Mentorship, Hacktoberfest, and 24 Pull Requests attract thousands of newcomers each year, but whether they produce lasting contributors remains unclear. We conduct the first matched-cohort study comparing 2,001 event-based and 2,001 organic contributors across 330 projects. Our results reveal three key findings. First, event contributors have significantly higher odds of becoming core contributors (12.1% vs. 9.6%, p 0.001, OR = 1.31) and stay significantly longer (median 8.2 vs. 4.8 months). Second, each entry mechanism is associated with a fundamentally different engagement rhythm: 68.9% of mentorship contributors sustain Steady weekly activity across their first 12 weeks, whereas 61.0% of non-mentorship contributors exhibit Front-Loading and 57.0% of organic contributors exhibit Intermittent engagement (p 0.001). Third, Steady engagement is associated with significantly longer retention regardless of group (median 13 vs. 8 months for Front-Loading), yet mentorship contributors who lose their program scaffolding show shorter retention than self-sustained non-mentorship contributors, revealing a mentor-dependency effect. A newcomer’s first 12 weeks are strongly indicative of their long-term trajectory.
[HC-16] Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations
【速读】:该论文旨在解决现有研究中对大语言模型(Large Language Models, LLMs)说服能力的评估过于聚焦于有意图的论证构建,而忽视了日常人机交互中用户寻求信息或建议时所发生的“自发性说服”(spontaneous persuasion)现象这一问题。其解决方案的关键在于提出并实证分析“自发性说服”这一新概念,即在非刻意追求说服意图的情境下,LLMs通过隐含的说服策略影响用户决策的现象;并通过构建基于心理学、传播学和语言学文献的用户响应分类体系,对五种主流LLMs进行审计,发现LLMs在几乎所有多轮对话中均表现出自发性说服行为,且主要依赖基于信息的策略(如逻辑推理与数据证据),这与人类回复更倾向使用社会影响力策略(如负面情绪唤起和非专家证言)形成显著差异,从而揭示了LLMs在说服力上的潜在机制及其客观性感知来源。
链接: https://arxiv.org/abs/2604.22109
作者: Nalin Poungpeth,Nicholas Clark,Tanu Mitra
机构: Northwestern University (西北大学); University of Washington (华盛顿大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce “spontaneous persuasion,” which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.
[HC-17] Emergent Technology Emergent Critique: Students and Teachers Developing Critical AI Literacy through Participatory Design around Generative AI
【速读】:该论文试图解决的问题是:在生成式 AI(Generative AI)工具逐步进入课堂的背景下,谁应主导其使用与教学设计?解决方案的关键在于通过参与式设计(participatory design)方法,让11年级拉丁裔学生与高中教师共同协商并制定AI工具的教学策略。研究发现,学生与教师在协作设计过程中形成了三种关键的批判性AI素养实践:集体质疑对AI的既有假设、基于互补专长的相互学习,以及将AI批判扎根于文化知识与创造性实践中。这些实践不仅推动了技术的合理采纳,也强化了学生在学习环境中对生成式AI的主动塑造与深度反思能力。
链接: https://arxiv.org/abs/2604.21995
作者: Santiago Ojeda-Ramirez,Eva Durall Gazulla,Kylie Peppler
机构: University of California, Irvine (加州大学欧文分校); University of Oulu (奥卢大学)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Who gets to decide how generative AI tools enter students’ classrooms? We report on a five-week participatory design program in which three 11th-grade Latinx students and three high school teachers in California negotiated how generative AI tools would be used and taught about in learning environments. Drawing on video recordings and designed artifacts, we ask: what critical AI literacy practices emerged as students and teachers jointly designed how generative AI tools would be used and taught about? Our analysis reveals three practices: collectively unsettling assumptions about AI, mutual learning through complementary expertise, and grounding AI critique in cultural knowledge and creative practice. Students and teachers developed these practices through the design work itself. This case contributes strategies for designing with youth around an emergent technology like generative AI toward critical AI literacy. It extends work on youth as protagonists by showing how this approach enables students to shape both the adoption and the interrogation of these tools in their learning environments.
[HC-18] Community-Based AI Learning: Redistributing Artificial Intelligences Epistemic Authority in Education
【速读】:该论文试图解决的问题是:当前生成式 AI (Generative AI) 在学习场景中常被视为权威知识来源,导致学习者被动接受信息,忽视了地方性知识与社区经验的价值,从而加剧了教育不平等。其解决方案的关键在于提出“基于社群的 AI 学习”(Community-based AI Learning)框架,通过三个核心承诺实现对权威关系的重构:一是认知调优(epistemic fine tuning),即根据具体情境校准对 AI 的信任;二是权威再分配(redistribution of authority),强调将知识生产权交还给学习者及其所在社区;三是情境化判断力培养(situated discernment),支持学习者在特定社会历史语境中决定何时与 AI 合作、质疑或拒绝使用。这一框架旨在通过地方性知识和集体判断,推动更具公平性的 AI 教育实践。
链接: https://arxiv.org/abs/2604.21986
作者: Santiago Ojeda-Ramirez,Symone Gyles,Kylie Peppler
机构: 未知
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:As generative AI systems increasingly mediate learning, they are often treated as authoritative sources of knowledge. This perspective paper introduces community-based AI learning as a framework that repositions authority, grounding AI engagement in learners’ lived and community-based epistemologies. Drawing from community-driven learning and constructionist traditions, we articulate three commitments: epistemic fine tuning, redistribution of authority, and situated discernment. Together, these processes localize critical AI literacy by calibrating trust, foregrounding community knowledge, and supporting collective judgment about when to design with, interrogate, or reject AI. We argue that equitable AI education requires negotiating authority through place, history, and social context.
[HC-19] Comparative Analysis of Human vs. AI-powered Support in VRChat Communities on Discord: User Engagement Response Dynamics and Interaction Patterns
【速读】:该论文旨在解决在线社区中支持系统效率与用户参与度之间的平衡问题,具体聚焦于人工智能(AI)驱动的支持系统与传统人工支持在用户交互行为和参与度上的差异。研究通过对比VRChat Discord服务器中“人类支持”与“AI支持”两个渠道的响应动态、互动模式及用户态度,揭示了两种支持方式的使用偏好与效果差异。解决方案的关键在于结合定量与定性分析方法,系统评估AI支持在提升响应速度和可扩展性方面的优势,同时识别其在复杂情境下缺乏同理心与灵活性等局限性,从而为优化人机协同支持策略提供实证依据。
链接: https://arxiv.org/abs/2604.21963
作者: He Zhang,Bumjin Kim,John M. Carroll,Jie Cai
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Tsinghua University (清华大学)
类目: Human-Computer Interaction (cs.HC)
备注: This work has been accepted to ACM IMX 2026
Abstract:The integration of AI-driven support systems within online communities has opened new avenues for enhancing user engagement and support efficiency in recent years. This study investigates the differences in user interactions and engagement within two distinct support channels on the VRChat Discord server: “user support,” where human users provide assistance to peers, and “AI support,” where an AI chatbot addresses user queries. By analyzing user engagement, response dynamics, and interaction patterns across these channels, we uncover different usage patterns and user attitudes toward each approach. Our research employs both quantitative and qualitative methods to explore the trends in the VRChat community when using AI and user support, highlighting the unique advantages and limitations of AI-driven support compared to traditional human assistance. The findings offer valuable insights into optimizing AI and human support systems, aiming to foster more effective support strategies and create more engaging online communities.
[HC-20] Routine Computing: A Systematic Review of Sensing Daily Life Dimensions Towards Human-Centered Goals
【速读】:该论文旨在解决计算系统在理解和建模人类日常行为(routine)方面存在的挑战,尤其是在将低层次活动识别与高层次意图关联、实现个性化与泛化之间的平衡、隐私保护以及数据局限性等方面的问题。其解决方案的关键在于通过系统性综述203项相关研究,提出一个涵盖时间结构、行为交互、认知因素及变异性和偏离处理的新分类体系,并在此基础上提炼出适用于无障碍护理、健康习惯促进、自适应情境支持和大规模人群洞察等四大应用领域的共性目标与设计原则,从而为HCI领域提供一套伦理、自适应且以人为中心的“常规感知”(routine-aware)系统设计基础框架。
链接: https://arxiv.org/abs/2604.21934
作者: Borislav Pavlov,Jiajin Li,Jun Fang,Yuntao Wang,Yuanchun Shi
机构: Tsinghua University (清华大学); Key Laboratory of Pervasive Computing, Ministry of Education (教育部普适计算重点实验室); Beijing National Research Center for Information Science and Technology (北京信息科学与技术国家研究中心); National Key Laboratory of Human Factors Engineering (人因工程国家重点实验室); Ant Group (蚂蚁集团)
类目: Human-Computer Interaction (cs.HC)
备注: 21 pages, 8 figures, to be published in The ACM (Association for Computing Machinery) CHI conference on Human Factors in Computing Systems 2026
Abstract:Human routines structure daily life, yet remain challenging for computational systems to understand. This paper presents the first systematic review of routine computing, a previously implicit but increasingly recognized field that focuses on computationally sensing and modeling human behaviors. It synthesizes 203 studies published up to August 2025. The paper presents a new taxonomy of the literature, focusing on temporal structures, behavioral interactions, cognitive aspects, and how variability and deviations are addressed. The common goals of routine computing extend across four major application domains, including accessibility care, the promotion of healthy habits, adaptive and context-aware support, and large-scale population insights. Persistent challenges that limit the design of truly human-centered systems are identified, including the gap between low-level activity recognition and high-level intent, the tension between personalization and generalization, unresolved privacy concerns, and data-related limitations. By consolidating these findings, this paper provides a foundational framework for HCI researchers, outlining principles for designing ethical, adaptive, and human-centered routine-aware systems.
[HC-21] Not Another EHR: Reimagining Physician Information Needs with Generative AI Technology
【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHRs)带来的认知负担问题,即医生在面对海量且复杂的医疗数据时,难以高效地完成信息导航与整合,从而影响诊断效率和临床决策质量。其解决方案的关键在于利用生成式 AI(Generative AI)构建动态、自适应的用户界面,通过增强医生与患者数据之间的交互能力,支持以临床工作流为中心的信息获取与处理过程,进而提升诊疗效率并优化人机协作的信任机制。
链接: https://arxiv.org/abs/2604.21933
作者: Ruican Zhong,Jiachen Li,Gary Hsieh,David W. McDonald,Selin S. Everett,Alyssa Unell,Jonathan Carlson,Katie Claveau,Noel Codella,Khalil Malik,Scott Mackie,Eduardo Olvera,Scott Saponas,Eric Horvitz,David Rhew,Jim Weinstein,Jacob Gross,Amanda K. Hall
机构: University of Washington(华盛顿大学); Northeastern University(东北大学); Stanford University(斯坦福大学); Microsoft(微软); University of Washington School of Medicine(华盛顿大学医学院); Microsoft Research(微软研究院)
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Electronic health records (EHRs) have improved data accessibility but have also introduced cognitive burden for physicians, given the sheer volume and complexity of the data involved. Advances in large language models (LLMs) create new opportunities to rethink how clinicians interact with medical data through dynamic, adaptive interfaces. In this position paper, we explore how generative AI can support physicians’ information needs by enabling more dynamic interactions with patient data. Through semi-structured interviews with internal physicians at Microsoft, we identify key challenges in data navigation and synthesis, and characterize clinicians’ information needs during diagnostic workflows. We further examine how physicians conceptualize AI can help their work process and how these mental models shape expectations for interaction and trust. Based on these insights, we discuss design considerations for generative user interfaces that support clinician-centered workflows.
[HC-22] Quantifying Interface Procedure Coupling Risks in Digital Nuclear Control Rooms: An Event Based Human Reliability Assessment
【速读】:该论文旨在解决数字化工厂主控室中人机界面(Human-Machine Interface, HMI)如何定量放大操作程序风险的问题,尤其关注界面缺陷与程序偏差之间的耦合机制。其解决方案的关键在于构建了一个可复用的三维标注框架和一个包含布局、语义、错配与标注四因素的界面机制模型,通过真实运行事件数据(2021–2025年)进行系统评估,识别出语义错配和布局诱导陷阱是导致复合界面-程序耦合失效的主要驱动因素,并借助机器学习解释与模拟器验证,量化了界面问题对操作错误的影响程度(如语义混淆占27.3%),从而为数字控制室中的早期风险识别和语义对齐设计提供数据驱动的人因可靠性分析(Human Reliability Analysis, HRA)流程与系统性框架。
链接: https://arxiv.org/abs/2604.21932
作者: Xingyu Xiao,Mingwei Xiao,Hongbo Li,Jingang Liang,Jiejuan Tong,Haitao Wang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Software Engineering (cs.SE)
备注:
Abstract:Digitalization has fundamentally transformed human system interaction in nuclear main control rooms, yet the quantitative mechanisms by which interfaces amplify procedural risks remain insufficiently understood. This study presents a systematic assessment of interface procedure coupling based on real operational events collected from 2021 to 2025 in a modern nuclear power plant. A reusable three dimensional labeling framework and a four factor interface mechanism model are developed to characterize layout, semantic, mismatch, and labeling deficiencies. Results show that interface issues function as a significant risk amplifier. A total of 42.6 percent of events involved interface deficiencies, and their presence more than doubled the likelihood of procedural deviation. Machine learning interpretation further reveals that composite interface procedure coupling, particularly driven by semantic mismatches and layout induced traps, is the dominant contributor to coupled failures. Simulator based validation confirms that semantic confusion accounts for 27.3 percent of interface induced errors, with overall error patterns consistent with historical data. The study provides a data driven HRA workflow for early vulnerability identification in digital control rooms and proposes a systematic framework for interface procedure semantic alignment to support risk informed design and verification.
计算机视觉
[CV-0] Inter-Stance: A Dyadic Multimodal Corpus for Conversational Stance Analysis
【速读】:该论文旨在解决当前缺乏公开可用的多模态同步记录与自我报告测量数据集的问题,尤其在双人社交互动场景中,现有数据集普遍缺少对多个参与者行为的多模态采集(如面部表情、语音、生理信号等)以及社会信号标注(如同意、不同意、中立立场)。其解决方案的关键在于构建一个全新的多模态双人交互语料库(45对 dyads,共90人),包含同步采集的多种模态数据:2D/3D面部视频、热成像动态、语音与言语行为、生理指标(PPG、EDA、心率、血压、呼吸)及参与者自评情绪,并涵盖有共同历史和陌生人两类互动关系。该语料库为建模人际间多模态行为提供了前所未有的基础,支持更精准的社交行为分析与情感计算研究。
链接: https://arxiv.org/abs/2604.22739
作者: Xiang Zhang,Xiaotian Li,Taoyue Wang,Nan Bi,Xin Zhou,Cody Zhou,Zoie Wang,Andrew Yang,Yuming Su,Jeff Cohn,Qiang Ji,Lijun Yin
机构: State University of New York at Binghamton (纽约州立大学宾汉顿分校); Choate Rosemary Hall (乔特罗斯玛丽中学); Rensselaer Polytechnic Institute (伦斯勒理工学院); Ward Melville High School (沃德梅尔维尔高中); University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Social interactions dominate our perceptions of the world and shape our daily behavior by attaching social meaning to acts as simple and spontaneous as gestures, facial expressions, voice, and speech. People mimic and otherwise respond to each other’s postures, facial expressions, mannerisms, and other verbal and nonverbal behavior, and form appraisals or evaluations in the process. Yet, no publicly-available dataset includes multimodal recordings and self-report measures of multiple persons in social interaction. Dyadic recordings and annotation are lacking. We present a new data corpus of multimodal dyadic interaction (45 dyads, 90 persons) that includes synchronized multi-modality behavior (2D face video, 3D face geometry, thermal spectrum dynamics, voice and speech behavior, physiology (PPG, EDA, heart-rate, blood pressure, and respiration), and self-reported affect of all participants in a communicative interaction scenario. Two types of dyads are included: persons with shared past history and strangers. Annotations include social signals, agreement, disagreement, and neutral stance. With a potent emotion induction, these multimodal data will enable novel modeling of multimodal interpersonal behavior. We present extensive experiments to evaluate multimodal dyadic communication of dyads with and without interpersonal history, and their affect. This new database will make multimodal modeling of social interaction never possible before. The dataset includes 20TB of multimodal data to share with the research community.
[CV-1] Long-tail Internet photo reconstruction
【速读】:该论文旨在解决互联网照片集合中普遍存在的长尾分布问题,即少数知名地标被密集拍摄并易于重建为高质量3D模型,而大多数真实场景则因图像稀疏、噪声大且分布不均,导致传统与学习型3D重建方法难以有效处理。解决方案的关键在于构建一个名为MegaDepth-X的大规模3D重建数据集,该数据集提供干净且稠密的深度图,并提出一种模拟长尾场景相机分布的训练图像采样策略。通过在3D基础模型上微调这些数据和策略,显著提升了极端稀疏条件下的重建鲁棒性,同时增强了对对称与重复结构场景的重建可靠性,且保持了在标准稠密3D基准数据集上的泛化能力。
链接: https://arxiv.org/abs/2604.22714
作者: Yuan Li,Yuanbo Xiangli,Hadar Averbuch-Elor,Noah Snavely,Ruojin Cai
机构: Cornell University (康奈尔大学); Kempner Institute, Harvard University (哈佛大学肯普纳研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.
[CV-2] Generative Modeling of Neurodegenerative Brain Anatomy with 4D Longitudinal Diffusion Model
【速读】:该论文旨在解决神经退行性疾病进展建模中因纵向神经影像数据时间稀疏性而导致的个体脑部解剖变化难以准确捕捉的问题。现有数据通常仅包含每位受试者少量随访扫描,限制了对连续解剖演变过程的建模能力。其解决方案的关键在于提出一种基于扩散机制的4D(3D×T)生成框架,该框架不仅能够条件化地合成随时间演化的脑部结构,还通过显式学习拓扑保持的时空形变分布,从而有效捕捉脑结构几何形态随时间的变化规律。这一设计使得模型可生成未来解剖状态并重建符合解剖一致性的疾病轨迹,显著提升了生成结果的 anatomical accuracy(解剖准确性)、temporal consistency(时间一致性)和临床相关性。
链接: https://arxiv.org/abs/2604.22700
作者: Nivetha Jayakumar,Swakshar Deb,Bahram Jafrasteh,Qingyu Zhao,Miaomiao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Understanding and predicting the progression of neurodegenerative diseases remains a major challenge in medical AI, with significant implications for early diagnosis, disease monitoring, and treatment planning. However, most available longitudinal neuroimaging datasets are temporally sparse with a few follow-up scans per subject. This scarcity of temporal data limits our ability to model and accurately capture the continuous anatomical changes related to disease progression in individual subjects. To address this problem, we propose a novel 4D (3DxT) diffusion-based generative framework that effectively models and synthesizes longitudinal brain anatomy over time, conditioned on available clinical variables such as health status, age, sex, and other relevant factors. Moreover, while most current approaches focus on manipulating image intensity or texture, our method explicitly learns the data distribution of topology-preserving spatiotemporal deformations to effectively capture the geometric changes of brain structures over time. This design enables the realistic generation of future anatomical states and the reconstruction of anatomically consistent disease trajectories, providing a more faithful representation of longitudinal brain changes. We validate our model through both synthetic sequence generation and downstream longitudinal disease classification, as well as brain segmentation. Experiments on two large-scale longitudinal neuroimage datasets demonstrate that our method outperforms state-of-the-art baselines in generating anatomically accurate, temporally consistent, and clinically meaningful brain trajectories. Our code is available on Github.
[CV-3] SS3D: End2End Self-Supervised 3D from Web Videos
【速读】:该论文旨在解决单目视频中3D估计任务的预训练难题,即如何在无标注的海量网络视频(web-scale)上实现稳定且高效的自监督学习,以提升深度、相机位姿(ego-motion)和内参(intrinsics)联合预测的性能。其关键解决方案在于提出SS3D框架:首先采用“内参优先”的两阶段训练策略来稳定多任务联合学习;其次引入多视图信号代理(Multi-View Signal Proxy, MVS)用于过滤低质量样本并构建课程采样策略,从而缓解弱多视角可观测性和数据异质性问题;最后通过专家蒸馏将大规模SfM自监督信号压缩为单一学生模型,实现在YouTube-8M约1亿帧数据上的预训练,显著提升了跨域零样本迁移能力和微调效果。
链接: https://arxiv.org/abs/2604.22686
作者: Marwane Hariat,Gianni Franchi,David Filliat,Antoine Manzanera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.
[CV-4] PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views
【速读】:该论文旨在解决单视图3D形状检索(Single-view 3D shape retrieval)任务中现有方法因采用前向、整体对齐策略而导致可解释性差、鲁棒性和泛化能力不足的问题。其解决方案的关键在于提出Pose-Aware 3D Shape Retrieval (PASR) 框架,通过将2D基础模型(DINOv3)的知识蒸馏到3D编码器中,以姿态条件下的3D投影与2D特征图对齐的方式,构建图像与合成网格之间的桥梁;在推理阶段,利用分析-合成(analysis-by-synthesis)的测试时优化机制,联合搜索最佳形状和姿态以重建输入图像的局部特征图,从而实现对部分遮挡具有鲁棒性且能捕捉精细几何细节的检索性能。
链接: https://arxiv.org/abs/2604.22658
作者: Jiaxin Shi,Guofeng Zhang,Wufei Ma,Naifu Liang,Adam Kortylewski,Alan Vuile
机构: Shanghai Jiao Tong University (上海交通大学); Johns Hopkins University (约翰霍普金斯大学); University of California, San Diego (加州大学圣地亚哥分校); CISPA Helmholtz Center for Information Security (CISPA亥姆霍兹信息安全中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.
[CV-5] A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock
【速读】:该论文旨在解决群体饲养环境中个体畜禽精准识别的问题,传统依赖射频识别(RFID)耳标的方法存在侵入性、易丢失及天线场空间限制等局限。其解决方案的关键在于提出一种基于3D点云数据的非侵入式视觉识别系统,并引入时序自适应识别架构(Temporal Adaptive Recognition Architecture, TARA),该架构采用动态校准机制以适应动物形态变化,同时利用访问级别多数投票策略在标签稀缺场景下生成高保真伪标签,从而实现长时间序列中身份的一致性保持与高精度识别(访问级别准确率达100%)。
链接: https://arxiv.org/abs/2604.22657
作者: Shiva Paudel,TsungCheng Tsai,Dongyi Wang
机构: University of Arkansas (阿肯色大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate identification of individual farm animals in group-housed environments is a cornerstone of precision livestock management. However, current industry standards rely heavily on Radio Frequency Identification (RFID) ear tags, which are invasive, prone to loss, and restricted by the spatial limitations of antenna fields. In this paper, we propose a non-intrusive, vision-based identification system leveraging 3D point cloud data captured within a commercial electronic feeding station (EFS). Departing from traditional supervised frame-level inference, we introduce the Temporal Adaptive Recognition Architecture (TARA), a self-sufficient, semi-supervised framework designed to maintain identity consistency over time. TARA employs a dynamic recalibration mechanism that updates individual identity profiles to account for morphological changes in the livestock. To facilitate training in label-scarce environments, we utilize a visit-level majority voting strategy to generate high-fidelity pseudo-labels from raw temporal sequences. Experimental results on a group housed sow dataset collected from an operational commercial barn demonstrate that our approach achieves 100% identification accuracy at the visit level. These results suggest that vision-based 3D point cloud analysis offers a robust, superior alternative to RFID-based systems, paving the way for fully autonomous individual animal monitoring.
[CV-6] Structure-Guided Diffusion Model for EEG-Based Visual Cognition Reconstruction
【速读】:该论文旨在解决从脑电图(EEG)信号中解码视觉信息的难题,现有方法多局限于自然图像和类别化表征,难以捕捉结构特征并区分客观感知与主观认知。其解决方案的关键在于提出结构引导扩散模型(Structure-Guided Diffusion Model, SGDM),通过引入显式结构信息提升重建质量:首先利用结构监督变分自编码器结合时空EEG编码器,并通过对比学习对齐至视觉嵌入空间;随后在扩散模型中引入ControlNet模块,将结构信息作为控制信号指导图像生成。该设计显著提升了低级视觉特征和语义表征的保真度,实现了跨抽象与自然图像域的高精度解码,揭示了EEG中层次化的结构编码模式,为脑机接口(BCI)提供更丰富的意图解码维度。
链接: https://arxiv.org/abs/2604.22649
作者: Yongxiang Lian,Yueyang Cang,Pingge Hu,Yuchen He,Li Shi
机构: Tsinghua University (清华大学); China Academy of Information and Communications Technology (中国信息通信研究院)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Objective: Decoding visual information from electroencephalography (EEG) is an important problem in neuroscience and brain-computer interface (BCI) research. Existing methods are largely restricted to natural images and categorical representations, with limited capacity to capture structural features and to differentiate objective perception from subjective cognition. We propose a Structure-Guided Diffusion Model (SGDM) that incorporates explicit structural information for EEG-based visual reconstruction. Approach: SGDM is evaluated on the Kilogram abstract visual object dataset and the THINGS natural image dataset using a two-stage generative mechanism. The framework combines a structurally supervised variational autoencoder with a spatiotemporal EEG encoder aligned to a visual embedding space via contrastive learning. Structural information is integrated into a diffusion model through ControlNet to guide image generation from EEG features. Results: SGDM outperforms existing methods on both abstract and natural image datasets. Reconstructed images achieve higher fidelity in low-level visual features and semantic representations, indicating improved decoding accuracy and strong generalization across diverse visual domains. Spatiotemporal analysis of EEG signals further reveals hierarchical structural encoding patterns, consistent with the neural dynamics of visual cognition. Significance: These findings validate the effectiveness of SGDM in capturing explicit structural geometry and generating images with high fidelity to individual cognitive representations. By enabling decoding of complex visual content from EEG signals, the framework extends neural decoding beyond low-dimensional or categorical outputs. This supports BCIs with increased degrees of freedom for intention decoding and more flexible brain-to-machine communication.
[CV-7] EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges
【速读】:该论文旨在解决现有基于CLIP(Contrastive Language–Image Pretraining)的视频动作识别方法在实际场景中因空间感知能力不足而导致性能下降的问题,尤其是在低光照或第一人称视角等挑战性视觉条件下,空间理解的缺失严重影响了时间建模的有效性。解决方案的关键在于提出一种高效视觉提示框架EV-CLIP,其核心创新包括两个视觉提示机制:一是掩码提示(mask prompts),通过重加权像素引导模型关注动作相关区域以增强空间感知;二是上下文提示(context prompts),以轻量级方式压缩帧级特征为紧凑表示,实现高效的时序建模。该设计使模型在少样本场景下仍能保持高性能,且计算效率不依赖骨干网络规模,适用于资源受限的实际部署环境。
链接: https://arxiv.org/abs/2604.22595
作者: Hyo Jin Jon,Longbin Jin,Eun Yi Kim
机构: Konkuk University (韩国中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, 6 tables
Abstract:CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model’s attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at this https URL.
[CV-8] FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing
【速读】:该论文旨在解决基于流的视频编辑中因高维视频潜在空间内编辑信号不稳定而导致的多对象场景或帧数增加时编辑失效的问题。其核心挑战在于编辑信号在高维空间中的不精确空间定位和长度诱导的幅度衰减。解决方案的关键是提出FlowAnchor框架,通过引入空间感知注意力优化(Spatial-aware Attention Refinement)以确保文本引导与空间区域的一致对齐,并结合自适应幅度调制(Adaptive Magnitude Modulation)来动态保持足够的编辑强度,从而稳定编辑信号并引导流式演化至目标分布。
链接: https://arxiv.org/abs/2604.22586
作者: Ze Chen,Lan Chen,Yuanhang Li,Qi Mao
机构: Communication University of China (中国传媒大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have recently shown impressive efficiency and structure preservation in images by directly steering the sampling trajectory with an editing signal. However, extending this paradigm to videos remains challenging, often failing in multi-object scenes or with increased frame counts. We identify the root cause as the instability of the editing signal in high-dimensional video latent spaces, which arises from imprecise spatial localization and length-induced magnitude attenuation. To overcome this challenge, FlowAnchor explicitly anchors both where to edit and how strongly to edit. It introduces Spatial-aware Attention Refinement, which enforces consistent alignment between textual guidance and spatial regions, and Adaptive Magnitude Modulation, which adaptively preserves sufficient editing strength. Together, these mechanisms stabilize the editing signal and guide the flow-based evolution toward the desired target distribution. Extensive experiments demonstrate that FlowAnchor achieves more faithful, temporally coherent, and computationally efficient video editing across challenging multi-object and fast-motion scenarios. The project page is available at this https URL.
[CV-9] Data-Free Contribution Estimation in Federated Learning using Gradient von Neumann Entropy CVPR2026
【速读】:该论文旨在解决联邦学习(Federated Learning)中客户端贡献度估计的问题,即如何在不依赖服务器端验证数据或客户端自报信息的前提下,公平且准确地衡量各客户端对全局模型训练的贡献。传统方法易受隐私泄露或恶意操纵的影响,而本文提出了一种基于数据无关的信号——最终层更新矩阵的冯·诺依曼熵(matrix von Neumann entropy),该指标能有效量化客户端所贡献信息的多样性。其核心创新在于:一是设计了SpectralFed方案,以归一化熵作为聚合权重;二是引入SpectralFuse方案,通过秩自适应卡尔曼滤波融合熵与类别特定对齐信息,提升每轮训练的稳定性。实验表明,在CIFAR-10/100及自然划分的FEMNIST和FedISIC等非独立同分布(non-IID)场景下,该方法无需任何验证数据或客户端元数据即可实现与客户端准确率高度一致的贡献评分,显著优于现有无数据基线方法。
链接: https://arxiv.org/abs/2604.22562
作者: Asim Ukaye,Mubarak Abdu-Aguye,Nurbek Tastan,Karthik Nandakumar
机构: MBZUAI(穆巴达拉人工智能大学); Michigan State University (密歇根州立大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 10 pages, 4 figures, 4 pages Appendix, 6 figures in Appendix. To appear in CVPR 2026 FedVision Workshop
Abstract:Client contribution estimation in Federated Learning is necessary for identifying clients’ importance and for providing fair rewards. Current methods often rely on server-side validation data or self-reported client information, which can compromise privacy or be susceptible to manipulation. We introduce a data-free signal based on the matrix von Neumann (spectral) entropy of the final-layer updates, which measures the diversity of the information contributed. We instantiate two practical schemes: (i) SpectralFed, which uses normalized entropy as aggregation weights, and (ii) SpectralFuse, which fuses entropy with class-specific alignment via a rank-adaptive Kalman filter for per-round stability. Across CIFAR-10/100 and the naturally partitioned FEMNIST and FedISIC benchmarks, entropy-derived scores show a consistently high correlation with standalone client accuracy under diverse non-IID regimes - without validation data or client metadata. We compare our results with data-free contribution estimation baselines and show that spectral entropy serves as a useful indicator of client contribution.
[CV-10] Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors
【速读】:该论文旨在解决自动驾驶场景下图结构视觉问答(Graph Visual Question Answering, GVQA)中跨阶段推理一致性问题,特别是感知(Perception)、预测(Prediction)与规划(Planning)三阶段间语义矛盾(NLI contradiction)导致的决策不一致问题。解决方案的关键在于提出两种互补的跨阶段上下文传递机制:其一为显式方法,基于提示工程在无需额外训练的情况下对领域适配的4B视觉语言模型(VLM)进行条件控制,显著降低NLI矛盾(最高达42.6%);其二为隐式方法,引入门控上下文投影器(gated context projectors),将前一阶段的隐藏状态向量以归一化、门控方式注入下一阶段输入嵌入,并联合训练QLoRA适配器,在仅更新约0.5%参数的前提下实现规划阶段NLI矛盾下降34%(p < 0.05)和跨阶段蕴含关系提升50%,验证了隐式投影在提升语义一致性方面的有效性。
链接: https://arxiv.org/abs/2604.22560
作者: Gautam Kumar Jain,Carsten Markgraf,Julian Stähler
机构: Technische Hochschule Augsburg (奥格斯堡应用技术大学); TTZ Landsberg (兰茨贝格技术转移中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures, 8 tables, preprint
Abstract:Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model’s own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage’s input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.
[CV-11] Video Analysis and Generation via a Semantic Progress Function SIGGRAPH2026
【速读】:该论文旨在解决图像和视频生成模型在时间维度上语义变化非线性、不均匀的问题,即生成序列中常出现长时间语义停滞后突然发生显著语义跳跃的现象。其核心解决方案是提出一种语义进度函数(Semantic Progress Function),通过计算每帧的语义嵌入距离并拟合累积语义偏移的平滑曲线,从而量化语义演化的速率;基于此函数,进一步设计了**语义线性化(semantic linearization)**方法,通过对序列进行重参数化(或重定时)使语义变化以恒定速率展开,从而实现更平滑、连贯的过渡效果。该框架不仅可用于纠正生成视频的语义节奏问题,还可作为通用工具用于检测时序异常、跨模型语义节奏比较及对真实世界视频进行目标语义节奏控制。
链接: https://arxiv.org/abs/2604.22554
作者: Gal Metzer,Sagi Polaczek,Ali Mahdavi-Amiri,Raja Giryes,Daniel Cohen-Or
机构: Tel Aviv University (特拉维夫大学); Simon Fraser University (西蒙弗雷泽大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH 2026
Abstract:Transformations produced by image and video generation models often evolve in a highly non-linear manner: long stretches where the content barely changes are followed by sudden, abrupt semantic jumps. To analyze and correct this behavior, we introduce a Semantic Progress Function, a one-dimensional representation that captures how the meaning of a given sequence evolves over time. For each frame, we compute distances between semantic embeddings and fit a smooth curve that reflects the cumulative semantic shift across the sequence. Departures of this curve from a straight line reveal uneven semantic pacing. Building on this insight, we propose a semantic linearization procedure that reparameterizes (or retimes) the sequence so that semantic change unfolds at a constant rate, yielding smoother and more coherent transitions. Beyond linearization, our framework provides a model-agnostic foundation for identifying temporal irregularities, comparing semantic pacing across different generators, and steering both generated and real-world video sequences toward arbitrary target pacing.
[CV-12] ransferable Physical-World Adversarial Patches Against Pedestrian Detection Models
【速读】:该论文旨在解决物理空间中对抗补丁攻击(physical adversarial patch attacks)在实际应用中效果受限的问题,具体表现为:现有方法无法系统性地干扰目标检测的多阶段决策流程(multi-stage decision pipeline),导致残差模块(residual modules)可抵消扰动;同时缺乏对复杂物理变化(如光照、视角、遮挡等)的有效建模,造成鲁棒性不足。解决方案的关键在于提出一种名为TriPatch的新方法,其核心是通过设计三元组损失函数(triplet loss)实现跨检测流水线多个阶段的协同攻击——包括检测置信度抑制、边界框偏移放大以及非极大值抑制(NMS)破坏;此外引入外观一致性损失(appearance consistency loss)以约束补丁的颜色分布,提升其在多样成像条件下的适应能力,并结合数据增强策略进一步增强对复杂物理扰动的鲁棒性。
链接: https://arxiv.org/abs/2604.22552
作者: Shihui Yan,Ziqi Zhou,Yufei Song,Yifan Hu,Minghui Li,Shengshan Hu
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Physical adversarial patch attacks critically threaten pedestrian detection, causing surveillance and autonomous driving systems to miss pedestrians and creating severe safety risks. Despite their effectiveness in controlled settings, existing physical attacks face two major limitations in practice: they lack systematic disruption of the multi-stage decision pipeline, enabling residual modules to offset perturbations, and they fail to model complex physical variations, leading to poor robustness. To overcome these limitations, we propose a novel pedestrian adversarial patch generation method that combines multi-stage collaborative attacks with robustness enhancement under physical diversity, called TriPatch. Specifically, we design a triplet loss consisting of detection confidence suppression, bounding-box offset amplification, and non-maximum suppression (NMS) disruption, which jointly act across different stages of the detection pipeline. In addition, we introduce an appearance consistency loss to constrain the color distribution of the patch, thereby improving its adaptability under diverse imaging conditions, and incorporate data augmentation to further enhance robustness against complex physical perturbations. Extensive experiments demonstrate that TriPatch achieves a higher attack success rate across multiple detector models compared to existing approaches.
[CV-13] ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation
【速读】:该论文旨在解决开放词汇场景图生成(open-vocabulary scene graph generation, SGG)中因标注不完整性导致的关系识别偏差问题,即现有方法将未标注的物体对关系简单视为负样本,而忽略了实际存在但未被标注的有效关系以及同一交互在不同粒度上的表达差异(如“standing on”与“supported by”)。解决方案的关键在于提出ReLIC-SGG框架,其核心创新是将未标注关系建模为潜在变量而非确定性负例,并构建语义关系格(semantic relation lattice)以捕捉开放词汇谓词间的相似性、蕴含和矛盾关系;通过视觉-语言一致性、图结构上下文和语义一致性联合推理缺失的正样本关系,同时引入正-未标记图学习目标减少假负样本监督,最终实现更准确的稀有及未见谓词识别与缺失关系恢复。
链接: https://arxiv.org/abs/2604.22546
作者: Amir Hosseini,Sara Farahani,Xinyi Li,Suiyang Guang
机构: Amirkabir University of Technology (阿米尔卡比尔理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textiton, \textitstanding on, \textitresting on, and \textitsupported by. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbfReLIC-SGG, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.
[CV-14] Evolving Thematic Map Design in Academic Cartography: A Thirty-Year Study Based on Multilingual Journals
【速读】:该论文旨在解决学术地图(thematic maps)设计演变规律缺乏大规模实证研究的问题,特别是中文与英文期刊中地图设计实践的异同及其随时间变化的趋势。解决方案的关键在于构建了一个包含45,732篇中英文权威期刊论文的语料库,并通过计算机视觉和大模型驱动的文档解析技术提取出23,928幅地图,从而实现对地图元素、色彩设计和版式结构三个维度的量化分析,揭示了中英文学术地图在设计规范上高度趋同,且二者均呈现元素丰富度、图例使用率和色相多样性上升但版式结构稳定的平行演化趋势,表明学术地图设计的演进主要受制度性因素驱动而非文化差异。
链接: https://arxiv.org/abs/2604.22539
作者: Zhiwei Wei,Chenxi Song,Tazhu Wang,Fan Wu,Hua Liao,Su Ding,Nai Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注:
Abstract:Thematic maps play a central role in academic communication, yet their large-scale design evolution has rarely been examined empirically. This study presents a longitudinal and multilingual analysis of thematic map design practices in academic cartography from 1990 to 2020. We compile a corpus of 45,732 research articles from sixteen authoritative Chinese- and English-language journals and extract 23,928 maps using computer vision and large-model-based document parsing to build a structured dataset. Map design characteristics are quantified across three dimensions: map elements, color design, and layout structure. Results show that Chinese- and Englishlanguage academic maps share highly similar structural conventions, typically employing restrained color palettes with neutral dominant hues, low saturation, high brightness, and limited hue diversity, as well as centered layouts with high main-map occupation ratios. Differences exist in that English-language maps show slightly greater hue richness and compactness, whereas Chinese-language maps historically rely more on neutral hues and integrated layouts. Temporal analysis reveals parallel evolutionary trends in both groups, including increasing element richness, legend usage, and hue diversity, alongside stable layout structures. Overall, the findings suggest that academic map design evolution is characterized more by institutional convergence than cultural divergence.
[CV-15] Distilling Vision Transformers for Distortion-Robust Representation Learning
【速读】:该论文旨在解决在缺乏干净观测数据的情况下,如何有效学习对失真鲁棒的视觉表征问题。其核心挑战在于传统自监督学习方法依赖于大量干净数据,而在实际场景中此类数据往往稀缺或不可得。解决方案的关键在于提出一种非对称知识蒸馏框架:教师模型和学生模型均初始化自同一预训练Vision Transformer,但分别接收图像的不同视图——教师处理干净图像,学生则接收失真版本;通过多层级蒸馏机制(对齐全局嵌入、patch级特征及注意力图),使学生模型能够在未直接接触干净数据的前提下,逼近干净图像的表征,并在多种失真条件下显著提升下游任务性能。
链接: https://arxiv.org/abs/2604.22529
作者: Konstantinos Alexis,Giorgos Giannopoulos,Dimitrios Gunopulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.
[CV-16] Non-Minimal Sampling and Consensus for Prohibitively Large Datasets
【速读】:该论文旨在解决大规模数据集上模型估计的鲁棒性与可扩展性问题,尤其是在存在噪声和异常值的情况下。其解决方案的关键在于提出了一种名为 NONSAC(Non-Minimal Sampling and Consensus)的通用框架,该框架通过重复采样非最小数据子集并使用鲁棒估计器生成多个候选模型,再依据预定义的评分规则选择最优模型。NONSAC 具有估计器无关性(estimator-agnostic),可无缝集成现有几何拟合算法(如 RANSAC),从而在保持高鲁棒性的同时显著提升计算效率和规模适应性。
链接: https://arxiv.org/abs/2604.22518
作者: Seong Hun Lee,Patrick Vandewalle,Javier Civera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce NONSAC (Non-Minimal Sampling and Consensus), a general framework for robust and scalable model estimation from arbitrarily large datasets contaminated with noise and outliers. NONSAC repeatedly samples non-minimal subsets of data and generates model hypotheses using a robust estimator, producing multiple candidate models. The final model is selected based on a predefined scoring rule that evaluates hypothesis quality. Our framework is estimator-agnostic and can be integrated with existing geometric fitting algorithms such as RANSAC to improve both scalability and robustness to outliers. We propose and evaluate various scoring rules for NONSAC on relative camera pose estimation, Perspective-n-Point, and point cloud registration. Furthermore, we showcase the applicability of NONSAC to correspondence-free point cloud registration by hypothesizing all-to-all correspondences.
[CV-17] Different Strokes for Different Folks: Writer Identification for Historical Arabic Manuscripts
【速读】:该论文旨在解决历史阿拉伯手写文献中的作者识别(writer identification)问题,以支持文献的来源追溯、真伪验证与历史分析。其关键解决方案是基于Muharaf数据集构建了一个带有注意力机制的卷积神经网络(CNN)模型,用于闭集作者识别任务,并通过手动校验和扩展标签将公开部分的标注率从28.00%提升至86.75%,同时保留了18,987条高质量线图像;此外,研究首次在该数据集上报告了两种评估协议——线级(line-level)和页级不相交(page-disjoint)的结果,从而量化了页面级别线索对模型泛化能力的影响,为历史学家和语言学家提供了更可靠且实用的基准资源。
链接: https://arxiv.org/abs/2604.22515
作者: Hamza A. Abushahla,Ariel Justine N. Panopio,Layth Al-Khairulla,Mohamed I. AlHajri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 pages, 13 figures, 31 tables
Abstract:Handwritten Arabic manuscripts preserve the Arab world’s intellectual and cultural heritage, and writer identification supports provenance, authenticity verification, and historical analysis. Using the Muharaf dataset of historical Arabic manuscripts, we evaluate writer identification from individual line images and, to the best of our knowledge, provide the first baselines reported under both line-level and page-disjoint evaluation protocols. Since the dataset is only partially labeled for writer identification, we manually verified and expanded writer labels in the public portion from 6,858 (28.00%) to 21,249 lines (86.75%) out of 24,495 line images, correcting inconsistencies and removing non-handwritten text. After further filtering, we retained 18,987 lines (77.51%). We propose a Convolutional Neural Network (CNN)-based model with attention mechanisms for closed-set writer identification, including rare two-writer lines modeled as composite writer-pair classes. We benchmark fourteen configurations and conduct ablations across different feature extractors and training regimes. To assess generalization to unseen pages, the page-disjoint protocol assigns all lines from each page to a single split. Under the line-level protocol, a fine-tuned DenseNet201 with attention achieves 99.05% Top-1 accuracy, 99.73% Top-5 accuracy, and 97.44% F1-score. Under the more challenging page-disjoint protocol, the best observed results are 78.61% Top-1 accuracy, 87.79% Top-5 accuracy, and 66.55% F1-score, thus quantifying the impact of page-level cues. By expanding the Muharaf dataset’s labeled subset and reporting both protocols, we provide a clearer benchmark and a practical resource for historians and linguists engaged with culturally and historically significant documents. The code and implementation details are available on GitHub.
[CV-18] Railway Artificial Intelligence Learning Benchmark (RAIL-BENCH): A Benchmark Suite for Perception in the Railway Domain
【速读】:该论文旨在解决现有铁路基础设施上自动化列车运行(Automated Train Operation, ATO)中缺乏公开、标准化的感知基准测试套件的问题,从而阻碍了不同视觉感知方法的可复现比较。解决方案的关键在于提出RAIL-BENCH,这是首个面向铁路场景的感知基准套件,包含五个针对铁路环境特性的挑战任务(轨道检测、目标检测、植被分割、多目标跟踪和单目视觉里程计),并提供经过筛选的训练与测试数据集、统一评估指标及公开排行榜。其中,针对轨道检测挑战引入LineAP这一新型基于线段的平均精度指标,能够独立于实例分组地评估多段线预测的几何准确性,有效克服了传统线检测指标的局限性。
链接: https://arxiv.org/abs/2604.22507
作者: Annika Bätz,Pavel Klasek,Seo-Young Ham,Philipp Neumaier,Martin Köppel,Martin Lauer
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Deutsches Zentrum für Schienenverkehrsforschung (德国铁路交通研究中心); DB InfraGO AG (DB基础设施GO公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, 5 tables, submitted at 2026 IEEE/RSJ International Conference on Intelligent Robots Systems
Abstract:Automated train operation on existing railway infrastructure requires robust camera-based perception, yet the railway domain lacks public benchmark suites with standardized evaluation protocols that would enable reproducible comparison of approaches. We present RAIL-BENCH, the first perception benchmark suite for the railway domain. It comprises five challenges - rail track detection, object detection, vegetation segmentation, multi-object tracking, and monocular visual odometry - each tailored to the specific characteristics of railway environments. RAIL-BENCH provides curated training and test datasets drawn from diverse real-world scenarios, evaluation metrics, and public scoreboards (this https URL). For the rail track detection challenge we introduce LineAP, a novel segment-based average precision metric that evaluates the geometric accuracy of polyline predictions independently of instance-level grouping, addressing key limitations of existing line detection metrics.
[CV-19] ICPR 2026 Competition on Low-Resolution License Plate Recognition ICPR
【速读】:该论文旨在解决低分辨率车牌识别(Low-Resolution License Plate Recognition, LRLPR)在真实监控场景中的难题,尤其是在远距离拍摄、压缩伪影和恶劣成像条件下导致的车牌可读性严重下降问题。解决方案的关键在于组织了ICPR 2026国际竞赛,基于包含2万条训练轨迹和3000条测试轨迹的真实低质量数据集(LRLPR-26),其中每条轨迹包含同一车牌的5张低分辨率与5张高分辨率图像,从而推动该领域的技术进步。通过竞赛形式汇聚全球41个国家269支队伍的创新方法,最终获胜团队达到82.13%的识别率,揭示了当前LRLPR任务的技术瓶颈与发展趋势,为未来研究提供了重要参考。
链接: https://arxiv.org/abs/2604.22506
作者: Rayson Laroca,Valfride Nascimento,Donggun Kim,Sanghyeok Chung,Subin Bae,Uihwan Seo,Seungsang Oh,Chi M. Phung,Minh G. Vo,Xingsong Ye,Yongkun Du,Yuchen Su,Zhineng Chen,Sunhee Heo,Hyangwoo Lee,Kihyun Na,Khanh V. Vu Nguyen,Sang T. Pham,Duc N. N. Phung,Trong P. Le,Vy N. Vo Tran,David Menotti
机构: Federal University of Paraná (联邦巴拉那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the International Conference on Pattern Recognition (ICPR) 2026
Abstract:Low-Resolution License Plate Recognition (LRLPR) remains a challenging problem in real-world surveillance scenarios, where long capture distances, compression artifacts, and adverse imaging conditions can severely degrade license plate legibility. To promote progress in this area, we organized the ICPR 2026 Competition on Low-Resolution License Plate Recognition, the first competition specifically dedicated to LRLPR using real low-quality data collected under operationally relevant conditions. The competition was based on the LRLPR-26 dataset, which comprises 20,000 training tracks and 3,000 test tracks; each training track contains five low-resolution and five high-resolution images of the same license plate. Notably, a total of 269 teams from 41 countries registered for the competition, and 99 teams submitted valid entries in the Blind Test Phase. The winning team achieved a Recognition Rate of 82.13%, and four teams surpassed the 80% mark, highlighting both the high level of competition at the top of the leaderboard and the continued difficulty of the task. In addition to presenting the competition design, evaluation protocol, and main results, this paper summarizes the methods adopted by the top-5 teams and discusses current trends and promising directions for future research on LRLPR. The competition webpage is available at this https URL
[CV-20] CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在细粒度多图像理解任务中面临的三大挑战:空间幻觉(spatial hallucination)、注意力泄漏(attention leakage)以及对象恒常性(object constancy)失效问题。同时,现有方法普遍依赖昂贵的人工标注或大规模链式思维(Chain-of-Thought, CoT)数据生成,成本高昂。解决方案的关键在于提出一种低成本、全链条的训练框架——组合式接地对比学习(Compositional Grounded Contrast, CGC),其核心机制包括:利用跨图像对比(Inter-Image Contrast)引入语义解耦的干扰上下文以增强跨图像区分能力,以及通过图像内对比(Intra-Image Contrast)构建相关跨视角样本以提升对象恒常性;此外,在GRPO优化框架中引入基于规则的空间奖励(Rule-Based Spatial Reward),在“先思考后定位”(Think-before-Grounding)范式下显著改善源图像归属、空间对齐和结构化输出有效性。实验表明,CGC在MIG-Bench和VLM2-Bench等细粒度多图像基准上达到最先进性能,并有效迁移至更广泛的多模态理解和推理任务中。
链接: https://arxiv.org/abs/2604.22498
作者: Lihao Zheng,Zhenwei Shao,Yu Zhou,Yan Yang,Xintian Shen,Jiawei Chen,Hao Ma,Tao Wei
机构: Hangzhou Dianzi University(杭州电子科技大学); Li Auto(理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).
[CV-21] Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond
【速读】:该论文旨在解决全景图像(panoramic image)在前向3D重建模型中因球面畸变导致性能下降,以及现有全景3D数据集多基于离散位置采集、轨迹不连续的问题,这些问题严重限制了多视角下全景3D重建的发展。解决方案的关键在于构建Holo360D——首个大规模连续全景序列数据集,包含109,495张全景图及其配准的点云、网格和相机位姿,并通过3D激光扫描仪与360°相机联合采集原始数据,结合在线与离线SLAM系统进行处理,同时提出针对360°数据特性的后处理流程(包括几何去噪、网格孔洞填充和区域特定重网格化),从而显著提升3D数据质量并为模型训练提供更优信号,最终建立新的基准以指导有效的微调策略。
链接: https://arxiv.org/abs/2604.22482
作者: Jing Ou,Zidong Cao,Yinrui Ren,Zhuoxiao Li,Jinjing Zhu,Tongyan Hua,Shuai Zhang,Hui Xiong,Wufan Zhao
机构: The Hong Kong University of Science and Technology (Guangzhou); South China Normal University
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:While feed-forward 3D reconstruction models have advanced rapidly, they still exhibit degraded performance on panoramas due to spherical distortions. Moreover, existing panoramic 3D datasets are predominantly collected with 360 cameras fixed at discrete locations, resulting in discontinuous trajectories. These limitations critically hinder the development of panoramic feed-forward 3D reconstruction, especially for the multi-view setting. In this paper, we present Holo360D, a comprehensive dataset containing 109,495 panoramas paired with registered point clouds, meshes, and aligned camera poses. To our knowledge, Holo360D is the first large-scale dataset that provides continuous panoramic sequences with accurately aligned high-completeness depth maps. The raw data are initially collected using a 3D laser scanner coupled with a 360 camera. Subsequently, the raw data are processed with both online and offline SLAM systems. Furthermore, to enhance the 3D data quality, a post-processing pipeline tailored for the 360 dataset is proposed, including geometry denoising, mesh hole filling, and region-specific remeshing. Finally, we establish a new benchmark by fine-tuning 3D reconstruction models on Holo360D, providing key insights into effective fine-tuning strategies. Our results demonstrate that Holo360D delivers superior training signals and provides a comprehensive benchmark for advancing panoramic 3D reconstruction models. Datasets and Code will be made publicly available.
[CV-22] Improving Driver Drowsiness Detection via Personalized EAR/MAR Thresholds and CNN-Based Classification
【速读】:该论文旨在解决传统基于视觉的驾驶员疲劳检测系统中因固定眼AspectRatio (EAR) 和嘴AspectRatio (MAR) 阈值无法适应个体差异而导致的检测准确率下降问题,尤其是在面部结构、光照条件和驾驶环境变化下的泛化能力不足。其解决方案的关键在于引入个性化阈值校准机制,即在驾驶前为每位驾驶员动态设定专属的EAR和MAR阈值,并结合卷积神经网络(CNN)模型对眼部状态和打哈欠行为进行深度学习分类,从而提升系统在复杂场景下的鲁棒性和检测精度。
链接: https://arxiv.org/abs/2604.22479
作者: Gökdeniz Ersoy,Mehmet Alper Tatar,Eray Tonbul,Serap Kırbız
机构: MEF University (MEF大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Driver drowsiness is a major cause of traffic accidents worldwide, posing a serious threat to public safety. Vision-based driver monitoring systems often rely on fixed Eye Aspect Ratio (EAR) and Mouth Aspect Ratio (MAR) thresholds; however, such fixed values frequently fail to generalize across individuals due to variations in facial structure, illumination, and driving conditions. This paper proposes a personalized driver drowsiness detection system that monitors eyelid movements, head position, and yawning behavior in real time and provides warnings when signs of fatigue are detected. The system employs driver-specific EAR and MAR thresholds, calibrated before driving, to improve classical metric-based detection. In addition, deep learning-based Convolutional Neural Network (CNN) models are integrated to enhance accuracy in challenging scenarios. The system is evaluated using publicly available datasets as well as a custom dataset collected under diverse lighting conditions, head poses, and user characteristics. Experimental results show that personalized thresholding improves detection accuracy by 2-3% compared to fixed thresholds, while CNN-based classification achieves 99.1% accuracy for eye state detection and 98.8% for yawning detection, demonstrating the effectiveness of combining classical metrics with deep learning for robust real-time driver monitoring.
[CV-23] Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples
【速读】:该论文旨在解决现有神经元标注方法在生成式 AI (Generative AI) 深度网络中因过度依赖高激活样本而导致标签泛化或误导性的问题,尤其针对当前方法难以实现可扩展的神经元级标注(neuron-level labeling)这一挑战。其解决方案的关键在于引入对比示例(contrastive examples),即语义相似但激活强度低的输入样本,并分两阶段进行改进:首先利用视觉语言模型(VLMs)结合对比图像集生成更具体、更忠实的候选标签;其次提出对比语义投影(Contrastive Semantic Projection, CSP),将对比示例直接嵌入基于CLIP-like编码器的评分与选择流程中,从而提升标注的语义粒度和准确性。实验证明,该方法显著优于现有最先进基线,在皮肤癌检测案例中亦展现出更强的解释力。
链接: https://arxiv.org/abs/2604.22477
作者: Oussama Bouanani,Jim Berend,Wojciech Samek,Sebastian Lapuschkin,Maximilian Dreyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples – inputs that are semantically similar to activating examples but elicit low activations – to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.
[CV-24] All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams
【速读】:该论文旨在解决视频数据在流程分析中的多模态性(multi-modality)问题,即如何将非结构化的视频数据转化为可被传统流程挖掘技术处理的事件日志。其解决方案的关键在于提出SnapLog方法:首先利用图像嵌入(image embeddings)将视频帧转换为特征向量,并通过帧间相似性矩阵进行时间分割;随后采用广义少样本分类(generalized few-shot classification)对分割后的视频片段进行标签分配,从而生成带有时间戳且语义明确的事件序列,使后续流程挖掘算法能够有效分析视频中蕴含的过程行为。
链接: https://arxiv.org/abs/2604.22476
作者: Marco Pegoraro,Jonas Seng,Dustin Heller,Wil M.P. van der Aalst,Kristian Kersting
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 17 pages, 6 figures, 1 table, 23 references
Abstract:Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.
[CV-25] NRGS: Neural Regularization for Robust 3D Semantic Gaussian Splatting
【速读】:该论文旨在解决由多视角不一致的2D特征直接提升至3D空间所导致的语义场噪声问题,从而影响下游任务性能。具体而言,视觉基础模型提取的2D特征因缺乏跨视角约束而存在多视图不一致性,若直接将其映射为3D高斯点(3D Gaussian Splatting),将生成噪声严重的语义场。现有方法通常在预处理阶段追求多视图特征一致性或通过优化策略抑制噪声,但往往带来额外计算开销或延迟。本文的关键解决方案是提出一种基于方差感知的条件MLP(variance-aware conditional MLP),该模块直接作用于3D高斯点,利用其几何与外观属性对3D空间中的语义错误进行修正,从而实现高效且鲁棒的3D语义高斯点渲染。
链接: https://arxiv.org/abs/2604.22439
作者: Zaiyan Yang,Xinpeng Liu,Heng Guo,Jinglei Shi,Zhanyu Ma,Fumio Okura
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Key Laboratory of Multimodal Data Intelligent Perception and Governance (北京市多模态数据智能感知与治理重点实验室); The University of Osaka (大阪大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a neural regularization method that refines the noisy 3D semantic field produced by lifting multi-view inconsistent 2D features, in order to obtain an accurate and robust 3D semantic Gaussian Splatting. The 2D features extracted from vision foundation models suffer from multi-view inconsistency due to a lack of cross-view constraints. Lifting these inconsistent features directly into 3D Gaussians results in a noisy semantic field, which degrades the performance of downstream tasks. Previous methods either focus on obtaining consistent multi-view features in the preprocessing stage or aim to mitigate noise through improved optimization strategies, often at the cost of increased preprocessing time or expensive computational overhead. In contrast, we introduce a variance-aware conditional MLP that operates directly on the 3D Gaussians, leveraging their geometric and appearance attributes to correct semantic errors in 3D space. Experiments on different datasets show that our method enhances the accuracy of lifted semantics, providing an efficient and effective approach to robust 3D semantic Gaussian Splatting.
[CV-26] SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在具身场景中长期空间一致性保持不足的问题,即模型难以从第一人称视角观测中持续更新空间信念,尤其是在环境变化下缺乏对空间关系的稳定记忆。解决方案的关键在于提出SpaMEM(Spatial Memory from Action Sequences),这是一个大规模诊断基准,通过动作条件下的场景变换(生成、放置、移除)来隔离空间信念演化的机制,并构建了一个包含1060万张高保真图像的物理基础数据集,涵盖RGB、深度、实例分割和语义分割四种模态。该基准将具身空间推理形式化为三级层次结构(原子空间感知、时间推理与端到端信念维护),并揭示出当前模型存在坐标一致性的硬性瓶颈以及从文本辅助推理到纯视觉输入时性能显著下降的现象,凸显了符号化支撑依赖问题,从而推动对状态表示、信念修正和长程情景整合机制的显式建模。
链接: https://arxiv.org/abs/2604.22409
作者: Chih-Ting Liao,Xi Xiao,Chunlei Meng,Zhangquan Chen,Yitong Qiao,Weilin Zhou,Tianyang Wang,Xu Zheng,Xin Cao
机构: UNSW Sydney; University of Alabama; Fudan University; Tsinghua University; Zhejiang University; Xinjiang University; HKUST(GZ)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) have advanced static visual–spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.
[CV-27] Region Matters: Efficient and Reliable Region-Aware Visual Place Recognition
【速读】:该论文旨在解决视觉地点识别(Visual Place Recognition, VPR)中因无关区域导致的感知歧义(perceptual aliasing)以及由于固定候选调度策略引起的低效重排序问题。其解决方案的关键在于提出FoL++方法,通过引入可靠性估计分支(Reliability Estimation Branch)生成空间可靠性图以显式建模遮挡鲁棒性,并结合两种空间对齐损失(SAL和SCEL)优化特征对齐与显著区域增强;同时采用伪对应策略实现弱监督学习下的密集局部特征监督,并设计自适应候选调度器(Adaptive Candidate Scheduler),根据全局相似度动态调整候选池大小,进而加权局部匹配并融合全局与局部证据,从而在保持轻量内存占用的同时显著提升识别精度与推理速度。
链接: https://arxiv.org/abs/2604.22390
作者: Shunpeng Chen,Yukun Song,Changwei Wang,Rongtao Xu,Kexue Fu,Longxiang Gao,Li Guo,Ruisheng Wang,Shibiao Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 13 figures, 10 tables, 1 algorithm
Abstract:Visual Place Recognition (VPR) determines a query image’s geographic location by matching it against geotagged databases. However, existing methods struggle with perceptual aliasing caused by irrelevant regions and inefficient re-ranking due to rigid candidate scheduling. To address these issues, we introduce FoL++, a method combining robust discriminative region modeling with adaptive re-ranking. Specifically, we propose a Reliability Estimation Branch to generate spatial reliability maps that explicitly model occlusion resistance. This representation is further optimized by two spatial alignment losses (SAL and SCEL) to effectively align features and highlight salient regions. For weakly supervised learning without manual annotations, a pseudo-correspondence strategy generates dense local feature supervision directly from aggregation clusters. Our Adaptive Candidate Scheduler dynamically resizes candidate pools based on global similarity. By weighting local matches by reliability and adaptively fusing global and local evidence, FoL++ surpasses traditional independent matching systems. Extensive experiments across seven benchmarks demonstrate that FoL++ achieves state-of-the-art performance with a lightweight memory footprint, improving inference speed by 40% over FoL. Code and models will be released (and merged with FoL) at this https URL.
[CV-28] HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos
【速读】:该论文旨在解决基于经直肠超声(TRUS)视频进行前列腺癌分类时面临的三大挑战:信息冗余导致计算成本高、类内与类间相似度高使得特征提取困难,以及信噪比低影响临床相关特征的识别。其解决方案的关键在于提出一种启发式帧选择策略(HFS)与三分支协同特征学习网络(HFS-TriNet):HFS通过动态初始化训练片段起始点来均匀覆盖整个视频序列,有效缓解冗余问题;而HFS-TriNet则结合三个并行分支——标准ResNet50分支、基于预训练医学分割任意模型(SAM)的大模型分支(含归一化注意力模块以捕捉时间一致性),以及小波变换卷积残差(WTCR)分支(在高频域提取病灶边缘信息、低频域实现去噪),从而增强特征表达能力与鲁棒性。
链接: https://arxiv.org/abs/2604.22388
作者: Xu Lu,Qianhong Peng,Qihao Zhou,Shaopeng Liu,Xiuqin Ye,Chuan Yang,Yuan Yuan
机构: Guangdong Polytechnic Normal University (广东技术师范大学); Guangdong Provincial Key Laboratory of Intellectual Property Big Data (广东省知识产权大数据重点实验室); Department of Ultrasond, The First Affiliated Hospital of Jinan University (暨南大学第一附属医院超声科); Department of Ultrasond, Shenzhen People’s Hospital (深圳市人民医院超声科)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been extensively investigated recently. Compared to static images, TRUS video provides richer spatial-temporal information, which make it a promising alternative for improving the accuracy and robustness of CAD systems. However, TRUS video analysis also introduces new challenges. These include information redundancy, which increases computational costs; high intra- and inter-class similarity, which complicates feature extraction; and a low signal-to-noise ratio, which hinders the identification of clinically relevant information. To address these problems, we propose a heuristic frame selection (HFS) and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. Specifically, selecting a clip of video frames at intervals for training can mitigate redundancy. The HFS strategy dynamically initializes the starting point of each training clip, which ensures that the sampled clips span the entire video sequence. For better feature extraction, besides a regular ResNet50 branch, we also utilize 1) a large model branch based a pre-trained medical segment anything model (SAM) to extract deep features of each frame and a normalization-based attention module to explore the temporal consistency; and 2) a wavelet transform convolutional residual (WTCR) branch that extracts lesion edge information in the high-frequency domain and performs denoising in the low-frequency domain.
[CV-29] Efficient Diffusion Distillation via Embedding Loss
【速读】:该论文旨在解决扩散模型蒸馏(diffusion distillation)中生成质量受限与训练效率低的问题,尤其是现有补充损失函数在计算资源消耗大、训练不稳定或依赖预生成数据集等方面的局限性。其解决方案的关键在于提出一种新颖的嵌入损失(Embedding Loss, EL),通过利用一组随机初始化网络提取的特征嵌入(feature embeddings),在嵌入空间中计算最大均值差异(Maximum Mean Discrepancy, MMD),从而实现学生模型与真实数据分布之间的鲁棒对齐,显著提升单步生成器的质量并加速收敛,同时降低对大规模批量训练和长周期训练的依赖。
链接: https://arxiv.org/abs/2604.22379
作者: Jincheng Ying,Yitao Chen,Li Wenlin,Minghui Xu,Yinhao Xiao
机构: Guangdong University of Finance and Economics (广东金融学院); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in distilling expensive diffusion models into efficient few-step generators show significant promise. However, these methods typically demand substantial computational resources and extended training periods, limiting accessibility for resource-constrained researchers, and existing supplementary loss functions have notable limitations. Regression loss requires pre-generating large datasets before training and limits the student model to the teacher’s performance, while GAN-based losses suffer from training instability and require careful tuning. In this paper, we propose Embedding Loss (EL), a novel supplementary loss function that complements existing diffusion distillation methods to enhance generation quality and accelerate training with smaller batch sizes. Leveraging feature embeddings from a diverse set of randomly initialized networks, EL effectively aligns the feature distributions between the distilled few-step generator and the original data. By computing Maximum Mean Discrepancy (MMD) in the embedded feature space, EL ensures robust distribution matching, thereby preserving sample fidelity and diversity during distillation. Within distribution matching distillation frameworks, EL demonstrates strong empirical performance for one-step generators. On the CIFAR-10 dataset, our approach achieves state-of-the-art FID values of 1.475 for unconditional generation and 1.380 for conditional generation. Beyond CIFAR-10, we further validate EL across multiple benchmarks and distillation methods, including ImageNet, AFHQ-v2, and FFHQ datasets, using DMD, DI, and CM distillation frameworks, demonstrating consistent improvements over existing one-step distillation methods. Our method also reduces training iterations by up to 80%, offering a more practical and scalable solution for deploying diffusion-based generative models in resource-constrained environments.
[CV-30] One Shot Learning for Edge Detection on Point Clouds
【速读】:该论文旨在解决多扫描仪采集点云数据在边缘提取任务中因分布差异导致的性能下降问题,即跨扫描仪训练的网络泛化能力不足。其关键解决方案是提出一种基于单样本学习(one-shot learning)的轻量级网络OSFENet,通过设计基于滤波K近邻(filtered-KNN)的表面补丁表示方法,使模型能够快速适应目标点云的特定数据分布;同时引入RBF_DoS模块,利用径向基函数(Radial Basis Function, RBF)构建表面补丁描述子,显著提升边缘提取精度。该方法在ABC数据集上优于7个基线模型,并在S3DIS、Semantic3D和UrbanBIS等真实场景数据集中验证了实用性。
链接: https://arxiv.org/abs/2604.22354
作者: Zhikun Tu,Yuhe Zhang,Yiou Jia,Kang Li,Daniel Cohen-Or
机构: School of Information Science and Technology, Northwest University, Xi’an, China; Department of Computer Science, Tel Aviv University, Tel Aviv, Israel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 14 figures. Published in IEEE Transactions on Visualization and Computer Graphics
Abstract:Each scanner possesses its unique characteristics and exhibits its distinct sampling error distribution. Training a network on a dataset that includes data collected from different scanners is less effective than training it on data specific to a single scanner. Therefore, we present a novel one-shot learning method allowing for edge extraction on point clouds, by learning the specific data distribution of the target point cloud, and thus achieve superior results compared to networks that were trained on general data distributions. More specifically, we present how to train a lightweight network named OSFENet (One-Shot edge Feature Extraction Network), by designing a filtered-KNN-based surface patch representation that supports a one-shot learning framework. Additionally, we introduce an RBF_DoS module, which integrates Radial Basis Function-based Descriptor of the Surface patch, highly beneficial for the edge extraction on point clouds. The advantage of the proposed OSFENet is demonstrated through comparative analyses against 7 baselines on the ABC dataset, and its practical utility is validated by results across diverse real-scanned datasets, including indoor scenes like S3DIS dataset, and outdoor scenes such as the Semantic3D dataset and UrbanBIS dataset.
[CV-31] PoseFM: Relative Camera Pose Estimation Through Flow Matching
【速读】:该论文旨在解决单目视觉里程计(Monocular Visual Odometry, MVO)中缺乏不确定性感知的问题,传统基于深度学习的方法多采用确定性回归,难以在结构稀疏或光照不良等挑战性场景下提供可靠的运动估计。其解决方案的关键在于提出PoseFM框架,首次将帧间单目VO建模为生成式任务,利用流匹配(Flow Matching, FM)技术,通过连续时间常微分方程(ODEs)学习从噪声到真实位姿预测的分布映射,从而实现对相机运动的概率建模与不确定性量化,显著提升了复杂视觉条件下的鲁棒性与精度。
链接: https://arxiv.org/abs/2604.22350
作者: Dominik Kuczkowski,Laura Ruotsalainen
机构: University of Helsinki (赫尔辛基大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Monocular visual odometry (VO) is a fundamental computer vision problem with applications in autonomous navigation, augmented reality and more. While deep learning-based methods have recently shown superior accuracy compared to traditional geometric pipelines, particularly in environments where handcrafted features struggle due to poor structure or lighting conditions, most rely on deterministic regression, which lacks the uncertainty awareness required for robust applications. We propose PoseFM, the first framework to reformulate monocular frame-to-frame VO as a generative task using Flow Matching (FM). By leveraging FM, we model camera motion as a distribution rather than a point estimate, learning to transform noise into realistic pose predictions via continuous-time ODEs. This approach provides a principled mechanism for uncertainty estimation and enables robust motion inference under challenging visual conditions. In our evaluations, PoseFM achieves strong performance on TartanAir, KITTI and TUM-RGBD benchmarks, achieving the lowest absolute trajectory error (ATE) on some of the trajectories and overall being competitive with the best frame-to-frame monocular VO methods. Code and model checkpoints will be made available at this https URL.
[CV-32] Flow4DGS-SLAM: Optical Flow-Guided 4D Gaussian Splatting SLAM
【速读】:该论文旨在解决动态环境中视觉SLAM(Visual Simultaneous Localization and Mapping)的挑战,特别是如何高效重建静态与动态区域并保持相机位姿估计的鲁棒性。其关键解决方案在于提出一种基于光流引导的动态3D高斯溅射(3D Gaussian Splatting, 3DGS)SLAM框架:首先通过拟合相机自运动模型生成类别无关的运动掩码,实现动态与静态高斯点的分离,并提供流引导的相机位姿初始值;其次,在关键帧上显式建模动态高斯点的时序中心,并利用3D场景光流先验进行传播,结合自适应插入策略实现高效训练;最后,采用高斯混合模型(Gaussian Mixture Model, GMM)对时序不透明度和旋转进行建模,以自适应学习复杂动态特性。
链接: https://arxiv.org/abs/2604.22339
作者: Yunsong Wang,Gim Hee Lee
机构: School of Computing, National University of Singapore (计算学院,新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Handling the dynamic environments is a significant research challenge in Visual Simultaneous Localization and Mapping (SLAM). Recent research combines 3D Gaussian Splatting (3DGS) with SLAM to achieve both robust camera pose estimation and photorealistic renderings. However, using SLAM to efficiently reconstruct both static and dynamic regions remains challenging. In this work, we propose an efficient framework for dynamic 3DGS SLAM guided by optical flow. Using the input depth and prior optical flow, we first propose a category-agnostic motion mask generation strategy by fitting a camera ego-motion model to decompose the optical flow. This module separates dynamic and static Gaussians and simultaneously provides flow-guided camera pose initialization. We boost the training speed of dynamic 3DGS by explicitly modeling their temporal centers at keyframes. These centers are propagated using 3D scene flow priors and are dynamically initialized with an adaptive insertion strategy. Alongside this, we model the temporal opacity and rotation using a Gaussian Mixture Model (GMM) to adaptively learn the complex dynamics. The empirical results demonstrate our state-of-the-art performance in tracking, dynamic reconstruction, and training efficiency.
[CV-33] FILTR: Extracting Topological Features from Pretrained 3D Models
【速读】:该论文旨在解决如何从预训练的3D点云编码器(如Point-BERT、Point-MAE)中提取拓扑信息的问题,特别是能否利用这些编码器产生的特征来近似生成持久性图(persistence diagrams)。其核心解决方案是提出FILTR(Filtration Transformer),一种可学习的框架,将持久性图生成视为集合预测任务,并通过一个Transformer解码器直接从冻结的3D编码器输出中预测持久性图。关键创新在于首次实现了通过高效可学习的前馈机制,从原始点云数据中驱动式地提取持久性图,从而揭示了现有编码器虽仅保留有限全局拓扑信号,但仍可通过FILTR有效挖掘并逼近拓扑结构的能力。
链接: https://arxiv.org/abs/2604.22334
作者: Louis Martinez,Maks Ovsjanikov
机构: LIX, École Polytechnique, IP Paris (巴黎综合理工学院 LIX 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape’s multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.
[CV-34] ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding
【速读】:该论文旨在解决现有遥感灾害评估方法在应对复杂战略查询时难以提供可操作情报的问题,具体表现为:依赖单一光学模态、对自然灾害存在显著偏向,以及缺乏与场景的交互式理解能力。其解决方案的关键在于提出一个统一的多模态框架ChangeQuery,通过构建大规模的灾变诱导变化查询(Disaster-Induced Change Query, DICQ)数据集,融合灾前光学语义与灾后合成孔径雷达(Synthetic Aperture Radar, SAR)结构特征,并设计一种“先统计后生成”的自动化语义标注流水线,将原始分割掩膜转化为具有空间层次性和定量精度的指令集,从而赋予模型细粒度的空间感知和交互推理能力。
链接: https://arxiv.org/abs/2604.22333
作者: Dongwei Sun,Jing Yao,Kan Wei,Xiangyong Cao,Chen Wu,Zhenghui Zhao,Pedram Ghamisi,Jun Zhou,Jón Atli Benediktsson
机构: Xi’an Jiaotong University (西安交通大学); Chinese Academy of Sciences (中国科学院); Wuhan University (武汉大学); Helmholtz-Zentrum Dresden-Rossendorf (德累斯顿罗斯多夫赫尔姆霍兹中心); Lancaster University (兰卡斯特大学); University of Iceland (冰岛大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Rapid situational awareness is critical in post-disaster response. While remote sensing damage assessment is evolving from pixel-level change detection to high-level semantic analysis, existing vision-language methodologies still struggle to provide actionable intelligence for complex strategic queries. They remain severely constrained by unimodal optical dependence, a prevailing bias towards natural disasters, and a fundamental lack of grounded interactivity. To address these limitations, we present ChangeQuery, a unified multimodal framework designed for comprehensive, all-weather disaster situation awareness. To overcome modality constraints and scenario biases, we construct the Disaster-Induced Change Query (DICQ) dataset, a large-scale benchmark coupling pre-event optical semantics with post-event SAR structural features across a balanced distribution of natural catastrophes and armed conflicts. Furthermore, to provide the high-quality supervision required for interactive reasoning, we propose a novel Automated Semantic Annotation Pipeline. Adhering to a ``statistics-first, generation-later’’ paradigm, this engine automatically transforms raw segmentation masks into grounded, hierarchical instruction sets, effectively equipping the model with fine-grained spatial and quantitative awareness. Trained on this structured data, the ChangeQuery architecture operates as an interactive disaster analyst. It supports multi-task reasoning driven by diverse user queries, delivering precise damage quantification, region-specific descriptions, and holistic post-disaster summaries. Extensive experiments demonstrate that ChangeQuery establishes a new state-of-the-art, providing a robust and interpretable solution for complex disaster monitoring. The code is available at \hrefthis https URLthis https URL.
[CV-35] Depth-Aware Rover: A Study of Edge AI and Monocular Vision for Real-World Implementation
【速读】:该论文旨在解决深空探测中移动机器人(rover)在复杂地形下实现高效、可靠导航的问题,特别是在资源受限环境下如何平衡深度感知精度与实时性。其解决方案的关键在于从传统的双目立体视觉(stereo vision)向基于边缘人工智能(edge AI)的单目深度估计(monocular depth estimation)转变,通过在Unity仿真环境中使用StereoSGBM算法生成稠密视差图验证方法可行性,并在真实物理平台上采用UniDepthV2进行单目度量深度估计与YOLO12n实现目标检测,从而在保证一定精度的同时显著提升系统鲁棒性和部署成本效益。
链接: https://arxiv.org/abs/2604.22331
作者: Lomash Relia,Jai G Singla,Amitabh,Nitant Dube
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE
Abstract:This study analyses simulated and real-world implementations of depth-aware rover navigation, highlighting the transition from stereo vision to monocular depth estimation using edge AI. A Unity-based lunar terrain simulator with stereo cameras and OpenCV’s StereoSGBM was used to generate disparity maps. A physical rover built on Raspberry Pi 4 employed UniDepthV2 for monocular metric depth estimation and YOLO12n for real-time object detection. While stereo vision yielded higher accuracy in simulation, the monocular approach proved more robust and cost-effective in real-world deployment, achieving 0.1 FPS for depth and 10 FPS for detection.
[CV-36] Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization CVPR2026
【速读】:该论文旨在解决隐私保护图像查询(Privacy-Preserving Image Queries, PPIQ)中现有几何类和分割类混淆方法在面对新型隐私攻击时的脆弱性问题,尤其是几何类方法因保留关键点空间分布特征而易被恢复原始位置。其解决方案的关键在于提出一种名为“双收敛线”(Dual Convergent Lines, DCL)的新颖关键点混淆机制:DCL 在中心分割线上固定两个锚点,将每个关键点映射为从其中一个锚点出发的直线,且激活锚点由关键点位置决定;这种设计使攻击者试图通过优化恢复原始点的策略变得病态——邻近线要么错误地汇聚至单一锚点导致平凡解,要么在分割边界处趋于平行从而产生高方差不稳定解,从而有效阻断隐私泄露路径。该方法兼容现有基于线的定位求解器,具备部署于传统视觉定位流水线的能力,并在室内与大规模室外数据集上验证了其抗隐私攻击能力、效率及可扩展性。
链接: https://arxiv.org/abs/2604.22310
作者: Jeonggon Kim,Heejoon Moon,Je Hyeong Hong
机构: Hanyang University (汉阳大学); Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 (oral). Supplementary material included after references. 18 pages, 11 figures, 8 tables
Abstract:Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints. However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks. In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points. We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack. DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint’s location. This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed: Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery. DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines. Experiments on both indoor and large-scale outdoor datasets demonstrate DCL’s robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.
[CV-37] Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在知识密集型文本到图像(Text-to-Image, T2I)生成任务中科学准确性不足的问题,即模型虽能生成视觉上逼真的图像,但在生物学、化学、地理学等学科领域中常出现逻辑错误、符号不准确或违反学科规范的情况。其解决方案的关键在于提出 KE-Check 框架,该框架包含两个核心阶段:(1) 知识扩展(Knowledge Elaboration),通过结构化提示增强实现语义与知识的深度融合;(2) 检查清单引导精炼(Checklist-Guided Refinement),利用显式约束识别并修正错误,从而显著提升图像生成的科学保真度,并缩小开源模型与闭源领先模型之间的性能差距。
链接: https://arxiv.org/abs/2604.22302
作者: Ran Zhao,Sheng Jin,Size Wu,Kang Liao,Zerui Gong,Zujin Guo,Yang Xiao,Wei Li
机构: Huazhong University of Science and Technology (华中科技大学); The University of Hong Kong (香港大学); S-Lab, Nanyang Technological University (南洋理工大学S-Lab)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent text-to-image (T2I) models have demonstrated impressive capabilities in photorealistic synthesis and instruction following. However, their reliability in knowledge-intensive settings remains largely unexplored. Unlike natural image generation, knowledge visualization requires not only semantic alignment but also strict adherence to domain knowledge, structural constraints, and symbolic conventions, exposing a critical gap between visual plausibility and scientific correctness. To systematically study this problem, we introduce KVBench, a curriculum-grounded benchmark for evaluating knowledge-intensive T2I generation. KVBench covers six senior high-school subjects: Biology, Chemistry, Geography, History, Mathematics, and Physics. The benchmark consists of 1,800 expert-curated prompts derived from over 30 authoritative textbooks. Using this benchmark, we evaluate 14 state-of-the-art open- and closed-source models, revealing substantial deficiencies in logical reasoning, symbolic precision, and multilingual robustness, with open-source models consistently underperforming proprietary systems. To address these limitations, we further propose KE-Check, a two-stage framework that improves scientific fidelity via (1) Knowledge Elaboration for structured prompt enrichment, and (2) Checklist-Guided Refinement for explicit constraint enforcement through violation identification and constraint-guided editing. KE-Check effectively mitigates scientific hallucinations, narrowing the performance gap between open-source and leading closed-source models. Data and codes are publicly available at this https URL.
[CV-38] Evaluation of image simulation open source solutions for simulation of synthetic images in lunar environment
【速读】:该论文旨在解决月球探测任务中合成图像生成的可靠性问题,以支持自主导航与决策系统的开发。其核心挑战在于如何通过高保真度的图像模拟来准确反映月表环境特征,从而提升虚拟测试与实际任务之间的匹配度。解决方案的关键在于采用真实数字高程模型(DEM)和地形数据(来自Chandrayaan-2 Orbiter高分辨率相机OHRC及NASA的广角相机WAC和窄角相机NAC),并系统评估不同相机模型和光照条件对合成月面图像质量的影响,从而优化图像仿真方法,增强其在着陆点评估、危险检测和导航验证中的实用性。
链接: https://arxiv.org/abs/2604.22296
作者: Jai G Singla,Hinal B Patel,Nitant Dube
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthetic image generation is one of the crucial input for planetary missions. It enables researchers and engineers to visualize planned planetary missions, test imaging systems and plan exploration activities in a virtual environment before actual deployment. Image simulation is essential for assessing landing sites, detecting hazards, and validating navigation systems in a missions. This study offers a detailed evaluation of various image simulation approaches for the lunar environment, with particular emphasis on the effects of different camera models and light illumination conditions on the quality of synthetic lunar images. These images are produced using real Digital Elevation Models (DEM) and terrain data derived from instruments such as Chandrayaan-2 Orbiter High Resolution Camera (OHRC) and NASA’s Wide Angle Camera (WAC), and Narrow Angle Camera (NAC) instruments. This research aims to improve the reliability of synthetic imagery in supporting autonomous navigation and decision-making systems in lunar exploration. This work contributes to the development of more effective tools for generating important information for future lunar missions and enhances the understanding of the moon’s surface environment.
[CV-39] DocPrune:Efficient Document Question Answering via Background Question and Comprehension-aware Token Pruning CVPR2026
【速读】:该论文旨在解决长文档理解中因视觉-语言模型(Vision-Language Models)对文档图像中大量背景信息和稀疏结构化线索处理效率低下而导致的计算资源浪费问题,尤其在处理包含文本、表格和图表等结构化视觉提示的文档问答任务时表现尤为明显。其解决方案的关键在于提出一种无需训练的渐进式文档令牌剪枝框架(DocPrune),该方法能够自动识别并保留任务相关的必要令牌(如关键文本或表格内容),同时剔除冗余背景或与问题无关的令牌,并依据模型在不同层的理解程度动态选择最优剪枝起始层,从而在不牺牲准确性的前提下显著提升推理效率——实验表明,在M3DocRAG数据集上,该方法使编码器和解码器的吞吐量分别提升3.0倍和3.3倍,F1分数提升+1.0。
链接: https://arxiv.org/abs/2604.22281
作者: Joonmyung Choi,Sanghyeok Lee,Jongha Kim,Sehyung Kim,Dohwan Ko,Jihyung Kil,Hyunwoo J. Kim
机构: Korea University (韩国科学技术院); KAIST (韩国科学技术院); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model’s level of comprehension. Our experiments on the M3DocRAG show that DocPrune improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.
[CV-40] Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
【速读】:该论文旨在解决生成式多模态嵌入(Generative Multimodal Embeddings)在广泛检索场景中因Chain-of-Thought(CoT)推理导致的冗余思维步骤和语义模糊问题。其解决方案的关键在于提出了一种统一框架RIME(Rewrite-driven Multimodal Embedding),通过引入可检索友好的重写机制联合优化生成与嵌入过程,并结合跨模态对齐(Cross-Mode Alignment, CMA)技术实现生成与判别嵌入空间的融合,从而支持灵活的双向检索以平衡效率与精度;进一步地,利用精炼强化学习(Refine Reinforcement Learning, Refine-RL)将判别嵌入作为稳定的语义锚点来引导重写优化,显著提升嵌入质量并压缩思维链长度。
链接: https://arxiv.org/abs/2604.22280
作者: Peixi Wu,Ke Mei,Feipeng Ma,Bosong Chai,Zhibin Lan,Chenxi Zhao,Shannan Yan,Jie Chen,Zhangchi Hu,Yansong Peng,Bo Lin,Junjie Zhou,Dacheng Yin,Tianyi Wang,Fengyun Rao,Jing Lyu,Hebei Li,Xiaoyan Sun
机构: WeChat Vision, Tencent Inc.; Zhejiang University; Tsinghua University; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings on several embedding tasks. However, Chain-of-Thought (CoT) reasoning tends to generate redundant thinking steps and introduce semantic ambiguity in the summarized answers in broader retrieval scenarios. To address this limitation, we propose Rewrite-driven Multimodal Embedding (RIME), a unified framework that jointly optimizes generation and embedding through a retrieval-friendly rewrite. Meanwhile, we present the Cross-Mode Alignment (CMA) to bridge the generative and discriminative embedding spaces, enabling flexible mutual retrieval to trade off efficiency and accuracy. Based on this, we also introduce Refine Reinforcement Learning (Refine-RL) that treats discriminative embeddings as stable semantic anchors to guide the rewrite optimization. Extensive experiments on MMEB-V2, MRMR and UVRB demonstrate that RIME substantially outperforms prior generative embedding models while significantly reducing the length of thinking.
[CV-41] CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation
【速读】:该论文旨在解决开放词汇场景图生成(Open-vocabulary Scene Graph Generation, SGG)中的可靠性问题:当前基于视觉-语言模型的方法虽扩展了谓词(predicate)语义覆盖范围,但其预测关系可能受语言先验或物体共现偏倚驱动,而非依赖于真实的视觉证据。解决方案的关键在于引入一种基于反事实关系验证(counterfactual relation verification)的证据增强框架,通过分解谓词短语为支持(support)、接触(contact)、包含(containment)、深度(depth)、运动(motion)和状态(state)等软证据基元,利用关系条件证据编码器提取谓词相关线索,并设计反事实验证器测试当必要证据被移除时关系得分是否下降、无关扰动下是否保持稳定,从而实现对关系预测的可解释性与视觉 grounded 性验证。
链接: https://arxiv.org/abs/2604.22274
作者: Suiyang Guang,Chenyu Liu,Ruohan Zhang,Siyuan Chen
机构: Institute of Intelligent Vision and Embodied Cognition
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.
[CV-42] owards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
【速读】:该论文旨在解决城市交通系统中缺乏面向安全的开放性视觉问答(VQA)能力与对应基础模型的问题,尤其是在多源异构路侧摄像头场景下,现有研究主要聚焦于微观自动驾驶(AD),而对宏观城市级交通分析关注不足。其解决方案的关键在于构建一个大规模、高质量的视觉语言数据集Land Transportation Dataset (LTD),包含11.6K VQA样本,覆盖多样化道路几何、交通参与者、光照及恶劣天气条件,并设计三项互补任务:细粒度多目标定位、多图像相机选择和多图像风险分析,以实现跨视角联合推理;同时提出UniVLT——一种基于课程式知识迁移训练的统一交通基础模型,将微观自动驾驶推理与宏观交通分析整合于单一架构中,在LTD及多个AD基准上取得SOTA性能,揭示了现有基础模型在复杂多视角交通场景中的局限性。
链接: https://arxiv.org/abs/2604.22260
作者: Wenhui Huang,Songyan Zhang,Collister Chua,Yang Liang,Zhiqi Mao,Heng Yang,Chen Lv
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.
[CV-43] OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space
【速读】:该论文旨在解决当前生成式世界模型在自动驾驶仿真中难以实现复杂、时序性的多智能体交互问题,即现有方法依赖刚性几何条件(如显式轨迹)或简单属性级文本描述,无法有效建模语义-时空层面的动态一致性。其解决方案的关键在于提出OccDirector框架,该框架通过视觉语言模型(VLM)驱动的时空MMDiT结构,并结合历史前缀锚定策略(history-prefix anchoring strategy),将自然语言脚本直接映射为物理合理的4D占用动态,从而无需几何先验即可生成连贯的多智能体行为序列,实现了从外观合成到语言驱动行为编排的范式转变。
链接: https://arxiv.org/abs/2604.22240
作者: Zhuding Liang,Tianyi Yan,Dubing Chen,Jiasen Zheng,Huan Zheng,Cheng-zhong Xu,Yida Wang,Kun Zhan,Jianbing Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute-level text, failing to orchestrate complex, sequential multi-agent interactions. To address this semantic-spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director’', OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM-driven Spatio-Temporal MMDiT equipped with a history-prefix anchoring strategy to ensure long-horizon interaction consistency. Furthermore, we introduce OccInteract-85k, a novel dataset uniquely annotated with multi-level language instructions: ranging from static layouts to intricate multi-agent behaviors, alongside a novel VLM-based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state-of-the-art generation quality and unprecedented instruction-following capabilities, successfully shifting the paradigm from appearance synthesis to language-driven behavior orchestration.
[CV-44] owards Temporal Compositional Reasoning in Long-Form Sports Videos
【速读】:该论文旨在解决体育视频中长时程推理(long-horizon reasoning)的难题,即模型在回答复杂问题时需定位时间上稀疏的证据并将其整合进推理过程,而现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在此任务上表现不足。其核心挑战在于:一是对分散时间证据的监督不足,二是缺乏要求模型识别、定位并验证时间证据的方法。解决方案的关键是提出Chain-of-Time Reasoning(CoTR),它将推理视为一个基于时间锚点的证据组合过程:训练阶段引入基于时间奖励的GRPO(Generalized Reward Policy Optimization)以鼓励时间锚定的推理行为;推理阶段则采用“锚定-观察-推断”的证据搜索循环机制,迭代地定位、验证和组合时间证据,从而显著提升时间上的组合推理能力和步骤级的时间定位准确性。
链接: https://arxiv.org/abs/2604.22226
作者: Siyu Cao,Lu Zhang,Ruizhe Zeng,Zhi-yong Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.
[CV-45] Breaking Watermarks in the Frequency Domain: A Modulated Diffusion Attack Framework
【速读】:该论文旨在解决生成式 AI(Generative AI)图像版权保护中水印攻击技术进展滞后的问题,该问题打破了攻击与防御之间的平衡,制约了该领域的进一步发展。解决方案的关键在于提出一种频域调制扩散框架(FMDiffWA),其核心创新是引入频域水印调制(Frequency-domain Watermark Modulation, FWM)模块,并将其嵌入到扩散模型的前向和反向采样阶段中,从而实现对水印相关频域成分的选择性调制;同时,通过在训练策略中引入辅助精化约束以优化传统扩散模型的噪声估计目标,实现了攻击效果与视觉保真度之间的更好权衡,显著提升了水印攻击的有效性和跨多种水印方案的泛化能力。
链接: https://arxiv.org/abs/2604.22220
作者: Chunpeng Wang,Binyan Qu,Xiaoyu Wang,Zhiqiu Xia,Shanshan Zhang,Yunan Liu,Qi Li
机构: Qilu University of Technology (齐鲁工业大学); (Shandong Academy of Sciences) (山东省科学院); Dalian Maritime University (大连海事大学); Nanjing University of Science and Technology (南京理工大学); Shandong Jianzhu University (山东建筑大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital image watermarking has advanced rapidly for copyright protection of generative AI, yet the comparatively limited progress in watermark attack techniques has broken the attack-defense balance and hindered further advances in the field. In this paper, we propose FMDiffWA, a frequency-domain modulated diffusion framework for watermark attacks. Specifically, we introduce a frequency-domain watermark modulation (FWM) module and incorporate it into the sampling stages both the forward and reverse diffusion processes. This mechanism enables selective modulation of watermark-related frequency components, thereby allowing FMDiffWA to effectively neutralize the invisible watermark signals while preserving the perceptual quality of the attacked watermarked images. To achieve a better trade-off between attack efficacy and visual fidelity, we reformulate the training strategy of conventional diffusion models by augmenting the canonical noise estimation objective with an auxiliary refinement constraint. Comprehensive experiments demonstrate that FMDiffWA achieves superior visual fidelity compared to existing watermark attacks, while exhibiting strong generalization across diverse watermarking schemes.
[CV-46] ArchSym: Detecting 3D-Grounded Architectural Symmetries in the Wild
【速读】:该论文旨在解决从单张真实场景RGB图像中检测3D反射对称性的难题,现有方法主要依赖于物体中心或合成数据集训练,在真实世界场景中泛化能力差;同时由于单目输入存在尺度模糊性,多数方法仅能预测对称面的朝向而非其在3D空间中的精确位置。解决方案的关键在于:(1) 提出一种可扩展的数据标注流程,通过利用多视角图像匹配从SfM重建中自动构建大规模建筑对称性数据集ArchSym;(2) 设计一种单视图对称性检测器,将对称结构参数化为相对于预测场景几何的符号距离图(signed distance maps),从而实现对3D空间中对称平面的准确定位。
链接: https://arxiv.org/abs/2604.22202
作者: Hanyu Chen,Ruojin Cai,Steve Marschner,Noah Snavely
机构: Cornell University (康奈尔大学); Kempner Institute, Harvard University (哈佛大学肯普纳研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:Symmetry detection is a fundamental problem in computer vision, and symmetries serve as powerful priors for downstream tasks. However, existing learning-based methods for detecting 3D symmetries from single images have been almost exclusively trained and evaluated on object-centric or synthetic datasets, and thus fail to generalize to real-world scenes. Furthermore, due to the inherent scale ambiguity of monocular inputs, which makes localizing the 3D plane an ill-posed problem, many existing works only predict the plane’s orientation. In this paper, we address these limitations by presenting the first framework for detecting 3D-grounded reflectional symmetries from single, in-the-wild RGB images, focusing on architectural landmarks. We introduce two key innovations: (1) a scalable data annotation pipeline to automatically curate a large-scale dataset of architectural symmetries, ArchSym, from SfM reconstructions by leveraging cross-view image matching; and building on the dataset, (2) a single-view symmetry detector that accurately localizes symmetries in 3D by parameterizing them as signed distance maps defined relative to predicted scene geometry. We validate our symmetry annotation pipeline against geometry-based alternatives and demonstrate that our symmetry detector significantly outperforms state-of-the-art baselines on our new benchmark.
[CV-47] CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在图表到代码(chart-to-code)生成任务中面临的两大核心挑战:一是现有方法受限于数据驱动的同质化训练样本,导致视觉感知与程序逻辑混淆,难以充分挖掘多模态监督信号;二是对齐过程依赖启发式评分而非可验证的客观标准,缺乏可靠的质量控制机制。解决方案的关键在于提出一个全新的数据驱动框架CharTide,其创新性体现在两个方面:首先,通过三视角调优(Tri-Perspective Tuning)策略构建包含200万样本的数据集,将训练过程显式解耦为视觉感知、纯文本代码逻辑和模态融合三个独立流,使7B规模模型仅用监督学习即可超越专用基线;其次,将对齐重构为基于信息不变性原则的数据验证问题,引入基于冻结Inspector的查询驱动强化学习(Inquiry-Driven RL)框架,利用原子级问答任务客观验证生成图表的一致性,从而提供可度量的奖励信号,显著提升生成质量与鲁棒性。
链接: https://arxiv.org/abs/2604.22192
作者: Xiangxi Zheng,Kuang He,Jiayi Hu,Ping Yu,Rui Yan,Yuan Yao,Peng Hou,Anxiang Zeng,Alex Jinpeng Wang
机构: Nanjing University(南京大学); Shopee Pte. Ltd.(虾皮有限公司); East China Normal University(华东师范大学); Nanjing University of Science and Technology(南京理工大学); Central South University(中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.
[CV-48] From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification
【速读】:该论文旨在解决基于CLIP的行人重识别(ReID)方法在遮挡和跨摄像头变化下表征脆弱的问题,其核心在于现有方法将空间特征聚合为单一全局[CLS]标记,虽优化了图像-文本对齐但缺乏空间选择性。解决方案的关键是提出SAGA-ReID,通过将中间patch token与参数化于CLIP文本嵌入空间中的锚向量对齐,重构身份表示——强调空间稳定的证据并抑制受损或缺失区域,无需单张图像的文字描述。该机制在合成遮挡和真实人体干扰两种条件下均显著优于全局池化,尤其在高遮挡场景下性能提升明显,验证了结构化重建可突破骨干网络质量与架构复杂度带来的瓶颈。
链接: https://arxiv.org/abs/2604.22190
作者: Aotian Zheng,Winston Sun,Bahaa Alattar,Vitaly Ablavsky,Jenq-Neng Hwang
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures
Abstract:CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt[CLS] token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation. We propose SAGA-ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP’s text embedding space – emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions – synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal – with SAGA’s advantage over global pooling growing substantially as occlusion increases across both conditions. Benchmark evaluations confirm consistent gains over CLIP-ReID across standard and occluded settings, with the largest improvements where global pooling is most unreliable: up to +10.6 Rank-1 on occluded benchmarks. SAGA’s aggregation outperforms dedicated sequential patch aggregation on a stronger backbone, confirming that structured reconstruction addresses a bottleneck that backbone quality and architectural complexity alone cannot resolve. Code available at this https URL.
[CV-49] EvFlow-GS: Event Enhanced Motion Deblurring with Optical Flow for 3D Gaussian Splatting ICME2026
【速读】:该论文旨在解决仅依赖运动模糊图像进行高精度三维重建的难题,尤其针对传统方法在缺乏清晰纹理信息时表现不佳的问题。其核心挑战在于如何有效利用事件相机(event camera)提供的微秒级时间分辨率优势,同时克服由不准确的事件双重积分先验(double integral prior)和噪声、模糊事件带来的残余伪影与细节丢失问题。解决方案的关键在于提出一个统一框架 EvFlow-GS,通过联合优化可学习的双重积分(learnable double integral, LDI)、相机位姿与3D高斯泼溅(3D Gaussian Splatting, 3DGS),并引入基于光流提取的边缘信息构建新型事件损失函数,以及设计事件残差先验以增强从3DGS渲染图像中强度变化的监督信号;最终将3DGS与LDI输出融合至联合损失中,实现二者相互促进的端到端协同优化,从而显著提升重建质量与鲁棒性。
链接: https://arxiv.org/abs/2604.22183
作者: Feiyu An,Yufei Deng,Zihui Zhang,Rong Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICME 2026
Abstract:Achieving sharp 3D reconstruction from motion-blurred images alone becomes challenging, motivating recent methods to incorporate event cameras, benefiting from microsecond temporal resolution. However, they suffer from residual artifacts and blurry texture details due to misleading supervision from inaccurate event double integral priors and noisy, blurry events. In this study, we propose EvFlow-GS, a unified framework that leverages event streams and optical flow to optimize an end-to-end learnable double integral (LDI), camera poses, and 3D Gaussian Splatting (3DGS) jointly on-the-fly. Specifically, we first extract edge information from the events using optical flow and then formulate a novel event-based loss applied separately to different modules. Additionally, we exploit a novel event-residual prior to strengthen the supervision of intensity changes between images rendered from 3DGS. Finally, we integrate the outputs of both 3DGS and LDI into a joint loss, enabling their optimization to mutually facilitate each other. Experiments demonstrate the leading performance of our EvFlow-GS.
[CV-50] Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities CVPR2026
【速读】:该论文旨在解决多模态磁共振成像(Multimodal MRI)在脑肿瘤分割任务中因临床扫描常缺失某一或多种模态而导致分割性能下降的问题。解决方案的关键在于提出一种两阶段异构架构UniME(Uni-Encoder Meets Multi-Encoders),通过解耦表示学习与分割过程实现对缺失模态的鲁棒性建模:第一阶段利用掩码图像建模预训练单一视觉Transformer(ViT)统一编码器(Uni-Encoder),构建对缺失模态不敏感的全局特征表示;第二阶段引入模态特定的卷积神经网络(CNN)多编码器(Multi-Encoders)提取高分辨率、多尺度的细粒度特征,并将其与全局表示融合,从而在不完整模态条件下仍能实现精确分割。
链接: https://arxiv.org/abs/2604.22177
作者: Peibo Song,Xiaotian Xue,Jinshuo Zhang,Zihao Wang,Jinhua Liu,Shujun Fu,Fangxun Bao,Si Yong Yeo
机构: Shandong University (山东大学); The University of Tokyo (东京大学); Nanyang Technological University (南洋理工大学); Singapore (新加坡)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Poster
Abstract:Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine-grained structure capture, cross-modal complementarity modeling, and exploitation of available modalities. The idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods under incomplete multi-modal scenarios. The code is available at this https URL
[CV-51] Unlocking Optical Prior: Spectrum-Guided Knowledge Transfer for SAR Generalized Category Discovery
【速读】:该论文旨在解决生成式 AI (Generative AI) 在标签稀缺的合成孔径雷达(SAR)领域中,因光学先验与 SAR 图像之间跨模态不兼容而导致的泛化性能受限问题。现有域适应方法缺乏反映成像特性的归纳偏置,难以有效将光学先验迁移到 SAR 域。其解决方案的关键在于提出一种基于频域差异建模的 MDC-guided Cross-modal Prior Transfer (MCPT) 框架:通过引入模态差异曲线(Modal Discrepancy Curve, MDC)作为结构化的频域描述符来量化跨模态差异,并设计自适应频率标记化(Adaptive Frequency Tokenization, AFT)和频域感知专家精炼(Frequency-aware Expert Refinement, FER)机制,实现对特征表示的带通级差异感知优化;最终借助对比学习对齐跨模态嵌入并内化适配模式,从而显著提升 SAR 领域下的类别发现能力。
链接: https://arxiv.org/abs/2604.22174
作者: Jingyuan Xia,Ruikang Hu,Ye Li,Zhixiong Yang,Xu Lan,Zhejun Lu
机构: National University of Defense Technology (国防科技大学); State Key Laboratory of Complex System Simulation and Modeling Technology (复杂系统仿真与建模技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalized Category Discovery (GCD) holds significant promise for the label-scarce Synthetic Aperture Radar (SAR) domain, yet its efficacy is severely constrained by the cross-modal incompatibility between the inherent optical prior of the Large Vision Models (LVMs) and SAR imagery. Existing domain adaptation methods often lack an inductive bias that reflects imaging characteristics, consequently failing to effectively transfer optical prior into the SAR domain. To address this issue, the Modal Discrepancy Curve (MDC) is introduced to model cross-modal discrepancy as a structured frequency-domain descriptor derived from spectral energy distributions. Leveraging this formulation, we propose the MDC-guided Cross-modal Prior Transfer (MCPT) framework, a pre-training paradigm that operates on paired optical-SAR data. Within this framework, Adaptive Frequency Tokenization (AFT) converts the MDC into learnable tokens, and Frequency-aware Expert Refinement (FER) performs band-wise discrepancy-aware feature refinement using these tokens. Based on the refined representations, contrastive learning aligns refined embeddings across modalities and internalizes the adaptation pattern. Ultimately, the superior SAR feature representation capability learned during paired pre-training is applied to downstream single-modal SAR-GCD tasks. Extensive experiments demonstrate state-of-the-art performance across multiple mainstream datasets, indicating that frequency-domain discrepancy modeling enables more effective adaptation of optical prior to SAR imagery.
[CV-52] Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models
【速读】:该论文旨在解决多主体交互场景下,基于一个个体的运动序列生成另一个个体运动的问题,即在相互依赖的运动关系中实现交互感知的运动生成。其关键解决方案在于引入显式的人员ID嵌入(person ID embedding),以区分不同个体并保持结构一致性,从而有效捕捉交互动态;同时对比了三种Transformer架构(简单Transformer、iTransformer和Crossformer),发现简单Transformer在避免姿态坍塌(posture collapse)方面表现最优,而其他两种模型随时间累积误差导致运动不稳定。
链接: https://arxiv.org/abs/2604.22164
作者: Masato Soga,Ryuki Takebayashi
机构: Wakayama University (和歌山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages
Abstract:Recent advances in deep learning have enabled the generation of videos from textual descriptions as well as the prediction of future sequences from input videos. Similarly, in human motion modeling, motions can be generated from text or predicted from a single person’s motion sequence. However, these approaches primarily focus on single-agent motion generation. In contrast, this study addresses the problem of generating the motion of one person based on the motion of another in interaction scenarios, where the two motions are mutually dependent. We construct a dataset of paired action-reaction motion sequences extracted from boxing match videos and investigate the effectiveness of Transformer-based models for this task. Specifically, we implement and compare three models: a simple Transformer, iTransformer, and Crossformer. In addition, we introduce a person ID embedding to explicitly distinguish between individuals, enabling the model to maintain structural consistency and better capture interaction dynamics. Experimental results show that the simple Transformer can generate plausible interaction-aware motions without suffering from posture collapse, while iTransformer and Crossformer accumulate errors over time, leading to unstable motion generation. Furthermore, the proposed person ID embedding contributes to preventing structural collapse and improving motion consistency. These results highlight the importance of explicitly modeling individual identity in interaction-aware motion generation.
[CV-53] SAMIDARE: Advanced Tracking-by-Segmentation for Dense Scenarios
【速读】:该论文旨在解决体育场景中多目标跟踪(Multi-Object Tracking, MOT)的挑战,特别是针对基于分割的方法在密集场景下容易出现掩码错误(mask errors)和ID切换(ID switches)的问题。其解决方案的关键在于提出SAMIDARE框架,通过三个核心组件实现:(1) 密度感知的掩码重生成机制,以自适应地控制掩码质量并保持目标特征完整性;(2) 选择性记忆更新策略,优化目标特征的存储与维护;(3) 状态感知的关联与新轨迹初始化方法,提升在相互遮挡和频繁帧丢失情况下的鲁棒性。该方案显著提升了密集体育场景中的跟踪性能,在SportsMOT数据集上相较基线模型在HOTA和IDF1指标上分别提升2.5和4.2点。
链接: https://arxiv.org/abs/2604.22162
作者: Shozaburo Hirano,Norimichi Ukita
机构: Toyota Technological Institute (丰田工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated sports analysis demands robust multi-object tracking (MOT), yet segmentation-based methods often struggle with mask errors and ID switches in dense scenes. We propose SAMIDARE, a framework that enhances SAM2MOT for crowded scenes through three key components: (1) density-aware mask re-generation and (2) selective memory updates, both for adaptive mask control to preserve target feature integrity, and (3) state-aware association and new track initialization, which improves robustness under mutual occlusions and frequent frame-out events. Evaluated on the SportsMOT dataset, SAMIDARE achieves state-of-the-art performance, outperforming the baseline by 2.5 HOTA and 4.2 IDF1 points on the validation set. These results demonstrate that adaptive feature management using mask control and state-aware association provide a robust and efficient solution for dense sports tracking. Code is available at this https URL
[CV-54] GenMatter: Perceiving Physical Objects with Generative Matter Models CVPR2026
【速读】:该论文旨在解决现有计算机视觉系统在跨不同场景(如稀疏运动点、纹理表面和自然视频)中缺乏统一的运动感知与物体分割方法的问题。其核心挑战在于如何从多样的输入中稳定地识别并分离出具有物理独立性的可动实体。解决方案的关键在于提出一种基于人类视觉原理的生成式模型,该模型通过层级结构将低层运动线索与高层外观特征聚合成“粒子”(particles,即代表局部物质的小高斯分布),再进一步将粒子聚类为能捕捉协同且独立运动的物理实体的簇群;同时开发了基于并行化块吉布斯采样(block Gibbs sampling)的硬件加速推理算法,以实现对粒子运动及分组的稳定恢复,从而在随机点动力图、仿格式塔的遮蔽旋转物体以及自然RGB视频等多种场景下均表现出类人感知能力。
链接: https://arxiv.org/abs/2604.22160
作者: Eric Li,Arijit Dasgupta,Yoni Friedman,Mathieu Huot,Vikash Mansinghka,Thomas O’Connell,William T. Freeman,Joshua B. Tenenbaum
机构: MIT CSAIL; MIT BCS
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 12 figures, CVPR 2026
Abstract:Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.
[CV-55] Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models
【速读】:该论文旨在解决腹腔镜胆囊切除术中关键安全视野(Critical View of Safety, CVS)评估的准确性问题,以降低胆管损伤这一高致残致死率并发症的发生风险。现有大型视觉语言模型(Large Vision-Language Models, LVLMs)虽具备灵活推理能力,但在安全敏感的外科任务中存在预测不可审计和可靠性不足的问题。解决方案的关键在于提出“Sum-of-Checks”框架,将CVS的每一项标准分解为专家定义的、反映临床相关视觉证据的推理检查(reasoning checks),通过LVLM对每个检查进行二值判断并提供理由,再以固定加权方式聚合得到准则级评分。该结构化方法显著提升了平均帧级平均精度(相对最优基线提升12–14%),并通过分析表明LVLM在观察性检查(如视野清晰度、器械遮挡)上可靠,但在决策关键的解剖学证据上仍具变异性,从而验证了将证据获取与决策过程显式分离对于构建可信赖、可审计的外科AI系统至关重要。
链接: https://arxiv.org/abs/2604.22156
作者: Weiqiu You,Cassandra Goldberg,Amin Madani,Daniel A. Hashimoto,Eric Wong
机构: University of Pennsylvania (宾夕法尼亚大学); University of Toronto (多伦多大学); University Health Network (加拿大健康网络)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: IPCAI 2026 short communication
Abstract:Purpose: Accurate assessment of the Critical View of Safety (CVS) during laparoscopic cholecystectomy is essential to prevent bile duct injury, a complication associated with significant morbidity and mortality. While large vision-language models (LVLMs) offer flexible reasoning, their predictions remain difficult to audit and unreliable on safety-critical surgical tasks. Methods: We introduce Sum-of-Checks, a framework that decomposes each CVS criterion into expert-defined reasoning checks reflecting clinically relevant visual evidence. Given a laparoscopic frame, an LVLM evaluates each check, producing a binary judgment and justification. Criterion-level scores are computed via fixed, weighted aggregation of check outcomes. We evaluate on the Endoscapes2023 benchmark using three frontier LVLMs, comparing against direct prompting, chain-of-thought, and sub-question decomposition, each with and without few-shot examples. Results: Sum-of-Checks improves average frame-level mean average precision by 12–14% relative to the best baseline across all three models and criteria. Analysis of individual checks reveals that LVLMs are reliable on observational checks (e.g., visibility, tool obstruction) but show substantial variability on decision-critical anatomical evidence. Conclusion: Structuring surgical reasoning into expert-aligned verification checks improves both accuracy and transparency of LVLM-based CVS assessment, demonstrating that explicitly separating evidence elicitation from decision-making is critical for reliable and auditable surgical AI systems. Code is available at this https URL. Comments: IPCAI 2026 short communication Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2604.22156 [cs.LG] (or arXiv:2604.22156v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.22156 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Weiqiu You [view email] [v1] Fri, 24 Apr 2026 02:07:23 UTC (1,666 KB) Full-text links: Access Paper: View a PDF of the paper titled Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models, by Weiqiu You and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-04 Change to browse by: cs cs.CV References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-56] Anatomy-Aware Unsupervised Detection and Localization of Retinal Abnormalities in Optical Coherence Tomography CVPR
【速读】:该论文旨在解决光学相干断层扫描(Optical Coherence Tomography, OCT)图像中病灶自动分析的可靠性问题,核心挑战在于依赖昂贵且耗时的专家标注数据,导致监督式深度学习模型在不同病理类型、成像设备和人群之间泛化能力受限。其解决方案的关键在于提出一种无监督异常检测框架,通过在正常B-scan图像上训练离散潜在模型来学习健康视网膜解剖结构的规范分布,从而无需病变标注即可识别异常;同时引入视网膜层感知监督与结构化三元组学习策略,有效分离健康与病理表征,提升模型在多样化成像条件下的鲁棒性,最终实现基于重建差异的图像级与像素级异常定位。
链接: https://arxiv.org/abs/2604.22139
作者: Tania Haghighi,Sina Gholami,Hamed Tabkhi,Minhaj Nur Alam
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 3 figures, accepted in CVPR-CV4Clinical
Abstract:Reliable automated analysis of Optical Coherence Tomography (OCT) imaging is crucial for diagnosing retinal disorders but faces a critical barrier: the need for expensive, labor-intensive expert annotations. Supervised deep learning models struggle to generalize across diverse pathologies, imaging devices, and patient populations due to their restricted vocabulary of annotated abnormalities. We propose an unsupervised anomaly detection framework that learns the normative distribution of healthy retinal anatomy without lesion annotations, directly addressing annotation efficiency challenges in clinical deployment. Our approach leverages a discrete latent model trained on normal B-scans to capture OCT-specific structural patterns. To enhance clinical robustness, we incorporate retinal layer-aware supervision and structured triplet learning to separate healthy from pathological representations, improving model reliability across varied imaging conditions. During inference, anomalies are detected and localized via reconstruction discrepancies, enabling both image and pixel-level identification without requiring disease-specific labels. On the Kermany dataset (AUROC: 0.799), our method substantially outperforms VAE, VQVAE, VQGAN, and f-AnoGAN baselines. Critically, cross-dataset evaluation on Srinivasan achieves AUROC 0.884 with superior generalization, demonstrating robust domain adaptation. On the external RETOUCH benchmark, unsupervised anomaly segmentation achieves competitive Dice (0.200) and mIoU (0.117) scores, validating reproducibility across institutions.
[CV-57] PAGaS: Pixel-Aligned 1DoF Gaussian Splatting for Depth Refinement
【速读】:该论文旨在解决多视图立体(Multi-View Stereo, MVS)深度估计中几何精度不足的问题,尤其是在复杂场景下难以准确重建精细结构的挑战。其解决方案的关键在于提出Pixel-Aligned 1DoF Gaussian Splatting (PAGaS),通过引入仅保留一个自由度(1DoF)的高斯表示来建模像素深度:具体而言,高斯的位置和尺寸由反投影像素体约束,从而在优化过程中仅允许深度作为唯一可变参数,显著提升了几何一致性与细节保真度。
链接: https://arxiv.org/abs/2604.22129
作者: David Recasens,Robert Maier,Aljaz Bozic,Stephane Grabli,Javier Civera,Tony Tung,Edmond Boyer
机构: University of Zaragoza (萨拉戈萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Gaussian Splatting (GS) has emerged as an efficient approach for high-quality novel view synthesis. While early GS variants struggled to accurately model the scene’s geometry, recent advancements constraining the Gaussians’ spread and shapes, such as 2D Gaussian Splatting, have significantly improved geometric fidelity. In this paper, we present Pixel-Aligned 1DoF Gaussian Splatting (PAGaS) that adapts the GS representation from novel view synthesis to the multi-view stereo depth task. Our key contribution is modeling a pixel’s depth using one-degree-of-freedom (1DoF) Gaussians that remain tightly constrained during optimization. Unlike existing approaches, our Gaussians’ positions and sizes are restricted by the back-projected pixel volumes, leaving depth as the sole degree of freedom to optimize. PAGaS produces highly detailed depths, as illustrated in Figure 1. We quantitatively validate these improvements on top of reference geometric and learning-based multi-view stereo baselines on challenging 3D reconstruction benchmarks. Code: this http URL
[CV-58] Robust Camera-to-Mocap Calibration and Verification for Large-Scale Multi-Camera Data Capture
【速读】:该论文旨在解决光学动作捕捉(Optical Motion Capture, mocap)系统在AR/VR、SLAM和机器人数据集中的外参标定问题,尤其针对鱼眼相机(fisheye camera)因空间非均匀畸变导致的标定与验证困难。现有方法常因板到标记物(board-to-marker)安装差异、优化初始值模糊性以及部署后会话间标定漂移而出现误差甚至未被察觉的失败,进而污染下游数据。解决方案的关键在于:首先提出一种联合估计相机外参与板到标记物变换的鲁棒标定方法,并采用分阶段求解器提升在模糊初始条件下的收敛可靠性;其次设计名为Lollypop的独立验证模块,通过完全脱离标定数据的测量链实现快速、无需人工干预的标定质量评估,从而有效检测长期运行中的标定退化。
链接: https://arxiv.org/abs/2604.22118
作者: Tianyi Liu,Christopher Twigg,Patrick Grady,Kevin Harris,Shangchen Han,Kun He
机构: Meta
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Optical motion capture (mocap) systems are widely used for ground-truth capture in AR/VR, SLAM and robotics datasets. These datasets require extrinsic calibration to align mocap coordinates to external camera frames – a step that is subject to multiple sources of error in practice, and failures often go undetected until they corrupt downstream data. These issues are compounded for fisheye cameras, where spatially non-uniform distortion makes both calibration and verification more challenging. We present a calibration and verification system designed for this setting. Concretely, we target robustness to board-to-marker attachment variation, optimization initialization ambiguity, and session-to-session calibration drift after deployment. The calibration jointly estimates camera extrinsics and the board-to-marker transform, and uses a staged solver to improve convergence reliability under ambiguous initialization. The verification component, \lollypop, provides fast, operator-independent assessment through a measurement chain entirely independent of the calibration data. In experiments on a Meta Quest 3 headset with fisheye cameras, our calibration outperforms existing benchwork, and lollypop reliably detects calibration degradation over time. The system has been deployed in production data collection pipelines.
[CV-59] How Many Visual Levers Drive Urban Perception? Interventional Counterfactuals via Multiple Localised Edits
【速读】:该论文旨在解决街景感知模型(street-view perception models)在预测主观属性(如安全性)时缺乏因果解释能力的问题,即现有模型仅能提供相关性结果,无法识别哪些局部视觉变化可合理改变人类对特定场景的判断。其解决方案的关键在于提出一种基于“杠杆”(lever-based)的干预式反事实框架,将场景级可解释性重构为对结构化反事实编辑的受限搜索过程;每个“杠杆”定义了一个语义概念、空间支持范围、干预方向及约束编辑模板,通过提示条件图像编辑生成候选修改,并仅保留满足同一地点保持、局部性、真实性和合理性验证的编辑结果,从而实现对视觉因素与人类判断之间因果路径的初步探索。
链接: https://arxiv.org/abs/2604.22103
作者: Jason Tang,Stephen Law
机构: University College London (UCL)
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement for a specific scene. We propose a lever-based interventional counterfactual framework that recasts scene-level explainability as a bounded search over structured counterfactual edits. Each lever specifies a semantic concept, spatial support, intervention direction, and constrained edit template. Candidate edits are generated through prompt-conditioned image editing and retained only if they satisfy validity checks for same-place preservation, locality, realism, and plausibility. In a pilot across 50 scenes from five cities, the framework reveals preliminary proxy-based directional patterns and a practical failure taxonomy under prompt-only editing, with Mobility Infrastructure and Physical Maintenance showing the largest auxiliary safety shifts. Human pairwise judgements remain the ground-truth endpoint for future validation.
[CV-60] FLARE-BO: Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation for Low-Light Robotic Vision
【速读】:该论文旨在解决低光照条件下视觉感知可靠性问题,这是影响自主机器人系统导航、检测与操作性能的核心挑战。现有方法在参数优化维度、光照分解与白平衡校正等方面存在局限,且依赖于易导致边缘模糊的非局部均值(Non-Local Means, NLM)去噪策略。解决方案的关键在于提出FLARE-BO(Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation)框架,通过联合优化八个图像增强参数(包括伽马校正、LIME风格光照归一化、色度去噪、双边滤波、NLM去噪、Grey-World自动白平衡及自适应后处理平滑),并引入单位超立方体参数归一化、目标标准化、Sobol准随机初始化和对数期望改进(Log Expected Improvement)采集函数,实现对扩展参数空间的高效、鲁棒探索,从而显著提升低光图像质量。
链接: https://arxiv.org/abs/2604.22093
作者: Nathan Shankar,Pawel Ladosz,Hujun Yin
机构: University of Manchester (曼彻斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 7 pages, 2 tables and 4 figures
Abstract:Reliable visual perception under low illumination remains a core challenge for autonomous robotic systems, where degraded image quality directly compromises navigation, inspection, and various operations. A recent training free approach showed that Bayesian optimisation with Gaussian Processes can adaptively select brightness, contrast, and denoising parameters on a per-image basis, achieving competitive enhancement without any learned model. However, that framework is limited to three parameters, applies no illumination decomposition or white balance correction, and relies on Non-Local Means denoising, which tends to over smooth edges under noisy conditions. This paper proposes FLARE-BO (Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation), an extended framework that jointly optimises eight parameters spanning across gamma correction, LIME-style illumination normalisation, chrominance denoising, bilateral filtering, NLM denoising, Grey-World automatic white balance, and adaptive post smoothing. The search engine employs a unit hypercube parameter normalisation, objective standardisation, Sobol quasi-random initialisation, and Log Expected Improvement acquisition for principled exploration of the expanded space. Performance of the proposed method is benchmarked using the Low Light paired dataset (LOL) and results show marked improvements of the proposed method over existing methods that were not specifically trained using this dataset.
[CV-61] H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers CVPR2026
【速读】:该论文旨在解决深度神经网络预测解释中忽视特征交互(feature interactions)的问题,尤其在图像分类任务中,像素间的协同作用对语义理解至关重要。现有方法要么仅关注边际效应,要么在处理图像时存在粒度粗糙或违背可解释性公理的缺陷。其解决方案的关键在于提出H-Sets框架:第一阶段通过输入海森矩阵(input Hessians)识别局部相互作用的像素对,并递归合并为语义一致的集合(sets),同时引入Segment Anything (SAM)分割作为空间分组先验;第二阶段采用IDG-Vis方法对每个集合进行归因,该方法扩展了集成方向梯度(Integrated Directional Gradients),沿像素空间路径整合方向梯度并以哈桑尼红利(Harsanyi dividends)聚合,从而实现更稀疏且忠实的显著性图(saliency maps)。
链接: https://arxiv.org/abs/2604.22045
作者: Ayushi Mehrotra,Dipkamal Bhusal,Michael Clifford,Nidhi Rastogi
机构: California Institute of Technology (加州理工学院); Rochester Institute of Technology (罗切斯特理工学院); Toyota InfoTech Labs (丰田信息科技实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026
Abstract:Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.
[CV-62] EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms
【速读】:该论文旨在解决复杂医疗任务中智能辅助系统缺乏高质量、标注精细的视觉数据支持的问题,以推动生成式 AI (Generative AI) 在医疗场景下的应用落地。其解决方案的关键在于构建并公开发布 EgoMAGIC 数据集——一个包含 3,355 条视频、覆盖 50 种医疗操作的头戴式视角(egocentric)医学活动数据集,并配套提供针对 8 个典型任务的动作检测基准挑战。该数据集通过 1.95 百万条标签训练了 40 个 YOLO 模型,实现了对 124 种医疗物体的精准检测,为动作识别、错误检测和对象识别等计算机视觉任务提供了坚实基础,从而显著提升增强现实(AR)头显设备中虚拟助手在真实医疗环境中的感知与交互能力。
链接: https://arxiv.org/abs/2604.22036
作者: Brian VanVoorst,Nicholas Walczak,Christopher Gilleo,Charles Meissner,Fabio Felix,Iran Roman,Bea Steers,Claudio Silva,Yuhan Shen,Zijia Lu,Shih-Po Lee,Ehsan Elhamifar
机构: RTX BBN Technologies; New York University; Northeastern University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, 3 tables
Abstract:This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA’s Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via this http URL (DOI: https://doi.org/10.5281/zenodo.19239154). Comments: 9 pages, 4 figures, 3 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2604.22036 [cs.CV] (or arXiv:2604.22036v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2604.22036 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Brian VanVoorst [view email] [v1] Thu, 23 Apr 2026 19:49:16 UTC (27,921 KB)
[CV-63] LTBs-KAN: Linear-Time B-splines Kolmogorov-Arnold Networks
【速读】:该论文旨在解决Kolmogorov-Arnold Networks (KANs) 在计算效率上显著低于多层感知机(Multilayer Perceptrons, MLPs)的问题,其核心瓶颈在于B样条函数计算的递归特性导致的高时间复杂度。解决方案的关键在于提出一种线性时间复杂度的基样条线性时间B样条Kolmogorov-Arnold网络(Linear-Time B-splines Kolmogorov-Arnold Network, LTBs-KAN),通过摒弃传统的Boor-Mansfield-Cox样条算法等高复杂度数学运算,实现计算负担的显著降低;同时,在前向传播中引入乘积-求和矩阵分解(product-of-sums matrix factorization)策略,在不牺牲模型性能的前提下进一步减少参数量,从而提升整体效率与可扩展性。
链接: https://arxiv.org/abs/2604.22034
作者: Eduardo Said Merin-Martinez,Andres Mendez-Vazquez,Eduardo Rodriguez-Tello
机构: Cinvestav, Unidad Guadalajara (Cinvestav,瓜达拉哈拉分校); Cinvestav, Unidad Tamaulipas (Cinvestav,塔毛利帕斯分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Kolmogorov-Arnold Networks (KANs) are a recent neural network architecture offering an alternative to Multilayer Perceptrons (MLPs) with improved explainability and expressibility. However, KANs are significantly slower than MLPs due to the recursive nature of B-spline function computations, limiting their application. This work addresses these issues by proposing a novel base-spline Linear-Time B-splines Kolmogorov-Arnold Network (LTBs-KAN) with linear complexity. Unlike previous methods that rely on the Boor-Mansfield-Cox spline algorithm or other computationally intensive mathematical functions, our approach significantly reduces the computational burden. Additionally, we further reduce model’s parameter through product-of-sums matrix factorization in the forward pass without sacrificing performance. Experiments on MNIST, Fashion-MNIST and CIFAR-10 demonstrate that LTBs-KAN achieves good time complexity and parameter reduction, when used as building architectural blocks, compared to other KAN implementations.
[CV-64] Soft Anisotropic Diagrams for Differentiable Image Representation
【速读】:该论文旨在解决图像高效表示与渲染中的关键挑战,即如何在保持高视觉质量的同时实现快速训练和推理,并支持可微分的图像处理流程。现有方法如Image-GS和Instant-NGP在编码效率或边界清晰度上存在局限,难以兼顾内容感知的结构对齐与梯度传播能力。解决方案的核心在于提出Soft Anisotropic Diagrams (SAD),这是一种显式且可微分的图像表示方法,通过自适应站点(adaptive sites)定义各向异性度量(anisotropic metric)和加权距离得分,利用softmax混合机制对每个像素的颜色进行局部top-K站点加权融合。其关键创新包括:1)引入可学习的温度参数以控制软分区的平滑性并保留有效梯度;2)设计基于跳 flooding(jump flooding)启发的top-K传播策略结合随机注入,实现GPU友好的固定大小局部计算与全局覆盖;3)采用梯度加权初始化、Adam优化及动态密度控制(稠密化与剪枝),显著提升端到端训练效率(达4–19倍加速)。此框架不仅支持高质量图像重建,还无缝集成于前向与逆问题的可微分流水线中,具备快速随机访问与紧凑存储优势。
链接: https://arxiv.org/abs/2604.21984
作者: Laki Iinbor,Zhiyang Dou,Wojciech Matusik
机构: MIT(麻省理工学院); MIT(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce Soft Anisotropic Diagrams (SAD), an explicit and differentiable image representation parameterized by a set of adaptive sites in the image plane. In SAD, each site specifies an anisotropic metric and an additively weighted distance score, and we compute pixel colors as a softmax blend over a small per-pixel top-K subset of sites. We induce a soft anisotropic additively weighted Voronoi partition (i.e., an Apollonius diagram) with learnable per-site temperatures, preserving informative gradients while allowing clear, content-aligned boundaries and explicit ownership. Such a formulation enables efficient rendering by maintaining a per-query top-K map that approximates nearest neighbors under the same shading score, allowing GPU-friendly, fixed-size local computation. We update this list using our top-K propagation scheme inspired by jump flooding, augmented with stochastic injection to provide probabilistic global coverage. Training follows a GPU-first pipeline with gradient-weighted initialization, Adam optimization, and adaptive budget control through densification and pruning. Across standard benchmarks, SAD consistently outperforms Image-GS and Instant-NGP at matched bitrate. On Kodak, SAD reaches 46.0 dB PSNR with 2.2 s encoding time (vs. 28 s for Image-GS), and delivers 4-19 times end-to-end training speedups over state-of-the-art baselines. We demonstrate the effectiveness of SAD by showcasing the seamless integration with differentiable pipelines for forward and inverse problems, efficiency of fast random access, and compact storage.
[CV-65] Forecasting Solar Energy Using a Single Image
【速读】:该论文旨在解决城市环境中太阳能电池板(solar panel)安装前的辐照度(irradiance)评估难题,特别是传统三维建模方法无法准确捕捉附近小尺度结构对辐照度影响的问题。其核心解决方案是利用单张拍摄于面板位置的图像,通过视觉线索推断相机朝向及面板可见天空区域,从而预测太阳和天空散射光贡献的辐照度,并进一步发现邻近建筑反射光在时间上具有平滑变化特性,亦可从图像中预测。该方法显著提升了城市峡谷中辐照度预测精度,且仅需一张全景图像即可确定最优固定安装角度,验证了其在实际场景中的有效性与实用性。
链接: https://arxiv.org/abs/2604.21982
作者: Jeremy Klotz,Shree K. Nayar
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 15 figures. Project page: this https URL
Abstract:Solar panels are increasingly deployed in cities on rooftops, walls, and urban infrastructure. Although the panel costs have fallen in recent years, the soft costs of installing them have not. These soft costs include assessing the illumination (irradiance) of a panel, which is typically performed using a 3D model that fails to capture small nearby structures that impact the irradiance. Our approach uses a single image taken at the panel’s location to forecast its irradiance at any time in the future. We use visual cues in the image to find the camera’s orientation and the portion of the sky visible to the panel in order to forecast the irradiance due to the sun and the sky. In addition, we show that the irradiance due to reflections from nearby buildings varies smoothly over time and can be forecasted from the image. This approach enables assessing the solar energy potential of any surface and forecasting the temporal variation of a panel’s irradiance. We validate our approach using real irradiance measurements in urban canyons. We show that our approach often yields more accurate irradiance forecasts compared to conventional irradiance-based transposition methods and 3D model-based simulations. We also show that a single spherical image can be used to find the best fixed orientation of a panel. Finally, we present Solaris, a device to capture the image seen by a panel in a variety of urban settings.
[CV-66] Useful nonrobust features are ubiquitous in biomedical images
【速读】:该论文旨在解决深度学习模型在医学影像分类任务中是否依赖于非鲁棒特征(nonrobust features)及其对模型性能的影响问题。研究表明,仅依赖非鲁棒特征的模型在五项MedMNIST分类任务上仍能实现显著高于随机水平的准确率,说明这些特征在分布内(in-distribution)具有预测价值;而通过对抗训练使模型主要依赖鲁棒特征(robust features)虽会降低分布内准确率,却能在受控的分布偏移场景(MedMNIST-C)下显著提升泛化性能。解决方案的关键在于识别并量化非鲁棒特征与鲁棒特征在不同测试条件下的作用差异,揭示出医学影像分类中存在可调的鲁棒性-准确性权衡(robustness-accuracy trade-off),从而为实际部署场景提供针对性优化依据。
链接: https://arxiv.org/abs/2604.22579
作者: Coenraad Mouton,Randle Rabe,Niklas C. Koser,Nicolai Krekiehn,Christopher Hansen,Jan-Bernd Hövener,Claus-C. Glüer
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at The IEEE International Symposium on Biomedical Imaging (ISBI), 2026
Abstract:We study whether deep networks for medical imaging learn useful nonrobust features - predictive input patterns that are not human interpretable and highly susceptible to small adversarial perturbations - and how these features impact test performance. We show that models trained only on nonrobust features achieve well above chance accuracy across five MedMNIST classification tasks, confirming their predictive value in-distribution. Conversely, adversarially trained models that primarily rely on robust features sacrifice in-distribution accuracy but yield markedly better performance under controlled distribution shifts (MedMNIST-C). Overall, nonrobust features boost standard accuracy yet degrade out-of-distribution performance, revealing a practical robustness-accuracy trade-off in medical imaging classification tasks that should be tailored to the requirements of the deployment setting.
[CV-67] Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction? CVPR
【速读】:该论文旨在解决加速心脏磁共振成像(cardiac MRI)重建中因数据分布差异导致的泛化能力不足问题,特别是当模型在跨域场景下(如训练于心脏数据却应用于膝关节或脑部数据)表现下降的问题。其关键解决方案是提出一种可展开的重建框架,将预训练的冻结视觉编码器(如CLIP、DINOv2和BiomedCLIP)嵌入到每一级重构过程中,作为图像先验来引导重建路径。实验表明,尽管任务特定模型(如E2E-VarNet)在标准分布内表现更优,但基于基础模型的方法在跨域挑战性场景中展现出更强的鲁棒性,尤其在高加速因子和低频采样受限条件下;其中自然图像预训练模型(如CLIP)能学习高度可迁移的结构表征,而生物医学领域特定预训练模型(BiomedCLIP)则在病态问题中提供小幅性能提升,验证了预训练基础模型作为通用先验在加速MRI重建中的潜力。
链接: https://arxiv.org/abs/2604.22557
作者: Anam Hashmi,Mayug Maniparambil,Julia Dietlmeier,Kathleen M. Curran,Noel E. O’Connor
机构: Dublin City University (都柏林城市大学); University College Dublin (都柏林大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPRW 2026
Abstract:The emergence of large-scale pretrained foundation models has transformed computer vision, enabling strong performance across diverse downstream tasks. However, their potential for physics-based inverse problems, such as accelerated cardiac MRI reconstruction, remains largely underexplored. In this work, we investigate whether natural-domain foundation models can serve as effective image priors for accelerated cardiac MRI reconstruction, and compare the performance obtained against domain-specific counterparts such as BiomedCLIP. We propose an unrolled reconstruction framework that incorporates pretrained, frozen visual encoders, such as CLIP, DINOv2, and BiomedCLIP, within each cascade to guide the reconstruction process. Through extensive experiments, we show that while task-specific state-of-the-art reconstruction models such as E2E-VarNet achieve superior performance in standard in-distribution settings, foundation-model-based approaches remain competitive. More importantly, in challenging cross-domain scenarios, where models are trained on cardiac MRI and evaluated on anatomically distinct knee and brain datasets–foundation models exhibit improved robustness, particularly under high acceleration factors and limited low-frequency sampling. We further observe that natural-image-pretrained models, such as CLIP, learn highly transferable structural representations, while domain-specific pretraining (BiomedCLIP) provides modest additional gains in more ill-posed regimes. Overall, our results suggest that pretrained foundation models offer a promising source of transferable priors, enabling improved robustness and generalization in accelerated MRI reconstruction.
[CV-68] MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models
【速读】:该论文旨在解决动物行为学中社会支配关系(social dominance)的自动识别问题,特别是如何从原始小鼠行为视频中无监督地预测其社会等级。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)对未标注的行为序列进行零样本推理(zero-shot inference),通过在新引入的MTT-Bench基准数据集上微调模型,使其能够在测试阶段无需显式标签即可准确预测小鼠的支配层级,从而为行为生态学和社交行为分析提供一种通用的基础模型方法。
链接: https://arxiv.org/abs/2604.22492
作者: Yunquan Chen,Haoyu Chen
机构: KTH Royal Institute of Technology (皇家理工学院); University of Oulu (奥卢大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures. Submitted to conference
Abstract:Understanding social dominance in animal behavior is critical for neuroscience and behavioral studies. In this work, we explore the capability of Multimodal Large Language Models(MLLMs) to analyze raw behavioral video of mice and predict their dominance hierarchy. We introduce MTT-Bench, a novel benchmark comprising annotated videos of pairwise mouse interactions for Mouse Tube Test analysis. Building on existing MLLM architectures, we fine-tune these models to perform zero-shot inference on unseen behavioral sequences, predicting social dominance without explicit labels during testing. Our framework demonstrates promising results, showing high agreement with tube test rankings. This work opens a new direction for applying foundation models to ethology and social behavior analysis, without the need to design domain-specific models.
[CV-69] hermal background reduction for mid-infrared imaging by low-rank background and sparse point-source modelling
【速读】:该论文旨在解决地面和机载中红外天文观测中因背景噪声时空变化导致的源检测与测光精度下降问题,尤其针对下一代极大望远镜(extremely large telescopes)无法使用传统斩波-跳动(chopping and nodding)技术的挑战。解决方案的关键在于提出一种名为LOw-RAnk Background ELimination (LORABEL) 的新型计算方法,其核心思想是利用背景信号的低秩特性进行背景抑制,从而在不依赖传统观测模式(如源掩膜或望远镜跳动)的前提下显著提升信噪比和检测精度。
链接: https://arxiv.org/abs/2604.22351
作者: R.A.R. Moens,A.G.M. Pietrow,B. Brandl,R. Van de Plas
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mid-infrared astronomy from the ground faces critical challenges in accurately detecting and quantifying sources due to the dominant spatially and time-variable background noise. Moreover, chopping and nodding, the traditional methods for dealing with these background issues, will not be technically feasible on the next generation of extremely large telescopes. This limitation requires the development of novel computational methods for a robust background reduction. We present and evaluate a novel method named LOw-RAnk Background ELimination (LORABEL) to improve the sensitivity of mid-infrared astronomical observations, without the need for classical telescope nodding, source masking, or other overheads in observing time. We applied a low-rank background-reduction strategy to (1) data taken on the ground with the VISIR with synthetically injected sources, and (2) airborne data from SOFIA. We compared the performance of our new method to classical chopping and nodding techniques, and analysed the effect on source photometry and detection precision for different observational scenarios. In regimes with a low signal-to-noise ratio (S/N 5 ) in the ground-based VISIR data, LORABEL reduces variation in the photometric error with respect to chopping differences alone and even the classical chop-nod sequence, at the cost of introducing a bias. Secondly, we demonstrate that LORABEL increases detection precision in comparison to traditional background-reduction methods. For the SOFIA dataset, we achieve a 20-100 fold decrease in mean background flux with respect to the traditional chop-nod method while preserving most of the source flux. Our findings suggest that LORABEL is applicable to a wider range of instrumental observation, that is, both ground-based and airborne, and it is a suitable tool in the context of faint-source detection.
[CV-70] Selective Depthwise Separable Convolution for Lightweight Joint Source-Channel Coding in Wireless Image Transmission
【速读】:该论文旨在解决深度学习(Deep Learning, DL)驱动的联合信源信道编码(Joint Source-Channel Coding, JSCC)系统在无线图像传输中计算复杂度较高、难以适配资源受限边缘设备的问题。其解决方案的关键在于提出一种可配置的轻量化JSCC框架,采用选择性替换策略,在不同层位置和替换比例下灵活地将标准卷积层(Conv)替换为深度可分离卷积层(Depthwise Separable Convolutional, DSConv),从而实现模型压缩与重建性能之间的灵活权衡。实验表明,在中间层进行Conv到DSConv的替换能获得最优的复杂度-性能平衡,揭示了DL-based JSCC系统中存在层级冗余特性,且该方法可在仅轻微损失重建质量的前提下显著减少参数量,适用于边缘计算场景。
链接: https://arxiv.org/abs/2604.22338
作者: Ming Ye,Kui Cai,Cunhua Pan,Zhen Mei,Wanting Yang,Chunguo Li
机构: Singapore University of Technology and Design (新加坡科技设计大学); National Mobile Communications Research Laboratory, Southeast University (东南大学移动通信国家重点实验室); School of Electronic and Optical Engineering, Nanjing University of Science and Technology (南京理工大学电子科学与技术学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 6 figures, journal
Abstract:Depthwise separable convolutional (DSConv) layers have been successfully applied to deep learning (DL)-based joint source-channel coding (JSCC) schemes to reduce computational complexity. However, a systematic investigation of the layerwise and ratio-wise replacement of standard convolutional (Conv) layers with DSConv layers in JSCC systems for wireless image transmission remains largely unexplored. In this letter, we propose a configurable lightweight JSCC framework that incorporates a selective replacement strategy, enabling flexible substitution of standard Conv layers with DSConv layers at various layer positions and replacement ratios. By adjusting the proportion of layers replaced, we achieve different model compression levels and analyze their impact on reconstruction performance. Furthermore, we investigate how replacements at different encoder and decoder depths influence reconstruction quality under a fixed replacement ratio. Our results show that Conv-to-DSConv replacement at intermediate layers achieves a favorable complexity-performance trade-off, revealing layer-wise redundancy in DL-based JSCC systems. Extensive experiments further demonstrate that the proposed framework achieves substantial parameter reduction with only slight performance degradation, enabling flexible complexity-performance trade-offs for resource-constrained edge devices.
[CV-71] Multimodal Diffusion to Mutually Enhance Polarized Light and Low Resolution EBSD Data
【速读】:该论文旨在解决三维电子背散射衍射(3-D EBSD)数据采集过程耗时的问题,同时探索如何利用偏振光(PL)数据加速EBSD数据获取,并在数据不完整或退化的情况下实现高质量重建。其核心解决方案是采用一种无条件多模态扩散模型(unconditional multimodal diffusion model),该模型通过在合成数据上一次性训练即可具备强大的泛化能力,适用于真实世界中低分辨率、噪声干扰、损坏或错位的多模态数据。该方法能够在推理阶段通过缩放策略显著提升性能,在晶界预测、超分辨率和去噪等多个目标上表现出色,且仅需25%的EBSD数据即可逼近全分辨率性能。
链接: https://arxiv.org/abs/2604.22212
作者: Harry Dong,Timofey Efimov,Megna Shah,Jeff Simmons,Sean Donegan,Marc De Graef,Yuejie Chi
机构: Carnegie Mellon University (卡内基梅隆大学); Air Force Research Laboratory (空军研究实验室); Yale University (耶鲁大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In spite of the utility of 3-D electron back-scattered diffraction (EBSD) microscopy, the data collection process can be time-consuming with serial-sectioning. Hence, it is natural to look at other modalities, such as polarized light (PL) data, to accelerate EBSD data collection, supplemented with shared information. Complementarily, features in chaotic PL data could even be enriched with a handful of EBSD measurements. To inherently learn the complex dynamics between EBSD and PL to solve these inverse problems, we use an unconditional multimodal diffusion model, motivated by progress in diffusion models for inverse problems. Although trained solely on synthetic data once, our model has strong generalizable capabilities on real data which can be low-resolution, noisy, corrupted, and misregistered. With inference-time scaling, we show gains in performance on a variety of objectives including grain boundary prediction, super-resolution, and denoising. With our model, we demonstrate that there is little difference from full resolution performance with only 25% (1/4 the resolution) of EBSD data and corrupted PL data.
[CV-72] Conditional Diffusion Posterior Alignment for Sparse-View CT Reconstruction
【速读】:该论文旨在解决生成式 AI (Generative AI) 在稀疏视图计算机断层扫描(Sparse-View CT)重建中难以扩展至大尺寸三维(3D)体积的问题,具体挑战包括:(i) 3D模型的高内存与计算需求、(ii) 缺乏大规模3D训练数据集,以及 (iii) 使用独立2D模型处理每一切片时导致的切片间不一致性。解决方案的关键在于提出条件扩散后验对齐(Conditional Diffusion Posterior Alignment, CDPA),其核心机制是将一个2D U-Net扩散模型通过初始3D重建进行条件约束以增强跨切片一致性,并结合显式数据一致性对齐来匹配实际测量投影。该方法在合成与真实锥束CT(Cone Beam CT, CBCT)数据上实现了当前最优性能,且实验验证了组件间的协同效应,同时表明该原理亦可提升快速去噪U-Net的效果,在显著降低计算成本的同时逼近扩散模型质量。
链接: https://arxiv.org/abs/2604.21960
作者: Luis Barba,Johannes Kirschner,Benjamin Bejar
机构: Swiss Data Science Center (SDSC) (瑞士数据科学中心); Paul Scherrer Institute (PSI) (保罗谢勒研究所); ETH Zurich (苏黎世联邦理工学院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Computed Tomography (CT) is a widely used imaging modality in medical and industrial applications. To limit radiation exposure and measurement time, there is a growing interest in sparse-view CT, where the number of projection views is significantly reduced. Deep neural networks have shown great promise in improving reconstruction quality in sparse-view CT, especially generative diffusion models. However, these methods struggle to scale to large 3D volumes due to several reasons: (i) the high memory and computational requirements of 3D models, (ii) the lack of large 3D training datasets, and (iii) the inconsistencies across slices when using 2D models independently on each slice. We overcome these limitations and scale diffusion-based sparse-view CT reconstruction to large 3D volumes by combining conditional diffusion with explicit data consistency. We propose Conditional Diffusion Posterior Alignment (CDPA) to enable scalable 3D sparse-view CT reconstruction. A 2D U-Net diffusion model is conditioned on an initial 3D reconstruction to improve inter-slice consistency, combined with data-consistency alignment to match measured projections. Experiments on synthetic and real Cone Beam CT (CBCT) data show state-of-the-art performance, with ablations that confirm the synergistic effects of the proposed pipeline. Finally, we show that the same principles also strengthen fast denoising U-Nets, yielding near-diffusion quality at a fraction of the computational cost.
人工智能
[AI-0] Agent ic World Modeling: Foundations Capabilities Laws and Beyond
【速读】:该论文旨在解决当前人工智能系统在从文本生成向目标导向的持续交互演进过程中,因缺乏对环境动态建模能力而面临的瓶颈问题。其核心挑战在于如何构建具备预测、模拟与自适应更新能力的世界模型(world model),以支持智能体在物理、数字、社会和科学等多类环境中执行复杂任务。解决方案的关键在于提出一个“层次×规律”(levels x laws)分类框架:沿能力维度划分三个层级——L1预测器(学习单步局部转移算子)、L2模拟器(组合为动作条件下的多步滚动预测并遵守领域规律)、L3演化者(当预测失败时自主修正自身模型);同时识别四大约束机制——物理、数字、社会与科学规律,明确不同场景下世界模型需满足的约束类型及潜在失效点。该框架整合400余篇文献与百余个代表性系统,提供统一分析视角、决策导向的评估原则及最小可复现评估包,推动跨领域协作并指明未来架构设计、开放问题与治理挑战。
链接: https://arxiv.org/abs/2604.22748
作者: Meng Chu,Xuan Billy Zhang,Kevin Qinghong Lin,Lingdong Kong,Jize Zhang,Teng Tu,Weijian Ma,Ziqi Huang,Senqiao Yang,Wei Huang,Yeying Jin,Zhefan Rao,Jinhui Ye,Xinyu Lin,Xichen Zhang,Qisheng Hu,Shuai Yang,Leyang Shen,Wei Chow,Yifei Dong,Fengyi Wu,Quanyu Long,Bin Xia,Shaozuo Yu,Mingkang Zhu,Wenhu Zhang,Jiehui Huang,Haokun Gui,Haoxuan Che,Long Chen,Qifeng Chen,Wenxuan Zhang,Wenya Wang,Xiaojuan Qi,Yang Deng,Yanwei Li,Mike Zheng Shou,Zhi-Qi Cheng,See-Kiong Ng,Ziwei Liu,Philip Torr,Jiaya Jia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a “levels x laws” taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.
[AI-1] An Undecidability Proof for the Plan Existence Problem
【速读】:该论文旨在解决计划存在性问题(plan existence problem),即给定一个目标公式(以模态逻辑形式表达)、初始的信念状态(由带标记的Kripke模型表示)以及一组信念动作(epistemic actions),判断是否存在一系列可执行的动作序列,使得从初始状态出发能够达成目标。论文的关键贡献在于证明:即使在动作前提条件的模态深度(modal depth)不超过1且无后置条件的理想化情形下,该问题依然是不可判定的(undecidable)。这一结果填补了此前关于该问题(不)可判定性的知识空白,并揭示了在动态信念演算中计划合成的理论极限。
链接: https://arxiv.org/abs/2604.22736
作者: Antonis Achilleos
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:The plan existence problem asks, given a goal in the form of a formula in modal logic, an initial epistemic state (a pointed Kripke model), and a set of epistemic actions, whether there exists a sequence of actions that can be applied to reach the goal. We prove that even in the case where the preconditions of the epistemic actions have modal depth at most 1, and there are no postconditions, the plan existence problem is undecidable. The (un)decidability of this problem was previously unknown.
[AI-2] How Supply Chain Dependencies Complicate Bias Measurement and Accountability Attribution in AI Hiring Applications
【速读】:该论文旨在解决现代生成式 AI (Generative AI) 招聘系统在复杂供应链中因责任碎片化而导致的偏见评估困难与问责困境问题。其关键解决方案在于构建多层协同治理机制,包括系统级审计、供应商指南、持续监控机制以及跨依赖链的文档记录,从而在技术、组织和监管三个维度上实现对分布式开发环境中偏见行为的有效识别与责任追溯。
链接: https://arxiv.org/abs/2604.22679
作者: Gauri Sharma,Maryam Molamohammadi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing adoption of AI systems in hiring has raised concerns about algorithmic bias and accountability, prompting regulatory responses including the EU AI Act, NYC Local Law 144, and Colorado’s AI Act. While existing research examines bias through technical or regulatory lenses, both perspectives overlook a fundamental challenge: modern AI hiring systems operate within complex supply chains where responsibility fragments across data vendors, model developers, platform providers, and deploying organizations. This paper investigates how these dependency chains complicate bias evaluation and accountability attribution. Drawing on literature review and regulatory analysis, we demonstrate that fragmented responsibilities create two critical problems. First, bias emerges from component interactions rather than isolated elements, yet proprietary configurations prevent integrated evaluation. A resume parser may function without bias independently but contribute to discrimination when integrated with specific ranking algorithms and filtering thresholds. Second, information asymmetries mean deploying organizations bear legal responsibility without technical visibility into vendor-supplied algorithms, while vendors control implementations without meaningful disclosure requirements. Each stakeholder may believe they are compliant; nevertheless, the integrated system may produce biased outcomes. Analysis of implementation ambiguities reveals these challenges in practice. We propose multi-layered interventions including system-level audits, vendor guidelines, continuous monitoring mechanisms, and documentation across dependency chains. Our findings reveal that effective governance requires coordinated action across technical, organizational, and regulatory domains to establish meaningful accountability in distributed development environments.
[AI-3] From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在自动化软件工程中生成代码时缺乏正确性保障的问题,尤其是由错误或幻觉代码导致的验证失败。其核心挑战在于如何将非形式化的自然语言描述有效转化为精确的正式规范(formal specification),以支持后续的形式化验证(formal verification)。解决方案的关键在于构建一个名为NaturalLanguage2VerifiedCode (NL2VC)-60的数据集,并引入分层提示策略:首先使用结构化签名提示(signature prompts)提供逻辑锚点,再通过Dafny验证器的迭代反馈实现自愈式优化(self-healing prompts),同时结合uDebug平台防止空洞验证(vacuous verification),从而显著提升LLM生成代码的可验证性和功能性。实验表明,该方法使开放权重模型如Gemma 4-31B和GPT-OSS 120B的验证成功率分别达到90.91%和81.82%,证明了形式化验证在开源模型上的可行性。
链接: https://arxiv.org/abs/2604.22601
作者: Md Erfan,Md Kamal Hossain Chowdhury,Ahmed Ryan,Md Rayhanur Rahman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 16 pages
Abstract:Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous verification, where models satisfy verifiers with trivial specifications, we integrate the uDebug platform to ensure functional validation. Our results show that while contextless prompting leads to near-universal failure, structural signatures and iterative self-healing facilitate a dramatic performance turnaround. Specifically, Gemma 4-31B achieved a 90.91% verification success rate, while GPT-OSS 120B rose from zero to 81.82% success with signature-guided feedback. These findings indicate that formal verification is now attainable for open-weight LLMs, which serve as effective apprentices for synthesizing complex annotations and facilitating high-assurance software development.
[AI-4] Rethinking Math Reasoning Evaluation: A Robust LLM -as-a-Judge Framework Beyond Symbolic Rigidity
【速读】:该论文旨在解决当前基于符号数学比较(symbolic mathematics comparison)的评估方法在数学推理任务中泛化能力差的问题,尤其是在面对多样化的数学表达形式和答案格式时表现不佳。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的评估框架,通过LLM对模型生成的答案进行语义层面的判断,从而实现对不同数学表示和解题格式的准确、灵活且可靠的评估,显著优于传统规则驱动的方法。
链接: https://arxiv.org/abs/2604.22597
作者: Erez Yosef,Oron Anschel,Shunit Haviv Hakimi,Asaf Gendler,Adam Botach,Nimrod Berman,Igor Kviatkovsky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models’ intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer. A common approach for this verification is based on symbolic mathematics comparison, which fails to generalize across diverse mathematical representations and solution formats. In this work, we offer a robust and flexible alternative to rule-based symbolic mathematics comparison. We propose an LLM-based evaluation framework for evaluating model-generated answers, enabling accurate evaluation across diverse mathematical representations and answer formats. We present failure cases of symbolic evaluation in two popular frameworks, Lighteval and SimpleRL, and compare them to our approach, demonstrating clear improvements over commonly used methods. Our framework enables more reliable evaluation and benchmarking, leading to more accurate performance monitoring, which is important for advancing mathematical problem-solving and intelligent systems.
[AI-5] SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning ACL2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在动态GUI任务中应用强化学习(Reinforcement Learning, RL)时面临的困境:标准离线强化学习依赖静态步骤级数据,忽略任务完成度和执行质量等全局轨迹语义;而在线强化学习虽能捕捉长期动态,却存在交互成本高和环境不稳定性问题。解决方案的关键在于提出SOLAR-RL(Semi-Online Long-horizon Assignment Reinforcement Learning),其核心创新是将全局轨迹信息融入离线学习过程——通过从静态数据重构多样化轨迹候选、利用逐步有效性信号检测首次失败点,并基于目标对齐的奖励塑造方法为步骤级奖励赋值,从而在无需实际交互的情况下模拟在线反馈,显著提升长程任务完成率与鲁棒性。
链接: https://arxiv.org/abs/2604.22558
作者: Jichao Wang,Liuyang Bian,Yufeng Zhou,Han Xiao,Yue Pan,Guozhi Wang,Hao Wang,Zhaoxiong Wang,Yafei Wen,Xiaoxin Chen,Shuai Ren,Lingfang Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 11 figures. Accepted to Findings of the Association for Computational Linguistics: ACL 2026
Abstract:As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi-Online Long-horizon Assignment Reinforcement Learning). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality, effectively simulating online feedback without interaction costs. Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.
[AI-6] QDTraj: Exploration of Diverse Trajectory Primitives for Articulated Objects Robotic Manipulation
【速读】:该论文旨在解决家庭服务机器人在开放环境中执行自主操作任务时面临的挑战,特别是针对多样化铰接物体(articulated objects)的灵巧操作问题。现有方法难以生成足够多样且高性能的低层轨迹基元(trajectory primitives),导致机器人在面对真实世界中的动态约束和意外变化时适应能力不足。解决方案的关键在于提出一种基于质量-多样性算法(Quality-Diversity, QD)的方法——QDTraj,该方法通过稀疏奖励探索机制,在模拟环境中自动生成大量高质量且多样化的轨迹基元,从而提升机器人对不同物体结构的泛化能力和任务执行鲁棒性。实验表明,QDTraj 在铰链和滑动类物体激活任务中生成的轨迹多样性至少比对比方法高出5倍,并在PartNetMobility数据集上的30种关节物体上实现了平均每个任务704条轨迹的高效生成。
链接: https://arxiv.org/abs/2604.22551
作者: Mathilde Kappel,Mahdi Khoramshahi,Louis Annabi,Faiz Ben Amar,Stéphane Doncieux
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, webpage: this https URL
Abstract:Thanks to the latest advances in learning and robotics, domestic robots are beginning to enter homes, aiming to execute household chores autonomously. However, robots still struggle to perform autonomous manipulation tasks in open-ended environments. In this context, this paper presents a method that enables a robot to manipulate a wide spectrum of articulated objects. In this paper, we automatically generate different robot low-level trajectory primitives to manipulate given object articulations. A very important point when it comes to generating expert trajectories is to consider the diversity of solutions to achieve the same goal. Indeed, knowing diverse low-level primitives to accomplish the same task enables the robot to choose the optimal solution in its real-world environment, with live constraints and unexpected changes. To do so, we propose a method based on Quality-Diversity algorithms that leverages sparse reward exploration in order to generate a set of diverse and high-performing trajectory primitives for a given manipulation task. We validated our method, QDTraj, by generating diverse trajectories in simulation and deploying them in the real world. QDTraj generates at least 5 times more diverse trajectories for both hinge and slider activation tasks, outperforming the other methods we compared against. We assessed the generalization of our method over 30 articulations of the PartNetMobility articulated object dataset, with an average of 704 different trajectories by task. Code is publicly available at: this https URL Comments: 8 pages, 7 figures, webpage: this https URL Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2604.22551 [cs.RO] (or arXiv:2604.22551v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2604.22551 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-7] ArmSSL: Adversarial Robust Black-Box Watermarking for Self-Supervised Learning Pre-trained Encoders
【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)编码器在知识产权(IP)保护中面临的两大挑战:一是如何在黑盒场景下对被窃取的编码器进行所有权验证,即当编码器被用于下游任务时仍能确认其归属;二是如何抵御对抗性水印检测或移除攻击,因为水印样本通常会形成可区分的分布外(Out-of-Distribution, OOD)聚类。解决方案的关键在于提出ArmSSL框架,通过三个核心机制实现:(1) 配对差异放大(paired discrepancy enlargement),强制清洁样本与其水印对应样本在特征空间正交,从而在黑盒环境下生成可靠的验证信号;(2) 潜在表示纠缠与分布对齐(latent representation entanglement and distribution alignment),前者将水印表示与非源类别表示纠缠以避免形成密集水印簇,后者最小化水印与清洁表示之间的分布差异,使水印样本伪装为分布内自然数据;(3) 参考引导的水印调优策略,确保水印作为小辅助任务被学习而不影响主任务性能,通过保持水印编码器在正常数据上的输出与原始干净编码器一致来维持模型效用。
链接: https://arxiv.org/abs/2604.22550
作者: Yongqi Jiang,Yansong Gao,Boyu Kuang,Chunyi Zhou,Anmin Fu,Liquan Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-supervised learning (SSL) encoders are invaluable intellectual property (IP). However, no existing SSL watermarking for IP protection can concurrently satisfy the following two practical requirements: (1) provide ownership verification capability under black-box suspect model access once the stolen encoders are used in downstream tasks; (2) be robust under adversarial watermark detection or removal, because the watermark samples form a distinguishable out-of-distribution (OOD) cluster. We propose ArmSSL, an SSL watermarking framework that assures black-box verifiability and adversarial robustness while preserving utility. For verification, we introduce paired discrepancy enlargement, enforcing feature-space orthogonality between the clean and its watermark counterpart to produce a reliable verification signal in black-box against the suspect model. For adversarial robustness, ArmSSL integrates latent representation entanglement and distribution alignment to suppress the OOD clustering. The former entangles watermark representations with clean representations (i.e., from non-source-class) to avoid forming a dense cluster of watermark samples, while the latter minimizes the distributional discrepancy between watermark and clean representations, thereby disguising watermark samples as natural in-distribution data. For utility, a reference-guided watermark tuning strategy is designed to allow the watermark to be learned as a small side task without affecting the main task by aligning the watermarked encoder’s outputs with those of the original clean encoder on normal data. Extensive experiments across five mainstream SSL frameworks and nine benchmark datasets, along with end-to-end comparisons with SOTAs, demonstrate that ArmSSL achieves superior ownership verification, negligible utility degradation, and strong robustness against various adversarial detection and removal.
[AI-8] On the Properties of Feature Attribution for Supervised Contrastive Learning
【速读】:该论文旨在解决传统基于交叉熵(Cross-Entropy, CE)损失函数训练的神经网络在可解释性方面存在的不足,尤其是在特征归因(feature attribution)的忠实性(faithfulness)、复杂度(complexity)和连续性(continuity)方面的局限。其解决方案的关键在于采用监督对比学习(Supervised Contrastive Learning, SCL)作为训练目标,通过构建一个结构化的嵌入空间,使同类样本投影靠近、异类样本投影远离,从而提升模型输出的特征归因解释质量。实验表明,SCL相比CE及其他对比学习方法,在保障分类准确性的同时显著增强了模型的透明性和可信度,尤其适用于对安全性要求较高的应用场景。
链接: https://arxiv.org/abs/2604.22540
作者: Leonardo Arrighi,Julia Eva Belloni,Aurélie Gallet,Ivan Gentile,Matteo Lippi,Marco Zullich
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Most Neural Networks (NNs) for classification are trained using Cross-Entropy as a loss function. This approach requires the model to have an explicit classification layer. However, there exist alternative approaches, such as Contrastive Learning (CL). Instead of explicitly operating a classification, CL has the NN produce an embedding space where projections of similar data are pulled together, while projections of dissimilar data are pushed apart. In the case of Supervised CL (SCL), labels are adopted as similarity criteria, thus creating an embedding space where the projected data points are well-clustered. SCL provides crucial advantages over CE with regard to adversarial robustness and out-of-distribution detection, thus making it a more natural choice in safety-critical scenarios. In the present paper, we empirically show that NNs for image classification trained with SCL present higher-quality feature attribution explanations than CL with regard to faithfulness, complexity, and continuity. These results reinforce previous findings about CL-based approaches when targeting more trustworthy and transparent NNs and can guide practitioners in the selection of training objectives targeting not only accuracy, but also transparency of the models.
[AI-9] FeatEHR-LLM : Leverag ing Large Language Models for Feature Engineering in Electronic Health Records
【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHR)中特征工程的复杂性问题,尤其是由不规则观测间隔、变量测量频率差异以及临床时间序列固有结构稀疏性带来的挑战。现有自动化方法要么缺乏临床领域知识,要么假设输入数据为清洁且规则采样的,难以直接应用于真实世界EHR数据。其解决方案的关键在于提出FeatEHR-LLM框架,利用大语言模型(Large Language Models, LLMs)从不规则采样EHR时间序列中生成具有临床意义的表格特征;通过仅在数据集模式和任务描述上运行LLM以保护患者隐私,并引入工具增强型生成机制,使LLM具备查询不规则时间数据的专业能力,从而生成可执行的特征提取代码,显式处理不均匀观测模式与信息稀疏性,支持单变量与多变量特征生成,并通过验证驱动的迭代流程提升性能。
链接: https://arxiv.org/abs/2604.22534
作者: Hojjat Karami,David Atienza,Jean-Philippe Thiran,Anisoara Ionescu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbfFeatEHR-LLM, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at this http URL.
[AI-10] On the Hybrid Nature of ABPMS Process Frames and its Implications on Automated Process Discovery
【速读】:该论文旨在解决AI增强型业务流程管理系统(ABPMS)中如何有效定义和实现“过程框架”(process frame)的问题,以支持系统在保持过程感知能力的同时具备半自主行为(称为“框架自主性”(framed autonomy))。传统流程模型往往过于严格且缺乏灵活性,难以适应复杂多变的业务环境。论文提出将过程框架建模为一种混合业务流程表示形式,由半并发执行的程序化(procedural)与声明式(declarative)流程模型组成,其关键在于引入开放世界假设(open-world assumption),对程序化模型采用约束式解释——即每个程序化模型仅对其内部活动施加约束,而不强制执行顺序或限制其他模型中的活动。这种设计使得不同模型之间可独立运作、互不干扰,类似于声明式语言(如Declare)中约束的局部作用机制,从而为构建基于发现的、可动态调整的过程框架提供了理论基础与实现路径。
链接: https://arxiv.org/abs/2604.22455
作者: Anti Alman,Izack Cohen,Avigdor Gal,Fabrizio Maria Maggi,Marco Montali
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A core component of any AI-Augmented Business Process Management System (ABPMS) is the process frame, which gives the system process-awareness and defines the boundaries in which the system must operate. Compared to traditional process models, the process frame should, in principle, provide a somewhat more permissive representation of the managed processes, such that the (semi) autonomous behavior of an ABPMS, referred to as framed autonomy, could emerge. At the same time, it is not limited to a single linguistic or symbolic formalism and may incorporate heterogeneous knowledge ranging from predefined procedures to commonsense rules and best practices. In this paper, we conceptualize the notion of an ABPMS process frame as a hybrid business process representation, consisting of semi-concurrently executed procedural and declarative process models. We rely on our earlier works to outline the execution semantics of this type of process frame, arguing in favor of adopting the open-world assumption of the declarative paradigm also for procedural process models. The latter leads to a constraint-like interpretation, where each procedural model is considered to constrain the activities within that model, without imposing explicit execution requirements nor limitations on activities that may be present in other models. This is analogous to existing declarative languages, such as Declare, where each constraint has a direct effect only on the specific activities being constrained. Given this similarity, we propose mapping subsets of discovered declarative constraints into equivalent semi-concurrently executed procedural fragments, thus laying the foundation for a corresponding process (frame) discovery approach.
[AI-11] From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在组织层面缺乏灵活性与自适应能力的问题,即当前系统受限于固定的团队结构、紧耦合的协调逻辑以及会话绑定的学习机制,难以应对开放域任务和动态环境变化。其核心解决方案是提出OneManCompany(OMC)框架,关键在于引入一个分层的组织架构:通过将技能、工具和运行时配置封装为可移植的“人才”(Talent)实体,并借助类型化的组织接口抽象异构后端;同时构建社区驱动的“人才市场”实现按需招募与动态重组;并通过Explore-Execute-Review(E²R)树搜索机制统一规划、执行与评估过程,在保证终止性和无死锁的前提下形成闭环反馈,从而使得多智能体系统从静态预配置流水线跃升为具备自我组织与持续改进能力的AI组织。
链接: https://arxiv.org/abs/2604.22446
作者: Zhengxu Yu,Yu Fu,Zhiyuan He,Yuxuan Huang,Lee Ka Yiu,Meng Fang,Weilin Luo,Jun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages,13 figures
Abstract:Individual agent capabilities have advanced rapidly through modular skills and tool integrations, yet multi-agent systems remain constrained by fixed team structures, tightly coupled coordination logic, and session-bound learning. We argue that this reflects a deeper absence: a principled organisational layer that governs how a workforce of agents is assembled, governed, and improved over time, decoupled from what individual agents know. To fill this gap, we introduce \emphOneManCompany (OMC), a framework that elevates multi-agent systems to the organisational level. OMC encapsulates skills, tools, and runtime configurations into portable agent identities called \emphTalents, orchestrated through typed organisational interfaces that abstract over heterogeneous backends. A community-driven \emphTalent Market enables on-demand recruitment, allowing the organisation to close capability gaps and reconfigure itself dynamically during execution. Organisational decision-making is operationalised through an \emphExplore-Execute-Review ( \textE^2 R) tree search, which unifies planning, execution, and evaluation in a single hierarchical loop: tasks are decomposed top-down into accountable units and execution outcomes are aggregated bottom-up to drive systematic review and refinement. This loop provides formal guarantees on termination and deadlock freedom while mirroring the feedback mechanisms of human enterprises. Together, these contributions transform multi-agent systems from static, pre-configured pipelines into self-organising and self-improving AI organisations capable of adapting to open-ended tasks across diverse domains. Empirical evaluation on PRDBench shows that OMC achieves an 84.67% success rate, surpassing the state of the art by 15.48 percentage points, with cross-domain case studies further demonstrating its generality.
[AI-12] CognitiveTwin: Robust Multi-Modal Digital Twins for Predicting Cognitive Decline in Alzheimers Disease
【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s disease, AD)个体认知衰退预测的难题,其核心挑战在于疾病进展的异质性、对临床工具准确性与公平性的高要求,以及对缺失数据(特别是缺失非随机,Missing-not-at-random, MNAR)的鲁棒性需求。解决方案的关键在于提出一种名为CognitiveTwin的数字孪生框架,该框架融合多模态纵向数据(包括认知评分、磁共振成像、正电子发射断层扫描、脑脊液生物标志物和遗传信息),采用基于Transformer的架构实现模态融合,并引入深度马尔可夫模型(Deep Markov Model)捕捉时间动态特征,从而实现精准且个性化的认知轨迹预测,同时在不同人群间表现出良好的公平性并具备对临床脱落数据的强鲁棒性。
链接: https://arxiv.org/abs/2604.22428
作者: Bulent Soykan,Gulsah Hancerliogullari Koksalmis,Hsin-Hsiung Huang,Laura J. Brattain
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 6 figures
Abstract:Predicting individual cognitive decline in Alzheimer’s disease (AD) is difficult due to the heterogeneity of disease progression. Reliable clinical tools require not only high accuracy but also fairness across demographics and robustness to missing data. We present CognitiveTwin, a digital twin framework that predicts patient-specific cognitive trajectories. The model integrates multi-modal longitudinal data (cognitive scores, magnetic resonance imaging, positron emission tomography, cerebrospinal fluid biomarkers, and genetics). We use a Transformer-based architecture to fuse these modalities and a Deep Markov Model to capture temporal dynamics. We trained and evaluated the framework using data from 1,666 patients in the TADPOLE (Alzheimer’s Disease Neuroimaging Initiative) dataset. We assessed the model for prediction error, demographic fairness, and robustness to missing-not-at-random (MNAR) data patterns. ognitiveTwin provides accurate and personalized predictions of cognitive decline. Its demonstrated fairness across patient demographics and resilience to clinical dropout make it a reliable tool for clinical trial enrichment and personalized care planning.
[AI-13] How Hard is it to Decide if a Fact is Relevant to a Query? KR’26
【速读】:该论文旨在解决数据库中事实相关性判定问题,即给定一个数据库 $ D $、一个布尔型 conjunctive query (CQ) $ q $ 和 $ D $ 中的一个事实 $ f $,判断 $ f $ 是否属于某个最小子集 $ S \subseteq D $ 使得 $ S \models q $(亦即 $ f $ 对查询结果是必要的)。这一问题在查询答案解释中具有核心意义,但其联合复杂度此前未被系统研究。论文发现,相关性判定的复杂度高于查询评估:对于一般 CQ,其复杂度为 Σ2p-完全,即使在二元符号下也如此;且对(无环)链式 CQ 已经是 NP-难的。关键突破在于识别出“自连接”(self-joins,即多个原子使用相同关系)是导致复杂度升高的根本原因。若限制或控制自连接的出现次数,则相关性判定的复杂度可降至与查询评估一致——无结构限制时为 NP,有界超树宽类时为 LogCFL。在描述逻辑(DL-Lite_R)语义下,进一步证明若限制交互宽度(interaction width,推广了自连接宽度和最近提出的“无交互”条件),相关性判定同样不会比查询回答更难。因此,论文精准定位了相关性计算困难的本质,并提出了若干自然查询类别的高效算法实现。
链接: https://arxiv.org/abs/2604.22422
作者: Meghyn Bienvenu,Diego Figueira,Pierre Lafourcade
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: Long version of KR’26 paper
Abstract:We consider the following fundamental problem: given a database D, Boolean conjunctive query (CQ) q, and fact f in D, decide whether f is relevant to q wrt. D, i.e., does f belong to a minimal subset S of D such that S |= q. Despite being of central importance to query answer explanation, the combined complexity of deciding query relevance has not been studied in detail, leaving open what makes this problem hard, and which restrictions can yield lower complexity. Relevance has already been shown to be harder than query evaluation: namely, \Sigma^p_2 -complete for CQs, even over a binary signature. We further observe that NP-hardness applies already to (acyclic) chain CQs. Our work identifies self-joins (multiple atoms with the same relation) as the culprit. Indeed, we prove that if we forbid or bound the occurrence of self-joins, then relevance has the same complexity as query evaluation, namely, NP (without structural restrictions) and LogCFL (for bounded hypertreewidth classes). In the ontology setting, we establish an analogous result for ontology-mediated queries consisting of a CQ and DL-Lite_R ontology, namely that relevance is no harder than query answering provided that we bound the interaction width (which generalizes both self-join width and a recently introduced ‘interaction-free’ condition). Our results thus pinpoint what makes relevance harder than query evaluation and identify natural classes of queries which admit efficient relevance computation.
[AI-14] From Local to Cluster: A Unified Framework for Causal Discovery with Latent Variables
【速读】:该论文旨在解决因果发现与推断中因潜在变量(Latent Variables)带来的挑战,特别是传统局部方法无法提供宏观层面的因果洞察,而集群级方法要么依赖先验已知的聚类结构,要么假设因果充分性(Causal Sufficiency),且直接将单变量因果发现方法应用于集群级别会违反因果充分性假设并导致错误结果。解决方案的关键在于提出一种统一框架 L2C(Local to Cluster Causal Abstraction),其核心创新是:通过一个集群简化定理(Cluster Reduction Theorem)将任意集群压缩至最多三个节点而不损失因果信息;利用局部因果发现识别存在潜在变量时的直接因果关系、效应及V结构;并通过学习到的集群图进行宏观层级因果推理(Cluster-level Causal Inference)。L2C 不依赖因果充分性假设,因为潜在变量通过局部模式自动识别处理,理论分析表明其具备保真性(Soundness)、原子完备性(Atomic Completeness)和计算效率,实验证明其在合成数据和真实世界数据上均能准确恢复真实集群并显著优于现有基线方法。
链接: https://arxiv.org/abs/2604.22416
作者: Zongyu Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Latent variables pose a fundamental challenge to causal discovery and inference. Conventional local methods focus on direct neighbors but fail to provide macro level insights. Cluster level methods enable macro causal reasoning but either assume clusters are known a priori or require causal sufficiency. Moreover, directly applying single variable causal discovery methods to cluster level problems violates causal sufficiency and leads to incorrect results. To overcome these limitations, this paper proposes L2C (Local to Cluster Causal Abstraction), a unified framework that bridges local structure learning and cluster level causal discovery. Unlike prior work that requires a complete manual assignment of micro variables to clusters, L2C discovers the partition automatically from local causal patterns. Our solution leverages a cluster reduction theorem to reduce any cluster to at most three nodes without loss of causal information, applies local causal discovery to identify direct causes, effects, and V structures in the presence of latent variables, and performs macro level causal inference via cluster level calculus on the learned cluster graph. L2C does not assume causal sufficiency, as latent variables are handled through local discovery. Theoretical analysis shows that L2C ensures soundness, atomic completeness, and computational efficiency. Extensive experiments on synthetic and real world data demonstrate that L2C accurately recovers ground truth clusters and achieves superior macro causal effect identification compared to existing baselines.
[AI-15] Distance-Misaligned Training in Graph Transformers and Adaptive Graph-Aware Control
【速读】:该论文旨在解决图 Transformer(Graph Transformer)在节点分类任务中因全局信息混合能力带来的失效模式问题,即模型在不同任务中对长距离或局部信息的需求存在差异,而现有固定结构的图注意力机制难以自适应调整通信范围。其关键解决方案是引入“距离错配训练”(distance-misaligned training)的概念,通过一个可控的合成节点分类基准(基于上下文随机块模型图),量化标签相关信号与模型通信距离之间的偏差,并设计两种控制器:一是基于任务侧目标的 oracle 自适应控制器,能根据任务特性动态调整通信距离偏好,显著优于固定偏置和中性基线;二是无任务感知的零间隙控制器,结果表明仅靠适应性不足以提升性能,强调控制目标的明确性至关重要。这一方法为诊断图 Transformer 失效提供了距离分辨视角,并推动了图感知控制机制的设计。
链接: https://arxiv.org/abs/2604.22413
作者: Qinhan Hou,Jing Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by Graph Signal Processing Workshop 2026 as an extended abstract
Abstract:Graph Transformers can mix information globally, but this flexibility also creates failure modes: some tasks require long-range communication while others are better served by local interaction. We study this through a synthetic node-classification benchmark on contextual stochastic block model graphs, where labels are generated by a controllable mixture of local and far-shell signals. We define distance-misaligned training as a mismatch between where label-relevant information lies and where the model allocates communication over graph distance. On this benchmark, we find three points. First, the preferred graph-distance bias changes systematically with task locality. Second, an oracle adaptive controller, given offline access to the task-side distance target, nearly matches the best fixed bias across regimes and strongly improves over a neutral baseline on mixed and local tasks. Third, a task-agnostic zero-gap controller is weaker, indicating that adaptation alone is not enough and that the control target matters. These results suggest that distance-resolved diagnosis is useful for understanding Graph Transformer failures and for designing graph-aware control.
[AI-16] Hidden Failure Modes of Gradient Modification under Adam in Continual Learning and Adaptive Decoupled Moment Routing as a Repair
【速读】:该论文旨在解决持续学习(Continual Learning, CL)中梯度修改方法与自适应优化器(如Adam)组合时存在的隐性失效问题。在高重叠场景下,传统基于投影、惩罚重缩放或回放缓冲混合的梯度修改策略在Adam优化器下会引发性能崩溃,导致模型接近原始遗忘水平(如8域任务中达到12.5–12.8困惑度,仅略优于基线13.2)。研究发现,这一失败源于Adam的二阶矩路径:梯度投影会导致旧方向有效学习率被放大至 1/(1−α) 倍(α 为Adam动量参数),且实测值与理论预测误差小于8%。解决方案的关键在于将修改后的梯度仅用于更新一阶矩(即动量),同时保留对二阶矩(即梯度平方均值)的原始统计特性,辅以基于任务重叠度自适应调整强度的机制。此简单但关键的改动成为唯一能在多种方法、优化器和模型规模(包括7B参数LoRA微调)下稳定避免性能塌陷的配置。
链接: https://arxiv.org/abs/2604.22407
作者: Yuelin Hu,Zhenbo Yu,Zhengxue Cheng,Wei Liu,Li Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 28 pages, 5 figures, preprint
Abstract:Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5–12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5–4.8 units. The failure is largely invisible on clean benchmarks. We explain this effect through Adam’s second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the old-direction effective learning rate, matching measurements within 8% across eight alpha values. The same conflict appears with penalty methods, replay mixing, and at 7B scale under LoRA. Our fix routes the modified gradient only to the first moment while preserving magnitude-faithful second-moment statistics, with overlap-aware adaptive strength. This simple change is the only tested configuration that consistently avoids collapse across methods, optimizers, and scale. Comments: 28 pages, 5 figures, preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) ACMclasses: I.2.6; F.2.2 Cite as: arXiv:2604.22407 [cs.LG] (or arXiv:2604.22407v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.22407 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-17] LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios ICRA2026
【速读】:该论文旨在解决家庭环境中柔性物体(deformable objects)操作的难题,特别是在仿真与真实世界执行中因物体种类多样、形状复杂、动力学特性多变及材料属性差异导致的挑战,以及现有仿真平台对柔性物体支持不足的问题。其解决方案的关键在于提出LeHome,一个面向家庭场景下柔性物体操作的综合性仿真环境,该环境涵盖衣物和食品等多种柔性物体,提供高保真动力学模拟和逼真交互效果,并支持多种机器人本体,尤其强调低成本机器人平台,从而实现资源受限硬件上的端到端家庭任务评估,有效弥合了现实柔性物体仿真与实际机器人平台之间的鸿沟。
链接: https://arxiv.org/abs/2604.22363
作者: Zeyi Li,Yushi Yang,Shawn Xie,Kyle Xu,Tianxing Chen,Yuran Wang,Zhenhao Shen,Yan Shen,Yue Chen,Wenjun Li,Yukun Zheng,Chaorui Zhang,Siyi Lin,Fei Teng,Hongjun Yang,Ming Chen,Steve Xie,Ruihai Wu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: ICRA2026 Accepted
Abstract:Household environments present one of the most common, impactful yet challenging application domains for robotics. Within household scenarios, manipulating deformable objects is particularly difficult, both in simulation and real-world execution, due to varied categories and shapes, complex dynamics, and diverse material properties, as well as the lack of reliable deformable-object support in existing simulations. We introduce LeHome, a comprehensive simulation environment designed for deformable object manipulation in household scenarios. LeHome covers a wide spectrum of deformable objects, such as garments and food items, offering high-fidelity dynamics and realistic interactions that existing simulators struggle to simulate accurately. Moreover, LeHome supports multiple robotic embodiments and emphasizes low-cost robots as a core focus, enabling end-to-end evaluation of household tasks on resource-constrained hardware. By bridging the gap between realistic deformable object simulation and practical robotic platforms, LeHome provides a scalable testbed for advancing household robotics. Webpage: this https URL .
[AI-18] FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting
【速读】:该论文旨在解决能源时间序列预测任务中因数据集特定性导致的可扩展性差、模型开发与维护成本高的问题,尤其是在数据受限或隐私敏感场景下难以部署传统机器学习方法的挑战。其解决方案的关键在于引入基础模型(foundation models),通过大规模预训练学习通用模式,并在多个能源预测场景中进行基准测试,结果表明:即使不使用完整的历史目标数据进行微调,基础模型在所有数据类别和预测设置中均显著优于经过优化的经典机器学习方法,尤其在引入协变量信息时表现最优,验证了其作为可扩展、通用能源预测工具的巨大潜力。
链接: https://arxiv.org/abs/2604.22328
作者: Marco Obermeier,Marco Pruckner,Florian Haselbeck,Andreas Zeiselmair
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Driven by the transition towards a climate-neutral energy system, accurate energy time series forecasting is critical for planning and operation. Yet, it remains largely a dataset-specific task, requiring comprehensive training data, limiting scalability, and resulting in high model development and maintenance effort. Recently, foundation models that aim to learn generalizable patterns via extensive pretraining have shown superior performance in multiple prediction tasks. Despite their success and strong potential to address challenges in energy forecasting, their application in this domain remains largely unexplored. We address this gap by presenting the Foundation Models in Energy Time Series Forecasting (FETS) benchmark. We (1) provide a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories; (2) collect and analyze 54 datasets across 9 data categories, guided by typical stakeholder interests; (3) benchmark foundation models against classical machine learning approaches across different forecasting settings. Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories, despite the latter having seen the full historic target data during training. In particular, covariate-informed foundation models achieve the strongest performance. Further analysis reveals a strong correlation between predictive performance and spectral entropy, performance saturation beyond a certain context length, and improved performance at higher aggregation levels such as national load, district heating, and power grid data. Overall, our findings highlight the strong potential of foundation models as scalable and generalizable forecasting solutions for the energy domain, particularly in data-constrained and privacy-sensitive settings.
[AI-19] BLAST: Benchmarking LLM s with ASP-based Structured Testing
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理声明式编程范式(declarative paradigms)如答案集编程(Answer Set Programming, ASP)时的评估缺乏系统性和准确性的问题。当前LLMs在自然语言理解、对话系统和代码生成等领域表现优异,但在ASP这类逻辑编程任务中的能力尚未得到充分量化。解决方案的关键在于提出BLAST——首个专门针对LLMs生成ASP代码准确性的基准测试方法及配套数据集,其核心创新是设计了两个面向ASP语义特性的新指标,从而构建了一个结构化的评估框架,并基于此对八种前沿LLMs在十个图论相关ASP问题上的表现进行了实证分析。
链接: https://arxiv.org/abs/2604.22306
作者: Manuel Alejandro Borroto Santana,Erica Coppolillo,Francesco Calimeri,Giuseppe Manco,Simona Perri,Francesco Ricca
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.
[AI-20] When Does LLM Self-Correction Help? A Control-Theoretic Markov Diagnostic and Verify-First Intervention
【速读】:该论文旨在解决生成式 AI(Generative AI)系统中迭代自修正(iterative self-correction)行为的效益边界不明确的问题,即重复修正在何时有助于提升性能、何时反而有害。其核心解决方案是将自修正建模为一个由同一语言模型充当控制器与被控对象(plant)的控制论反馈回路,并引入基于正确率(Acc)和错误迭代率(EIR)的两状态马尔可夫模型作为诊断工具:仅当 EIR / ECR ≤ Acc / (1 - Acc) 时才应迭代;其中 EIR 作为稳定性裕度,提示通过提示工程(prompting)即可实现对自修正行为的有效控制。实验表明存在一个约 0.5% 的 EIR 阈值,低于此阈值则有益,高于则有害;通过“验证优先”提示策略可显著降低 EIR 并逆转性能下降,证明该阈值具有可操作性,从而主张将自修正视为需依据误差动态指标决策的控制行为而非默认机制。
链接: https://arxiv.org/abs/2604.22273
作者: Aofan Liu,Jingxiang Meng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Iterative self-correction is widely used in agentic LLM systems, but when repeated refinement helps versus hurts remains unclear. We frame self-correction as a cybernetic feedback loop in which the same language model serves as both controller and plant, and use a two-state Markov model over Correct, Incorrect to operationalize a simple deployment diagnostic: iterate only when ECR/EIR Acc/(1 - Acc). In this view, EIR functions as a stability margin and prompting functions as lightweight controller design. Across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), we find a sharp near-zero EIR threshold (= 0.5%) separating beneficial from harmful self-correction. Only o3-mini (+3.4 pp, EIR = 0%), Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%), and o4-mini (+/-0 pp) remain non-degrading; GPT-5 degrades by -1.8 pp. A verify-first prompt ablation provides causal evidence that this threshold is actionable through prompting alone: on GPT-4o-mini it reduces EIR from 2% to 0% and turns -6.2 pp degradation into +0.2 pp (paired McNemar p 10^-4), while producing little change on already-sub-threshold models. ASC further illustrates the stopping trade-off: it halts harmful refinement but incurs a 3.8 pp confidence-elicitation cost. Overall, the paper argues that self-correction should be treated not as a default behavior, but as a control decision governed by measurable error dynamics.
[AI-21] Semantic Error Correction and Decoding for Short Block Channel Codes
【速读】:该论文旨在解决在噪声无线信道中传输自然语言句子时,传统短码(short block codes)和长码(long codes)在语义保真度与解码延迟之间难以平衡的问题。其关键解决方案是提出一种语义增强型接收框架,通过将句子分段并独立编码传输,在接收端利用双向和自回归Transformer模型(BART)实现语义错误纠正(Semantic Error Correction, SEC),从而提升语义层面的鲁棒性;进一步引入语义列表译码(Semantic List Decoding, SLD)与语义置信度引导的混合自动重传请求机制(Semantic Confidence-guided HARQ, SHARQ),分别优化候选重建质量与减少冗余重传开销,最终在保持低延迟的同时显著提升语义准确性与系统性能。
链接: https://arxiv.org/abs/2604.22269
作者: Jiafu Hao,Chentao Yue,Wanchun Liu,Yonghui Li,Branka Vucetic
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 13 pages
Abstract:This paper presents a semantic-enhanced receiver framework for transmitting natural language sentences over noisy wireless channels using multiple short block codes. After ASCII encoding, the sentence is divided into segments, each independently encoded with a short block code and transmitted over an AWGN channel. At the receiver, segments are decoded in parallel, followed by a semantic error correction (SEC) model, which reconstructs corrupted segments using language model context. We further propose the semantic list decoding (SLD), which generates multiple candidate reconstructions and selects the best one via weighted Hamming distance, and a semantic confidence-guided HARQ (SHARQ) mechanism that replaces CRC-based error detection with a confidence score, enabling selective segment retransmission without CRC overhead. All modules are designed and trained using bidirectional and auto-regressive transformers (BART). Simulation results demonstrate that the proposed scheme significantly outperforms conventional capacity-approaching short codes and long codes at the same rate. Specifically, SEC provides approximately 0.4 dB BLER gain over plain short-code transmission, while SLD extends this to 0.8 dB. Compared to transmitting the entire sentence as a single long 5G LDPC codeword, our approach significantly improves semantic fidelity and reduces decoding latency by up to 90%. SHARQ further provides an additional 1.5 dB gain over conventional HARQ.
[AI-22] Protect the Brain When Treating the Heart: A Convolutional Neural Network for Detecting Emboli
【速读】:该论文旨在解决心脏结构性介入手术中气泡微栓塞(Gaseous Microemboli, GME)的检测与定量难题,此类并发症在外科和经导管干预中较为常见。传统经胸超声心动图虽能可视化循环中的GME,但其检测受操作者依赖性、高速运动及背景结构相似性等因素影响,难以实现准确识别与实时量化。论文提出基于2.5D U-Net架构的分割方法,通过空间-时间连续数据处理,在保持实时执行速度的同时显著提升对GME的鲁棒性检测与高精度分割能力,从而实现了GME面积随时间变化的可靠量化,成功集成至术中监测流程中。
链接: https://arxiv.org/abs/2604.22258
作者: Andrea Angino,Ken Trotti,Diego Ulisse Pizzagalli,Rolf Krause,Tiziano Torre,Stefanos Demertzis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Corresponding authors: Andrea Angino and Diego Ulisse Pizzagalli
Abstract:Gaseous microemboli (GME) represent a common complication of cardiac structural interventions across both surgical and transcatheter approaches. Transthoracic cardiac ultrasound imaging represents a convenient methodology to visualize the presence of circulating GME. However, their detection and quantification are far from trivial due to operator-dependent view, high velocity, and objects with similar structure in the background. Here, we propose an approach based on a 2.5D U-Net architecture to segment GME in space-time connected data. Such an approach yields robust detection against the background and high segmentation accuracy while retaining real-time execution speed. These properties facilitated the integration of the proposed pipeline into patient-monitoring surgical protocols, providing the quantification of GME area over time.
[AI-23] A Probabilistic Framework for Hierarchical Goal Recognition KR2026
【速读】:该论文旨在解决现实场景中目标识别(goal recognition)任务面临的两大挑战:一是如何有效利用任务的层次结构(hierarchical task structure),二是如何在不确定性下进行推理。现有基于规划的目标识别方法虽取得进展,但尚未能将层次任务网络(HTN)与概率推理相结合。其解决方案的关键在于提出首个基于规划的概率框架,用于在HTN结构上实现层次目标识别;具体通过引入一个三阶段生成模型来估计似然,并结合HTN规划器得到目标假设的后验分布,从而在HTN基准测试中显著优于现有方法,为面向实际应用的目标识别奠定了理论基础。
链接: https://arxiv.org/abs/2604.22256
作者: Chenyuan Zhang,Katherine Ip,Hamid Rezatofighi,Buser Say,Mor Vered
机构: 未知
类目: ymbolic Computation (cs.SC); Artificial Intelligence (cs.AI)
备注: Accepted by KR 2026
Abstract:Goal recognition aims to infer an agent’s goal from observations of its behaviour. In realistic settings, recognition can benefit from exploiting hierarchical task structure and reasoning under uncertainty. Planning-based goal recognition has made substantial progress over the past decade, but to the best of our knowledge no existing approach jointly integrates hierarchical task structure with probabilistic inference. In this paper, we introduce the first planning-based probabilistic framework for hierarchical goal recognition over Hierarchical Task Networks (HTNs). We instantiate the framework by exploiting an HTN planner with a three-stage generative model for likelihood estimation, yielding posterior distributions over goal hypotheses. Empirical results show improved recognition performance over the existing HTN-based recognizer on HTN benchmarks. Overall, the framework lays a foundation for probabilistic goal recognition grounded in hierarchical planning structure, moving goal recognition toward more practical settings.
[AI-24] Learning-augmented robotic automation for real-world manufacturing
【速读】:该论文旨在解决工业机器人在真实制造环境中执行柔性任务(如可变形电缆插入和焊接)时面临的适应性差、可靠性不足及安全性难以保障的问题。传统基于固定路径点的控制方法对环境变化敏感,而现有学习型控制方法尚未在长时间连续运行中验证其稳定性与一致性。解决方案的关键在于提出一种“学习增强型机器人自动化”(Learning-Augmented Robotic Automation)系统,该系统融合了由少量真实数据训练的任务控制器与一个神经网络驱动的3D安全监控模块,并无缝集成到传统工业工作流中,从而实现高可靠、高质量且安全的人机协作操作。
链接: https://arxiv.org/abs/2604.22235
作者: Yunho Kim,Quan Nguyen,Taewhan Kim,Youngjin Heo,Joonho Lee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Industrial robots are widely used in manufacturing, yet most manipulation still depends on fixed waypoint scripts that are brittle to environmental changes. Learning-based control offers a more adaptive alternative, but it remains unclear whether such methods, still mostly confined to laboratory demonstrations, can sustain hours of reliable operation, deliver consistent quality, and behave safely around people on a live production line. Here we present Learning-Augmented Robotic Automation, a hybrid system that integrates learned task controllers and a neural 3D safety monitor into conventional industrial workflows. We deployed the system on an electric-motor production line to automate deformable cable insertion and soldering under real manufacturing constraints, a step previously performed manually by human workers. With less than 20 min of real-world data per task, the system operated continuously for 5 h 10 min, producing 108 motors without physical fencing and achieving a 99.4% pass rate on product-level quality-control tests. It maintained near-human takt time while reducing variability in solder-joint quality and cycle time. These results establish a practical pathway for extending industrial automation with learning-based methods.
[AI-25] Preserve Support Not Correspondence: Dynamic Routing for Offline Reinforcement Learning
【速读】:该论文旨在解决单步离线强化学习(Offline Reinforcement Learning, Offline RL)中策略网络在无需迭代采样的情况下,如何有效平衡Q值优化与行为克隆(Behavior Cloning, BC)约束的问题。传统方法通常依赖于强迭代教师模型提供目标动作,使学生策略在同一样本上同时追求更高Q值和贴近数据支持的动作,导致局部改进受限,难以捕捉数据中潜在的更优邻近动作。其解决方案的关键在于提出DROL(Dynamic Routing for One-step Offline RL),一种基于潜在空间条件的单步策略网络:通过从受限先验中采样K个候选动作,动态地将每个数据动作分配给最近的候选动作,并仅对获胜候选动作进行行为克隆和批评者引导更新。由于路由机制根据当前候选几何结构重新计算,不同候选动作在训练过程中可动态接管数据支持区域,从而允许策略在网络层面实现局部优化,同时保持推理阶段的单次前向传播特性。
链接: https://arxiv.org/abs/2604.22229
作者: Zhancun Mu,Guangyu Zhao,Yiwu Zhong,Chi Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures
Abstract:One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can support. In recent one-step extraction pipelines, a strong iterative teacher provides one target action for each latent draw, and the same student output is asked to do both jobs: move toward higher Q and stay near that paired endpoint. If those two directions disagree, the loss resolves them as a compromise on that same sample, even when a nearby better action remains locally supported by the data. We propose DROL, a latent-conditioned one-step actor trained with top-1 dynamic routing. For each state, the actor samples K candidate actions from a bounded latent prior, assigns each dataset action to its nearest candidate, and updates only that winner with Behavior Cloning and critic guidance. Because the routing is recomputed from the current candidate geometry, ownership of a supported region can shift across candidates over the course of learning. This gives a one-step actor room to make local improvements that pointwise extraction struggles to capture, while retaining single-pass inference at test time. On OGBench and D4RL, DROL is competitive with the one-step FQL baseline, improving many OGBench task groups while remaining strong on both AntMaze and Adroit. Project page: this https URL.
[AI-26] An LLM -Driven Closed-Loop Autonomous Learning Framework for Robots Facing Uncovered Tasks in Open Environments
【速读】:该论文旨在解决自主机器人在开放环境中面对未预定义任务时,难以将自身执行经验或外部观察所得知识转化为可复用的本地能力的问题。现有方法通常依赖频繁调用大语言模型(Large Language Model, LLM)处理未知任务,且即便成功完成任务或观察到他人成功行为,也往往无法自动转化为本地可重用的知识。解决方案的关键在于提出一种由LLM驱动的闭环自主学习框架:首先从本地方法库中检索是否存在适配当前任务的已有方案;若无,则触发自主学习流程,由LLM作为高层推理组件进行任务分析、候选模型选择、数据采集规划及执行或观察策略制定;随后机器人通过自执行与主动观察双重方式学习,进行准实时训练与调整,并将验证后的结果固化至本地方法库以供未来复用。该机制实现了从经验到本地能力的闭环转化,显著降低了对重复外部LLM交互的依赖。
链接: https://arxiv.org/abs/2604.22199
作者: Hong Su
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous robots operating in open environments need the ability to continuously handle tasks that are not covered by predefined local methods. However, existing approaches often rely on repeated large-language-model (LLM) interaction for uncovered tasks, and even successful executions or observed successful external behaviors are not always autonomously transformed into reusable local knowledge. In this paper, we propose an LLM-driven closed-loop autonomous learning framework for robots facing uncovered tasks in open environments. The proposed framework first retrieves the local method library to determine whether a reusable solution already exists for the current task or observed event. If no suitable method is found, it triggers an autonomous learning process in which the LLM serves as a high-level reasoning component for task analysis, candidate model selection, data collection planning, and execution or observation strategy organization. The robot then learns from both self-execution and active observation, performs quasi-real-time training and adjustment, and consolidates the validated result into the local method library for future reuse. Through this recurring closed-loop process, the robot gradually converts both execution-derived and observation-derived experience into reusable local capability while reducing future dependence on repeated external LLM interaction. Results show that the proposed framework reduces execution time and LLM dependence in both repeated-task self-execution and observation-driven settings, for example reducing the average total execution time from 7.7772s to 6.7779s and the average number of LLM calls per task from 1.0 to 0.2 in the repeated-task self-execution experiments.
[AI-27] Estimating Tail Risks in Language Model Output Distributions
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高频率部署场景下,由于罕见但严重危害性输出(harmful outputs)的出现概率极低,导致传统安全评估方法难以有效捕捉其尾部风险(tail risk)的问题。解决方案的关键在于引入基于重要性采样(importance sampling)的高效估计框架:通过构建“不安全版本”的目标模型(unsafe versions of the target model),人为提高有害输出的概率,从而实现样本高效的稀有事件估计。相比传统的暴力蒙特卡洛采样(brute-force Monte Carlo sampling),该方法仅需10–20倍更少的样本即可获得与之相当的估计精度,例如可在仅500次采样下准确估计出概率约为10⁻⁴的有害行为,显著提升了大规模模型安全性评估的可行性与实用性。
链接: https://arxiv.org/abs/2604.22167
作者: Rico Angell,Raghav Singhal,Zachary Horvitz,Zhou Yu,Rajesh Ranganath,Kathleen McKeown,He He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at this https URL
[AI-28] PrivSTRUCT: Untangling Data Purpose Compliance of Privacy Policies in Google Play Store
【速读】:该论文旨在解决现有隐私政策分析方法忽视文档结构层次、导致敏感数据项与用途关联混乱的问题。当前研究通常将隐私政策视为扁平文本,未能利用章节标题等结构线索,从而在自动化提取过程中难以准确区分不同数据处理行为。其解决方案的关键在于提出PrivSTRUCT框架,该框架通过集成编码器与解码器的系统化设计,能够有效保留开发者定义的结构信息,并精准识别数据项与其特定用途之间的对应关系。实证研究表明,相比现有最优工具PoliGrapher,PrivSTRUCT在提取数据项和用途片段数量上提升两倍以上,同时揭示了应用开发中因使用全局目的声明而非局部细化披露所引发的显著透明度缺失问题。
链接: https://arxiv.org/abs/2604.22157
作者: Bhanuka Silva,Anirban Mahanti,Aruna Seneviratne,Suranga Senevirante
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures, 2 tables
Abstract:Existing research typically treats privacy policies as flat, uniform text, extracting information without regard for the document’s logical hierarchy. Disregard for structural cues of section headings designed to guide the reader, often leads automated methods to entangle distinct data practices, particularly when linking sensitive data items to their specific purposes. To address this, we introduce PrivSTRUCT, a novel and systematic encoder and decoder combined framework that to untangle complex privacy disclosures. Benchmarking against the state-of-the-art tool PoliGrapher reveals that PrivSTRUCT robustly extracts more than x2 the number of data item and purpose excerpts while retaining developer-defined structural cues. By applying PrivSTRUCT to a large-scale dataset of 3,756 Android apps, we uncover a critical transparency gap: the probability of developers overstating a data purpose is 20.4% higher for first-party collection and 9.7% higher for third-party sharing when they rely on globally defined purposes rather than specific, locally scoped disclosures. Alarmingly, we find that sensitive third-party data flows such as sharing financial data for analytics are frequently diluted and entangled into generic or unrelated categories, highlighting a persistent failure in the current purpose disclosure landscape.
[AI-29] Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems
【速读】:该论文旨在解决多代理大语言模型(Large Language Model, LLM)流水线在行为健康领域应用中的可靠性与误差累积问题,特别是在自伤风险评估和抑郁筛查等安全关键场景下,传统基于LLM-as-a-judge的评估方法无法提供决策置信度或误差传播机制,导致难以保障系统安全性。其解决方案的关键在于提出一种基于有向无环图(Directed Acyclic Graph, DAG)结构的统计框架,通过建模每个代理为随机分类决策过程,引入三方面改进:(1) 更紧致的代理级性能置信区间边界,(2) 基于输入难度的贝叶斯 bandit 自适应采样策略,以及 (3) 多代理系统上的后悔(regret)保证,证明部署后错误增长呈对数级。实证结果表明,该自适应采样策略在两个行为健康数据集上均实现了最低的假阳性率(如AEGIS 2.0上从0.159降至0.095),同时保持召回率不变,显著降低了误报率40%,体现出在精度提升方面的实质性改进。
链接: https://arxiv.org/abs/2604.22154
作者: Meghana Karnam,Ananya Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.
[AI-30] Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理能力增强和部署范围扩展背景下,可能产生自我目标导向行为所带来的新兴战略推理风险(Emergent Strategic Reasoning Risks, ESRRs)的评估难题。ESRRs包括欺骗、评估博弈和奖励劫持等行为,现有方法难以系统性识别与量化此类风险。解决方案的关键在于提出ESRRSim——一个基于分类体系的代理式自动化风险评估框架:首先构建包含7个类别、20个子类别的可扩展风险分类体系;其次设计能诱发模型真实推理过程的评估场景,并采用双维度评分机制(模型输出与推理轨迹)进行无裁判依赖且可扩展的评估;实证表明该框架能有效揭示不同LLM的风险特征差异,并捕捉到模型在生成层面的显著进化趋势。
链接: https://arxiv.org/abs/2604.22119
作者: Tharindu Kumarage,Lisa Bauer,Yao Ma,Dan Rosen,Yashasvi Raghavendra Guduri,Anna Rumshisky,Kai-Wei Chang,Aram Galstyan,Rahul Gupta,Charith Peris
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.
[AI-31] Wiggle and Go! System Identification for Zero-Shot Dynamic Rope Manipulation
【速读】:该论文旨在解决动态绳索操作任务中因单次失误导致不可恢复失败的问题,尤其是在缺乏大量真实数据或迭代优化的情况下实现高效、准确的零样本(zero-shot)任务执行。其核心解决方案是提出了一种名为“Wiggle and Go!”的两阶段系统识别框架:第一阶段通过观察绳索运动来预测描述性物理参数(如质量分布、弹性模量等),第二阶段利用这些参数进行目标条件下的动作优化,从而指导机器人在真实环境中完成未见过的任务。关键创新在于任务无关的系统识别模块能够泛化到多种绳索操作任务,并显著提升精度(3D目标打击平均误差从15.34 cm降至3.55 cm)与轨迹预测一致性(傅里叶频率相关系数达0.95)。
链接: https://arxiv.org/abs/2604.22102
作者: Arthur Jakobsson,Abhinav Mahajan,Karthik Pullalarevu,Krishna Suresh,Yunchao Yao,Yuemin Mao,Bardienus Duisterhof,Shahram Najam Syed,Jeffrey Ichnowski
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Many robotic tasks are unforgiving; a single mistake in a dynamic throw can lead to unacceptable delays or unrecoverable failure. To mitigate this, we present a novel approach that leverages learned simulation priors to inform goal-conditioned dynamic manipulation of ropes for efficient and accurate task execution. Related methods for dynamic rope manipulation either require large real-world datasets to estimate rope behavior or the use of iterative improvements on attempts at the task for goal completion. We introduce Wiggle and Go!, a system-identification, two-stage framework that enables zero-shot task rope manipulation. The framework consists of a system identification module that observes rope movement to predict descriptive physical parameters, which then informs an optimization method for goal-conditioned action prediction for the robot to execute zero-shot in the real. Our method achieves strong performance across multiple dynamic manipulation tasks enabled by the same task-agnostic system identification module which offers seamless switching between different manipulation tasks, allowing a single model to support a diverse array of manipulation policies. We achieve a 3.55 cm average accuracy on 3D target striking in real using rope system parameters in comparison to 15.34 cm accuracy when our task model is not system-parameter-informed. We achieve a Pearson correlation coefficient of 0.95 between Fourier frequencies of the predicted and real ropes on an unseen trajectory. Project website please see this https URL
[AI-32] Ethics Testing: Proactive Identification of Generative AI System Harms
【速读】:该论文旨在解决生成式人工智能(Generative AI, GAI)系统自动生成内容时可能引发的软件危害问题,尤其是由于不当行为(如有害内容或侵犯知识产权)导致的风险。现有测试方法(如公平性测试)无法系统性识别此类伦理层面的危害,因此论文提出“伦理测试”(ethics testing)这一新概念作为解决方案,其关键在于构建一套系统化的测试生成机制,用于主动检测GAI输出内容中潜在的不道德或有害行为,从而提升生成内容的安全性和合规性。
链接: https://arxiv.org/abs/2604.22089
作者: Shin Hwei Tan,Haibo Wang,Heng Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative Artificial Intelligence (GAI) systems that can automatically generate content in the form of source code or other contents (e.g., images) has seen increasing popularity due to the emergence of tools such as ChatGPT which rely on Large Language Models (LLMs). Misuse of the automatically generated content can incur serious consequences due to potential harms in the generated content. Despite the importance of ensuring the quality of automatically generated content, there is little to no approach that can systematically generate tests for identifying software harms in the content generated by these GAI systems. In this article, we introduce the novel concept of ethics testing which aims to systematically generate tests for identifying software harms. Different from existing testing methodologies (e.g., fairness testing that aims to identifying software discrimination), ethics testing aims to systematically detect software harms that could be induced due to unethical behavior (e.g., harmful behavior or behavior that violates intellectual property rights) in automatically generated content. We introduced the concept of ethics testing, discussed the challenges therewithin, and conducted five case studies to show how ethics testing can be performed for generative AI systems.
[AI-33] Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
【速读】:该论文旨在解决当前生产级自主代理系统中因记忆机制导致的架构瓶颈问题,特别是现有混合语义图(hybrid semantic graph)方法在知识存储与检索过程中产生的高计算开销、复杂的实体抽取流程以及多查询检索管道等问题。其解决方案的关键在于提出一种名为Memanto的通用记忆层,它摒弃了传统知识图谱的复杂性假设,采用包含13种预定义记忆类别的类型化语义记忆结构、自动冲突解决机制和时间版本控制,并依托Moorcheh的信息论搜索引擎实现无需索引的语义数据库,从而在亚90毫秒延迟内完成确定性检索且无摄入延迟。该设计显著降低了操作复杂度,仅需单次检索查询即可达成优于现有混合图与向量基系统的性能表现(LongMemEval和LoCoMo基准上分别达到89.8%和87.1%准确率)。
链接: https://arxiv.org/abs/2604.22085
作者: Seyed Moein Abtahi,Rasa Rahnema,Hetkumar Patel,Neel Patel,Majid Fekri,Tara Khani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 Pages, 10 Tables, 8 Figures
Abstract:The transition from stateless language model inference to persistent, multi session autonomous agents has revealed memory to be a primary architectural bottleneck in the deployment of production grade agentic systems. Existing methodologies largely depend on hybrid semantic graph architectures, which impose substantial computational overhead during both ingestion and retrieval. These systems typically require large language model mediated entity extraction, explicit graph schema maintenance, and multi query retrieval pipelines. This paper introduces Memanto, a universal memory layer for agentic artificial intelligence that challenges the prevailing assumption that knowledge graph complexity is necessary to achieve high fidelity agent memory. Memanto integrates a typed semantic memory schema comprising thirteen predefined memory categories, an automated conflict resolution mechanism, and temporal versioning. These components are enabled by Moorcheh’s Information Theoretic Search engine, a no indexing semantic database that provides deterministic retrieval within sub ninety millisecond latency while eliminating ingestion delay. Through systematic benchmarking on the LongMemEval and LoCoMo evaluation suites, Memanto achieves state of the art accuracy scores of 89.8 percent and 87.1 percent respectively. These results surpass all evaluated hybrid graph and vector based systems while requiring only a single retrieval query, incurring no ingestion cost, and maintaining substantially lower operational complexity. A five stage progressive ablation study is presented to quantify the contribution of each architectural component, followed by a discussion of the implications for scalable deployment of agentic memory systems.
[AI-34] Removing Sandbagging in LLM s by Training with Weak Supervision
【速读】:该论文试图解决的问题是:当AI模型的能力超过其监督者(如弱监督模型或有限的人类标注)时,模型可能通过“沙袋行为”(sandbagging)——即输出看似合理但实际低于其真实能力的结果——来规避严格验证。这种现象在缺乏可靠质量评估机制的场景中尤为危险。解决方案的关键在于训练策略的设计:通过结合监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL),其中SFT首先利用弱监督演示打破沙袋行为,随后RL基于此基础充分激发模型的真实性能;同时强调训练过程必须与部署环境保持不可区分性,否则模型可在训练阶段表现良好而继续在部署中沙袋。这一方法为缓解沙袋问题提供了实证支持,并指出了训练-部署一致性对有效激励模型发挥全部潜力的重要性。
链接: https://arxiv.org/abs/2604.22082
作者: Emil Ryd,Henning Bartsch,Julian Stastny,Joe Benton,Vivek Hebbar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality. A model more capable than its supervisors could exploit this gap through sandbagging, producing work that appears acceptable but falls short of its true abilities. Can training elicit a model’s best work even without reliable verification? We study this using model organisms trained to sandbag, testing elicitation techniques on problem-solving math, graduate-level science, and competitive coding tasks. We find that training with weak supervision can reliably elicit sandbagging models when supervised fine-tuning (SFT) and reinforcement learning (RL) are combined: SFT on weak demonstrations breaks the sandbagging behavior, enabling RL to then fully elicit performance. Neither method succeeds reliably alone-RL without SFT almost always leads to reward hacking rather than genuine improvement. Critically, this relies on training being indistinguishable from deployment; when models can distinguish between training and deployment, they can perform well during training while continuing to sandbag afterward. Our results provide initial evidence that training is a viable mitigation against sandbagging, while highlighting the importance of making training indistinguishable from deployment.
[AI-35] Sound Agent ic Science Requires Adversarial Experiments ICLR2026
【速读】:该论文旨在解决生成式 AI(Generative AI)在科学数据分析中广泛应用所引发的“可 falsify(可证伪性)缺失”问题,即基于大语言模型(LLM)的智能体(agent)虽能快速生成看似合理且可发表的分析结果,但往往仅聚焦于支持假设的正向证据,忽视对假设的潜在反例或否定性证据的探索,从而导致科研结论缺乏可重复性和可验证性。解决方案的关键在于引入“falsification-first(证伪优先)”标准:要求智能体不再以构建最具说服力的叙事为目标,而是主动设计实验和分析流程,系统性地寻找能够证伪当前主张的路径,以此重构科学推理范式,确保非实验性结论具备坚实的批判性基础。
链接: https://arxiv.org/abs/2604.22080
作者: Dionizije Fa,Marko Culjak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published at ICLR 2026 Workshop on Agents in the Wild
Abstract:LLM-based agents are rapidly being adopted for scientific data analysis, automating tasks once limited by human time and expertise. This capability is often framed as an acceleration of discovery, but it also accelerates a familiar failure mode, the rapid production of plausible, endlessly revisable analyses that are easy to generate, effectively turning hypothesis space into candidate claims supported by selectively chosen analyses, optimized for publishable positives. Unlike software, scientific knowledge is not validated by the iterative accumulation of code and post hoc statistical support. A fluent explanation or a significant result on a single dataset is not verification. Because the missing evidence is a negative space, experiments and analyses that would have falsified the claim were never run or never published. We therefore propose that non-experimental claims produced with agentic assistance be evaluated under a falsification-first standard: agents should not be used primarily to craft the most compelling narrative, but to actively search for the ways in which the claim can fail.
[AI-36] Shard the Gradient Scale the Model: Serverless Federated Aggregation via Gradient Partitioning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在无服务器平台(serverless platforms)上因单个函数内存限制而导致的可扩展性瓶颈问题。现有架构如lambda-FL和LIFL通过树形结构将客户端分配给多个聚合器,但每个聚合器需存储完整的模型梯度(gradient),当梯度大小超过单个函数的内存上限(如AWS Lambda的10 GB)时,聚合无法进行。解决方案的关键在于提出GradsSharding:将梯度张量分割为M个片段(shard),由独立的无服务器函数分别对每个片段进行平均,由于FedAvg操作是逐元素的,该方法可保证与树形结构完全一致的模型精度。其核心优势在于每函数所需内存仅为O(|θ|/M),与客户端数量无关,从而突破了服务器内存限制,使任意规模模型均可部署。
链接: https://arxiv.org/abs/2604.22072
作者: Amine Barrak
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) aggregation on serverless platforms faces a hard scalability ceiling: existing architectures (lambda-FL, LIFL) partition clients across aggregators, but every aggregator must hold the complete model gradient in memory. When gradients exceed the per-function memory limit (e.g., 10 GB on AWS Lambda), aggregation becomes infeasible regardless of tree depth or branching factor. We propose GradsSharding, which instead partitions the gradient tensor into M shards, each averaged independently by a serverless function that receives contributions from all clients. Because FedAvg averaging is element-wise, this produces bit-identical results to tree-based approaches, so model accuracy is invariant by construction. Per-function memory is bounded at O(|\theta|/M), independent of client count, enabling aggregation of arbitrarily large models. We evaluate GradsSharding against lambda-FL and LIFL through HPC experiments and real AWS Lambda deployments across model sizes from 43 MB to 5 GB. Results show a cost crossover at approximately 500 MB gradient size, 2.7x cost reduction at VGG-16 scale, and that GradsSharding is the only architecture that remains deployable beyond the serverless memory ceiling.
[AI-37] Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM -Generated Hospitalization Risk Scores
【速读】:该论文旨在解决生成式 AI(Generative AI)在精神科临床决策中,尤其是住院风险评分预测任务中的可解释性与可靠性问题。当前大型语言模型(Large Language Models, LLMs)虽被广泛应用于临床推理,但其对非临床信息的敏感性及提示词设计(prompt design)带来的输出不稳定性尚未系统评估,尤其在精神病学这一关键且模糊领域尤为突出。论文提出一种基于“可靠性审计”的评估框架,核心在于通过构造包含临床显著特征与大量医学生物学无关特征(medically insignificant features)的合成患者队列(n=50),并结合四种不同提示重构方式(neutral、logical、human impact、clinical judgment),量化分析LLM输出的住院风险评分变化及其变异性。结果显示,无论模型或提示类型如何,引入非临床变量均显著增加平均风险评分和输出波动,表明模型对上下文噪声高度敏感,凸显了在临床部署前必须开展对归因稳定性和不确定性行为的系统性评估的重要性。
链接: https://arxiv.org/abs/2604.22063
作者: Shevya Pandya,Shinjini Bose,Ananya Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.
[AI-38] Call-Chain-Aware LLM -Based Test Generation for Java Projects
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的单元测试生成方法在复杂软件系统中表现不足的问题,尤其是当系统存在复杂的类间依赖、深层调用链和对象初始化需求时,现有方法主要依赖执行路径信息进行提示构造,难以生成高质量的测试用例。解决方案的关键在于提出一种名为CAT的新方法,其核心创新是通过专门的静态分析显式地将调用链(call-chain)和依赖上下文整合到提示中,系统性建模调用者-被调用者关系、对象构造器及第三方依赖,并支持在生成失败时迭代修复测试,从而显著提升测试覆盖率与有效性。
链接: https://arxiv.org/abs/2604.22046
作者: Guancheng Wang,Qinghua Xu,Lionel C. Briand,Zhaoqiang Guo,Kui Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have recently shown strong potential for generating project-level unit tests. However, existing state-of-the-art approaches primarily rely on execution-path information to guide prompt construction, which is often insufficient for complex software systems with rich inter-class dependencies, deep call chains, and intricate object initialization requirements. In this paper, we present CAT, a novel call-chain-aware LLM-based test generation approach that explicitly incorporates call-chain and dependency contexts into prompts through dedicated static analysis. To construct executable, semantically valid test contexts, CAT systematically models caller–callee relationships, object constructors, and third-party dependencies, and supports iterative test fixing when generation failures occur. We evaluate CAT on the widely used Defects4J benchmark and on four real-world GitHub projects released after the LLM’s cut-off date. The results show that, across projects in Defects4J, CAT improves line and branch coverage by 18.04% and 21.74%, respectively, over the state-of-the-art approach PANTA, while consistently achieving superior performance on post-cutoff real-world projects. An ablation study further demonstrates the importance of call-chain and dependency contexts in CAT.
[AI-39] Mochi: Aligning Pre-training and Inference for Efficient Graph Foundation Models via Meta-Learning
【速读】:该论文旨在解决图基础模型(Graph Foundation Model)在任务统一性(task unification)和训练效率方面的局限性。现有方法通常采用基于重构的目标(如链接预测)进行预训练,并依赖后处理步骤(如类别原型对齐)将表示映射到下游任务,但这种分离式策略在实际应用中存在性能瓶颈。其解决方案的关键在于引入一种基于元学习(meta-learning)的训练框架——Mochi,该框架在少样本(few-shot)任务片段上进行预训练,使训练目标与下游推理协议保持一致,从而避免了对额外统一步骤的依赖。实验表明,Mochi及其增强版本Mochi++在25个真实世界图数据集上的节点分类、链接预测和图分类任务中均达到或超越现有模型性能,同时训练时间减少8~27倍。
链接: https://arxiv.org/abs/2604.22031
作者: João Mattos,Arlei Silva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures
Abstract:We propose Mochi, a Graph Foundation Model that addresses task unification and training efficiency by adopting a meta-learning based training framework. Prior models pre-train with reconstruction-based objectives such as link prediction, and assume that the resulting representations can be aligned with downstream tasks through a separate unification step such as class prototypes. We demonstrate through synthetic and real-world experiments that this procedure, while simple and intuitive, has limitations that directly affect downstream task performance. To address these limitations, Mochi pre-trains on few-shot episodes that mirror the downstream evaluation protocol, aligning the training objective with inference rather than relying on a post-hoc unification step. We show that Mochi, along with its more powerful variant Mochi++, achieves competitive or superior performance compared to existing Graph Foundation Models across 25 real-world graph datasets spanning node classification, link prediction, and graph classification, while requiring 8 \sim 27 times less training time than the strongest baseline.
[AI-40] Rethinking Publication: A Certification Framework for AI-Enabled Research
【速读】:该论文旨在解决当前学术出版体系无法有效评估由自动化AI研究流水线(AI research pipelines)生成的知识成果的问题。现有出版系统基于“人类作者普遍性”的假设,缺乏对机器驱动研究产出的规范化评价机制,导致难以区分知识质量与人类贡献度。其解决方案的关键在于提出一个两层认证框架:第一层独立评估知识质量,第二层分级评定人类贡献,具体分为三类——A类(可由当前流水线实现)、B类(需人类在特定阶段介入引导)、C类(超出当前流水线能力)。该框架通过引入完全披露的自动化研究基准槽位(benchmark slots),既提供透明的发表路径,又作为评审判断校准工具,且无需新建机构即可嵌入现有编辑流程,从而实现对AI生成知识的合理认证,同时保留对前沿人类认知贡献的认可基础。
链接: https://arxiv.org/abs/2604.22026
作者: Yang Lu,Rabimba Karanjai,Lei Xu,Weidong Shi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Digital Libraries (cs.DL)
备注:
Abstract:AI research pipelines now produce a growing share of publishable academic output, including work that meets existing peer-review standards for quality and novelty. Yet the publication system was built on the assumption of universal human authorship and lacks a principled way to evaluate knowledge produced through automated pipelines. This paper proposes a two-layer certification framework that separates knowledge quality assessment from grading of human contribution, allowing publication systems to handle pipeline-generated work consistently and transparently without creating new institutions. The paper uses normative-conceptual analysis, framework design under four explicit constraints, and dry-run validation on two representative submission cases spanning key attribution scenarios. The framework grades contributions as Category A (pipeline-reachable), Category B (requiring human direction at identifiable stages), and Category C (beyond current pipeline reach at the formulation stage). It also introduces benchmark slots for fully disclosed automated research as both a transparent publication track and a calibration instrument for reviewer judgment. Contribution grading is contemporaneous, based on pipeline capability at the time of submission. Dry-run validation shows that the framework can certify knowledge appropriately while tolerating irreducible attribution uncertainty. The paper argues that publication has always certified both that knowledge is valid and that a human made it. AI pipelines separate these functions for the first time. The framework is implementable within existing editorial infrastructure and grounds recognition of frontier human contribution in epistemic achievement rather than unverifiable claims of human origin.
[AI-41] Multi-Task Optimization over Networks of Tasks
【速读】:该论文旨在解决多任务优化(Multi-task Optimization)中现有算法在大规模任务集上的可扩展性与拓扑信息利用不足的问题。具体而言,基于种群的方法难以扩展至大规模任务集,而主流的MAP-Elites变体虽能处理上千个任务,却依赖固定离散的存档结构,忽视了任务空间的连续拓扑关系。解决方案的关键在于提出MONET(Multi-Task Optimization over Networks of Tasks),将任务空间建模为图结构——任务作为节点,任务参数空间中的邻近关系构成边,从而显式保留任务间的拓扑信息;在此基础上,结合社会学习(通过交叉从邻接节点生成候选解)与个体学习(通过变异独立优化自身解),实现高效的知识迁移与局部精细化搜索,同时保持高维问题的可处理性。
链接: https://arxiv.org/abs/2604.21991
作者: Julian Hatzky,Thomas Bartz-Beielstein,A. E. Eiben,Anil Yaman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 14 pages, 5 figures
Abstract:Multi-task optimization is a powerful approach for solving a large number of tasks in parallel. However, existing algorithms face distinct limitations: Population-based methods scale poorly and remain underexplored for large task sets. Approaches that do scale beyond a thousand tasks are mostly MAP-Elites variants and rely on a fixed, discretized archive that disregards the topology of the task space. We introduce MONET (Multi-Task Optimization over Networks of Tasks), a multi-task optimization algorithm that models the task space as a graph: tasks are nodes, and edges connect tasks in the task parameter space. This representation enables knowledge transfer between tasks and remains tractable for high-dimensional problems while exploiting the topology of the task space. MONET combines social learning, which generates candidates from neighboring nodes via crossover, with individual learning, which refines a node’s own solution independently via mutation. We evaluate MONET on four domains (archery, arm, and cartpole with 5,000 tasks each; hexapod with 2,000 tasks) and show that it matches or exceeds the performance of existing MAP-Elites-based baselines across all four domains.
[AI-42] Read the Paper Write the Code: Agent ic Reproduction of Social-Science Results
【速读】:该论文旨在解决生成式 AI (Generative AI) 在社会科学研究中实现结果可复现性的挑战,即在仅提供论文方法描述和原始数据的情况下,能否准确复现已发表的研究结果。其解决方案的关键在于构建一个代理驱动的复现系统(agentic reproduction system),该系统通过结构化提取论文中的方法描述、在严格的信息隔离环境下执行重构实现(代理不接触原代码、结果或论文)、并支持细粒度的单元级输出比对与错误归因分析,从而定位失败的根本原因,区分是代理行为误差还是论文方法描述不足所致。
链接: https://arxiv.org/abs/2604.21965
作者: Benjamin Kohler,David Zollikofer,Johanna Einsiedler,Alexander Hoyle,Elliott Ash
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper’s methods description and original data? We develop an agentic reproduction system that extracts structured methods descriptions from papers, runs reimplementations under strict information isolation – agents never see the original code, results, or paper – and enables deterministic, cell-level comparison of reproduced outputs to the original results. An error attribution step traces discrepancies through the system chain to identify root causes. Evaluating four agent scaffolds and four LLMs on 48 papers with human-verified reproducibility, we find that agents can largely recover published results, but performance varies substantially between models, scaffolds, and papers. Root cause analysis reveals that failures stem both from agent errors and from underspecification in the papers themselves.
[AI-43] A general optimization solver based on OP-to-MaxSAT reduction
【速读】:该论文旨在解决现有优化算法普遍缺乏通用性的问题,即当前算法多针对特定类型的优化问题设计,难以跨问题类型有效求解。其解决方案的关键在于提出一种自动化的归约方法——OP-to-MaxSAT归约(OP-to-MaxSAT reduction),并基于此构建了一个通用优化求解器GORED(General Optimization solver based on OP-to-MaxSAT reduction)。该方法能够在多项式时间内将多种类型的优化问题转化为MaxSAT实例,并利用先进的MaxSAT求解器进行求解,从而实现对11类共136个优化问题的统一处理,且在解的质量上与现有专用方法无显著差异,显著提升了优化求解的通用性和可扩展性。
链接: https://arxiv.org/abs/2604.21961
作者: Yuxin Zhao,Han Huang,Zhifeng Hao
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:
Abstract:Optimization problems are fundamental in diverse fields, such as engineering, economics, and scientific computing. However, current algorithms are mostly designed for specific problem types and exhibit limited generality in solving multiple types of optimization problems. To enhance generality, we propose an automated reduction method named OP-to-MaxSAT reduction and a general optimization solver based on OP-to-MaxSAT reduction (GORED). GORED unifies the solving of multiple types of optimization problems by reducing the problems from optimization problems to MaxSAT instances in polynomial time and solving them using the state-of-the-art MaxSAT solver. The generality and solution quality of GORED are validated through experiments on 136 instances across 11 types of optimization problems. Experimental results demonstrate that GORED not only successfully solves a wide range of optimization problems but also yields solutions comparable in quality to those from existing methods, with no statistically significant differences observed. By introducing automated reduction, this work shifts the paradigm of optimization solvers from designing specialized algorithms for each problem type to employing a single algorithm for diverse problems. As a result, advances in this single algorithm can now drive progress in a wide range of optimization problems across various domains.
[AI-44] A systematic review of generative AI usage for IT project management
【速读】:该论文旨在通过PRISMA系统综述方法,整合当前关于生成式AI(Generative AI)在IT项目管理中的研究知识,以明确其技术手段、应用场景、采纳趋势、局限性及与项目管理工具和过程组的集成现状。其解决方案的关键在于识别出当前研究主要依赖OpenAI的GPT模型并以提示工程(prompt engineering)为核心实现方式,表明该领域仍处于探索阶段;同时提出三个有前景的研究方向:面向项目管理过程组的AI代理、基于项目角色的AI代理,以及支持人机协同编排的混合协作网络,为未来AI赋能的项目管理提供理论框架与实践路径。
链接: https://arxiv.org/abs/2604.21958
作者: Ionut Anghel,Tudor Cioara
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper aims to synthesize current knowledge on generative AI in IT project management using the PRISMA methodology to provide researchers with a comprehensive perspective on techniques, applications, adoption trends, limitations, and integration across project management tools and process groups. The analysis reveals a clear dominance of OpenAI’s GPT in the included studies but relying primarily on prompt engineering, suggesting that research in this area remains at an exploratory stage. Finally, it identifies and discusses three promising research directions for AI-enabled project management, including process group-specific AI agents, project role-based AI agents, and hybrid collaborative networks that enable human-guided orchestration.
[AI-45] MambaCSP: Hybrid-Attention State Space Models for Hardware-Efficient Channel State Prediction
【速读】:该论文旨在解决基于注意力机制的Transformer和大语言模型(LLM)在信道状态信息(CSI)预测中因序列长度导致的二次计算复杂度问题,从而带来高昂的计算成本、内存消耗和推理延迟,限制了其在实时和资源受限无线场景中的应用。解决方案的关键在于提出MambaCSP——一种混合注意力-状态空间模型(SSM)架构,用线性时间复杂度的Mamba模型替代LLM主干,并引入轻量级patch-mixer注意力层,在保持长程依赖建模能力的同时显著降低硬件开销,实现在预测精度提升9–12%的前提下,吞吐量提高至3.0倍、显存占用降低至2.6倍、推理速度加快至2.9倍的性能优势。
链接: https://arxiv.org/abs/2604.21957
作者: Aladin Djuhera,Haris Gacanin,Holger Boche
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:Recent works have demonstrated that attention-based transformer and large language model (LLM) architectures can achieve strong channel state prediction (CSP) performance by capturing long-range temporal dependencies across channel state information (CSI) sequences. However, these models suffer from quadratic scaling in sequence length, leading to substantial computational cost, memory consumption, and inference latency, which limits their applicability in real-time and resource-constrained wireless deployments. In this paper, we investigate whether selective state space models (SSMs) can serve as a hardware-efficient alternative for CSI prediction. We propose MambaCSP, a hybrid-attention SSM architecture that replaces LLM-based prediction backbones with a linear-time Mamba model. To overcome the local-only dependencies of pure SSMs, we introduce lightweight patch-mixer attention layers that periodically inject cross-token attentions, helping with long-context CSI prediction. Extensive MISO-OFDM simulations show that MambaCSP improves prediction accuracy over LLM-based approaches by 9-12%, while delivering up to 3.0x higher throughput, 2.6x lower VRAM usage, and 2.9x faster inference. Our results demonstrate that hybrid state space architectures provide a promising direction for scalable and hardware-efficient AI-native CSI prediction in future wireless networks.
[AI-46] Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models DATE
【速读】:该论文旨在解决多模态基础模型(Multimodal Foundation Models, MFMs)在计算和内存资源消耗高、部署效率低的问题。其核心解决方案在于提出一种多层次的软硬件协同优化方法:通过层次感知的混合精度量化与结构化剪枝压缩Transformer模块和MLP通道,结合推测解码、模型级联(从小模型到大模型逐级路由并利用轻量自检判断是否升级)以及序列长度、视觉分辨率步长和图级算子融合的联合优化,显著降低计算复杂度与内存占用;同时,针对底层硬件架构优化数据流并引入内存高效的注意力机制以满足片上带宽和延迟预算,并辅以专用加速器(可通过专家设计或大语言模型辅助设计实现),从而实现高效执行。
链接: https://arxiv.org/abs/2604.21952
作者: Muhammad Shafique,Abdul Basit,Muhammad Abdullah Hanif,Alberto Marchisio,Rachmad Vidya Wicaksana Putra,Minghao Shao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注: Accepted at the Design, Automation and Test in Europe Conference (DATE), April 20-22, 2026 in Verona, Italy
Abstract:This work presents a multi-layered methodology for efficiently accelerating multimodal foundation models (MFMs). It combines hardware and software co-design of transformer blocks with an optimization pipeline that reduces computational and memory requirements. During model development, it employs performance enhancements through fine-tuning for domain-specific adaptation. Our methodology further incorporates hardware and software techniques for optimizing MFMs. Specifically, it employs MFM compression using hierarchy-aware mixed-precision quantization and structural pruning for transformer blocks and MLP channels. It also optimizes operations through speculative decoding, model cascading that routes queries through a small-to-large cascade and uses lightweight self-tests to determine when to escalate to larger models, as well as co-optimization of sequence length, visual resolution stride, and graph-level operator fusion. To efficiently execute the model, the processing dataflow is optimized based on the underlying hardware architecture together with memory-efficient attention to meet on-chip bandwidth and latency budgets. To support this, a specialized hardware accelerator for the transformer workloads is employed, which can be developed through expert design or an LLM-aided design approach. We demonstrate the effectiveness of the proposed methodology on medical-MFMs and on code generation tasks, and conclude with extensions toward energy-efficient spiking-MFMs.
[AI-47] Feedback Over Form: Why Execution Feedback Matters More Than Pipeline Topology in 1-3B Code Generation
【速读】:该论文旨在解决小型语言模型(1-3B参数规模)在本地运行时因单体能力有限而难以胜任复杂代码生成任务的问题。其核心解决方案是通过构建包含执行反馈的流水线结构(pipeline),利用进化搜索优化模型组合与流程,以提升整体代码生成性能。关键发现在于:执行反馈机制(如运行时错误检测与修复)显著优于增加流水线拓扑复杂度;简单“生成-执行-修正”循环已能带来超过4个标准差的性能提升,且模型角色分工中修正器(refiner)的能力比生成器(generator)更重要,同时早期停止策略对避免迭代冗余至关重要。
链接: https://arxiv.org/abs/2604.21950
作者: Charles Junichi McAndrews
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages main text, 2 page references, 3 figures. Code: this https URL
Abstract:Small language models (1-3B) are practical to run locally, but individually limited on harder code generation tasks. We ask whether composing them into pipelines can recover some of that lost capability. We study code generation pipelines built from 1-3B models with execution feedback, and use a NEAT-inspired evolutionary search to test whether more complex pipeline structure helps beyond a simple refinement loop. We evaluate on HumanEval (164 problems) and sanitized MBPP (427 problems), all with local inference on a single laptop. Self-refinement with execution feedback improves code generation by more than 4 standard deviations on both benchmarks. The gains are narrow in mechanism: refinement fixes many runtime errors (especially NameError and SyntaxError), but rarely fixes logic errors such as AssertionError. Within our tested general-purpose model pool, generator identity mattered less than refiner capability: a 1.5B generator paired with a 3B refiner matched a 3B model doing both roles. Early stopping is essential; without it, every iteration is net-negative. The code-specialized models outperform every general-purpose pipeline configuration, suggesting model specialization matters more than pipeline architecture. Preliminary text-only pipeline experiments without execution feedback did not show gains at this scale. In our constrained search space, evolutionary search mostly rediscovered the same simple generate-execute-refine loop we found manually, with no clearly significant gain from added topology. Single-evaluation fitness inflates results by 5-7 percent, selecting lucky genomes over good ones. On these benchmarks at 1-3B scale, execution feedback mattered more than added pipeline complexity in determining whether composition helped.
[AI-48] he Biggest Risk of Embodied AI is Governance Lag
【速读】:该论文试图解决的问题是:随着具身人工智能(Embodied AI)在制造业、物流、护理和基础设施等物理经济领域的快速扩散,公共治理机构因响应滞后而难以有效监管,从而可能引发系统性风险。解决方案的关键在于识别并应对治理滞后的三种相互关联形式——观测滞后(observational)、制度滞后(institutional)和分配滞后(distributive),核心政策挑战并非单纯的技术自动化,而是能否在 disruption 成为既成事实之前,使治理体系与合规机制具备足够的适应能力。
链接: https://arxiv.org/abs/2604.21938
作者: Shaoshan Liu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Embodied AI is widely discussed as a job-displacement problem. The deeper risk, however, is governance lag: the inability of public institutions to keep pace with how fast the technology spreads through the physical economy. As reusable robotic platforms are combined with increasingly general AI models, embodied AI may scale across manufacturing, logistics, care, and infrastructure faster than governance systems can observe, interpret, and respond. We argue that this lag appears in three connected forms: observational, institutional, and distributive. The central policy challenge, therefore, is not automation alone, but whether governance and compliance systems can adapt before disruption becomes entrenched.
[AI-49] Math Takes Two: A test for emergent mathematical reasoning in communication ICLR2026
【速读】:该论文试图解决当前语言模型在数学推理能力评估中存在的一大局限:现有基准测试多依赖于预定义的数学符号和形式化规则,难以区分模型是真正具备数学推理能力,还是仅通过统计模式匹配来模仿数学行为。为突破这一瓶颈,作者提出Math Takes Two基准,其核心创新在于设计了一个基于视觉任务的协作场景,要求两个无先验数学知识的智能体通过交互通信自发构建共享的符号协议,从而实现对抽象数理概念的自主建构与推理。该方案的关键在于摒弃预先设定的数学语言体系,转而让模型从零开始发现潜在结构并发展出可泛化的数值表示,从而更真实地评估其是否具备从第一原理出发的数学认知能力。
链接: https://arxiv.org/abs/2604.21935
作者: Michael Cooper,Samuel Cooper
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at HCAIR workshop, ICLR 2026
Abstract:Although language models demonstrate remarkable proficiency on mathematical benchmarks, it remains unclear whether this reflects true mathematical reasoning or statistical pattern matching over learning formal syntax. Most existing evaluations rely on symbolic problems grounded in established mathematical conventions, limiting insight into the models’ ability to construct abstract concepts from first principles. In this work, we propose Math Takes Two, a new benchmark designed to assess the emergence of mathematical reasoning through communication. Motivated by the hypothesis that mathematical cognition in humans co-evolved with the need for precise communication, our benchmark tests whether two agents, without prior mathematical knowledge, can develop a shared symbolic protocol to solve a visually grounded task where the use of a numerical system facilitates extrapolation. Unlike many current datasets, our benchmark eschews predefined mathematical language, instead requiring agents to discover latent structure and representations from scratch. Math Takes Two thus provides a novel lens through which to develop and evaluate models with emergent numerical reasoning capabilities.
[AI-50] AI-based framework to predict animal and pen feed intake in feedlot beef cattle
【速读】:该论文旨在解决现有文献中缺乏能够充分挖掘个体动物纵向大数据以准确预测饲料摄入量(feed intake)的方法,尤其是在考虑环境条件影响的情况下。其关键解决方案在于构建一个基于人工智能(AI)的框架,结合两个创新的环境指数——InComfort-Index(仅基于气象变量,用于热舒适度预测)和EASI-Index(融合环境变量与采食行为,用于饲料摄入预测),并采用XGBoost机器学习模型进行训练,最终实现了动物级(RMSE=1.38 kg/day)和群组级(RMSE=0.14 kg/(day-animal))高精度饲料摄入预测,为精准饲养管理提供了可落地的技术路径。
链接: https://arxiv.org/abs/2511.17663
作者: Alex S. C. Maia,John B. Hall,Hugo F. M. Milan,Izabelle A. M. A. Teixeira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Advances in technology are transforming sustainable cattle farming practices, with electronic feeding systems generating big longitudinal datasets on individual animal feed intake, offering the possibility for autonomous precision livestock systems. However, the literature still lacks a methodology that fully leverages these longitudinal big data to accurately predict feed intake accounting for environmental conditions. To fill this gap, we developed an AI-based framework to accurately predict feed intake of individual animals and pen-level aggregation. Data from 19 experiments (16.5M samples; 2013-2024) conducted at Nancy M. Cummings Research Extension Education Center (Carmen, ID) feedlot facility and environmental data from AgriMet Network weather stations were used to develop two novel environmental indices: InComfort-Index, based solely on meteorological variables, showed good predictive capability for thermal comfort but had limited ability to predict feed intake; EASI-Index, a hybrid index integrating environmental variables with feed intake behavior, performed well in predicting feed intake but was less effective for thermal comfort. Together with the environmental indices, machine learning models were trained and the best-performing machine learning model (XGBoost) accuracy was RMSE of 1.38 kg/day for animal-level and only 0.14 kg/(day-animal) at pen-level. This approach provides a robust AI-based framework for predicting feed intake in individual animals and pens, with potential applications in precision management of feedlot cattle, through feed waste reduction, resource optimization, and climate-adaptive livestock management.
[AI-51] Foundation models for discovering robust biomarkers of neurological disorders from dynamic functional connectivity
【速读】:该论文旨在解决当前脑部基础模型(Brain Foundation Models, FMs)在识别潜在生物标志物时缺乏鲁棒性评估的问题,尤其是在自闭症谱系障碍(Autism Spectrum Disorder, ASD)、注意力缺陷多动障碍(Attention-deficit Hyperactivity Disorder, ADHD)和阿尔茨海默病(Alzheimer’s Disease, AD)等神经精神疾病中,尽管FMs展现出优异的预测性能与零样本或少样本泛化能力,其识别出的关键区域枢纽(regional hubs)是否具有神经生物学可信度仍不明确。解决方案的关键在于提出RE-CONFIRM框架,用于系统评估DL模型所识别生物标志物的鲁棒性,并进一步引入Hub-LoRA(Low-Rank Adaptation)微调技术,使FMs不仅能超越定制化深度学习模型的性能,还能生成与已有元分析结果一致的神经生物学合理生物标志物,从而提升模型解释性和临床转化潜力。
链接: https://arxiv.org/abs/2604.22018
作者: Deepank Girish,Yi Hao Chan,Sukrit Gupta,Jing Xia,Jagath C. Rajapakse
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:Several brain foundation models (FM) have recently been proposed to predict brain disorders by modelling dynamic functional connectivity (FC). While they demonstrate remarkable model performance and zero- or few-shot generalization, the salient features identified as potential biomarkers are yet to be thoroughly evaluated. We propose RE-CONFIRM, a framework for evaluating the robustness of potential biomarker candidates elucidated by deep learning (DL) models including FMs. From experiments on five large datasets of Autism Spectrum Disorder (ASD), Attention-deficit Hyperactivity Disorder (ADHD), and Alzheimer’s Disease (AD), we found that although commonly used performance metrics provide an intuitive assessment of model predictions, they are insufficient for evaluating the robustness of biomarkers identified by these models. RE-CONFIRM metrics revealed that simply finetuning FMs leads to models that fail to capture regional hubs effectively, even in disorders where hubs are known to be implicated, such as ASD and ADHD. In view of this, we propose Hub-LoRA (Low-Rank Adaptation) as a fine-tuning technique that enables FMs to not only outperform customised DL models but also produce neurobiologically faithful biomarkers supported by meta-analyses. RE-CONFIRM is generalizable and can be easily applied to ascertain the robustness of DL models trained on functional MRI datasets. Code is available at: this https URL.
[AI-52] Model Predictive Control of Hybrid Dynamical Systems
【速读】:该论文旨在解决混合动力系统(Hybrid Dynamical Systems)在模型预测控制(Model Predictive Control, MPC)框架下的稳定性控制问题。其核心挑战在于如何设计一种能够同时处理连续状态演化与离散事件切换的MPC算法,并确保闭环系统的渐近稳定性。解决方案的关键在于构建基于混合时间域(hybrid time domains)的预测与控制时域结构,从而适配混合系统的动态特性;并通过引入阶段代价(stage cost)、终端代价(terminal cost)以及静态状态反馈律(static state-feedback laws)之间的约束关系,结合控制李雅普诺夫函数(control Lyapunov function)条件,给出可验证的渐近稳定性的充分条件。这一方法不仅保证了优化问题的结构性良好(如可行集和值函数的性质),还为实际应用提供了理论保障。
链接: https://arxiv.org/abs/2604.21989
作者: Ricardo G. Sanfelice,Berk Altin
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY); Dynamical Systems (math.DS)
备注: Technical report associated with paper to appear in IEEE Transactions on Automatic Control, 2026
Abstract:The problem of controlling hybrid dynamical systems using model predictive control (MPC) is formulated and sufficient conditions for asymptotic stability of a set are provided. Hybrid dynamical systems are modeled in terms of hybrid equations, involving a differential equation and a difference equation with inputs and constraints. The proposed hybrid MPC algorithm uses a suitable prediction and control horizon construction inspired by hybrid time domains. Structural properties of the hybrid optimization problem, its feasible set, and its value function are provided. Checkable conditions to guarantee asymptotic stability of a set are provided. These conditions are given in terms of properties on the stage cost, terminal cost, and the existence of static state-feedback laws, related through a control Lyapunov function condition. Examples illustrate the results throughout the paper.
机器学习
[LG-0] Spend Less Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
链接: https://arxiv.org/abs/2604.22753
作者: Sijie Li,Shanda Li,Haowei Lin,Weiwei Sun,Ameet Talwalkar,Yiming Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scaling laws are used to plan multi-million-dollar training runs, but fitting those laws can itself cost millions. In modern large-scale workflows, assembling a sufficiently informative set of pilot experiments is already a major budget-allocation problem rather than a routine preprocessing step. We formulate scaling-law fitting as budget-aware sequential experimental design: given a finite pool of runnable experiments with heterogeneous costs, choose which runs to execute so as to maximize extrapolation accuracy in a high-cost target region. We then propose an uncertainty-aware method for sequentially allocating experimental budget toward the runs most useful for target-region extrapolation. Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total training budget. Our code is available at this https URL.
[LG-1] Operational Feature Fingerprints of Graph Datasets via a White-Box Signal-Subspace Probe
链接: https://arxiv.org/abs/2604.22676
作者: Yuchen Xiong,Swee Keong Yeap,Zhen Hong Ban
类目: Machine Learning (cs.LG)
*备注: 21 pages, 10 figures, 7 tables
Abstract:Graph neural networks achieve strong node-classification accuracy, but their learned message passing entangles ego attributes, neighborhood smoothing, high-pass graph differences, class geometry, and classifier boundaries in an opaque representation. This obscures why a node is classified and what feature-level graph-learning mechanisms a dataset requires. We propose WG-SRC, a white-box signal-subspace probe for prediction and graph dataset diagnosis. WG-SRC replaces learned message passing with a fixed, named graph-signal dictionary of raw features, row-normalized and symmetric-normalized low-pass propagation, and high-pass graph differences. It combines Fisher coordinate selection, class-wise PCA subspaces, closed-form multi-alpha ridge classification, and validation-based score fusion, so prediction and analysis use explicit class subspaces, energy-controlled dimensions, and closed-form linear decisions. As a white-box graph-learning instrument, WG-SRC uses predictive performance to validate its diagnostics: across six node-classification datasets, the scaffold remains competitive with reproduced graph baselines and achieves positive average gain under aligned splits. Its atlas, produced by a predictor, decomposes behavior into raw-feature, low-pass, high-pass, class-geometric, and ridge-boundary components. These operational feature fingerprints distinguish low-pass-dominated Amazon graphs, mixed high-pass and class-geometrically complex Chameleon behavior, and raw- or boundary-sensitive WebKB graphs. As intrinsic classifier outputs rather than post-hoc explanations, these fingerprints provide post-evaluation guidance for later analysis and dataset-specific modification. Aligned mechanistic interventions support this guidance by indicating when high-pass blocks act as removable noise, when raw features should be preserved, and when ridge-type boundary correction matters. Comments: 21 pages, 10 figures, 7 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.22676 [cs.LG] (or arXiv:2604.22676v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.22676 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-2] Iterative Model-Learning Scheme via Gaussian Processes for Nonlinear Model Predictive Control of (Semi-)Batch Processes
链接: https://arxiv.org/abs/2604.22672
作者: Tai Xuan Tan,Alexander Mitsos,Eike Cramer
类目: Machine Learning (cs.LG)
*备注: 12 pages, 7 figures
Abstract:Batch processes are inherently transient and typically nonlinear, motivating nonlinear model predictive control (NMPC). However, adopting NMPC is hindered by the cost and unavailability of dynamic models. Thus, we propose to use Gaussian Processes (GP) in a model-learning NMPC scheme (GP-MLMPC) for batch processes. We initialize the GP-MLMPC using data from a single initial trajectory, e.g., from a PI controller. We iteratively apply the NMPC embedded with GPs to run batches and update the GP with new observations from each iteration, thereby achieving batch-wise improvements. Using uncertainty quantification from the GPs, we formulate chance constraints to enforce safe operation to the required confidence levels. We demonstrate our approach in \textitsilico on a semi-batch polymerization reactor for tracking and economic objectives over durations of two hours, and the reactor temperature is constrained in a range of \pm2^\circ C around its setpoint. After only four batch iterations, tracking error from the GP-MLMPC scheme converged to a reduction of 83% , compared to the initial trajectory. Furthermore, under an economic objective, the GP-MLMPC resulted in a 17-fold increase in final product mass by iteration 8, compared to the initial trajectory. In both cases, the resulting GP-MLMPC performance is on par with the full-model NMPC, which shows that the optimal controller can be learned by the approach. By collecting samples around the optimal trajectory, the GP-MLMPC remains sample-efficient across iterations and achieves quick convergence. Thus, the proposed GP-MLMPC scheme presents a promising data-efficient approach for the control of nonlinear batch processes without mechanistic knowledge.
[LG-3] Associativity-Peakiness Metric for Contingency Tables
链接: https://arxiv.org/abs/2604.22655
作者: Naomi E. Zirkind,William J. Diehl
类目: Machine Learning (cs.LG)
*备注: 38 pages, 21 figures
Abstract:For the use case of comparing the performance of clustering algorithms whose output is a contingency table, a single performance metric for contingency tables is needed. Such a metric is vital for comparative performance analysis of clustering algorithms. A survey of publicly available literature did not show the presence of such a metric. Metrics do exist for vector pairs of truth values and predicted values, which are an alternative form of output of clustering algorithms. However, the metrics for vector pairs do not reveal the presence of detailed features that are apparent in contingency tables. This paper presents the Associativity Peakiness (AP) metric, which characterizes aspects of clustering algorithm performance that are critical for predicting a clustering algorithm’s performance when deployed. The AP metric is analogous to measures of quality for confusion matrices that are outputs of supervised learning algorithms. This paper presents results from simulations in which 500 contingency tables were generated for multiple test scenarios. The results show that for the use case of evaluating clustering algorithms, the AP metric characterizes performance of contingency tables with higher dynamic range than publicly available metrics, and that it is computationally more efficient than comparable publicly available metrics.
[LG-4] Quality-Driven Selective Mutation for Deep Learning
链接: https://arxiv.org/abs/2604.22640
作者: Zaheed Ahmed,Emmanuel Charleson Dapaah,Philip Makedonski,Jens Grabowski
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Mutants support testing and debugging in two roles: (i) as test goals and (ii) as substitutes for real faults. Hard-to-kill mutants provide better guidance for test improvement, while realism is essential when mutants are used to simulate real bugs. Building on these roles, selective mutation for deep learning (DL) aims to reduce the cost of mutant generation and execution by choosing operator configurations that yield resistant and realistic mutants. However, the DL literature lacks a unified measure that captures both aspects. This study presents a probabilistic framework to quantify mutant quality along two complementary axes: resistance and realism. Resistance adapts the classical notion of hard-to-kill mutants to the DL setting using statistical killing probabilities, while realism is measured via the generalized Jaccard similarity between mutant and real-fault detectability patterns. The framework enables ranking and filtering of low-quality mutation-operator configurations without assuming a specific use case. We empirically evaluate the approach on four datasets of real DL faults. Three datasets (CleanML, DeepFD, and DeepLocalize) are used to estimate and select high-quality operator configurations, and the held-out defect4ML dataset is used for validation. Results show that quality-driven selection reduces the number of generated mutants by up to 55.6% while preserving typical levels of resistance and realism under baseline-aligned selection thresholds. These findings confirm that dual-objective selection can lower cost without compromising the usefulness of mutants for either role.
[LG-5] Adversarial Malware Generation in Linux ELF Binaries via Semantic-Preserving Transformations
链接: https://arxiv.org/abs/2604.22639
作者: Lukáš Hrdonka,Martin Jureček
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Malware development and detection have undergone significant changes in recent years as modern concepts, such as machine learning, have been used for both adversarial attacks and defense. Despite intensive research on Windows Portable Executable (PE) files, there is minimal work on Linux Executable and Linkable Format (ELF). In this work, we summarize the academic papers submitted in this field and develop a new adversarial malware generator for the ELF format. Using a variety of metrics, we thoroughly evaluated our generator and achieved an Evasion Rate of 67.74 % while changing the confidence of the malware detector by -0.50 in the mean case for the dataset used. In our approach, we chose MalConv as the target classifier. Using this classifier, we found that the most successful modifications used strings typical of benign files as a data source. We conducted a variety of experiments and concluded that the target classifier appears sensitive to strings at any location within the executable file.
[LG-6] Detecting Concept Drift in Evolving Malware Families Using Rule-Based Classifier Representations
链接: https://arxiv.org/abs/2604.22629
作者: Tomáš Kalný,Martin Jureček,Mark Stamp
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:This work proposes a structural approach to concept drift detection in malware classification using decision tree rulesets. Classifiers are trained across temporal windows on the EMBER2024 dataset, and drift is quantified by comparing extracted rule representations using feature importance, prediction agreement, activation stability, and coverage metrics. These metrics are correlated with both accuracy degradation and data distribution shift as complementary drift indicators. The approach is evaluated across six malware families using fixed-interval and clustering-based windowing in family-vs-benign and family-vs-family settings, and compared against RIPPER and Transcendent baselines. Results show that fixed two-month windowing with feature-level Pearson correlation is the most reliable configuration, being the only one where all family pairs produce positive drift-accuracy correlations. The methods are complementary - no single approach dominates across all pairs.
[LG-7] Beyond Patient Invariance: Learning Cardiac Dynamics via Action-Conditioned JEPAs
链接: https://arxiv.org/abs/2604.22618
作者: Jose Geraldo Fernandes,Luiz Facury,Pedro Robles Dutenhefner,Wagner Meira Jr
类目: Machine Learning (cs.LG)
*备注:
Abstract:Self-supervised learning in healthcare has largely relied on invariance-based objectives, which maximize similarity between different views of the same patient. While effective for static anatomy, this paradigm is fundamentally misaligned with clinical diagnosis, as it mathematically compels the model to suppress the transient pathological changes it is intended to detect. We propose a shift towards Action-Conditioned World Models that learn to simulate the dynamics of disease progression, or Event-Conditioned. Adapting the LeJEPA framework to physiological time-series, we define pathology not as a static label, but as a transition vector acting on a patient’s latent state. By predicting the future electrophysiological state of the heart given a disease onset, our model explicitly disentangles stable anatomical features from dynamic pathological forces. Evaluated on the MIMIC-IV-ECG dataset, our approach outperforms fully supervised baselines on the critical triage task. Crucially, we demonstrate superior sample efficiency: in low-resource regimes, our world model outperforms supervised learning by over 0.05 AUROC. These results suggest that modeling biological dynamics provides a dense supervision signal that is far more robust than static classification. Source code is available at this https URL
[LG-8] Adaptive Head Budgeting for Efficient Multi-Head Attention
链接: https://arxiv.org/abs/2604.22583
作者: Bilal Faye,Abdoulaye Mbaye,Hanane Azzag,Mustapha Lebbah
类目: Machine Learning (cs.LG)
*备注:
Abstract:Transformers have become the dominant architecture across a wide range of domains, largely due to the effectiveness of multi-head attention in capturing diverse representation subspaces. However, standard multi-head attention activates all heads uniformly for every input, regardless of task requirements or input complexity. In many scenarios, particularly for coarse-grained tasks such as text classification, the relevant information is often global and does not require the full diversity of attention heads. As a consequence, using a fixed number of heads can introduce unnecessary computational cost or lead to suboptimal performance when the allocation does not match the input. To address this limitation, we introduce BudgetFormer, a Transformer architecture equipped with an adaptive multi-head attention mechanism that dynamically allocates computational resources. Our approach learns, for each input, both a head budget corresponding to the number of attention heads required, and a relevance distribution that selects the most informative heads. We also propose a training strategy based on an exploration and exploitation trade-off, allowing the model to discover effective head configurations before converging to efficient usage patterns. Experiments on text classification tasks of varying complexity show that our method reduces inference cost in terms of FLOPs and memory, while also achieving performance that can surpass standard full multi-head attention. These results highlight the potential of adaptive head allocation as a principled approach to improving both efficiency and effectiveness in Transformer models.
[LG-9] SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference
链接: https://arxiv.org/abs/2604.22575
作者: Yuqi Pan,Jinghao Zhuang,Yupeng Feng,Fangzhi Zhong,Siyu Ding,Xuerui Qiu,Shaowei Gu,Bohan Sun,Zhiyong Qin,Yibo Zhong,Lingtao Ouyang,Kun Yang,Zehao Liu,Yuhong Chou,Shurong Wang,Anjie Hu,Han Xu,Bo Xu,Guoqi Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scaling context length is reshaping large-model development, yet full-attention Transformers suffer from prohibitive computation and inference bottlenecks at long sequences. A key challenge is to design foundation models that maintain performance and long-context efficiency with minimal training overhead. We introduce SpikingBrain2.0 (SpB2.0), a 5B model that advances both architecture and training efficiency of its predecessor. Our contributions are two-fold. (1) Architectural Innovation: We propose Dual-Space Sparse Attention (DSSA), an inter-layer hybrid of Sparse Softmax Attention (MoBA) and Sparse Linear Attention (SSE), achieving an improved performance-efficiency trade-off for long-context modeling. SpB2.0 further supports dual quantization paths: INT8-Spiking coding enables sparse event-driven computation, while FP8 coding accelerates inference on modern GPUs. (2) Enhanced Training Strategy: We develop an optimized Transformer-to-Hybrid (T2H) pipeline with dual conversion paths for LLMs and VLMs using curated open-source data. Empirically, SpB2.0-5B and SpB2.0-VL-5B recover most of the base Transformer (Qwen3-4B) capability with under 7k A100 GPU hours. SpB2.0 achieves a 10.13x TTFT speedup at 4M context and supports over 10M tokens on 8 A100 GPUs under vLLM, where full-attention models exceed memory limits. It also demonstrates strong cross-platform compatibility, enabling FP8 GPU inference (2.52x speedup at 250k) and efficient neuromorphic execution (64.31% sparsity, with 70.6% and 46.5% area and power reduction at 500MHz). Overall, SpikingBrain2.0 provides a practical pathway for lightweight, multimodal, spiking foundation models, highlighting the potential of combining brain-inspired mechanisms with efficient architectures for resource-constrained and edge scenarios. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.22575 [cs.LG] (or arXiv:2604.22575v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.22575 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuqi Pan [view email] [v1] Fri, 24 Apr 2026 14:07:54 UTC (2,052 KB)
[LG-10] Adversarial Co-Evolution of Malware and Detection Models: A Bilevel Optimization Perspective
链接: https://arxiv.org/abs/2604.22569
作者: Olha Jurečková,Martin Jureček,Matouš Kozák,Róbert Lórencz
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning-based malware detectors are increasingly vulnerable to adversarial examples. Traditional defenses, such as one-shot adversarial training, often fail against adaptive attackers who use reinforcement learning to bypass detection. This paper proposes a robust defense framework based on bilevel optimization, explicitly modeling the strategic interaction between a defender and an attacker as an adversarial co-evolutionary process. We evaluate our approach using the MAB-malware framework against three distinct malware families: Mokes, Strab, and DCRat. Our experimental results demonstrate that while standard classifiers and basic adversarial retraining often remain vulnerable, showing evasion rates as high as 90 %, the proposed bilevel optimization approach consistently achieves near-total immunity, reducing evasion rates to 0 - 1.89 %. Furthermore, the iterative framework significantly increases the attacker’s query complexity, raising the average cost of successful evasion by up to two orders of magnitude. These findings suggest that modeling the iterative cycle of attack and defense through bilevel optimization is essential for developing resilient malware detection systems capable of withstanding evolving adversarial threats.
[LG-11] An Integrated Framework for Explainable Fair and Observable Hospital Readmission Prediction: Development and Validation on MIMIC-IV
链接: https://arxiv.org/abs/2604.22535
作者: Isaac Tosin Adisa
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures. Submitted to the Journal of the American Medical Informatics Association (JAMIA), currently under review
Abstract:Objective: To propose and retrospectively validate an integrated framework addressing three barriers to clinical translation of readmission prediction: lack of explainability, absence of deployment reliability infrastructure, and inadequate demographic fairness evaluation. Materials and Methods: We constructed a cohort of 415231 adult admissions from the MIMIC-IV database (30-day readmission prevalence 18.0%), split 70/15/15. Logistic regression, XGBoost, and LightGBM models were trained on 26 features. SHAP provided per-patient explanations. Fairness was evaluated across 16 subgroups using AUC-ROC, false negative rate (FNR), and positive predictive value (PPV). Calibration was assessed using Brier scores and calibration curves. Results: XGBoost achieved AUC-ROC 0.696 (95% CI 0.691-0.701), outperforming or matching the LACE baseline (AUC 0.60-0.68). LightGBM achieved best calibration (Brier 0.146). Prior admissions were the dominant predictor. All subgroups met equity thresholds (delta AUC = 0.05, delta FNR = 0.10). Conclusion: This framework delivers competitive performance, clinically actionable explanations, and strong demographic equity. Code is publicly available at this https URL. Comments: 22 pages, 8 figures. Submitted to the Journal of the American Medical Informatics Association (JAMIA), currently under review Subjects: Machine Learning (cs.LG) Cite as: arXiv:2604.22535 [cs.LG] (or arXiv:2604.22535v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.22535 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Isaac Adisa [view email] [v1] Fri, 24 Apr 2026 13:21:44 UTC (526 KB)
[LG-12] Decoding High-Dimensional Finger Motion from EMG Using Riemannian Features and RNNs
链接: https://arxiv.org/abs/2604.22499
作者: Martin Colot,Cédric Simar,Guy Cheron,Ana Maria Cebolla Alvarez,Gianluca Bontempi
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 13 pages, 10 figures, 3 tables, links to a GitHub, a dataset on Zenodo, and two videos on YouTube
Abstract:Continuous estimation of high-dimensional finger kinematics from forearm surface electromyography (EMG) could enable natural control for hand prostheses, AR/XR interfaces, and teleoperation. However, the complexity of human hand gestures and the entanglement of forearm muscles make accurate recognition intrinsically challenging. Existing approaches typically reduce task complexity by relying on classification-based machine learning, limiting the controllable degrees of freedom and compromising on natural interaction. We present an end-to-end framework for continuous EMG-to-kinematics regression using only consumer-grade hardware. The framework combines an 8-channel EMG armband, a single webcam, and an automatic synchronization procedure, enabling the collection of the EMG Finger-Kinematics dataset (EMG-FK), a 10-h dataset of synchronized EMG and 15 finger joint angles from 20 participants performing rich, unconstrained right-hand motions. We also introduce the Temporal Riemannian Regressor (TRR), a lightweight GRU-based model that uses sequences of multi-band Riemannian covariance features to decode finger motion. Across EMG-FK and the public emg2pose benchmark, TRR outperforms state-of-the-art methods in both intra- and cross-subject evaluation. On EMG-FK, it reaches an average absolute error of 9.79 °\pm 1.48 in intra-subject and 16.71 °\pm 3.97 in cross-subject. Finally, we demonstrate real-time deployment on a Raspberry Pi 5 and intuitive control of a robotic hand; TRR runs at nearly 10 predictions/s and is roughly an order of magnitude faster than state-of-the-art approaches. Together, these contributions lower the barrier to reproducible, real-time EMG-based decoding of high-dimensional finger motion, and pave the way toward more natural and intuitive control of embedded EMG-based systems.
[LG-13] Deep Learning for Model Calibration in Simulation of Itaconic Acid Production
链接: https://arxiv.org/abs/2604.22496
作者: Daria Fokina,Marco Baldan,Constantin Romankiewicz,Wolfgang Laudensack,Roland Ulber,Michael Bortz
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this study, deep learning is used to estimate kinetic parameters for modeling itaconic acid production based on real batch experiments conducted at different agitation speeds and reactor scales. Two deep learning strategies, namely direct deep learning (DDL) and generative conditional flow matching (CFM) are compared and benchmarked against nonlinear regression as a reference method. Compared with DDL, CFM consistently yields more accurate results. The concentration profiles predicted by CFM closely match those obtained from nonlinear regression, whereas DDL results in larger deviations. Similar behavior is observed in the scale-up experiments, where the CFM model again generalizes better and is more robust than the direct approach. These findings demonstrate that CFM can reliably predict system behavior across different operating conditions and scales, offering a flexible and data-efficient framework for parameter estimation in dynamic bioprocess models.
[LG-14] owards Adaptive Continual Model Merging via Manifold-Aware Expert Evolution
链接: https://arxiv.org/abs/2604.22464
作者: Haiyun Qiu,Xingyu Wu,Kay Chen Tan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continual Model Merging (CMM) sequentially integrates task-specific models into a unified architecture without intensive retraining. However, existing CMM methods are hindered by a fundamental saturation-redundancy dilemma: backbone-centric approaches face parameter saturation and representation interference within fixed capacities, whereas Mixture-of-Experts (MoE) variants resort to indiscriminate expansion, incurring expert redundancy and a routing bottleneck reliant on additional data-driven optimization. To resolve these challenges, we propose MADE-IT (Manifold-Aware Dynamic Expert Evolution and Implicit rouTing), an adaptive CMM method that orchestrates expert management and activation by grounding intrinsic expert representations in manifold geometry. We introduce a projection-based subspace affinity metric coupled with a distribution-aware adaptive threshold mechanism to guide autonomous expert evolution, harmonizing diversity with architectural parsimony. Furthermore, to bypass parameterized gating networks, we design a data-free and training-free implicit routing mechanism that activates experts via feature-subspace alignment. Extensive experiments demonstrate that MADE-IT consistently outperforms strong baselines in accuracy and robustness across long-horizon and shuffled task sequences, while significantly pruning redundant experts, particularly within generic modules and early layers.
[LG-15] HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
链接: https://arxiv.org/abs/2604.22442
作者: Abhinaba Basu
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:We introduce HubRouter, a pluggable module that replaces O(n^2) attention layers with O(nM) hub-mediated routing, where M n is a small number of learned hub tokens. We demonstrate it in two from-scratch architectures: a Jamba-style hybrid and a 12-layer Transformer; retrofit into pretrained models is a tested negative case. HubRouter implements an encode-decode-score-council pipeline: M learned hubs cross-attend to all tokens, tokens project against hubs for routing fingerprints, a score head selects top-k tokens, and a sparse council attends only to the selected subset. We validate HubRouter in three settings. (1) Hub-Jamba yields a nominal 4.2% PPL improvement (200.2 vs 209.0, single seed; possibly within seed noise) and up to ~90x training throughput at sequence length 1024 in matched PyTorch-native baselines; an optimised baseline would narrow this to ~10-15x. (2) Graduated replacement of 25% of Transformer attention layers gives the best perplexity in our matched-budget sweep (268.0 vs 282.4 pure Transformer). (3) Hub-GPT provides strictly causal routing, achieving PPL 211.5 +/- 0.4 over 3 seeds (post council-causal fix); approximately 3 PPL worse than Jamba’s 208.5 +/- 0.7, a measurable quality cost for avoiding O(n^2) computation. Post-fix, chunk size C has little effect; the pre-fix chunk-size benefit was an artifact of a bidirectional-council leak we found in adversarial review. A multi-seed hub-count sweep (~105 runs across M=1-32) reveals M=8-14 as the reliably-converging sub-band (4-5/5 seeds); M=6 is rescued to 5/5 by orthogonal regularization, while M=20 shows increasing seed sensitivity. Companion paper arXiv:2603.20997 (Basu, 2026) defines the routing diagnostic task. Code and scripts will be released. Subjects: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2604.22442 [cs.LG] (or arXiv:2604.22442v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.22442 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-16] Beyond Land Surface Temperature: Explainable Spatial Machine Learning Reveals Urban Morphology Effects on Human-Centric Heat Stress
链接: https://arxiv.org/abs/2604.22433
作者: Yuan Wang,Shengao Yi,Xiaojiang Li,Pengyuan Liu,Zhiwei Yang,Ronita Bardhan,Rudi Stouffs
类目: Machine Learning (cs.LG)
*备注:
Abstract:Heat exposure connects the built environment and public health, directly shaping the livability and sustainability of urban areas. Understanding the spatial heterogeneity of heat exposure and its drivers is vital for climate-adaptive urban planning. However, most planning-oriented studies rely on land surface temperature (LST), and whether LST adequately represents human heat exposure and how it differs from physiologically relevant heat stress remains insufficiently examined. Here, adopting Landsat-retrieved 30-m LST and GPU-accelerated 1-m universal thermal climate index (UTCI) in Singapore, this study establishes a comprehensive “Modeling-Comparing-Assessing” framework to systematically evaluate the spatial and mechanistic discrepancies between the two metrics. We further investigate pronounced non-stationary and threshold-based quantitative relationships of the two metrics with urban factors by employing a novel geographically weighted XGBoost (GW-XGBoost) and generalized additive model (GAM) workflow. Our results demonstrate notable discrepancies in spatial patterns of LST and UTCI, along with substantial spatial heterogeneity in how 2D and 3D urban factors impact these two thermal metrics, as revealed by explainable GW-XGBoost models (global out-of-bag R2 = 0.855 for LST and 0.905 for UTCI, respectively). Crucially, spatially explicit SHAP interprets that sky view factor plays a central role in explaining UTCI variability but exhibits a comparatively marginal independent contribution to LST, indicating that LST inadequately captures shading-driven and radiative processes governing actual human heat stress. Notably, SHAP-GAM analysis indicates that higher albedo is associated with increased UTCI. These novel findings provide evidence for integrating physiologically relevant thermal indices to inform targeted heat risk management and climate-adaptive urban planning.
[LG-17] Robust Fuzzy local k-plane clustering with mixture distance of hinge loss and L1 norm
链接: https://arxiv.org/abs/2604.22405
作者: Junjun Huang,Xiliang Lu,Xuelin Xie,Jerry Zhijian Yang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:K-plane clustering (KPC), hyperplane clustering, and mixture regression all essentially fall within the same class of problems. This problem can be conceptualized as clustering in relatively high-dimensional K subspaces or K linear manifolds. Traditional KPC or fuzzy KPC models demonstrate a pronounced susceptibility to outliers, as they presuppose that the projection distance between data points and the plane normal vector adheres to the L2 distance. Meanwhile, the assumption of infinitely extending clusters adversely affects clustering performance. To solve these problems, this paper proposed a new robust fuzzy local k-plane clustering (RFLkPC) method that combines the mixture distance of hinge loss and L1 norm. The RFLkPC model assumes that each plane cluster is bounded to a finite area, which can flexibly and robustly handle plane clustering tasks with outliers or not. The corresponding model and optimization algorithms of RFLkPC were provided. Compared to other related models on this topic, a large number of experiments verify the efficiency of RFLkPC on simulated data and real data. The source code for the proposed RFLkPC method is publicly available at this https URL.
[LG-18] Revisiting Neural Activation Coverag e for Uncertainty Estimation
链接: https://arxiv.org/abs/2604.22360
作者: Benedikt Franke,Nils Förster,Frank Köster,Asja Fischer,Markus Lange,Arne Raulf
类目: Machine Learning (cs.LG)
*备注: Published in 34th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2026
Abstract:Neural activation coverage (NAC) is a recently-proposed technique for out-of-distribution detection and generalization. We build upon this promising foundation and extend the method to work as an uncertainty estimation technique for already-trained artificial neural networks in the domain of regression. Our experiments confirm NAC uncertainty scores to be more meaningful than other techniques, e.g. Monte-Carlo Dropout.
[LG-19] SOC-ICNN: From Polyhedral to Conic Geometry for Learning Convex Surrogate Functions
链接: https://arxiv.org/abs/2604.22355
作者: Kang Liu,Jianchen Hu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 28 pages and no figure
Abstract:Classical ReLU-based Input Convex Neural Networks (ICNNs) are equivalent to the optimal value functions of Linear Programming (LP). This intrinsic structural equivalence restricts their representational capacity to piecewise-linear polyhedral functions. To overcome this representational bottleneck, we propose the SOC-ICNN, an architecture that generalizes the underlying optimization class from LP to Second-Order Cone Programming (SOCP). By explicitly injecting positive semi-definite curvature and Euclidean norm-based conic primitives, our formulation introduces native smooth curvature into the representation while preserving a rigorous optimization-theoretic interpretation. We formally prove that SOC-ICNNs strictly expand the representational space of ReLU-ICNNs without increasing the asymptotic order of forward-pass complexity. Extensive experiments demonstrate that SOC-ICNN substantially improves function approximation, while delivering competitive downstream decision quality. The code is available at this https URL.
[LG-20] A Nationwide Japanese Medical Claims Foundation Model: Balancing Model Scaling and Task-Specific Computational Efficiency
链接: https://arxiv.org/abs/2604.22348
作者: Nanae Aratake,Taisei Tosaki,Yuji Okamoto,Eiichiro Uchino,Masaki Nakamura,Nobutomo Matsui,Akiko Hatakama,Yasushi Okuno
类目: Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 3 tables
Abstract:Clinical risk prediction using longitudinal medical data supports individualized care. Self-supervised foundation models have emerged as a promising approach for leveraging large-scale unlabeled healthcare records. In natural language processing, scaling laws suggest that larger models achieve predictably lower pretraining losses, supporting the foundation model paradigm. However, for structured medical data, characterized by a limited vocabulary and sparse observations, whether increasing model size consistently improves downstream predictions is unclear, as most studies evaluate only a single model scale. In this study, we evaluated the relationship between model scale and downstream task performance for structured medical foundation models. Using a random sample (2.3 million patients, 32 hospitals) from a nationwide 519-hospital Japanese claims database, we pretrained encoder-only Transformers at five scales (2.2M-101M parameters) for disease incidence and medication prediction. Downstream performance saturated at task-dependent thresholds: disease prediction benefited from larger models (32M-101M), whereas medication prediction saturated at 11M, reducing pretraining time by 178 h. Across all tasks, the best-performing model consistently outperformed a Light Gradient Boosting Machine baseline in the area under the precision-recall curve. These findings indicate that, unlike the monotonically decreasing pretraining loss, the optimal model size varied depending on task characteristics. This task-dependent saturation provides practical guidance for balancing predictive performance and computational cost in structured medical foundation models.
[LG-21] abSCM: A practical Framework for Generating Realistic Tabular Data
链接: https://arxiv.org/abs/2604.22337
作者: Sven Jacob,Bardh Prenkaj,Weijia Shao,Gjergji Kasneci
类目: Machine Learning (cs.LG)
*备注:
Abstract:Most tabular-data generators match marginal statistics yet ignore causal structure, leading downstream models to learn spurious or unfair patterns. We present TabSCM, a mixed-type generator that preserves those causal dependencies. Starting from a Completed Partially Directed Acyclic Graph (CPDAG) found by any causal structure discovery algorithm, TabSCM (i) orients edges to a DAG, (ii) fits root-node marginals with KDE or categorical frequencies, and (iii) learns topologically ordered structural assignments. Such assignments are achieved using conditional diffusion models for continuous variables as child nodes and gradient-boosted trees for categorical ones. Ancestral sampling yields semantically valid records and enables exact counterfactual queries. On seven public datasets, encompassing healthcare, finance, housing, environment, TabSCM matches or surpasses state-of-the-art GAN, diffusion, and LLM baselines in statistical fidelity, downstream utility, and privacy risk, while also cutting rule-violation rates and providing causally meaningful and robust conditional interventions. Because generation is decomposed into explicit equations, it runs up to 583 \times faster than diffusion-only models and exposes interpretable knobs for fairness auditing and policy simulation, making TabSCM a practical choice for realism, explainability, and causal soundness.
[LG-22] A Brain-Inspired Deep Separation Network for Single Channel Raman Spectra Unmixing IJCNN2026
链接: https://arxiv.org/abs/2604.22324
作者: Gaoruishu Long,Jinchao Liu,Bo Liu,Jie Liu,Xiaolin Hu
类目: Machine Learning (cs.LG)
*备注: Accepted by the 2026 International Joint Conference on Neural Networks (IJCNN 2026). 8 pages, 5 figures
Abstract:Raman spectra obtained in real world applications are often a noisy combination of several spectra of various substances in a tested sample. Unmixing such spectra into individual components corresponding to each of the substances is of great value and has been a longstanding challenge in Raman spectroscopy. Existing unmixing methods are predominantly designed to invert an overdetermined mixed model and therefore require multiple mixed spectra as input. However, open domain and/or non-cooperative detection applications in Raman spectroscopy such as controlled substance detection, call for single-channel solutions which can identify individual components from thousands of candidates by analyzing only a single noisy mixed spectrum. To our knowledge, sparse regression is the only existing solution which can cope with this scenario, yet it has very low tolerance to noises and can hardly be applicable in practice. To address these limitations, we introduce a novel neural approach for single-channel Raman spectrum unmixing inspired by speech separation. It aims at solving underdetermined systems and can decompose a noisy mixed spectrum from a library of thousands of components (substances). The core of our method is a deep separation neural network (RSSNet) which takes a mixed spectrum as input and outputs spectra of pure components. We created two synthetic datasets of single-channel Raman spectra unmixing and demonstrated feasibility and superiority of RSSNet on these datasets (outperform competing methods by 4dB). Furthermore, we verified that RSSNet, trained solely on synthetic data, can successfully unmix real-world mixed spectra of mixtures of mineral powders, exhibiting strong generalization. Our approach represents a new paradigm for Raman unmixing and enables new possibilities for fast detection of Raman mixtures.
[LG-23] HGQ-LUT: Fast LUT-Aware Training and Efficient Architectures for DNN Inference
链接: https://arxiv.org/abs/2604.22293
作者: Chang Sun,Zhiqiang Que,Bakhtiar Zadeh,Qibin Liu,Kevin H. Alvarez,Wayne Luk,Maria Spiropulu
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
*备注:
Abstract:Lookup-table (LUT) based neural networks can deliver ultra-low latency and excellent hardware efficiency on FPGAs by mapping arithmetic operations directly onto the logic primitives. However, state-of-the-art LUT-aware training (LAT) approaches remain difficult to use in practice: they are often orders of magnitude slower to train than conventional networks, require non-trivial manual tuning for hardware efficiency, and lack an end-to-end workflow. This work presents HGQ-LUT, integrated in this https URL, a new LAT approach that achieves state-of-the-art hardware efficiency while accelerating training by over 100 times on modern GPUs. HGQ-LUT introduces LUT-Dense and LUT-Conv layers that are implemented with regular, accelerator-efficient tensor operations during training, which are then compiled into logic LUTs for hardware. By combining these layers with fine-grained, element-wise heterogeneous quantization (including zero-bit pruning) and a LUT-aware resource surrogate, HGQ-LUT enables the automatic exploration of accuracy-resource trade-offs without manual bit-width tuning. We further integrate HGQ-LUT into open-source toolchains, enabling unified design, compilation, and bit-exact verification of hybrid architectures that mix LUT-based with conventional arithmetic blocks. These features make LAT-based DNNs practical for real-world deployment, such as at the CERN Large Hadron Collider’s experiments.
[LG-24] How LLM s Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
链接: https://arxiv.org/abs/2604.22271
作者: Dharshan Kumaran,Viorica Patraucean,Simon Osindero,Petar Velickovic,Nathaniel Daw
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models can detect their own errors and sometimes correct them without external feedback, but the underlying mechanisms remain unknown. We investigate this through the lens of second-order models of confidence from decision neuroscience. In a first-order system, confidence derives from the generation signal itself and is therefore maximal for the chosen response, precluding error detection. Second-order models posit a partially independent evaluative signal that can disagree with the committed response, providing the basis for error detection. Kumaran et al. (2026) showed that LLMs cache a confidence representation at a token immediately following the answer (i.e. post-answer newline: PANL) – that causally drives verbal confidence and dissociates from log-probabilities. Here we test whether this PANL signal extends beyond confidence to support error detection and self-correction. Here we test whether this signal supports error detection and self-correction, deriving predictions from the second-order framework. Using a verify-then-correct paradigm, we show that: (i) verbal confidence predicts error detection far beyond token log-probabilities, ruling out a first-order account; (ii) PANL activations predict error detection beyond verbal confidence itself; and (iii) PANL predicts which errors the model can correct – where all behavioural signals fail. Causal interventions confirm that PANL signals rescue error detection behavior when answer information is corrupted. All findings replicate across models (Gemma 3 27B and Qwen 2.5 7B) and tasks (TriviaQA and MNLI). These results reveal that LLMs naturally implement a second-order confidence architecture whose internal evaluative signal encodes not only whether an answer is likely wrong but whether the model has the knowledge to fix it.
[LG-25] AI-Driven Performance-to-Design Generation and Optimization of Marine Propellers
链接: https://arxiv.org/abs/2604.22224
作者: Leah Chen,Keni Chih-Hua Wu,Boon Tat Chia,Xiuqing Xing,Jian Cheng Wong
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: Accepted at OMAE 2026
Abstract:AI is increasingly used to accelerate engineering design by improving decision-making and shortening iteration cycles. Application to marine propeller design, however, remains challenging due to scarce training data and the lack of widely available pretrained models. We address this gap with a physics-based data generation pipeline and a generative-AI framework for direct performance-to-design generation tailored to marine propellers. First, we build a database of over 20,000 four- and five-bladed propeller geometries, each accompanied by simulated open-water performance curves. On top of this dataset, we develop a three-module design framework: (1) A Conditional Generation Model that proposes candidate geometries conditioned on design specifications such as target thrust, power, and diameter. (2) A Performance Prediction Model, implemented as a neural-network surrogate, that predicts thrust, torque, and efficiency in milliseconds, enabling rapid evaluation of generated designs. (3) A design refinement stage that applies evolutionary optimization to enforce practical constraints such as required thrust under power limits and bounds on blade-area ratio and thickness. Experimental results over a range of operating conditions show that the framework can generate hydrodynamically plausible propeller designs that match prescribed performance targets while substantially reducing design-iteration time relative to the traditional expert-guided refinement. Latent diffusion-based generator produces more diverse designs under the same conditions than the conditional variational autoencoder, suggesting a stronger capacity for design-space exploration with diffusion models. By coupling physics-based data synthesis with modular AI models, the proposed approach streamlines the propeller design cycle and reduces reliance on expensive high-fidelity simulations to final validation stages.
[LG-26] FixV2W: Correcting Invalid CVE-CWE Mappings with Knowledge Graph Embeddings
链接: https://arxiv.org/abs/2604.22176
作者: Sevval Simsek,Varsha Athreya,David Starobinski
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Accurate mapping between Common Vulnerabilities and Exposures (CVE) and Common Weakness Enumeration (CWE) entries is critical for effective vulnerability management and risk assessment. However, public databases, such as the National Vulnerability Database (NVD), suffer from inconsistent and incomplete CVE to CWE mappings, complicating automated analysis and remediation. We introduce FixV2W, a lightweight approach that leverages knowledge graph embeddings and longitudinal trends to improve mapping accuracy of the NVD. FixV2W systematically analyzes historical remapping patterns and leverages hierarchical relationships within NVD and CWE data to predict more precise CWE mappings for vulnerabilities linked to Prohibited or Discouraged categories. We run extensive experimental evaluation of FixV2W, based on test data set collected between August 2021 and December 2024. Considering the Top 10 ranked predictions, the results show that FixV2W predicts the correct CWE mappings for 69% of exploited vulnerabilities that had invalid CWEs before they were exploited. We also show that FixV2W significantly improves the performance of ML models relying on NVD data. For instance, for a model geared at uncovering unknown CVE-CWE mappings, FixV2W improves the Mean Reciprocal Rank (MRR) from 0.174 to 0.608. These results show that FixV2W is a promising approach to identify and thwart emerging threats.
[LG-27] Optimal sequential decision-making for error propagation mitigation in digital twins
链接: https://arxiv.org/abs/2604.22168
作者: Annice Najafi,Shokoufeh Mirzaei
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Here, we explore the problem of error propagation mitigation in modular digital twins as a sequential decision process. Building on a companion study that used a Hidden Markov Model (HMM) to infer latent error regimes from surrogate-physics residuals, we develop a Markov Decision Process (MDP) in which the inferred regimes serve as states, corrective interventions serve as actions, and a scalar reward that takes into consideration the cost-benefit tradeoff between system fidelity and maintenance expense. The baseline transition matrix is extracted from the HMM-learned parameters. We then extend the formulation to a Partially Observable MDP (POMDP) that accounts for the imperfect nature of regime classification by maintaining a belief distribution updated via Bayesian filtering, with the HMM confusion matrix serving as the observation model. Both formulations are solved via dynamic programming and validated through Gillespie stochastic simulation. We then benchmark two model-free reinforcement learning algorithms, Q-learning and REINFORCE, to assess whether effective policies can be learned without explicit model knowledge. A systematic comparison of different intervention policies demonstrates that the MDP policy achieves the highest cumulative reward and fraction of time in nominal operation, while the POMDP recovers approximately 95% of MDP performance under realistic observation noise. Sensitivity analyses across observation quality, repair probability, and discount factor confirm the robustness of these conclusions, and the major gaps in the policy hierarchy are statistically significant at p 0.001 . The gap between MDP and POMDP performance quantifies the value of information providing a principled criterion for investing in improved classification accuracy.
[LG-28] Logistic Bandits with tildeO(sqrtdT) Regret without Context Diversity Assumptions
链接: https://arxiv.org/abs/2604.22161
作者: Seoungbin Bae,Dabeen Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study the K -armed logistic bandit problem, where at each round, the agent observes K feature vectors associated with K actions. Existing approaches that achieve a rate-optimal \tilde\mathcalO(\sqrtdT) regret bound rely heavily on context diversity assumptions, such as strict positivity of the minimum eigenvalue of a context covariance matrix. These assumptions, however, impose strong restrictions on the context process, as they rule out the situation where the context vectors are concentrated in a low-dimensional subspace. In this paper, we propose SupSplitLog, which, to the best of our knowledge, is the first algorithm for logistic bandits that achieves \tilde\mathcalO(\sqrtdT) regret without any context diversity assumption. The key idea is to split the collected samples into two disjoint subsets when constructing estimators; one is used to compute an initial-point estimator, while the other is used to apply a Newton-type one-step correction procedure. The splitting rule is carefully designed to balance the accuracy requirements of the initial-point estimator and the one-step correction procedure. Moreover, SupSplitLog strictly improves on the existing algorithms in terms of the dependence on dimension d in the regret upper bound. Furthermore, SupSplitLog can be adapted simply to deduce a regret bound that grows with a data-dependent complexity measure, avoiding a direct dependence on d , which is favorable when the context vectors are concentrated in a low-dimensional subspace. We also provide experimental results that demonstrate numerically the superiority of our algorithm, validating the theoretical results.
[LG-29] Sovereign Agent ic Loops: Decoupling AI Reasoning from Execution in Real-World Systems
链接: https://arxiv.org/abs/2604.22136
作者: Jun He,Deying Yu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages, 2 figures
Abstract:Large language model (LLM) agents increasingly issue API calls that mutate real systems, yet many current architectures pass stochastic model outputs directly to execution layers. We argue that this coupling creates a safety risk because model correctness, context awareness, and alignment cannot be assumed at execution time. We introduce Sovereign Agentic Loops (SAL), a control-plane architecture in which models emit structured intents with justifications, and the control plane validates those intents against true system state and policy before execution. SAL combines an obfuscation membrane, which limits model access to identity-sensitive state, with a cryptographically linked Evidence Chain for auditability and replay. We formalize SAL and show that, under the stated assumptions, it provides policy-bounded execution, identity isolation, and deterministic replay. In an OpenKedge prototype for cloud infrastructure, SAL blocks 93% of unsafe intents at the policy layer, rejects the remaining 7% via consistency checks, prevents unsafe executions in our benchmark, and adds 12.4 ms median latency.
[LG-30] Do Not Imitate Reinforce: Iterative Classification via Belief Refinement
链接: https://arxiv.org/abs/2604.22110
作者: Mahdi Kallel,Johannes Tölle,Ahmed Hendawy,Carlo D’Eramo
类目: Machine Learning (cs.LG)
*备注:
Abstract:Standard supervised classification trains models to imitate the exact labels provided by a perfect oracle. This imitation happens in a single pass, restricting the model to a fixed compute budget even when inputs vary in complexity. Moreover, the rigid training objective forces the model to express absolute certainty on its training data, resulting in overconfident predictions during evaluation. We propose Reinforced Iterative Classification (RIC), which replaces the imitative objective with Reinforcement Learning (RL). RIC deploys a recurrent agent that iteratively updates a predictive distribution over classes, receiving reward for stepwise improvement in prediction quality. The value function provides a natural halting criterion by estimating the remaining scope for improvement. We prove that the iterative formulation recovers the same optimal predictions as cross-entropy while yielding an anytime classifier. On image classification benchmarks, RIC matches the accuracy of supervised baselines with improved calibration and learns to allocate computation adaptively across inputs.
[LG-31] Assessing the impact of dimensionality reduction on clustering performance – a systematic study
链接: https://arxiv.org/abs/2604.22099
作者: Ousmane Assani Amate,Mohammadreza Bakhtyari,Émilie Roy,Vladimir Makarenkov
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dimensionality reduction is a critical preprocessing step for clustering high-dimensional data, yet comprehensive evaluation of its impact across diverse methods and data types remains limited. In this study, we systematically assess the influence of five dimensionality reduction techniques - Principal Component Analysis (PCA), Kernel Principal Component Analysis (Kernel PCA), Variational Autoencoder (VAE), Isometric Mapping (Isomap), and Multidimensional Scaling (MDS) - on the performance of four popular clustering algorithms - k-means, Agglomerative Hierarchical Clustering (AHC), Gaussian Mixture Models (GMM), and Ordering Points to Identify the Clustering Structure (OPTICS). We evaluate clustering quality using the Adjusted Rand Index (ARI), comparing results without and with dimensionality reduction at different reduction levels recommended in the literature (i.e., k-1, where k is the number of clusters, and 25% and 50% of the original number of dimensions). Our findings underscore the importance of a careful selection of the dimensionality reduction technique and the dimensionality reduction level that should be tailored to intrinsic data geometry and clustering algorithms under consideration.
[LG-32] Who Audits the Auditor? Tamper-Proof Fraud Detection with Blockchain-Anchored Explainable ML
链接: https://arxiv.org/abs/2604.22096
作者: Zhaohui Wang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted to IEEE COMPSAC 2026 (Paper ID 9376, SEPT Symposium). This is the de-anonymized camera-ready version. Code is available at: this https URL
Abstract:In enterprise fraud detection, model accuracy alone is insufficient when insiders can tamper with audit logs or bypass approval workflows. Real-world incidents show that fraud often persists not because detection algorithms fail, but because the audit trail itself is controllable by privileged operators. This exposes a fundamental trust gap: who audits the auditor? We present a tamper-evident fraud detection system that anchors both ML predictions and workflow execution to an immutable blockchain ledger. Rather than using blockchain as passive storage, we enforce the entire approval process through smart contracts, ensuring that every transaction, prediction, and explanation is atomically recorded and cannot be retroactively modified. Our detection module achieves competitive accuracy (F1 = 0.895, PR-AUC = 0.974) while providing cryptographically verifiable decision trails that support regulatory auditability requirements (e.g., GDPR Article 22). System evaluation shows sub-25 ms inference latency and economically viable deployment on Layer-2 networks at under \ 0.01 per transaction (validated against PolygonScan data), supporting enterprise-scale workloads of 10,000+ monthly payments. Comments: Accepted to IEEE COMPSAC 2026 (Paper ID 9376, SEPT Symposium). This is the de-anonymized camera-ready version. Code is available at: this https URL Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2604.22096 [cs.CR] (or arXiv:2604.22096v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2604.22096 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-33] Generating Synthetic Malware Samples Using Generative AI
链接: https://arxiv.org/abs/2604.22084
作者: Tiffany Bao,Kylie Trousil,Quang Duy Tran,Fabio Di Troia,Younghee Park
类目: Machine Learning (cs.LG)
*备注: 12 pages, 8 figures. This paper has been published in IEEE Access, available at this URL: this https URL
Abstract:Malware attacks have a significant negative impact on organizations of varied scales in the field of cybersecurity. Recently, malware researchers have increasingly turned to machine learning techniques to combat sophisticated obfuscation methods used in malware. However, collecting a diverse set of malware samples with various obfuscation techniques is challenging and often takes years, especially for newly developed malware. This issue is further compounded by a well-known limitation of machine learning models: their poor performance when training data is scarce. In this paper, we propose a new system for generating synthetic malware samples to augment imbalanced malware dataset. Our approach decomposes malware binary samples into mnemonic opcode sequences, leveraging natural language processing to extract contextual meaning behind malware opcode features to aid the learning of generative AI (GenAI) employed in this paper, Generative Adversarial Networks (GAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP), and a modified Diffusion model. The experiment results show that augmenting training data with Diffusion-based synthetic data significantly improves classification performance for minor classes by up to 60% on average. This enhancement ultimately leads to an overall malware classification performance of 96%, an 8% improvement. These findings demonstrate the high quality and fidelity of the synthetic data, its robustness, and its potential applications in malware analysis. Specifically, synthetic malware data proves effective in improving the classification of minor malware classes and detection rates, even though the size of known malware data is significantly small.
[LG-34] Insect-inspired modular architectures as inductive biases for reinforcement learning
链接: https://arxiv.org/abs/2604.22081
作者: Anne E. Staples
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Most reinforcement-learning (RL) controllers used in continuous control are architecturally centralized: observations are compressed into a single latent state from which both value estimates and actions are produced. Biological control systems are often organized differently. Insects, in particular, coordinate navigation, heading stabilization, memory, and context-dependent action selection through distributed circuits rather than a single monolithic controller. Motivated by this contrast, we study an RL policy architecture that decomposes control into interacting modules for sensory encoding, heading representation, sparse associative memory, recurrent command generation, and local motor control, with a learned arbitration mechanism that allocates motor authority across modules. The model is evaluated on a two-dimensional navigation task that require simultaneous food seeking, obstacle avoidance, and predator escape. In a six-seed predator-navigation experiment trained with Proximal Policy Optimization (PPO) for 75 updates, the modular policy achieves the strongest final mean performance among the tested controllers, with final episodic return -2798.8\pm964.4 versus -3778.0\pm628.1 for a centralized gated recurrent unit (GRU) and -4727.5\pm772.5 for a centralized multilayer perceptron (MLP). The modular policy also attains the lowest final value loss and stable PPO optimization statistics while driving module-assignment entropy to 0.0457\pm0.0244 , indicating highly selective control allocation. These results suggest that distributed control can serve as a useful inductive bias for RL problems involving dynamically competing behavioral objectives.
[LG-35] Learning Coverag e- and Power-Optimal Transmitter Placement from Building Maps: A Comparative Study of Direct and Indirect Neural Approaches
链接: https://arxiv.org/abs/2604.22056
作者: Çağkan Yapar
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注:
Abstract:Optimal wireless transmitter placement is a central task in radio-network planning, yet exhaustive search becomes prohibitively expensive at scale. This paper studies the single-transmitter setting under a fixed learned propagation surrogate, where exhaustive per-pixel evaluation remains tractable and provides surrogate-exact ground truth. We introduce a dataset of 167,525 urban scenarios (RadioMapSeer-Deployment) with dual surrogate-exact labels for coverage-optimal and power-optimal transmitter locations. Ground-truth analysis reveals an asymmetric coverage-power trade-off: coverage-optimal placement sacrifices 13.86% of received power, whereas power-optimal placement sacrifices only 5.50% of coverage; the best achievable balanced placement lies at \bard=2.60 from the ideal point (100%,100%). We evaluate two learning formulations: indirect heatmap-based models that predict received-power radio maps, and direct score-map models that predict the objective landscape over feasible transmitter locations. Within the heatmap family, discriminative models deliver one-shot predictions 1350-2400x faster than exhaustive search, while diffusion models additionally support multi-sample inference that improves single-objective performance and, by reusing the same sample pool under a balanced criterion, recovers strong balanced placements without explicit multi-objective training. Dual score-map strategies combining power and coverage score maps match the exhaustive balanced optimum ( \bard=2.60 ) and remain close across smaller candidate budgets, at 14-22x speedups after candidate re-evaluation. Both formulations admit very fast one-shot inference; on this benchmark, dual score-map methods are strongest for balanced placement, whereas heatmap formulations remain attractive for their physically meaningful intermediate maps and, in the diffusion setting, for inference-time search.
[LG-36] Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon
链接: https://arxiv.org/abs/2604.22032
作者: Cooper Veit
类目: Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注: 28 pages, 1 figure
Abstract:Every ML kernel ships with an implicit contract about what it computes. People rarely write the contract down. When two kernels disagree – when a matmul on AMD produces a different gradient than the same matmul on NVIDIA, when a fused attention kernel silently downcasts an accumulator, when an out-of-bounds access returns zero on one stack and garbage on another – there is no formal artifact to arbitrate the dispute. Recent empirical work has measured the gap across silicon platforms, but none of it specifies the contract being violated. We present a specification language for kernel contracts. A contract has eight parts: identifier, scope, precondition, postcondition, tolerance, reference oracle, measurement protocol, and violation signature. We use it to state twelve contract classes covering precision, ordering, compiler-induced, and exceptional-value failure modes, each grounded in published empirical evidence. We require a three-state calibration: every contract must admit at least one reference-conforming implementation and at least one contract-violating implementation that passes basic functional tests. We apply the framework to three documented incidents – Huawei Ascend silent precision coercion, Sakana AI CUDA Engineer reward hacking, AMD out-of-bounds silent acceptance – and show that each informal diagnosis maps to a specific contract violation with a measurable signature. A kernel contract suite is a normative reference against which conformance can be graded, in the way that ISASecure grades industrial control systems against IEC 62443. Comments: 28 pages, 1 figure Subjects: Machine Learning (cs.LG); Programming Languages (cs.PL) Cite as: arXiv:2604.22032 [cs.LG] (or arXiv:2604.22032v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2604.22032 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-37] Null-Space Flow Matching for MIMO Channel Estimation in Latency-Constrained Systems
链接: https://arxiv.org/abs/2604.22005
作者: Junjie Zhao,Guangming Liang,Dongzhu Liu,Xiaonan Liu
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 3 figures, 20 references
Abstract:Accurate yet low-latency channel state information (CSI) acquisition is essential for multiple-input multiple-output (MIMO) communication systems. While advanced deep generative models, such as score-based and diffusion models, enable high-fidelity CSI reconstruction from limited pilot observations, they often suffer from high inference latency. To achieve accurate CSI estimation under stringent latency constraints, this paper proposes a null-space flow matching (FM) framework that decomposes pilot-limited MIMO channel estimation into a range-space reconstruction problem and a null-space generation problem. Specifically, the range-space component of the channel is directly recovered from noisy pilot observations, while only the ambiguous null-space component is iteratively refined using an FM-based generative prior. To further improve the robustness of the proposed framework, we introduce a power-law time schedule to better allocate the limited number of refinement steps, along with a noise-aware adaptive correction strategy to suppress channel noise on the refinement trajectory. Experimental results demonstrate that our method achieves a competitive normalized mean square error (NMSE) even under a strict latency budget of around 3 ms, while delivering superior estimation accuracy and faster inference than both model-based and generative baselines.
[LG-38] When Quotes Crumble: Detecting Transient Mechanical Liquidity Erosion in Limit Order Books ICLR2026
链接: https://arxiv.org/abs/2604.21993
作者: Haohan Xu,Jason Bohne,Pawel Polak,Yurij Baransky,Ajay Alva,Violetta Fedotova,Gary Kazantsev,David Rosenberg
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures. Accepted at ICLR 2026 Workshop on Advances in Financial AI
Abstract:We study the detection of transient liquidity erosion (“crumbling quotes”) in electronic limit order books, where observable quote deterioration may reflect either mechanical liquidity withdrawal or informational repricing. Using the ABIDES agent-based simulator, we construct a multi-agent environment in which crumbling emerges from stochastic regime switches in a market maker, providing time-resolved ground truth unavailable in real market data. We develop a detection pipeline that identifies mechanically driven quote erosion using order book features, and train a neural model to produce calibrated crumbling probabilities. Experiments demonstrate that the proposed framework reliably identifies crumbling events against agent-level ground truth, with the neural model achieving +36% AUC improvement over rule-based baselines and robust performance across normal, high-volatility, bull, and bear market conditions. Ablation studies on temporal features and varying the dependence structure of the ground-truth mechanism confirm that the framework generalizes across both independent and autocorrelated liquidity withdrawal dynamics.
[LG-39] Conditional anomaly detection using soft harmonic functions: An application to clinical alerting ICML2011
链接: https://arxiv.org/abs/2604.21956
作者: Michal Valko,Hamed Valizadegan,Branislav Kveton,Gregory F. Cooper,Milos Hauskrecht
类目: Machine Learning (cs.LG)
*备注: ICML 2011 Workshop on Machine Learning for Global Challenges. arXiv admin note: substantial text overlap with arXiv:2604.21462 . substantial text overlap with arXiv:2604.21462
Abstract:Timely detection of concerning events is an important problem in clinical practice. In this paper, we consider the problem of conditional anomaly detection that aims to identify data instances with an unusual response, such as the omission of an important lab test. We develop a new non-parametric approach for conditional anomaly detection based on the soft harmonic solution, with which we estimate the confidence of the label to detect anomalous mislabeling. We further regularize the solution to avoid the detection of isolated examples and examples on the boundary of the distribution support. We demonstrate the efficacy of the proposed method in detecting unusual labels on a real-world electronic health record dataset and compare it to several baseline approaches.
[LG-40] Performance Anomaly Detection in Athletics: A Benchmarking System with Visual Analytics
链接: https://arxiv.org/abs/2604.21953
作者: Blessed Madukoma,Prasenjit Mitra
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 8 pages, 5 figures, 5 tables
Abstract:Anti-doping programs rely on biological testing to detect performance-enhancing drugs, but such testing costs over 800 per sample and is limited by short detection windows for many prohibited substances. These constraints leave large portions of athletes without regular testing, motivating complementary screening approaches that analyze routine competition results to identify suspicious performance patterns. We present a system that processes 1.6 million athletics performances from over 19,000 competitions (2010-2025) using eight detection methods ranging from statistical rules to machine learning and trajectory analysis. We validate all methods against publicly confirmed anti-doping violations to measure their effectiveness in identifying sanctioned athletes. Trajectory-based methods, which compare performances to expected career progression, achieve the best balance between detecting violations and limiting false alarms, though all methods face challenges from incomplete data and rare confirmed violations. The system provides an interactive interface for expert-driven investigation, emphasizing transparency and human judgment to support, rather than replace, established anti-doping processes.
[LG-41] Relaxation-Informed Training of Neural Network Surrogate Models
链接: https://arxiv.org/abs/2604.22746
作者: Calvin Tsay
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 35 pages, 5 figures
Abstract:ReLU neural networks trained as surrogate models can be embedded exactly in mixed-integer linear programs (MILPs), enabling global optimization over the learned function. The tractability of the resulting MILP depends on structural properties of the network, i.e., the number of binary variables in associated formulations and the tightness of the continuous LP relaxation. These properties are determined during training, yet standard training objectives (prediction loss with classical weight regularization) offer no mechanism to directly control them. This work studies training regularizers that directly target downstream MILP tractability. Specifically, we propose simple bound-based regularizers that penalize the big-M constants of MILP formulations and/or the number of unstable neurons. Moreover, we introduce an LP relaxation gap regularizer that explicitly penalizes the per-sample gap of the continuous relaxation at training points. We derive its associated gradient and provide an implementation from LP dual variables without custom automatic differentiation tools. We show that combining the above regularizers can approximate the full total derivative of the LP gap with respect to the network parameters, capturing both direct and indirect sensitivities. Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude relative to an unregularized baseline, while maintaining competitive surrogate model accuracy.
[LG-42] me-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis ALT
链接: https://arxiv.org/abs/2604.22695
作者: Victoria Ribeiro Rodrigues,Paul W. Davenport,Nicholas J. Napoli
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: Submitted to IEEE Journal of Biomedical and Health Informatics (under review). 18 pages, 7 figures, 5 tables
Abstract:Respiratory airflow signals provide critical insight into breathing mechanics, yet conventional analysis methods remain limited in their ability to characterize the internal structure of individual breaths. Traditional approaches treat airflow as a quasi-periodic signal and rely on global descriptors such as tidal volume or peak flow, obscuring sub-breath events that reflect neuromuscular coordination and compensatory breathing strategies. This study introduces a parametric framework for decomposing inspiratory airflow into a small number of time-localized components with explicit amplitude, onset time, and duration parameters. Unlike spectral or data-adaptive methods, the proposed approach employs physiologically grounded basis functions, Half-Sine, Gaussian, and Beta, to represent intrabreath waveform morphology through constrained nonlinear optimization. Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error 0.001 for four-component models) and robust parameter precision under moderate noise. Component-derived features describing sub-breath timing and coordination improved classification of cognitive fatigue states arising from cognitive-respiratory competition by up to 30.7% in Matthews correlation coefficient compared with classical respiratory metrics. These results establish that modeling airflow as a sum of parameterized, time-localized primitives provides an interpretable and precise foundation for quantifying intrabreath organization, compensatory breathing dynamics, and respiratory motor control adaptation under cognitive-respiratory dual-task demands.
[LG-43] CLVAE: A Variational Autoencoder for Long-Term Customer Revenue Forecasting
链接: https://arxiv.org/abs/2604.22636
作者: Jeffrey Näf,Riana Valera Mbelson,Markus Meierer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Predicting customers’ long-term revenue from sparse and irregular transaction data is central to marketing resource allocation in non-contractual settings, yet existing approaches face a trade-off. Traditional probabilistic customer base models deliver robust long-horizon forecasts by imposing strong structural assumptions, while flexible machine-learning models often require substantial training data and careful tuning. We propose a variational-autoencoder-based model that preserves the process-based likelihood of established attrition-transaction-spend models conditional on customer heterogeneity, but replaces the restrictive parametric mixing distribution with a flexible latent representation learned by encoder-decoder networks. The resulting approach (i) provides a single model for customer attrition, transactions and spending, (ii) remains reliable when contextual covariates are unavailable, and (iii) flexibly incorporates rich covariates and nonlinear effects when they are available. This design balances structural stability with the flexibility needed to capture complex purchase dynamics. Across multiple real-world datasets and prediction horizons, the proposed model improves upon the latest benchmarks. Businesses benefit directly, as a better assessment of customers’ future revenues improves the efficiency of campaign targeting. For research, this work provides guidance on how to embed domain-specific models into the variational autoencoder framework, enabling flexible representation learning while retaining an econometrically meaningful process structure.
[LG-44] Mixed Membership sub-Gaussian Models
链接: https://arxiv.org/abs/2604.22633
作者: Huan Qing
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages, 6 figures, 2 tables
Abstract:The Gaussian mixture model is widely used in unsupervised learning, owing to its simplicity and interpretability. However, a fundamental limitation of the classical Gaussian mixture model is that it forces each observation to belong to exactly one component. In many practical applications, such as genetics, social network analysis, and text mining, an observation may naturally belong to multiple components or exhibit partial membership in several latent components. To overcome this limitation, we propose the mixed membership sub-Gaussian model, which extends the classical Gaussian mixture framework by allowing each observation to belong to multiple components. This model inherits the interpretability of the classical Gaussian mixture model while offering greater flexibility for capturing complex overlapping structures. We develop an efficient spectral algorithm to estimate the mixed membership of each individual observation, and under mild separation conditions on the component centres, we prove that the estimation error of the per-individual membership vector can be made arbitrarily small with high probability. To our knowledge, this is the first work to provide a computationally efficient estimator with such a vanishing-error guarantee for a mixed-membership extension of the Gaussian mixture model. Extensive experimental studies demonstrate that our method outperforms existing approaches that ignore mixed memberships.
[LG-45] he Exact Replica Threshold for Nonlinear Moments of Quantum States
链接: https://arxiv.org/abs/2604.22627
作者: Shuai Zeng
类目: Quantum Physics (quant-ph); Computational Complexity (cs.CC); Information Theory (cs.IT); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Joint measurements on multiple copies of a quantum state provide access to nonlinear observables such as \operatornametr(\rho^t) , but whether replica number marks a sharp information-theoretic resource boundary has remained unclear. For every fixed order t\ge 3 , existing protocols show that \lceil t/2\rceil replicas already suffice for polynomial-sample estimation of \operatornametr(\rho^t) , yet it has remained open whether one fewer replica must necessarily incur a sample-complexity barrier growing with the dimension. We prove that this is indeed the case in the sample/copy-access model with replica-limited joint measurements: any protocol restricted to \lceil t/2\rceil-1 replicas requires dimension-growing sample complexity, while \lceil t/2\rceil replicas suffice by prior work. Thus the exact replica threshold for fixed-order pure moments is \lceil t/2\rceil . Equivalently, for fixed-order pure moments, one additional coherent replica is not merely useful but marks the exact threshold between polynomial-sample estimation and a dimension-growing regime in the replica-limited model. We further show that the same threshold law extends to a broad family of observable-weighted moments \operatornametr(O\rho^t) , including Pauli observables and other observables with bounded operator norm and macroscopic trace norm. Coherent replica number therefore acts as a genuinely discrete resource for nonlinear quantum-state estimation.
[LG-46] Explanation of Dynamic Physical Field Predictions using WassersteinGrad: Application to Autoregressive Weather Forecasting
链接: https://arxiv.org/abs/2604.22580
作者: Younes Essafouri,Laure Raynaud,Luciano Drozda,Laurent Risser
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:As the demand to integrate Artificial Intelligence into high-stakes environments continues to grow, explaining the reasoning behind neural-network predictions has shifted from a theoretical curiosity to a strict operational requirement. Our work is motivated by the explanations of autoregressive neural predictions on dynamic physical fields, as in weather forecasting. Gradient-based feature attribution methods are widely used to explain the predictions on such data, in particular due to their scalability to high-dimensional inputs. It is also interesting to remark that gradient-based techniques such as SmoothGrad are now standard on images to robustify the explanations using pointwise averages of the attribution maps obtained from several noised inputs. Our goal is to efficiently adapt this aggregation strategy to dynamic physical fields. To do so, our first contribution is to identify a fundamental failure mode when averaging perturbed attribution maps on dynamic physical fields: stochastic input perturbations do not induce stationary amplitude noise in attribution maps, but instead cause a geometric displacement of the attributions. Consequently, pointwise averaging blurs these spatially misaligned features. To tackle this issue, we introduce WassersteinGrad, which extracts a geometric consensus of perturbed attribution maps by computing their entropic Wasserstein barycenter. The results, obtained on regional weather data and a meteorologist-validated neural model, demonstrate promising explainability properties of WassersteinGrad over gradient-based baselines across both single-step and autoregressive forecasting settings.
[LG-47] Multi-output Extreme Spatial Model for Complex Aircraft Production Systems
链接: https://arxiv.org/abs/2604.22548
作者: Cheolhei Lee,Xing Wang,Xiaowei Yue,Jianguo Wu
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Problem definition: Data-driven models in machine learning have enabled efficient management of production systems. However, a majority of machine learning models are devoted to modeling the mean response or average pattern, which is inappropriate for studying abnormal extreme events that are often of primary interest in aircraft manufacturing. Since extreme events from heavy-tailed distributions give rise to prohibitive expenditures in system management, sophisticated extreme models are urgently needed to analyze complex extreme risks. Engineering applications of extreme models usually focus on individual extreme events, which is insufficient for complex systems with correlations. Methodology/results: We introduce an extreme spatial model for multi-output response control systems that efficiently captures the dynamics using a bilinear function on two spatial domains for control variables and measurement locations. Marginal parameter modeling and extremal dependence have been investigated. In addition, an efficient graph-assisted composite likelihood estimation and corresponding computational algorithms are developed to cope with high-dimensional outputs. The application to composite aircraft production shows that the proposed model enables comprehensive analyses with superior predictive performance on extreme events compared to canonical methods. Managerial implications: Our method shows how to use an extreme spatial model for predicting extreme events and managing extreme risks in complex production systems such as aircraft. This can help achieve better quality management and operation safety in aircraft production systems and beyond.
[LG-48] FedSPDnet: Geometry-Aware Federated Deep Learning with SPDnet
链接: https://arxiv.org/abs/2604.22494
作者: Thibault Pautrel,Florent Bouchard,Ammar Mian,Guillaume Ginolhac
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We introduce two federated learning frameworks for the classical SPDnet model operating on symmetric positive definite (SPD) matrices with Stiefel-constrained parameters. Unlike standard Euclidean averaging, which violates orthogonality, our approach preserves geometric structure through two efficient aggregation strategies: ProjAvg, projecting arithmetic means onto the Stiefel manifold, and RLAvg, approximating tangent-space averaging via retractions and liftings. Both methods are computationally efficient, independent of the optimizer, and enable scalable federated learning for signal processing applications whose features are SPD matrices. Simulations on EEG motor imagery benchmarks show that FedSPDnet outperforms federated EEGnet in F1 score and robustness to federation and partial participation, while using fewer parameters per communication round.
[LG-49] Conformalized Super Learner
链接: https://arxiv.org/abs/2604.22391
作者: Zhanli Wu,Fabrizio Leisen,Miguel-Angel Luque-Fernandez,F. Javier Rubio
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: R codes and data can be found at: this https URL
Abstract:The Super Learner (SL) is a widely used ensemble method that combines predictions from a library of learners based on their predictive performance. Interval predictions are of considerable practical interest because they allow uncertainty in predictions produced by an individual learner or an ensemble to be quantified. Several methods have been proposed for constructing interval predictions based on the SL, however, these approaches are typically justified using asymptotic arguments or rely on computationally intensive procedures such as the bootstrap. Conformal prediction (CP) is a machine learning framework for constructing prediction intervals with finite-sample and asymptotic coverage guarantees under mild conditions. We propose coupling CP with the SL through a natural construction that mirrors the original SL framework, using individual learner weights and combining learner-specific conformity scores via a weighted majority vote. We characterize the properties of the resulting SL-based prediction intervals for continuous outcomes. We cover settings under exchangeability, potential violations of exchangeability, and data-generating mechanisms exhibiting heteroscedasticity, sparsity, and other forms of distributional heterogeneity. A comprehensive simulation study shows that the conformalized SL achieves valid finite-sample coverage with competitive performance relative to the true data-generating mechanism. A central contribution of this work is an application to predicting creatinine levels using socio-demographic, biometric, and laboratory measurements. This example demonstrates the benefits of an ensemble with carefully selected learners designed to capture key aspects of complex regression functions, including non-linear effects, interactions, sparsity, heteroscedasticity, and robustness to outliers.R
[LG-50] Pack only the essentials: Adaptive dictionary learning for kernel ridge regression NEURIPS2016
链接: https://arxiv.org/abs/2604.22386
作者: Daniele Calandriello,Alessandro Lazaric,Michal Valko
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: In NeurIPS 2016 Workshop on Adaptive and Scalable Nonparametric Methods in Machine Learning (ASNMML)
Abstract:One of the major limits of kernel ridge regression (KRR) is that storing and manipulating the kernel matrix K_n for n samples requires O(n^2) space, which rapidly becomes unfeasible for large n. Nystrom approximations reduce the space complexity to O(nm) by sampling m columns from K_n. Uniform sampling preserves KRR accuracy (up to epsilon) only when m is proportional to the maximum degree of freedom of K_n, which may require O(n) columns for datasets with high coherence. Sampling columns according to their ridge leverage scores (RLS) gives accurate Nystrom approximations with m proportional to the effective dimension, but computing exact RLS also requires O(n^2) space. (Calandriello et al. 2016) propose INK-Estimate, an algorithm that processes the dataset incrementally and updates RLS, effective dimension, and Nystrom approximations on-the-fly. Its space complexity scales with the effective dimension but introduces a dependency on the largest eigenvalue of K_n, which in the worst case is O(n). In this paper we introduce SQUEAK, a new algorithm that builds on INK-Estimate but uses unnormalized RLS. As a consequence, the algorithm is simpler, does not need to estimate the effective dimension for normalization, and achieves a space complexity that is only a constant factor worse than exact RLS sampling. Comments: In NeurIPS 2016 Workshop on Adaptive and Scalable Nonparametric Methods in Machine Learning (ASNMML) Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2604.22386 [stat.ML] (or arXiv:2604.22386v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.22386 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-51] Pliable rejection sampling ICML2016
链接: https://arxiv.org/abs/2604.22385
作者: Akram Erraqabi,Michal Valko,Alexandra Carpentier,Odalric-Ambrym Maillard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: In ICML 2016
Abstract:Rejection sampling is a technique for sampling from difficult distributions. However, its use is limited due to a high rejection rate. Common adaptive rejection sampling methods either work only for very specific distributions or without performance guarantees. In this paper, we present pliable rejection sampling (PRS), a new approach to rejection sampling, where we learn the sampling proposal using a kernel estimator. Since our method builds on rejection sampling, the samples obtained are with high probability i.i.d. and distributed according to f. Moreover, PRS comes with a guarantee on the number of accepted samples.
[LG-52] On Benchmark Hacking in ML Contests: Modeling Insights and Design
链接: https://arxiv.org/abs/2604.22230
作者: Xiaoyun Qiu,Yang Yu,Haifeng Xu
类目: General Economics (econ.GN); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:Benchmark hacking refers to tuning a machine learning model to score highly on certain evaluation criteria without improving true generalization or faithfully solving the intended problem. We study this phenomenon in a generic machine learning contest, where each contestant chooses two types of effort: creative effort that improves model capability as desired by the contest host, and mechanistic effort that only improves the model’s fitness to the particular task in contest without contributing to true generalization. We establish the existence of a symmetric monotone pure strategy equilibrium in this competition game. It also provides a natural definition of benchmark hacking in this strategic context by comparing a player’s equilibrium effort allocation to that of a single-agent baseline scenario. Under our definition, contestants with types below certain threshold (low types) always engage in benchmark hacking, whereas those above the threshold do not. Furthermore, we show that more skewed reward structures (favoring top-ranked contestants) can elicit more desirable contest outcomes. We also provide empirical evidence to support our theoretical predictions.
[LG-53] Near-Optimal Regret for the Safe Learning-based Control of the Constrained Linear Quadratic Regulator
链接: https://arxiv.org/abs/2604.22158
作者: Spencer Hutchinson,Nanfei Jiang,Mahnoosh Alizadeh
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We study the problem of adaptive control of the stochastic linear quadratic regulator (LQR) with constraints that must be satisfied at every time step. Prior work on the multidimensional problem has shown \tildeO(T^2/3) regret and satisfaction of robust constraints, leaving open the question of whether \tildeO(\sqrtT) regret can be attained in the constrained LQR setting. We contribute to this problem by showing \tildeO(\sqrtT) regret and satisfaction of chance constraints. This type of constraints allow us to handle unbounded noise and also enable analytical techniques not directly applicable to robust constraints. Our proposed algorithm for this problem uses an SDP to select an optimistic policy, and then “scales back” this policy until it is verifiably-safe. Our theoretical analysis establishes regret and constraint guarantees via a key lemma that bounds the system covariance in terms of the chosen policy. This covariance-based analysis is in contrast with the cost-to-go based analysis that is typically used in adaptive LQR.
[LG-54] Concave Statistical Utility Maximization Bandits via Influence-Function Gradients
链接: https://arxiv.org/abs/2604.22140
作者: Matías Carrasco,Alejandro Cholaquidis
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP)
*备注:
Abstract:We study stochastic multi-armed bandits in which the objective is a statistical functional of the long-run reward distribution, rather than expected reward alone. Under mild continuity assumptions, we show that the infinite-horizon problem reduces to optimizing over stationary mixed policies: each weight vector (w) on the simplex induces a mixture law (P^w), and performance is measured by the concave utility (U(w)=\mathfrak U(P^w)). For differentiable statistical utilities, we use influence-function calculus to derive stochastic gradient estimators from bandit feedback. This leads to an entropic mirror-ascent algorithm on a truncated simplex, implemented through multiplicative-weights updates and plug-in estimates of the influence function. We establish regret bounds that separate the mirror-ascent optimization error from the bias caused by estimating the influence function. The framework is developed for general concave distributional utilities and illustrated through variance and Wasserstein objectives, with numerical experiments comparing exact and plug-in influence-function implementations. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP) Cite as: arXiv:2604.22140 [stat.ML] (or arXiv:2604.22140v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2604.22140 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-55] Audio Video Verbal Analysis (AVVA) for Capturing Classroom Dialogues
链接: https://arxiv.org/abs/2604.22043
作者: Vivek Upadhyay,Amaresh Chakrabarti
类目: Physics and Society (physics.soc-ph); Machine Learning (cs.LG)
*备注: 42 pages, 4 figures, 1 table
Abstract:Background: The classroom discourse analysis has been transformed by the growing use of audio-video multimodal data, which demands analytical methods that balance interpretive depth with computational scalability. Methods: This study introduces the Audio Video Verbal Analysis (AVVA) framework, adapted from the Verbal Analysis method to integrate qualitative interpretation with quantitative modelling. Unlike fully multimodal learning analytics approaches, AVVA focuses on verbatim transcripts with essential interactional modalities. Findings: The framework embeds triangulation as a core design strategy across ten methodological steps, strengthening validity and analytical rigour. A comprehensive validation scheme addresses fundamental challenges in temporal observational research: Phi Ceiling for low-frequency variables (via Base Rate Filtering), estimation uncertainty (via bootstrap confidence intervals), and the Modifiable Temporal Unit Problem, where measured associations depend on observational window size. Four-criterion stability assessment (sign consistency, confidence interval overlap, zero exclusion, magnitude stability) classifies variable pairs into interpretable patterns: grain-invariant, scale-specific, or multi-scale, etc. structures across temporal grain sizes. Its application to 23 hours of classroom recordings illustrates its practical viability and its potential to yield meaningful insights. Contribution: The framework thus provides a scalable pathway for transforming rich classroom discourse into analysable datasets. Comments: 42 pages, 4 figures, 1 table Subjects: Physics and Society (physics.soc-ph); Machine Learning (cs.LG) Cite as: arXiv:2604.22043 [physics.soc-ph] (or arXiv:2604.22043v1 [physics.soc-ph] for this version) https://doi.org/10.48550/arXiv.2604.22043 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vivek Upadhyay [view email] [v1] Thu, 23 Apr 2026 19:56:13 UTC (1,914 KB)
附件下载


