Research Feed

AI Research Papers

80 papers tracked across 1 categories

Tracked

80

Last 7 Days

43

Categories

1

With Code

0

Top categories:cs.LG (30)
Latest tracked papercs.LGApr 22, 2026

SkillLearnBench: Benchmarking Continual Learning Methods for Agent Skill Generation on Real-World Tasks

Shanshan Zhong, Yi Lu, Jingjie Ning, Yibing Wan, Lihan Feng, Yuyi Ao +4 more

Skills have become the de facto way to enable LLM agents to perform complex real-world tasks with customized instructions, workflows, and tools, but how to learn them automatically and effectively remains unclear. We introduce SkillLearnBench, the first benchmark for evaluating continual skill learning methods, comprising 20 verified, skill-dependent tasks across 15 sub-domains derived from a real-world skill taxonomy , evaluated at three levels: skill quality, execution trajectory, and task out

cs.LGApr 21, 2026

RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

Yusuf Çelebi, Yağız Asker, Özay Ezerceli, Mahmoud ElHussieni +3

Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the layer-specific roles of internal representations are poorly understood, leading to heuristic decisions about where adaptation should be applied. We model the evolution of hidden states as a high-dimensional geometric trajectory and propose using the Ramer-Douglas-Peucker (RDP) algorithm, a parameter-free and training-free polygon simplification me

cs.LGApr 21, 2026

Micro Language Models Enable Instant Responses

Wen Cheng, Tuochao Chen, Karim Helwani, Sriram Srinivasan +2

Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models (μLMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cl

cs.LGApr 21, 2026

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng +5

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we pre

cs.LGApr 21, 2026

What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Xinhao Zhang, Xi Chen, François Portet, Maxime Peyrard

Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary search, collecting optimization trajectories for 15 LLMs across 8 tasks. Although zero-shot problem-solving ability correlates with final optimization outcomes, it explains only part of the variance: mode

cs.LGApr 21, 2026

Accurate and scalable exchange-correlation with deep learning

Giulia Luise, Chin-Wei Huang, Thijs Vogels, Derk P. Kooi +24

Density Functional Theory (DFT) underpins much of modern computational chemistry and materials science. Yet, the reliability of DFT-derived predictions of experimentally measurable properties remains fundamentally limited by the need to approximate the unknown exchange-correlation (XC) functional. The traditional paradigm for improving accuracy has relied on increasingly elaborate hand-crafted functional forms. This approach has led to a longstanding trade-off between computational efficiency an

cs.LGApr 21, 2026

ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

Xianming Li, Zongxi Li, Tsz-fung Andrew Lee, Jing Li +2

Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small set of task-specific parameters while freezing the pretrained backbone. However, existing approaches, such as Low-Rank Adaptation (LoRA), achieve adaptation by inserting independent low-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation. We propose ShadowPEFT, a centralized PEFT framework that

cs.LGApr 21, 2026

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

Fan Li, Chonghuinan Wang, Lina Lei, Yuping Qiu +8

Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To

cs.LGApr 21, 2026

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang +6

Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves ex

cs.LGApr 21, 2026

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

Ying Zeng, Miaosen Luo, Guangyuan Li, Yang Yang +9

Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reas

cs.LGApr 21, 2026

TEMPO: Scaling Test-time Training for Large Reasoning Models

Qingyang Zhang, Xinke Kong, Haitao Wu, Qinghua Hu +6

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT fr

cs.LGApr 21, 2026

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan +3

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can

cs.LGApr 21, 2026

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo +2

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a pers

cs.LGApr 21, 2026

ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation

Zhiqin Yang, Zhenyuan Zhang, Xianzhang Jia, Jun Song +3

Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user. Human productivity rests on the social and organizational relationships through which people coordinate, negotiate, and delegate. When agents move beyond performing tasks for one person to representing that person in collaboration with others, the infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to

cs.LGApr 21, 2026

LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction

Jiakai Tang, Runfeng Zhang, Weiqiu Wang, Yifei Liu +6

Scaling Transformer-based click-through rate (CTR) models by stacking more parameters brings growing computational and storage overhead, creating a widening gap between scaling ambitions and the stringent industrial deployment constraints. We propose LoopCTR, which introduces a loop scaling paradigm that increases training-time computation through recursive reuse of shared model layers, decoupling computation from parameter growth. LoopCTR adopts a sandwich architecture enhanced with Hyper-Conne

cs.LGApr 21, 2026

PlayCoder: Making LLM-Generated GUI Code Playable

Zhiyuan Peng, Wei Tao, Xin Yin, Chenhao Ying +2

Large language models (LLMs) have achieved strong results in code generation, but their ability to generate GUI applications, especially games, remains insufficiently studied. Existing benchmarks mainly evaluate correctness through test cases, which are inadequate for GUI applications because these systems are interactive, event-driven, and require correct state transitions across sequences of user actions. Their evaluation therefore should consider interaction flows and UI logic rather than onl

cs.LGApr 21, 2026

Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao +15

Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers hig

cs.LGApr 21, 2026

Evaluation-driven Scaling for Scientific Discovery

Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo +21

Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be s

cs.LGApr 20, 2026

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Wentao Shi, Yu Wang, Yuyang Zhao, Yuxin Chen +7

As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introd

cs.LGApr 20, 2026

Mitigating Multimodal Hallucination via Phase-wise Self-reward

Yu Zhang, Chuyang Sun, Kehai Chen, Xuefeng Bai +2

Large Vision-Language Models (LVLMs) still struggle with vision hallucination, where generated responses are inconsistent with the visual input. Existing methods either rely on large-scale annotated data for fine-tuning, which incurs massive computational overhead, or employ static post-hoc strategies that overlook the dynamic nature of hallucination emergence. To address these, we introduce a new self-rewarding framework, enabling dynamic hallucination mitigation at inference time without exter

cs.LGApr 20, 2026

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu +4

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose \Ours, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurat

cs.LGApr 20, 2026

Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

Rongyuan Tan, Jue Zhang, Zhuozhao Li, Qingwei Lin +2

Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as contrastive attribution, attributing the logit difference between an incorrect output token and a corre

cs.LGApr 20, 2026

Dual-View Training for Instruction-Following Information Retrieval

Qingcheng Zeng, Puxuan Yu, Aman Mehta, Fuheng Zhao +1

Instruction-following information retrieval (IF-IR) studies retrieval systems that must not only find documents relevant to a query, but also obey explicit user constraints such as required attributes, exclusions, or output preferences. However, most retrievers are trained primarily for semantic relevance and often fail to distinguish documents that match the topic from those that satisfy the instruction. We propose a dual-view data synthesis strategy based on polarity reversal: given a query, a

cs.LGApr 20, 2026

MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

Sua Lee, Sanghee Park, Jinbae Im

Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias i

cs.LGApr 20, 2026

River-LLM: Large Language Model Seamless Exit Based on KV Share

Yingtao Shen, An Zou

Large Language Models (LLMs) have demonstrated exceptional performance across diverse domains but are increasingly constrained by high inference latency. Early Exit has emerged as a promising solution to accelerate inference by dynamically bypassing redundant layers. However, in decoder-only architectures, the efficiency of Early Exit is severely bottlenecked by the KV Cache Absence problem, where skipped layers fail to provide the necessary historical states for subsequent tokens. Existing solu

cs.LGApr 20, 2026

MARCO: Navigating the Unseen Space of Semantic Correspondence

Claudia Cuttano, Gabriele Trivigno, Carlo Masone, Stefan Roth

Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training. Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-

cs.LGApr 20, 2026

On the Reliability of Computer Use Agents

Gonzalo Gonzalez-Pumariega, Saaket Agashe, Jiachen Yang, Ang Li +1

Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpassing human performance. Yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task. This raises a fundamental question: if an agent can succeed at a task once, what prevents it from doing so reliably? In this work, we study the sources of unreliability in computer-use agents throu

cs.LGApr 19, 2026

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

Szu-Chi Chen, I-Ning Tsai, Yi-Cheng Lin, Sung-Feng Huang +1

Recent Speech-to-Speech Translation (S2ST) systems achieve strong semantic accuracy yet consistently strip away non-verbal vocalizations (NVs), such as laughter and crying that convey pragmatic intent, which severely limits real-world utility. We address this via three contributions. First, we propose a synthesis pipeline for building scalable expressive datasets to overcome the data scarcity limitation. Second, we propose MoVE, a Mixture-of-LoRA-Experts architecture with expressive-specialized

cs.LGApr 19, 2026

Speculative Decoding for Autoregressive Video Generation

Yuezhou Hu, Jintao Zhang

Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, wh

cs.LGApr 19, 2026

Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

Qingcheng Zeng, Yuheng Lu, Zeqi Zhou, Heli Qi +5

Code-switching is a pervasive linguistic phenomenon in global communication, yet modern information retrieval systems remain predominantly designed for, and evaluated within, monolingual contexts. To bridge this critical disconnect, we present a holistic study dedicated to code-switching IR. We introduce CSR-L (Code-Switching Retrieval benchmark-Lite), constructing a dataset via human annotation to capture the authentic naturalness of mixed-language queries. Our evaluation across statistical, de

cs.LGApr 19, 2026

UniMesh: Unifying 3D Mesh Understanding and Generation

Peng Huang, Yifeng Chen, Zeyu Zhang, Hao Tang

Recent advances in 3D vision have led to specialized models for either 3D understanding (e.g., shape classification, segmentation, reconstruction) or 3D generation (e.g., synthesis, completion, and editing). However, these tasks are often tackled in isolation, resulting in fragmented architectures and representations that hinder knowledge transfer and holistic scene modeling. To address these challenges, we propose UniMesh, a unified framework that jointly learns 3D generation and understanding

cs.LGApr 19, 2026

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Ivan Bercovich, Ivgeni Segal, Kexun Zhang, Shashwat Saxena +2

We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably reward-hackable. The data set includes 3,632 hack trajectories and 2,352 legitimate baseline trajectories across three frontier models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Each entry preserves the original task definition alongside full attack trajectories that show how the verifier was bypassed. It also includes cases where the task was not solv

cs.LGApr 19, 2026

When Background Matters: Breaking Medical Vision Language Models by Transferable Attack

Akash Ghosh, Subhadip Baidya, Sriparna Saha, Xiuying Chen

Vision-Language Models (VLMs) are increasingly used in clinical diagnostics, yet their robustness to adversarial attacks remains largely unexplored, posing serious risks. Existing medical attacks focus on secondary objectives such as model stealing or adversarial fine-tuning, while transferable attacks from natural images introduce visible distortions that clinicians can easily detect. To address this, we propose MedFocusLeak, a highly transferable black-box multimodal attack that induces incorr

cs.LGApr 19, 2026

The Continuity Layer: Why Intelligence Needs an Architecture for What It Carries Forward

Samuel Sameer Tanguturi

The most important architectural problem in AI is not the size of the model but the absence of a layer that carries forward what the model has come to understand. Sessions end. Context windows fill. Memory APIs return flat facts that the model has to reinterpret from scratch on every read. The result is intelligence that is powerful per session and amnesiac across time. This position paper argues that the layer which fixes this, the continuity layer, is the most consequential piece of infrastruc

cs.LGApr 18, 2026

Understanding and Enforcing Weight Disentanglement in Task Arithmetic

Shangge Liu, Yuehan Yin, Lei Wang, Qi Fan +4

Task arithmetic provides an efficient, training-free way to edit pre-trained models, yet lacks a fundamental theoretical explanation for its success. The existing concept of ``weight disentanglement" describes the ideal outcome of non-interfering task composition but does not reveal its underlying cause. Crucially, what intrinsic properties of the pre-trained model (θ_0) or the task vectors (τ_t) enable this disentanglement remains underexplored. In this paper, we introduce Task-Feature Speciali

cs.LGApr 18, 2026

The Cognitive Penalty: Ablating System 1 and System 2 Reasoning in Edge-Native SLMs for Decentralized Consensus

Syed Muhammad Aqdas Rizvi

Decentralized Autonomous Organizations (DAOs) are inclined explore Small Language Models (SLMs) as edge-native constitutional firewalls to vet proposals and mitigate semantic social engineering. While scaling inference-time compute (System 2) enhances formal logic, its efficacy in highly adversarial, cryptoeconomic governance environments remains underexplored. To address this, we introduce Sentinel-Bench, an 840-inference empirical framework executing a strict intra-model ablation on Qwen-3.5-9

cs.LGApr 18, 2026

The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation

Jiaxin Zhang, Xiangyu Peng, Qinglin Chen, Qinyuan Ye +2

On-policy distillation (OPD) is an increasingly important paradigm for post-training language models. However, we identify a pervasive Scaling Law of Miscalibration: while OPD effectively improves task accuracy, it systematically traps models in severe overconfidence. We trace this failure to an information mismatch: teacher supervision is formed under privileged context available during training, whereas the deployed model must report confidence using only deployment-time information. We formal

cs.LGApr 17, 2026

Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints

Xinge Liu, Terry Jingchen Zhang, Bernhard Schölkopf, Zhijing Jin +1

The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering

cs.LGApr 17, 2026

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Sai Srinivas Kancheti, Aditya Sanjiv Kanade, Vineeth N. Balasubramanian, Tanuja Ganu

Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that

cs.LGApr 17, 2026

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Rohit Sinha, Aditya Kanade, Sai Srinivas Kancheti, Vineeth N Balasubramanian +1

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, an

cs.LGApr 17, 2026

Target-Oriented Pretraining Data Selection via Neuron-Activated Graph

Zijun Wang, Haoqin Tu, Weidong Zhou, Yiyang Zhou +6

Everyday tasks come with a target, and pretraining models around this target is what turns them into experts. In this paper, we study target-oriented language model (LM) pretraining by introducing Neuron-Activated Graph Ranking (NAG-based Ranking), a training-free and interpretable framework for target pretraining data selection. Rather than using black-box representations, our approach directly characterizes each target input by a sparse set of high-impact neurons in any off-the-shelf LLMs. Con

cs.LGApr 17, 2026

KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Ankit Maloo

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark co

cs.LGApr 17, 2026

VoxMind: An End-to-End Agentic Spoken Dialogue System

Tianle Liang, Yifu Chen, Shengpeng Ji, Yijun Chen +6

Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, these models can extend their knowledge boundaries and better solve real-world tasks. Yet, existing research has largely concentrated on core perception and generation, with comparatively limited exploration of such tool

cs.LGApr 16, 2026

Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility

Yining Hong, Yining She, Eunsuk Kang, Christopher S. Timperley +1

AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a sys

cs.LGApr 16, 2026

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) anti-distillation, or degrading the training usefulness of query responses,

cs.LGApr 14, 2026

AgentSPEX: An Agent SPecification and EXecution Language

Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan +6

Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reasoning and tool-use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain an

cs.LGApr 14, 2026

Forge-UGC: FX optimization and register-graph engine for universal graph compiler

Satyam Kumar, Saurabh Jha

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that se

cs.LGApr 13, 2026

Predicting integers from continuous parameters

Bas Maat, Peter Bloem

We study the problem of predicting numeric labels that are constrained to the integers or to a subrange of the integers. For example, the number of up-votes on social media posts, or the number of bicycles available at a public rental station. While it is possible to model these as continuous values, and to apply traditional regression, this approach changes the underlying distribution on the labels from discrete to continuous. Discrete distributions have certain benefits, which leads us to the

cs.LGApr 3, 2026

Significance and Stability Analysis of Gene-Environment Interaction using RGxEStat

Meng'en Qin, Zhe Li, Xiaohui Yang

Genotype-by-Environment (GxE) interactions influence the performance of genotypes across diverse environments, reducing the predictability of phenotypes in target environments. In-depth analysis of GxE interactions facilitates the identification of how genetic advantages or defects are expressed or suppressed under specific environmental conditions, thereby enabling genetic selection and enhancing breeding practices. This paper introduces two key models for GxE interaction research. Specifically

cs.LGMar 18, 2026

SPRITE: From Static Mockups to Engine-Ready Game UI

Yunshu Bai, RuiHao Li, Hao Zhang, Chien Her Lim +2

Game UI implementation requires translating stylized mockups into interactive engine entities. However, current "Screenshot-to-Code" tools often struggle with the irregular geometries and deep visual hierarchies typical of game interfaces. To bridge this gap, we introduce SPRITE, a pipeline that transforms static screenshots into editable engine assets. By integrating Vision-Language Models (VLMs) with a structured YAML intermediate representation, SPRITE explicitly captures complex container re

cs.LG1704067200000

A Statistical Analysis of Wasserstein Autoencoders for Intrinsically Low-dimensional Data

Saptarshi Chakraborty, Peter L. Bartlett

cs.LG
cs.LG1704067200000

Online Information Acquisition: Hiring Multiple Agents

Federico Cacciamani, Matteo Castiglioni, Nicola Gatti

cs.LG
cs.LG1704067200000

MINDE: Mutual Information Neural Diffusion Estimation

Giulio Franzese, Mustapha Bounoua, Pietro Michiardi

cs.LG
cs.LG1704067200000

Fine-Tuned Language Models Generate Stable Inorganic Materials as Text

Nate Gruver, Anuroop Sriram, Andrea Madotto, Andrew Gordon Wilson +1

cs.LG
cs.LG1704067200000

Hybrid Directional Graph Neural Network for Molecules

Junyi An, Chao Qu, Zhipeng Zhou, Fenglei Cao +1

cs.LG
cs.LG1704067200000

CrossLoco: Human Motion Driven Control of Legged Robots via Guided Unsupervised Reinforcement Learning

Tianyu Li, Hyunyoung Jung, Matthew C. Gombolay, Yong Kwon Cho +1

cs.LG
cs.LG1704067200000

Leveraging Optimization for Adaptive Attacks on Image Watermarks

Nils Lukas, Abdulrahman Diaa, Lucas Fenaux, Florian Kerschbaum

cs.LG
cs.LG1704067200000

Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning

Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, Shirui Pan

cs.LG
cs.LG1704067200000

Towards image compression with perfect realism at ultra-low bitrates

Marlène Careil, Matthew J. Muckley, Jakob Verbeek, Stéphane Lathuilière

cs.LG
cs.LG1704067200000

Compressed Context Memory for Online Language Model Interaction

Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song

cs.LG
cs.LG1704067200000

Dual-Encoders for Extreme Multi-label Classification

Nilesh Gupta, Devvrit, Ankit Singh Rawat, Srinadh Bhojanapalli +1

cs.LG
cs.LG1704067200000

Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL

Xiangyu Liu, Souradip Chakraborty, Yanchao Sun, Furong Huang

cs.LG
cs.LG1704067200000

Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

Xiangyu Liu, Chenghao Deng, Yanchao Sun, Yongyuan Liang +1

cs.LG
cs.LG1704067200000

Decodable and Sample Invariant Continuous Object Encoder

Dehao Yuan, Furong Huang, Cornelia Fermüller, Yiannis Aloimonos

cs.LG
cs.LG1704067200000

Graphical Multioutput Gaussian Process with Attention

Yijue Dai, Wenzhong Yan, Feng Yin

cs.LG
cs.LG1704067200000

Generalization in diffusion models arises from geometry-adaptive harmonic representations

Zahra Kadkhodaie, Florentin Guth, Eero P. Simoncelli, Stéphane Mallat

cs.LG
cs.LG1704067200000

Geometry-Aware Projective Mapping for Unbounded Neural Radiance Fields

Junoh Lee, Hyunjun Jung, Jin-Hwi Park, Inhwan Bae +1

cs.LG
cs.LG1704067200000

Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Yiyang Zhou, Chenhang Cui, Jaehong Yoon, Linjun Zhang +1

cs.LG
cs.LG1704067200000

Multi-Scale Representations by Varying Window Attention for Semantic Segmentation

Haotian Yan, Ming Wu, Chuang Zhang

cs.LG
cs.LG1704067200000

Guess & Sketch: Language Model Guided Transpilation

Celine Lee, Abdulrahman Mahmoud, Michal Kurek, Simone Campanoni +1

cs.LG
cs.LG1704067200000

Implicit Neural Representations and the Algebra of Complex Wavelets

T. Mitchell Roddenberry, Vishwanath Saragadam, Maarten V. de Hoop, Richard G. Baraniuk

cs.LG
cs.LG1704067200000

S2AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic

Safa Messaoud, Billel Mokeddem, Zhenghai Xue, Linsey Pang +1

cs.LG
cs.LG1704067200000

Graph Transformers on EHRs: Better Representation Improves Downstream Performance

Raphael Poulain, Rahmatollah Beheshti

cs.LG
cs.LG1704067200000

Conformal Prediction via Regression-as-Classification

Etash Kumar Guha, Shlok Natarajan, Thomas Möllenhoff, Mohammad Emtiyaz Khan +1

cs.LG
cs.LG1704067200000

Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning

Hyungho Na, Yunkyeong Seo, Il-Chul Moon

cs.LG
cs.LG1704067200000

Distinguished In Uniform: Self-Attention Vs. Virtual Nodes

Eran Rosenbluth, Jan Tönshoff, Martin Ritzert, Berke Kisin +1

cs.LG
cs.LG1704067200000

Beyond Weisfeiler-Lehman: A Quantitative Framework for GNN Expressiveness

Bohang Zhang, Jingchu Gai, Yiheng Du, Qiwei Ye +1

cs.LG
cs.LG1704067200000

Plug-and-Play Posterior Sampling under Mismatched Measurement and Prior Models

Marien Renaud, Jiaming Liu, Valentin De Bortoli, Andrés Almansa +1

cs.LG
cs.LG1704067200000

Unlocking the Power of Representations in Long-term Novelty-based Exploration

Alaa Saade, Steven Kapturowski, Daniele Calandriello, Charles Blundell +1

cs.LG
cs.LG1704067200000

Hypergraph Dynamic System

Jielong Yan, Yifan Feng, Shihui Ying, Yue Gao

cs.LG