We gratefully acknowledge support from
the Simons Foundation and member institutions.


New submissions

[ total of 78 entries: 1-78 ]
[ showing up to 2000 entries per page: fewer | more ]

New submissions for Mon, 3 Jun 24

[1]  arXiv:2405.20352 [pdf, other]
Title: Adapting Quantile Mapping to Bias Correct Solar Radiation Data
Comments: 28 pages, 15 figures
Subjects: Applications (stat.AP)

Bias correction is a common pre-processing step applied to climate model data before it is used for further analysis. This article introduces an efficient adaptation of a well-established bias-correction method - quantile mapping - for global horizontal irradiance (GHI) that ensures corrected data is physically plausible through incorporating measurements of clearsky GHI. The proposed quantile mapping method is fit on reanalysis data to first bias correct for regional climate models (RCMs) and is tested on RCMs forced by general circulation models (GCMs) to understand existing biases directly from GCMs. Additionally, we adapt a functional analysis of variance methodology that analyzes sources of remaining biases after implementing the proposed quantile mapping method and considered biases by climate region. This analysis is applied to four sets of climate model output from NA-CORDEX and compared against data from the National Solar Radiation Database produced by the National Renewable Energy Lab.

[2]  arXiv:2405.20400 [pdf, other]
Title: Fast leave-one-cluster-out cross-validation by clustered Network Information Criteria (NICc)
Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)

This paper introduced a clustered estimator of the Network Information Criterion (NICc) to approximate leave-one-cluster-out cross-validated deviance, which can be used as an alternative to cluster-based cross-validation when modeling clustered data. Stone proved that Akaike Information Criterion (AIC) is an asymptotic equivalence to leave-one-observation-out cross-validation if the parametric model is true. Ripley pointed out that the Network Information Criterion (NIC) derived in Stone's proof, is a better approximation to leave-one-observation-out cross-validation when the model is not true. For clustered data, we derived a clustered estimator of NIC, referred to as NICc, by substituting the Fisher information matrix in NIC with its estimator that adjusts for clustering. This adjustment imposes a larger penalty in NICc than the unclustered estimator of NIC when modeling clustered data, thereby preventing overfitting more effectively. In a simulation study and an empirical example, we used linear and logistic regression to model clustered data with Gaussian or binomial response, respectively. We showed that NICc is a better approximation to leave-one-cluster-out deviance and prevents overfitting more effectively than AIC and Bayesian Information Criterion (BIC). NICc leads to more accurate model selection, as determined by cluster-based cross-validation, compared to AIC and BIC.

[3]  arXiv:2405.20415 [pdf, other]
Title: Differentially Private Boxplots
Subjects: Methodology (stat.ME); Applications (stat.AP); Other Statistics (stat.OT)

Despite the potential of differentially private data visualization to harmonize data analysis and privacy, research in this area remains relatively underdeveloped. Boxplots are a widely popular visualization used for summarizing a dataset and for comparison of multiple datasets. Consequentially, we introduce a differentially private boxplot. We evaluate its effectiveness for displaying location, scale, skewness and tails of a given empirical distribution. In our theoretical exposition, we show that the location and scale of the boxplot are estimated with optimal sample complexity, and the skewness and tails are estimated consistently. In simulations, we show that this boxplot performs similarly to a non-private boxplot, and it outperforms a boxplot naively constructed from existing differentially private quantile algorithms. Additionally, we conduct a real data analysis of Airbnb listings, which shows that comparable analysis can be achieved through differentially private boxplot visualization.

[4]  arXiv:2405.20418 [pdf, other]
Title: A Bayesian joint model of multiple nonlinear longitudinal and competing risks outcomes for dynamic prediction in multiple myeloma: joint estimation and corrected two-stage approaches
Comments: 38 pages, 13 figures
Subjects: Applications (stat.AP); Methodology (stat.ME)

Predicting cancer-associated clinical events is challenging in oncology. In Multiple Myeloma (MM), a cancer of plasma cells, disease progression is determined by changes in biomarkers, such as serum concentration of the paraprotein secreted by plasma cells (M-protein). Therefore, the time-dependent behaviour of M-protein and the transition across lines of therapy (LoT) that may be a consequence of disease progression should be accounted for in statistical models to predict relevant clinical outcomes. Furthermore, it is important to understand the contribution of the patterns of longitudinal biomarkers, upon each LoT initiation, to time-to-death or time-to-next-LoT. Motivated by these challenges, we propose a Bayesian joint model for trajectories of multiple longitudinal biomarkers, such as M-protein, and the competing risks of death and transition to next LoT. Additionally, we explore two estimation approaches for our joint model: simultaneous estimation of all parameters (joint estimation) and sequential estimation of parameters using a corrected two-stage strategy aiming to reduce computational time. Our proposed model and estimation methods are applied to a retrospective cohort study from a real-world database of patients diagnosed with MM in the US from January 2015 to February 2022. We split the data into training and test sets in order to validate the joint model using both estimation approaches and make dynamic predictions of times until clinical events of interest, informed by longitudinally measured biomarkers and baseline variables available up to the time of prediction.

[5]  arXiv:2405.20447 [pdf, other]
Title: Algorithmic Fairness in Performative Policy Learning: Escaping the Impossibility of Group Fairness
Subjects: Machine Learning (stat.ML); Computers and Society (cs.CY); Machine Learning (cs.LG)

In many prediction problems, the predictive model affects the distribution of the prediction target. This phenomenon is known as performativity and is often caused by the behavior of individuals with vested interests in the outcome of the predictive model. Although performativity is generally problematic because it manifests as distribution shifts, we develop algorithmic fairness practices that leverage performativity to achieve stronger group fairness guarantees in social classification problems (compared to what is achievable in non-performative settings). In particular, we leverage the policymaker's ability to steer the population to remedy inequities in the long term. A crucial benefit of this approach is that it is possible to resolve the incompatibilities between conflicting group fairness definitions.

[6]  arXiv:2405.20451 [pdf, other]
Title: Statistical Properties of Robust Satisficing
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)

The Robust Satisficing (RS) model is an emerging approach to robust optimization, offering streamlined procedures and robust generalization across various applications. However, the statistical theory of RS remains unexplored in the literature. This paper fills in the gap by comprehensively analyzing the theoretical properties of the RS model. Notably, the RS structure offers a more straightforward path to deriving statistical guarantees compared to the seminal Distributionally Robust Optimization (DRO), resulting in a richer set of results. In particular, we establish two-sided confidence intervals for the optimal loss without the need to solve a minimax optimization problem explicitly. We further provide finite-sample generalization error bounds for the RS optimizer. Importantly, our results extend to scenarios involving distribution shifts, where discrepancies exist between the sampling and target distributions. Our numerical experiments show that the RS model consistently outperforms the baseline empirical risk minimization in small-sample regimes and under distribution shifts. Furthermore, compared to the DRO model, the RS model exhibits lower sensitivity to hyperparameter tuning, highlighting its practicability for robustness considerations.

[7]  arXiv:2405.20601 [pdf, other]
Title: Bayesian Nonparametric Quasi Likelihood
Subjects: Methodology (stat.ME); Other Statistics (stat.OT)

A recent trend in Bayesian research has been revisiting generalizations of the likelihood that enable Bayesian inference without requiring the specification of a model for the data generating mechanism. This paper focuses on a Bayesian nonparametric extension of Wedderburn's quasi-likelihood, using Bayesian additive regression trees to model the mean function. Here, the analyst posits only a structural relationship between the mean and variance of the outcome. We show that this approach provides a unified, computationally efficient, framework for extending Bayesian decision tree ensembles to many new settings, including simplex-valued and heavily heteroskedastic data. We also introduce Bayesian strategies for inferring the dispersion parameter of the quasi-likelihood, a task which is complicated by the fact that the quasi-likelihood itself does not contain information about this parameter; despite these challenges, we are able to inject updates for the dispersion parameter into a Markov chain Monte Carlo inference scheme in a way that, in the parametric setting, leads to a Bernstein-von Mises result for the stationary distribution of the resulting Markov chain. We illustrate the utility of our approach on a variety of both synthetic and non-synthetic datasets.

[8]  arXiv:2405.20644 [pdf, other]
Title: Fixed-budget optimal designs for multi-fidelity computer experiments
Authors: Gecheng Chen, Rui Tuo
Subjects: Methodology (stat.ME)

This work focuses on the design of experiments of multi-fidelity computer experiments. We consider the autoregressive Gaussian process model proposed by Kennedy and O'Hagan (2000) and the optimal nested design that maximizes the prediction accuracy subject to a budget constraint. An approximate solution is identified through the idea of multi-level approximation and recent error bounds of Gaussian process regression. The proposed (approximately) optimal designs admit a simple analytical form. We prove that, to achieve the same prediction accuracy, the proposed optimal multi-fidelity design requires much lower computational cost than any single-fidelity design in the asymptotic sense. Numerical studies confirm this theoretical assertion.

[9]  arXiv:2405.20655 [pdf, ps, other]
Title: Statistical inference for case-control logistic regression via integrating external summary data
Subjects: Methodology (stat.ME); Machine Learning (stat.ML)

Case-control sampling is a commonly used retrospective sampling design to alleviate imbalanced structure of binary data. When fitting the logistic regression model with case-control data, although the slope parameter of the model can be consistently estimated, the intercept parameter is not identifiable, and the marginal case proportion is not estimatable, either. We consider the situations in which besides the case-control data from the main study, called internal study, there also exists summary-level information from related external studies. An empirical likelihood based approach is proposed to make inference for the logistic model by incorporating the internal case-control data and external information. We show that the intercept parameter is identifiable with the help of external information, and then all the regression parameters as well as the marginal case proportion can be estimated consistently. The proposed method also accounts for the possible variability in external studies. The resultant estimators are shown to be asymptotically normally distributed. The asymptotic variance-covariance matrix can be consistently estimated by the case-control data. The optimal way to utilized external information is discussed. Simulation studies are conducted to verify the theoretical findings. A real data set is analyzed for illustration.

[10]  arXiv:2405.20758 [pdf, other]
Title: Fast Bayesian Basis Selection for Functional Data Representation with Correlated Errors
Comments: 30 pages (25 in the main text and 5 in the supplemental material)
Subjects: Methodology (stat.ME)

Functional data analysis (FDA) finds widespread application across various fields, due to data being recorded continuously over a time interval or at several discrete points. Since the data is not observed at every point but rather across a dense grid, smoothing techniques are often employed to convert the observed data into functions. In this work, we propose a novel Bayesian approach for selecting basis functions for smoothing one or multiple curves simultaneously. Our method differentiates from other Bayesian approaches in two key ways: (i) by accounting for correlated errors and (ii) by developing a variational EM algorithm instead of a Gibbs sampler. Simulation studies demonstrate that our method effectively identifies the true underlying structure of the data across various scenarios and it is applicable to different types of functional data. Our variational EM algorithm not only recovers the basis coefficients and the correct set of basis functions but also estimates the existing within-curve correlation. When applied to the motorcycle dataset, our method demonstrates comparable, and in some cases superior, performance in terms of adjusted $R^2$ compared to other techniques such as regression splines, Bayesian LASSO and LASSO. Additionally, when assuming independence among observations within a curve, our method, utilizing only a variational Bayes algorithm, is in the order of thousands faster than a Gibbs sampler on average. Our proposed method is implemented in R and codes are available at https://github.com/acarolcruz/VB-Bases-Selection.

[11]  arXiv:2405.20799 [pdf, other]
Title: Rough Transformers: Lightweight Continuous-Time Sequence Modelling with Path Signatures
Comments: Preprint. Under review. arXiv admin note: text overlap with arXiv:2403.10288
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Time-series data in real-world settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In these settings, traditional sequence-based recurrent models struggle. To overcome this, researchers often replace recurrent architectures with Neural ODE-based models to account for irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of even moderate length. To address this challenge, we introduce the Rough Transformer, a variation of the Transformer model that operates on continuous-time representations of input sequences and incurs significantly lower computational costs. In particular, we propose \textit{multi-view signature attention}, which uses path signatures to augment vanilla attention and to capture both local and global (multi-scale) dependencies in the input data, while remaining robust to changes in the sequence length and sampling frequency and yielding improved spatial processing. We find that, on a variety of time-series-related tasks, Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the representational benefits of Neural ODE-based models, all at a fraction of the computational time and memory resources.

[12]  arXiv:2405.20817 [pdf, other]
Title: Extremile scalar-on-function regression with application to climate scenarios
Subjects: Methodology (stat.ME)

Extremiles provide a generalization of quantiles which are not only robust, but also have an intrinsic link with extreme value theory. This paper introduces an extremile regression model tailored for functional covariate spaces. The estimation procedure turns out to be a weighted version of local linear scalar-on-function regression, where now a double kernel approach plays a crucial role. Asymptotic expressions for the bias and variance are established, applicable to both decreasing bandwidth sequences and automatically selected bandwidths. The methodology is then investigated in detail through a simulation study. Furthermore, we highlight the applicability of the model through the analysis of data sourced from the CH2018 Swiss climate scenarios project, offering insights into its ability to serve as a modern tool to quantify climate behaviour.

[13]  arXiv:2405.20856 [pdf, other]
Title: Parameter identification in linear non-Gaussian causal models under general confounding
Subjects: Methodology (stat.ME); Machine Learning (stat.ML)

Linear non-Gaussian causal models postulate that each random variable is a linear function of parent variables and non-Gaussian exogenous error terms. We study identification of the linear coefficients when such models contain latent variables. Our focus is on the commonly studied acyclic setting, where each model corresponds to a directed acyclic graph (DAG). For this case, prior literature has demonstrated that connections to overcomplete independent component analysis yield effective criteria to decide parameter identifiability in latent variable models. However, this connection is based on the assumption that the observed variables linearly depend on the latent variables. Departing from this assumption, we treat models that allow for arbitrary non-linear latent confounding. Our main result is a graphical criterion that is necessary and sufficient for deciding the generic identifiability of direct causal effects. Moreover, we provide an algorithmic implementation of the criterion with a run time that is polynomial in the number of observed variables. Finally, we report on estimation heuristics based on the identification result, explore a generalization to models with feedback loops, and provide new results on the identifiability of the causal graph.

[14]  arXiv:2405.20909 [pdf, other]
Title: Nonparametric regression on random geometric graphs sampled from submanifolds
Subjects: Statistics Theory (math.ST); Machine Learning (stat.ML)

We consider the nonparametric regression problem when the covariates are located on an unknown smooth compact submanifold of a Euclidean space. Under defining a random geometric graph structure over the covariates we analyze the asymptotic frequentist behaviour of the posterior distribution arising from Bayesian priors designed through random basis expansion in the graph Laplacian eigenbasis. Under Holder smoothness assumption on the regression function and the density of the covariates over the submanifold, we prove that the posterior contraction rates of such methods are minimax optimal (up to logarithmic factors) for any positive smoothness index.

[15]  arXiv:2405.20936 [pdf, other]
Title: Bayesian Deep Generative Models for Replicated Networks with Multiscale Overlapping Clusters
Subjects: Methodology (stat.ME)

Our interest is in replicated network data with multiple networks observed across the same set of nodes. Examples include brain connection networks, in which nodes corresponds to brain regions and replicates to different individuals, and ecological networks, in which nodes correspond to species and replicates to samples collected at different locations and/or times. Our goal is to infer a hierarchical structure of the nodes at a population level, while performing multi-resolution clustering of the individual replicates. In brain connectomics, the focus is on inferring common relationships among the brain regions, while characterizing inter-individual variability in an easily interpretable manner. To accomplish this, we propose a Bayesian hierarchical model, while providing theoretical support in terms of identifiability and posterior consistency, and design efficient methods for posterior computation. We provide novel technical tools for proving model identifiability, which are of independent interest. Our simulations and application to brain connectome data provide support for the proposed methodology.

[16]  arXiv:2405.20957 [pdf, other]
Title: Data Fusion for Heterogeneous Treatment Effect Estimation with Multi-Task Gaussian Processes
Subjects: Methodology (stat.ME); Applications (stat.AP)

Bridging the gap between internal and external validity is crucial for heterogeneous treatment effect estimation. Randomised controlled trials (RCTs), favoured for their internal validity due to randomisation, often encounter challenges in generalising findings due to strict eligibility criteria. Observational studies on the other hand, provide external validity advantages through larger and more representative samples but suffer from compromised internal validity due to unmeasured confounding. Motivated by these complementary characteristics, we propose a novel Bayesian nonparametric approach leveraging multi-task Gaussian processes to integrate data from both RCTs and observational studies. In particular, we introduce a parameter which controls the degree of borrowing between the datasets and prevents the observational dataset from dominating the estimation. The value of the parameter can be either user-set or chosen through a data-adaptive procedure. Our approach outperforms other methods in point predictions across the covariate support of the observational study, and furthermore provides a calibrated measure of uncertainty for the estimated treatment effects, which is crucial when extrapolating. We demonstrate the robust performance of our approach in diverse scenarios through multiple simulation studies and a real-world education randomised trial.

[17]  arXiv:2405.20970 [pdf, other]
Title: PUAL: A Classifier on Trifurcate Positive-Unlabeled Data
Comments: 24 pages, 6 figures
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)

Positive-unlabeled (PU) learning aims to train a classifier using the data containing only labeled-positive instances and unlabeled instances. However, existing PU learning methods are generally hard to achieve satisfactory performance on trifurcate data, where the positive instances distribute on both sides of the negative instances. To address this issue, firstly we propose a PU classifier with asymmetric loss (PUAL), by introducing a structure of asymmetric loss on positive instances into the objective function of the global and local learning classifier. Then we develop a kernel-based algorithm to enable PUAL to obtain non-linear decision boundary. We show that, through experiments on both simulated and real-world datasets, PUAL can achieve satisfactory classification on trifurcate data.

[18]  arXiv:2405.20992 [pdf, other]
Title: A Novel Two-stage Deming Regression Framework with Applications to Association Analysis between Clinical Risks
Subjects: Applications (stat.AP)

In healthcare, clinical risks are crucial for treatment decisions, yet the analysis of their associations is often overlooked. This gap is particularly significant when balancing risks that are weighed against each other, as in the case of atrial fibrillation (AF) patients facing stroke and bleeding risks with anticoagulant medication. While traditional regression models are ill-suited for this task due to standard errors in risk estimation, a novel two-stage Deming regression framework is proposed to address this issue, offering a more accurate tool for analyzing associations between variables observed with errors of known or estimated variances. The first stage is to obtain the variable values with variances of errors either by estimation or observation, followed by the second stage that fits a Deming regression model potentially subject to a transformation. The second stage accounts for the uncertainties associated with both independent and response variables, including known or estimated variances and additional unknown variances from the model. The complexity arising from different scenarios of uncertainty is handled by existing and advanced variations of Deming regression models. An important practical application is to support personalized treatment recommendations based on clinical risk associations that were identified by the proposed framework. The model's effectiveness is demonstrated by applying it to a real-world dataset of AF-diagnosed patients to explore the relationship between stroke and bleeding risks, providing crucial guidance for making informed decisions regarding anticoagulant medication. Furthermore, the model's versatility in addressing data containing multiple sources of uncertainty such as privacy-protected data suggests promising avenues for future research in regression analysis.

[19]  arXiv:2405.21020 [pdf, ps, other]
Title: Bayesian Estimation of Hierarchical Linear Models from Incomplete Data: Cluster-Level Interaction Effects and Small Sample Sizes
Subjects: Methodology (stat.ME)

We consider Bayesian estimation of a hierarchical linear model (HLM) from small sample sizes where 37 patient-physician encounters are repeatedly measured at four time points. The continuous response $Y$ and continuous covariates $C$ are partially observed and assumed missing at random. With $C$ having linear effects, the HLM may be efficiently estimated by available methods. When $C$ includes cluster-level covariates having interactive or other nonlinear effects given small sample sizes, however, maximum likelihood estimation is suboptimal, and existing Gibbs samplers are based on a Bayesian joint distribution compatible with the HLM, but impute missing values of $C$ by a Metropolis algorithm via a proposal density having a constant variance while the target conditional distribution has a nonconstant variance. Therefore, the samplers are not guaranteed to be compatible with the joint distribution and, thus, not guaranteed to always produce unbiased estimation of the HLM. We introduce a compatible Gibbs sampler that imputes parameters and missing values directly from the exact conditional distributions. We analyze repeated measurements from patient-physician encounters by our sampler, and compare our estimators with those of existing methods by simulation.

[20]  arXiv:2405.21037 [pdf, other]
Title: Introducing sgboost: A Practical Guide and Implementation of sparse-group boosting in R
Subjects: Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)

This paper introduces the sgboost package in R, which implements sparse-group boosting for modeling high-dimensional data with natural groupings in covariates. Sparse-group boosting offers a flexible approach for both group and individual variable selection, reducing overfitting and enhancing model interpretability. The package uses regularization techniques based on the degrees of freedom of individual and group base-learners, and is designed to be used in conjunction with the mboost package. Through comparisons with existing methods and demonstration of its unique functionalities, this paper provides a practical guide on utilizing sparse-group boosting in R, accompanied by code examples to facilitate its application in various research domains. Overall, this paper serves as a valuable resource for researchers and practitioners seeking to use sparse-group boosting for efficient and interpretable high-dimensional data analysis.

Cross-lists for Mon, 3 Jun 24

[21]  arXiv:2405.20390 (cross-list from cs.LG) [pdf, other]
Title: Quantitative Convergences of Lie Group Momentum Optimizers
Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Machine Learning (stat.ML)

Explicit, momentum-based dynamics that optimize functions defined on Lie groups can be constructed via variational optimization and momentum trivialization. Structure preserving time discretizations can then turn this dynamics into optimization algorithms. This article investigates two types of discretization, Lie Heavy-Ball, which is a known splitting scheme, and Lie NAG-SC, which is newly proposed. Their convergence rates are explicitly quantified under $L$-smoothness and local strong convexity assumptions. Lie NAG-SC provides acceleration over the momentumless case, i.e. Riemannian gradient descent, but Lie Heavy-Ball does not. When compared to existing accelerated optimizers for general manifolds, both Lie Heavy-Ball and Lie NAG-SC are computationally cheaper and easier to implement, thanks to their utilization of group structure. Only gradient oracle and exponential map are required, but not logarithm map or parallel transport which are computational costly.

[22]  arXiv:2405.20405 (cross-list from cs.DS) [pdf, other]
Title: Private Mean Estimation with Person-Level Differential Privacy
Comments: 67 pages, 3 figures
Subjects: Data Structures and Algorithms (cs.DS); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG); Machine Learning (stat.ML)

We study differentially private (DP) mean estimation in the case where each person holds multiple samples. Commonly referred to as the "user-level" setting, DP here requires the usual notion of distributional stability when all of a person's datapoints can be modified. Informally, if $n$ people each have $m$ samples from an unknown $d$-dimensional distribution with bounded $k$-th moments, we show that
\[n = \tilde \Theta\left(\frac{d}{\alpha^2 m} + \frac{d }{ \alpha m^{1/2} \varepsilon} + \frac{d}{\alpha^{k/(k-1)} m \varepsilon} + \frac{d}{\varepsilon}\right)\]
people are necessary and sufficient to estimate the mean up to distance $\alpha$ in $\ell_2$-norm under $\varepsilon$-differential privacy (and its common relaxations). In the multivariate setting, we give computationally efficient algorithms under approximate DP (with slightly degraded sample complexity) and computationally inefficient algorithms under pure DP, and our nearly matching lower bounds hold for the most permissive case of approximate DP. Our computationally efficient estimators are based on the well known noisy-clipped-mean approach, but the analysis for our setting requires new bounds on the tails of sums of independent, vector-valued, bounded-moments random variables, and a new argument for bounding the bias introduced by clipping.

[23]  arXiv:2405.20435 (cross-list from cs.LG) [pdf, other]
Title: Deep Learning for Computing Convergence Rates of Markov Chains
Subjects: Machine Learning (cs.LG); Probability (math.PR); Machine Learning (stat.ML)

Convergence rate analysis for general state-space Markov chains is fundamentally important in areas such as Markov chain Monte Carlo and algorithmic analysis (for computing explicit convergence bounds). This problem, however, is notoriously difficult because traditional analytical methods often do not generate practically useful convergence bounds for realistic Markov chains. We propose the Deep Contractive Drift Calculator (DCDC), the first general-purpose sample-based algorithm for bounding the convergence of Markov chains to stationarity in Wasserstein distance. The DCDC has two components. First, inspired by the new convergence analysis framework in (Qu et.al, 2023), we introduce the Contractive Drift Equation (CDE), the solution of which leads to an explicit convergence bound. Second, we develop an efficient neural-network-based CDE solver. Equipped with these two components, DCDC solves the CDE and converts the solution into a convergence bound. We analyze the sample complexity of the algorithm and further demonstrate the effectiveness of the DCDC by generating convergence bounds for realistic Markov chains arising from stochastic processing networks as well as constant step-size stochastic optimization.

[24]  arXiv:2405.20452 (cross-list from cs.LG) [pdf, other]
Title: Understanding Encoder-Decoder Structures in Machine Learning Using Information Measures
Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)

We present new results to model and understand the role of encoder-decoder design in machine learning (ML) from an information-theoretic angle. We use two main information concepts, information sufficiency (IS) and mutual information loss (MIL), to represent predictive structures in machine learning. Our first main result provides a functional expression that characterizes the class of probabilistic models consistent with an IS encoder-decoder latent predictive structure. This result formally justifies the encoder-decoder forward stages many modern ML architectures adopt to learn latent (compressed) representations for classification. To illustrate IS as a realistic and relevant model assumption, we revisit some known ML concepts and present some interesting new examples: invariant, robust, sparse, and digital models. Furthermore, our IS characterization allows us to tackle the fundamental question of how much performance (predictive expressiveness) could be lost, using the cross entropy risk, when a given encoder-decoder architecture is adopted in a learning setting. Here, our second main result shows that a mutual information loss quantifies the lack of expressiveness attributed to the choice of a (biased) encoder-decoder ML design. Finally, we address the problem of universal cross-entropy learning with an encoder-decoder design where necessary and sufficiency conditions are established to meet this requirement. In all these results, Shannon's information measures offer new interpretations and explanations for representation learning.

[25]  arXiv:2405.20482 (cross-list from cs.LG) [pdf, other]
Title: Leveraging Structure Between Environments: Phylogenetic Regularization Incentivizes Disentangled Representations
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Many causal systems such as biological processes in cells can only be observed indirectly via measurements, such as gene expression. Causal representation learning -- the task of correctly mapping low-level observations to latent causal variables -- could advance scientific understanding by enabling inference of latent variables such as pathway activation. In this paper, we develop methods for inferring latent variables from multiple related datasets (environments) and tasks. As a running example, we consider the task of predicting a phenotype from gene expression, where we often collect data from multiple cell types or organisms that are related in known ways. The key insight is that the mapping from latent variables driven by gene expression to the phenotype of interest changes sparsely across closely related environments. To model sparse changes, we introduce Tree-Based Regularization (TBR), an objective that minimizes both prediction error and regularizes closely related environments to learn similar predictors. We prove that under assumptions about the degree of sparse changes, TBR identifies the true latent variables up to some simple transformations. We evaluate the theory empirically with both simulations and ground-truth gene expression data. We find that TBR recovers the latent causal variables better than related methods across these settings, even under settings that violate some assumptions of the theory.

[26]  arXiv:2405.20528 (cross-list from math.OC) [pdf, ps, other]
Title: Convergence Analysis of the Sinkhorn Algorithm with Sparse Cost Matrices
Subjects: Optimization and Control (math.OC); Statistics Theory (math.ST)

This paper presents a theoretical analysis of the convergence rate of the Sinkhorn algorithm when the cost matrix is sparse. We derive bounds on the convergence rate that depend on the sparsity pattern and the degree of sparsity of the cost matrix. We also explore whether existing convergence results for dense cost matrices can be adapted or improved for the sparse case. Our analysis provides new insights into the behavior of the Sinkhorn algorithm in the presence of sparsity and highlights potential avenues for algorithmic improvements.

[27]  arXiv:2405.20540 (cross-list from cs.LG) [pdf, ps, other]
Title: Fully Unconstrained Online Learning
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

We provide an online learning algorithm that obtains regret $G\|w_\star\|\sqrt{T\log(\|w_\star\|G\sqrt{T})} + \|w_\star\|^2 + G^2$ on $G$-Lipschitz convex losses for any comparison point $w_\star$ without knowing either $G$ or $\|w_\star\|$. Importantly, this matches the optimal bound $G\|w_\star\|\sqrt{T}$ available with such knowledge (up to logarithmic factors), unless either $\|w_\star\|$ or $G$ is so large that even $G\|w_\star\|\sqrt{T}$ is roughly linear in $T$. Thus, it matches the optimal bound in all cases in which one can achieve sublinear regret, which arguably most "interesting" scenarios.

[28]  arXiv:2405.20542 (cross-list from cs.LG) [pdf, ps, other]
Title: On the Connection Between Non-negative Matrix Factorization and Latent Dirichlet Allocation
Comments: 9 pages
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Non-negative matrix factorization with the generalized Kullback-Leibler divergence (NMF) and latent Dirichlet allocation (LDA) are two popular approaches for dimensionality reduction of non-negative data. Here, we show that NMF with $\ell_1$ normalization constraints on the columns of both matrices of the decomposition and a Dirichlet prior on the columns of one matrix is equivalent to LDA. To show this, we demonstrate that explicitly accounting for the scaling ambiguity of NMF by adding $\ell_1$ normalization constraints to the optimization problem allows a joint update of both matrices in the widely used multiplicative updates (MU) algorithm. When both of the matrices are normalized, the joint MU algorithm leads to probabilistic latent semantic analysis (PLSA), which is LDA without a Dirichlet prior. Our approach of deriving joint updates for NMF also reveals that a Lasso penalty on one matrix together with an $\ell_1$ normalization constraint on the other matrix is insufficient to induce any sparsity.

[29]  arXiv:2405.20550 (cross-list from cs.LG) [pdf, ps, other]
Title: Uncertainty Quantification for Deep Learning
Comments: 25 pages 4 figures, submitted to Environmental data Science
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

A complete and statistically consistent uncertainty quantification for deep learning is provided, including the sources of uncertainty arising from (1) the new input data, (2) the training and testing data (3) the weight vectors of the neural network, and (4) the neural network because it is not a perfect predictor. Using Bayes Theorem and conditional probability densities, we demonstrate how each uncertainty source can be systematically quantified. We also introduce a fast and practical way to incorporate and combine all sources of errors for the first time. For illustration, the new method is applied to quantify errors in cloud autoconversion rates, predicted from an artificial neural network that was trained by aircraft cloud probe measurements in the Azores and the stochastic collection equation formulated as a two-moment bin model. For this specific example, the output uncertainty arising from uncertainty in the training and testing data is dominant, followed by uncertainty in the input data, in the trained neural network, and uncertainty in the weights. We discuss the usefulness of the methodology for machine learning practice, and how, through inclusion of uncertainty in the training data, the new methodology is less sensitive to input data that falls outside of the training data set.

[30]  arXiv:2405.20573 (cross-list from cs.LG) [pdf, other]
Title: Enhancing Generative Molecular Design via Uncertainty-guided Fine-tuning of Variational Autoencoders
Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)

In recent years, deep generative models have been successfully adopted for various molecular design tasks, particularly in the life and material sciences. A critical challenge for pre-trained generative molecular design (GMD) models is to fine-tune them to be better suited for downstream design tasks aimed at optimizing specific molecular properties. However, redesigning and training an existing effective generative model from scratch for each new design task is impractical. Furthermore, the black-box nature of typical downstream tasks$\unicode{x2013}$such as property prediction$\unicode{x2013}$makes it nontrivial to optimize the generative model in a task-specific manner. In this work, we propose a novel approach for a model uncertainty-guided fine-tuning of a pre-trained variational autoencoder (VAE)-based GMD model through performance feedback in an active learning setting. The main idea is to quantify model uncertainty in the generative model, which is made efficient by working within a low-dimensional active subspace of the high-dimensional VAE parameters explaining most of the variability in the model's output. The inclusion of model uncertainty expands the space of viable molecules through decoder diversity. We then explore the resulting model uncertainty class via black-box optimization made tractable by low-dimensionality of the active subspace. This enables us to identify and leverage a diverse set of high-performing models to generate enhanced molecules. Empirical results across six target molecular properties, using multiple VAE-based generative models, demonstrate that our uncertainty-guided fine-tuning approach consistently outperforms the original pre-trained models.

[31]  arXiv:2405.20642 (cross-list from cs.LG) [pdf, other]
Title: Principal-Agent Multitasking: the Uniformity of Optimal Contracts and its Efficient Learning via Instrumental Regression
Authors: Shiliang Zuo
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

This work studies the multitasking principal-agent problem. I first show a ``uniformity'' result. Specifically, when the tasks are perfect substitutes, and the agent's cost function is homogeneous to a certain degree, then the optimal contract only depends on the marginal utility of each task and the degree of homogeneity. I then study a setting where the marginal utility of each task is unknown so that the optimal contract must be learned or estimated with observational data. I identify this problem as a regression problem with measurement errors and observe that this problem can be cast as an instrumental regression problem. The current works observe that both the contract and the repeated observations (when available) can act as valid instrumental variables, and propose using the generalized method of moments estimator to compute an approximately optimal contract from offline data. I also study an online setting and show how the optimal contract can be efficiently learned in an online fashion using the two estimators. Here the principal faces an exploration-exploitation tradeoff: she must experiment with new contracts and observe their outcome whilst at the same time ensuring her experimentations are not deviating too much from the optimal contract. This work shows when repeated observations are available and agents are sufficiently ``diverse", the principal can achieve a very low $\widetilde{O}(d)$ cumulative utility loss, even with a ``pure exploitation" algorithm.

[32]  arXiv:2405.20677 (cross-list from cs.LG) [pdf, other]
Title: Provably Efficient Interactive-Grounded Learning with Personalized Reward
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Interactive-Grounded Learning (IGL) [Xie et al., 2021] is a powerful framework in which a learner aims at maximizing unobservable rewards through interacting with an environment and observing reward-dependent feedback on the taken actions. To deal with personalized rewards that are ubiquitous in applications such as recommendation systems, Maghakian et al. [2022] study a version of IGL with context-dependent feedback, but their algorithm does not come with theoretical guarantees. In this work, we consider the same problem and provide the first provably efficient algorithms with sublinear regret under realizability. Our analysis reveals that the step-function estimator of prior work can deviate uncontrollably due to finite-sample effects. Our solution is a novel Lipschitz reward estimator which underestimates the true reward and enjoys favorable generalization performances. Building on this estimator, we propose two algorithms, one based on explore-then-exploit and the other based on inverse-gap weighting. We apply IGL to learning from image feedback and learning from text feedback, which are reward-free settings that arise in practice. Experimental results showcase the importance of using our Lipschitz reward estimator and the overall effectiveness of our algorithms.

[33]  arXiv:2405.20678 (cross-list from cs.LG) [pdf, ps, other]
Title: No-Regret Learning for Fair Multi-Agent Social Welfare Optimization
Subjects: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Machine Learning (stat.ML)

We consider the problem of online multi-agent Nash social welfare (NSW) maximization. While previous works of Hossain et al. [2021], Jones et al. [2023] study similar problems in stochastic multi-agent multi-armed bandits and show that $\sqrt{T}$-regret is possible after $T$ rounds, their fairness measure is the product of all agents' rewards, instead of their NSW (that is, their geometric mean). Given the fundamental role of NSW in the fairness literature, it is more than natural to ask whether no-regret fair learning with NSW as the objective is possible. In this work, we provide a complete answer to this question in various settings. Specifically, in stochastic $N$-agent $K$-armed bandits, we develop an algorithm with $\widetilde{\mathcal{O}}\left(K^{\frac{2}{N}}T^{\frac{N-1}{N}}\right)$ regret and prove that the dependence on $T$ is tight, making it a sharp contrast to the $\sqrt{T}$-regret bounds of Hossain et al. [2021], Jones et al. [2023]. We then consider a more challenging version of the problem with adversarial rewards. Somewhat surprisingly, despite NSW being a concave function, we prove that no algorithm can achieve sublinear regret. To circumvent such negative results, we further consider a setting with full-information feedback and design two algorithms with $\sqrt{T}$-regret: the first one has no dependence on $N$ at all and is applicable to not just NSW but a broad class of welfare functions, while the second one has better dependence on $K$ and is preferable when $N$ is small. Finally, we also show that logarithmic regret is possible whenever there exists one agent who is indifferent about different arms.

[34]  arXiv:2405.20724 (cross-list from cs.LG) [pdf, other]
Title: Learning on Large Graphs using Intersecting Communities
Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)

Message Passing Neural Networks (MPNNs) are a staple of graph machine learning. MPNNs iteratively update each node's representation in an input graph by aggregating messages from the node's neighbors, which necessitates a memory complexity of the order of the number of graph edges. This complexity might quickly become prohibitive for large graphs provided they are not very sparse. In this paper, we propose a novel approach to alleviate this problem by approximating the input graph as an intersecting community graph (ICG) -- a combination of intersecting cliques. The key insight is that the number of communities required to approximate a graph does not depend on the graph size. We develop a new constructive version of the Weak Graph Regularity Lemma to efficiently construct an approximating ICG for any input graph. We then devise an efficient graph learning algorithm operating directly on ICG in linear memory and time with respect to the number of nodes (rather than edges). This offers a new and fundamentally different pipeline for learning on very large non-sparse graphs, whose applicability is demonstrated empirically on node classification tasks and spatio-temporal data processing.

[35]  arXiv:2405.20763 (cross-list from cs.LG) [pdf, other]
Title: Improving Generalization and Convergence by Enhancing Implicit Regularization
Comments: 35 pages
Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)

In this work, we propose an Implicit Regularization Enhancement (IRE) framework to accelerate the discovery of flat solutions in deep learning, thereby improving generalization and convergence. Specifically, IRE decouples the dynamics of flat and sharp directions, which boosts the sharpness reduction along flat directions while maintaining the training stability in sharp directions. We show that IRE can be practically incorporated with {\em generic base optimizers} without introducing significant computational overload. Experiments show that IRE consistently improves the generalization performance for image classification tasks across a variety of benchmark datasets (CIFAR-10/100, ImageNet) and models (ResNets and ViTs). Surprisingly, IRE also achieves a $2\times$ {\em speed-up} compared to AdamW in the pre-training of Llama models (of sizes ranging from 60M to 229M) on datasets including Wikitext-103, Minipile, and Openwebtext. Moreover, we provide theoretical guarantees, showing that IRE can substantially accelerate the convergence towards flat minima in Sharpness-aware Minimization (SAM).

[36]  arXiv:2405.20769 (cross-list from cs.CR) [pdf, other]
Title: Avoiding Pitfalls for Privacy Accounting of Subsampled Mechanisms under Composition
Subjects: Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)

We consider the problem of computing tight privacy guarantees for the composition of subsampled differentially private mechanisms. Recent algorithms can numerically compute the privacy parameters to arbitrary precision but must be carefully applied.
Our main contribution is to address two common points of confusion. First, some privacy accountants assume that the privacy guarantees for the composition of a subsampled mechanism are determined by self-composing the worst-case datasets for the uncomposed mechanism. We show that this is not true in general. Second, Poisson subsampling is sometimes assumed to have similar privacy guarantees compared to sampling without replacement. We show that the privacy guarantees may in fact differ significantly between the two sampling schemes. In particular, we give an example of hyperparameters that result in $\varepsilon \approx 1$ for Poisson subsampling and $\varepsilon > 10$ for sampling without replacement. This occurs for some parameters that could realistically be chosen for DP-SGD.

[37]  arXiv:2405.20779 (cross-list from cs.CR) [pdf, ps, other]
Title: Asymptotic utility of spectral anonymization
Comments: 16 pages, 6 figures
Subjects: Cryptography and Security (cs.CR); Methodology (stat.ME)

In the contemporary data landscape characterized by multi-source data collection and third-party sharing, ensuring individual privacy stands as a critical concern. While various anonymization methods exist, their utility preservation and privacy guarantees remain challenging to quantify. In this work, we address this gap by studying the utility and privacy of the spectral anonymization (SA) algorithm, particularly in an asymptotic framework. Unlike conventional anonymization methods that directly modify the original data, SA operates by perturbing the data in a spectral basis and subsequently reverting them to their original basis. Alongside the original version $\mathcal{P}$-SA, employing random permutation transformation, we introduce two novel SA variants: $\mathcal{J}$-spectral anonymization and $\mathcal{O}$-spectral anonymization, which employ sign-change and orthogonal matrix transformations, respectively. We show how well, under some practical assumptions, these SA algorithms preserve the first and second moments of the original data. Our results reveal, in particular, that the asymptotic efficiency of all three SA algorithms in covariance estimation is exactly 50% when compared to the original data. To assess the applicability of these asymptotic results in practice, we conduct a simulation study with finite data and also evaluate the privacy protection offered by these algorithms using distance-based record linkage. Our research reveals that while no method exhibits clear superiority in finite-sample utility, $\mathcal{O}$-SA distinguishes itself for its exceptional privacy preservation, never producing identical records, albeit with increased computational complexity. Conversely, $\mathcal{P}$-SA emerges as a computationally efficient alternative, demonstrating unmatched efficiency in mean estimation.

[38]  arXiv:2405.20782 (cross-list from cs.CR) [pdf, other]
Title: Universal Exact Compression of Differentially Private Mechanisms
Comments: 30 pages, 3 figures
Subjects: Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (stat.ML)

To reduce the communication cost of differential privacy mechanisms, we introduce a novel construction, called Poisson private representation (PPR), designed to compress and simulate any local randomizer while ensuring local differential privacy. Unlike previous simulation-based local differential privacy mechanisms, PPR exactly preserves the joint distribution of the data and the output of the original local randomizer. Hence, the PPR-compressed privacy mechanism retains all desirable statistical properties of the original privacy mechanism such as unbiasedness and Gaussianity. Moreover, PPR achieves a compression size within a logarithmic gap from the theoretical lower bound. Using the PPR, we give a new order-wise trade-off between communication, accuracy, central and local differential privacy for distributed mean estimation. Experiment results on distributed mean estimation show that PPR consistently gives a better trade-off between communication, accuracy and central differential privacy compared to the coordinate subsampled Gaussian mechanism, while also providing local differential privacy.

[39]  arXiv:2405.20821 (cross-list from cs.LG) [pdf, other]
Title: Pursuing Overall Welfare in Federated Learning through Sequential Decision Making
Comments: Accepted at ICML 2024
Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)

In traditional federated learning, a single global model cannot perform equally well for all clients. Therefore, the need to achieve the client-level fairness in federated system has been emphasized, which can be realized by modifying the static aggregation scheme for updating the global model to an adaptive one, in response to the local signals of the participating clients. Our work reveals that existing fairness-aware aggregation strategies can be unified into an online convex optimization framework, in other words, a central server's sequential decision making process. To enhance the decision making capability, we propose simple and intuitive improvements for suboptimal designs within existing methods, presenting AAggFF. Considering practical requirements, we further subdivide our method tailored for the cross-device and the cross-silo settings, respectively. Theoretical analyses guarantee sublinear regret upper bounds for both settings: $\mathcal{O}(\sqrt{T \log{K}})$ for the cross-device setting, and $\mathcal{O}(K \log{T})$ for the cross-silo setting, with $K$ clients and $T$ federation rounds. Extensive experiments demonstrate that the federated system equipped with AAggFF achieves better degree of client-level fairness than existing methods in both practical settings. Code is available at https://github.com/vaseline555/AAggFF

[40]  arXiv:2405.20824 (cross-list from cs.LG) [pdf, ps, other]
Title: Online Convex Optimisation: The Optimal Switching Regret for all Segmentations Simultaneously
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We consider the classic problem of online convex optimisation. Whereas the notion of static regret is relevant for stationary problems, the notion of switching regret is more appropriate for non-stationary problems. A switching regret is defined relative to any segmentation of the trial sequence, and is equal to the sum of the static regrets of each segment. In this paper we show that, perhaps surprisingly, we can achieve the asymptotically optimal switching regret on every possible segmentation simultaneously. Our algorithm for doing so is very efficient: having a space and per-trial time complexity that is logarithmic in the time-horizon. Our algorithm also obtains novel bounds on its dynamic regret: being adaptive to variations in the rate of change of the comparator sequence.

[41]  arXiv:2405.20838 (cross-list from cs.LG) [pdf, other]
Title: einspace: Searching for Neural Architectures from Fundamental Operations
Comments: Project page at this https URL
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Neural architecture search (NAS) finds high performing networks for a given task. Yet the results of NAS are fairly prosaic; they did not e.g. create a shift from convolutional structures to transformers. This is not least because the search spaces in NAS often aren't diverse enough to include such transformations a priori. Instead, for NAS to provide greater potential for fundamental design shifts, we need a novel expressive search space design which is built from more fundamental operations. To this end, we introduce einspace, a search space based on a parameterised probabilistic context-free grammar. Our space is versatile, supporting architectures of various sizes and complexities, while also containing diverse network operations which allow it to model convolutions, attention components and more. It contains many existing competitive architectures, and provides flexibility for discovering new ones. Using this search space, we perform experiments to find novel architectures as well as improvements on existing ones on the diverse Unseen NAS datasets. We show that competitive architectures can be obtained by searching from scratch, and we consistently find large improvements when initialising the search with strong baselines. We believe that this work is an important advancement towards a transformative NAS paradigm where search space expressivity and strategic search initialisation play key roles.

[42]  arXiv:2405.20877 (cross-list from cs.IT) [pdf, other]
Title: Waveform Design for Over-the-Air Computing
Comments: 14 pages
Subjects: Information Theory (cs.IT); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP); Statistics Theory (math.ST)

In response to the increasing number of devices anticipated in next-generation networks, a shift toward over-the-air (OTA) computing has been proposed. Leveraging the superposition of multiple access channels, OTA computing enables efficient resource management by supporting simultaneous uncoded transmission in the time and the frequency domain. Thus, to advance the integration of OTA computing, our study presents a theoretical analysis addressing practical issues encountered in current digital communication transceivers, such as time sampling error and intersymbol interference (ISI). To this end, we examine the theoretical mean squared error (MSE) for OTA transmission under time sampling error and ISI, while also exploring methods for minimizing the MSE in the OTA transmission. Utilizing alternating optimization, we also derive optimal power policies for both the devices and the base station. Additionally, we propose a novel deep neural network (DNN)-based approach to design waveforms enhancing OTA transmission performance under time sampling error and ISI. To ensure fair comparison with existing waveforms like the raised cosine (RC) and the better-than-raised-cosine (BRTC), we incorporate a custom loss function integrating energy and bandwidth constraints, along with practical design considerations such as waveform symmetry. Simulation results validate our theoretical analysis and demonstrate performance gains of the designed pulse over RC and BTRC waveforms. To facilitate testing of our results without necessitating the DNN structure recreation, we provide curve fitting parameters for select DNN-based waveforms as well.

[43]  arXiv:2405.20915 (cross-list from cs.LG) [pdf, other]
Title: Fast yet Safe: Early-Exiting with Risk Control
Comments: 25 pages, 11 figures, 4 tables (incl. appendix)
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

Scaling machine learning models significantly improves their performance. However, such gains come at the cost of inference being slow and resource-intensive. Early-exit neural networks (EENNs) offer a promising solution: they accelerate inference by allowing intermediate layers to exit and produce a prediction early. Yet a fundamental issue with EENNs is how to determine when to exit without severely degrading performance. In other words, when is it 'safe' for an EENN to go 'fast'? To address this issue, we investigate how to adapt frameworks of risk control to EENNs. Risk control offers a distribution-free, post-hoc solution that tunes the EENN's exiting mechanism so that exits only occur when the output is of sufficient quality. We empirically validate our insights on a range of vision and language tasks, demonstrating that risk control can produce substantial computational savings, all the while preserving user-specified performance goals.

[44]  arXiv:2405.20918 (cross-list from cs.SI) [pdf, other]
Title: Flexible inference in heterogeneous and attributed multilayer networks
Subjects: Social and Information Networks (cs.SI); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)

Networked datasets are often enriched by different types of information about individual nodes or edges. However, most existing methods for analyzing such datasets struggle to handle the complexity of heterogeneous data, often requiring substantial model-specific analysis. In this paper, we develop a probabilistic generative model to perform inference in multilayer networks with arbitrary types of information. Our approach employs a Bayesian framework combined with the Laplace matching technique to ease interpretation of inferred parameters. Furthermore, the algorithmic implementation relies on automatic differentiation, avoiding the need for explicit derivations. This makes our model scalable and flexible to adapt to any combination of input data. We demonstrate the effectiveness of our method in detecting overlapping community structures and performing various prediction tasks on heterogeneous multilayer data, where nodes and edges have different types of attributes. Additionally, we showcase its ability to unveil a variety of patterns in a social support network among villagers in rural India by effectively utilizing all input information in a meaningful way.

[45]  arXiv:2405.20933 (cross-list from cs.LG) [pdf, ps, other]
Title: Concentration Bounds for Optimized Certainty Equivalent Risk Estimation
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

We consider the problem of estimating the Optimized Certainty Equivalent (OCE) risk from independent and identically distributed (i.i.d.) samples. For the classic sample average approximation (SAA) of OCE, we derive mean-squared error as well as concentration bounds (assuming sub-Gaussianity). Further, we analyze an efficient stochastic approximation-based OCE estimator, and derive finite sample bounds for the same. To show the applicability of our bounds, we consider a risk-aware bandit problem, with OCE as the risk. For this problem, we derive bound on the probability of mis-identification. Finally, we conduct numerical experiments to validate the theoretical findings.

[46]  arXiv:2405.20954 (cross-list from cs.LG) [pdf, other]
Title: Aligning Multiclass Neural Network Classifier Criterion with Task Performance via $F_β$-Score
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

Multiclass neural network classifiers are typically trained using cross-entropy loss. Following training, the performance of this same neural network is evaluated using an application-specific metric based on the multiclass confusion matrix, such as the Macro $F_\beta$-Score. It is questionable whether the use of cross-entropy will yield a classifier that aligns with the intended application-specific performance criteria, particularly in scenarios where there is a need to emphasize one aspect of classifier performance. For example, if greater precision is preferred over recall, the $\beta$ value in the $F_\beta$ evaluation metric can be adjusted accordingly, but the cross-entropy objective remains unaware of this preference during training. We propose a method that addresses this training-evaluation gap for multiclass neural network classifiers such that users can train these models informed by the desired final $F_\beta$-Score. Following prior work in binary classification, we utilize the concepts of the soft-set confusion matrices and a piecewise-linear approximation of the Heaviside step function. Our method extends the $2 \times 2$ binary soft-set confusion matrix to a multiclass $d \times d$ confusion matrix and proposes dynamic adaptation of the threshold value $\tau$, which parameterizes the piecewise-linear Heaviside approximation during run-time. We present a theoretical analysis that shows that our method can be used to optimize for a soft-set based approximation of Macro-$F_\beta$ that is a consistent estimator of Macro-$F_\beta$, and our extensive experiments show the practical effectiveness of our approach.

[47]  arXiv:2405.20993 (cross-list from cs.IT) [pdf, other]
Title: Information limits and Thouless-Anderson-Palmer equations for spiked matrix models with structured noise
Subjects: Information Theory (cs.IT); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Statistics Theory (math.ST)

We consider a prototypical problem of Bayesian inference for a structured spiked model: a low-rank signal is corrupted by additive noise. While both information-theoretic and algorithmic limits are well understood when the noise is i.i.d. Gaussian, the more realistic case of structured noise still proves to be challenging. To capture the structure while maintaining mathematical tractability, a line of work has focused on rotationally invariant noise. However, existing studies either provide sub-optimal algorithms or they are limited to a special class of noise ensembles. In this paper, we establish the first characterization of the information-theoretic limits for a noise matrix drawn from a general trace ensemble. These limits are then achieved by an efficient algorithm inspired by the theory of adaptive Thouless-Anderson-Palmer (TAP) equations. Our approach leverages tools from statistical physics (replica method) and random matrix theory (generalized spherical integrals), and it unveils the equivalence between the rotationally invariant model and a surrogate Gaussian model.

[48]  arXiv:2405.21012 (cross-list from cs.LG) [pdf, other]
Title: G-Transformer for Conditional Average Potential Outcome Estimation over Time
Subjects: Machine Learning (cs.LG); Methodology (stat.ME)

Estimating potential outcomes for treatments over time based on observational data is important for personalized decision-making in medicine. Yet, existing neural methods for this task suffer from either (a) bias or (b) large variance. In order to address both limitations, we introduce the G-transformer (GT). Our GT is a novel, neural end-to-end model designed for unbiased, low-variance estimation of conditional average potential outcomes (CAPOs) over time. Specifically, our GT is the first neural model to perform regression-based iterative G-computation for CAPOs in the time-varying setting. We evaluate the effectiveness of our GT across various experiments. In sum, this work represents a significant step towards personalized decision-making from electronic health records.

[49]  arXiv:2405.21046 (cross-list from cs.LG) [pdf, other]
Title: Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

Reinforcement learning from human feedback (RLHF) has emerged as a central tool for language model alignment. We consider online exploration in RLHF, which exploits interactive access to human or AI feedback by deliberately encouraging the model to produce diverse, maximally informative responses. By allowing RLHF to confidently stray from the pre-trained model, online exploration offers the possibility of novel, potentially super-human capabilities, but its full potential as a paradigm for language model training has yet to be realized, owing to computational and statistical bottlenecks in directly adapting existing reinforcement learning techniques. We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO), which is simple and practical -- a one-line change to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023) -- yet enjoys the strongest known provable guarantees and promising empirical performance. XPO augments the DPO objective with a novel and principled exploration bonus, empowering the algorithm to explore outside the support of the initial model and human feedback data. In theory, we show that XPO is provably sample-efficient and converges to a near-optimal language model policy under natural exploration conditions, irrespective of whether the initial model has good coverage. Our analysis, which builds on the observation that DPO implicitly performs a form of $Q^{\star}$-approximation (or, Bellman error minimization), combines previously disparate techniques from language modeling and theoretical reinforcement learning in a serendipitous fashion through the perspective of KL-regularized Markov decision processes. Empirically, we find that XPO is more sample-efficient than non-exploratory DPO variants in a preliminary evaluation.

Replacements for Mon, 3 Jun 24

[50]  arXiv:2002.01605 (replaced) [pdf, ps, other]
Title: Exploratory Machine Learning with Unknown Unknowns
Comments: published at Artificial Intelligence, preliminary conference version published at AAAI'21
Journal-ref: Artificial Intelligence,Volume 327, 2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[51]  arXiv:2104.11702 (replaced) [pdf, other]
Title: Correlated Dynamics in Marketing Sensitivities
Authors: Ryan Dew, Yuhao Fan
Subjects: Applications (stat.AP); Econometrics (econ.EM); Machine Learning (stat.ML)
[52]  arXiv:2201.02532 (replaced) [pdf, other]
Title: Approximate Factor Models for Functional Time Series
Subjects: Econometrics (econ.EM); Methodology (stat.ME)
[53]  arXiv:2212.00394 (replaced) [pdf, other]
Title: From CNNs to Shift-Invariant Twin Models Based on Complex Wavelets
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Machine Learning (stat.ML)
[54]  arXiv:2301.06650 (replaced) [pdf, other]
Title: Enhancing Deep Traffic Forecasting Models with Dynamic Regression
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[55]  arXiv:2303.00728 (replaced) [pdf, other]
Title: On the universality of $S_n$-equivariant $k$-body gates
Comments: 7+15 pages, 3+5 figures, updated to published version
Journal-ref: New J. Phys. 26, 053030 (2024)
Subjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
[56]  arXiv:2303.01887 (replaced) [pdf, other]
Title: Fast Forecasting of Unstable Data Streams for On-Demand Service Platforms
Subjects: Econometrics (econ.EM); Applications (stat.AP)
[57]  arXiv:2308.14143 (replaced) [pdf, other]
Title: Ensemble-localized Kernel Density Estimation with Applications to the Ensemble Gaussian Mixture Filter
Subjects: Optimization and Control (math.OC); Numerical Analysis (math.NA); Applications (stat.AP)
[58]  arXiv:2308.14906 (replaced) [pdf, other]
Title: BayOTIDE: Bayesian Online Multivariate Time series Imputation with functional decomposition
Comments: Accepted by The 41st International Conference on Machine Learning (ICML 2024)
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[59]  arXiv:2309.02369 (replaced) [pdf, other]
Title: Adaptive Bayesian Predictive Inference in High-dimensional Regerssion
Authors: Veronika Rockova
Subjects: Statistics Theory (math.ST)
[60]  arXiv:2309.14512 (replaced) [pdf, ps, other]
Title: Byzantine-Resilient Federated PCA and Low Rank Column-wise Sensing
Comments: 36 pages
Subjects: Information Theory (cs.IT); Machine Learning (stat.ML)
[61]  arXiv:2309.16476 (replaced) [pdf, other]
Title: High-dimensional robust regression under heavy-tailed data: Asymptotics and Universality
Comments: 13 pages + Supplementary information
Subjects: Statistics Theory (math.ST); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Machine Learning (stat.ML)
[62]  arXiv:2311.09388 (replaced) [pdf, other]
Title: Synthesis estimators for positivity violations with a continuous covariate
Subjects: Methodology (stat.ME)
[63]  arXiv:2311.14492 (replaced) [pdf, other]
Title: Numerical Generalized Randomized HMC processes for restricted domains
Subjects: Computation (stat.CO)
[64]  arXiv:2401.11130 (replaced) [pdf, ps, other]
Title: Identification and Estimation of Conditional Average Partial Causal Effects via Instrumental Variable
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[65]  arXiv:2402.01000 (replaced) [pdf, other]
Title: Multivariate Probabilistic Time Series Forecasting with Correlated Errors
Comments: This paper extends the work presented in arXiv:2305.17028 to a multivariate setting
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[66]  arXiv:2402.07131 (replaced) [pdf, other]
Title: Resampling methods for Private Statistical Inference
Comments: 45 pages
Subjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Methodology (stat.ME)
[67]  arXiv:2402.08097 (replaced) [pdf, ps, other]
Title: An Accelerated Gradient Method for Convex Smooth Simple Bilevel Optimization
Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
[68]  arXiv:2402.09033 (replaced) [pdf, other]
Title: Cross-Temporal Forecast Reconciliation at Digital Platforms with Machine Learning
Subjects: Econometrics (econ.EM); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
[69]  arXiv:2402.09723 (replaced) [pdf, other]
Title: Efficient Prompt Optimization Through the Lens of Best Arm Identification
Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
[70]  arXiv:2402.13901 (replaced) [pdf, other]
Title: Non-asymptotic Convergence of Discrete-time Diffusion Models: New Approach and Improved Rate
Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
[71]  arXiv:2403.01371 (replaced) [pdf, other]
Title: eXponential FAmily Dynamical Systems (XFADS): Large-scale nonlinear Gaussian state-space modeling
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
[72]  arXiv:2403.12166 (replaced) [pdf, other]
Title: The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection
Comments: Accepted to ICASSP 2024
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[73]  arXiv:2404.04455 (replaced) [pdf, other]
Title: Tomographic reconstruction of a disease transmission landscape via GPS recorded random paths
Subjects: Applications (stat.AP)
[74]  arXiv:2404.09636 (replaced) [pdf, other]
Title: All-in-one simulation-based inference
Comments: To be published in the proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria. PMLR 235, 2024
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
[75]  arXiv:2405.15682 (replaced) [pdf, other]
Title: The Road Less Scheduled
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
[76]  arXiv:2405.16069 (replaced) [pdf, other]
Title: IncomeSCM: From tabular data set to time-series simulator and causal estimation benchmark
Subjects: Machine Learning (cs.LG); Methodology (stat.ME)
[77]  arXiv:2405.19059 (replaced) [pdf, other]
Title: Robust Entropy Search for Safe Efficient Bayesian Optimization
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
[78]  arXiv:2405.19523 (replaced) [pdf, other]
Title: Comparison of Point Process Learning and its special case Takacs-Fiksel estimation
Comments: Main text: 30 pages, 11 figures. Appendix: 26 pages, 10 figures. Total: 56 pages, 21 figures
Subjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
[ total of 78 entries: 1-78 ]
[ showing up to 2000 entries per page: fewer | more ]

Disable MathJax (What is MathJax?)

Links to: arXiv, form interface, find, stat, recent, 2406, contact, help  (Access key information)