Skip to main content Skip to secondary navigation
Main content start

Data Augmentation with Confidence: Evaluating RL Policies with PERRY

By Aishwarya Mandyam

As generative models become stronger, other areas of machine learning are starting to use them to improve existing real datasets. In many subfields of machine learning, dataset size and quality are ongoing challenges. For example, in healthcare, we often have small datasets that do not include all possible patients. This presents a problem for many modern algorithms that assume access to large datasets that encompass a diverse set of samples (e.g., patients).

Generative models are now of sufficiently high quality to change how we think about these settings. Instead of treating limited data as a fixed constraint, we can use generative models to create synthetic datasets that approximate or even extend the distributions of real data. This marks a shift toward a new paradigm, one in which synthetic data can actively shape model development, benchmarking, and evaluation.

Now, suppose we are designing a new treatment policy for healthcare. For example, say we proposed a new dosing strategy for administering chemotherapy for a particular sub-population of patients. We have data from a historical policy (the “behavior” policy, perhaps used to describe how other patients receive chemotherapy), but we want to know: if we deploy this new treatment policy, how well will it perform? The challenge of evaluating this policy arises from a fundamental issue: it can be unethical to deploy a new treatment regime without testing it on this patient sub-population. Because deploying something untested can be risky, we turn to off-policy evaluation (OPE), which estimates the value of a new policy using data that was collected before.

Similar to other ML algorithms, OPE can benefit from data augmentation. The primary limitation of prior work in this area is the reliance on having access to a large and representative dataset from the behavior policy. Synthetic data can improve the size and composition of the behavior policy dataset. However, if we mix real and synthetic data without correction for possibly erroneous synthetic data, the OPE estimates can become biased. Furthermore, without an uncertainty metric that reflects the quality of the synthetic data, such estimates can become unreliable for practitioners.

PERRY (Policy Evaluation with Confidence Intervals using Auxiliary Data) tackles this particular problem. The methods proposed allow us to combine synthetic or auxiliary data with real trajectories while still producing valid confidence intervals around policy value estimates. The paper introduces two complementary methods and evaluates them across simulated and real domains, including an electronic healthcare dataset.

The Challenge: Data Augmentation with Confidence Intervals

Most OPE methods provide a point estimate of how good a policy is. In real-world systems, that is not enough. You also need a confidence interval, which tells you a plausible range for the true performance.

Synthetic data can help by reducing variance and improving coverage, but it introduces its own biases. If the generated trajectories differ too much from the real environment, your estimates become unreliable. Without careful adjustment, the resulting intervals can be too narrow or fail to cover the true policy value.

PERRY’s key idea is to use auxiliary data and maintain valid statistical guarantees. When synthetic data is informative, PERRY produces tighter intervals. When the synthetic data is unreliable, PERRY produces comparable confidence interval sizes to those observed had we simply used the real trajectories.

Proposed OPE Methods

The paper introduces two techniques, depending on whether you want to evaluate a single initial state or the average performance across many initial states.

TargetMethodIntuition
Value of the policy conditioned on an initial start stateCP-GenUses conformal prediction to calibrate intervals from pairs of real and synthetic trajectories
Value of the policy averaged over all start states.DR-PPICombines doubly robust estimation and prediction-powered inference to adjust for bias from synthetic data

CP-Gen

Suppose you care about a specific starting state, such as a patient with certain symptoms. You can use conformal prediction to compare returns from real and synthetic rollouts under the target policy. The method builds a distribution of discrepancies between real and generated outcomes and then uses quantiles from that distribution to construct an interval for the initial-state conditioned target policy value. Under mild smoothness conditions, CP-Gen guarantees valid coverage.

DR-PPI

When you care about the average policy value, DR-PPI offers a distinct approach. It takes inspiration from doubly robust estimation and prediction-powered inference to incorporate auxiliary trajectories. The method produces an asymptotically valid confidence interval for the averaged state policy value.

A notable advantage of both methods is that they adapt to the quality of the synthetic data. If the generated trajectories are poor, the confidence interval widens. If they are accurate, it narrows.

Experimental Results

PERRY was tested on a range of environments, including inventory control, robotic tasks, simulated sepsis treatment, and the MIMIC-IV electronic health record dataset.

Across these experiments:

  • PERRY’s confidence intervals consistently cover the true policy value, while many baselines under-cover or produce overly wide intervals.
  • When synthetic data is of reasonable quality, PERRY yields tighter intervals than baseline approaches, without sacrificing coverage.
  • When synthetic data is unreliable, the intervals naturally widen, avoiding false confidence.

Significance

Accurate and reliable policy evaluation is essential in high-stakes domains such as healthcare and autonomous control. PERRY provides a path toward more trustworthy reinforcement learning deployment by combining data augmentation with rigorous statistical inference.

The two proposed methods address two key problems in OPE: estimating initial-state specific policy values, and estimating average policy values across all initial states. Both methods allow a practitioner to reliably perform OPE with synthetic data and identify an informative confidence interval.

Looking Forward

PERRY opens the door for several future directions. The framework could be extended to policy selection and optimization, where the confidence intervals guide safer choices between candidate policies. Another exciting direction is deploying these methods in live settings, where models and data are imperfect and feedback loops can emerge.

Synthetic data is becoming increasingly common in RL, but without confidence intervals, we cannot know when to trust it. PERRY provides the statistical foundation to use augmented data responsibly.

 

More News Topics