opda.nonparametric module#
Nonparametric distributions and tools for optimal design analysis.
- class opda.nonparametric.EmpiricalDistribution(ys, ws=None, a=-inf, b=inf)[source]#
The empirical distribution.
- Parameters:
- ys1D array of floats, required
The sample for which to create an empirical distribution.
- ws1D array of non-negative floats or None, optional
Weights, or the probability masses, to assign to each value in the sample,
ys
. Weights must be non-negative and sum to 1.ws
should have the same shape asys
. IfNone
, then each sample will be assigned equal weight.- afloat, optional
The minimum of the support of the underlying distribution.
- bfloat, optional
The maximum of the support of the underlying distribution.
Notes
EmpiricalDistribution
provides confidence bands for the CDF which can then be translated into confidence bands for the tuning curve. See the examples section for how to accomplish this task or [1] for more background.References
[1]Lourie, Nicholas, Kyunghyun Cho, and He He. “Show Your Work with Confidence: Confidence Bands for Tuning Curves.” arXiv preprint arXiv:2311.09480 (2023).
Examples
To produce confidence bands for tuning curves, first create confidence bands for the CDF of the score distribution:
>>> ns = [1, 2, 3, 4, 5] >>> lower_cdf, point_cdf, upper_cdf =\ ... EmpiricalDistribution.confidence_bands( ... ys=[0.1, 0.8, 0.5, 0.4, 0.6], ... confidence=0.80, ... ) >>> tuning_curve_lower = upper_cdf.quantile_tuning_curve(ns) >>> tuning_curve_point = point_cdf.quantile_tuning_curve(ns) >>> tuning_curve_upper = lower_cdf.quantile_tuning_curve(ns)
Note that the upper CDF band gives the lower tuning curve band and vice versa.
- Attributes:
- meanfloat
The distribution’s mean.
- variancefloat
The distribution’s variance.
- sample(size=None, *, generator=None)[source]#
Return a sample from the empirical distribution.
- Parameters:
- sizeNone, int, or tuple of ints, optional
The desired shape of the returned sample. If
None
, then the sample is a scalar.- generatornp.random.Generator or None, optional
The random number generator to use. If
None
, then the global default random number generator is used. Seeopda.random
for more information.
- Returns:
- float or array of floats
The sample from the distribution.
- pmf(ys)[source]#
Return the probability mass at
ys
.- Parameters:
- ysfloat or array of floats, required
The points at which to evaluate the probability mass.
- Returns:
- float or array of floats from 0 to 1 inclusive
The probability mass at
ys
.
- cdf(ys)[source]#
Return the cumulative probability at
ys
.We define the cumulative distribution function, \(F\), using less than or equal to:
\[F(y) = \mathbb{P}(Y \leq y)\]- Parameters:
- ysfloat or array of floats, required
The points at which to evaluate the cumulative probability.
- Returns:
- float or array of floats from 0 to 1 inclusive
The cumulative probability at
ys
.
- ppf(qs)[source]#
Return the quantile at
qs
.Since the empirical distribution is discrete, its exact quantiles are ambiguous. We use the following definition of the quantile function, \(Q\):
\[Q(p) = \inf \{y\in[a, b]\mid p\leq F(y)\}\]where \(F\) is the cumulative distribution function and \(a\) and \(b\) are the optional bounds provided for the distribution’s support. Note that this definition is different from the most standard one in which \(y\) is quantified over the whole real line; however, quantifying over the reals leads to counterintuitive behavior at zero, which then always evaluates to negative infinity. Instead, the above definition will have zero evaluate to the lower bound on the support.
- Parameters:
- qsfloat or array of floats from 0 to 1 inclusive, required
The points at which to evaluate the quantiles.
- Returns:
- float or array of floats
The quantiles at
qs
.
- quantile_tuning_curve(ns, q=0.5, minimize=False)[source]#
Return the quantile tuning curve evaluated at
ns
.Since the empirical distribution is discrete, its exact quantiles are ambiguous. See the
ppf()
method for the definition of the quantile function we use.- Parameters:
- nspositive float or array of floats, required
The points at which to evaluate the tuning curve.
- qfloat from 0 to 1 inclusive, optional
The quantile at which to evaluate the tuning curve.
- minimizebool, optional
Whether or not to compute the tuning curve for minimizing a metric as opposed to maximizing it.
- Returns:
- float or array of floats
The quantile tuning curve evaluated at
ns
.
- average_tuning_curve(ns, minimize=False)[source]#
Return the average tuning curve evaluated at
ns
.- Parameters:
- nspositive float or array of floats, required
The points at which to evaluate the tuning curve.
- minimizebool, optional
Whether or not to compute the tuning curve for minimizing a metric as opposed to maximizing it.
- Returns:
- float or array of floats
The average tuning curve evaluated at
ns
.
- naive_tuning_curve(ns, minimize=False)[source]#
Return the naive estimate for the tuning curve at
ns
.The naive tuning curve estimate assigns to n the maximum value seen in the first n samples. The estimate assumes each sample has identical weight, so this method cannot be called when
ws
is notNone
.- Parameters:
- nspositive int or array of ints, required
The values at which to evaluate the naive tuning curve estimate.
- minimizebool, optional
Whether or not to estimate the tuning curve for minimizing a metric as opposed to maximizing it.
- Returns:
- float or array of floats
The values of the naive tuning curve estimate.
- v_tuning_curve(ns, minimize=False)[source]#
Return the v estimate for the tuning curve at
ns
.The v statistic tuning curve estimate assigns to n the average value of the maximum after n observations when resampling with replacement. The estimate is consistent but biased.
- Parameters:
- nspositive int or array of ints, required
The values at which to evaluate the v statistic tuning curve estimate.
- minimizebool, optional
Whether or not to estimate the tuning curve for minimizing a metric as opposed to maximizing it.
- Returns:
- float or array of floats
The values of the v statistic tuning curve estimate.
- u_tuning_curve(ns, minimize=False)[source]#
Return the u estimate for the tuning curve at
ns
.The u statistic tuning curve estimate assigns to n the average value of the maximum after n observations when resampling without replacement. The estimate is unbiased for n less than or equal to the original sample size. For larger n, we return the maximum value from the original sample.
- Parameters:
- nspositive int or array of ints, required
The values at which to evaluate the u statistic tuning curve estimate.
- minimizebool, optional
Whether or not to estimate the tuning curve for minimizing a metric as opposed to maximizing it.
- Returns:
- float or array of floats
The values of the u statistic tuning curve estimate.
- classmethod confidence_bands(ys, confidence, a=-inf, b=inf, *, generator=None, method='ld_highest_density', n_jobs=None)[source]#
Return confidence bands for the CDF.
Return three instances of
EmpiricalDistribution
, offering a lower confidence band, point estimate, and upper confidence band for the CDF of the distribution that generatedys
.The properties of the CDF bands depend on the method used to construct them, as set by the
method
parameter.- Parameters:
- ys1D array of floats, required
The sample from the distribution.
- confidencefloat from 0 to 1 inclusive, required
The coverage or confidence level for the bands.
- afloat, optional
The minimum of the support of the underlying distribution.
- bfloat, optional
The maximum of the support of the underlying distribution.
- generatornp.random.Generator or None, optional
The random number generator to use. If
None
, then the global default random number generator is used. Seeopda.random
for more information.- methodstr, optional
One of the strings: “dkw”, “ks”, “ld_equal_tailed”, or “ld_highest_density”. The
method
parameter determines the kind of confidence band and thus its properties. See the notes section for details on the different methods.- n_jobspositive int or None, optional
Set the maximum number of parallel processes to use when constructing the confidence bands. If
None
thenn_jobs
will be set to the number of CPUs returned byos.cpu_count()
. Only some methods (e.g."ld_highest_density"
) can leverage parallel computation. If the method can’t use parallelism, it’ll just use the current process instead.
- Returns:
- EmpiricalDistribution
The lower confidence band for the distribution’s CDF.
- EmpiricalDistribution
The point estimate for the distribution’s CDF.
- EmpiricalDistribution
The upper confidence band for the distribution’s CDF.
Notes
There are four built-in methods for generating confidence bands: dkw, ks, ld_equal_tailed, and ld_highest_density. All four methods provide simultaneous confidence bands.
The dkw method uses the Dvoretzky-Kiefer-Wolfowitz inequality which is fast to compute but fairly conservative for smaller samples.
The ks method inverts the Kolmogorov-Smirnov test to provide a confidence band with exact coverage and which is uniformly spaced above and below the empirical cumulative distribution. Because the band has uniform width, it is relatively looser at the ends than in the middle, and most violations of the confidence band tend to occur near the median. The Kolmogorov-Smirnov bands require that the underlying distribution is continuous to achieve exact coverage.
The ld (Learned-Miller-DeStefano) methods expand pointwise confidence bands for the order statistics, based on the beta distribution, until they hold simultaneously with exact coverage. These pointwise bands may either use the equal-tailed interval (ld_equal_tailed) or the highest density interval (ld_highest_density) from the beta distribution. The highest density interval yields the tightest bands; however, the equal-tailed intervals are almost the same size and significantly faster to compute. The ld bands do not have uniform width and are tighter near the end points. They’re violated equally often across the whole range. The Learned-Miller-DeStefano bands require that the underlying distribution is continuous to achieve exact coverage. See “A Probabilistic Upper Bound on Differential Entropy” [1] for details.
References
[1]Learned-Miller, E and DeStefano, J, “A Probabilistic Upper Bound on Differential Entropy” (2008). IEEE TRANSACTIONS ON INFORMATION THEORY. 732.