opda.parametric module#
Parametric distributions and tools for optimal design analysis.
- class opda.parametric.QuadraticDistribution(a, b, c, convex=False)[source]#
The Quadratic distribution.
When using random search to optimize a deterministic smooth function, the best score asymptotically approaches a quadratic distribution. In particular, if the search distribution is a continuous distribution and the function is well-approximated by a second-order Taylor expansion near the optimum, then the tail of the score distribution will approach a quadratic distribution.
- Parameters:
- afinite float, required
The minimum value that the distribution can take.
- bfinite float, required
The maximum value that the distribution can take.
- cpositive int, required
The effective number of hyperparameters. Values of
c
greater than 10 are not supported.- convexbool, optional
Whether or not to use the convex form of the quadratic distribution, as opposed to the concave form. When optimizing via random search, the tail of the score distribution approaches the convex form when minimizing and the concave when maximizing.
See also
NoisyQuadraticDistribution
The quadratic distribution with additive normal noise.
Notes
The quadratic distribution, \(\mathcal{Q}(\alpha, \beta, \gamma)\), has a dual relationship to itself:
\[Y \sim \mathcal{Q}_{\max}(\alpha, \beta, \gamma) \iff -Y \sim \mathcal{Q}_{\min}(-\beta, -\alpha, \gamma)\]Where \(\mathcal{Q}_{\max}\) and \(\mathcal{Q}_{\min}\) are the concave and convex quadratic distributions, respectively.
The \(\alpha\) and \(\beta\) parameters can also be seen as defining a location-scale family:
\[Y \sim \mathcal{Q}(0, 1, \gamma) \iff \alpha + (\beta - \alpha) Y \sim \mathcal{Q}( \alpha, \beta, \gamma )\]In other words, \(\alpha\) and \(\beta\) act on the distribution by linearly changing its support.
- Attributes:
- meanfloat
The distribution’s mean.
- variancefloat
The distribution’s variance.
- C_MINint
A class attribute giving the minimum supported value of
c
.- C_MAXint
A class attribute giving the maximum supported value of
c
.
- sample(size=None, *, generator=None)[source]#
Return a sample from the quadratic distribution.
- Parameters:
- sizeNone, int, or tuple of ints, optional
The desired shape of the returned sample. If
None
, then the sample is a scalar.- generatornp.random.Generator or None, optional
The random number generator to use. If
None
, then the global default random number generator is used. Seeopda.random
for more information.
- Returns:
- float or array of floats from a to b inclusive
The sample from the distribution.
- pdf(ys)[source]#
Return the probability density at
ys
.- Parameters:
- ysfloat or array of floats, required
The points at which to evaluate the probability density.
- Returns:
- non-negative float or array of floats
The probability density at
ys
.
- cdf(ys)[source]#
Return the cumulative probability at
ys
.We define the cumulative distribution function, \(F\), using less than or equal to:
\[F(y) = \mathbb{P}(Y \leq y)\]- Parameters:
- ysfloat or array of floats, required
The points at which to evaluate the cumulative probability.
- Returns:
- float or array of floats from 0 to 1 inclusive
The cumulative probability at
ys
.
- ppf(qs)[source]#
Return the quantile at
qs
.We define the quantile function, \(Q\), as:
\[Q(p) = \inf \{y\in\operatorname{supp} f(y)\mid p\leq F(y)\}\]where \(f\) is the probability density, \(F\) is the cumulative distribution function, and \(\operatorname{supp} f(y)\) is the support.
- Parameters:
- qsfloat or array of floats from 0 to 1 inclusive, required
The points at which to evaluate the quantiles.
- Returns:
- float or array of floats from a to b inclusive
The quantiles at
qs
.
- quantile_tuning_curve(ns, q=0.5, minimize=None)[source]#
Return the quantile tuning curve evaluated at
ns
.- Parameters:
- nspositive float or array of floats, required
The points at which to evaluate the tuning curve.
- qfloat from 0 to 1 inclusive, optional
The quantile at which to evaluate the tuning curve.
- minimizebool or None, optional
Whether or not to compute the tuning curve for minimizing a metric as opposed to maximizing it. Defaults to
None
, in which case it is taken to be the same asself.convex
, so convex quadratic distributions will minimize and concave ones will maximize.
- Returns:
- float or array of floats from a to b inclusive
The quantile tuning curve evaluated at
ns
.
- average_tuning_curve(ns, minimize=None)[source]#
Return the average tuning curve evaluated at
ns
.- Parameters:
- nspositive float or array of floats, required
The points at which to evaluate the tuning curve.
- minimizebool or None, optional
Whether or not to compute the tuning curve for minimizing a metric as opposed to maximizing it. Defaults to
None
, in which case it is taken to be the same asself.convex
, so convex quadratic distributions will minimize and concave ones will maximize.
- Returns:
- float or array of floats from a to b inclusive
The average tuning curve evaluated at
ns
.
- classmethod fit(ys, limits=(-inf, inf), constraints=None, *, generator=None, method='maximum_spacing')[source]#
Return a quadratic distribution fitted to
ys
.- Parameters:
- ys1D array of finite floats, required
The sample from the distribution.
- limitspair of floats, optional
The left-open interval over which to fit the distribution. Observations outside the interval are censored (the fit considers only that an observation occured above or below the interval, not its exact value).
- constraintsmapping or None, optional
A mapping from strings (the class’s parameter names) to constraints. For
a
,b
, andc
, the constraint can be either a float (fixing the parameter’s value) or a pair of floats (restricting the parameter to that closed interval). Forconvex
, the constraint can be either a boolean (fixing the parameter’s value) or a sequence of booleans (restricting the parameter to that set). The mapping can be empty and need not include all the parameters.- generatornp.random.Generator or None, optional
The random number generator to use. If
None
, then the global default random number generator is used. Seeopda.random
for more information.- methodstr, optional
One of the strings: “maximum_spacing”. The
method
parameter determines how the distribution will be estimated. See the notes section for details on different methods.
- Returns:
- QuadraticDistribution
The fitted quadratic distribution.
Notes
We fit the distribution via maximum product spacing estimation (MPS) [1] [2]. MPS maximizes the product of the spacings: \(F\left(Y_{(i)}; \theta\right) - F\left(Y_{(i-1)}; \theta\right)\), where \(F\) is the CDF at the parameters \(\theta\) and \(Y_{(i)}\) is the i’th order statistic.
To compute the maximum, we optimize the spacing function with a generic algorithm that might fail to find the maximum; however, such failures should be rare.
Fitting Part of the Data
The quadratic distribution approximates the tail of the score distribution from random search. Thus, you’ll typically fit just the tail. You can accomplish this using the
limits
parameter which censors all observations outside a left-open interval. Thus,limits = (-np.inf, threshold)
will censor everything abovethreshold
andlimits = (threshold, np.inf)
will censor everything less than or equal tothreshold
.Why Use Maximum Spacing Estimation?
Often, maximum spacing still works where maximum likelihood breaks down. In our case, the quadratic distribution has unbounded probability density when \(\gamma = 1\). Since the parameters \(\alpha\) and \(\beta\) control the distribution’s support, this unbounded density leads to the unbounded likelihood problem [3] which makes the maximum likelihood estimator inconsistent. Unlike maximum likelihood, maximum spacing remains consistent even with an unbounded likelihood.
A Variant of Maximum Spacing Estimation
Standard MPS can’t handle tied data points. In order to handle ties and censoring, we use a variant of MPS. For a detailed description, see the notes section of
NoisyQuadraticDistribution
.References
[1]Cheng, R. C. H., & Amin, N. A. K. (1983). Estimating Parameters in Continuous Univariate Distributions with a Shifted Origin. Journal of the Royal Statistical Society. Series B (Methodological), 45(3), 394-403.
[2]Ranneby, B. (1984). The Maximum Spacing Method. An Estimation Method Related to the Maximum Likelihood Method. Scandinavian Journal of Statistics, 11(2), 93-112.
[3]Cheng, R. C. H., & Traylor, L. (1995). Non-Regular Maximum Likelihood Problems. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 3-44.
Examples
Use
fit()
to estimate the distribution from data:>>> QuadraticDistribution.fit( ... ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75], ... ) QuadraticDistribution(...)
When fitting a quadratic distribution to the results from random search, you’ll typically restrict it to be convex when minimizing and concave when maximizing. You can accomplish this with
constraints
:>>> minimize = True # If random search minimized / maximized >>> QuadraticDistribution.fit( ... ys=[5.0, 2.9, 5.1, 6.8, 3.2], ... constraints={"convex": True if minimize else False}, ... ) QuadraticDistribution(...)
The quadratic distribution approximates the score distribution’s left tail when minimizing and right tail when maximizing. You can fit to only the tail of your data using
limits
:>>> threshold = 5. # The maximum cross-entropy to consider >>> QuadraticDistribution.fit( ... ys=[5.0, 2.9, 5.1, 6.8, 3.2], # 5.1 & 6.8 are censored ... limits=( ... (-np.inf, threshold) # minimize: fit the left tail ... if minimize else ... (threshold, np.inf) # maximize: fit the right tail ... ), ... constraints={"convex": True if minimize else False}, ... ) QuadraticDistribution(...)
You could also censor both tails if necessary:
>>> QuadraticDistribution.fit( ... ys=[5.0, 2.9, 5.1, 6.8, 3.2], ... limits=(1., 10.), ... ) QuadraticDistribution(...)
Finally, you can use
constraints
to bound any of the parameters in case you have some extra information. For example, if you knew a bound on the performace of the worst or best hyperparameters, you could constraina
andb
:>>> min_accuracy, max_accuracy = 0., 1. >>> QuadraticDistribution.fit( ... ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75], ... constraints={ ... "a": (min_accuracy, max_accuracy), ... "b": (min_accuracy, max_accuracy), ... }, ... ) QuadraticDistribution(...)
Or, you might know that the random search used 3 hyperparameters, so the effective number of hyperparameters (
c
) can be at most that:>>> n_hyperparameters = 3 >>> QuadraticDistribution.fit( ... ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75], ... constraints={"c": (1, n_hyperparameters)}, ... ) QuadraticDistribution(...)
You could also fix
c
(ora
,b
, orconvex
) to a particular value:>>> QuadraticDistribution.fit( ... ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75], ... constraints={"c": 1}, ... ) QuadraticDistribution(...)
Of course, you can mix and match all of these ideas together as desired.
- class opda.parametric.NoisyQuadraticDistribution(a, b, c, o, convex=False)[source]#
The Noisy Quadratic distribution.
When using random search to optimize a smooth function with additive normal noise, the best score asymptotically approaches a noisy quadratic distribution. In particular, if the search distribution is a continuous distribution, the function is well-approximated by a second-order Taylor expansion near the optimum, and the observed score is the result of the function plus additive normal noise, then the tail of the score distribution will approach a noisy quadratic distribution.
- Parameters:
- afinite float, required
The minimum value that the distribution can take, without accounting for the additive noise.
- bfinite float, required
The maximum value that the distribution can take, without accounting for the additive noise.
- cpositive int, required
The effective number of hyperparameters. Values of
c
greater than 10 are not supported.- ofinite non-negative float, required
The standard deviation of the additive noise.
- convexbool, optional
Whether or not to use the convex form of the noisy quadratic distribution, as opposed to the concave form. When optimizing via random search, the tail of the score distribution approaches the convex form when minimizing and the concave when maximizing.
See also
QuadraticDistribution
A noiseless version of the noisy quadratic distribution.
Notes
The noisy quadratic distribution, \(\mathcal{Q}(\alpha, \beta, \gamma, \sigma)\), has a dual relationship to itself:
\[Y \sim \mathcal{Q}_{\max}(\alpha, \beta, \gamma, \sigma) \iff -Y \sim \mathcal{Q}_{\min}(-\beta, -\alpha, \gamma, \sigma)\]Where \(\mathcal{Q}_{\max}\) and \(\mathcal{Q}_{\min}\) are the concave and convex noisy quadratic distributions, respectively.
The \(\alpha\) and \(\beta\) parameters can also be seen as defining a location-scale family:
\[Y \sim \mathcal{Q}(0, 1, \gamma, \sigma) \iff \alpha + (\beta - \alpha) Y \sim \mathcal{Q}( \alpha, \beta, \gamma, (\beta - \alpha)\sigma )\]In other words, \(\alpha\) and \(\beta\) act on the distribution by linearly changing its support and residual standard deviation, \(\sigma\).
- Attributes:
- meanfloat
The distribution’s mean.
- variancefloat
The distribution’s variance.
- C_MINint
A class attribute giving the minimum supported value of
c
.- C_MAXint
A class attribute giving the maximum supported value of
c
.
- sample(size=None, *, generator=None)[source]#
Return a sample from the noisy quadratic distribution.
- Parameters:
- sizeNone, int, or tuple of ints, optional
The desired shape of the returned sample. If
None
, then the sample is a scalar.- generatornp.random.Generator or None, optional
The random number generator to use. If
None
, then the global default random number generator is used. Seeopda.random
for more information.
- Returns:
- float or array of floats
The sample from the distribution.
- pdf(ys)[source]#
Return the probability density at
ys
.- Parameters:
- ysfloat or array of floats, required
The points at which to evaluate the probability density.
- Returns:
- non-negative float or array of floats
The probability density at
ys
.
- cdf(ys)[source]#
Return the cumulative probability at
ys
.We define the cumulative distribution function, \(F\), using less than or equal to:
\[F(y) = \mathbb{P}(Y \leq y)\]- Parameters:
- ysfloat or array of floats, required
The points at which to evaluate the cumulative probability.
- Returns:
- float or array of floats from 0 to 1 inclusive
The cumulative probability at
ys
.
- ppf(qs)[source]#
Return the quantile at
qs
.We define the quantile function, \(Q\), as:
\[Q(p) = \inf \{y\in\operatorname{supp} f(y)\mid p\leq F(y)\}\]where \(f\) is the probability density, \(F\) is the cumulative distribution function, and \(\operatorname{supp} f(y)\) is the support.
- Parameters:
- qsfloat or array of floats from 0 to 1 inclusive, required
The points at which to evaluate the quantiles.
- Returns:
- float or array of floats
The quantiles at
qs
.
- quantile_tuning_curve(ns, q=0.5, minimize=None)[source]#
Return the quantile tuning curve evaluated at
ns
.- Parameters:
- nspositive float or array of floats, required
The points at which to evaluate the tuning curve.
- qfloat from 0 to 1 inclusive, optional
The quantile at which to evaluate the tuning curve.
- minimizebool or None, optional
Whether or not to compute the tuning curve for minimizing a metric as opposed to maximizing it. Defaults to
None
, in which case it is taken to be the same asself.convex
, so convex noisy quadratic distributions will minimize and concave ones will maximize.
- Returns:
- float or array of floats
The quantile tuning curve evaluated at
ns
.
- average_tuning_curve(ns, minimize=None, *, atol=None)[source]#
Return the average tuning curve evaluated at
ns
.- Parameters:
- nspositive float or array of floats, required
The points at which to evaluate the tuning curve.
- minimizebool or None, optional
Whether or not to compute the tuning curve for minimizing a metric as opposed to maximizing it. Defaults to
None
, in which case it is taken to be the same asself.convex
, so convex noisy quadratic distributions will minimize and concave ones will maximize.- atolnon-negative float or None, optional
The absolute tolerance to use for stopping the computation. The average tuning curve is computed via numerical integration. The computation stops when the error estimate is less than
atol
. IfNone
, thenatol
is set automatically based on the distribution’s parameters.
- Returns:
- float or array of floats
The average tuning curve evaluated at
ns
.
- classmethod fit(ys, limits=(-inf, inf), constraints=None, *, generator=None, method='maximum_spacing')[source]#
Return a noisy quadratic distribution fitted to
ys
.- Parameters:
- ys1D array of finite floats, required
The sample from the distribution.
- limitspair of floats, optional
The left-open interval over which to fit the distribution. Observations outside the interval are censored (the fit considers only that an observation occured above or below the interval, not its exact value).
- constraintsmapping or None, optional
A mapping from strings (the class’s parameter names) to constraints. For
a
,b
,c
, ando
the constraint can be either a float (fixing the parameter’s value) or a pair of floats (restricting the parameter to that closed interval). Forconvex
, the constraint can be either a boolean (fixing the parameter’s value) or a sequence of booleans (restricting the parameter to that set). The mapping can be empty and need not include all the parameters.- generatornp.random.Generator or None, optional
The random number generator to use. If
None
, then the global default random number generator is used. Seeopda.random
for more information.- methodstr, optional
One of the strings: “maximum_spacing”. The
method
parameter determines how the distribution will be estimated. See the notes section for details on different methods.
- Returns:
- NoisyQuadraticDistribution
The fitted noisy quadratic distribution.
Notes
We fit the distribution via maximum product spacing estimation (MPS) [1] [2]. MPS maximizes the product of the spacings: \(F\left(Y_{(i)}; \theta\right) - F\left(Y_{(i-1)}; \theta\right)\), where \(F\) is the CDF at the parameters \(\theta\) and \(Y_{(i)}\) is the i’th order statistic.
To compute the maximum, we optimize the spacing function with a generic algorithm that might fail to find the maximum; however, such failures should be rare.
Fitting Part of the Data
The noisy quadratic distribution approximates the tail of the score distribution from random search. Thus, you’ll typically fit just the tail. You can accomplish this using the
limits
parameter which censors all observations outside a left-open interval. Thus,limits = (-np.inf, threshold)
will censor everything abovethreshold
andlimits = (threshold, np.inf)
will censor everything less than or equal tothreshold
.Why Use Maximum Spacing Estimation?
Often, maximum spacing still works where maximum likelihood breaks down. In our case, the noisy quadratic distribution has unbounded probability density when \(\sigma = 0\) and \(\gamma = 1\). Since the parameters \(\alpha\) and \(\beta\) control the distribution’s support, this unbounded density leads to the unbounded likelihood problem [3] which makes the maximum likelihood estimator inconsistent. Unlike maximum likelihood, maximum spacing remains consistent even with an unbounded likelihood.
A Variant of Maximum Spacing Estimation
Standard MPS can’t handle tied data points because then the spacing, \(F\left(Y_{(i)}; \theta\right) - F\left(Y_{(i-1)}; \theta\right)\), would be zero, making the whole product of spacings zero.
To handle ties and censoring, we view maximum spacing as maximum likelihood on grouped data [4]. Roughly, we imagine adding an extra data point at \(Y_{(n+1)} = \infty\), then grouping the data between the order statistics, and finally applying maximum likelihood.
More specifically, we group the data into buckets defined as left-open intervals: \(\left(Z_{(i-1)}, Z_{(i)}\right]\). To construct the buckets, we deduplicate the uncensored order statistics along with the lower and upper bounds on the support:
\[-\infty = Z_{(0)} < Z_{(1)} = Y_{(i_1)} < \ldots < Y_{(i_k)} = Z_{(k)} < Z_{(k+1)} = \infty\]If there are censored observations in the left tail then we further divide the first bucket at the lower limit for censoring; similarly, if the right tail contains censored observations, we divide the last bucket at the upper limit. If any of the (uncensored) order statistics equals one of the censoring limits, then we leave the bucket as is so as never to create a zero-length bucket. In this way, the deduplication ensures all buckets have positive length.
With the buckets defined, we bin the data into each of the buckets and put an extra data point in the rightmost one (as in typical maximum spacing estimation). Intuitively, this corresponds to estimating the CDF at each order statistic by i/(n+1), reducing the upward bias at the order statistics. For example, if you have one data point, you might guess that it’s near the median rather than the max of the distribution. If \(N_i\) is the number of data points in bucket \(\left(Z_{(i-1)}, Z_{(i)}\right]\), then the objective becomes:
\[\sum_{i=1}^{l} N_i \log\left( F\left(Z_{(i)}; \theta\right) - F\left(Z_{(i-1)}; \theta\right) \right)\]where \(l\) is the highest index of all the \(Z_{(i)}\).
When fitting with constraints, a minor modification is necessary if \(\sigma\) is constrained to be zero. Instead of using \(\pm\infty\) as the bounds on the support, you must use \(\alpha_-\) (the lower bound on \(\alpha\)) and \(\beta_+\) (the upper bound on \(\beta\)); also, the leftmost bucket must be closed (i.e., include \(\alpha_-\)). The reason is that if you were to construct the buckets using \(-\infty\) and you had a data point equal to \(\alpha_-\) then you’d get a bucket that contains zero probability no matter the parameters. Even worse, that data point would get put in this bucket, sending the product of spacings to zero. Similarly, if you used \(\infty\) and you had a data point equal to \(\beta_+\) then you’d get another zero probabiliy bucket. The extra data point at the top of the support will get put in this bucket and again send the product of spacings to zero. These situations can be fairly common in practice due to rounding. Thus, in constrained problems where \(\sigma\) is zero we construct the buckets using \(\alpha_-\) and \(\beta_+\).
References
[1]Cheng, R. C. H., & Amin, N. A. K. (1983). Estimating Parameters in Continuous Univariate Distributions with a Shifted Origin. Journal of the Royal Statistical Society. Series B (Methodological), 45(3), 394-403.
[2]Ranneby, B. (1984). The Maximum Spacing Method. An Estimation Method Related to the Maximum Likelihood Method. Scandinavian Journal of Statistics, 11(2), 93-112.
[3]Cheng, R. C. H., & Traylor, L. (1995). Non-Regular Maximum Likelihood Problems. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 3-44.
[4]Titterington, D. M. (1985). Comment on “Estimating Parameters in Continuous Univariate Distributions.” Journal of the Royal Statistical Society. Series B (Methodological), 47(1), 115-116.
Examples
Use
fit()
to estimate the distribution from data:>>> NoisyQuadraticDistribution.fit( ... ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75], ... ) NoisyQuadraticDistribution(...)
When fitting a noisy quadratic distribution to the results from random search, you’ll typically restrict it to be convex when minimizing and concave when maximizing. You can accomplish this with
constraints
:>>> minimize = True # If random search minimized / maximized >>> NoisyQuadraticDistribution.fit( ... ys=[5.0, 2.9, 5.1, 6.8, 3.2], ... constraints={"convex": True if minimize else False}, ... ) NoisyQuadraticDistribution(...)
The noisy quadratic distribution approximates the score distribution’s left tail when minimizing and right tail when maximizing. You can fit to only the tail of your data using
limits
:>>> threshold = 5. # The maximum cross-entropy to consider >>> NoisyQuadraticDistribution.fit( ... ys=[5.0, 2.9, 5.1, 6.8, 3.2], # 5.1 & 6.8 are censored ... limits=( ... (-np.inf, threshold) # minimize: fit the left tail ... if minimize else ... (threshold, np.inf) # maximize: fit the right tail ... ), ... constraints={"convex": True if minimize else False}, ... ) NoisyQuadraticDistribution(...)
You could also censor both tails if necessary:
>>> NoisyQuadraticDistribution.fit( ... ys=[5.0, 2.9, 5.1, 6.8, 3.2], ... limits=(1., 10.), ... ) NoisyQuadraticDistribution(...)
Finally, you can use
constraints
to bound any of the parameters in case you have some extra information. For example, if you knew a bound on the average performance of the worst or best hyperparameters, you could constraina
andb
:>>> min_accuracy, max_accuracy = 0., 1. >>> NoisyQuadraticDistribution.fit( ... ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75], ... constraints={ ... "a": (min_accuracy, max_accuracy), ... "b": (min_accuracy, max_accuracy), ... }, ... ) NoisyQuadraticDistribution(...)
Or, you might know that the random search used 3 hyperparameters, so the effective number of hyperparameters (
c
) can be at most that:>>> n_hyperparameters = 3 >>> NoisyQuadraticDistribution.fit( ... ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75], ... constraints={"c": (1, n_hyperparameters)}, ... ) NoisyQuadraticDistribution(...)
You could also fix
c
(ora
,b
,o
, orconvex
) to a particular value:>>> NoisyQuadraticDistribution.fit( ... ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75], ... constraints={"c": 1}, ... ) NoisyQuadraticDistribution(...)
Of course, you can mix and match all of these ideas together as desired.