opda.parametric module#

Parametric distributions and tools for optimal design analysis.

class opda.parametric.QuadraticDistribution(a, b, c, convex=False)[source]#

The Quadratic distribution.

When using random search to optimize a deterministic smooth function, the best score asymptotically approaches a quadratic distribution. In particular, if the search distribution is a continuous distribution and the function is well-approximated by a second-order Taylor expansion near the optimum, then the tail of the score distribution will approach a quadratic distribution.

Parameters:

afinite float, required: The minimum value that the distribution can take.
bfinite float, required: The maximum value that the distribution can take.
cpositive int, required: The effective number of hyperparameters. Values of c greater than 10 are not supported.
convexbool, optional: Whether or not to use the convex form of the quadratic distribution, as opposed to the concave form. When optimizing via random search, the tail of the score distribution approaches the convex form when minimizing and the concave when maximizing.

See also

NoisyQuadraticDistribution: The quadratic distribution with additive normal noise.

Notes

The quadratic distribution, \(\mathcal{Q}(\alpha, \beta, \gamma)\), has a dual relationship to itself:

\[Y \sim \mathcal{Q}_{\max}(\alpha, \beta, \gamma) \iff -Y \sim \mathcal{Q}_{\min}(-\beta, -\alpha, \gamma)\]

Where \(\mathcal{Q}_{\max}\) and \(\mathcal{Q}_{\min}\) are the concave and convex quadratic distributions, respectively.

The \(\alpha\) and \(\beta\) parameters can also be seen as defining a location-scale family:

\[Y \sim \mathcal{Q}(0, 1, \gamma) \iff \alpha + (\beta - \alpha) Y \sim \mathcal{Q}( \alpha, \beta, \gamma )\]

In other words, \(\alpha\) and \(\beta\) act on the distribution by linearly changing its support.

Attributes:

meanfloat: The distribution’s mean.
variancefloat: The distribution’s variance.
C_MINint: A class attribute giving the minimum supported value of c.
C_MAXint: A class attribute giving the maximum supported value of c.

sample(size=None, *, generator=None)[source]#

Return a sample from the quadratic distribution.

Parameters:

sizeNone, int, or tuple of ints, optional: The desired shape of the returned sample. If None, then the sample is a scalar.
generatornp.random.Generator or None, optional: The random number generator to use. If None, then the global default random number generator is used. See opda.random for more information.

Returns:

float or array of floats from a to b inclusive: The sample from the distribution.

pdf(ys)[source]#

Return the probability density at ys.

Parameters:

ysfloat or array of floats, required: The points at which to evaluate the probability density.

Returns:

non-negative float or array of floats: The probability density at ys.

cdf(ys)[source]#

Return the cumulative probability at ys.

We define the cumulative distribution function, \(F\), using less than or equal to:

\[F(y) = \mathbb{P}(Y \leq y)\]

Parameters:

ysfloat or array of floats, required: The points at which to evaluate the cumulative probability.

Returns:

float or array of floats from 0 to 1 inclusive: The cumulative probability at ys.

ppf(qs)[source]#

Return the quantile at qs.

We define the quantile function, \(Q\), as:

\[Q(p) = \inf \{y\in\operatorname{supp} f(y)\mid p\leq F(y)\}\]

where \(f\) is the probability density, \(F\) is the cumulative distribution function, and \(\operatorname{supp} f(y)\) is the support.

Parameters:

qsfloat or array of floats from 0 to 1 inclusive, required: The points at which to evaluate the quantiles.

Returns:

float or array of floats from a to b inclusive: The quantiles at qs.

quantile_tuning_curve(ns, q=0.5, minimize=None)[source]#

Return the quantile tuning curve evaluated at ns.

Parameters:

nspositive float or array of floats, required: The points at which to evaluate the tuning curve.
qfloat from 0 to 1 inclusive, optional: The quantile at which to evaluate the tuning curve.
minimizebool or None, optional: Whether or not to compute the tuning curve for minimizing a metric as opposed to maximizing it. Defaults to None, in which case it is taken to be the same as self.convex, so convex quadratic distributions will minimize and concave ones will maximize.

Returns:

float or array of floats from a to b inclusive: The quantile tuning curve evaluated at ns.

average_tuning_curve(ns, minimize=None)[source]#

Return the average tuning curve evaluated at ns.

Parameters:

nspositive float or array of floats, required: The points at which to evaluate the tuning curve.
minimizebool or None, optional: Whether or not to compute the tuning curve for minimizing a metric as opposed to maximizing it. Defaults to None, in which case it is taken to be the same as self.convex, so convex quadratic distributions will minimize and concave ones will maximize.

Returns:

float or array of floats from a to b inclusive: The average tuning curve evaluated at ns.

classmethod fit(ys, limits=(-inf, inf), constraints=None, *, generator=None, method='maximum_spacing')[source]#

Return a quadratic distribution fitted to ys.

Parameters:

ys1D array of floats, required: The sample from the distribution.
limitspair of floats, optional: The left-open interval over which to fit the distribution. Observations outside the interval are censored (the fit considers only that an observation occured above or below the interval, not its exact value).
constraintsmapping or None, optional: A mapping from strings (the class’s parameter names) to constraints. For a, b, and c, the constraint can be either a float (fixing the parameter’s value) or a pair of floats (restricting the parameter to that closed interval). For convex, the constraint can be either a boolean (fixing the parameter’s value) or a sequence of booleans (restricting the parameter to that set). The mapping can be empty and need not include all the parameters.
generatornp.random.Generator or None, optional: The random number generator to use. If None, then the global default random number generator is used. See opda.random for more information.
methodstr, optional: One of the strings: “maximum_spacing”. The method parameter determines how the distribution will be estimated. See the notes section for details on different methods.

Returns:

QuadraticDistribution: The fitted quadratic distribution.

Notes

We fit the distribution via maximum product spacing estimation (MPS) [1] [2]. MPS maximizes the product of the spacings: \(F\left(Y_{(i)}; \theta\right) - F\left(Y_{(i-1)}; \theta\right)\), where \(F\) is the CDF at the parameters \(\theta\) and \(Y_{(i)}\) is the i’th order statistic.

To compute the maximum, we optimize the spacing function with a generic algorithm that might fail to find the maximum; however, such failures should be rare.

Fitting Part of the Data

The quadratic distribution approximates the tail of the score distribution from random search. Thus, you’ll typically fit just the tail. You can accomplish this using the limits parameter which censors all observations outside a left-open interval. Thus, limits = (-np.inf, threshold) will censor everything above threshold and limits = (threshold, np.inf) will censor everything less than or equal to threshold.

Why Use Maximum Spacing Estimation?

Often, maximum spacing still works where maximum likelihood breaks down. In our case, the quadratic distribution has unbounded probability density when \(\gamma = 1\). Since the parameters \(\alpha\) and \(\beta\) control the distribution’s support, this unbounded density leads to the unbounded likelihood problem [3] which makes the maximum likelihood estimator inconsistent. Unlike maximum likelihood, maximum spacing remains consistent even with an unbounded likelihood.

A Variant of Maximum Spacing Estimation

Standard MPS can’t handle tied data points. In order to handle ties and censoring, we use a variant of MPS. For a detailed description, see the notes section of NoisyQuadraticDistribution.

References

[1]

Cheng, R. C. H., & Amin, N. A. K. (1983). Estimating Parameters in Continuous Univariate Distributions with a Shifted Origin. Journal of the Royal Statistical Society. Series B (Methodological), 45(3), 394-403.

[2]

Ranneby, B. (1984). The Maximum Spacing Method. An Estimation Method Related to the Maximum Likelihood Method. Scandinavian Journal of Statistics, 11(2), 93-112.

[3]

Cheng, R. C. H., & Traylor, L. (1995). Non-Regular Maximum Likelihood Problems. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 3-44.

Examples

Use fit() to estimate the distribution from data:

>>> QuadraticDistribution.fit(
...   ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75],
... )
QuadraticDistribution(...)

When fitting a quadratic distribution to the results from random search, you’ll typically restrict it to be convex when minimizing and concave when maximizing. You can accomplish this with constraints:

>>> minimize = True  # If random search minimized / maximized
>>> QuadraticDistribution.fit(
...   ys=[5.0, 2.9, 5.1, 6.8, 3.2],
...   constraints={"convex": True if minimize else False},
... )
QuadraticDistribution(...)

The quadratic distribution approximates the score distribution’s left tail when minimizing and right tail when maximizing. You can fit to only the tail of your data using limits:

>>> threshold = 5.  # The maximum cross-entropy to consider
>>> QuadraticDistribution.fit(
...   ys=[5.0, 2.9, 5.1, 6.8, 3.2],  # 5.1 & 6.8 are censored
...   limits=(
...     (-np.inf, threshold)  # minimize: fit the left tail
...     if minimize else
...     (threshold, np.inf)   # maximize: fit the right tail
...   ),
...   constraints={"convex": True if minimize else False},
... )
QuadraticDistribution(...)

You could also censor both tails if necessary:

>>> QuadraticDistribution.fit(
...   ys=[5.0, 2.9, 5.1, 6.8, 3.2],
...   limits=(1., 10.),
... )
QuadraticDistribution(...)

Finally, you can use constraints to bound any of the parameters in case you have some extra information. For example, if you knew a bound on the performace of the best hyperparameters, you might constrain a (when minimizing) or b (when maximizing):

>>> min_accuracy, max_accuracy = 0., 1.
>>> QuadraticDistribution.fit(
...   ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75],
...   constraints=(
...     {"a": (min_accuracy, max_accuracy)}
...     if minimize else
...     {"b": (min_accuracy, max_accuracy)}
...   ),
... )
QuadraticDistribution(...)

Or, you might know that the random search used 3 hyperparameters, so the effective number of hyperparameters (c) can be at most that:

>>> n_hyperparameters = 3
>>> QuadraticDistribution.fit(
...   ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75],
...   constraints={"c": (1, n_hyperparameters)},
... )
QuadraticDistribution(...)

You could also fix c (or a, b, or convex) to a particular value:

>>> QuadraticDistribution.fit(
...   ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75],
...   constraints={"c": 1},
... )
QuadraticDistribution(...)

Of course, you can mix and match all of these ideas together as desired.

class opda.parametric.NoisyQuadraticDistribution(a, b, c, o, convex=False)[source]#

The Noisy Quadratic distribution.

When using random search to optimize a smooth function with additive normal noise, the best score asymptotically approaches a noisy quadratic distribution. In particular, if the search distribution is a continuous distribution, the function is well-approximated by a second-order Taylor expansion near the optimum, and the observed score is the result of the function plus additive normal noise, then the tail of the score distribution will approach a noisy quadratic distribution.

Parameters:

afinite float, required: The minimum value that the distribution can take, without accounting for the additive noise.
bfinite float, required: The maximum value that the distribution can take, without accounting for the additive noise.
cpositive int, required: The effective number of hyperparameters. Values of c greater than 10 are not supported.
ofinite non-negative float, required: The standard deviation of the additive noise.
convexbool, optional: Whether or not to use the convex form of the noisy quadratic distribution, as opposed to the concave form. When optimizing via random search, the tail of the score distribution approaches the convex form when minimizing and the concave when maximizing.

See also

QuadraticDistribution: A noiseless version of the noisy quadratic distribution.

Notes

The noisy quadratic distribution, \(\mathcal{Q}(\alpha, \beta, \gamma, \sigma)\), has a dual relationship to itself:

\[Y \sim \mathcal{Q}_{\max}(\alpha, \beta, \gamma, \sigma) \iff -Y \sim \mathcal{Q}_{\min}(-\beta, -\alpha, \gamma, \sigma)\]

Where \(\mathcal{Q}_{\max}\) and \(\mathcal{Q}_{\min}\) are the concave and convex noisy quadratic distributions, respectively.

The \(\alpha\) and \(\beta\) parameters can also be seen as defining a location-scale family:

\[Y \sim \mathcal{Q}(0, 1, \gamma, \sigma) \iff \alpha + (\beta - \alpha) Y \sim \mathcal{Q}( \alpha, \beta, \gamma, (\beta - \alpha)\sigma )\]

In other words, \(\alpha\) and \(\beta\) act on the distribution by linearly changing its support and residual standard deviation, \(\sigma\).

Attributes:

meanfloat: The distribution’s mean.
variancefloat: The distribution’s variance.
C_MINint: A class attribute giving the minimum supported value of c.
C_MAXint: A class attribute giving the maximum supported value of c.

sample(size=None, *, generator=None)[source]#

Return a sample from the noisy quadratic distribution.

Parameters:

sizeNone, int, or tuple of ints, optional: The desired shape of the returned sample. If None, then the sample is a scalar.
generatornp.random.Generator or None, optional: The random number generator to use. If None, then the global default random number generator is used. See opda.random for more information.

Returns:

float or array of floats: The sample from the distribution.

pdf(ys)[source]#

Return the probability density at ys.

Parameters:

ysfloat or array of floats, required: The points at which to evaluate the probability density.

Returns:

non-negative float or array of floats: The probability density at ys.

cdf(ys)[source]#

Return the cumulative probability at ys.

We define the cumulative distribution function, \(F\), using less than or equal to:

\[F(y) = \mathbb{P}(Y \leq y)\]

Parameters:

ysfloat or array of floats, required: The points at which to evaluate the cumulative probability.

Returns:

float or array of floats from 0 to 1 inclusive: The cumulative probability at ys.

ppf(qs)[source]#

Return the quantile at qs.

We define the quantile function, \(Q\), as:

\[Q(p) = \inf \{y\in\operatorname{supp} f(y)\mid p\leq F(y)\}\]

where \(f\) is the probability density, \(F\) is the cumulative distribution function, and \(\operatorname{supp} f(y)\) is the support.

Parameters:

qsfloat or array of floats from 0 to 1 inclusive, required: The points at which to evaluate the quantiles.

Returns:

float or array of floats: The quantiles at qs.

quantile_tuning_curve(ns, q=0.5, minimize=None)[source]#

Return the quantile tuning curve evaluated at ns.

Parameters:

nspositive float or array of floats, required: The points at which to evaluate the tuning curve.
qfloat from 0 to 1 inclusive, optional: The quantile at which to evaluate the tuning curve.
minimizebool or None, optional: Whether or not to compute the tuning curve for minimizing a metric as opposed to maximizing it. Defaults to None, in which case it is taken to be the same as self.convex, so convex noisy quadratic distributions will minimize and concave ones will maximize.

Returns:

float or array of floats: The quantile tuning curve evaluated at ns.

average_tuning_curve(ns, minimize=None, *, atol=None)[source]#

Return the average tuning curve evaluated at ns.

Parameters:

nspositive float or array of floats, required: The points at which to evaluate the tuning curve.
minimizebool or None, optional: Whether or not to compute the tuning curve for minimizing a metric as opposed to maximizing it. Defaults to None, in which case it is taken to be the same as self.convex, so convex noisy quadratic distributions will minimize and concave ones will maximize.
atolnon-negative float or None, optional: The absolute tolerance to use for stopping the computation. The average tuning curve is computed via numerical integration. The computation stops when the error estimate is less than atol. If None, then atol is set automatically based on the distribution’s parameters.

Returns:

float or array of floats: The average tuning curve evaluated at ns.

classmethod fit(ys, limits=(-inf, inf), constraints=None, *, generator=None, method='maximum_spacing')[source]#

Return a noisy quadratic distribution fitted to ys.

Parameters:

ys1D array of floats, required: The sample from the distribution.
limitspair of floats, optional: The left-open interval over which to fit the distribution. Observations outside the interval are censored (the fit considers only that an observation occured above or below the interval, not its exact value).
constraintsmapping or None, optional: A mapping from strings (the class’s parameter names) to constraints. For a, b, c, and o the constraint can be either a float (fixing the parameter’s value) or a pair of floats (restricting the parameter to that closed interval). For convex, the constraint can be either a boolean (fixing the parameter’s value) or a sequence of booleans (restricting the parameter to that set). The mapping can be empty and need not include all the parameters.
generatornp.random.Generator or None, optional: The random number generator to use. If None, then the global default random number generator is used. See opda.random for more information.
methodstr, optional: One of the strings: “maximum_spacing”. The method parameter determines how the distribution will be estimated. See the notes section for details on different methods.

Returns:

NoisyQuadraticDistribution: The fitted noisy quadratic distribution.

Notes

To compute the maximum, we optimize the spacing function with a generic algorithm that might fail to find the maximum; however, such failures should be rare.

Fitting Part of the Data

The noisy quadratic distribution approximates the tail of the score distribution from random search. Thus, you’ll typically fit just the tail. You can accomplish this using the limits parameter which censors all observations outside a left-open interval. Thus, limits = (-np.inf, threshold) will censor everything above threshold and limits = (threshold, np.inf) will censor everything less than or equal to threshold.

Why Use Maximum Spacing Estimation?

Often, maximum spacing still works where maximum likelihood breaks down. In our case, the noisy quadratic distribution has unbounded probability density when \(\sigma = 0\) and \(\gamma = 1\). Since the parameters \(\alpha\) and \(\beta\) control the distribution’s support, this unbounded density leads to the unbounded likelihood problem [3] which makes the maximum likelihood estimator inconsistent. Unlike maximum likelihood, maximum spacing remains consistent even with an unbounded likelihood.

A Variant of Maximum Spacing Estimation

Standard MPS can’t handle tied data points because then the spacing, \(F\left(Y_{(i)}; \theta\right) - F\left(Y_{(i-1)}; \theta\right)\), would be zero, making the whole product of spacings zero.

To handle ties and censoring, we view maximum spacing as maximum likelihood on grouped data [4]. Roughly, we imagine adding an extra data point at \(Y_{(n+1)} = \infty\), then grouping the data between the order statistics, and finally applying maximum likelihood.

More specifically, we group the data into buckets defined as left-open intervals: \(\left(Z_{(i-1)}, Z_{(i)}\right]\). To construct the buckets, we deduplicate the uncensored order statistics along with the lower and upper bounds on the support:

\[-\infty = Z_{(0)} < Z_{(1)} = Y_{(i_1)} < \ldots < Y_{(i_k)} = Z_{(k)} < Z_{(k+1)} = \infty\]

If there are censored observations in the left tail then we further divide the first bucket at the lower limit for censoring; similarly, if the right tail contains censored observations, we divide the last bucket at the upper limit. If any of the (uncensored) order statistics equals one of the censoring limits, then we leave the bucket as is so as never to create a zero-length bucket. In this way, the deduplication ensures all buckets have positive length.

With the buckets defined, we bin the data into each of the buckets and put an extra data point in the rightmost one (as in typical maximum spacing estimation). Intuitively, this corresponds to estimating the CDF at each order statistic by i/(n+1), reducing the upward bias at the order statistics. For example, if you have one data point, you might guess that it’s near the median rather than the max of the distribution. If \(N_i\) is the number of data points in bucket \(\left(Z_{(i-1)}, Z_{(i)}\right]\), then the objective becomes:

\[\sum_{i=1}^{l} N_i \log\left( F\left(Z_{(i)}; \theta\right) - F\left(Z_{(i-1)}; \theta\right) \right)\]

where \(l\) is the highest index of all the \(Z_{(i)}\).

When fitting with constraints, a minor modification is necessary if \(\sigma\) is constrained to be zero. Instead of using \(\pm\infty\) as the bounds on the support, you must use \(\alpha_-\) (the lower bound on \(\alpha\)) and \(\beta_+\) (the upper bound on \(\beta\)); also, the leftmost bucket must be closed (i.e., include \(\alpha_-\)). The reason is that if you were to construct the buckets using \(-\infty\) and you had a data point equal to \(\alpha_-\) then you’d get a bucket that contains zero probability no matter the parameters. Even worse, that data point would get put in this bucket, sending the product of spacings to zero. Similarly, if you used \(\infty\) and you had a data point equal to \(\beta_+\) then you’d get another zero probabiliy bucket. The extra data point at the top of the support will get put in this bucket and again send the product of spacings to zero. These situations can be fairly common in practice due to rounding. Thus, in constrained problems where \(\sigma\) is zero we construct the buckets using \(\alpha_-\) and \(\beta_+\).

References

[1]

[2]

Ranneby, B. (1984). The Maximum Spacing Method. An Estimation Method Related to the Maximum Likelihood Method. Scandinavian Journal of Statistics, 11(2), 93-112.

[3]

Cheng, R. C. H., & Traylor, L. (1995). Non-Regular Maximum Likelihood Problems. Journal of the Royal Statistical Society. Series B (Methodological), 57(1), 3-44.

[4]

Titterington, D. M. (1985). Comment on “Estimating Parameters in Continuous Univariate Distributions.” Journal of the Royal Statistical Society. Series B (Methodological), 47(1), 115-116.

Examples

Use fit() to estimate the distribution from data:

>>> NoisyQuadraticDistribution.fit(
...   ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75],
... )
NoisyQuadraticDistribution(...)

When fitting a noisy quadratic distribution to the results from random search, you’ll typically restrict it to be convex when minimizing and concave when maximizing. You can accomplish this with constraints:

>>> minimize = True  # If random search minimized / maximized
>>> NoisyQuadraticDistribution.fit(
...   ys=[5.0, 2.9, 5.1, 6.8, 3.2],
...   constraints={"convex": True if minimize else False},
... )
NoisyQuadraticDistribution(...)

The noisy quadratic distribution approximates the score distribution’s left tail when minimizing and right tail when maximizing. You can fit to only the tail of your data using limits:

>>> threshold = 5.  # The maximum cross-entropy to consider
>>> NoisyQuadraticDistribution.fit(
...   ys=[5.0, 2.9, 5.1, 6.8, 3.2],  # 5.1 & 6.8 are censored
...   limits=(
...     (-np.inf, threshold)  # minimize: fit the left tail
...     if minimize else
...     (threshold, np.inf)   # maximize: fit the right tail
...   ),
...   constraints={"convex": True if minimize else False},
... )
NoisyQuadraticDistribution(...)

You could also censor both tails if necessary:

>>> NoisyQuadraticDistribution.fit(
...   ys=[5.0, 2.9, 5.1, 6.8, 3.2],
...   limits=(1., 10.),
... )
NoisyQuadraticDistribution(...)

Finally, you can use constraints to bound any of the parameters in case you have some extra information. For example, if you knew a bound on the average performance of the best hyperparameters, you might constrain a (when minimizing) or b (when maximizing):

>>> min_accuracy, max_accuracy = 0., 1.
>>> NoisyQuadraticDistribution.fit(
...   ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75],
...   constraints=(
...     {"a": (min_accuracy, max_accuracy)}
...     if minimize else
...     {"b": (min_accuracy, max_accuracy)}
...   ),
... )
NoisyQuadraticDistribution(...)

Or, you might know that the random search used 3 hyperparameters, so the effective number of hyperparameters (c) can be at most that:

>>> n_hyperparameters = 3
>>> NoisyQuadraticDistribution.fit(
...   ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75],
...   constraints={"c": (1, n_hyperparameters)},
... )
NoisyQuadraticDistribution(...)

You could also fix c (or a, b, o, or convex) to a particular value:

>>> NoisyQuadraticDistribution.fit(
...   ys=[0.59, 0.86, 0.94, 0.81, 0.68, 0.90, 0.93, 0.75],
...   constraints={"c": 1},
... )
NoisyQuadraticDistribution(...)

Of course, you can mix and match all of these ideas together as desired.