Positive unlabeled (PU) learning is a semi-supervised binary classification setting when no labeled negative example is available to learn a classifier. This means that the dataset is composed of a set of labeled positive examples and an usually much larger set of unlabeled examples containing both positives and negatives. Despite the absence of labeled negatives, a special loss function exists to learn from PU data.

In this post I will follow the derivation from du Plessis et al. (2014).1

The basics of statistical learning theory

This section introduces basic definitions needed below. If you know machine learning you also know all this. Statistical learning theory is concerned with finding under which conditions and how well it is possible to learn from data. I will denote the data with $\mathcal{D}$, using $\mathcal{X}$ and $\mathcal{Y}$ for the input and output spaces respectively, and, since we are in a binary classification setting, $\mathcal{Y}={ -1,1 }$. A generic classifier $f:\mathcal{X}\rightarrow\mathcal{Y}$ gives a prediction for an input in $\mathcal{X}$.

We now consider a generic loss function $\ell:\mathcal{Y}\times\mathcal{Y} \rightarrow\mathbb{R}$ that indicates how good the predictions are, and define the risk of $f$ as the expected loss over the data:

\[R_\ell(f)=\mathbb{E}_{\mathcal{D}}[\ell(f(x), y)]\]

The 0-1 loss is defined as $\ell_{01}(y,t)=\mathbb{1}[y\neq t]$, and the risk under this loss is simply the probability of the classifier giving an incorrect prediction:

\[R_{01}(f)=\mathbb{E}_{x,y}[\ell_{01}(f(x),y)]=p(f(x)\neq y)\]

The best possible classifier $f^\star=\mathbb{E}(y\vert x)$ for $\mathcal{D}$ minimizes the 0-1 loss. There are other losses called surrgate losses besides the 0-1 loss that can be used to compute the risk in such a way that a classifier minimizing $R_\ell$ also minimizes $R_{01}$.2

The 0-1 risk for positive unlabeled data

The 0-1 risk can be written as follows:

\[R_{01}(f)=p(f=-1|y=1,x)p(y=1|x)+p(f=1|y=-1,x)p(y=-1|x)\]

In a fully supervised setting, both terms can be approximated with the dataset. Without negative examples, however, it looks like we are stuck and cannot compute the second term.

As it turns out, there is a way out. Consider the following (from now on I will omit $x$ to make things easier to read):

\[p(f=1)=p(f=1|y=1)p(y=1)+p(f=1|y=-1)p(y=-1)\]

By re-arranging this we get:

\[p(f=1|y=-1)p(y=-1)=p(f=1|y=1)p(y=1)-p(f=1)\]

Thus, the 0-1 risk is equivalent to:

\[R_{01}(f)=p(f=-1|y=1,x)p(y=1|x)+p(f=1|y=1)p(y=1)-p(f=1)\]

Or, in canonical terms:

\[\begin{align} R_{01}(f) &=p(f=-1|y=1,x)p(y=1|x)+(1-p(f=-1|y=1))p(y=1)-p(f=1) \\ &=2\cdot p(f=-1|y=1)p(y=1)+p(f=1)-p(y=1) \end{align}\]

This can be computed with positive-unlabeled data! The first term is the 0-1 risk on the positive samples, while $p(f=1)$ can be approximated on the unlabeled set if we assume that the unlabeled samples are an uniform representative sample from the data distribution (i.e., no bias is involved). Dealing with $p(y=1)$ is tougher: either you can guesstimate it using domain knowledge, or you can try to estimate it from the data itself.3

The risk for a general loss function

Consider the data distribution as a mixture of two distributions for the positive and negative classes:

\[\begin{align} p(x,y) &=p(x|y=-1)p(y=-1)+p(x|y=1)p(y=1) \\ &=\pi p_+(x)+(1-\pi)p_-(x) \end{align}\]

This allows us to write the risk of an estimator as

\[R(f) =\mathbb{E}_{x,y}[\ell(f(x),y)] =\pi\mathbb{E}_{x|y=1}[\ell(f(x),1)]+(1-\pi)\mathbb{E}_{x|y=-1}[\ell(f(x),-1)]\]

As before, we cannot estimate the second term due to the lack of negative data. However, the distribution of negatives is given by

\[(1-\pi)p_-(x)=p(x)-\pi p_+(x)\]

This allows us to expand the risk on the negative data as

\[\begin{align} (1-\pi)\mathbb{E}_{x|y=-1}[\ell(f(x),-1)] &=\int_\mathcal{X} \ell(f(x),-1)(1-\pi)p_-(x)\text{d}x \\ &=\int_\mathcal{X} \ell(f(x),-1)\left[p(x)-\pi p_+(x)\right]\text{d}x \\ &=\int_\mathcal{X} \ell(f(x),-1)p(x)\text{d}x- \int_\mathcal{X} \ell(f(x),-1)\pi p_+(x)\text{d}x \\ &=\mathbb{E}_x[\ell(f(x),-1)]-\pi\mathbb{E}_{x|y=1}[\ell(f(x),-1)] \end{align}\]

Plugging back into the equation of the risk, we get a risk for positive unlabeled data:

\[R_{pu}(f) =\pi\mathbb{E}_{x|y=1}[\ell(f(x),1)] +\mathbb{E}_x[\ell(f(x),-1)]-\pi\mathbb{E}_{x|y=1}[\ell(f(x),-1)]\]

Unbiased risk estimator

To have an unbiased estimator, we must obtain the same risk that we would obtain with a surrogate loss approximating the 0-1 risk. In other words, we must have:

\[R_{pu}(f)= 2\pi \mathbb{E}_{x|y=1}[\ell(f(x),1)]+\mathbb{E}_x[\ell(f(x),-1)]-\pi\]

where the right side is a surrogate for the 0-1 loss (note that I am not entirely sure why this is be the case). By equating the two expressions we see that $\mathbb{E}_x[\ell(f(x),-1)]$ goes away and we get

\[\pi\mathbb{E}_{x|y=1}[\ell(f(x),1)] -\pi\mathbb{E}_{x|y=1}[\ell(f(x),-1)] =2\pi \mathbb{E}_{x|y=1}[\ell(f(x),1)]-\pi\]

Merging the two $\mathbb{E}_{x\vert y=1}[\ell(f(x),1)]$ gives:

\[-\pi\mathbb{E}_{x|y=1}[\ell(f(x),-1)] =\pi \mathbb{E}_{x|y=1}[\ell(f(x),1)]-\pi\]

Dividing by $-\pi$ and moving the one inside the expectation on the right gives:

\[\mathbb{E}_{x|y=1}[\ell(f(x),-1)] =\mathbb{E}_{x|y=1}[1-\ell(f(x),1)]\]

In other words, for $R_{pu}$ to be an unbiased estimator of the surrogate 0-1 risk we must have a symmetry condition on $\ell$:

\[\ell(f(x),-1)+\ell(f(x),1)=1\]

The 0-1 loss satisfies this, but it is not a good candidate for practical applications. We seek a loss function that is continuous, smooth and differentiable, so that we can use gradient descent to optimize the classifier. All these conditions are met by the sigmoid loss:

\[\ell_\sigma(f(x), y) = \frac{1}{1+\exp(y\cdot f(x))}\]

It is trivial to verify that the symmetric condition is satisfied:

\[\begin{align} \ell_\sigma(f(x), -1)+\ell_\sigma(f(x), 1) &=\frac{2+\exp(f(x))+\exp(-f(x))}{ (1+\exp(f(x)))(1+\exp(- f(x)))} \\ &=\frac{2+\exp(f(x))+\exp(-f(x))}{ 1+\exp(f(x))+\exp(-f(x)))+\exp(0)} \\ &= 1 \end{align}\]

Conclusion

We first derived an expression for the 0-1 risk that can be computed with a dataset that only has positive and unlabeled examples. We then generalized this expression to any surrogate loss, and derived an essential condition that such a loss must satisfy in order to give an unbiased estimate of the risk. We finally presented an example of such loss that makes learning from positive unlabeled data no harder than learning with negative examples.

Bibliography

  1. du Plessis, M. C., Niu, G. & Sugiyama, M. Analysis of Learning from Positive and Unlabeled Data. Advances in Neural Information Processing Systems 27, (2014). 

  2. Bartlett, P. L., Jordan, M. I. & Mcauliffe, J. D. Convexity, Classification, and Risk Bounds. Journal of the American Statistical Association 101, 138–156 (2006). 

  3. du Plessis, M. C., Niu, G. & Sugiyama, M. Class-prior estimation for learning from positive and unlabeled data. Mach Learn 106, 463–492 (2017).