A loss function for positive unlabeled learning
Positive unlabeled (PU) learning is a semisupervised binary classification setting when no labeled negative example is available to learn a classifier. This means that the dataset is composed of a set of labeled positive examples and an usually much larger set of unlabeled examples containing both positives and negatives. Despite the absence of labeled negatives, a special loss function exists to learn from PU data.
In this post I will follow the derivation from du Plessis et al. (2014).^{1}
The basics of statistical learning theory
This section introduces basic definitions needed below. If you know machine learning you also know all this. Statistical learning theory is concerned with finding under which conditions and how well it is possible to learn from data. I will denote the data with $\mathcal{D}$, using $\mathcal{X}$ and $\mathcal{Y}$ for the input and output spaces respectively, and, since we are in a binary classification setting, $\mathcal{Y}={ 1,1 }$. A generic classifier $f:\mathcal{X}\rightarrow\mathcal{Y}$ gives a prediction for an input in $\mathcal{X}$.
We now consider a generic loss function $\ell:\mathcal{Y}\times\mathcal{Y} \rightarrow\mathbb{R}$ that indicates how good the predictions are, and define the risk of $f$ as the expected loss over the data:
\[R_\ell(f)=\mathbb{E}_{\mathcal{D}}[\ell(f(x), y)]\]The 01 loss is defined as $\ell_{01}(y,t)=\mathbb{1}[y\neq t]$, and the risk under this loss is simply the probability of the classifier giving an incorrect prediction:
\[R_{01}(f)=\mathbb{E}_{x,y}[\ell_{01}(f(x),y)]=p(f(x)\neq y)\]The best possible classifier $f^\star=\mathbb{E}(y\vert x)$ for $\mathcal{D}$ minimizes the 01 loss. There are other losses called surrgate losses besides the 01 loss that can be used to compute the risk in such a way that a classifier minimizing $R_\ell$ also minimizes $R_{01}$.^{2}
The 01 risk for positive unlabeled data
The 01 risk can be written as follows:
\[R_{01}(f)=p(f=1y=1,x)p(y=1x)+p(f=1y=1,x)p(y=1x)\]In a fully supervised setting, both terms can be approximated with the dataset. Without negative examples, however, it looks like we are stuck and cannot compute the second term.
As it turns out, there is a way out. Consider the following (from now on I will omit $x$ to make things easier to read):
\[p(f=1)=p(f=1y=1)p(y=1)+p(f=1y=1)p(y=1)\]By rearranging this we get:
\[p(f=1y=1)p(y=1)=p(f=1y=1)p(y=1)p(f=1)\]Thus, the 01 risk is equivalent to:
\[R_{01}(f)=p(f=1y=1,x)p(y=1x)+p(f=1y=1)p(y=1)p(f=1)\]Or, in canonical terms:
\[\begin{align} R_{01}(f) &=p(f=1y=1,x)p(y=1x)+(1p(f=1y=1))p(y=1)p(f=1) \\ &=2\cdot p(f=1y=1)p(y=1)+p(f=1)p(y=1) \end{align}\]This can be computed with positiveunlabeled data! The first term is the 01 risk on the positive samples, while $p(f=1)$ can be approximated on the unlabeled set if we assume that the unlabeled samples are an uniform representative sample from the data distribution (i.e., no bias is involved). Dealing with $p(y=1)$ is tougher: either you can guesstimate it using domain knowledge, or you can try to estimate it from the data itself.^{3}
The risk for a general loss function
Consider the data distribution as a mixture of two distributions for the positive and negative classes:
\[\begin{align} p(x,y) &=p(xy=1)p(y=1)+p(xy=1)p(y=1) \\ &=\pi p_+(x)+(1\pi)p_(x) \end{align}\]This allows us to write the risk of an estimator as
\[R(f) =\mathbb{E}_{x,y}[\ell(f(x),y)] =\pi\mathbb{E}_{xy=1}[\ell(f(x),1)]+(1\pi)\mathbb{E}_{xy=1}[\ell(f(x),1)]\]As before, we cannot estimate the second term due to the lack of negative data. However, the distribution of negatives is given by
\[(1\pi)p_(x)=p(x)\pi p_+(x)\]This allows us to expand the risk on the negative data as
\[\begin{align} (1\pi)\mathbb{E}_{xy=1}[\ell(f(x),1)] &=\int_\mathcal{X} \ell(f(x),1)(1\pi)p_(x)\text{d}x \\ &=\int_\mathcal{X} \ell(f(x),1)\left[p(x)\pi p_+(x)\right]\text{d}x \\ &=\int_\mathcal{X} \ell(f(x),1)p(x)\text{d}x \int_\mathcal{X} \ell(f(x),1)\pi p_+(x)\text{d}x \\ &=\mathbb{E}_x[\ell(f(x),1)]\pi\mathbb{E}_{xy=1}[\ell(f(x),1)] \end{align}\]Plugging back into the equation of the risk, we get a risk for positive unlabeled data:
\[R_{pu}(f) =\pi\mathbb{E}_{xy=1}[\ell(f(x),1)] +\mathbb{E}_x[\ell(f(x),1)]\pi\mathbb{E}_{xy=1}[\ell(f(x),1)]\]Unbiased risk estimator
To have an unbiased estimator, we must obtain the same risk that we would obtain with a surrogate loss approximating the 01 risk. In other words, we must have:
\[R_{pu}(f)= 2\pi \mathbb{E}_{xy=1}[\ell(f(x),1)]+\mathbb{E}_x[\ell(f(x),1)]\pi\]where the right side is a surrogate for the 01 loss (note that I am not entirely sure why this is be the case). By equating the two expressions we see that $\mathbb{E}_x[\ell(f(x),1)]$ goes away and we get
\[\pi\mathbb{E}_{xy=1}[\ell(f(x),1)] \pi\mathbb{E}_{xy=1}[\ell(f(x),1)] =2\pi \mathbb{E}_{xy=1}[\ell(f(x),1)]\pi\]Merging the two $\mathbb{E}_{x\vert y=1}[\ell(f(x),1)]$ gives:
\[\pi\mathbb{E}_{xy=1}[\ell(f(x),1)] =\pi \mathbb{E}_{xy=1}[\ell(f(x),1)]\pi\]Dividing by $\pi$ and moving the one inside the expectation on the right gives:
\[\mathbb{E}_{xy=1}[\ell(f(x),1)] =\mathbb{E}_{xy=1}[1\ell(f(x),1)]\]In other words, for $R_{pu}$ to be an unbiased estimator of the surrogate 01 risk we must have a symmetry condition on $\ell$:
\[\ell(f(x),1)+\ell(f(x),1)=1\]The 01 loss satisfies this, but it is not a good candidate for practical applications. We seek a loss function that is continuous, smooth and differentiable, so that we can use gradient descent to optimize the classifier. All these conditions are met by the sigmoid loss:
\[\ell_\sigma(f(x), y) = \frac{1}{1+\exp(y\cdot f(x))}\]It is trivial to verify that the symmetric condition is satisfied:
\[\begin{align} \ell_\sigma(f(x), 1)+\ell_\sigma(f(x), 1) &=\frac{2+\exp(f(x))+\exp(f(x))}{ (1+\exp(f(x)))(1+\exp( f(x)))} \\ &=\frac{2+\exp(f(x))+\exp(f(x))}{ 1+\exp(f(x))+\exp(f(x)))+\exp(0)} \\ &= 1 \end{align}\]Conclusion
We first derived an expression for the 01 risk that can be computed with a dataset that only has positive and unlabeled examples. We then generalized this expression to any surrogate loss, and derived an essential condition that such a loss must satisfy in order to give an unbiased estimate of the risk. We finally presented an example of such loss that makes learning from positive unlabeled data no harder than learning with negative examples.
Bibliography

du Plessis, M. C., Niu, G. & Sugiyama, M. Analysis of Learning from Positive and Unlabeled Data. Advances in Neural Information Processing Systems 27, (2014). ↩

Bartlett, P. L., Jordan, M. I. & Mcauliffe, J. D. Convexity, Classification, and Risk Bounds. Journal of the American Statistical Association 101, 138–156 (2006). ↩

du Plessis, M. C., Niu, G. & Sugiyama, M. Classprior estimation for learning from positive and unlabeled data. Mach Learn 106, 463–492 (2017). ↩