ReLU is one of the commonly used activations for artificial neural networks, and softplus can viewed as its smooth version.
where is a parameter that one can specify; it is typically set to one. The figure illustrates how soffplus becomes closer to ReLU when using different values.
Dombrowski et al. (2019)'s Theorem 2 says that for one-layer neuron network with ReLU activations and and its softplus version , the following equality holds:
The gradient wrt. to the input of the softplus network is the expectation of the gradient of the ReLU network when the input is perturbed by the noise .
In the following, I state the proof that is provided in the supplement of the paper.
Let assume for a moment that is scalar. We first start by showing that
. To do this, we write
where is implicitly defined. Differentiating the equation both, we get
Applying another differentiaion yields
Substituting in (1), we get
Consider . We assume that ; that is the noise vector consists of independent noise components. Furthermore, we choose
s.t. and . Then, from (2), it follows that
Consider . The expectation can be rewritten as
Differentiating both sides yields the desired result. ◼️
(Visualization added on 02/06/2020)
Why does this result matter?
In the paper, Dombrowski et al. (2019) show that attribution (explanation) maps can be arbitrarily manipulated. They argue that this is because the output manifold of the ReLU neural network has a large curvature, and it causes gradients wrt. to the input to be highly unstable when the input is slightly perturbed.
They show that one can prevent such manipulations by replacing ReLU with the softplus activation. Based on the theorem 2, they argue and empirically show that doing so has the same effect (attribution maps) as SmoothGrad, in which the attribution map is averaged from several maps of the input perturbed by some noise.
Dombrowski et al. (2019) "Explanations can be manipulated and geometry is to blame"
From Softplus to Sigmoid
Consider . The derivative is