<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.7.4">Jekyll</generator><link href="http://kkoehncke.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="http://kkoehncke.github.io/" rel="alternate" type="text/html" /><updated>2018-12-11T14:21:49+00:00</updated><id>http://kkoehncke.github.io/feed.xml</id><title type="html">Kevin Koehncke</title><subtitle>Master's Student at Georgia Tech</subtitle><entry><title type="html">NeurIPS 2018 Batch Normalization</title><link href="http://kkoehncke.github.io/NeurIPS-2018-Debrief-Batch-Normalization-Uncovered/" rel="alternate" type="text/html" title="NeurIPS 2018 Batch Normalization" /><published>2018-12-10T00:00:00+00:00</published><updated>2018-12-10T00:00:00+00:00</updated><id>http://kkoehncke.github.io/NeurIPS-2018-Debrief-Batch-Normalization-Uncovered</id><content type="html" xml:base="http://kkoehncke.github.io/NeurIPS-2018-Debrief-Batch-Normalization-Uncovered/">&lt;p&gt;I was one of the lucky few that managed to get a NeurIPS ticket last minute off the waitlist and was excited to hear about the latest findings in ML research. Amidst the frigid Montreal weather, I saw some groundbreaking research regarding batch normalization that made a lot of researchers (and myself) re-think the reason for using batch normalization within their network architectures.&lt;/p&gt;

&lt;h2 id=&quot;what-is-batch-normalization&quot;&gt;What is Batch Normalization?&lt;/h2&gt;

&lt;p&gt;For people who do not know what batch normalization is, batch normalization (BN) is a technique used with mini-batch training to normalize activation values in neural network layers by taking the output of the previous activation layer and zero-centering the batch mean and forcing unit batch variance via &lt;em&gt;[&lt;a href=&quot;https://arxiv.org/pdf/1502.03167v3.pdf&quot; title=&quot;Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift &quot;&gt;1&lt;/a&gt;]&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;https://cdn-images-1.medium.com/max/1600/1*Hiq-rLFGDpESpr8QNsJ1jg.png&quot; alt=&quot;img&quot; /&gt;&lt;/p&gt;

&lt;p&gt;where two new trainable parameters &lt;script type=&quot;math/tex&quot;&gt;\gamma&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\beta&lt;/script&gt; are introduced that scale and shift the output via a linear transformation; we note that for an arbitrary loss &lt;script type=&quot;math/tex&quot;&gt;\mathcal{L}&lt;/script&gt;, our backpropagation of our gradients with respect to our six new variables are continous &amp;amp; differentiable, thus allowing &lt;script type=&quot;math/tex&quot;&gt;\gamma&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;\beta&lt;/script&gt; to be learned via an optimization method such as SGD.&lt;/p&gt;

&lt;p&gt;The purpose of BN, as proposed in the original paper by Sergey Ioffe &amp;amp; Christian Szedegy &lt;em&gt;[&lt;a href=&quot;https://arxiv.org/pdf/1502.03167v3.pdf&quot; title=&quot;Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift &quot;&gt;1&lt;/a&gt;]&lt;/em&gt; is the following:&lt;/p&gt;

&lt;p&gt;When feeding outputs in one activation layer to a subsequent layer, their distributions vary during training. With varying distributions, gradient descent has a hard time finding the minima of our proposed optimization problem when each layer does not have uniform scale as gradient descent is not scale invariant. This causes our learned parameters to change from the previous layer, creating inconsistencies, and causing the need for lower learning rates to be chosen &amp;amp; careful weight initialization in order to create a well-conditioned environment for our model to be trained; we denote this change in the input layers’ distribution as &lt;em&gt;internal covariate shift&lt;/em&gt; (ICS). Hence, utilizing BN reduces ICS by creating a uniform scale of our input distributions.&lt;/p&gt;

&lt;p&gt;Ioffe &amp;amp; Szedegy also state that higher learning rates can be used in conjunction with BN due to the normalization of distributions across the network, causing vanishing and exploding gradients to be less likely and prevents getting stuck in local minima during training. Backpropagation gains more resilience as well, with the layer Jacobian and progagated gradients being more closer to scale invariant with respect to the weights calculated than before.&lt;/p&gt;

&lt;h2 id=&quot;neurips-findings&quot;&gt;NeurIPS Findings&lt;/h2&gt;

&lt;p&gt;Even with Ioffe &amp;amp; &amp;amp; Szedegy’s explanation, there is still a lot of unknown as to what governs the behavior behind BN during training. Johan Bjorck, Carla Gomes, Bart Selman, and Kilian Q. Weinberger sought to explain experimentally BN’s behavior on training &lt;em&gt;[&lt;a href=&quot;http://papers.nips.cc/paper/7996-understanding-batch-normalization.pdf&quot; title=&quot;Understanding Batch Normalization&quot;&gt;3&lt;/a&gt;]&lt;/em&gt;. In their first experiment, they trained a 110-layer ResNet on CIFAR-10 with three different learning rates &lt;script type=&quot;math/tex&quot;&gt;0.0001, 0.003, 0.1&lt;/script&gt; with and without BN:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../images/NeurIPS_2018/image-1.png&quot; alt=&quot;image-1&quot; /&gt;&lt;/p&gt;

&lt;p&gt;They observed that with the smallest learning rate, BN provided a small boost in training speed but both models converged to the same test accuracy, whilst the higher learning rates benefited greatly from BN, allowing for faster training without compromising test accuracy and adds regularization. Bjorck et al. attribute this to the larger learning rates generate more SGD “noise” which in turn creates a regularization effect and prevents getting stuck in sharp minima, supported by Keskar et al 2017 findings &lt;em&gt;[&lt;a href=&quot;https://arxiv.org/pdf/1609.04836.pdf&quot; title=&quot;On Large Batch Training For Deep Learning: Generalization Gap and Sharp Minima&quot;&gt;4&lt;/a&gt;]&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;But why does using BN allow for higher learning rates? Bjorck et al. observe the relative loss during the first few mini-batches as a function of the step size:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../images/NeurIPS_2018/image-2.png&quot; alt=&quot;image-2&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We observe that networks utilizing BN do not diverge as rapidly as networks without BN with respect to step size. Is this due to the fact that we reduce ICS or some other phenomena?&lt;/p&gt;

&lt;p&gt;Santurkar et al. argue that their is a greater effect at play with using BN: we are smoothing our optimization landscape such that we create a further well-conditioned optimization problem that aids SGD in finding a solution. Due to creating approximately scale invariance from activation layer to activation layer, BN allows spikes and bumps in our non-convex loss function to be smoothed, thus allowing for a larger learning rate and more predictive gradients to be computed &lt;em&gt;[&lt;a href=&quot;http://papers.nips.cc/paper/7996-understanding-batch-normalization.pdf&quot; title=&quot;Understanding Batch Normalization&quot;&gt;3&lt;/a&gt;]&lt;/em&gt;.  In order to measure this smoothing effect, Santurkar et al. propose the following definition:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../images/NeurIPS_2018/image-3.png&quot; alt=&quot;image-3&quot; /&gt;&lt;/p&gt;

&lt;p&gt;To my knowledge, this is the first proposed mathematical definition ICS, namely calculating the &lt;script type=&quot;math/tex&quot;&gt;l_2&lt;/script&gt; distance between the sum of all gradients of &lt;script type=&quot;math/tex&quot;&gt;\mathcal{L}&lt;/script&gt; with respect to our parameters &lt;script type=&quot;math/tex&quot;&gt;W_{k}^t&lt;/script&gt;  where &lt;script type=&quot;math/tex&quot;&gt;G_{t,i}&lt;/script&gt; corresponds to the gradients before the layer weight update and &lt;script type=&quot;math/tex&quot;&gt;G_{t, i}^{'}&lt;/script&gt; responds to the gradients after the layer weight update. In their paper, they go on to prove theoretically that BN provides a more well-behaved optimization problem by inducing favorable properties such as Lipschitz continuity and increased predictive gradients &lt;em&gt;[&lt;a href=&quot;http://papers.nips.cc/paper/7996-understanding-batch-normalization.pdf&quot; title=&quot;Understanding Batch Normalization&quot;&gt;3&lt;/a&gt;]&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Recall that for an arbitrary function &lt;script type=&quot;math/tex&quot;&gt;f&lt;/script&gt;, we say &lt;script type=&quot;math/tex&quot;&gt;f&lt;/script&gt; is L-Lipschitz if &lt;script type=&quot;math/tex&quot;&gt;\vert f(x_1) - f(x_2) \vert \leq L \vert \vert f(x_1) - f(x_2) \vert \vert&lt;/script&gt; for all &lt;script type=&quot;math/tex&quot;&gt;x_1&lt;/script&gt; and &lt;script type=&quot;math/tex&quot;&gt;x_2&lt;/script&gt;  and for some constants &lt;script type=&quot;math/tex&quot;&gt;L&lt;/script&gt;. Intuitively, Lipschitz continuity ensures that your function does not explode at some point. We can extend this notion of reduction of explosion to the gradients of &lt;script type=&quot;math/tex&quot;&gt;f&lt;/script&gt; via &lt;script type=&quot;math/tex&quot;&gt;\beta&lt;/script&gt;-smoothness where we say &lt;script type=&quot;math/tex&quot;&gt;f&lt;/script&gt; is &lt;script type=&quot;math/tex&quot;&gt;\beta&lt;/script&gt;-smooth if its gradients are &lt;script type=&quot;math/tex&quot;&gt;\beta&lt;/script&gt;-Lipschitz i.e. if &lt;script type=&quot;math/tex&quot;&gt;\|\nabla f(x_1)-\nabla f(x_2) \| \leq \beta \|x_1 - x_2 \|&lt;/script&gt; for some constant &lt;script type=&quot;math/tex&quot;&gt;\beta&lt;/script&gt;.&lt;/p&gt;

&lt;p&gt;Experimentally, Santurkar et al. used the VGG network on CIFAR-10 with &amp;amp; without BN, calculated the &lt;script type=&quot;math/tex&quot;&gt;l_2&lt;/script&gt; distance between the loss weight gradients &lt;script type=&quot;math/tex&quot;&gt;\vert\vert G_{t,i} - G_{t,i}^{'}\vert\vert_2&lt;/script&gt;  and found the following during training:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../images/NeurIPS_2018/image-4.png&quot; alt=&quot;image-4&quot; /&gt;&lt;/p&gt;

&lt;p&gt;where (a) corresponds to the variation in loss function’s value, (b) is the &lt;script type=&quot;math/tex&quot;&gt;l_2&lt;/script&gt; disance of &lt;script type=&quot;math/tex&quot;&gt;G&lt;/script&gt;, and (c) the maximum &lt;script type=&quot;math/tex&quot;&gt;l_2&lt;/script&gt; over distance moved in that direction, which we define as “effective” &lt;script type=&quot;math/tex&quot;&gt;\beta&lt;/script&gt;-smoothness &lt;em&gt;[&lt;a href=&quot;http://papers.nips.cc/paper/7996-understanding-batch-normalization.pdf&quot; title=&quot;Understanding Batch Normalization&quot;&gt;3&lt;/a&gt;]&lt;/em&gt;. We immediately see that the addition of BN generates a smoother loss landscape by drastically reducing the fluctuations in gradient predictiveness via the created &lt;script type=&quot;math/tex&quot;&gt;\beta&lt;/script&gt;-smoothing effect on &lt;script type=&quot;math/tex&quot;&gt;\mathcal{L}&lt;/script&gt;.&lt;/p&gt;

&lt;p&gt;Furthermore, Santurkar et al. devised a clever experiment to examine whether ICS had anything to do with increased training performance. They trained three VGG networks on CIFAR-10: one without BN, one with BN, and one with BN where the activation, after passing the BN layer, was perturbed via i.i.d noise sampled from a time-step dependent, non-zero mean and non-unit variance distribution &lt;script type=&quot;math/tex&quot;&gt;D_j^{t}&lt;/script&gt; for each activation &lt;script type=&quot;math/tex&quot;&gt;j&lt;/script&gt; for each sample in each batch. This pertubation produces a severe covariate shift that is non-uniform across all activations that would induce a decrease in training performance. However, they observe that even though less stable distributions are produced with the noisy pertubation, training performance is not impacted:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;../images/NeurIPS_2018/image-5.png&quot; alt=&quot;image-5&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;We see that batch normalization’s connection to training performance and internal covariate shift is weak at best. Rather, we see that batch normalization provides another method for smoothing our optimization landscape to be more stable, thus allowing for higher learning rates to be used which in turn improves training performance. This explains the known benefits of batch normalization such as prevention of exploding / vanishing gradients and robustness to hyperparameter selection.&lt;/p&gt;</content><author><name>Kevin Koehncke</name></author><summary type="html">I was one of the lucky few that managed to get a NeurIPS ticket last minute off the waitlist and was excited to hear about the latest findings in ML research. Amidst the frigid Montreal weather, I saw some groundbreaking research regarding batch normalization that made a lot of researchers (and myself) re-think the reason for using batch normalization within their network architectures.</summary></entry></feed>