• MacKay, Matthew
  • Vicol, Paul
  • Ba, Jimmy
  • Grosse, Roger


Recurrent neural networks (RNNs) provide state-of-the-art performance in processing sequential data but are memory intensive to train, limiting the flexibility of RNN models which can be trained. Reversible RNNs—RNNs for which the hidden-to-hidden transition can be reversed—offer a path to reduce the memory requirements of training, as hidden states need not be stored and instead can be recomputed during backpropagation. We first show that perfectly reversible RNNs, which require no storage of the hidden activations, are fundamentally limited because they cannot forget information from their hidden state. We then provide a scheme for storing a small number of bits in order to allow perfect reversal with forgetting. Our method achieves comparable performance to traditional models while reducing the activation memory cost by a factor of 10–15. We extend our technique to attention-based sequence-to-sequence models, where it maintains performance while reducing activation memory cost by a factor of 5–10 in the encoder, and a factor of 10–15 in the decoder.


  1. Alex Graves, Abdel-Rahman Mohamed, and Geoffrey Hinton. Speech Recognition with Deep Recur-rent Neural Networks. InInternational Conference on Acoustics, Speech and Signal Processing(ICASSP), pages 6645–6649. IEEE, 2013.
  2. Gábor Melis, Chris Dyer, and Phil Blunsom. On the State of the Art of Evaluation in Neural LanguageModels.arXiv:1707.05589, 2017.
  3. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and Optimizing LSTMLanguage Models.arXiv:1708.02182, 2017.
  4. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by JointlyLearning to Align and Translate.arXiv:1409.0473, 2014.
  5. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s Neural Machine TranslationSystem: Bridging the Gap between Human and Machine Translation.arXiv:1609.08144, 2016.
  6. Paul J Werbos. Backpropagation through Time: What It Does and How to Do It.Proceedings of theIEEE, 78(10):1550–1560, 1990.
  7. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning Representations byBack-propagating Errors.Nature, 323(6088):533, 1986.
  8. Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to Construct DeepRecurrent Neural Networks.arXiv:1312.6026, 2013.
  9. Antonio Valerio Miceli Barone, Jindˇrich Helcl, Rico Sennrich, Barry Haddow, and Alexandra Birch.Deep Architectures for Neural Machine Translation.arXiv:1707.07631, 2017.
  10. Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. RecurrentHighway Networks.arXiv:1607.03474, 2016.
  11. Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder forStatistical Machine Translation.arXiv:1406.1078, 2014.
  12. Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory.Neural Computation, 9(8):1735–1780, 1997.Dougal Maclaurin, David Duvenaud, and Ryan P Adams. Gradient-based Hyperparameter Opti-mization through Reversible Learning. InProceedings of the 32nd International Conference onMachine Learning (ICML), 2015.
  13. Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a Large AnnotatedCorpus of English: The Penn Treebank.Computational Linguistics, 19(2):313–330, 1993.
  14. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel MixtureModels.arXiv:1609.07843, 2016.
  15. Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia. Multi30K: Multilingual English-German Image Descriptions.arXiv:1605.00459, 2016.
  16. Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. The IWSLT2016 Evaluation Campaign.Proceedings of the 13th International Workshop on Spoken LanguageTranslation (IWSLT), 2016.
  17. George Papamakarios, Iain Murray, and Theo Pavlakou. Masked Autoregressive Flow for DensityEstimation. InAdvances in Neural Information Processing Systems (NIPS), pages 2335–2344,2017.
  18. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density Estimation using Real NVP.arXiv:1605.08803, 2016.
  19. Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Im-proving Variational Inference with Inverse Autoregressive Flow. InAdvances in Neural InformationProcessing Systems (NIPS), pages 4743–4751, 2016.
  20. Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The Reversible ResidualNetwork: Backpropagation Without Storing Activations. InAdvances in Neural InformationProcessing Systems (NIPS), pages 2211–2221, 2017.
  21. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for ImageRecognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 770–778, 2016.
  22. Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to Forget: Continual Predictionwith LSTM. 1999.
  23. Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. LSTM:A Search Space Odyssey.IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232, 2017.
  24. Wojciech Zaremba and Ilya Sutskever. Learning to Execute.arXiv:1410.4615, 2014.
  25. Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep Learning withLimited Numerical Precision. InInternational Conference on Machine Learning (ICML), pages1737–1746, 2015.
  26. Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training Deep Neural Networks withLow Precision Multiplications.arXiv:1412.7024, 2014.
  27. Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective Approaches to Attention-Based Neural Machine Translation.arXiv:1508.04025, 2015.
  28. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg SCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-Scale MachineLearning on Heterogeneous Distributed Systems.arXiv:1603.04467, 2016.
  29. Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nico-las Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, et al. Theano:A Python Framework for Fast Computation of Mathematical Expressions.arXiv:1605.02688,2016.
  30. James Martens and Ilya Sutskever. Training Deep and Recurrent Networks with Hessian-FreeOptimization. InNeural Networks: Tricks of the Trade, pages 479–535. Springer, 2012.
  31. Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with SublinearMemory Cost.arXiv:1604.06174, 2016.
  32. Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. Memory-EfficientBackpropagation through Time. InAdvances in Neural Information Processing Systems (NIPS),pages 4125–4133, 2016.
  33. Max Jaderberg, Wojciech M Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, David Silver,and Koray Kavukcuoglu. Decoupled Neural Interfaces using Synthetic Gradients. InInternationalConference on Machine Learning (ICML), 2017.
  34. Wojciech Marian Czarnecki, Grzegorz ́Swirszcz, Max Jaderberg, Simon Osindero, Oriol Vinyals,and Koray Kavukcuoglu. Understanding Synthetic Gradients and Decoupled Neural Interfaces.arXiv:1703.00522, 2017.
  35. Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary Evolution Recurrent Neural Networks. InInternational Conference on Machine Learning (ICML), pages 1120–1128, 2016.
  36. Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-CapacityUnitary Recurrent Neural Networks. InAdvances in Neural Information Processing Systems(NIPS), pages 4880–4888, 2016
  37. Li Jing, Yichen Shen, Tena Dubˇcek, John Peurifoy, Scott Skirlo, Max Tegmark, and Marin Sol-jaˇci ́c. Tunable Efficient Unitary Neural Networks (EUNN) and their Application to RNN.arXiv:1612.05231, 2016.

The SELF Institute