分布式 tensorflow 指南

分布式 tensorflow 指南

本指南是一个分布式训练样例集合(可以作为样板代码)和一个基本的分布式tensorflow教程。许多的例子集中在著名的分布式训练方案的实施,如作者的博客文章探讨过的 分布式 keras。
几乎所有的示例都可以在一台带有CPU的机器上运行,所有的示例只使用数据并行(即在图形复制之间)。

项目地址:https://github.com/tmulc18/Distributed-TensorFlow-Guide

This guide is a collection of distributed training examples (that can act as boilerplate code) and a tutorial of basic distributed TensorFlow. Many of the examples focus on implementing well-known distributed training schemes, such as those available in Distriubted Keras which were discussed in the author’s blog post.

Almost all the examples can be run on a single machine with a CPU, and all the examples only use data-parallelism (i.e. between-graph replication).

The motivation for this guide stems from the current state of distributed deep learning. Deep learning papers typical demonstrate successful new architectures on some benchmark, but rarely show how these models can be trained with 1000x the data which is usually the requirement in industy. Furthermore, most successful distributed cases use state-of-the-art hardware to bruteforce massive effective minibatches in a synchronous fashion across high-bandwidth networks; there has been little research showing the potential of asynchronous training (which is why there are a lot of those examples in this guide). Finally, the lack of documenation for distributed TF was the real reason this project was started. TF is a great tool that prides itself on its scalability, but unfortunately there are few examples that show how to make your model scale with datasize.

The aim of this guide is to aid all interested in distributed deep learning, from beginners to researchers.

Related posts

Leave a Comment