Go to the profile of  Priyanka Sharma
Priyanka Sharma
4 min read

Cloud Tensor Processing Units(TPUs)

Tensor Processing Units are Google's customized-based developed application-specific integrated circuits(ASICs). These are used to accelerate the workloads of Machine Learning. These are designed with the benefit's of Google's deep experience and leadership in Machine Learning.

Cloud Tensor Processing Units(TPUs)

Tensor Processing Units are Google's customized-based developed application-specific integrated circuits(ASICs). These are used to accelerate the workloads of Machine Learning. These are designed with the benefit's of Google's deep experience and leadership in Machine Learning.

Cloud TPU is something which allows your machine learning workloads to run on Google's TPU accelerator hardware with Tensorflow. Cloud TPU is designed for maximum performance and flexibility to help researchers, developers, and businesses to build TensorFlow compute clusters that can leverage CPUs, GPUs, and TPUs. High-level Tensorflow APIs help you to get models running on the Cloud TPU hardware.

Cloud TPU Programming Model

Cloud TPUs are very fast at performing dense vector and matrix computations.Transferring of data from Cloud TPUs to the host memory is slow as compared to the speed of computation. This means that the speed of PCIe bus is much slower than both the Cloud TPU interconnect and the on-chip high bandwidth memory(HBM). This concludes that a partial compilation of a model, where execution of 'ping-pongs' between host and device, uses the device in very efficient way, as it would be idle for most of the time, waiting for data to arrive through PCIe bus. To overcome and accelerate this situation, the programming model for Cloud TPU is designed to execute much of the training on TPU - ideally the entire training loop.  

Salient features of programming model implemented by TPUEstimator :-

  • The TensorFlow server is running on the host machine fetches data and pre-processes it before "infeeding" it to the Cloud TPU hardware.
  • Data Parallelism: Cores on a Cloud TPU execute an identical program residing in their own respective HBM in a synchronous manner. A reduction operation is performed at the end of each neural network step across all the cores.
  • All model parameters are kept in on-chip high bandwidth memory.
  • Input training data is streamed to an "infeed" queue on the Cloud TPU.
  • A program running on Cloud TPU retrieves batches from these queues during each training step.
  • The cost of launching computations on Cloud TPU is amortized by executing many training steps in a loop.

When to use TPUs

These are optimized for some specific workloads. In some situations, you might want to use GPUs or CPUs on Compute Engine instances to run your machine learning workloads. In general, you can decide what hardware is best for your workload based on the following guidelines:-

GPUs

  • Models with a significant number of custom TensorFlow operations that must run at least partially on CPUs.
  • Models with TensorFlow ops that are not available on Cloud TPU.
  • Medium-to-large models with larger effective batch sizes.
  • Models that are not written in TensorFlow or cannot be written in TensorFlow.
  • Models for which source does not exist or is too onerous to change.

CPUs

  • Small models with small effective batch sizes.
  • Models that are dominated by custom TensorFlow operations written in C++.
  • Models that are limited by available I/O or the networking bandwidth of the host system.
  • Quick prototyping that requires maximum flexibility.
  • Simple models that do not take long to train.

TPUs

  • Models that train for weeks or months.
  • Larger and very large models with very large effective batch sizes.
  • Models dominated by matrix computations.
  • Models with no custom TensorFlow operations inside the main training loop.

Cloud TPUs are not suited to the following workloads:

  • Workloads that require high-precision arithmetic. For example, double-precision arithmetic is not suitable for TPUs.
  • Workloads that access memory in a sparse manner might not be available on TPUs.
  • Linear algebra programs that require frequent branching or are dominated element-wise by algebra. TPUs are optimized to perform fast, bulky matrix multiplication, so a workload that is not dominated by matrix multiplication is unlikely to perform well on TPUs compared to other platforms.
  • Neural network workloads that contain custom TensorFlow operations written in C++. Specifically, custom operations in the body of the main training loop are not suitable for TPUs.

Neural network workloads must be able run multiple iterations of the entire training loop on the TPU. Although this is not a fundamental requirement of TPUs themselves, this is one of the current constraints of the TPU software ecosystem that is required for efficiency.

Advantages of Cloud TPUs

  • Cloud TPUs minimizes the time-to-accuracy when you train large,complex neural network models.
  • Models that took weeks to train on other hardware platforms can coverage within hours on TPUs.
  • It accelerate the performance of linear algebra computation, which is used heavily in machine learning applications.

Differences from conventional training

A typical TensorFlow training graph consists of multiple overlapping subgraphs which provide a variety of functionality including:

  • Loss functions.
  • Gradient code (usually automatically generated).
  • Summary ops for monitoring training.
  • Save/Restore ops for checkpointing.
  • I/O operations to read training data.
  • The model itself.
  • Input preprocessing stages, often connected via queues.
  • Initialization code for those variables.
  • Model variables.

On Cloud TPU, TensorFlow programs are compiled by the XLA just-in-time compiler. When training on Cloud TPU, the only code that can be compiled and executed on the hardware is that corresponding to the dense parts of the model, loss and gradient subgraphs. All other parts of the TensorFlow program run on the host machines (Cloud TPU server) as part of a regular distributed TensorFlow session. This typically consists of the I/O operations which read training data, any pre-processing code (for example: decoding compressed images, randomly sampling/cropping, assembling training minibatches) and all of the housekeeping parts of the graph such as checkpoint save/restore.