Rapid deployment of Machine Learning (ML) applications like recommendation engines, chat-bots and image synthesis application have made them a dominant workload. These applications are being powered by ML models of increasingly expansive scale, with models comprising of trillions of parameters becoming quite common. Due to massive compute requirements, ML models are exclusively trained on specialized accelerators and often in a distributed setting. However, a closer analysis of compute utilization shows that ML models are not fully utilizing the compute available on these accelerators. The primary reason for this poor compute utilization is data movement bottlenecks.In this dissertation we primarily focus on data movement bottlenecks associated with intermediate activations. First, we study Gradient Compression, an approach to minimize the amount of synchronization. In Accordion we retrofit existing compression algorithms to automatically vary the amount of compression to reduce the communication during training. Next, we study lack of wall clock speedups when using gradient compression algorithms in On the utility of gradient compression and propose several guidelines which can be used to design new gradient compression algorithms. The second part of this dissertation studies distributed training of recommendation models, where we introduce Bagpipe a system to minimize embedding access overhead in distributed training. Finally, we introduce Clustered Head Attention, in which we aim to reduce the memory bandwidth bottlenecks of multi-head attention by identifying attention heads with similar output at inference time.