As programmable GPUs have become increasingly general-purpose, they are increasingly used by a wide variety of applications that leverage them for accelerated computing. Workloads that were traditionally run on these systems have different characteristics than these newer workloads. For example, traditional workloads generally assume threads access independent data and synchronize infrequently. Accordingly, accelerators such as GPUs use simple, software-driven coherence protocols that trade-off heavyweight synchronization operations for improved efficiency when synchronization is not required. To help reduce this overhead, accelerators also use scoped memory consistency models. While this approach worked well for traditional applications, many modern applications that run on GPUs frequently share data across threads and utilize fine-grained synchronization. Thus, inefficient synchronization support is a significant bottleneck for running them on these accelerators. Thus, a holistic approach is required to tackle the inefficiencies in how synchronization is used in both single- and Multi-GPU systems. We propose hardware-software frameworks that use knowledge of the GPU memory hierarchy and the algorithmic properties of applications to improve the efficiency of GPU global synchronization. First, we target bottlenecks in explicit global synchronization resulting from the use of atomics for global memory updates. To resolve this bottleneck, we propose to cache commutative atomic updates locally using a novel buffering mechanism that exploits locality in atomics and reduces their serialization penalty, which reduces network traffic to the LLC and improves performance. Programmers also use global synchronization to ensure correctness with software synchronization primitives. However, existing GPU synchronization primitives either scale poorly or suffer from livelock or deadlock issues because of heavy contention between threads accessing shared synchronization objects. We overcome these inefficiencies by designing more efficient, scalable GPU synchronization primitives. Finally, we handle performance degradation that results from implicit global synchronization that takes place at kernel boundaries. GPU vendors are pivoting to chiplet-based designs where the global memory ordering point has moved from the L2 to the L3 or global memory. This necessitates bulk cache invalidations and writebacks of the L2 caches in these chiplets, leading to a loss of potential inter-kernel data reuse. We propose an intelligent producer-consumer dependency tracking mechanism called CPCoh that reduces the number of bulk coherence maintenance operations required, thereby increasing inter-kernel reuse and improving performance. Overall, we advance the state of the art for global synchronization in GPUs resulting in performance and energy improvements across a plethora of modern GPU applications.