Dataflow Shuffle is the base operation behind
Dataflow transforms such as GroupByKey
, CoGroupByKey
, and
Combine
. The Dataflow Shuffle operation partitions and groups
data by key in a scalable, efficient, fault-tolerant manner. The
Dataflow Shuffle feature, available for batch pipelines only,
moves the shuffle operation out of the worker VMs and into the
Dataflow service backend.
Batch jobs use Dataflow Shuffle by default.
Benefits of Dataflow Shuffle
The service-based Dataflow Shuffle has the following benefits:
- Faster execution time of batch pipelines for the majority of pipeline job types.
- A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs.
- Better Horizontal Autoscaling because VMs do not hold any shuffle data and can therefore be scaled down earlier.
- Better fault tolerance; an unhealthy VM holding Dataflow Shuffle data will not cause the entire job to fail, as would happen if not using the feature.
Most of the reduction in worker resources comes from offloading the shuffle work to the Dataflow service. For that reason, there is a charge associated with the use of Dataflow Shuffle. The execution times might vary from run to run. If you are running a pipeline that has important deadlines, we recommend allocating sufficient buffer time before the deadline.
Use Dataflow Shuffle
This feature is available in all regions where Dataflow is supported. To see available locations, read Dataflow locations. If you use the Dataflow Shuffle, the workers must be deployed in the same region as the Dataflow job.
If you use Dataflow Shuffle for your pipeline, don't specify the
zone
pipeline options. Instead, specify the region
and set the value to one
of the available regions. Dataflow
autoselects the zone in the region you specified. If you do specify the zone
pipeline option and set it to a zone outside of the available regions,
Dataflow reports an error. If you set an incompatible combination
of region
and zone
, your job cannot use Dataflow Shuffle.
There might be performance differences between regions.
The default boot disk size for each batch job is 25 GB. For some batch jobs, you might be required to modify the size of the disk. Consider the following:
- A worker VM uses part of the 25 GB of disk space for the operating system, binaries, logs, and containers. Jobs that use a significant amount of disk and exceed the remaining disk capacity may fail when you use Dataflow Shuffle.
- Jobs that use a lot of disk I/O may be slow due to the performance of the small disk. For more information about performance differences between disk sizes, see Compute Engine Persistent Disk Performance.
To specify a larger disk size for a Dataflow Shuffle job, you can
use the --disk_size_gb
parameter.