Storage Volume Management¶
Hadoop uses instance storage for important features such as caching, fault tolerance and storage. Hadoop Distributed File System (HDFS) relies on the local storage on instances within cluster to store replicas of data blocks. MapReduce (MR) framework uses local file system for storing intermediate map outputs, spill files, distributed cache etc. which can result in high volumes of disk usage while working with reasonably sized datasets.
Hadoop uses volumes attached to an instance in round-robin fashion to distribute files across different disks. A task attempt writing to a volume with no available storage space can fail. The task attempt will normally be retried but such failures can lead to job failure if maximum retry attempts are exceeded.
Volumes and Task Configuration Parameters¶
Some of the important Hadoop configuration properties which can be used to avoid out of disk space errors are:
| Parameter | Default Value | Description |
|---|---|---|
dfs.datanode.du.reserved |
1073741824 | Reserved space in bytes per volume. On QDS, both MR and HDFS will not select a volume for write if it has available space less than the configured value. Value in bytes. |
mapred.local.dir.minspacestart |
2147483648 | If no volume on the instance has available space more than the configured value, new tasks are not scheduled on the instance. Value in bytes. |
mapred.local.dir.minspacekill |
0 | If no volume on the instance has available space more than the configured value, new tasks are not scheduled on the instance. Also, to save the rest of the running tasks, kill one of them, to clean up some space. Start with the reduce tasks, then go with the ones that have finished the least. Value in bytes. |
EBS Volumes for Additional Storage¶
Some instances on cloud, like the compute intensive AWS instances, have good CPU and memory configuration but come with low instance storage which limits their usage. To increase storage on such instance types, it is possible to attach EBS volumes using cluster configuration in the Control Panel. Qubole Data Service (QDS) allows both EBS Magnetic and EBS General Purpose (SSD) volumes to be attached.
There are two modes in which EBS volumes can be used:
EBS Volumes as Reserved Disks
In this mode, the attached EBS volumes act as reserved disks for HDFS and MR intermediate data. An additional configuration property
local.reserved.diris automatically set to contain list of directories residing on mounted EBS volumes. These directories are accessed only in case the request to read, write or existence check could not be served from instance storage.The above arrangement ensures that if there was sufficient space on fast SSD volumes, then EBS volumes are not used. This prevents the use of EBS volumes which might have higher latency or have I/O operation cost associated with them. Additional information related to this mode can be found in the blog post. This is the default mode in which the attached EBS volumes are used.
EBS Volumes as Local Disks
In this mode, the attached EBS volumes act as regular instance storage. The mode can be enabled by setting Hadoop configuration property
local.reserved.regular.disktotrue. This mode is useful for cases when SSD volumes are attached to the instance which have desired latency for the application and do not have additional I/O cost associated with them.