Protecting Clusters Against Bad Jobs

A single hadoop cluster is usually shared across many users. In such a multi-user environment, it is a common occurrence that a certain user may issue a bad job which may degrade the performance of the entire cluster. Some common occurrences of bad jobs affecting the whole clusters are:

  • Single mapper issuing too much output thereby taking away the disk with it
  • A map/reduce job may have a lot of mappers outputting too much of map data thereby affecting the disk usage of the entire cluster
  • Reducer tasks copying a lot of map output data during the shuffle phase thereby affecting the disk space

Qubole’s hadoop distribution provides protection of the clusters against such jobs. Its clusters periodically monitors the jobs (via the JobUpdater thread) and kills any job that may be affecting the entire cluster.

Cluster Level Configuration

Following are the cluster level parameters and their default values. For changing them, add them to the hadoop overrides section and restart the cluster.

Control the frequency of JobUpdater thread

mapred.cluster.kill.jobs.map.output.interval.seconds (default: 600 seconds)

Kill the job producing maximum map output when DFS is running low on disk space

# This function will be invoked when average space per node is less than the
# configured value
mapred.hustler.downscale.freespacepernode.gb.avg (default: 5)

# And when the map output is more than the following fraction of the used DFS.
mapred.map.output.cluster.capacity.ratio.max (default=0.5)

Job Level Configuration

Following are the job level configuration properties and their default values. These values can either be changed by modifying them in the hadoop overrides (cluster restart is not required) or changing them on a per job basis.

Kill job which produces too much map output

# Kill the job when the total map output is beyond the configurable value
mapred.map.output.job.max.gb (default: 20% of the total configured dfs capacity)

# Kill the job when any of its tasks produces more map output than the configurable value
mapred.map.output.task.max.gb (default: 80% of the disk space per node)

Kill job which produces a lot of logs

# Kill the job when any of its tasks produces logs with size more than the
# configurable value
mapred.task.log.size.max (default: 40% of the disk space per node)

Kill job where reducers read a lot of map data

# Kill the job when any of the reduce tasks reads more shuffle data than the
# configurable value
mapred.reduce.shuffle.bytes.per.task.max.gb (default: 80% of the disk space per node)