Cluster Metrics¶

GET /api/v1.3/clusters/(string: id_or_label)/metrics¶

Note

The metrics are available for clusters running with Ganglia monitoring enabled.

Required Role¶

The following roles can make this API call:

A user who is part of the system-user/system-admin group.
A user invoking this API must be part of a group associated with a role that allows viewing a cluster’s metrics. See Managing Groups and Managing Roles for more information.

Parameters¶

Parameter	Description
metric	The metric to monitor. It is possible to get metric values for a particular node or aggregated across cluster
interval	The interval for which the metric values are required. Valid value for interval can be `hour`, `2hr`, `4hr`, `day`, `week`, `month` or `year`. Default interval value is `hour`.
hostname	The hostname for which the metric values are required. Valid value is the private DNS name of the host. See Per-host Metrics below. If not specified, for certain metrics, API returns the metric value aggregated across the cluster. See Aggregate Cluster Metrics below.

Note

Parameters marked in bold are mandatory. Others are optional and have default values.

Per-host Metrics¶

Metrics related to a host can be collected with hostname parameter value specified as the internal DNS name of the instance (with format ip-A-B-C-D.ec2.internal). Some of the useful metrics are:

System Metrics

cpu_user : Percentage of CPU utilization while executing at the user level
cpu_system : Percentage of CPU utilization while executing at the system level
cpu_idle : Percentage of time CPU were idle
disk_free : Total free disk space
mem_free : Amount of available memory
bytes_in : Number of bytes in per second
bytes_out : Number of bytes out per second

Hadoop 1 JobTracker Metrics

Various metrics related to JobTracker can be queried with hostname parameter set to master.

Metrics for Hadoop jobs
- mapred.jobtracker.jobs_submitted : Number of Hadoop jobs submitted
- mapred.jobtracker.jobs_running : Number of Hadoop jobs running
- mapred.jobtracker.jobs_completed : Number of Hadoop jobs completed
- mapred.jobtracker.jobs_failed : Number of Hadoop jobs failed
Metrics for Hadoop map tasks
- mapred.jobtracker.map_slots : Number of map slots
- mapred.jobtracker.occupied_map_slots : Number of map slots occupied
- mapred.jobtracker.maps_launched : Number of map tasks launched
- mapred.jobtracker.running_maps : Number of running map tasks
- mapred.jobtracker.waiting_maps : Number of waiting map tasks
- mapred.jobtracker.maps_completed : Number of map tasks completed
- mapred.jobtracker.maps_failed : Number of map tasks failed
Metrics for Hadoop reduce tasks
- mapred.jobtracker.reduce_slots : Number of reduce slots
- mapred.jobtracker.occupied_reduce_slots : Number of reduce slots occupied
- mapred.jobtracker.reduces_launched : Number of reduce tasks launched
- mapred.jobtracker.running_reduces : Number of running reduce tasks
- mapred.jobtracker.waiting_reduces : Number of waiting reduce tasks
- mapred.jobtracker.reduces_completed : Number of reduce tasks completed
- mapred.jobtracker.reduces_failed : Number of reduce tasks failed

Examples¶

The following curl command reports the number of maps launched over the past one hour in an Hadoop 1 cluster.

curl -i -H "X-AUTH-TOKEN: ${X_AUTH_TOKEN}" -H "Content-Type: application/json" -H "Accept: application/json" \
     -G \
     -d metric=mapred.jobtracker.maps_launched \
     -d hostname=master \
     -d interval=hour \
     https://api.qubole.com/api/v1.3/clusters/${CLUSTER_ID}/metrics

Note

The above syntax uses https://api.qubole.com as the endpoint. Qubole provides other endpoints to access QDS that are described in Supported Qubole Endpoints on Different Cloud Providers.

The JSON response to the API call contains datapoints corresponding to the metric values. The value pair in datapoints has the format [metric value, time represented in epoch seconds]. Metric value of “NaN” refers to an unavailable value at that point of time.

Response:

[
   {
      "datapoints":[
         [14313937,1427752170],
         [14313937,1427752185],
         [14319826.6,1427752200],
         [14328661,1427752215],
         ...
         [14940674,1427755710],
         [14943716,1427755725],
         ["NaN",1427755740],
         ["NaN",1427755755]
   ],
   "hostname":"master",
   "metric":"master last hour   ",
   "interval":"hour"
   }
]

Example to get Cluster Metrics of an Hadoop 2 Cluster with 21144 as its Cluster ID

curl -i -H "X-AUTH-TOKEN: $AUTH_TOKEN" -H "Content-Type: application/json" -H "Accept: application/json" \
-G \
-d metric=yarn.NodeManagerMetrics.ContainersRunning \
-d metric=yarn.NodeManagerMetrics.ContainersCompleted \
-d metric=yarn.NodeManagerMetrics.ContainersKilled \
-d hostname=<hostname> \
-d interval=hour \
 https://api.qubole.com/api/v1.3/clusters/21144/metrics

In the above example, replace <n-n-n-n> with the host IP address and <name> with the defined host name.

Aggregate Cluster Metrics¶

Some of the system metrics can be aggregated across cluster to get a broader view of the resource across all instances in the cluster. The hostname parameter should not be specified for aggregate cluster metrics.

Some of the useful aggregate cluster metrics are:

cpu_report : Aggregate report of CPU utilization percentage
mem_report : Aggregate report of memory usage in bytes
load_report : Aggregate report with current load, number of processes running processes, nodes and CPU count
network_report: Aggregate report with network traffic in and out of the cluster nodes

Example¶

curl -i -H "X-AUTH-TOKEN: ${X_AUTH_TOKEN}" -H "Content-Type: application/json" -H "Accept: application/json" \
     -G \
     -d metric=cpu_report \
     -d interval=hour \
     https://api.qubole.com/api/v1.3/clusters/${CLUSTER_ID}/metrics

Response:

[
 {"metric":"User\\g","interval":"hour","datapoints":[[58.689508632,1427752170],[57.445152722,1427752185],[56.650996016,1427752200],[53.899468792,1427752215], ..., [43.448339973,1427755710],[44.044090305,1427755725],[42.478220452,1427755740],["NaN",1427755755]],"hostname":"null"},
 {"metric":"Nice\\g","interval":"hour","datapoints":[[0.010491367862,1427752170],[0.0088977423639,1427752185],[0.0024701195219,1427752200],[0.0030544488712,1427752215], ..., [0,1427755710],[0,1427755725],[0,1427755740],["NaN",1427755755]],"hostname":"null"},
 {"metric":"System\\g","interval":"hour","datapoints":[[6.4996015936,1427752170],[6.3784860558,1427752185],[6.2476494024,1427752200],[5.985126162,1427752215], ..., [5.5504648074,1427755710],[5.5448871182,1427755725],[5.3686586985,1427755740],["NaN",1427755755]],"hostname":"null"},
 {"metric":"Wait\\g","interval":"hour","datapoints":[[0.44156706507,1427752170],[0.45962815405,1427752185],[0.41856573705,1427752200],[0.40849933599,1427752215], ..., [0.88273572377,1427755710],[0.78273572377,1427755725],[0.66613545817,1427755740],["NaN",1427755755]],"hostname":"null"},
 {"metric":"Steal\\g","interval":"hour","datapoints":[[0.096812749004,1427752170],[0.096679946879,1427752185],[0.096414342629,1427752200],[0.096812749004,1427752215], ..., [0.099601593625,1427755710],[0.09973439575,1427755725],[0.1,1427755740],["NaN",1427755755]],"hostname":"null"},
 {"metric":"Idle\\g","interval":"hour","datapoints":[[34.283532537,1427752170],[35.633333333,1427752185],[36.605179283,1427752200],[39.631341301,1427752215], ..., [50.014741036,1427755710],[49.515139442,1427755725],[51.369189907,1427755740],["NaN",1427755755]],"hostname":"null"}
]