Understanding the System Metrics for Monitoring (AWS)¶
Qubole clusters support Datadog monitoring when the Datadog monitoring is enabled at the QDS account level. For more information on enabling Datadog in Control Panel > Account Settings, see Configuring your Access Settings using IAM Keys or Managing Roles.
The following table lists the different system metrics that are published to the Datadog account.
| System Metrics | Metrics Definition |
|---|---|
| disk_free | Total free disk space |
| disk_total | Total disk space |
| part_max_used | Maximum percent used on any single disk partition. |
| load_one | Load Average over 1 minute |
| load_five | Load Average over 5 minutes |
| load_fifteen | Load Average over 15 minutes |
| cpu_user | Percentage of CPU utilization while executing at the user level. |
| cpu_system | Percentage of CPU utilization while executing at the system level. |
| cpu_wio | The percentage of CPU Wait I/O. |
| cpu_nice | Percentage of CPU cycles spent on nice processes. |
| cpu_steal | Stolen time, which is the time spent in other operating systems when running in a virtualized environment. |
| cpu_aidle | Percentage of CPU cycles spent idle since last boot. |
| cpu_idle | Percentage of CPU idle time. |
| cpu_report | Aggregate report of CPU utilization percentage. |
| mem_report | Aggregate report of memory usage in bytes. |
| load_report | Aggregate report with current load, number of processes running processes, nodes and CPU count. |
| network_report | Aggregate report with network traffic in and out of the cluster nodes. |
| cluster-addnodefailure | The node addition metric to monitor the autoscaling feature. |
| cluster-removenodefailure | The node removal metric to monitor the downscaling/autoscaling events in a cluster. |
| system-rootdiskfullmaster | The metric displays the disk space in the master node’s root partition. |
| system-ephemeral0fullmaster | The metric displays the disk space in the master node’s ephemeral0 partition. |