Understanding the Presto Metrics for Monitoring¶
Presto clusters support the Datadog monitoring service. You can configure the Datadog monitoring service at the cluster level as described in Advanced configuration: Modifying Cluster Monitoring Settings, or at the account level. (For more information on configuring the Datadog monitoring service at the account level on AWS, see Configuring your Access Settings using IAM Keys or Managing Roles.)
These are the different Presto metrics that are displayed in the Datadog account and the actions that you can do to remove the cause of errors.
| Presto Metric | Metric Definition | Abnormalities indicated in the Metrics | Actions |
|---|---|---|---|
| presto.Workers | Number of workers that are part of the Presto cluster | presto.Workers is lesser than configured minimum nodes |
Perform these actions:
|
| presto.MaxYoungGenGC-Time | Maximum time spent in YoungGen Garbage Collection (GC) across all nodes of the cluster | Sudden spike in the value indicates a problem | GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
| presto.MaxYoungGenGC-Count | Maximum number of YoungGen GC events across all nodes of the cluster | Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. | GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
| presto.MaxOldGenGC-Time | Maximum time spent in OldGen GC across all nodes of the cluster | Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. | GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
| presto.MaxOldGenGC-Count | Maximum number of OldGen GC events across all nodes of the cluster | Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea. | GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups. |
| presto.AveragePlanningTime | Average planning time (in milliseconds) in planning phase of queries | These are the possible abnormalities:
|
Perform these actions:
|
| presto.requestFailures | Number of requests that failed at master while contacting worker nodes during the task execution | There might be a few of these errors due to network congestion but a consistent increase in the value indicates that there is a problem. | Perform these actions:
|
| presto.RUNNING-Queries | Number of Running Queries in the cluster | Not Applicable | Not Applicable |
| presto.FINISHED-Queries | Number of Finished Queries | Not Applicable | Not Applicable |
| presto.FAILED-Queries | Number of Failed Queries | Not Applicable | Not Applicable |
| presto.bytesReadPerSecondPerQuery | Bytes read per second per query (This metric considers only running queries in its calculations and if there are not any, then no data is reported.) | Value going towards 0 indicates that there is an issue Absence of a value does not indicate an issue |
This is most probably due to read operators getting stuck in reading from the cloud object store. Most probable reason for this is a network issue, which can be manually checked from the nodes. If you can manually reach the cloud object store, then create a ticket with Qubole Support. |