Understanding the Presto Metrics for Monitoring¶

Presto clusters support the Datadog monitoring service. You can configure the Datadog monitoring service at the cluster level as described in Advanced configuration: Modifying Cluster Monitoring Settings, or at the account level. (For more information on configuring the Datadog monitoring service at the account level on AWS, see Configuring your Access Settings using IAM Keys or Managing Roles.)

These are the different Presto metrics that are displayed in the Datadog account and the actions that you can do to remove the cause of errors.

Presto Metric	Metric Definition	Abnormalities indicated in the Metrics	Actions
presto.Workers	Number of workers that are part of the Presto cluster	`presto.Workers` is lesser than configured minimum nodes	Perform these actions: Check if there is a spot node loss. Use the `presto.SpotLossNotification` metric to confirm. If there is a spot node loss, then this is expected and the cluster scales up in sometime. If there is an increase in `presto.requestFailures` metric, then it can point to Presto in a worker node failing. Most common reason for this GC pauses in a worker node. To prevent it, better workload management must be done through Queues or ResourceGroups.
presto.MaxYoungGenGC-Time	Maximum time spent in YoungGen Garbage Collection (GC) across all nodes of the cluster	Sudden spike in the value indicates a problem	GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.MaxYoungGenGC-Count	Maximum number of YoungGen GC events across all nodes of the cluster	Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea.	GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.MaxOldGenGC-Time	Maximum time spent in OldGen GC across all nodes of the cluster	Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea.	GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.MaxOldGenGC-Count	Maximum number of OldGen GC events across all nodes of the cluster	Sudden increase in values can point to a problem but usually a correlation with GC time would give a better idea.	GC problems typically happen when cluster is under heavy load. This can be reduced by better workload management by using Queues or ResourceGroups.
presto.AveragePlanningTime	Average planning time (in milliseconds) in planning phase of queries	These are the possible abnormalities: Sudden spikes in values can be expected when query runs for the first time on a table causing the metastore cache warmup With metastore caching disabled, if the planning time is consistently high that is in 10s of seconds, then it indicates a problem	Perform these actions: Even after several queries has run on the cluster and still this metric’s value does not reduce, then check metastore cache settings and ensure that it is enabled and TTL values are high enough to allow using cached values. This could mean a problem in metastore or the running Hive Metastore server. Resolution: Firstly, verify that the metatore is not under a heavy load. If it is, then the metastore must be upgraded or other measures must be taken to bring down the load. If metastore is not heavily loaded, then it might be an issue with the Hive Metastore . server. To resolve this, create a ticket with Qubole Support.
presto.requestFailures	Number of requests that failed at master while contacting worker nodes during the task execution	There might be a few of these errors due to network congestion but a consistent increase in the value indicates that there is a problem.	Perform these actions: This can happen if a node is lost. Check the `presto.Workers` metric to confirm. This can also happen if a node is stuck in GC. Use GC related metrics to confirm.
presto.RUNNING-Queries	Number of Running Queries in the cluster	Not Applicable	Not Applicable
presto.FINISHED-Queries	Number of Finished Queries	Not Applicable	Not Applicable
presto.FAILED-Queries	Number of Failed Queries	Not Applicable	Not Applicable
presto.bytesReadPerSecondPerQuery	Bytes read per second per query (This metric considers only running queries in its calculations and if there are not any, then no data is reported.)	Value going towards 0 indicates that there is an issue Absence of a value does not indicate an issue	This is most probably due to read operators getting stuck in reading from the cloud object store. Most probable reason for this is a network issue, which can be manually checked from the nodes. If you can manually reach the cloud object store, then create a ticket with Qubole Support.