Understanding Automatic Statistics Collection in Qubole¶

Presto, Apache Spark, and Apache Hive can generate more efficient query plans with table statistics. For example, Spark, as of version 2.1.1, performs broadcast joins only if the table size is available in the table statistics stored in the Hive Metastore (see spark.sql.autoBroadcastJoinThreshold). Broadcast joins can have a dramatic impact on the run time of everyday SQL queries where small dimension tables are joined frequently. The Big Bench tuning exercise from Intel reported a 4.22x speedup by turning on broadcast joins for specific queries.

In Qubole Data Service (QDS), all query engines use Hive Metastore as the catalog. If the Hive Metastore contains statistics, then all query engines can use them for query planning as explained above. But table statistics collection is not automatic. One of the goals of the Qubole platform is to apply automation to help users achieve the maximum performance and efficiency. This topic describes how Qubole has automated Table Statistics collection in QDS.

Understanding the Pros and Cons of Apache Hive Table Statistics¶

The Apache Hive Statistics wiki page describes the list of statistics that can be computed and stored in the Hive metastore. Statistics collection is not automated and the pros and cons are described in this table.

Solution

Pros

Cons

User sets hive.stats.autogather=true to gather statistics automatically during INSERT OVERWRITE queries.

Supported in Hive

Cons include:

Not supported for SparkSQL
Increased query run time
It does not catch these DDL/DML Statements:
- CREATE EXTERNAL TABLE
- ALTER TABLE/PARTITION
- INSERT INTO

It does not catch direct writes to S3 locations for external tables.

User schedules ANALYZE TABLE COMPUTE STATISTICS

User picks up a few tables as wanted to keep stats updated and then uses Qubole Scheduler (or an external cron job) to run the COMPUTE STATISTICS statement on these tables periodically.

Engine Agnostic

It:

Has manual effort on the user’s side
Does not cover all tables automatically
Unnecessary runs might occur even if the tables are not updated/modified

In a bid to overcome the disadvantages of table statistics of open source described above, Qubole has decided to provide automated statistics collection into the QDS platform as a service.

Understanding Qubole Statistics Collection Service¶

These are the requirements for using the collection service:

Use the ANALYZE COMPUTE STATISTICS statement in Apache Hive to collect statistics.
Trigger ANALYZE statements for DML and DDL statements that create tables or insert data on any query engine.
ANALYZE statements must be transparent and not affect the performance of DML statements.
ANALYZE COMPUTE STATISTICS comes in three flavors in Apache Hive.

The three flavors of ANALYZE COMPUTE STATISTICS are described in this table.

SQL Statement Example Metadata Collected Impact on Cluster

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]

COMPUTE STATISTICS NO SCAN

Number of files and physical size in bytes

It is very fast and nearly zero impact on the cluster’s running workloads.

ANALYZE TABLE [db_name.]tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]

COMPUTE STATISTICS

Number of files, physical size in bytes, and number of rows

It scans the entire table or partition specified and has moderate impact.

ANALYZE TABLE [db_name.]tablename COMPUTE STATISTICS FOR COLUMNS Number of distinct values, NULL values, and TRUE and FALSE (BOOLEANS case). Minimum/maximum values and average/maximum length of the column. It has a significant impact.

Each flavor requires different compute resources. Qubole allows you to decide the appropriate statistics for each table. currently, you can choose the flavor at the account level. In the future, Qubole plans to allow you to configure the flavor at a table or schema level. Create a ticket with Qubole Support to enable this feature on the QDS account.

This architectural diagram shows how ANALYZE COMPUTE STATISTICS statements are triggered in QDS.

Here is the command’s computation process:

A user issues a Hive or Spark command.
If this command is a DML or DDL statement, the metastore is updated.
A custom MetastoreEventListener is triggered.
The command is run just like any other Hive command.

Qubole has seen a significant performance boost for analytic queries in Apache Hive and SparkSQL using table statistics and it is working with the Presto community to incorporate statistics in the Presto query planning as well.

Qubole has recently announced the launch of AIR (Alerts, Insights, Recommendations). Automatic Statistics Collection plays a key role in the Qubole’s autonomous data platform. Some examples of use cases within AIR are:

Viewing table statistics in the preview.
The QDS platform can give recommendations and insights on better partitioning or data organization strategies to data engineers, architects, or admins based on statistics.