Using the Catalog Configuration

A Presto catalog consists of schemas and refers to a data source through a connector. Qubole allows you to add the catalog through a simplified way by just defining its properties through the Presto overrides on the Presto cluster. You can add the catalog using the syntax below through the Presto override.

catalog/<catalog-name>.properties:
<catalog property 1>
<catalog property 2>
.
.
.
<catalog property n>

catalog/hive.properties

Qubole provides table-level security for Hive tables accessed through Presto. See Understanding Qubole Hive Authorization for more information. The following table describes the common hive.properties.

Parameter Examples Default Description
hive.metastore-timeout 3m, 1h 3m Timeout for Hive metastore calls that is it denotes how long a request waits to fetch data from the metastore before getting timed out.
hive.metastore-cache-ttl 5m, 20m 20m It denotes a data entry’s life duration in the metastore cache before it is evicted. Metastore caches tables, partitions, databases, and so on that are fetched from the Hive metastore. Configuring Thrift Metastore Server Interface for the Custom Metastore (AWS) describes how to configure Hive Thrift Metastore Interface.
hive.metastore-cache-ttl-bulk 20m, 1d NA When you have a query that you need to run on hive.information_schema.columns, set this option as a Presto override. For example, hive.metastore-cache-ttl-bulk=24h. Enabling this option caches table entries for the configured duration, when the table info is fetched (in bulk) from the metastore. This makes fetching tables/columns through JBDC drivers faster.
hive.metastore-refresh-interval 10m, 20m 100m

It denotes the time after which a background refresh for an entry in the metastore cache is triggered. If you still see stale results, then you can see fresh results when you run the query for the second time. Suppose, if you do not set this parameter or when its value is greater than hive.metastore-cache-ttl and you run the query after the entry is evicted from the metastore cache, then the query brings back the evicted entry from the Hive metastore into the cache and pays this warmup time. Retrieving the info from the metastore takes more time than reading from the cache.

To avoid such a scenario, If you set this parameter and if the query is run after the refresh interval has expired, then the query returns the cache entry quickly and starts a background cache refresh. So, it is useful to set the value of hive.metastore-cache-ttl higher than hive.metastore-refresh-interval to get the cached entries with higher TTL and faster refreshes.

hive.security allow-all, sql-standard allow-all sql-standard enables Hive authorization. See Understanding Qubole Hive Authorization for more information.
hive.s3.multipart.min-file-size 18MB, 20MB 16MB Minimum file size for an S3 multipart upload
hive.s3.multipart.min-part-size 8MB, 9MB 6MB Minimum part size for an S3 multipart upload
hive.s3.sse.enabled true, false false It is used to configure server-side encryption for data at rest on S3, by setting it to true. For more information, see Enabling SSE-KMS in the Presto Cluster.
hive.s3.sse.type KMS, S3 NA It is used to specify the type of server-side encryption when hive.s3.sse.enabled is set to true. This property is only supported on Presto 0.180 and 0.193 versions.
hive.s3.ssl.enabled true, false false It is used to secure the communication between Amazon S3 and the Presto cluster using SSL. Set the property to true to enable it.
hive.skip-corrupt-records true, false false

It is used to skip corrupt records in input formats other than orc, parquet and rcfile. You can also set it as a session property, as hive.skip_corrupt_records=true in a session when the active cluster does not have this configuration globally enabled. This configuration is supported only in Presto 0.180 and later versions.

Note

The behavior for the corrupted file is non-deterministic that is Presto might read some part of the file before hitting corrupt data and in such a case, the QDS record reader returns whatever it read until this point and skips the rest of the file.

hive.bucket-owner-full-control true, false false When it is enabled, the S3 bucket owner gets complete permissions over the files written into it by other users.
hive.information-schema-presto-view-only true, false true It is enabled by default and hence, the information schema only includes the Presto views and not the Hive views. When it is set to false, the information schema includes both the Presto and Hive views.

Using RubiX on a Presto Cluster

The latest supported version of RubiX on QDS is 0.2.11. For more information on RubiX, see Qubole’s blog on RubiX.

To use RubiX, select Enable Rubix in the Presto Cluster UI configuration page. (See the figure earlier on this page). Selecting Enable Rubix automatically sets the Rubix configuration in the Presto cluster to cache data. Here is a sample RubiX configuration.

catalog/hive.properties:
hadoop.cache.data.enabled=true
hadoop.cache.data.table.whitelist=.*
hadoop.cache.data.table.columns.min=0

Note

Setting hadoop.cache.data.dirprefix.list and hadoop.cache.data.block-size as Hadoop Overrides would automatically set them in the Presto RubiX client.

See Understanding File Cache and RubiX Configuration for more information on the configuration.

Understanding File Cache and RubiX Configuration

Note

Qubole supports RubiX on Hadoop 2 clusters. See Configuring RubiX on Hadoop 2 Clusters for the configuration options.

The following table describes the configurations for File Cache and RubiX.

Parameter Examples Default Description
hadoop.cache.data.enabled true, false false Enable SSD cache
hadoop.cache.data.table.whitelist .*.store.* NA Regex whitelisting tables to cache
hadoop.cache.data.table.columns.min 0, 1, 2 3 Minimum query columns to cache file
hadoop.cache.data.use-block-cache true, false true This configuration is effective only when hadoop.cache.data.enabled is set to true. This configuration uses Rubix for caching when it is set to true and uses File Cache when it is set to false.

The following table describes the configurations for File Cache.

Note

Setting hadoop.cache.data.dirprefix.list and hadoop.cache.data.block-size as Hadoop Overrides would automatically set them in the Presto RubiX client.

Parameter Examples Default Description
hadoop.cache.data.maxage 1h, 1d 3650d Cache files younger than interval
hadoop.cache.data.expiration 1h, 1d 3650d Expire cached file after interval
hadoop.cache.data.fullness.percentage 80, 90 95 Percentage % of SSD disk to fill up
hadoop.cache.fileinfo.enabled true, false true Cache file size, timestamp
hadoop.cache.fileinfo.expiration 5m, 1h 5m Expire fileinfo cache entry
hadoop.cache.dirinfo.enabled true, false true Cache directory listing
hadoop.cache.dirinfo.expiration 5m, 1h 5m Expiration directory listing from cache
hadoop.cache.data.dirprefix.list /media/ephemeral, /media/ebs /media/ephemeral Prefixes for paths of directories used to store cached data. Final paths created by appending suffix in range [0, 25].

The following table describes the configurations for a RubiX Cache Manager.

Note

The configuration for a RubiX Cache Manager can be added as Hadoop cluster overrides on the Clusters page of the QDS UI. The syntax is:

<hadoop.cache.*>=<value>

For more information about the Override Hadoop Configuration Variables field, see Advanced configuration: Modifying Hadoop Cluster Settings or the screenshot earlier on this page.

Setting hadoop.cache.data.dirprefix.list and hadoop.cache.data.block-size as Hadoop overrides automatically sets them in the Presto RubiX client.

Parameter Examples Default Description
hadoop.cache.data.expiration.after-write 400, 350 300 The duration in seconds after which metadata entry of data written to a file is removed. This does not remove backup data.
hadoop.cache.data.fullness.percentage 85, 90 80 Percentage of total disk space to use for caching and backing files are deleted in an LRU way.
hadoop.cache.data.dirprefix.list /media/ephemeral, /media/ebs /media/ephemeral Prefix location of directories to use as a destination for cache. The final destination created by appending numbers from 1 to 5 to these prefixes.
hadoop.cache.data.block-size 1100000 1048576 Block size in which each S3 file is logically divided. Larger the block size, more the additional data is cached.
hadoop.cache.data.bookkeeper.port 9500 8899 The port on which the Cache BookKeeper Server is started.
hadoop.cache.data.bookkeeper.max-threads 600 500 Maximum number of threads in a Cache BookKeeper Server