Using the Catalog Configuration¶

A Presto catalog consists of schemas and refers to a data source through a connector. Qubole allows you to add the catalog through a simplified way by just defining its properties through the Presto overrides on the Presto cluster. You can add the catalog using the syntax below through the Presto override.

catalog/<catalog-name>.properties:
<catalog property 1>
<catalog property 2>
.
.
.
<catalog property n>

catalog/hive.properties¶

Qubole provides table-level security for Hive tables accessed through Presto. See Understanding Qubole Hive Authorization for more information. The following table describes the common hive.properties.

Parameter	Examples	Default	Description
hive.metastore-timeout	3m, 1h	3m	Timeout for Hive metastore calls that is it denotes how long a request waits to fetch data from the metastore before getting timed out.
hive.metastore-cache-ttl	5m, 20m	20m	It denotes a data entry’s life duration in the metastore cache before it is evicted. Metastore caches tables, partitions, databases, and so on that are fetched from the Hive metastore. Configuring Thrift Metastore Server Interface for the Custom Metastore (AWS) describes how to configure Hive Thrift Metastore Interface.
hive.metastore-cache-ttl-bulk	20m, 1d	NA	When you have a query that you need to run on `hive.information_schema.columns`, set this option as a Presto override. For example, `hive.metastore-cache-ttl-bulk=24h`. Enabling this option caches table entries for the configured duration, when the table info is fetched (in bulk) from the metastore. This makes fetching tables/columns through JBDC drivers faster.
hive.metastore-refresh-interval	10m, 20m	100m	It denotes the time after which a background refresh for an entry in the metastore cache is triggered. If you still see stale results, then you can see fresh results when you run the query for the second time. Suppose, if you do not set this parameter or when its value is greater than `hive.metastore-cache-ttl` and you run the query after the entry is evicted from the metastore cache, then the query brings back the evicted entry from the Hive metastore into the cache and pays this warmup time. Retrieving the info from the metastore takes more time than reading from the cache. To avoid such a scenario, If you set this parameter and if the query is run after the refresh interval has expired, then the query returns the cache entry quickly and starts a background cache refresh. So, it is useful to set the value of `hive.metastore-cache-ttl` higher than `hive.metastore-refresh-interval` to get the cached entries with higher TTL and faster refreshes.
hive.security	`allow-all`, `sql-standard`	`allow-all`	`sql-standard` enables Hive authorization. See Understanding Qubole Hive Authorization for more information.
hive.s3.multipart.min-file-size	18MB, 20MB	16MB	Minimum file size for an S3 multipart upload
hive.s3.multipart.min-part-size	8MB, 9MB	6MB	Minimum part size for an S3 multipart upload
hive.s3.sse.enabled	`true`, `false`	`false`	It is used to configure server-side encryption for data at rest on S3, by setting it to `true`. For more information, see Enabling SSE-KMS in the Presto Cluster.
hive.s3.sse.type	`KMS`, `S3`	NA	It is used to specify the type of server-side encryption when `hive.s3.sse.enabled` is set to `true`. This property is only supported on Presto 0.180 and 0.193 versions.
hive.s3.ssl.enabled	`true`, `false`	`false`	It is used to secure the communication between Amazon S3 and the Presto cluster using SSL. Set the property to `true` to enable it.
hive.skip-corrupt-records	`true`, `false`	`false`	It is used to skip corrupt records in input formats other than `orc`, `parquet` and `rcfile`. You can also set it as a session property, as `hive.skip_corrupt_records=true` in a session when the active cluster does not have this configuration globally enabled. This configuration is supported only in Presto 0.180 and later versions. Note The behavior for the corrupted file is non-deterministic that is Presto might read some part of the file before hitting corrupt data and in such a case, the QDS record reader returns whatever it read until this point and skips the rest of the file.
hive.bucket-owner-full-control	`true`, `false`	`false`	When it is enabled, the S3 bucket owner gets complete permissions over the files written into it by other users.
hive.information-schema-presto-view-only	`true`, `false`	`true`	It is enabled by default and hence, the information schema only includes the Presto views and not the Hive views. When it is set to `false`, the information schema includes both the Presto and Hive views.

Using RubiX on a Presto Cluster¶

The latest supported version of RubiX on QDS is 0.2.11. For more information on RubiX, see Qubole’s blog on RubiX.

To use RubiX, select Enable Rubix in the Presto Cluster UI configuration page. (See the figure earlier on this page). Selecting Enable Rubix automatically sets the Rubix configuration in the Presto cluster to cache data. Here is a sample RubiX configuration.

catalog/hive.properties:
hadoop.cache.data.enabled=true
hadoop.cache.data.table.whitelist=.*
hadoop.cache.data.table.columns.min=0

Note

Setting hadoop.cache.data.dirprefix.list and hadoop.cache.data.block-size as Hadoop Overrides would automatically set them in the Presto RubiX client.

See Understanding File Cache and RubiX Configuration for more information on the configuration.

Understanding File Cache and RubiX Configuration¶

Note

Qubole supports RubiX on Hadoop 2 clusters. See Configuring RubiX on Hadoop 2 Clusters for the configuration options.

The following table describes the configurations for File Cache and RubiX.

Parameter	Examples	Default	Description
hadoop.cache.data.enabled	true, false	false	Enable SSD cache
hadoop.cache.data.table.whitelist	..store.	NA	Regex whitelisting tables to cache
hadoop.cache.data.table.columns.min	0, 1, 2	3	Minimum query columns to cache file
hadoop.cache.data.use-block-cache	true, false	true	This configuration is effective only when hadoop.cache.data.enabled is set to `true`. This configuration uses Rubix for caching when it is set to `true` and uses File Cache when it is set to `false`.

The following table describes the configurations for File Cache.

Note

Setting hadoop.cache.data.dirprefix.list and hadoop.cache.data.block-size as Hadoop Overrides would automatically set them in the Presto RubiX client.

Parameter	Examples	Default	Description
hadoop.cache.data.maxage	1h, 1d	3650d	Cache files younger than interval
hadoop.cache.data.expiration	1h, 1d	3650d	Expire cached file after interval
hadoop.cache.data.fullness.percentage	80, 90	95	Percentage % of SSD disk to fill up
hadoop.cache.fileinfo.enabled	true, false	true	Cache file size, timestamp
hadoop.cache.fileinfo.expiration	5m, 1h	5m	Expire fileinfo cache entry
hadoop.cache.dirinfo.enabled	true, false	true	Cache directory listing
hadoop.cache.dirinfo.expiration	5m, 1h	5m	Expiration directory listing from cache
hadoop.cache.data.dirprefix.list	/media/ephemeral, /media/ebs	/media/ephemeral	Prefixes for paths of directories used to store cached data. Final paths created by appending suffix in range [0, 25].

The following table describes the configurations for a RubiX Cache Manager.

Note

The configuration for a RubiX Cache Manager can be added as Hadoop cluster overrides on the Clusters page of the QDS UI. The syntax is:

<hadoop.cache.*>=<value>

For more information about the Override Hadoop Configuration Variables field, see Advanced configuration: Modifying Hadoop Cluster Settings or the screenshot earlier on this page.

Setting hadoop.cache.data.dirprefix.list and hadoop.cache.data.block-size as Hadoop overrides automatically sets them in the Presto RubiX client.

Parameter	Examples	Default	Description
hadoop.cache.data.expiration.after-write	400, 350	300	The duration in seconds after which metadata entry of data written to a file is removed. This does not remove backup data.
hadoop.cache.data.fullness.percentage	85, 90	80	Percentage of total disk space to use for caching and backing files are deleted in an LRU way.
hadoop.cache.data.dirprefix.list	/media/ephemeral, /media/ebs	/media/ephemeral	Prefix location of directories to use as a destination for cache. The final destination created by appending numbers from 1 to 5 to these prefixes.
hadoop.cache.data.block-size	1100000	1048576	Block size in which each S3 file is logically divided. Larger the block size, more the additional data is cached.
hadoop.cache.data.bookkeeper.port	9500	8899	The port on which the Cache BookKeeper Server is started.
hadoop.cache.data.bookkeeper.max-threads	600	500	Maximum number of threads in a Cache BookKeeper Server