Using the Catalog Configuration¶
A Presto catalog consists of schemas and refers to a data source through a connector. Qubole allows you to add the catalog through a simplified way by just defining its properties through the Presto overrides on the Presto cluster. You can add the catalog using the syntax below through the Presto override.
catalog/<catalog-name>.properties:
<catalog property 1>
<catalog property 2>
.
.
.
<catalog property n>
catalog/hive.properties¶
Qubole provides table-level security for Hive tables accessed through Presto. See Understanding Qubole Hive Authorization for more information. The following table describes the common hive.properties.
| Parameter | Examples | Default | Description |
|---|---|---|---|
| hive.metastore-timeout | 3m, 1h | 3m | Timeout for Hive metastore calls that is it denotes how long a request waits to fetch data from the metastore before getting timed out. |
| hive.metastore-cache-ttl | 5m, 20m | 20m | It denotes a data entry’s life duration in the metastore cache before it is evicted. Metastore caches tables, partitions, databases, and so on that are fetched from the Hive metastore. Configuring Thrift Metastore Server Interface for the Custom Metastore (AWS) describes how to configure Hive Thrift Metastore Interface. |
| hive.metastore-cache-ttl-bulk | 20m, 1d | NA | When you have a query that you need to run on
hive.information_schema.columns, set this option
as a Presto override. For example,
hive.metastore-cache-ttl-bulk=24h.
Enabling this option caches table entries for the
configured duration, when the table info is fetched
(in bulk) from the metastore. This makes fetching
tables/columns through JBDC drivers faster. |
| hive.metastore-refresh-interval | 10m, 20m | 100m | It denotes the time after which a background refresh
for an entry in the metastore cache is triggered. If
you still see stale results, then you can see fresh
results when you run the query for the
second time. Suppose, if you do not set this parameter
or when its value is greater than
To avoid such a scenario, If you set this parameter
and if the query is run after the refresh interval
has expired, then the query returns the cache
entry quickly and starts a background cache refresh.
So, it is useful to set the value of
|
| hive.security | allow-all, sql-standard |
allow-all |
sql-standard enables Hive
authorization. See Understanding Qubole Hive Authorization for more
information. |
| hive.s3.multipart.min-file-size | 18MB, 20MB | 16MB | Minimum file size for an S3 multipart upload |
| hive.s3.multipart.min-part-size | 8MB, 9MB | 6MB | Minimum part size for an S3 multipart upload |
| hive.s3.sse.enabled | true, false |
false |
It is used to configure server-side encryption for
data at rest on S3, by setting it to true. For
more information, see
Enabling SSE-KMS in the Presto Cluster. |
| hive.s3.sse.type | KMS, S3 |
NA | It is used to specify the type of server-side
encryption when hive.s3.sse.enabled
is set to true. This property is only supported
on Presto 0.180 and 0.193 versions. |
| hive.s3.ssl.enabled | true, false |
false |
It is used to secure the
communication between Amazon S3 and
the Presto cluster using SSL. Set the
property to true to enable it. |
| hive.skip-corrupt-records | true, false |
false |
It is used to skip corrupt records in input formats
other than Note The behavior for the corrupted file is non-deterministic that is Presto might read some part of the file before hitting corrupt data and in such a case, the QDS record reader returns whatever it read until this point and skips the rest of the file. |
| hive.bucket-owner-full-control | true, false |
false |
When it is enabled, the S3 bucket owner gets complete permissions over the files written into it by other users. |
| hive.information-schema-presto-view-only | true, false |
true |
It is enabled by default and hence, the information
schema only includes the Presto views and not the Hive
views. When it is set to false, the information
schema includes both the Presto and Hive views. |
Using RubiX on a Presto Cluster¶
The latest supported version of RubiX on QDS is 0.2.11. For more information on RubiX, see Qubole’s blog on RubiX.
To use RubiX, select Enable Rubix in the Presto Cluster UI configuration page. (See the figure earlier on this page). Selecting Enable Rubix automatically sets the Rubix configuration in the Presto cluster to cache data. Here is a sample RubiX configuration.
catalog/hive.properties: hadoop.cache.data.enabled=true hadoop.cache.data.table.whitelist=.* hadoop.cache.data.table.columns.min=0
Note
Setting hadoop.cache.data.dirprefix.list and hadoop.cache.data.block-size as Hadoop Overrides
would automatically set them in the Presto RubiX client.
See Understanding File Cache and RubiX Configuration for more information on the configuration.
Understanding File Cache and RubiX Configuration¶
Note
Qubole supports RubiX on Hadoop 2 clusters. See Configuring RubiX on Hadoop 2 Clusters for the configuration options.
The following table describes the configurations for File Cache and RubiX.
| Parameter | Examples | Default | Description |
|---|---|---|---|
| hadoop.cache.data.enabled | true, false | false | Enable SSD cache |
| hadoop.cache.data.table.whitelist | .*.store.* | NA | Regex whitelisting tables to cache |
| hadoop.cache.data.table.columns.min | 0, 1, 2 | 3 | Minimum query columns to cache file |
| hadoop.cache.data.use-block-cache | true, false | true | This configuration is effective only
when hadoop.cache.data.enabled is set
to true. This configuration uses
Rubix for caching when it is set to
true and uses File Cache when it
is set to false. |
The following table describes the configurations for File Cache.
Note
Setting hadoop.cache.data.dirprefix.list and hadoop.cache.data.block-size as Hadoop Overrides
would automatically set them in the Presto RubiX client.
| Parameter | Examples | Default | Description |
|---|---|---|---|
| hadoop.cache.data.maxage | 1h, 1d | 3650d | Cache files younger than interval |
| hadoop.cache.data.expiration | 1h, 1d | 3650d | Expire cached file after interval |
| hadoop.cache.data.fullness.percentage | 80, 90 | 95 | Percentage % of SSD disk to fill up |
| hadoop.cache.fileinfo.enabled | true, false | true | Cache file size, timestamp |
| hadoop.cache.fileinfo.expiration | 5m, 1h | 5m | Expire fileinfo cache entry |
| hadoop.cache.dirinfo.enabled | true, false | true | Cache directory listing |
| hadoop.cache.dirinfo.expiration | 5m, 1h | 5m | Expiration directory listing from cache |
| hadoop.cache.data.dirprefix.list | /media/ephemeral, /media/ebs | /media/ephemeral | Prefixes for paths of directories used to store cached data. Final paths created by appending suffix in range [0, 25]. |
The following table describes the configurations for a RubiX Cache Manager.
Note
The configuration for a RubiX Cache Manager can be added as Hadoop cluster overrides on the Clusters page of the QDS UI. The syntax is:
<hadoop.cache.*>=<value>
For more information about the Override Hadoop Configuration Variables field, see Advanced configuration: Modifying Hadoop Cluster Settings or the screenshot earlier on this page.
Setting hadoop.cache.data.dirprefix.list and hadoop.cache.data.block-size as Hadoop overrides
automatically sets them in the Presto RubiX client.
| Parameter | Examples | Default | Description |
|---|---|---|---|
| hadoop.cache.data.expiration.after-write | 400, 350 | 300 | The duration in seconds after which metadata entry of data written to a file is removed. This does not remove backup data. |
| hadoop.cache.data.fullness.percentage | 85, 90 | 80 | Percentage of total disk space to use for caching and backing files are deleted in an LRU way. |
| hadoop.cache.data.dirprefix.list | /media/ephemeral, /media/ebs | /media/ephemeral | Prefix location of directories to use as a destination for cache. The final destination created by appending numbers from 1 to 5 to these prefixes. |
| hadoop.cache.data.block-size | 1100000 | 1048576 | Block size in which each S3 file is logically divided. Larger the block size, more the additional data is cached. |
| hadoop.cache.data.bookkeeper.port | 9500 | 8899 | The port on which the Cache BookKeeper Server is started. |
| hadoop.cache.data.bookkeeper.max-threads | 600 | 500 | Maximum number of threads in a Cache BookKeeper Server |