Set Data Compression on Hadoop-1 Clusters¶
Data compression in Hadoop can speed up the input/output operations as Hadoop jobs are data-intensive. It saves data storage space and makes the data transfer faster over a network. However, there is an increase in CPU utilization and processing time when data is compressed and decompressed. Data Compression and the format used for compressing data have a considerable impact on MapReduce jobs’ performance.
Configuring various formats of data compression are as explained below:
gzip compression format - The file extension of this compression format is
.gz. This format is not splittable. The following configuration is used to set this format:SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
bzip2 compression format - The file extension of this compression format is
.bz2. This format is splittable. The following configuration is used to set this format:SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec;
lzo compression format - The file extension of this compression format is
.lzo. This format is splittable if the compression is indexed. The following configuration is used to set this format:SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
snappy compression format - The file extension of this compression format is
.snappy. This format is splittable. The following configuration is used to set this format:SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK;
Example (AWS)
DROP TABLE IF EXISTS manager; CREATE EXTERNAL TABLE manager( manageid string,yearid string,teamid string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3n://qubole-abc/csv'; DROP TABLE IF EXISTS manager_snappy; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; SET mapred.output.compression.type=BLOCK; CREATE TABLE manager_snappy like manager; INSERT OVERWRITE TABLE manager_snappy SELECT * FROM manager; SELECT * FROM manager_snappy limit 3;
zlib/deflate compression format - It is the default data compression format. The file extension of this compression format is
.deflate. This format is not splittable. The following configuration is used to set this format:SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec;
Example (AWS)
DROP TABLE IF EXISTS manager; CREATE EXTERNAL TABLE manager( manageid string,yearid string,teamid string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3n://qubole-abc/csv'; DROP TABLE IF EXISTS manager_zlib_is_default; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec; CREATE TABLE manager_zlib_is_default like manager; INSERT OVERWRITE TABLE manager_zlib_is_default SELECT * FROM manager; SELECT * FROM manager_zlib_is_default limit 3;