Set Data Compression on Hadoop-1 Clusters¶

Data compression in Hadoop can speed up the input/output operations as Hadoop jobs are data-intensive. It saves data storage space and makes the data transfer faster over a network. However, there is an increase in CPU utilization and processing time when data is compressed and decompressed. Data Compression and the format used for compressing data have a considerable impact on MapReduce jobs’ performance.

Configuring various formats of data compression are as explained below:

gzip compression format - The file extension of this compression format is .gz. This format is not splittable. The following configuration is used to set this format:
```
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
```
bzip2 compression format - The file extension of this compression format is .bz2. This format is splittable. The following configuration is used to set this format:
```
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec;
```
lzo compression format - The file extension of this compression format is .lzo. This format is splittable if the compression is indexed. The following configuration is used to set this format:
```
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
```

snappy compression format - The file extension of this compression format is .snappy. This format is splittable. The following configuration is used to set this format:

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

Example (AWS)

DROP TABLE IF EXISTS manager;
CREATE EXTERNAL TABLE manager( manageid string,yearid string,teamid string) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3n://qubole-abc/csv';
DROP TABLE IF EXISTS manager_snappy;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;
CREATE TABLE manager_snappy like manager;
INSERT OVERWRITE TABLE manager_snappy
SELECT * FROM manager;
SELECT * FROM manager_snappy limit 3;

zlib/deflate compression format - It is the default data compression format. The file extension of this compression format is .deflate. This format is not splittable. The following configuration is used to set this format:

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec;

Example (AWS)

DROP TABLE IF EXISTS manager;
CREATE EXTERNAL TABLE manager( manageid string,yearid string,teamid string) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3n://qubole-abc/csv';
DROP TABLE IF EXISTS manager_zlib_is_default;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec;
CREATE TABLE manager_zlib_is_default like manager;
INSERT OVERWRITE TABLE manager_zlib_is_default
SELECT * FROM manager;
SELECT * FROM manager_zlib_is_default limit 3;