Troubleshooting Errors and Exceptions in Hive Jobs¶

This topic provides information about the errors and exceptions that you might encounter when running Hive jobs or applications. You can resolve these errors and exceptions by following the respective workarounds.

Container memory requirement exceeds physical memory limits
GC overhead limit exceeded, causing out of memory error
Mapper or reducer job fails because no valid local directory is found
Out of Memory error when using ORC file format
Hive job fails when “lock wait timeout” is exceeded
S3 “Access denied” error while creating a Hive table

Container memory requirement exceeds physical memory limits¶

Problem Description¶

Hive job fails, and the error messsage below is seen in the Qubole Analyze UI logs or Mapper logs, Reducer logs, or ApplicationMaster logs:

Container [pid=18196,containerID=container_1526931816701_34273_02_000003] is running beyond physical memory limits.
Current usage: 2.2 GB of 2.2 GB physical memory used; 3.2 GB of 4.6 GB virtual memory used. Killing container.

Diagnosis¶

Three different kinds of failure can result in this error message:

Mapper failure
- The error is seen in Analyze UI logs or Mapper log. The error can occur because the mapper is requesting more memory than the configured memory. The parameter mapreduce.map.memory.mb represents the mapper memory.
Reducer failure
- The error is seen in Analyze UI logs or Reducer log. The error can occur because the reducer is requesting more memory than the configured memory. The parameter mapreduce.reduce.memory.mb represents the reducer memory.
ApplicationMaster failure
- This error is seen in Analyze UI logs or ApplicationMaster log. This error can occur when the container hosting the ApplicationMaster is requesting more than the assigned memory. The parameter yarn.app.mapreduce.am.resource.mb represents the memory allocated.

Solution¶

Mapper failure: Modify the two parameters below to increase the memory for mapper task if the mapper fails with the above error.

mapreduce.map.memory.mb: The upper memory limit that Hadoop allows to be allocated to the mapper, in megabytes.
mapreduce.map.java.opts: Sets the heap size for the mapper.

Reducer failure: Modify the two parameters below to increase the memory for the reducer task if the reducer fails with the above error.

mapreduce.reduce.memory.mb: The upper memory limit that Hadoop allows to be allocated to the reducer, in megabytes.
mapreduce.reduce.java.opts: Sets the heap size for the reducer.

ApplicationMaster failure: Modify the two parameters below to increase the memory for the ApplicationMaster if the ApplicationMaster fails with the above error.

yarn.app.mapreduce.am.resource.mb: The amount of memory the ApplicationMaster needs, in megabytes.
yarn.app.mapreduce.am.command-opts: To set the heap size for the ApplicationMaster.

Make sure that yarn.app.mapreduce.am.command-opts is less than yarn.app.mapreduce.am.resource.mb. Qubole recommends that the value of yarn.app.mapreduce.am.command-opts be around 80% of yarn.app.mapreduce.am.resource.mb.

Example: Use the set command to update the configuration property at the query level.

set yarn.app.mapreduce.am.resource.mb=3500;

set yarn.app.mapreduce.am.command-opts=-Xmx2000m;

To update configs at the cluster level:

Add or update the parameters under Override Hadoop Configuration Variables in the Advanced Configuration tab in Cluster Settings and restart the cluster.
See also: MapReduce Configuration in Hadoop 2

GC overhead limit exceeded, causing out of memory error¶

Problem Description¶

A Hive job fails with an out-of-memory error “GC overhead limit exceeded,” as shown below.

java.io.IOException: org.apache.hadoop.ipc.RemoteException(java.lang.OutOfMemoryError): GC overhead limit exceeded
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:337)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:422)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:579)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:348)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:345)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)

Diagnosis¶

This out of memory error is coming from the getJobStatus method call. This is likely to be an issue with the JobHistory server running out of memory. This can be confirmed by checking the JobHistory server log on the master node in media/ephemeral0/logs/mapred. The JobHistory server log will show the out of memory exception stack trace as above.

The out of memory error for the JobHistory server usually happens in the following cases:

The cluster master node is too small and the JobHistory server is set to, for example, a heap size of 1 GB.
The jobs are very large, with thousands of mapper tasks running.

Solution¶

Qubole recommends that you use a larger cluster master node, with at least 60 GB RAM and a heap size of 4 GB for the JobHistory server process.
Depending on the nature of the job, even 4 GB for the JobHistory server heap size might not be sufficient. In this case, set the JobHistory server memory to a higher value, such as 8 GB, using the following bootstrap commands:

sudo echo 'export HADOOP_JOB_HISTORYSERVER_HEAPSIZE="8192"' >> /etc/hadoop/mapred-env.sh
sudo -u mapred /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh stop historyserver
sudo -u mapred /usr/lib/hadoop2/sbin/mr-jobhistory-daemon.sh start historyserver

Mapper or reducer job fails because no valid local directory is found¶

Problem Description¶

Mapper or reducer job fails with the following error:

Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid directory for <file_path>

at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext$DirSelector.getPathForWrite(LocalDirAllocator.java:541)

at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:627)

at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:173)

at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:154)

at org.apache.tez.runtime.library.common.task.local.output.TezTaskOutputFiles.getInputFileForWrite(TezTaskOutputFiles.java:250)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MapOutput.createDiskMapOutput(MapOutput.java:100)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.MergeManager.reserve(MergeManager.java:404)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyMapOutput(FetcherOrderedGrouped.java:476)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:278)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:178)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:191)

at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:54)

Diagnosis¶

This error is seen either in the Analyze UI or in mapper or reducer logs.

During execution, the map-reduce task will store intermediate data in local directories. These directories are specified by the parameter mapreduce.cluster.local.dir in the file mapred-site.xml. During job processing, the map-reduce framework looks for the directories specified by the mapreduce.cluster.local.dir parameter and verifies if there is enough space on the directories listed to create the intermediate file. If there is no directory that has the required space, the map-reduce job will fail with the error shown above.

Solution¶

Ensure that there is enough space in the local directories, based on the requirements of the data to be processed.
You can compress the intermediate output files to minimize space consumption.

Parameters to be set for compression:

set mapreduce.map.output.compress = true;
set mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec; -- Snappy will be used for compression

Additional Information¶

Example Zendesk ticket: https://qubole.zendesk.com/agent/tickets/22248
Internal Knowledge Base article: https://qubole.atlassian.net/browse/KB-249

Out of Memory error when using ORC file format¶

Problem Description¶

An Out of Memory error occurs while generating splits information when the ORC file format is used.

Diagnosis¶

The following logs appear in the Analyze log UI:

Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1098)
... 15 more
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.google.protobuf.ByteString.copyFrom(ByteString.java:192)
at com.google.protobuf.CodedInputStream.readBytes(CodedInputStream.java:324)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1331)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics.(OrcProto.java:1281)
at org.apache.hadoop.hive.ql.io.orc.OrcProto$StringStatistics$1.parsePartialFrom(OrcProto.java:1374)

Solution¶

The Out of Memory error could be because of using the default split strategy (HYBRID), which requires more memory. Qubole recommends using the ORC split strategy as BI by setting the parameter below:

hive.exec.orc.split.strategy=BI

Hive job fails when “lock wait timeout” is exceeded¶

Problem Description¶

A Hive job fails with the following error message:

Lock wait timeout exceeded; try restarting transaction.

The timeout happens while partioning Insert operations.

Diagnosis¶

The following content will appear in the hive.log file:

ERROR metastore.RetryingHMSHandler (RetryingHMSHandler.java:invoke(173)) - Retrying HMSHandler after 2000 ms (attempt 9 of 10) with error:
javax.jdo.JDODataStoreException: Insert of object “org.apache.hadoop.hive.metastore.model.MPartition@74adce4e” using statement
“INSERT INTO `PARTITIONS` (`PART_ID`,`TBL_ID`,`LAST_ACCESS_TIME`,`CREATE_TIME`,`PART_NAME`,`SD_ID`) VALUES (?,?,?,?,?,?)” failed :
Lock wait timeout exceeded; try restarting transaction

This MySQL transaction timeout can happen during heavy traffic on the Hive Metastore when the RDS server is too busy.

Solution¶

Try setting a higher value for innodb_lock_wait_timeout on the MySQL side. innodb_lock_wait_timeout defines the length of time in seconds an InnoDB transaction waits for a row lock before giving up. The default value is 50 seconds.

S3 “Access denied” error while creating a Hive table¶

Problem¶

An S3 “Access denied” error appears when creating a Hive table.

Diagnosis¶

When using server-side encryption for s3 buckets, the “Access denied” error message below appears when creating a Hive table:

ERROR: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:com.qubole.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied

Solution¶

Set the required server-side encryption algorithm using fs.s3a.server-side-encryption-algorithm for s3a or fs.s3n.sse for s3n. For more information, see Enabling SSE-KMS.
If the s3 “Access denied” error still appears, check the s3 bucket policy and ensure that required permissions are defined there. For more information, see What are some examples of policies I should use to delegate access to Qubole for my Cloud accounts?.