Map / Reduce Tasks are failing extensively. Task Id : attempt_*_*_000001_0, Status : FAILED - mapreduce

I am new to Hadoop. My Laptop is 32GB, Core i5, 4 core processor. I have created multinode (3 data Node) apache hadoop cluster 2.7.4 on that by virtual machines. I have assign 8GB, 2 core cpu per data node, Resource manager virtual machines. When I am running map reduce hadoop example jobs on namenode then almost every time my job got failed due to failing of Map tasks or reduce tasks.
I didn't see any specific error in logs, but notice that all maps &
reduces tasks containers try to find containers on same data node, if
it fails couple of times then application master select another node
for available containers.
Is there any way to assign container on data node like round robin way ?
Any help would be appreciable.
Output-
hduser#NameNode:/opt/hadoop/etc/hadoop$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar pi 2 4
Number of Maps = 2
Samples per Map = 4
Wrote input for Map #0
Wrote input for Map #1
Starting Job
.....
17/11/02 12:53:33 INFO mapreduce.Job: Running job: job_1509607315241_0001
17/11/02 12:53:40 INFO mapreduce.Job: Job job_1509607315241_0001 running in uber mode : false
17/11/02 12:53:40 INFO mapreduce.Job: map 0% reduce 0%
17/11/02 12:53:55 INFO mapreduce.Job: Task Id : attempt_1509607315241_0001_m_000001_0, Status : FAILED
17/11/02 12:53:55 INFO mapreduce.Job: Task Id : attempt_1509607315241_0001_m_000000_0, Status : FAILED
17/11/02 12:54:01 INFO mapreduce.Job: map 50% reduce 0%
17/11/02 12:54:09 INFO mapreduce.Job: Task Id : attempt_1509607315241_0001_m_000001_1, Status : FAILED
17/11/02 12:54:14 INFO mapreduce.Job: Task Id : attempt_1509607315241_0001_r_000000_0, Status : FAILED
17/11/02 12:54:24 INFO mapreduce.Job: Task Id : attempt_1509607315241_0001_m_000001_2, Status : FAILED
17/11/02 12:54:30 INFO mapreduce.Job: Task Id : attempt_1509607315241_0001_r_000000_1, Status : FAILED
17/11/02 12:54:40 INFO mapreduce.Job: map 100% reduce 100%
17/11/02 12:54:44 INFO mapreduce.Job: Job job_1509607315241_0001 failed with state FAILED due to: Task failed task_1509607315241_0001_m_000001
Yarn-site.xnl
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.10.109</value>
<description> The hostname of the machine the resource manager runs on. </description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>A list of auxiliary services run by the node manager. A service is implemented by the class defined by the property yarn.nodemanager.auxservices.servicename.class. By default, no auxiliary services are specified. </description>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
<description> </description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>7096</value>
<description>The amount of physical memory (in MB) that may be allocated to containers being run by the node manager </description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>6196</value>
<description>RM can only allocate memory to containers in increments of "yarn.scheduler.minimum-allocation-mb" and not exceed "yarn.scheduler.maximum-allocation-mb" and It should not be more then total allocated memory of the Node. </description>
</property>
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>6000</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
<description>RM can only allocate memory to containers in increments of "yarn.scheduler.minimum-allocation-mb" and not exceed "yarn.scheduler.maximum-allocation-mb" and It should not be more then total allocated memory of the Node. </description>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx2048m</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
<description>The number of CPU cores that may be allocated to containers being run by the node manager.</description>
</property>
<property>
<name>yarn.resourcemanager.bind-host</name>
<value>192.168.10.109</value>
<description> The address the resource manager’s RPC and HTTP servers will bind to.</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>192.168.10.109:8032</value>
<description>The hostname and port that the resource manager’s RPCserver runs on. </description>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>192.168.10.109:8033</value>
<description>The resource manager’s admin RPC server address and port. This is used by the admin client (invoked with yarn rmadmin, typically run outside the cluster) to communicate with the resource manager. </description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>192.168.10.109:8030</value>
<description>The resource manager scheduler’s RPC server address and port. This is used by (in-cluster) application masters to communicate with the resource manager.</description>
</property>
<property>
<name>yarn.resourcemanager.resourcetracker.address</name>
<value>192.168.10.109:8031</value>
<description>The resource manager resource tracker’s RPC server address and port. This is used by (incluster) node managers to communicate with the resource manager. </description>
</property>
<property>
<name>yarn.nodemanager.hostname</name>
<value>0.0.0.0</value>
<description>The hostname of the machine the node manager runs on. </description>
</property>
<property>
<name>yarn.nodemanager.bind-host</name>
<value>0.0.0.0</value>
<description>The address the node manager’s RPC and HTTP servers will bind to. </description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/opt/hadoop/hdfs/yarn</value>
<description>A list of directories where nodemanagers allow containers to store intermediate data. The data is cleared out when the application ends.</description>
</property>
<property>
<name>yarn.nodemanager.address</name>
<value>0.0.0.0:8050</value>
<description>The node manager’s RPC server address and port. This is used by (in-cluster) application masters to communicate with node managers.</description>
</property>
<property>
<name>yarn.nodemanager.localizer.address</name>
<value>0.0.0.0:8040</value>
<description>The node manager localizer’s RPC server address and port. </description>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>192.168.10.109:8088</value>
<description> The resource manager’s HTTP server address and port.</description>
</property>
<property>
<name>yarn.nodemanager.webapp.address</name>
<value>0.0.0.0:8042</value>
<description>The node manager’s HTTP server address and port. </description>
</property>
<property>
<name>yarn.web-proxy.address</name>
<value>192.168.10.109:9046</value>
<description>The web app proxy server’s HTTP server address and port. If not set (the default), then the web app proxy server will run in the resource manager process. MapReduce ApplicationMaster REST APIs are accessed using a proxy server, that is, Web Application Proxy server. Proxy server is an optional service in YARN. An administrator can configure the
service to run on a particular host or on the ResourceManager itself (stand-alone mode). If the proxy server is not configured, then it runs as a part of the ResourceManager service. By default, REST calls could be made to the web address port of
ResourceManager 8088. </description>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>Default framework to run.</description>
</property>
<!-- <property>
<name>mapreduce.jobtracker.address</name>
<value>localhost:54311</value>
<description>MapReduce job tracker runs at this host and port.</description>
</property> -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>192.168.10.109:19888</value>
<description>The MapReduce job history server’s addressand port.</description>
</property>
<property>
<name>mapreduce.shuffle.port</name>
<value>13562</value>
<description>The shuffle handler’s HTTP port number.This is used for serving map outputs, and is not a user-accessible web UI.</description>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>192.168.10.109:10020</value>
<description>The job history server’s RPC server address and port. This is used by the client (typically outside the cluster) to query job history.</description>
</property>
<property>
<name>mapreduce.jobhistory.bind-host</name>
<value>192.168.10.109</value>
<description>Setting all of these values to 0.0.0.0 as in the example above will cause the MapReduce daemons to listen on all addresses and interfaces of the hosts in the cluster.</description>
</property>
<property>
<name>mapreduce.job.userhistorylocation</name>
<value>/opt/hadoop/hdfs/mrjobhistory</value>
<description>User can specify a location to store the history files of a particular job. If nothing is specified, the logs are stored in output directory. The files are stored in "_logs/history/" in the directory. User can stop logging by giving the value "none".</description>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/opt/hadoop/hdfs/mrjobhistory/tmp</value>
<description>Directory where history files are written by MapReduce jobs.</description>
</property>
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>/opt/hadoop/hdfs/mrjobhistory/done</value>
<description>Directory where history files are managed by the MR JobHistory Server.</description>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>3072</value>
</property>
<property>
<name>mapreduce.map.cpu.vcores</name>
<value>1</value>
<description> The number of virtual cores to request from the scheduler for each map task.</description>
</property>
<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>1</value>
<description> The number of virtual cores to request from the scheduler for each reduce task.</description>
</property>
<property>
<name>mapreduce.task.timeout</name>
<value>1800000</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1555m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2048m</value>
</property>
<property>
<name>mapreduce.job.running.map.limit</name>
<value>2</value>
<description> The maximum number of simultaneous map tasks per job. There is no limit if this value is 0 or negative.</description>
</property>
<property>
<name>mapreduce.job.running.reduce.limit</name>
<value>1</value>
<description> The maximum number of simultaneous reduce tasks per job. There is no limit if this value is 0 or negative.</description>
</property>
<property>
<name>mapreduce.reduce.shuffle.connect.timeout</name>
<value>1800000</value>
<description>Expert: The maximum amount of time (in milli seconds) reduce task spends in trying to connect to a tasktracker for getting map output.</description>
</property>
<property>
<name>mapreduce.reduce.shuffle.read.timeout</name>
<value>1800000</value>
<description>Expert: The maximum amount of time (in milli seconds) reduce task waits for map output data to be available for reading after obtaining connection.</description>
</property>
<!--
<property>
<name>mapreduce.job.reducer.preempt.delay.sec</name>
<value>300</value>
<description> The threshold (in seconds) after which an unsatisfied mapper request triggers reducer preemption when there is no anticipated headroom. If set to 0 or a negative value, the reducer is preempted as soon as lack of headroom is detected. Default is 0.</description>
</property>
<property>
<name>mapreduce.job.reducer.unconditional-preempt.delay.sec</name>
<value>400</value>
<description> The threshold (in seconds) after which an unsatisfied mapper request triggers a forced reducer preemption irrespective of the anticipated headroom. By default, it is set to 5 mins. Setting it to 0 leads to immediate reducer preemption. Setting to -1 disables this preemption altogether.</description>
</property>
-->
</configuration>

The problem is with /etc/hosts file in data nodes. We have to comment the line where hostname which point to it's loopback address. I have traced this error with the line in logs-
INFO [main] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Instantiated MRClientService at DN1/127.0.1.1:54483
Earlier
127.0.1.1 DN1
192.168.10.104 dn1
After
# 127.0.1.1 DN1
192.168.10.104 DN1

I suggest to add into mapred-site.xml the following properties:
<property>
<name>mapreduce.map.maxattempts</name>
<value>20</value>
</property>
<property>
<name>mapreduce.reduce.maxattempts</name>
<value>20</value>
</property>

Related

How to impersonate using HBase API and hbase-site.xml for BigTable Connection

We are connecting to BigTable using HBase API and we are using the hbase-site.xml.
Is there any way we can use impersonation using HBase API to connect to BigTable?
<configuration xmlns:xi="http://www.w3.org/2001/XInclude">
<property>
<name>hbase.client.connection.impl</name>
<value>com.google.cloud.bigtable.hbase1_x.BigtableConnection</value>
</property>
<property>
<name>google.bigtable.project.id</name>
<value></value>
</property>
<property>
<name>google.bigtable.instance.id</name>
<value></value>
</property>
<property>
<name>google.bigtable.auth.json.keyfile</name>
<value></value>
</property>
</configuration>
The source code (bigtable implementation using HBase API i.e com.google.cloud.bigtable.hbase1_x.BigtableConnection)doesn't have any functionality related to using impersonation. https://github.com/googleapis/java-bigtable-hbase
About User Impersonation in Hbase, it appears that it is supported through the user of an Apache Thrift server, which I think acts a bit like an upstream proxy. Per the comments in the post here, it is stated that CBT does support thrift with this provided example (note this should be set up on a GCE instance). This additional guide shows the process of setting up this gateway and using it for requests coming from App Engine. If I misunderstood your intention, you can come back with additional details on your use-case, so that I could work on your question.
We didn't anyway to configure the impersonated user in hbase-site.xml as in the source code of this didn't find any param for this https://github.com/googleapis/java-bigtable-hbase
The best way, we can impersonate using HBase API when connecting to BigTable is create BigTable connection using impersonation and use that connection object in the existing HBase API implementation. Here is the code snippet for getting the connection
public org.apache.hadoop.hbase.client.Connection getConnection() throws Exception{
Credentials credentials = GoogleCredentials.fromStream(new FileInputStream("credentials_key.json"));
ImpersonatedCredentials targetCredentials = ImpersonatedCredentials.create((GoogleCredentials) credentials,
"your-service-account#gcp-test-project.iam.gserviceaccount.com", null,
Arrays.asList("https://www.googleapis.com/auth/bigtable.data"), 3600);
// use your gcp project name and bigtable instance name
Configuration config = BigtableConfiguration.configure("gcp-test-project", "big-table-instance");
BigtableConfiguration.withCredentials(config,(Credentials)targetCredentials);
Connection connection = BigtableConfiguration.connect(config);
return connection;
}
Using this approach, with minimal changes, one can use existing api/implementations using HBase API to connect to BigTable and can impersonate. Please note that if impersonation is not required (the json key account you are using has the permissions to read/write), then there will not be any changes required for your existing code base. ref https://cloud.google.com/bigtable/docs/hbase-bigtable

External Shuffle service connection idle for more than 120seconds while there are outstanding requests

I am running a spark job on yarn. The job runs properly on the amazon EMR. (1 Master and 2 slave with m4.xlarge)
I have setup similar infra using HDP 2.6 distribution using aws ec2 machines. But the spark job gets stuck at one particular stage and after sometime i get the following error in container logs. The main error seems to be shuffle service being idle.
18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker#10.210.150.150:44343)
18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 9, fetching them
18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 9, fetching them
18/06/25 07:15:31 INFO spark.MapOutputTrackerWorker: Got the output locations
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 1 remote fetches in 0 ms
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 1 remote fetches in 0 ms
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Getting 5 non-empty blocks out of 1000 blocks
18/06/25 07:15:31 INFO storage.ShuffleBlockFetcherIterator: Started 1 remote fetches in 1 ms
18/06/25 07:15:31 INFO codegen.CodeGenerator: Code generated in 4.822611 ms
18/06/25 07:15:31 INFO codegen.CodeGenerator: Code generated in 8.430244 ms
18/06/25 07:17:31 ERROR server.TransportChannelHandler: Connection to ip-10-210-150-180.********/10.210.150.180:7447 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
18/06/25 07:17:31 ERROR client.TransportResponseHandler: Still have 307 requests outstanding when connection from ip-10-210-150-180.********/10.210.150.180:7447 is closed
18/06/25 07:17:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 197 outstanding blocks after 5000 ms
18/06/25 07:17:31 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from ip-10-210-150-180.********/10.210.150.180:7447 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
18/06/25 07:17:31 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 166 outstanding blocks after 5000 ms
18/06/25 07:17:31 ERROR shuffle.OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from ip-10-210-150-180.********/10.210.150.180:7447 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:146)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:108)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:278)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:220)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:241)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:227)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
I am currently running spark on yarn cluster with the following spark-defaults configurations
spark.eventLog.dir=hdfs:///user/spark/applicationHistory
spark.eventLog.enabled=true
spark.yarn.historyServer.address=ppv-qa12-tenant8-spark-cluster-master.periscope-solutions.local:18080
spark.shuffle.service.enabled=true
spark.dynamicAllocation.enabled=true
spark.driver.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64
spark.driver.maxResultSize=0
spark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.executor.memory=5g
spark.driver.memory=1g
spark.executor.cores=4
And i have the following set in yarn-site.xml in nodemanager of slave machines
<configuration>
<property>
<name>yarn.application.classpath</name>
<value>/usr/hdp/current/spark2-client/aux/*,/etc/hadoop/conf,/usr/hdp/current/hadoop-client/*,/usr/hdp/current/hadoop-client/lib/*,/usr/hdp/current/hadoop-hdfs-client/*,/usr/hdp/current/hadoop-hdfs-client/lib/*,/usr/hdp/current/hadoop-yarn-client/*,/usr/hdp/current/hadoop-yarn-client/lib/*</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>spark2_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark2_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
<name>yarn.nodemanager.container-manager.thread-count</name>
<value>64</value>
</property>
<property>
<name>yarn.nodemanager.localizer.client.thread-count</name>
<value>20</value>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>5</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>************</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.client.thread-count</name>
<value>64</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.client.thread-count</name>
<value>64</value>
</property>
<property>
<name>yarn.scheduler.increment-allocation-mb</name>
<value>32</value>
</property>
<property>
<name>yarn.scheduler.increment-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>128</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>32</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>8</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>11520</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>11520</value>
</property>
<property>
<name>yarn.nodemanager.hostname</name>
<value>*************</value>
</property>
</configuration>
Edit : Through some network debugging I found that epehmeral port created by container to connect with shuffle service is actively refusing connection. (telnet immediately throws error)
On looking into kernel and system activity logs we found the following issue in /var/log/messages
xen_netfront: xennet: skb rides the rocket: 19 slots
That means that our aws ec2 machines were having network packet loss.
The data transfer b/n container and shuffle service happen through RPC Calls (ChunkFetchRequest, ChunkFetchSuccess and ChunkFetchFailure) and these RPC calls were suppressed by the network.
More info on this log can be found in the following thread.
http://www.brendangregg.com/blog/2014-09-11/perf-kernel-line-tracing.html
The log message means that we are exceeding the maximum buffer size of the packet that can be put in the driver ring buffer queue(16) and those SKB's were lost
Scatter-gather collects multiple responses and sends them as a single response which in turn is responsible for an increase in the SKB size.
So we turned off the scatter-gather using the following command.
sudo ethtool -K eth0 sg off
After this there were no more packet loss.
Performance is also similar to that we used to have in EMR.

Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden (Hadoop + S3)

I am trying to Access the s3 files via Hadoop Shell commands and when I execute the below command I getting this error.
What I did so for
I have installed Hadoop single node (hadoop-2.6.1) and added (hadoop aws jar and aws jdk jar in classpath as well )
Command I executed
hdfs dfs -ls s3a://s3-us-west-2.amazonaws.com/azpoc1/
Error
ubuntu#ip-172-31-2-211:~/hadoop-2.6.1$ hdfs dfs -ls s3a://s3-us-west-2.amazonaws.com/azpoc1/
-ls: Fatal internal error
com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: FC80B14D00C2FBE0; S3 Extended Request ID: TAHwxzqjMF8CD3bTnyaRGwpAgQnu0DsUFWL/E1llrXDfS+CqEMq6K735Koh7QkpSwEe8jzIOIX0=), S3 Extended Request ID: TAHwxzqjMF8CD3bTnyaRGwpAgQnu0DsUFWL/E1llrXDfS+CqEMq6K735Koh7QkpSwEe8jzIOIX0=
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1632)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4365)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4312)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1270)
at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1245)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:688)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:71)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1625)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:224)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:207)
at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
My core-site.xml file
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:50000</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>*****</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>*****</value>
</property>
<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
</configuration>
First, don't post your secret keys and access keys. This is a significant security risk.
What are the permissions associated with your IAM user? My guess is that it does not have appropriate permissions to access the bucket. I would temporarily give it too many permissions (like s3:*) and see if it works. If it does, then it's permissions.
There's a whole troubleshooting s3a document to go through: start there.
There's a also some diagnostics module I've put up which tries to debug connectivity problems without printing secrets: storediag. Grab the latest release and see what is says.

JEE7/JAX-RS How to programatically create a JDBC connectionpool

I'm currently developing a REST service to replace an existing solution. I'm using plain Payara/JEE7/JAX-RS. I am not using Spring and I do not intent to.
The problem I'm facing is that we want to reuse as much of the original configuration as possible (deployment on multiple nodes in a cluster with puppet controlling the configuration files).
Usually in Glassfish/Payara, you'd have a domain.xml file that has some content like this:
<jdbc-connection-pool driver-classname="" pool-resize-quantity="10" datasource-classname="org.postgresql.ds.PGSimpleDataSource" max-pool-size="20" res-type="javax.sql.DataSource" steady-pool-size="10" description="" name="pgsqlPool">
<property name="User" value="some_user"/>
<property name="DatabaseName" value="myDatabase"/>
<property name="LogLevel" value="0"/>
<property name="Password" value="some_password"/>
<!-- bla --->
</jdbc-connection-pool>
<jdbc-resource pool-name="pgsqlPool" description="" jndi-name="jdbc/pgsql"/>
Additionally you'd have a persistence.xml file in your archive like this:
<persistence-unit name="myDatabase">
<provider>org.hibernate.ejb.HibernatePersistence</provider>
<jta-data-source>jdbc/pgsql</jta-data-source>
<properties>
<property name="hibernate.dialect" value="org.hibernate.dialect.PostgreSQLDialect"/>
<!-- bla -->
</properties>
</persistence-unit>
I need to replace both of these configuration files by a programmatic solution so I can read from the existing legacy configuration files and (if needed) create the connection pools and persistence units on the server's startup.
Do you have any idea how to accomplish that?
Actually you do not need to edit each domain.xml by hands. Just create glassfish-resources.xml file like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE resources PUBLIC "-//GlassFish.org//DTD GlassFish Application Server 3.1 Resource Definitions//EN" "http://glassfish.org/dtds/glassfish-resources_1_5.dtd">
<resources>
<jdbc-connection-pool driver-classname="" pool-resize-quantity="10" datasource-classname="org.postgresql.ds.PGSimpleDataSource" max-pool-size="20" res-type="javax.sql.DataSource" steady-pool-size="10" description="" name="pgsqlPool">
<property name="User" value="some_user"/>
<property name="DatabaseName" value="myDatabase"/>
<property name="LogLevel" value="0"/>
<property name="Password" value="some_password"/>
<!-- bla --->
</jdbc-connection-pool>
<jdbc-resource pool-name="pgsqlPool" description="" jndi-name="jdbc/pgsql"/>
</resources>
Then either use
$PAYARA_HOME/bin/asadmin add-resources glassfish-resources.xml
on each node once or put it under WEB-INF/ of your war (note, in this case jndi-name SHOULD be java:app/jdbc/pgsql because you do not have access to global: scope at this context).
Note that your persistence.xml should be under META-INF/ of any jar in your classpath.
If you do not like this, you may use
#PersistenceUnit(unitName = "MyDatabase")
EmtityManagerFactory emf;
to create EntityManager on fly:
createEntityManager(java.util.Map properties).
By the way, using Payara you can share configuration with JCache across you cluster.
Since the goal is to have a dockerized server that runs a single application, I can very well use an embedded server.
Using an embedded sever, the solution to my problem looks roughly like this:
For the server project, create a Maven dependency:
<dependencies>
<dependency>
<groupId>fish.payara.extras</groupId>
<artifactId>payara-embedded-all</artifactId>
<version>4.1.1.163.0.1</version>
</dependency>
</dependencies>
Start your server like this:
final BootstrapProperties bootstrapProperties = new BootstrapProperties();
final GlassFishRuntime runtime = GlassFishRuntime.bootstrap();
final GlassFishProperties glassfishProperties = new GlassFishProperties();
final GlassFish glassfish = runtime.newGlassFish(glassfishProperties);
glassfish.start();
Add your connection pools to the started instance:
final CommandResult createPoolCommandResult = commandRunner.run("create-jdbc-connection-pool",
"--datasourceclassname=org.postgresql.ds.PGConnectionPoolDataSource", "--restype=javax.sql.ConnectionPoolDataSource", //
"--property=DatabaseName=mydb"//
+ ":ServerName=127.0.0.1"//
+ ":PortNumber=5432"//
+ ":User=myUser"//
+ ":Password=myPassword"//
//other properties
, "Mydb"); //the pool name
Add a corresponding jdbc resource:
final CommandResult createResourceCommandResult = commandRunner.run("create-jdbc-resource", "--connectionpoolid=Mydb", "jdbc__Mydb");
(In the real world you would get the data from some external configuration file)
Now deploy your application:
glassfish.getDeployer().deploy(new File(pathToWarFile));
(Usually you would read your applications from some deployment directory)
In the application itself you can just refer to the configured pools like this:
#PersistenceContext(unitName = "mydb")
EntityManager mydbEm;
Done.
A glassfish-resources.xml would have been possible too, but with a catch: My configuration file is external, shared by some applications (so the file format is not mine) and created by external tools on deployment. I would need to XSLT the file to a glassfish-resources.xml file and run a script that does the "asadmin" calls.
Running an embedded server is an all-java solution that I can easily build on a CI server and my application's test suite could spin up the same embedded server build to run some integration tests.

CXF java.net.ConnectException: Connection timed out

I am getting a connection timed out when I try to invoke from a WS client a method from a CXF Web service I have deployed. Both are using custom interceptors, and the service is overloaded due to multiple invocations.
Caused by: java.net.ConnectException: ConnectException invoking http://xxx.xx.xx.xx:12005/myservice/repository?wsdl: Connection timed out
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.mapException(HTTPConduit.java:1338)
at org.apache.cxf.transport.http.HTTPConduit$WrappedOutputStream.close(HTTPConduit.java:1322)
at org.apache.cxf.transport.AbstractConduit.close(AbstractConduit.java:56)
at org.apache.cxf.transport.http.HTTPConduit.close(HTTPConduit.java:622)
at org.apache.cxf.interceptor.MessageSenderInterceptor$MessageSenderEndingInterceptor.handleMessage(MessageSenderInterceptor.java:62)
... 36 more
I tried multiple solutions to disabled the timeout or to increase it but all failed.
First, I tried to create a CXF configuration file like the following:
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:http-conf="http://cxf.apache.org/transports/http/configuration"
xsi:schemaLocation="http://cxf.apache.org/transports/http/configuration
http://cxf.apache.org/schemas/configuration/http-conf.xsd
http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans.xsd">
<http-conf:conduit name="*.http-conduit">
<http-conf:client CacheControl="no-cache"
ConnectionTimeout="0" ReceiveTimeout="0" AllowChunking="false" />
</http-conf:conduit>
</beans>
Then, I forced my application to load it by using the Java system property -Dcxf.config.file=/home/test/resources/cxf.xml
In the logs I can see that the configuration is read and thus probably applied
INFO: Loaded configuration file /home/test/resources/cxf.xml.
Unfortunately the connection timed out still occurs.
The second solution I tried consists of setting the policy programmatically on all the clients by using the following piece of code:
public static void setHTTPPolicy(Client client) {
HTTPConduit http = (HTTPConduit) client.getConduit();
HTTPClientPolicy httpClientPolicy = new HTTPClientPolicy();
httpClientPolicy.setConnectionTimeout(0);
httpClientPolicy.setReceiveTimeout(0);
httpClientPolicy.setAsyncExecuteTimeout(0);
http.setClient(httpClientPolicy);
}
but again the connection timeout occurs.
Do I miss something? Is there some other timeouts to configure? any help is welcome.
CXF allows you to configure threadpooling for your webservice endpoint. This way, you can cater for timeouts occurring as a result of scarce request processing resources. Below is a sample config using the <jaxws:endpoint/> option in cxf:
<jaxws:endpoint id="serviceBean" implementor="#referenceToServiceBeanDefinition" address="/MyEndpointAddress">
<jaxws:executor>
<bean id="threadPool" class="java.util.concurrent.ThreadPoolExecutor">
<!-- Minimum number of waiting threads in the pool -->
<constructor-arg index="0" value="2"/>
<!-- Maximum number of working threads in the pool -->
<constructor-arg index="1" value="5"/>
<!-- Maximum wait time for a thread to complete execution -->
<constructor-arg index="2" value="400000"/>
<!-- Unit of wait time -->
<constructor-arg index="3" value="#{T(java.util.concurrent.TimeUnit).MILLISECONDS}"/>
<!-- Storage data structure for waiting thread tasks -->
<constructor-arg index="4" ref="taskQueue"/>
</bean>
</jaxws:executor>
</jaxws:endpoint>
<!-- Basic data structure to temporarily hold waiting tasks-->
<bean id="taskQueue" class="java.util.concurrent.LinkedBlockingQueue"/>