I am launching an EMR cluster at run time based on an User Event and once the job is done the cluster will be terminated.
How ever when i the cluster is launched and the tasks are getting executed i am getting the Error:
I read some posts where it is being suggested that we need to update yarn-site.xml in namenode and datanodes and restart the yarn instance.
Not sure how to configure this during the launch of the cluster itself.
org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
Container launch failed for container_1523533251407_0001_01_000002 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:390)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Thanks
Answer:
Here is what i have added in my code to resolve the Issue:
Map<String,String> yarnProperties = new HashMap<String,String>();
yarnProperties.put("yarn.nodemanager.aux-services","mapreduce_shuffle");
yarnProperties.put("yarn.nodemanager.aux-services.mapreduce_shuffle.class","org.apache.hadoop.mapred.ShuffleHandler");
Configuration yarnConfig = new Configuration()
.withClassification("yarn-env")
.withProperties(yarnProperties);
RunJobFlowRequest request = new RunJobFlowRequest()
.withConfigurations(yarnConfig)
We were setting some other properties in the yarn-site.xml .
In case you are trying to create using AWS CLI, you can use
--configurations 'json file with the config'
Else if you are trying to create through java , for example
Application hive = new Application().withName("Hive");
Map<String,String> hiveProperties = new HashMap<String,String>();
hiveProperties.put("hive.join.emit.interval","1000");
hiveProperties.put("hive.merge.mapfiles","true");
Configuration myHiveConfig = new Configuration()
.withClassification("hive-site")
.withProperties(hiveProperties);
Then you can refer as
RunJobFlowRequest request = new RunJobFlowRequest()
.withName("Create cluster with ReleaseLabel")
.withReleaseLabel("emr-5.13.0")
.withApplications(hive)
.withConfigurations(myHiveConfig)
For the other problem :-
You need to add this 2 properties in the above way and then create the cluster:-
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Related
I'm currently developing ETL scripts locally using the AWS Glue ETL library.
I'm facing an issue when extracting data from S3 bucket as DynamicFrame.
When I want to convert to a DataFrame using toDF(), it will always trigger this exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o52.toDF
...
ERROR Executor: Exception in task 5.0 in stage 3.0 (TID 29)
java.lang.IllegalStateException: Connection pool shut down
at org.apache.http.util.Asserts.check(Asserts.java:34)
at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:191)
at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:267)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
at com.amazonaws.http.conn.$Proxy15.requestConnection(Unknown Source)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1330)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5062)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5008)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1490)
at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:148)
at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:281)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:364)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:179)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:163)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at com.amazonaws.services.glue.readers.BufferedStream.read(DynamicRecordReader.scala:91)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.ensureLoaded(ByteSourceJsonBootstrapper.java:489)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.detectEncoding(ByteSourceJsonBootstrapper.java:126)
at com.fasterxml.jackson.core.json.ByteSourceJsonBootstrapper.constructParser(ByteSourceJsonBootstrapper.java:215)
I tried the same code on AWS Glue DevEndpoint and it works fine. Any idea how to resolve this?
Please go with java8, your issue will be resolved
check java -version
I had the same issue when running the dev env with following specs:
Scala version 2.11.12
Spark version 2.4.3
Glue 1.0.0
To fix it, add the following line to the spark configuration in $SPARK_HOME/conf/spark-defaults.conf
spark.master local
Alternatively, depending on how you are running your job, you can configure this dynamically if you are in control of the spark context. i.e.
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
sc = SparkContext(conf=conf)
I have found this happens when running in local mode with multiple threads, I have found that increasing fs.s3.connection.maximum or fs.s3a.connection.maximum does not fix the issue. Although this post indicates that it should https://kb.databricks.com/jobs/job-fails-connection-pool.html
I am trying to run an application from my workstation (inside Intellij) and connect to a remote Spark cluster (2.3.1) running on ec2. I know this isn't a best practice, but if I can get this to work for development it will make my life a lot easier.
I've managed to get fairly far, and I am able to run operations on RDDs and return results, until I get to a step which uses .zipWithIndex() and I get the following exception:
ERROR 2018-07-19 11:16:21,137 o.a.spark.network.shuffle.RetryingBlockFetcher Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /172.x.x.x:33898
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) ~[spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) ~[spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:113) ~[spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:123) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:98) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:691) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62) [spark-combined-shaded-2.3.1-evg1.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_172]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_172]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_172]
where 172.x.x.x is the (censored) local IP inside the AWS VPC of the spark instance containing both the master and worker. I have configured the ec2 Spark instance so that it should be using it's public DNS with SPARK_PUBLIC_DNS and use the following configuration to build my SparkContext:
SparkConf sparkConf = new SparkConf()
.setAppName("myapp")
.setMaster(System.getProperty("spark.master", "spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077"))
.set("spark.cores.max", String.valueOf(4))
.set("spark.scheduler.mode", "FAIR")
.set("spark.driver.maxResultSize", String.valueOf(maxResultSize))
.set("spark.executor.memory", "2G")
.set("spark.executor.extraJavaOptions", "-XX:+UseG1GC")
.set("spark.ui.retainedStages", String.valueOf(250))
.set("spark.ui.retainedJobs", String.valueOf(250))
.set("spark.network.timeout", String.valueOf(800))
.set("spark.driver.host", "localhost")
.set("spark.driver.port", String.valueOf(23584))
.set("spark.driver.blockManager.port", String.valueOf(6578))
.set("spark.files.overwrite", "true")
;
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
jsc.addJar("my_application.jar");
And then I make an SSH tunnel with
ssh -R 23584:localhost:23584 -L 44895:localhost:44895 -R 27017:localhost:27017 -R 6578:localhost:6578 ubuntu#ec2-x-x-x-x.compute-1.amazonaws.com
so that the workers can see back to my machine. What am I missing? Why is there still an attempt to connect to something by it's AWS IP that can't be seen from my machine?
Edit: When I look at the web UI I can see that the port referenced in java.io.IOException: Failed to connect to /172.x.x.x:33898 does indeed belong to an executor. How can I tell my driver to connect through the public IP rather than the private IP?
I was eventually able to solve this by setting the undocumented variable SPARK_LOCAL_HOSTNAME to the public DNS in my spark-env.sh file.
My environment configuration now looks like this:
export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
export SPARK_PUBLIC_DNS="ec2-xx-xxx-xxx-x.compute-1.amazonaws.com"
export SPARK_MASTER_HOST=""
export SPARK_LOCAL_HOSTNAME="ec2-xx-xxx-xxx-x.compute-1.amazonaws.com"
2018-03-08 16:36:16,775 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Downloading public rsrc:{ hdfs://mycluster/user/abc_user/udf/pig_udf-1.5.7_handle_input_error.jar, 1516336589685, FILE, null }
2018-03-08 16:36:16,775 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download resource { { hdfs://mycluster/user/oozie/share/lib/lib_20171215093741/pig/libgplcompression.so.0.0.0, 1513307849411, FILE, null },pending,[(container_1519371600813_0002_02_000001)],8140205165392614,DOWNLOADING}
java.lang.IllegalArgumentException: java.net.UnknownHostException: mycluster
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:406)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:728)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:671)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:155)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2815)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2852)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2834)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:249)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecumytor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: mycluster
Yarn-nodemanager service and Data-node service is on the same machine
Yarn-resource-manager service and NameNode in on the same machine
When run a simple pig script load data and print . I met above error .
Before add standby Namnode everything work well.
How can I config yarn to understand my NameNode Cluster
Thanks you
After check again hdfs-site.xml on 2 DataNode where Yarn Node Manager stand on , I see that the hdfs-site file missing this line when compare with the hdfs-site on Name Node
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
It working now
We have an installation of AWS EMR in a client environment. The encryption in transit and the encryption at rest has been enabled using security configuration. We continue to get the below mapreduce errors when we execute a simple Hive query.
Diagnostic Messages for this Task:
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError:
error in shuffle in fetcher#1
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:377)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by:
java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.checkReducerHealth(ShuffleSchedulerImpl.java:366)
at org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl.copyFailed(ShuffleSchedulerImpl.java:288)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.openShuffleUrl(Fetcher.java:282)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:323)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
Please let me know if anyone has faced this error before.
I setup Cassandra Cluster Using DataStax AMI in AWS and run the cassandra service. I am trying to connect this cassandra service from another EC2 instance where titan is installed. Titan server version is 0.4.4. I also tried with 0.5.3 but still the same error.
Cassandra is backend storage for the titan .
Error is
20366 [main] WARN com.tinkerpop.rexster.config.GraphConfigurationContainer - Could not load graph graph. Please check the XML configuration.
20367 [main] WARN com.tinkerpop.rexster.config.GraphConfigurationContainer - GraphConfiguration could not be found or otherwise instantiated: [com.thinkaurelius.titan.tinkerpop.rexster.TitanGraphConfiguration]. Ensure that it is in Rexster's path.
com.tinkerpop.rexster.config.GraphConfigurationException: GraphConfiguration could not be found or otherwise instantiated: [com.thinkaurelius.titan.tinkerpop.rexster.TitanGraphConfiguration]. Ensure that it is in Rexster's path.at com.tinkerpop.rexster.config.GraphConfigurationContainer.getGraphFromConfiguration(GraphConfigurationContainer.java:137)
at com.tinkerpop.rexster.config.GraphConfigurationContainer.<init>(GraphConfigurationContainer.java:54)
at com.tinkerpop.rexster.server.XmlRexsterApplication.reconfigure(XmlRexsterApplication.java:99)
at com.tinkerpop.rexster.server.XmlRexsterApplication.<init>(XmlRexsterApplication.java:47)
at com.tinkerpop.rexster.Application.<init>(Application.java:96)
at com.tinkerpop.rexster.Application.main(Application.java:188)
Caused by: java.lang.IllegalArgumentException: Could not instantiate implementation: com.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxStoreManager
at com.thinkaurelius.titan.diskstorage.Backend.instantiate(Backend.java:355)
at com.thinkaurelius.titan.diskstorage.Backend.getImplementationClass(Backend.java:367)
at com.thinkaurelius.titan.diskstorage.Backend.getStorageManager(Backend.java:311)
at com.thinkaurelius.titan.diskstorage.Backend.<init>(Backend.java:121)
at com.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration.getBackend(GraphDatabaseConfiguration.java:1173)
at com.thinkaurelius.titan.graphdb.database.StandardTitanGraph.<init>(StandardTitanGraph.java:75)
at com.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:40)
at com.thinkaurelius.titan.tinkerpop.rexster.TitanGraphConfiguration.configureGraphInstance(TitanGraphConfiguration.java:25)
at com.tinkerpop.rexster.config.GraphConfigurationContainer.getGraphFromConfiguration(GraphConfigurationContainer.java:119)
Configuration file -
<rexster>
<http>
<server-port>7182</server-port>
<server-host>0.0.0.0</server-host>
<base-uri>http://localhost</base-uri>
<web-root>public</web-root>
<character-set>UTF-8</character-set>
<enable-jmx>false</enable-jmx>
<enable-doghouse>true</enable-doghouse>
<max-post-size>2097152</max-post-size>
<max-header-size>8192</max-header-size>
<upload-timeout-millis>30000</upload-timeout-millis>
<thread-pool>
<worker>
<core-size>8</core-size>
<max-size>8</max-size>
</worker>
<kernal>
<core-size>4</core-size>
<max-size>4</max-size>
</kernal>
</thread-pool>
<io-strategy>leader-follower</io-strategy>
</http>
<rexpro>
<server-port>7180</server-port>
<server-host>0.0.0.0</server-host>
<session-max-idle>1790000</session-max-idle>
<session-check-interval>3000000</session-check-interval>
<connection-max-idle>180000</connection-max-idle>
<connection-check-interval>3000000</connection-check-interval>
<enable-jmx>false</enable-jmx>
<thread-pool>
<worker>
<core-size>8</core-size>
<max-size>8</max-size>
</worker>
<kernal>
<core-size>4</core-size>
<max-size>4</max-size>
</kernal>
</thread-pool>
<io-strategy>leader-follower</io-strategy>
</rexpro>
<shutdown-port>7183</shutdown-port>
<shutdown-host>127.0.0.1</shutdown-host>
<graphs>
<graph>
<graph-name>graph</graph-name>
<graph-type>com.thinkaurelius.titan.tinkerpop.rexster.TitanGraphConfiguration</graph-type>
<graph-location>/tmp/titan</graph-location>
<graph-read-only>false</graph-read-only>
<properties>
<storage.hostname>ec2-52-22-199-210.amazonaws.com</storage.hostname>
<storage.backend>cassandra</storage.backend>
</properties>
<extensions>
<allows>
<allow>tp:gremlin</allow>
</allows>
</extensions>
</graph>
</graphs>
</rexster>