Hadoop 2.9.2 AWS - amazon-web-services

I'm managed to setup Hadoop with 3 datanodes as a small cluster and everything work ok.
When trying to access AWS bucket on S3A protocol I get this error:
hadoop fs -ls s3a://my-bucket/
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2395)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3208)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3240)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:245)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:228)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:317)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:380)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2299)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
... 16 more
What I did wrong ? How do fix that ?
P.S. Bucket on Amazon if fully public. Anyone can download from it.
Amazon credentials was configured in hadoop/core-site.xml as described here: Hadoop-AWS module: Integration with Amazon Web Services

As per the link you shared the issue seems to be related JAR file missing from CLASSPATH. Can you check if it is accessible. If it is not can you copy required JARS as shown below matching your Hadoop version and retry.
sudo cp hadoop/share/hadoop/tools/lib/$AWS_JAVA_SDK_VERSION.jar hadoop/share/hadoop/common/lib/
sudo cp hadoop/share/hadoop/tools/lib/$AWS_HADOOP_VERSION.jar hadoop/share/hadoop/common/lib/

Related

Unable to connect to Huggingface from EC2 instance

I am running a python code in EC2 instance where I am loading a Huggingface model using the from_pretrained() method. I get the error
OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json' to download pretrained model configuration file.
while trying to initialize the reader. To get over this, I downloaded the file manually and provided the local JSON path. That worked fine but then I see issues in loading the tokenizer too.
OSError: Couldn't reach server at '{}' to download vocabulary files.
I think my network settings of EC2 are not correct due to which I am unable to connect to external Huggingface repository.
I tried relaxing the inbound rules for EC2 to IP version|Type|Protocol|Port range|Destination=>IPv4|All|traffic|All|All|0.0.0.0/0 but even that doesn't help. The outbound rules are already IPv4|All|traffic|All|All|0.0.0.0/0.
I also tried creating an IAM role with policy AmazonS3ReadOnlyAccess and attached it to the EC2 instance but still getting the same error.
Could someone point what needs to be done to solve this. Thanks.
Here is how i fixed this issue.
i installed pyopenssl like this :
!pip install pyopenssl
then i restarted terminal and re-ran the code and it fixed the issue for me,thanks
might be your network is using proxy
this might help
$ proxies = {"http": 'foo.bar:3128', addyourproxy:'foo.bar:4012'}
$ from transformers import pipeline
$ qt_ans = pipeline('question-answering')

Filebeat and AWS Elasticsearch - Not Working

I have good experience in working with Elasticsearch, I have worked with version 2.4 and now trying to learn new Elasticsearch.
I am trying to implement Filebeat to send my apache and system logs to my Elasticsearch endpoint. To save my time I preferred to launch a t2.medium single node instance over AWS Elasticsearch Service under the public domain and I have attached the access policy to allow everyone to access the cluster.
The AWS Elasticsearch instance is up and running healthy.
I launched a Ubuntu(18.04) server, downloaded the filebeat tar and made the following configuration in filebeat.yml:
#-------------------------- Elasticsearch output ------------------------------
output.elasticsearch:
# Array of hosts to connect to.
hosts: ["https://my-public-test-domain.ap-southeast-1.es.amazonaws.com:443"]
18.04- # Optional protocol and basic auth credentials.
#protocol: "https"
#username: "elastic"
#password: "changeme"
I enabled the required modules :
filebeat modules enable system apache
Then as per the filebeat documentation I changed the ownership of the filebeat file and started the filebeat with the following commands :
sudo chown root filebeat.yml
sudo ./filebeat -e
When I started the filebeat I faced the following permission and ownership issues :
Error loading config from file '/home/ubuntu/beats/filebeat-7.2.0-linux-x86_64/modules.d/system.yml', error invalid config: config file ("/home/ubuntu/beats/filebeat-7.2.0-linux-x86_64/modules.d/system.yml") must be owned by the user identifier (uid=0) or root
To resolve this I changed the ownership for the files which were throwing errors.
When I restarted the filebeat service , I started facing the following issue :
Connection marked as failed because the onConnect callback failed: cannot retrieve the elasticsearch license: unauthorized access, could not connect to the xpack endpoint, verify your credentials
Going through this link , I found that to work with AWS Elasticsearch I will need Beats OSS versions.
So I again downloaded the OSS version for beat from this link and followed the same procedure as above, but still no luck. Now I am facing the following errors :
Error 1:
Attempting to reconnect to backoff(elasticsearch(https://my-public-test-domain.ap-southeast-1.es.amazonaws.com:443)) with 12 reconnect attempt(s)
Error 2:
Failed to connect to backoff(elasticsearch(https://my-public-test-domain.ap-southeast-1.es.amazonaws.com:443)): Connection marked as failed because the onConnect callback failed: 1 error: Error loading pipeline for fileset system/auth: This module requires an Elasticsearch plugin that provides the geoip processor. Please visit the Elasticsearch documentation for instructions on how to install this plugin. Response body: {"error":{"root_cause":[{"type":"parse_exception","reason":"No processor type exists with name [geoip]","header":{"processor_type":"geoip"}}],"type":"parse_exception","reason":"No processor type exists with name [geoip]","header":{"processor_type":"geoip"}},"status":400}
From the second error I can understand that the geoip plugin is not available because of which I facing this error.
What else needs to be done to get this working?
Has anyone been to successfully connect Beats to AWS Elasticsearch?
What other steps I could to take to mitigate the above issue?
Envrionment Details:
AWS Elasticsearch Version : 6.7
File Beat : 7.2.0
First, you need to use OSS version of filebeat with AWS ES https://www.elastic.co/downloads/beats/filebeat-oss
Second, AWS ElasticSearch does not provide GeoIP module, so you will need to edit pipelines for any of the default modules you want to use, and make sure GeoIP is removed/commented out.
For example in /usr/share/filebeat/module/system/auth/ingest/pipeline.json (that's the path when installed from deb package - your path will be different of course) comment out:
{
"geoip": {
"field": "source.ip",
"target_field": "source.geo",
"ignore_failure": true
}
},
Repeat the same for apache module.
I've spent hours trying to make filebeat iis module works with AWS elasticsearch. I kept getting ingest-geoip error, Below fixed the issue.
For windows iis logs, AWS elasticsearch remove geoip from filebeat module configuration:
C:\Program Files (x86)\filebeat\module\iis\access\ingest\default.json
C:\Program Files (x86)\filebeat\module\iis\access\manifest.yml
C:\Program Files (x86)\filebeat\module\iis\error\ingest\default.json
C:\Program Files (x86)\filebeat\module\iis\error\manifest.yml

Unable to load S3 parquet with postgresql driver in spark-shell [duplicate]

Trying to read a file located in S3 using spark-shell:
scala> val myRdd = sc.textFile("s3n://myBucket/myFile1.log")
lyrics: org.apache.spark.rdd.RDD[String] = s3n://myBucket/myFile1.log MappedRDD[55] at textFile at <console>:12
scala> myRdd.count
java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2614)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
... etc ...
The IOException: No FileSystem for scheme: s3n error occurred with:
Spark 1.31 or 1.40 on dev machine (no Hadoop libs)
Running from the Hortonworks Sandbox HDP v2.2.4 (Hadoop 2.60) which integrates Spark 1.2.1 out of the box
Using s3:// or s3n:// scheme
What is the cause of this error? Missing dependency, Missing configuration, or mis-use of sc.textFile()?
Or may be this is due to a bug that affects Spark build specific to Hadoop 2.60 as this post seems to suggest. I am going to try Spark for Hadoop 2.40 to see if this solves the issue.
Confirmed that this is related to the Spark build against Hadoop 2.60. Just installed Spark 1.4.0 "Pre built for Hadoop 2.4 and later" (instead of Hadoop 2.6). And the code now works OK.
sc.textFile("s3n://bucketname/Filename") now raises another error:
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
The code below uses the S3 URL format to show that Spark can read S3 file. Using dev machine (no Hadoop libs).
scala> val lyrics = sc.textFile("s3n://MyAccessKeyID:MySecretKey#zpub01/SafeAndSound_Lyrics.txt")
lyrics: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at textFile at <console>:21
scala> lyrics.count
res1: Long = 9
Even Better: the code above with AWS credentials inline in the S3N URI will break if the AWS Secret Key has a forward "/". Configuring AWS Credentials in SparkContext will fix it. Code works whether the S3 file is public or private.
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "BLABLA")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "....") // can contain "/"
val myRDD = sc.textFile("s3n://myBucket/MyFilePattern")
myRDD.count
Despite that this question has already an accepted answer, I think that the exact details of why this is happening are still missing. So I think there might be a place for one more answer.
If you add the required hadoop-aws dependency, your code should work.
Starting Hadoop 2.6.0, s3 FS connector has been moved to a separate library called hadoop-aws.
There is also a Jira for that:
Move s3-related FS connector code to hadoop-aws.
This means that any version of spark, that has been built against Hadoop 2.6.0 or newer will have to use another external dependency to be able to connect to the S3 File System.
Here is an sbt example that I have tried and is working as expected using Apache Spark 1.6.2 built against Hadoop 2.6.0:
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0"
In my case, I encountered some dependencies issues, so I resolved by adding exclusion:
libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0" exclude("tomcat", "jasper-compiler") excludeAll ExclusionRule(organization = "javax.servlet")
On other related note, I have yet to try it, but that it is recommended to use "s3a" and not "s3n" filesystem starting Hadoop 2.6.0.
The third generation, s3a: filesystem. Designed to be a switch in replacement for s3n:, this filesystem binding supports larger files and promises higher performance.
You can add the --packages parameter with the appropriate jar:
to your submission:
bin/spark-submit --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 code.py
I had to copy the jar files from a hadoop download into the $SPARK_HOME/jars directory. Using the --jars flag or the --packages flag for spark-submit didn't work.
Details:
Spark 2.3.0
Hadoop downloaded was 2.7.6
Two jar files copied were from (hadoop dir)/share/hadoop/tools/lib/
aws-java-sdk-1.7.4.jar
hadoop-aws-2.7.6.jar
This is a sample spark code which can read the files present on s3
val hadoopConf = sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", s3Key)
hadoopConf.set("fs.s3.awsSecretAccessKey", s3Secret)
var jobInput = sparkContext.textFile("s3://" + s3_location)
Ran into the same problem in Spark 2.0.2. Resolved it by feeding it the jars. Here's what I ran:
$ spark-shell --jars aws-java-sdk-1.7.4.jar,hadoop-aws-2.7.3.jar,jackson-annotations-2.7.0.jar,jackson-core-2.7.0.jar,jackson-databind-2.7.0.jar,joda-time-2.9.6.jar
scala> val hadoopConf = sc.hadoopConfiguration
scala> hadoopConf.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
scala> hadoopConf.set("fs.s3.awsAccessKeyId",awsAccessKeyId)
scala> hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> sqlContext.read.parquet("s3://your-s3-bucket/")
obviously, you need to have the jars in the path where you're running spark-shell from
There is a Spark JIRA, SPARK-7481, open as of today, oct 20, 2016, to add a spark-cloud module which includes transitive dependencies on everything s3a and azure wasb: need, along with tests.
And a Spark PR to match. This is how I get s3a support into my spark builds
If you do it by hand, you must get hadoop-aws JAR of the exact version the rest of your hadoop JARS have, and a version of the AWS JARs 100% in sync with what Hadoop aws was compiled against. For Hadoop 2.7.{1, 2, 3, ...}
hadoop-aws-2.7.x.jar
aws-java-sdk-1.7.4.jar
joda-time-2.9.3.jar
+ jackson-*-2.6.5.jar
Stick all of these into SPARK_HOME/jars. Run spark with your credentials set up in Env vars or in spark-default.conf
the simplest test is can you do a line count of a CSV File
val landsatCSV = "s3a://landsat-pds/scene_list.gz"
val lines = sc.textFile(landsatCSV)
val lineCount = lines.count()
Get a number: all is well. Get a stack trace. Bad news.
For Spark 1.4.x "Pre built for Hadoop 2.6 and later":
I just copied needed S3, S3native packages from hadoop-aws-2.6.0.jar to
spark-assembly-1.4.1-hadoop2.6.0.jar.
After that I restarted spark cluster and it works.
Do not forget to check owner and mode of the assembly jar.
I was facing the same issue. It worked fine after setting the value for fs.s3n.impl and adding hadoop-aws dependency.
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)
sc.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
S3N is not a default file format. You need to build your version of Spark with a version of Hadoop that has the additional libraries used for AWS compatibility. Additional info I found here, https://www.hakkalabs.co/articles/making-your-local-hadoop-more-like-aws-elastic-mapreduce
You probably have to use s3a:/ scheme instead of s3:/ or s3n:/
However, it is not working out of the box (for me) for the spark shell. I see the following stacktrace:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1781)
at org.apache.spark.rdd.RDD.count(RDD.scala:1099)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:24)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
at $iwC$$iwC$$iwC.<init>(<console>:37)
at $iwC$$iwC.<init>(<console>:39)
at $iwC.<init>(<console>:41)
at <init>(<console>:43)
at .<init>(<console>:47)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
... 68 more
What I think - you have to manually add the hadoop-aws dependency manually http://search.maven.org/#artifactdetails|org.apache.hadoop|hadoop-aws|2.7.1|jar But I have no idea how to add it to spark-shell properly.
Download the hadoop-aws jar from maven repository matching your hadoop version.
Copy the jar to $SPARK_HOME/jars location.
Now in your Pyspark script, setup AWS Access Key & Secret Access Key.
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "ACCESS_KEY")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "YOUR_SECRET_ACCESSS_KEY")
// where spark is SparkSession instance
For Spark scala:
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "ACCESS_KEY")
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "YOUR_SECRET_ACCESSS_KEY")
I was able to to read my S3 parquet files (Spark 3.3.1, Hadoop 3) using the configuration proposed here:
spark = SparkSession.builder.appName("Test_Parquet").master("local[*]")\
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")\
.config("spark.jars.packages",
"org.apache.hadoop:hadoop-aws:3.2.2,"
"com.amazonaws:aws-java-sdk-bundle:1.12.180").getOrCreate()
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl",\
"org.apache.hadoop.fs.s3a.S3A")
df = spark.read.parquet(f"s3a://{bucket_name}/{file_name}")
USe s3a instead of s3n. I had similar issue on a Hadoop job. After switching from s3n to s3a it worked.
e.g.
s3a://myBucket/myFile1.log

Spring Cloud Data Flow : Cannot run program "docker"

I want to deploy Spring Boot applications using Kinesis streams on Kubernetes cluster on AWS.
I used kops in an AWS EC2 (Amazon Linux) instance to create my cluster and deploy it using terraform.
I installed Spring Cloud Data Flow for Kubernetes using Helm chart. All my pods are up and running and I can access to the Spring Cloud Data Flow interface in order to register my dockerized apps. I am using ECR repositories to upload my Docker images.
When I want to deploy the stream (composed of a time-source and a log-sink), a big nice red error message pops up. I checked the log of the Skipper pod and I have the following error message starting with :
org.springframework.cloud.skipper.SkipperException: Could not install AppDeployRequest
and finishing with :
Caused by: java.io.IOException: Cannot run program "docker" (in directory "/tmp/spring-cloud-deployer-5769885450333766520/time-log-kinesis-stream-1539963209716/time-log-kinesis-stream.log-sink-kinesis-app-v1"): error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) ~[na:1.8.0_111-internal]
at org.springframework.cloud.deployer.spi.local.LocalAppDeployer$AppInstance.start(LocalAppDeployer.java:386) ~[spring-cloud-deployer-local-1.3.7.RELEASE.jar!/:1.3.7.RELEASE]
at org.springframework.cloud.deployer.spi.local.LocalAppDeployer$AppInstance.start(LocalAppDeployer.java:414) ~[spring-cloud-deployer-local-1.3.7.RELEASE.jar!/:1.3.7.RELEASE]
at org.springframework.cloud.deployer.spi.local.LocalAppDeployer$AppInstance.access$200(LocalAppDeployer.java:296) ~[spring-cloud-deployer-local-1.3.7.RELEASE.jar!/:1.3.7.RELEASE]
at org.springframework.cloud.deployer.spi.local.LocalAppDeployer.deploy(LocalAppDeployer.java:199) ~[spring-cloud-deployer-local-1.3.7.RELEASE.jar!/:1.3.7.RELEASE]
... 54 common frames omitted
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method) ~[na:1.8.0_111-internal]
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247) ~[na:1.8.0_111-internal]
at java.lang.ProcessImpl.start(ProcessImpl.java:134) ~[na:1.8.0_111-internal]
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ~[na:1.8.0_111-internal]
... 58 common frames omitted
I already had this error when I tried to deploy on a local k8s cluster on Windows 10 and I thought it was linked to Win10 platform.
I am using spring-cloud-dataflow-server-kubernetes at version 1.6.2.RELEASE.
I really do not have any clues why this error is appearing. Thanks !
It looks like the docker command is not found by the SCDF local deployer's ProcessBuilder when it tries to run the docker exec from this path:
/tmp/spring-cloud-deployer-5769885450333766520/time-log-kinesis-stream-1539963209716/time-log-kinesis-stream.log-sink-kinesis-app-v1
The SCDF sets the above path as its working directory before running the docker command and hence docker is expected to run from this location.
I have found where was the issue. My bad, the problem is always between the keyboard and the chair !
I wanted to remove all the metrics process in the skipper-config.yaml file and I inserted a typo in the configuration file. The JSON env variable data.spring.application.json for the Skipper launch was not valid hence the DeployerInitializationService never saw the properties it needed to add Kubernetes into the repository !
Now in the logs and in the dataflow shell I have the default and the minikube accounts. Thanks for your help anyway :)

Problems using distcp and s3distcp with my EMR job that outputs to HDFS

I've run a job on AWS's EMR, and stored the output in the EMR job's HDFS. I am then trying to copy the result to S3 via distcp or s3distcp, but both are failing as described below. (Note: the reason I'm not just sending my EMR job's output directly to S3 is due to the (currently unresolved) problem I describe in Where is my AWS EMR reducer output for my completed job (should be on S3, but nothing there)?
For distcp, I run (following this post's recommendation):
elastic-mapreduce --jobflow <MY-JOB-ID> --jar \
s3://elasticmapreduce/samples/distcp/distcp.jar \
--args -overwrite \
--args hdfs:///output/myJobOutput,s3n://output/myJobOutput \
--step-name "Distcp output to s3"
In error log (/mnt/var/log/hadoop/steps/8), I get:
With failures, global counters are inaccurate; consider running with -i
Copy failed: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: <SOME-REQUEST-ID>, AWS Error Code: null, AWS Error Message: Forbidden, S3 Extended Request ID: <SOME-EXT-REQUEST-ID>
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:548)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:288)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:170)
...
For s3distcp, I run (following the s3distcp documentation):
elastic-mapreduce --jobflow <MY-JOB-ID> --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.0.4/s3distcp.jar \
--args '--src,/output/myJobOutput,--dest,s3n://output/myJobOutput'
In the error log (/mnt/var/log/hadoop/steps/9), I get:
java.lang.RuntimeException: Reducer task failed to copy 1 files: hdfs://10.116.203.7:9000/output/myJobOutput/part-00000 etc
at com.amazon.elasticmapreduce.s3distcp.CopyFilesReducer.close(Unknown Source)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:537)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:428)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Any ideas what I'm doing wrong?
Update: Someone responding on the AWS Forums to a post about a similar distcp error mentions the IAM user user permissions, but I don't know what this means (edit: I haven't created any IAM users, so it is using the defaults); hopefully it helps pinpoint my problem.
Update 2: I noticed this error in namenode log file (when re-running s3distcp).. I'm going to look into default EMR permissions to see if it is my problem:
2012-06-24 21:57:21,326 WARN org.apache.hadoop.security.ShellBasedUnixGroupsMapping (IPC Server handler 40 on 9000): got exception trying to get groups for user job_201206242009_0005
org.apache.hadoop.util.Shell$ExitCodeException: id: job_201206242009_0005: No such user
at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getUnixGroups(ShellBasedUnixGroupsMapping.java:68)
at org.apache.hadoop.security.ShellBasedUnixGroupsMapping.getGroups(ShellBasedUnixGroupsMapping.java:45)
at org.apache.hadoop.security.Groups.getGroups(Groups.java:79)
at org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:966)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.<init>(FSPermissionChecker.java:50)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5160)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkTraverse(FSNamesystem.java:5143)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:1992)
at org.apache.hadoop.hdfs.server.namenode.NameNode.getFileInfo(NameNode.java:837)
...
Update 3: I contact AWS Support, and they didn't see a problem, so am now waiting to hear back from their engineering team. Will post back as I hear more
Try this solution. At least it worked for me. (I've successfully copied dir with 30Gb file).
I'm not 100% positive, but after reviewing my commands above, I noticed that my destination on S3 does NOT specify a bucket name. This appears to simply be a case of rookie-ism.