I know that this question may have been asked multiple times but I tried those solutions and it didn't workout. Therefore, asking it in a new thread for a definite solution.
I have created a IAM user with S3 read only permission (Get and List on all S3 resources) but when I try to access S3 from EMR cluster using HDFS command it throws "Error Code 403 Forbidden" exception for certain folders. People in other post has answered it to be a permission issue; which I didn't find a right solution as it is "Forbidden" instead of "Access Denied". The behavior of this error has appeared only for certain folders (containing objects) inside a bucket and for certain empty folders. It was observed that if I use native API calls then it works normally as follows:
Exception "Forbidden" when using s3a calls:
hdfs dfs -ls s3a://<bucketname>/<folder>
No error when using s3 native calls s3n and s3:
hdfs dfs -ls s3://<bucketname>/<folder>
hdfs dfs -ls s3n://<bucketname>/<folder>
Similar behavior have also been observed for empty folders and I understand on S3 only objects are physical files whereas rest "buckets and folders" are just a place holder. However, if I create a new empty folder then s3a calls doesn't throw this exception.
P.S. - Root IAM access key surpass this exception.
I'd recommend you file a JIRA on issues.apache.org, HADOOP project, component fs/s3 with the exact hadoop version you are using. Add the stack trace as the first comment, as that's the only way we could begin to work out what is happening.
FWIW, we haven't tested restricted permissions other than simple read-only and R/W; mixing permissions down the path is inevitably going to break things, as the client code expects to be able to HEAD, GET & LIST anything in the bucket.
BTW, the Hadoop S3 clients all mock empty directories by creating 0 byte objects with a "/" suffix, e/g "folder/"; then use a HEAD on that to probe for an empty bucket. When data is added under an empty dir, the mock parent dir is DELETE-d.
Related
I Have a bucket with 3 million objects. I Even don't know how many folders are there in my S3 bucket and even don't know the names of folders in my bucket.I want to show only list of folders of AWS s3. Is there any way to get list of all folders ?
I would use AWS CLI for this. To get started - have a look here.
Then it is a matter of almost standard linux commands (ls):
aws s3 ls s3://<bucket_name>/path/to/search/folder/ --recursive | grep '/$' > folders.txt
where:
grep command just reads what aws s3 ls command has returned and searches for entries with ending /.
ending > folders.txt saves output to a file.
Note: grep (if I'm not wrong) is unix only utility command. But I believe, you can achieve this on windows as well.
Note 2: depending on the number of files there this operation might (will) take a while.
Note 3: usually in systems like AWS S3, term folder is there only for user to maintain visual similarity with standard file systems however inside it does treat it as a part of a key. You can see in your (web) console when you filter by "prefix".
Amazon S3 buckets with large quantities of objects are very difficult to use. The API calls that list bucket contents are limited to returning 1000 objects per API call. While it is possible to request 'folders' (by using Delimiter='/' and looking at CommonPrefixes), this would take repeated calls to obtain the hierarchy.
Instead, I would recommend using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then play with that CSV file from code (or possibly Excel? Might be too big?) to obtain your desired listings.
Just be aware that doing anything on that bucket will not be fast.
The question: Imagine I run a very simple Python script on EMR - assert 1 == 2. This script will fail with an AssertionError. The log the contains the traceback containing that AssertionError will be placed (if logs are enabled) in an S3 bucket that I specified on setup, and then I can read the log containing the AssertionError when those logs get dropped into S3. However, where do those logs exist before they get dropped into S3?
I presume they would exist on the EC2 instance that the particular script ran on. Let's say I'm already connected to that EC2 instance and the EMR step that the script ran on had the ID s-EXAMPLE. If I do:
[n1c9#mycomputer cwd]# gzip -d /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr.gz
[n1c9#mycomputer cwd]# cat /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
Then I'll get an output with the typical 20/01/22 17:32:50 INFO Client: Application report for application_1 (state: ACCEPTED) that you can see in the stderr log file you can access on EMR:
So my question is: Where is the log (stdout) to see the actual AssertionError that was raised? It gets placed in my S3 bucket indicated for logging about 5-7 minutes after the script fails/completes, so where does it exist in EC2 before that? I ask because getting to these error logs before they are placed on S3 would save me a lot of time - basically 5 minutes each time I write a script that fails, which is more often than I'd like to admit!
What I've tried so far: I've tried checking the stdout on the EC2 machine in the paths in the code sample above, but the stdout file is always empty:
What I'm struggling to understand is how that stdout file can be empty if there's an AssertionError traceback available on S3 minutes later (am I misunderstanding how this process works?). I also tried looking in some of the temp folders that PySpark builds, but had no luck with those either. Additionally, I've printed the outputs of the consoles for the EC2 instances running on EMR, both core and master, but none of them seem to have the relevant information I'm after.
I also looked through some of the EMR methods for boto3 and tried the describe_step method documented here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.describe_step - which, for failed steps, have a FailureDetails json dict response. Unfortunately, this only includes a LogFile key which links to the stderr.gz file on S3 (even in that file doesn't exist yet) and a Message key which contain a generic Exception in thread.. message, not the stdout. Am I misunderstanding something about the existence of those logs?
Please feel free to let me know if you need any more information!
It is quite normal that with log collecting agents, the actual logs files doesn't actually grow, but they just intercept stdout to do what they need.
Most probably when you configure to use S3 for the logs, the agent is configured to either read and delete your actual log file, or maybe create a symlink of the log file to somewhere else, so that file is actually never writen when any process open it for write.
maybe try checking if there is any symlink there
find -L / -samefile /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
but it can be something different from a symlink to achieve the same logic, and I ddint find anything in AWS docs, so most probably is not intended that you will have both S3 and files at the same time and maybe you wont find it
If you want to be able to check your logs more frequently, you may want to think about installing a third party logs collector (logstash, beats, rsyslog,fluentd) and ship logs to SolarWinds Loggly, logz.io, or set up a ELK (Elastic search, logstash, kibana)
You can check this article from Loggly, or create a free acount in logz.io and check the lots of free shippers that they support
I have a Hive script I'm running in EMR that is creating a partitioned Parquet table in S3 from a ~40GB gzipped CSV file also stored in S3.
The script runs fine for about 4 hours but reaches a point (pretty sure when it is just about done creating the Parquet table) where it errors out. The logs show that the error is:
HiveException: Hive Runtime Error while processing row
caused by:
AmazonS3Exception: Bad Request
There really isn't any more useful information in the logs that I can see. It is reading the CSV file fine from S3 and it creates a couple metadata files in S3 fine as well, so I've confirmed the instance has read/write permissions to the Bucket.
I really can't think of anything else that's going on and I wish there was more info in the logs about what "Bad Request" to S3 that Hive is making. Anyone have any ideas?
BadRequest is a fairly meaningless response from AWS which it sends if there is any reason why it doesn't like the caller. Nobody really knows what's happening.
The troubleshooting docs for the ASF S3A connector list some causes, but they aren't complete, and based on guesswork from what made the message go away.
If you have the request ID which failed, you can submit a support request for amazon to see what they saw on their side.
If it makes you feel any better, I'm seeing it when I try to list exactly one directory in an object store, and I'm co-author of the s3a connector. Like I said "guesswork". Once you find out, add a comment here or, if it's not in the troubleshooting doc, submit a patch to hadoop on the topic.
I fail to understand how to simply list the contents of an S3 bucket on EMR during a spark job.
I wanted to do the following
Configuration conf = spark.sparkContext().hadoopConfiguration();
FileSystem s3 = S3FileSystem.get(conf);
List<LocatedFileStatus> list = toList(s3.listFiles(new Path("s3://mybucket"), false))
This always fails with the following error
java.lang.IllegalArgumentException: Wrong FS: s3://*********/, expected: hdfs://**********.eu-central-1.compute.internal:8020
in the hadoopConfiguration fs.defaultFS -> hdfs://**********.eu-central-1.compute.internal:8020
The way I understand it if I don't use a protocol just /myfolder/myfile instead of i.e. hdfs://myfolder/myfile it will default to the df.defaultFS.
But I would expect if I specify my s3://mybucket/ the fs.defaultFS should not matter.
How does one access the directory information? spark.read.parquet("s3://mybucket/*.parquet") works just fine but for this task I need to check the existence of some files and would also like to delete some. I assumed org.apache.hadoop.fs.FileSystem would be the correct tool.
PS: I also don't understand how logging works. If I use deploy-mode cluster (i want to deploy jars from s3 which does not work in client mode), the I can only find my logs in s3://logbucket/j-.../containers/application.../conatiner...0001. There is quite a long delay before those show in S3. How do I find it via ssh on the master? or is there some faster/better way to check spark application logs?
UPDATE: Just found them under /mnt/var/log/hadoop-yarn/containers however the it is owned by yarn:yarn and as hadoop user I cannot read it. :( Ideas?
In my case I needed to read a parquet file that was generated by prior EMR jobs, I was looking for list of files for a given s3 prefix, but nice thing is we don't need to do all that, we can simply do this:
spark.read.parquet(bucket+prefix_directory)
URI.create() should be used to point it to correct Filesystem.
val fs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val dirPaths = FileSystem.get(URI.create("<s3-path>"), fs.getConf).listStatus(new Path("<s3-path>"))```
I don't think you are picking up the FS right; just use the static FileSystem.get() method, or Path.get()
Try something like:
Path p = new Path("s3://bucket/subdir");
FileSystem fs = p.get(conf);
FileStatus[] status= fs.listStatus(p);
Regarding logs, YARN UI should let you at them via the node managers.
I am searching for a specific file in a S3 bucket that has a lot of files. In my application I get an error of 403 access denied, and with s3cmd I am getting an error of 403 (Forbidden) if I try to get a file from the bucket. My problem is that I am not sure if the permissions are the problem (because I can get other files) or the file isn't present on the bucket. I have started to search in the Amazon console interface, but I am scrolling for hours and I have not arrived at "4...." (I am still at "39...") and the file I am looking for is in a folder "C03215".
So, is there a faster way to verify that the file exists on the bucket? Or is there a way to do auto-scrolling and meanwhile doing something else (because if I do not scroll nothing new is loading)?
P.S.: I have no permission to list with s3cmd
Regarding accelerating the scrolling in the console
Like you I have many thousands of objects that takes an eternity to scroll through to in the console.
I recently discovered though how to jump straight to a specific path/folder in the console that is going to save my mouse finger and my sanity!
This will only work for folders though not the actual leaf objects themselves.
In the URL bar of your browser when viewing a bucket you will see something like:
console.aws.amazon.com/s3/home?region=eu-west-1#&bucket=your-bucket-name&prefix=
If you append your object's path after the prefix and hit enter you assume that it should jump to that object but it does nothing (in chrome at least).
However if you append your object's path after the prefix, hit enter and then hit refresh (f5) the console will reload at your specified location.
e.g.
console.aws.amazon.com/s3/home?region=eu-west-1#&bucket=your-bucket-name&prefix=development/2015-04/TestEvent/93edfcbg-5e27-42d3-a2f9-3d86a63d27f9/
There was much joy in our office when this was figured out!
The only "faster way" is to have the s3:ListBucket permission on the bucket, because, as you have noticed, S3's response to a GET request is intentionally ambiguous if you don't.
If the object you request does not exist, the error Amazon S3 returns depends on whether you also have the s3:ListBucket permission.
If you have the s3:ListBucket permission on the bucket, Amazon S3 will return an HTTP status code 404 ("no such key") error.
If you don’t have the s3:ListBucket permission, Amazon S3 will return an HTTP status code 403 ("access denied") error.
http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html
Also, there's not a way to accelerate scrolling in the console.