Where are the EMR logs that are placed in S3 located on the EC2 instance running the script? - amazon-web-services

The question: Imagine I run a very simple Python script on EMR - assert 1 == 2. This script will fail with an AssertionError. The log the contains the traceback containing that AssertionError will be placed (if logs are enabled) in an S3 bucket that I specified on setup, and then I can read the log containing the AssertionError when those logs get dropped into S3. However, where do those logs exist before they get dropped into S3?
I presume they would exist on the EC2 instance that the particular script ran on. Let's say I'm already connected to that EC2 instance and the EMR step that the script ran on had the ID s-EXAMPLE. If I do:
[n1c9#mycomputer cwd]# gzip -d /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr.gz
[n1c9#mycomputer cwd]# cat /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
Then I'll get an output with the typical 20/01/22 17:32:50 INFO Client: Application report for application_1 (state: ACCEPTED) that you can see in the stderr log file you can access on EMR:
So my question is: Where is the log (stdout) to see the actual AssertionError that was raised? It gets placed in my S3 bucket indicated for logging about 5-7 minutes after the script fails/completes, so where does it exist in EC2 before that? I ask because getting to these error logs before they are placed on S3 would save me a lot of time - basically 5 minutes each time I write a script that fails, which is more often than I'd like to admit!
What I've tried so far: I've tried checking the stdout on the EC2 machine in the paths in the code sample above, but the stdout file is always empty:
What I'm struggling to understand is how that stdout file can be empty if there's an AssertionError traceback available on S3 minutes later (am I misunderstanding how this process works?). I also tried looking in some of the temp folders that PySpark builds, but had no luck with those either. Additionally, I've printed the outputs of the consoles for the EC2 instances running on EMR, both core and master, but none of them seem to have the relevant information I'm after.
I also looked through some of the EMR methods for boto3 and tried the describe_step method documented here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.describe_step - which, for failed steps, have a FailureDetails json dict response. Unfortunately, this only includes a LogFile key which links to the stderr.gz file on S3 (even in that file doesn't exist yet) and a Message key which contain a generic Exception in thread.. message, not the stdout. Am I misunderstanding something about the existence of those logs?
Please feel free to let me know if you need any more information!

It is quite normal that with log collecting agents, the actual logs files doesn't actually grow, but they just intercept stdout to do what they need.
Most probably when you configure to use S3 for the logs, the agent is configured to either read and delete your actual log file, or maybe create a symlink of the log file to somewhere else, so that file is actually never writen when any process open it for write.
maybe try checking if there is any symlink there
find -L / -samefile /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
but it can be something different from a symlink to achieve the same logic, and I ddint find anything in AWS docs, so most probably is not intended that you will have both S3 and files at the same time and maybe you wont find it
If you want to be able to check your logs more frequently, you may want to think about installing a third party logs collector (logstash, beats, rsyslog,fluentd) and ship logs to SolarWinds Loggly, logz.io, or set up a ELK (Elastic search, logstash, kibana)
You can check this article from Loggly, or create a free acount in logz.io and check the lots of free shippers that they support

Related

I am wondering if it is normal for AWS EMR to send a lot of list and head requests for S3 model files

I am using the AWS EMR Cluster service.
It is a situation in which machine learning tasks such as spark-build are being performed by referring to the model file with the S3 Bucket between EMR Cluster use.
I request a lot of head and list requests from S3, but I am wondering if it is normal for AWS EMR to send a lot of list and head requests to the S3 model file.
Symptom: AWS EMR is about 2.7 million head and list requests per day to S3.
A lot of list/head requests get sent.
This is related to how directories are emulated on the hadoop/spark/hive S3 clients; every time a progress looks to see if there's a directory on a path it will issue a LIST request, maybe a HEAD request first (to see if its a file).
Then there's the listing of the contents, more LIST requests, and finally reading the files. There'll be one HEAD request on every open() call to verify the file exists and to determine how long it is.
Files are read with GET Requests. Every time there's a seek()/buffer read on the input stream and the data isn't in a buffer the client has to do one of
read to the end of the current ranged get (assuming its a ranged GET), discarding the data, issue a new ranged GET
abort() the HTTPS connection, negotiate a new one. Slow.
Overall then, a lot of IO, especially if the application is inefficient about caching the output of directory listings, whether files exist, doing needless checks before operations (if fs.exists(path) fs.delete(path, false)) and the like.
If this is your code, try not to do that
(disclaimer: this is all guesses based on the experience of tuning the open source hive/spark apps to work through the S3A connector. I'm assuming the same for EMR)

Errors when using DialogFlow "restore agent" API

We have suddenly started experiencing an error when using the DialogFlow "restore agent" API. The call is failing with the error:
400 com.google.apps.framework.request.BadRequestException: Invalid
agent zip. Missing required json file agent.json
Oddly, it only seems to happen for newly created DialogFlow agents, but not for older/existing ones. We are using this API so that we can programmatically create a custom agent using our own intents/entities. This code has been working for about the past two years, with no changes on our side. We are using the official DialogFlow client library for Python. We have been on version 0.2.0, and I tried updating to the latest (0.8.0) but there was no change.
I tried changing our code to include the agent.json file (by using the "export agent" API and getting the agent.json file from there). In that case, I no longer get the above error and the restore appears to succeed. However, the agent then seems to be corrupt in some way. When trying to click on any intent -- or various other operations in the DialogFlow console -- I get the error:
Failed to get Training Phrases Errorid=xxx
(where xxx seems to be a UUID that changes each time)
Trying to export the agent in that state also displays an error:
Error downloading agent
Occasionally, even including the agent.json as above, the restore will still fail but return the error:
500 Internal error encountered.
I appreciate any ideas on how we can get this working again. Thanks!
After a lot of trial and error I found the solution. Here it is in case anyone else runs into this. Something must have changed recently in how DialogFlow processes the zip upload during the "restore agent" operation --
1) The agent.json file is now required in the zip file, where before it was optional
2) We found some of the "id" elements in our _usersays files for various intents were not valid UUIDs. Previously this did not cause any error, but now the agent winds up in an invalid state ("Failed to get Training Phrases" error, etc as mentioned above).
Easy way to fix is to export one of the existing agents and copy it's agent.json and package.json into your current directory before uploading.
agent.json is now required by dialogflow.

is it possible write watchtower logs in single file in aws

I have developed one small application. For logs , i'm using watchtower in aws. logs are working fine.Logs are inserted in cloudwatch by file wise logs in aws but i wants all logs to be registered in a single file only (for example api.views ) .is this possible? if yes, how?
solved this problem... i have written logs function in one file.. calling that function wherever i want to write logs.... so all logs are saving with same file name

Using S3 as target for AWS DMS: Uploaded File name doesn't change

We are using DMS to get data from SQL Server and load it in S3 bucket, after which the data is finally loaded into Snowflake DB using Snowpipe for Full Load.
Now, in order for Snowpipe to know there is new data in S3 bucket, the filename needs to be different than the last one. Have tried all the task setting options available (DROP_AND_CREATE, DO_NOTHING, TRUNCATE) to have the file name different, but still not working. It loads the file name as LOAD00000001.csv
In documentation it shows that file name will be incremental (eg. LOAD00000001.csv, LOAD00000002.csv .. and so on) but it's not happening. Which is why the Snowpipe is not able to register the changes.
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html
Can someone please help?
For DMS the incremental counter is started over from 1 each time the task is run. It does not have a "Don't override existing objects" feature.
Your best bet may be to handle the load yourself by looking for updated object timestamps in your folder or setting up S3 event notifications.

AWS EMR - Hive creating new table in S3 results in AmazonS3Exception: Bad Request

I have a Hive script I'm running in EMR that is creating a partitioned Parquet table in S3 from a ~40GB gzipped CSV file also stored in S3.
The script runs fine for about 4 hours but reaches a point (pretty sure when it is just about done creating the Parquet table) where it errors out. The logs show that the error is:
HiveException: Hive Runtime Error while processing row
caused by:
AmazonS3Exception: Bad Request
There really isn't any more useful information in the logs that I can see. It is reading the CSV file fine from S3 and it creates a couple metadata files in S3 fine as well, so I've confirmed the instance has read/write permissions to the Bucket.
I really can't think of anything else that's going on and I wish there was more info in the logs about what "Bad Request" to S3 that Hive is making. Anyone have any ideas?
BadRequest is a fairly meaningless response from AWS which it sends if there is any reason why it doesn't like the caller. Nobody really knows what's happening.
The troubleshooting docs for the ASF S3A connector list some causes, but they aren't complete, and based on guesswork from what made the message go away.
If you have the request ID which failed, you can submit a support request for amazon to see what they saw on their side.
If it makes you feel any better, I'm seeing it when I try to list exactly one directory in an object store, and I'm co-author of the s3a connector. Like I said "guesswork". Once you find out, add a comment here or, if it's not in the troubleshooting doc, submit a patch to hadoop on the topic.