Where to find node logs in AWS EMR cluster? - amazon-web-services

I have pyspark program running on AWS EMR cluster.
Cluster config is like this - emr-5.31.0, hadoop 2.10.0, hive 2.3.7, hue 4.7.1, pig 0.17.0.
Program processes some files on hdfs file system but at some moment it is getting errors.
In amazon console - YARN applications - application_XXX (Spark) - executors - driver - stderr:
'could not obtain block ... file=
A little before this message there is 'Task 0 in stage 35 failed 4 times. aborting job'
If i go to amazon console - YARN applications - application_XXX (Spark) - stages - 35 - tasks - 0 - stdout - i dont see anything bad at first glance except a lot of 'GC (allocation Failure)' messages.
In its stderr - there is a WARN - 'Could not obtain block XXX, file= No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException.
If i go to monitoring tab - node status - i see that one node became unhealthy at that time and thats it. Number of nodes also changed at 'live data nodes', 'MR total nodes', 'MR active nodes', MR lost nodes' charts.
As i understand, task cannot find file on hdfs because node it was hosted on became unhealthy.
My question is where i can find the reasons node became unhealthy. I wasnt able to find any other logs on amazon console. May be there are some node-local places where this reason is stored?

Hi I launched a EMR myself some time ago, dont remember about the logs. But consulting the docs here:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-web-log-files.html
It states that they are stored on the machines (which I assume you have the keys), they are also stored on S3 by default. Not sure in which bucket they will be created.
Best Regards :)

On the Summary page for your EMR cluster there is a section named "Configuration details".
Below that, there is a label named "Log URI". It points to an S3 URI, but, there is also a small folder icon.
Click on that icon and you can browse to the logs on the nodes for your EMR cluster.

Actually, for amazon there are more logs accessible via s3 location - there are logs for node boot and configuration part, and logs from running services on node - hdfs and yarn, which i was looking for. Path looks like this - s3 location/cluster id/node/node id/applications - here i was able to find hdfs and yarn logs.

Related

Elasticsearch 6.3 (AWS) snapshot restore progress ERROR: "/_recovery is not allowed"

I take manual snapshots of an Elasticsearch index
These are stored in a snapshot repo on S3
I have created a new ES cluster, also version 6.3
I have connected the new cluster to the S3 snapshot repo via python script method mentioned in this blog post: https://medium.com/docsapp-product-and-technology/aws-elasticsearch-manual-snapshot-and-restore-on-aws-s3-7e9783cdaecb
I have confirmed that the new cluster has access to the snapshot repo via the GET /_snapshot/manual-snapshot-repo/_all?pretty command
I have initiated a snapshot restore to this new cluster via:
POST /_snapshot/manual-snapshot-repo/snapshot_name/_restore
{
"indices": "reports",
"ignore_unavailable": false,
"include_global_state": false
}
It is clear that this operation has at least partially succeeded as the cluster status has gone from "green" to "yellow" and a GET request to /_cluster/health yields information that suggests actions are occuring on an otherwise empty cluster... not to mention storage is starting to be utilized (when viewing cluster health on AWS).
I would very much like to monitor the progress of the restore operation.
Elasticsearch docs suggest to use the Recovery API. Docs Link: https://www.elastic.co/guide/en/elasticsearch/reference/6.3/indices-recovery.html
It is clear from the docs that GET /_recovery?human or GET /my_index/_recovery?human should yield restore progress.
However, I encounter the following error:
"Message": "Your request: '/_recovery' is not allowed."
I get the same message when attempting the GET command in the following ways:
Via Kibana dev tools
Via chrome address bar (It's just a GET operation after all)
Via Advanced REST Client (a Chrome app)
I have not been able to locate any other mention of this particular error message.
How can I utilize the GET /_recovery?human command on my ElasticSearch 6.3 clusters?
Thank you!
The Amazon managed Elasticsearch does not have all the endpoints available.
For version 6.3 you can check this link for the endpoints available, and _recovery is not on the list, that is why you get that message.
Without the _recovery endpoint you will need to rely on _cluster/health.

AWS Elasticsearch snapshot stuck in state IN_PROGRESS

I am using ElasticService from AWS. I am receiving Snapshot failure status in Overall health. I can see, there is one snapshot that is stuctk for almost 2 days.
id status start_epoch start_time end_epoch end_time duration indices successful_shards failed_shards total_shards
2020-07-13t13-30-56.2a009367-21fd-48ab-accc-36a3f61db683 IN_PROGRESS 1594647056 13:30:56 0 00:00:00 1.8d 342 0 0 0
I am not allowed to DELETE this snapshot:
{"Message":"Your request: '/_snapshot/cs-automated-enc/2020-07-13t13-30-56.2a009367-21fd-48ab-accc-36a3f61db683' is not allowed."}
I do not know what to do next. I am not able to fix it and lot of API calls does not work as it normally should do. I can see, it suddenly solved by itself after 2 days, but basically I do not know how to fix it and where is/was the problem.
Questions:
May I configure where should, and how often should elastic create snapshot of whole cluster? Or maybe just choose which index should be snapshoted?
May I see files in cs-automated-enc in S3 or is it not available for user, its included in Elastic AWS service?
Does snapshots stored in cs-automated-enc included in Elasticsearch price?

Where are the EMR logs that are placed in S3 located on the EC2 instance running the script?

The question: Imagine I run a very simple Python script on EMR - assert 1 == 2. This script will fail with an AssertionError. The log the contains the traceback containing that AssertionError will be placed (if logs are enabled) in an S3 bucket that I specified on setup, and then I can read the log containing the AssertionError when those logs get dropped into S3. However, where do those logs exist before they get dropped into S3?
I presume they would exist on the EC2 instance that the particular script ran on. Let's say I'm already connected to that EC2 instance and the EMR step that the script ran on had the ID s-EXAMPLE. If I do:
[n1c9#mycomputer cwd]# gzip -d /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr.gz
[n1c9#mycomputer cwd]# cat /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
Then I'll get an output with the typical 20/01/22 17:32:50 INFO Client: Application report for application_1 (state: ACCEPTED) that you can see in the stderr log file you can access on EMR:
So my question is: Where is the log (stdout) to see the actual AssertionError that was raised? It gets placed in my S3 bucket indicated for logging about 5-7 minutes after the script fails/completes, so where does it exist in EC2 before that? I ask because getting to these error logs before they are placed on S3 would save me a lot of time - basically 5 minutes each time I write a script that fails, which is more often than I'd like to admit!
What I've tried so far: I've tried checking the stdout on the EC2 machine in the paths in the code sample above, but the stdout file is always empty:
What I'm struggling to understand is how that stdout file can be empty if there's an AssertionError traceback available on S3 minutes later (am I misunderstanding how this process works?). I also tried looking in some of the temp folders that PySpark builds, but had no luck with those either. Additionally, I've printed the outputs of the consoles for the EC2 instances running on EMR, both core and master, but none of them seem to have the relevant information I'm after.
I also looked through some of the EMR methods for boto3 and tried the describe_step method documented here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.describe_step - which, for failed steps, have a FailureDetails json dict response. Unfortunately, this only includes a LogFile key which links to the stderr.gz file on S3 (even in that file doesn't exist yet) and a Message key which contain a generic Exception in thread.. message, not the stdout. Am I misunderstanding something about the existence of those logs?
Please feel free to let me know if you need any more information!
It is quite normal that with log collecting agents, the actual logs files doesn't actually grow, but they just intercept stdout to do what they need.
Most probably when you configure to use S3 for the logs, the agent is configured to either read and delete your actual log file, or maybe create a symlink of the log file to somewhere else, so that file is actually never writen when any process open it for write.
maybe try checking if there is any symlink there
find -L / -samefile /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
but it can be something different from a symlink to achieve the same logic, and I ddint find anything in AWS docs, so most probably is not intended that you will have both S3 and files at the same time and maybe you wont find it
If you want to be able to check your logs more frequently, you may want to think about installing a third party logs collector (logstash, beats, rsyslog,fluentd) and ship logs to SolarWinds Loggly, logz.io, or set up a ELK (Elastic search, logstash, kibana)
You can check this article from Loggly, or create a free acount in logz.io and check the lots of free shippers that they support

Amazon redshift query aborts automatically after 1 hour

I have around 500GB compressed data in amazon s3. I wanted to load this data to Amazon Redshift. For that, I have created an internal table in AWS Athena and I am trying to load data in the internal table of Amazon Redshift.
Loading of this big data into Amazon Redshift is taking more than an hour. The problem is when I fired a query to load data it gets aborted after 1hour. I tried it 2-3 times but it's getting aborted after 1 hour. I am using Aginity Tool to fire the query. Also, in Aginity tool it is showing that query is currently running and the loader is spinning.
More Details:
Redshift cluster has 12 nodes with 2TB space for each node and I used 1.7 TB space.
S3 files are not the same size. One of them is 250GB. Some of them in MB.
I am using the command
create table table_name as select * from athena_schema.table_name
it stops exactly after 1hr.
Note: I have set the current query timeout in Aginity to 90000 sec.
I know this is an old thread, but for anyone coming here because of the same issue, I've realised that, at least for my case, the problem was the Aginity client; so, it's not related with Redshift or its Workload Manager, but only with such third party client called Aginity. In summary, use a different client like SQL Workbench and run the COPY command from there.
Hope this helps!
Carlos C.
More information, about my environment:
Redshift:
Cluster TypeThe cluster's type: Multi Node
Cluster: ds2.xlarge
NodesThe cluster's type: 4
Cluster Version: 1.0.4852
Client Environment:
Aginity Workbench for Redshift
Version 4.9.1.2686 (build 05/11/17)
Microsoft Windows NT 6.2.9200.0 (64-bit)
Network:
Connected to OpenVPN, via SSH Port tunneling.
The connection is not being dropped. This issue is only affecting the COPY command. The connection remains active.
Command:
copy tbl_XXXXXXX
from 's3://***************'
iam_role 'arn:aws:iam::***************:role/***************';
S3 Structure:
120 files of 6.2 GB each. 20 files of 874MB.
Output:
ERROR: 57014: Query (22381) cancelled on user's request
Statistics:
Start: ***************
End: ***************
Duration: 3,600.2420863
I'm not sure if following answer will solve your exact problem of timeout at exactly 1 Hr.
But, based on my experience, in case of Redshift loading data via Copy command is best and fast way. SO I feel that timeout issue shouldn't happen at all in your case.
The copy command in RedShift could load data from S3 or via SSH.
e.g.
Simple copy
copy sales from 'emr://j-SAMPLE2B500FC/myoutput/part-*' iam_role
'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter '\t' lzop;
e.g. Using Menifest
copy customer
from 's3://mybucket/cust.manifest'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
manifest;
PS: Even if you do it using Menifest and divide your data into Multiple files, it will be more faster as RedShift loads data in parallel.

How to verify that Redshift are really DISK FULL?

Question from Redshift newbies: I copy data using AWS datapipeline but it FAILED and log said
"ERROR: Disk Full Detail:
----------------------------------------------- error: Disk Full code: 1016 context: node: 0 query: 2070045 location: fdisk_api.cpp:343
process: query0_49 [pid=15048] "
I'd like to know how could we check if Redshift is really disk full via CLI or web console, any comments or hints would be appreciated.
If you're using a single node and have SQL access to the cluster (e.g. via psql), you can run:
select
sum(capacity)/1024 as capacity_gbytes,
sum(used)/1024 as used_gbytes,
(sum(capacity) - sum(used))/1024 as free_gbytes
from
stv_partitions where part_begin=0;
This articles has more: https://www.flydata.com/blog/querying-free-disk-space-on-redshift/
You can check that in CloudWatch console. In the left bar, you'll see bunch of AWS services under the 'Metrics' heading. Click on Redshift. In that, look for the 'PercentageDiskSpaceUsed' metric for the concerned cluster.
Also, do remember that this metric is separately available for each compute node.