I am running a pig script in -hcatalogue mode and it is failing at the map reduce job execution. It says one of the map reduce job is failing. What would be the best way to troubleshoot. i am trying to find the log file but i could not get it.
Is there any specific place i can find the logs?
Log of pig-script is created in the working directory from where you have run the pig script or started the pig-console.
For Map-Reduce log you have to check the [HADOOP_HOME]/logs/userlogs directory. you will get ERROR message either in sysout file or syserr file.
Related
I couldn't find relevant information in the Documentation. I have tried all options and links in the batch transform pages.
They can be found, but unfortunately not via any links in the Vertex AI console.
Soon after the batch prediction job fails, go to Logging -> Logs Explorer and create a query like this, replacing YOUR_PROJECT with the name of your gcp project:
logName:"projects/YOUR_PROJECT/logs/ml.googleapis.com"
First look for the same error reported by the Batch Prediction page in the Vertex AI console: "Job failed. See logs for full details."
The log line above the "Job Failed" error will likely report the real reason your batch prediction job failed.
I have found that just going to Cloud logger after batch prediction job fails and clicking run query shows the error details
I'm trying to load around 1000 files from Google Cloud Storage into BigQuery using the BigQuery transfer service, but it appears I have an error in one of my files:
Job bqts_601e696e-0000-2ef0-812d-f403043921ec (table streams) failed with error INVALID_ARGUMENT: Error while reading data, error message: CSV table references column position 19, but line starting at position:206 contains only 19 columns.; JobID: 931777629779:bqts_601e696e-0000-2ef0-812d-f403043921ec
How can I find which file is causing this error?
I feel like this is in the docs somewhere, but I can't seem to find it.
Thanks!
You can use bq show --format=prettyjson -j job_id_here and will show a verbose error about the failed job. You can see more info about the usage of the command in BigQuery managing jobs docs.
I tried this with a failed job of mine wherein I'm loading csv files from a Google Coud Storage bucket in my project.
Command used:
bq show --format=prettyjson -j bqts_xxxx-xxxx-xxxx-xxxx
Here is a snippet of the output. Output is in JSON format:
The question: Imagine I run a very simple Python script on EMR - assert 1 == 2. This script will fail with an AssertionError. The log the contains the traceback containing that AssertionError will be placed (if logs are enabled) in an S3 bucket that I specified on setup, and then I can read the log containing the AssertionError when those logs get dropped into S3. However, where do those logs exist before they get dropped into S3?
I presume they would exist on the EC2 instance that the particular script ran on. Let's say I'm already connected to that EC2 instance and the EMR step that the script ran on had the ID s-EXAMPLE. If I do:
[n1c9#mycomputer cwd]# gzip -d /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr.gz
[n1c9#mycomputer cwd]# cat /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
Then I'll get an output with the typical 20/01/22 17:32:50 INFO Client: Application report for application_1 (state: ACCEPTED) that you can see in the stderr log file you can access on EMR:
So my question is: Where is the log (stdout) to see the actual AssertionError that was raised? It gets placed in my S3 bucket indicated for logging about 5-7 minutes after the script fails/completes, so where does it exist in EC2 before that? I ask because getting to these error logs before they are placed on S3 would save me a lot of time - basically 5 minutes each time I write a script that fails, which is more often than I'd like to admit!
What I've tried so far: I've tried checking the stdout on the EC2 machine in the paths in the code sample above, but the stdout file is always empty:
What I'm struggling to understand is how that stdout file can be empty if there's an AssertionError traceback available on S3 minutes later (am I misunderstanding how this process works?). I also tried looking in some of the temp folders that PySpark builds, but had no luck with those either. Additionally, I've printed the outputs of the consoles for the EC2 instances running on EMR, both core and master, but none of them seem to have the relevant information I'm after.
I also looked through some of the EMR methods for boto3 and tried the describe_step method documented here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.describe_step - which, for failed steps, have a FailureDetails json dict response. Unfortunately, this only includes a LogFile key which links to the stderr.gz file on S3 (even in that file doesn't exist yet) and a Message key which contain a generic Exception in thread.. message, not the stdout. Am I misunderstanding something about the existence of those logs?
Please feel free to let me know if you need any more information!
It is quite normal that with log collecting agents, the actual logs files doesn't actually grow, but they just intercept stdout to do what they need.
Most probably when you configure to use S3 for the logs, the agent is configured to either read and delete your actual log file, or maybe create a symlink of the log file to somewhere else, so that file is actually never writen when any process open it for write.
maybe try checking if there is any symlink there
find -L / -samefile /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
but it can be something different from a symlink to achieve the same logic, and I ddint find anything in AWS docs, so most probably is not intended that you will have both S3 and files at the same time and maybe you wont find it
If you want to be able to check your logs more frequently, you may want to think about installing a third party logs collector (logstash, beats, rsyslog,fluentd) and ship logs to SolarWinds Loggly, logz.io, or set up a ELK (Elastic search, logstash, kibana)
You can check this article from Loggly, or create a free acount in logz.io and check the lots of free shippers that they support
I am trying to do this flow in an airflow dag.
Task 1: check if file exists in s3 (s3 sensor). If no new file is found, skip to task 4.
Task 2: if task 1 meets the criteria, delete the existing file in the local folder
Task 3: if task 2 is finished, download the s3 file into the local folder
Task 4: in either case, update table (using the only file in the folder)
I am not sure what trigger rule to add in the task 4. If i add one_failed, obviously the task wont be executed if the file exists.
If i add "all_done" it wont be executed because in either path, the dag will be skipping tasks (that's the whole purpose).
How should I go about it? I think i am missing something here...
Thanks everyone.
UPDATE
It also seems that my s3keysensor is not triggering "Fail" status when timed out. It appears in yellow even though the log shows "Snap, time is out".
Should be triggering Fail. This is from the documentation.
" Sensor operators keep executing at a time interval and succeed when
a criteria is met and fail if and when they time out."
This message appears in the console "These tasks are deadlocked: {...}." and the dag does not keep running. Can't get task 4 running! I am also trying it with a backfill for the same start and end date, is this correct?
Okay. It seems that Airflow cant have "empty paths". So you just have to add a dummy branch-false and then "ONE_SUCEED" on task 4.
Simple as that.
How to process only new files using AWS data pipeline and EMR? I may get different number of files in my source directory. I want to process them using AWS data pipeline and EMR as one file after another file. I'm not sure how pre condition "exists" or "Shell Command activity" can solve this issue. Please suggest a way to process a delta list of files by adding EMR steps or creating EMR clusters for each file.
The way this is usually done in datapipeline is to use schedule expressions when referring to the source directory. For example,
if your pipeine is scheduled to run hourly and you specify "s3://bucket/#{format(minusMinutes(#scheduledStartTime,60),'YYYY-MM-dd hh')}"
as the input directory, datapipeline will resolve that to "s3://bucket/2016-10-23-16" when its running at hour 17. So the job will only read data corresponding to hour 16. If you can structure your input to produce data in this manner, this can be used. See http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-pipeline-expressions.html for more examples of expressions.
Unfortunately, there is no built-n support "get data since last processed".