Get details of an S3 sync failure - amazon-web-services

We run S3 sync commands in a SQL Job which syncs a local directory to an S3 bucket. At times, we'll get a sync "error" with an error code of 1, or sometimes 2. The documentation lists what each code means; error code 1 provides less details and leaves more questions remaining about the problem. It simply states "One or more Amazon S3 transfer operations failed. Limited to S3 commands."
When I run a sync command in a PowerShell script, and encounter an error (i.e. a synced document being open), the window displays an error message and which specific file that is causing a problem.
How can I capture those details in my SQL job?

I have solved this problem...
Using a PowerShell script, in the AWS s3 sync command we output the results to a text file:
aws s3 sync c:\source\dir s3://target/dir/ > F:\Source\s3Inventory\SyncOutput.txt
Then read the text file contents into a string variable:
$S3Output = Get-Content -Path C:\Source\s3Inventory\SyncOutput.txt -Raw
If the $LASTEXITCODE from the sync command does not equal 0 (indicating an error), then we send an email with the results:
if ($LASTEXITCODE -ne 0)
{
#Send email containing the value of string variable $S3Output
}
This has been placed into production and we were finally able to determine which file/object was failing.
One could certainly attach the text file to the email, rather than reading the contents into a string.
Hope this helps others!

Related

PowerShell Copy Files to AWS Bucket with Object Lock

I want to copy CSV files generated by an SSIS package from an AWS EC2 server to an S3 bucket. Each time I try I get an error around the content-MD5 HTTP error because we have object lock enabled on the bucket.
Write-S3Object : Content-MD5 HTTP header is required for Put Object requests with Object Lock parameters
I would assume there is a PowerShell command I can add or I am missing something but after furious googling I cannot find a resolution. Any help or an alternative option would be appreciated.
I am now testing using the AWS CLI process instead of PowerShell.
If you do want to continue to use the Write-S3Object PowerShell command the missing magic flag is:
-CalculateContentMD5Header 1
So the final command will be
Write-S3Object -Region $region -BucketName $bucketName -File $fileToBackup -Key $destinationFileName -CalculateContentMD5Header 1
https://docs.aws.amazon.com/powershell/latest/reference/items/Write-S3Object.html
After a lot of testing, reading and frustration I found the AWS CLI was able to do exactly what I needed. I am unsure if this is an issue with my PowerShell knowledge or a missing feature (I lean toward my knowledge).
I created a bat file that used the CLI to move the files into the S3 bucket, I then called this bat file from within an SSIS execute task process.
Dropped the one line code below just in case it may help others.
aws s3 mv C:\path\to\files\ s3://your.s3.bucket.name/ --recursive

aws S3 how to show contents of file based on a pattern

I would like to perform a simple
cat dir/file.OK.*
how can this be achieved in aws?
I came up with
aws s3 cp s3://bucket-name/path/to/folder/ - --exclude="*" --include="R0.OK.*"
but this returns:
download failed: *(the path)* to - An error occurred (404) when calling the HeadObject operation: Not Found
Thank you for your help.
Additional detail: there is supposed to be only one file matching the pattern, so we could use that information. Anything else is allowed to (horribly) fail.
Edit - currently I am just executing an aws ls into a file, and then cp-ing every file piped into grep. Works, but a nuissance.
I would consider an alternative to streaming the S3 content direct to stdout, which is to simply copy the files from S3 to local disk, which will allow you to use include/exclude/recursive, cat the files locally, and then simply delete the files afterwards.

Where are the EMR logs that are placed in S3 located on the EC2 instance running the script?

The question: Imagine I run a very simple Python script on EMR - assert 1 == 2. This script will fail with an AssertionError. The log the contains the traceback containing that AssertionError will be placed (if logs are enabled) in an S3 bucket that I specified on setup, and then I can read the log containing the AssertionError when those logs get dropped into S3. However, where do those logs exist before they get dropped into S3?
I presume they would exist on the EC2 instance that the particular script ran on. Let's say I'm already connected to that EC2 instance and the EMR step that the script ran on had the ID s-EXAMPLE. If I do:
[n1c9#mycomputer cwd]# gzip -d /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr.gz
[n1c9#mycomputer cwd]# cat /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
Then I'll get an output with the typical 20/01/22 17:32:50 INFO Client: Application report for application_1 (state: ACCEPTED) that you can see in the stderr log file you can access on EMR:
So my question is: Where is the log (stdout) to see the actual AssertionError that was raised? It gets placed in my S3 bucket indicated for logging about 5-7 minutes after the script fails/completes, so where does it exist in EC2 before that? I ask because getting to these error logs before they are placed on S3 would save me a lot of time - basically 5 minutes each time I write a script that fails, which is more often than I'd like to admit!
What I've tried so far: I've tried checking the stdout on the EC2 machine in the paths in the code sample above, but the stdout file is always empty:
What I'm struggling to understand is how that stdout file can be empty if there's an AssertionError traceback available on S3 minutes later (am I misunderstanding how this process works?). I also tried looking in some of the temp folders that PySpark builds, but had no luck with those either. Additionally, I've printed the outputs of the consoles for the EC2 instances running on EMR, both core and master, but none of them seem to have the relevant information I'm after.
I also looked through some of the EMR methods for boto3 and tried the describe_step method documented here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/emr.html#EMR.Client.describe_step - which, for failed steps, have a FailureDetails json dict response. Unfortunately, this only includes a LogFile key which links to the stderr.gz file on S3 (even in that file doesn't exist yet) and a Message key which contain a generic Exception in thread.. message, not the stdout. Am I misunderstanding something about the existence of those logs?
Please feel free to let me know if you need any more information!
It is quite normal that with log collecting agents, the actual logs files doesn't actually grow, but they just intercept stdout to do what they need.
Most probably when you configure to use S3 for the logs, the agent is configured to either read and delete your actual log file, or maybe create a symlink of the log file to somewhere else, so that file is actually never writen when any process open it for write.
maybe try checking if there is any symlink there
find -L / -samefile /mnt/var/log/hadoop/steps/s-EXAMPLE/stderr
but it can be something different from a symlink to achieve the same logic, and I ddint find anything in AWS docs, so most probably is not intended that you will have both S3 and files at the same time and maybe you wont find it
If you want to be able to check your logs more frequently, you may want to think about installing a third party logs collector (logstash, beats, rsyslog,fluentd) and ship logs to SolarWinds Loggly, logz.io, or set up a ELK (Elastic search, logstash, kibana)
You can check this article from Loggly, or create a free acount in logz.io and check the lots of free shippers that they support

automating file archival from ec2 to s3 based on last modified date

I want to write an automated job in which the job will go through my files stored on the ec2 storage and check for the last modified date.If the date is more than (x) days the file should automatically get archived to my s3.
Also I don't want to convert the file to a zip file for now.
What I don't understand is how to give the path of the ec2 instance storage and the how do i put the condition for the last modified date.
aws s3 sync your-new-dir-name s3://your-s3-bucket-name/folder-name
Please correct me if I understand this wrong
Your requirement is to archive the older files
So you need a script that checks the modified time and if its not being modified since X days you simply need to make space by archiving it to S3 storage . You don't wish to store the file locally
is it correct ?
Here is some advice
1. Please provide OS information ..this would help us to suggest shell script or power shell script
Here is power shell script
$fileList = Get-Content "c:\pathtofolder"
foreach($file in $fileList) {
Get-Item $file | select -Property fullName, LastWriteTime | Export-Csv 'C:\fileAndDate.csv' -NoTypeInformation
}
then AWS s3 cp to s3 bucket.
You will do the same with Shell script.
Using aws s3 sync is a great way to backup files to S3. You could use a command like:
aws s3 sync /home/ec2-user/ s3://my-bucket/ec2-backup/
The first parameter (/home/ec2-user/) is where you can specify the source of the files. I recommend only backing-up user-created files, not the whole operating system.
There is no capability for specifying a number of days. I suggest you just copy all files.
You might choose to activate Versioning to keep copies of all versions of files in S3. This way, if a file gets overwritten you can still go back to a prior version. (Storage charges will apply for all versions kept in S3.)

Remove processed source files after AWS Datapipeline completes

A third party sends me a daily upload of log files into an S3 bucket. I'm attempting to use DataPipeline to transform them into a slightly different format with awk, place the new files back on S3, then move the original files aside so that I don't end up processing the same ones again tomorrow.
Is there a clean way of doing this? Currently my shell command looks something like :
#!/usr/bin/env bash
set -eu -o pipefail
aws s3 cp s3://example/processor/transform.awk /tmp/transform.awk
for f in "${INPUT1_STAGING_DIR}"/*; do
basename=${f//+(*\/|.*)}
unzip -p "$f" | awk -f /tmp/transform.awk | gzip > ${OUTPUT1_STAGING_DIR}/$basename.tsv.gz
done
I could use the aws cli tool to move the source file aside on each iteration of the loop, but that seems flakey - if my loop dies halfway through processing, those earlier files are going to get lost.
Few possible solutions:
Create a trigger on your s3 bucket.. Whenever any object added to the bucket --> invoke lambda function which can be a python script which performs transformation --> copies back to another bucket. Now, on this other bucket again lambda function is invoked which deletes file from first bucket.
I personally feel; what you have achieved is good enough..All you need is exception handling in the shell script and delete the file ( never loose data ) ONLY when output file is successfully created ( probably u can check the size of output file also )