aws S3 how to show contents of file based on a pattern - amazon-web-services

I would like to perform a simple
cat dir/file.OK.*
how can this be achieved in aws?
I came up with
aws s3 cp s3://bucket-name/path/to/folder/ - --exclude="*" --include="R0.OK.*"
but this returns:
download failed: *(the path)* to - An error occurred (404) when calling the HeadObject operation: Not Found
Thank you for your help.
Additional detail: there is supposed to be only one file matching the pattern, so we could use that information. Anything else is allowed to (horribly) fail.
Edit - currently I am just executing an aws ls into a file, and then cp-ing every file piped into grep. Works, but a nuissance.

I would consider an alternative to streaming the S3 content direct to stdout, which is to simply copy the files from S3 to local disk, which will allow you to use include/exclude/recursive, cat the files locally, and then simply delete the files afterwards.

Related

how to access files in s3 bucket for other commands using cli

I want to use some commands with the aws cli on large files that are stored in a s3 bucket without copying the files to the local directory
(I'm familiar with the aws cp command, it's not what I want).
For example, let's say I want to use a simple bash commands like "head" or "more".
If I try to use it like this:
head s3://bucketname/file.txt
but then I get:
head: cannot open ‘s3://bucketname/file.txt’ for reading: No such file or directory
How else can I do it?
How else can I do it?
Thanks in advance.
Wether a command will be able to access a file over s3 bucket or not depends entirely on the command. Under the hood, every command is just a program. When you run something like head filename, the argument filename is passed as an argument to the head command's main() function. You can check out the source code here: https://github.com/coreutils/coreutils/blob/master/src/head.c
Essentially, since the head command does not support S3 URIs, you cannot do this. You can either:
Copy the s3 file to stdout and then pipe it to head: aws s3 cp fileanme - | head. This doesn't seem the likely option if file is too big for the pipe buffer.
Use s3curl to copy a range of bytes: how to access files in s3 bucket for other commands using cli

Get details of an S3 sync failure

We run S3 sync commands in a SQL Job which syncs a local directory to an S3 bucket. At times, we'll get a sync "error" with an error code of 1, or sometimes 2. The documentation lists what each code means; error code 1 provides less details and leaves more questions remaining about the problem. It simply states "One or more Amazon S3 transfer operations failed. Limited to S3 commands."
When I run a sync command in a PowerShell script, and encounter an error (i.e. a synced document being open), the window displays an error message and which specific file that is causing a problem.
How can I capture those details in my SQL job?
I have solved this problem...
Using a PowerShell script, in the AWS s3 sync command we output the results to a text file:
aws s3 sync c:\source\dir s3://target/dir/ > F:\Source\s3Inventory\SyncOutput.txt
Then read the text file contents into a string variable:
$S3Output = Get-Content -Path C:\Source\s3Inventory\SyncOutput.txt -Raw
If the $LASTEXITCODE from the sync command does not equal 0 (indicating an error), then we send an email with the results:
if ($LASTEXITCODE -ne 0)
{
#Send email containing the value of string variable $S3Output
}
This has been placed into production and we were finally able to determine which file/object was failing.
One could certainly attach the text file to the email, rather than reading the contents into a string.
Hope this helps others!

PowerShell Copy Files to AWS Bucket with Object Lock

I want to copy CSV files generated by an SSIS package from an AWS EC2 server to an S3 bucket. Each time I try I get an error around the content-MD5 HTTP error because we have object lock enabled on the bucket.
Write-S3Object : Content-MD5 HTTP header is required for Put Object requests with Object Lock parameters
I would assume there is a PowerShell command I can add or I am missing something but after furious googling I cannot find a resolution. Any help or an alternative option would be appreciated.
I am now testing using the AWS CLI process instead of PowerShell.
If you do want to continue to use the Write-S3Object PowerShell command the missing magic flag is:
-CalculateContentMD5Header 1
So the final command will be
Write-S3Object -Region $region -BucketName $bucketName -File $fileToBackup -Key $destinationFileName -CalculateContentMD5Header 1
https://docs.aws.amazon.com/powershell/latest/reference/items/Write-S3Object.html
After a lot of testing, reading and frustration I found the AWS CLI was able to do exactly what I needed. I am unsure if this is an issue with my PowerShell knowledge or a missing feature (I lean toward my knowledge).
I created a bat file that used the CLI to move the files into the S3 bucket, I then called this bat file from within an SSIS execute task process.
Dropped the one line code below just in case it may help others.
aws s3 mv C:\path\to\files\ s3://your.s3.bucket.name/ --recursive

AWS S3 Deleting file while someone is downloading that object

I cant seem to find how does AWS s3 handles, if someone deletes file while other person is downloading it.
Does it behave like unix system, where descriptor is opened and file is downloaded without problems or does it behave in other way?
Thanks for help!
S3 offers eventual consistency for DELETES.
From the S3 Data Consistency Model
A process deletes an existing object and immediately tries to read it.
Until the deletion is fully propagated, Amazon S3 might return the
deleted data.
Here, where the deletion and downloading of the same object is done concurrently, even if the deletion of the object succeeds before the download is complete, the process will still be able to download the data.
You will face a race condition, of sorts, so the outcome is unpredictable. When you download a file from S3, you might be connected to multiple S3 servers. If at any time, you request part of an S3 object and the server you are connected to thinks the object has been deleted then your download will fail.
Here's a simple test: store a 2GB flle in S3, then download it. While it it is downloading, go into the S3 console and delete the object. You will find that your download fails (with NoSuchKey) because the specified key no longer exists.
Create a temporary 2GB file and upload it to S3:
mkfile -n 2g 2gb.dat
$ aws s3 cp 2gb.dat s3://mybucket
upload: ./2gb.dat to s3://mybucket/2gb.dat
Once complete, start downloading the file:
$ aws s3 cp s3://mybucket/2gb.dat fred.dat
Completed 162.2 MiB/2.0 GiB (46.9 MiB/s) with 1 file(s) remaining
Then jump over to the S3 console and delete 2gb.dat, and this will happen:
$ aws s3 cp s3://mybucket/2gb.dat fred.dat
download failed: s3://mybucket/2gb.dat to ./fred.dat An error occurred
(NoSuchKey) when calling the GetObject operation: The specified key does not exist.

Remove processed source files after AWS Datapipeline completes

A third party sends me a daily upload of log files into an S3 bucket. I'm attempting to use DataPipeline to transform them into a slightly different format with awk, place the new files back on S3, then move the original files aside so that I don't end up processing the same ones again tomorrow.
Is there a clean way of doing this? Currently my shell command looks something like :
#!/usr/bin/env bash
set -eu -o pipefail
aws s3 cp s3://example/processor/transform.awk /tmp/transform.awk
for f in "${INPUT1_STAGING_DIR}"/*; do
basename=${f//+(*\/|.*)}
unzip -p "$f" | awk -f /tmp/transform.awk | gzip > ${OUTPUT1_STAGING_DIR}/$basename.tsv.gz
done
I could use the aws cli tool to move the source file aside on each iteration of the loop, but that seems flakey - if my loop dies halfway through processing, those earlier files are going to get lost.
Few possible solutions:
Create a trigger on your s3 bucket.. Whenever any object added to the bucket --> invoke lambda function which can be a python script which performs transformation --> copies back to another bucket. Now, on this other bucket again lambda function is invoked which deletes file from first bucket.
I personally feel; what you have achieved is good enough..All you need is exception handling in the shell script and delete the file ( never loose data ) ONLY when output file is successfully created ( probably u can check the size of output file also )