AWS S3 Glacier - Programmatically Initiate Restore - amazon-web-services

I have been writing an web-app using s3 for storage and glacier for backup. So I setup the lifecycle policy to archive it. Now I want to write a webapp that lists the archived files, the user should be able to initiate restore from this and then get an email once their restore is complete.
Now the trouble I am running into is I cant find a php sdk command I can issue to initiateRestore. Then it would be nice if it notified SNS when restore was complete, SNS would push the JSON onto SQS and I would poll SQS and finally email the user when polling detected a complete restore.
Any help or suggestions would be nice.
Thanks.

You could also use the AWS CLI tool like so (here I'm assuming you want to restore all files in one directory):
aws s3 ls s3://myBucket/myDir/ | awk '{if ($4) print $4}' > myFiles.txt
for x in `cat myFiles.txt`
do
echo "restoring $x"
aws s3api restore-object \
--bucket myBucket \
--key "myDir/$x" \
--restore-request '{"Days":30}'
done
Regarding your desire for notification, the CLI tool will report "A client error (RestoreAlreadyInProgress) occurred: Object restore is already in progress" if request already initiated, and probably a different message once it restores. You could run this restore command several times, looking for "restore done" error/message. Pretty hacky of course; there's probably a better way with AWS CLI tool.
Caveat: be careful with Glacier restores that exceed the allotted free-restore amount/period. If you restore too much data too quickly, charges can exponentially pile up.

I wrote something fairly similar. I can't speak to any PHP api, however there's a simple http POST that kicks off glacier restoration.
Since that happens asyncronously (and takes up to 5 hours), you have to set up a process to poll files that are restoring by making HEAD requests for the object, which will have restoration status info in an x-amz-restore header.
If it helps, my ruby code for parsing this header looks like this:
if restore = headers['x-amz-restore']
if restore.first =~ /ongoing-request="(.+?)", expiry-date="(.+?)"/
restoring = $1 == "true"
restore_date = DateTime.parse($2)
elsif restore.first =~ /ongoing-request="(.+?)"/
restoring = $1 == "true"
end
end

Related

PowerShell Copy Files to AWS Bucket with Object Lock

I want to copy CSV files generated by an SSIS package from an AWS EC2 server to an S3 bucket. Each time I try I get an error around the content-MD5 HTTP error because we have object lock enabled on the bucket.
Write-S3Object : Content-MD5 HTTP header is required for Put Object requests with Object Lock parameters
I would assume there is a PowerShell command I can add or I am missing something but after furious googling I cannot find a resolution. Any help or an alternative option would be appreciated.
I am now testing using the AWS CLI process instead of PowerShell.
If you do want to continue to use the Write-S3Object PowerShell command the missing magic flag is:
-CalculateContentMD5Header 1
So the final command will be
Write-S3Object -Region $region -BucketName $bucketName -File $fileToBackup -Key $destinationFileName -CalculateContentMD5Header 1
https://docs.aws.amazon.com/powershell/latest/reference/items/Write-S3Object.html
After a lot of testing, reading and frustration I found the AWS CLI was able to do exactly what I needed. I am unsure if this is an issue with my PowerShell knowledge or a missing feature (I lean toward my knowledge).
I created a bat file that used the CLI to move the files into the S3 bucket, I then called this bat file from within an SSIS execute task process.
Dropped the one line code below just in case it may help others.
aws s3 mv C:\path\to\files\ s3://your.s3.bucket.name/ --recursive

Automating the maintenance of Athena views

I am currently working on creating a data lake where we can compile, combine and analysis multiple data sets in S3.
I am using Athena and Quicksight as a central part of this to be able to quickly query and explore the data. To make things easier in Quicksight for end-users, I am creating many Athena views that do some basic transformation and aggregations.
I would like to be able to source control my views and create some automation around them so that we can have a code-driven approach and not rely on users manually updating views and running DDL to update the definitions.
There does not seem to be any support in Cloudformation for Athena views.
My current approach would be to just save the create or replace view as ... DDL in an .sql file in source control and then create some sort of script that runs the DDL so it could be made part of a continuous integration solution.
Anyone have any other experience with automation and CI for Athena views?
Long time since OP posted, but here goes a bash script to do just that. You can use this script on CI of your choice.
This script assumes that you have a directory with all your .sql files definition for the views. Within that directory there is a .env file to make some deploy time replacements of shell envs.
#!/bin/bash
export VIEWS_DIRECTORY="views"
export DEPLOY_ENVIRONMENT="dev"
export ENV_FILENAME="views.env"
export OUTPUT_BUCKET="your-bucket-name-$DEPLOY_ENVIRONMENT"
export OUTPUT_PREFIX="your-prefix"
export AWS_PROFILE="your-profile"
cd $VIEWS_DIRECTORY
# Create final .env file with any shell env replaced
env_file=".env"
envsubst < $ENV_FILENAME > $env_file
# Export variables in .env as shell environment variables
export $(grep -v '^#' ./$env_file | xargs)
# Loop through all SQL files replacing env variables and pushing to AWS
FILES="*.sql"
for view_file in $FILES
do
echo "Processing $view_file file..."
# Replacing env variables in query file
envsubst < $view_file > query.sql
# Running query via AWS CLI
query=$(<query.sql) \
&& query_execution_id=$(aws athena start-query-execution \
--query-string "$query" \
--result-configuration "OutputLocation=s3://${OUTPUT_BUCKET}/${OUTPUT_PREFIX}" \
--profile $AWS_PROFILE \
| jq -r '.QueryExecutionId')
# Checking for query completion successfully
echo "Query executionID: $query_execution_id"
while :
do
query_state=$(aws athena get-query-execution \
--query-execution-id $query_execution_id \
--profile $AWS_PROFILE \
| jq '.QueryExecution.Status.State')
echo "Query state: $query_state"
if [[ "$query_state" == '"SUCCEEDED"' ]]; then
echo "Query ran successfully"
break
elif [[ "$query_state" == '"FAILED"' ]]; then
echo "Query failed with ExecutionID: $query_execution_id"
exit 1
elif [ -z "$query_state" ]; then
echo "Unexpected error. Terminating routine."
exit 1
else
echo "Waiting for query to finish running..."
sleep 1
fi
done
done
I think you could use AWS Glue
When Should I Use AWS Glue?
You can use AWS Glue to build a data warehouse to organize, cleanse,
validate, and format data. You can transform and move AWS Cloud data
into your data store. You can also load data from disparate sources
into your data warehouse for regular reporting and analysis. By
storing it in a data warehouse, you integrate information from
different parts of your business and provide a common source of data
for decision making.
AWS Glue simplifies many tasks when you are building a data warehouse:
Discovers and catalogs metadata about your data stores into a central catalog.
You can process semi-structured data, such as clickstream or process logs.
Populates the AWS Glue Data Catalog with table definitions from scheduled crawler programs. Crawlers call classifier logic to infer
the schema, format, and data types of your data. This metadata is
stored as tables in the AWS Glue Data Catalog and used in the
authoring process of your ETL jobs.
Generates ETL scripts to transform, flatten, and enrich your data from source to target.
Detects schema changes and adapts based on your preferences.
Triggers your ETL jobs based on a schedule or event. You can initiate jobs automatically to move your data into your data
warehouse. Triggers can be used to create a dependency flow between
jobs.
Gathers runtime metrics to monitor the activities of your data warehouse.
Handles errors and retries automatically.
Scales resources, as needed, to run your jobs.
https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

automating file archival from ec2 to s3 based on last modified date

I want to write an automated job in which the job will go through my files stored on the ec2 storage and check for the last modified date.If the date is more than (x) days the file should automatically get archived to my s3.
Also I don't want to convert the file to a zip file for now.
What I don't understand is how to give the path of the ec2 instance storage and the how do i put the condition for the last modified date.
aws s3 sync your-new-dir-name s3://your-s3-bucket-name/folder-name
Please correct me if I understand this wrong
Your requirement is to archive the older files
So you need a script that checks the modified time and if its not being modified since X days you simply need to make space by archiving it to S3 storage . You don't wish to store the file locally
is it correct ?
Here is some advice
1. Please provide OS information ..this would help us to suggest shell script or power shell script
Here is power shell script
$fileList = Get-Content "c:\pathtofolder"
foreach($file in $fileList) {
Get-Item $file | select -Property fullName, LastWriteTime | Export-Csv 'C:\fileAndDate.csv' -NoTypeInformation
}
then AWS s3 cp to s3 bucket.
You will do the same with Shell script.
Using aws s3 sync is a great way to backup files to S3. You could use a command like:
aws s3 sync /home/ec2-user/ s3://my-bucket/ec2-backup/
The first parameter (/home/ec2-user/) is where you can specify the source of the files. I recommend only backing-up user-created files, not the whole operating system.
There is no capability for specifying a number of days. I suggest you just copy all files.
You might choose to activate Versioning to keep copies of all versions of files in S3. This way, if a file gets overwritten you can still go back to a prior version. (Storage charges will apply for all versions kept in S3.)

Remove processed source files after AWS Datapipeline completes

A third party sends me a daily upload of log files into an S3 bucket. I'm attempting to use DataPipeline to transform them into a slightly different format with awk, place the new files back on S3, then move the original files aside so that I don't end up processing the same ones again tomorrow.
Is there a clean way of doing this? Currently my shell command looks something like :
#!/usr/bin/env bash
set -eu -o pipefail
aws s3 cp s3://example/processor/transform.awk /tmp/transform.awk
for f in "${INPUT1_STAGING_DIR}"/*; do
basename=${f//+(*\/|.*)}
unzip -p "$f" | awk -f /tmp/transform.awk | gzip > ${OUTPUT1_STAGING_DIR}/$basename.tsv.gz
done
I could use the aws cli tool to move the source file aside on each iteration of the loop, but that seems flakey - if my loop dies halfway through processing, those earlier files are going to get lost.
Few possible solutions:
Create a trigger on your s3 bucket.. Whenever any object added to the bucket --> invoke lambda function which can be a python script which performs transformation --> copies back to another bucket. Now, on this other bucket again lambda function is invoked which deletes file from first bucket.
I personally feel; what you have achieved is good enough..All you need is exception handling in the shell script and delete the file ( never loose data ) ONLY when output file is successfully created ( probably u can check the size of output file also )

Cannot delete Amazon S3 key that contains bad character

I just began to use S3 recently. I accidentally made a key that contains a bad character, and now I can't list the contents of that folder, nor delete that bad key. (I've since added checks to make sure I don't do this again).
I was using an old "S3" python module from 2008 originally. Now I've switched to boto-2.0, and I still cannot delete it. I did quite a bit of research online, and it seems the problem is I have an invalid XML character, so it seems a problem at the lowest level, and no API has helped so far.
I finally contacted Amazon, and they said to use "s3-curl.pl" from http://aws.amazon.com/code/128. I downloaded it, and here's my key:
<Key>info/[01</Key>
I think I was doing a quick bash for loop over some files at the time, and I have "lscolors" set up, and so this happened.
I tried
./s3curl.pl --id <myID> --key <myKEY> -- -X DELETE https://mybucket.s3.amazonaws.com/info/[01
(and also tried putting the URL in single/double quotes, and also tried to escape the '[').
Without quotes on the URL, it hangs. With quotes, I get "curl: (3) [globbing] error: bad range specification after pos 50". I edited the s3-curl.pl to do curl --globoff and still get this error.
I would appreciate any help.
This solved the issue, just delete the main folder:
aws s3 rm "s3://BUCKET_NAME/folder/folder" --recursive
You can use the s3cmd tool from here. You first need to run
s3cmd fixbucket <bucket name that contains bad file>.
You can then delete the file using
s3cmd del <bucket>/<file>
In my case there were newlines in the key (however that happened..). I was able to fix it with the aws cli like this:
aws cli rm "s3://my_bucket/Icon"$'\r'
I also had versioning enabled, so I also needed to do this, for all the versions (versions ids are visible in the UI when enabling the version view):
aws s3api delete-object --bucket my_bucket --key "Icon"$'\r' --version-id <version_id>
I was in this situation recently, to list the items you can use:
aws s3api list-objects-v2 --bucket my_bucket --encoding-type url
the bad keys will come back url encoded like:
"Key": "%01%C3%B4%C2%B3%C3%8Bu%C2%A5%27%40yr%3E%60%0EQ%14%C3%A5.gif"
spaces became + and I had to change those to %20 and * wasn't encoded I had to replace those with %2A before I was able to delete them.
To actually delete them, I wasn't able to use the aws cli because it would urlencode the already urlencoded key resulting in a 404, so to get around that I manually hit the rest API with the DELETE verb.
I recently encountered this case. I had newline at the end of my bucket. The following command solved the matter.
aws s3 rm "bucket_name"$'\r' --recursive