How to paginate over an AWS CLI response? - amazon-web-services

I'm trying to paginate over EC2 Reserved Instance offerings, but can't seem to paginate via the CLI (see below).
% aws ec2 describe-reserved-instances-offerings --max-results 20
{
"NextToken": "someToken",
"ReservedInstancesOfferings": [
{
...
}
]
}
% aws ec2 describe-reserved-instances-offerings --max-results 20 --starting-token someToken
Parameter validation failed:
Unknown parameter in input: "PaginationConfig", must be one of: DryRun, ReservedInstancesOfferingIds, InstanceType, AvailabilityZone, ProductDescription, Filters, InstanceTenancy, OfferingType, NextToken, MaxResults, IncludeMarketplace, MinDuration, MaxDuration, MaxInstanceCount
The documentation found in [1] says to use start-token. How am I supposed to do this?
[1] http://docs.aws.amazon.com/cli/latest/reference/ec2/describe-reserved-instances-offerings.html

With deference to a 2017 solution by marjamis which must have worked on a prior CLI version, please see a working approach for paginating from AWS in bash from a Mac laptop and aws-cli/2.1.2
# The scope of this example requires that credentials are already available or
# are passed in with the AWS CLI command.
# The parsing example uses jq, available from https://stedolan.github.io/jq/
# The below command is the one being executed and should be adapted appropriately.
# Note that the max items may need adjusting depending on how many results are returned.
aws_command="aws emr list-instances --max-items 333 --cluster-id $active_cluster"
unset NEXT_TOKEN
function parse_output() {
if [ ! -z "$cli_output" ]; then
# The output parsing below also needs to be adapted as needed.
echo $cli_output | jq -r '.Instances[] | "\(.Ec2InstanceId)"' >> listOfinstances.txt
NEXT_TOKEN=$(echo $cli_output | jq -r ".NextToken")
fi
}
# The command is run and output parsed in the below statements.
cli_output=$($aws_command)
parse_output
# The below while loop runs until either the command errors due to throttling or
# comes back with a pagination token. In the case of being throttled / throwing
# an error, it sleeps for three seconds and then tries again.
while [ "$NEXT_TOKEN" != "null" ]; do
if [ "$NEXT_TOKEN" == "null" ] || [ -z "$NEXT_TOKEN" ] ; then
echo "now running: $aws_command "
sleep 3
cli_output=$($aws_command)
parse_output
else
echo "now paginating: $aws_command --starting-token $NEXT_TOKEN"
sleep 3
cli_output=$($aws_command --starting-token $NEXT_TOKEN)
parse_output
fi
done #pagination loop

Looks like some busted documentation.
If you run the following, this works:
aws ec2 describe-reserved-instances-offerings --max-results 20 --next-token someToken
Translating the error message, it said it expected NextToken which can be represented as next-token on the CLI.

If you continue to read the reference documentation that you provided, you will learn that:
--starting-token (string)
A token to specify where to start paginating. This is the NextToken from a previously truncated response.
Moreover:
--max-items (integer)
The total number of items to return. If the total number of items available is more than the value specified in max-items then a NextToken will be provided in the output that you can use to resume pagination.

Related

Environment Variables in newest AWS EC2 instance

I am trying to get ENVIRONMENT Variables into the EC2 instance (trying to run a django app on Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type ami-0ff8a91507f77f867 ). How do you get them in the newest version of amazon's linux, or get the logging so it can be traced.
user-data text (modified from here):
#!/bin/bash
#trying to get a file made
touch /tmp/testfile.txt
cat 'This and that' > /tmp/testfile.txt
#trying to log
echo 'Woot!' > /home/ec2-user/user-script-output.txt
#Trying to get the output logged to see what is going wrong
exec > >(tee /var/log/user-data.log|logger -t user-data ) 2>&1
#trying to log
echo "XXXXXXXXXX STARTING USER DATA SCRIPT XXXXXXXXXXXXXX"
#trying to store the ENVIRONMENT VARIABLES
PARAMETER_PATH='/'
REGION='us-east-1'
# Functions
AWS="/usr/local/bin/aws"
get_parameter_store_tags() {
echo $($AWS ssm get-parameters-by-path --with-decryption --path ${PARAMETER_PATH} --region ${REGION})
}
params_to_env () {
params=$1
# If .Ta1gs does not exist we assume ssm Parameteres object.
SELECTOR="Name"
for key in $(echo $params | /usr/bin/jq -r ".[][].${SELECTOR}"); do
value=$(echo $params | /usr/bin/jq -r ".[][] | select(.${SELECTOR}==\"$key\") | .Value")
key=$(echo "${key##*/}" | /usr/bin/tr ':' '_' | /usr/bin/tr '-' '_' | /usr/bin/tr '[:lower:]' '[:upper:]')
export $key="$value"
echo "$key=$value"
done
}
# Get TAGS
if [ -z "$PARAMETER_PATH" ]
then
echo "Please provide a parameter store path. -p option"
exit 1
fi
TAGS=$(get_parameter_store_tags ${PARAMETER_PATH} ${REGION})
echo "Tags fetched via ssm from ${PARAMETER_PATH} ${REGION}"
echo "Adding new variables..."
params_to_env "$TAGS"
Notes -
What i think i know but am unsure
the user-data script is only loaded when it is created, not when I stop and then start mentioned here (although it also says [i think outdated] that the output is logged to /var/log/cloud-init-output.log )
I may not be starting the instance correctly
I don't know where to store the bash script so that it can be executed
What I have verified
the user-data text is on the instance by ssh-ing in and curl http://169.254.169.254/latest/user-data shows the current text (#!/bin/bash …)
What Ive tried
editing rc.local directly to export AWS_ACCESS_KEY_ID='JEFEJEFEJEFEJEFE' … and the like
putting them in the AWS Parameter Store (and can see them via the correct call, I just can't trace getting them into the EC2 instance without logs or confirming if the user-data is getting run)
putting ENV variables in Tags and importing them as mentioned here:
tried outputting the logs to other files as suggested here (Not seeing any log files in the ssh instance or on the system log)
viewing the System Log on the aws webpage to see any errors/logs via selecting the instance -> 'Actions' -> 'Instance Settings' -> 'Get System Log' (not seeing any commands run or log statements [only 1 unrelated word of user])

aws cli returns an extra 'None' when fetching the first element using --query parameter and with --output text

I am getting an extra None in aws-cli (version 1.11.160) with --query parameter and --output text when fetching the first element of the query output.
See the examples below.
$ aws kms list-aliases --query "Aliases[?contains(AliasName,'alias/foo')].TargetKeyId|[0]" --output text
a3a1f9d8-a4de-4d0e-803e-137d633df24a
None
$ aws kms list-aliases --query "Aliases[?contains(AliasName,'alias/foo-bar')].TargetKeyId|[0]" --output text
None
None
As far as I know this was working till yesterday but from today onwards this extra None comes in and killing our ansible tasks.
Anyone experienced anything similar?
Thanks
I started having this issue in the past few days too. In my case I was querying exports from a cfn stack.
My solution was (since I'll only ever get one result from the query) to change | [0].Value to .Value, which works with --output text.
Some examples:
$ aws cloudformation list-exports --query 'Exports[?Name==`kms-key-arn`] | []'
[
{
"ExportingStackId": "arn:aws:cloudformation:ap-southeast-2:111122223333:stack/stack-name/83ea7f30-ba0b-11e8-8b7d-50fae957fc4a",
"Name": "kms-key-arn",
"Value": "arn:aws:kms:ap-southeast-2:111122223333:key/a13a4bad-672e-45a3-99c2-c646a9470ffa"
}
]
$ aws cloudformation list-exports --query 'Exports[?Name==`kms-key-arn`] | [].Value'
[
"arn:aws:kms:ap-southeast-2:111122223333:key/a13a4bad-672e-45a3-99c2-c646a9470ffa"
]
$ aws cloudformation list-exports --query 'Exports[?Name==`kms-key-arn`] | [].Value' --output text
arn:aws:kms:ap-southeast-2:111122223333:key/a13a4bad-672e-45a3-99c2-c646a9470ffa
aws cloudformation list-exports --query 'Exports[?Name==`kms-key-arn`] | [0].Value' --output text
arn:aws:kms:ap-southeast-2:111122223333:key/a13a4bad-672e-45a3-99c2-c646a9470ffa
None
I'm no closer to finding out why it's happening, but it disproves #LHWizard's theory, or at least indicates there are conditions where that explanation isn't sufficient.
The best explanation is that not every match for your query statement has a TargetKeyId. On my account, there are several Aliases that only have AliasArn and AliasName key/value pairs. The None comes from a null value for TargetKeyId, in other words.
I came across the same issue when listing step functions. I consider it to be a bug. I don't like solutions that ignore the first or last element, expecting it will always be None at that position - at some stage the issue will get fixed and your workaround has introduced a nasty bug.
So, in my case, I did this as a safe workaround (adapt to your needs):
#!/usr/bin/env bash
arn="<step function arn goes here>"
arns=()
for arn in $(aws stepfunctions list-executions --state-machine-arn "$arn" --max-items 50 --query 'executions[].executionArn' --output text); do
[[ $arn == 'None' ]] || arns+=("$arn")
done
# process execution arns
for arn in "${arns[#]}"; do
echo "$arn" # or whatever
done
Supposing you need only the first value:
Replace --output text with --output json and you could parsed with jq
Therefore, you'll have something like
Ps. the -r option with jq is to remove the quotes around the response
aws kms list-aliases --query "Aliases[?contains(AliasName,'alias/foo')].TargetKeyId|[0]" --output | jq -r '.'

Delete older than month AWS EC2 snapshots

Is this below given command will work or not to delete older than month AWS EC2 Snapshot.
aws describe-snapshots | grep -v (date +%Y-%m-) | grep snap- | awk '{print $2}' | xargs -n 1 -t aws delete-snapshot
Your command won't work mostly because of a typo: aws describe-snapshots should be aws ec2 describe-snapshots.
Anyway, you can do this without any other tools than aws:
snapshots_to_delete=$(aws ec2 describe-snapshots --owner-ids xxxxxxxxxxxx --query 'Snapshots[?StartTime<=`2017-02-15`].SnapshotId' --output text)
echo "List of snapshots to delete: $snapshots_to_delete"
# actual deletion
for snap in $snapshots_to_delete; do
aws ec2 delete-snapshot --snapshot-id $snap
done
Make sure you always know what are you deleting. By echo $snap, for example.
Also, adding --dry-run to aws ec2 delete-snapshot can show you that there are no errors in request.
Edit:
There are two things to pay attention at in the first command:
--owner-ids - you account unique id. Could easily be found manually in top right corner of AWS Console: Support->Support Center->Account Number xxxxxxxxxxxx
--query - JMESPath query which gets only snapshots created later than specified date (e.g.: 2017-02-15): Snapshots[?StartTime>=`2017-02-15`].SnapshotId
+1 to #roman-zhuzha for getting me close. i did have trouble when $snapshots_to_delete didn't parse into a long string of snapshots separated by whitespaces.
this script, below, does parse them into a long string of snapshot ids separated by whitespaces on my Ubuntu (trusty) 14.04 in bash with awscli 1.16:
#!/usr/bin/env bash
dry_run=1
echo_progress=1
d=$(date +'%Y-%m-%d' -d '1 month ago')
if [ $echo_progress -eq 1 ]
then
echo "Date of snapshots to delete (if older than): $d"
fi
snapshots_to_delete=$(aws ec2 describe-snapshots \
--owner-ids xxxxxxxxxxxxx \
--output text \
--query "Snapshots[?StartTime<'$d'].SnapshotId" \
)
if [ $echo_progress -eq 1 ]
then
echo "List of snapshots to delete: $snapshots_to_delete"
fi
for oldsnap in $snapshots_to_delete; do
# some $oldsnaps will be in use, so you can't delete them
# for "snap-a1234xyz" currently in use by "ami-zyx4321ab"
# (and others it can't delete) add conditionals like this
if [ "$oldsnap" = "snap-a1234xyz" ] ||
[ "$oldsnap" = "snap-c1234abc" ]
then
if [ $echo_progress -eq 1 ]
then
echo "skipping $oldsnap known to be in use by an ami"
fi
continue
fi
if [ $echo_progress -eq 1 ]
then
echo "deleting $oldsnap"
fi
if [ $dry_run -eq 1 ]
then
# dryrun will not actually delete the snapshots
aws ec2 delete-snapshot --snapshot-id $oldsnap --dry-run
else
aws ec2 delete-snapshot --snapshot-id $oldsnap
fi
done
Switch these variables as necesssary:
dry_run=1 # set this to 0 to actually delete
echo_progress=1 # set this to 0 to not echo stmnts
Change the date -d string to a human readable version of the number of days, months, or years back you want to delete "older than":
d=$(date +'%Y-%m-%d' -d '15 days ago') # half a month
Find your account id and update these XXXX's to that number:
--owner-ids xxxxxxxxxxxxx \
Here is an example of where you can find that number:
If running this in a cron, you only want to see errors and warnings. A frequent warning will be that there are snapshots in use. The two example snapshot id's (snap-a1234xyz, snap-c1234abc) are ignored since they would otherwise print something like:
An error occurred (InvalidSnapshot.InUse) when calling the DeleteSnapshot operation: The snapshot snap-a1234xyz is currently in use by ami-zyx4321ab
See the comments near "snap-a1234xyx" example snapshot id for how to handle this output.
And don't forget to check on the handy examples and references in the 1.16 aws cli describe-snapshots manual.
you can use 'self' in '--owner-ids' and delete the snapshots created before a specific date (e.g. 2018-01-01) with this one-liner command:
for i in $(aws ec2 describe-snapshots --owner-ids self --query 'Snapshots[?StartTime<=`2018-01-01`].SnapshotId' --output text); do echo Deleting $i; aws ec2 delete-snapshot --snapshot-id $i; sleep 1; done;
Date condition must be within Parenthesis ()
aws ec2 describe-snapshots \
--owner-ids 012345678910 \
--query "Snapshots[?(StartTime<='2020-03-31')].[SnapshotId]"

AWS Cloudwatch Log - Is it possible to export existing log data from it?

I have managed to push my application logs to AWS Cloudwatch by using the AWS CloudWatch log agent. But the CloudWatch web console does not seem to provide a button to allow you to download/export the log data from it.
Any idea how I can achieve this goal?
The latest AWS CLI has a CloudWatch Logs cli, that allows you to download the logs as JSON, text file or any other output supported by AWS CLI.
For example to get the first 1MB up to 10,000 log entries from the stream a in group A to a text file, run:
aws logs get-log-events \
--log-group-name A --log-stream-name a \
--output text > a.log
The command is currently limited to a response size of maximum 1MB (up to 10,000 records per request), and if you have more you need to implement your own page stepping mechanism using the --next-token parameter. I expect that in the future the CLI will also allow full dump in a single command.
Update
Here's a small Bash script to list events from all streams in a specific group, since a specified time:
#!/bin/bash
function dumpstreams() {
aws $AWSARGS logs describe-log-streams \
--order-by LastEventTime --log-group-name $LOGGROUP \
--output text | while read -a st; do
[ "${st[4]}" -lt "$starttime" ] && continue
stname="${st[1]}"
echo ${stname##*:}
done | while read stream; do
aws $AWSARGS logs get-log-events \
--start-from-head --start-time $starttime \
--log-group-name $LOGGROUP --log-stream-name $stream --output text
done
}
AWSARGS="--profile myprofile --region us-east-1"
LOGGROUP="some-log-group"
TAIL=
starttime=$(date --date "-1 week" +%s)000
nexttime=$(date +%s)000
dumpstreams
if [ -n "$TAIL" ]; then
while true; do
starttime=$nexttime
nexttime=$(date +%s)000
sleep 1
dumpstreams
done
fi
That last part, if you set TAIL will continue to fetch log events and will report newer events as they come in (with some expected delay).
There is also a python project called awslogs, allowing to get the logs: https://github.com/jorgebastida/awslogs
There are things like:
list log groups:
$ awslogs groups
list streams for given log group:
$ awslogs streams /var/log/syslog
get the log records from all streams:
$ awslogs get /var/log/syslog
get the log records from specific stream :
$ awslogs get /var/log/syslog stream_A
and much more (filtering for time period, watching log streams...
I think, this tool might help you to do what you want.
It seems AWS has added the ability to export an entire log group to S3.
You'll need to setup permissions on the S3 bucket to allow cloudwatch to write to the bucket by adding the following to your bucket policy, replacing the region with your region and the bucket name with your bucket name.
{
"Effect": "Allow",
"Principal": {
"Service": "logs.us-east-1.amazonaws.com"
},
"Action": "s3:GetBucketAcl",
"Resource": "arn:aws:s3:::tsf-log-data"
},
{
"Effect": "Allow",
"Principal": {
"Service": "logs.us-east-1.amazonaws.com"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::tsf-log-data/*",
"Condition": {
"StringEquals": {
"s3:x-amz-acl": "bucket-owner-full-control"
}
}
}
Details can be found in Step 2 of this AWS doc
The other answers were not useful with AWS Lambda logs since they create many log streams and I just wanted to dump everything in the last week. I finally found the following command to be what I needed:
aws logs tail --since 1w LOG_GROUP_NAME > output.log
Note that LOG_GROUP_NAME is the lambda function path (e.g. /aws/lambda/FUNCTION_NAME) and you can replace the since argument with a variety of times (1w = 1 week, 5m = 5 minutes, etc)
I would add that one liner to get all logs for a stream :
aws logs get-log-events --log-group-name my-log-group --log-stream-name my-log-stream | grep '"message":' | awk -F '"' '{ print $(NF-1) }' > my-log-group_my-log-stream.txt
Or in a slightly more readable format :
aws logs get-log-events \
--log-group-name my-log-group\
--log-stream-name my-log-stream \
| grep '"message":' \
| awk -F '"' '{ print $(NF-1) }' \
> my-log-group_my-log-stream.txt
And you can make a handy script out of it that is admittedly less powerful than #Guss's but simple enough. I saved it as getLogs.sh and invoke it with ./getLogs.sh log-group log-stream
#!/bin/bash
if [[ "${#}" != 2 ]]
then
echo "This script requires two arguments!"
echo
echo "Usage :"
echo "${0} <log-group-name> <log-stream-name>"
echo
echo "Example :"
echo "${0} my-log-group my-log-stream"
exit 1
fi
OUTPUT_FILE="${1}_${2}.log"
aws logs get-log-events \
--log-group-name "${1}"\
--log-stream-name "${2}" \
| grep '"message":' \
| awk -F '"' '{ print $(NF-1) }' \
> "${OUTPUT_FILE}"
echo "Logs stored in ${OUTPUT_FILE}"
Apparently there isn't an out-of-box way from AWS Console where you can download the CloudWatchLogs. Perhaps you can write a script to perform the CloudWatchLogs fetch using the SDK / API.
The good thing about CloudWatchLogs is that you can retain the logs for infinite time(Never Expire); unlike the CloudWatch which just keeps the logs for just 14 days. Which means you can run the script in monthly / quarterly frequency rather than on-demand.
More information about the CloudWatchLogs API,
http://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/Welcome.html
http://awsdocs.s3.amazonaws.com/cloudwatchlogs/latest/cwl-api.pdf
You can now perform exports via the Cloudwatch Management Console with the new Cloudwatch Logs Insights page. Full documentation here https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_ExportQueryResults.html. I had already started ingesting my Apache logs into Cloudwatch with JSON, so YMMV if you haven't set it up in advance.
Add Query to Dashboard or Export Query Results
After you run a query, you can add the query to a CloudWatch
dashboard, or copy the results to the clipboard.
Queries added to dashboards automatically re-run every time you load
the dashboard and every time that the dashboard refreshes. These
queries count toward your limit of four concurrent CloudWatch Logs
Insights queries.
To add query results to a dashboard
Open the CloudWatch console at
https://console.aws.amazon.com/cloudwatch/.
In the navigation pane, choose Insights.
Choose one or more log groups and run a query.
Choose Add to dashboard.
Select the dashboard, or choose Create new to create a new dashboard
for the query results.
Choose Add to dashboard.
To copy query results to the clipboard
Open the CloudWatch console at
https://console.aws.amazon.com/cloudwatch/.
In the navigation pane, choose Insights.
Choose one or more log groups and run a query.
Choose Actions, Copy query results.
Inspired by saputkin I have created a pyton script that downloads all the logs for a log group in given time period.
The script itself: https://github.com/slavogri/aws-logs-downloader.git
In case there are multiple log streams for that period multiple files will be created. Downloaded files will be stored in current directory, and will be named by the log streams that has a log events in given time period. (If the group name contains forward slashes, they will be replaced by underscores. Each file will be overwritten if it already exists.)
Prerequisite: You need to be logged in to your aws profile. The Script itself is going to use on behalf of you the AWS command line APIs: "aws logs describe-log-streams" and "aws logs get-log-events"
Usage example: python aws-logs-downloader -g /ecs/my-cluster-test-my-app -t "2021-09-04 05:59:50 +00:00" -i 60
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-g , --log-group (required) Log group name for which the log stream events needs to be downloaded
-t , --end-time (default: now) End date and time of the downloaded logs in format: %Y-%m-%d %H:%M:%S %z (example: 2021-09-04 05:59:50 +00:00)
-i , --interval (default: 30) Time period in minutes before the end-time. This will be used to calculate the time since which the logs will be downloaded.
-p , --profile (default: dev) The aws profile that is logged in, and on behalf of which the logs will be downloaded.
-r , --region (default: eu-central-1) The aws region from which the logs will be downloaded.
Please let me now if it was useful to you. :)
After I did it I learned that there is another option using Boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.get_log_events
Still the command line API seems to me like a good option.
export LOGGROUPNAME=[SOME_LOG_GROUP_NAME]; for LOGSTREAM in `aws --output text logs describe-log-streams --log-group-name ${LOGGROUPNAME} |awk '{print $7}'`; do aws --output text logs get-log-events --log-group-name ${LOGGROUPNAME} --log-stream-name ${LOGSTREAM} >> ${LOGGROUPNAME}_output.txt; done
Adapted #Guyss answer to macOS. As I am not really a bash guy, had to use python, to convert dates to a human-readable form.
runaswslog -1w gets last week and so on
runawslog() { sh awslogs.sh $1 | grep "EVENTS" | python parselogline.py; }
awslogs.sh:
#!/bin/bash
#set -x
function dumpstreams() {
aws $AWSARGS logs describe-log-streams \
--order-by LastEventTime --log-group-name $LOGGROUP \
--output text | while read -a st; do
[ "${st[4]}" -lt "$starttime" ] && continue
stname="${st[1]}"
echo ${stname##*:}
done | while read stream; do
aws $AWSARGS logs get-log-events \
--start-from-head --start-time $starttime \
--log-group-name $LOGGROUP --log-stream-name $stream --output text
done
}
AWSARGS=""
#AWSARGS="--profile myprofile --region us-east-1"
LOGGROUP="/aws/lambda/StockTrackFunc"
TAIL=
FROMDAT=$1
starttime=$(date -v ${FROMDAT} +%s)000
nexttime=$(date +%s)000
dumpstreams
if [ -n "$TAIL" ]; then
while true; do
starttime=$nexttime
nexttime=$(date +%s)000
sleep 1
dumpstreams
done
fi
parselogline.py:
import sys
import datetime
dat=sys.stdin.read()
for k in dat.split('\n'):
d=k.split('\t')
if len(d)<3:
continue
d[2]='\t'.join(d[2:])
print( str(datetime.datetime.fromtimestamp(int(d[1])/1000)) + '\t' + d[2] )
I had a similar use case where i had to download all the streams for a given log group. See if this script helps.
#!/bin/bash
if [[ "${#}" != 1 ]]
then
echo "This script requires two arguments!"
echo
echo "Usage :"
echo "${0} <log-group-name>"
exit 1
fi
streams=`aws logs describe-log-streams --log-group-name "${1}"`
for stream in $(jq '.logStreams | keys | .[]' <<< "$streams"); do
record=$(jq -r ".logStreams[$stream]" <<< "$streams")
streamName=$(jq -r ".logStreamName" <<< "$record")
echo "Downloading ${streamName}";
echo `aws logs get-log-events --log-group-name "${1}" --log-stream-name "$streamName" --output json > "${stream}.log" `
echo "Completed dowload:: ${streamName}";
done;
You have have pass log group name as an argument.
Eg: bash <name_of_the_bash_file>.sh <group_name>
I found AWS Documentation to be complete and accurate. https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasks.html
This laid down steps for exporting logs from Cloudwatch to S3

How to restore folders (or entire buckets) to Amazon S3 from Glacier?

I changed the lifecycle for a bunch of my buckets on Amazon S3 so their storage class was set to Glacier. I did this using the online AWS Console. I now need those files again.
I know how to restore them back to S3 per file. But my buckets have thousands of files. I wanted to see if there was a way to restore the entire bucket back to S3, just like there was a way to send the entire bucket to Glacier?
I'm guessing there's a way to program a solution. But I wanted to see if there was a way to do it in the Console. Or with another program? Or something else I might be missing?
If you use s3cmd you can use it to restore recursively pretty easily:
s3cmd restore --recursive s3://mybucketname/
I've also used it to restore just folders as well:
s3cmd restore --recursive s3://mybucketname/folder/
If you're using the AWS CLI tool (it's nice, you should), you can do it like this:
aws s3 ls s3://<BUCKET_NAME> --recursive | awk '{print $4}' | xargs -L 1 aws s3api restore-object --restore-request '{"Days":<DAYS>,"GlacierJobParameters":{"Tier":"<TIER>"}}' --bucket <BUCKET_NAME> --key
Replace <BUCKET_NAME> with the bucket name you want and provide restore parameters <DAYS> and <TIER>.
<DAYS> is the number of days you want to restore the object for and <TIER> controls the speed of the restore process and has three levels: Bulk, Standard, or Expedited:
The above answers didn't work well for me because my bucket was mixed with objects on Glacier and some that were not. The easiest thing for me was to create a list of all GLACIER objects in the bucket, then attempt to restore each one individually, ignoring any errors (like already in progress, not an object, etc).
Get a listing of all GLACIER files (keys) in the bucket
aws s3api list-objects-v2 --bucket <bucketName> --query "Contents[?StorageClass=='GLACIER']" --output text | awk '{print $2}' > glacier-restore.txt
Create a shell script and run it, replacing your "bucketName".
#!/bin/sh
for x in `cat glacier-restore.txt`
do
echo "Begin restoring $x"
aws s3api restore-object --restore-request Days=7 --bucket <bucketName> --key "$x"
echo "Done restoring $x"
done
Credit goes to Josh at http://capnjosh.com/blog/a-client-error-invalidobjectstate-occurred-when-calling-the-copyobject-operation-operation-is-not-valid-for-the-source-objects-storage-class/, a resource I found after trying some of the above solutions.
There isn't a built-in tool for this. "Folders" in S3 are an illusion for human convenience, based on forward-slashes in the object key (path/filename) and every object that migrates to glacier has to be restored individually, although...
Of course you could write a script to iterate through the hierarchy and send those restore requests using the SDKs or the REST API in your programming language of choice.
Be sure you understand how restoring from glacier into S3 works, before you proceed. It is always only a temporary restoration, and you choose the number of days that each object will persist in S3 before reverting back to being only stored in glacier.
Also, you want to be certain that you understand the penalty charges for restoring too much glacier data in a short period of time, or you could be in for some unexpected expense. Depending on the urgency, you may want to spread the restore operation out over days or weeks.
I recently needed to restore a whole bucket and all its files and folders. You will need s3cmd and aws cli tools configured with your credentials to run this.
I've found this pretty robust to handle errors with specific objects in the bucket that might have already had a restore request.
#!/bin/sh
# This will give you a nice list of all objects in the bucket with the bucket name stripped out
s3cmd ls -r s3://<your-bucket-name> | awk '{print $4}' | sed 's#s3://<your-bucket-name>/##' > glacier-restore.txt
for x in `cat glacier-restore.txt`
do
echo "restoring $x"
aws s3api restore-object --restore-request Days=7 --bucket <your-bucket-name> --profile <your-aws-credentials-profile> --key "$x"
done
Here is my version of the aws cli interface and how to restore data from glacier. I modified some of the above examples to work when the key of the files to be restored contain spaces.
# Parameters
BUCKET="my-bucket" # the bucket you want to restore, no s3:// no slashes
BPATH="path/in/bucket/" # the objects prefix you wish to restore (mind the `/`)
DAYS=1 # For how many days you wish to restore the data.
# Restore the objects
aws s3 ls s3://${BUCKET}/${BPATH} --recursive | \
awk '{out=""; for(i=4;i<=NF;i++){out=out" "$i}; print out}'| \
xargs -I {} aws s3api restore-object --restore-request Days=${DAYS} \
--bucket ${BUCKET} --key "{}"
It looks like S3 Browser can "restore from Glacier" at the folder level, but not bucket level. The only thing is you have to buy the Pro version. So not the best solution.
A variation on Dustin's answer to use AWS CLI, but to use recursion and pipe to sh to skip errors (like if some objects have already requested restore...)
BUCKET=my-bucket
BPATH=/path/in/bucket
DAYS=1
aws s3 ls s3://$BUCKET$BPATH --recursive | awk '{print $4}' | xargs -L 1 \
echo aws s3api restore-object --restore-request Days=$DAYS \
--bucket $BUCKET --key | sh
The xargs echo bit generates a list of "aws s3api restore-object" commands and by piping that to sh, you can continue on error.
NOTE: Ubuntu 14.04 aws-cli package is old. In order to use --recursive you'll need to install via github.
POSTSCRIPT: Glacier restores can get unexpectedly pricey really quickly. Depending on your use case, you may find the Infrequent Access tier to be more appropriate. AWS have a nice explanation of the different tiers.
This command worked for me:
aws s3api list-objects-v2 \
--bucket BUCKET_NAME \
--query "Contents[?StorageClass=='GLACIER']" \
--output text | \
awk -F $'\t' '{print $2}' | \
tr '\n' '\0' | \
xargs -L 1 -0 \
aws s3api restore-object \
--restore-request Days=7 \
--bucket BUCKET_NAME \
--key
ProTip
This command can take quite while if you have lots of objects.
Don't CTRL-C / break the command otherwise you'll have to wait for
the processed objects to move out of the RestoreAlreadyInProgress state before you can re-run it. It can take a few hours for the state to transition. You'll see this error message if you need to wait: An error occurred (RestoreAlreadyInProgress) when calling the RestoreObject operation
I've been through this mill today and came up with the following based on the answers above and having also tried s3cmd. s3cmd doesn't work for mixed buckets (Glacier and Standard). This will do what you need in two steps - first create a glacier file list and then ping the s3 cli requests off (even if they have already occurred). It will also keep a track of which have been requested already so you can restart the script as necessary. Watch out for the TAB (\t) in the cut command quoted below:
#/bin/sh
bucket="$1"
glacier_file_list="glacier-restore-me-please.txt"
glacier_file_done="glacier-requested-restore-already.txt"
if [ "X${bucket}" = "X" ]
then
echo "Please supply bucket name as first argument"
exit 1
fi
aws s3api list-objects-v2 --bucket ${bucket} --query "Contents[?StorageClass=='GLACIER']" --output text |cut -d '\t' -f 2 > ${glacier_file_list}
if $? -ne 0
then
echo "Failed to fetch list of objects from bucket ${bucket}"
exit 1
fi
echo "Got list of glacier files from bucket ${bucket}"
while read x
do
echo "Begin restoring $x"
aws s3api restore-object --restore-request Days=7 --bucket ${bucket} --key "$x"
if [ $? -ne 0 ]
then
echo "Failed to restore \"$x\""
else
echo "Done requested restore of \"$x\""
fi
# Log those done
#
echo "$x" >> ${glacier_file_done}
done < ${glacier_file_list}
I wrote a program in python to recursively restore folders. The s3cmd command above didn't work for me and neither did the awk command.
You can run this like python3 /home/ec2-user/recursive_restore.py -- restore and to monitor the restore status use python3 /home/ec2-user/recursive_restore.py -- status
import argparse
import base64
import json
import os
import sys
from datetime import datetime
from pathlib import Path
import boto3
import pymysql.cursors
import yaml
from botocore.exceptions import ClientError
__author__ = "kyle.bridenstine"
def reportStatuses(
operation,
type,
successOperation,
folders,
restoreFinished,
restoreInProgress,
restoreNotRequestedYet,
restoreStatusUnknown,
skippedFolders,
):
"""
reportStatuses gives a generic, aggregated report for all operations (Restore, Status, Download)
"""
report = 'Status Report For "{}" Operation. Of the {} total {}, {} are finished being {}, {} have a restore in progress, {} have not been requested to be restored yet, {} reported an unknown restore status, and {} were asked to be skipped.'.format(
operation,
str(len(folders)),
type,
str(len(restoreFinished)),
successOperation,
str(len(restoreInProgress)),
str(len(restoreNotRequestedYet)),
str(len(restoreStatusUnknown)),
str(len(skippedFolders)),
)
if (len(folders) - len(skippedFolders)) == len(restoreFinished):
print(report)
print("Success: All {} operations are complete".format(operation))
else:
if (len(folders) - len(skippedFolders)) == len(restoreNotRequestedYet):
print(report)
print("Attention: No {} operations have been requested".format(operation))
else:
print(report)
print("Attention: Not all {} operations are complete yet".format(operation))
def status(foldersToRestore, restoreTTL):
s3 = boto3.resource("s3")
folders = []
skippedFolders = []
# Read the list of folders to process
with open(foldersToRestore, "r") as f:
for rawS3Path in f.read().splitlines():
folders.append(rawS3Path)
s3Bucket = "put-your-bucket-name-here"
maxKeys = 1000
# Remove the S3 Bucket Prefix to get just the S3 Path i.e., the S3 Objects prefix and key name
s3Path = removeS3BucketPrefixFromPath(rawS3Path, s3Bucket)
# Construct an S3 Paginator that returns pages of S3 Object Keys with the defined prefix
client = boto3.client("s3")
paginator = client.get_paginator("list_objects")
operation_parameters = {"Bucket": s3Bucket, "Prefix": s3Path, "MaxKeys": maxKeys}
page_iterator = paginator.paginate(**operation_parameters)
pageCount = 0
totalS3ObjectKeys = []
totalS3ObjKeysRestoreFinished = []
totalS3ObjKeysRestoreInProgress = []
totalS3ObjKeysRestoreNotRequestedYet = []
totalS3ObjKeysRestoreStatusUnknown = []
# Iterate through the pages of S3 Object Keys
for page in page_iterator:
for s3Content in page["Contents"]:
s3ObjectKey = s3Content["Key"]
# Folders show up as Keys but they cannot be restored or downloaded so we just ignore them
if s3ObjectKey.endswith("/"):
continue
totalS3ObjectKeys.append(s3ObjectKey)
s3Object = s3.Object(s3Bucket, s3ObjectKey)
if s3Object.restore is None:
totalS3ObjKeysRestoreNotRequestedYet.append(s3ObjectKey)
elif "true" in s3Object.restore:
totalS3ObjKeysRestoreInProgress.append(s3ObjectKey)
elif "false" in s3Object.restore:
totalS3ObjKeysRestoreFinished.append(s3ObjectKey)
else:
totalS3ObjKeysRestoreStatusUnknown.append(s3ObjectKey)
pageCount = pageCount + 1
# Report the total statuses for the folders
reportStatuses(
"restore folder " + rawS3Path,
"files",
"restored",
totalS3ObjectKeys,
totalS3ObjKeysRestoreFinished,
totalS3ObjKeysRestoreInProgress,
totalS3ObjKeysRestoreNotRequestedYet,
totalS3ObjKeysRestoreStatusUnknown,
[],
)
def removeS3BucketPrefixFromPath(path, bucket):
"""
removeS3BucketPrefixFromPath removes "s3a://<bucket name>" or "s3://<bucket name>" from the Path
"""
s3BucketPrefix1 = "s3a://" + bucket + "/"
s3BucketPrefix2 = "s3://" + bucket + "/"
if path.startswith(s3BucketPrefix1):
# remove one instance of prefix
return path.replace(s3BucketPrefix1, "", 1)
elif path.startswith(s3BucketPrefix2):
# remove one instance of prefix
return path.replace(s3BucketPrefix2, "", 1)
else:
return path
def restore(foldersToRestore, restoreTTL):
"""
restore initiates a restore request on one or more folders
"""
print("Restore Operation")
s3 = boto3.resource("s3")
bucket = s3.Bucket("put-your-bucket-name-here")
folders = []
skippedFolders = []
# Read the list of folders to process
with open(foldersToRestore, "r") as f:
for rawS3Path in f.read().splitlines():
folders.append(rawS3Path)
# Skip folders that are commented out of the file
if "#" in rawS3Path:
print("Skipping this folder {} since it's commented out with #".format(rawS3Path))
folders.append(rawS3Path)
continue
else:
print("Restoring folder {}".format(rawS3Path))
s3Bucket = "put-your-bucket-name-here"
maxKeys = 1000
# Remove the S3 Bucket Prefix to get just the S3 Path i.e., the S3 Objects prefix and key name
s3Path = removeS3BucketPrefixFromPath(rawS3Path, s3Bucket)
print("s3Bucket={}, s3Path={}, maxKeys={}".format(s3Bucket, s3Path, maxKeys))
# Construct an S3 Paginator that returns pages of S3 Object Keys with the defined prefix
client = boto3.client("s3")
paginator = client.get_paginator("list_objects")
operation_parameters = {"Bucket": s3Bucket, "Prefix": s3Path, "MaxKeys": maxKeys}
page_iterator = paginator.paginate(**operation_parameters)
pageCount = 0
totalS3ObjectKeys = []
totalS3ObjKeysRestoreFinished = []
totalS3ObjKeysRestoreInProgress = []
totalS3ObjKeysRestoreNotRequestedYet = []
totalS3ObjKeysRestoreStatusUnknown = []
# Iterate through the pages of S3 Object Keys
for page in page_iterator:
print("Processing S3 Key Page {}".format(str(pageCount)))
s3ObjectKeys = []
s3ObjKeysRestoreFinished = []
s3ObjKeysRestoreInProgress = []
s3ObjKeysRestoreNotRequestedYet = []
s3ObjKeysRestoreStatusUnknown = []
for s3Content in page["Contents"]:
print("Processing S3 Object Key {}".format(s3Content["Key"]))
s3ObjectKey = s3Content["Key"]
# Folders show up as Keys but they cannot be restored or downloaded so we just ignore them
if s3ObjectKey.endswith("/"):
print("Skipping this S3 Object Key because it's a folder {}".format(s3ObjectKey))
continue
s3ObjectKeys.append(s3ObjectKey)
totalS3ObjectKeys.append(s3ObjectKey)
s3Object = s3.Object(s3Bucket, s3ObjectKey)
print("{} - {} - {}".format(s3Object.key, s3Object.storage_class, s3Object.restore))
# Ensure this folder was not already processed for a restore
if s3Object.restore is None:
restore_response = bucket.meta.client.restore_object(
Bucket=s3Object.bucket_name, Key=s3Object.key, RestoreRequest={"Days": restoreTTL}
)
print("Restore Response: {}".format(str(restore_response)))
# Refresh object and check that the restore request was successfully processed
s3Object = s3.Object(s3Bucket, s3ObjectKey)
print("{} - {} - {}".format(s3Object.key, s3Object.storage_class, s3Object.restore))
if s3Object.restore is None:
s3ObjKeysRestoreNotRequestedYet.append(s3ObjectKey)
totalS3ObjKeysRestoreNotRequestedYet.append(s3ObjectKey)
print("%s restore request failed" % s3Object.key)
# Instead of failing the entire job continue restoring the rest of the log tree(s)
# raise Exception("%s restore request failed" % s3Object.key)
elif "true" in s3Object.restore:
print(
"The request to restore this file has been successfully received and is being processed: {}".format(
s3Object.key
)
)
s3ObjKeysRestoreInProgress.append(s3ObjectKey)
totalS3ObjKeysRestoreInProgress.append(s3ObjectKey)
elif "false" in s3Object.restore:
print("This file has successfully been restored: {}".format(s3Object.key))
s3ObjKeysRestoreFinished.append(s3ObjectKey)
totalS3ObjKeysRestoreFinished.append(s3ObjectKey)
else:
print(
"Unknown restore status ({}) for file: {}".format(s3Object.restore, s3Object.key)
)
s3ObjKeysRestoreStatusUnknown.append(s3ObjectKey)
totalS3ObjKeysRestoreStatusUnknown.append(s3ObjectKey)
elif "true" in s3Object.restore:
print("Restore request already received for {}".format(s3Object.key))
s3ObjKeysRestoreInProgress.append(s3ObjectKey)
totalS3ObjKeysRestoreInProgress.append(s3ObjectKey)
elif "false" in s3Object.restore:
print("This file has successfully been restored: {}".format(s3Object.key))
s3ObjKeysRestoreFinished.append(s3ObjectKey)
totalS3ObjKeysRestoreFinished.append(s3ObjectKey)
else:
print(
"Unknown restore status ({}) for file: {}".format(s3Object.restore, s3Object.key)
)
s3ObjKeysRestoreStatusUnknown.append(s3ObjectKey)
totalS3ObjKeysRestoreStatusUnknown.append(s3ObjectKey)
# Report the statuses per S3 Key Page
reportStatuses(
"folder-" + rawS3Path + "-page-" + str(pageCount),
"files in this page",
"restored",
s3ObjectKeys,
s3ObjKeysRestoreFinished,
s3ObjKeysRestoreInProgress,
s3ObjKeysRestoreNotRequestedYet,
s3ObjKeysRestoreStatusUnknown,
[],
)
pageCount = pageCount + 1
if pageCount > 1:
# Report the total statuses for the files
reportStatuses(
"restore-folder-" + rawS3Path,
"files",
"restored",
totalS3ObjectKeys,
totalS3ObjKeysRestoreFinished,
totalS3ObjKeysRestoreInProgress,
totalS3ObjKeysRestoreNotRequestedYet,
totalS3ObjKeysRestoreStatusUnknown,
[],
)
def displayError(operation, exc):
"""
displayError displays a generic error message for all failed operation's returned exceptions
"""
print(
'Error! Restore{} failed. Please ensure that you ran the following command "./tools/infra auth refresh" before executing this program. Error: {}'.format(
operation, exc
)
)
def main(operation, foldersToRestore, restoreTTL):
"""
main The starting point of the code that directs the operation to it's appropriate workflow
"""
print(
"{} Starting log_migration_restore.py with operation={} foldersToRestore={} restoreTTL={} Day(s)".format(
str(datetime.now().strftime("%d/%m/%Y %H:%M:%S")), operation, foldersToRestore, str(restoreTTL)
)
)
if operation == "restore":
try:
restore(foldersToRestore, restoreTTL)
except Exception as exc:
displayError("", exc)
elif operation == "status":
try:
status(foldersToRestore, restoreTTL)
except Exception as exc:
displayError("-Status-Check", exc)
else:
raise Exception("%s is an invalid operation. Please choose either 'restore' or 'status'" % operation)
def check_operation(operation):
"""
check_operation validates the runtime input arguments
"""
if operation is None or (
str(operation) != "restore" and str(operation) != "status" and str(operation) != "download"
):
raise argparse.ArgumentTypeError(
"%s is an invalid operation. Please choose either 'restore' or 'status' or 'download'" % operation
)
return str(operation)
# To run use sudo python3 /home/ec2-user/recursive_restore.py -- restore
# -l /home/ec2-user/folders_to_restore.csv
if __name__ == "__main__":
# Form the argument parser.
parser = argparse.ArgumentParser(
description="Restore s3 folders from archival using 'restore' or check on the restore status using 'status'"
)
parser.add_argument(
"operation",
type=check_operation,
help="Please choose either 'restore' to restore the list of s3 folders or 'status' to see the status of a restore on the list of s3 folders",
)
parser.add_argument(
"-l",
"--foldersToRestore",
type=str,
default="/home/ec2-user/folders_to_restore.csv",
required=False,
help="The location of the file containing the list of folders to restore. Put one folder on each line.",
)
parser.add_argument(
"-t",
"--restoreTTL",
type=int,
default=30,
required=False,
help="The number of days you want the filess to remain restored/unarchived. After this period the logs will automatically be rearchived.",
)
args = parser.parse_args()
sys.exit(main(args.operation, args.foldersToRestore, args.restoreTTL))
Maybe I am only a decade late to post an answer, but now we have S3 batch operations to restore deep archived objects in bulk.See this