How to use boto3 waiters to take snapshot from big RDS instances - amazon-web-services

I started migrating my code to boto 3 and one nice addition I noticed are the waiters.
I want to create a snapshot from a db instance and I want to check for it's availability before I resume with my code.
My approach is the following:
# Notice: Step : Check snapshot availability [1st account - Oregon]
print "--- Check snapshot availability [1st account - Oregon] ---"
new_snap = client1.describe_db_snapshots(DBSnapshotIdentifier=new_snapshot_name)['DBSnapshots'][0]
# print pprint.pprint(new_snap) #debug
waiter = client1.get_waiter('db_snapshot_completed')
print "Manual snapshot is -pending-"
sleep(60)
waiter.wait(
DBSnapshotIdentifier = new_snapshot_name,
IncludeShared = True,
IncludePublic = False
)
print "OK. Manual snapshot is -available-"
,but the documentation says that it polls the status every 15 seconds for 40 times. That is 10 minutes. Yet, a rather big DB will need more than that .
How could I use the waiter to alleviate for that?

Waiters have configuration parameters'delay' and 'max_attempts'
like this :
waiter = rds_client.get_waiter('db_instance_available')
print( "waiter delay: " + str(waiter.config.delay) )
waiter.py on github

You could do it without the waiter if you like.
From the documentation for that waiter:
Polls RDS.Client.describe_db_snapshots() every 15 seconds until a successful state is reached. An error is returned after 40 failed checks.
Basically that means it does the following:
RDS = boto3.client('rds')
RDS.describe_db_snapshots()
You can just run that but filter to your snapshot id, here is the syntax.http://boto3.readthedocs.io/en/latest/reference/services/rds.html#RDS.Client.describe_db_snapshots
response = client.describe_db_snapshots(
DBInstanceIdentifier='string',
DBSnapshotIdentifier='string',
SnapshotType='string',
Filters=[
{
'Name': 'string',
'Values': [
'string',
]
},
],
MaxRecords=123,
Marker='string',
IncludeShared=True|False,
IncludePublic=True|False
)
This will end up looking something like this:
snapshot_description = RDS.describe_db_snapshots(DBSnapshotIdentifier='YOURIDHERE')
then you can just loop until that returns a snapshot which is available. So here is a very rough idea.
import boto3
import time
RDS = boto3.client('rds')
RDS.describe_db_snapshots()
snapshot_description = RDS.describe_db_snapshots(DBSnapshotIdentifier='YOURIDHERE')
while snapshot_description['DBSnapshots'][0]['Status'] != 'available' :
print("still waiting")
time.sleep(15)
snapshot_description = RDS.describe_db_snapshots(DBSnapshotIdentifier='YOURIDHERE')

I think the other answer alluded to this solution but here it is expressly.
[snip]
...
# Create your waiter
waiter_db_snapshot = client1.get_waiter('db_snapshot_completed')
# Increase the max number of tries as appropriate
waiter_db_snapshot.config.max_attempts = 120
# Add a 60 second delay between attempts
waiter_db_snapshot.config.delay = 60
print "Manual snapshot is -pending-"
....
[snip]

Related

AWS SSM error while targets.1.member.values failed to satisfy constraint: Member must have length less than or equal to 50

I am trying to run a SSM command on more than 50 EC2 instances of my fleet. By using AWS boto3's SSM client, I am running a specific command on my nodes. My code is given below. After running the code, an unexpected error is showing up.
# running ec2 instances
instances = client.describe_instances()
instance_ids = [inst["InstanceId"] for inst in instances] # might contain more than 50 instances
# run command
run_cmd_resp = ssm_client.send_command(
Targets=[
{"Key": "InstanceIds", "Values": inst_ids_all},
],
DocumentName="AWS-RunShellScript",
DocumentVersion="1",
Parameters={
"commands": ["#!/bin/bash", "ls -ltrh", "# some commands"]
}
)
On executing this, getting below error
An error occurred (ValidationException) when calling the SendCommand operation: 1 validation error detected: Value '[...91 instance IDs...]' at 'targets.1.member.values' failed to satisfy constraint: Member must have length less than or equal to 50.
How do I run the SSM command my whole fleet?
As shown in the error message and boto3 documentation (link), the number of instances in one send_command call is limited up to 50. To run the SSM command for all instances, splitting the original list into 50 each could be a solution.
FYI: If your account has a fair amount of instances, describe_instances() can't retrieve all instance info in one api call, so it would be better to check whether NextToken is in response.
ref: How do you use "NextToken" in AWS API calls
# running ec2 instances
instances = client.describe_instances()
instance_ids = [inst["InstanceId"] for inst in instances]
while "NextToken" in instances:
instances = client.describe_instances(NextToken=instances["NextToken"])
instance_ids += [inst["InstanceId"] for inst in instances]
# run command
for i in range(0, len(instance_ids), 50):
target_instances = instance_ids[i : i + 50]
run_cmd_resp = ssm_client.send_command(
Targets=[
{"Key": "InstanceIds", "Values": inst_ids_all},
],
DocumentName="AWS-RunShellScript",
DocumentVersion="1",
Parameters={
"commands": ["#!/bin/bash", "ls -ltrh", "# some commands"]
}
)
Finally after #Rohan Kishibe's answer, I tried to implement below batched execution for the SSM runShellScript.
import math
ec2_ids_all = [...] # all instance IDs fetched by pagination.
PG_START, PG_STOP = 0, 50
PG_SIZE = 50
PG_COUNT = math.ceil(len(ec2_ids_all) / PG_SIZE)
for page in range(PG_COUNT):
cmd = ssm.send_command(
Targets=[{"Key": "InstanceIds", "Values": ec2_ids_all[PG_START:PG_STOP]}],
DocumentVersion="AWS-RunShellScript",
Parameters={"commands": ["ls -ltrh", "# other commands"]}
}
PG_START += PG_SIZE
PG_STOP += PG_SIZE
In above way, the total number of instance IDs will be distributed in batches and then executed accordingly. One can also save the Command IDs and batch instance IDs in a mapping for future usage.

Boto3 and AWS RDS: properly wait for database creation from snapshot

I have a following code in my Lambda (Python and Boto3):
rds.restore_db_instance_from_db_snapshot(
DBSnapshotIdentifier=snapshot_name,
DBInstanceIdentifier=db_id,
DBInstanceClass=rds_instance_class,
VpcSecurityGroupIds=secgroup,
DBSubnetGroupName=rds_subnet_groupname,
MultiAZ=False,
PubliclyAccessible=False,
CopyTagsToSnapshot=True
)
waiter = rds.get_waiter('db_instance_available')
waiter.wait(DBInstanceIdentifier=db_id)
# some other operation that expects that DB is up and running.
The waiter was added as an attempt to properly wait for DB. However, it looks like the waiter times out.
What would be the correct waiter to use in this case?
try setting waiter.config.delay and/or waiter.config.max_attempts.
waiter = rds.get_waiter('db_instance_available')
waiter.config.delay = 123 # this is in seconds
waiter.config.max_attempts = 123
waiter.wait(DBInstanceIdentifier=db_id)
OR
waiter = rds.get_waiter('db_instance_available')
waiter.wait(
DBClusterIdentifier=db_id
WaiterConfig={
'Delay': 123,
'MaxAttempts': 123
}
)
WaiterConfig (dict) A dictionary that provides parameters to control
waiting behavior.
Delay (integer) The amount of time in seconds to wait between
attempts. Default: 30
MaxAttempts (integer) The maximum number of attempts to be made.
Default: 60
Could it be that your waiter is actually checking the existing db and seeing that it's available before the status can update on the previous command to restore the snapshot?

AWS Lambda function fails while query Athena

I am attempting to write a simple Lambda function to query a table in Athena. But after a few seconds I see "Status: FAILED" in the Cloudwatch logs.
There is no descriptive error message on the cause of failure.
My test code is below:
import json
import time
import boto3
# athena constant
DATABASE = 'default'
TABLE = 'test'
# S3 constant
S3_OUTPUT = 's3://test-output/'
# number of retries
RETRY_COUNT = 1000
def lambda_handler(event, context):
# created query
query = "SELECT * FROM default.test limit 2"
# % (DATABASE, TABLE)
# athena client
client = boto3.client('athena')
# Execution
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': DATABASE
},
ResultConfiguration={
'OutputLocation': S3_OUTPUT,
}
)
# get query execution id
query_execution_id = response['QueryExecutionId']
print(query_execution_id)
# get execution status
for i in range(1, 1 + RETRY_COUNT):
# get query execution
query_status = client.get_query_execution(QueryExecutionId=query_execution_id)
query_execution_status = query_status['QueryExecution']['Status']['State']
if query_execution_status == 'SUCCEEDED':
print("STATUS:" + query_execution_status)
break
if query_execution_status == 'FAILED':
#raise Exception("STATUS:" + query_execution_status)
print("STATUS:" + query_execution_status)
else:
print("STATUS:" + query_execution_status)
time.sleep(i)
else:
# Did not encounter a break event. Need to kill the query
client.stop_query_execution(QueryExecutionId=query_execution_id)
raise Exception('TIME OVER')
# get query results
result = client.get_query_results(QueryExecutionId=query_execution_id)
print(result)
return
The logs show the following:
2020-08-31T10:52:12.443-04:00
START RequestId: e5434651-d36e-48f0-8f27-0290 Version: $LATEST
2020-08-31T10:52:13.481-04:00
88162f38-bfcb-40ae-b4a3-0b5a21846e28
2020-08-31T10:52:13.500-04:00
STATUS:QUEUED
2020-08-31T10:52:14.519-04:00
STATUS:RUNNING
2020-08-31T10:52:16.540-04:00
STATUS:RUNNING
2020-08-31T10:52:19.556-04:00
STATUS:RUNNING
2020-08-31T10:52:23.574-04:00
STATUS:RUNNING
2020-08-31T10:52:28.594-04:00
STATUS:FAILED
2020-08-31T10:52:28.640-04:00
....more status: FAILED
....
END RequestId: e5434651-d36e-48f0-8f27-0290
REPORT RequestId: e5434651-d36e-48f0-8f27-0290 Duration: 30030.22 ms Billed Duration: 30000 ms Memory Size: 128 MB Max Memory Used: 72 MB Init Duration: 307.49 ms
2020-08-31T14:52:42.473Z e5434651-d36e-48f0-8f27-0290 Task timed out after 30.03 seconds
I think I have the right permissions for S3 bucket access given to the role (if not, I would have seen the error message). There are no files created in the bucket either. I am not sure what is going wrong here. What am I missing?
Thanks
The last line in your log shows
2020-08-31T14:52:42.473Z e5434651-d36e-48f0-8f27-0290 Task timed out after 30.03 seconds
To me this looks like the timeout of the Lambda Function is set to 30 seconds. Try increasing it to more than the time the Athena query needs (the maximum is 15 minutes).

Delete AWS Log Groups that haven't been touched in X days

Is there a way to delete all AWS Log Groups that haven't had any writes to them in the past 30 days?
Or conversely, get the list of log groups that haven't had anything written to them in the past 30 days?
Here is some quick script I wrote:
#!/usr/bin/python3
# describe log groups
# describe log streams
# get log groups with the lastEventTimestamp after some time
# delete those log groups
# have a dry run option
# support profile
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.describe_log_streams
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.describe_log_groups
# https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.delete_log_group
import boto3
import time
millis = int(round(time.time() * 1000))
delete = False
debug = False
log_group_prefix='/' # NEED TO CHANGE THESE
days = 30
# Create CloudWatchLogs client
cloudwatch_logs = boto3.client('logs')
log_groups=[]
# List log groups through the pagination interface
paginator = cloudwatch_logs.get_paginator('describe_log_groups')
for response in paginator.paginate(logGroupNamePrefix=log_group_prefix):
for log_group in response['logGroups']:
log_groups.append(log_group['logGroupName'])
if debug:
print(log_groups)
old_log_groups=[]
empty_log_groups=[]
for log_group in log_groups:
response = cloudwatch_logs.describe_log_streams(
logGroupName=log_group, #logStreamNamePrefix='',
orderBy='LastEventTime',
descending=True,
limit=1
)
# The time of the most recent log event in the log stream in CloudWatch Logs. This number is expressed as the number of milliseconds after Jan 1, 1970 00:00:00 UTC.
if len(response['logStreams']) > 0:
if debug:
print("full response is:")
print(response)
print("Last event is:")
print(response['logStreams'][0]['lastEventTimestamp'])
print("current millis is:")
print(millis)
if response['logStreams'][0]['lastEventTimestamp'] < millis - (days * 24 * 60 * 60 * 1000):
old_log_groups.append(log_group)
else:
empty_log_groups.append(log_group)
# delete log group
if delete:
for log_group in old_log_groups:
response = cloudwatch_logs.delete_log_group(logGroupName=log_group)
#for log_group in empty_log_groups:
# response = cloudwatch_logs.delete_log_group(logGroupName=log_group)
else:
print("old log groups are:")
print(old_log_groups)
print("Number of log groups:")
print(len(old_log_groups))
print("empty log groups are:")
print(empty_log_groups)
I have been using aws-cloudwatch-log-clean and I can say it works quite well.
you need boto3 installed and then you:
./sweep_log_streams.py [log_group_name]
It has a --dry-run option for you to check it what you expect first.
A note of caution, If you have a long running process in ECS which is quiet on the Logs, and the log has been truncated to empty in CW due to the logs retention period. Deleting its empty log stream can break and hang the service, as it has nowhere to post its logs to...
I'm not aware of a simple way to do this, but you could use the awscli (or preferably python/boto3) to describe-log-groups, then for each log group invoke describe-log-streams, then for each log group/stream pair, invoke get-log-events with a --start-time of 30 days ago. If the union of all the events arrays for the log group is empty then you know you can delete the log group.
I did the same setup, I my case I want to delete log stream older than X days from the Cloudwatch Log Stream.
remove.py
import optparse
import sys
import os
import json
import datetime
def deletefunc(loggroupname,days):
os.system("aws logs describe-log-streams --log-group-name {} --order-by LastEventTime > test.json".format(loggroupname))
oldstream=[]
milli= days * 24 * 60 * 60 * 1000
with open('test.json') as json_file:
data = json.load(json_file)
for p in data['logStreams']:
subtract=p['creationTime']+milli
sub1=subtract/1000
sub2=datetime.datetime.fromtimestamp(sub1).strftime('%Y-%m-%d')
op=p['creationTime']/1000
original=datetime.datetime.fromtimestamp(op).strftime('%Y-%m-%d')
name=p['logStreamName']
if original < sub2:
oldstream.append(name)
for i in oldstream:
os.system("aws logs delete-log-stream --log-group-name {} --log-stream-name {}".format(loggroupname,i))
parser = optparse.OptionParser()
parser.add_option('--log-group-name', action="store", dest="loggroupname", help="LogGroupName For Eg: testing-vpc-flow-logs",type="string")
parser.add_option('-d','--days', action="store", dest="days", help="Type No.of Days in Integer to Delete Logs other than days provided",type="int")
options, args = parser.parse_args()
if options.loggroupname is None:
print ("Provide log group name to continue.\n")
cmd = 'python3 ' + sys.argv[0] + ' -h'
os.system(cmd)
sys.exit(0)
if options.days is None:
print ("Provide date to continue.\n")
cmd = 'python3 ' + sys.argv[0] + ' -h'
os.system(cmd)
sys.exit(0)
elif options.days and options.loggroupname:
loggroupname = options.loggroupname
days = options.days
deletefunc(loggroupname,days)
you can run this file using the command:
python3 remove.py --log-group-name=testing-vpc-flow-logs --days=7

Server Errors While Writing With Python Cassandra Driver

code=1000 [Unavailable exception] message="Cannot achieve consistency level ONE" info={'required_replicas': 1, 'alive_replicas': 0, 'consistency': 'ONE'}
code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
I am inserting into Cassandra Cassandra 2.0.13(single node for testing) by python cassandra-driver version 2.6
The following are my keyspace and table definitions:
CREATE KEYSPACE test_keyspace WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '1' };
CREATE TABLE test_table (
key text PRIMARY KEY,
column1 text,
...,
column17 text
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
What I tried:
1) multiprocessing(protocol version set to 1)
each process has its own cluster, session(default_timeout set to 30.0)
def get_cassandra_session():
"""creates cluster and gets the session base on key space"""
# be aware that session cannot be shared between threads/processes
# or it will raise OperationTimedOut Exception
if CLUSTER_HOST2:
cluster = cassandra.cluster.Cluster([CLUSTER_HOST1, CLUSTER_HOST2])
else:
# if only one address is available, we have to use older protocol version
cluster = cassandra.cluster.Cluster([CLUSTER_HOST1], protocol_version=1)
session = cluster.connect(KEY_SPACE)
session.default_timeout = 30.0
return session
2) batch insert (protocol version set to 2 because BatchStatement is enabled on Cassandra 2.X)
def batch_insert(session, batch_queue, batch):
try:
insert_user = session.prepare("INSERT INTO " + db.TABLE + " (" + db.COLUMN1 + "," + db.COLUMN2 + "," + db.COLUMN3 +
"," + db.COLUMN4 + ") VALUES (?,?,?,?)")
while batch_queue.qsize() > 0:
'''batch queue size is 1000'''
row_tuple = batch_queue.get()
batch.add(insert_user, row_tuple)
session.execute(batch)
except Exception as e:
logger.error("batch insert fail.... %s", e)
the above function is invoked by:
batch = BatchStatement(consistency_level=ConsistencyLevel.ONE)
batch_insert(session, batch_queue, batch)
tuples are stored in batch_queue.
3) synchronizing execution
Several days ago I post another question Cassandra update fails , cassandra was complaining about TimeOut issue. I was using synchronize execution for updating.
Can anyone help, is this my code issue or python cassandra-driver issue or Cassandra itself ?
Thanks a million!
If your question is about those errors at the top, those are server-side error responses.
The first says that the coordinator you contacted cannot satisfy the request at CL.ONE, with the nodes it believes are alive. This can happen if all replicas are down (more likely with a low replication factor).
The other two errors are timeouts, where the coordinator didn't get responses from 'live' nodes in a time configured in the cassandra.yaml.
All of these indicate that the cluster you're connected to is not healthy. This could be because it is overwhelmed (high GC pauses), or experiencing network issues. Check the server logs for clues.
I got the following error, which looks very similar:
cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level LOCAL_ONE" info={'consistency': 'LOCAL_ONE', 'alive_replicas': 0, 'required_replicas': 1}
When I added a sleep(0.5) in the code, it worked fine. I was trying to write too much too fast...