Similar to question Bulk delete sqs queues using boto3, I now want to delete queues based on 'CreatedTimestamp'. If the created time (in epoch time) is before a specific epoch timestamp, it should be deleted.
I tried to write something similar to answer given in the earlier post but I am not sure if I need to to loop again to check the created time.
client = boto3.client('sqs')
timestamp = '1645747200'
def delete_sqs_queues (event, context):
response = client.list_queues()
for sqs_url in response['QueueUrls']:
get_att = client.get_queue_attributes(
QueueUrl=sqs_url,
AttributeNames=['CreatedTimestamp']
)
Can I directly compare the CreatedTimestamp with a variable or do I need another for loop to iterate over the CreatedTimestamps?
There is only one CreatedTimestamp for each queue, thus you don't have to iterate again:
for sqs_url in response['QueueUrls']:
get_att = client.get_queue_attributes(
QueueUrl=sqs_url,
AttributeNames=['CreatedTimestamp']
)
queue_timestamp = get_att['Attributes']['CreatedTimestamp']
print(queue_timestamp)
# for example
if queue_timestamp < timestamp:
print(f'Delete {sqs_url} because of {queue_timestamp}')
Related
I understand using boto3 Object.copy_from(...) uses threads but is not asynchronous. Is it possible to make this call asynchronous? If not, is there another way to accomplish this using boto3? I'm finding that moving hundreds/thousands of files is fine, but when i'm processing 100's of thousands of files it gets extremely slow.
You can have a look at aioboto3. It is a third party library, not created by AWS, but it provides asyncio support for selected (not all) AWS API calls.
I use the following. You can copy into a python file and run it from the command line. I have a PC with 8 cores, so it's faster than my little EC2 instance with 1 VPC.
It uses the multiprocessing library, so you'd want to read up on that if you aren't familiar. It's relatively straightforward. There's a batch delete that I've commented out because you really don't want to accidentally delete the wrong directory. You can use whatever methods you want to list the keys or iterate through the objects, but this works for me.
from multiprocessing import Pool
from itertools import repeat
import boto3
import os
import math
s3sc = boto3.client('s3')
s3sr = boto3.resource('s3')
num_proc = os.cpu_count()
def get_list_of_keys_from_prefix(bucket, prefix):
"""gets list of keys for given bucket and prefix"""
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
keys_list.extend(keys)
if prefix in keys_list:
keys_list.remove(prefix)
return keys_list
def batch_delete_s3(keys_list, bucket):
total_keys = len(keys_list)
chunk_size = 1000
num_batches = math.ceil(total_keys / chunk_size)
for b in range(0, num_batches):
batch_to_delete = []
for k in keys_list[chunk_size*b:chunk_size*b+chunk_size]:
batch_to_delete.append({'Key': k})
s3sc.delete_objects(Bucket=bucket, Delete={'Objects': batch_to_delete,'Quiet': True})
def copy_s3_to_s3(from_bucket, from_key, to_bucket, to_key):
copy_source = {'Bucket': from_bucket, 'Key': from_key}
s3sr.meta.client.copy(copy_source, to_bucket, to_key)
def upload_multiprocess(from_bucket, keys_list_from, to_bucket, keys_list_to, num_proc=4):
with Pool(num_proc) as pool:
r = pool.starmap(copy_s3_to_s3, zip(repeat(from_bucket), keys_list_from, repeat(to_bucket), keys_list_to), 15)
pool.close()
pool.join()
return r
if __name__ == '__main__':
__spec__= None
from_bucket = 'from-bucket'
from_prefix = 'from/prefix/'
to_bucket = 'to-bucket'
to_prefix = 'to/prefix/'
keys_list_from = get_list_of_keys_from_prefix(from_bucket, from_prefix)
keys_list_to = [to_prefix + k.rsplit('/')[-1] for k in keys_list_from]
rs = upload_multiprocess(from_bucket, keys_list_from, to_bucket, keys_list_to, num_proc=num_proc)
# batch_delete_s3(keys_list_from, from_bucket)
I think you can use boto3 along with python threads to handle such cases, In AWS S3 Docs they mentioned
Your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket.
So you can make 3500 uploads by one call, nothing can override this 3500 limit set by AWS.
By using threads you need just 300 (approx) calls.
It Takes 5 Hrs in Worst Case i.e considering that your files are large, will take 1 min for a file to upload on average.
Note: Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.
How to publish multiple messages to pubsub fast? Without multiprocessing and multithreading becaue the code is already in a Thread
The code bellow is publishing 40 messages per second
publisher = pubsub.PublisherClient(
credentials=credentials,
batch_settings=types.BatchSettings(
max_messages=1000, # default is 100
max_bytes=1 * 1000 * 1000, # 1 MiB
max_latency=0.1, # default is 10 ms
)
)
topic_name = 'projects/{project_id}/topics/{topic}'.format(
project_id=PROJECT_ID,
topic=TOPIC_PUBSUB,
)
for data in results:
bytes_json_data = str.encode(json.dumps(data))
future = publisher.publish(topic_name, bytes_json_data)
future.result()
Instead of publishing the messages one at a time and then wait on the future, you should publish them all at once and then wait on the published futures at the end. It will look something like:
from concurrent import futures
...
publish_futures = []
for data in results:
bytes_json_data = str.encode(json.dumps(data))
future = publisher.publish(topic_name, bytes_json_data)
publish_futures.append(future)
...
futures.wait(publish_futures, return_when=futures.ALL_COMPLETED)
There's a detailed example in the docs with sample code.
Take out:
future.result()
Leave it just:
for data in results:
bytes_json_data = str.encode(json.dumps(data))
future = publisher.publish(topic_name, bytes_json_data)
And should take it less than a second to publish 10k messages
I'm using boto3 with python, but I belive the problem and logic should be universal across all languages.
I know that table.scan() should in theory return all the records, but in fact, they scan() results are limited to 1MB in size. It's recommended to create a while loop based on LastEvaluatedKey, but that also doesn't give me all the results (15200 instead of 16000), code here:
dynamodb = boto3.resource('dynamodb', region_name='eu-west-2')
table = dynamodb.Table(dBTable)
response = table.scan()
print("item_count:", table.item_count)
print("response1:", response["Count"])
items=[]
while 'LastEvaluatedKey' in response and response['LastEvaluatedKey'] != "":
print("response:", response["Count"])
items+=response["Items"]
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
How can I fetch all the records reliably?
It's recommended to create a while loop based on LastEvaluatedKey, but that also doesn't give me all the results (15200 instead of 16000).
Are you sure about that? My guess would be you have something else going on. I use boto3, and LastEvaludatedKey in a loop in production code that runs every day, and have never encountered a case where not all the rows were returned - not saying it's impossible, but I would first make sure you code is right.
Edit, this code works:
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource('dynamodb', region_name='us-east-1')
table = dynamodb.Table('DeadLetterQueue')
response = table.scan()
print("item_count:", table.item_count)
items=response["Items"]
while 'LastEvaluatedKey' in response and response['LastEvaluatedKey'] != "":
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
items.extend(response["Items"])
print (len(items))
The problem you are facing is not related to the DynamoDB scan operation. It's related to your code. The last scan operation is not appending to the items array.
Following is your code with slight modifications -
dynamodb = boto3.resource('dynamodb', region_name='eu-west-2')
table = dynamodb.Table(dBTable)
response = table.scan()
print("item_count:", table.item_count)
print("response1:", response["Count"])
items=response["Items"] // Changed HERE.
while 'LastEvaluatedKey' in response and response['LastEvaluatedKey'] != "":
response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
print("response:", response["Count"])
items+=response["Items"] // Shifted this line HERE
I'm building a Django app that will periodically take information from an external source, and use it to update model objects.
What I want to to be able to do is create a QuerySet which has all the objects which might match the final list. Then check which model objects need to be created, updated, and deleted. And then (ideally) perform the update in the fewest number of transactions. And without performing any unnecessary DB operations.
Using create_or_update gets me most of the way to what I want to do.
jobs = get_current_jobs(host, user)
for host, user, name, defaults in jobs:
obj, _ = Job.upate_or_create(host=host, user=user, name=name, defaults=defaults)
The problem with this approach is that it doesn't delete anything that no longer exists.
I could just delete everything up front, or do something dumb like
to_delete = set(Job.objects.filter(host=host, user=user)) - set(current)
(Which is an option) but I feel like there must already be an elegant solution that doesn't require either deleting everything, or reading everything into memory.
You should use Redis for storage and use this python package in your code. For example:
import redis
import requests
pool = redis.StrictRedis('localhost')
time_in_seconds = 3600 # the time period you want to keep your data
response = requests.get("url_to_ext_source")
pool.set("api_response", response.json(), ex=time_in_seconds)
I've got 100s of thousands of objects saved in S3. My requirement entails me needing to load a subset of these objects (anywhere between 5 to ~3000) and read the binary content of every object. From reading through the boto3/AWS CLI docs it looks like it's not possible to get multiple objects in one request so currently I have implemented this as a loop that constructs the key of every object, requests for the object then reads the body of the object:
for column_key in outstanding_column_keys:
try:
s3_object_key = "%s%s-%s" % (path_prefix, key, column_key)
data_object = self.s3_client.get_object(Bucket=bucket_key, Key=s3_object_key)
metadata_dict = data_object["Metadata"]
metadata_dict["key"] = column_key
metadata_dict["version"] = float(metadata_dict["version"])
metadata_dict["data"] = data_object["Body"].read()
records.append(Record(metadata_dict))
except Exception as exc:
logger.info(exc)
if len(records) < len(column_keys):
raise Exception("Some objects are missing!")
My issue is that when I attempt to get multiple objects (e.g 5 objects), I get back 3 and some aren't processed by the time I check if all objects have been loaded. I'm handling that in a custom exception. I'd come up with a solution to wrap the above code snippet in a while loop because I know the outstanding keys that I need:
while (len(outstanding_column_keys) > 0) and (load_attempts < 10):
for column_key in outstanding_column_keys:
try:
s3_object_key = "%s%s-%s" % (path_prefix, key, column_key)
data_object = self.s3_client.get_object(Bucket=bucket_key, Key=s3_object_key)
metadata_dict = data_object["Metadata"]
metadata_dict["key"] = column_key
metadata_dict["version"] = float(metadata_dict["version"])
metadata_dict["data"] = data_object["Body"].read()
records.append(Record(metadata_dict))
except Exception as exc:
logger.info(exc)
if len(records) < len(column_keys):
raise Exception("Some objects are missing!")
But I took this out suspecting that S3 is actually still processing the outstanding responses and the while loop would unnecessarily make additional requests for objects that S3 is already in the process of returning.
I did a separate investigation to verify that get_object requests are synchronous and it seems they are:
import boto3
import time
import os
s3_client = boto3.client('s3', aws_access_key_id=os.environ["S3_AWS_ACCESS_KEY_ID"], aws_secret_access_key=os.environ["S3_AWS_SECRET_ACCESS_KEY"])
print "Saving 3000 objects to S3..."
start = time.time()
for x in xrange(3000):
key = "greeting_{}".format(x)
s3_client.put_object(Body="HelloWorld!", Bucket='bucket_name', Key=key)
end = time.time()
print "Done saving 3000 objects to S3 in %s" % (end - start)
print "Sleeping for 20 seconds before trying to load the saved objects..."
time.sleep(20)
print "Loading the saved objects..."
arr = []
start_load = time.time()
for x in xrange(3000):
key = "greeting_{}".format(x)
try:
obj = s3_client.get_object(Bucket='bucket_name', Key=key)
arr.append(obj)
except Exception as exc:
print exc
end_load= time.time()
print "Done loading the saved objects. Found %s objects. Time taken - %s" % (len(arr), end_load - start_load)
My question and something I need confirmation is:
Whether the get_object requests are indeed synchronous? If they are then I expect that when I check for loaded objects in the first
code snippet then all of them should be returned.
If the get_object requests are asynchronous then how do I handle the responses in a way that avoids making extra requests to S3 for
objects that are still in the process of being returned?
Further clarity/refuting of any of my assumptions about S3 would also be appreciated.
Thank you!
Unlike Javascript, Python processes requests synchronously unless you do some sort of multithreading (which you aren't doing in your snippet above). In your for loop, you issue a request to s3_client.get_object, and that call blocks until the data is returned. Since the records array is smaller than it should be, that must mean that some exception is being thrown, and it should be caught in the except block:
except Exception as exc:
logger.info(exc)
If that isn't printing anything, it might be because logging is configured to ignore INFO level messages. If you aren't seeing any errors, you might try printing with logger.error.