How to publish multiple messages in google Pub/Sub fast?

How to publish multiple messages in google Pub/Sub fast? - google-cloud-platform

How to publish multiple messages to pubsub fast? Without multiprocessing and multithreading becaue the code is already in a Thread
The code bellow is publishing 40 messages per second
publisher = pubsub.PublisherClient(
credentials=credentials,
batch_settings=types.BatchSettings(
max_messages=1000, # default is 100
max_bytes=1 * 1000 * 1000, # 1 MiB
max_latency=0.1, # default is 10 ms
)
)
topic_name = 'projects/{project_id}/topics/{topic}'.format(
project_id=PROJECT_ID,
topic=TOPIC_PUBSUB,
)
for data in results:
bytes_json_data = str.encode(json.dumps(data))
future = publisher.publish(topic_name, bytes_json_data)
future.result()

Instead of publishing the messages one at a time and then wait on the future, you should publish them all at once and then wait on the published futures at the end. It will look something like:
from concurrent import futures
...
publish_futures = []
for data in results:
bytes_json_data = str.encode(json.dumps(data))
future = publisher.publish(topic_name, bytes_json_data)
publish_futures.append(future)
...
futures.wait(publish_futures, return_when=futures.ALL_COMPLETED)
There's a detailed example in the docs with sample code.

Take out:
future.result()
Leave it just:
for data in results:
bytes_json_data = str.encode(json.dumps(data))
future = publisher.publish(topic_name, bytes_json_data)
And should take it less than a second to publish 10k messages

Related

How do I move/copy files in s3 using boto3 asynchronously?

I understand using boto3 Object.copy_from(...) uses threads but is not asynchronous. Is it possible to make this call asynchronous? If not, is there another way to accomplish this using boto3? I'm finding that moving hundreds/thousands of files is fine, but when i'm processing 100's of thousands of files it gets extremely slow.

You can have a look at aioboto3. It is a third party library, not created by AWS, but it provides asyncio support for selected (not all) AWS API calls.

I use the following. You can copy into a python file and run it from the command line. I have a PC with 8 cores, so it's faster than my little EC2 instance with 1 VPC.
It uses the multiprocessing library, so you'd want to read up on that if you aren't familiar. It's relatively straightforward. There's a batch delete that I've commented out because you really don't want to accidentally delete the wrong directory. You can use whatever methods you want to list the keys or iterate through the objects, but this works for me.
from multiprocessing import Pool
from itertools import repeat
import boto3
import os
import math
s3sc = boto3.client('s3')
s3sr = boto3.resource('s3')
num_proc = os.cpu_count()
def get_list_of_keys_from_prefix(bucket, prefix):
"""gets list of keys for given bucket and prefix"""
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
keys_list.extend(keys)
if prefix in keys_list:
keys_list.remove(prefix)
return keys_list
def batch_delete_s3(keys_list, bucket):
total_keys = len(keys_list)
chunk_size = 1000
num_batches = math.ceil(total_keys / chunk_size)
for b in range(0, num_batches):
batch_to_delete = []
for k in keys_list[chunk_size*b:chunk_size*b+chunk_size]:
batch_to_delete.append({'Key': k})
s3sc.delete_objects(Bucket=bucket, Delete={'Objects': batch_to_delete,'Quiet': True})
def copy_s3_to_s3(from_bucket, from_key, to_bucket, to_key):
copy_source = {'Bucket': from_bucket, 'Key': from_key}
s3sr.meta.client.copy(copy_source, to_bucket, to_key)
def upload_multiprocess(from_bucket, keys_list_from, to_bucket, keys_list_to, num_proc=4):
with Pool(num_proc) as pool:
r = pool.starmap(copy_s3_to_s3, zip(repeat(from_bucket), keys_list_from, repeat(to_bucket), keys_list_to), 15)
pool.close()
pool.join()
return r
if __name__ == '__main__':
__spec__= None
from_bucket = 'from-bucket'
from_prefix = 'from/prefix/'
to_bucket = 'to-bucket'
to_prefix = 'to/prefix/'
keys_list_from = get_list_of_keys_from_prefix(from_bucket, from_prefix)
keys_list_to = [to_prefix + k.rsplit('/')[-1] for k in keys_list_from]
rs = upload_multiprocess(from_bucket, keys_list_from, to_bucket, keys_list_to, num_proc=num_proc)
# batch_delete_s3(keys_list_from, from_bucket)

I think you can use boto3 along with python threads to handle such cases, In AWS S3 Docs they mentioned
Your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket.
So you can make 3500 uploads by one call, nothing can override this 3500 limit set by AWS.
By using threads you need just 300 (approx) calls.
It Takes 5 Hrs in Worst Case i.e considering that your files are large, will take 1 min for a file to upload on average.
Note: Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.

Delete sqs queue based on 'CreatedTimestamp'

Similar to question Bulk delete sqs queues using boto3, I now want to delete queues based on 'CreatedTimestamp'. If the created time (in epoch time) is before a specific epoch timestamp, it should be deleted.
I tried to write something similar to answer given in the earlier post but I am not sure if I need to to loop again to check the created time.
client = boto3.client('sqs')
timestamp = '1645747200'
def delete_sqs_queues (event, context):
response = client.list_queues()
for sqs_url in response['QueueUrls']:
get_att = client.get_queue_attributes(
QueueUrl=sqs_url,
AttributeNames=['CreatedTimestamp']
)
Can I directly compare the CreatedTimestamp with a variable or do I need another for loop to iterate over the CreatedTimestamps?

There is only one CreatedTimestamp for each queue, thus you don't have to iterate again:
for sqs_url in response['QueueUrls']:
get_att = client.get_queue_attributes(
QueueUrl=sqs_url,
AttributeNames=['CreatedTimestamp']
)
queue_timestamp = get_att['Attributes']['CreatedTimestamp']
print(queue_timestamp)
# for example
if queue_timestamp < timestamp:
print(f'Delete {sqs_url} because of {queue_timestamp}')

SSE DJANGO REQUEST COUNT

i still new using SSE and i have a question about SSE in Django version 3.2.5, i am using StreamingHttpResponse to send SSE response to EventSource client and it does work fine,
my question that
why it takes to long to open the connection between backend and EventSource?
why it sends only 167 responses/32 seconds ?
i tried to open the code of StreamingHttpResponse but i didn't find anything related to the number of response
here in the code
def sse_movies(request):
def event_stream():
while True:
sleep(.2)
yield f"data: {datetime.time(datetime.now())}\n\n"
return StreamingHttpResponse(event_stream(), content_type='text/event-stream')
i am using sleep() to wait only 200/milliseconds for each iteration.
but whenever send the EventSource it waits almost 32/seconds to initiate the connection with the back-end, and after it sends 167 requests then waits 2 seconds then sends another 167 requests once more and after the second 167 is sent it waits another 32 seconds
here is the code of EventSource client
let url = '/test/' +'sse/movies/'
let sse_client = new EventSource(url)
let movies = document.querySelector('#data-movies')
let movies_list = document.querySelector('#messages')
sse_client.onopen = function(message_event) {
console.log('opened')
}
console.log(sse_client)
sse_client.onmessage = (message_event) => {
console.log(message_event.data)
console.log(sse_client.readyState)
}
NOTE: when i remove white: True EventSource doesn't wait and sends requests as much as possible
maybe i misunderstand something here, but i hope the somebody could help me

i could figure out the issue.
it was n't in the code itself, but it was related with the buffer size of the webserver
so when i edited my code to be as below it worked fine:
def sse_movies(request):
def event_stream():
body = ''.join([letter * 6000 for letter in 'A'])
body_len = len(body)
print(body_len)
while True:
sleep(2)
yield f"data: {body}\n\n"
return StreamingHttpResponse(event_stream(), content_type='text/event-stream')
as you see above the minimum size for buffer is 6000/Characters
i don't not how much that could be in bytes), but yeah it worked Alhumdulliah)
i really don't know that much about buffer/buffer-size/..
but i thought that it could be the issue

Multiple Greenlets in a loop and ZMQ. Greenlet blocks in a first _run

I wrote two types of greenlets. MyGreenletPUB will publish message via ZMQ with message type 1 and message type 2.
MyGreenletSUB instances will subscribe to ZMQ PUB, based on parameter ( "1" and "2" ).
Problem here is that when I start my Greenlets run method in MyGreenletSUB code will stop on message = sock.recv() and will never return run time execution to other greenlets.
My question is how can I avoid this and how can I start my greenlets asynchronous with a while TRUE, without using gevent.sleep() in while methods to switch execution between greenlets
from gevent.monkey import patch_all
patch_all()
import zmq
import time
import gevent
from gevent import Greenlet
class MyGreenletPUB(Greenlet):
def _run(self):
# ZeroMQ Context
context = zmq.Context()
# Define the socket using the "Context"
sock = context.socket(zmq.PUB)
sock.bind("tcp://127.0.0.1:5680")
id = 0
while True:
gevent.sleep(1)
id, now = id + 1, time.ctime()
# Message [prefix][message]
message = "1#".format(id=id, time=now)
sock.send(message)
# Message [prefix][message]
message = "2#".format(id=id, time=now)
sock.send(message)
id += 1
class MyGreenletSUB(Greenlet):
def __init__(self, b):
Greenlet.__init__(self)
self.b = b
def _run(self):
context = zmq.Context()
# Define the socket using the "Context"
sock = context.socket(zmq.SUB)
# Define subscription and messages with prefix to accept.
sock.setsockopt(zmq.SUBSCRIBE, self.b)
sock.connect("tcp://127.0.0.1:5680")
while True:
message = sock.recv()
print message
g = MyGreenletPUB.spawn()
g2 = MyGreenletSUB.spawn("1")
g3 = MyGreenletSUB.spawn("2")
try:
gevent.joinall([g, g2, g3])
except KeyboardInterrupt:
print "Exiting"

A default ZeroMQ .recv() method modus operandi is to block until there has arrived anything, that will pass to the hands of the .recv() method caller.
For indeed smart, non-blocking agents, always use rather .poll() instance-methods and .recv( zmq.NOBLOCK ).
Beware, that ZeroMQ subscription is based on topic-filter matching from left and may get issues if mixed unicode and non-unicode strings are being distributed / collected at the same time.
Also, mixing several event-loops might become a bit tricky, depends on your control-needs. I personally always prefer non-blocking systems, even at a cost of more complex design efforts.

Django Celery reduce time, 5 hours to complete 1000 tasks

I'm running on a development environment so this maybe different in production, but when I run a task from Django Celery, it seems to only fetch tasks from the broker every 10-20 seconds. I'm only testing at this point but lets say I'm sending around 1000 tasks this means it will take over 5 hours+ to complete.
Is this is normal? Should it be quicker? Or I'm I doing something wrong?
This is my task
class SendMessage(Task):
name = "Sending SMS"
max_retries = 10
default_retry_delay = 3
def run(self, message_id, gateway_id=None, **kwargs):
logging.debug("About to send a message.")
# Because we don't always have control over transactions
# in our calling code, we will retry up to 10 times, every 3
# seconds, in order to try to allow for the commit to the database
# to finish. That gives the server 30 seconds to write all of
# the data to the database, and finish the view.
try:
message = Message.objects.get(pk=message_id)
except Exception as exc:
raise SendMessage.retry(exc=exc)
if not gateway_id:
if hasattr(message.billee, 'sms_gateway'):
gateway = message.billee.sms_gateway
else:
gateway = Gateway.objects.all()[0]
else:
gateway = Gateway.objects.get(pk=gateway_id)
#response = gateway._send(message)
print(message_id)
logging.debug("Done sending message.")
which gets run from my view
for e in Contact.objects.filter(contact_owner=request.user etc etc):
SendMessage.delay(e.id, message)

Yes this is normal. This is the default workers to be used. They set this default so that it will not affect the performance of the app.
There is another way to change it. The task decorator can take a number of options that change the way the task behaves. Any keyword argument passed to the task decorator will actually be set as an attribute of the resulting task class.
You can set the rate limit which limits the number of tasks that can be run in a given time frame.
//means hundred tasks a minute, another /s (second) and /h (hour)
CELERY_DEFAULT_RATE_LIMIT = "100/m" --> set in settings

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to publish multiple messages in google Pub/Sub fast? - google-cloud-platform

Take out: future.result() Leave it just: for data in results: bytes_json_data = str.encode(json.dumps(data)) future = publisher.publish(topic_name, bytes_json_data) And should take it less than a second to publish 10k messages

Related

How do I move/copy files in s3 using boto3 asynchronously?

Delete sqs queue based on 'CreatedTimestamp'

SSE DJANGO REQUEST COUNT

Multiple Greenlets in a loop and ZMQ. Greenlet blocks in a first _run

Django Celery reduce time, 5 hours to complete 1000 tasks

Categories

Resources