I understand using boto3 Object.copy_from(...) uses threads but is not asynchronous. Is it possible to make this call asynchronous? If not, is there another way to accomplish this using boto3? I'm finding that moving hundreds/thousands of files is fine, but when i'm processing 100's of thousands of files it gets extremely slow.
You can have a look at aioboto3. It is a third party library, not created by AWS, but it provides asyncio support for selected (not all) AWS API calls.
I use the following. You can copy into a python file and run it from the command line. I have a PC with 8 cores, so it's faster than my little EC2 instance with 1 VPC.
It uses the multiprocessing library, so you'd want to read up on that if you aren't familiar. It's relatively straightforward. There's a batch delete that I've commented out because you really don't want to accidentally delete the wrong directory. You can use whatever methods you want to list the keys or iterate through the objects, but this works for me.
from multiprocessing import Pool
from itertools import repeat
import boto3
import os
import math
s3sc = boto3.client('s3')
s3sr = boto3.resource('s3')
num_proc = os.cpu_count()
def get_list_of_keys_from_prefix(bucket, prefix):
"""gets list of keys for given bucket and prefix"""
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
keys_list.extend(keys)
if prefix in keys_list:
keys_list.remove(prefix)
return keys_list
def batch_delete_s3(keys_list, bucket):
total_keys = len(keys_list)
chunk_size = 1000
num_batches = math.ceil(total_keys / chunk_size)
for b in range(0, num_batches):
batch_to_delete = []
for k in keys_list[chunk_size*b:chunk_size*b+chunk_size]:
batch_to_delete.append({'Key': k})
s3sc.delete_objects(Bucket=bucket, Delete={'Objects': batch_to_delete,'Quiet': True})
def copy_s3_to_s3(from_bucket, from_key, to_bucket, to_key):
copy_source = {'Bucket': from_bucket, 'Key': from_key}
s3sr.meta.client.copy(copy_source, to_bucket, to_key)
def upload_multiprocess(from_bucket, keys_list_from, to_bucket, keys_list_to, num_proc=4):
with Pool(num_proc) as pool:
r = pool.starmap(copy_s3_to_s3, zip(repeat(from_bucket), keys_list_from, repeat(to_bucket), keys_list_to), 15)
pool.close()
pool.join()
return r
if __name__ == '__main__':
__spec__= None
from_bucket = 'from-bucket'
from_prefix = 'from/prefix/'
to_bucket = 'to-bucket'
to_prefix = 'to/prefix/'
keys_list_from = get_list_of_keys_from_prefix(from_bucket, from_prefix)
keys_list_to = [to_prefix + k.rsplit('/')[-1] for k in keys_list_from]
rs = upload_multiprocess(from_bucket, keys_list_from, to_bucket, keys_list_to, num_proc=num_proc)
# batch_delete_s3(keys_list_from, from_bucket)
I think you can use boto3 along with python threads to handle such cases, In AWS S3 Docs they mentioned
Your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket.
So you can make 3500 uploads by one call, nothing can override this 3500 limit set by AWS.
By using threads you need just 300 (approx) calls.
It Takes 5 Hrs in Worst Case i.e considering that your files are large, will take 1 min for a file to upload on average.
Note: Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.
Related
I have a websites which provides a selling platform for individuals. Each individual registers with his bitcoin address and has to input his transaction ID after each transaction.
My code -
import urllib
import re
urlr = "https://blockchain.info/q/txresult/"+hash+"/"+receiver.bitcoin_account
urls = "https://blockchain.info/q/txresult/"+hash+"/"+sender.bitcoin_account
try:
res = urllib.urlopen(urls)
resread = res.read()
sen = urllib.urlopen(urlr)
senread = sen.read()
except IOError:
resread = ""
senread = ""
try:
resread = int(resread)
senread = int(senread)
if resread >= 5000000 and senread != 0:
...
Please i need a better solution, if i can get
You may get a better result if you run bitcoind yourself, and do not rely on blockchain.info's API. Simply start bitcoind with the following options:
bitcoind -txindex -server
If you already synced with the network before you might need to include -reindex on the first time.
You will then be able to use the JSON-RPC interface to query for transactions:
bitcoin-cli getrawtransaction 4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b
Better yet, you can use the python-bitcoinlib library to query and parse the transaction without shelling out to bitcoin-cli.
from binascii import unhexlify
from bitcoin.rpc Proxy
p = Proxy("http://rpcuser:rpcpass#127.0.0.1:8332")
h = unhexlify("4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b")
print(p.gettransaction(h))
That should give you direct access to a local copy of the Bitcoin blockchain, without having to trust blockchain.info, and be faster and more scalable.
The following code streams a postgres BYTEA column to a browser
from flask import Response, stream_with_context
#app.route('/api/1/zfile/<file_id>', methods=['GET'])
def download_file(file_id):
file = ZFile.query.filter_by(id=file_id).first()
return Response(stream_with_context(file.data), mimetype=file.mime_type)
it is extreemely slow (aprox 6 minutes for 5 mb).
I am downloading with curl from the same host, so network is not the issue,
also I can extract the file from the psql console in less than a second,
so it seems the database side is also not to blame :
COPY (select f.data from z_file f where f.id = '4ec3rf') TO 'zazX.pdf' (FORMAT binary)
Update:
I have further evidence that the "fetch from the DB" step is not slow, If I write file.data to a file using
with open("/vagrant/zoz.pdf", 'wb') as output:
output.write(file.data)
it also takes a fraction of a second. So the slowness is caused by the way Flask does the streaming.
I had this issue while using Flask to proxy streaming from another url using python-requests.
In this use case, the trick is setting the chunk_size parameter in iter_content:
def flask_view():
...
req = requests.get(url, stream=True, params=args)
return Response(
stream_with_context(req.iter_content(chunk_size=1024)),
content_type=req.headers['content-type']
otherwise it will use chunk_size=1, which can slow things down quite a bit. In my case the streaming went from a couple of kb/s to several mb/s after the increase in chunk_size.
Flask can be given a generator that returns the whole array in a single yield and will "know" how to deal with it, this returns in milliseconds :
from flask import Response, stream_with_context
#app.route('/api/1/zfile/<file_id>', methods=['GET'])
def download_file(file_id):
file = ZFile.query.filter_by(id=file_id).first()
def single_chunk_generator():
yield file.data
return Response(stream_with_context(single_chunk_generator()), mimetype=file.mime_type)
stream_with_context, when given an array will create a generator that iterates through it and do various checks on every element, which causes a huge performance hit.
Currently I have a method that retrieves all ~119,000 gmail accounts and writes them to a csv file using python code below and the enabled admin.sdk + auth 2.0:
def get_accounts(self):
students = []
page_token = None
params = {'customer': 'my_customer'}
while True:
try:
if page_token:
params['pageToken'] = page_token
current_page = self.dir_api.users().list(**params).execute()
students.extend(current_page['users'])
# write each page of data to a file
csv_file = CSVWriter(students, self.output_file)
csv_file.write_file()
# clear the list for the next page of data
del students[:]
page_token = current_page.get('nextPageToken')
if not page_token:
break
except errors.HttpError as error:
break
I would like to retrieve all 119,000 as a lump sum, that is, without having to loop or as a batch call. Is this possible and if so, can you provide example python code? I have run into communication issues and have to rerun the process multiple times to obtain the ~119,000 accts successfully (takes about 10 minutes to download). Would like to minimize communication errors. Please advise if better method exists or non-looping method also is possible.
There's no way to do this as a batch because you need to know each pageToken and those are only given as the page is retrieved. However, you can increase your performance somewhat by getting larger pages:
params = {'customer': 'my_customer', 'maxResults': 500}
since the default page size when maxResults is not set is 100, adding maxResults: 500 will reduce the number of API calls by an order of 5. While each call may take slightly longer, you should notice performance increases because you're making far fewer API calls and HTTP round trips.
You should also look at using the fields parameter to only specify user attributes you need to read in the list. That way you're not wasting time and bandwidth retrieving details about your users that your app never uses. Try something like:
my_fields = 'nextPageToken,users(primaryEmail,name,suspended)'
params = {
'customer': 'my_customer',
maxResults': 500,
fields: my_fields
}
Last of all, if your app retrieves the list of users fairly frequently, turning on caching may help.
Hi all we are using google api for e.g. this one 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query via python script but very fast it gets blocked. Any work around for this? Thank you.
Below is my current codes.
#!/usr/bin/env python
import math,sys
import json
import urllib
def gsearch(searchfor):
query = urllib.urlencode({'q': searchfor})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
search_response = urllib.urlopen(url)
search_results = search_response.read()
results = json.loads(search_results)
data = results['responseData']
return data
args = sys.argv[1:]
m = 45000000000
if len(args) != 2:
print "need two words as arguments"
exit
n0 = int(gsearch(args[0])['cursor']['estimatedResultCount'])
n1 = int(gsearch(args[1])['cursor']['estimatedResultCount'])
n2 = int(gsearch(args[0]+" "+args[1])['cursor']['estimatedResultCount'])
The link doesn't work, and there is no code here, so all I can suggest is finding out from the API what the limits are, and delaying your requests appropriately. Alternatively, you can probably pay for less restricted API usage.
Link is bad.
Usually you can overcome this by paying for use.
it seems that Boto is the official Amazon API module for Python, and this one is for Tornado, so here is my questions:
does it offer pagination (requests only 10 products, since amazon offers 10 products per page, then i want only to get the first page...), then how (sample code?)
how then to parse the product parse, i've used python-amazon-simple-product-api but sadly it dont offer pagination, so all the offers keep iterating.
generally, pagination is performed by the client requesting the api. To do this in boto, you'll need to cut your systems up. So for instance, say you make a call to AWS via boto, using the get_all_instances def; you'll need to store those somehow and then keep track of which servers have been displayed, and which not. To my knowledge, boto does not have the LIMIT functionality most dev's are used to from MySQL. Personally, I scan all my instances and stash them in mongo like so:
for r in conn.get_all_instances(): # loop through all reservations
groups = [g.name for g in r.groups] # get a list of groups for this reservation
for x in r.instances: # loop through all instances with-in reservation
groups = ','.join(groups) # join the groups into a comma separated list
name = x.tags.get('Name',''); # get instance name from the 'Name' tag
new_record = { "tagname":name, "ip_address":x.private_ip_address,
"external_ip_nat":x.ip_address, "type":x.instance_type,
"state":x.state, "base_image":x.image_id, "placement":x.placement,
"public_ec2_dns":x.public_dns_name,
"launch_time":x.launch_time, "parent": ObjectId(account['_id'])}
new_record['groups'] = groups
systems_coll.update({'_id':x.id},{"$set":new_record},upsert=True)
error = db.error()
if error != None:
print "err:%s:" % str(error)
You could also wrap these in try/catch blocks. Up to you. Once you get them out of boto, should be trivial to do the cut up work.
-- Jess