Hi all we are using google api for e.g. this one 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query via python script but very fast it gets blocked. Any work around for this? Thank you.
Below is my current codes.
#!/usr/bin/env python
import math,sys
import json
import urllib
def gsearch(searchfor):
query = urllib.urlencode({'q': searchfor})
url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
search_response = urllib.urlopen(url)
search_results = search_response.read()
results = json.loads(search_results)
data = results['responseData']
return data
args = sys.argv[1:]
m = 45000000000
if len(args) != 2:
print "need two words as arguments"
exit
n0 = int(gsearch(args[0])['cursor']['estimatedResultCount'])
n1 = int(gsearch(args[1])['cursor']['estimatedResultCount'])
n2 = int(gsearch(args[0]+" "+args[1])['cursor']['estimatedResultCount'])
The link doesn't work, and there is no code here, so all I can suggest is finding out from the API what the limits are, and delaying your requests appropriately. Alternatively, you can probably pay for less restricted API usage.
Link is bad.
Usually you can overcome this by paying for use.
Related
I understand using boto3 Object.copy_from(...) uses threads but is not asynchronous. Is it possible to make this call asynchronous? If not, is there another way to accomplish this using boto3? I'm finding that moving hundreds/thousands of files is fine, but when i'm processing 100's of thousands of files it gets extremely slow.
You can have a look at aioboto3. It is a third party library, not created by AWS, but it provides asyncio support for selected (not all) AWS API calls.
I use the following. You can copy into a python file and run it from the command line. I have a PC with 8 cores, so it's faster than my little EC2 instance with 1 VPC.
It uses the multiprocessing library, so you'd want to read up on that if you aren't familiar. It's relatively straightforward. There's a batch delete that I've commented out because you really don't want to accidentally delete the wrong directory. You can use whatever methods you want to list the keys or iterate through the objects, but this works for me.
from multiprocessing import Pool
from itertools import repeat
import boto3
import os
import math
s3sc = boto3.client('s3')
s3sr = boto3.resource('s3')
num_proc = os.cpu_count()
def get_list_of_keys_from_prefix(bucket, prefix):
"""gets list of keys for given bucket and prefix"""
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
keys_list.extend(keys)
if prefix in keys_list:
keys_list.remove(prefix)
return keys_list
def batch_delete_s3(keys_list, bucket):
total_keys = len(keys_list)
chunk_size = 1000
num_batches = math.ceil(total_keys / chunk_size)
for b in range(0, num_batches):
batch_to_delete = []
for k in keys_list[chunk_size*b:chunk_size*b+chunk_size]:
batch_to_delete.append({'Key': k})
s3sc.delete_objects(Bucket=bucket, Delete={'Objects': batch_to_delete,'Quiet': True})
def copy_s3_to_s3(from_bucket, from_key, to_bucket, to_key):
copy_source = {'Bucket': from_bucket, 'Key': from_key}
s3sr.meta.client.copy(copy_source, to_bucket, to_key)
def upload_multiprocess(from_bucket, keys_list_from, to_bucket, keys_list_to, num_proc=4):
with Pool(num_proc) as pool:
r = pool.starmap(copy_s3_to_s3, zip(repeat(from_bucket), keys_list_from, repeat(to_bucket), keys_list_to), 15)
pool.close()
pool.join()
return r
if __name__ == '__main__':
__spec__= None
from_bucket = 'from-bucket'
from_prefix = 'from/prefix/'
to_bucket = 'to-bucket'
to_prefix = 'to/prefix/'
keys_list_from = get_list_of_keys_from_prefix(from_bucket, from_prefix)
keys_list_to = [to_prefix + k.rsplit('/')[-1] for k in keys_list_from]
rs = upload_multiprocess(from_bucket, keys_list_from, to_bucket, keys_list_to, num_proc=num_proc)
# batch_delete_s3(keys_list_from, from_bucket)
I think you can use boto3 along with python threads to handle such cases, In AWS S3 Docs they mentioned
Your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket.
So you can make 3500 uploads by one call, nothing can override this 3500 limit set by AWS.
By using threads you need just 300 (approx) calls.
It Takes 5 Hrs in Worst Case i.e considering that your files are large, will take 1 min for a file to upload on average.
Note: Running more threads consumes more resources on your machine. You must be sure that your machine has enough resources to support the maximum number of concurrent requests that you want.
I have a Dash app that makes some graphs based on data drawn from an API, and I'd like to give the user an option to change a parameter, pull new data based on this, and then redraw the graphs. It could be through a form, but I figured the simplest method would be to use the <pathname> route system from Flask. Dash allows me to do this:
import dash
import dash_core_components as dcc
import dash_html_components as html
import plotly.express as px
app = dash.Dash(__name__)
app.layout = html.Div(children=[
dcc.Location(id='url', refresh=False),
html.Div(id='page-content'),
])
#app.callback(dash.dependencies.Output('page-content', 'children'),
[dash.dependencies.Input('url', 'pathname')])
def display_page(pathname):
if pathname == '/':
return html.Div('Please append a pathname to the route')
else:
data = get_data_from_api(int(pathname))
fig_1 = px.line(data, x="time", y="price")
fig_2 = px.line(data, x="time", y="popularity")
return html.Div(children=[
dcc.Graph(id='fig_1',figure=fig_1),
dcc.Graph(id='fig_2',figure=fig_2),
])
if __name__ == '__main__':
app.run_server(debug=True)
But the problem is that the API call takes a minute or two, and it seems to be constantly polling it, such that the request times out and and the graphs never redraw. What I need is something which doesn't auto-refresh, which can run the API call, update the underlying data, and then tell the app to refresh its state.
I did consider a Dash-within-Flask hybrid like this, but it seems excessively complicated for my use-case. Is there a simpler way to do this?
I think you can add a html.Button to your layout.
html.Button('Update', id='update-button')
To your Callback you can add:
#app.callback(dash.dependencies.Output('page-content', 'children'),
[dash.dependencies.Input('url', 'pathname'),
dash.dependencies.Input('update-button', 'n_clicks')])
def display_page(pathname, n_clicks):
....
No need to process the variabel n_clicks in anyway. The Callback is always triggerd.
Cheers
I have a websites which provides a selling platform for individuals. Each individual registers with his bitcoin address and has to input his transaction ID after each transaction.
My code -
import urllib
import re
urlr = "https://blockchain.info/q/txresult/"+hash+"/"+receiver.bitcoin_account
urls = "https://blockchain.info/q/txresult/"+hash+"/"+sender.bitcoin_account
try:
res = urllib.urlopen(urls)
resread = res.read()
sen = urllib.urlopen(urlr)
senread = sen.read()
except IOError:
resread = ""
senread = ""
try:
resread = int(resread)
senread = int(senread)
if resread >= 5000000 and senread != 0:
...
Please i need a better solution, if i can get
You may get a better result if you run bitcoind yourself, and do not rely on blockchain.info's API. Simply start bitcoind with the following options:
bitcoind -txindex -server
If you already synced with the network before you might need to include -reindex on the first time.
You will then be able to use the JSON-RPC interface to query for transactions:
bitcoin-cli getrawtransaction 4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b
Better yet, you can use the python-bitcoinlib library to query and parse the transaction without shelling out to bitcoin-cli.
from binascii import unhexlify
from bitcoin.rpc Proxy
p = Proxy("http://rpcuser:rpcpass#127.0.0.1:8332")
h = unhexlify("4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b")
print(p.gettransaction(h))
That should give you direct access to a local copy of the Bitcoin blockchain, without having to trust blockchain.info, and be faster and more scalable.
I am using Tornado CurlAsyncHTTPClient. My process memory keeps growing for both blocking and non blocking requests when I instantiate corresponding httpclients for each request. This memory usage growth does not happen if I just have one instance of the httpclients(tornado.httpclient.HTTPClient/tornado.httpclient.AsyncHTTPClient) and reuse them.
Also If I use SimpleAsyncHTTPClient instead of CurlAsyncHTTPClient this memory growth doesnot happen irrespective of how I instantiate.
Here is a sample code that reproduces this,
import tornado.httpclient
import json
import functools
instantiate_once = False
tornado.httpclient.AsyncHTTPClient.configure('tornado.curl_httpclient.CurlAsyncHTTPClient')
hc, io_loop, async_hc = None, None, None
if instantiate_once:
hc = tornado.httpclient.HTTPClient()
io_loop = tornado.ioloop.IOLoop()
async_hc = tornado.httpclient.AsyncHTTPClient(io_loop=io_loop)
def fire_sync_request():
global count
if instantiate_once:
global hc
if not instantiate_once:
hc = tornado.httpclient.HTTPClient()
url = '<Please try with a url>'
try:
resp = hc.fetch(url)
except (Exception,tornado.httpclient.HTTPError) as e:
print str(e)
if not instantiate_once:
hc.close()
def fire_async_requests():
#generic response callback fn
def response_callback(response):
response_callback_info['response_count'] += 1
if response_callback_info['response_count'] >= request_count:
io_loop.stop()
if instantiate_once:
global io_loop, async_hc
if not instantiate_once:
io_loop = tornado.ioloop.IOLoop()
requests = ['<Please add ur url to try>']*5
response_callback_info = {'response_count': 0}
request_count = len(requests)
global count
count +=request_count
hcs=[]
for url in requests:
kwargs ={}
kwargs['method'] = 'GET'
if not instantiate_once:
async_hc = tornado.httpclient.AsyncHTTPClient(io_loop=io_loop)
async_hc.fetch(url, callback=functools.partial(response_callback), **kwargs)
if not instantiate_once:
hcs.append(async_hc)
io_loop.start()
for hc in hcs:
hc.close()
if not instantiate_once:
io_loop.close()
if __name__ == '__main__':
import sys
if sys.argv[1] == 'sync':
while True:
output = fire_sync_request()
elif sys.argv[1] == 'async':
while True:
output = fire_async_requests()
Here set instantiate_once variable to True, and execute
python check.py sync or python check.py async. The process memory increases continuously
With instantiate_once=False, this doesnot happen.
Also If I use SimpleAsyncHTTPClient instead of CurlAsyncHTTPClient this memory growth doesnot happen.
I have python 2.7/ tornado 2.3.2/ pycurl(libcurl/7.26.0 GnuTLS/2.12.20 zlib/1.2.7 libidn/1.25 libssh2/1.4.2 librtmp/2.3)
I could reproduce the same issue with latest tornado 3.2
Please help me to understand this behaviour and figure out the right way of using tornado as http library.
HTTPClient and AsyncHTTPClient are designed to be reused, so it will always be more efficient not to recreate them all the time. In fact, AsyncHTTPClient will try to magically detect if there is an existing AsyncHTTPClient on the same IOLoop and use that instead of creating a new one.
But even though it's better to reuse one http client object, it shouldn't leak to create many of them as you're doing here (as long as you're closing them). This looks like a bug in pycurl: https://github.com/pycurl/pycurl/issues/182
Use pycurl 7.19.5 and this hack to avoid memory leaks:
Your Tornado main file:
tornado.httpclient.AsyncHTTPClient.configure("curl_httpclient_leaks_patched.CurlAsyncHTTPClientEx")
curl_httpclient_leaks_patched.py
from tornado import curl_httpclient
class CurlAsyncHTTPClientEx(curl_httpclient.CurlAsyncHTTPClient):
def close(self):
super(CurlAsyncHTTPClientEx, self).close()
del self._multi
I want to create a function which grabs every users latest tweet from a specific group. So, if a user is in the 'authors' group, I want to grab their latest tweet and then finally cache the result for the day so we only do the crazy leg work once.
def latest_tweets(self):
g = Group.objects.get(name='author')
users = []
for u in g.user_set.all():
acc = u.get_profile().twitter_account
users.append('http://twitter.com/statuses/user_timeline/'+acc+'.rss')
return users
Is where I am at so far, but I'm at a complete loose end as to how I parse the RSS to get there latest tweet. Can anyone help me out here? If there is a better way to do this, any suggestions are welcome! I'm sure someone will suggest using django-twitter or other such libraries, but I'd like to do this manually if possible.
Cheers
why redo the stone?, you can download/install/import python-twitter and do something like:
tweet = twitter.Api().GetUserTimeline( u.get_profile().twitter_account )[0]
http://code.google.com/p/python-twitter/
an example: http://www.omh.cc/blog/2008/aug/4/adding-your-twitter-status-django-site/
Rss can be parsed by any xml parser. I've used the built-in module htmllib before for a different task and found it easy to deal with. If all you're doing is rss parsing though, I'd recommend feedparser. I haven't used it before, but it seems pretty straight forward.
If you go with python-twitter it is pretty simple. This is from memory so forgive me if I make a mistake here.
from django.core.cache import cache
import twitter
TWITTER_USER = 'username'
TWITTER_TIMEOUT = 3600
def latest_tweet(request):
tweet = cache.get('tweet')
if tweet:
return {"tweet":tweet}
api = twitter.Api()
tweets = api.GetUserTimeline(TWITTER_USER)
tweet = tweets[0]
tweet.date = datetime.strptime(
tweet.created_at, "%a %b %d %H:%M:%S +0000 %Y"
)
cache.set( 'tweet', tweet, TWITTER_TIMEOUT )
return {"tweet": tweet}