C++ or Python: Bruteforcing to find text in many webpages - c++

Suppose I want to bruteforce webpages:
for example, http://www.example.com/index.php?id=<1 - 99999>
and search each page to find if there contain a certain text.
If the page contains the text, then store it into a string
I kind of get it working in python, but it is quite slow (around 1-2 second per page, which would take around 24 hours to do that) is there a better solution? I am thinking of using C/C++, because I heard python is not very efficient. However, after a second thought, I think it might not be the efficiency of python, but rather the efficiency of accessing html element (I changed the whole html into a text, then search it... and the content is quite long)
So how can I improve the speed of bruteforcing?

Most likely your problem has nothing to do with your ability to quickly parse HTML and everything to do with the latency of page retrieval and blocking on sequential tasks.
1-2 seconds is a reasonable amount of time to retrieve a page. You should be able to find text on the page orders of magnitude faster. However, if you are processing pages one at a time you are blocked waiting for a response from a web server while you could be finding your results. You could instead retrieve multiple pages at once via worker processes and wait only for their output.
The following code has been modified from Python's multiprocessing docs to fit your problem a little more closely.
import urllib
from multiprocessing import Process, Queue
def worker(input, output):
for func, args in iter(input.get, 'STOP'):
result = func(*args)
output.put(result)
def find_on_page(num):
uri = 'http://www.example.com/index.php?id=%d' % num
f = urllib.urlopen(uri)
data = f.read()
f.close()
index = data.find('datahere:') # obviously use your own methods
if index < 0:
return None
else:
return data[index:index+20]
def main():
NUM_PROCESSES = 4
tasks = [(find_on_page, (i,)) for i in range(99999)]
task_queue = Queue()
done_queue = Queue()
for task in tasks:
task_queue.put(task)
for i in range(NUM_PROCESSES):
Process(target=worker, args=(task_queue, done_queue)).start()
for i in range(99999):
print done_queue.get()
for i in range(NUM_PROCESSES):
task_queue.put('STOP')
if __name__ == "__main__":
main()

Have you analyzed that the parsing part is the bottleneck of your algorithm and not the HTTP request and answer transaction?
I don't think that the kind of efficiency that gives you C/C++ is what you are looking for here.

Related

Cloud Dataflow performance issue while writing to BigQuery

I am trying to read and write it to BigQuery with Cloud Dataflow (Beam Python SDK).
It is taking almost 30 minutes for reading and writing 20 million records (~ 80 MBs).
Looking at dataflow DAG I can see it is taking most of the time at converting each CSV line to BQ Row.
Below is the code snippet for doing the same :
beam.Map(lambda s: data_ingestion.parse_record_string(s,data_ingestion.stg_schema_dict))
def parse_record_string(self, string_input,schema_dict):
for idx,(x,key) in enumerate(zip(imm_input,schema_dict)):
key = key.strip()
datatype = schema_dict[key].strip()
if key == 'HASH_ID' and datatype != 'STRING':
hash_id = hash(''.join(imm_input[1:idx]))
row_dict[key] = hash_id
else:
if x:
x = x.decode('utf-8').strip()
row_dict[key] = x
else:
row_dict[key] = None
#row_dict[key] = ''
return row_dict
Apart from map transform , I have used ParDo and Flatmap also. All of them are producing same result.
Please suggest any possible tuning to reduce time.
Thanks in advance
Your code is compute intensive when you look at it. For each of your 20M of line, you perform:
A for loop (how many element on each line?)
A zip and enumerate
On each element on the loop
You perform 2 strips (which are loop on the string to remove spaces)
A join on a slice (which are 2 loops) -> how often this condition is true ?
Another strip in other case
Python is wonderful and very convenient with many helpers. However, take care of the trap of this easiness and evaluate correctly the complexity of your algorithm.
If you know Java, try it out. It could be far more efficient.

Python 2.7 Addition to dict is too slow

I'm working on Python script. It takes information about subscribers' traffic from files and put it in special structures. And it works. But it works too slow. I've written the same algorith in PHP and it works much faster. I noticed Python spends a lot of time to put the data in dict. The PHP script spends 6 sec to process my test file, but the Python script - 12 sec (about 7 sec to get the data from the file and 5 sec to fill the structures).
My structures look like this: struct[subscriberId][protocolId] = octents
And I use followed function to fill them:
def addBytesToStatStruct(struct, subscriberId, protocolId, octents):
if subscriberId in struct:
if protocolId in struct[subscriberId]:
struct[subscriberId][protocolId] += octents
return
else:
struct[subscriberId][protocolId] = octents
return
else:
struct[subscriberId] = {protocolId : octents}
May be I do something wrong? I suppose my problem appears because of collisions happen during addition. As I know PHP uses chaining but Python uses open addressing. Could you give me a hint how can I make Python dict faster?

Django query using large amount of memory

I have a query that is causing memory spikes in my application. The below code is designed to show a single record, but occasionally show 5 to 10 records. The problem is there are edge cases where 100,000 results are passed to MultipleObjectsReturned. I believe this causes the high memory usage. The code is:
try:
record = record_class.objects.get(**filter_params)
context["record"] = record
except record_class.MultipleObjectsReturned:
records = record_class.objects.filter(**filter_params)
template_path = "record/%(type)s/%(type)s_multiple.html" % {"type": record_type}
return render(request, template_path, {"records": records}, current_app=record_type)
I thought about adding a slice at the end of the filter query, so it looks like this:
records = record_class.objects.filter(**filter_params)[:20]
But the code still seems slow. Is there a way to limit the results to 20 in a way that does not load the entire query or cause high memory usage?
As this_django documentation says:
Use a subset of Python’s array-slicing syntax to limit your QuerySet to a certain number of results. This is the equivalent of SQL’s LIMIT and OFFSET clauses.
For example, this returns the first 5 objects (LIMIT 5):
Entry.objects.all()[:5]
So it seems that "limiting the results to 20 in a way that does not load the entire query" is being fulfiled .
So your code is slow for some other reason. or maybe you are checking the time complexity in wrong way.

Writing to Google Spreadsheet API Extremely Slow

I am trying to write data from here(http://acleddata.com/api/acled/read) to Google Sheets via its API.I'm using the gspread package to help.
Here is the code:
r = requests.get("http://acleddata.com/api/acled/read")
data = r.json()
data = data['data']
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('credentials.json', scope)
gc = gspread.authorize(credentials)
for row in data:
sheet.append_row(row.values())
The data is a list of dictionaries, each dictionary representing a row in a spreadsheet. This is writing to my Google Sheet but it is unusably slow. It took easily 40 minutes to write a hundred rows, and then I interrupted the script.
Is there anything I can do to speed up this process?
Thanks!
Based on your code, you're using the older V3 Google Data API. For better performance, switch to the V4 API. A migration guide is available here.
Here is the faster solution:
cell_list = sheet.range('A2:'+numberToLetters(num_columns)+str(num_lines+1))
for cell in cell_list:
val = df.iloc[cell.row-2, cell.col-1]
if type(val) is str:
val = val.decode('utf-8')
elif isinstance(val,(int, long, float, complex)):
val= int(round(val))
cell.value = val
sheet.update_cells(cell_list)
This is derived from here https://www.dataiku.com/learn/guide/code/python/export-a-dataset-to-google-spreadsheets.html
I believe the change here is that this solution creates a cell_list object, which only requires one API call.
Based from this thread, Google Spreadsheets API can be pretty slow depending on many factors including your connection speed to Google servers, usage of proxy, etc. Avoid having gspread.login inside a loop because this method is slow.
...get_all_records came to my rescue, much faster than range for entire sheet.
I have also read in this forum that it depends on the size of the worksheet, so as the rows increase in the worksheet, the program run even more slower.

Processing web feed multiple times a day

Ok, here is in brief the deal: I spider the web (all kind of data, blogs/news/forums) as it appears on internet. Then I process this feed and do analysis on processed data. Spidering is not a big deal. I can get it pretty much in real time as internet gets new data. Processing is a bottleneck, it involves some computationally heavy algorithms.
I am in pursuit of building a strategy to schedule my spiders. The big goal is to make sure that analysis that is produced as end result reflects effect of as much recent input as possible. Start to think of it, the obvious objective is to make sure data does not pile up. I get the data through spiders, pass on to processing code, wait till processing gets over and then spider more. This time bringing all the data which appeared while I was waiting for processing to get over. Okay this is a very broad thought.
Can some of you share your thoughts, may be think loud. If you were me what would go in your mind. I hope I am making sense with my question. This is not a search engine indexing by the way.
It appears that you want to keep the processors from falling too far behind the spiders. I would imagine that you want to be able to scale this out as well.
My recommendation is that you implement a queue using an client/server SQL databse. MySQL would work nicely for this purpose.
Design Objectives
Keep the spiders from getting too far ahead of the processors
Allow for a balance of power between spiders and processors (keeping each busy)
Keep data as fresh as possible
Scale out and up as needed
Queue:
Create a queue to store the data from the spiders before it is processed. This could be done in several ways, but it does not sound like IO is your bottleneck.
A simple approach would be to have an SQL table with this layout:
TABLE Queue
Queue_ID int unsigned not null auto_increment primary key
CreateDate datetime not null
Status enum ('New', 'Processing')
Data blob not null
# pseudo code
function get_from_queue()
# in SQL
START TRANSACTION;
SELECT Queue_ID, Data FROM Queue WHERE Status = 'New' LIMIT 1 FOR UPDATE;
UPDATE Queue SET Status = 'Processing' WHERE Queue_ID = (from above)
COMMIT
# end sql
return Data# or false in the case of no records found
# pseudo code
function count_from_queue()
# in SQL
SELECT COUNT(*) FROM Queue WHERE Status = 'New'
# end sql
return (the count)
Spider:
So you have multiple spider processes.. They each say:
if count_from_queue() < 10:
# do the spider thing
# save it in the queue
else:
# sleep awhile
repeat
In this way, each spider will be either resting or spidering. The decision (in this case) is based on if there are less than 10 pending items to process. You would tune this to your purposes.
Processor
So you have multiple processor processes.. They each say:
Data = get_from_queue()
if Data:
# process it
# remove it from the queue
else:
# sleep awhile
repeat
In this way, each processor will be either resting or processing.
In summary:
Whether you have this running on one computer, or 20, a queue will provide the control you need to ensure that all parts are in sync, and not getting too far ahead of each other.