Python 2.7 Addition to dict is too slow - python-2.7

I'm working on Python script. It takes information about subscribers' traffic from files and put it in special structures. And it works. But it works too slow. I've written the same algorith in PHP and it works much faster. I noticed Python spends a lot of time to put the data in dict. The PHP script spends 6 sec to process my test file, but the Python script - 12 sec (about 7 sec to get the data from the file and 5 sec to fill the structures).
My structures look like this: struct[subscriberId][protocolId] = octents
And I use followed function to fill them:
def addBytesToStatStruct(struct, subscriberId, protocolId, octents):
if subscriberId in struct:
if protocolId in struct[subscriberId]:
struct[subscriberId][protocolId] += octents
return
else:
struct[subscriberId][protocolId] = octents
return
else:
struct[subscriberId] = {protocolId : octents}
May be I do something wrong? I suppose my problem appears because of collisions happen during addition. As I know PHP uses chaining but Python uses open addressing. Could you give me a hint how can I make Python dict faster?

Related

Cloud Dataflow performance issue while writing to BigQuery

I am trying to read and write it to BigQuery with Cloud Dataflow (Beam Python SDK).
It is taking almost 30 minutes for reading and writing 20 million records (~ 80 MBs).
Looking at dataflow DAG I can see it is taking most of the time at converting each CSV line to BQ Row.
Below is the code snippet for doing the same :
beam.Map(lambda s: data_ingestion.parse_record_string(s,data_ingestion.stg_schema_dict))
def parse_record_string(self, string_input,schema_dict):
for idx,(x,key) in enumerate(zip(imm_input,schema_dict)):
key = key.strip()
datatype = schema_dict[key].strip()
if key == 'HASH_ID' and datatype != 'STRING':
hash_id = hash(''.join(imm_input[1:idx]))
row_dict[key] = hash_id
else:
if x:
x = x.decode('utf-8').strip()
row_dict[key] = x
else:
row_dict[key] = None
#row_dict[key] = ''
return row_dict
Apart from map transform , I have used ParDo and Flatmap also. All of them are producing same result.
Please suggest any possible tuning to reduce time.
Thanks in advance
Your code is compute intensive when you look at it. For each of your 20M of line, you perform:
A for loop (how many element on each line?)
A zip and enumerate
On each element on the loop
You perform 2 strips (which are loop on the string to remove spaces)
A join on a slice (which are 2 loops) -> how often this condition is true ?
Another strip in other case
Python is wonderful and very convenient with many helpers. However, take care of the trap of this easiness and evaluate correctly the complexity of your algorithm.
If you know Java, try it out. It could be far more efficient.

Consolidate file write and read together

I am writing a python script to write data to the Vertica DB. I use the official library vertica_db_client. For some reason, if I use the built-in cur.executemany method for some reason it takes a long time to complete (40+ seconds per 1k entries). The recommendation I got was to first save the data to a file, then use "COPY" method. Here is the save-to-a-csv-file part:
with open('/data/dscp.csv', 'w') as out:
csv_out=csv.writer(out)
csv_out.writerow(("time_stamp", "subscriber", "ip_address", "remote_address", "signature_service_name", "dscp_out", "bytes_in", "bytes_out")) # which is for adding a title line
for row in data:
csv_out.writerow(row)
My data is a list of tuples. examples are like:
[\
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452), \
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.21', 'SIP', 26, 4270, 5212), \
('2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.129', '18.215.140.51', 'HTTP2 over TLS', 0, 14378, 5291)\
]
Then, in order to use the COPY method, I have to (at least based on their instruction https://www.vertica.com/docs/9.1.x/HTML/python_client/loadingdata_copystdin.html), read the file first then do "COPY from STDIN". Here is my code
f = open("/data/dscp.csv", "r")
cur.stdin = f
cur.execute("""COPY pason.dscp FROM STDIN DELIMITER ','""")
Here is the code for connecting the DB, in case it is relevent to the problem
import vertica_db_client
user = 'dbadmin'
pwd = 'xxx'
database = 'xxx'
host = 'xxx'
db = vertica_db_client.connect(database=database, user=user, password=pwd, host=host)
cur = db.cursor()
So clearly it is waste of effort to first save then read... What is the best way to consolidate the two reading part?
If anyone can tell me why my execute.many was slow it would be equally helpful!
Thanks!
First of all, yes, it is both the recommended way and the most efficient way to write the data to a file first. It may seem inefficient at first, but writing the data to a file on disk will take next to no time at all, but Vertica is not optimized for many individual INSERT statements. Bulk loading is the fastest way to get large amounts of data into Vertica. Not only that, but when you do many individual INSERT statements, you could potentially run into ROS pushback issues, and even if you don't there will be extra load on the database when the ROS containers are merged after the load.
You could convert your array of tuples two a large string variable and then print the string to the console.
The string would look something like:
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.11', 'SIP', 26, 2911, 4452
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.132', '10.135.3.21', 'SIP', 26, 4270, 5212
'2019-02-13 10:00:00', '09d5e206-daba-11e7-b122-00c03aaf89d2', '10.128.67.129', '18.215.140.51', 'HTTP2 over TLS', 0, 14378, 5291
But instead of actually printing it to the console, you could just pipe it into a VSQL command.
$ python my_script.py | vsql -U dbadmin -d xxx -h xxx -c "COPY pason.dscp FROM STDIN DELIMITER ','"
This may not be efficient though. I don't have much experience with exceedingly long string variables in python.
Secondly, the vertica_db_client is no longer being actively developed by Vertica. While it will still supported at least until the python2 end of life, you should be using vertica_python.
You can install vertica_python with pip.
$ pip install vertica_python
or
$ pip3 install vertica_python
depending on which version of Python you want to use it with.
You can also build from source code can be found on Vertica's GitHub page https://github.com/vertica/vertica-python/
As for using the COPY command with vertica_python, see the answer in this question here: Import Data to SQL using Python
I have used several python libraries to connect to Vertica and vertica_python is by far my favorite, and ever since Vertica took over the development from Uber it has continued to improve on a very regular basis.

Abaqus Total of each Stress component

I have an Assembly which only consist of one Part. I'm trying to get the TOTAL of every stress compoment of THE WHOLE Assembly/Part within Python.
My problem with my current method is, that it takes ages to sum up the stress of each element(see the code below). The report files gvies me the Totals within a second, so there must be a better way to get to these values over the odb-file.
Thankful for any hint!
odb = session.openOdb(name='C:/temp/Job-1.odb')
step_1 = odb.steps['Step-1']
stress_1=step_1.frames[-1].fieldOutputs['S']
#Step-1
sum_Sxx_1=sum_Syy_1=sum_Szz_1=0
for el in range(numElemente):
Stress=stress_1.getSubset(region=Instance.elements[el],position=CENTROID, elementType='C3D8R').values
sum_Sxx_1 = sum_Sxx_1 + Stress[0].data[0]
sum_Syy_1 = sum_Syy_1 + Stress[0].data[1]
sum_Szz_1 = sum_Szz_1 + Stress[0].data[2]
Direct access by python to values is very slow indeed (I've experienced the same problems). You can write a report file with each value and then work with text files by python again. Just feed file line by line, find relevant line, split it to get stresses, sum them in and continue.

Reading large result set from Mongo - performance issue

Faced performance issue where reading several thousands. We have RoR application where we read data stored in Mongo. We use Monogid. Each stored documents contain 17 fields (15 float, and 2 as integer). We execute query which supported by index. Cursor return very fast (<50ms) but reading all documents take more then 500ms.
To find bottleneck we run the same query in Mongo Shell and query took <50ms to complete and iterate overall rows in result set. We have tested Mongo Ruby Driver and query take 250ms to complete. The same result we have if using Moped. Finally we have wrote c++ app which use mongo c++ driver and time to iterate over all result set - <20ms. But if we unzip received BSON object (to output it to console) time rise up to 120ms.
Does extraction from BSON take that much time?

C++ or Python: Bruteforcing to find text in many webpages

Suppose I want to bruteforce webpages:
for example, http://www.example.com/index.php?id=<1 - 99999>
and search each page to find if there contain a certain text.
If the page contains the text, then store it into a string
I kind of get it working in python, but it is quite slow (around 1-2 second per page, which would take around 24 hours to do that) is there a better solution? I am thinking of using C/C++, because I heard python is not very efficient. However, after a second thought, I think it might not be the efficiency of python, but rather the efficiency of accessing html element (I changed the whole html into a text, then search it... and the content is quite long)
So how can I improve the speed of bruteforcing?
Most likely your problem has nothing to do with your ability to quickly parse HTML and everything to do with the latency of page retrieval and blocking on sequential tasks.
1-2 seconds is a reasonable amount of time to retrieve a page. You should be able to find text on the page orders of magnitude faster. However, if you are processing pages one at a time you are blocked waiting for a response from a web server while you could be finding your results. You could instead retrieve multiple pages at once via worker processes and wait only for their output.
The following code has been modified from Python's multiprocessing docs to fit your problem a little more closely.
import urllib
from multiprocessing import Process, Queue
def worker(input, output):
for func, args in iter(input.get, 'STOP'):
result = func(*args)
output.put(result)
def find_on_page(num):
uri = 'http://www.example.com/index.php?id=%d' % num
f = urllib.urlopen(uri)
data = f.read()
f.close()
index = data.find('datahere:') # obviously use your own methods
if index < 0:
return None
else:
return data[index:index+20]
def main():
NUM_PROCESSES = 4
tasks = [(find_on_page, (i,)) for i in range(99999)]
task_queue = Queue()
done_queue = Queue()
for task in tasks:
task_queue.put(task)
for i in range(NUM_PROCESSES):
Process(target=worker, args=(task_queue, done_queue)).start()
for i in range(99999):
print done_queue.get()
for i in range(NUM_PROCESSES):
task_queue.put('STOP')
if __name__ == "__main__":
main()
Have you analyzed that the parsing part is the bottleneck of your algorithm and not the HTTP request and answer transaction?
I don't think that the kind of efficiency that gives you C/C++ is what you are looking for here.