Cloud Dataflow performance issue while writing to BigQuery - python-2.7

I am trying to read and write it to BigQuery with Cloud Dataflow (Beam Python SDK).
It is taking almost 30 minutes for reading and writing 20 million records (~ 80 MBs).
Looking at dataflow DAG I can see it is taking most of the time at converting each CSV line to BQ Row.
Below is the code snippet for doing the same :
beam.Map(lambda s: data_ingestion.parse_record_string(s,data_ingestion.stg_schema_dict))
def parse_record_string(self, string_input,schema_dict):
for idx,(x,key) in enumerate(zip(imm_input,schema_dict)):
key = key.strip()
datatype = schema_dict[key].strip()
if key == 'HASH_ID' and datatype != 'STRING':
hash_id = hash(''.join(imm_input[1:idx]))
row_dict[key] = hash_id
else:
if x:
x = x.decode('utf-8').strip()
row_dict[key] = x
else:
row_dict[key] = None
#row_dict[key] = ''
return row_dict
Apart from map transform , I have used ParDo and Flatmap also. All of them are producing same result.
Please suggest any possible tuning to reduce time.
Thanks in advance

Your code is compute intensive when you look at it. For each of your 20M of line, you perform:
A for loop (how many element on each line?)
A zip and enumerate
On each element on the loop
You perform 2 strips (which are loop on the string to remove spaces)
A join on a slice (which are 2 loops) -> how often this condition is true ?
Another strip in other case
Python is wonderful and very convenient with many helpers. However, take care of the trap of this easiness and evaluate correctly the complexity of your algorithm.
If you know Java, try it out. It could be far more efficient.

Related

Python Bigtable client spends ages on deepcopy

I have a Python Kafka consumer sometimes reading extremely slowly from Bigtable. It reads a row from Bigtable, performs some calculations and occasionally writes some information back, then moves on.
The issue is that from a 1 vCPU VM in GCE it reads/writes extremely fast, the consumer chewing through 100-150 messages/s. No problem there.
However, when deployed on the production Kubernetes cluster (GKE), which is multi-zonal (europe-west1-b/c/d), it goes through something like 0.5 messages/s. Yes - 2s per message.
Bigtable is in europe-west1-d - but pods scheduled on nodes in the same zone (d), have the same performance as pods on nodes in other zones, which is weird.
The pod is hitting the CPU limits (1 vCPU) constantly. Profiling the program shows that most of the time (95%) is spent inside of the PartialRowData.cells() function, in copy.py:132(deepcopy)
It uses the newest google-cloud-bigtable==0.29.0 package.
Now, I understand that the package is in alpha, but what is the factor that so dramatically reduces the performance by 300x?
The piece of code that reads the row data is this:
def _row_to_dict(cls, row):
if row is None:
return {}
item_dict = {}
if COLUMN_FAMILY in row.cells:
structured_cells = {}
for field_name in STRUCTURED_STATIC_FIELDS:
if field_name.encode() in row.cells[COLUMN_FAMILY]:
structured_cells[field_name] = row.cells[COLUMN_FAMILY][field_name.encode()][
0].value.decode()
item_dict[COLUMN_FAMILY] = structured_cells
return item_dict
where the row passed in is from
row = self.bt_table.read_row(row_key, filter_=filter_)
and there might be about 50 STRUCTURED_STATIC_FIELDS.
Is the deepcopy really just taking ages to copy? Or is it waiting for data transfer from Bigtable? Am I misusing the library somehow?
Any pointers on how to improve the performance?
Thanks a lot in advance.
It turns out the library defines the getter for row.cells as:
#property
def cells(self):
"""Property returning all the cells accumulated on this partial row.
:rtype: dict
:returns: Dictionary of the :class:`Cell` objects accumulated. This
dictionary has two-levels of keys (first for column families
and second for column names/qualifiers within a family). For
a given column, a list of :class:`Cell` objects is stored.
"""
return copy.deepcopy(self._cells)
So each call to the dictionary was performing a deepcopy in addition to the look-up.
Adding a
row_cells = row.cells
and subsequently only referring to that fixed the issue.
The difference in performance of dev/prod environment was using also the fact that prod table already had a lot more timestamps/versions of the cells, whereas the dev table had only a couple. This made the returned dictionaries which had to be deep-copied a lot larger.
Chaining the existing filters with CellsColumnLimitFilter helped even further:
filter_ = RowFilterChain(filters=[filter_, CellsColumnLimitFilter(num_cells=1)])

Django query using large amount of memory

I have a query that is causing memory spikes in my application. The below code is designed to show a single record, but occasionally show 5 to 10 records. The problem is there are edge cases where 100,000 results are passed to MultipleObjectsReturned. I believe this causes the high memory usage. The code is:
try:
record = record_class.objects.get(**filter_params)
context["record"] = record
except record_class.MultipleObjectsReturned:
records = record_class.objects.filter(**filter_params)
template_path = "record/%(type)s/%(type)s_multiple.html" % {"type": record_type}
return render(request, template_path, {"records": records}, current_app=record_type)
I thought about adding a slice at the end of the filter query, so it looks like this:
records = record_class.objects.filter(**filter_params)[:20]
But the code still seems slow. Is there a way to limit the results to 20 in a way that does not load the entire query or cause high memory usage?
As this_django documentation says:
Use a subset of Python’s array-slicing syntax to limit your QuerySet to a certain number of results. This is the equivalent of SQL’s LIMIT and OFFSET clauses.
For example, this returns the first 5 objects (LIMIT 5):
Entry.objects.all()[:5]
So it seems that "limiting the results to 20 in a way that does not load the entire query" is being fulfiled .
So your code is slow for some other reason. or maybe you are checking the time complexity in wrong way.

Writing to Google Spreadsheet API Extremely Slow

I am trying to write data from here(http://acleddata.com/api/acled/read) to Google Sheets via its API.I'm using the gspread package to help.
Here is the code:
r = requests.get("http://acleddata.com/api/acled/read")
data = r.json()
data = data['data']
scope = ['https://spreadsheets.google.com/feeds']
credentials = ServiceAccountCredentials.from_json_keyfile_name('credentials.json', scope)
gc = gspread.authorize(credentials)
for row in data:
sheet.append_row(row.values())
The data is a list of dictionaries, each dictionary representing a row in a spreadsheet. This is writing to my Google Sheet but it is unusably slow. It took easily 40 minutes to write a hundred rows, and then I interrupted the script.
Is there anything I can do to speed up this process?
Thanks!
Based on your code, you're using the older V3 Google Data API. For better performance, switch to the V4 API. A migration guide is available here.
Here is the faster solution:
cell_list = sheet.range('A2:'+numberToLetters(num_columns)+str(num_lines+1))
for cell in cell_list:
val = df.iloc[cell.row-2, cell.col-1]
if type(val) is str:
val = val.decode('utf-8')
elif isinstance(val,(int, long, float, complex)):
val= int(round(val))
cell.value = val
sheet.update_cells(cell_list)
This is derived from here https://www.dataiku.com/learn/guide/code/python/export-a-dataset-to-google-spreadsheets.html
I believe the change here is that this solution creates a cell_list object, which only requires one API call.
Based from this thread, Google Spreadsheets API can be pretty slow depending on many factors including your connection speed to Google servers, usage of proxy, etc. Avoid having gspread.login inside a loop because this method is slow.
...get_all_records came to my rescue, much faster than range for entire sheet.
I have also read in this forum that it depends on the size of the worksheet, so as the rows increase in the worksheet, the program run even more slower.

Proper Python data structure for real-time analysis?

Community,
Objective: I'm running a Pi project (i.e. Python) that communicates with an Arduino to get data from a load cell once a second. What data structure should I use to log (and do real-time analysis) on this data in Python?
I want to be able to do things like:
Slice the data to get the value of the last logged datapoint.
Slice the data to get the mean of the datapoints for the last n seconds.
Perform a regression on the last n data points to get g/s.
Remove from the log data points older than n seconds.
Current Attempts:
Dictionaries: I have appended a new key with a rounded time to a dictionary (see below), but this makes slicing and analysis hard.
log = {}
def log_data():
log[round(time.time(), 4)] = read_data()
Pandas DataFrame: this was the one I was hopping for, because is makes time-series slicing and analysis easy, but this (How to handle incoming real time data with python pandas) seems to say its a bad idea. I can't follow their solution (i.e. storing in dictionary, and df.append()-ing in bulk every few seconds) because I want my rate calculations (regressions) to be in real time.
This question (ECG Data Analysis on a real-time signal in Python) seems to have the same problem as I did, but with no real solutions.
Goal:
So what is the proper way to handle and analyze real-time time-series data in Python? It seems like something everyone would need to do, so I imagine there has to pre-built functionality for this?
Thanks,
Michael
To start, I would question two assumptions:
You mention in your post that the data comes in once per second. If you can rely on that, you don't need the timestamps at all -- finding the last N data points is exactly the same as finding the data points from the last N seconds.
You have a constraint that your summary data needs to be absolutely 100% real time. That may make life more complicated -- is it possible to relax that at all?
Anyway, here's a very naive approach using a list. It satisfies your needs. Performance may become a problem depending on how many of the previous data points you need to store.
Also, you may not have thought of this, but do you need the full record of past data? Or can you just drop stuff?
data = []
new_observation = (timestamp, value)
# new data comes in
data.append(new_observation)
# Slice the data to get the value of the last logged datapoint.
data[-1]
# Slice the data to get the mean of the datapoints for the last n seconds.
mean(map(lambda x: x[1], filter(lambda o: current_time - o[0] < n, data)))
# Perform a regression on the last n data points to get g/s.
regression_function(data[-n:])
# Remove from the log data points older than n seconds.
data = filter(lambda o: current_time - o[0] < n, data)

C++ or Python: Bruteforcing to find text in many webpages

Suppose I want to bruteforce webpages:
for example, http://www.example.com/index.php?id=<1 - 99999>
and search each page to find if there contain a certain text.
If the page contains the text, then store it into a string
I kind of get it working in python, but it is quite slow (around 1-2 second per page, which would take around 24 hours to do that) is there a better solution? I am thinking of using C/C++, because I heard python is not very efficient. However, after a second thought, I think it might not be the efficiency of python, but rather the efficiency of accessing html element (I changed the whole html into a text, then search it... and the content is quite long)
So how can I improve the speed of bruteforcing?
Most likely your problem has nothing to do with your ability to quickly parse HTML and everything to do with the latency of page retrieval and blocking on sequential tasks.
1-2 seconds is a reasonable amount of time to retrieve a page. You should be able to find text on the page orders of magnitude faster. However, if you are processing pages one at a time you are blocked waiting for a response from a web server while you could be finding your results. You could instead retrieve multiple pages at once via worker processes and wait only for their output.
The following code has been modified from Python's multiprocessing docs to fit your problem a little more closely.
import urllib
from multiprocessing import Process, Queue
def worker(input, output):
for func, args in iter(input.get, 'STOP'):
result = func(*args)
output.put(result)
def find_on_page(num):
uri = 'http://www.example.com/index.php?id=%d' % num
f = urllib.urlopen(uri)
data = f.read()
f.close()
index = data.find('datahere:') # obviously use your own methods
if index < 0:
return None
else:
return data[index:index+20]
def main():
NUM_PROCESSES = 4
tasks = [(find_on_page, (i,)) for i in range(99999)]
task_queue = Queue()
done_queue = Queue()
for task in tasks:
task_queue.put(task)
for i in range(NUM_PROCESSES):
Process(target=worker, args=(task_queue, done_queue)).start()
for i in range(99999):
print done_queue.get()
for i in range(NUM_PROCESSES):
task_queue.put('STOP')
if __name__ == "__main__":
main()
Have you analyzed that the parsing part is the bottleneck of your algorithm and not the HTTP request and answer transaction?
I don't think that the kind of efficiency that gives you C/C++ is what you are looking for here.