What is the most efficient way to perform a large and slow batch job on GAE - python-2.7

Say I have a retrieved a list of objects from NDB. I have a method that I can call to update the state of these objects, which I have to do every 15 minutes. These updates take ~30 seconds due to API calls that it has to make.
How would I go ahead and process a list of >1,000 objects?
Example of an approach that would be very slow:
my_objects = [...] # list of objects to process
for object in my_objects:
object.process_me() # takes around 30 seconds
object.put()

Two options:
you can run a task with a query cursor, that processes only N entities each time. When these are processed, and there are more entities to go, you fire another task with the next query cursor.Resources: query cursor, tasks
you can run a mapreduce job that will go over all entities in your query in a parallel manner (might require more resources).Simple tutorial: MapReduce on App Engine made easy

You might consider using mapreduce for your purposes. When I wanted to update all my > 15000 entities I used mapreduce.
def process(entity):
# update...
yield op.db.Put(entity)

Related

Dividing tasks into aws step functions and then join them back when all completed

We have a AWS step function that processes csv files. These CSV files records can be anything from 1 to 4000.
Now, I want to create another inner AWS step function that will process these csv records. The problem is for each record I need to hit another API and for that I want all of the record to be executed asynchronously.
For example - CSV recieved having records of 2500
The step function called another step function 2500 times (The other step function will take a CSV record as input) process it and then store the result in Dynamo or in any other place.
I have learnt about the callback pattern in aws step function but in my case I will be passing 2500 tokens and I want the outer step function to process them when all the 2500 records are done processing.
So my question is this possible using the AWS step function.
If you know any article or guide for me to reference then that would be great.
Thanks in advance
It sounds like dynamic parallelism could work:
To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.
This keeps the flow all within a single Step Function and allows for easier traceability.
The limiting factor would be the amount of concurrency available (docs):
Concurrent iterations may be limited. When this occurs, some iterations will not begin until previous iterations have completed. The likelihood of this occurring increases when your input array has more than 40 items.
One additional thing to be aware of here is cost. You'll easily blow right through the free tier and start incurring actual cost (link).

when is it not performance practical to use persist() on a spark dataframe?

While working on improving code performance as I had many jobs fail (aborted), I thought about using persist() function on Spark Dataframe whenever I need to use that same dataframe on many other operations. When doing it and following the jobs, stages in the Spark application UI, I felt like it's not really always optimal to do so, it depends on the number of partitions and the data size. I wasn't sure until I got the job aborted because of a fail in the persist stage.
I'm questioning if the best practice of using persist() whenever many operations will be performed on the dataframe is always valid? If not, when it's not? how to judge?
To be more concrete I will present my code and the details of the aborted job:
#create a dataframe from another one df_transf_1 on which I made a lot of transformations but no actions
spark_df = df_transf_1.select('user_id', 'product_id').dropDuplicates()
#persist
spark_df.persist()
products_df = spark_df[['product_id']].distinct()
df_products_indexed = products_df.rdd.map(lambda r: r.product_id).zipWithIndex().toDF(['product_id', 'product_index'])
You may ask why I persisted spark_df?
It's because I'm going to use it multiple of times like with products_df and also in joins (e.g: spark_df = spark_df.join(df_products_indexed,"product_id")
Details of fail reason in Stage 3:
Job aborted due to stage failure: Task 40458 in stage 3.0 failed 4 times, most recent failure: Lost task 40458.3 in stage 3.0 (TID 60778, xx.xx.yyyy.com, executor 91): ExecutorLostFailure (executor 91 exited caused by one of the running tasks) Reason: Slave lost
Driver stacktrace:
The size of the input data (4 TB) is huge, before doing persist is there a way to check the size of the data? Is it a parameter in choosing to persist or not? Also the number of partitions (tasks) for persist > 100,000
Here are two cases for using persist():
After using repartition in order to avoid shuffling your data again and again as the dataframe is being used by the next steps. This will be useful only for the case that you call more than one action for the persisted dataframe/RDD since persist is an transformation and hence lazily evaluated. In general if you have multiple actions on the same dataframe/RDD.
Iterative computations, for instance when you want to query a dataframe inside a for loop. With persist Spark will save the intermediate results and omit reevaluating the same operations on every action call. Another example would be appending new columns with a join as discussed here.
What my experience taught me is that you should persist the dataframe when you perform several operations on them, so you create temporal tables (also you ensure that if something fails you have a recovery point). By doing this you prevent huge DAG'S that often do not end, if you have, for example, joins. So my advice would be to do something like this:
# operations
df.write.saveAsTable('database.tablename_temp')
df = spark.table('database.tablename_temp')
# more operations

Dataproc Pyspark job only running on one node

My problem is that my pyspark job is not running in parallel.
Code and data format:
My PySpark looks something like this (simplified, obviously):
class TheThing:
def __init__(self, dInputData, lDataInstance):
# ...
def does_the_thing(self):
"""About 0.01 seconds calculation time per row"""
# ...
return lProcessedData
#contains input data pre-processed from other RDDs
#done like this because one RDD cannot work with others inside its transformation
#is about 20-40MB in size
#everything in here loads and processes from BigQuery in about 7 minutes
dInputData = {'dPreloadedData': dPreloadedData}
#rddData contains about 3M rows
#is about 200MB large in csv format
#rddCalculated is about the same size as rddData
rddCalculated = (
rddData
.map(
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
)
)
llCalculated = rddCalculated.collect()
#save as csv, export to storage
Running on Dataproc cluster:
Cluster is created via the Dataproc UI.
Job is executed like this:
gcloud --project <project> dataproc jobs submit pyspark --cluster <cluster_name> <script.py>
I observed the job status via the UI, started like this. Browsing through it I noticed that only one (seemingly random) of my worker nodes was doing anything. All others were completely idle.
Whole point of PySpark is to run this thing in parallel, and is obviously not the case. I've run this data in all sorts of cluster configurations, the last one being massive, which is when I noticed it's singular-node use. And hence why my jobs take too very long to complete, and time seems independent of cluster size.
All tests with smaller datasets pass without problems on my local machine and on the cluster. I really just need to upscale.
EDIT
I changed
llCalculated = rddCalculated.collect()
#... save to csv and export
to
rddCalculated.saveAsTextFile("gs://storage-bucket/results")
and only one node is still doing the work.
Depending on whether you loaded rddData from GCS or HDFS, the default split size is likely either 64MB or 128MB, meaning your 200MB dataset only has 2-4 partitions. Spark does this because typical basic data-parallel tasks churn through data fast enough that 64MB-128MB means maybe tens of seconds of processing, so there's no benefit in splitting into smaller chunks of parallelism since startup overhead would then dominate.
In your case, it sounds like the per-MB processing time is much higher due to your joining against the other dataset and perhaps performing fairly heavyweight computation on each record. So you'll want a larger number of partitions, otherwise no matter how many nodes you have, Spark won't know to split into more than 2-4 units of work (which would also likely get packed onto a single machine if each machine has multiple cores).
So you simply need to call repartition:
rddCalculated = (
rddData
.repartition(200)
.map(
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
)
)
Or add the repartition to an earlier line:
rddData = rddData.repartition(200)
Or you may have better efficiency if you repartition at read time:
rddData = sc.textFile("gs://storage-bucket/your-input-data", minPartitions=200)

Pig killing data nodes while loading a lot of files

I have a script that tries to get the times that users start/end their days based on log files. The job always fails before it completes and seems to knock 2 data nodes down every time.
The load portion of the script:
log = LOAD '$data' USING SieveLoader('#source_host', 'node', 'uid', 'long_timestamp', 'type');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL AND $0#'type'=='USER_AUTH';
There are about 6500 files that we are reading from, so it seems to spawn about that many map tasks. The SieveLoader is a custom UDF that loads a line, passes it to an existing method that parses fields from the line and returns them in a map. The parameters passed in are to limit the size of the map to only those fields with which we are concerned.
Our cluster has 5 data nodes. We have quad cores and each node allows 3 map/reduce slots for a total of 15. Any advice would be greatly appreciated!

Processing web feed multiple times a day

Ok, here is in brief the deal: I spider the web (all kind of data, blogs/news/forums) as it appears on internet. Then I process this feed and do analysis on processed data. Spidering is not a big deal. I can get it pretty much in real time as internet gets new data. Processing is a bottleneck, it involves some computationally heavy algorithms.
I am in pursuit of building a strategy to schedule my spiders. The big goal is to make sure that analysis that is produced as end result reflects effect of as much recent input as possible. Start to think of it, the obvious objective is to make sure data does not pile up. I get the data through spiders, pass on to processing code, wait till processing gets over and then spider more. This time bringing all the data which appeared while I was waiting for processing to get over. Okay this is a very broad thought.
Can some of you share your thoughts, may be think loud. If you were me what would go in your mind. I hope I am making sense with my question. This is not a search engine indexing by the way.
It appears that you want to keep the processors from falling too far behind the spiders. I would imagine that you want to be able to scale this out as well.
My recommendation is that you implement a queue using an client/server SQL databse. MySQL would work nicely for this purpose.
Design Objectives
Keep the spiders from getting too far ahead of the processors
Allow for a balance of power between spiders and processors (keeping each busy)
Keep data as fresh as possible
Scale out and up as needed
Queue:
Create a queue to store the data from the spiders before it is processed. This could be done in several ways, but it does not sound like IO is your bottleneck.
A simple approach would be to have an SQL table with this layout:
TABLE Queue
Queue_ID int unsigned not null auto_increment primary key
CreateDate datetime not null
Status enum ('New', 'Processing')
Data blob not null
# pseudo code
function get_from_queue()
# in SQL
START TRANSACTION;
SELECT Queue_ID, Data FROM Queue WHERE Status = 'New' LIMIT 1 FOR UPDATE;
UPDATE Queue SET Status = 'Processing' WHERE Queue_ID = (from above)
COMMIT
# end sql
return Data# or false in the case of no records found
# pseudo code
function count_from_queue()
# in SQL
SELECT COUNT(*) FROM Queue WHERE Status = 'New'
# end sql
return (the count)
Spider:
So you have multiple spider processes.. They each say:
if count_from_queue() < 10:
# do the spider thing
# save it in the queue
else:
# sleep awhile
repeat
In this way, each spider will be either resting or spidering. The decision (in this case) is based on if there are less than 10 pending items to process. You would tune this to your purposes.
Processor
So you have multiple processor processes.. They each say:
Data = get_from_queue()
if Data:
# process it
# remove it from the queue
else:
# sleep awhile
repeat
In this way, each processor will be either resting or processing.
In summary:
Whether you have this running on one computer, or 20, a queue will provide the control you need to ensure that all parts are in sync, and not getting too far ahead of each other.