Pig killing data nodes while loading a lot of files - mapreduce

I have a script that tries to get the times that users start/end their days based on log files. The job always fails before it completes and seems to knock 2 data nodes down every time.
The load portion of the script:
log = LOAD '$data' USING SieveLoader('#source_host', 'node', 'uid', 'long_timestamp', 'type');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL AND $0#'type'=='USER_AUTH';
There are about 6500 files that we are reading from, so it seems to spawn about that many map tasks. The SieveLoader is a custom UDF that loads a line, passes it to an existing method that parses fields from the line and returns them in a map. The parameters passed in are to limit the size of the map to only those fields with which we are concerned.
Our cluster has 5 data nodes. We have quad cores and each node allows 3 map/reduce slots for a total of 15. Any advice would be greatly appreciated!


Dataproc Pyspark job only running on one node

My problem is that my pyspark job is not running in parallel.
Code and data format:
My PySpark looks something like this (simplified, obviously):
class TheThing:
def __init__(self, dInputData, lDataInstance):
# ...
def does_the_thing(self):
"""About 0.01 seconds calculation time per row"""
# ...
return lProcessedData
#contains input data pre-processed from other RDDs
#done like this because one RDD cannot work with others inside its transformation
#is about 20-40MB in size
#everything in here loads and processes from BigQuery in about 7 minutes
dInputData = {'dPreloadedData': dPreloadedData}
#rddData contains about 3M rows
#is about 200MB large in csv format
#rddCalculated is about the same size as rddData
rddCalculated = (
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
llCalculated = rddCalculated.collect()
#save as csv, export to storage
Running on Dataproc cluster:
Cluster is created via the Dataproc UI.
Job is executed like this:
gcloud --project <project> dataproc jobs submit pyspark --cluster <cluster_name> <script.py>
I observed the job status via the UI, started like this. Browsing through it I noticed that only one (seemingly random) of my worker nodes was doing anything. All others were completely idle.
Whole point of PySpark is to run this thing in parallel, and is obviously not the case. I've run this data in all sorts of cluster configurations, the last one being massive, which is when I noticed it's singular-node use. And hence why my jobs take too very long to complete, and time seems independent of cluster size.
All tests with smaller datasets pass without problems on my local machine and on the cluster. I really just need to upscale.
I changed
llCalculated = rddCalculated.collect()
#... save to csv and export
and only one node is still doing the work.
Depending on whether you loaded rddData from GCS or HDFS, the default split size is likely either 64MB or 128MB, meaning your 200MB dataset only has 2-4 partitions. Spark does this because typical basic data-parallel tasks churn through data fast enough that 64MB-128MB means maybe tens of seconds of processing, so there's no benefit in splitting into smaller chunks of parallelism since startup overhead would then dominate.
In your case, it sounds like the per-MB processing time is much higher due to your joining against the other dataset and perhaps performing fairly heavyweight computation on each record. So you'll want a larger number of partitions, otherwise no matter how many nodes you have, Spark won't know to split into more than 2-4 units of work (which would also likely get packed onto a single machine if each machine has multiple cores).
So you simply need to call repartition:
rddCalculated = (
lambda l, dInputData=dInputData: TheThing(dInputData, l).does_the_thing()
Or add the repartition to an earlier line:
rddData = rddData.repartition(200)
Or you may have better efficiency if you repartition at read time:
rddData = sc.textFile("gs://storage-bucket/your-input-data", minPartitions=200)

What is the most efficient way to perform a large and slow batch job on GAE

Say I have a retrieved a list of objects from NDB. I have a method that I can call to update the state of these objects, which I have to do every 15 minutes. These updates take ~30 seconds due to API calls that it has to make.
How would I go ahead and process a list of >1,000 objects?
Example of an approach that would be very slow:
my_objects = [...] # list of objects to process
for object in my_objects:
object.process_me() # takes around 30 seconds
Two options:
you can run a task with a query cursor, that processes only N entities each time. When these are processed, and there are more entities to go, you fire another task with the next query cursor.Resources: query cursor, tasks
you can run a mapreduce job that will go over all entities in your query in a parallel manner (might require more resources).Simple tutorial: MapReduce on App Engine made easy
You might consider using mapreduce for your purposes. When I wanted to update all my > 15000 entities I used mapreduce.
def process(entity):
# update...
yield op.db.Put(entity)

Reading many small files from S3 very slow

Loading many small files (>200000, 4kbyte) from a S3 Bucket into HDFS via Hive or Pig on AWS EMR is extremely slow. It seems that only one mapper is used to get the data, though I cannot exactly figure out where the bottleneck is.
Pig Code Sample
data = load 's3://data-bucket/' USING PigStorage(',') AS (line:chararray)
Hive Code Sample
CREATE EXTERNAL TABLE data (value STRING) LOCATION 's3://data-bucket/';
Are there any known settings that speed up the process or increase the number of mappers used to fetch the data?
I tried the following without any noticeable effects:
Increase #Task Nodes
set hive.optimize.s3.query=true
manually set #mappers
Increase instance type from medium up to xlarge
I know that s3distcp would speed up the process, but I could only get better performance by doing a lot of tweaking including setting #workerThreads and would prefer changing parameters directly in my PIG/Hive scripts.
You can either :
use distcp to merge the file before your job starts : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/
have a pig script that will do it for you, once.
If you want to do it through PIG, you need to know how many mappers are spawned. You can play with the following parameters :
// to set mapper = nb block size. Set to true for one per file.
SET pig.noSplitCombination false;
// set size to have SUM(size) / X = wanted number of mappers
SET pig.maxCombinedSplitSize 250000000;
Please provide metrics for thoses cases

neo4j-import with node_auto_indexing

For a project, I need to import 5 million nodes and 15 millions relations.
I tried to import by batch but it was very slow, so I used the new tool 'Neo4j-import' from Neo4j 2.2. We generate some specifics .csv and use the 'neo4j-import'. It is very fast, the whole database is created in 1mn30.
But the problem is that I need to do a regex query on one property (find a movie with only the beginning of his name). And the average response time is between 2.5 and 4 seconds, which is huge.
I read that with Lucene query it would be much more efficient. But with Neo4-import, nodes are created without the node_auto_indexing.
Is there a way to use Neo4j-import and have node_auto_indexing in order to use the Lucene query?
neo4j-import does not populate auto indexes. For doing you need to trigger a write operation on the the nodes to be auto indexed. Assume you have nodes with a :Person label having a name property.
Configure node auto index for name in neo4j.properties and restart Neo4j.
To populate the autoindex run a cypher statement like:
MATCH (n:Person)
WHERE NOT HAS(n.migrated)
SET n.name = n.name, n.migrated=true
RETURN count(n) LIMIT 50000
Rerun this statement until the reported count is 0. The rationale for the LIMIT is to have transactions of a reasonable size.

Processing web feed multiple times a day

Ok, here is in brief the deal: I spider the web (all kind of data, blogs/news/forums) as it appears on internet. Then I process this feed and do analysis on processed data. Spidering is not a big deal. I can get it pretty much in real time as internet gets new data. Processing is a bottleneck, it involves some computationally heavy algorithms.
I am in pursuit of building a strategy to schedule my spiders. The big goal is to make sure that analysis that is produced as end result reflects effect of as much recent input as possible. Start to think of it, the obvious objective is to make sure data does not pile up. I get the data through spiders, pass on to processing code, wait till processing gets over and then spider more. This time bringing all the data which appeared while I was waiting for processing to get over. Okay this is a very broad thought.
Can some of you share your thoughts, may be think loud. If you were me what would go in your mind. I hope I am making sense with my question. This is not a search engine indexing by the way.
It appears that you want to keep the processors from falling too far behind the spiders. I would imagine that you want to be able to scale this out as well.
My recommendation is that you implement a queue using an client/server SQL databse. MySQL would work nicely for this purpose.
Design Objectives
Keep the spiders from getting too far ahead of the processors
Allow for a balance of power between spiders and processors (keeping each busy)
Keep data as fresh as possible
Scale out and up as needed
Create a queue to store the data from the spiders before it is processed. This could be done in several ways, but it does not sound like IO is your bottleneck.
A simple approach would be to have an SQL table with this layout:
Queue_ID int unsigned not null auto_increment primary key
CreateDate datetime not null
Status enum ('New', 'Processing')
Data blob not null
# pseudo code
function get_from_queue()
# in SQL
SELECT Queue_ID, Data FROM Queue WHERE Status = 'New' LIMIT 1 FOR UPDATE;
UPDATE Queue SET Status = 'Processing' WHERE Queue_ID = (from above)
# end sql
return Data# or false in the case of no records found
# pseudo code
function count_from_queue()
# in SQL
SELECT COUNT(*) FROM Queue WHERE Status = 'New'
# end sql
return (the count)
So you have multiple spider processes.. They each say:
if count_from_queue() < 10:
# do the spider thing
# save it in the queue
# sleep awhile
In this way, each spider will be either resting or spidering. The decision (in this case) is based on if there are less than 10 pending items to process. You would tune this to your purposes.
So you have multiple processor processes.. They each say:
Data = get_from_queue()
if Data:
# process it
# remove it from the queue
# sleep awhile
In this way, each processor will be either resting or processing.
In summary:
Whether you have this running on one computer, or 20, a queue will provide the control you need to ensure that all parts are in sync, and not getting too far ahead of each other.