Is reducer bottleneck in MR framework - mapreduce

I want to understand what to do in the following case.
For example, I have 1TB of text data, and lets assume that 900GB of it is the word "Hello".
After each map operation, i will have a collection of key-value pairs of <"Hello",1>.
But as I said, this is a huge collection, 900GB and as I understand , the reducer gets all of it and will crush.
My reducer RAM is of 80GB only.
Will the reducer really crush ??
In other words is reducer the bottleneck of horizontal scaling ?

Yes, all equal keys from all mappers get funneled into a single reducer.
It's not clear if you have 900GB of only one word, or a bunch of large text documents with a bunch of words.
In the later case, the string "Hello" really doesn't take that much data. Neither does a single integer.
The reducer will also get a long list of ones, sure, but if you re-used the reducer code as a Combiner, then you can mitigate the memory issues by pre-aggregating the values for each input split

Related

Is it okay to set reduce_limit = false config in couchdb configuration?

I am working on a map/reduce review and I always have reduce_overflow_error each time I run the view, if I set reduce_limit = false in couchdb configuration, it is working, I want to know if there is negative effect if I change this config setting? thank you
The setting reduce_limit=true enforces CouchDB to control the size of reduced output on each step of reduction. If stringified JSON output of a reduction step has more than 200 chars and it‘s twice or more longer than input, CouchDB‘s query server throws an error. Both numbers, 2x and 200 chars, are hard-coded.
Since a reduce function runs inside SpiderMonkey instance(s) with only 64Mb RAM available, the limitation set by default looks somehow reasonable. Theoretically, reduce must fold, not blow up the data given.
However, in real life it‘s quite hard to fly under the limitation in all cases. You can not control number of chunks for a (re)reduction step. It means you can run into situation, when your output for a particular chunk is more than twice longer in chars, although other chunks reduced are much shorter. In this case even one uncomfortable chunk breaks entire reduction if reduce_limit is set.
So unsetting reduce_limit might be helpful, if your reducer can sometimes output more data, than it received.
Common case – unrolling arrays into objects. Imagine you receive list of arrays like [[1,2,3...70], [5,6,7...], ...] as input rows. You want to aggregate your list in a manner {key0:(sum of 0th elts), key1:(sum of 1st elts)...}.
If CouchDB decides to send you a chunk with 1 or 2 rows, you have an error. Reason is simple – object keys are also accounted calculating result length.
Possible (but very hard to achieve) negative effect is SpiderMonkey instance constantly restarting/falling on RAM overquota, when trying to process a reduction step or entire reduction. Restarting SM is CPU and RAM intensive and costs hundreds milliseconds in general.

Mapper or Reducer, where to do more processing?

I have a 6 million line text file with lines up to 32,000 characters long, and I want to
measure the word-length frequencies.
The simplest method is for the Mapper to create a (word-length, 1) key-value pair for every word and let an 'aggregate' Reducer do the rest of the work.
Would it be more efficient to do some of the aggregation in the mapper? Where the key-value pair output would be (word-length, frequency_per_line).
The outputs from the mapper would be decreased by an factor of the average amount of words per line.
I know there are many configuration factors involved. But is there a hard rule saying whether most or the work should be done by the Mapper or the Reducer?
The platform is AWS with a student account, limited to the following configuration.

Sharing counter values between MapReduce mappers

I have a mapper that reads input and writes to a database. I want to limit how many inputs are actually converted and written to that database, and all mappers must contribute to the limit and then stop once that limit is reached (approximately; one or two extra isn't a big deal.)
I implemented a limiter function on our mapper that asks the other tasks, "How many records have you imported?" Once a given limit is reached, it will stop importing those records (although it will continue processing them for other purposes.)
the map code in question looks something like this:
public void map(ImmutableBytesWritable key, Result row, Context context) {
// prepare the input
// ...
if (context.getCounter(Metrics.IMPORTED).getValue()<IMPORT_LIMIT){
importRecord();
context.getCounter(Metrics.IMPORTED).increment(1l);
}
// do other things
// ...
}
So each mapper checks to see if there is more room to import, and only if the limit hasn't been reached does it perform any importing. However, each mapper itself is importing up to the limit, so that for 16 mappers, we get 16*IMPORT_LIMIT records imported. It's definitely doing SOME limiting (the count is much much lower than the normal number of imported records.)
When are counter values pushed to other mappers, or are they even available to each mapper? Can I actually get somewhat real-time values from the counter, or do they only update when a mapper is finished? Is there a better way to share a value between mappers?
Okay: from what I've seen, MapReduce doesn't share counters between mappers until the job is finished (ie. not at all.) I'm not sure if mappers that commit partway through will allow later mappers to see their counters, but it's not reliable enough to be done real time.
Instead what I'll do is I will run a simple java application that iterates over the rows on its own and write to a column, which the existing MapReduce job will use to determine if it should import the row or not.

JPA 2.0: Batching queries with IN clause

I am looking for a strategy to batch all my queries (with IN clause) to overcome the restrictions by databases on IN clause (See here).
I usually get list of size 100000 to 305000. So, this has become very important to tackle.
I have tried two strategies so far.
Strategy 1:
Create an entity and hence a table with one column to hold such values (can we create temp tables on the fly with JPA 2.0 vendor-independent?) and use the data from the temp table as a subquery to the original query before eventually cleaning up the temp table.
Advantage: Very performant queries. Really quick, I must admit for the numbers I have mentioned, it was mostly under a minute.
Possible drawback: Use of temp table which is actually a permanent one in my case thus far.
Strategy 2:
Calculate the batch size for the given input list and for each batch execute the query and accumulate the result.
Advantage: No temp tables. Easy for any threads within the same transaction.
Disadvantage: A big disadvantage is amount of time it takes to execute all the batches. For the mentioned numbers, this is at an unacceptable level at the moment. Takes anything between 5 to 15 mins!
I would appreciate any feedback, suggestions or improvements from all you JPA gurus.
Thanks.
I only tested up to 50,000 integers but I have some pretty good performance data around splitting large lists using various methods, with CLR and a numbers table leading the pack at the higher end:
Splitting a list of integers : another roundup
Not sure if you are using integers or strings but the results should be roughly equivalent.
As an aside, I'll confess I have no idea what JPA 2.0 is, but I assume you can control the format of the lists that it sends to SQL Server.

What is cheaper: cast to int or Trim the strings in C++?

I am reading several files from linux /proc fs and I will have to insert those values in a database. I should be as optimal as possible. So what is cheaper:
i) to cast then to int, while I stored then in memory, for later cast to string again while I build my INSERT statement
ii) or keep them as string, just sanitizing the values (removing ':', spaces, etc...)
iii) What should I take in account to learn to make this decision?
I am already doing a split in the lines, because the order they came is not good enough for me.
Thanks,
Pedro
Edit - Clarification
Sorry guys, my scenario is the following: I am measuring cpu, memory, network, disk, etc... every 10 seconds. We are developing our database system, so I cannot count with anything more than just INSERT statements.
I got interested in this optimization because the frequency off parsing data. Its gonna be write once - there will be no updates over the data after it is written.
You seem to be performing some archiving activity [write-once, read-probably-atmost-once] (storing the DB for a later rare/non-frequent use), if not, you should put the optimization emphasize based on how the data will be read (not written).
If this is the archiving case, maybe inseting BLOBs (binary large objects, [or similar concepts]) into the DB will be more efficient.
Addition:
Apparently it will depend on how you will read the data. Are you just listing the data for browse purpose later on, or there will be more complex fetching queries based on the benchmark values.
For example if you are later performing something like: SELECT * from db.Log WHERE log.time > time1 and Max (Memory) < 5000 then it is best to keep each data in its original format (int in integer, string in String, etc) so that the main data processing is left to DB server.