Map-reduce : work on multiple lines

Map-reduce : work on multiple lines - mapreduce

I have a requirement where i need to work on multiple rows of input data, first sort the data and then substract one value from row one in next row and so on. Can we do this operation in map reduce somehow ?

You can make your custom Record Reader and send your desired number of records to map task and perform the calculations.

Related

How does Cloud Bigtable read rows that are non-contiguous?

Given a large number of known row keys. How does bigtable read(not a scan operation) those rows? Does it read the rows one after the other or all at once? If I have a large number of non-contiguous rows that I want to read, is it better to make separate concurrent or parallel hits to read each or to give all rows to bigtable i.e. a "batch read"?

There are three options for a non-contiguous batch read which depend on your latency and CPU requirements. You can do all the reads as get requests in parallel, you can issue a read rows request/scan with multiple ranges that include only one row, or you can do a hybrid.
Reading with multiple parallel get requests
This option can be great if you have a lot of processing power or don't need to read a huge number of rows. This will issue multiple requests to Bigtable, so it's going to have an impact on your CPU utilization. One Bigtable node supports around 10K reads per second, but if you have 1000 rows you need to read individually that might make a dent in your capacity.
Also, if you need all of the requests to resolve before you can process the data, you may run into performance issues if one request is slow, it slows down the entire result.
Scan with multiple rows
Bigtable supports scanning with multiple filters. One filter is a row range based on the row key. You can create a row range filter that includes exactly one row and do a scan with a filter for each row.
The Bigtable client libraries support queries like this, so you can just pass the row keys and don't need to create all of those row range filters. However, it's important to know what is happening under the hood for performance. This one query will be performed sequentially on the Bigtable server, so it could take a lot more time than multiple gets.
In Java, to do this kind of query, you just pass multiple row keys to the Query builder like so:
Query query = Query.create(tableId).rowKey("phone#4c410523#20190501").rowKey("phone#4c410523#20190502");
ServerStream<Row> rows = dataClient.readRows(query);
for (Row row : rows) {
printRow(row);
}
Hybrid approach
Depending on the scale of rows you're working with, it may make sense to take your set of row keys, divide them up and issue multiple scans in parallel. You can get the benefit of fewer requests while still potentially getting better latency since the requests are parallelized.
I would recommend experimenting to see which scenario works best for your use case, or leave a comment with more information on your use case and I can see if there is more information I can offer you.

Knime : How to achieve multi split by row

I would like to split a data set into multiple data set of 1000 rows and how is it possible?
The Node row splitter has only two output . Let me know if there is any way to use java snippet for this requirement.

It is not entirely well specified how you want to split the table, but there are two loop types that might do what you are looking for: Chunk Loop (Start) or Group Loop (Start). Your workflow probably would look like this:
[(Chunk/Group) Loop Start] --> Your processing nodes of the selected rows --> [Loop End]
In the part Your processing nodes of the selected rows you will only see the splitted parts you need.
The difference between the two nodes is the following: the Chunk Loop Start nodes collect the rows to a group by their position (consecutive nodes part of the same group till the requested number of rows are consumed), while the Group Loop Start collects the rows with the same properties to the same collection for processing. (The Loop End node might be not the best fit depending on your processing requirements, in that case look for other Loop End nodes.)
In case these are not sufficient, you might try the parallel chunk loop nodes or as I remember there are bagging, ensemble and cross validation (X-Validation) nodes too in some extensions. (For more complex workflows you can also use recursive loops.) For feature elimination, you can also find support.

How to override shuffle/sort in map/reduce or else, how can I get the sorted list in map/reduce from the last element to the patitioner

Assuming only one reducer.
My scenario is to get the list of top N scorers in the university. The data is in format. The Map/reduce framework, by default, sorting the data, in ascending order. But I want the list in descending order, or atleast if I can access the sorted list from the end, my work becomes damm easy. Instead of sending a lot of data to the reducer, I can restrict the data to a limit.
(I want to override the predefined Shuffle/Sort)
Thanks & Regards
Ashwanth

I guess Combiners is what you want. It runs along with the Mappers and they typically do what a reducer does but instead on a single mapper's output data. Generally the combiner class is set the same as the reducer. In your case you can sort and pick top-K elements in each mapper and send only those out.
So instead of sending all your map output records you will be sending only a maximum of K * number of mappers records to the reducer.
You can find example usage on http://wiki.apache.org/hadoop/WordCount.
Bonus - Check out http://blog.optimal.io/3-differences-between-a-mapreduce-combiner-and-reducer/ for major differences between a combiner and a reducer.

Exporting a map with changing number of elements to CSV

I need to export data from 3 maps to preferably a single CSV and would like to be able to do so without simply making a column for every possible key (there may be up to 65024 of them).
The output would be a CSV containing the value at each of the keys at each timestep (may be several hundred thousand).
Anyone got any ideas?

Reduce the granularity by categorizing your keys into groups and store them with one timestep per row. Then you can plot one datapoint per line.
Let me know if you need clarification, i'd need some more info.

kettle sample rows for each type

I have a set of rows let's say "rowId","type","value". I need on output set of 10 sample rows for each "type". How can I do it? "type" has aprox. 100 different, and changing values, so switch is not good option.

Well I've figured a walkaround from this situation. I splited transformation in parts. First part collects all data to a temp table, finds unique types, and copies them to the result.
The second one runs for every input row (where we have types), and collects data of a given type from temp table. Then you need no grouping to do stratified sample.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js