how reduce RDD work in Apache Spark

how reduce RDD work in Apache Spark - mapreduce

Am currently working with Apache Spark.
But i can not understand how reduce work after map ..
my example is pretty simple
val map = readme.map(line => line.split(" ").size)
i know this will return array of number of words per line but where is the key/value here to pass a reduce function ..
map.reduce((a,b) => {if(a>b) a else b})
reduce phase how it works .. (a,b) is the tuple_2 ? or its key/value from map function ??

Once you have
val map = readme.map(line => line.split(" ").size)
Each element of the RDD consists of a single number, the number of words in a line of the file.
You could count all the words in your dataset with map.sum() or map.reduce( (a,b) => a+b ), which are equivalent.
The code you have posted:
map.reduce((a,b) => {if(a>b) a else b})
would find the maximum number of words per line for your entire dataset.
The RDD.reduce method works by converting every two elements it encounters, which at first are taken from pairs of RDD rows, to another element, in this case a number. The aggregation function should be written so it can be nested and called on the rows in any order. For example, subtraction will not yield useful results as a reduce function because you can not predict ahead of time in what order results would be subtracted from one another. Addition, however, or maximization, still works correctly no matter the order.

Related

Running a sequence in batches using asSequence()?

I have a function:
private fun importProductsInSequence(products: List<Product>): List<Result> =
products.asSequence()
.map { saveProduct(it)}
.takeWhile { products.isNotEmpty() }
.toList()
Is there a possibility to rewrite this sequence so it works in batches? In example, a list of 1000 products is passed to this method, and the sequence is taking 100 products, saves them, then next 100 until products.isNotEmpty() condition is met?

takeWhile is not needed here, as the products list size doesn't change even after you iterate over the sequence. products.isNotEmpty() will always be true
Kotlin has chunked method that can give you objects in batches as required
private fun importProductsInSequence(products: List<Product>): List<Result> =
products.asSequence()
.chunked(100)
.map { saveProducts(it)} // saveProducts method would take list of products and return list of Result
.flatten()
.toList()

You neeed to use chunked
private fun importProductsInSequence(products: List<Product>): List<Product>{
products.asSequence().chunked(100).onEach{ saveProduct(it)}.flatten().toList()
chunk will partition the list to the size provided, the last portion will have less elements than previous one.

How to make pairs from Range?

I have got range MySQLTablesRange. It's consist data like:
aa_1 aa_3 aa_2 bb_2 bb_1 bb_3
I need to create pairs like:
aa_1 bb_1
aa_2 bb_2
aa_3 bb_3
std.algorithm have method group that doing similar thing, but I do not know how to write it in code. I did:
MySQLTablesRange.each!(a => a.split("_")[1].array.group.writeln);
But it's wrong, because group works with array, but not with single element.
Any ideas?

Update: After testing this - I realised it's not 'group' you want. But chunkBy. Updated the answer to reflect that.
https://dlang.org/phobos/std_algorithm_iteration.html#chunkBy
You have to tell chunkBy how to chunk the data...
[1,2,3,4,5,6]
.sort!((a,b) => a%2 > b%2) // separate odds n evens
.chunkBy!((a,b) => a%2 == b%2); // chunk them so all evens are in one range, odds in another.
That will create two groups. One with odd numbers, one with evens.
In your case it looks like you'd group them on the text that comes after '_' in each element.
"aa_1 aa_2 aa_3 bb_1 bb_2 bb_3 cc_1"
.split(" ")
.sort!((a,b) => a[$-1].to!int < b[$-1].to!int) // sort it so _1's are together, _2s are together. etc
.chunkBy!((a,b) => a[$-1] == b[$-1]) // chunk them so they're in they're own range
.each!writeln; // print each range
$ rdmd test.d
["aa_1", "bb_1", "cc_1"]
["aa_2", "bb_2"]
["aa_3", "bb_3"]
Ideally you'd get index of _ and compare after that...

Select duplicated lists from a list of lists (Python 2.7.13)

I have two lists, one is a list of lists, and they have the same number of indexes(the half number of values), like this:
list1=[['47', '43'], ['299', '295'], ['47', '43'], etc.]
list2=[[9.649, 9.612, 9.42, etc.]
I want to detect the repeated pair of values in the same list(and delete it), and sum the values with the same indexes in the second list, creating an output like this:
list1=[['47', '43'], ['299', '295'], etc.]
list2=[[19.069, 9.612, etc.]
The main problem is that the order of the values is important and I'm really stuck.

You could create a collections.defaultdict to sum values together, with keys as the sublists (converted as tuple to be hashable)
list1=[['47', '43'], ['299', '295'], ['47', '43']]
list2=[9.649, 9.612, 9.42]
import collections
c = collections.defaultdict(float)
for l,v in zip(list1,list2):
c[tuple(l)] += v
print(c)
Alternative using collections.Counter and which does the same:
c = collections.Counter((tuple(k),v) for k,v in zip(list1,list2))
At this point, we have the related data:
defaultdict(<class 'float'>, {('299', '295'): 9.612, ('47', '43'): 19.069})
now if needed (not sure, since the dictionary holds the data very well) we can rebuild the lists, keeping the (relative) order between them (but not their original order, that shouldn't be a problem since they're still linked):
list1=[]
list2=[]
for k,v in c.items():
list1.append(list(k))
list2.append(v)
print(list1,list2)
result:
[['299', '295'], ['47', '43']]
[9.612, 19.069]

Applying regexp and finding the highest number in a list

I have got a list of different names. I have a script that prints out the names from the list.
req=urllib2.Request('http://some.api.com/')
req.add_header('AUTHORIZATION', 'Token token=hash')
response = urllib2.urlopen(req).read()
json_content = json.loads(response)
for name in json_content:
print name['name']
Output:
Thomas001
Thomas002
Alice001
Ben001
Thomas120
I need to find the max number that comes with the name Thomas. Is there a simple way to to apply regexp for all the elements that contain "Thomas" and then apply max(list) to them? The only way that I have came up with is to go through each element in the list, match regexp for Thomas, then strip the letters and put the remaining numbers to a new list, but this seems pretty bulky.

You don't need regular expressions, and you don't need sorting. As you said, max() is fine. To be safe in case the list contains names like "Thomasson123", you can use:
names = ((x['name'][:6], x['name'][6:]) for x in json_content)
max(int(b) for a, b in names if a == 'Thomas' and b.isdigit())
The first assignment creates a generator expression, so there will be only one pass over the sequence to find the maximum.

You don't need to go for regex. Just store the results in a list and then apply sorted function on that.
>>> l = ['Thomas001',
'homas002',
'Alice001',
'Ben001',
'Thomas120']
>>> [i for i in sorted(l) if i.startswith('Thomas')][-1]
'Thomas120'

How to read each element within a tuple from a list

I want to write a program which will read in a list of tuples, and in the tuple it will contain two elements. The first element can be an Object, and the second element will be the quantity of that Object. Just like: Mylist([{Object1,Numbers},{Object2, Numbers}]).
Then I want to read in the Numbers and print the related Object Numbers times and then store them in a list.
So if Mylist([{lol, 3},{lmao, 2}]), then I should get [lol, lol, lol, lmao, lmao] as the final result.
My thought is to first unzip those tuples (imagine if there are more than 2) into two tuples which the first one contains the Objects while the second one contains the quantity numbers.
After that read the numbers in second tuples and then print the related Object in first tuple with the exact times. But I don't know how to do this. THanks for any help!

A list comprehension can do that:
lists:flatten([lists:duplicate(N,A) || {A, N} <- L]).
If you really want printing too, use recursion:
p([]) -> [];
p([{A,N}|T]) ->
FmtString = string:join(lists:duplicate(N,"~p"), " ")++"\n",
D = lists:duplicate(N,A),
io:format(FmtString, D),
D++p(T).
This code creates a format string for io:format/2 using lists:duplicate/2 to replicate the "~p" format specifier N times, joins them with a space with string:join/2, and adds a newline. It then uses lists:duplicate/2 again to get a list of N copies of A, prints those N items using the format string, and then combines the list with the result of a recursive call to create the function result.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

how reduce RDD work in Apache Spark - mapreduce

Related

Running a sequence in batches using asSequence()?

How to make pairs from Range?

Select duplicated lists from a list of lists (Python 2.7.13)

Applying regexp and finding the highest number in a list

How to read each element within a tuple from a list

Categories

Resources