Mapper output.collect()? - mapreduce

The mapper invokes this two times.
out.collect(new Text("Car"),new Text("Subaru");
out.collect(new Text("Cr-v"),new Text("Honda");
Does the reduce() also get called two times ?

I'll assume you're talking about OutputCollector.collect(K,V)?
reduce() gets called once for each [key, (list of values)] pair. To explain, let's say you called:
out.collect(new Text("Car"),new Text("Subaru");
out.collect(new Text("Car"),new Text("Honda");
out.collect(new Text("Car"),new Text("Ford");
out.collect(new Text("Truck"),new Text("Dodge");
out.collect(new Text("Truck"),new Text("Chevy");
Then reduce() would be called twice with the pairs
reduce(Car, <Subaru, Honda, Ford>)
reduce(Truck, <Dodge, Chevy>)
So in your example, yes, the function reduce() would be called twice. I hope that helps.

Related

Complete groovy list of parameters

I have a function with an optional list of parameters :
cleanFolders(String... folders) {
other_function_called(param1,param2,param3)
}
This function calls another function which take exactly three parameters.
So, I want to use the list of parameters folders to call this other function :
if there is only one element in folders list, I call
other_function_called(folders[0],"none","none")
if there are two elements :
other_function_called(folders[0],folders[1],"none")
and for three elements :
other_function_called(folders[0],folders[1],folders[2])
How I can do this properly (not using many disgracious "if else") ?
Thanks
as Jeff writes, you can use * to unpack the varargs array.
However, this will give you MissingMethodException if the number of arguments is not matching.
For this case you could create a new array starting with the available values, that is then filled up with the remaining default values, so that unpacked it just matches the right number of arguments.
def spreadArgs = args + ["none"] * (3 - args.size())
other_function_called(*spreadArgs)

What does ":_*" mean in scala? (When using a List to filter a dataframe) [duplicate]

This question already has answers here:
What does `:_*` (colon underscore star) do in Scala?
(4 answers)
Closed 2 years ago.
When seeing some co-workers' Scala-Spark code, sometimes I encounter that they use lists to filter dataframes as in this example:
val myList: List[String] = List("0661", "0239", "0949", "0380", "0279", "0311")
df.filter(col("col1").isin(myList:_*)
The code above works perfectly, this one, however, does not:
df.filter(col("col1").isin(myList)
What I don't understand is, what is that "colon underscore star" :_* exactly doing?
Thanks in advance!
It does mean "pass list as a separate parameters". It works for methods, that have a vararg argument, like "any number of strings", but not a List[String] version.
Spark's isin function has signature isin(list: Any*): Column, Any* means "any number of arguments of type Any". Not very descriptive, but here you can pass either any number of strings, or any number of cols.
With :_* syntax, you're saying to compiler "replace my list with varargs", it's equialent to writing .isin("0661", "0239" ...)
Also, since Spark 2.4.0 there's function isInCollection, that takes Iterable, so you can pass List there directly.
This is sometimes called splat operator. It is used to to adapt a sequence (Array, List, Seq, Vector, etc.) so it can be passed as an argument for a varargs method parameter:
def printAll(strings: String*):Unit = {
strings.foreach(println)
}
val fruits = List("apple", "banana", "cherry")
printAll(fruits:_*)
If any method contains any repeated parameter. If you want to pass any Iterable in the method's repeated parameter, to convert your Iterable to repeated parameter you will use :_*
def x(y:Int*):Seq[Int]={ // y:Int* is a repeated parameter.
y
}
x(List(1,2,3,4):_*) <--- you are passing List into repeated parameter

Calling a function with arguments in lists

i have a function that takes 3 arguments
(defn times-changed-answer [rid qid csv-file] ...some code)
that counts for a user with (record-id) rid how many times he changed his answer with (question-code) qid. The data is in csv-file.
It works and i have tested it for multiple users.
Now i want to call this function for all users and for all questions.
I have a list of rids and a list of qids.
(def rid-list '(1 2 4 5 10)
(def qid-list '(166 167 168 169 180 141)
How could i call this function on all users for all questions?
The lists are of different length and the third argument (the file) is always the same.
I'd use for list comprehension - it depends on what result you expect
here; here e.g. [rid qid result] is returned for all of them:
(for [rid rid-list
qid qid-list]
[rid qid (times-changed-answer rid quid csv-file)])
If you want to have this in a map, you could e.g. reduce over that.

In EDN, how can I pass multiple values to tagged element returned from other tagged elements

I have following EDN
{
:test #xyz/getXyz #abc/getAbc #fgh/getFgh "sampleString"
}
In Clojure, I have defined implementation for each of the tagged element which internally calls java functions. I have a requirement in which I need to pass return values of both #abc/getAbc and #fgh/getFgh to #xyz/getXyz as separate parameters.
In my current implementation #fgh/getFgh gets called with "sampleString". And with the output of #fgh/getFgh, #abc/getAbc gets called. And with its output #xyz/getXyz gets called.
My requirement is #xyz/getXyz should get called with return value of both #abc/getAbc and #fgh/getFgh as individual parameters.
Clojure implementation
(defn getXyz [sampleString]
(.getXyz xyzBuilder sampleString)
)
(defn getAbc [sampleString]
(.getAbc abcBuilder sampleString)
)
(defn getFgh [sampleString]
(.getFgh fghBuilder sampleString)
)
(defn custom-readers []
{
'xyz/getXyz getXyz
'xyz/getAbc getAbc
'xyz/getFgh getFgh
}
)
I want to modify getXyz to
(defn getXyz [abcReturnValue fghReturnValue]
(.getXyz xyzBuilder abcReturnValue fghReturnValue)
)
You can't do exactly what you are asking. The tag can only process the following form. That said, you can alter the syntax for your Xyz EDN so that it can support taking a vector of [Abc Fgh] pair.
{:test #xyz/getXyz [#abc/getAbc "sampleString" #fgh/getFgh "sampleString"]}
I'm not sure if you meant that getAbc still needed to take getFgh as input or not. If so, it would be more like:
{:test #xyz/getXyz [#abc/getAbc #fgh/getFgh "sampleString" #fgh/getFgh "sampleString"]}
Now your getXyz tagged reader will receive a vector of Abc and Fgh. So you'll need to change your code to grab the elements from inside the vector, something like that:
(defn getXyz [[abcReturnValue fghReturnValue]]
(.getXyz xyzBuilder abcReturnValue fghReturnValue))
This uses destructuring syntax (notice that the arguments are wrapped in an extra pair of bracket), but you could use first and second if you wanted instead, or any other way.

compare two CGPDFDictionary

Is the a way to compare two CGPDFDictionaries?
There is a function memcmp, but it doesn't work for me, because dictionaries are in different memory cells.
Can you use the CGPDFDictionaryApplyFunction function ?
It seems like if you supply a callback function , it will be called for every key-value pair.
void CGPDFDictionaryApplyFunction (
CGPDFDictionaryRef dict,
CGPDFDictionaryApplierFunction function,
void *info
);
So you can pass your second dictionary ( say dict2 ) as info. In your CGPDFDictionaryApplierFunction you can see if the current key being enumerated is also in dict2