Spark FlatMap function for huge lists - mapreduce

I have a very basic question. Spark's flatMap function allows you the emit 0,1 or more outputs per input. So the (lambda) function you feed to flatMap should return a list.
My question is: what happens if this list is too large for your memory to handle!?
I haven't currently implemented this, the question should be resolved before I rewrite my MapReduce software which could easily deal with this by putting context.write() anywhere in my algorithm I wanted to. (the output of a single mapper could easily lots of gigabytes.
In case you're interested: a mappers does some sort of a word count, but in fact in generates all possible substrings, together with a wide range of regex expressions matching with the text. (bioinformatics use case)

So the (lambda) function you feed to flatMap should return a list.
No, it doesn't have to return list. In practice you can easily use a lazy sequence. It is probably easier to spot when take a look at the Scala RDD.flatMap signature:
flatMap[U](f: (T) ⇒ TraversableOnce[U])
Since subclasses of TraversableOnce include SeqView or Stream you can use a lazy sequence instead of a List. For example:
val rdd = sc.parallelize("foo" :: "bar" :: Nil)
rdd.flatMap {x => (1 to 1000000000).view.map {
_ => (x, scala.util.Random.nextLong)
}}
Since you've mentioned lambda function I assume you're using PySpark. The simplest thing you can do is to return a generator instead of list:
import numpy as np
rdd = sc.parallelize(["foo", "bar"])
rdd.flatMap(lambda x: ((x, np.random.randint(1000)) for _ in xrange(100000000)))
Since RDDs are lazily evaluated it is even possible to return an infinite sequence from the flatMap. Using a little bit of toolz power:
from toolz.itertoolz import iterate
def inc(x):
return x + 1
rdd.flatMap(lambda x: ((i, x) for i in iterate(inc, 0))).take(1)

Related

Iterate over an existing map against a list of tuples scala

I have a list of tuples that I must change the values for in a map that contains those tuples. So if I have a list such as List((0,2), (0,3)) with a map that looks like this: Map((0,2) => List(1,2,3), (0,3) => List(1,2)), I need to access the matching map tuples with the tuples listed in the list, then remove a number from the mapping.
So in the example above, if I wanted to remove 2 from the mapping, I would get Map((0,2) => List(1,3), (0,3) => List(1)).
Design wise, I was thinking of pattern matching the map, but I've read some answers that said that may not be the best way. The tough part for me is that it has to be immutable, so I was thinking of pattern matching the list, getting the map value, change the value, then recreate the map and recursively call the function again. What do you think of this implementation?
This could be a way to remove 2 from your Map:
val newMap = oldMap.mapValues(list => list.filter(_ != 2))
Or more generally:
def filterInMap(element: Int, oldMap: Map[(Int,Int),List[Int]]) =
oldMap.mapValues(list => list.filter(_ != element))
This way there's no need to mutate anything at all. mapValues transforms just the values of your Map and returns a copy of the original without mutating it at all. filter then gets the job done by only allowing elements that don't match the element we would like to remove.
Bonus: even more generally:
def filterInMap[A](element: A, oldMap: Map[(A,A),List[A]]) =
oldMap.mapValues(list => list.filter(_ != element))

Convert a Java List of Lists to Scala without O(n) iteration?

Answers to this question do a good job of explaining how to use Scala's Java Converters to change a Java List into a Scala List. Unfortunately, I need to convert a List of Lists from Java to Scala types, and that solution doesn't work:
// pseudocode
java.util.List[java.util.List[String]].asScala
-> scala.collection.immutable.List[java.util.List[String]]
Is there a way to do this conversion without an O(N) iteration over the Java object?
You need to convert the nested lists as well, but that would require the up front O(n):
import scala.collection.JavaConverters._
val javaListOfLists = List(List("a", "b", "c").asJava, List("d", "e", "f").asJava).asJava
val scalaListOfLists = javaListOfLists.asScala.toList.map(_.asScala.toList)
Alternatively, you could convert the outer list into a Stream[List[T]], that would only apply the conversion cost as you accessed each item
val scalaStreamOfLists = javaListOfLists.asScala.toStream.map(_.asScala.toList)
If you don't want to pay the conversion cost at all, you could write a wrapper around java.util.List which would give you a scala collection interface. a rought shot at that would be:
def wrap[T](javaIterator: java.util.Iterator[T]): Stream[T] = {
if (javaIterator.hasNext)
javaIterator.next #:: wrap(javaIterator)
else
empty
}
val outerWrap = wrap(javaListOfLists.iterator).map(inner => wrap(inner.iterator()))
alternatively you can use scalaj-collection library i wrote specifically for this purpose
import com.daodecode.scalaj.collection._
val listOfLists: java.util.List[java.util.List[String]] = ...
val s: mutable.Seq[mutable.Seq[String]] = listOfLists.deepAsScala
that's it. It will convert all nested java collections and primitive types to scala versions. You can also convert directly to immutable data structures using deepAsScalaImmutable (with some copying overhead of course)

Asking about ML recursive function

I have been doing with ml function and got some annoying things.
I will explain it with simple code.
For example if there is a list(int*int) and I want to examine that there are some tuples that contains 3 for the first element.
L = [(1,2),(2,3),(3,5),(3,4)]
so in this list, I want to get 5 and 4.
However, in ML, the function is recursive, so if I write code like this.
fun a(list) =
if #1(hd(list)) = 3 then #2(hd(list))
else a(tl(list))
in this simple function, it can get 5 but not 4 because once it detects that (3,5) is satisfied the condition it returns 5 and the function finishes.
Is there any way to get the 4 as well?
I don't know ml but basically instead of doing else you need to do this:
fun a(list) =
if list = nil then nil
else
if #1(hd(list)) = 3
then
#2(hd(list)) :: a(tl(list))
else
a(tl(list))
(I am gradually editing this response as I learn more about ML :)
You forgot to call the function recursively on the tail of the list where the condition held.
In ML, you almost never use hd and tl but use pattern matching instead. And you can pattern-match on tuples for more readability:
fun filter [] = []
| filter ((x, y)::xys) = if x = 3
then y::(filter xys)
else filter xys
And high-order functions on List structure is another option in case you would like to use them.

Haskell - Convert x number of tuples into a list [duplicate]

I have a question about tuples and lists in Haskell. I know how to add input into a tuple a specific number of times. Now I want to add tuples into a list an unknown number of times; it's up to the user to decide how many tuples they want to add.
How do I add tuples into a list x number of times when I don't know X beforehand?
There's a lot of things you could possibly mean. For example, if you want a few copies of a single value, you can use replicate, defined in the Prelude:
replicate :: Int -> a -> [a]
replicate 0 x = []
replicate n | n < 0 = undefined
| otherwise = x : replicate (n-1) x
In ghci:
Prelude> replicate 4 ("Haskell", 2)
[("Haskell",2),("Haskell",2),("Haskell",2),("Haskell",2)]
Alternately, perhaps you actually want to do some IO to determine the list. Then a simple loop will do:
getListFromUser = do
putStrLn "keep going?"
s <- getLine
case s of
'y':_ -> do
putStrLn "enter a value"
v <- readLn
vs <- getListFromUser
return (v:vs)
_ -> return []
In ghci:
*Main> getListFromUser :: IO [(String, Int)]
keep going?
y
enter a value
("Haskell",2)
keep going?
y
enter a value
("Prolog",4)
keep going?
n
[("Haskell",2),("Prolog",4)]
Of course, this is a particularly crappy user interface -- I'm sure you can come up with a dozen ways to improve it! But the pattern, at least, should shine through: you can use values like [] and functions like : to construct lists. There are many, many other higher-level functions for constructing and manipulating lists, as well.
P.S. There's nothing particularly special about lists of tuples (as compared to lists of other things); the above functions display that by never mentioning them. =)
Sorry, you can't1. There are fundamental differences between tuples and lists:
A tuple always have a finite amount of elements, that is known at compile time. Tuples with different amounts of elements are actually different types.
List an have as many elements as they want. The amount of elements in a list doesn't need to be known at compile time.
A tuple can have elements of arbitrary types. Since the way you can use tuples always ensures that there is no type mismatch, this is safe.
On the other hand, all elements of a list have to have the same type. Haskell is a statically-typed language; that basically means that all types are known at compile time.
Because of these reasons, you can't. If it's not known, how many elements will fit into the tuple, you can't give it a type.
I guess that the input you get from your user is actually a string like "(1,2,3)". Try to make this directly a list, whithout making it a tuple before. You can use pattern matching for this, but here is a slightly sneaky approach. I just remove the opening and closing paranthesis from the string and replace them with brackets -- and voila it becomes a list.
tuplishToList :: String -> [Int]
tuplishToList str = read ('[' : tail (init str) ++ "]")
Edit
Sorry, I did not see your latest comment. What you try to do is not that difficult. I use these simple functions for my task:
words str splits str into a list of words that where separated by whitespace before. The output is a list of Strings. Caution: This only works if the string inside your tuple contains no whitespace. Implementing a better solution is left as an excercise to the reader.
map f lst applies f to each element of lst
read is a magic function that makes a a data type from a String. It only works if you know before, what the output is supposed to be. If you really want to understand how that works, consider implementing read for your specific usecase.
And here you go:
tuplish2List :: String -> [(String,Int)]
tuplish2List str = map read (words str)
1 As some others may point out, it may be possible using templates and other hacks, but I don't consider that a real solution.
When doing functional programming, it is often better to think about composition of operations instead of individual steps. So instead of thinking about it like adding tuples one at a time to a list, we can approach it by first dividing the input into a list of strings, and then converting each string into a tuple.
Assuming the tuples are written each on one line, we can split the input using lines, and then use read to parse each tuple. To make it work on the entire list, we use map.
main = do input <- getContents
let tuples = map read (lines input) :: [(String, Integer)]
print tuples
Let's try it.
$ runghc Tuples.hs
("Hello", 2)
("Haskell", 4)
Here, I press Ctrl+D to send EOF to the program, (or Ctrl+Z on Windows) and it prints the result.
[("Hello",2),("Haskell",4)]
If you want something more interactive, you will probably have to do your own recursion. See Daniel Wagner's answer for an example of that.
One simple solution to this would be to use a list comprehension, as so (done in GHCi):
Prelude> let fstMap tuplist = [fst x | x <- tuplist]
Prelude> fstMap [("String1",1),("String2",2),("String3",3)]
["String1","String2","String3"]
Prelude> :t fstMap
fstMap :: [(t, b)] -> [t]
This will work for an arbitrary number of tuples - as many as the user wants to use.
To use this in your code, you would just write:
fstMap :: Eq a => [(a,b)] -> [a]
fstMap tuplist = [fst x | x <- tuplist]
The example I gave is just one possible solution. As the name implies, of course, you can just write:
fstMap' :: Eq a => [(a,b)] -> [a]
fstMap' = map fst
This is an even simpler solution.
I'm guessing that, since this is for a class, and you've been studying Haskell for < 1 week, you don't actually need to do any input/output. That's a bit more advanced than you probably are, yet. So:
As others have said, map fst will take a list of tuples, of arbitrary length, and return the first elements. You say you know how to do that. Fine.
But how do the tuples get into the list in the first place? Well, if you have a list of tuples and want to add another, (:) does the trick. Like so:
oldList = [("first", 1), ("second", 2)]
newList = ("third", 2) : oldList
You can do that as many times as you like. And if you don't have a list of tuples yet, your list is [].
Does that do everything that you need? If not, what specifically is it missing?
Edit: With the corrected type:
Eq a => [(a, b)]
That's not the type of a function. It's the type of a list of tuples. Just have the user type yourFunctionName followed by [ ("String1", val1), ("String2", val2), ... ("LastString", lastVal)] at the prompt.

How do I add x tuples into a list x number of times?

I have a question about tuples and lists in Haskell. I know how to add input into a tuple a specific number of times. Now I want to add tuples into a list an unknown number of times; it's up to the user to decide how many tuples they want to add.
How do I add tuples into a list x number of times when I don't know X beforehand?
There's a lot of things you could possibly mean. For example, if you want a few copies of a single value, you can use replicate, defined in the Prelude:
replicate :: Int -> a -> [a]
replicate 0 x = []
replicate n | n < 0 = undefined
| otherwise = x : replicate (n-1) x
In ghci:
Prelude> replicate 4 ("Haskell", 2)
[("Haskell",2),("Haskell",2),("Haskell",2),("Haskell",2)]
Alternately, perhaps you actually want to do some IO to determine the list. Then a simple loop will do:
getListFromUser = do
putStrLn "keep going?"
s <- getLine
case s of
'y':_ -> do
putStrLn "enter a value"
v <- readLn
vs <- getListFromUser
return (v:vs)
_ -> return []
In ghci:
*Main> getListFromUser :: IO [(String, Int)]
keep going?
y
enter a value
("Haskell",2)
keep going?
y
enter a value
("Prolog",4)
keep going?
n
[("Haskell",2),("Prolog",4)]
Of course, this is a particularly crappy user interface -- I'm sure you can come up with a dozen ways to improve it! But the pattern, at least, should shine through: you can use values like [] and functions like : to construct lists. There are many, many other higher-level functions for constructing and manipulating lists, as well.
P.S. There's nothing particularly special about lists of tuples (as compared to lists of other things); the above functions display that by never mentioning them. =)
Sorry, you can't1. There are fundamental differences between tuples and lists:
A tuple always have a finite amount of elements, that is known at compile time. Tuples with different amounts of elements are actually different types.
List an have as many elements as they want. The amount of elements in a list doesn't need to be known at compile time.
A tuple can have elements of arbitrary types. Since the way you can use tuples always ensures that there is no type mismatch, this is safe.
On the other hand, all elements of a list have to have the same type. Haskell is a statically-typed language; that basically means that all types are known at compile time.
Because of these reasons, you can't. If it's not known, how many elements will fit into the tuple, you can't give it a type.
I guess that the input you get from your user is actually a string like "(1,2,3)". Try to make this directly a list, whithout making it a tuple before. You can use pattern matching for this, but here is a slightly sneaky approach. I just remove the opening and closing paranthesis from the string and replace them with brackets -- and voila it becomes a list.
tuplishToList :: String -> [Int]
tuplishToList str = read ('[' : tail (init str) ++ "]")
Edit
Sorry, I did not see your latest comment. What you try to do is not that difficult. I use these simple functions for my task:
words str splits str into a list of words that where separated by whitespace before. The output is a list of Strings. Caution: This only works if the string inside your tuple contains no whitespace. Implementing a better solution is left as an excercise to the reader.
map f lst applies f to each element of lst
read is a magic function that makes a a data type from a String. It only works if you know before, what the output is supposed to be. If you really want to understand how that works, consider implementing read for your specific usecase.
And here you go:
tuplish2List :: String -> [(String,Int)]
tuplish2List str = map read (words str)
1 As some others may point out, it may be possible using templates and other hacks, but I don't consider that a real solution.
When doing functional programming, it is often better to think about composition of operations instead of individual steps. So instead of thinking about it like adding tuples one at a time to a list, we can approach it by first dividing the input into a list of strings, and then converting each string into a tuple.
Assuming the tuples are written each on one line, we can split the input using lines, and then use read to parse each tuple. To make it work on the entire list, we use map.
main = do input <- getContents
let tuples = map read (lines input) :: [(String, Integer)]
print tuples
Let's try it.
$ runghc Tuples.hs
("Hello", 2)
("Haskell", 4)
Here, I press Ctrl+D to send EOF to the program, (or Ctrl+Z on Windows) and it prints the result.
[("Hello",2),("Haskell",4)]
If you want something more interactive, you will probably have to do your own recursion. See Daniel Wagner's answer for an example of that.
One simple solution to this would be to use a list comprehension, as so (done in GHCi):
Prelude> let fstMap tuplist = [fst x | x <- tuplist]
Prelude> fstMap [("String1",1),("String2",2),("String3",3)]
["String1","String2","String3"]
Prelude> :t fstMap
fstMap :: [(t, b)] -> [t]
This will work for an arbitrary number of tuples - as many as the user wants to use.
To use this in your code, you would just write:
fstMap :: Eq a => [(a,b)] -> [a]
fstMap tuplist = [fst x | x <- tuplist]
The example I gave is just one possible solution. As the name implies, of course, you can just write:
fstMap' :: Eq a => [(a,b)] -> [a]
fstMap' = map fst
This is an even simpler solution.
I'm guessing that, since this is for a class, and you've been studying Haskell for < 1 week, you don't actually need to do any input/output. That's a bit more advanced than you probably are, yet. So:
As others have said, map fst will take a list of tuples, of arbitrary length, and return the first elements. You say you know how to do that. Fine.
But how do the tuples get into the list in the first place? Well, if you have a list of tuples and want to add another, (:) does the trick. Like so:
oldList = [("first", 1), ("second", 2)]
newList = ("third", 2) : oldList
You can do that as many times as you like. And if you don't have a list of tuples yet, your list is [].
Does that do everything that you need? If not, what specifically is it missing?
Edit: With the corrected type:
Eq a => [(a, b)]
That's not the type of a function. It's the type of a list of tuples. Just have the user type yourFunctionName followed by [ ("String1", val1), ("String2", val2), ... ("LastString", lastVal)] at the prompt.