Scala: Remove duplicates in list of objects - list

I've got a list of objects List[Object] which are all instantiated from the same class. This class has a field which must be unique Object.property. What is the cleanest way to iterate the list of objects and remove all objects(but the first) with the same property?

list.groupBy(_.property).map(_._2.head)
Explanation: The groupBy method accepts a function that converts an element to a key for grouping. _.property is just shorthand for elem: Object => elem.property (the compiler generates a unique name, something like x$1). So now we have a map Map[Property, List[Object]]. A Map[K,V] extends Traversable[(K,V)]. So it can be traversed like a list, but elements are a tuple. This is similar to Java's Map#entrySet(). The map method creates a new collection by iterating each element and applying a function to it. In this case the function is _._2.head which is shorthand for elem: (Property, List[Object]) => elem._2.head. _2 is just a method of Tuple that returns the second element. The second element is List[Object] and head returns the first element
To get the result to be a type you want:
import collection.breakOut
val l2: List[Object] = list.groupBy(_.property).map(_._2.head)(breakOut)
To explain briefly, map actually expects two arguments, a function and an object that is used to construct the result. In the first code snippet you don't see the second value because it is marked as implicit and so provided by the compiler from a list of predefined values in scope. The result is usually obtained from the mapped container. This is usually a good thing. map on List will return List, map on Array will return Array etc. In this case however, we want to express the container we want as result. This is where the breakOut method is used. It constructs a builder (the thing that builds results) by only looking at the desired result type. It is a generic method and the compiler infers its generic types because we explicitly typed l2 to be List[Object] or, to preserve order (assuming Object#property is of type Property):
list.foldRight((List[Object](), Set[Property]())) {
case (o, cum#(objects, props)) =>
if (props(o.property)) cum else (o :: objects, props + o.property))
}._1
foldRight is a method that accepts an initial result and a function that accepts an element and returns an updated result. The method iterates each element, updating the result according to applying the function to each element and returning the final result. We go from right to left (rather than left to right with foldLeft) because we are prepending to objects - this is O(1), but appending is O(N). Also observe the good styling here, we are using a pattern match to extract the elements.
In this case, the initial result is a pair (tuple) of an empty list and a set. The list is the result we're interested in and the set is used to keep track of what properties we already encountered. In each iteration we check if the set props already contains the property (in Scala, obj(x) is translated to obj.apply(x). In Set, the method apply is def apply(a: A): Boolean. That is, accepts an element and returns true / false if it exists or not). If the property exists (already encountered), the result is returned as-is. Otherwise the result is updated to contain the object (o :: objects) and the property is recorded (props + o.property)
Update: #andreypopp wanted a generic method:
import scala.collection.IterableLike
import scala.collection.generic.CanBuildFrom
class RichCollection[A, Repr](xs: IterableLike[A, Repr]){
def distinctBy[B, That](f: A => B)(implicit cbf: CanBuildFrom[Repr, A, That]) = {
val builder = cbf(xs.repr)
val i = xs.iterator
var set = Set[B]()
while (i.hasNext) {
val o = i.next
val b = f(o)
if (!set(b)) {
set += b
builder += o
}
}
builder.result
}
}
implicit def toRich[A, Repr](xs: IterableLike[A, Repr]) = new RichCollection(xs)
to use:
scala> list.distinctBy(_.property)
res7: List[Obj] = List(Obj(1), Obj(2), Obj(3))
Also note that this is pretty efficient as we are using a builder. If you have really large lists, you may want to use a mutable HashSet instead of a regular set and benchmark the performance.

Starting Scala 2.13, most collections are now provided with a distinctBy method which returns all elements of the sequence ignoring the duplicates after applying a given transforming function:
list.distinctBy(_.property)
For instance:
List(("a", 2), ("b", 2), ("a", 5)).distinctBy(_._1) // List((a,2), (b,2))
List(("a", 2.7), ("b", 2.1), ("a", 5.4)).distinctBy(_._2.floor) // List((a,2.7), (a,5.4))

Here is a little bit sneaky but fast solution that preserves order:
list.filterNot{ var set = Set[Property]()
obj => val b = set(obj.property); set += obj.property; b}
Although it uses internally a var, I think it is easier to understand and to read than the foldLeft-solution.

A lot of good answers above. However, distinctBy is already in Scala, but in a not-so-obvious place. Perhaps you can use it like
def distinctBy[A, B](xs: List[A])(f: A => B): List[A] =
scala.reflect.internal.util.Collections.distinctBy(xs)(f)

With preserve order:
def distinctBy[L, E](list: List[L])(f: L => E): List[L] =
list.foldLeft((Vector.empty[L], Set.empty[E])) {
case ((acc, set), item) =>
val key = f(item)
if (set.contains(key)) (acc, set)
else (acc :+ item, set + key)
}._1.toList
distinctBy(list)(_.property)

One more solution
#tailrec
def collectUnique(l: List[Object], s: Set[Property], u: List[Object]): List[Object] = l match {
case Nil => u.reverse
case (h :: t) =>
if (s(h.property)) collectUnique(t, s, u) else collectUnique(t, s + h.prop, h :: u)
}

I found a way to make it work with groupBy, with one intermediary step:
def distinctBy[T, P, From[X] <: TraversableLike[X, From[X]]](collection: From[T])(property: T => P): From[T] = {
val uniqueValues: Set[T] = collection.groupBy(property).map(_._2.head)(breakOut)
collection.filter(uniqueValues)
}
Use it like this:
scala> distinctBy(List(redVolvo, bluePrius, redLeon))(_.color)
res0: List[Car] = List(redVolvo, bluePrius)
Similar to IttayD's first solution, but it filters the original collection based on the set of unique values. If my expectations are correct, this does three traversals: one for groupBy, one for map and one for filter. It maintains the ordering of the original collection, but does not necessarily take the first value for each property. For example, it could have returned List(bluePrius, redLeon) instead.
Of course, IttayD's solution is still faster since it does only one traversal.
My solution also has the disadvantage that, if the collection has Cars that are actually the same, both will be in the output list. This could be fixed by removing filter and returning uniqueValues directly, with type From[T]. However, it seems like CanBuildFrom[Map[P, From[T]], T, From[T]] does not exist... suggestions are welcome!

With a collection and a function from a record to a key this yields a list of records distinct by key. It's not clear whether groupBy will preserve the order in the original collection. It may even depend on the type of collection. I'm guessing either head or last will consistently yield the earliest element.
collection.groupBy(keyFunction).values.map(_.head)
When will Scala get a nubBy? It's been in Haskell for decades.

If you want to remove duplicates and preserve the order of the list you can try this two liner:
val tmpUniqueList = scala.collection.mutable.Set[String]()
val myUniqueObjects = for(o <- myObjects if tmpUniqueList.add(o.property)) yield o

this is entirely a rip of #IttayD 's answer, but unfortunately I don't have enough reputation to comment.
Rather than creating an implicit function to convert your iteratble, you can simply create an implicit class:
import scala.collection.IterableLike
import scala.collection.generic.CanBuildFrom
implicit class RichCollection[A, Repr](xs: IterableLike[A, Repr]){
def distinctBy[B, That](f: A => B)(implicit cbf: CanBuildFrom[Repr, A, That]) = {
val builder = cbf(xs.repr)
val i = xs.iterator
var set = Set[B]()
while (i.hasNext) {
val o = i.next
val b = f(o)
if (!set(b)) {
set += b
builder += o
}
}
builder.result
}
}

Related

OCaml return value in if statement nested in loop

I am trying to return a value if something occurs when iterating through a list. Is it possible to return a string if X happens when iterating through the list, otherwise return another string if it never happens?
let f elem =
if not String.contains str elem then "false" in
List.iter f alphlist;
"true";
This is not working in my implemented method sadly.
OCaml is a functional language, so you pretty much need to concentrate on the values returned by functions. There are ways to return different values in exceptional cases, but (IMHO) the best way to learn is to start just with ordinary old nested function calls.
List.iter always returns the same value: (), which is known as unit.
For this reason, the expression List.iter f alphlist will also always return () no matter what f does.
There is another kind of list-handling function that works by maintaining a value across all the calls and returning that value at the end. It's called a fold.
So, if you want to compute some value that's a kind of summary of what it saw in all of the string lists in alphlist, you should probably be using a fold, say List.fold_left.
Here is a function any_has_7 that determines whether any one of the specified lists contains the integer 7:
let any_has_7 lists =
let has_7 sofar list =
sofar || List.mem 7 list
in
List.fold_left has_7 false lists
Here's how it looks when you run it:
# any_has_7 [[1;2]; [3;4]];;
- : bool = false
# any_has_7 [[1;2]; [5;7]; [8;9]];;
- : bool = true
In other words, this function does something a lot like what you're asking for. It returns true when one or more of the lists contains a certain value, and false when none of them contains the value.
I hope this helps.

Scala - using a list of ints as an index to populate a new data structure

I have a list of Objects (Items, in this case) which have category ids and properties (which itself is a list of custom types).
I am trying to def a function that takes a list of integers e.g. List(101, 102, 102, 103, 104) that correspond to the category ids for the Items and creates a list of tuples that include the category type (which is an Option) and each property type from a list of properties that go along with each category. So far I have the below, but I am getting an error that value _2 is not a member of Product with Serializable.
def idxToData(index: List[Int], items: Seq[Item]): List[(Option[Category], Property[_])] = {
def getId(ic: Option[Category]): Int => {
ic match {
case Some(e) => e._id
case None => 0
}
}
index.flatMap(t => items.map(i => if(t == getId(i.category)){
(i.category, i.properties.list.map(_.property).toList.sortWith(_._id < _._id))
} else {
None
}.filter(_ != None )
))
.map(x => x._2.map(d => (x._1, d)))
.toList
}
I am not sure how it is assigning that type (I am assuming at that point that I should have a list of tuples that I am trying to map).
Overall, is there a better way in scala to achieve the desired result from taking in a list of indices and using that to access the specific items in a list where a tuple of two parts of each corresponding item would "replace" the index to create the new list structure?
You should split your code, give names to things (add some vals and some defs), and when the compiler does not agree with you, write types, so that the compiler will tell you early where it disagrees (don't worry, we all did that when starting with FP)
Also, when posting such a question, you might want to give (relevant parts of) the interface of elements that are referenced but not defined. What are "is" (is that items?), Item, category, properties...., or simplify your code so that they do not appear.
Now, to the problem :
if(t == (i.category match { case Some(e) => e._id})){
(i.category, i.properties.list.map(_.property).toList.sortWith(_._id < _._id))
} else {
None
}
The first branch is the type Tuple2(Int, whatever) while the second branch is of the completely unrelated type None. Clearly, there is no common super type better than AnyRef, so that is the type of the if expression. Then the type of is.map (supposing is is some sort of Seq) will be Seq[AnyRef]. filter does not change the type, so still Seq[AnyRef], and in the map(x =>...), x is an AnyRef too, not a Tuple2, so it has no _2.
Of course, the list actually contains only tuples, because originally it had tuples and Nones and you have removed the Nones. But that was lost to the compiler when it typed that AnyRef.
(as the compiler error message tells and as noted by Imm, the compiler finds a slightly more precise type than AnyRef, Product with Serializable; however, that will not do you any good, all of the useful typing information is still lost there).
To preserve the type, in general you should do something such as
if(....) {
Some(stuff)
else
None
That would have been typed Option[type of stuff], where type of stuff is your Pair.
However, there is something simpler with routine collect.
It is a bit like match, except that it takes a partial function, and it discard elements for which the partial function is not defined.
So that would be
is.collect { case i if categoryId(i) == Some(t) =>
(i.catetory, i.properties....)
}
supposing you have defined
def categoryId(item: Item): Option[Int] = item.category.map(._id)
When you do this:
is.map(i => if(t == getId(i.category)){
(i.category, i.properties.list.map(_.property).toList.sortWith(_._id < _._id))
} else {
None
}
you get a List[Product with Serializable] (what you should probably get is a type error, but that could be a long digression), because that's the only supertype of None and (Category, List[Property[_]]) or whatever that tuple type is. The compiler isn't smart enough to carry the union type through and figure out that when you filter(_ != None) anything left in the list must be the tuple.
Try to rephrase this part. E.g. you could do is.filter(i => t == getId(i.category)) first, before the map, and then you wouldn't need to mess around with Nones in your list.

Scala Iterating Tuple Map

I have a map of key value pairs and I am trying to fetch the value given a key, however although it returns me a value it comes along with some, any idea how can I get rid of it?
My code is:
val jsonmap = simple.split(",\"").toList.map(x => x.split("\":")).map(x => Tuple2(x(0), x(1))).toMap
val text = jsonmap.get("text")
Where "text" is the key and I want the value that it is mapped to, I am currently getting the following output:
Some("4everevercom")
I tried using flatMap instead of Map but it doesn't work either
You are looking for the apply method. Seq, Set and Map implement different apply methods, the difference being the main distinction between each one.
As a syntactic sugar, .apply can be omitted in o.apply(x), so you'd just have to do this:
val text = jsonmap("text") // pass "key" to apply and receive "value"
For Seq, apply takes an index and return a value, as you used in your own example:
x(0), x(1)
For Set, apply takes a value, and returns a Boolean.
You can use getOrElse, as #Brian stated, or 'apply' method and deal with absent values on your own:
val text: String = jsonmap("text")
// would trow java.util.NoSuchElementException, if no key presented
Moreover, you can set some default value so Map can fallback to it, if no key found:
val funkyMap = jsonmap.get("text").withDefaultValue("")
funkyMap("foo")
// if foo or any other key is not presented in map, "" will be returned
Use it, if you want to have default value for all missing keys.
You can use getOrElse which gets the value of Some if there is one or the default value that you specify. If the key exists in the map, you will get the value from Some. If the key does not exist, you will get the value specified.
scala> val map = Map(0 -> "foo")
map: scala.collection.immutable.Map[Int,String] = Map(0 -> foo)
scala> map.get(0)
res3: Option[String] = Some(foo)
scala> map.getOrElse(1, "bar")
res4: String = bar
Also, see this for info on Scala's Some and None Options: http://www.codecommit.com/blog/scala/the-option-pattern
Just offering additional options that might help you work with Option and other similar structures you'll find in the standard library.
You can also iterate (map) over the Option. This works since Option is an Iterable containing either zero or one element.
myMap.get(key) map { value =>
// use value...
}
Equivalently, you can use a for if you feel it makes for clearer code:
for (value <- myMap.get(key)) {
// use value...
}

Accessing SML tuples by Index Variable

Question is simple.
How to access a tuple by using Index variable in SML?
val index = 5;
val tuple1 = (1,2,3,4,5,6,7,8,9,10);
val correctValue = #index tuple1 ??
I hope, somebody would be able to help out.
Thanks in advance!
There doesn't exist a function which takes an integer value and a tuple, and extracts that element from the tuple. There are of course the #1, #2, ... functions, but these do not take an integer argument. That is, the name of the "function" is #5, it is not the function # applied to the value 5. As such, you cannot substitute the name index instead of the 5.
If you don't know in advance at which place in the tuple the element you want will be at, you're probably using them in a way they're not intended to be used.
You might want a list of values, for which the 'a list type is more natural. You can then access the nth element using List.nth.
To clarify a bit, why you can't do that you need some more knowledge of what a tuple is in SML.
Tuples are actually represented as records in SML. Remember that records has the form {id = expr, id = expr, ..., id = expr} where each identifier is a label.
The difference of tuples and records is given away by the way you index elements in a tuple: #1, #2, ... (1, "foo", 42.0) is a derived form of (equivalent with) {1 = 1, 2 = "foo", 3 = 42.0}. This is perhaps better seen by the type that SML/NJ gives that record
- {1 = 1, 2 = "foo", 3 = 42.0};
val it = (1,"foo",42.0) : int * string * real
Note the type is not shown as a record type such as {1: int, 2: string, 3: real}. The tuple type is again a derived form of the record type.
Actually #id is not a function, and thus it can't be called with a variable as "argument". It is actually a derived form of (note the wildcard pattern row, in the record pattern match)
fn {id=var, ...} => var
So in conclusion, you won't be able to do what you wan't, since these derived forms (or syntactic sugar if you will) aren't dynamic in any ways.
One way is as Sebastian Paaske said to use lists. The drawback is that you need O(n) computations to access the nth element of a list. If you need to access an element in O(1) time, you may use arrays, which are in basic sml library.
You can find ore about arrays at:
http://sml-family.org/Basis/array.html

What is an efficient way to concatenate lists?

When we have two lists a and b, how can one concatenate those two (order is not relevant) to a new list in an efficient way ?
I could not figure out from the Scala API, if a ::: b and a ++ b are efficient. Maybe I missed something.
In Scala 2.9, the code for ::: (prepend to list) is as follows:
def :::[B >: A](prefix: List[B]): List[B] =
if (isEmpty) prefix
else (new ListBuffer[B] ++= prefix).prependToList(this)
whereas ++ is more generic, since it takes a CanBuildFrom parameter, i.e. it can return a collection type different from List:
override def ++[B >: A, That](that: GenTraversableOnce[B])(implicit bf: CanBuildFrom[List[A], B, That]): That = {
val b = bf(this)
if (b.isInstanceOf[ListBuffer[_]]) (this ::: that.seq.toList).asInstanceOf[That]
else super.++(that)
}
So if your return type is List, the two perform identical.
The ListBuffer is a clever mechanism in that it can be used as a mutating builder, but eventually "consumed" by the toList method. So what (new ListBuffer[B] ++= prefix).prependToList(this) does, is first sequentially add all the elements in prefix (in the example a), taking O(|a|) time. It then calls prependToList, which is a constant time operation (the receiver, or b, does not need to be taken apart). Therefore, the overall time is O(|a|).
On the otherhand, as pst pointed out, we have reverse_::::
def reverse_:::[B >: A](prefix: List[B]): List[B] = {
var these: List[B] = this
var pres = prefix
while (!pres.isEmpty) {
these = pres.head :: these
pres = pres.tail
}
these
}
So with a reverse_::: b, this again takes O(|a|), hence is no more or less efficient that the other two methods (although for small list sizes, you save the overhead of having an intermediate ListBuffer creation).
In other words, if you have knowledge about the relative sizes of a and b, you should make sure that the prefix is the smaller of the two lists. If you do not have that knowledge, there is nothing you can do, because the size operation on a List takes O(N) :)
On the other hand, in a future Scala version you may see an improved Vector concatenation algorithm, as demonstrated in this ScalaDays talk. It promises to solve the task in O(log N) time.