Extract number from string. Regular expression vs Try construct? - regex

This question was inspired by Extract numbers from String Array question.
Consider we have a List of arbitrary alphabetic and numeric strings:
val ls = List("The", "first", "one", "is", "11", "the", "second", "is" "22")
The goal is to form a list of numbers extracted from the original list: val nums: List[Int] = List(11, 22)
There are two different approaches possible (AFAIK):
Using Try construct:
val nums = ls.flatMap(s => Try(s.toInt).toOption)
This solution looks concise but it will have a huge overhead to handle exceptions.
Using matches method:
val nums = ls.filter(_.matches("\\d+")).map(_.toInt)
Here the most time-consuming part is regexp matching.
Which one is better by performance?
From my point of view usage of exception mechanism in such simple operation is a like "using a sledge-hammer to crack a nut".

I highly recommend you test this stuff out yourself, you can learn a lot! Commence Scala REPL:
scala> import scala.util.Try
import scala.util.Try
< import printTime function from our repo >
scala> val list = List("The", "first", "one", "is", "11", "the", "second", "is", "22")
list: List[String] = List(The, first, one, is, 11, the, second, is, 22)
scala> var x: List[Int] = Nil
x: List[Int] = List()
OK, the environment is set up. Here's your first function (Try):
scala> def f1(l: List[String], n: Int) = {
var i = 0
while (i < n) {
x = l.flatMap(s => Try(s.toInt).toOption)
i += 1
}
}
f1: (l: List[String], n: Int)Unit
The second function (regex):
scala> def f2(l: List[String], n: Int) = {
var i = 0
while (i < n) {
x = l.filter(_.matches("\\d+")).map(_.toInt)
i += 1
}
}
f2: (l: List[String], n: Int)Unit
Timings:
scala> printTime(f1(list, 100000)) // Try
time: 4.152s
scala> printTime(f2(list, 100000)) // regex
time: 565.107ms
Well, we've learned that handling exceptions inside a flatMap is a very inefficient way to do things. This is partly because exception handling produces bad assembly code, and partly because flatMaps with options do a lot of extra allocation and boxing. Regex is ~8x faster! But...is regex fast?
scala> def f3(l: List[String], n: Int) = {
var i = 0
while (i < n) {
x = l.filter(_.forall(_.isDigit)).map(_.toInt)
i += 1
}
}
f3: (l: List[String], n: Int)Unit
scala> printTime(f3(list, 100000)) // isDigit
time: time: 70.960ms
Replacing regex with character isDigit calls gave us another order of magnitude improvement. The lesson here is to avoid try/catch handling at all costs, avoid using regex whenever possible, and don't be afraid to write performance comparisons!

Related

how to get this result list result = List(1,2,3,4,5,6,7) from val l1 = List(1,2,List(3,List(4,5,6),5,6,7) in scala? [duplicate]

This question already has answers here:
Scala flatten a List
(5 answers)
Closed 4 months ago.
how to achieve below result list from the list val l1 = List(1,2,List(3,List(4,5,6),5,6,7)?
result = List(1,2,3,4,5,6,7)
Strangely enough, this actually works in Scala 3, and is kinda-sorta "type-safe":
type NestedList[A] = A match {
case Int => Int | List[NestedList[Int]]
case _ => A | List[NestedList[A]]
}
val input: NestedList[Int] = List(1,2,List(3,List(4,5,6),5,6,7))
def flattenNested[A](nested: NestedList[A]): List[A] =
nested match
case xs: List[NestedList[A]] => xs.flatMap(flattenNested)
case x: A => List(x)
val result: List[Int] = flattenNested(input)
println(result.distinct)
I'd consider this more of a curiosity, though, it feels quite buggy. See also discussion here. As it is now, it would be much preferable to model the input data properly as an enum, so one doesn't end up with a mix of Ints and Lists in the first place.
The following would work,
def flattenOps(l1: List[Any]): List[Int] = l1 match {
case head :: tail => head match {
case ls: List[_] => flattenOps(ls) ::: flattenOps(tail)
case i: Int => i :: flattenOps(tail)
}
case Nil => Nil
}
flattenOps(l1).distinct
Regardless the strange input, I would like to make a common pattern of my items in the list:
import scala.reflec.ClassTag
sealed trait ListItem[A] {
def flattenedItems: List[A]
}
case class SingleItem[A](value: A) extends ListItem[A] {
def flattenedItems: List[A] = value :: Nil
}
case class NestedItems[A](values: List[ListItem[A]]) extends ListItem[A] {
def flattenedItems: List[A] = values.flatMap(_.flattenedItems)
}
// The above would probably take much less lines of code in Scala 3
object ListItem {
def parse[A : ClassTag](value: Any): ListItem[A] = {
value match {
case single: A => SingleItem(single)
case nested: List[Any] => NestedItems(nested.map(parse[A]))
}
}
}
Then given the list as:
val l = List(1, 2, List(3, List(4, 5, 6), 5, 6, 7))
You can parse each value, and then get flattened values using the ListItem's method:
val itemizedList: List[ListItem[Int]] = l.map(ListItem.parse[Int])
val result = itemizedList.flatMap(_.flattenedItems).distinct

How to find numbers from a list, the result are sequences based on a given number?

Want a scala method, which can find numbers from a list, the result are sequences based on a given number:
def findSeq(baseNum:Int, numbers: List[Int]): List[Int]
If no numbers fit, just return Nil:
Given:
a number: 3
a list of number: 5,9,2,4,10,6
It will return:
List(4,5,6)
Explanation:
Since the base number is 3, it will try to find 4 from the list. If found, try to find 5, then 6, until an expected one not found. The just return the found ones and sort it. Don't need to care duplicated numbers.
More test cases:
findSeq(3, Nil) === Nil
findSeq(3, List(3)) === Nil
findSeq(3, List(5,6)) === Nil
findSeq(3, List(4,5,7)) === List(4,5)
findSeq(3, List(4,7,6)) === List(4)
Looking for an elegant solution in Scala.
I guess you can simply use recursion like this:
def findSeq(baseNum:Int, numbers: List[Int]): List[Int] = {
if (numbers.contains(baseNum+1))
(baseNum+1) :: findSeq(baseNum+1, numbers)
else
Nil
}
You method requires to be efficient (in some sense), or it's just for fun?
NOTE: btw, you could also use Stream, in case you don't care about efficiency. Something like this would do I guess:
def findSeq(baseNum:Int, numbers: List[Int]): List[Int] = {
Stream.from(baseNum+1) takeWhile (numbers.contains) toList
}
do a sort first, then foldLeft seems to be more efficient
def findSeq(baseNum:Int, numbers: List[Int]) = findSeqSorted(baseNum, numbers.sorted)
def findSeqSorted(baseNum:Int, numbers: List[Int]): List[Int] = {
numbers.foldLeft(List(baseNum))((acc,x) => if(x == acc.head + 1) x :: acc else acc).reverse.tail
}

Scala partition into more than two lists

I have a list in Scala that I'm trying to partition into multiple lists based on a predicate that involves multiple elements of the list. For example, if I have
a: List[String] = List("a", "ab", "b", "abc", "c")
I want to get b: List[List[String]] which is a List of List[String] such that the sum of the lengths of the inner List[String] == 3. i.e List(List("a", "b", "c"), List("abc"), List("ab", "a"), ...)
[Edit] Needs to take a reasonable time for lists of length 50 or less.
It is not possible to build an efficient algorithm that is cheaper than O(2^n * O(p)) for any arbitrary predicate, p. This is because every subset must be evaluated. You will never achieve something that works for n == 50.
Build all possible sublists and filter:
def filter[A](list: List[A])(predicate: (List[A] => Boolean)): List[List[A]] = {
(for {i <- 1 to list.length
subList <- list.combinations(i)
if predicate(subList)
} yield subList).toList
}
val a = List("a", "ab", "b", "abc", "c")
val result = filter(a)(_.foldLeft(0)(_ + _.length) == 3)
I think Sergey is on a good track here, but we can optimize his code a little bit. First of all, we can notice that if the sum of string lengths is N then for sure we don't need to check combinations composed of more than N strings, as the shortest string is at least one character long. And, additionally, we can get away without for synctatic sugar and use the sum method instead of a much more generic (and thus, probably, not so quick) foldLeft.
For clarity's sake, let's first define a small helper function which will compute the sum of strings lengths:
def sumOfStr(list: List[String]) = list.map(_.length).sum
And now the main method:
def split(list: List[String], sum: Int) =
(1 to sum).map(list.combinations(_).filter(sumOfStr(_) == sum)).flatten.toList
EDIT: With our powers combined, we give you a still very inefficient, but hey-that's-the-best-we-can-do-in-reasonable-time version:
def sumOfStr(lst: List[String]) = {
var sum = 0
lst.foreach{ sum += _.length }
sum
}
def split(lst: List[String], sum: Int) =
(1 to sum).par
.map(lst.combinations(_).filter(sumOfStr(_) == sum))
.flatten.toList

Split list into multiple lists with fixed number of elements

How to split a List of elements into lists with at most N items?
ex: Given a list with 7 elements, create groups of 4, leaving the last group possibly with less elements.
split(List(1,2,3,4,5,6,"seven"),4)
=> List(List(1,2,3,4), List(5,6,"seven"))
I think you're looking for grouped. It returns an iterator, but you can convert the result to a list,
scala> List(1,2,3,4,5,6,"seven").grouped(4).toList
res0: List[List[Any]] = List(List(1, 2, 3, 4), List(5, 6, seven))
There is much easier way to do the task using sliding method.
It works this way:
val numbers = List(1, 2, 3, 4, 5, 6 ,7)
Lets say you want to break the list into smaller lists of size 3.
numbers.sliding(3, 3).toList
will give you
List(List(1, 2, 3), List(4, 5, 6), List(7))
Or if you want to make your own:
def split[A](xs: List[A], n: Int): List[List[A]] = {
if (xs.size <= n) xs :: Nil
else (xs take n) :: split(xs drop n, n)
}
Use:
scala> split(List(1,2,3,4,5,6,"seven"), 4)
res15: List[List[Any]] = List(List(1, 2, 3, 4), List(5, 6, seven))
edit: upon reviewing this 2 years later, I wouldn't recommend this implementation since size is O(n), and hence this method is O(n^2), which would explain why the built-in method becomes faster for large lists, as noted in comments below. You could implement efficiently as follows:
def split[A](xs: List[A], n: Int): List[List[A]] =
if (xs.isEmpty) Nil
else (xs take n) :: split(xs drop n, n)
or even (slightly) more efficiently using splitAt:
def split[A](xs: List[A], n: Int): List[List[A]] =
if (xs.isEmpty) Nil
else {
val (ys, zs) = xs.splitAt(n)
ys :: split(zs, n)
}
I am adding a tail recursive version of the split method since there was some discussion of tail-recursion versus recursion. I have used the tailrec annotation to force the compiler to complain in case the implementation is not indeed tail-recusive. Tail-recursion I believe turns into a loop under the hood and thus will not cause problems even for a large list as the stack will not grow indefinitely.
import scala.annotation.tailrec
object ListSplitter {
def split[A](xs: List[A], n: Int): List[List[A]] = {
#tailrec
def splitInner[A](res: List[List[A]], lst: List[A], n: Int) : List[List[A]] = {
if(lst.isEmpty) res
else {
val headList: List[A] = lst.take(n)
val tailList : List[A]= lst.drop(n)
splitInner(headList :: res, tailList, n)
}
}
splitInner(Nil, xs, n).reverse
}
}
object ListSplitterTest extends App {
val res = ListSplitter.split(List(1,2,3,4,5,6,7), 2)
println(res)
}
I think this is the implementation using splitAt instead of take/drop
def split [X] (n:Int, xs:List[X]) : List[List[X]] =
if (xs.size <= n) xs :: Nil
else (xs.splitAt(n)._1) :: split(n,xs.splitAt(n)._2)

scala return on first Some in list

I have a list l:List[T1] and currently im doing the following:
myfun : T1 -> Option[T2]
val x: Option[T2] = l.map{ myfun(l) }.flatten.find(_=>true)
The myfun function returns None or Some, flatten throws away all the None's and find returns the first element of the list if any.
This seems a bit hacky to me. Im thinking that there might exist some for-comprehension or similar that will do this a bit less wasteful or more clever.
For example: I dont need any subsequent answers if myfun returns any Some during the map of the list l.
How about:
l.toStream flatMap (myfun andThen (_.toList)) headOption
Stream is lazy, so it won't map everything in advance, but it won't remap things either. Instead of flattening things, convert Option to List so that flatMap can be used.
In addition to using toStream to make the search lazy, we can use Stream::collectFirst:
List(1, 2, 3, 4, 5, 6, 7, 8).toStream.map(myfun).collectFirst { case Some(d) => d }
// Option[String] = Some(hello)
// given def function(i: Int): Option[String] = if (i == 5) Some("hello") else None
This:
Transforms the List into a Stream in order to stop the search early.
Transforms elements using myFun as Option[T]s.
Collects the first mapped element which is not None and extract it.
Starting Scala 2.13, with the deprecation of Streams in favor of LazyLists, this would become:
List(1, 2, 3, 4, 5, 6, 7, 8).to(LazyList).map(function).collectFirst { case Some(d) => d }
Well, this is almost, but not quite
val x = (l flatMap myfun).headOption
But you are returning a Option rather than a List from myfun, so this may not work. If so (I've no REPL to hand) then try instead:
val x = (l flatMap(myfun(_).toList)).headOption
Well, the for-comprehension equivalent is pretty easy
(for(x<-l, y<-myfun(x)) yield y).headOption
which, if you actually do the the translation works out the same as what oxbow_lakes gave. Assuming reasonable laziness of List.flatmap, this is both a clean and efficient solution.
As of 2017, the previous answers seem to be outdated. I ran some benchmarks (list of 10 million Ints, first match roughly in the middle, Scala 2.12.3, Java 1.8.0, 1.8 GHz Intel Core i5). Unless otherwise noted, list and map have the following types:
list: scala.collection.immutable.List
map: A => Option[B]
Simply call map on the list: ~1000 ms
list.map(map).find(_.isDefined).flatten
First call toStream on the list: ~1200 ms
list.toStream.map(map).find(_.isDefined).flatten
Call toStream.flatMap on the list: ~450 ms
list.toStream.flatMap(map(_).toList).headOption
Call flatMap on the list: ~100 ms
list.flatMap(map(_).toList).headOption
First call iterator on the list: ~35 ms
list.iterator.map(map).find(_.isDefined).flatten
Recursive function find(): ~25 ms
def find[A,B](list: scala.collection.immutable.List[A], map: A => Option[B]) : Option[B] = {
list match {
case Nil => None
case head::tail => map(head) match {
case None => find(tail, map)
case result # Some(_) => result
}
}
}
Iterative function find(): ~25 ms
def find[A,B](list: scala.collection.immutable.List[A], map: A => Option[B]) : Option[B] = {
for (elem <- list) {
val result = map(elem)
if (result.isDefined) return result
}
return None
}
You can further speed up things by using Java instead of Scala collections and a less functional style.
Loop over indices in java.util.ArrayList: ~15 ms
def find[A,B](list: java.util.ArrayList[A], map: A => Option[B]) : Option[B] = {
var i = 0
while (i < list.size()) {
val result = map(list.get(i))
if (result.isDefined) return result
i += 1
}
return None
}
Loop over indices in java.util.ArrayList with function returning null instead of None: ~10 ms
def find[A,B](list: java.util.ArrayList[A], map: A => B) : Option[B] = {
var i = 0
while (i < list.size()) {
val result = map(list.get(i))
if (result != null) return Some(result)
i += 1
}
return None
}
(Of course, one would usually declare the parameter type as java.util.List, not java.util.ArrayList. I chose the latter here because it's the class I used for the benchmarks. Other implementations of java.util.List will show different performance - most will be worse.)