Running a sequence in batches using asSequence()? - list

I have a function:
private fun importProductsInSequence(products: List<Product>): List<Result> =
products.asSequence()
.map { saveProduct(it)}
.takeWhile { products.isNotEmpty() }
.toList()
Is there a possibility to rewrite this sequence so it works in batches? In example, a list of 1000 products is passed to this method, and the sequence is taking 100 products, saves them, then next 100 until products.isNotEmpty() condition is met?

takeWhile is not needed here, as the products list size doesn't change even after you iterate over the sequence. products.isNotEmpty() will always be true
Kotlin has chunked method that can give you objects in batches as required
private fun importProductsInSequence(products: List<Product>): List<Result> =
products.asSequence()
.chunked(100)
.map { saveProducts(it)} // saveProducts method would take list of products and return list of Result
.flatten()
.toList()

You neeed to use chunked
private fun importProductsInSequence(products: List<Product>): List<Product>{
products.asSequence().chunked(100).onEach{ saveProduct(it)}.flatten().toList()
chunk will partition the list to the size provided, the last portion will have less elements than previous one.

Related

Scala List of tuple becomes empty after for loop

I have a Scala list of tuples, "params", which are of size 28. I want to loop through and print each element of the list, however, nothing is printed out. After finishing the for loop, I checked the size of the list, which now becomes 0.
I am new to scala and I could not figure this out after a long time googling.
val primes = List(11, 13, 17, 19, 2, 3, 5, 7)
val params = primes.combinations(2)
println(params.size)
for (param <- params) {
print(param(0), param(1))
}
println(params.size)
combinations methods in List create an Iterator. Once the Iterator is consumed using methods like size, it will be empty.
From the docs
one should never use an iterator after calling a method on it.
If you comment out println(params.size), you can see that for loop is printing out the elements, but the last println(params.size) will remain as 0.
Complementing Johny's great answer:
Do you know how I can save the result from combination methods to use for later?
Well, as already suggested you can just toList
However, note there is a reason why combinations returns an Iterator and it is because the data can be too big, if you are okay with that then go ahead; but you may still take advantage of laziness.
For example, let's convert the inner lists into a tuples before collecting the results:
val params =
primes
.combinations(2)
.collect {
case a :: b :: Nil => (a, b)
}.toList
In the same way, you may add extra steps in the chain like another map or a filter before doing the toList
Even better, if your end action is something like foreach(foo) then you do not even need to collect everything into a List
primes.combinations(2) returns Iterator.
Iterators are data structures that allow to iterate over a sequence of
elements. They have a hasNext method for checking if there is a next
element available, and a next method which returns the next element
and discards it from the iterator.
So, it is like pointer to Iterable collection. Once you have done iteration you no longer will be able to iterate again.
When println(params.size) executed that time iteration completed while computing size and now params is pointing to end. Because of this for (param <- params) will be equivalent looping around empty collection.
There can 2 possible solution:
Don't check the size before for loop.
Convert iterator to Iterable e.g. list.
params = primes.combinations(2).toList
To learn more about Iterator and Iterable refer What is the relation between Iterable and Iterator?

Keeping an index with flatMap in Swift

This is a follow-up to this question:
flatMap and `Ambiguous reference to member` error
There I am using the following code to convert an array of Records to an array of Persons:
let records = // load file from bundle
let persons = records.flatMap(Person.init)
Since this conversion can take some time for big files, I would like to monitor an index to feed into a progress indicator.
Is this possible with this flatMap construction? One possibility I thought of would be to send a notification in the init function, but I am thinking counting the records is also possible from within flatMap ?
Yup! Use enumerated().
let records = // load file from bundle
let persons = records.enumerated().flatMap { index, record in
print(index)
return Person(record)
}

Scala. foreach in one element Lists of Dates

I want to create a function that works on a single date or over a period of time. For that I make a list of strings like this:
import com.github.nscala_time.time.Imports._
val fromDate = "2015-10-15".toLocalDate
val toDate = "2015-10-15".toLocalDate
val days = Iterator.iterate(fromDate)(_ + 1.day).takeWhile(_ <= toDate).map(_.toString)
Now I want to access days content:
scala> days.foreach(println)
2015-10-15
scala> days.foreach(day => println(day))
With the first foreach I get the only element in days list, but with the second I get nothing. This is just a sample, but I need to use the 2nd form, because I need to use day value inside the function.
If I use a range both methods work like they are supposed to...
Anyone knows why this behaviour?
Thanks in advance!
P.D. I don't want to use another method to create the list of strings (unless I can't find a solution for this)
Second function works in same way as first.
You've created Iterator object, which you can iterate only once.
To be able use it any amount of times just convert you iterator to list.
val daysL = days.toList

how reduce RDD work in Apache Spark

Am currently working with Apache Spark.
But i can not understand how reduce work after map ..
my example is pretty simple
val map = readme.map(line => line.split(" ").size)
i know this will return array of number of words per line but where is the key/value here to pass a reduce function ..
map.reduce((a,b) => {if(a>b) a else b})
reduce phase how it works .. (a,b) is the tuple_2 ? or its key/value from map function ??
Once you have
val map = readme.map(line => line.split(" ").size)
Each element of the RDD consists of a single number, the number of words in a line of the file.
You could count all the words in your dataset with map.sum() or map.reduce( (a,b) => a+b ), which are equivalent.
The code you have posted:
map.reduce((a,b) => {if(a>b) a else b})
would find the maximum number of words per line for your entire dataset.
The RDD.reduce method works by converting every two elements it encounters, which at first are taken from pairs of RDD rows, to another element, in this case a number. The aggregation function should be written so it can be nested and called on the rows in any order. For example, subtraction will not yield useful results as a reduce function because you can not predict ahead of time in what order results would be subtracted from one another. Addition, however, or maximization, still works correctly no matter the order.

Scala - Mutable ListBuffer or Immutable list to choose?

I am writing a simple scala program that will calculate moving average of list of quotes of a defined size say 100.
Quotes will be coming at the rate of approx 5-6 quotes per second .
1) Is it good to keep the quotes in a immutable scala list wherein I guess each time a quote comes a new list will be created ? Is it going to take too much unnecessary memory ?
OR
2) Is it good to keep the quotes in a mutable list like ListBuffer wherein I will removing the oldest quote and pushing the new quotes each time a quote comes .
Current code
package com.example.csv
import scala.io.Source
import scala.collection.mutable.ListBuffer
object CsvFileParser {
val WINDOW_SIZE = 25;
var quotes = ListBuffer(0.0);
def main(args: Array[String]) = {
val src = Source.fromFile("GBP_USD_Week1.csv");
//drop header and split the comma separated tokens
val iter = src.getLines().drop(1).map(_.split(","));
// Sliding window reads ahead // remove it
val index = 0;
while(iter.hasNext) {
processRecord(iter.next)
}
src.close()
}
def processRecord(record: Array[String]) = {
if(quotes.length < WINDOW_SIZE){
quotes += record(4).toDouble;
}else {
val movingAverage = quotes.sum / quotes.length
quotes.map(_ + " ").foreach(print)
println("\nMoving Average " + movingAverage)
quotes = quotes.tail;
quotes += record(4).toDouble;
}
}
/*def simpleMovingAverage(values: ListBuffer[Double], period: Int): ListBuffer[Double] = {
ListBuffer.fill(period - 1)(0.0) ++ (values.sliding(period).map(_.sum).map(_ / period))
}*/
}
This depends on whether you keep the items in a reverse order or not. Lists will append elements in constant time at the beginning (:: will not create an entirely new list), while the ListBuffer#+= will create a node at append it at the end of the list.
There should be little or no performance difference, nor memory footprint difference, between using List and ListBuffer -- internally these are the same data-structures. The only question is will you need to reverse the list at the end -- if you will, this will require creating a second list, so it may be slower.
In your case the decision to use list buffers was correct -- you need to remove the first element, and append to the other side of the collection, which is something that ordinary functional lists would not allow you to do efficiently.
However, in your code you call tail on the ListBuffer. This will actually copy the contents of the list buffer rather than giving you a cheap tail (O(WINDOW_SIZE) operation). You should call the quotes.remove(0, 1) to remove the first entry -- this will just mutate the current buffer, an O(1) operation.
For very fast arrivals of quotes, you might consider using a customized data structure -- for list buffer you will pay the cost of boxing.
However, in the case that there are 5-6 quotes per second and the WINDOW_SIZE is around 100, you don't have to worry -- even calling tail on the list buffer should be more than acceptable.
Immutable structures in Scala use a technique called structural sharing.
For Lists it's also mentioned in the Scaladocs:
The functional list is characterized by persistence and structural sharing, thus offering considerable performance and space consumption benefits in some scenarios.
So:
the immutable List will take up more space, but not much more and will blend in better with other Scla Constructs
the mutable Version is prone to concurrency issues, will be a little faster and use a litte bit less space.
As for the code:
If you use a mutable Structure you can get rid of the var, keep it if you want to overwrite an immutable one
It would be considered good style having a method returning the List instead of manipulating a mutable Structure or having an global var
If you process the file like in the code snippet below, you need not care for closing the file, also the file will be processed concurrently
def processLines(path: String) : Unit = for {
line <- Source.fromFile("GBP_USD_Week1.csv").getLines.tail.par
record <- line.split(",")
} processRecord(record)