Scala - Mutable ListBuffer or Immutable list to choose? - list

I am writing a simple scala program that will calculate moving average of list of quotes of a defined size say 100.
Quotes will be coming at the rate of approx 5-6 quotes per second .
1) Is it good to keep the quotes in a immutable scala list wherein I guess each time a quote comes a new list will be created ? Is it going to take too much unnecessary memory ?
OR
2) Is it good to keep the quotes in a mutable list like ListBuffer wherein I will removing the oldest quote and pushing the new quotes each time a quote comes .
Current code
package com.example.csv
import scala.io.Source
import scala.collection.mutable.ListBuffer
object CsvFileParser {
val WINDOW_SIZE = 25;
var quotes = ListBuffer(0.0);
def main(args: Array[String]) = {
val src = Source.fromFile("GBP_USD_Week1.csv");
//drop header and split the comma separated tokens
val iter = src.getLines().drop(1).map(_.split(","));
// Sliding window reads ahead // remove it
val index = 0;
while(iter.hasNext) {
processRecord(iter.next)
}
src.close()
}
def processRecord(record: Array[String]) = {
if(quotes.length < WINDOW_SIZE){
quotes += record(4).toDouble;
}else {
val movingAverage = quotes.sum / quotes.length
quotes.map(_ + " ").foreach(print)
println("\nMoving Average " + movingAverage)
quotes = quotes.tail;
quotes += record(4).toDouble;
}
}
/*def simpleMovingAverage(values: ListBuffer[Double], period: Int): ListBuffer[Double] = {
ListBuffer.fill(period - 1)(0.0) ++ (values.sliding(period).map(_.sum).map(_ / period))
}*/
}

This depends on whether you keep the items in a reverse order or not. Lists will append elements in constant time at the beginning (:: will not create an entirely new list), while the ListBuffer#+= will create a node at append it at the end of the list.
There should be little or no performance difference, nor memory footprint difference, between using List and ListBuffer -- internally these are the same data-structures. The only question is will you need to reverse the list at the end -- if you will, this will require creating a second list, so it may be slower.
In your case the decision to use list buffers was correct -- you need to remove the first element, and append to the other side of the collection, which is something that ordinary functional lists would not allow you to do efficiently.
However, in your code you call tail on the ListBuffer. This will actually copy the contents of the list buffer rather than giving you a cheap tail (O(WINDOW_SIZE) operation). You should call the quotes.remove(0, 1) to remove the first entry -- this will just mutate the current buffer, an O(1) operation.
For very fast arrivals of quotes, you might consider using a customized data structure -- for list buffer you will pay the cost of boxing.
However, in the case that there are 5-6 quotes per second and the WINDOW_SIZE is around 100, you don't have to worry -- even calling tail on the list buffer should be more than acceptable.

Immutable structures in Scala use a technique called structural sharing.
For Lists it's also mentioned in the Scaladocs:
The functional list is characterized by persistence and structural sharing, thus offering considerable performance and space consumption benefits in some scenarios.
So:
the immutable List will take up more space, but not much more and will blend in better with other Scla Constructs
the mutable Version is prone to concurrency issues, will be a little faster and use a litte bit less space.
As for the code:
If you use a mutable Structure you can get rid of the var, keep it if you want to overwrite an immutable one
It would be considered good style having a method returning the List instead of manipulating a mutable Structure or having an global var
If you process the file like in the code snippet below, you need not care for closing the file, also the file will be processed concurrently
def processLines(path: String) : Unit = for {
line <- Source.fromFile("GBP_USD_Week1.csv").getLines.tail.par
record <- line.split(",")
} processRecord(record)

Related

Arrow Java ListVector writeBatch and read get empty list

I have a ListVector and value [[1,2,3,4,5]] in VectorSchemaRoot and I could see its value in IDEA.
I use following code to write the VectorSchemaRoot variable and get the byte array
val out = new ByteArrayOutputStream()
val writer = new ArrowStreamWriter(vectorSchemaRoot, null, out)
writer.start()
writer.writeBatch()
writer.end()
out.close()
val byteArr = out.toByteArray
And read back
val allocator = new RootAllocator(Int.MaxValue)
val reader = new ArrowStreamReader(new ByteArrayInputStream(byteArr), allocator)
while (reader.loadNextBatch()) {
val schemaRoot = reader.getVectorSchemaRoot
schemaRoot
}
The schema is correct, but the list is empty []
However, I use other types of values, like char, bit, the result read from the byteArr is correct(non-empty).
How to fix the ListVector empty issue?
Finally I used just basic classes.
The StructVector, ListVector are complex classes, and according to my test, they do not bring speed or memory benefit over just using basic classes. And the documents for complex classes are very few.
Thus basic classes is recommended. And just use List of Fields to make the schema of them, could also get the structured vector.

What's slowing down this piece of python code?

I have been trying to implement the Stupid Backoff language model (the description is available here, though I believe the details are not relevant to the question).
The thing is, the code's working and producing the result that is expected, but works slower than I expected. I figured out the part that was slowing down everything is here (and NOT in the training part):
def compute_score(self, sentence):
length = len(sentence)
assert length <= self.n
if length == 1:
word = tuple(sentence)
return float(self.ngrams[length][word]) / self.total_words
else:
words = tuple(sentence[::-1])
count = self.ngrams[length][words]
if count == 0:
return self.alpha * self.compute_score(sentence[1:])
else:
return float(count) / self.ngrams[length - 1][words[:-1]]
def score(self, sentence):
""" Takes a list of strings as argument and returns the log-probability of the
sentence using your language model. Use whatever data you computed in train() here.
"""
output = 0.0
length = len(sentence)
for idx in range(length):
if idx < self.n - 1:
current_score = self.compute_score(sentence[:idx+1])
else:
current_score = self.compute_score(sentence[idx-self.n+1:idx+1])
output += math.log(current_score)
return output
self.ngrams is a nested dictionary that has n entries. Each of these entries is a dictionary of form (word_i, word_i-1, word_i-2.... word_i-n) : the count of this combination.
self.alpha is a constant that defines the penalty for going n-1.
self.n is the maximum length of that tuple that the program is looking for in the dictionary self.ngrams. It is set to 3 (though setting it to 2 or even 1 doesn't anything). It's weird because the Unigram and Bigram models work just fine in fractions of a second.
The answer that I am looking for is not a refactored version of my own code, but rather a tip which part of it is the most computationally expensive (so that I could figure out myself how to rewrite it and get the most educational profit from solving this problem).
Please, be patient, I am but a beginner (two months into the world of programming). Thanks.
UPD:
I timed the running time with the same data using time.time():
Unigram = 1.9
Bigram = 3.2
Stupid Backoff (n=2) = 15.3
Stupid Backoff (n=3) = 21.6
(It's on some bigger data than originally because of time.time's bad precision.)
If the sentence is very long, most of the code that's actually running is here:
def score(self, sentence):
for idx in range(len(sentence)): # should use xrange in Python 2!
self.compute_score(sentence[idx-self.n+1:idx+1])
def compute_score(self, sentence):
words = tuple(sentence[::-1])
count = self.ngrams[len(sentence)][words]
if count == 0:
self.compute_score(sentence[1:])
else:
self.ngrams[len(sentence) - 1][words[:-1]]
That's not meant to be working code--it just removes the unimportant parts.
The flow in the critical path is therefore:
For each word in the sentence:
Call compute_score() on that word plus the following 2. This creates a new list of length 3. You could avoid that with itertools.islice().
Construct a 3-tuple with the words reversed. This creates a new tuple. You could avoid that by passing the -1 step argument when making the slice outside this function.
Look up in self.ngrams, a nested dict, with the first key being a number (might be faster if this level were a list; there are only three keys anyway?), and the second being the tuple just created.
Recurse with the first word removed, i.e. make a new tuple (sentence[2], sentence[1]), or
Do another lookup in self.ngrams, implicitly creating another new tuple (words[:-1]).
In summary, I think the biggest problem you have is the repeated and nested creation and destruction of lists and tuples.

Compare a portion of String value present in 2 Lists

Below code extract a particular value from List srchlist and check for a particular value in List rplzlist. The contents of list srchlist and rplzlist looks like below.
srchlist = ["DD='A'\n", "SOUT='*'\n", 'PGM=FTP\n', 'PGM=EMAIL']
rplzlist = ['A=ZZ.VVMSSB\n', 'SOUT=*\n', 'SALEDB=TEST12']
I am extracting the characters after the '='(equal) sign and within the single quotes using a combination of strip and translate function.
Of the elements in the srchlist only the 'SOUT' matches with the rplzlist.
Do let me know why the below code does not work, also suggest me a better approach to compare a part of string present in the list.
for ele in srchlist:
sYmls = ele.split('=')
vAlue = sYmls[1].translate(None,'\'')
for elem in rplzlist:
rPls = elem.split('=')
if vAlue in rPls:
print("vAlue")
Here is the more pythonic approach for what you wanted to do:
>>> list(set([(i.split('='))[1].translate(None,'\'') for i in srchlist]) & set([j.split('=')[1] for j in rplzlist]))
['*\n']
I used set() and then get the whole output as list, you may use .join().
Inside set(), list comprehension is given which is faster than the normal for loops.
Another Solution Using join(), and replace() in place of translate():
>>> "".join(set([(i.split('='))[1].replace('\'','') for i in srchlist]) & set([j.split('=')[1] for j in rplzlist]))
'*\n'

Understanding; for i in range, x,y = [int(i) in i.... Python3

I am stuck trying to understand the mechanics behind this combined input(), loop & list-comprehension; from Codegaming's "MarsRover" puzzle. The sequence creates a 2D line, representing a cut-out of the topology in an area 6999 units wide (x-axis).
Understandably, my original question was put on hold, being to broad. I am trying to shorten and to narrow the question: I understand list comprehension basically, and I'm ok experienced with for-loops.
Like list comp:
land_y = [int(j) for j in range(k)]
if k = 5; land_y = [0, 1, 2, 3, 4]
For-loops:
for i in the range(4)
a = 2*i = 6
ab.append(a) = 0,2,4,6
But here, it just doesn't add up (in my head):
6999 points are created along the x-axis, from 6 points(x,y).
surface_n = int(input())
for i in range(surface_n):
land_x, land_y = [int(j) for j in input().split()]
I do not understand where "i" makes a difference.
I do not understand how the data "packaged" inside the input. I have split strings of integers on another task in almost exactly the same code, and I could easily create new lists and work with them - as I understood the structure I was unpacking (pretty simple being one datatype with one purpose).
The fact that this line follows within the "game"-while-loop confuses me more, as it updates dynamically as the state of the game changes.
x, y, h_speed, v_speed, fuel, rotate, power = [int(i) for i in input().split()]
Maybe someone could give an example of how this could be written in javascript, haskell or c#? No need to be syntax-correct, I'm just struggling with the concept here.
input() takes a line from the standard input. So it’s essentially reading some value into your program.
The way that code works, it makes very hard assumptions on the format of the input strings. To the point that it gets confusing (and difficult to verify).
Let’s take a look at this line first:
land_x, land_y = [int(j) for j in input().split()]
You said you already understand list comprehension, so this is essentially equal to this:
inputs = input().split()
result = []
for j in inputs:
results.append(int(j))
land_x, land_y = results
This is a combination of multiple things that happen here. input() reads a line of text into the program, split() separates that string into multiple parts, splitting it whenever a white space character appears. So a string 'foo bar' is split into ['foo', 'bar'].
Then, the list comprehension happens, which essentially just iterates over every item in that splitted input string and converts each item into an integer using int(j). So an input of '2 3' is first converted into ['2', '3'] (list of strings), and then converted into [2, 3] (list of ints).
Finally, the line land_x, land_y = results is evaluated. This is called iterable unpacking and essentially assumes that the iterable on the right has exactly as many items as there are variables on the left. If that’s the case then it’s just a nice way to write the following:
land_x = results[0]
land_y = results[1]
So basically, the whole list comprehension assumes that there is an input of two numbers separated by whitespace, it then splits those into separate strings, converts those into numbers and then assigns each number to a separate variable land_x and land_y.
Exactly the same thing happens again later with the following line:
x, y, h_speed, v_speed, fuel, rotate, power = [int(i) for i in input().split()]
It’s just that this time, it expects the input to have seven numbers instead of just two. But then it’s exactly the same.

Regex with SQL Server 2008 CLR performance issues

I am trying to understand why is it taking so long to execute a simple query.
In my local machine it takes 10 seconds but in production it takes 1 min.
(I imported the database from production into my local database)
select *
from JobHistory
where dbo.LikeInList(InstanceID, 'E218553D-AAD1-47A8-931C-87B52E98A494') = 1
The table DataHistory is not indexed and it has 217,302 rows
public partial class UserDefinedFunctions
{
[SqlFunction]
public static bool LikeInList([SqlFacet(MaxSize = -1)]SqlString value, [SqlFacet(MaxSize = -1)]SqlString list)
{
foreach (string val in list.Value.Split(new char[] { ',' }, StringSplitOptions.None))
{
Regex re = new Regex("^.*" + val.Trim() + ".*$", RegexOptions.IgnoreCase);
if (re.IsMatch(value.Value))
{
return(true);
}
}
return (false);
}
};
And the issue is that if a table has 217k rows then I will be calling that function 217,000 times! not sure how I can rewrite this thing.
Thank you
There are several issues with this code:
Missing (IsDeterministic = true, IsPrecise = true) in [SqlFunction] attribute. Doing this (mainly just the IsDeterministic = true part) will allow the SQLCLR UDF to participate in parallel execution plans. Without setting IsDeterministic = true, this function will prevent parallel plans, just like T-SQL UDFs do.
Return type is bool instead of SqlBoolean
RegEx call is inefficient: using an instance method once is expensive. Switch to using the static Regex.IsMatch instead
RegEx pattern is very inefficient: wrapping the search string in "^.*" and ".*$" will require the RegEx engine to parse and retain in memory as the "match", the entire contents of the value input parameter, for every single iteration of the foreach. Yet the behavior of Regular Expressions is such that simply using val.Trim() as the entire pattern would yield the exact same result.
(optional) If neither input parameter will ever be over 4000 characters, then specify a MaxSize of 4000 instead of -1 since NVARCHAR(4000) is much faster than NVARCHAR(MAX) for passing data into, and out of, SQLCLR objects.