Fast list-product sign for PackedArray?

Fast list-product sign for PackedArray? - list

As a continuation of my previous question, Simon's method to find the list product of a PackedArray is fast, but it does not work with negative values.
This can be "fixed" by Abs with minimal time penalty, but the sign is lost, so I will need to find the product sign separately.
The fastest method that I tried is EvenQ # Total # UnitStep[-lst]
lst = RandomReal[{-2, 2}, 5000000];
Do[
EvenQ#Total#UnitStep[-lst],
{30}
] // Timing
Out[]= {3.062, Null}
Is there a faster way?

This is a little over two times faster than your solution and apart from the nonsense of using Rule### to extract the relevant term, I find it more clear - it simply counts the number elements with each sign.
EvenQ[-1 /. Rule###Tally#Sign[lst]]
To compare timings (and outputs)
In[1]:= lst=RandomReal[{-2,2},5000000];
s=t={};
Do[AppendTo[s,EvenQ#Total#UnitStep[-lst]],{10}];//Timing
Do[AppendTo[t,EvenQ[-1/.Rule###Tally#Sign[lst]]],{10}];//Timing
s==t
Out[3]= {2.11,Null}
Out[4]= {0.96,Null}
Out[5]= True

A bit late-to-the-party post: if you are ultimately interested in speed, Compile with the C compilation target seems to be about twice faster than the fastest solution posted so far (Tally - Sign based):
fn = Compile[{{l, _Real, 1}},
Module[{sumneg = 0},
Do[If[i < 0, sumneg++], {i, l}];
EvenQ[sumneg]], CompilationTarget -> "C",
RuntimeOptions -> "Speed"];
Here are the timings on my machine:
In[85]:= lst = RandomReal[{-2, 2}, 5000000];
s = t = q = {};
Do[AppendTo[s, EvenQ#Total#UnitStep[-lst]], {10}]; // Timing
Do[AppendTo[t, EvenQ[-1 /. Rule ### Tally#Sign[lst]]], {10}]; // Timing
Do[AppendTo[q, fn [lst]], {10}]; // Timing
s == t == q
Out[87]= {0.813, Null}
Out[88]= {0.515, Null}
Out[89]= {0.266, Null}
Out[90]= True

Related

Python: referring to each duplicate item in a list by unique index

I am trying to extract particular lines from txt output file. The lines I am interested in are few lines above and few below the key_string that I am using to search through the results. The key string is the same for each results.
fi = open('Inputfile.txt')
fo = open('Outputfile.txt', 'a')
lines = fi.readlines()
filtered_list=[]
for item in lines:
if item.startswith("key string"):
filtered_list.append(lines[lines.index(item)-2])
filtered_list.append(lines[lines.index(item)+6])
filtered_list.append(lines[lines.index(item)+10])
filtered_list.append(lines[lines.index(item)+11])
fo.writelines(filtered_list)
fi.close()
fo.close()
The output file contains the right lines for the first record, but multiplied for every record available. How can I update the indexing so it can read every individual record? I've tried to find the solution but as a novice programmer I was struggling to use enumerate() function or collections package.

First of all, it would probably help if you said what exactly goes wrong with your code (a stack trace, it doesn't work at all, etc). Anyway, here's some thoughts. You can try to divide your problem into subproblems to make it easier to work with. In this case, let's separate finding the relevant lines from collecting them.
First, let's find the indexes of all the relevant lines.
key = "key string"
relevant = []
for i, item in enumerate(lines):
if item.startswith(key):
relevant.append(item)
enumerate is actually quite simple. It takes a list, and returns a sequence of (index, item) pairs. So, enumerate(['a', 'b', 'c']) returns [(0, 'a'), (1, 'b'), (2, 'c')].
What I had written above can be achieved with a list comprehension:
relevant = [i for (i, item) in enumerate(lines) if item.startswith(key)]
So, we have the indexes of the relevant lines. Now, let's collected them. You are interested in the line 2 lines before it and 6 and 10 and 11 lines after it. If your first lines contains the key, then you have a problem – you don't really want lines[-1] – that's the last item! Also, you need to handle the situation in which your offset would take you past the end of the list: otherwise Python will raise an IndexError.
out = []
for r in relevant:
for offset in -2, 6, 10, 11:
index = r + offset
if 0 < index < len(lines):
out.append(lines[index])
You could also catch the IndexError, but that won't save us much typing, as we have to handle negative indexes anyway.
The whole program would look like this:
key = "key string"
with open('Inputfile.txt') as fi:
lines = fi.readlines()
relevant = [i for (i, item) in enumerate(lines) if item.startswith(key)]
out = []
for r in relevant:
for offset in -2, 6, 10, 11:
index = r + offset
if 0 < index < len(lines):
out.append(lines[index])
with open('Outputfile.txt', 'a') as fi:
fi.writelines(out)

To get rid of duplicates you can cast list to set; example:
x=['a','b','a']
y=set(x)
print(y)
will result in:
['a','b']

Compact way to write a list of lists

I am writing a program that outputs a list of ordered lists of numbers. Say the output is as follows:
[1,1,1];
[1,1,2]
I would like to look at the output by eye and make some sense of it, but my output is hundreds to thousands of lines long. I would like to write the output in the following more compact format: [1,1,1/2], where the slash indicates that in the third slot I can have a 1 or a 2. So, for a longer example, [1/2, 1/3, 5, 8/9] would be the compact way of writing [1,1,5,8];[1,1,5,9];[1,3,5,8]; etc. Can anyone suggest a pseudocode algorithm for accomplishing this?
Edit: All of the lists are the same length. Also, I expect in general to have multiple lists at the end. For example {[1,1,2], [1,1,3], [1,2,4]} should become {[1,1,2/3], [1,2,4]}.

What'd I do is use a hash at each element in the first list. You'd then iterate through the remaining lists, and for each position in the other lists, you'd check against the hash in the first / original list for that index to see if you'd seen it before. So you'd end up with something like:
[1 : {1}, 1: {1, 3}, 5: {5}, 8: {8, 9}]
And then when printing / formatting the list, you'd just print each key in the hash, except you'd use slashes or whatever.
EDIT: Bad Psuedocode (python)(untested):
def shorten_list(list_of_lists):
primary_list = list_of_lists[0]
hash_values = [{} * len(primary_list)]
for i in range(len(list_of_lists)):
current_list = list_of_lists[i]
for j in range(current_list):
num = current_list[j]
if num not in hash_values[j]:
hash_values[j] = j
for i in range(len(hash_values)):
current_dict = hash_values[i]
print primary_list[i]
for key in current_dict:
if key != primary_list[i]:
print '/', key

Here's actual code to sort the lists the way you wanted. But maybe the most useful visualization would be a scatter plot. Import the data into your favorite spreadsheet, and plot away.
$(document).ready( function(){
var numbers = [
[1, 1, 5, 8],
[1, 1, 5, 9],
[1, 3, 5],
[1, 1, 5, 10, 15]];
$('#output').text(JSON.stringify(compactNumbers(numbers)));
});
function compactNumbers(numberlists){
var output = [];
for(var i = 0; i < numberlists.length; i++){
for(var j = 0; j < numberlists[i].length; j++) {
if(!output[j]) output[j] = [];
if($.inArray(numberlists[i][j], output[j]) == -1){
output[j].push(numberlists[i][j]);
}
}
}
return(output);
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div id="output"></div>

mathematica dynamic shifter list

I want a shifter list for a input buffer. My code is:
//Simulate input whit Slider. Is work perfect. Only work for changes by the user.
list = Table[0, {10}];
Slider[Dynamic[b, (b = #; list = Take[Join[list, {b}], -10]) &], {0,
10, 1}]
Dynamic#list
//x is a simulation of data input
Dynamic[x = RandomInteger[10], UpdateInterval -> 1]
//Shifter list. As 'a' change, the code is replayed.
Dynamic[Take[AppendTo[a, x], -10],UpdateInterval -> 1]
I want to run the code only for 'x' changes. No for changes of 'a'. Help me, please.

Not sure how your Slider is related to the question but here's an answer:
use TrackedSymbols to specify what can trigger second Dynamic.
Dynamic[x = RandomInteger[10], UpdateInterval -> .2]
a = {};
Dynamic[a = PadLeft[Flatten#{a, x}, 10], TrackedSymbols :> {x}]
no need for UpdateInterval then.
Keep in mind that operations in Dynamic will be performed only when such cell is visible. Maybe better approach is to use ScheduledTasks or regular Do + Pause.

gsub speed vs pattern length

I've been using gsub extensively lately, and I noticed that short patterns run faster than long ones, which is not surprising. Here's a fully reproducible code:
library(microbenchmark)
set.seed(12345)
n = 0
rpt = seq(20, 1461, 20)
msecFF = numeric(length(rpt))
msecFT = numeric(length(rpt))
inp = rep("aaaaaaaaaa",15000)
for (i in rpt) {
n = n + 1
print(n)
patt = paste(rep("a", rpt[n]), collapse = "")
#time = microbenchmark(func(count[1:10000,12], patt, "b"), times = 10)
timeFF = microbenchmark(gsub(patt, "b", inp, fixed=F), times = 10)
msecFF[n] = mean(timeFF$time)/1000000.
timeFT = microbenchmark(gsub(patt, "b", inp, fixed=T), times = 10)
msecFT[n] = mean(timeFT$time)/1000000.
}
library(ggplot2)
library(grid)
library(gridExtra)
axis(1,at=seq(0,1000,200),labels=T)
p1 = qplot(rpt, msecFT, xlab="pattern length, characters", ylab="time, msec",main="fixed = TRUE" )
p2 = qplot(rpt, msecFF, xlab="pattern length, characters", ylab="time, msec",main="fixed = FALSE")
grid.arrange(p1, p2, nrow = 2)
As you see, I'm looking for a pattern that contains a replicated rpt[n] times. The slope is positive, as expected. However, I noticed a kink at 300 characters with fixed=T and 600 characters with fixed=F and then the slope seems to be approximately as before (see plot below).
I suppose, it is due to memory, object size, etc. I also noticed that the longest allowed pattern is 1463 symbols, with object size of 1552 bytes.
Can someone explain the kink better and why at 300 and 600 characters?
Added: it is worth mentioning, that most of my patterns are 5-10 characters long, which gives me on my real data (not the mock-up inp in the example above) the following timing.
gsub, fixed = TRUE: ~50 msec per one pattern
gsub, fixed = FALSE: ~190 msec per one pattern
stringi, fixed = FALSE: ~55 msec per one pattern
gsub, fixed = FALSE, perl = TRUE: ~95 msec per one pattern
(I have 4k patterns, so total timing of my module is roughly 200 sec, which is exactly 0.05 x 4000 with gsub and fixed = TRUE. It is the fastest method for my data and patterns)

The kinks might be related to the bits required to hold patterns of that length.
There is another solution that scales much better, use the repetition operator {} to specify how many repeats you want to find. In order to find more than 255 (8 bit integer max) you'll have to specify perl = TRUE.
patt2 <- paste0('a{',rpt[n],'}')
timeRF <- microbenchmark(gsub(patt2, "b", inp, perl = T), times = 10)
I get speeds of around 2.1 ms per search with no penalty for pattern length. That's about 8x faster than fixed = FALSE for small pattern lengths and about 60x faster for large pattern lengths.

Generalization for regular expression on any list

I'm working with a large list containing integers and I would like to do some pattern matching on them (like finding certain sequences). Regular expressions would be the perfect fit, except that they always seem to only handle lists of characters, a.k.a. strings. Is there any library (in any language) that can handle lists of an arbitrary type?
I'm aware that I could convert my integer list into a string and then do a normal regular expression search but that seems a bit wasteful and inelegant.
edit:
My requirements are fairly simple. No need for nested lists, no need for fancy character classes. Basically I'm just interested in occurrences of sequences that can get pretty complicated. (e.g. something like "[abc]{3,}.([ab]?[^a]{4,7})" etc. where a,b,c are integers). This should be possible to generalize over any type which can be checked for equality. For an enumerable type you could also get things like "[a-z]" to work.

Regular expressions match only strings, by definition.
Of course, in theory you could construct an equivalent grammar, say for lists of numbers. With new tokens like \e for even numbers, \o for odd numbers, \s for square numbers, \r for real numbers etc., so that
[1, 2, 3, 4, 5, 6]
would be matched by
^(\o\e)*$
or
[ln(3), math.pi, sqrt(-1)]
would be matched by
^\R*$
etc. Sounds like a fun project, but also like a very difficult one. And how this could be expanded to handle arbitrary lists, nested and all, is beyond me.

Some of the regex syntax generalize to generic sequences. Also, to be able to specify any object, strings is not the best medium for the expression themselves.
"Small" example in python:
def choice(*items):
return ('choice',[value(item) for item in items])
def seq(*items):
return ('seq',[value(item) for item in items])
def repeat(min,max,lazy,item):
return ('repeat',min,max,lazy,value(item))
def value(item):
if not isinstance(item,tuple):
return ('value',item)
return item
def compile(pattern):
ret = []
key = pattern[0]
if key == 'value':
ret.append(('literal',pattern[1]))
elif key == 'seq':
for subpattern in pattern[1]:
ret.extend(compile(subpattern))
elif key == 'choice':
jumps = []
n = len(pattern[1])
for i,subpattern in enumerate(pattern[1]):
if i < n-1:
pos = len(ret)
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
jumps.append(len(ret))
ret.append('placeholder for jump')
ret[pos] = ('choice',len(ret)-pos)
else:
ret.extend(compile(subpattern))
for pos in jumps:
ret[pos] = ('jump', len(ret)-pos)
elif key == 'repeat':
min,max,lazy,subpattern = pattern[1:]
for _ in xrange(min):
ret.extend(compile(subpattern))
if max == -1:
if lazy:
pos = len(ret)
ret.append('placeholder for jump')
ret.extend(compile(subpattern))
ret[pos] = ('jump',len(ret)-pos)
ret.append(('choice',pos+1-len(ret)))
else:
pos = len(ret)
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
ret.append(('jump',pos-len(ret)))
ret[pos] = ('choice',len(ret)-pos)
elif max > min:
if lazy:
jumps = []
for _ in xrange(min,max):
ret.append(('choice',2))
jumps.append(len(ret))
ret.append('placeholder for jump')
ret.extend(compile(subpattern))
for pos in jumps:
ret[pos] = ('jump', len(ret)-pos)
else:
choices = []
for _ in xrange(min,max):
choices.append(len(ret))
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
ret.append(('drop,'))
for pos in choices:
ret[pos] = ('choice',len(ret)-pos)
return ret
def match(pattern,subject,start=0):
stack = []
pos = start
i = 0
while i < len(pattern):
instruction = pattern[i]
key = instruction[0]
if key == 'literal':
if pos < len(subject) and subject[pos] == instruction[1]:
i += 1
pos += 1
continue
elif key == 'jump':
i += instruction[1]
continue
elif key == 'choice':
stack.append((i+instruction[1],pos))
i += 1
continue
# fail
if not stack:
return None
i,pos = stack.pop()
return pos
def find(pattern,subject,start=0):
for pos1 in xrange(start,len(subject)+1):
pos2 = match(pattern,subject,pos1)
if pos2 is not None: return pos1,pos2
return None,None
def find_all(pattern,subject,start=0):
matches = []
pos1,pos2 = find(pattern,subject,start)
while pos1 is not None:
matches.append((pos1,pos2))
pos1,pos2 = find(pattern,subject,pos2)
return matches
# Timestamps: ([01][0-9]|2[0-3])[0-5][0-9]
pattern = compile(
seq(
choice(
seq(choice(0,1),choice(0,1,2,3,4,5,6,7,8,9)),
seq(2,choice(0,1,2,3)),
),
choice(0,1,2,3,4,5),
choice(0,1,2,3,4,5,6,7,8,9),
)
)
print find_all(pattern,[1,3,2,5,6,3,4,2,4,3,2,2,3,6,6,5,3,5,3,3,2,5,4,5])
# matches: (0,4): [1,3,2,5]; (10,14): [2,2,3,6]
A few points of improvement:
More constructs: classes with negation, ranges
Classes instead of tuples

If you really need a free grammar like in regular expressions, then you have to go a way as described in Tim's answer.
If you only have a fixed number of patterns to search for, then the easiest and fastest way is to write your own search/filter functions.

Interesting problem indeed.
Lateral thinking: download the .Net Framework Source code, lift the Regex source code and adapt it to work with integers rather than characters.
Just an idea.

Well, Erlang has pattern matching (of your type) built right in. I did something similar once in Ruby - a bit of probably not too well performing hackery, see http://radiospiel.org/0x16-its-a-bird

Clojure since version 1.9 has clojure.spec in the standard library, which can do exactly that and more. For example to describe a sequence of odd numbers that may end with one even number you'd write:
(require '[clojure.spec.alpha :as s])
(s/def ::odds-then-maybe-even (s/cat :odds (s/+ odd?)
:even (s/? even?)))
Then to get matching subsequences you'd do this:
(s/conform ::odds-then-maybe-even [1 3 5 100])
;;=> {:odds [1 3 5], :even 100}
(s/conform ::odds-then-maybe-even [1])
;;=> {:odds [1]}
And to find out why a sequence doesn't match your definition:
(s/explain ::odds-then-maybe-even [100])
;; In: [0] val: 100 fails spec: ::odds-then-maybe-even at: [:odds] predicate: odd?
See full documentation with examples at https://clojure.org/guides/spec

You can try pamatcher, it's a JavaScript library that generalize the notion of regular expressions for any sequence of items (of any type).
An example for "[abc]{3,}.([ab]?[^a]{4,7})" pattern matching, where a, b, c are integers:
var pamatcher = require('pamatcher');
var a = 10;
var b = 20;
var c = 30;
var matcher = pamatcher([
{ repeat:
{ or: [a, b, c] },
min: 3
},
() => true,
{ optional:
{ or: [a, b] }
},
{ repeat: (i) => i != a,
min: 4,
max: 7
}
]);
var input = [1, 4, 8, 44, 55];
var result = matcher.test(input);
if(result) {
console.log("Pattern matches!");
} else {
console.log("Pattern doesn't match.");
}
Note: I am the creator of this library.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Fast list-product sign for PackedArray? - list

Related

Python: referring to each duplicate item in a list by unique index

Compact way to write a list of lists

mathematica dynamic shifter list

gsub speed vs pattern length

Generalization for regular expression on any list

Categories

Resources