Generalization for regular expression on any list - regex

I'm working with a large list containing integers and I would like to do some pattern matching on them (like finding certain sequences). Regular expressions would be the perfect fit, except that they always seem to only handle lists of characters, a.k.a. strings. Is there any library (in any language) that can handle lists of an arbitrary type?
I'm aware that I could convert my integer list into a string and then do a normal regular expression search but that seems a bit wasteful and inelegant.
edit:
My requirements are fairly simple. No need for nested lists, no need for fancy character classes. Basically I'm just interested in occurrences of sequences that can get pretty complicated. (e.g. something like "[abc]{3,}.([ab]?[^a]{4,7})" etc. where a,b,c are integers). This should be possible to generalize over any type which can be checked for equality. For an enumerable type you could also get things like "[a-z]" to work.

Regular expressions match only strings, by definition.
Of course, in theory you could construct an equivalent grammar, say for lists of numbers. With new tokens like \e for even numbers, \o for odd numbers, \s for square numbers, \r for real numbers etc., so that
[1, 2, 3, 4, 5, 6]
would be matched by
^(\o\e)*$
or
[ln(3), math.pi, sqrt(-1)]
would be matched by
^\R*$
etc. Sounds like a fun project, but also like a very difficult one. And how this could be expanded to handle arbitrary lists, nested and all, is beyond me.

Some of the regex syntax generalize to generic sequences. Also, to be able to specify any object, strings is not the best medium for the expression themselves.
"Small" example in python:
def choice(*items):
return ('choice',[value(item) for item in items])
def seq(*items):
return ('seq',[value(item) for item in items])
def repeat(min,max,lazy,item):
return ('repeat',min,max,lazy,value(item))
def value(item):
if not isinstance(item,tuple):
return ('value',item)
return item
def compile(pattern):
ret = []
key = pattern[0]
if key == 'value':
ret.append(('literal',pattern[1]))
elif key == 'seq':
for subpattern in pattern[1]:
ret.extend(compile(subpattern))
elif key == 'choice':
jumps = []
n = len(pattern[1])
for i,subpattern in enumerate(pattern[1]):
if i < n-1:
pos = len(ret)
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
jumps.append(len(ret))
ret.append('placeholder for jump')
ret[pos] = ('choice',len(ret)-pos)
else:
ret.extend(compile(subpattern))
for pos in jumps:
ret[pos] = ('jump', len(ret)-pos)
elif key == 'repeat':
min,max,lazy,subpattern = pattern[1:]
for _ in xrange(min):
ret.extend(compile(subpattern))
if max == -1:
if lazy:
pos = len(ret)
ret.append('placeholder for jump')
ret.extend(compile(subpattern))
ret[pos] = ('jump',len(ret)-pos)
ret.append(('choice',pos+1-len(ret)))
else:
pos = len(ret)
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
ret.append(('jump',pos-len(ret)))
ret[pos] = ('choice',len(ret)-pos)
elif max > min:
if lazy:
jumps = []
for _ in xrange(min,max):
ret.append(('choice',2))
jumps.append(len(ret))
ret.append('placeholder for jump')
ret.extend(compile(subpattern))
for pos in jumps:
ret[pos] = ('jump', len(ret)-pos)
else:
choices = []
for _ in xrange(min,max):
choices.append(len(ret))
ret.append('placeholder for choice')
ret.extend(compile(subpattern))
ret.append(('drop,'))
for pos in choices:
ret[pos] = ('choice',len(ret)-pos)
return ret
def match(pattern,subject,start=0):
stack = []
pos = start
i = 0
while i < len(pattern):
instruction = pattern[i]
key = instruction[0]
if key == 'literal':
if pos < len(subject) and subject[pos] == instruction[1]:
i += 1
pos += 1
continue
elif key == 'jump':
i += instruction[1]
continue
elif key == 'choice':
stack.append((i+instruction[1],pos))
i += 1
continue
# fail
if not stack:
return None
i,pos = stack.pop()
return pos
def find(pattern,subject,start=0):
for pos1 in xrange(start,len(subject)+1):
pos2 = match(pattern,subject,pos1)
if pos2 is not None: return pos1,pos2
return None,None
def find_all(pattern,subject,start=0):
matches = []
pos1,pos2 = find(pattern,subject,start)
while pos1 is not None:
matches.append((pos1,pos2))
pos1,pos2 = find(pattern,subject,pos2)
return matches
# Timestamps: ([01][0-9]|2[0-3])[0-5][0-9]
pattern = compile(
seq(
choice(
seq(choice(0,1),choice(0,1,2,3,4,5,6,7,8,9)),
seq(2,choice(0,1,2,3)),
),
choice(0,1,2,3,4,5),
choice(0,1,2,3,4,5,6,7,8,9),
)
)
print find_all(pattern,[1,3,2,5,6,3,4,2,4,3,2,2,3,6,6,5,3,5,3,3,2,5,4,5])
# matches: (0,4): [1,3,2,5]; (10,14): [2,2,3,6]
A few points of improvement:
More constructs: classes with negation, ranges
Classes instead of tuples

If you really need a free grammar like in regular expressions, then you have to go a way as described in Tim's answer.
If you only have a fixed number of patterns to search for, then the easiest and fastest way is to write your own search/filter functions.

Interesting problem indeed.
Lateral thinking: download the .Net Framework Source code, lift the Regex source code and adapt it to work with integers rather than characters.
Just an idea.

Well, Erlang has pattern matching (of your type) built right in. I did something similar once in Ruby - a bit of probably not too well performing hackery, see http://radiospiel.org/0x16-its-a-bird

Clojure since version 1.9 has clojure.spec in the standard library, which can do exactly that and more. For example to describe a sequence of odd numbers that may end with one even number you'd write:
(require '[clojure.spec.alpha :as s])
(s/def ::odds-then-maybe-even (s/cat :odds (s/+ odd?)
:even (s/? even?)))
Then to get matching subsequences you'd do this:
(s/conform ::odds-then-maybe-even [1 3 5 100])
;;=> {:odds [1 3 5], :even 100}
(s/conform ::odds-then-maybe-even [1])
;;=> {:odds [1]}
And to find out why a sequence doesn't match your definition:
(s/explain ::odds-then-maybe-even [100])
;; In: [0] val: 100 fails spec: ::odds-then-maybe-even at: [:odds] predicate: odd?
See full documentation with examples at https://clojure.org/guides/spec

You can try pamatcher, it's a JavaScript library that generalize the notion of regular expressions for any sequence of items (of any type).
An example for "[abc]{3,}.([ab]?[^a]{4,7})" pattern matching, where a, b, c are integers:
var pamatcher = require('pamatcher');
var a = 10;
var b = 20;
var c = 30;
var matcher = pamatcher([
{ repeat:
{ or: [a, b, c] },
min: 3
},
() => true,
{ optional:
{ or: [a, b] }
},
{ repeat: (i) => i != a,
min: 4,
max: 7
}
]);
var input = [1, 4, 8, 44, 55];
var result = matcher.test(input);
if(result) {
console.log("Pattern matches!");
} else {
console.log("Pattern doesn't match.");
}
Note: I am the creator of this library.

Related

Groovy get all elements in a list under certain index

I want to collect all the elements in a specific array list under a specific index. Let's say I have this list:
def names = ["David", "Arthur", "Tommy", "Jacob"]
I want to print all the names under the index of 2, which in this case, will print "David, Arthur"
Now I can use a for loop quite easily with that or even groovy's eachWithIndex(). The problem is I don't want to run all over the elements because that's not efficient. Rather than that, I want to run until a specific point.
Is there any method in groovy which does that , because I didn't find one.
Thanks in advance!
Since you're starting at index 0, the take method is probably the simplest approach.
def names = ["David", "Arthur", "Tommy", "Jacob"]
assert ['David', 'Arthur'] == names.take(2)
Given the list
def names = ["David", "Arthur", "Tommy", "Jacob"]
you can use any of the options below:
assert [ "Tommy", "Jacob" ] == names[ 2..<4 ]
assert [ "Tommy", "Jacob" ] == names[ 2..-1 ]
assert [ "Tommy", "Jacob" ] == names.subList( 2, 4 )
assert [ "Tommy", "Jacob" ] == names.drop( 2 )
Note, that each of these methods create a new List instance, so if the memory considerations are of importance, it makes sense to skip the elements using e.g. eachWithIndex() method or alike
Using Ranges:
​def names = ["David", "Arthur", "Tommy", "Jacob"]
def idx = 2
def sublist = names[0..idx-1]
sublist.each { n ->
println n
}
Using more syntactic sugar:
names[0..<idx]

Compare strings and just keep those who have on same positions different characters [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
my question is the following: I have a file which contains around 70 strings, all of them have 6 characters (either a,c,g or t for every position -> these are short DNA-sequences).
For example:
accggt agctta gggatc gactta ccttgg
What I need are the strings which are completely unique. Which have on every position a different character (base) compared with the other strings.
In this case I would get two matches (I define them as lists but this is only an idea for the output format):
[accggt , gggatc]
[gggatc , ccttgg]
The elements of list one are on every position different and so are also the elements of list 2.
Is there a build-in function which can do it? I also thought of regular expression but I'm not that familar with this approach.
Thanks in advance!
Edit:
Ok, it seems it is not that easy to describe. So lets go into more detail:
Let's take the five strings mentioned above:
I would start to compare the first string with all the other strings and then continue with string 2 comparing with all other strings and so on.
The first character of the first string is an a.
The first character of the second string is also an a.
This means I would discard the second string.
The first character of the third string is an g.
Fine.
The second character of the first string is an c.
The second character of the third string is an g.
Fine.
The third character of the first string is an c.
The third character of the third string is an g.
Fine.
The fourth ... and so on.
And if all characters of a string are different from the characters of another string (on every position like described above) I would keep those two strings and would search for the next strings which are different on every position compared to the strings I already found. Because I only have four letters there should be only four possibilities fo different strings.
I should end up with, probably a list, which contains the groups of strings which are different in every position.
I hope this helps.
You can use the following algorithm: iterate through all possible word combinations in your string and check each pair for equality with if [x == y for (x, y) in zip(word, nextWord)].count(True) == 0:.
Here is a snippet:
s = "accggt agctta gggatc gactta ccttgg"
chks = s.split(" ");
for word in chks:
for nextWord in chks:
if word != nextWord:
if [x == y for (x, y) in zip(word, nextWord)].count(True) == 0:
print([word, nextWord])
Result of the IDEONE demo:
['accggt', 'gggatc']
['gggatc', 'accggt']
['gggatc', 'ccttgg']
['ccttgg', 'gggatc']
UPDATE
You can deduplicate the list with a custom function. Here is an updated snippet:
def dedup(lst):
seen = set()
result = []
for item in lst:
fs = frozenset(item)
if fs not in seen:
result.append(item)
seen.add(fs)
return result
res = []
s = "accggt agctta gggatc gactta ccttgg"
chks = s.split(" ");
for word in chks:
for nextWord in chks:
if word != nextWord:
if [x == y for (x, y) in zip(word, nextWord)].count(True) == 0:
res.append([word, nextWord])
print(dedup(res))
Result: [['accggt', 'gggatc'], ['gggatc', 'ccttgg']].
To check the words by 3, you need to create all possible permutations of the string into 3-word combinations and use something like:
from itertools import permutations
def dedup(lst):
seen = set()
result = []
for item in lst:
fs = frozenset(item)
if fs not in seen:
result.append(item)
seen.add(fs)
return result
res = []
s = "accggt agctta gggatc gactta ccttgg"
chks = s.split(" ");
perms = [p for p in permutations(chks, 3)]
for perm in perms:
if [(x == y or y == z or x == z) for (x, y, z) in zip(*perm)].count(True) == 0:
res.append(perm)
print(dedup(res))
To find the DNA strings which are completely different on every character you have to check every string against any other string if any character of the given string is the same character on the same position in the comparing string.
Here is an example code for that:
# read all dna strings into a list of strings
dna = ['accggt', 'agctta', 'gggatc', 'gactta', 'ccttgg', '123456']
def compare_two_dna(dna1, dna2):
i = 0
l = len(dna1)
while(i < l):
if dna1[i] == dna2[i]:
return True
i += 1
return False
def is_dna_unique(d, dna_strings):
return len(filter(lambda x: compare_two_dna(d, x), dna_strings)) == 1
# filter all items which only occure once in the list
unique_dna = filter(lambda d: is_dna_unique(d, dna), dna)
print(unique_dna)
The result here is: 123456
var dnaList = "accggt agctta gggatc gactta ccttgg".split( " " );
function getUniqueDnas( dna_list ){
var result = [];
for( var d1 in dna_list ){
var isRepeat = false;
var dna1 = dna_list[ d1 ];
for( var d2 in dna_list ){
var dna2 = dna_list[ d2 ];
if( dna1 == dna2 ){
isRepeat = true;
break;
}
}
if( !isRepeat )
result.push( dna1 );
}
return result;
}
var uniqueDnaList = getUniqueDnas( dnaList );

Pig Latin involving digits

I'm new to computer programming and I'm trying to write a program that converts English into Pig Latin (For every word, move the first letter to the end of the word and add 'ay').
If the there is a number (in digits), multiply it by 2 and add 4.
ex. John has 4 cats --> ndaay ashay 12 atscay)
I got the first pig latin part down but can't seem to figure out the number part. My code accesses a text file but here is the program that would perform the string pig-latin. Where would I fit the number function?
def pig_english():
letterlist = [i + i[0] for i in read_script()]
ayList = [i + 'ay' for i in letterlist]
delaylist = [i[1:] for i in ayList]
print (delaylist)
You can test if i.isdigit() and then cast to an int but it will be easier doing it all in one comprehension:
def pig_english(words):
ayList = [str(int(i)*2+4) if i.isdigit() else i[1:]+i[0]+"ay" for i in words]
print (ayList)
If you split the operations across multiple comprehensions then you will need to guard against ints:
def pig_english(words):
numberlist = [int(i)*2+4 if i.isdigit() else i for i in words]
letterlist = [i if isinstance(i, int) else i + i[0] for i in numberlist]
ayList = [i if isinstance(i, int) else i + 'ay' for i in letterlist]
delaylist = [str(i) if isinstance(i, int) else i[1:] for i in ayList]
print (delaylist)
>>> pig_english("John has 4 cats".split())
['ohnJay', 'ashay', '12', 'atscay']

Way of fast iterating through list

I have a question about iterating through lists.
Let's say i have list of maps with format
def listOfMaps = [ ["date":"2013/05/23", "id":"1"],
["date":"2013.05.23", "id":"2"],
["date":"2013-05-23", "id":"3"],
["date":"23/05/2013", "id":"4"] ]
Now i have a list of two patterns (in reality i have a lot more :D)
def patterns = [
/\d{4}\/\d{2}\/\d{2}/, //'yyyy/MM/dd'
/\d{4}\-\d{2}\-\d{2}/ //'yyyy-MM-dd'
]
I want to println dates only with the "yyyy/MM/dd" and "yyyy-MM-dd" format so i have to go through the lists
for (int i = 0; i < patterns.size(); i++) {
def findDates = listOfMaps.findAll{it.get("word") ==~ patterns[i] ?
dateList << it : "Nothing found"}
}
but i have a problem with this way. What if the list "listOfMaps" gonna be huge? It will take a lot of time to find all patters because this code will have to go through the whole list of patters and the same amount of time it will have to go through list of maps wich in case of huge lists might take a long while :). I tried with forEach inside the findAll clousure it does not work.
So my question is is there any way to go through the list of patterns inside the findAll clousure? For instance sth like this in pseudocode
def findDates = listOfMaps.findAll{it.get("word") ==~ for(){patterns[i]} ? : }
so in that case it goes only once through the listOfMaps list and it iterates through patterns(which always is way way way way smaller than listOfMaps).
I might have an idea to create a function that returns the instance of list, but i'm struggling to implement this :).
Thanks in advance for response.
You could do:
def listOfMaps = [ [date:"2013/05/23", id:"1"],
[date:"2013.05.23", id:"2"],
[date:"2013-05-23", id:"3"],
[date:"23/05/2013", id:"4"] ]
def patterns = [
/\d{4}\/\d{2}\/\d{2}/, //'yyyy/MM/dd'
/\d{4}\-\d{2}\-\d{2}/ //'yyyy-MM-dd'
]
def foundRecords = listOfMaps.findAll { m ->
patterns.find { p ->
m.date ==~ p
}
}

Groovy way to check if a list is sorted or not

Does Groovy have a smart way to check if a list is sorted? Precondition is that Groovy actually knows how to sort the objects, e.g. a list of strings.
The way I do right now (with just some test values for this example) is to copy the list to a new list, then sort it and check that they are equal. Something like:
def possiblySorted = ["1", "2", "3"]
def sortedCopy = new ArrayList<>(possiblySorted)
sortedCopy.sort()
I use this in unit tests in several places so it would be nice with something like:
def possiblySorted = ["1", "2", "3"]
possiblySorted.isSorted()
Is there a good way like this to check if a list is sorted in Groovy, or which is the preffered way? I would almost expect Groovy to have something like this, since it is so smart with collections and iteration.
If you want to avoid doing an O(n*log(n)) operation to check if a list is sorted, you can iterate it just once and check if every item is less or equals than the next one:
def isSorted(list) {
list.size() < 2 || (1..<list.size()).every { list[it - 1] <= list[it] }
}
assert isSorted([])
assert isSorted([1])
assert isSorted([1, 2, 2, 3])
assert !isSorted([1, 2, 3, 2])
Why not just compare it to a sorted instance of the same list?
def possiblySorted = [ 4, 2, 1 ]
// Should fail
assert possiblySorted == possiblySorted.sort( false )
We pass false to the sort method, so it returns a new list rather than modifying the existing one
You could add a method like so:
List.metaClass.isSorted = { -> delegate == delegate.sort( false ) }
Then, you can do:
assert [ 1, 2, 3 ].isSorted()
assert ![ 1, 3, 2 ].isSorted()