Find all possible permutations of a lowercase chemical formula - regex

I'm trying to resolve ambiguity in a lowercase chemical formula. Since some element names are substrings of other element names, and they're all run together, there can be multiple global matches for the same pattern.
Considering the regex /^((h)|(s)|(hg)|(ga)|(as))+$/ against the string hgas. There are two possible matches. hg, as and h, s, ga (out of order compared to input, but not an issue). Obviously a regex for all possible symbols would be longer, but this example was done for simplicity.
Regex's powerful lookahead and lookbehind allow it to conclusively determine whether even a very long string does match this pattern or there are no possible permutations of letters. It will diligently try all possible permutations of matches, and, for example, if it hits the end of the string with a leftover g, go back through and retry a different combination.
I'm looking for a regular expression, or a language with some kind of extension, that adds on the ability to keep looking for matches after one is found, in this case, finding h, s, ga as well as hg, as.
Rebuilding the complex lookahead and lookbehind functionality of regex for this problem does not seem like a reasonable solution, especially considering the final regex also includes a \d* after each symbol.
I thought about reversing the order of the regexp, /^((as)|(ga)|(hg)|(s)|(h))+$/, to find additional mappings, but at most this will only find one additional match, and I don't have the theoretical background in regex to know if it's even reasonable to try.
I've created a sample page using my existing regex which finds 1 or 0 matches for a given lowercase string and returns it properly capitalized (and out of order). It uses the first 100 chemical symbols in its matching.
http://www.ptable.com/Script/lowercase_formula.php?formula=hgas
tl;dr: I have a regex to match 0 or 1 possible chemical formula permutations in a string. How do I find more than 1 match?

I'm well-aware this answer might be off-topic (as in the approach), but I think it is quite interesting, and it solves the OP's problem.
If you don't mind learning a new language (Prolog), then it might help you generate all possible combinations:
name(X) :- member(X, ['h', 's', 'hg', 'ga', 'as']).
parse_([], []).
parse_(InList, [HeadAtom | OutTail]) :-
atom_chars(InAtom, InList),
name(HeadAtom),
atom_concat(HeadAtom, TailAtom, InAtom),
atom_chars(TailAtom, TailList),
parse_(TailList, OutTail).
parse(In, Out) :- atom_chars(In, List), parse_(List, Out).
Sample run:
?- parse('hgas', Out).
Out = [h, ga, s] ;
Out = [hg, as] ;
false.
The improved version, which includes processing for number is a tad bit longer:
isName(X) :- member(X, ['h', 's', 'hg', 'ga', 'as', 'o', 'c']).
% Collect all numbers, since it will not be part of element name.
collect([],[],[]).
collect([H|T], [], [H|T]) :-
\+ char_type(H, digit), !.
collect([H|T], [H|OT], L) :-
char_type(H, digit), !, collect(T, OT, L).
parse_([], []).
parse_(InputChars, [Token | RestTokens]) :-
atom_chars(InputAtom, InputChars),
isName(Token),
atom_concat(Token, TailAtom, InputAtom),
atom_chars(TailAtom, TailChars),
parse_(TailChars, RestTokens).
parse_(InputChars, [Token | RestTokens]) :-
InputChars = [H|_], char_type(H, digit),
collect(InputChars, NumberChars, TailChars),
atom_chars(Token, NumberChars),
parse_(TailChars, RestTokens).
parse(In, Out) :- atom_chars(In, List), parse_(List, Out).
Sample run:
?- parse('hgassc20h245o', X).
X = [h, ga, s, s, c, '20', h, '245', o] ;
X = [hg, as, s, c, '20', h, '245', o] ;
false.
?- parse('h2so4', X).
X = [h, '2', s, o, '4'] ;
false.
?- parse('hgas', X).
X = [h, ga, s] ;
X = [hg, as] ;
false.

The reason you haven't found a generalized regex library that does this is because it's not possible with all regular expressions to do this. There are regular expressions that will not terminate.
Imagine with your example that you just added empty string to the list of terms, then
'hgas' could be:
['hg', 'as']
['hg', '', 'as']
['hg', '', '', 'as']
You'll probably just have to roll your own.
In psuedo code:
def findall(term, possible):
out = []
# for all the terms
for pos in possible:
# if it is a candidate
if term.startswith(pos):
# combine it with all possible endings
for combo in findall(term.removefrombegining(pos), possible):
newCombo = combo.prepend(out)
out.append(newCombo)
return out
findall('hgas', ['h', 'as', ...])
This above will run in exponential time so dynamic programming will be the way this isn't an exponentially large problem. Memoization for the win.
The last thing worth noting is the above code doesn't check that it fully matches.
ie. 'hga' might return ['hg'] as a possibility.
I'll leave the actual coding, memrization, and this last hiccup to, as my profs lovingly say, 'an exercise to the reader'

This is not a job for regexp, you need something more like a state machine.
You'd need to parse the string popping out all known symbols, stopping if there is none, and continuing. If the whole string gets consumed on one branch, you have found a possibility.
In PHP, something like:
<?php
$Elements = array('h','he','li','s','ga','hg','as',
// "...and there may be many others, but they haven't been discovered".
);
function check($string, $found = array(), $path = '')
{
GLOBAL $Elements, $calls;
$calls++;
if ('' == $string)
{
if (!empty($path))
$found[] = $path;
return $found;
}
$es = array_unique(array(
substr($string, 0, 1), // Element of 1 letter ("H")
substr($string, 0, 2), // Element of 2 letter ("He")
));
foreach($es as $e)
if (in_array($e, $Elements))
$found = check(substr($string, strlen($e)), $found, $path.$e.',');
return $found;
}
print_r(check('hgas'));
print "in $calls calls.\n";
?>

Don't use regex. A regex matches only 1 element as you say, instead you need to find all the possible "meanings" of your string. Given the fact that each element's lenght is 1-2 chars, I'd go with this algorythm (forgive the pseudocode):
string[][] results;
void formulas(string[] elements, string formula){
string[] elements2=elements;
if(checkSymbol(formula.substring(0,1))){
elements.append(formula.substring(0,1));
if(formula.length-1 ==0){
results.append(elements);
} else {
formulas(elements,formula.substring(1,formula.length);
}
}
if(checkSymbol(formula.substring(0,2))){
elements2.append(formula.substring(0,2));
if(formula.length-2 ==0){
results.append(elements2);
} else {
formulas(elements2,formula.substring(2,formula.length);
}
}
}
bool checkSymbol(string symbol){
// verifies if symbol is the name of an element
}
input "hgas" (let's go depth first)
first step:
checkSymbol(formula.substring(0,1)) is true for "h"
elements2 = [h]
recursive call, if(checkSymbol(formula.substring(0,1))) false
then it tests ga => true
elements2 = [h, ga]
third recursive call
test s, checksymbol returns true, elements is then [h, ga, s]. Length of the substring is 0 so it appends to results the first array: [h, ga, s]
--
let's go back to the second "branch" of first step
The test checkSymbol(formula.substring(0,2) finds that "hg" is an element as well
elements2 = [hg]
Then we call formulas([hg],"as")
The test for "a" fails (it is not an element) and the test for "as" works, the length is totally consumed, the result [hg,as] is appended to results[]
This algorythm should run in O(n^2) time in the worst case, n being the length of the string.

Related

Is there a pythonic way to count the number of leading matching characters in two strings?

For two given strings, is there a pythonic way to count how many consecutive characters of both strings (starting at postion 0 of the strings) are identical?
For example in aaa_Hello and aa_World the "leading matching characters" are aa, having a length of 2. In another and example there are no leading matching characters, which would give a length of 0.
I have written a function to achive this, which uses a for loop and thus seems very unpythonic to me:
def matchlen(string0, string1): # Note: does not work if a string is ''
for counter in range(min(len(string0), len(string1))):
# run until there is a mismatch between the characters in the strings
if string0[counter] != string1[counter]:
# in this case the function terminates
return(counter)
return(counter+1)
matchlen(string0='aaa_Hello', string1='aa_World') # returns 2
matchlen(string0='another', string1='example') # returns 0
You could use zip and enumerate:
def matchlen(str1, str2):
i = -1 # needed if you don't enter the loop (an empty string)
for i, (char1, char2) in enumerate(zip(str1, str2)):
if char1 != char2:
return i
return i+1
An unexpected function in os.path, commonprefix, can help (because it is not limited to file paths, any strings work). It can also take in more than 2 input strings.
Return the longest path prefix (taken character-by-character) that is a prefix of all paths in list. If list is empty, return the empty string ('').
from os.path import commonprefix
print(len(commonprefix(["aaa_Hello","aa_World"])))
output:
2
from itertools import takewhile
common_prefix_length = sum(
1 for _ in takewhile(lambda x: x[0]==x[1], zip(string0, string1)))
zip will pair up letters from the two strings; takewhile will yield them as long as they're equal; and sum will see how many there are.
As bobble bubble says, this indeed does exactly the same thing as your loopy thing. Its sole pro (and also its sole con) is that it is a one-liner. Take it as you will.

Value between opening and closing bracket [duplicate]

I'm trying to match a mathematical-expression-like string, that have nested parentheses.
import re
p = re.compile('\(.+\)')
str = '(((1+0)+1)+1)'
print p.findall(s)
['(((1+0)+1)+1)']
I wanted it to match all the enclosed expressions, such as (1+0), ((1+0)+1)...
I don't even care if it matches unwanted ones like (((1+0), I can take care of those.
Why it's not doing that already, and how can I do it?
As others have mentioned, regular expressions are not the way to go for nested constructs. I'll give a basic example using pyparsing:
import pyparsing # make sure you have this installed
thecontent = pyparsing.Word(pyparsing.alphanums) | '+' | '-'
parens = pyparsing.nestedExpr( '(', ')', content=thecontent)
Here's a usage example:
>>> parens.parseString("((a + b) + c)")
Output:
( # all of str
[
( # ((a + b) + c)
[
( # (a + b)
['a', '+', 'b'], {}
), # (a + b) [closed]
'+',
'c'
], {}
) # ((a + b) + c) [closed]
], {}
) # all of str [closed]
(With newlining/indenting/comments done manually)
Edit: Modified to eliminate unnecessary Forward, as per Paul McGuire's suggestions.
To get the output in nested list format:
res = parens.parseString("((12 + 2) + 3)")
res.asList()
Output:
[[['12', '+', '2'], '+', '3']]
There is a new regular engine module being prepared to replace the existing one in Python. It introduces a lot of new functionality, including recursive calls.
import regex
s = 'aaa(((1+0)+1)+1)bbb'
result = regex.search(r'''
(?<rec> #capturing group rec
\( #open parenthesis
(?: #non-capturing group
[^()]++ #anyting but parenthesis one or more times without backtracking
| #or
(?&rec) #recursive substitute of group rec
)*
\) #close parenthesis
)
''',s,flags=regex.VERBOSE)
print(result.captures('rec'))
Output:
['(1+0)', '((1+0)+1)', '(((1+0)+1)+1)']
Related bug in regex: http://code.google.com/p/mrab-regex-hg/issues/detail?id=78
Regex languages aren't powerful enough to matching arbitrarily nested constructs. For that you need a push-down automaton (i.e., a parser). There are several such tools available, such as PLY.
Python also provides a parser library for its own syntax, which might do what you need. The output is extremely detailed, however, and takes a while to wrap your head around. If you're interested in this angle, the following discussion tries to explain things as simply as possible.
>>> import parser, pprint
>>> pprint.pprint(parser.st2list(parser.expr('(((1+0)+1)+1)')))
[258,
[327,
[304,
[305,
[306,
[307,
[308,
[310,
[311,
[312,
[313,
[314,
[315,
[316,
[317,
[318,
[7, '('],
[320,
[304,
[305,
[306,
[307,
[308,
[310,
[311,
[312,
[313,
[314,
[315,
[316,
[317,
[318,
[7, '('],
[320,
[304,
[305,
[306,
[307,
[308,
[310,
[311,
[312,
[313,
[314,
[315,
[316,
[317,
[318,
[7,
'('],
[320,
[304,
[305,
[306,
[307,
[308,
[310,
[311,
[312,
[313,
[314,
[315,
[316,
[317,
[318,
[2,
'1']]]]],
[14,
'+'],
[315,
[316,
[317,
[318,
[2,
'0']]]]]]]]]]]]]]]],
[8,
')']]]]],
[14,
'+'],
[315,
[316,
[317,
[318,
[2,
'1']]]]]]]]]]]]]]]],
[8, ')']]]]],
[14, '+'],
[315,
[316,
[317,
[318, [2, '1']]]]]]]]]]]]]]]],
[8, ')']]]]]]]]]]]]]]]],
[4, ''],
[0, '']]
You can ease the pain with this short function:
def shallow(ast):
if not isinstance(ast, list): return ast
if len(ast) == 2: return shallow(ast[1])
return [ast[0]] + [shallow(a) for a in ast[1:]]
>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
[258,
[318,
'(',
[314,
[318, '(', [314, [318, '(', [314, '1', '+', '0'], ')'], '+', '1'], ')'],
'+',
'1'],
')'],
'',
'']
The numbers come from the Python modules symbol and token, which you can use to build a lookup table from numbers to names:
map = dict(token.tok_name.items() + symbol.sym_name.items())
You could even fold this mapping into the shallow() function so you can work with strings instead of numbers:
def shallow(ast):
if not isinstance(ast, list): return ast
if len(ast) == 2: return shallow(ast[1])
return [map[ast[0]]] + [shallow(a) for a in ast[1:]]
>>> pprint.pprint(shallow(parser.st2list(parser.expr('(((1+0)+1)+1)'))))
['eval_input',
['atom',
'(',
['arith_expr',
['atom',
'(',
['arith_expr',
['atom', '(', ['arith_expr', '1', '+', '0'], ')'],
'+',
'1'],
')'],
'+',
'1'],
')'],
'',
'']
The regular expression tries to match as much of the text as possible, thereby consuming all of your string. It doesn't look for additional matches of the regular expression on parts of that string. That's why you only get one answer.
The solution is to not use regular expressions. If you are actually trying to parse math expressions, use a real parsing solutions. If you really just want to capture the pieces within parenthesis, just loop over the characters counting when you see ( and ) and increment a decrement a counter.
Stack is the best tool for the job: -
import re
def matches(line, opendelim='(', closedelim=')'):
stack = []
for m in re.finditer(r'[{}{}]'.format(opendelim, closedelim), line):
pos = m.start()
if line[pos-1] == '\\':
# skip escape sequence
continue
c = line[pos]
if c == opendelim:
stack.append(pos+1)
elif c == closedelim:
if len(stack) > 0:
prevpos = stack.pop()
# print("matched", prevpos, pos, line[prevpos:pos])
yield (prevpos, pos, len(stack))
else:
# error
print("encountered extraneous closing quote at pos {}: '{}'".format(pos, line[pos:] ))
pass
if len(stack) > 0:
for pos in stack:
print("expecting closing quote to match open quote starting at: '{}'"
.format(line[pos-1:]))
In the client code, since the function is written as a generator function simply use the for loop pattern to unroll the matches: -
line = '(((1+0)+1)+1)'
for openpos, closepos, level in matches(line):
print(line[openpos:closepos], level)
This test code produces following on my screen, noticed the second param in the printout indicates the depth of the parenthesis.
1+0 2
(1+0)+1 1
((1+0)+1)+1 0
From a linked answer:
From the LilyPond convert-ly utility (and written/copyrighted by myself, so I can show it off here):
def paren_matcher (n):
# poor man's matched paren scanning, gives up
# after n+1 levels. Matches any string with balanced
# parens inside; add the outer parens yourself if needed.
# Nongreedy.
return r"[^()]*?(?:\("*n+r"[^()]*?"+r"\)[^()]*?)*?"*n
convert-ly tends to use this as paren_matcher (25) in its regular expressions which is likely overkill for most applications. But then it uses it for matching Scheme expressions.
Yes, it breaks down after the given limit, but the ability to just plug it into regular expressions still beats the "correct" alternatives supporting unlimited depth hands-down in usability.
Balanced pairs (of parentheses, for example) is an example of a language that cannot be recognized by regular expressions.
What follows is a brief explanation of the math for why that is.
Regular expressions are a way of defining finite state automata (abbreviated FSM). Such a device has a finite amount of possible state to store information. How that state can be used is not particularly restricted, but it does mean that there are an absolute maximum number of distinct positions it can recognize.
For example, the state can be used for counting, say, unmatched left parentheses. But because the amount of state for that kind of counting must be completely bounded, then a given FSM can count to a maximum of n-1, where n is the number of states the FSM can be in. If n is, say, 10, then the maximum number of unmatched left parenthesis the FSM can match is 10, until it breaks. Since it's perfectly possible to have one more left parenthesis, there is no possible FSM that can correctly recognize the complete language of matched parentheses.
So what? Suppose you just pick a really large n? The problem is that as a way of describing FSM, regular expressions basically describe all of the transitions from one state to another. Since for any N, an FSM would need 2 state transitions (one for matching a left parenthesis, and one for matching right), the regular expression itself must grow by at least a constant factor multiple of n
By comparison, the next better class of languages, (context free grammars) can solve this problem in a totally compact way. Here's an example in BNF
expression ::= `(` expression `)` expression
| nothing
I believe this function may suit your need, I threw this together fast so feel free to clean it up a bit. When doing nests its easy to think of it backwards and work from there =]
def fn(string,endparens=False):
exp = []
idx = -1
for char in string:
if char == "(":
idx += 1
exp.append("")
elif char == ")":
idx -= 1
if idx != -1:
exp[idx] = "(" + exp[idx+1] + ")"
else:
exp[idx] += char
if endparens:
exp = ["("+val+")" for val in exp]
return exp
You can use regexps, but you need to do the recursion yourself. Something like the following does the trick (if you only need to find, as your question says, all the expressions enclosed into parentheses):
import re
def scan(p, string):
found = p.findall(string)
for substring in found:
stripped = substring[1:-1]
found.extend(scan(p, stripped))
return found
p = re.compile('\(.+\)')
string = '(((1+0)+1)+1)'
all_found = scan(p, string)
print all_found
This code, however, does not match the 'correct' parentheses. If you need to do that you will be better off with a specialized parser.
Here is a demo for your question, though it is clumsy, while it works
import re s = '(((1+0)+1)+1)'
def getContectWithinBraces( x , *args , **kwargs):
ptn = r'[%(left)s]([^%(left)s%(right)s]*)[%(right)s]' %kwargs
Res = []
res = re.findall(ptn , x)
while res != []:
Res = Res + res
xx = x.replace('(%s)' %Res[-1] , '%s')
res = re.findall(ptn, xx)
print(res)
if res != []:
res[0] = res[0] %('(%s)' %Res[-1])
return Res
getContectWithinBraces(s , left='\(\[\{' , right = '\)\]\}')
my solution is that: define a function to extract content within the outermost parentheses, and then you call that function repeatedly until you get the content within the innermost parentheses.
def get_string_inside_outermost_parentheses(text):
content_p = re.compile(r"(?<=\().*(?=\))")
r = content_p.search(text)
return r.group()
def get_string_inside_innermost_parentheses(text):
while '(' in text:
text = get_string_inside_outermost_parentheses(text)
return text
You should write a proper parser for parsing such expression (e.g. using pyparsing).
Regular expressions are not an appropriate tool for writing decent parsers.
Many posts suggest that for nested braces,
REGEX IS NOT THE WAY TO DO IT. SIMPLY COUNT THE BRACES:
For example, see: Regular expression to detect semi-colon terminated C++ for & while loops
Here is a complete python sample to iterate through a string and count braces:
# decided for nested braces to not use regex but brace-counting
import re, string
texta = r'''
nonexistent.\note{Richard Dawkins, \textit{Unweaving the Rainbow: Science, Delusion
and the Appetite for Wonder} (Boston: Houghton Mifflin Co., 1998), pp. 302, 304,
306-309.} more text and more.
This is a statistical fact, not a
guess.\note{Zheng Wu, \textit{Cohabitation: An Alternative Form
of Family Living} (Ontario, Canada: Oxford University Press,
2000), p. 149; \hbox{Judith} Treas and Deirdre Giesen, ``Title
and another title,''
\textit{Journal of Marriage and the Family}, February 2000,
p.\,51}
more and more text.capitalize
'''
pos = 0
foundpos = 0
openBr = 0 # count open braces
while foundpos <> -1:
openBr = 0
foundpos = string.find(texta, r'\note',pos)
# print 'foundpos',foundpos
pos = foundpos + 5
# print texta[pos]
result = ""
while foundpos > -1 and openBr >= 0:
pos = pos + 1
if texta[pos] == "{":
openBr = openBr + 1
if texta[pos] == "}":
openBr = openBr - 1
result = result + texta[pos]
result = result[:-1] # drop the last } found.
result = string.replace(result,'\n', ' ') # replace new line with space
print result

Finding permutations using regular expressions

I need to create a regular expression (for program in haskell) that will catch the strings containing "X" and ".", assuming that there are 4 "X" and only one ".". It cannot catch any string with other X-to-dot relations.
I have thought about something like
[X\.]{5}
But it catches also "XXXXX" or ".....", so it isn't what I need.
That's called permutation parsing, and while "pure" regular expressions can't parse permutations it's possible if your regex engine supports lookahead. (See this answer for an example.)
However I find the regex in the linked answer difficult to understand. It's cleaner in my opinion to use a library designed for permutation parsing, such as megaparsec.
You use the Text.Megaparsec.Perm module by building a PermParser in a quasi-Applicative style using the <||> operator, then converting it into a regular MonadParsec action using makePermParser.
So here's a parser which recognises any combination of four Xs and one .:
import Control.Applicative
import Data.Ord
import Data.List
import Text.Megaparsec
import Text.Megaparsec.Perm
fourXoneDot :: Parsec Dec String String
fourXoneDot = makePermParser $ mkFive <$$> x <||> x <||> x <||> x <||> dot
where mkFive a b c d e = [a, b, c, d, e]
x = char 'X'
dot = char '.'
I'm applying the mkFive function, which just stuffs its arguments into a five-element list, to four instances of the x parser and one dot, combined with <||>.
ghci> parse fourXoneDot "" "XXXX."
Right "XXXX."
ghci> parse fourXoneDot "" "XX.XX"
Right "XXXX."
ghci> parse fourXoneDot "" "XX.X"
Left {- ... -}
This parser always returns "XXXX." because that's the order I combined the parsers in: I'm mapping mkFive over the five parsers and it doesn't reorder its arguments. If you want the permutation parser to return its input string exactly, the trick is to track the current position within the component parsers, and then sort the output.
fourXoneDotSorted :: Parsec Dec String String
fourXoneDotSorted = makePermParser $ mkFive <$$> x <||> x <||> x <||> x <||> dot
where mkFive a b c d e = map snd $ sortBy (comparing fst) [a, b, c, d, e]
x = withPos (char 'X')
dot = withPos (char '.')
withPos = liftA2 (,) getPosition
ghci> parse fourXoneDotSorted "" "XX.XX"
Right "XX.XX"
As the megaparsec docs note, the implementation of the Text.Megaparsec.Perm module is based on Parsing Permutation Phrases; the idea is described in detail in the paper and the accompanying slides.
The other answers look quite complicated to me, given that there are only five strings in this language. Here's a perfectly fine and very readable regex for this:
\.XXXX|X\.XXX|XX\.XX|XXX\.X|XXXX\.
Are you attached to regex, or did you just end up at regex because this was a question you didn't want to try answering with applicative parsers?
Here's the simplest possible attoparsec implementation I can think of:
parseDotXs :: Parser ()
parseDotXs = do
dotXs <- count 5 (satisfy (inClass ".X"))
let (dots,xS) = span (=='.') . sort $ dotXs
if (length dots == 1) && (length xS == 4) then do
return ()
else do
fail "Mismatch between dots and Xs"
You may need to adjust slightly depending on your input type.
There are tons of fancy ways to do stuff in applicative parsing land, but there is no rule saying you can't just do things the rock-stupid simple way.
Try the following regex :
(?<=^| )(?=[^. ]*\.)(?=(?:[^X ]*X){4}).{5}(?=$| )
Demo here
If you have one word per string, you can simplify the regex by this one :
^(?=[^. \n]*\.)(?=(?:[^X \n]*X){4}).{5}$
Demo here

R code to check if word matches pattern

I need to validate a string against a character vector pattern. My current code is:
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')
check.pattern = function(word, min.size = 2)
{
word = trim(word)
chars = strsplit(word, NULL)[[1]]
all(chars %in% my.pattern) && (length(chars) >= min.size)
}
Example:
w.valid = 'special!'
w.invalid = 'test-me'
check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE
This is VERY SLOW i guess...is there a faster way to do this? Regex maybe?
Thanks!
PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix,
where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:
row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.
The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!
There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.
The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.
check.pattern.2 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:
check.pattern.3 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Finally, you can internalize the regex from trim:
check.pattern.4 = function(word, min.size = 2)
{
grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:
^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$
Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.
This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of
* a lower case letter
* a bang (exclamation point)
* a question mark
Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.
The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.
Hopefully this clarifies why you were having performance issues.

Using alternation or character class for single character matching?

(Note: Title doesn't seem to clear -- if someone can rephrase this I'm all for it!)
Given this regex: (.*_e\.txt), which matches some filenames, I need to add some other single character suffixes in addition to the e. Should I choose a character class or should I use an alternation for this? (Or does it really matter??)
That is, which of the following two seems "better", and why:
a) (.*(e|f|x)\.txt), or
b) (.*[efx]\.txt)
Use [efx] - that's exactly what character classes are designed for: to match one of the included characters. Therefore it's also the most readable and shortest solution.
I don't know if it's faster, but I would be very much surprised if it wasn't. It definitely won't be slower.
My reasoning (without ever having written a regex engine, so this is pure conjecture):
The regex token [abc] will be applied in a single step of the regex engine: "Is the next character one of a, b, or c?"
(a|b|c) however tells the regex engine to
remember the current position in the string for backtracking, if necessary
check if it's possible to match a. If so, success. If not:
check if it's possible to match b. If so, success. If not:
check if it's possible to match c. If so, success. If not:
give up.
Here is a benchmark:
updated according to tchrist comment, the difference is more significant
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
use Benchmark qw(:all);
my #l;
foreach(qw/b c d f g h j k l m n ñ p q r s t v w x z B C D F G H J K L M N ñ P Q R S T V W X Z/) {
push #l, "abc$_.txt";
}
my $re1 = qr/^(.*(b|c|d|f|g|h|j|k|l|m|n|ñ|p|q|r|s|t|v|w|x|z)\.txt)$/;
my $re2 = qr/^(.*[bcdfghjklmnñpqrstvwxz]\.txt)$/;
my $cpt;
my $count = -3;
my $r = cmpthese($count, {
'alternation' => sub {
for(#l) {
$cpt++ if $_ =~ $re1;
}
},
'class' => sub {
for(#l) {
$cpt++ if $_ =~ $re2;
}
}
});
result:
Rate alternation class
alternation 2855/s -- -50%
class 5677/s 99% --
With a single character, it's going to have such a minimal difference that it won't matter. (unless you're doing LOTS of operations)
However, for readability (and a slight performance increase) you should be using the character class method.
For a bit further information - opening a round bracket ( causes Perl to start backtracking for that current position, which, as you don't have further matches to go against, you really don't need for your regex. A character class will not do this.