I need to validate a string against a character vector pattern. My current code is:
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# valid pattern is lowercase alphabet, '.', '!', and '?' AND
# the string length should be >= than 2
my.pattern = c(letters, '!', '.', '?')
check.pattern = function(word, min.size = 2)
{
word = trim(word)
chars = strsplit(word, NULL)[[1]]
all(chars %in% my.pattern) && (length(chars) >= min.size)
}
Example:
w.valid = 'special!'
w.invalid = 'test-me'
check.pattern(w.valid) #TRUE
check.pattern(w.invalid) #FALSE
This is VERY SLOW i guess...is there a faster way to do this? Regex maybe?
Thanks!
PS: Thanks everyone for the great answers. My objective was to build a 29 x 29 matrix,
where the row names and column names are the allowed characters. Then i iterate over each word of a huge text file and build a 'letter precedence' matrix. For example, consider the word 'special', starting from the first char:
row s, col p -> increment 1
row p, col e -> increment 1
row e, col c -> increment 1
... and so on.
The bottleneck of my code was the vector allocation, i was 'appending' instead of pre-allocate the final vector, so the code was taking 30 minutes to execute, instead of 20 seconds!
There are some built-in functions that can clean up your code. And I think you're not leveraging the full power of regular expressions.
The blaring issue here is strsplit. Comparing the equality of things character-by-character is inefficient when you have regular expressions. The pattern here uses the square bracket notation to filter for the characters you want. * is for any number of repeats (including zero), while the ^ and $ symbols represent the beginning and end of the line so that there is nothing else there. nchar(word) is the same as length(chars). Changing && to & makes the function vectorized so you can input a vector of strings and get a logical vector as output.
check.pattern.2 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]*$"),word) & nchar(word) >= min.size
}
check.pattern.2(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Next, using curly braces for number of repetitions and some paste0, the pattern can use your min.size:
check.pattern.3 = function(word, min.size = 2)
{
word = trim(word)
grepl(paste0("^[a-z!.?]{",min.size,",}$"),word)
}
check.pattern.3(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
Finally, you can internalize the regex from trim:
check.pattern.4 = function(word, min.size = 2)
{
grepl(paste0("^\\s*[a-z!.?]{",min.size,",}\\s*$"),word)
}
check.pattern.4(c(" d ","!hello ","nA!"," asdf.!"," d d "))
#[1] FALSE TRUE FALSE TRUE FALSE
If I understand the pattern you are desiring correctly, you would want a regex of a similar format to:
^\\s*[a-z!\\.\\?]{MIN,MAX}\\s*$
Where MIN is replaced with the minimum length of the string, and MAX is replaced with the maximum length of the string. If there is no maximum length, then MAX and the comma can be omitted. Likewise, if there is neither maximum nor minimum everything within the {} including the braces themselves can be replaced with a * which signifies the preceding item will be matched zero or more times; this is equivalent to {0}.
This ensures that the regex only matches strings where every character after any leading and trailing whitespace is from the set of
* a lower case letter
* a bang (exclamation point)
* a question mark
Note that this has been written in Perl style regex as it is what I am more familiar with; most of my research was at this wiki for R text processing.
The reason for the slowness of your function is the extra overhead of splitting the string into a number of smaller strings. This is a lot of overhead in comparison to a regex (or even a manual iteration over the string, comparing each character until the end is reached or an invalid character is found). Also remember that this algorithm ENSURES a O(n) performance rate, as the split causes n strings to be generated. This means that even FAILING strings must do at least n actions to reject the string.
Hopefully this clarifies why you were having performance issues.
Related
I have a list of dictionary words, I would like to find any word that consists of (some or all) certain characters of a source word in any order :
For Example:
Characters (source word) to look for : stainless
Found Words : stainless, stain, net, ten, less, sail, sale, tale, tales, ants, etc.
Also if a letter is found once in the source word it can't be repeated in the found word
Unacceptable words to find : tent (t is repeated), tall (l is repeated) , etc.
Acceptable words to find : less (s is already repeated in the source word), etc.
You could take this approach:
Match any sequence of characters that are in the search word, requiring that the match is a word (word-boundaries)
Prohibit that a certain character occurs more often than it is present in the search word, using a negative look-ahead. Do this for every character that is in the search word.
For the given example the regular expression would be:
(?!(\S*s){4}|(\S*t){2}|(\S*a){2}|(\S*i){2}|(\S*n){2}|(\S*l){2}|(\S*e){2})\b[stainless]+\b
The biggest part of the pattern deals with the negative look-ahead. For example:
(\S*s){4} would match four times an 's' in a single word.
(?! | ) places these patterns as different options in a negative look-ahead so that none of them should match.
Automation
It is clear that making such a regular expression for a given word needs some work, so that is where you could use some automation. Notepad++ cannot help with that, but in a programming environment it is possible. Here is a little snippet in JavaScript that will give you the regular expression that corresponds to a given search word:
function regClassEscape(s) {
// Escape "[" and "^" and "-":
return s.replace(/[\]^-]/g, "\\$&");
}
function buildRegex(searchWord) {
// get frequency of each letter:
let freq = {};
for (let ch of searchWord) {
ch = regClassEscape(ch);
freq[ch] = (freq[ch] ?? 0) + 1;
}
// Produce negative options (too many occurrences)
const forbidden = Object.entries(freq).map(([ch, count]) =>
"(\\S*[" + ch + "]){" + (count + 1) + "}"
).join("|");
// Produce character set
const allowed = Object.keys(freq).join("");
return "(?!" + forbidden + ")\\b[" + allowed + "]+\\b";
}
// I/O management
const [input, output] = document.querySelectorAll("input,div");
input.addEventListener("input", refresh);
function refresh() {
if (/\s/.test(input.value)) {
output.textContent = "Input should have no white space!";
} else {
output.textContent = buildRegex(input.value);
}
}
refresh();
input { width: 100% }
Search word:<br>
<input value="stainless">
Regular expression:
<div></div>
I am trying to find a regex query, such that, for instance, the following strings match the same expression
"1116.67711..44."
"2224.43322..88."
"9993.35599..22."
"7779.91177..55."
I.e. formally "x1x1x1x2.x2x3x3x1x1..x4x4." where xi ≠ xj if i ≠ j, and where xi is some number from 1 to 9 inclusive.
Or (another example), the following strings match the same expression, but not the same expression as before:
"94..44.773399.4"
"25..55.886622.5"
"73..33.992277.3"
I.e. formally "x1x2..x2x2.x3x3x4x4x1x1.x2" where xi ≠ xj if i ≠ j, and where xi is some number from 1 to 9 inclusive.
That is two strings should be equal if they have the same form, but with the numbers internally permuted so that they are pairwise distinct.
The dots should mean a space in the sequence, this could be any value that is not a single digit number, and two "equal" strings, should have spaces the same places. If it helps, the strings all have the same length of 81 (above they all have a length of 15, as to not write too long strings).
That is, if I have some string as above, e.g. "3566.235.225..45" i want to have some reqular expression that i can apply to some database to find out if such a string already exists
Is it possible to do this?
The answer is fairly straightforward:
import re
pattern = re.compile(r'^(\d)\1{3}$')
print(pattern.match('1234'))
print(pattern.match('333'))
print(pattern.match('3333'))
print(pattern.match('33333'))
You capture what you need once, then tell the regex engine how often you need to repeat it. You can refer back to it as often as you like, for example for a pattern that would match 11.222.1 you'd use ^(\d)\1{1}\.(\d)\2{2}\.(\1){1}$.
Note that the {1} in there is superfluous, but it shows that the pattern can be very regular. So much so, that it's actually easy to write a function that solves the problem for you:
def make_pattern(grouping, separators='.'):
regex_chars = '.\\*+[](){}^$?!:'
groups = {}
i = 0
j = 0
last_group = 0
result = '^'
while i < len(grouping):
if grouping[i] in separators:
if grouping[i] in regex_chars:
result += '\\'
result += grouping[i]
i += 1
else:
while i < len(grouping) and grouping[i] == grouping[j]:
i += 1
if grouping[j] in groups:
group = groups[grouping[j]]
else:
last_group += 1
groups[grouping[j]] = last_group
group = last_group
result += '(.)'
j += 1
result += f'\\{group}{{{i-j}}}'
j = i
return re.compile(result+'$')
print(make_pattern('111.222.11').match('aaa.bbb.aa'))
So, you can give make_pattern a good example of the pattern and it will return the compiled regex for you. If you'd like other separators than '.', you can just pass those in as well:
my_pattern = make_pattern('11,222,11', separators=',')
print(my_pattern.match('aa,bbb,aa'))
I have to process a comma separated string which contains triplets of values and translate them to runtime types,the input looks like:
"1x2y3z,80r160g255b,48h30m50s,1x3z,255b,1h,..."
So each substring should be transformed this way:
"1x2y3z" should become Vector3 with x = 1, y = 2, z = 3
"80r160g255b" should become Color with r = 80, g = 160, b = 255
"48h30m50s" should become Time with h = 48, m = 30, s = 50
The problem I'm facing is that all the components are optional (but they preserve order) so the following strings are also valid Vector3, Color and Time values:
"1x3z" Vector3 x = 1, y = 0, z = 3
"255b" Color r = 0, g = 0, b = 255
"1h" Time h = 1, m = 0, s = 0
What I have tried so far?
All components optional
((?:\d+A)?(?:\d+B)?(?:\d+C)?)
The A, B and C are replaced with the correct letter for each case, the expression works almost well but it gives twice the expected results (one match for the string and another match for an empty string just after the first match), for example:
"1h1m1s" two matches [1]: "1h1m1s" [2]: ""
"11x50z" two matches [1]: "11x50z" [2]: ""
"11111h" two matches [1]: "11111h" [2]: ""
This isn't unexpected... after all an empty string matches the expression when ALL of the components are empty; so in order to fix this issue I've tried the following:
1 to 3 quantifier
((?:\d+[ABC]){1,3})
But now, the expression matches strings with wrong ordering or even repeated components!:
"1s1m1h" one match, should not match at all! (wrong order)
"11z50z" one match, should not match at all! (repeated components)
"1r1r1b" one match, should not match at all! (repeated components)
As for my last attempt, I've tried this variant of my first expression:
Match from begin ^ to the end $
^((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
And it works better than the first version but it still matches the empty string plus I should first tokenize the input and then pass each token to the expression in order to assure that the test string could match the begin (^) and end ($) operators.
EDIT: Lookahead attempt (thanks to Casimir et Hippolyte)
After reading and (try to) understanding the regex lookahead concept and with the help of Casimir et Hippolyte answer I've tried the suggested expression:
\b(?=[^,])(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Against the following test string:
"48h30m50s,1h,1h1m1s,11111h,1s1m1h,1h1h1h,1s,1m,1443s,adfank,12322134445688,48h"
And the results were amazing! it is able to detect complete valid matches flawlessly (other expressions gave me 3 matches on "1s1m1h" or "1h1h1h" which weren't intended to be matched at all). Unfortunately it captures emtpy matches everytime a unvalid match is found so a "" is detected just before "1s1m1h", "1h1h1h", "adfank" and "12322134445688", so I modified the Lookahead condition to get the expression below:
\b(?=(?:\d+[ABC]){1,3})(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
It gets rid of the empty matches in any string which doesn't match (?:\d+[ABC]){1,3}) so the empty matches just before "adfank" and "12322134445688" are gone but the ones just before "1s1m1h", "1h1h1h" are stil detected.
So the question is: Is there any regular expression which matches three triplet values in a given order where all component is optional but should be composed of at least one component and doesn't match empty strings?
The regex tool I'm using is the C++11 one.
Yes, you can add a lookahead at the begining to ensure there is at least one character:
^(?=.)((?:\d+A)?(?:\d+B)?(?:\d+C)?)$
If you need to find this kind of substring in a larger string (so without to tokenize before), you can remove the anchors and use a more explicit subpattern in a lookahead:
(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)
In this case, to avoid false positive (since you are looking for very small strings that can be a part of something else), you can add word-boundaries to the pattern:
\b(?=\d+[ABC])((?:\d+A)?(?:\d+B)?(?:\d+C)?)\b
Note: in a comma delimited string: (?=\d+[ABC]) can be replaced by (?=[^,])
I think this might do the trick.
I am keying on either the beginning of the string to match ^ or the comma separator , for fix the start of each match: (?:^|,).
Example:
#include <regex>
#include <iostream>
const std::regex r(R"~((?:^|,)((?:\d+[xrh])?(?:\d+[ygm])?(?:\d+[zbs])?))~");
int main()
{
std::string test = "1x2y3z,80r160g255b,48h30m50s,1x3z,255b";
std::sregex_iterator iter(test.begin(), test.end(), r);
std::sregex_iterator end_iter;
for(; iter != end_iter; ++iter)
std::cout << iter->str(1) << '\n';
}
Output:
1x2y3z
80r160g255b
48h30m50s
1x3z
255b
Is that what you are after?
EDIT:
If you really want to go to town and make empty expressions unmatched then as far as I can tell you have to put in every permutation like this:
const std::string A = "(?:\\d+[xrh])";
const std::string B = "(?:\\d+[ygm])";
const std::string C = "(?:\\d+[zbs])";
const std::regex r("(?:^|,)(" + A + B + C + "|" + A + B + "|" + A + C + "|" + B + C + "|" + A + "|" + B + "|" + C + ")");
I'm going through my homework and can't seem to figure out how to do this one.
Say the alphabet is {a,b,c}, we want a expression that finds strings with an even number of cs.
Example strings that are included:
empty set,
ccab
abcc
cabc
ababababcc
and so on.. just an even amount of c's.
You can use this regex to allow only even # of c in input:
^(?=(([^c\n]*c){2})*[^\nc]*$)[abc]*$
RegEx Demo
The below regex would match the strings which has only even number of c's,
^(?:[^c]*c[^c]*c[^c\n]*)+?$
DEMO
OR
^(?:[ab]*c[ab]*c[ab]*)+?$
DEMO
Assuming that the total number of c's count, not consecutive cs - there is a nice theoretical approach, based on the fact that **a string with an even number ofc`s can be expressed as a finite state automaton with two states**.
The first state is the initial state, and it is also an accepting state. The second one is a rejecting state. Each c toggles us between the states. Other letters do nothing.
Now, you can convert this simple machine to regex using one of the methods described here.
Something like
^([^c]*(c[^c]*c)+)*[^c]*$
ought to do it. we can break it out, thus:
^ # - start-of-line, followed by
( # - a group, consisting of
[^c]* # - zero or more characters other than 'c', followed by
( # - a group, consisting of
c # - the literal character 'c', followed by
[^c]* # - zero or more characters other than 'c', followed by
c # - the literal character 'c'
)+ # repeated one or more times
)* # repeated zero or more times, followed by
[^c]* # - a final sequence of zero or more characters other than 'c', followed by
$ # - end-of-line
One might note that something like the following C# method will likely perform better and be easier to understand:
public bool ContainsEvenNumberOfCharacters( this string s , char x )
{
int cnt = 0 ;
foreach( char c in s )
{
cnt += ( c == x ? 1 : 0 ) ;
}
bool isEven = 0 == (cnt&1) ; // it's even if the low-order bit is off.
return isEven ;
}
Simply
/^(([^c]*c[^c]*){2})*$/
In English:
Zero or more strings, each of which contains exactly two instances of a c, preceded or followed by any number of non-c's.
This solution has the advantage that it is easily extendable to the case of a string with a number of c's which is multiple of 3, etc., and makes no assumptions about the alphabet.
In another question I learned how to calculate straight poker hand using regex (here).
Now, by curiosity, the question is: can I use regex to calculate the same thing, using ASCII CODE?
Something like:
regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
Matches: 45678, 23456
Doesn't matches: 45679 or 23459 (not in sequence)
Your main problem is really going to be that you're not using ASCII-consecutive encodings for your hands, you're using numerics for non-face cards, and non-consecutive, non-ordered characters for face cards.
You need to detect, at the start of the strings, 2345A, 23456, 34567, ..., 6789T, 789TJ, 89TJQ, 9TJQK and TJQKA.
These are not consecutive ASCII codes and, even if they were, you would run into problems since both A2345 and TJQKA are valid and you won't get A being both less than and greater than the other characters in the same character set.
If it has to be done by a regex, then the following regex segment:
(2345A|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)
is probably the easiest and most readable one you'll get.
There is no regex that will do what you want as the other answers have pointed out, but you did say that you want to learn regex, so here's another meta-regex approach that may be instructional.
Here's a Java snippet that, given a string, programmatically generate the pattern that will match any substring of that string of length 5.
String seq = "ABCDEFGHIJKLMNOP";
System.out.printf("^(%s)$",
seq.replaceAll(
"(?=(.{5}).).",
"$1|"
)
);
The output is (as seen on ideone.com):
^(ABCDE|BCDEF|CDEFG|DEFGH|EFGHI|FGHIJ|GHIJK|HIJKL|IJKLM|JKLMN|KLMNO|LMNOP)$
You can use this to conveniently generate the regex pattern to match straight poker hands, by initializing seq as appropriate.
How it works
. metacharacter matches "any" character (line separators may be an exception depending on the mode we're in).
The {5} is an exact repetition specifier. .{5} matches exactly 5 ..
(?=…) is positive lookahead; it asserts that a given pattern can be matched, but since it's only an assertion, it doesn't actually make (i.e. consume) the match from the input string.
Simply (…) is a capturing group. It creates a backreference that you can use perhaps later in the pattern, or in substitutions, or however you see fit.
The pattern is repeated here for convenience:
match one char
at a time
|
(?=(.{5}).).
\_________/
must be able to see 6 chars ahead
(capture the first 5)
The pattern works by matching one character . at a time. Before that character is matched, however, we assert (?=…) that we can see a total of 6 characters ahead (.{5})., capturing (…) into group 1 the first .{5}. For every such match, we replace with $1|, that is, whatever was captured by group 1, followed by the alternation metacharacter.
Let's consider what happens when we apply this to a shorter String seq = "ABCDEFG";. The ↑ denotes our current position.
=== INPUT === === OUTPUT ===
A B C D E F G ABCDE|BCDEFG
↑
We can assert (?=(.{5}).), matching ABCDEF
in the lookahead. ABCDE is captured.
We now match A, and replace with ABCDE|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
We can assert (?=(.{5}).), matching BCDEFG
in the lookahead. BCDEF is captured.
We now match B, and replace with BCDEF|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
:
:
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), and we are at
the end of the string, so we're done.
So we get ABCDE|BCDEF|CDEFG, which are all the substrings of length 5 of seq.
References
regular-expressions.info/Dot, Repetition, Grouping, Lookaround
Something like regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
You can not do anything remotely close to this in most regex flavors. This is simply not the kinds of patterns that regex is designed for.
There is no mainstream regex pattern that will succintly match any two consecutive characters that differ by x in their ASCII encoding.
For instructional purposes...
Here you go (see also on ideone.com):
String alpha = "ABCDEFGHIJKLMN";
String p = alpha.replaceAll(".(?=(.))", "$0(?=$1|\\$)|") + "$";
System.out.println(p);
// A(?=B|$)|B(?=C|$)|C(?=D|$)|D(?=E|$)|E(?=F|$)|F(?=G|$)|G(?=H|$)|
// H(?=I|$)|I(?=J|$)|J(?=K|$)|K(?=L|$)|L(?=M|$)|M(?=N|$)|N$
String p5 = String.format("(?:%s){5}", p);
String[] tests = {
"ABCDE", // true
"JKLMN", // true
"AAAAA", // false
"ABCDEFGH", // false
"ABCD", // false
"ACEGI", // false
"FGHIJ", // true
};
for (String test : tests) {
System.out.printf("[%s] : %s%n",
test,
test.matches(p5)
);
}
This uses meta-regexing technique to generate a pattern. That pattern ensures that each character is followed by the right character (or the end of the string), using lookahead. That pattern is then meta-regexed to be matched repeatedly 5 times.
You can substitute alpha with your poker sequence as necessary.
Note that this is an ABSOLUTELY IMPRACTICAL solution. It's much more readable to e.g. just check if alpha.contains(test) && (test.length() == 5).
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
SOLVED!
See in http://jsfiddle.net/g48K9/3
I solved using closure, in js.
String.prototype.isSequence = function () {
If (this == "A2345") return true; // an exception
return this.replace(/(\w)(\w)(\w)(\w)(\w)/, function (a, g1, g2, g3, g4, g5) {
return code(g1) == code(g2) -1 &&
code(g2) == code(g3) -1 &&
code(g3) == code(g4) -1 &&
code(g4) == code(g5) -1;
})
};
function code(card){
switch(card){
case "T": return 58;
case "J": return 59;
case "Q": return 60;
case "K": return 61;
case "A": return 62;
default: return card.charCodeAt();
}
}
test("23456");
test("23444");
test("789TJ");
test("TJQKA");
test("8JQKA");
function test(cards) {
alert("cards " + cards + ": " + cards.isSequence())
}
Just to clarify, ascii codes:
ASCII CODES:
2 = 50
3 = 51
4 = 52
5 = 53
6 = 54
7 = 55
8 = 56
9 = 57
T = 84 -> 58
J = 74 -> 59
Q = 81 -> 60
K = 75 -> 61
A = 65 -> 62