I hate doing this but I've banged my head for hours just trying to figure out regexes, so I am finally resorting to asking the experts.
-1,AAABO,ABOAO
-2,ABBBO,BABBO
-3,AAACO,ACAAO
-4,ABDDO,BADDO
-5,AAABF,ABFAA
-6,BBBGO,BGBBO
I am looking to match multiple substrings but only between the commas.
For example:
AA and B would return rows 1,5
BB and O would return 2 and 6
BBB and G would return row 6
AA C and O would return row 3
I would build this dynamically as needed.
The 2nd step would be filtering on the beginning or end of the string after the 2nd comma
For example (start):
AB would return row 1 and 5
For example (end):
BO would return row 2 and 6
and then I need to combine all 3 filters.
For example
AAA O (contains from 2nd column)
AB (begins with)
O (ends with)
returns row 1
I could do multiple passes if required.
I would be delighted with any guidance.
You want the regex
/^.*?,(?=[^,]*AAA)(?=[^,]*O).*?,AB.*O$/
with commentary
/
^.*?, # consume the first field
(?=[^,]*AAA) # look ahead in the 2nd field for AAA
(?=[^,]*O) # look ahead in the 2nd field for O
.*?, # consume the 2nd field
AB.*O$ # the 3rd field starts with AB and ends with O
/x
which you can generate like this
sub gen_regex {
my ($begins, $ends, #contains) = #_;
my $regex = "^.*?,"
. join("", map {"(?=[^,]*$_)"} #contains)
. ".*?,$begins.*$ends\$";
return qr/$regex/;
}
my $re = gen_regex('AB', 'O', qw(AAA O));
and then use it like this:
while (<>) { say $. if /$re/ }
Related
I have two lists and I need to perform a string match. I have used three for loops and re.pattern to solve. I am getting the expected using existing code (part1), but I need to optimized the code (part2) as it takes a longer time when I apply for lengthy data.
part1
texts = ['foo abc', 'foobar xyz', 'xyz baz32', 'baz 45','fooz','bazzar','foo baz']
terms = ['foo','baz','apple']
output_list = []
for term in terms:
pattern_term = r'\b(?:{})\b'.format(term)
try:
for i in range(len(texts)):
line_text = texts[i]
for match in re.finditer(pattern_term, line_text):
start_index = match.start()
output_list.append([i, start_index, line_text[start_index:], term])
except:
pass
output:
Explaination fo columns names :
Index = index of texts when pattern matches
Start_index = start index where pattern matches inside text
Match_text = complete text of that matching
Match_term = term with it matches
pd.DataFrame(output_list, columns = ['Index', 'Start_index', 'Match_text', 'Match_term'])
Index Start_index Match_text Match_term
0 0 0 foo abc foo
1 6 0 foo baz foo
2 3 0 baz 45 baz
3 6 4 baz baz
I have tried the following code (part2), but its output is partial:
part 2
df = pd.DataFrame({'Match_text': texts})
pat = r'\b(?:{})\b'.format('|'.join(terms))
df[df['Match_text'].str.contains(pat)]
output
Match_text
0 foo abc
3 baz 45
6 foo baz
Your code is already good since you need to find occurrences of whole words inside longer strings, and you create the regex pattern before the loop where the texts are processed with the regex.
The regex already is good, the only thing about it is the redundant non-capturing group that you may discard because you check term by term, there is no alternation inside the group. You might also compile the regex:
pattern_term = re.compile(r'\b{}\b'.format(term))
Then, you may get rid of temporary variables in the for loop:
for i in range(len(texts)):
for match in pattern_term.finditer(texts[i]):
output_list.append([i, match.start(), texts[i][match.start():], term])
I'm trying clean some small strings (1-3 letters) stored in a column from R Data Frame. Specifically, suppose the next R Script:
df = data.frame( "original" = c("ABCDE FG H",
"IJKL MN OPQRS",
"TUV WX YZ AAAA"))
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}($|\\s)", " ", df$original)
df$filter2 = gsub("\\b[A-Z]{1,2}\\b", " ", df$original)
> df
original | filter1 | filter2 |
1 ABCDE FG H | ABCDE H | ABCDE |
2 IJKL MN OPQRS | IJKL OPQRS | IJKL OPQRS|
3 TUV WX YZ AAAA | TUV YZ AAAA| TUV AAAA |
I don't understand why the first filter (^|\\s)[A-Z]{1,2}($|\\s) doesn't replace "H" in the first row or "YZ" in the third one. I would expect the same result that using \\b[A-Z]{1,2}\\b as filter (filter2 column). Please don't worry about multiple spaces, it isn't important for me (unless this would be the problem :)).
I thought that the problem is the "globality" of operation, that it's, if it finds the first one not replace the second one, but it isn't TRUE if I do the next replacement:
> gsub("A", "X", "AAAABBBBCCCDDDDAAAAAAAEEE")
[1] "XXXXBBBBCCCDDDDXXXXXXXEEE"
So, Why are the results different?
The point is that gsub can only match non-overlapping strings. FG being the first expected match, and H the second, you can see that these strings overlap, and thus, after "(^|\\s)[A-Z]{1,2}($|\\s)" consumes the trailing space after FG, H just does not match the pattern.
Look: ABCDE FG H is analyzed from left to right. The expression matches FG , and the regex index is right before H. There is only this letter to match, but (^|\s) requires a space or the start of string - there is none at this location.
To "fix" this and use the same logic, you can use a PCRE regex gsub with lookarunds:
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=$|\\s)", " ", df$original, perl=TRUE)
or
df$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df$original, perl=TRUE)
and if you need to actually consume (to remove) spaces, just add \\s* before (or/and after).
The second expression "\\b[A-Z]{1,2}\\b" contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both FG and H since the spaces are not consumed.
I have something like this
AD ABCDEFG HIJKLMN
AB HIJKLMN
AC DJKEJKW SJKLAJL JSHELSJ
Rule: Always 2 Chars Code (AB|AC|AD) at line beginning then any number of 7 Chars codes following.
With this regex:
^(AB|AC|AD)|((\S{7})?
in this groovy code sample:
def m= Pattern.compile(/^(AB|AC|AD)|((\S{7})?)/).matcher("AC DJKEJKW SJKLAJL JSHELSJ")
println m.getCount()
I always get 8 as count, means it counts the spaces.
How do I get 4 groups (as expected) without spaces ?
Thanks from a not-yet-regex-expert
Sven
Using this code:
def input = [ 'AD ABCDEFG HIJKLMN', 'AB HIJKLMN', 'AC DJKEJKW SJKLAJL JSHELSJ' ]
def regexp = /^(AB|AC|AD)|((\S{7})+)/
def result = input.collect {
matcher = ( it =~ regexp )
println "got $matcher.count for $it"
matcher.collect { it[0] }
}
println result
I get the output
got 3 for AD ABCDEFG HIJKLMN
got 2 for AB HIJKLMN
got 4 for AC DJKEJKW SJKLAJL JSHELSJ
[[AD, ABCDEFG, HIJKLMN], [AB, HIJKLMN], [AC, DJKEJKW, SJKLAJL, JSHELSJ]]
Is this more what you wanted?
This pattern will match your requirements
^A[BCD](?:\s\S{7})+
See it here online on Regexr
Meaning start with A then either a B or a C or a D. This is followed by at least one group consisting of a whitespace followed by 7 non whitespaces.
In another question I learned how to calculate straight poker hand using regex (here).
Now, by curiosity, the question is: can I use regex to calculate the same thing, using ASCII CODE?
Something like:
regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
Matches: 45678, 23456
Doesn't matches: 45679 or 23459 (not in sequence)
Your main problem is really going to be that you're not using ASCII-consecutive encodings for your hands, you're using numerics for non-face cards, and non-consecutive, non-ordered characters for face cards.
You need to detect, at the start of the strings, 2345A, 23456, 34567, ..., 6789T, 789TJ, 89TJQ, 9TJQK and TJQKA.
These are not consecutive ASCII codes and, even if they were, you would run into problems since both A2345 and TJQKA are valid and you won't get A being both less than and greater than the other characters in the same character set.
If it has to be done by a regex, then the following regex segment:
(2345A|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)
is probably the easiest and most readable one you'll get.
There is no regex that will do what you want as the other answers have pointed out, but you did say that you want to learn regex, so here's another meta-regex approach that may be instructional.
Here's a Java snippet that, given a string, programmatically generate the pattern that will match any substring of that string of length 5.
String seq = "ABCDEFGHIJKLMNOP";
System.out.printf("^(%s)$",
seq.replaceAll(
"(?=(.{5}).).",
"$1|"
)
);
The output is (as seen on ideone.com):
^(ABCDE|BCDEF|CDEFG|DEFGH|EFGHI|FGHIJ|GHIJK|HIJKL|IJKLM|JKLMN|KLMNO|LMNOP)$
You can use this to conveniently generate the regex pattern to match straight poker hands, by initializing seq as appropriate.
How it works
. metacharacter matches "any" character (line separators may be an exception depending on the mode we're in).
The {5} is an exact repetition specifier. .{5} matches exactly 5 ..
(?=…) is positive lookahead; it asserts that a given pattern can be matched, but since it's only an assertion, it doesn't actually make (i.e. consume) the match from the input string.
Simply (…) is a capturing group. It creates a backreference that you can use perhaps later in the pattern, or in substitutions, or however you see fit.
The pattern is repeated here for convenience:
match one char
at a time
|
(?=(.{5}).).
\_________/
must be able to see 6 chars ahead
(capture the first 5)
The pattern works by matching one character . at a time. Before that character is matched, however, we assert (?=…) that we can see a total of 6 characters ahead (.{5})., capturing (…) into group 1 the first .{5}. For every such match, we replace with $1|, that is, whatever was captured by group 1, followed by the alternation metacharacter.
Let's consider what happens when we apply this to a shorter String seq = "ABCDEFG";. The ↑ denotes our current position.
=== INPUT === === OUTPUT ===
A B C D E F G ABCDE|BCDEFG
↑
We can assert (?=(.{5}).), matching ABCDEF
in the lookahead. ABCDE is captured.
We now match A, and replace with ABCDE|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
We can assert (?=(.{5}).), matching BCDEFG
in the lookahead. BCDEF is captured.
We now match B, and replace with BCDEF|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
:
:
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), and we are at
the end of the string, so we're done.
So we get ABCDE|BCDEF|CDEFG, which are all the substrings of length 5 of seq.
References
regular-expressions.info/Dot, Repetition, Grouping, Lookaround
Something like regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
You can not do anything remotely close to this in most regex flavors. This is simply not the kinds of patterns that regex is designed for.
There is no mainstream regex pattern that will succintly match any two consecutive characters that differ by x in their ASCII encoding.
For instructional purposes...
Here you go (see also on ideone.com):
String alpha = "ABCDEFGHIJKLMN";
String p = alpha.replaceAll(".(?=(.))", "$0(?=$1|\\$)|") + "$";
System.out.println(p);
// A(?=B|$)|B(?=C|$)|C(?=D|$)|D(?=E|$)|E(?=F|$)|F(?=G|$)|G(?=H|$)|
// H(?=I|$)|I(?=J|$)|J(?=K|$)|K(?=L|$)|L(?=M|$)|M(?=N|$)|N$
String p5 = String.format("(?:%s){5}", p);
String[] tests = {
"ABCDE", // true
"JKLMN", // true
"AAAAA", // false
"ABCDEFGH", // false
"ABCD", // false
"ACEGI", // false
"FGHIJ", // true
};
for (String test : tests) {
System.out.printf("[%s] : %s%n",
test,
test.matches(p5)
);
}
This uses meta-regexing technique to generate a pattern. That pattern ensures that each character is followed by the right character (or the end of the string), using lookahead. That pattern is then meta-regexed to be matched repeatedly 5 times.
You can substitute alpha with your poker sequence as necessary.
Note that this is an ABSOLUTELY IMPRACTICAL solution. It's much more readable to e.g. just check if alpha.contains(test) && (test.length() == 5).
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?
SOLVED!
See in http://jsfiddle.net/g48K9/3
I solved using closure, in js.
String.prototype.isSequence = function () {
If (this == "A2345") return true; // an exception
return this.replace(/(\w)(\w)(\w)(\w)(\w)/, function (a, g1, g2, g3, g4, g5) {
return code(g1) == code(g2) -1 &&
code(g2) == code(g3) -1 &&
code(g3) == code(g4) -1 &&
code(g4) == code(g5) -1;
})
};
function code(card){
switch(card){
case "T": return 58;
case "J": return 59;
case "Q": return 60;
case "K": return 61;
case "A": return 62;
default: return card.charCodeAt();
}
}
test("23456");
test("23444");
test("789TJ");
test("TJQKA");
test("8JQKA");
function test(cards) {
alert("cards " + cards + ": " + cards.isSequence())
}
Just to clarify, ascii codes:
ASCII CODES:
2 = 50
3 = 51
4 = 52
5 = 53
6 = 54
7 = 55
8 = 56
9 = 57
T = 84 -> 58
J = 74 -> 59
Q = 81 -> 60
K = 75 -> 61
A = 65 -> 62
I want to split a string like this:
abc//def//ghi
into a part before and after the first occurrence of //:
a: abc
b: //def//ghi
I'm currently using this regex:
(?<a>.*?)(?<b>//.*)
Which works fine so far.
However, sometimes the // is missing in the source string and obviously the regex fails to match. How is it possible to make the second group optional?
An input like abc should be matched to:
a: abc
b: (empty)
I tried (?<a>.*?)(?<b>//.*)? but that left me with lots of NULL results in Expresso so I guess it's the wrong idea.
Try a ^ at the begining of your expression to match the begining of the string and a $ at the end to match the end of the string (this will make the ungreedy match work).
^(?<a>.*?)(?<b>//.*)?$
A proof of Stevo3000's answer (Python):
import re
test_strings = ['abc//def//ghi', 'abc//def', 'abc']
regex = re.compile("(?P<a>.*?)(?P<b>//.*)?$")
for ts in test_strings:
match = regex.match(ts)
print 'a:', match.group('a'), 'b:', match.group('b')
a: abc b: //def//ghi
a: abc b: //def
a: abc b: None
Why use group matching at all? Why not just split by "//", either as a regex or a plain string?
use strict;
my $str = 'abc//def//ghi';
my $short = 'abc';
print "The first:\n";
my #groups = split(/\/\//, $str, 2);
foreach my $val (#groups) {
print "$val\n";
}
print "The second:\n";
#groups = split(/\/\//, $short, 2);
foreach my $val (#groups) {
print "$val\n";
}
gives
The first:
abc
def//ghi
The second:
abc
[EDIT: Fixed to return max 2 groups]