Regex - Match length based on value inside match (using variables ?)

Regex - Match length based on value inside match (using variables ?) - regex

I'd like to know if its possible to use a value inside the expression as a variable for a second part of the expression
The goal is to extract some specific strings from a memory dump. One part of the string is based on a (more or less) fixed structure that can be described well using regular expressions. The Problem is the second part of the string that has a variable length and no "footer" or anything that can be "matched" as an "END".
Instead there is a length indicator on position 2 of the first part.
Here is a simplified example string that id like to find (an all others) inside a large file
00 24 AA BB AA DD EE FF GG HH II JJ ########### ( # beeing unwanted data)
Lets assume that the main structure would allways be 00 XX AA BB AA - but the last part (starting from DD) will be variable in length for each string based on the value of XX
I know that this can be done in code outside regex but iam curious if its possible :)

Short answer: NO
Long answer:
You can acheive what you want in two steps:
Extract the value inside string
Build dynamically a regexp for matching
PSEUDO CODE
s:='00 24 AA BB AA DD EE FF GG HH II JJ ###########'
re:=/00 (\d{2}) AA BB AA/
if
s::matches(re)
then
match := re::match(s)
len := matches(1)
dynamicRE := new Regexp(re::toString() + ' (?:[A-Z]{2} ){' + len + '}')
// dynamicRE == /00 (\d{2}) AA BB AA (?:[A-Z]{2} ){24,24}/
if s::matches(dynamicRE) then
// MATCH !!
else
// NO MATCH !!
end if
end if

Related

Find string to regular expression programmatically?

Given a regular expression, is is possible to find a string that matches that expression programmatically? If so, please mention an algorithm for that, assuming that a string exists.
Bonus question: Give the performance/complexity of that algorithm, if able.
PS: Note I am not asking this: Programmatically derive a regular expression from a string. More likely I am asking the reserve problem.

Generex is a Java library for generating String from a regular expression.
Check it out: https://github.com/mifmif/Generex
Here is the sample Java code demonstrating library usage:
Generex generex = new Generex("[0-3]([a-c]|[e-g]{1,2})");
// Generate random String
String randomStr = generex.random();
System.out.println(randomStr);// a random value from the previous String list
// generate the second String in lexicographical order that match the given Regex.
String secondString = generex.getMatchedString(2);
System.out.println(secondString);// it print '0b'
// Generate all String that matches the given Regex.
List<String> matchedStrs = generex.getAllMatchedStrings();
// Using Generex iterator
Iterator iterator = generex.iterator();
while (iterator.hasNext()) {
System.out.print(iterator.next() + " ");
}
// it prints:
// 0a 0b 0c 0e 0ee 0ef 0eg 0f 0fe 0ff 0fg 0g 0ge 0gf 0gg
// 1a 1b 1c 1e 1ee 1ef 1eg 1f 1fe 1ff 1fg 1g 1ge 1gf 1gg
// 2a 2b 2c 2e 2ee 2ef 2eg 2f 2fe 2ff 2fg 2g 2ge 2gf 2gg
// 3a 3b 3c 3e 3ee 3ef 3eg 3f 3fe 3ff 3fg 3g 3ge 3gf 3gg
Another one: https://code.google.com/archive/p/xeger/
Here is the sample Java code demonstrating library usage:
String regex = "[ab]{4,6}c";
Xeger generator = new Xeger(regex);
String result = generator.generate();
assert result.matches(regex);

Assume you define regular expressions like this:
R :=
<literal string>
(RR) -- concatenation
(R*) -- kleene star
(R|R) -- choice
Then you can define a recursive function S(r) which finds a matching string:
S(<literal string>) = <literal string>
S(rs) = S(r) + S(s)
S(r*) = ""
S(r|s) = S(r)
For example: S(a*(b|c)) = S(a*) + S(b|c) = "" + S(b) = "" + "b" = "b".
If you have a more complex notion of regular expression, you can rewrite it in terms of the basic primitives and then apply the above. For example, R+ = RR* and [abc] = (a|b|c).
Note that if you've got a parsed regular expression (so you know its syntax tree), then the above algorithm takes at most time linear in the size of the regular expression (assuming you're careful to perform the string concatenations efficiently).

To find given expression in string which fit under that criteria, for that I had tried below algorithm.
i) Create the array for all strings available in given source.
ii) Create a function with parameters for array, expression and initial index count.
iii) Call function recursively and increase the index with every move, until we match string has not found.
iv) Return/break the function if String with desired expression is found.
Below is same java code:
public class ExpressionAlgo {
public static void main(String[] args) {
// TODO Auto-generated method stub
String data = "A quantifier defines how often an element can occur. The symbols ?, *, + and {} define the quantity of the regular expressions";
regCheck(data.split(" "), "sym", 0);
}
public static void regCheck(String[] ar, String expresion, int i) {
if(ar[i].contains(expresion)){
System.out.println(ar[i]);
return;
}
if(i<ar.length-1){
i=i+1;
regCheck(ar, expresion, i);
}
}
}
As far as I calculated the complexity of this code is N^3 because I had use split, contains method and call regCheck method recursively.

How to get total number of non-repetitive substrings?

Suppose I have string str = "aabaa"
Its non repetitive substrings are
a
b
aa
ab
ba
aab
aba
baa
aaba
abaa
aabaa

Compute the suffix array and the longest common prefix array thereof.
a
1
aa
2
aabaa
1
abaa
0
baa
Return (n+1)n/2, the number of substring bounds, minus the sum of the longest common prefix array.
(5+1)5/2 - (1+2+1+0) = 15 - 4 = 11.

stringi's stri_replace_first_regex's replacement not seen as a regex

I have a string, in which I'm trying to replace the first matching pattern with a corresponding replacement. EG in my example below : if bb is found first, replace it by foo and don't replace anything else, but if cc is found first, replace it by bar and don't replace anything else.
This behaves almost as desired, except the replacement argument is not interpreted as a regex, but as a whole string. (But the pattern argument is seen as a regex, as required).
stri_replace_first_regex(
c(" bb cc bb cc "," cc bb cc bb ", " aa bb cc "),
pattern = " bb | cc ",
replacement = " foo | bar ")
Ouputs : " foo | bar cc bb cc " " foo | bar bb cc bb " " aa foo | bar cc "
while I want it to output " foo cc bb cc " " bar bb cc bb" " aa foo cc "
Any idea on how to solve that ?
Thanks.
More context :
My inputs can have basically almost any formatting, they are postal adresses entered by customers, in which I need to replace the type of street by something standardized (for instance, turn street into st, road in rd and avenue in av). Any of those words can appear again (eg 20 bis road of sesame street), so I consider only the first appearance as valid, and the subsequent appearances of a word from the pattern list must not be replaced.

You can use qdap library's mgsub for these replacements:
> input <- c("1 road of whatever road", "1 street of whatever street")
> pattern = c("^(.*?)\\bstreet\\b","^(.*?)\\broad\\b")
> replacement = c("\\1st","\\1rd")
> mgsub(pattern, replacement, input, fixed=FALSE, perl=TRUE)
[1] "1 rd of whatever road" "1 st of whatever street"
The patterns include ^ (start of string), (.*?) a capturing group matching any characters but a newline as few as possible up to the first occurrence of the whole words (due to the word boundaries \b) street and road.
The replacement patterns have backreferences (\\1) to the text captured with the capturing groups and the words to replace.

Read ?stringi::stri_replace_first_regex; pattern and replacement are vectorized, so if you pass them a vector of strings, each pattern will be replaced with the respective replacement:
stringi::stri_replace_first_regex(
c(" bb cc bb cc "," cc bb cc bb "),
pattern = c("bb", "cc"),
replacement = c("foo", "bar"))
# [1] " foo cc bb cc " " bar bb cc bb "

Changing spaces with "prxchange", but not all spaces

I need to change the spaces in my text to underscores, but only the spaces that are between words, not the ones between digits, so, for an example
"The quick brown fox 99 07 3475"
Would become
"The_quick_brown_fox 99 07 3475"
I tried using this in a data step:
mytext = prxchange('s/\w\s\w/_/',-1,mytext);
But the result was not what i wanted
"Th_uic_row_ox 99 07 3475"
Any ideas on what i could do?
Thanks in advance.

Data One ;
X = "The quick brown fox 99 07 3475" ;
Y = PrxChange( 's/(?<=[a-z])\s+(?=[a-z])/_/i' , -1 , X ) ;
Put X= Y= ;
Run ;

You are changing
"W W"
to
"_"
when you want to change
"W W"
to
"W_W"
so
prxchange('s/(\w)\s(\w)/$1_$2/',-1,mytext);
Full example:
data test;
mytext='The quick brown fox 99 07 3475';
newtext = prxchange('s/([A-Za-z])\s([A-Za-z])/$1_$2/',-1,mytext);
put _all_;
run;

You can use the CALL PRXNEXT function to find the position of each match, then use the SUBSTR function to replace the space with an underscore. I've changed your regular expression as \w matches any alphanumeric character, so it should include spaces between numbers. I'm not sure how you got your result using that expression.
Anyway, the code below should give you what you want.
data have;
mytext='The quick brown fox 99 07 3475';
_re=prxparse('/[a-z]\s[a-z]/i'); /* match a letter followed by a space followed by a letter, ignore case */
_start=1 /* starting position for search */;
call prxnext(_re,_start,-1,mytext,_position,_length); /* find position of 1st match */
do while(_position>0); /* loop through all matches */
substr(mytext,_position+1,1)='_'; /* replace ' ' with '_' for matches */
_start=_start-2; /* prevents the next start position jumping 3 ahead (the length of the regex search string) */
call prxnext(_re,_start,-1,mytext,_position,_length); /* find position of next match */
end;
drop _: ;
run;

Regex to calculate straight poker hand - Using ASCII CODE

In another question I learned how to calculate straight poker hand using regex (here).
Now, by curiosity, the question is: can I use regex to calculate the same thing, using ASCII CODE?
Something like:
regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
Matches: 45678, 23456
Doesn't matches: 45679 or 23459 (not in sequence)

Your main problem is really going to be that you're not using ASCII-consecutive encodings for your hands, you're using numerics for non-face cards, and non-consecutive, non-ordered characters for face cards.
You need to detect, at the start of the strings, 2345A, 23456, 34567, ..., 6789T, 789TJ, 89TJQ, 9TJQK and TJQKA.
These are not consecutive ASCII codes and, even if they were, you would run into problems since both A2345 and TJQKA are valid and you won't get A being both less than and greater than the other characters in the same character set.
If it has to be done by a regex, then the following regex segment:
(2345A|23456|34567|45678|56789|6789T|789TJ|89TJQ|9TJQK|TJQKA)
is probably the easiest and most readable one you'll get.

There is no regex that will do what you want as the other answers have pointed out, but you did say that you want to learn regex, so here's another meta-regex approach that may be instructional.
Here's a Java snippet that, given a string, programmatically generate the pattern that will match any substring of that string of length 5.
String seq = "ABCDEFGHIJKLMNOP";
System.out.printf("^(%s)$",
seq.replaceAll(
"(?=(.{5}).).",
"$1|"
)
);
The output is (as seen on ideone.com):
^(ABCDE|BCDEF|CDEFG|DEFGH|EFGHI|FGHIJ|GHIJK|HIJKL|IJKLM|JKLMN|KLMNO|LMNOP)$
You can use this to conveniently generate the regex pattern to match straight poker hands, by initializing seq as appropriate.
How it works
. metacharacter matches "any" character (line separators may be an exception depending on the mode we're in).
The {5} is an exact repetition specifier. .{5} matches exactly 5 ..
(?=…) is positive lookahead; it asserts that a given pattern can be matched, but since it's only an assertion, it doesn't actually make (i.e. consume) the match from the input string.
Simply (…) is a capturing group. It creates a backreference that you can use perhaps later in the pattern, or in substitutions, or however you see fit.
The pattern is repeated here for convenience:
match one char
at a time
|
(?=(.{5}).).
\_________/
must be able to see 6 chars ahead
(capture the first 5)
The pattern works by matching one character . at a time. Before that character is matched, however, we assert (?=…) that we can see a total of 6 characters ahead (.{5})., capturing (…) into group 1 the first .{5}. For every such match, we replace with $1|, that is, whatever was captured by group 1, followed by the alternation metacharacter.
Let's consider what happens when we apply this to a shorter String seq = "ABCDEFG";. The ↑ denotes our current position.
=== INPUT === === OUTPUT ===
A B C D E F G ABCDE|BCDEFG
↑
We can assert (?=(.{5}).), matching ABCDEF
in the lookahead. ABCDE is captured.
We now match A, and replace with ABCDE|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
We can assert (?=(.{5}).), matching BCDEFG
in the lookahead. BCDEF is captured.
We now match B, and replace with BCDEF|
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), skip forward
:
:
A B C D E F G ABCDE|BCDEF|CDEFG
↑
Can't assert (?=(.{5}).), and we are at
the end of the string, so we're done.
So we get ABCDE|BCDEF|CDEFG, which are all the substrings of length 5 of seq.
References
regular-expressions.info/Dot, Repetition, Grouping, Lookaround

Something like regex: [C][C+1][C+2][C+3][C+4], being C the ASCII CODE (or like this)
You can not do anything remotely close to this in most regex flavors. This is simply not the kinds of patterns that regex is designed for.
There is no mainstream regex pattern that will succintly match any two consecutive characters that differ by x in their ASCII encoding.
For instructional purposes...
Here you go (see also on ideone.com):
String alpha = "ABCDEFGHIJKLMN";
String p = alpha.replaceAll(".(?=(.))", "$0(?=$1|\\$)|") + "$";
System.out.println(p);
// A(?=B|$)|B(?=C|$)|C(?=D|$)|D(?=E|$)|E(?=F|$)|F(?=G|$)|G(?=H|$)|
// H(?=I|$)|I(?=J|$)|J(?=K|$)|K(?=L|$)|L(?=M|$)|M(?=N|$)|N$
String p5 = String.format("(?:%s){5}", p);
String[] tests = {
"ABCDE", // true
"JKLMN", // true
"AAAAA", // false
"ABCDEFGH", // false
"ABCD", // false
"ACEGI", // false
"FGHIJ", // true
};
for (String test : tests) {
System.out.printf("[%s] : %s%n",
test,
test.matches(p5)
);
}
This uses meta-regexing technique to generate a pattern. That pattern ensures that each character is followed by the right character (or the end of the string), using lookahead. That pattern is then meta-regexed to be matched repeatedly 5 times.
You can substitute alpha with your poker sequence as necessary.
Note that this is an ABSOLUTELY IMPRACTICAL solution. It's much more readable to e.g. just check if alpha.contains(test) && (test.length() == 5).
Related questions
How does the regular expression (?<=#)[^#]+(?=#) work?

SOLVED!
See in http://jsfiddle.net/g48K9/3
I solved using closure, in js.
String.prototype.isSequence = function () {
If (this == "A2345") return true; // an exception
return this.replace(/(\w)(\w)(\w)(\w)(\w)/, function (a, g1, g2, g3, g4, g5) {
return code(g1) == code(g2) -1 &&
code(g2) == code(g3) -1 &&
code(g3) == code(g4) -1 &&
code(g4) == code(g5) -1;
})
};
function code(card){
switch(card){
case "T": return 58;
case "J": return 59;
case "Q": return 60;
case "K": return 61;
case "A": return 62;
default: return card.charCodeAt();
}
}
test("23456");
test("23444");
test("789TJ");
test("TJQKA");
test("8JQKA");
function test(cards) {
alert("cards " + cards + ": " + cards.isSequence())
}
Just to clarify, ascii codes:
ASCII CODES:
2 = 50
3 = 51
4 = 52
5 = 53
6 = 54
7 = 55
8 = 56
9 = 57
T = 84 -> 58
J = 74 -> 59
Q = 81 -> 60
K = 75 -> 61
A = 65 -> 62

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex - Match length based on value inside match (using variables ?) - regex

Related

Find string to regular expression programmatically?

How to get total number of non-repetitive substrings?

stringi's stri_replace_first_regex's replacement not seen as a regex

Changing spaces with "prxchange", but not all spaces

Regex to calculate straight poker hand - Using ASCII CODE

Categories

Resources