A regular expression that replaces a group with hard coded text - regex

First of all, I'm not sure if this is something you can even do in regular expressions. If you can, I have no idea on how to search for how to do it.
Let's say I have text:
Click this link for more information.
And a regular expression:
<a[^>]*>([^<]*)</a>
The application of the regular expression would yield this for group 1:
this link
Let's say I wanted to write the regular expression to instead return hard coded text for group 1
<a[^>]*>(${{replacement text}}[^<]*)</a>
(this is made up syntax by the way)
So that the application of the regular expression to the text would yield this for group 1:
replacement text
Is this possible?
Here's another example just to solidify my objective:
Examples of text:
serverNode1/appPortal
serverNode1/appPortal2
serverNode1/appPortal3
My regular expression
appPortal((?:?{{"1"}}\b)|(?:\d))
(using the same made up syntax)
The expected output for the first character group should be
1
2
3
(The point of the expression is to match the word break and replace it with "1" or otherwise use the digit character class to match a digit. The sub-groups are made optional with the ?: so the outside group is still group 1).
What is the point of this you may ask? I am using Splunk to do field extractions, and I'd like for the field to be extracted as 1, 2, or 3, like in my above example, and I can only rely on the regular expression groups to give me the fields (as in, I don't have anywhere to put code to say if group 1 == "" then change to "1").

Basically, as the regular expressions defined, it is not possible. By definition, regular expressions match the patterns in the text. To be clear, regexp engine returns matches that are always part of the original string, nothing more. There are some regex extensions that allows to specify name of the capturing group, but it does not transform the match.
The behaviour you described can be easy achieved processing the regex match in any programming language, but it also can be achieved by combining regex substitution and parsing.
For example, s/appPortal(?!\d)/appPortal1/ will replace "appPortal" without the digit after it with "appPortal1" and then you can apply another regex to build the match you want.

Related

Regular expression for duplicate string

Hello I am trying to formulate the regular expression to find substring and replace portion of that string. I have input in the format
Some_text_beginning_AASHISH_XX_YY_COPY_COPY_COPY_COPY
Please see that every string will have word AASHISH and in the end there could be indeterminate number of COPY. I want to delete all the COPY
I wrote the regular expression as
(.*)_AASHISH_(.*)_COPY+
I could find all the valid expression with this. But when I try to replace it with
$1_AASHISH_$2
It replaces just the last _COPY All the _COPY which came before last one are taken to be in group 2.
Further see that I am not using any programming language. I am using some third party tool. All it allows me is to search for string and replace it. It allows me to write regular expression.
Just to clarify why this question is not the same as posted before, tool I am using does not allow me use all regular expression somehow. I dont know how that tool is created. I just have UI.
Thanks in advance
Here's a regex that will capture the whole portion you want to maintain, resulting in a replacement that's just $1.
(.*_AASHISH_.*?)(?:_COPY)+
A few notes:
.*? - The ? on the end makes the repetition operator * non-greedy. It will match the minimum characters given its context.
(?:_COPY) - The ?: prefix makes this a non-capturing grouping.
+ - The repetition operator will make the entire last group (_COPY) repeat 1 or more times, not just the Y.

Using Regular Expression, unable to get the complete value of second group [duplicate]

Is there a defined behavior for how regular expressions should handle the capturing behavior of nested parentheses? More specifically, can you reasonably expect that different engines will capture the outer parentheses in the first position, and nested parentheses in subsequent positions?
Consider the following PHP code (using PCRE regular expressions)
<?php
$test_string = 'I want to test sub patterns';
preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches);
print_r($matches);
?>
Array
(
[0] => I want to test sub patterns //entire pattern
[1] => I want to test //entire outer parenthesis
[2] => want //first inner
[3] => to //second inner
[4] => patterns //next parentheses set
)
The entire parenthesized expression is captured first (I want to test), and then the inner parenthesized patterns are captured next ("want" and "to"). This makes logical sense, but I could see an equally logical case being made for first capturing the sub parentheses, and THEN capturing the entire pattern.
So, is this "capture the entire thing first" defined behavior in regular expression engines, or is it going to depend on the context of the pattern and/or the behavior of the engine (PCRE being different than C#'s being different than Java's being different than etc.)?
From perlrequick
If the groupings in a regex are
nested, $1 gets the group with the
leftmost opening parenthesis, $2 the
next opening parenthesis, etc.
Caveat: Excluding non-capture group opening parenthesis (?=)
Update
I don't use PCRE much, as I generally use the real thing ;), but PCRE's docs show the same as Perl's:
SUBPATTERNS
2. It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector argument of pcre_exec(). Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns.
For example, if the string "the red king" is matched against the pattern
the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.
If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.
Yeah, this is all pretty much well defined for all the languages you're interested in:
Java - http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
"Capturing groups are numbered by counting their opening parentheses from left to right. ... Group zero always stands for the entire expression."
.Net - http://msdn.microsoft.com/en-us/library/bs2twtah(VS.71).aspx
"Captures using () are numbered automatically based on the order of the opening parenthesis, starting from one. The first capture, capture element number zero, is the text matched by the whole regular expression pattern.")
PHP (PCRE functions) - http://www.php.net/manual/en/function.preg-replace.php#function.preg-replace.parameters
"\0 or $0 refers to the text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern." (It was also true of the deprecated POSIX functions)
PCRE - http://www.pcre.org/pcre.txt
To add to what Alan M said, search for "How pcre_exec() returns captured substrings" and read the fifth paragraph that follows:
The first pair of integers, ovector[0] and ovector[1], identify the
portion of the subject string matched by the entire pattern. The next
pair is used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair that
has been set. For example, if two substrings have been captured, the
returned value is 3. If there are no capturing subpatterns, the return
value from a successful match is 1, indicating that just the first pair
of offsets has been set.
Perl's different - http://perldoc.perl.org/perlre.html#Capture-buffers
$1, $2 etc. match capturing groups as you'd expect (i.e. by occurrence of opening bracket), however $0 returns the program name, not the entire query string - to get that you use $& instead.
You'll more than likely find similar results for other languages (Python, Ruby, and others).
You say that it's equally logical to list the inner capture groups first and you're right - it's just be a matter of indexing on closing, rather than opening, parens. (if I understand you correctly). Doing this is less natural though (for example it doesn't follow reading direction convention) and so makes it more difficult (probably not significantly) to determine, by insepection, which capturing group will be at a given result index.
Putting the entire match string being in position 0 also makes sense - mostly for consistency. It allows the entire matched string to remain at the same index regardless of the number capturing groups from regex to regex and regardless of the number of capturing groups that actually match anything (Java for example will collapse the length of the matched groups array for each capturing group does not match any content (think for example something like "a (.*)pattern"). You could always inspect capturing_group_results[capturing_group_results_length - 2], but that doesn't translate well to languages to Perl which dynamically create variables ($1, $2 etc.) (Perl's a bad example of course, since it uses $& for the matched expression, but you get the idea :).
Every regex flavor I know numbers groups by the order in which the opening parentheses appear. That outer groups are numbered before their contained sub-groups is just a natural outcome, not explicit policy.
Where it gets interesting is with named groups. In most cases, they follow the same policy of numbering by the relative positions of the parens--the name is merely an alias for the number. However, in .NET regexes the named groups are numbered separately from numbered groups. For example:
Regex.Replace(#"one two three four",
#"(?<one>\w+) (\w+) (?<three>\w+) (\w+)",
#"$1 $2 $3 $4")
// result: "two four one three"
In effect, the number is an alias for the name; the numbers assigned to named groups start where the "real" numbered groups leave off. That may seem like a bizarre policy, but there's a good reason for it: in .NET regexes you can use the same group name more than once in a regex. That makes possible regexes like the one from this thread for matching floating-point numbers from different locales:
^[+-]?[0-9]{1,3}
(?:
(?:(?<thousand>\,)[0-9]{3})*
(?:(?<decimal>\.)[0-9]{2})?
|
(?:(?<thousand>\.)[0-9]{3})*
(?:(?<decimal>\,)[0-9]{2})?
|
[0-9]*
(?:(?<decimal>[\.\,])[0-9]{2})?
)$
If there's a thousands separator, it will be saved in group "thousand" no matter which part of the regex matched it. Similarly, the decimal separator (if there is one) will always be saved in group "decimal". Of course, there are ways to identify and extract the separators without reusable named groups, but this way is so much more convenient, I think it more than justifies the weird numbering scheme.
And then there's Perl 5.10+, which gives us more control over capturing groups than I know what to do with. :D
The order of capturing in the order of the left paren is standard across all the platforms I've worked in. (perl, php, ruby, egrep)

Combining two Regular expressions into a single one using Vb.net

I had two regular expressions which are mentioned below .
Regex 1.
^\d{9}_[a-zA-Z]{1}_(0[1-9]|1[0-2]).(0[1-9]|[1-2][0-9]|3[0-1]).[0-9]{4}_[0-9]{3}_[0-9a-zA-Z]{2}(?:_[0-9a-zA-Z]*)?
using this I am checking string.
999999999_A_12.10.2015_010_2q_somedescription
if any of this pattern got failed let say
999999999_12.10.2015_010_2q_somedescription
I need to notify second part got missed for this I am using regex 2.
Regex 2.
^\d{9}_^[a-zA-Z]$_(0[1-9]|1[0-2]).(0[1-9]|[1-2][0-9]|3[0-1]).[0-9]{4}_^[0-9]{3}$_[0-9a-zA-Z]{2}$_[0-9a-zA-Z]*
I tried splitting regex one and string into groups and comparing them. i am using Regex.Match method in vb.net even if my string contains
999999999_AB_12.10.2015_010_2q_somedescription
it is giving result as success.so I wrote regex 2 for exact match. But I need to combine these two regular expressions and make it into a single one. I am splitting regex 2 and string comparing them using Regex.Match method in vb.net which is working but I don't want to maintain two regex expressions.
Considered Match: 9
99999999_A_12.10.2015_010_2q_somedescription
if any thing is missing from the above string like
999999999_12.10.2015_010_2q_somedescription
(or) if anything is other than above format like
999999999_AB_12.10.2015_010_2q_somedescription
which are considered as mismatch I need to find which part is missing and I should notify it to the user
MisMatch:
999999999_12.10.2015_010_2q_somedescription,999999999_AB_12.10.2015_010_2q_somedescription,999999999_AB_12.10.20_010_2q_somedescription,999999999_AB_12.10.2015_01_2q_somedescription,999999999_AB_12.10.2015_010_2_somedescription,9999_AB_12.10.2015_010_2q_somedescription
You should use named group to get the value of the part that can change. For example:
\d{9}_(?<X>[a-zA-Z]{1})_(0[1-9]|1[0-2]).(0[1-9]|[1-2][0-9]|3[0-1]).[0-9]{4}_[0-9]{3}_[0-9a-zA-Z]{2}(?:_[0-9a-zA-Z]*)?
Now in VB.NET you can check the value of the capture group X in your match. You can then use if or switch to do whatever you want.

Regular expression - Strange behavior

I'm writing a compiler. I'm just starting, so I'm creating the Scanner (or Lexer). Currently, I'm writing some regular definitions which will be processed by my scanner. Trying to create one of them, I run in the next problem:
I was testing, in RegExr, the following (incredibly simple) regular expression:
r = /(a|ab)/
Where "r" is a regular definition; I mean, the regular expression just is (a|ab).
I thought the language L(r) would be (according to the book Compilers: Principles, Techniques and Tools):
L(r) = {a, ab}
Surprisingly, the tool matches {a}!
So my question is, why this behavior?
The regex a|ab matches "a" or "ab" (obviously), but some tools/languages (eg Java) consider the input to match when the entire input matches the regex, while others (eg JavaScript) consider input to match when some of the input matches.
Your tool must be a "some" variety to match "{a}".
A regex parses the text from left-to-right and in case of an alternator (|) it will first aim to match with the first candidate.
If you use:
(ab|a)
It will match both ab and a's.
The point is that once a match is found, a global matcher will start the next match attempt after the end of the first match.
You can easily verify that the matched language is {a,ab}: use the regex ^c(a|ab)d and use cabd. In that case, the regex has no choice but selecting the second option.
So say the regex reads: (a|ab) and the text is ab. It will match with a, next it will start after a, so it will attempt to match with b, but fail.
Most lexer tools however use a different way to determine the match. For lexer tools, the "longest match" counts. So the match with the longest number of characters.
Now if you enter (a|ba) as regex, it will match earlier ba earlier. Why? Because it also aims to find the first attempt. And in the text cbad, starting at index 1 (b) is seen as better than starting at index 2 (a).
As said by #bohemian some regex evaluate just a part of the string if you want to match the whole string you can use a regexp like this:
/^(a|ab)$/
Which only will accept a or ab

Regex replacement character only if group matches

In a sed/egrep-style regular expression is it possible to print a character in the replacement string only if one of the groups matched?
For example, suppose I have an expression such as:
/^func([ \t]+\([^)]+\))?[ \t]+([a-zA-Z0-9_]+)/\1.\2/
Is it possible to print the period in the replacement only if the group \1 matched?
Specifically I'm trying to write an expression for the --regex-<LANG> option as described in http://ctags.sourceforge.net/ctags.html
The only thing I can think of is two replace commands:
/^func[ \t]+([a-zA-Z0-9_]+)/\1/
/^func([ \t]+\([^)]+\))?[ \t]+([a-zA-Z0-9_]+)/\1.\2/
The documentation of ctags suggests that this is supported by simply specifying two --regex-<LANG> options:
The regular expression defined by this option is added to the current list of regular expressions for the specified language unless the parameter is omitted, in which case the current list is cleared.
In Perl, you can call arbitrary function on the group matches, but this doesn't help here.