Annotating mismatches in regular expression - regex

I need to "annotate" with a X character each mismatch in a regular expression, For example if I have a text file like:
Line1Name: this is a (string).
Line2Name: (a string)
Line3Name this is a line without parenthesis
Line4Name: (a string 2)
Now following regular expression will match everything before a :
^[^:]+(?=:)
so the result will be
Line1Name:
Line2Name:
Line4Name:
However I would need to annotate the mismatch at the 3rd line, having this output:
Line1Name:
Line2Name:
X
Line4Name:
Is this possible with regular expressions?

If you have a look at what a regular expression is, you will realize that it is not possible to do logical operations with a regex alone. Quoting Wikipedia:
In computing, a regular expression provides a concise and flexible means to “match” (specify and recognize) strings of text, such as particular characters, words, or patterns of characters.
emphasis mine – simply put, a regex is a fancy way to find a string; it either does (it matches), or not.
To achieve what you are after, you need some kind of logic switch that operates on the match / not-match result of your regex search and triggers an action. You haven’t specified in what environment you are using your regex, so providing a solution is a bit pointless, but as an example, this would do what you are trying to do in pure bash:
# assuming your string is in $str
result="$([[ $str =~ ^[^:]+: ]] && echo "${str%:*}" || echo "X")"
and this does the same thing in a language supporting your regex pattern (Ruby):
# assuming your string is in str
result = str.match(/^[^:]+(?=:)/) || "X"
As a side note, your example code does not match the output: you are using a lookahead for the colon, which excludes it in the match, but your output includes it. I’ve opted for sticking with your regex over your output pattern in my examples, thus excluding the colon from the result.

Related

How to use Regular expression in perl if both regular expression and strings are variables

I have two variables coming from some user inputs. One is a string that needs to be checked and other one is a regular expression as below.
Following code doesn't work.
my $pattern = "/^current.*$/";
my $name = "currentStateVector";
if($name =~ $pattern) {
print "matches \n";
} else {
print "doesn't match \n";
}
And following does.
if($name =~ /^current.*$/) {
print "matches \n";
} else {
print "doesn't match \n";
}
What's the reason for this. I've the regular expression stored in a variable. Is there another way to store this variable or modify it?
The double-quotes that you use interpolate -- they first evaluate what's inside them (variables, escapes, etc) and return a string built with evaluations' results and remaining literals. See Gory details of parsing quoting constructs for an illuminating discussion, with lots of detail.
And your example string happens to have a $/ there, which is one of Perl's global variables (see perlvar) so $pattern is different than expected; print it to see. (In this case the / is erroneous as discussed below but the point stands.)
Instead, either use single quotes to avoid interpretation of characters like $ and \ (etc) so that they are used in regex as such
my $pattern = q(^current.*$);
or, better, use the regex-specific qr operator
my $pattern = qr/^current.*$/;
which builds from its string a proper regex pattern (a special type of Perl value), and allows use of modifiers. In this case you need to escape characters that have a special meaning in regex if you want them to be treated as literals.
Note that there's no need for // for the regex, and they wouldn't be a part of the pattern anyway -- having them around the actual pattern is wrong.
Also, carefully consider all circumstances under which user input may end up being used.
It is brought up in a comment that users may submit a "pattern" with extra /'s. That'd be wrong, as mentioned above; only the pattern itself should be given (surrounded on the command-line by ', so that the shell doesn't interpret particular characters in it). More detail follows.
The /'s are clearly not meant as a part of the pattern, but are rather intended to come with the match operator, to delimit (quote) the regex pattern itself (in the larger expression) so that one can use string literals in the pattern. Or they are used for clarity, and/or to be able to specify global modifiers (even though those can be specified inside patterns as well).
But then if users still type them around the pattern the regex will use those characters as a part of the pattern and will try to match a leading /, etc; it will fail, quietly. Make sure that users know that they need to give a pattern alone, with no delimiters.
If this is likely to be a problem I'd check for delimiters and if found carry on with a "loud" (clear) warning. What makes this tricky is the fact that a pattern starting and ending with a slash is legitimate -- it is possible, if somewhat unlikely, that a user may want actual /'s in their pattern. So you can only ask, or raise a warning, not abort.
Note that with a pattern given in a variable, or with an expression yielding a pattern at runtime, the explicit match operator and delimiters aren't needed for matching; the variable or the expression's return is taken as a search pattern and used for matching. See The basics (perlre) and Binding Operators (perlop).
So you can do simply $name =~ $pattern. Of course $name =~ /$pattern/ is fine as well, where you can then give global modifiers after the closing /
The slashes are part of the matching operator m//, not part of the regex.
When I populate the regex from user input
my $pattern = shift;
and run the script as
58663971.pl '^current.*$'
it matches.

tcl regular expression, attempting to pull out a string between two patterns

Gretings!
I am trying to use tcl regular expressions to strip off unwanted characters and keep the desired string.
The 4 basic string types are
I34/pAVDD_3
I32/pDVDD_15_2
I999/pAGND
I3/pDOUT_LG0
What I want to capture is what's in-between the p and the end of the string or the last underscore & number if it exists. With the strings above I want to capture AVDD, DVDD_15, AGND, and DOUT_LG0.
I thought I had it with [p](\w*)?[_][\d*] but it doesn't work with I3/pDOUT_LG0 and after quite awhile of trying different things, I can't find a pattern that will work.
Thanks!
How about
regexp {p(?:(\w+)_\d|(\w+))$} $str -> c1 c2
set result $c1$c2
One or the other will be empty, so the result is a simple concatenation of them.
Another possible solution is to strip off the unwanted parts:
regsub -all {.+p|_\d$} $str {}
Documentation:
regexp,
regsub,
Syntax of Tcl regular expressions

Find and trim part of what is found using regular expression

I'm a newbie in writing regular expressions
I have a file name like this TST0101201304-123.txt and my target is to get the numbers between '-' and '.txt'
So I wrote this formula -([0-9]*)\.txt this will get me the numbers that I want, but in addition, it is retrieving the highfin '-' and the last part of the string also '.txt' so the result in the example above is '-123.txt'
So my question is:
Is there a way in regular expressions to get only part of the matched string, like a submatch of the match without the need to trim it in my shell script code for unix?
I found this answer but it is getting the same result:
Regexp: Trim parts of a string and return what ever is left
Tip: To test my regular expressions is used this website
You can use lookbehind and lookahead
(?<=-)[0-9]*(?=[.]txt)
Don't know if it would work in unix
Different regex-engines are different. Since you're using expr match, you need to make two changes:
expr match expects a regex that matches the entire string; so, you need to add .* at the beginning of yours, to cover everything before the hyphen.
expr match uses POSIX Basic Regular Expressions (BREs), which use \( and \) for grouping (and capturing) rather than merely ( and ).
But, conveniently, when you give expr match a regex that contains a capture-group, its output is the content of that capture-group; you don't need to do anything else special. So:
$ expr match TST0101201304-123.txt '.*-\([0-9]*\)\.txt'
123
sed is your friend.
echo filename | sed -e 's/-\([0-9]*\)/\1'
should get you what you want.

how to avoid to match the last letter in this regexp?

I have a quesion about regexp in tcl:
first output: TIP_12.3.4 %
second output: TIP_12.3.4 %
and sometimes the output maybe look like:
first output: TIP_12 %
second output: TIP_12 %
I want to get the number 12.3.4 or 12 using the following exgexp:
output: TIP_(/[0-9].*/[0-9])
but why it does not matches 12.3.4 or 12%?
You need to escape the dot, else it stands for "match every character". Also, I'm not sure about the slashes in your regexp. Better solution:
/TIP_(\d+\.?)+/
Your problem is that / is not special in Tcl's regular expression language at all. It's just an ordinary printable non-letter character. (Other languages are a little different, as it is quite common to enclose regular expressions in / characters; this is not the case in Tcl.) Because it is a simple literal, using it in your RE makes it expect it in the input (despite it not being there); unsurprisingly, that makes the RE not match.
Fixing things: I'd use a regular expression like this: output: TIP_([\d.]+) under the assumption that the data is reasonably well formatted. That would lead to code like this:
regexp {output: TIP_([0-9.]+)} $input -> dottedDigits
Everything not in parentheses is a literal here, so that the code is able to find what to match. Inside the parentheses (the bit we're saving for later) we want one or more digits or periods; putting them inside a square-bracketed-set is perfect and simple. The net effect is to store the 12.3.4 in the variable dottedDigits (if found) and to yield a boolean result that says whether it matched (i.e., you can put it in an if condition usefully).
NB: the regular expression is enclosed in braces because square brackets are also Tcl language metacharacters; putting the RE in braces avoids trouble with misinterpretation of your script. (You could use backslashes instead, but they're ugly…)
Try this :
output: TIP_(/([0-9\.^%]*)/[0-9])
Capture group 1.
Demo here :
http://regexr.com?31f6g
The following expression works for me:
{TIP_((\d+\.?)+)}

Regular Expression, dynamic number

The regular expression which I have provided will select the string 72719.
Regular expression:
(?<=bdfg34f;\d{4};)\d{0,9}
Text sample:
vfhnsirf;5234;72159;2;668912;28032009;4;
bdfg34f;8467;72719;7;6637912;05072009;7;
b5g342sirf;234;72119;4;774582;20102009;3;
How can I rewrite the expression to select that string even when the number 8467; is changed to 84677; or 846777; ? Is it possible?
First, when asking a regex question, you should always specify which language you are using.
Assuming that the language you are using does not support variable length lookbehind (and most don't), here is a solution which will work. Your original expression uses a fixed-length lookbehind to match the pattern preceding the value you want. But now this preceding text may be of variable length so you can't use a look behind. This is no problem. Simply match the preceding text normally and capture the portion that you want to keep in a capture group. Here is a tested PHP code snippet which grabs all values from a string, capturing each value into capture group $1:
$re = '/^bdfg34f;\d{4,};(\d{0,9})/m';
if (preg_match_all($re, $text, $matches)) {
$values = $matches[1];
}
The changes are:
Removed the lookbehind group.
Added a start of line anchor and set multi-line mode.
Changed the \d{4} "exactly four" to \d{4,} "four or more".
Added a capture group for the desired value.
Here's how I usually describe "fields" in a regex:
[^;]+;[^;]+;([^;]+);
This means "stuff that isn't semi-colon, followed by a semicolon", which describes each field. Do that twice. Then the third time, select it.
You may have to tweak the syntax for whatever language you are doing this regex in.
Also, if this is just a data file on disk and you are using GNU tools, there's a much easier way to do this:
cat file | cut -d";" -f 3
to match the first number with a minimum of 4 digits
(?<=bdfg34f;\d{4,};)\d{0,9}
and to match the first number with 1 or more length
(?<=bdfg34f;\d+;)\d{0,9}
or to match the first number only if the length is between 4 and 6
(?<=bdfg34f;\d{4,6};)\d{0,9}
This is a simple text parsing problem that probably doesn't mandate the use of regular expressions.
You could take the input line by line and split on ';', i.e. (in php, I have no idea what you're doing)
foreach (explode("\n", $string) as $line) {
$bits = explode(";", $line);
echo $bits[3]; // third column
}
If this is indeed in a file and you happen to be using PHP, using fgetcsv would be much better though.
Anyway, context is missing, but the bottom line is I don't think you should be using regular expressions for this.