What does this regular expression try to match? - regex

These days I am learning regular expressions, but it seems like a little hard to me. I am reading some code in TCL, but what does it want to match?
regexp ".* (\[\\d]\{3\}:\[\\d]\{3\}:\[\\d]\{3\}.\[\\d]\{5\}).\[^\\n]" $input

If you un-escape the characters, you get the following:
.* ([\d]{3}:[\d]{3}:[\d]{3}.[\d]{5}).[^\n]
The term [\d]{x} would match x number of consecutive digits. Therefore, the portion inside the parentheses would match something of the form ###:###:###?##### (where # can be any digit and ? can be any character). The parentheses themselves aren't matched, they're just used for specifying what part of the input to "capture" and return to the caller. Following this sequence is a single dot ., which matches a single character (which can be anything). The trailing [^\n] will match a single character that is anything except a newline (a ^ at the start of a bracketed expression inverts the match). The .* term at the very beginning matches a sequence of characters of any length (even zero), followed by a space.
With all of this taken into account, it appears that this regular expression extracts a series of digits from the middle of a line. Given the format of the numbers, it may be looking for a timestamp in the hours:minutes:seconds.milliseconds format (although if that is the case, {1,3} and {1,5} should be used instead). The trailing .[^\n] term looks like it could be trying to exclude timestamps that are at or near the end of a line. Timestamped logs often have a timestamp followed by some sort of delimiting character (:, >, a space, etc). A regular expression like this might be used to extract timestamps from the log while ignoring "blank" lines that have a timestamp but no message.
Update:
Here's an example using TCL 8.4:
% set re ".* (\[\\d]\{3\}:\[\\d]\{3\}:\[\\d]\{3\}.\[\\d]\{5\}).\[^\\n]"
% regexp $re "TEST: 123:456:789:12345> sample log line"
1
% regexp $re " 111:222:333.44444 foo"
1
% regexp $re "111:222:333.44444 foo"
0
% regexp $re " 111:222:333.44444 "
0
% regexp $re " 10:44:56.12344: "
0
%
% regexp $re "TEST: 123:456:789:12345> sample log line" match data
1
% puts $match
TEST: 123:456:789:12345>
% puts $data
123:456:789:12345
The first two examples match the expression. The third fails because it lacks the space character before the first number sequence. The fourth fails because it doesn't have a non-newline character at the end after the trailing space. The fifth fails because the numerical sequences don't have enough digits. By passing parameters after the input, you can store the part of the input that matched the expression as well as the data that was "captured" by using parentheses. See the TCL wiki for details on the regexp command.
The interesting part with TCL is that you have to escape the [ character but not the ], while both the { and } need escaping.

.* ==> match junk part of the input
( ==> start capture
\[\\d]\{3\}: ==> match 3 digits followed by ':'
\[\\d]\{3\}: ==> match 3 digits followed by ':'
\[\\d]\{3\}. ==> match 3 digits followed by any character
\[\\d]\{5\} ==> match 5 digits
). ==> close capture and match any character
\[^\\n] ==> match a character that is not a newline

Related

Regex POSIX - How can i find if the start of a line contains a word from a word that appears later in line

I have a UNIX passwd file and i need to find using egrep if the first 7 characters from GECOS are inside the username. I want to check if the username (jkennedy) contains the word "kennedy" from the GECOS.
I was planning to use back-references but the username is before the gecos so i don't know how to implement it.
For example the passwd file contains this line:
jkennedy:x:2473:1067:kennedy john:/root:/bin/bash
As per my original comment, the regex below works for me.
See it in use here - note this regex differs slightly as it's more used for display purposes. The regex below is the POSIX version of this and removes non-capture groups and the unneeded capture group around the backreference.
^[^:]*([^:]{7})([^:]*:){4}\1.*$
^ assert position at the start of the line
[^:]* match any character except : any number of times
([^:]{7}) capture exactly seven of any character except :
([^:]*:){4} match the following exactly four times
[^:]*: match any character except : any number of times, followed by : literally
\1 match the backreference; matches what was previously matched by the first capture gorup
.* match any character (except newline characters) any number of times
$ assert position at the end of the line
Assuming you do NOT want case sensitivity to foul your matching -
declare -l tmpUsr tmpName
while IFS=: read usr x x x name x
do tmpUsr="$usr"; tmpName="$name"
(( ${#name} )) && [[ "$tmpUsr" =~ ${tmpName:0:7} ]] &&
printf "$usr ($name<${tmpName:0:7}>)\n"
done</etc/passwd

Need to understand Regex advanced substring, replace and split to make work with PowerShell

In forums, have seen the signature as:
[string](0..9 | %{[char][int](32+("......................."
).substring(($_*2), 2))}) -replace "\s{1}\b"
it gives the result as an email address.
Need to understand the substring and replace syntax here and to be precise the complete syntax how it is being evaluated.
I understand that [string] is a data type, then foreach every digit from 1 to 9 following result can be in integer or character and finding the ASCII value followed by finding the substring which is multiplied by 2 (don't understand why) and taking only 2 digits of result (result will be be of 2 digits then why to), then comes the replace which replaces with white space and the end \b signifies the word end boundary.
Also, as mentioned in http://ss64.com/ps/syntax-regex.html, how
PS> 'ABCD' -replace "([AC])(.)",'$2-$1'
B-AD-C
results in B-AD-C.
What's the period significance here? cant find the meaning which tells me how to use it, I have tried removing it and it results in
PS> 'ABCD' -replace "([AC])",'$2-$1'
$2-AB$2-CD
Why the period is significant with capture groups in regex?
I am in this for almost a week now but not able to find the exact meaning.
Any help would be appreciated on this.
Regards
Merry X'Mas
I think you are supposed to replace the dots with digits.
For each digit n from 0 to 9, pick the n'th pair of digits in the string, add 32, and convert to that Unicode character. Then remove every space that comes before a word character.
32 is the code for a space (" "), and is followed by 94 printable characters. See List of Unicode characters, Basic Latin (Wikipedia) for a list.
[string], [char], and [int] converts a value to the specified type. The % { } syntax is short for ForEach-Object { }.
The regex at the end, matches one space character, followed by a word-boundary. —That is, it matches zero characters but only before a word-character (A-Z, a-z, 0-9 and "_")).
The full syntax is: string -replace regex , replacement. Since there are no replacement specified, it defaults to the empty string. Effectively removes any space before a letter or digit.
For the regex ([AC])(.), it will match any A or C followed by another character. It will capture the A or C into group one, and the following character into group 2.
Simple ( ) creates a capture group, and allows you to refer to the matched substring elsewhere. . is a wildcard, and matches any character except newline (U+00A0).
ABCD -> {
Match 1 = {
Text = "AB"
Capture 1 = { Text = "A" }
Capture 2 = { Text = "B" }
}
Match 2 = {
Text = "CD"
Capture 1 = { Text = "C" }
Capture 2 = { Text = "D" }
}
}
The replacement string $2-$1 says to put the text captured into group 2 first, followed by a dash, followed by the text captured into group 1.
$1 refers to the text captured into group 1, and $2 refers to the text captured into group 2. Match 1 will be replaced by "B-A", and Match 2 by "D-C", resulting in "B-AD-C".
A good source for the syntax and mechanics of regex is http://www.regular-expressions.info/

changing RegEx from 3 digits to 4

I'm not that great at RegEx, and have the following piece of code on my hands:
value.replace(/\s*.*(\d+[,\.]\d+)[^\d]*/m, "$1");
Now it works great at reducing this "\r\n\t\t\t\t& #36;0.05 USD\t\t\t" (please note I've intentionally left a space between the & and # as removing it converts it to a dollar sign on the site) to this "0.05". The issue I have is that if the number is a double digit (10.05 rather than 0.05) the expression removes the digit from the front and still outputs 0.05 rather than 10.05.
From what I can see in the expression, it's hard coded to pick up just 3 digits, so I was wondering if there's a way to amend it to also work in cases where there are 4 digits.
The . after /\s* is matching the first digit if there are 2 or more digits. Remove that and see if it works...
value.replace(/\s*(\d+[,.]\d+)[^\d]/m, "$1");
Given your example of the regex:
/\s*.*(\d+[,.]\d+)[^\d]/m
And the data:
\r\n\t\t\t\t$0.05 USD\t\t\t
\r\n\t\t\t\t$10.05 USD\t\t\t
In the regex, the leading "/" (forward-slash), and the "/" before the "m" delimits the regex and is not part of the matching.
The "\s" in the regex is shorthand for [ \t\r\n\f] which matches whitespace (space, tab, Carriage-return, Line-feed, Form-feed). So, "\s*" will match "\r\n\t\t\t\t"
The "." (dot) in the regex matches any single character (generally any character except "\n").
The "*" following the "." says to match any 0 or more characters. So, together the ".*", matches the "$" (and possibly, additionally, one or more digits... see below).
Next, the "(" in the regex starts the part of the regex that will "capture" part of your data.
The "\d" in the regex will match any 1 number. Actually "\d" matches [0-9] and other digit characters, like Eastern Arabic numerals "??????????".
The "+" following the "\d" says to match any 1 or more numbers (digits).
The "[,.]" in the regex will match one of either a literal "." (dot), or a "," (comma), to match the "decimal" separator.
Another "\d+" to match any 1 or more numbers (digits).
Next, the ")" in the regex closes the part of the regex that will "capture" part of your data.
The "[^\d]" will match any 1 character that is not a number (digit). So, in this case, it will match the
" " (space).
The "m" at the end of the regex (following the second "/"): "m" changes the behavior of the "^" and "$" anchors, which are not used in your regex, so the "m" should have no effect. But, if you're using Ruby, "m" changes the behavior of the "." (dot).
Now, the "problem"... the ".*" (before the "("), is in regex terms, "greedy". This means it will match as "early" as possible, and for as "long" as possible. So, if there is more than 1 digit following the ";", then the ".*" will consume some digits.
Note: Using ".*" can cause all sorts of problems, especially with "/m" under Ruby. It's best to avoid using ".*" if possible.
There are 2 ways to fix this.
1) If the part before the number you want to capture is always "$", then specify that in regex instead of the ".*". So like this:
/\s*$(\d+[,.]\d+)[^\d]/m
or, if it will always be "$" or something very similar to that:
/\s*[^;]+;(\d+[,.]\d+)[^\d]/m
Here, "[^;]+;" means any string of 1 or more characters that does not contain a ";" followed by a "[;]".
2) If the part before the number you want to capture which is shown as "$", could be totally different in the data, then you just need to make sure that the part of the regex that is currently ".*" will not match a digit in the last position. So like this:
/\s[^.,]*[^\d](\d+[,.]\d+)[^\d]/m
Here, "[^.,]*[^\d]" means any string of 0 or more characters that does not contain a "." (dot) or a "," (comma) where the last character does not contain a digit.
Try this
value.replace( /\s*.(\d+[,.]\d+)[^\d]/m, "$1");
WORKING REGEX
Output:
The .* matches greedily and therefore matches as many characters, including digits, as it can, as long as the rest of the pattern can still match.
The rest of the pattern can still match if just one digit is left for the /d+ to match, so you only end up with one digit there.
If the semicolon in your example is always in that position in the strings you wish to match, use it as a marker like this
value.replace(/.*;(\d+[,\.]\d+).*/m, "$1");

C# regular expression to match square brackets

I'm trying to use a regular expression in C# to match a software version number that can contain:
a 2 digit number
a 1 or 2 digit number (not starting in 0)
another 1 or 2 digit number (not starting in 0)
a 1, 2, 3, 4 or 5 digit number (not starting in 0)
an option letter at the end enclosed in square brackets.
Some examples:
10.1.23.26812
83.33.7.5
10.1.23.26812[d]
83.33.7.5[q]
Invalid examples:
10.1.23.26812[
83.33.7.5]
10.1.23.26812[d
83.33.7.5q
I have tried the following:
string rex = #"[0-9][0-9][.][1-9]([0-9])?[.][1-9]([0-9])?[.][1-9]([0-9])?([0-9])?([0-9])?([0-9])?([[][a-zA-Z][]])?";
(note: if I try without the "#" and just escape the square brackets by doing "\[" I get an error saying "Unrecognised escape sequence")
I can get to the point where the version number is validating correctly, but it accepts anything that comes after (for example: "10.1.23.26812thisShouldBeWrong" is being matched as correct).
So my question is: is there a way of using a regular expression to match / check for square brackets in a string or would I need to convert it to a different character (eg: change [a] to a and match for *s instead)?
This happens because the regex matches part of the string, and you haven't told it to force the entire string to match. Also, you can simplify your regex a lot (for example, you don't need all those capturing groups:
string rex = #"^[0-9]{2}\.[1-9][0-9]?\.[1-9][0-9]?\.[1-9][0-9]{0,4}(?:\[[a-zA-Z]\])?$";
The ^ and $ are anchors that match the start and end of the string.
The error message you mentioned has to do with the fact that you need to escape the backslash, too, if you don't use a verbatim string. So a literal opening bracket can be matched in a regex as "[[]" or "\\[" or #"\[". The latter form is preferred.
You need to anchor the regex with ^ and $
string rex = #"^[0-9][0-9][.][1-9]([0-9])?[.][1-9]([0-9])?[.][1-9]([0-9])?([0-9])?([0-9])?([0-9])?([[][a-zA-Z][]])?$";
the reason the 10.1.23.26812thisShouldBeWrong matches is because it matches the substring 10.1.23.26812
The regex can be simplfied slightly for readability
string rex = #"^\d{2}\.([1-9]\d?\.){2}[1-9]\d{0,4}(\[[a-zA-Z]\])?$";
In response to TimCross warning - updated regex
string rex = #"^[0-9]{2}\.([1-9][0-9]?\.){2}[1-9][0-9]{0,4}(\[[a-zA-Z]\])?$";

Trying to understand this perl regex bracketed character class?

Below is a script that I was playing with. With the script below it will print a
$tmp = "cd abc/test/.";
if ( $tmp =~ /cd ([\w\/\.])/ ) {
print $1."\n";
}
BUT if I change it to:
$tmp = "cd abc/test/.";
if ( $tmp =~ /cd ([\w\/\.]+)/ ) {
print $1."\n";
}
then it prints: cd abc/test/.
From my understanding the + matches one or more of the matching sequence, correct me if i am wrong please. But why in the first case it only matches a? I thought it should match nothing!!
Thank you.
You are correct. In the first case you match a single character from that character class, while in the second you match at least one, with as many as possible after the first one.
First one :
"
cd\ # Match the characters “cd ” literally
( # Match the regular expression below and capture its match into backreference number 1
[\w\/\.] # Match a single character present in the list below
# A word character (letters, digits, etc.)
# A / character
# A . character
)
"
Second one :
"
cd\ # Match the characters “cd ” literally
( # Match the regular expression below and capture its match into backreference number 1
[\w\/\.] # Match a single character present in the list below
# A word character (letters, digits, etc.)
# A / character
# A . character
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"
In regexes, characters in brackets only count for a match of one character within the given bracket. In other words, [\w\/\.] matches exactly one of the following characters:
An alphanumeric character or "_" (the \w).
A forward slash (the \/--notice that the forward slash needs to be escaped, since it is used as the default marker for the beginning and end of a regex)
A period (the \.--again, escaped since . denotes any character except the newline character).
Because /cd ([\w\/\.])./ only captures one character into $1, it grabs the first character, which in this case is "a".
You are correct in that the + allows for a match of one or more such characters. Since regexes match greedily by default, you should get all of "abc/test/." for $1 in the second match.
If you haven't already done so, you might want to peruse perldoc perlretut.