Match a 24 hour formatted time with regex - regex

I am trying to match a 24 hour time with regular expressions using egrep.
Here is my test file, test.txt:
32:23:31
24:30:31
23:70:31
23:61:31
23:10:70
23:10:61
22:17:16
01:17:15
24:15:22
0:17:16
00:17:17
24:30:31
Here is my regular expression:
egrep '(2[0-3]|1[0-9]|0[0-9]|[^0-9][0-9]):([0-5][0-9]|[0-9]):([0-5][0-9]|[0-9])' test.txt
Resulting matches:
23:10:70
23:10:61
22:17:16
01:17:15
00:17:17
Any idea why it is matching 23:10:70 and 23:10:61?

It's actually matching 23:10:7 and 23:10:6, but since you are not using the end of line metacharacter $ at the end of the string, it will process anything that follows.
egrep '^(2[0-3]|1[0-9]|0[0-9]|[^0-9][0-9]):([0-5][0-9]|[0-9]):([0-5][0-9]|[0-9])$' test.txt
In other words, you should only allow [0-9] at the end of the string, if the matched digit is the last one on the line, that is, if it is followed by $.
Another option, is to force the last digit to be 0-padded if it is less than 10, i.e., instead of [0-9] use 0[0-9]. This will match 23:10:07, but not 23:10:7. It's the same you already have for the hours part.

Related

Extracting Number from Log File

I'm trying to extract a number from a log file that outputs lines of text like this:
1/11/2016 3:26:12 AM 1/11/2016 3:27:00 AM 45.6 A
The output from the line is 45.6 A
However, my Regex code is returning the 12 A from 3:26:12 AM. I need it to completely ignore the time number and just output the 45.6 A.
Here's my Regex code:
$regex = '\d+(?:\.\d+)?(?=\s+A)'
You just forgot to anchor the lookeahead at the end of the string:
\d+(?:\.\d+)?(?=\s+A$)
^
See the regex demo
The \d+(?:\.\d+)? will match one or more digits optionally followed with a . followed with one or more digits (a float value), and the (?=\s+A$) lookahead will require one or more whitespace characters with A right at the end of the string to appear after the float value.
$s = '1/11/2016 3:26:12 AM 1/11/2016 3:27:00 AM 45.6 A'
$rx = '\d+(?:\.\d+)?(?=\s+A$)'
$result = [regex]::Match($s, $rx, 'RightToLeft')
if ($result) { $result.Value; }
You can use word boundary (\b) to match only A, not AM:
\d+(?:\.\d+)?(?=\s+A\b)
DEMO: https://regex101.com/r/pA7jK2/1
if you just need find the last digit with an A in it, try this
(\d+\.\d\sA)
Demo here

Regular Expressions with multiple dots in Linux bash shell give strange results

I tried to match a substring including a lot of dots, and it failed in Debian Linux shell. I made a simple script to look how dots are processed and found it completely out of rules. I retried it Bash, perl, Ubunta shell it all the same. The script and output are below.
#!/bin/sh
my_regex=u2734523abcABCB.C123.ABC.abc.1..2.34.2
Numbering=123456789_123456789_123456789_123456789
echo "$my_regex"
echo "$Numbering"
echo `expr index "$my_regex" '(ABC)'`
echo `expr index "$my_regex" '(ABC\.)'`
echo `expr index "$my_regex" '(\.\.)'`
echo `expr index "$my_regex" '(.)'`
echo `expr index "$my_regex" '(\.1)'`
Output:
u2734523abcABCB.C123.ABC.abc.1..2.34.2
123456789_123456789_123456789_123456789
12
12
16
16
16
The first regex should match ABC and return number-position of first character. It works.
The second one should find ABC followed by dot, it looks like it ignores dot.
The third one should find two dots but it finds first occurrence of one dot. Ignores again?
The fourth should find first any character, but it still finds the dot on position 16.
The fifth should find a dot followed by 1, it still finds the first occurrence of dot.
It seems like neither \ nor [ ] (I tried it too), nor the dot itself works as in common regular expression.
Why?
expr index has nothing to do with regular expressions.
expr index STRING CHARS outputs the index of the first occurrance of any of the CHARS in STRING. So your first search for '(ABC)' finds the first left parenthesis, A, B, C, or right parenthesis in your string. The first one is the A at position 12.
'(ABC\.)' does the same thing, except it's now also looking for a backslash or period. But the A is still the first match at position 12.
'(\.\.)' looks only for a parenthesis, backslash, or period. The first match is the period at position 16.
Likewise, all your other searches find the period at position 16, because none of the other characters you're listing come before that.
(On a side note, it's silly to capture the output with backticks only to immediately echo it. You'd get the same result by omitting the echo and backticks.)
You are incorrectly using index function of expr. As per man expr:
index STRING CHARS - index in STRING where any CHARS is found, or 0
So 2 things to note here:
index doesn't do any regex matching
index will find position of any of the char is found in string
If you want regex matching then use:
STRING : REGEXP
like this:
my_regex='u2734523abcABCB.C123.ABC.abc.1..2.34.2'
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*ABC'
24
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*ABC\.'
25
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*\.\.'
32
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*.'
38
expr u2734523abcABCB.C123.ABC.abc.1..2.34.2 : '.*\.1'
30
The numbers after each expr command is actually the length of the match.
There is no need to use echo here as expr anyway writes output on stdout.
You might want to take a look at BASH built-in =~ operator for regex matching.

Trim end of string

I'm having trouble trimming off some characters at the end of a string. The string usually looks like:
C:\blah1\blah2
But sometimes it looks like:
C:\blah1\blah2.extra
I need to extract out the string 'blah2'. Most of the time, that's easy with a substring command. But on the rare occasions when the '.extra' portion is present, I need to first trim that part off.
The thing is, '.extra' always begins with a dot, but then is followed by various combinations of letters with various lengths. So wildcards will be necessary. Essentially, I need to script, "If the string contains a dot, trim off the dot and anything following it."
$string.replace(".*","") doesn't work. Nor does $string.replace(".\*",""). Nor does $string.replace(".[A-Z]","").
Also, I can't get at it from the beginning of the string either. 'blah1' is unknown and of various lengths. I have to get at 'blah2' from the end of the string.
Assuming that the string is always a path to a file with or without an extension (such as ".extra"), you can use Path.GetFileNameWithoutExtension():
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("C:\blah1\blah2")
blah2
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("C:\blah1\blah2.extra")
blah2
The path doesn't even have to be rooted:
PS C:\> [System.IO.Path]::GetFileNameWithoutExtension("blah1\blah2.extra")
blah2
If you want to implement similar functionality on your own, that should be fairly simply as well - use String.LastIndexOf() to find the last \ in the string and use that as your starting argument for Substring():
function Extract-Name {
param($NameString)
# Extract part after the last occurrence of \
if($NameString -like '*\*') {
$NameString = $NameString.Substring($NameString.LastIndexOf('\') + 1)
}
# Remove anything after a potential .
if($NameString -like '*.*') {
$NameString.Remove($NameString.IndexOf("."))
}
$NameString
}
And you'll see similar results:
PS C:\> Extract-Name "C:\blah1\blah2.extra"
blah2
PS C:\> Extract-Name "C:\blah124323\blah2.extra"
blah2
PS C:\> Extract-Name "C:\blah124323\blah2"
blah2
PS C:\> Extract-Name "abc124323\blah2"
blah2
As the other posters have said, you can use special file name manipulators for this. If you'd like to do it with regular expressions, you can say
$string.replace("\..*","")
The \..* regex matches a dot (\.) and then any string of characters (.*).
Let me address each of the non-working regexes individually:
$string.replace(".*","")
The reason this doesn't work is that . and * are both special characters in regular expressions: . is a wildcard character that matches any character, and * means "match the previous character zero or more times." So .* means "any string of characters."
$string.replace(".\*","")
In this instance, you're escaping the * character, meaning that the regex treats it literally, so the regex matches any single character (.) followed by a star (\*).
$string.replace(".[A-Z]","")
In this case, the regex will match any character (.) followed by any single capital letter ([A-Z]).
If the strings are actual paths using Get-Item would be another option:
$path = 'C:\blah1\blah2.something'
(Get-Item $path).BaseName
The Replace() method can't be used here, because it doesn't support wildcards or regular expressions.

Perl regex wierd behavior : works with smaller strings fails with longer repetitions of the smaller ones

here is a REGEX in perl that I use to identify strings that match this pattern : include any number of occurrences of any character but single quote ' or backslash , allow only escaped occurrences of ' or , respectively : \' and \ and finally it has to end with a (non-escaped) single quote '
foo.pl
#!/usr/bin/perl
my $line;
my $matchString;
Main();
sub Main() {
foreach $line( <STDIN> ) {
$line =~ m/(^(([^\\\']*?(\\\')*?(\\\\)*?)*?\'))/g;
$matchString = $1;
print "matchString:$matchString\n"
}
}
It seems to work fine for strings like :
./foo.pl
asasas'
sdsdsdsdsdsd'
\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
matchString:asasas'
matchString:sdsdsdsdsdsd'
matchString:\\\'sdsdsdsdsd\\\'sdsdsdsd\\'
matchString:\'sddsd\\sdsdsds\\\\\\sdsdsdsd\\\\\\'
Then I create a file with the following recurring pattern :
AAAAAAAAAAAAAAAAAAAAAAAAAAAAA\\BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\'CCCCCCCCCCCCCCCCCCCCCC\\sdsdsd\\\\\' ZZZZ\'GGGGGG
By creating a string by repeating this pattern one or more times and adding a single quote ' at the end should match the reg exp. I created a file called zz3 with 16 repetitions of the above pattern. I created then a file called ZZ6 with 18 repetitions of zz3 and another one called ZZ7 with the contents of ZZ6 + one additional instance of zz3, hence 19 repetitions of zz3.
By adding a single quote at the end of zz3 it results in a match. By adding a single quote at the end of ZZ6 it also results in a match as expected.
Now here is the tough part, by adding a single quote at the end of ZZ7 does not result in a match!
here is a link to the 3 files :
https://drive.google.com/file/d/0BzIKyGguqkWvOWdKaElGRjhGdjg/view?usp=sharing
The perl version I am using is v5.16.3 on FreeBSD bit i tried with various versions on either FreeBSD or linux with identical results. It seems to me that either perl has a problem with the size from 34274 bytes (ZZ6) to 36178 bytes (ZZ7), or I am missing something badly.
Your regular expression leads to catastrophic backtracking because you have nested quantifiers.
If you change it to
(^(([^\\\']*+(\\')*+(\\\\)*+)*?'))
(using possessive quantifiers to avoid backtracking), it should work.
I just would like to note that the whole problem appeared in an effort to re-engineer an old in-house program to parse escaped PostgreSQL bytea values.
Following this discussion it is clear that perl cannot match any repetition of non dot (.) patterns for more than 32766(=32K-2) times.
The solution is to masquerade the \\ and \' sequences with some chars that are certain to not appear in the input, such as Device Ctrl1 (\x11) and Device Ctrl2 (\x12), (presented as ^Q, ^R in vi respectively) :
$dataField =~ s/\\\\/\x11/g;
$dataField =~ s/\\\'/\x12/g;
then try to match non greedily any input till the first single quote.
$dataField =~ m/(^.*?\')/s;
$matchString = $1;
and finally substitute the above Ctrl chars back to their initial values
$matchString =~ s/\x11/\\\\/g;
$matchString =~ s/\x12/\\\'/g;
This is very fast. Another solution would be to parse till the first single quote and count the number of \'s. If it is even then we have found our last non escaped single quote in the text so we have found our desired match, otherwise the single quote is an escape one and thus considered part of the text, so we keep this value and iterate to the next single quote and repeat the same logic, by concatenating the value to the previous value. This tends to be very slow for big files with many intermediate escaped single quotes.
Perl regex's seem to be much faster than Perl code.

Regex to match numeric pattern

I am trying to match specific numeric pattern from the below list.
My requirement is to match only report.20150325 to report.20150331. Please help.
report.20150319
report.20150320
report.20150321
report.20150322
report.20150323
report.20150324
report.20150325
report.20150326
report.20150327
report.20150328
report.20150329
report.20150330
report.20150331
It's very simple to match 25 to 31 use regex 2[5-9]|3[01]
Here is complete regex
(report\.201503(2[5-9]|3[01]))
DEMO
Explanation of 2[5-9]|3[01]
2 followed by a single character in the range between 5 and 9
OR
3 followed by 0 or 1
You could use something like so: ^report\.201503(2[5-9]|3[01])$/gm (built using this tool).
It should match the reports you are after, as shown here.
A regexp match isn't always the right approach. Here you are asking to match a string followed by a number so use a string and numeric comparisons:
$ awk -F'.' '$1=="report" && ($2>=20150325) && ($2<=20150331)' file
report.20150325
report.20150326
report.20150327
report.20150328
report.20150329
report.20150330
report.20150331
Seems like you want to print the lines which falls between the lines which matches two separate patterns (including the lines which matches the patterns).
$ sed -n '/^report\.20150325$/,/^report\.20150331$/p' file
report.20150325
report.20150326
report.20150327
report.20150328
report.20150329
report.20150330
report.20150331