Regex to match numeric pattern - regex

I am trying to match specific numeric pattern from the below list.
My requirement is to match only report.20150325 to report.20150331. Please help.
report.20150319
report.20150320
report.20150321
report.20150322
report.20150323
report.20150324
report.20150325
report.20150326
report.20150327
report.20150328
report.20150329
report.20150330
report.20150331

It's very simple to match 25 to 31 use regex 2[5-9]|3[01]
Here is complete regex
(report\.201503(2[5-9]|3[01]))
DEMO
Explanation of 2[5-9]|3[01]
2 followed by a single character in the range between 5 and 9
OR
3 followed by 0 or 1

You could use something like so: ^report\.201503(2[5-9]|3[01])$/gm (built using this tool).
It should match the reports you are after, as shown here.

A regexp match isn't always the right approach. Here you are asking to match a string followed by a number so use a string and numeric comparisons:
$ awk -F'.' '$1=="report" && ($2>=20150325) && ($2<=20150331)' file
report.20150325
report.20150326
report.20150327
report.20150328
report.20150329
report.20150330
report.20150331

Seems like you want to print the lines which falls between the lines which matches two separate patterns (including the lines which matches the patterns).
$ sed -n '/^report\.20150325$/,/^report\.20150331$/p' file
report.20150325
report.20150326
report.20150327
report.20150328
report.20150329
report.20150330
report.20150331

Related

How to remove first number from a number using regex as i want to implement in pentaho

I am getting a number as
1140302
I have to remove the first 1 from above number
140302
Actually there is a wiki page about this:
regex = ^[0-9]([0-9]+)$
replace with = $1
Above regex matches length >= 2 numeric strings, and $1 will contain the numbers without the first.
use ^. and replace with ''
demo here : http://regex101.com/r/sW8fJ7

How can I use regex to ignore strings if they contain a certain string

I am trying to use regex to scan through some log files. In particular, I am looking to pick out lines that meet this format:
IP address or random number "banned.", so for example, "111.111.111.111 banned." or "0320932 banned.", etc.
There should only be 2 groups of characters (the number/IP address and "banned." There may be more than one space in between the words or before them), the string should also not contain "client", "[private]", or "request". For the most part I am just confused about how to go about detecting the groups of characters and avoiding strings that contain those words.
Thanks for any help that you may have to offer
egrep -v '^ *[0-9]+((\.[0-9]+){3})? +banned\.$'
Allows optional leading spaces at the beginning of the line.
Must be followed by an all-digit sequence OR an IP-like address.
Must be followed by at least one space.
Line must end in 'banned.'
Finally, the -v option ensures that only lines NOT matching the regex are returned.
With these constraints you needn't worry about ruling out additional words such as 'client'.
I'm assuming in the following input data lines 1 and 3 should be dropped:
111.111.111.111 banned.
2.2.2.2 wibble
0320932 banned
1434324 wobble
You can drop them with this grep expression:
$ grep -E -v "[0-9.]+ +banned" logfile.log
2.2.2.2 wibble
1434324 wobble
$
This regular expression matches 1 or more numbers and periods followed by 1 or more spaces followed by the word "banned". Passing -v to grep will cause it to display all lines that do not match the regular expression. Add -i to the grep command to make it case-insensitive.
You want a negating match, which looks like:
/^((?!([\d.\s]+banned\.)).)*$/
See it in action: http://regex101.com/r/bY7pK4
Note your example shows a period after banned. If you don't want it, remove \. from the expression.
Try this RegExp
String regex = "\\d+.\\d+.\\d+.\\d+ banned.";
Here you can filter your both kind of string.
Example:
public static void main(String[] args) {
System.out.println("start");
String src = "657 hi tis is 111.111.111.111 banned. 57 happy i9";
//String src = "87 working is 0320932 banned. Its ending str 08";
String regex = "\\d+.\\d+.\\d+.\\d+ banned.";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(src);
while(matcher.find()){
System.out.println(matcher.start() + " : " + matcher.group());
}
}
Let me know if it is not working for you.
trying to match IP address or random number "banned."
This egrep should work for you:
egrep '(([0-9]{1,3}\.){3}[0-9]{1,3}|[0-9]+) +banned' logfile
The following will work:
\s*\d\d\d\.\d\d\d\.\d\d\d\.\d\d\d\s*banned\s*

How to match the whole expression only, even when there are sub parts that match?

Just trying to write input validation pattern that would allow entry of wild characters. Input field is 9 char max and should follow these rules:
* + 1- 8 charcters
1- 8 chars + *
* + 1-7 chars + *
I've written this regex using the regex documentation and testing it on one of the regex testers.
\*{1}[0-9]{1,7}\*{1}|[0-9]{1,8}\*{1}|\*{1}[0-9]{1,8}|[0-9]{9}
It matches all these correctly
123456789
*1*
*12*
*123*
*1234*
*12345*
*123456*
*1234567*
1234567*
123456*
12345*
1234*
123*
12*
1*
*1
*12
*123
*1234
*12345
*123456
*1234567
*12345678
But it also matches when I don't want it. For example it finds 2 matches in this *123456789* First match is *12345678 and second one is 9*
I don't want in this case to find any matches. Either the whole string matches one of the patterns or not. How does one do that?
Use anchors that make sure the regex always matches the entire string:
^(\*[0-9]{1,7}\*|[0-9]{1,8}\*|\*[0-9]{1,8}|[0-9]{9})$
Note the parentheses to make sure that the alternation is contained within the group:
^
(
\*[0-9]{1,7}\*
|
[0-9]{1,8}\*
|
\*[0-9]{1,8}
|
[0-9]{9}
)
$
Also, {1} is always superfluous - one match per token is the default.
You could use start and end string anchors:
http://www.regular-expressions.info/anchors.html
So, your regex would be something like this (note first and last symbol):
^(\*{1}[0-9]{1,7}*{1}|[0-9]{1,8}*{1}|*{1}[0-9]{1,8}|[0-9]{9})$

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.

Match a 24 hour formatted time with regex

I am trying to match a 24 hour time with regular expressions using egrep.
Here is my test file, test.txt:
32:23:31
24:30:31
23:70:31
23:61:31
23:10:70
23:10:61
22:17:16
01:17:15
24:15:22
0:17:16
00:17:17
24:30:31
Here is my regular expression:
egrep '(2[0-3]|1[0-9]|0[0-9]|[^0-9][0-9]):([0-5][0-9]|[0-9]):([0-5][0-9]|[0-9])' test.txt
Resulting matches:
23:10:70
23:10:61
22:17:16
01:17:15
00:17:17
Any idea why it is matching 23:10:70 and 23:10:61?
It's actually matching 23:10:7 and 23:10:6, but since you are not using the end of line metacharacter $ at the end of the string, it will process anything that follows.
egrep '^(2[0-3]|1[0-9]|0[0-9]|[^0-9][0-9]):([0-5][0-9]|[0-9]):([0-5][0-9]|[0-9])$' test.txt
In other words, you should only allow [0-9] at the end of the string, if the matched digit is the last one on the line, that is, if it is followed by $.
Another option, is to force the last digit to be 0-padded if it is less than 10, i.e., instead of [0-9] use 0[0-9]. This will match 23:10:07, but not 23:10:7. It's the same you already have for the hours part.