Perl, Any match string after a pattern - regex

I'm trying to match if any string exists after a certain pattern.
The pattern is "pattern" 'anything in between' "[after]". case insensitive.
e.g
pattern 1 [after] ABC
pattern 2 [after] 123 abc DEX
pattern 3 [after]
pattern 12345123 [after]
pattern #ASd#98 #_90sd [after] xyz dec
[after] 4 pattern
So the result I would like to obtain is,
pattern 1 [after] ABC
pattern 2 [after] 123 abc DEX
pattern #ASd#98 #_90sd [after] xyz dec
It begins with "pattern" and ends with "[after], anything sandwiched between is also accepted.
I'm having difficulty incorporating the delimits of [ ] & if string exists together.
I've tried, the closest I've gotten ends up matching
m/pattern/ ../ \[after]/
pattern 1 [after] ABC
pattern 2 [after] 123 ABC DEX
pattern 3 [after]
pattern 12345123 [after]
pattern #ASd#98 #_90sd [after] xyz dec
But I don't need the 3rd or 4th pattern as it doesn't hold any numerics or characters after "[after]".
Thanks

Here is the code I used to test against your input (which I just cat'ed and piped to the script)
#!/usr/bin/perl
while(<>)
{
print if (/^pattern.*\[after\]\s*\S+/);
}
So to break it down for you:
/^pattern : match any string that begins with "pattern"
.*\[after\] : match any characters followed by "[after]"
\s*\S+ : match 0 or more whitespace characters followed by one or more non-whitespace character
That should give you enough to work with to tweak it up as you see fit.

Code:
$str = 'pattern 2 [after] 123 abc DEX';
if ($str =~ m/^pattern\s+(\d+)\s+\[after\]\s+(.+)/) {
print "$1\t$2\n";
} else {
print "(no match)\n";
}
Output:
2 123 abc DEX
Test this code here.

Is this what you want:
/pattern [0-9] \[after\](?= .)/s
or
/pattern [0-9] \[after\] ./s

Related

Regex for number of digits 0 and at least one digit 1

I have a block of text in which I try to find lines that contain any (*) number of digits 0 and at least one (+) digit 1. Explaination:
1234 xxx 00000000000111000000 00000010000100000000 Some text <-- matches
2345 yyy 00000000000000000000 00000000000000000000 Some text <-- does not match
2345 yyy 00000001000000000000 00000000000000000000 Some text <-- matches
3456 zzz 11111111111111111111 11111111111111111111 Some text <-- matches
How to accomplish this? Thanks!
Tried with negative lookahead but failed:
\s+\d+ +[a-zA-Z]+ +(?![0]{20}) +(?![0]{20}) +([0-9a-zA-Z ]+)
You are not matching any digits 0 or 1 after the assertions.
If both columns with the digits 0 or 1 can not be only zeroes, you can use both columns in the assertion:
+\d+ +[a-zA-Z]+ +(?!0{20} +0{20}\b)[01]{20} +[01]{20} +([0-9a-zA-Z ]+)
See a regex101 demo.
Here is my shorter version of the regex. But it only test line by line. So you will have to iterate through each line in your file like the code below:
import re
text = '''1234 xxx 00000000000111000000 00000010000100000000 Some text
2345 yyy 00000000000000000000 00000000000000000000 Some text
2345 yyy 00000001000000000000 00000000000000000000 Some text
3456 zzz 11111111111111111111 11111111111111111111 Some text'''
regex = r'^\d+\s+\w+\s+0*1+0*\s+\d+\s+\w+'
matches = re.findall(regex, text, re.MULTILINE)
for match in matches:
print(match)
For explanation and details, please check regex101 demo

Specify end of regex group

I am trying to create a regular expression which matches multiple groups, so the values between the groups can be extracted. Each group looks identical.
Lets consider the following example, note that the linebreaks are intended:
dog 1
wuff
wuff
cat
123
XYZ
dog 1
wuff
wuff
cat
456
ABC
dog 1
wuff
wuff
cat
789
Thus, with the right regular expression I want to get the output:
123
XYZ
456
ABC
789
On regex101.com I tried:
(?s)(?:dog.*cat)
which matches all values between the first occurence of dog an the last occurence of cat.
In addition I tried:
(?s)(?:dog.*(cat){1})
which, with my limited knowledge, should match the first occurence of cat and then end the group, but it does not.
I appreciate any help.
You may use this regex in MULTILINE mode to capture value after dog.*cat matches:
^dog\b(?:.*\n)+?cat\n(.*(?:\n.*)*?)(?=\ndog|\Z)
Your values are present in capture group #1
RegEx Demo
RegEx Details:
^: Match start line
dog\b: Match word dog with a word boundary
(?:.*\n)+?: Match anything followed by a line break. Repeat this 1+ times (lazy)
cat\n: Match cat followed by a newline
(.*(?:\n.*)*?): These are the multiline values you're interested in the first capture group.
(?=\ndog|\Z): Lookahead to assert that we have a dog after line break or end of input ahead of the current position

I am looking for a solution to know if a specific word or digit is last in the line, followed by nothing , not even space [duplicate]

This question already has answers here:
regex to get the number from the end of a string
(2 answers)
Closed 3 years ago.
set vv "abc 123 456 "
regexp {abc[\s][\d]+[\s][\d]+} $vv
1
regexp {abc[\s][\d]+[\s][\d]+(?! )} $vv
1
Should return 0, as the line contains extra space at the end or extra characters.
From a list of lines, i am trying to know which lines have space at the end and which do not.
lines can be of any format, for instance, i need to extract line 1 and 3 but not 2 and 4.
"abc 123 456"
"abc 123 456 abc 999"
"xyz 123 999"
"xyz 123 999 zzz 222"
You could use a repeating pattern matching a space and digits to make sure that the line ends with digits only:
^abc(?: \d+)+$
Regex demo
Or a bit broader match using word characters \w if the lines can be of any format:
^\w+(?: \w+)+$
Regex demo
Not sure about TCL regex, but I think you have to add an anchor:
abc\s\d+\s\d+$
It can be summarized as ending of line ($) proceeding by words(\w).
puts [regexp {\w$} $vv]
If all you need is to find out if a line ends with a space or not, use this for a regex:
\s$
The regular express would be {^abc.*\d$} -- a digit followed by the end of the string.
% regex {^abc.*\d$} $vv
0
The glob pattern would be {abc*[0-9]}
% string match {abc*[0-9]} $vv
0
% string match {abc*[0-9] } $vv
1

How to replace pattern of repeating characters/words only at the beginning of the string?

Note that this question is in the context of Julia, and therefore (to my knowledge) PCRE.
Suppose that you had a string like this:
"sssppaaasspaapppssss"
and you wanted to match, individually, the repeating characters at the end of the string (in the case of our string, the four "s" characters - that is, so that matchall gives ["s","s","s","s"], not ["ssss"]). This is easy:
r"(.)(?=\1*$)"
It's practically trivial (and easily used - replace(r"(.)(?=\1*$)","hell","k") will give "hekk" while replace(r"(.)(?=\1*$)","hello","k") will give "hellk"). And it can be generalised for repeating patterns by switching out the dot for something more complex:
r"(\S+)(?=( \1)*$)"
which will, for instance, independently match the last three instances of "abc" in "abc abc defg abc h abc abc abc".
Which then leads to the question... how would you match the repeating character or pattern at the start of the string, instead? Specifically, using regex in the way it's used above.
The obvious approach would be to reverse the direction of the above regex as r"(?<=^\1*)(.)" - but PCRE/Julia doesn't allow lookbehinds to have variable length (except where it's fixed-variable, like (?<=ab|cde)), and thus throws an error. The next thought is to use "\K" as something along the lines of r"^\1*\K(.)", but this only manages to match the first character (presumably because it "advances" after matching it, and no longer matches the caret).
For clarity: I'm seeking a regex that will, for instance, result in
replace("abc abc defg abc h abc abc abc",<regex here>,"hello")
producing
"hello hello defg abc h abc abc abc"
As you can see, it's replacing each "abc" from the start with "hello", but only until the first non-match. The reverse one I provide above does this at the other end of the string:
replace("abc abc defg abc h abc abc abc",r"(\S+)(?=( \1)*$)","hello")
produces
"abc abc defg abc h hello hello hello"
You can use the \G anchor that matches the position after the previous match or at the start of the string. In this way you ensure the contiguity of results from the start of the string to the last occurrence:
\G(\S+)( (?=\1 ))?
demo
or to be able to match until the end of the string:
\G(\S+)( (?=\1(?: |\z)))?
For PCRE style engines, unfortunately there is no way to do this without
variable length lookbehind.
A pure solution is not possible.
There is no \G anchor trickery that can accomplish this.
Here is why the \G anchor won't work.
With the anchor, the only guarantee you have is that the last match
resulted in a match where the forward overlap was checked to be equal
to the current match.
As a result, you can only globally match up to N-1 of the duplicate's from the beginning.
Here is a proof:
Regex:
# (?:\G([a-c]+)(?=\1))
(?:
\G
( [a-c]+ ) # (1)
(?=
\1
)
)
Input:
abcabcabcbca
Output:
** Grp 0 - ( pos 0 , len 3 )
abc
** Grp 1 - ( pos 0 , len 3 )
abc
------------
** Grp 0 - ( pos 3 , len 3 )
abc
** Grp 1 - ( pos 3 , len 3 )
abc
Conclusion:
Even though you know the Nth one is there from the previous lookahead,
the Nth one can't be matched without the condition of the current lookahead.
Sorry, and good luck!
Let me know if you find a pure regex solution.

REGEX Search and keep specific characters

I have hundreds of References in the following format
HCVSAM0123BK
c35UNI0321RS
scruni0321
XXXXXX ZZZZ WW
6 characters 4 digits 2 characters
I want to keep the 4 digits after the first 6 characters, but in some cases it doesn't have the last 2 characters
My goal is to get only ZZZZ (the 4 digits)
ex: from HCVSAM0123BK to 0123
Thank You
You can do match the following:
^\w{6}(\d+)(\w{2})?$
and the first captured group \1 is what you want.
Demo: http://regex101.com/r/qT0lY8
Answer to udpated question:
^(?!\d+$)\w{6}(\d+)(\w{2})?$
(?!\d+$) is a negative look ahead, that will fail the match if the line is only digits, and \w stands for [0-9a-zA-Z_].
search : ^.{6}(.{4}).*
and replace with : \1
demo here : http://regex101.com/r/kZ7dS8
output :
0123
0321
0321
using branch reset :
search : (?|.*(\d{4}).*)
and replace with : \1