regex match after word - regex

I would like to know how to capture text only if the beginning of a line matching a certain string... but i dont want to capture the begining string...
for example if i have the text:
BEGIN_TAG: Text To Capture
WRONG_TAG: Text Not to Capture
i want to capture:
Text To Capture
From the line that begin with BEGIN_TAG: not the line that begin with WRONG_TAG:
I know the how to select the line that begin with the desired text: ^BEGIN_TAG:\W?(.*)
but this selects the text "BEGIN_TAG:"... i dont want this only want the text after "BEGIN_TAG"
I am using PCRE regex

Instead of a positive lookbehind that does not allow unknown width patterns, you may use a match reset operator \K:
^BEGIN_TAG:\W?\K.*
See the regex demo
Details:
^ - in Sublime, start of a line
BEGIN_TAG: - a string of literal chars
\W? - 1 or 0 non-word chars
\K - the match reset operator that discards all text matched so far
.* - any 0+ chars other than linebreak characters (the rest of the line) that are the only chars that will be kept in the matched text.

You can use lookbehind. Then, the text in the lookbehind group isn't part of the whole match. You can see it as an anchor like \b, ^, etc.
You then get:
(?<=^BEGIN_TAG:\W)(\w.*)$
Explained:
(?<= # Positive lookbehind group
^ # Start of line / string
BEGIN_TAG: # Literal
\W # A non-word character ([^a-zA-Z_])
)
( # First and only matching group (probably not needed)
\w # A word character ([a-zA-Z_])
.* # Any character, any number of times
)
$ # End of line / string

Related

Parenthesis content after a specific word

I'm trying to get UNIX group names using a regex (can't use groups because I can only get the process uid, so I'm using id <process_id> to get groups)
input looks like this
uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n
I'd like to capture kawsay, sudo, video and gpio
The only pieces I've got are:
a positive lookbehind to start capturing after groups: /(?<=groups)/
capture the parenthesis content: /\((\w+)\)/
Using PCRE's \G you may use this regex:
(?:\bgroups=|(?<!^)\G)[^(]*\(([^)]+)\)
Your intended matches are available in capture group #1
RegEx Demo
RegEx Details:
(?:: Start non-capture group
\bgroups=: Match word groups followed by a =
|: OR
(?<!^)\G: Start from end position of the previous match
): End non-capture group
[^(]*: Match 0 or more of any character that is not (
\(: Match opening (
([^)]+): Use capture group #1 to match 1+ of any non-) characters
\): Match closing )
You can use
(?:\G(?!\A)\),|\bgroups=)\d+\(\K\w+
See the regex demo. Details:
(?:\G(?!\A)\),|\bgroups=) - either of
\G(?!\A)\), - end of the previous match (\G operator matches either start of string or end of the previous match, so the (?!\A) is necessary to exclude the start of string location) and then ), substring
| - or
\bgroups= - a whole word groups (\b is a word boundary) and then a = char
\d+\( - one or more digits and a (
\K - match reset operator that makes the regex engine "forget" the text matched so far
\w+ - one or more word chars.
Here are two more ways to extract the strings of interest. Both return matches and do not employ capture groups. My preference is for second one.
str = "uid=1001(kawsay) gid=1001(kawsay) groups=1001(kawsay),27(sudo),44(video),997(gpio)\n"
Match substrings between parentheses that are not followed later in the string with "groups="
Match the regular expression
rgx = /(?<=\()(?!.*\bgroups=).*?(?=\))/
str.scan(rgx)
#=> ["kawsay", "sudo", "video", "gpio"]
Demo
See String#scan.
This expression can be broken down as follows.
(?<=\() # positive lookbehind asserts previous character is '('
(?! # begin negative lookahead
.* # match zero or more characters
\bgroups= # match 'groups=' preceded by a word boundary
) # end negative lookahead
.* # match zero or more characters lazily
(?=\)) # positive lookahead asserts next character is ')'
This may not be as efficient as expressions that employ \G (because of the need to determine if 'groups=' appears in the string after each left parenthesis), but that may not matter.
Extract the portion of the string following "groups=" and then match substrings between parentheses
First, obtain the portion of the string that follows "groups=":
rgx1 = /(?<=\bgroups=).*/
s = str[rgx1]
#=> "1001(kawsay),27(sudo),44(video),997(gpio)\n"
See String#[].
Then match the regular expression
rgx2 = /(?<=\()[^\)\r\n]+/
against s:
s.scan(rgx2)
#=> ["kawsay", "sudo", "video", "gpio"]
The regular expression rgx1 can be broken down as follows:
(?<=\bgroups=) # Positive lookbehind asserts that the current
# position in the string is preceded by`'groups'`,
# which is preceded by a word boundary
.* # match zero of more characters other than line
# terminators (to end of line)
rgx2 can be broken down as follows:
(?<=\() # Use a positive lookbehind to assert that the
# following character is preceded by '('
[^\)\r\n]+ # Match one or more characters other than
# ')', '\r' and '\n'
Note:
The operations can of course be chained: str[/(?<=\bgroups=).*/].scan(/(?<=\()[^\)\r\n]+/); and
rgx2 could alternatively be written /(?<=\().+?(?=\)), where ? makes the match of one or more characters lazy and (?=\)) is a positive lookahead that asserts that the match is followed by a right parenthesis.
This would probably be the fastest solution of those offered and certainly the easiest to test.

Capture function and its value using Regex

I have a text that contains the following function calls:
set_name(value:"this is a test");
set_attribute(name:"description", value:"Some
Multi
Line
Value");
And I am trying to capture its data so that I get back:
'name'
or
'attribute'
The value just after "set_"
As well as the inside content:
value:"this is a test"
And
name:"description", value:"Some
Multi
Line
Value"
Respectively
I tried using this regex:
script_([A-Za-z_]+)\s*\(([\S\s]*?)\)
but it will fail if this is the set_attribute value:
set_attribute(name:"description", value:"Some
Multi
(Line)
Value");
Because the (first) ) found there is captured by the regex
I am looking for a regex that would return "attribute" and the content via two group captures:
name:"description", value:"Some
Multi
(Line)
Value"
The desired strings could be extracted with the following regular expression, with the single-line or DOTALL flag is set, causing dot to match line terminators.
(?<=^set_)\w+(?=\()|(?<=\().*?(?=\);$)
The first match is the substring between set_ and (; the second match is the substring between ( and ).
In Ruby, for example, this regex could be used as follows.
str = 'set_name(value:"this is a test");'
r = /(?<=^set_)\w+(?=\()|(?<=\().*?(?=\);$)/m
after_set, inside_parens = str.scan(r)
after_set #=> "name"
inside_parens #=> "value:\"this is a test\""
Note that in Ruby single-line or DOTALL mode (dot matches line terminators) is denoted /m.
Start your engine!.
The regex engine performs the following operations.
/
(?<=^set_) : positive lookbehind asserts match is preceded by `set_` at
the beginning of the string
\w+ : match 1+ word characters
(?=\() : positive lookahead asserts following character is '('
| : or
(?<=\() : positive lookbehind asserts match is preceded by '('
.*? : match 0+ characters, as few as possible
(?=\);$) : positive lookahead asserts match is followed by ');' at
: the end of the line
/m : flag to cause '.' to match line terminators
Each line ends with character semicolon. You could add the character in regex after character ).
set_([A-Za-z_]+)\s*\(([\S\s]*?)\);
Demo
You may use
(?ms)^set_(\w+)\((.*?)\);$
See the regex demo.
Details
(?ms) - multiline (^ and $ match start/end of the line now) and dotall (. matches line break chars) modes are ON
^ - start of a line
set_ - a literal string
(\w+) - Group 1: one or more word chars
\( - a ( char
(.*?) - Group 2: any 0 or more chars, as few as possible
\); - ); substring...
$ - at the end of the line.
Another way to get the values using the 2 capturing groups is to repeatedly match the key:values pairs between the opening and the closing parenthesis in group 2.
^set_([A-Za-z_]+)\s*\((\w+:"[^"]+"(?:, ?\w+:"[^"]+")*)\);
Explanation
^set_ Match set_ form the start of the string
( Capture group 1
[A-Za-z_]+ Match 1+ times any of the listed
) Close group 1
\s*\( Match 0+ whitespace chars and opening (
( Capture group 2
\w+:"[^"]+" Match 1+ word chars, then from opening " till closing "
(?:, ?\w+:"[^"]+")* Optionally repeat the previous pattern preceded by a comma and optional space
) Close group 2
\); Match the closing )
Regex demo

regex: match on negating a set of abitrary whitespaces before and after a 'comment' character

I am looking for a regex to (non)match a comment character wrapped by arbitrary whitespaces
For example with '#' as comment character:
lines supposed to be match:
code line here
code line here
lines supposed to be not matched:
#code line here
# code line here
# code line here
So, something like a negation of the set (zero/*whitespaces # zero/*whitespaces)
^(\s#\s)
The following regex will match lines that don't have the character "#"
^((?!^\s+#).)*$
May be something not optimized, but try this:
^[^#]*(?!\s*#).
This will get all symbols from the beginning of the line that are not followed by spaces + # combination.
For your example data, if lookahead is supported you could use a negative lookahead to assert that from the start of the string what is on the right is not 0+ times a whitespace char followed by a #.
If that is the case, then match the whole string.
^(?!\s*#).+$
That will match:
^ Start of the string
(?! Negative lookahead
\s*# Match 0+ times a whitespace char, then #
) Close lookahead
.+ Match any char except newline 1+ times
$ End of the string
regex101 demo

regexp print line by line and remove last word

I am trying to remove last word from each line if line contains more than one word.
If line has only one word then print it as it, no need to delete it.
say below are the lines
address 34 address
value 1 value
valuedescription
size 4 size
from above lines I want to remove all last words from each line except from 3rd line as it has only one word using regexp ..
I tried below regexp and it is removing single word lines also
$_ =~ s/\s*\S+\s*+$//;
Need your help for the same.
You can use:
$_ =~ s/(?<=\w)\h+\w+$//m;
RegEx Demo
Explanation:
(?<=\w): Lookbehind to assert that we have at least one word char before last word
\h+: Match 1+ horizontal whitespaces
\w+: match a word with 1+ word characters
$: End of line
Try this regex:
^(?=(?:\w+ \w+)).*\K\b\w+
Replace each match with a blank string
Click for Demo
OR
^((?=(?:\w+ \w+)).*\b)\w+
and replace each match with \1
Click for Demo
Explanation(1st Regex):
^ - asserts the start of the line
(?=(?:\w+ \w+)) - positive lookahead to check if the string has 2 words present in it
.* - If the above condition satisfies, then match 0+ occurrences of any character(except newline) until the end of the line
\K - forget everything matched so far
\b - backtrack to find the last word boundary
\w+ - matches the last word
a single word with no whitespace matches your regex since you've used \s* both before and after the \S+, and \s* matches an empty string.
You could use $_ =~ s/^(.*\S)\s+(\S+)$/$1/;
[Explanation: Match the RegEx if the line contains some number of characters ending with a non-whitespace (stored in $1), followed by 1 or more white-space characters, followed by 1 or more non-white-space characters. If there is a match, replace it all with the first part ($1).]
Though you might want to trim leading/trailing whitespace if you think it might contain any - depends on what you want to happen in those cases.

How can I detect last digits in python string

I need to detect last digits in the string, as they are indexes for my strings. They may be 2^64, So it's not convenient to check only last element in the string, then try second... etc.
String may be like asdgaf1_hsg534, i.e. in the string may be other digits too, but there are somewhere in the middle and they are not neighboring with the index I want to get.
Here is a method using re.sub:
import re
input = ['asdgaf1_hsg534', 'asdfh23_hsjd12', 'dgshg_jhfsd86']
for s in input:
print re.sub('.*?([0-9]*)$',r'\1',s)
Output:
534
12
86
Explanation:
The function takes a regular expression, a replacement string, and the string you want to do the replacement on: re.sub(regex,replace,string)
The regex '.*?([0-9]*)$' matches the whole string and captures the number that precedes the end of the string. Parenthesis are used to capture parts of the match we are interested in, \1 refers to the first capture group and \2 the second ect..
.*? # Matches anything (non-greedy)
([0-9]*) # Upto a zero or more digits digit (captured)
$ # Followed by the end-of-string identifier
So we are replacing the whole string with just the captured number we are interested in. In python we need to use raw strings for this: r'\1'. If the string doesn't end with digits then a blank string with be returned.
twosixfour = "get_the_numb3r_2_^_64__18446744073709551615"
print re.sub('.*?([0-9]*)$',r'\1',twosixfour)
>>> 18446744073709551615
A simple regex can detect digits at the end of the string:
'\d+$'
$ matches the end of the string. \d+ matches one or more digits. The + operator is greedy by default, meaning it matches as many digits as possible. So this will match all of the digits at the end of the string.
If you want to use re.sub and make sure that there is at least a single digit present at the end of the line, you can use the quantifier + to match 1 or more digits \d+ to not remove the whole line if there are no digits present or no digits only at the end of the line.
^.*?(\d+)$
^ Start of line
.*? Match any char except a newline as least as possible (non greedy)
(\d+) Capture group 1, match 1+ digits
$ End of line
Or using a negative lookbehind
^.*(?<!\d)(\d+)$
^ Start of line
.* Match any char except a newline as much as possible
(?<!\d)(\d+) Assert no digits directly to the left, then capture 1+ digits in group 1
$ End of line
Regex demo
When using re.match, you can omit the ^ anchor and you might also use \A and \Z to asert the start and the end of the string.
Regex demo
import re
strings = ['asdgaf1_hsg534', 'asdfh23_hsjd12', 'dgshg_jhfsd86', 'test']
for s in strings:
print (re.sub(r".*?(\d+)$", r'\1',s))
Output
534
12
86
test
If there should be a non digit present before matching a digit as in this comment you could use a negated character class with a single capture group.
^.*[^\d\r\n](\d+)
^ Start of line
.* Match any char except a newline as much as possible
[^\d\r\n] Negated character class, match any char except a digit or a newline
(\d+) Capture group 1, match 1+ digits
Regex demo
To get the last digits in the string (not necessarily at the end of the string)
^.*?(\d+)[^\r\n\d]*$
^ Start of line
.*? Match any char except a newline as least as possible (non greedy)
(\d+) Capture group 1, match 1+ digits
[^\r\n\d]* Negated character class, match 0+ times any char except a newline or digit
$ End of line
Regex demo