In Regex, how do I match until a char or another char indefinitely but don't group the last char? - regex

This is my string, I want my regex to return "bash" at group 1 and "585602" at group 2 (the Pid value)
Name: bash
Umask: 0022
State: S (sleeping)
Tgid: 585602
Ngid: 0
Pid: 585602
PPid: 585598
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 1000 1000
FDSize: 256
Groups: 150 962 970 985 987 990 996 998 1000
NStgid: 585602
NSpid: 585602
NSpgid: 585602
NSsid: 585602
VmPeak: 8708 kB
VmSize: 8708 kB
...
what I have now is
Name:\t *(.*)\n(.|\n)*?Pid:\t *(.*)\n
Unfortunately, I'm seeing that the second matched group is the single newline before the P of "Pid", and the third one is the Pid value. I sense the problem is in the (.|\n) part of the regex, but if I remove the parentheses then it groups a lot of other stuff that I don't want. How would I go about having only bash and the pid value as groups?

You get a newline in the second group, because you are repeating (.|\n)* and repeating the value of a capture group will hold the value of the last iteration.
The character before Pid: is a newline, that is the value of the capture group that you see.
Note that using (.|\n)* is not advisable due to the alternation in the repetition. Better ways could be like (if supported) using an inline flag (?s) to have the dot match a newline, using a character class [\s\S]* or set the flag in a programming language to have the dot match a newline.
You can use 2 capture groups (you don't really need 3 groups), matching the Pid as digits and match at least a single non whitespace character \S in the first capture group.
If you want to consider the start and the end of the line, you can start the pattern with ^ and end the pattern with $
\bName:\t *(\S.*)\n[\s\S]*?^Pid:\t *(\d+)\b
See a regex101 demo
Or as #anubhava suggests optionally repeating the whole line followed by a newline, non greedy like (?:.*\n)* instead of [\s\S]*?:
\bName:\t *(.*)\n(?:.*\n)*?Pid:\t *(\d+)\b
See another regex101 demo.

In perl, using slurp mode to lead the string -
$: perl -ne 'BEGIN{$/=undef} /Name:\s+(\S+).*\nPid:\s+(\S+)/ms; print "$1 $2\n";'<<<"$str"
bash 585602

Related

Regex to remove strings in notepad++ that contains certain amount of numbers

If I have the following lines, how to
leave only 10-11 digits long strings that starts with "04" or "05". Don't remove empty lines.
0409999999 012345678
012345678 0409999999
023456789 034566 0455555555
012345678 012345678
0299999999
so the lines above should then look like:
0409999999
0409999999
0455555555
I would suggest pattern
\b\d{0,9}|0[45]\d{8,9}\b
Explanation:
\b - word boundary
\d{0,9} - match up to 9 digits
| - alternation
0 - match 0 literally
[45] - match 4 or 5
Regex demo
EDIT
After update you can use [ \t]*(?!0[45]\d{8,9})\b\d+[ \t]*
The difference here is that it uses negative lookahead to assure that what is ahead is not 10-11 digit number starting with 04 or 05.
[ \t] are used to trim space and tabs.
Then you just need to replace it with empty string.
\b(?!04|05)\d{1,9}\b|\b\d{10,11}\b

regex to extract housenumber plus addition

I'm looking for a regex that matches housenumbers combined with additions for all addresses below:
Breestraat 4
Breestraat 45
Breestraat 456
Dubbele Straat 4a
Dubbele Straat 4-a
5 meistraat 1a
5meistraat 12
5meistraat 12a
Teststraat 22-III
Now the following regex works, except in the first case. This is because the single digit housenummber is missed because of the first \d in the regex (which prevents a starting digit to be captured).
\d?.(\d+.+)$
regex to extract housenumber addition
I'm scratching my head how to get the housenumer '4' for the first line. so basically how to change the "skip starting digit" to "skip starting digit but let it have to result on the capturing group".
You can use
\d+\D*$
\d+\S*$
See the regex demo #1 and regex demo #2.
The pattern matches
\d+ - one or more digits
\D* - zero or more non-digit chars
\S* - zero or more non-whitespace chars
$ - end of string.
It's not perfectly clear what you are requesting precisely..
Anyway this is the pattern matching the house number at the end of the string:
\d+[-\da-zI]*$
https://regexr.com/6l0g7
Anyway I'm aware this is not a valid answer

Match any character until you see this string of thee dots

I have the below data set,
1.1.7 Ensure separate partition exists for /var/tmp (Scored) ...................................... 40
1.1.8 Ensure nodev option set on /var/tmp partition (Scored) ................................... 42
1.1.9 Ensure nosuid option set file.ext (Scored) .................................. 43
1.1.10 Ensure noexec option set on /var/tmp partition (Scored) ............................... 44
1.1.11 Ensure separate partition exists for /var/log (Scored) ..................................... 45
1.1.12 Ensure separate partition exists for /var/log/audit (Scored) ......................... 47
1.1.13 Ensure separate partition exists for /home (Scored) ......................................... 49
1.7.1.7 Ensure the MCS Translation Service (mcstrans) is not installed (Scored)\n
.............................................................................................................................................................. 105
I want to extract the number number (x.x.x.x) followed by the text, i.e for the below,
1.1.13 Ensure separate partition exists for /home (Scored) ......................................... 49
I want group1=1.1.13 , group2=Ensure separate partition exists for /home (Scored)
I can pull out the first group without issues, but I am struggling with the second group as some text contains a . which I want to capture, in addition to this, some lines contain a new line character within the second group so a '.' will not work either, here is my regex,
^[ ]?(\d(?:[.]|\d|[ ])+)([^.]+)
The issue is within the second capture group ([^.]+), what I am trying to do is 'match everything until you see three dots ...' but it is not working. This is what I have tried without any luck,
([^.]{3}+)
(?!\.{3})
What should can I do to capture any character until you see the sequence '...' ?
EDIT:
So, I have found a way to do this, but it doesn't feel like it's the best way to do it, here is my regex,
^\s*((?:\d+\.?)+)\s+(.*?\n?.*?)[\.]{3}
So basically, I am saying match anything(apart from a new line) unless you see a new line in that case, match it once, then match everything again until you see '...' . Why does (.|\n)*? not work ?
You can match the start of the string, followed by optional whitespace chars without a newline and digits with an optional repetition of a dot and digits.
Then match as least as possible chars including a newline till you encounter 3 dots.
^[^\S\r\n]*(\d+(?:\.\d+)+)\b([\s\S]*?)\.\.\.
^ Start of string
[^\S\r\n]* Optionally match whitespace chars without a newline
( Capture group 1
\d+(?:\.\d+)+ Match 1+ digits and repeat a . and 1+ digits
)\b Close group 1 and a word boundary
( Capture group 2
[\s\S]*? Match any char including a newline as least as possible
)\.\.\. Close group 2 and match ...`
Regex demo
If you want to limit the number of lines following, you can match either as least as possible chars on the same line until you encounter ... or repeat 1-2 times matching the following lines that do not start with digits, dot and digit and then match ...
^[^\S\r\n]*(\d+(?:\.\d+)+)\b(.*?|.*(?:\r?\n(?![^\S\r\n]*\d+\.\d).*?){1,2})\.\.\.
Regex demo
Please try this:
^[ ]?(\d(?:[.]|\d|[ ])+)(.*?)[\.]{3}
For spaces, a \s is better:
^\s*((?:\d+\.?)+)\s+(.*?)[\.]{3}
Edit: With newline after "(Scored)" or before "..."
^\s*((?:\d+\.?)+)\s+(.*?)\n?[\.]{3}
Try this:
^([0-9\.]+)\s(((?![.]{3}).)*)
This seems to do what you want: ^\s*((?:\d+\.?)+)\s+([^.]*)[\.]{3}
Regex demo
Instead of greedy consuption you can make use of anchor:
^\s*((?:\d+\.?)+)\s+(.*?)$
In this case "$" matches the Newline (excluding it)
Sources: Regular-Expressions.info; interactive Regex-Check

last year occurrence from string

I have strings like this:
ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar
I'm trying to get the last occurrence of a single year (from 1900 to 2050), so I need to extract only 1934 from that string.
I'm trying with:
grep -P -o '\s(19|20)[0-9]{2}\s(?!\s(19|20)[0-9]{2}\s)'
or
grep -P -o '((19|20)[0-9]{2})(?!\s\1\s)'
But it matches: 1910 and 1934
Here's the Regex101 example:
https://regex101.com/r/UetMl0/3
https://regex101.com/r/UetMl0/4
Plus: how can I extract the year without the surrounding spaces without doing an extra grep to filter them?
Have you ever heard this saying:
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
Keep it simple - you're interested in finding a number between 2 numbers so just use a numeric comparison, not a regexp:
$ awk -v min=1900 -v max=2050 '{yr=""; for (i=1;i<=NF;i++) if ( ($i ~ /^[0-9]{4}$/) && ($i >= min) && ($i <= max) ) yr=$i; print yr}' file
1934
You didn't say what to do if no date within your range is present so the above outputs a blank line if that happens but is easily tweaked to do anything else.
To change the above script to find the first instead of the last date is trivial (move the print inside the if), to use different start or end dates in your range is trivial (change the min and/or max values), etc., etc. which is a strong indication that this is the right approach. Try changing any of those requirements with a regexp-based solution.
I don't see a way to do this with grep because it doesn't let you output just one of the capture groups, only the whole match.
Wit perl I'd do something like
perl -lpe 'if (/^.*\b(19\d\d|20(?:0-4\d|50))\b/) { print $1 }'
Idea: Use ^.* (greedy) to consume as much of the string up front as possible, thus finding the last possible match. Use \b (word boundary) around the matched number to prevent matching 01900 or X1911D. Only print the first capture group ($1).
I tried to implement your requirement of 1900-2050; if that's too complicated, ((?:19|20)\d\d) will do (but also match e.g. 2099).
The regex to do your task using grep can be as follows:
\b(?:19\d{2}|20[0-4]\d|2050)\b(?!.*\b(?:19\d{2}|20[0-4]\d|2050)\b)
Details:
\b - Word boundary.
(?: - Start of a non-capturing group, needed as a container for
alternatives.
19\d{2}| - The first alternative (1900 - 1999).
20[0-4]\d| - The second alternative (2000 - 2049).
2050 - The third alternative, just 2050.
) - End of the non-capturing group.
\b - Word boundary.
(?! - Negative lookahead for:
.* - A sequence of any chars, meaning actually "what follows
can occur anywhere further".
\b(?:19\d{2}|20[0-4]\d|2050)\b - The same expression as before.
) - End of the negative lookahead.
The word boundary anchors provide that you will not match numbers - parts
of longer words, e.g. X1911D.
The negative lookahead provides that you will match just the last
occurrence of the required year.
If you can use other tool than grep, supporting call to a previous
numbered group (?n), where n is the number of another capturing
group, the regex can be a bit simpler:
(\b(?:19\d{2}|20[0-4]\d|2050)\b)(?!.*(?1))
Details:
(\b(?:19\d{2}|20[0-4]\d|2050)\b) - The regex like before, but
enclosed within a capturing group (it will be "called" later).
(?!.*(?1)) - Negative lookahead for capturing group No 1,
located anywhere further.
This way you avoid writing the same expression again.
For a working example in regex101 see https://regex101.com/r/fvVnZl/1
You may use a PCRE regex without any groups to only return the last occurrence of a pattern you need if you prepend the pattern with ^.*\K, or, in your case, since you expect a whitespace boundary, ^(?:.*\s)?\K:
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' file
See the regex demo.
Details
^ - start of line
(?:.*\s)? - an optional non-capturing group matching 1 or 0 occurrences of
.* - any 0+ chars other than line break chars, as many as possible
\s - a whitespace char
\K - match reset operator discarding the text matched so far
(?:19\d{2}|20(?:[0-4]\d|50)) - 19 and any two digits or 20 followed with either a digit from 0 to 4 and then any digit (00 to 49) or 50.
(?!\S) - a whitespace or end of string.
See an online demo:
s="ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar"
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' <<< "$s"
# => 1934

Notepad++ regex to remove spaces between specific characters but not ALL spaces

I thought my regex skills were strong, but I'm getting crushed.
I have several lines in thsi format
Time: 105 0 0
Time: 88 0 1
Time: 44 1 1
Time: 64 1 0
I want theses to turn into this:
Time: 105 thread00
Time: 88 thread01
Time: 44 thread11
Time: 64 thread10
I can match the [0-9][ ][0-9] section... I match it with that regex right there!
But I don't know how to preserve the values AND remove the space. Replacing it wholesale with new stuff, sure... but how do I PRESERVE values?
Find what: (\d)\s(\d)$
Replace with: thread\1\2
\d matches any digit, \s matches any space character.
The parentheses will be captured for use as \1, \2, \3... and \0 will provide the entire match.*
$ matches the end of a line, so that you don't accidentally match the "5 0" in the first line.
*Note that some regex engines use the \1 pattern while some others will use $1. Notepad++ uses the former.
You can try this:
Pattern:
/^(.*)(\d+)\s(\d+)$/
Breakdown:
^ # start of line
(.*) # the first part of the line -- capture $1
(\d+) # the first number (1 or more) -- capture $2
\s # the space between the numbers
(\d+) # the second number (1 or more) -- capture $3
$ # end of line
Replace:
/$1thread$2$3/
Result:
Time: 105 thread00
Time: 88 thread01
Time: 44 thread11
Time: 64 thread10
Demo: http://regex101.com/r/gB8uS4