RegEx skip word - regex

I would like to use regular expressions to extract the first couple of words and the second to last letter of a string.
For example, in the string
"CSC 101 Intro to Computing A R"
I would like to capture
"CSC 101 A"
Maybe something similar to this
grep -o -P '\w{3}\s\d{3}*thenIdon'tKnow*\s\w\s'
Any help would be greatly appreciated.

You could go for:
^((?:\w+\W+){2}).*(\w+)\W+\w+$
And use group 1 + 2, see it working on regex101.com.
Broken down, this says:
^ # match the start of the line/string
( # capture group 1
(?:\w+\W+){2} # repeated non-capturing group with words/non words
)
.* # anything else afterwards
(\w+)\W+\w+ # backtracking to the second last word character
$

Do:
^(\S+)\s+(\S+).*(\S+)\s+\S+$
The 3 captured groups capture the 3 desired potions
\S indicates any non-whitespace character
\s indicates any whitespace character
Demo
As you have used grep with PCRE in your example, i am assuming you have access to the GNU toolset. Using GNU sed:
% sed -E 's/^(\S+)\s+(\S+).*(\S+)\s+\S+$/\1 \2 \3/' <<<"CSC 101 Intro to Computing A R"
CSC 101 A

A whole RegEx pattern can't match disjointed groups.
I suggest taking a look at Capture Groups - basically you capture the two disjointed groups, the matched couples of words can then be used by referring to these two groups.
grep can't print out multiple capture groups so an example with sed is
echo 'CSC 101 Intro to Computing A R' | sed -n 's/^\(\w\{3\}\s[[:digit:]]\{3\}\).*\?\(\w\)\s\+\w$/\1 \2/p' which prints out CSC 101 A
Note that the pattern used here is ^(\w{3}\s\d{3}).*?(\w)\s+\w$

Related

How to transpose pieces of data using Regular expression in Notepad++

I am very new to the world of regular expressions. I am trying to use Notepad++ using Regex for the following:
Input file is something like this and there are multiple such files:
Code:
abc
17
015
0 7
4.3
5/1
***END***
abc
6
71
8/3
9 0
***END***
abc
10.1
11
9
***END***
I need to be able to edit the text in all of these files so that all the files look like this:
Code:
abc
1,2,3,4,5
***END***
abc
6,7,8,9
***END***
abc
10,11,12
***END***
Also:
In some files the number of * around the word END varies, is there a way to generalize the number of * so I don't have to worry about it?
There is some additional data before abcs which does not need to be transposed, how do I keep that data as it is along with transposing the data between abc and ***END***.
Kindly help me. Your help is much appreciated!
Try the following find and replace, in regex mode:
Find: ^(\d+)\R(?!\*{1,}END\*{1,})
Replace: $1,
Demo
Here is an explanation of the regex pattern:
^ from the start of the line
(\d+) match AND capture a number
\R followed by a platform independent newline, which
(?!\*{1,}END\*{1,}) is NOT followed by ***END***
Note carefully the negative lookahead at the end of the pattern, which makes sure that we don't do the replacement on the final number in each section. Without this, the last number would bring the END marker onto the same line.
This will eplace only between "abc" and "***END***" with any number of asterisk.
Ctrl+H
Find what: (?:(?<=^abc)\R|\G(?!^)).+\K\R(?!\*+END\*+)
Replace with: ,
CHECK Match case
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
(?: # non capture group
(?<=^abc) # positive look behind, make sure we have "abc" at the beginning of line before
\R # any kind of linebreak
| # OR
\G # restart from last match position
(?!^) # negative look ahead, make sure we are not at the beginning of line
) # end group
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\R # any kind of linebreak
(?!\*+END\*+) # negative lookahead, make sure we haven't ***END*** after
Screen capture (before):
Screen capture (after):

How do i replace in Linux blanks with underscore between letters only, ignoring numbers

Using Linux, i need a way to replace blanks in a string with underscores. The special point is to do this only between two letters (regardless if upper- or lowercase). Not between two numbers or a number and a letter.
Example:
"This is a test File of 100 MB Size - 45 of 50 files processed"
Output should be:
"This_is_a_test_File_of 100 MB_Size - 45 of 50 files_processed"
Thanks in advance for your help.
I tried a lot of sed regex combinations, but none of them did the job.
Seems a bit tricky.
sed 's/\([a-z]\)[[:space:]]\([A-Z]\)/_/g'
sed 's/\([a-z]\) \([A-Z]\)/_/g'
A way that puts hyphens around digits and that plays with word boundaries:
sed -E 's/([0-9_])/-\1-/g;s/\b \b/_/g;s/-([0-9_])-/\1/g' file
Or more direct with perl:
perl -pe's/\pL\K (?=\pL)/_/g' file
You may use
sed ':A;s/\([[:alpha:]]\) \([[:alpha:]]\)/\1_\2/;tA' file
Or
sed ':A;s/\([[:alpha:]]\)[[:space:]]\([[:alpha:]]\)/\1_\2/;tA' file
The point is that you match and capture a letter into Group 1 with the first \([[:alpha:]]\), then match a space (or whitespace with [[:space:]]), and then match and capture into Group 2 a letter (with the second \([[:alpha:]]\)), replace this match with the contents of Group 1 (\1), _ and Group 2 contents (\2), and then get back to search for a match after the preceding match start.
Note your approach would partly work if you added \1 and \2 placeholders to your RHS at right places, but the fact there are one-letter words would prevent it from working. However, if you pipe the second idedentical sed command you would get the expected output:
sed 's/\([[:alpha:]]\) \([[:alpha:]]\)/\1_\2/g' file | sed 's/\([[:alpha:]]\) \([[:alpha:]]\)/\1_\2/g'
See this online demo.

last year occurrence from string

I have strings like this:
ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar
I'm trying to get the last occurrence of a single year (from 1900 to 2050), so I need to extract only 1934 from that string.
I'm trying with:
grep -P -o '\s(19|20)[0-9]{2}\s(?!\s(19|20)[0-9]{2}\s)'
or
grep -P -o '((19|20)[0-9]{2})(?!\s\1\s)'
But it matches: 1910 and 1934
Here's the Regex101 example:
https://regex101.com/r/UetMl0/3
https://regex101.com/r/UetMl0/4
Plus: how can I extract the year without the surrounding spaces without doing an extra grep to filter them?
Have you ever heard this saying:
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.” Now they have two problems.
Keep it simple - you're interested in finding a number between 2 numbers so just use a numeric comparison, not a regexp:
$ awk -v min=1900 -v max=2050 '{yr=""; for (i=1;i<=NF;i++) if ( ($i ~ /^[0-9]{4}$/) && ($i >= min) && ($i <= max) ) yr=$i; print yr}' file
1934
You didn't say what to do if no date within your range is present so the above outputs a blank line if that happens but is easily tweaked to do anything else.
To change the above script to find the first instead of the last date is trivial (move the print inside the if), to use different start or end dates in your range is trivial (change the min and/or max values), etc., etc. which is a strong indication that this is the right approach. Try changing any of those requirements with a regexp-based solution.
I don't see a way to do this with grep because it doesn't let you output just one of the capture groups, only the whole match.
Wit perl I'd do something like
perl -lpe 'if (/^.*\b(19\d\d|20(?:0-4\d|50))\b/) { print $1 }'
Idea: Use ^.* (greedy) to consume as much of the string up front as possible, thus finding the last possible match. Use \b (word boundary) around the matched number to prevent matching 01900 or X1911D. Only print the first capture group ($1).
I tried to implement your requirement of 1900-2050; if that's too complicated, ((?:19|20)\d\d) will do (but also match e.g. 2099).
The regex to do your task using grep can be as follows:
\b(?:19\d{2}|20[0-4]\d|2050)\b(?!.*\b(?:19\d{2}|20[0-4]\d|2050)\b)
Details:
\b - Word boundary.
(?: - Start of a non-capturing group, needed as a container for
alternatives.
19\d{2}| - The first alternative (1900 - 1999).
20[0-4]\d| - The second alternative (2000 - 2049).
2050 - The third alternative, just 2050.
) - End of the non-capturing group.
\b - Word boundary.
(?! - Negative lookahead for:
.* - A sequence of any chars, meaning actually "what follows
can occur anywhere further".
\b(?:19\d{2}|20[0-4]\d|2050)\b - The same expression as before.
) - End of the negative lookahead.
The word boundary anchors provide that you will not match numbers - parts
of longer words, e.g. X1911D.
The negative lookahead provides that you will match just the last
occurrence of the required year.
If you can use other tool than grep, supporting call to a previous
numbered group (?n), where n is the number of another capturing
group, the regex can be a bit simpler:
(\b(?:19\d{2}|20[0-4]\d|2050)\b)(?!.*(?1))
Details:
(\b(?:19\d{2}|20[0-4]\d|2050)\b) - The regex like before, but
enclosed within a capturing group (it will be "called" later).
(?!.*(?1)) - Negative lookahead for capturing group No 1,
located anywhere further.
This way you avoid writing the same expression again.
For a working example in regex101 see https://regex101.com/r/fvVnZl/1
You may use a PCRE regex without any groups to only return the last occurrence of a pattern you need if you prepend the pattern with ^.*\K, or, in your case, since you expect a whitespace boundary, ^(?:.*\s)?\K:
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' file
See the regex demo.
Details
^ - start of line
(?:.*\s)? - an optional non-capturing group matching 1 or 0 occurrences of
.* - any 0+ chars other than line break chars, as many as possible
\s - a whitespace char
\K - match reset operator discarding the text matched so far
(?:19\d{2}|20(?:[0-4]\d|50)) - 19 and any two digits or 20 followed with either a digit from 0 to 4 and then any digit (00 to 49) or 50.
(?!\S) - a whitespace or end of string.
See an online demo:
s="ACB 01900 X1911D 1910 1955-2011 3424 2135 1934 foobar"
grep -Po '^(?:.*\s)?\K(?:19\d{2}|20(?:[0-4]\d|50))(?!\S)' <<< "$s"
# => 1934

regex matching two groups of repeating digits where both are not allowed to be the same digits

Folks,
I'm trying to use regular expressions to process a large set of number strings and match digit sequences for particular patterns where some digits are repeated in groups. Part of the requirement is to ensure uniqueness between sections of the given pattern.
An example of the kind of matching I'm trying to achieve
ABBBCCDD
Interpret this as a set of digits. But A,B,C,D cannot be the same. And the repetition of each is the pattern we're trying to match.
I've been using regular expressions with negative look-ahead as part of this matching and it works but not all the time and I'm confused as to why. I'm hoping someone can explain why its glitching and suggest a solution.
So to address ABBBCCDD I came up with this RE using negative look-ahead using groups..
(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}
To break this down..
(.) single character wildcard group 1 (A)
(?!\1{1,7}) negative look-ahead for 1-7 occurrences of group 1 (A)
(.) single character wildcard group 2 (B)
\2{2} A further two occurrences of group 2 (B)
(?!\2{1,4}) Negative look-ahead of 1-4 occurrences of group 2 (B)
(.) single character wildcard group 3 (C)
\3{1} One more occurrence of group 3 (C)
(?!\3{1,2}) Negative look-ahead of 1-2 occurrences of group 3 (C)
(.) single character wildcard group 4 (D)
\4{1} one more occurrence of group 4 (D)
The thinking here is that the negative look-aheads act as a means of verifying that a given character is not found where it's unexpected. So A gets checked in the next 7 chars. Once B and it's 2 repetitions are matched, we're negativdely looking ahead for B in the next 4 chars. Finally once the pair of Cs is matched, we're looking in the final 2 for a C as a means of detecting a mismatch.
For test data, this string "01110033" matches the expression. But it shouldn't because the '0' for A is repeated in the C position.
I ran checks of this expression in Python and with grep in PCRE mode (-P). Both matched the wrong pattern.
I put the expression in https://regex101.com/ along with the same test string "01110033" and it also matched there. I don't have enough rating to post images of this or of variations I tried with the test data. So here are some text grabs from command-line runs with grep -P
So our invalid expression that repeats A in CC position gets through..
$ echo "01110033" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
01110033
$
Changing DD to 11, copying BBB, we also find that gets through despite B having a forward negative check..
$ echo "01110011" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
01110011
$
Now change DD to "00", copying the CC digits and low and behold it doesn't match..
$ echo "01110000" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
Delete the forward-negative check for CC "(?!\3{1,2})" from the expression and our repeat of the C digit in the D position makes it through.
$ echo "01110000" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(.)\4{1}'
01110000
$
Back to the original test number and switch CC digits to the same use of '1' from B. It doesn't get through.
$ echo "01111133" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
And to play this out for the BBB group, set the B digits to the same 0 as encountered for A. Also fails to match..
$ echo "00002233" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
Then take out the negative lookahead for A and we can this to match..
$ echo "00002233" | grep -P '(.)(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
00002233
$
So it seems to me that the forward negative check is working but that it only works with the next adjacent set or its intended lookahead range is cut short in some form presumably by the extra things we're trying to match.
If I add an additional lookahead on A right after B and its repetition have been processed, we get it to avoid matching on the CC part reusing the A digit..
$ echo "01110033" | grep -P '(.)(?!\1{1,7})(.)\2{2}(?!\1{1,4})(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}'
$
To take this further, then after matching the CC set, I would need to repeat the negative lookaheads for A and B again. This just seems wrong.
Hopefully an RE expert can clarify what I'm doing wrong here or confirm if negative-lookahead is indeed limited based on what I'm observing
(.)(?!.{0,6}\1)(.)\2{2}(?!\2{1,4})(.)\3{1}(?!\3{1,2})(.)\4{1}
^^^^^^^^
Change your lookahead to disallow match when \1 appears anywhere in the string.See demo.You can similarly modify other parts as well in your regex.
https://regex101.com/r/vV1wW6/31
NOTE: updated.
As vks already noted, your negative lookaheads weren't excluding what you thought -- \1{1,7} for example is only going to exclude A, AA, AAA, AAAA, AAAAA, AAAAAA, and AAAAAAA. I think you want the lookaheads to be .*\1, .*\2, .*\3, etc.
But here's another idea: It's easy to prefilter out ANY line that has non-adjacent repeated characters:
grep -P -v '(.)(?!\1).*\1'
And then your regexp on the result is MUCH simpler: .{1}.{3}.{2}.{2}
And in fact the whole thing can be combined using the first as a negative pre-lookahead constraint:
(?!.*(.)(?!\1).*\1).{1}.{3}.{2}.{2}
Or if you need to capture the digits as you did originally:
(?!.*(.)(?!\1).*\1)(.){1}(.){3}(.){2}(.){2}
But note that those digits will now be \2 \3 \4 \5, since \1 is in the lookahead.
Based on the feedback so far, I'm giving another answer that does not rely on doing arithmetic based on total length and that will self-containedly identify any sequence of 4 unique character/digit groups in the length sequence 1,3,2,2 anywhere in a string:
/(?<=^|(.)(?!\1))(.)\2{0}(?!\2)(.)\3{2}(?!\2|\3)(.)\4{1}(?!\2|\3|\4)(.)\5{1}(?!\5)/gm
^^^^^^^^^^^^^^^^ this is a look-behind that makes sure we're starting with a new character/digit
^^^^^^^^ this is the size-1 group; yes the \2{0} is superfluous
^^^^^^ this ensures the next group is unique
^^^^^^^^ this is the size-3 group
etc.
Let me know if this is closer to your solution. If so, and if all of your "patterns" consist of sequences of the group sizes you're looking for (like 1,3,2,2), I can come up with some code that will generate the corresponding regexp for any such input "pattern".
just some details here on what the eventual solution looked like for me..
So fundamentally (?!\1{1,7}) was not what I had thought it would be and was the entire cause of the issues I had encountered. Sincere appreciations to you guys for finding that issue for me.
The example I had shown was 1 from about 50 I had to formulate from a set of patterns.
It ended up as..
ABBBCCDD
09(.)(?!.{0,6}\1)(.)\2{2}(?!.{0,3}\2)(.)\3{1}(?!.{0,1}\3)(.)\4{1}
So once \1 (A) was captured, I tested negative lookahead of 0-6 wildchars preceding A. Then I capture \2 (B), its two repetitions and then give B negative lookahead of 0-3 wilds + B and so on.
It keeps the focus oriented around looking forward negatively to make sure the caught groups do not repeat where they are not supposed to. Then the subsequent captures and their recurrence patterns will do the rest in ensuring the match.
Other examples from the final set:
ABCCDDDD
(.)(?!.{0,6}\1)(.)(?!.{0,5}\2)(.)\3{1}(?!.{0,3}\3)(.)\4{3}
AABBCCDD
(.)\1{1}(?!.{0,5}\1)(.)\2{1}(?!.{0,3}\2)(.)\3{1}(?!.{0,1}\3)(.)\4{1}
ABCCDEDE
09(.)(?!.{0,6}\1)(.)(?!.{0,5}\2)(.)\3{1}(?!.{0,3}\3)(.)(?!\4{1})(.)\4{1}\5{1}

Regular expression to find first match only

I have this text :-
SOME text, .....
Number of successes: 3556
Number of failures: 22
Some text, .....
Number of successes: 2623
Number of failure: 0
My requirement is to find the first occurrence of this pattern "Number of successes: (\d+)" which is Number of successes: 3556.
But the above expression returns subsequent matches as well.
I want the regular expression to do this for me, unlike in java where i can use loop to iterate.
Can anyone help me with a regular expression that can find the first occurrence only.
One solution that should work in any language:
(?s)\A(?:(?!Number of successes:).)*Number of successes: (\d+)
Explanation:
(?s) # Turn on singleline mode
\A # Start of string
(?: # Non-capturing group:
(?!Number of successes:) # Unless this text intervenes:
. # Match any character.
)* # Repeat as needed.
Number of successes:[ ] # Then match this text
(\d+) # and capture the following number
See it live on regex101.com.
Just in case the requirements to do it via regexp is not really a requirement, here are alternatives to the (nice) approach by Tim (who uses only regexp)
awk ' $0~/Number of successes: [1-9][0-9]*/ { print $0 ; exit 0 ;}'
or the really simple
grep 'Number of successes: [1-9][0-9]*' | head -1
I much prefer the awk one, as it quits as soon as it sees the first match, whereas the 2nd one could process many lines after it (until it receives the SIGPIPE or end of file)
Try using grep with -m option
grep -m 1 'Number of successes: [0-9]\+' file