Replace regex with captured group ONLY - regex

I'm trying to understand why the following does not give me what I think (or want :)) should be returned:
sed -r 's/^(.*?)(Some text)?(.*)$/\2/' list_of_values
or Perl:
perl -lpe 's/^(.*?)(Some text)?(.*)$/$2/' list_of_values
So I want my result to be just the Some text, otherwise (meaning if there was nothing captured in $2) then it should just be EMPTY.
I did notice that with perl it does work if Some text is at the start of the line/string (which baffles me...). (Also noticed that removing ^ and $ has no effect)
Basically, I'm trying to get what grep would return with the --only-matching option as discussed here. Only I want/need to use sub/replace in the regex.
EDITED (added sample data)
Sample input:
$ cat -n list_of_values
1 Black
2 Blue
3 Brown
4 Dial Color
5 Fabric
6 Leather and Some text after that ....
7 Pearl Color
8 Stainless Steel
9 White
10 White Mother-of-Pearl Some text stuff
Desired output:
$ perl -ple '$_ = /(Some text)/ ? $1 : ""' list_of_values | cat -n
1
2
3
4
5
6 Some text
7
8
9
10 Some text

First of all, this shows how to duplicate grep -o using Perl.
You're asking why
foo Some text bar
012345678901234567
results in just a empty string instead of
Some text
Well,
At position 0, ^ matches 0 characters.
At position 0, (.*?) matches 0 characters.
At position 0, (Some text)? matches 0 characters.
At position 0, (.*) matches 17 characters.
At position 17, $ matches 0 characters.
Match succeeds.
You could use
s{^ .*? (?: (Some[ ]text) .* | $ )}{ $1 // "" }exs;
or
s{^ .*? (?: (Some[ ]text) .* | $ )}{$1}xs; # Warns if warnings are on.
Far simpler:
$_ = /(Some text)/ ? $1 : "";
I question your use of -p. Are you sure you want a line of output for each line of input? It seems to me you'd rather have
perl -nle'print $1 if /(Some text)/'

Related

Unable to match multiple digits in regex

I am simply trying to print 5 or 6 digit number present in each line.
cat file.txt
Random_something xyz ...64763
Random2 Some String abc-778986
Something something 676347
Random string without numbers
cat file.txt | sed 's/^.*\([0-9]\{5,6\}\+\).*$/\1/'
Current Output
64763
78986
76347
Random string without numbers
Expected Output
64763
778986
676347
The regex doesn't seem to work as intended with 6 digit numbers. It skips the first number of the 6 digit number for some reason and it prints the last line which I don't need as it doesn't contain any 5 or 6 digit number whatsoever
grep is a better for this with -o option that prints only matched string:
grep -Eo '[0-9]{5,6}' file
64763
778986
676347
-E is for enabling extended regex mode.
If you really want a sed, this should work:
sed -En 's/(^|.*[^0-9])([0-9]{5,6}).*/\2/p' file
64763
778986
676347
Details:
-n: Suppress normal output
(^|.*[^0-9]): Match start or anything that is followed by a non-digit
([0-9]{5,6}): Match 5 or 6 digits in capture group #2
.* Match remaining text
\2: is replacement that puts matched digits back in replacement
/p prints substituted text
With awk, you could try following. Simple explanation would be, using match function of awk and giving regex to match 5 to 6 digits in each line, if match is found then print the matched part.
awk 'match($0,/[0-9]{5,6}/){print substr($0,RSTART,RLENGTH)}' Input_file

Regex substitute multiple lines for a single line

I have a plain text file in which I need to substitute multiple consecutive lines of text with a single replacement line. For example, when I have a date and time, followed by a blank line, followed by a page number,
11/13/2018 08:33:00
Page 1 of 1
I'd like to replace it with a single line (e.g., PAGE BREAK).
I've tried
sed 's/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}\n\nPage \d of \d/PAGE BREAK/g' file1.txt > file2.txt
and
perl -pe 's/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}\n\nPage \d of \d/PAGE BREAK/g' file1.txt > file2.txt
but it leaves the text unchanged.
Both sed and Perl process the input line by line. You can tell Perl to load the whole file into memory by using -0777 (if it's not too large):
perl -0777 -pe 's=[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}\n\nPage [0-9]+ of [0-9]+=PAGE BREAK=g'
Note that I used [0-9], because \d can match ٤, ໖, ६, or 𝟡.
I also used s=== instead of s/// so I don't have to backslash the slashes in the date part.
Another Perl variant
$ cat page_break.txt
123 45 jh kljl
11/13/2018 08:33:00
Page 1 of 1
ghjgjh hkjhj
fhfghfghfh
11/13/2018 08:33:00
Page 1 of 2
ghgigkjkj
$ perl -ne '{ if ( (/\d{2}\/\d{2}\/\d{4} \d{2}:\d{2}:\d{2}/ and $x++)or ( /^\s*$/ and $x++) or (/Page \d of \d/ and $x++) ){} if($x==0) { print "$_" } if($x==3) { print "PAGE BREAK\n"; $x=0} }' page_break.txt
123 45 jh kljl
PAGE BREAK
ghjgjh hkjhj
fhfghfghfh
PAGE BREAK
ghgigkjkj
$

Delete all lines which don't match a pattern

I am looking for a way to delete all lines that do not follow a specific pattern (from a txt file).
Pattern which I need to keep the lines for:
x//x/x/x/5/x/
x could be any amount of characters, numbers or special characters.
5 is always a combination of alphanumeric - 5 characters - e.g Xf1Lh, always appears after the 5th forward slash.
/ are actual forward slashes.
Input:
abc//a/123/gds:/4AdFg/f3dsg34/
y35sdf//x/gd:df/j5je:/x/x/x
yh//x/x/x/5Fsaf/x/
45wuhrt//x/x/dsfhsdfs54uhb/
5ehys//srt/fd/ab/cde/fg/x/x
Desired output:
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
grep selects lines according to a regular expression and your x//x/x/x/5/x/ just needs minor changes to make it into a regular expression:
$ grep -E '.*//.*/.*/.*/[[:alnum:]]{5}/.*/' file
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Explanation:
"x could be any amount of characters, numbers or special characters". In a regular expression that is .* where . means any character and * means zero or more of the preceding character (which in this case is .).
"5 is always a combination of alphanumeric - 5 characters". In POSIX regular expressions, [[:alnum:]] means any alphanumeric character. {5} means five of the preceding. [[:alnum:]] is unicode-safe.
Possible improvements
One issue is how x should be interpreted. In the above, x was allowed to be any character. As triplee points out, however, another reasonable interpretation is that x should be any character except /. In that case:
grep -E '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
Also, we might want this regex to match only complete lines. In that case, we can either surround the regex with ^ an $ or we can use grep's -x option:
grep -xE '[^/]*//[^/]*/[^/]*/[^/]*/[[:alnum:]]{5}/[^/]*/' file
I was figuring out how to do it in awk at the same time as the other answer and came up with:
awk -F/ 'BEGIN{OFS=FS}$2==""&&$6~/[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]/&&NF=8'
The awk I worked it out on didn't support the {5} regexp frob.
You can use -P option for extended perl support like
grep -P "^(?:[^/]*/){5}[A-Za-z0-9]{5}/(?:/|$)" input
Output
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
Regex Breakdown
^ #Start of line
(?: #Non capturing group
[^/]* #Match anything except /
/ #Match / literally
){5} #Repeat this 5 times
[A-Za-z0-9]{5} #Match alphanumerics. You can use \w if you want to allow _ along with [A-Za-z0-9]
(?: #Non capturing group
/ #Next character should be /
| #OR
$ #End of line
)
Using sed and in place edit to delete all lines that do not follow a specific pattern (from a txt file):
$ sed -i.bak -n "/.*\/\/.*\/.*\/.*\/[a-zA-Z0-9]\{5\}\/.*\//p" test.in
$ cat test.in
abc//a/123/gds:/4AdFg/f3dsg34/
yh//x/x/x/5Fsaf/x/
-i.bak in place edit creating a test.in.bak backup file, -n quiet, do not print non-matches to output
and ".../p" print matches.

Need with a regex that can handle optional substrings

I'm trying to use sed to parse version numbers from the output of git describe. The output is in the form:
vMAJOR.MINOR[-STRING]-REVISION-HASH
MAJOR, MINOR, and REVISION are integers. STRING and HASH are arbitrary strings, but I'm only interested in HASH.
Examples:
v0.1-alpha-3-g9c8c402 should return 0 1 3 g9c8c402
v0.4-beta-10-g3187e7f-dirty should return 0 4 10 g3187e7f-dirty
v1.0-0-fe35119e should return 1 0 0 fe35119e
I was originally using:
sed 's/v\([0-9]*\)\.\([0-9]*\)-.*-\([0-9]*\)-\(.*\)/\1 \2 \3 \4/g'
However, it works only when the optional substring is present.
It doesn't work now because it expects two dashes between version end revision, even if no string is present.
Edit: I'm not very familiar with sed regex, you will need a \? instead of ?. I also read that \? was only included as a GNU extension, so not sure if it'll help you.
v\([0-9]*\)\.\([0-9]*\)-.*-\?\([0-9]*\)-\(.*\)
If the \? doesn't work, you could try specifying it as 'zero or one times' like this:
v\([0-9]*\)\.\([0-9]*\)-.*-\{0,1\}\([0-9]*\)-\(.*\)
Try:
sed 's/v\([0-9]*\)\.\([0-9]*\)-\([^-]*-\)*\([0-9]*\)-\(.*\)/\1 \2 \4 \5/g'
with bash regular expressions:
while read version; do
if [[ $version =~ ^v([0-9]+)\.([0-9]+)(-[^-]+)?-([0-9]+)-(.+) ]]; then
echo "${BASH_REMATCH[1]} ${BASH_REMATCH[2]} ${BASH_REMATCH[-2]} ${BASH_REMATCH[-1]} "
fi
done <<END
v0.1-alpha-3-g9c8c402
v0.4-beta-10-g3187e7f-dirty
v1.0-0-fe35119e
END
0 1 3 g9c8c402
0 4 10 g3187e7f-dirty
1 0 0 fe35119e
Perl's non-capturing parentheses (?:...)
are useful as well:
perl -pe 's/^v([0-9]+)\.([0-9]+)(?:-[^-]+)?-([0-9]+)-(.+)/$1 $2 $3 $4/' <<END
v0.1-alpha-3-g9c8c402
v0.4-beta-10-g3187e7f-dirty
v1.0-0-fe35119e
END

POSIX Regular Expressions Limit Repetitions

I am trying to grep for a maximum number of repetitions allowed on my input string and can't seem to get it working.
The input file has three lines with 3,5 and 7 repetitions of "pq" respectively. The >=3, >=5 expressions are working fine, but "between 3 and 5" expression {3,5} shows the line with seven repetitions as well.
DEV /> cat input.txt
pq -- One occurance of pq
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{3,\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{5\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
DEV /> grep "\(pq\)\{3,5\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq
Am I doing something wrong or is this the expected behavior?
If this is the expected behavior ( as the string with 7 PQs has between 3-5 PQs),
1) in what cases is the maximum repetitions applicable? What would be the difference between {3,5} and {3,} (greater than 3)?
2) I can anchor my regular expressions with "^", but what if my string does not end with "pq" and has more text?
If a line has seven repetitions of anything, it also therefore contains between 3–5 repetitions of that thing, and at several points, no less.
Use match anchors if you expect matches to be anchored. Otherwise, of course, they are not.
The practical difference between /X{3,}/ and /X{3,5}/ is the how long of a string it matches — the extent (or span) of the match. If all you are looking for is a boolean yes/no responses and there is nothing further in your pattern, it does not make much of a difference; in fact, a moderately clever regex engine will return early if it knows it is safe to do so.
One way to see the difference is with GNU grep’s ‑o or ‑‑only‐matching option. Watch:
$ echo 123456789 | egrep -o '[0-9]{3}'
123
456
789
$ echo 123456789 | egrep -o '[0-9]{3,}'
123456789
$ echo 123456789 | egrep -o '[0-9]{3,5}'
12345
6789
$ echo 123456789 | egrep -o '[0-9]{3,5}[2468]'
123456
$ echo 123456790 | egrep -o '[0-9]{3,5}[13579]'
12345
6789
To understand how those last two work, it is useful to get a trace of the regex engine’s attempts, including backtracking steps. You can do this using Perl in this way:
$ perl -Mre=debug -le 'print $& while 1234567890 =~ /\d{3,5}[13579]/g'
Compiling REx "\d{3,5}[13579]"
Final program:
1: CURLY {3,5} (4)
3: DIGIT (0)
4: ANYOF[13579][] (15)
15: END (0)
stclass DIGIT minlen 4
Matching REx "\d{3,5}[13579]" against "1234567890"
Matching stclass DIGIT against "1234567" (7 chars)
0 <> <1234567890> | 1:CURLY {3,5}(4)
DIGIT can match 5 times out of 5...
5 <12345> <67890> | 4: ANYOF[13579][](15)
failed...
4 <1234> <567890> | 4: ANYOF[13579][](15)
5 <12345> <67890> | 15: END(0)
Match successful!
12345
Matching REx "\d{3,5}[13579]" against "67890"
Matching stclass DIGIT against "67" (2 chars)
5 <12345> <67890> | 1:CURLY {3,5}(4)
DIGIT can match 5 times out of 5...
10 <1234567890> <> | 4: ANYOF[13579][](15)
failed...
9 <123456789> <0> | 4: ANYOF[13579][](15)
failed...
8 <12345678> <90> | 4: ANYOF[13579][](15)
9 <123456789> <0> | 15: END(0)
Match successful!
6789
Freeing REx: "\d{3,5}[13579]"
When you have additional constraints about what comes after the match, then which type of repetition you choose can make a big difference. Here I’ll impose a constraint on where each match is allowed to finish, by saying it needs to end before an odd digit:
$ perl -le 'print $& while 1234567890 =~ /\d{3}(?=[13579])/g'
234
678
$ perl -le 'print $& while 1234567890 =~ /\d{3,5}(?=[13579])/g'
1234
5678
% perl -le 'print $& while 1234567890 =~ /\d{3,}(?=[13579])/g'
12345678
So when you have things that have to come afterwards, it can make a great deal of difference. When you are just deciding whether the entire line matches something, it may not be as important.
This is expected behavior. The string "pqpqpqpqpqpqpq" does in fact have between three and five repetitions of "pq", and then a few more for good measure. You may want to try anchoring your regular expression, something like ^\(pq\)\{3,5\}$.
Edit to match edited question:
The maximum is applicable in all situations. What is happening is that grep is matching 5 of the 7 repetitions of "pq" (most likely the first five), and since it found a match it prints out the line.
You'll have to figure out a way to change your regex to match what you want and not match what you don't. For example, to match a line starting with 3–5 repetitions of "pq", you might do something like this: ^\(pq\){3,5}\($|[^p]|p$|p[^q]\). That matches 3–5 "pq"s followed immediately by end-of-line or any-character-other-than-"p" or "p"-followed-by-end-of-line or "p"-followed-by-any-character-other-than-"q".