Get only one instance of a regex instead of multiple

Get only one instance of a regex instead of multiple - regex

I have a text file that contains:
libpackage-example1.so.3.2.1,
libpackage-example2.so.3.2.1,
libpackage-example3.so.3.2.1,
libpackage-example4.so.3.2.1
I only want to get one instance of "3.2.1", but when I run the command below:
grep -Po '(?<=.so.)\d.\d.\d'
The result is
3.2.1
3.2.1
3.2.1
3.2.1
instead of just one "3.2.1". I think making it a lazy regex would work, but I do not know how to do that.

The regex is applied to each line. No matter how you change the regex, if the the whole file contains multiple matching lines then all of them will be printed.
However, you can limit the number of matched lines using the -m option. -o -m 1 will output at most all matches from one line before exiting. If there are multiple matches in one line use grep ... | head -n1 instead.
Also, keep in mind that . means any character. To specify a literal dot use \. or [.].
Perl regexes also support \K which makes writing easier. Only the part after the last \K will be printed.
grep -Pom1 '\.so\.\K\d\.\d\.\d'

The grep command has the -m N option that will make grep stop after the first N matches.
In general, the way to only get the first line of output in unix is to send the output to the head command. To get just the first line of output, do:
grep -Po '(?<=.so.)\d.\d.\d' | head -n 1
That "1" can be any number.

Use
awk -F'[.]so[.]' '/^libpackage-/{sub(/,$/,"", $NF);print $NF; exit}'
Split with .so. separator, find the line beginning with libpackage-, remove a comma from the end of the last field, print it and stop processing.
Another way:
grep -m1 -Po '(?<=\.so\.)\d+\.\d+\.\d+'
-m1 gets the first instance. I updated the expression: literal periods should be escaped, and \d+ will match one or more digits.

Related

How to get the release value?

I've a file with the below name formats:
rzp-QAQ_SA2-5.12.0.38-quality.zip
rzp-TEST-5.12.0.38-quality.zip
rzp-ASQ_TFC-5.12.0.38-quality.zip
I want the value as: 5.12.0.38-quality.zip from the above file names.
I'm trying as below, but not getting the correct value though:
echo "$fl_name" | sed 's#^[-[:alpha:]_[:digit:]]*##'
fl_name is the variable containing the file name.
Thanks a lot in advance!

You are matching too much with all the alpha, digit - and _ in the same character class.
You can match alpha and - and optionally _ and alphanumerics
sed -E 's#^[-[:alpha:]]+(_[[:alnum:]]*-)?##' file
Or you can shorten the first character class, and match a - at the end:
sed -E 's#^[-[:alnum:]_]*-##' file
Output of both examples
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip

With GNU grep you could try following code. Written and tested with shown samples.
grep -oP '(.*?-){2}\K.*' Input_file
OR as an alternative use(with a non-capturing group solution, as per the fourth bird's nice suggestion):
grep -oP '(?:[^-]*-){2}\K.*' Input_file
Explanation: using GNU grep here. in grep program using -oP option which is for matching exact matched values and to enable PCRE flavor respectively in program. Then in main program, using regex (.*?-){2} means, using lazy match till - 2 times here(to get first 2 matches of - here) then using \K option which is to make sure that till now matched value is forgotten and only next mentioned regex matched value will be printed, which will print rest of the values here.

It is much easier to use cut here:
cut -d- -f3- file
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip
If you want sed then use:
sed -E 's/^([^-]*-){2}//' file
5.12.0.38-quality.zip
5.12.0.38-quality.zip
5.12.0.38-quality.zip

Assumptions:
all filenames contain 3 hyphens (-)
the desired result always consists of stripping off the 1st two hyphen-delimited strings
OP wants to perform this operation on a variable
We can eliminate the overhead of sub-process calls (eg, grep, cut and sed) by using parameter substitution:
$ f1_name='rzp-ASQ_TFC-5.12.0.38-quality.zip'
$ new_f1_name="${f1_name#*-}" # strip off first hyphen-delimited string
$ echo "${new_f1_name}"
ASQ_TFC-5.12.0.38-quality.zip
$ new_f1_name="${new_f1_name#*-}" # strip off next hyphen-delimited string
$ echo "${new_f1_name}"
5.12.0.38-quality.zip
On the other hand if OP is feeding a list of file names to a looping construct, and the original file names are not needed, it may be easier to perform a bulk operation on the list of file names before processing by the loop, eg:
while read -r new_f1_name
do
... process "${new_f1_name)"
done < <( command-that-generates-list-of-file-names | cut -d- -f3-)

In plain bash:
echo "${fl_name#*-*-}"

You can do a reverse of each line, and get the two last elements separated by "-" and then reverse again:
cat "$fl_name"| rev | cut -f1,2 -d'-' | rev

A Perl solution capturing digits and characters trailing a '-'
cat f_name | perl -lne 'chomp; /.*?-(\d+.*?)\z/g;print $1'

How can I get "grep -zoP" to display every match separately?

I have a file on this form:
X/this is the first match/blabla
X-this is
the second match-
and here we have some fluff.
And I want to extract everything that appears after "X" and between the same markers. So if I have "X+match+", I want to get "match", because it appears after "X" and between the marker "+".
So for the given sample file I would like to have this output:
this is the first match
and then
this is
the second match
I managed to get all the content between X followed by a marker by using:
grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
That is:
grep -Po '(?<=X(.))(.|\n)+(?=\1)' to match X followed by (something) that gets captured and matched at the end with (?=\1) (I based the code on my answer here).
Note I use (.|\n) to match anything, including a new line, and that I also use -z in grep to match new lines as well.
So this works well, the only problem comes from the display of the output:
$ grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file
this is the first matchthis is
the second match
As you can see, all the matches appear together, with "this is the first match" being followed by "this is the second match" with no separator at all. I know this comes from the usage of "-z", that treats all the file as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline (quoting "man grep").
So: is there a way to get all these results separately?
I tried also in GNU Awk:
awk 'match($0, /X(.)(\n|.*)\1/, a) {print a[1]}' file
but not even the (\n|.*) worked.

awk doesn't support backreferences within regexp definition.
Workarounds:
$ grep -zPo '(?s)(?<=X(.)).+(?=\1)' ip.txt | tr '\0' '\n'
this is the first match
this is
the second match
# with ripgrep, which supports multiline matching
$ rg -NoUP '(?s)(?<=X(.)).+(?=\1)' ip.txt
this is the first match
this is
the second match
Can also use (?s)X(.)\K.+(?=\1) instead of (?s)(?<=X(.)).+(?=\1). Also, you might want to use non-greedy quantifier here to avoid matching match+xyz+foobaz for an input X+match+xyz+foobaz+
With perl
$ perl -0777 -nE 'say $& while(/X(.)\K.+(?=\1)/sg)' ip.txt
this is the first match
this is
the second match

Here is another gnu-awk solution making use of RS and RT:
awk -v RS='X.' 'ch != "" && n=index($0, ch) {
print substr($0, 1, n-1)
}
RT {
ch = substr(RT, 2, 1)
}' file
this is the first match
this is
the second match

With GNU awk for multi-char RS, RT, and gensub() and without having to read the whole file into memory:
$ awk -v RS='X.' 'NR>1{print "<" gensub(end".*","",1) ">"} {end=substr(RT,2,1)}' file
<this is the first match>
<this is
the second match>
Obviously I added the "<" and ">" so you could see where each output record starts/ends.
The above assumes that the character after X isn't a non-repetition regexp metachar (e.g. ., ^, [, etc.) so YMMV

The use case is kind of problematic, because as soon as you print the matches, you lose the information about where exactly the separator was. But if that's acceptable, try piping to xargs -r0.
grep -zPo '(?<=X(.))(.|\n)+(?=\1)' file | xargs -r0
These options are GNU extensions, but then so is grep -z and (mostly) grep -P, so perhaps that's acceptable.

GNU grep -z terminates input/output records with null characters (useful in conjunction with other tools such as sort -z). pcregrep will not do that:
pcregrep -Mo2 '(?s)X(.)(.+?)\1' file
-onumber used instead of lookarounds. ? lazy quantifier added (in case \1 occurs later).

How to match the last occurrence of a pattern on a single line string

I am using this command line to get a particular line from an html file which contains various other tags, links etc.:
cat index.html | grep -m1 -oE '<a href="(.*?)" rel="sample"[\S\s]*.*</dd>'
It outputs the line which I want:
<a href="http://example.com/something/one/" rel="sample" >Foo</a> <a href="http://example.com/something/two/" rel="sample" >Bar</a></dd>
But I want to capture only something/two (the path of the last URL) considering that:
the URLs are not known beforehand (it's a script processing multiple html files)
the line can sometimes contain only 1 URL, e.g.
<a href="http://example.com/something/one/" rel="sample" >Foo</a></dd>
in which case I would want to get only something/one as in this case it is the last one.
How can I do that?

Just add
| grep -o 'href="[^"]*' | tail -n1
The first part only extracts the hrefs, the second part keeps only the last line.
If you want to extract only the path, you can use cut with delimiter set to / and extract everything starting from the fourth column:
| grep -o 'href="[^"]*' | tail -n1 | cut -f4- -d/
because
href="http://example.com/something/two/
1 23 4 5

If you can use perl, then capturing within a regex makes this a lot easier.
perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";'
The regex is basically the same as would also work with grep. I've used m() instead of // to avoid escaping the / inside the regex.
The initial .* will greedily capture everything at the beginning of the line. If you have multiple links on a line, it will capture all but the last. This works with grep too, but it causes grep -o to output the beginning of the line, since this now matches the regex.
This doesn't matter with the capturing parenthesis, as only the part inside the (.*?) is captured and printed.
It would be used the same way as grep.
cat index.html | perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";'
or
perl -ne 'm(.*<a href="[^:]+://[^/]*/(.*?)" rel="sample".*</dd>) and print "$1\n";' index.html

On Linux, GNU grep's -P option enables a concise solution:
$ grep -oP '.*<a href="http://.+?/\K[^"]+(?=/"\s*rel="sample".*</dd>$)' index.html
something/two
-o only outputs the matching part(s) of each line that matches.
-P activates support for PRCEs (Perl-compatible Regular Expressions), which supports advanced regex constructs such as non-greedy matching (*?), dropping everything matched so far (\K), and look-ahead assertions ((?=...).
The combination of \K and (?=...) allows constraining the matching part of the regex to the subexpression of interest.
Note that no grep implementation supports capture groups, but the above, thanks to the features enabled by -P, is an emulation of extracting a single capture-group value.
As for what you tried:
-m1 limits the number of matching lines to 1, but with -o also present, multiple matches on that 1 line are still all printed.
Additionally, while you can use (...) for precedence, that doesn't constitute a capture group in grep, because there's no support for extracting capture-group values in grep.
Even with -E for extended regex support, advanced constructs such as non-greedy matching (.*?) are not supported.

Use grep to match a pattern in a line only once

I have this:
echo 12345 | grep -o '[[:digit:]]\{1,4\}'
Which gives this:
1234
5
I understand whats happening. How do I stop grep from trying to continue matching after 1 successful match?
How do I get only
1234

Do you want grep to stop matching or do you only care about the first match. You could use head if the later is true...
`grep stuff | head -n 1`
Grep is a line based util so the -m 1 flag tells grep to stop after it matches the first line which when combined with head is pretty good in practice.

You need to do the grouping: \(...\) followed by the exact number of occurrence: \{<n>\} to do the job:
maci:~ san$ echo 12345 | grep -o '\([[:digit:]]\)\{4\}'
1234
Hope it helps. Cheers!!

Use sed instead of grep:
echo 12345 | sed -n '/^\([0-9]\{1,4\}\).*/s//\1/p'
This matches up to 4 digits at the beginning of the line, followed by anything, keeps just the digits, and prints them. The -n prevents lines from being printed otherwise. If the digit string might also appear mid-line, then you need a slightly more complex command.
In fact, ideally you'll use a sed with PCRE regular expressions since you really need a non-greedy match. However, up to a reasonable approximation, you can use: (A semi-solution to a considerably more complex problem...now removed!)
Since you want the first string of up to 4 digits on the line, simply use sed to remove any non-digits and then print the digit string:
echo abc12345 | sed -n '/^[^0-9]*\([0-9]\{1,4\}\).*/s//\1/p'
This matches a string of non-digits followed by 1-4 digits followed by anything, keeps just the digits, and prints them.

If – as in your example – your numeric expression will appear at the beginning of the string you're starting with, you could just add a start-of-line anchor ^:
echo 12345 | grep -o '^\([[:digit:]]\)\{1,4\}'
Depending on which exact digits you want, an end-of-line anchor $ might help also.

grep manpage says on this topic (see chapter 'regular expressions'):
(…)
{n,}
The preceding item is matched n or more times.
{n,m}
The preceding item is matched at least n times, but not more than m times.
(…)
So the answer should be:
echo 12345 | grep -o '[[:digit:]]\{4\}'
I just tested it on cygwin terminal (2018) and it worked!

Best way to complete this Perl regex one-liner

I'm trying to use a Perl one-liner to munge some output from grepping svn diff, so I can automatically test the files. We have a run_test.sh script that can take multiple PHP files prepended with 'Test" as its arguments.
So far I have the following which successfully prepends 'Test' to the file names
[gjempty#gjempty-rhel4 classes]$ svn diff | grep '(revision' | perl -wpl -e 's/(.*)\/(.*)$/$1\/Test$2/'
--- commerce/TestLCart.php (revision 104387)
--- commerce/manufacturing/TestLRoutingData.php (revision 104387)
Now I'd just like to grab the file/path to pass it to our run_test.sh. I can finish it off with awk as below, but am trying to improve my Perl/one-liner skills. So how do I revise the perl one-liner to additionally extract only the file path?
svn diff | grep '(revision' | perl -wpl -e 's/(.*)\/(.*)$/$1\/Test$2/' | awk '{print $2}' | xargs run_test.sh

You're just wanting the file names, so svn st is what you want. Instead of getting large quantities of noise which could potentially contain (revision in it, and the main lines you want, you'll get it like this: M commerce/LCart.php. Then you can just chop off \S* (any number of non-whitespace characters) followed by \s* (any number of whitespace characters), and take what's left. You could do the \S*\s* differently, but that's the simplest way to get all cases.
svn st | perl -wpl -e 's|\S*\s*(.*)/(.*)$|$1/Test$2|'
(Switched it after posting from using s/// to s||| so the / doesn't need to be escaped; good idea, Axeman.)

You can get rid of the grep and the awk fairly easily.
svn diff | perl -wnl -e '/\(revision/ or next; m|(\S+)/(\S+)|; print "$1/Test$2";'
I changed the -p to -n. -p means while (<>) { <your code>; print $_; }, and -n is the same but without the print, since the new version has an explicit print instead.
Rather than an s/// substitution, I used an m// pattern match. I changed the delimiter to | to avoid backslashing the slash (a cause of Leaning Toothpick Syndrome). You can use almost any punctuation character you want.
\S is similar to . but matches only non-whitespace characters. Your .*s in the pattern were actually matching the entire chunks of the line before and after the slash, but the new pattern only matches the pathname of the file. Since the + is "greedy", the first one ($1) will get more string when there are multiple slashes in the pathname, the same as with your substitution pattern.

Better version:
No default print ( -n)
Extract substring first
Subst on that
print value
perl -wnl -e '($_)=m{---\s+(\S+)} and s|/([^/]+)$|/Test$1| and print "$_\n";'
You don't need awk now. And adding '(revision to the expression,
perl -wnl -e '($_)=m{---\s+(\S+)\s+\(revision} and s|/([^/]+)$|/Test$1| and print "$_\n";'
you don't need grep either.
But I have several subversion tools I created, and if all you want are the changed files 'svn st' is better.
svn st | perl -wnle 'm/^[CM]\s+(\S+)/and$r=rindex($1,"/")+1and print substr($1,0,$r),"Test",substr($1,$r+1),"\n"'
This time I chose a rindex + substr method. Now, there's no regex backtracking.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Get only one instance of a regex instead of multiple - regex

Related

How to get the release value?

How can I get "grep -zoP" to display every match separately?

How to match the last occurrence of a pattern on a single line string

Use grep to match a pattern in a line only once

Best way to complete this Perl regex one-liner

Categories

Resources