How to do multiple grep pattern to find value in grepped string - regex

I am trying to do multiple grep pattern to find a number within a grepped string.
I have a text file like this:
This is the first sample line 1
this is the second sample line
another line
total lines: 3 tot
I am trying to find a way to get just the number of total lines. So the output here should be "3"
Here are the things I've tried:
grep "total lines: [0-9]" myfile.txt
grep "total lines" myfile.txt | grep "[0-9]"

You could use sed:
sed -En 's/^total lines: ([0-9]+).*/\1/p' myfile.txt
-E extended regular expressions
-n suppress automatic printing
Match ^total lines: ([0-9]+).* (capture the number)
\1 replace the whole line with the captured number
p print the result

1st solution: Using GNU grep try following. Simply using -o option to print only matched value, -P enables PCRE regex for program. Then in regex portion matching from starting ^total lines: in each line and if a match found then discard matched values by \K option(to remove it from expected output) which is followed by 1 or more digits, using positive look ahead to make sure its followed by space(s) tot here.
grep -oP '^total lines: \K[0-9]+(?=\s+tot)' Input_file
2nd solution: With your shown samples, please try following in awk. This could be done in a single awk itself. Searching line which has string /total lines: / in it then printing 2nd last field of that line.
awk '/total lines: /{print $(NF-1)}' Input_file
3rd solution: Using awk's match function here. Matching total lines: [0-9]+ tot and then substituting everything apart from digits with null in matched values.
awk 'match($0,/total lines: [0-9]+ tot/){val=substr($0,RSTART,RLENGTH);gsub(/[^0-9]+/,"",val);print val}' Input_file

Do you have to use grep?
$ echo myfile.txt | wc -l
If you mean that the file has a line in it formatted as
total lines: 3 tot
Then refer to https://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match and use something like:
grep -Po 'total lines: \K\d+' myfile.txt
Notes:
Perl regex is not my forte, so the \d\w part might not work.
This may be doable without -P, but I cannot test from this windows computer.
regex101.com helped me test the above line, so it may work.

Problem with relying on pattern of last line and applying grep/sed to find pattern is that if any line in file contains such pattern, then you will have to apply some additional logic to filter that.
e.g. Consider case of below input file.
line001
total lines: 883 tot
This is the first sample line 1
this is the second sample line
another line
total lines: 883 tot
Assuming your file format is constant (i.e. Second last line will be blank and last line will contain total count), instead of using any pattern matching commands you can directly count number of rows using below awk command.
awk 'END { print NR - 2 }' myfile.txt

You can use the following awk to get the third field on a line that starts with total count: and stop processing the file further:
awk '/^total lines:/{print $3; exit}' file
See this online demo.
You can use the following GNU grep:
# Extract a non-whitespace chunk after a certain pattern
grep -oP '^total lines:\s*\K\S+' file
# Extract a number after a pattern
grep -oP '^total lines:\s*\K\d+(?:\.\d+)?' file
See an online demo. Details:
^ - start of string
total lines: - a literal string
\s* - any zero or more whitespace chars
\K - match reset operator discarding all text matched so far
\S+ - one or more non-whitespace chars
\d+(?:\.\d+)? - one or more digits and then an optional sequence of . and one or more digits.
See the regex demo.

Related

Lazy Grep -P: How to show only to the 1st match from the lines

I need to print only the 1st match from each line.
My file contains text something like this:
cat t.txt
abcsuahrcb
abscuharcb
bsaucharcb
absuhcrcab
He is the command I am trying with:
cat t.txt | grep -oP 'a.*?c'
It gives:
abc
ahrc
absc
arc
auc
arc
absuhc
I need it to return:
abc
absc
auc
absuhc
These are the 1st possible matches from each line.
Any other alternatives like sed and aws will work, but not something which needs to be installed on Ubuntu.
Perl to the rescue:
perl -lne 'print $1 if /(a.*?c)/' t.txt
-n reads the input line by line, running the code for each;
-l removes newlines from input lines and adds them to output;
The code tries to match a.*?c, if matched, it stores the result in $1;
As there's no loop, only one match per line is attempted.
A sed variation on The fourth bird's answer:
$ sed -En 's/^[^a]*(a[^c]*c).*/\1/p' t.txt
abc
absc
auc
absuhc
Where:
-En - enable extended regex support, suppress automatic printing of pattern space
^[^a]* - from start of line match all follow-on characters that are not a
(a[^c]*c) - (1st capture group) match letter a plus all follow-on characters that are not c followed by a c
.* - match rest of line
\1/p - print contents of 1st capture group
One awk idea:
$ awk 'match($0,/a[^c]*c/) { print substr($0,RSTART,RLENGTH)}' t.txt
abc
absc
auc
absuhc
Where:
if we find a match then the match() call is non-zero (ie, 'true') so ...
print the substring defined by the RSTART/RLENGTH variables (which are auto-populated by a successful match() call)
Using grep you could write the pattern as matching from the first a to the first c using a negated character class.
Using -P for Perl-compatible regular expressions, you can make use of \K to forget what is matched so far.
Note that you don't have to use cat but you can add the filename at the end.
grep -oP '^[^a]*\Ka[^c]*c' t.txt
The pattern matches:
^ Start of string
[^a]* Optionally match any char except a
\K Forget what is matched so far
a Match literally
[^c]* Optionally match any char except c
c Match literally
Output
abc
absc
auc
absuhc
Another option with gnu-awk and the same pattern, only now using and printing the capture group 1 value:
awk 'match($0,/^[^a]*(a[^c]*c)/, a) { print a[1]}' t.txt

Print the line matching 'pattern' string, excluding the 'pattern'

I have the following lines in a text file 'file.txt'
String1 ABCDEFGHIJKL
String2 DCEGIJKLQMAB
I want to print the characters corresponding to 'String1' in another text file 'text.txt' like this
ABCDEFGHIJKL
Here, I don't want to use any line numbers. Any suggestions using 'sed' command?. I tried with between 'string 1' and 'string 2', but couldn't obtain command excluding 'string1'. This following code for excluding only 'string2'.
sed -n '/^string1/,/^string2/{p;/^string2/q}' file.txt | sed '$d' > text.txt
awk '$1=="String1" { print $2 }' file.txt > text.txt
Where the first space delimited field equals "String1", print the second field. Redirect the output to text.txt.
Use GNU grep:
grep -Po 'String1\s+\K.*' in_file
Here, grep uses the following options:
-P : Use Perl regexes.
-o : Print the matches only (1 match per line), not the entire lines.
\K : Cause the regex engine to "keep" everything it had matched prior to the \K and not include it in the match. Specifically, ignore the preceding part of the regex when printing the match.
SEE ALSO:
grep manual
perlre - Perl regular expressions

Using grep -P and lookahead/lookbehind to get text between patterns

Assume the following is in file.txt:
---------
foo bar
more foo bar
---------
when I execute grep -P '(?<=-$)(?s:.)*(?=^-)' file.txt, I expect only the middle two lines to be matched, but this expression matches nothing. What's wrong?
I also tried grep -P '(?s)(?<=-$).*(?=^-)' file.txt but same result.
Your pattern dos not work because
The P option alone only makes grep match using the PCRE regex engine
Since you have no other options, grep outputs whole matched lines, you need to add o option to output the matched text(s) and z to slurp the file into a single text
Your regex has ^ and $ anchors that match start/end of the string, not lines, by default. You need a m flag together with s flag (it makes . match any char including line break chars).
So, you may use your regex with m and -oz:
grep -Poz '(?ms)(?<=-$).*(?=^-)' file.txt
Or,
grep -Poz '(?s)-\R\K.*(?=\R-)' file.txt
where \R matces any line break sequence and \K omits the text matched so far from the overall memory buffer.
See the regex demo.

Parsing only first regex match in a line with several matches

Is it possible to have a regex that parses only a1bcdea1 from this line a1bcdea1ABCa1DEFa1 ?
This grep command does not work:
$ cat txtfile
a1bcdea1ABCa1DEFa1
$ grep -oE "[A-Z,a-z]1.*?[A-Z,a-z]1" txtfile
a1bcdea1ABCa1DEFa1
I want the output of grep to be only a1bcdea1.
EDIT:
It is obvious that I can just use grep -o "a1bcdea1" for the above line, but consider if one has several thousands of lines and the goal is to match FIRST [A-Z,a-z]1.*?[A-Z,a-z]1 for each single line.
How about using a ^ start anchor and restricting character set used:
grep -o '^[A-Za-z]1[A-Za-z]*1'
See this Bash demo or Regex Pattern at regex101
If you expect more digits or other characters in between, go with this
grep -oP '^[A-Za-z]1.*?[A-Za-z]1'
The lazy matching requires perl compatible mode. For not at line start, go with this
grep -oP '^.*?\K[A-Za-z]1.*?[A-Za-z]1'
\K resets beginning of the reported match and is a PCRE feature as well.
Here is a gnu awk solution using split function:
awk '(n = split($0, a, /[a-zA-Z]1/, b)) > 1 {print b[1] a[2] b[2]}' file
a1bcdea1
This awk command splits each line on regex /[a-zA-Z]1/ and stores split tokens in array a and delimiters in array b.

How to find only the lines that contain two consecutive vowels

how to find lines that contain consecutive vowels
$ (filename) | sed '/[a*e*i*o*u]/!d'
To find lines that contain consecutive vowels you should consider using
sed -n '/[aeiou]\{2,\}/p' file
Here, [aeiou]\{2,\} pattern matches 2 or more occurrences (\{2,\} is an interval quantifier with the minimum occurrence number set to 2) and [aeiou] is a bracket expression matching any char defined in it.
The -n suppresses output, and the p command prints specific lines only (that is, -n with p only outputs the lines that match your pattern).
Or, you may get the same functionality with grep:
grep '[aeiou]\{2,\}' file
grep -E '[aeiou]{2,}' file
Here is an online demo:
s="My boomerang
Text here
Koala there"
sed -n '/[aeiou]\{2,\}/p' <<< "$s"
Output:
My boomerang
Koala there