I have a file.txt looking like this:
abe
abbe
cde
45a678
ae
cababb
12345
And after running command egrep [[:digit:]] file.txt
it shows the result two results: "45a678" and "12345". I don't understand why does it show the first result (I tought that regex will only show lines with numbers).
You are searching for any digit in line. You should constrain it from beginning (^) to the end ($) of the line and find at least one digit in between (+).
egrep ^[[:digit:]]+$ file.txt
in Regex [:digit:] only matches a digit and not checking all the line.
For parsing all line you need to use ^ for beginning line and $ for end line.
as a result
egrep ^\d+$ file.txt
will only match those lines with numbers
Your regex [[:digit:]] searching for lines that has [[:digit:]] so the 45a678 matches so use ^[[:digit:]]*$ to match all only-digit lines:
$ egrep ^[[:digit:]]*$ file1.txt
12345
Related
I need to print only the 1st match from each line.
My file contains text something like this:
cat t.txt
abcsuahrcb
abscuharcb
bsaucharcb
absuhcrcab
He is the command I am trying with:
cat t.txt | grep -oP 'a.*?c'
It gives:
abc
ahrc
absc
arc
auc
arc
absuhc
I need it to return:
abc
absc
auc
absuhc
These are the 1st possible matches from each line.
Any other alternatives like sed and aws will work, but not something which needs to be installed on Ubuntu.
Perl to the rescue:
perl -lne 'print $1 if /(a.*?c)/' t.txt
-n reads the input line by line, running the code for each;
-l removes newlines from input lines and adds them to output;
The code tries to match a.*?c, if matched, it stores the result in $1;
As there's no loop, only one match per line is attempted.
A sed variation on The fourth bird's answer:
$ sed -En 's/^[^a]*(a[^c]*c).*/\1/p' t.txt
abc
absc
auc
absuhc
Where:
-En - enable extended regex support, suppress automatic printing of pattern space
^[^a]* - from start of line match all follow-on characters that are not a
(a[^c]*c) - (1st capture group) match letter a plus all follow-on characters that are not c followed by a c
.* - match rest of line
\1/p - print contents of 1st capture group
One awk idea:
$ awk 'match($0,/a[^c]*c/) { print substr($0,RSTART,RLENGTH)}' t.txt
abc
absc
auc
absuhc
Where:
if we find a match then the match() call is non-zero (ie, 'true') so ...
print the substring defined by the RSTART/RLENGTH variables (which are auto-populated by a successful match() call)
Using grep you could write the pattern as matching from the first a to the first c using a negated character class.
Using -P for Perl-compatible regular expressions, you can make use of \K to forget what is matched so far.
Note that you don't have to use cat but you can add the filename at the end.
grep -oP '^[^a]*\Ka[^c]*c' t.txt
The pattern matches:
^ Start of string
[^a]* Optionally match any char except a
\K Forget what is matched so far
a Match literally
[^c]* Optionally match any char except c
c Match literally
Output
abc
absc
auc
absuhc
Another option with gnu-awk and the same pattern, only now using and printing the capture group 1 value:
awk 'match($0,/^[^a]*(a[^c]*c)/, a) { print a[1]}' t.txt
I have versions like:
v1.0.3-preview2
v1.0.3-sometext
v1.0.3
v1.0.2
v1.0.1
I am trying to get the latest version that is not preview (doesn't have text after version number) , so result should be:
v1.0.3
I used this grep: grep -m1 "[v\d+\.\d+.\d+$]"
but it still outputs: v1.0.3-preview2
what I could be missing here?
To return first match for pattern v<num>.<num>.<num>, use:
grep -m1 -E '^v[0-9]+(\.[0-9]+){2}$' file
v1.0.3
If you input file is unsorted then use grep | sort -V | head as:
grep -E '^v[0-9]+(\.[0-9]+){2}$' f | sort -rV | head -1
When you use ^ or $ inside [...] they are treated a literal character not the anchors.
RegEx Details:
^: Start
v: Match v
[0-9]+: Match 1+ digits
(\.[0-9]+){2}: Match a dot followed by 1+ dots. Repeat this group 2 times
$: End
To match the digits with grep, you can use
grep -m1 "v[[:digit:]]\+\.[[:digit:]]\+\.[[:digit:]]\+$" file
Note that you don't need the [ and ] in your pattern, and to escape the dot to match it literally.
With awk you could try following awk code.
awk 'match($0,/^v[0-9]+(\.[0-9]+){2}$/){print;exit}' Input_file
Explanation of awk code: Simple explanation of awk program would be, using match function of awk to match regex to match version, once match is found print the matched value and exit from program.
Regular expressions match substrings, not whole strings. You need to explicitly match the start (^) and end ($) of the pattern.
Keep in mind that $ has special meaning in double quoted strings in shell scripts and needs to be escaped.
The boundary characters need to be outside of any group ([]).
I am trying to do multiple grep pattern to find a number within a grepped string.
I have a text file like this:
This is the first sample line 1
this is the second sample line
another line
total lines: 3 tot
I am trying to find a way to get just the number of total lines. So the output here should be "3"
Here are the things I've tried:
grep "total lines: [0-9]" myfile.txt
grep "total lines" myfile.txt | grep "[0-9]"
You could use sed:
sed -En 's/^total lines: ([0-9]+).*/\1/p' myfile.txt
-E extended regular expressions
-n suppress automatic printing
Match ^total lines: ([0-9]+).* (capture the number)
\1 replace the whole line with the captured number
p print the result
1st solution: Using GNU grep try following. Simply using -o option to print only matched value, -P enables PCRE regex for program. Then in regex portion matching from starting ^total lines: in each line and if a match found then discard matched values by \K option(to remove it from expected output) which is followed by 1 or more digits, using positive look ahead to make sure its followed by space(s) tot here.
grep -oP '^total lines: \K[0-9]+(?=\s+tot)' Input_file
2nd solution: With your shown samples, please try following in awk. This could be done in a single awk itself. Searching line which has string /total lines: / in it then printing 2nd last field of that line.
awk '/total lines: /{print $(NF-1)}' Input_file
3rd solution: Using awk's match function here. Matching total lines: [0-9]+ tot and then substituting everything apart from digits with null in matched values.
awk 'match($0,/total lines: [0-9]+ tot/){val=substr($0,RSTART,RLENGTH);gsub(/[^0-9]+/,"",val);print val}' Input_file
Do you have to use grep?
$ echo myfile.txt | wc -l
If you mean that the file has a line in it formatted as
total lines: 3 tot
Then refer to https://unix.stackexchange.com/questions/13466/can-grep-output-only-specified-groupings-that-match and use something like:
grep -Po 'total lines: \K\d+' myfile.txt
Notes:
Perl regex is not my forte, so the \d\w part might not work.
This may be doable without -P, but I cannot test from this windows computer.
regex101.com helped me test the above line, so it may work.
Problem with relying on pattern of last line and applying grep/sed to find pattern is that if any line in file contains such pattern, then you will have to apply some additional logic to filter that.
e.g. Consider case of below input file.
line001
total lines: 883 tot
This is the first sample line 1
this is the second sample line
another line
total lines: 883 tot
Assuming your file format is constant (i.e. Second last line will be blank and last line will contain total count), instead of using any pattern matching commands you can directly count number of rows using below awk command.
awk 'END { print NR - 2 }' myfile.txt
You can use the following awk to get the third field on a line that starts with total count: and stop processing the file further:
awk '/^total lines:/{print $3; exit}' file
See this online demo.
You can use the following GNU grep:
# Extract a non-whitespace chunk after a certain pattern
grep -oP '^total lines:\s*\K\S+' file
# Extract a number after a pattern
grep -oP '^total lines:\s*\K\d+(?:\.\d+)?' file
See an online demo. Details:
^ - start of string
total lines: - a literal string
\s* - any zero or more whitespace chars
\K - match reset operator discarding all text matched so far
\S+ - one or more non-whitespace chars
\d+(?:\.\d+)? - one or more digits and then an optional sequence of . and one or more digits.
See the regex demo.
How can I use regular expression to find a line that has at least two times the same word?
I tried:
egrep '\w{2,}\1' file
But the terminal gives me the error:
egrep: invalid backreference number
There are several issues with your current regex.
Use a capturing group for capturing words and backreference to it.
Add \b word boundaries for limiting words to left and right side.
Add .* for matching any amount of any characters in between.
echo "ABC foo ABC bar" | egrep '\b(\w{2,})\b.*\b\1\b'
ABC foo ABC bar
echo "ABC foo ABCD bar" | egrep '\b(\w{2,})\b.*\b\1\b'
false
See demo at regex101. If desired use egrep -o --only-matching to extract relevant part.
You can further use .*? lazy dot with grep-P --perl-regexp for as few times as possible.
Try this instead:
egrep '(\w{2,}).*\1' file
If you don't have a capturing group ((...)), then there's nothing to backreference.
Here's an example:
$ cat file
this line has the same word twice word
this line does not
this is this and that is that
$ egrep '(\w{2,}).*\1' file
this line has the same word twice word
this is this and that is that
How can I use grep to match 3 numbers in a file? My file looks like this:
123
122
222
333443
fdsfs5454353
dsfsfjsk4654641
Note that some of the lines contain trailing spaces. I want to only match three digit numbers. I tried:
grep -E [0-9]{3} test.txt
grep -E '\<[0-9]{3}\>' test.txt
grep '^[0-9][0-9]*' test|awk '{if(length($0) == 3) print $0}'
or if you have whitespace:
sed 's/[ \t]*$//' test|grep '^[0-9][0-9]*'|awk '{if(length($0) == 3) print $0}'
(thanks #shellter)
Use Extended Regular Expressions with Bounds
I asked if you meant numbers with exactly three digits, or each three-digit match in a string. You replied that you wanted only lines that contained exactly three digits.
Extended grep provides an easy solution for this. Consider the following:
$ egrep '^\d{3}\b' /tmp/corpus
123
122
222
This uses a bound (also known as a range) to look for exactly three digits at the start of each line, followed by a word boundary. The word boundary will match trailing space or the end of line, ensuring that you get the proper match in either case.