Matching strings with grep and \A regexp

Matching strings with grep and \A regexp - regex

Given the string in some file:
hel string1
hell string2
hello string3
I'd like to capture just hel using cat file | grep 'regexp here'
I tried doing a bunch of regexp but none seem to work. What makes the most sense is: grep -E '\Ahel' but that doesn't seem to work. It works on http://rubular.com/ however. Any ideas why that isn't working with grep?
Also, when pasting the above string with a tab space before each line, the \A does not seem to work on rubular. I thought \A matches beginning of string, and that doesn't matter whatever characters was before that. Why did \A stop matching when there was a space before the string?

ERE (-E) does not support \A for indicating start of match. Try ^ instead.
Use -m 1 to stop grepping after the first match in each file.
If you want grep to print only the matched string (not the entire line), use -o.
Use -h if you want to suppress the printing of filenames in the grep output.
Example:
grep -Eohm 1 "^hel" *.log
If you need to enforce only outputting if the search string is on the first line of the file, you could use head:
head -qn 1 *.log | grep -Eoh "^hel"

ERE doesn't support \A but PCRE does hence grep -P can be used with same regex (if available):
grep -P '\Ahel\b' file
hel string1
Also important is to use word boundary \b to restrict matching hello
Alternatively in ERE you can use:
egrep '^hel\b'
hel string1

I thought \A matches beginning of string, and that doesn't matter whatever characters was before that. Why did \A stop matching when there was a space before the string?
\A matches the very beginning of the text, it doesn't match the start-of-line when you have one or more lines in your text.
Anyway, grep doesn't support \A so you need to use ^ which by the way matches the start of each line in multi-line mode contrary to \A.

Using awk
awk '$1=="hel"' file
PS you do not need to cat file to grep, use grep 'regexp here' file

Related

Back-reference when preprend using sed linux command and i sed command

I'm trying to prepend the first character of "monkey" using this command:
echo monkey | sed -E '/(.)onkey/i \1'
But when I use it like this, the output shows
1
monkey
I actually hope to see:
m
monkey
But back-reference doesn't work. Please someone tell me if it is possible to use Back-reference with \1. Thanks in advance.

You may use this sed:
echo 'monkey' | sed -E 's/(.)onkey/\1\n&/'
m
monkey
Here:
\1: is back-reference for group #1
\n: inserts a line break
&: is back-reference for full match

With any version of awk you can try following solution, written and tested with shown samples. Simply searching regex ^.onkey and then using sub function to substitute starting letter with itself new line and itself and printing the value(s).
echo monkey | awk '/^.onkey/{sub(/^./,"&\n&")} 1'

This might work for you (GNU sed):
sed -E '/monkey/i m' file
Insert the line containing m only above a line containing monkey.
Perhaps a more generic solution would be to insert the first character of a word above that word:
sed -E 'h;s/\B.*//;G' file
Make copy of the word.
Remove all but the first character of the word.
Append the original word delimited by a newline.
Print the result.
N.B. \B starts a match between characters of a word. \b represents the start or end of a word (as does \< and \> separately).

Regex to match exact version phrase

I have versions like:
v1.0.3-preview2
v1.0.3-sometext
v1.0.3
v1.0.2
v1.0.1
I am trying to get the latest version that is not preview (doesn't have text after version number) , so result should be:
v1.0.3
I used this grep: grep -m1 "[v\d+\.\d+.\d+$]"
but it still outputs: v1.0.3-preview2
what I could be missing here?

To return first match for pattern v<num>.<num>.<num>, use:
grep -m1 -E '^v[0-9]+(\.[0-9]+){2}$' file
v1.0.3
If you input file is unsorted then use grep | sort -V | head as:
grep -E '^v[0-9]+(\.[0-9]+){2}$' f | sort -rV | head -1
When you use ^ or $ inside [...] they are treated a literal character not the anchors.
RegEx Details:
^: Start
v: Match v
[0-9]+: Match 1+ digits
(\.[0-9]+){2}: Match a dot followed by 1+ dots. Repeat this group 2 times
$: End

To match the digits with grep, you can use
grep -m1 "v[[:digit:]]\+\.[[:digit:]]\+\.[[:digit:]]\+$" file
Note that you don't need the [ and ] in your pattern, and to escape the dot to match it literally.

With awk you could try following awk code.
awk 'match($0,/^v[0-9]+(\.[0-9]+){2}$/){print;exit}' Input_file
Explanation of awk code: Simple explanation of awk program would be, using match function of awk to match regex to match version, once match is found print the matched value and exit from program.

Regular expressions match substrings, not whole strings. You need to explicitly match the start (^) and end ($) of the pattern.
Keep in mind that $ has special meaning in double quoted strings in shell scripts and needs to be escaped.
The boundary characters need to be outside of any group ([]).

How to use grep/sed/awk, to remove a pattern from beginning of a text file

I have a text file with the following pattern written to it:
TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"
I would like to discard the first part of each line containing
TIME[32.468ms] -(3)-.............
To test the regular expression I've tried the following:
cat myfile.txt | egrep "^TIME\[.*\]\s\s\-\(3\)\-\.+"
This identifies correctly the lines I want. Now, to delete the pattern I've tried:
cat myfile.txt | sed s/"^TIME\[.*\]\s\s\-\(3\)\-\.+"//
but it just seems to be doing the cat, since it shows the content of the complete file and no substitution happens.
What am I doing wrong?
OS: CentOS 7

With your shown samples, please try following grep command. Written and tested with GNU grep.
grep -oP '^TIME\[\d+\.\d+ms\]\s+-\(\d+\)-\.+\K.*' Input_file
Explanation: Adding detailed explanation for above code.
^TIME\[ ##Matching string TIME from starting of value here.
\d+\.\d+ms\] ##Matching digits(1 or more occurrences) followed by dot digits(1 or more occurrences) followed by ms ] here.
\s+-\(\d+\)-\.+ ##Matching spaces91 or more occurrences) followed by - digits(1 or more occurrences) - and 1 or more dots.
\K ##Using \K option of GNU grep to make sure previous match is found in line but don't consider it in printing, print next matched regex part only.
.* ##to match till end of the value.
2nd solution: Adding awk program here.
awk 'match($0,/^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+/){print substr($0,RSTART+RLENGTH)}' Input_file
Explanation: using match function of awk, to match regex ^TIME\[[0-9]+\.[0-9]+ms\][[:space:]]+-\([0-9]+\)-\.+ which will catch text which we actually want to remove from lines. Then printing rest of the text apart from matched one which is actually required by OP.

This awk using its sub() function:
awk 'sub(/^TIME[[][^]]*].*\.+/,"")' file
"TEXT I WANT TO KEEP"
If there is replacement, sub() returns true.

$ cut -d'"' -f2 file
TEXT I WANT TO KEEP

You may use:
s='TIME[32.468ms] -(3)-............."TEXT I WANT TO KEEP"'
sed -E 's/^TIME\[[^]]*].*\.+//'
"TEXT I WANT TO KEEP"

The \s regex extension may not be supported by your sed.
In BRE syntax (which is what sed speaks out of the box) you do not backslash round parentheses - doing that turns them into regex metacharacters which do not match themselves, somewhat unintuitively. Also, + is just a regular character in BRE, not a repetition operator (though you can turn it into one by similarly backslashing it: \+).
You can try adding an -E option to switch from BRE syntax to the perhaps more familiar ERE syntax, but that still won't enable Perl regex extensions, which are not part of ERE syntax, either.
sed 's/^TIME\[[^][]*\][[:space:]][[:space:]]-(3)-\.*//' myfile.txt
should work on any reasonably POSIX sed. (Notice also how the minus character does not need to be backslash-escaped, though doing so is harmless per se. Furthermore, I tightened up the regex for the square brackets, to prevent the "match anything" regex you had .* from "escaping" past the closing square bracket. In some more detail, [^][] is a negated character class which matches any character which isn't (a newline or) ] or [; they have to be specified exactly in this order to avoid ambiguity in the character class definition. Finally, notice also how the entire sed script should normally be quoted in single quotes, unless you have specific reasons to use different quoting.)
If you have sed -E or sed -r you can use + instead of * but then this complicates the overall regex, so I won't suggest that here.

A simpler one for sed:
sed 's/^[^"]*//' myfile.txt

If the "text you want to keep" always surrounded by the quote like this and only them having the quote in the line starting with "TIME...", then:
sed -n '/^TIME/p' file | awk -F'"' '{print $2}'
should get the line starting with "TIME..." and print the text within the quotes.

Thanks all, for your help.
By the end, I've found a way to make it work:
echo 'TIME[32.468ms] -(3)-.............TEXT I WANT TO KEEP' | grep TIME | sed -r 's/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//'
More generally,
grep TIME myfile.txt | sed -r ‘s/^TIME\[[0-9]+\.[0-9]+ms\]\s\s-\(3\)-\.+//’
Cheers,
Pedro

Using grep -P and lookahead/lookbehind to get text between patterns

Assume the following is in file.txt:
---------
foo bar
more foo bar
---------
when I execute grep -P '(?<=-$)(?s:.)*(?=^-)' file.txt, I expect only the middle two lines to be matched, but this expression matches nothing. What's wrong?
I also tried grep -P '(?s)(?<=-$).*(?=^-)' file.txt but same result.

Your pattern dos not work because
The P option alone only makes grep match using the PCRE regex engine
Since you have no other options, grep outputs whole matched lines, you need to add o option to output the matched text(s) and z to slurp the file into a single text
Your regex has ^ and $ anchors that match start/end of the string, not lines, by default. You need a m flag together with s flag (it makes . match any char including line break chars).
So, you may use your regex with m and -oz:
grep -Poz '(?ms)(?<=-$).*(?=^-)' file.txt
Or,
grep -Poz '(?s)-\R\K.*(?=\R-)' file.txt
where \R matces any line break sequence and \K omits the text matched so far from the overall memory buffer.
See the regex demo.

grep regex with backtick matches all lines

$ cat file
anna
amma
kklks
ksklaii
$ grep '\`' file
anna
amma
kklks
ksklaii
Why? How is that match working ?

This appears to be a GNU extension for regular expressions. The backtick ('\`') anchor matches the very start of a subject string, which explains why it is matching all lines. OS X apparently doesn't implement the GNU extensions, which would explain why your example doesn't match any lines there. See http://www.regular-expressions.info/gnu.html
If you want to match an actual backtick when the GNU extensions are in effect, this works for me:
grep '[`]' file

twm's answer provides the crucial pointer, but note that it is the sequence \`, not ` by itself that acts as the start-of-input anchor in GNU regexes.
Thus, to match a literal backtick in a regex specified as a single-quoted shell string, you don't need any escaping at all, neither with GNU grep nor with BSD/macOS grep:
$ { echo 'ab'; echo 'c`d'; } | grep '`'
c`d
When using double-quoted shell strings - which you should avoid for regexes, for reasons that will become obvious - things get more complicated, because you then must escape the ` for the shell's sake in order to pass it through as a literal to grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\`"
c`d
Note that, after the shell has parsed the "..." string, grep still only sees `.
To recreate the original command with a double-quoted string with GNU grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\\\`" # !! BOTH \ and ` need \-escaping
ab
c`d
Again, after the shell's string parsing, grep sees just \`, which to GNU grep is the start-of-the-input anchor, so all input lines match.
Also note that since grep processes input line by line, \` has the same effect as ^ the start-of-a-line anchor; with multi-line input, however - such as if you used grep -z to read all lines at once - \` only matches the very start of the whole string.
To BSD/macOS grep, \` simply escapes a literal `, so it only matches input lines that contain that character.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Matching strings with grep and \A regexp - regex

ERE doesn't support \A but PCRE does hence grep -P can be used with same regex (if available): grep -P '\Ahel\b' file hel string1 Also important is to use word boundary \b to restrict matching hello Alternatively in ERE you can use: egrep '^hel\b' hel string1

Using awk awk '$1=="hel"' file PS you do not need to cat file to grep, use grep 'regexp here' file

Related

Back-reference when preprend using sed linux command and i sed command

Regex to match exact version phrase

How to use grep/sed/awk, to remove a pattern from beginning of a text file

Using grep -P and lookahead/lookbehind to get text between patterns

grep regex with backtick matches all lines

Categories

Resources