Why does grep match all lines for the pattern "\'" - regex

In this SO question there is something that I cannot explain:
grep "\'" input_file
matches all lines in the given file. Does \' have a special meaning for grep?

grep regex GNU extension: ‘\'’ matches the end of the whole input

I did not know this feature of the regular expressions. But it's listed at regular-expressions.info as the end of the string anchor.
It does not exist in all regex implementations only in GNU Basic and Extended Regular Expressions, see this compatibility chart for more info.

That is a really strange beaviour of grep, I don't know how to explain it, but I must note that \' doesn't match any character. It looks like it has the same meaning as $:
$ echo x | grep "x\'"
x
$ echo xy | grep "x\'"
$ echo x | grep "\'x"
Update 1
As it is stated in http://www.gnu.org/software/findutils/manual/html_node/find_html/grep-regular-expression-syntax.html (thanks to Richard Sitze for the link) it really has the same meaning as $. But meanwhile I've noted a difference between \' and $:
$ echo x | grep 'x$'
x
$ echo x | grep 'x$$'
$ echo x | grep "x\'"
x
$ echo x | grep "x\'\'"
x
$ echo x | grep "x\'\'\'"
x
You can specify \' as many times as you wish but that is not so for $. There must be only one $.
Another important remark. The manual says:
‘\'’ matches the end of the whole input
But strictly speaking that's not truth because \' matches not only the end of the whole input but the end of every single line also:
$ (echo x; echo y) | grep "\'"
x
y
Exactly how $ does.

\ is an escape character. This mean the the ' should considered as text to search for, and not as a control character.

Related

Extract string between underscores and dot

I have strings like these:
/my/directory/file1_AAA_123_k.txt
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
So basically, the number of underscores is not fixed. I would like to extract the string between the first underscore and the dot. So the output should be something like this:
AAA_123_k
CCC
KK_45
I found this solution that works:
string='/my/directory/file1_AAA_123_k.txt'
tmp="${string%.*}"
echo $tmp | sed 's/^[^_:]*[_:]//'
But I am wondering if there is a more 'elegant' solution (e.g. 1 line code).
With bash version >= 3.0 and a regex:
[[ "$string" =~ _(.+)\. ]] && echo "${BASH_REMATCH[1]}"
You can use a single sed command like
sed -n 's~^.*/[^_/]*_\([^/]*\)\.[^./]*$~\1~p' <<< "$string"
sed -nE 's~^.*/[^_/]*_([^/]*)\.[^./]*$~\1~p' <<< "$string"
See the online demo. Details:
^ - start of string
.* - any text
/ - a / char
[^_/]* - zero or more chars other than / and _
_ - a _ char
\([^/]*\) (POSIX BRE) / ([^/]*) (POSIX ERE, enabled with E option) - Group 1: any zero or more chars other than /
\. - a dot
[^./]* - zero or more chars other than . and /
$ - end of string.
With -n, default line output is suppressed and p only prints the result of successful substitution.
With your shown samples, with GNU grep you could try following code.
grep -oP '.*?_\K([^.]*)' Input_file
Explanation: Using GNU grep's -oP options here to print exact match and to enable PCRE regex respectively. In main program using regex .*?_\K([^.]*) to get value between 1st _ and first occurrence of .. Explanation of regex is as follows:
Explanation of regex:
.*?_ ##Matching from starting of line to till first occurrence of _ by using lazy match .*?
\K ##\K will forget all previous matched values by regex to make sure only needed values are printed.
([^.]*) ##Matching everything till first occurrence of dot as per need.
A simpler sed solution without any capturing group:
sed -E 's/^[^_]*_|\.[^.]*$//g' file
AAA_123_k
CCC
KK_45
If you need to process the file names one at a time (eg, within a while read loop) you can perform two parameter expansions, eg:
$ string='/my/directory/file1_AAA_123_k.txt.2'
$ tmp="${string#*_}"
$ tmp="${tmp%%.*}"
$ echo "${tmp}"
AAA_123_k
One idea to parse a list of file names at the same time:
$ cat file.list
/my/directory/file1_AAA_123_k.txt.2
/my/directory/file2_CCC.txt
/my/directory/file2_KK_45.txt
$ sed -En 's/[^_]*_([^.]+).*/\1/p' file.list
AAA_123_k
CCC
KK_45
Using sed
$ sed 's/[^_]*_//;s/\..*//' input_file
AAA_123_k
CCC
KK_45
This is easy, except that it includes the initial underscore:
ls | grep -o "_[^.]*"

grep regex with backtick matches all lines

$ cat file
anna
amma
kklks
ksklaii
$ grep '\`' file
anna
amma
kklks
ksklaii
Why? How is that match working ?
This appears to be a GNU extension for regular expressions. The backtick ('\`') anchor matches the very start of a subject string, which explains why it is matching all lines. OS X apparently doesn't implement the GNU extensions, which would explain why your example doesn't match any lines there. See http://www.regular-expressions.info/gnu.html
If you want to match an actual backtick when the GNU extensions are in effect, this works for me:
grep '[`]' file
twm's answer provides the crucial pointer, but note that it is the sequence \`, not ` by itself that acts as the start-of-input anchor in GNU regexes.
Thus, to match a literal backtick in a regex specified as a single-quoted shell string, you don't need any escaping at all, neither with GNU grep nor with BSD/macOS grep:
$ { echo 'ab'; echo 'c`d'; } | grep '`'
c`d
When using double-quoted shell strings - which you should avoid for regexes, for reasons that will become obvious - things get more complicated, because you then must escape the ` for the shell's sake in order to pass it through as a literal to grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\`"
c`d
Note that, after the shell has parsed the "..." string, grep still only sees `.
To recreate the original command with a double-quoted string with GNU grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\\\`" # !! BOTH \ and ` need \-escaping
ab
c`d
Again, after the shell's string parsing, grep sees just \`, which to GNU grep is the start-of-the-input anchor, so all input lines match.
Also note that since grep processes input line by line, \` has the same effect as ^ the start-of-a-line anchor; with multi-line input, however - such as if you used grep -z to read all lines at once - \` only matches the very start of the whole string.
To BSD/macOS grep, \` simply escapes a literal `, so it only matches input lines that contain that character.

Regex behaviour with angle brackets

Please explain to me why the following expression doesn't output anything:
echo "<firstname.lastname#domain.com>" | egrep "<lastname#domain.com>"
but the following does:
echo "<firstname.lastname#domain.com>" | egrep "\<lastname#domain.com>"
The behaviour of the first is as expected but the second should not output. Is the "\<" being ignored within the regex or causing some other special behaviour?
AS #hwnd said \< matches the begining of the word. ie a word boundary \b must exists before the starting word character(character after \< in the input must be a word character),
In your example,
echo "<firstname.lastname#domain.com>" | egrep "<lastname#domain.com>"
In the above example, egrep checks for a literal < character present before the lastname string. But there isn't, so it prints nothing.
$ echo "<firstname.lastname#domain.com>" | egrep "\<lastname#domain.com>"
<firstname.**lastname#domain.com>**
But in this example, a word boundary \b exists before lastname string so it prints the matched characters.
Some more examples:
$ echo "namelastname#domain.com" | egrep "\<e#domain.com"
$ echo "namelastname#domain.com" | egrep "\<lastname#domain.com"
$ echo "namelastname#domain.com" | egrep "\<com"
namelastname#domain.**com**
$ echo "<firstname.lastname#domain.com>" | egrep "\<#domain.com>"
$ echo "n-ame-lastname#domain.com" | egrep "\<ame-lastname#domain.com"
n-**ame-lastname#domain.com**

Bash- How to convert non-alphanumerical character to "_"

I am trying to store user input in a variable and clean that variable in order to keep only alphanumerical caract + some others (I mean [a-zA-Z0-9-_]).
I tried using this but it isn't exhaustive :
SERVICE_NAME=$(echo $SERVICE_NAME | tr A-Z a-z | tr ' ' _ | tr \' _ | tr \" _)
Do you have some help for this?
Bash's string substitution is a fine thing: ${var//pat/rep}
val='Foo$%!*#BAR###baZ'
echo ${val//[^a-zA-Z_-]/_}
Foo_____BAR___baZ
A small explanation: The slash introduces a search/replace, a little like in sed (where it just delimits patterns). But you use a single slash for one replacement:
val='Foo$%!*#BAR###baZ'
echo ${val/[^a-zA-Z_-]/_}
Foo_%!*#BAR###baZ
Two slashes // mean replace all. Uncommon, but it has some logic, multiple slashes to mean multiple replace (please excuse my poor English).
And note how the $ is separated from the variable, but it is hard to modify a literal constant this way (which would be nice for testing). Modifying $1 isn't a no-brainer as well, afaik.
$ echo 'asd!#QCW##D' | tr A-Z a-z | sed -e 's/[^a-zA-Z0-9\-]/_/g'
asd__qcw__d
I would use sed for this and use the ^ (not) operator in your set of valid characters and replace everything else with an underscore. The above shows the syntax with the output.
And, as a bonus, if you want to replace a run of invalid characters with one underscore, just add + to your regular expression (and use the -r switch to sed to make it use extended regular expressions:
$ echo 'asd!#QCW##D' | tr A-Z a-z | sed -r 's/[^a-zA-Z0-9\-]+/_/g'
asd_qcw_d
I believe it can all be done in 1 single sed command like this:
echo 'Foo$%!*#BAR###baZ' | sed -e 's/[A-Z]/\L&/g' -e 's/[^a-z0-9\-]/_/g'
OUTPUT
foo_____bar___baz
perl way:
perl -ple 's/[^\w\-]/_/g'
pure bash way
a='foo-BAR_123,.:goo'
echo ${a//[^[:alnum:]-]/_}
produces:
foo-BAR_123___goo

having a regex replacing across lines, retain the newlines?

I'd like to have a substitute or print style command with a regex working across lines. And lines retained.
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | tr -d '\n' | grep -or 'b.*f'
bcdef
or
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | tr -d '\n' | sed -r 's|b(.*)f|y\1z|'
aycdezg
i'd like to use grep or sed because i'd like to know what people would've done before awk or perl ..
would they not have? was .* not available? had they no other equivalent?
to possibly modify some input with a regex that spans across lines, and print it to stdout or output to a file, retaining the lines.
This should do what you're looking for:
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | sed ':a;$s/b\([^f]*\)f/y\1z/;N;ba'
a
y
c
d
e
z
g
It accumulates all the lines then does the replacement. It looks for the first "f". If you want it to look for the last "f", change [^f] to ..
Note that this may make use of features added to sed after AWK or Perl became available (AWK has been around a looong time).
Edit:
To do a multi-line grep requires only a little modification:
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | sed ':a;$s/^[^b]*\(b[^f]*f\)[^f]*$/\1/;N;ba'
b
c
d
e
f
sed can match across newlines through the use of its N command. For example, the following sed command will replace bar followed a newline followed by foo with ###:
$ echo -e "foo\nbar\nbaz\nqux" | sed 'N;s/bar\nbaz/###/;P;D'
foo
###
qux
The N command will append the next input line to the current pattern space separated by an embedded newline (\n)
The P command will print the current pattern space up to and including the first embedded newline.
The D command will delete up to and including the first embedded newline in the pattern space. It will also start next cycle but skip reading from the input if there is still data in the pattern space.
Through the use of these 3 commands, you can essentially do any sort of s command replacement looking across N-lines.
Edit
If your question is how can I remove the need for tr in the two examples above and just use sed then here you go:
$ echo -e 'a\nb\nc\nd\ne\nf\ng' | sed ':a;N;$!ba;s/\n//g;y/ag/yz/'
ybcdefz
Proven tools to the rescue.
echo -e "foo\nbar\nbaz\nqux" | perl -lpe 'BEGIN{$/=""}s/foo\nbar/###/'