regexp special character - regex

How to do in grep or sed for regexp with special characters?
I want to search a pattern "This is a (test)" from a text file which the file contains the following lines:
This is a test
This is a (test)!
This is a (test)
Only line 3 is the result I want to return, is it possible?

grep option -F treats all characters literally:
grep -F 'This is a (test)'
from grep --help
-F, --fixed-strings PATTERN is a set of newline-separated strings
however
grep -F 'This is a (test)' <<END
This is a test
This is a (test)!
This is a (test)
END
This is a (test)! matches also if pattern ends with word boundaries -w can help
grep -Fw 1 <<END
12
1
END
otherwise special characters must be escaped
grep '^This is \(a test\)$'
or perl extension \Q..\E can be used to escape special characters between \Q and \E
grep -P '^\QThis is a (test)\E$'

Related

How can I get a list of the words that have six or more consonants in a row using the grep command?

I want to find a list of words that contain six or more consonants in a row from a number of text files.
I'm pretty new to the Unix terminal, but this is what I have tried:
cat *.txt | grep -Eo "\w+" | grep -i "[^AEOUIaeoui]{6}"
I use the cat command here because it will otherwise include the file names in the next pipe. I use the second pipe to get a list of all the words in the text files.
The problem is the last pipe, I want to somehow get it to grep 6 consonants in a row, it doesn't need to be the same one. I would know one way of solving the problem, but that would create a command longer that this entire post.
For the last grep you also need the -E switch - or you need to escape the curly braces:
cat *.txt | grep -Eo "\w+" | grep -Ei "[^AEOUIaeoui]{6}"
cat *.txt | grep -Eo "\w+" | grep -i "[^AEOUIaeoui]\{6\}"
I use the cat command here because it will otherwise include the file names in the next pipe
You can disable this using the -h flag:
grep -hEo "\w+" *.txt | grep -Ei "[^AEOUIaeoui]{6}"
You can use
grep -hEio '[[:alpha:]]*[b-df-hj-np-tv-z]{6}[[:alpha:]]*' *.txt
Regex details
[[:alpha:]]* - any zero or more letter
[b-df-hj-np-tv-z]{6} - six English consonant letters on end
[[:alpha:]]* - any zero or more letter.
The grep options make the regex search case insensitive (i) and grep shows the matched texts only (with o) without displaying the filenames (h). The -E option allows the POSIX ERE syntax, else, if you do not specify it, you would need to escape {6} as \{6\},
Use this Perl one-liner:
perl -lne 'print for grep { /[^aeoui]{6}/i } /\b([a-z]+)\b/ig' in_file.txt
Example:
cat > in_file.txt <<EOF
the abcdfghi aBcdfghi.
ABCDFGHI234
abcdEfgh
EOF
perl -lne 'print for grep { /[^aeoui]{6}/i } /\b([a-z]+)\b/ig' in_file.txt
Output:
abcdfghi
aBcdfghi
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
The regex uses these modifiers:
/g : Multiple matches.
/i : Case-insensitive matches.
/\b([a-z]+)\b/ig : Match words that consist of 1 or more letters only ([a-z]+), with words boundary \b on both sides. This way, ABCDFGHI234 does not match, but all 3 words in line 1 (the, abcdfghi, aBcdfghi) match. This may be important for some applications. Note that not all answers in this thread use the word boundary around letters, and thus do not make the distinction shown in this example.
/[^aeoui]{6}/i : Match 6 or more consecutive non-vowels. Non-vowels here resolve exactly to consonants, because the previous regex selected for words made of letters only, that is, vowels and consonants.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start
Get all words containing 6 or more consonants in a row in a given directory
cat *.txt | grep -Eo "\w+" | grep -E "[^AEOUIaeoui]{6,}"
We can use grep -Eo (-E Extended regex, -o output ONLY matching)
cat *.txt will output all of the data from all txt files in the current directory
grep -Eo "\w+" will output all of the words from an input in the form of one word per line
We can use Regex to search for strings that contain a pattern:
[^LISTOFCHARACTERS] Any character but LISTOFCHARACTERS
{6,} 6 or more

Bash find hashtags from string

I'm new to shell scripting and I'm trying to find all hashtags from a string using grep, this is what I have but it only works for alphanumeric characters
echo '<span><span>#😀fooFOO0</span></span>' | grep -o '#[a-zA-Z0-9]'
If the hashtag finishes before </span>, you can do
echo '<span><span>#😀fooFOO0</span></span>' | grep -Po '#.*?(?=<)'
.*? means non-greedy search.
(?=<) is look-ahead.
The following command print a line for each hashtag found:
❯ echo '<span><span>#😀fooFOO0</span>#foo #bar</span>' | grep --fixed-strings --only-matching '#'
#
#
#
Options
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
Warning: --count or -c won't give the number of hashtags (3) but the number of lines containing one (only 1 here).

What is the meaning of the -F option in grep manual

-F is an option of grep, from the manual below:
interpret pattern as a list of fixed strings,separated by
newlines,any of which is to be matched
My question is
How to separated multiple fixed strings, what is the newline character, \n or \?
It seems grep -F a\nh file is not valid if I want to find lines which starts with a character a or h.
Thanks in advance !
In grep, -F will cause patterns to match literally i.e. no Regex interpretation is done on the pattern(s).
Multiple patterns can be inputted by \n i.e. newline separation.
Not all shells convert \n to newline by default, you can use $'a\nh' in that case.
Example:
$ echo $'foo\nf.o\nba.r\nbaar\n'
foo
f.o
ba.r
baar
$ grep -F $'f.o\nba.r' <<<$'foo\nf.o\nba.r\nbaar\n'
f.o
ba.r
By default the pattern is a Basic Regular Expressions (BRE) pattern, but with -F it will be interpreted as a literal string with no metacharacters.
You can also use -E which will enable Extended Regular Expressions (ERE).
% grep -F '..' <<< $'hello\nworld\n...'
...
% grep '..' <<< $'hello\nworld\n...'
hello
world
...

grep regex with backtick matches all lines

$ cat file
anna
amma
kklks
ksklaii
$ grep '\`' file
anna
amma
kklks
ksklaii
Why? How is that match working ?
This appears to be a GNU extension for regular expressions. The backtick ('\`') anchor matches the very start of a subject string, which explains why it is matching all lines. OS X apparently doesn't implement the GNU extensions, which would explain why your example doesn't match any lines there. See http://www.regular-expressions.info/gnu.html
If you want to match an actual backtick when the GNU extensions are in effect, this works for me:
grep '[`]' file
twm's answer provides the crucial pointer, but note that it is the sequence \`, not ` by itself that acts as the start-of-input anchor in GNU regexes.
Thus, to match a literal backtick in a regex specified as a single-quoted shell string, you don't need any escaping at all, neither with GNU grep nor with BSD/macOS grep:
$ { echo 'ab'; echo 'c`d'; } | grep '`'
c`d
When using double-quoted shell strings - which you should avoid for regexes, for reasons that will become obvious - things get more complicated, because you then must escape the ` for the shell's sake in order to pass it through as a literal to grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\`"
c`d
Note that, after the shell has parsed the "..." string, grep still only sees `.
To recreate the original command with a double-quoted string with GNU grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\\\`" # !! BOTH \ and ` need \-escaping
ab
c`d
Again, after the shell's string parsing, grep sees just \`, which to GNU grep is the start-of-the-input anchor, so all input lines match.
Also note that since grep processes input line by line, \` has the same effect as ^ the start-of-a-line anchor; with multi-line input, however - such as if you used grep -z to read all lines at once - \` only matches the very start of the whole string.
To BSD/macOS grep, \` simply escapes a literal `, so it only matches input lines that contain that character.

Is it possible to escape regex metacharacters reliably with sed

I'm wondering whether it is possible to write a 100% reliable sed command to escape any regex metacharacters in an input string so that it can be used in a subsequent sed command. Like this:
#!/bin/bash
# Trying to replace one regex by another in an input file with sed
search="/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3"
replace="/xyz\n\t[0-9]\+\([^ ]\)\{2,3\}\3"
# Sanitize input
search=$(sed 'script to escape' <<< "$search")
replace=$(sed 'script to escape' <<< "$replace")
# Use it in a sed command
sed "s/$search/$replace/" input
I know that there are better tools to work with fixed strings instead of patterns, for example awk, perl or python. I would just like to prove whether it is possible or not with sed. I would say let's concentrate on basic POSIX regexes to have even more fun! :)
I have tried a lot of things but anytime I could find an input which broke my attempt. I thought keeping it abstract as script to escape would not lead anybody into the wrong direction.
Btw, the discussion came up here. I thought this could be a good place to collect solutions and probably break and/or elaborate them.
Note:
If you're looking for prepackaged functionality based on the techniques discussed in this answer:
bash functions that enable robust escaping even in multi-line substitutions can be found at the bottom of this post (plus a perl solution that uses perl's built-in support for such escaping).
#EdMorton's answer contains a tool (bash script) that robustly performs single-line substitutions.
Ed's answer now has an improved version of the sed command used below, corrected in calestyo's answer, which is needed if you want to escape string literals for potential use with other regex-processing tools, such as awk and perl. In short: for cross-tool use, \ must be escaped as \\ rather than as [\], which means: instead of the
sed 's/[^^]/[&]/g; s/\^/\\^/g' command used below, you must use
sed 's/[^^\]/[&]/g; s/[\^]/\\&/g;'
All snippets below assume bash as the shell (POSIX-compliant reformulations are possible):
SINGLE-line Solutions
Escaping a string literal for use as a regex in sed:
To give credit where credit is due: I found the regex used below in this answer.
Assuming that the search string is a single-line string:
search='abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3' # sample input containing metachars.
searchEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$search") # escape it.
sed -n "s/$searchEscaped/foo/p" <<<"$search" # Echoes 'foo'
Every character except ^ is placed in its own character set [...] expression to treat it as a literal.
Note that ^ is the one char. you cannot represent as [^], because it has special meaning in that location (negation).
Then, ^ chars. are escaped as \^.
Note that you cannot just escape every char by putting a \ in front of it because that can turn a literal char into a metachar, e.g. \< and \b are word boundaries in some tools, \n is a newline, \{ is the start of a RE interval like \{1,3\}, etc.
The approach is robust, but not efficient.
The robustness comes from not trying to anticipate all special regex characters - which will vary across regex dialects - but to focus on only 2 features shared by all regex dialects:
the ability to specify literal characters inside a character set.
the ability to escape a literal ^ as \^
Escaping a string literal for use as the replacement string in sed's s/// command:
The replacement string in a sed s/// command is not a regex, but it recognizes placeholders that refer to either the entire string matched by the regex (&) or specific capture-group results by index (\1, \2, ...), so these must be escaped, along with the (customary) regex delimiter, /.
Assuming that the replacement string is a single-line string:
replace='Laurel & Hardy; PS\2' # sample input containing metachars.
replaceEscaped=$(sed 's/[&/\]/\\&/g' <<<"$replace") # escape it
sed -n "s/.*/$replaceEscaped/p" <<<"foo" # Echoes $replace as-is
MULTI-line Solutions
Escaping a MULTI-LINE string literal for use as a regex in sed:
Note: This only makes sense if multiple input lines (possibly ALL) have been read before attempting to match.
Since tools such as sed and awk operate on a single line at a time by default, extra steps are needed to make them read more than one line at a time.
# Define sample multi-line literal.
search='/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3
/def\n\t[A-Z]\+\([^ ]\)\{3,4\}\4'
# Escape it.
searchEscaped=$(sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$search" | tr -d '\n') #'
# Use in a Sed command that reads ALL input lines up front.
# If ok, echoes 'foo'
sed -n -e ':a' -e '$!{N;ba' -e '}' -e "s/$searchEscaped/foo/p" <<<"$search"
The newlines in multi-line input strings must be translated to '\n' strings, which is how newlines are encoded in a regex.
$!a\'$'\n''\\n' appends string '\n' to every output line but the last (the last newline is ignored, because it was added by <<<)
tr -d '\n then removes all actual newlines from the string (sed adds one whenever it prints its pattern space), effectively replacing all newlines in the input with '\n' strings.
-e ':a' -e '$!{N;ba' -e '}' is the POSIX-compliant form of a sed idiom that reads all input lines a loop, therefore leaving subsequent commands to operate on all input lines at once.
If you're using GNU sed (only), you can use its -z option to simplify reading all input lines at once:
sed -z "s/$searchEscaped/foo/" <<<"$search"
Escaping a MULTI-LINE string literal for use as the replacement string in sed's s/// command:
# Define sample multi-line literal.
replace='Laurel & Hardy; PS\2
Masters\1 & Johnson\2'
# Escape it for use as a Sed replacement string.
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$replace")
replaceEscaped=${REPLY%$'\n'}
# If ok, outputs $replace as is.
sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar"
Newlines in the input string must be retained as actual newlines, but \-escaped.
-e ':a' -e '$!{N;ba' -e '}' is the POSIX-compliant form of a sed idiom that reads all input lines a loop.
's/[&/\]/\\&/g escapes all &, \ and / instances, as in the single-line solution.
s/\n/\\&/g' then \-prefixes all actual newlines.
IFS= read -d '' -r is used to read the sed command's output as is (to avoid the automatic removal of trailing newlines that a command substitution ($(...)) would perform).
${REPLY%$'\n'} then removes a single trailing newline, which the <<< has implicitly appended to the input.
bash functions based on the above (for sed):
quoteRe() quotes (escapes) for use in a regex
quoteSubst() quotes for use in the substitution string of a s/// call.
both handle multi-line input correctly
Note that because sed reads a single line at at time by default, use of quoteRe() with multi-line strings only makes sense in sed commands that explicitly read multiple (or all) lines at once.
Also, using command substitutions ($(...)) to call the functions won't work for strings that have trailing newlines; in that event, use something like IFS= read -d '' -r escapedValue <(quoteSubst "$value")
# SYNOPSIS
# quoteRe <text>
quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
# SYNOPSIS
# quoteSubst <text>
quoteSubst() {
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
printf %s "${REPLY%$'\n'}"
}
Example:
from=$'Cost\(*):\n$3.' # sample input containing metachars.
to='You & I'$'\n''eating A\1 sauce.' # sample replacement string with metachars.
# Should print the unmodified value of $to
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/$(quoteRe "$from")/$(quoteSubst "$to")/" <<<"$from"
Note the use of -e ':a' -e '$!{N;ba' -e '}' to read all input at once, so that the multi-line substitution works.
perl solution:
Perl has built-in support for escaping arbitrary strings for literal use in a regex: the quotemeta() function or its equivalent \Q...\E quoting.
The approach is the same for both single- and multi-line strings; for example:
from=$'Cost\(*):\n$3.' # sample input containing metachars.
to='You owe me $1/$& for'$'\n''eating A\1 sauce.' # sample replacement string w/ metachars.
# Should print the unmodified value of $to.
# Note that the replacement value needs NO escaping.
perl -s -0777 -pe 's/\Q$from\E/$to/' -- -from="$from" -to="$to" <<<"$from"
Note the use of -0777 to read all input at once, so that the multi-line substitution works.
The -s option allows placing -<var>=<val>-style Perl variable definitions following -- after the script, before any filename operands.
Building upon #mklement0's answer in this thread, the following tool will replace any single-line string (as opposed to regexp) with any other single-line string using sed and bash:
$ cat sedstr
#!/bin/bash
old="$1"
new="$2"
file="${3:--}"
escOld=$(sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g' <<< "$old")
escNew=$(sed 's/[&/\]/\\&/g' <<< "$new")
sed "s/$escOld/$escNew/g" "$file"
To illustrate the need for this tool, consider trying to replace a.*/b{2,}\nc with d&e\1f by calling sed directly:
$ cat file
a.*/b{2,}\nc
axx/bb\nc
$ sed 's/a.*/b{2,}\nc/d&e\1f/' file
sed: -e expression #1, char 16: unknown option to `s'
$ sed 's/a.*\/b{2,}\nc/d&e\1f/' file
sed: -e expression #1, char 23: invalid reference \1 on `s' command's RHS
$ sed 's/a.*\/b{2,}\nc/d&e\\1f/' file
a.*/b{2,}\nc
axx/bb\nc
# .... and so on, peeling the onion ad nauseum until:
$ sed 's/a\.\*\/b{2,}\\nc/d\&e\\1f/' file
d&e\1f
axx/bb\nc
or use the above tool:
$ sedstr 'a.*/b{2,}\nc' 'd&e\1f' file
d&e\1f
axx/bb\nc
The reason this is useful is that it can be easily augmented to use word-delimiters to replace words if necessary, e.g. in GNU sed syntax:
sed "s/\<$escOld\>/$escNew/g" "$file"
whereas the tools that actually operate on strings (e.g. awk's index()) cannot use word-delimiters.
NOTE: the reason to not wrap \ in a bracket expression is that if you were using a tool that accepts [\]] as a literal ] inside a bracket expression (e.g. perl and most awk implementations) to do the actual final substitution (i.e. instead of sed "s/$escOld/$escNew/g") then you couldn't use the approach of:
sed 's/[^^]/[&]/g; s/\^/\\^/g'
to escape \ by enclosing it in [] because then \x would become [\][x] which means \ or ] or [ or x. Instead you'd need:
sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'
So while [\] is probably OK for all current sed implementations, we know that \\ will work for all sed, awk, perl, etc. implementations and so use that form of escaping.
It should be noted that the regular expression used in some answers above among this and that one:
's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'
seems to be wrong:
Doing first s/\^/\\^/g followed by s/\\/\\\\/g is an error, as any ^ escaped first to \^ will then have its \ escaped again.
A better way seems to be: 's/[^\^]/[&]/g; s/[\^]/\\&/g;'.
[^^\\] with sed (BRE/ERE) should be just [^\^] (or [^^\]). \ has no special meaning inside a bracket expression and needs not to be quoted.
Bash parameter expansion can be used to escape a string for use as a Sed replacement string:
# Define a sample multi-line literal. Includes a trailing newline to test corner case
replace='a&b;c\1
d/e
'
# Escape it for use as a Sed replacement string.
: "${replace//\\/\\\\}"
: "${_//&/\\\&}"
: "${_//\//\\\/}"
: "${_//$'\n'/\\$'\n'}"
replaceEscaped=$_
# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''
In bash 5.2+, it can be simplified further:
# Define a sample multi-line literal. Includes a trailing newline to test corner case
replace='a&b;c\1
d/e
'
# Escape it for use as a Sed replacement string.
shopt -s extglob
shopt -s patsub_replacement # An & in the replacement will expand to what matched. bash 5.2+
: "${replace//#(&|\\|\/|$'\n')/\\&}"
replaceEscaped=$_
# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''
Encapsulate it in a bash function:
##
# escape_replacement -v var replacement
#
# Escape special characters in _replacement_ so that it can be
# used as the replacement part in a sed substitute command.
# Store the result in _var_.
escape_replacement() {
if ! [[ $# = 3 && $1 = '-v' ]]; then
echo "escape_replacement: invalid usage" >&2
echo "escape_replacement: usage: escape_replacement -v var replacement" >&2
return 1
fi
local -n var=$2 # nameref (requires Bash 4.3+)
# We use the : command (true builtin) as a dummy command as we
# trigger a sequence of parameter expansions
# We exploit that the $_ variable (last argument to the previous command
# after expansion) contains the result of the previous parameter expansion
: "${3//\\/\\\\}" # Backslash-escape any existing backslashes
: "${_//&/\\\&}" # Backslash-escape &
: "${_//\//\\\/}" # Backslash-escape the delimiter (we assume /)
: "${_//$'\n'/\\$'\n'}" # Backslash-escape newline
var=$_ # Assign to the nameref
# To support Bash older than 4.3, the following can be used instead of nameref
#eval "$2=\$_" # Use eval instead of nameref https://mywiki.wooledge.org/BashFAQ/006
}
# Test the function
# =================
# Define a sample multi-line literal. Include a trailing newline to test corner case
replace='a&b;c\1
d/e
'
escape_replacement -v replaceEscaped "$replace"
# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''