grep regex with backtick matches all lines

grep regex with backtick matches all lines - regex

$ cat file
anna
amma
kklks
ksklaii
$ grep '\`' file
anna
amma
kklks
ksklaii
Why? How is that match working ?

This appears to be a GNU extension for regular expressions. The backtick ('\`') anchor matches the very start of a subject string, which explains why it is matching all lines. OS X apparently doesn't implement the GNU extensions, which would explain why your example doesn't match any lines there. See http://www.regular-expressions.info/gnu.html
If you want to match an actual backtick when the GNU extensions are in effect, this works for me:
grep '[`]' file

twm's answer provides the crucial pointer, but note that it is the sequence \`, not ` by itself that acts as the start-of-input anchor in GNU regexes.
Thus, to match a literal backtick in a regex specified as a single-quoted shell string, you don't need any escaping at all, neither with GNU grep nor with BSD/macOS grep:
$ { echo 'ab'; echo 'c`d'; } | grep '`'
c`d
When using double-quoted shell strings - which you should avoid for regexes, for reasons that will become obvious - things get more complicated, because you then must escape the ` for the shell's sake in order to pass it through as a literal to grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\`"
c`d
Note that, after the shell has parsed the "..." string, grep still only sees `.
To recreate the original command with a double-quoted string with GNU grep:
$ { echo 'ab'; echo 'c`d'; } | grep "\\\`" # !! BOTH \ and ` need \-escaping
ab
c`d
Again, after the shell's string parsing, grep sees just \`, which to GNU grep is the start-of-the-input anchor, so all input lines match.
Also note that since grep processes input line by line, \` has the same effect as ^ the start-of-a-line anchor; with multi-line input, however - such as if you used grep -z to read all lines at once - \` only matches the very start of the whole string.
To BSD/macOS grep, \` simply escapes a literal `, so it only matches input lines that contain that character.

Related

How do I take only the first occurrence of a hyphen in sed?

I have a string, for example home/JOHNSMITH-4991-common-task-list, and I want to take out the uppercase part and the numbers with the hyphen between them. I echo the string and pipe it to sed like so, but I keep getting all the hyphens I don't want, e.g.:
echo home/JOHNSMITH-4991-common-task-list | sed 's/[^A-Z0-9-]//g'
gives me:
JOHNSMITH-4991---
I need:
JOHNSMITH-4991
How do I ignore all but the first hyphen?

You can use
sed 's,.*/\([^-]*-[^-]*\).*,\1,'
POSIX BRE regex details:
.* - any zero or more chars
/ - a / char
\([^-]*-[^-]*\) - Group 1: any zero or more chars other than -, a hyphen, and then again zero or more chars other than -
.* - any zero or more chars
The replacement is the Group 1 placeholder, \1, to restore just the text captured.
See the online demo:
#!/bin/bash
s="home/JOHNSMITH-4991-common-task-list"
sed 's,.*/\([^-]*-[^-]*\).*,\1,' <<< "$s"
# => JOHNSMITH-4991

1st solution: With awk it will be much easier and we could keep it simple, please try following, written and tested with your shown samples.
echo "echo home/JOHNSMITH-4991-common-task-list" | awk -F'/|-' '{print $2"-"$3}'
Explanation: Simple explanation would be, setting field separator as / OR - and printing 2nd field - and 3rd field of current line.
2nd solution: Using match function of awk program here.
echo "echo home/JOHNSMITH-4991-common-task-list" |
awk '
match($0,/\/[^-]*-[^-]*/){
print substr($0,RSTART+1,RLENGTH-1)
}'
3rd solution: Using GNU grep solution here. Using -oP option of grep here, to print matched values with o option and to enable ERE(extended regular expression) with P option. Then in main program of grep using .*/ followed by \K to ignore previous matched part and then mentioning [^-]*-[^-]* to make sure to get values just before 2nd occurrence of - in matched line.
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '.*/\K[^-]*-[^-]*'

Here is a simple alternative solution using cut with bash string substitution:
s='home/JOHNSMITH-4991-common-task-list'
cut -d- -f1-2 <<< "${s##*/}"
JOHNSMITH-4991

You could match until the first occurrence of the /, then clear the match buffer with \K and then repeat the character class 1+ times with a hyphen in between to select at least characters before and after the hyphen.
[^/]*/\K[A-Z0-9]+-[A-Z0-9]+
If supported, using gnu grep:
echo "echo home/JOHNSMITH-4991-common-task-list" | grep -oP '[^/]*/\K[A-Z0-9]+-[A-Z0-9]+'
Output
JOHNSMITH-4991
If gnu awk is an option, using the same pattern but with a capture group:
echo "home/JOHNSMITH-4991-common-task-list" | awk 'match($0, /[^\/]*\/([A-Z0-9]+-[A-Z0-9]+)/, a) {print a[1]}'
If the desired output is always the first match where the character class with a hyphen matches:
echo "home/JOHNSMITH-4991-common-task-list" | awk -v FPAT="[A-Z0-9]+-[A-Z0-9]+" '$0=$1'
Output
JOHNSMITH-4991

Assumptions:
could be more than one fwd slash in string
(after the last fwd slash) there are 2 or more hyphens in the string
desired output is between last fwd slash and 2nd hyphen
One idea using parameter substitutions:
$ string='home/dir/JOHNSMITH-4991-common-task-list'
$ string1="${string##*/}"
$ typeset -p string1
declare -- string1="JOHNSMITH-4991-common-task-list"
$ string1="${string1%%-*}"
$ typeset -p string1
declare -- string1="JOHNSMITH"
$ string2="${string#*-}"
$ typeset -p string2
declare -- string2="4991-common-task-list"
$ string2="${string2%%-*}"
$ typeset -p string2
declare -- string2="4991"
$ newstring="${string1}-${string2}"
$ echo "${newstring}"
JOHNSMITH-4991
NOTES:
typeset commands added solely to show progression of values
a bit of typing but if doing this a lot of times in bash the overall performance should be good compared to other solutions that require spawning a sub-process
if there's a need to parse a large number of strings best performance will come from streaming all strings at once (via a file?) to one of the other solutions (eg, a single awk call that processes all strings will be faster than running the set of strings through a bash loop and performing all of these parameter substitutions)

linux extract only a string starts with a special string and ends with the first occurrence of comma

I have a log file contains some information like below
"variable1=XXX, emotionType=sad, sentimentType=negative..."
What I want is to grep only the matched string, the string starts with emotionType and ends with the first occurrence of comma.
E.g.
emotionType=sad
emotionType=joy
...
What I have tried is
grep -e "/^emotionType.*,/" file.log -o
but I got nothing. Anyone can tell me what should I do?

You need to use
grep -o "emotionType[^,]*" file.log
Note:
Remove ^ or replace with \<, starting word boundary construct if your matches are not located at the beginning of each line
Remove the / chars on both ends of the regex since grep does not use regex delimiters (like sed)
[^,] is a negated bracket expression that matches any char other than a comma
* is a POSIX BRE quantifier that matches zero or more occurrences.
See an online demo:
#!/bin/bash
s="variable1=XXX, emotionType=sad, sentimentType=negative, emotionType=happy"
grep -o "emotionType=[^,]*" <<< "$s"
Output:
emotionType=sad
emotionType=happy

1st solution: With awk you could try following program. Simple explanation would be using awk's match function capability and using regex to match string emotionType till next occurrence of , and printing all the matches in awk program.
var="variable1=XXX, emotionType=sad, sentimentType=negative, emotionType=happy"
Where var is a shell variable.
echo "$var" |
awk '{while(match($0,/emotionType=[^,]*/)){print substr($0,RSTART,RLENGTH);$0=substr($0,RSTART+RLENGTH)}}'
2nd solution: Or in GNU awk using RS variable try following awk program.
echo "$var" | awk -v RS='emotionType=[^,]*' 'RT{sub(/\n+$/,"",RT);print RT}'

Is it possible to escape regex metacharacters reliably with sed

I'm wondering whether it is possible to write a 100% reliable sed command to escape any regex metacharacters in an input string so that it can be used in a subsequent sed command. Like this:
#!/bin/bash
# Trying to replace one regex by another in an input file with sed
search="/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3"
replace="/xyz\n\t[0-9]\+\([^ ]\)\{2,3\}\3"
# Sanitize input
search=$(sed 'script to escape' <<< "$search")
replace=$(sed 'script to escape' <<< "$replace")
# Use it in a sed command
sed "s/$search/$replace/" input
I know that there are better tools to work with fixed strings instead of patterns, for example awk, perl or python. I would just like to prove whether it is possible or not with sed. I would say let's concentrate on basic POSIX regexes to have even more fun! :)
I have tried a lot of things but anytime I could find an input which broke my attempt. I thought keeping it abstract as script to escape would not lead anybody into the wrong direction.
Btw, the discussion came up here. I thought this could be a good place to collect solutions and probably break and/or elaborate them.

Note:
If you're looking for prepackaged functionality based on the techniques discussed in this answer:
bash functions that enable robust escaping even in multi-line substitutions can be found at the bottom of this post (plus a perl solution that uses perl's built-in support for such escaping).
#EdMorton's answer contains a tool (bash script) that robustly performs single-line substitutions.
Ed's answer now has an improved version of the sed command used below, corrected in calestyo's answer, which is needed if you want to escape string literals for potential use with other regex-processing tools, such as awk and perl. In short: for cross-tool use, \ must be escaped as \\ rather than as [\], which means: instead of the
sed 's/[^^]/[&]/g; s/\^/\\^/g' command used below, you must use
sed 's/[^^\]/[&]/g; s/[\^]/\\&/g;'
All snippets below assume bash as the shell (POSIX-compliant reformulations are possible):
SINGLE-line Solutions
Escaping a string literal for use as a regex in sed:
To give credit where credit is due: I found the regex used below in this answer.
Assuming that the search string is a single-line string:
search='abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3' # sample input containing metachars.
searchEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$search") # escape it.
sed -n "s/$searchEscaped/foo/p" <<<"$search" # Echoes 'foo'
Every character except ^ is placed in its own character set [...] expression to treat it as a literal.
Note that ^ is the one char. you cannot represent as [^], because it has special meaning in that location (negation).
Then, ^ chars. are escaped as \^.
Note that you cannot just escape every char by putting a \ in front of it because that can turn a literal char into a metachar, e.g. \< and \b are word boundaries in some tools, \n is a newline, \{ is the start of a RE interval like \{1,3\}, etc.
The approach is robust, but not efficient.
The robustness comes from not trying to anticipate all special regex characters - which will vary across regex dialects - but to focus on only 2 features shared by all regex dialects:
the ability to specify literal characters inside a character set.
the ability to escape a literal ^ as \^
Escaping a string literal for use as the replacement string in sed's s/// command:
The replacement string in a sed s/// command is not a regex, but it recognizes placeholders that refer to either the entire string matched by the regex (&) or specific capture-group results by index (\1, \2, ...), so these must be escaped, along with the (customary) regex delimiter, /.
Assuming that the replacement string is a single-line string:
replace='Laurel & Hardy; PS\2' # sample input containing metachars.
replaceEscaped=$(sed 's/[&/\]/\\&/g' <<<"$replace") # escape it
sed -n "s/.*/$replaceEscaped/p" <<<"foo" # Echoes $replace as-is
MULTI-line Solutions
Escaping a MULTI-LINE string literal for use as a regex in sed:
Note: This only makes sense if multiple input lines (possibly ALL) have been read before attempting to match.
Since tools such as sed and awk operate on a single line at a time by default, extra steps are needed to make them read more than one line at a time.
# Define sample multi-line literal.
search='/abc\n\t[a-z]\+\([^ ]\)\{2,3\}\3
/def\n\t[A-Z]\+\([^ ]\)\{3,4\}\4'
# Escape it.
searchEscaped=$(sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$search" | tr -d '\n') #'
# Use in a Sed command that reads ALL input lines up front.
# If ok, echoes 'foo'
sed -n -e ':a' -e '$!{N;ba' -e '}' -e "s/$searchEscaped/foo/p" <<<"$search"
The newlines in multi-line input strings must be translated to '\n' strings, which is how newlines are encoded in a regex.
$!a\'$'\n''\\n' appends string '\n' to every output line but the last (the last newline is ignored, because it was added by <<<)
tr -d '\n then removes all actual newlines from the string (sed adds one whenever it prints its pattern space), effectively replacing all newlines in the input with '\n' strings.
-e ':a' -e '$!{N;ba' -e '}' is the POSIX-compliant form of a sed idiom that reads all input lines a loop, therefore leaving subsequent commands to operate on all input lines at once.
If you're using GNU sed (only), you can use its -z option to simplify reading all input lines at once:
sed -z "s/$searchEscaped/foo/" <<<"$search"
Escaping a MULTI-LINE string literal for use as the replacement string in sed's s/// command:
# Define sample multi-line literal.
replace='Laurel & Hardy; PS\2
Masters\1 & Johnson\2'
# Escape it for use as a Sed replacement string.
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$replace")
replaceEscaped=${REPLY%$'\n'}
# If ok, outputs $replace as is.
sed -n "s/\(.*\) \(.*\)/$replaceEscaped/p" <<<"foo bar"
Newlines in the input string must be retained as actual newlines, but \-escaped.
-e ':a' -e '$!{N;ba' -e '}' is the POSIX-compliant form of a sed idiom that reads all input lines a loop.
's/[&/\]/\\&/g escapes all &, \ and / instances, as in the single-line solution.
s/\n/\\&/g' then \-prefixes all actual newlines.
IFS= read -d '' -r is used to read the sed command's output as is (to avoid the automatic removal of trailing newlines that a command substitution ($(...)) would perform).
${REPLY%$'\n'} then removes a single trailing newline, which the <<< has implicitly appended to the input.
bash functions based on the above (for sed):
quoteRe() quotes (escapes) for use in a regex
quoteSubst() quotes for use in the substitution string of a s/// call.
both handle multi-line input correctly
Note that because sed reads a single line at at time by default, use of quoteRe() with multi-line strings only makes sense in sed commands that explicitly read multiple (or all) lines at once.
Also, using command substitutions ($(...)) to call the functions won't work for strings that have trailing newlines; in that event, use something like IFS= read -d '' -r escapedValue <(quoteSubst "$value")
# SYNOPSIS
# quoteRe <text>
quoteRe() { sed -e 's/[^^]/[&]/g; s/\^/\\^/g; $!a\'$'\n''\\n' <<<"$1" | tr -d '\n'; }
# SYNOPSIS
# quoteSubst <text>
quoteSubst() {
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' <<<"$1")
printf %s "${REPLY%$'\n'}"
}
Example:
from=$'Cost\(*):\n$3.' # sample input containing metachars.
to='You & I'$'\n''eating A\1 sauce.' # sample replacement string with metachars.
# Should print the unmodified value of $to
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/$(quoteRe "$from")/$(quoteSubst "$to")/" <<<"$from"
Note the use of -e ':a' -e '$!{N;ba' -e '}' to read all input at once, so that the multi-line substitution works.
perl solution:
Perl has built-in support for escaping arbitrary strings for literal use in a regex: the quotemeta() function or its equivalent \Q...\E quoting.
The approach is the same for both single- and multi-line strings; for example:
from=$'Cost\(*):\n$3.' # sample input containing metachars.
to='You owe me $1/$& for'$'\n''eating A\1 sauce.' # sample replacement string w/ metachars.
# Should print the unmodified value of $to.
# Note that the replacement value needs NO escaping.
perl -s -0777 -pe 's/\Q$from\E/$to/' -- -from="$from" -to="$to" <<<"$from"
Note the use of -0777 to read all input at once, so that the multi-line substitution works.
The -s option allows placing -<var>=<val>-style Perl variable definitions following -- after the script, before any filename operands.

Building upon #mklement0's answer in this thread, the following tool will replace any single-line string (as opposed to regexp) with any other single-line string using sed and bash:
$ cat sedstr
#!/bin/bash
old="$1"
new="$2"
file="${3:--}"
escOld=$(sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g' <<< "$old")
escNew=$(sed 's/[&/\]/\\&/g' <<< "$new")
sed "s/$escOld/$escNew/g" "$file"
To illustrate the need for this tool, consider trying to replace a.*/b{2,}\nc with d&e\1f by calling sed directly:
$ cat file
a.*/b{2,}\nc
axx/bb\nc
$ sed 's/a.*/b{2,}\nc/d&e\1f/' file
sed: -e expression #1, char 16: unknown option to `s'
$ sed 's/a.*\/b{2,}\nc/d&e\1f/' file
sed: -e expression #1, char 23: invalid reference \1 on `s' command's RHS
$ sed 's/a.*\/b{2,}\nc/d&e\\1f/' file
a.*/b{2,}\nc
axx/bb\nc
# .... and so on, peeling the onion ad nauseum until:
$ sed 's/a\.\*\/b{2,}\\nc/d\&e\\1f/' file
d&e\1f
axx/bb\nc
or use the above tool:
$ sedstr 'a.*/b{2,}\nc' 'd&e\1f' file
d&e\1f
axx/bb\nc
The reason this is useful is that it can be easily augmented to use word-delimiters to replace words if necessary, e.g. in GNU sed syntax:
sed "s/\<$escOld\>/$escNew/g" "$file"
whereas the tools that actually operate on strings (e.g. awk's index()) cannot use word-delimiters.
NOTE: the reason to not wrap \ in a bracket expression is that if you were using a tool that accepts [\]] as a literal ] inside a bracket expression (e.g. perl and most awk implementations) to do the actual final substitution (i.e. instead of sed "s/$escOld/$escNew/g") then you couldn't use the approach of:
sed 's/[^^]/[&]/g; s/\^/\\^/g'
to escape \ by enclosing it in [] because then \x would become [\][x] which means \ or ] or [ or x. Instead you'd need:
sed 's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'
So while [\] is probably OK for all current sed implementations, we know that \\ will work for all sed, awk, perl, etc. implementations and so use that form of escaping.

It should be noted that the regular expression used in some answers above among this and that one:
's/[^^\\]/[&]/g; s/\^/\\^/g; s/\\/\\\\/g'
seems to be wrong:
Doing first s/\^/\\^/g followed by s/\\/\\\\/g is an error, as any ^ escaped first to \^ will then have its \ escaped again.
A better way seems to be: 's/[^\^]/[&]/g; s/[\^]/\\&/g;'.
[^^\\] with sed (BRE/ERE) should be just [^\^] (or [^^\]). \ has no special meaning inside a bracket expression and needs not to be quoted.

Bash parameter expansion can be used to escape a string for use as a Sed replacement string:
# Define a sample multi-line literal. Includes a trailing newline to test corner case
replace='a&b;c\1
d/e
'
# Escape it for use as a Sed replacement string.
: "${replace//\\/\\\\}"
: "${_//&/\\\&}"
: "${_//\//\\\/}"
: "${_//$'\n'/\\$'\n'}"
replaceEscaped=$_
# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''
In bash 5.2+, it can be simplified further:
# Define a sample multi-line literal. Includes a trailing newline to test corner case
replace='a&b;c\1
d/e
'
# Escape it for use as a Sed replacement string.
shopt -s extglob
shopt -s patsub_replacement # An & in the replacement will expand to what matched. bash 5.2+
: "${replace//#(&|\\|\/|$'\n')/\\&}"
replaceEscaped=$_
# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''
Encapsulate it in a bash function:
##
# escape_replacement -v var replacement
#
# Escape special characters in _replacement_ so that it can be
# used as the replacement part in a sed substitute command.
# Store the result in _var_.
escape_replacement() {
if ! [[ $# = 3 && $1 = '-v' ]]; then
echo "escape_replacement: invalid usage" >&2
echo "escape_replacement: usage: escape_replacement -v var replacement" >&2
return 1
fi
local -n var=$2 # nameref (requires Bash 4.3+)
# We use the : command (true builtin) as a dummy command as we
# trigger a sequence of parameter expansions
# We exploit that the $_ variable (last argument to the previous command
# after expansion) contains the result of the previous parameter expansion
: "${3//\\/\\\\}" # Backslash-escape any existing backslashes
: "${_//&/\\\&}" # Backslash-escape &
: "${_//\//\\\/}" # Backslash-escape the delimiter (we assume /)
: "${_//$'\n'/\\$'\n'}" # Backslash-escape newline
var=$_ # Assign to the nameref
# To support Bash older than 4.3, the following can be used instead of nameref
#eval "$2=\$_" # Use eval instead of nameref https://mywiki.wooledge.org/BashFAQ/006
}
# Test the function
# =================
# Define a sample multi-line literal. Include a trailing newline to test corner case
replace='a&b;c\1
d/e
'
escape_replacement -v replaceEscaped "$replace"
# Output should match "$replace"
sed -n "s/.*/$replaceEscaped/p" <<<''

Matching strings with grep and \A regexp

Given the string in some file:
hel string1
hell string2
hello string3
I'd like to capture just hel using cat file | grep 'regexp here'
I tried doing a bunch of regexp but none seem to work. What makes the most sense is: grep -E '\Ahel' but that doesn't seem to work. It works on http://rubular.com/ however. Any ideas why that isn't working with grep?
Also, when pasting the above string with a tab space before each line, the \A does not seem to work on rubular. I thought \A matches beginning of string, and that doesn't matter whatever characters was before that. Why did \A stop matching when there was a space before the string?

ERE (-E) does not support \A for indicating start of match. Try ^ instead.
Use -m 1 to stop grepping after the first match in each file.
If you want grep to print only the matched string (not the entire line), use -o.
Use -h if you want to suppress the printing of filenames in the grep output.
Example:
grep -Eohm 1 "^hel" *.log
If you need to enforce only outputting if the search string is on the first line of the file, you could use head:
head -qn 1 *.log | grep -Eoh "^hel"

ERE doesn't support \A but PCRE does hence grep -P can be used with same regex (if available):
grep -P '\Ahel\b' file
hel string1
Also important is to use word boundary \b to restrict matching hello
Alternatively in ERE you can use:
egrep '^hel\b'
hel string1

I thought \A matches beginning of string, and that doesn't matter whatever characters was before that. Why did \A stop matching when there was a space before the string?
\A matches the very beginning of the text, it doesn't match the start-of-line when you have one or more lines in your text.
Anyway, grep doesn't support \A so you need to use ^ which by the way matches the start of each line in multi-line mode contrary to \A.

Using awk
awk '$1=="hel"' file
PS you do not need to cat file to grep, use grep 'regexp here' file

Replace all whitespace with a line break/paragraph mark to make a word list

I am trying to vocab list for a Greek text we are translating in class. I want to replace every space or tab character with a paragraph mark so that every word appears on its own line. Can anyone give me the sed command, and explain what it is that I'm doing? I’m still trying to figure sed out.

For reasonably modern versions of sed, edit the standard input to yield the standard output with
$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g'
τέχνη
βιβλίο
γη
κήπος
If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with
sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab
What it means:
The character class [[:blank:]] matches either a single space character or
a single tab character.
Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
The + quantifier means match one or more of the previous pattern.
So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
The \n in the replacement is the newline that you want.
The /g modifier on the end means perform the substitution as many times as possible rather than just once.
The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)
Perl Compatible Regexes
For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in
sed -E -e 's/\s+/\n/g' old > new
or
sed -e 's/\s\+/\n/g' old > new
These commands read input from the file old and write the result to a file named new in the current directory.
Maximum portability, maximum cruftiness
Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.
$ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\
/g'
τέχνη
βιβλίο
γη
κήπος
Notes:
Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.
The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.
Note on backslashes and quoting
The commands above all used single quotes ('') rather than double quotes (""). Consider:
$ echo '\\\\' "\\\\"
\\\\ \\
That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.

The portable way to do this is:
sed -e 's/[ \t][ \t]*/\
/g'
That's an actual newline between the backslash and the slash-g. Many sed implementations don't know about \n, so you need a literal newline. The backslash before the newline prevents sed from getting upset about the newline. (in sed scripts the commands are normally terminated by newlines)
With GNU sed you can use \n in the substitution, and \s in the regex:
sed -e 's/\s\s*/\n/g'
GNU sed also supports "extended" regular expressions (that's egrep style, not perl-style) if you give it the -r flag, so then you can use +:
sed -r -e 's/\s+/\n/g'
If this is for Linux only, you can probably go with the GNU command, but if you want this to work on systems with a non-GNU sed (eg: BSD, Mac OS-X), you might want to go with the more portable option.

All of the examples listed above for sed break on one platform or another. None of them work with the version of sed shipped on Macs.
However, Perl's regex works the same on any machine with Perl installed:
perl -pe 's/\s+/\n/g' file.txt
If you want to save the output:
perl -pe 's/\s+/\n/g' file.txt > newfile.txt
If you want only unique occurrences of words:
perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt

option 1
echo $(cat testfile)
Option 2
tr ' ' '\n' < testfile

This should do the work:
sed -e 's/[ \t]+/\n/g'
[ \t] means a space OR an tab. If you want any kind of space, you could also use \s.
[ \t]+ means as many spaces OR tabs as you want (but at least one)
s/x/y/ means replace the pattern x by y (here \n is a new line)
The g at the end means that you have to repeat as many times it occurs in every line.

You could use POSIX [[:blank:]] to match a horizontal white-space character.
sed 's/[[:blank:]]\+/\n/g' file
or you may use [[:space:]] instead of [[:blank:]] also.
Example:
$ echo 'this is a sentence' | sed 's/[[:blank:]]\+/\n/g'
this
is
a
sentence

You can also do it with xargs:
cat old | xargs -n1 > new
or
xargs -n1 < old > new

Using gawk:
gawk '{$1=$1}1' OFS="\n" file

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

grep regex with backtick matches all lines - regex

$ cat file anna amma kklks ksklaii $ grep '\`' file anna amma kklks ksklaii Why? How is that match working ?

Related

How do I take only the first occurrence of a hyphen in sed?

linux extract only a string starts with a special string and ends with the first occurrence of comma

Is it possible to escape regex metacharacters reliably with sed

Matching strings with grep and \A regexp

Replace all whitespace with a line break/paragraph mark to make a word list

Categories

Resources