Bash to find lines with exact one word? - regex

I'm trying to write a bash script that takes a file name, and return lines that have one word. Here is sample text:
This has more than one word
There
is exactly one word in above line.
White-space
in the start of the above line doesn't matter.
Need-some-help.
Output:
There
White-space
Need-some-help.
I'm looking into using a combination SED and Regex.
Note: I cannot using anything else (it has to be a bash script, without custom modules), so suggesting that wouldn't help.

If words can contain any non-whitespace characters, then:
grep -E '^\s*\S+\s*$'
or
sed -E '/^\s*\S+\s*$/!d'
or
sed -n -E '/^\s*\S+\s*$/p'

If you have awk available: awk 'NF==1'
sed: delete any line with a "non-space space non-space" sequence sed '/[^ ] +[^ ]/d'

Well You could just delete lines which contain a char + space + char using sed.
#!/bin/bash
echo "This has more than one word
There
is exactly one word in above line.
White-space
in the start of the above line doesn't matter.
Need-some-help." | sed '/\S \S/d' -

^\s*\b[a-zA-Z.-]+\s*$
For the regex part and assuming you are searching the file line by line this regex will only match if there is exactly one word in the line.

Assuming you can use grep (one of the most common tools used in shell scripts):
#!/bin/bash
grep '^ *[^ ]\+ *$' "$#"

Related

Replace spaces with new lines if part of a specific pattern using sed and regex with extended syntax

so I have a text file with multiple instances looking like this:
word. word or words [something:'else]
I need to replace with a new line the double space after every period followed by a sequence of words and then a "[", like so:
word.\nword or words [something:'else]
I thought about using the sed command in bash with extended regex syntax, but nothing has worked so far... I've tried different variations of this:
sed -E 's/(\.)( )(.*)(.\[)/\1\n\3\4/g' old.txt > new.txt
I'm an absolute beginner at this, so I'm not sure at all about what I'm doing 😳
This might work for you (GNU sed):
sed -E 's/\. ((\w+ )+\[)/\.\n\1/g' file
Replace globally a period followed by two spaces and one or more words space separated followed by an opening square bracket by; a period followed by a newline followed by the matching back reference from the regexp.
Your sed command is almost correct (but contains some redundancies)
sed -E 's/(\.)( )(.*)(.\[)/\1\n\3\4/' old.txt > new.txt
# ^
# You forget terminating the s command
But you don't need to capture everything. A simpler one could be
sed -E 's/\. (.*\[)/.\n\1/' old.txt > new.txt

Extracting Substring from String with Multiple Special Characters Using Sed

I have a text file with a line that reads:
<div id="page_footer"><div><? print('Any phrase's characters can go here!'); ?></div></div>
And I'm wanting to use sed or awk to extract the substring above between the single quotes so it just prints ...
Any phrase's characters can go here!
I want the phrase to be delimited as I have above, starting after the single quote and ending at the single-quote immediately followed by a parenthesis and then semicolon. The following sed command with a capture group doesn't seem to be working for me. Suggestions?
sed '/^<div id="page_footer"><div><? print(\'\(.\+\)\');/ s//\1/p' /home/foobar/testfile.txt
Incorrect would be using cut like
grep "page_footer" /home/foobar/testfile.txt | cut -d "'" -f2
It will go wrong with single quotes inside the string. Counting the number of single quotes first will change this from a simple to an over-complicated solution.
A solution with sed is better: remove everything until the first single quote and everything after the last one. A single quote in the string becomes messy when you first close the sed parameter with a single quote, escape the single quote and open a sed string again:
grep page_footer /home/foobar/testfile.txt | sed -e 's/[^'\'']*//' -e 's/[^'\'']*$//'
And this is not the full solution, you want to remove the first/last quotes as well:
grep page_footer /home/foobar/testfile.txt | sed -e 's/[^'\'']*'\''//' -e 's/'\''[^'\'']*$//'
Writing the sed parameters in double-quoted strings and using the . wildcard for matching the single quote will make the line shorter:
grep page_footer /home/foobar/testfile.txt | sed -e "s/^[^\']*.//" -e "s/.[^\']*$//"
Using advanced grep (such as in Linux), this might be what you are looking for
grep -Po "(?<=').*?(?='\);)"

How can I use regex to exclude lines with extra characters?

I have a bunch of email addresses:
abc#google.com
bdc#yahoo.com
\\ske#google.com
I'd like to delete the bolded line because there is extra character in the string other than # . and letters. How do I do this ?
Through awk,
$ awk '/^\w+#\w+/{print}' file
abc#google.com
bdc#yahoo.com
Awk searches for the lines which starts with one or more word character followed by an # symbol and again followed by one or more word characters. If it founds any, then prints the whole line.
This line \\ske#google.com wouldn't starts with a word character, so it not get printed.
You can use this sed:
sed -i.bak -n '/^[[:alnum:]]*#/p' file
You can use vim to take care of it too:
vim -c 'v/^[[:alnum:]]*#/d' -c 'wq' file
You could also use a perl module:
perl -ne 'use Email::Valid; print if Email::Valid->address($_)'

Insert space after period using sed

I've got a bunch of files that have sentences ending like this: \#.Next sentence. I'd like to insert a space after the period.
Not all occurrences of \#. do not have a space, however, so my regex checks if the next character after the period is a capital letter.
Because I'm checking one character after the period, I can't just do a replace on \#. to \#., and because I don't know what character is following the period, I'm stuck.
My command currently:
sed -i .bak -E 's/\\#\.[A-Z]/<SOMETHING IN HERE>/g' *.tex
How can I grab the last letter of the matching string to use in the replacement regex?
EDIT: For the record, I'm using a BSD version of sed (I'm using OS X) - from my previous question regarding sed, apparently BSD sed (or at least, the Apple version) doesn't always play nice with GNU sed regular expressions.
The right command should be this:
sed -i.bak -E "s/\\\#.(\S)/\\\#. \1/g" *.tex
Whith it, you match any \# followed by non whitespace (\S) and insert a whitespace (what is made by replacing the whole match with '\# ' plus the the non whitespace just found).
Use this sed command:
sed -i.bak -E 's/(\\#\.)([A-Z])/\1 \2/g' *.tex
OR better:
sed -i.bak -E 's/(\\#\.)([^ \t])/\1 \2/g' *.tex
which will insert space if \#. is not followed by any white-space character (not just capital letter).
This might work for you:
sed -i .bak -E 's/\\#\. \?/\\#. /g' *.tex
Explanation:
If there's a space there replace it with a space, otherwise insert a space.
I think the following would be correct:
s/\\#\.[^\s]/\\#. /g
Only replace the expression if it is not followed by a space.

Replace all whitespace with a line break/paragraph mark to make a word list

I am trying to vocab list for a Greek text we are translating in class. I want to replace every space or tab character with a paragraph mark so that every word appears on its own line. Can anyone give me the sed command, and explain what it is that I'm doing? I’m still trying to figure sed out.
For reasonably modern versions of sed, edit the standard input to yield the standard output with
$ echo 'τέχνη βιβλίο γη κήπος' | sed -E -e 's/[[:blank:]]+/\n/g'
τέχνη
βιβλίο
γη
κήπος
If your vocabulary words are in files named lesson1 and lesson2, redirect sed’s standard output to the file all-vocab with
sed -E -e 's/[[:blank:]]+/\n/g' lesson1 lesson2 > all-vocab
What it means:
The character class [[:blank:]] matches either a single space character or
a single tab character.
Use [[:space:]] instead to match any single whitespace character (commonly space, tab, newline, carriage return, form-feed, and vertical tab).
The + quantifier means match one or more of the previous pattern.
So [[:blank:]]+ is a sequence of one or more characters that are all space or tab.
The \n in the replacement is the newline that you want.
The /g modifier on the end means perform the substitution as many times as possible rather than just once.
The -E option tells sed to use POSIX extended regex syntax and in particular for this case the + quantifier. Without -E, your sed command becomes sed -e 's/[[:blank:]]\+/\n/g'. (Note the use of \+ rather than simple +.)
Perl Compatible Regexes
For those familiar with Perl-compatible regexes and a PCRE-capable sed, use \s+ to match runs of at least one whitespace character, as in
sed -E -e 's/\s+/\n/g' old > new
or
sed -e 's/\s\+/\n/g' old > new
These commands read input from the file old and write the result to a file named new in the current directory.
Maximum portability, maximum cruftiness
Going back to almost any version of sed since Version 7 Unix, the command invocation is a bit more baroque.
$ echo 'τέχνη βιβλίο γη κήπος' | sed -e 's/[ \t][ \t]*/\
/g'
τέχνη
βιβλίο
γη
κήπος
Notes:
Here we do not even assume the existence of the humble + quantifier and simulate it with a single space-or-tab ([ \t]) followed by zero or more of them ([ \t]*).
Similarly, assuming sed does not understand \n for newline, we have to include it on the command line verbatim.
The \ and the end of the first line of the command is a continuation marker that escapes the immediately following newline, and the remainder of the command is on the next line.
Note: There must be no whitespace preceding the escaped newline. That is, the end of the first line must be exactly backslash followed by end-of-line.
This error prone process helps one appreciate why the world moved to visible characters, and you will want to exercise some care in trying out the command with copy-and-paste.
Note on backslashes and quoting
The commands above all used single quotes ('') rather than double quotes (""). Consider:
$ echo '\\\\' "\\\\"
\\\\ \\
That is, the shell applies different escaping rules to single-quoted strings as compared with double-quoted strings. You typically want to protect all the backslashes common in regexes with single quotes.
The portable way to do this is:
sed -e 's/[ \t][ \t]*/\
/g'
That's an actual newline between the backslash and the slash-g. Many sed implementations don't know about \n, so you need a literal newline. The backslash before the newline prevents sed from getting upset about the newline. (in sed scripts the commands are normally terminated by newlines)
With GNU sed you can use \n in the substitution, and \s in the regex:
sed -e 's/\s\s*/\n/g'
GNU sed also supports "extended" regular expressions (that's egrep style, not perl-style) if you give it the -r flag, so then you can use +:
sed -r -e 's/\s+/\n/g'
If this is for Linux only, you can probably go with the GNU command, but if you want this to work on systems with a non-GNU sed (eg: BSD, Mac OS-X), you might want to go with the more portable option.
All of the examples listed above for sed break on one platform or another. None of them work with the version of sed shipped on Macs.
However, Perl's regex works the same on any machine with Perl installed:
perl -pe 's/\s+/\n/g' file.txt
If you want to save the output:
perl -pe 's/\s+/\n/g' file.txt > newfile.txt
If you want only unique occurrences of words:
perl -pe 's/\s+/\n/g' file.txt | sort -u > newfile.txt
option 1
echo $(cat testfile)
Option 2
tr ' ' '\n' < testfile
This should do the work:
sed -e 's/[ \t]+/\n/g'
[ \t] means a space OR an tab. If you want any kind of space, you could also use \s.
[ \t]+ means as many spaces OR tabs as you want (but at least one)
s/x/y/ means replace the pattern x by y (here \n is a new line)
The g at the end means that you have to repeat as many times it occurs in every line.
You could use POSIX [[:blank:]] to match a horizontal white-space character.
sed 's/[[:blank:]]\+/\n/g' file
or you may use [[:space:]] instead of [[:blank:]] also.
Example:
$ echo 'this is a sentence' | sed 's/[[:blank:]]\+/\n/g'
this
is
a
sentence
You can also do it with xargs:
cat old | xargs -n1 > new
or
xargs -n1 < old > new
Using gawk:
gawk '{$1=$1}1' OFS="\n" file