Use grep to find a specific pattern in a line - regex

I am trying to find a specific pattern in a text file using grep inside a bourne shell script
The style is: word1 word2 word3
I want to print everything that is not of that style. So far I used
grep -e '[[:space:]]\{2,\}' somefile
to find more than 2 empty spaces between the words, but I cannot figure out how to make it so that the 3 word per line limit is retained.
My other method would be to also count how many words there are per line and if it exceeds 3, to print the line. Or to check for a white space at the end of the 3rd word, but I am unsure how that would be formatted.

I'm not sure if that's what you wanted, but here:
]$ cat input
one
one two
one two three
one two three four
]$ grep -v -e "^[^[:space:]]\+ [^[:space:]]\+ [^[:space:]]\+$" input
one
one two
one two three four
We match:
any number of not spaces: [^[:space:]]\+
until we get a space
repeated 3 times
all this should be in one line: ^...$
and we negate this with -v option

Related

Bash: How can I grep a line for multiple instances of the same string?

I've got several lines that look like this:
aaaaaaaaxzaaaaaaaaaaaaaa
bbbbbbbbbbbbxzbbbbbbxzbb
ccxzcccccccccccccccccxzc
dddddddxzddddddddddddddd
Inside two of those lines, there are two instances of xz characters. I want grep to look for xz twice in the same line and output the lines that it matches on. If xz appears once, I don't want to know.
Running the following:
cat lines | grep "xz"
Tells me every line with xz on, but I only want to see lines with xz appearing twice.
How can I make the pattern search repeat in the same line?
You can use
cat lines | grep 'xz.*xz'
Or just
grep 'xz.*xz' lines
The .* will match optional characters (any but a newline) between 2 xz.
In case you need to use look-arounds, you will need -P switch to enable Perl-like regexps.
awk is one way to go:
awk -F'xz' 'NF==3' file
or
awk 'gsub(/xz/,"")==2' file
another benefit awk brings you is, it is easier to check a pattern matched less then n times, exact n times or greater than n times. you just change the == into <, <=, >, >=
If you want to output the matching lines in full you don't need the options
grep 'xz.*xz' filename
will do

Print commands in history consisting in just one word

I want to print lines that contains single word only.
For example:
this is a line
another line
one
more
line
last one
I want to get the ones with single word only
one
more
line
EDIT: Guys, thank you for answers. Almost all of the answers work for my test file. However I wanted to list single lines in bash history. When I try your answers like
history | your posted commands
all of them below fails. Some only prints some numbers (might line numbers?)
You want to get all those commands in history that contain just one word. Considering that history prints the number of the command as a first column, you need to match those lines consisting in two words.
For this, you can say:
history | awk 'NF==2'
If you just want to print the command itself, say:
history | awk 'NF==2 {print $2}'
To rehash your problem, any line containing a space or nothing should be removed.
grep -Ev '^$| ' file
Your problem statement is unspecific on whether lines containing only punctuation might also occur. Maybe try
grep -Ex '[A-Za-z]+' file
to only match lines containing only one or more alphabetics. (The -x option implicitly anchors the pattern -- it requires the entire line to match.)
In Bash, the output from history is decorated with line numbers; maybe try
history | grep -E '^ *[0-9]+ [A-Za-z]+$'
to match lines where the line number is followed by a single alphanumeric token. Notice that there will be two spaces between the line number and the command.
In all cases above, the -E selects extended regular expression matching, aka egrep (basic RE aka traditional grep does not support e.g. the + operator, though it's available as \+).
Try this:
grep -E '^\s*\S+\s*$' file
With the above input, it will output:
one
more
line
If your test strings are in a file called in.txt, you can try the following:
grep -E "^\w+$" in.txt
What it means is:
^ starting the line with
\w any word character [a-zA-Z0-9]
+ there should be at least 1 of those characters or more
$ line end
And output would be
one
more
line
Assuming your file as texts.txt and if grep is not the only criteria; then
awk '{ if ( NF == 1 ) print }' texts.txt
If your single worded lines don't have a space at the end you can also search for lines without an empty space :
grep -v " "
I think that what you're looking for could be best described as a newline followed by a word with a negative lookahead for a space,
/\n\w+\b(?! )/g
example

Counting number of lines which contain a pattern

I have data in the following form:
<id_mytextadded1829>
<text1> <text2> <text3>.
<id_m_abcdef829>
<text4> <text5> <text6>.
<id_mytextadded1829>
<text7> <text2> <text8>.
<id_mytextadded1829>
<text2> <text1> <text9>.
<id_m_abcdef829>
<text11> <text12> <text2>.
Now I want to the number of lines in which <text2> is present. I know I can do the same using python's regex. But regex would tell me whether a pattern is present in a line or not? On the other hand my requirement is to find a string which is present exactly in the middle of a line. I know sed is good for replacing contents present in a line. But instead of replacing if I only want the number of lines..is it possible to do so using sed.
EDIT:
Sorry I forgot to mention. I want lines where <text2> occurs in the middle of the line. I dont want lines where <text2> occurs in the beginning or at the end of the line.
E.g. in the data shown above the number of lines which have <text2> in the middle are 2 (rather than 4).
Is there some way by which I may achieve the desired count of the number of lines by which I may find out the number of lines which have <text2> in middle using linux or python
I want lines where <text2> occurs in the middle of the line.
You could say:
grep -P '.+<text2>.+' filename
to list the lines containing <text2> not at the beginning or the end of a line.
In order to get only the count of matches, you could say:
grep -cP '.+<text2>.+' filename
You can use grep for this. For example, this will count number of lines in the file that match the ^123[a-z]+$ pattern:
egrep -c ^123[a-z]+$ file.txt
P.S. I'm not quite sure about the syntax and I don't have the possibility to test it at the moment. Maybe the regex should be quoted.
Edit: the question is a bit tricky since we don't know for sure what your data is and what exactly you're trying to count in it, but it all comes down to correctly formulating a regular expression.
If we assume that <text2> is an exact sequence of characters that should be present in the middle of the line and should not be present at the beginning and in the end, then this should be the regex you're looking for: ^<text[^2]>.*text2.*<text[^2]>\.$
Using awk you can do this:
awk '$2~/text2/ {a++} END {print a}' file
2
It will count all line with text2 in the middle of the line.
I want lines where occurs in the middle of the line. I dont
want lines where occurs in the beginning or at the end of the
line.
Try using grep with -c
grep -c '>.*<text2>.*<' file
Output:
2
Where occur (everywhere)
sed -n "/<text2>/ =" filename
if you want in the middle (like write later in comment)
sed -n "/[^ ] \{1,\}<text2> \{1,\}[^ ]/ =" filename

Get string between two characters occurring many times in a line

I am trying to extract a single string out of a line having many segments in a key-value order, but I don't get it as it matches much more than I want to.
This is my example line:
|SEGA~1~MAGIC~DESCRIPTION~~~M~TEST~|SEGB~34~12.11.2011~3~M~O~|SEGC~HELLO~WORLD~|
This lines is a kind concatenation of many segments into one line. Now I want to extract the the string at index 2 in the segment starting with SEGA.
So what I do is grep for this:
egrep -o 'SEGA(.*?)\~\|'
But it gives me the whole line, sometimes it gives me only the segment I am looking for. With the match I would split that segment by using the ~ character and take the third one.
Since I use .*? with the question mark I expected egrep to only match the content between SEGA and the very first occurrence of ~| which is right before SEGB and not the one at the end of SEGC or SEGB.
How can I tell grep to search for SEGA and give the whole content starting right after SEGA until THE VERY FIRST occurrence of ~|
You can use the -P(--perl-regexp) option in grep:
grep -oP '(?<=SEGA).*?(?=~\|)' file
If you want to include the trailing ~|, please remove the lookahead (?=...).
I think .*? (lazy) does not exit in egrep.
I'd suggest you break the line into lines on | and then grep from those:
$ echo "|SEGA~1~MAGIC~DESCRIPTION~~~M~TEST~|SEGB~34~12.11.2011~3~M~O~|SEGC~HELLO~WORLD~|" | sed -e 's/|/\n/g' | grep ^SEGA
SEGA~1~MAGIC~DESCRIPTION~~~M~TEST~

How to use sed to remove only double empty lines?

I found this question and answer on how to remove triple empty lines. However, I need the same only for double empty lines. Ie. all double blank lines should be deleted completely, but single blank lines should be kept.
I know a bit of sed, but the proposed command for removing triple blank lines is over my head:
sed '1N;N;/^\n\n$/d;P;D'
This would be easier with cat:
cat -s
I've commented the sed command you don't understand:
sed '
## In first line: append second line with a newline character between them.
1N;
## Do the same with third line.
N;
## When found three consecutive blank lines, delete them.
## Here there are two newlines but you have to count one more deleted with last "D" command.
/^\n\n$/d;
## The combo "P+D+N" simulates a FIFO, "P+D" prints and deletes from one side while "N" appends
## a line from the other side.
P;
D
'
Remove 1N because we need only two lines in the 'stack' and it's enought with the second N, and change /^\n\n$/d; to /^\n$/d; to delete all two consecutive blank lines.
A test:
Content of infile:
1
2
3
4
5
6
7
Run the sed command:
sed '
N;
/^\n$/d;
P;
D
' infile
That yields:
1
2
3
4
5
6
7
sed '/^$/{N;/^\n$/d;}'
It will delete only two consecutive blank lines in a file. You can use this expression only in file then only you can fully understand. When a blank line will come that it will enter into braces.
Normally sed will read one line. N will append the second line to pattern space. If that line is empty line. the both lines are separated by newline.
/^\n$/ this pattern will match that time only the d will work. Else d not work. d is used to delete the pattern space whole content then start the next cycle.
This would be easier with awk:
awk -v RS='\n\n\n' 1
BUT the above solution only deletes first search of 3 consecutive blank line.
To delete all, 3 consecutive blank lines use below command
sed '1N;N;/^\n\n$/ { N;s/^\n\n//;N;D; };P;D' filename
As far as I can tell none of the solutions here work. cat -s as suggested by #DerMike isn't POSIX compliant (and it's less convenient if you're already using sed for another transformation), and sed 'N;/^\n$/d;P;D' as suggested by #Birei sometimes deletes more newlines than it should.
Instead, sed ':L;N;s/^\n$//;t L' works. For POSIX compliance use sed -e :L -e N -e 's/^\n$//' -e 't L', since POSIX doesn't specify using ; to separate commands.
Example:
$ S='foo\nbar\n\nbaz\n\n\nqux\n\n\n\nquxx\n';\
> paste <(printf "$S")\
> <(printf "$S" | sed -e 'N;/^\n$/d;P;D')\
> <(printf "$S" | sed -e ':L;N;s/^\n$//;t L')
foo foo foo
bar bar bar
baz baz baz
qux
qux
qux quxx
quxx
quxx
$
Here we can see the original file, #Birei's solution, and my solution side-by-side. #Birei's solution deletes all blank lines separating baz and qux, while my solution removes all but one as intended.
Explanation:
:L Create a new label called L.
N Read the next line into the current pattern space,
separated by an "embedded newline."
s/^\n$// Replace the pattern space with the empty pattern space,
corresponding to a single non-embedded newline in the output,
if the current pattern space only contains a single embedded newline,
indicating that a blank line was read into the pattern space by `N`
after a blank line had already been read from the input.
t L Branch to label L if the previous `s` command successfully
substituted text in the pattern space.
In effect, this deletes one recurrent blank line at a time, reading each into the pattern space as an embedded newline with N and deleting them with s.
BUT the above solution only deletes first search of 3 consecutive blank line. To delete all, 3 consecutive blank lines use below command
sed '1N;N;/^\n\n$/ { N;s/^\n\n//;N;D; };P;D' filename
Just pipe it to 'uniq' command and all empty lines regardless the number of them will be shrank to just one. Simpler is better.
Clarification: As Marlar stated this is not a solution if you have "other non-blank consecutive duplicated lines" that you do not want to get rid of. This is a solution in other cases like when trying to cleanup configuration files which was the solution I was after when I saw this question. I solved my problem indeed just using 'uniq'.