Counting number of lines which contain a pattern - regex

I have data in the following form:
<id_mytextadded1829>
<text1> <text2> <text3>.
<id_m_abcdef829>
<text4> <text5> <text6>.
<id_mytextadded1829>
<text7> <text2> <text8>.
<id_mytextadded1829>
<text2> <text1> <text9>.
<id_m_abcdef829>
<text11> <text12> <text2>.
Now I want to the number of lines in which <text2> is present. I know I can do the same using python's regex. But regex would tell me whether a pattern is present in a line or not? On the other hand my requirement is to find a string which is present exactly in the middle of a line. I know sed is good for replacing contents present in a line. But instead of replacing if I only want the number of lines..is it possible to do so using sed.
EDIT:
Sorry I forgot to mention. I want lines where <text2> occurs in the middle of the line. I dont want lines where <text2> occurs in the beginning or at the end of the line.
E.g. in the data shown above the number of lines which have <text2> in the middle are 2 (rather than 4).
Is there some way by which I may achieve the desired count of the number of lines by which I may find out the number of lines which have <text2> in middle using linux or python

I want lines where <text2> occurs in the middle of the line.
You could say:
grep -P '.+<text2>.+' filename
to list the lines containing <text2> not at the beginning or the end of a line.
In order to get only the count of matches, you could say:
grep -cP '.+<text2>.+' filename

You can use grep for this. For example, this will count number of lines in the file that match the ^123[a-z]+$ pattern:
egrep -c ^123[a-z]+$ file.txt
P.S. I'm not quite sure about the syntax and I don't have the possibility to test it at the moment. Maybe the regex should be quoted.
Edit: the question is a bit tricky since we don't know for sure what your data is and what exactly you're trying to count in it, but it all comes down to correctly formulating a regular expression.
If we assume that <text2> is an exact sequence of characters that should be present in the middle of the line and should not be present at the beginning and in the end, then this should be the regex you're looking for: ^<text[^2]>.*text2.*<text[^2]>\.$

Using awk you can do this:
awk '$2~/text2/ {a++} END {print a}' file
2
It will count all line with text2 in the middle of the line.

I want lines where occurs in the middle of the line. I dont
want lines where occurs in the beginning or at the end of the
line.
Try using grep with -c
grep -c '>.*<text2>.*<' file
Output:
2

Where occur (everywhere)
sed -n "/<text2>/ =" filename
if you want in the middle (like write later in comment)
sed -n "/[^ ] \{1,\}<text2> \{1,\}[^ ]/ =" filename

Related

regex in sed removing only the first occurrence from every line

I have the following file I would like to clean up
cat file.txt
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
My desired output is:
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
I would like to remove everything between ":" and the first occurence of "or"
I tried sed 's/MNS:d*?or /MNS:/g' though it removes the second "or" as well.
I tried every option in https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/
to no avail. should I create alias sed='perl -pe'? It seems that sed does not properly support regex
perl should be more suitable here because we need Lazy match logic here.
perl -pe 's|(:.*?or +)(.*)|:\2|' Input_file
by using .*?or we are checking for the first nearest match for or string in the line.
This might work for you (GNU sed):
sed '/:.*\<or\>/{s/\<or\>/\n/;s/:.*\n//}' file
If a line contains : followed by the word or, then substitute the first occurrence of the word or with a unique delimiter (e.g.\n) and then remove everything between : and the unique delimiter.
Wrt I would like to remove everything between ":" and the first occurence of "or" - no you wouldn't. The first occurrence of or in the 2nd line of sample input is as the start of orweqqwe. That text immediately after : looks like it could be any set of characters so couldn't it contain a standalone or, e.g. MNS:2 or eqqwe or M+ GYPA*02 or GYPA*N
Given that and the fact it's apparently a fixed number of characters to be removed on every line, it seems like this is what you should really be using:
$ sed 's/:.\{14\}/:/' file
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
If it is sure the or always occurs twice a line as provided example, please try:
sed 's/\(MNS:\).\+ or \(.\+ or .*\)/\1\2/' file.txt
Result:
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
Otherwise using perl is a better solution which supports the shortest match as RavinderSingh13 answers.
ex supports lazy matching with \{-}:
ex -s '+%s/:\zs.\{-}or //g|wq' input_file
The pattern :\zs.\{-}or matches any character after the first : up to the first or.

Print commands in history consisting in just one word

I want to print lines that contains single word only.
For example:
this is a line
another line
one
more
line
last one
I want to get the ones with single word only
one
more
line
EDIT: Guys, thank you for answers. Almost all of the answers work for my test file. However I wanted to list single lines in bash history. When I try your answers like
history | your posted commands
all of them below fails. Some only prints some numbers (might line numbers?)
You want to get all those commands in history that contain just one word. Considering that history prints the number of the command as a first column, you need to match those lines consisting in two words.
For this, you can say:
history | awk 'NF==2'
If you just want to print the command itself, say:
history | awk 'NF==2 {print $2}'
To rehash your problem, any line containing a space or nothing should be removed.
grep -Ev '^$| ' file
Your problem statement is unspecific on whether lines containing only punctuation might also occur. Maybe try
grep -Ex '[A-Za-z]+' file
to only match lines containing only one or more alphabetics. (The -x option implicitly anchors the pattern -- it requires the entire line to match.)
In Bash, the output from history is decorated with line numbers; maybe try
history | grep -E '^ *[0-9]+ [A-Za-z]+$'
to match lines where the line number is followed by a single alphanumeric token. Notice that there will be two spaces between the line number and the command.
In all cases above, the -E selects extended regular expression matching, aka egrep (basic RE aka traditional grep does not support e.g. the + operator, though it's available as \+).
Try this:
grep -E '^\s*\S+\s*$' file
With the above input, it will output:
one
more
line
If your test strings are in a file called in.txt, you can try the following:
grep -E "^\w+$" in.txt
What it means is:
^ starting the line with
\w any word character [a-zA-Z0-9]
+ there should be at least 1 of those characters or more
$ line end
And output would be
one
more
line
Assuming your file as texts.txt and if grep is not the only criteria; then
awk '{ if ( NF == 1 ) print }' texts.txt
If your single worded lines don't have a space at the end you can also search for lines without an empty space :
grep -v " "
I think that what you're looking for could be best described as a newline followed by a word with a negative lookahead for a space,
/\n\w+\b(?! )/g
example

how to grep exact string match across 2 files

I've UTF-8 plain text lists of usernames, 1 per line, in list1.txt and list2.txt. Note, in case pertinent, that usernames may contain regex characters e.g. ! ^ . ( and such as well as spaces.
I want to get and save to matches.txt a list of all unique values occurring in both lists. I've little command line expertise but this almost gets me there:
grep -Ff list1.txt list2.txt > matches.txt
...but that is treating "jdoe" and "jdoe III" as a match, returning "jdoe III" as the matched value. This is incorrect for the task. I need the per-line pattern match to be the whole line, i.e. from ^ to $. I've tried adding the -x flag but that gets no matches at all (edit: see comment to accepted answer - I got the flag order wrong).
I'm on OS X 10.9.5 and I don't have to use grep - another command line (tool) solving the problem will do.
All you need to do is add the -x flag to your grep query:
grep -Fxf list1.txt list2.txt > matches.txt
The -x flag will restrict matches to full line matches (each PATTERN becomes ^PATTERN$). I'm not sure why your attempt at -x failed. Maybe you put it after the -f, which must be immediately followed by the first file?
This awk will be handy than grep here:
awk 'FNR==NR{a[$0]; next} $0 in a' list1.txt list2.txt > matches.txt
$0 is the line, FNR is the current line number of the current file, NR is the overall line number (they are only the same when you are on the first file). a[$0] is a associative array (hash) whose key is the line. next will ensure that further clauses (the $0 in a) will not run if the current clause (the fact that this is the first file) did. $0 in a will be true when the current line has a value in the array a, thus only lines present in both will be displayed. The order will be their order of occurence in the second file.
A very simple and straightforward way to do it that doesn't require one to do all sorts of crazy things with grep is as follows
cat list1.txt list2.txt|grep match > matches.txt
Not only that, but it's also easier to remember, (especially if you regularly use cat).
grep -Fwf file1 file2 would match word to word !!

Delete lines that contains text with spaces

So I have a very large file that I created by combining a number of word lists. The problem is that I made the mistake of not cleaning up the original word lists before combining and sorting them, so there are a number of lines peppered throughout the file that are sentences, ASCII art, or other information that I don't want in there.
For right now, I'd like to delete any line that contains one or more spaces. I don't want to remove the spaces, I want to remove the entire line if it has a space in it.
I'm terrible with regex, and was hoping someone could help me out.
Thanks.
There is short command
sed -e '/\s/d'
It runs sed with script /\s/d which means
for each line matching /\s/ (have at least one space or tab)
run command d - delete line
So, only lines without any space will be saved.
This command will not delete empty lines.
Use it like:
sed -e '/\s/d' < input_file.txt > output_file.txt
I guess an inverted grep for spaces will do the job:
cat your_file.txt | grep -v ' ' > output.txt
It will filter the file, removing any lines with spaces.

Get string between two characters occurring many times in a line

I am trying to extract a single string out of a line having many segments in a key-value order, but I don't get it as it matches much more than I want to.
This is my example line:
|SEGA~1~MAGIC~DESCRIPTION~~~M~TEST~|SEGB~34~12.11.2011~3~M~O~|SEGC~HELLO~WORLD~|
This lines is a kind concatenation of many segments into one line. Now I want to extract the the string at index 2 in the segment starting with SEGA.
So what I do is grep for this:
egrep -o 'SEGA(.*?)\~\|'
But it gives me the whole line, sometimes it gives me only the segment I am looking for. With the match I would split that segment by using the ~ character and take the third one.
Since I use .*? with the question mark I expected egrep to only match the content between SEGA and the very first occurrence of ~| which is right before SEGB and not the one at the end of SEGC or SEGB.
How can I tell grep to search for SEGA and give the whole content starting right after SEGA until THE VERY FIRST occurrence of ~|
You can use the -P(--perl-regexp) option in grep:
grep -oP '(?<=SEGA).*?(?=~\|)' file
If you want to include the trailing ~|, please remove the lookahead (?=...).
I think .*? (lazy) does not exit in egrep.
I'd suggest you break the line into lines on | and then grep from those:
$ echo "|SEGA~1~MAGIC~DESCRIPTION~~~M~TEST~|SEGB~34~12.11.2011~3~M~O~|SEGC~HELLO~WORLD~|" | sed -e 's/|/\n/g' | grep ^SEGA
SEGA~1~MAGIC~DESCRIPTION~~~M~TEST~