So I have a very large file that I created by combining a number of word lists. The problem is that I made the mistake of not cleaning up the original word lists before combining and sorting them, so there are a number of lines peppered throughout the file that are sentences, ASCII art, or other information that I don't want in there.
For right now, I'd like to delete any line that contains one or more spaces. I don't want to remove the spaces, I want to remove the entire line if it has a space in it.
I'm terrible with regex, and was hoping someone could help me out.
Thanks.
There is short command
sed -e '/\s/d'
It runs sed with script /\s/d which means
for each line matching /\s/ (have at least one space or tab)
run command d - delete line
So, only lines without any space will be saved.
This command will not delete empty lines.
Use it like:
sed -e '/\s/d' < input_file.txt > output_file.txt
I guess an inverted grep for spaces will do the job:
cat your_file.txt | grep -v ' ' > output.txt
It will filter the file, removing any lines with spaces.
Related
I need some help with sed or awks.
How can i remove a line only if it is followed by a line that starts with the same character (in this case >)?
Example I have this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
I want to get this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
Note that not all the lines have the same numbers but they all have the same format, which is why I want to use regular expressions. If you could explain how to read the code you produce that would be really great.
Thank you so much!
If the whole file follows that pattern (some number of lines starting with >, of which you want only the last, followed by a single line that should always be printed), you could use something like this:
awk '/^>/ { latest=$0 } !/^>/ { if (latest) { print latest; latest="" } print }'
If the line starts with >, then it is remembered (stored in the variable latest) but not printed. If the line doesn't start with >, then it is printed, but only after first printing whatever was most recently stored in latest.
The conditional means each printed > line will appear only once, even if there are multiple non-> lines in a row. Since that doesn't happen in your sample data, you may not need the complication, and could use this simpler unconditional version:
awk '/^>/ { latest=$0 } !/^>/ { print latest; print }'
The needed result can be easily achieved by just using uniq command with -w(--check-chars=N) option:
cat testfile | uniq -w 3
The output:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
-w, --check-chars=N
compare no more than N characters in lines
http://man7.org/linux/man-pages/man1/uniq.1.html
It will compare the first N characters of each line to make decision for repeated lines
try: if your data is same as given sample Input_file then following may help you in same.
awk '/^>/{A=$0;next} {print A ORS $0;A=""}' Input_file
This might work for you (GNU sed):
sed 'N;/^>.*\n>/!P;D' file
Read two lines into the pattern space and do not print the first of these lines if the first and second lines begin with >.
sed 'N;/^>.*\n\w/!D' file #(GNU sed)
N: read next line into the pattern space. /^>.*\n\w/!D: delete the first line if the first line starts with ">" and the second line doesn't begin with a letter
I want to print lines that contains single word only.
For example:
this is a line
another line
one
more
line
last one
I want to get the ones with single word only
one
more
line
EDIT: Guys, thank you for answers. Almost all of the answers work for my test file. However I wanted to list single lines in bash history. When I try your answers like
history | your posted commands
all of them below fails. Some only prints some numbers (might line numbers?)
You want to get all those commands in history that contain just one word. Considering that history prints the number of the command as a first column, you need to match those lines consisting in two words.
For this, you can say:
history | awk 'NF==2'
If you just want to print the command itself, say:
history | awk 'NF==2 {print $2}'
To rehash your problem, any line containing a space or nothing should be removed.
grep -Ev '^$| ' file
Your problem statement is unspecific on whether lines containing only punctuation might also occur. Maybe try
grep -Ex '[A-Za-z]+' file
to only match lines containing only one or more alphabetics. (The -x option implicitly anchors the pattern -- it requires the entire line to match.)
In Bash, the output from history is decorated with line numbers; maybe try
history | grep -E '^ *[0-9]+ [A-Za-z]+$'
to match lines where the line number is followed by a single alphanumeric token. Notice that there will be two spaces between the line number and the command.
In all cases above, the -E selects extended regular expression matching, aka egrep (basic RE aka traditional grep does not support e.g. the + operator, though it's available as \+).
Try this:
grep -E '^\s*\S+\s*$' file
With the above input, it will output:
one
more
line
If your test strings are in a file called in.txt, you can try the following:
grep -E "^\w+$" in.txt
What it means is:
^ starting the line with
\w any word character [a-zA-Z0-9]
+ there should be at least 1 of those characters or more
$ line end
And output would be
one
more
line
Assuming your file as texts.txt and if grep is not the only criteria; then
awk '{ if ( NF == 1 ) print }' texts.txt
If your single worded lines don't have a space at the end you can also search for lines without an empty space :
grep -v " "
I think that what you're looking for could be best described as a newline followed by a word with a negative lookahead for a space,
/\n\w+\b(?! )/g
example
This has been driving me nuts for a while now. Actually, make that all damn night!!!
I have a file of contact information that is now only slightly mangled that I'm trying to fix in VIM. The file has 13k lines so I really, REALLY don't want to have to manually fix this in the file. I believe that the final issue I am having is that for some dumb reason many of the lines run together with the next line. This is a csv file with newline characters separating the lines. Unfortunately, there are maybe a couple of hundred lines where there is a newline character in the middle of should be two lines.
Here is an example of the file where this is occurring:
Freddy,Bauhof,fabaof#garbage.net,16126 Garbage Drive,Spring,TX,77 379,5555550440,M,1/1/14,14:23:57,256.241.24.29^#
Natasha,Moore,ndivy#garbage.com,3715 Garbage Rd,Louisville,KY,40218,5555553358,F,1/1/14,3:12:09,74.256.182.12^#MaryAnn,Haase,mahase#garbage.net,303 N Garbage Rd,Norfolk,NE,68701,5555559031,M,12/31/13,7:20:21,69.256.211.147^#
Jonathan,Golden,jongolden#garbage.com,11 Garbage Dr,GlenHead,NY,115 45,5555556712,M,1/1/14,17:28:09,256.195.83.118^#
What I am trying to do is simply insert a newline/carraige return/whatever that will actually break the middle line into two after "12^#" and immediately before "MaryAnn". Thanks in advance for any help with this.
Run this in vim :%s/\^#\(\S\)/^#\r\1/g.
It inserts a carriage return between ^# and not whitespace.
Foo^#
Bar^#Baz^#
Qux^#
Becomes
Foo^#
Bar^#
Baz^#
Qux^#
in vim:
this line would work:
%s/\^#\zs\ze\S/\r/g
if you are open to shell, grep could help you:
kent$ cat f
foo^#
foo2^#bar^#baz^#
blah^#
kent$ grep -oP '.*?\^#' f
foo^#
foo2^#
bar^#
baz^#
blah^#
sed too:
kent$ sed -r 's/\^#(.)/^#\n\1/g' f
foo^#
foo2^#
bar^#
baz^#
blah^#
I have data in the following form:
<id_mytextadded1829>
<text1> <text2> <text3>.
<id_m_abcdef829>
<text4> <text5> <text6>.
<id_mytextadded1829>
<text7> <text2> <text8>.
<id_mytextadded1829>
<text2> <text1> <text9>.
<id_m_abcdef829>
<text11> <text12> <text2>.
Now I want to the number of lines in which <text2> is present. I know I can do the same using python's regex. But regex would tell me whether a pattern is present in a line or not? On the other hand my requirement is to find a string which is present exactly in the middle of a line. I know sed is good for replacing contents present in a line. But instead of replacing if I only want the number of lines..is it possible to do so using sed.
EDIT:
Sorry I forgot to mention. I want lines where <text2> occurs in the middle of the line. I dont want lines where <text2> occurs in the beginning or at the end of the line.
E.g. in the data shown above the number of lines which have <text2> in the middle are 2 (rather than 4).
Is there some way by which I may achieve the desired count of the number of lines by which I may find out the number of lines which have <text2> in middle using linux or python
I want lines where <text2> occurs in the middle of the line.
You could say:
grep -P '.+<text2>.+' filename
to list the lines containing <text2> not at the beginning or the end of a line.
In order to get only the count of matches, you could say:
grep -cP '.+<text2>.+' filename
You can use grep for this. For example, this will count number of lines in the file that match the ^123[a-z]+$ pattern:
egrep -c ^123[a-z]+$ file.txt
P.S. I'm not quite sure about the syntax and I don't have the possibility to test it at the moment. Maybe the regex should be quoted.
Edit: the question is a bit tricky since we don't know for sure what your data is and what exactly you're trying to count in it, but it all comes down to correctly formulating a regular expression.
If we assume that <text2> is an exact sequence of characters that should be present in the middle of the line and should not be present at the beginning and in the end, then this should be the regex you're looking for: ^<text[^2]>.*text2.*<text[^2]>\.$
Using awk you can do this:
awk '$2~/text2/ {a++} END {print a}' file
2
It will count all line with text2 in the middle of the line.
I want lines where occurs in the middle of the line. I dont
want lines where occurs in the beginning or at the end of the
line.
Try using grep with -c
grep -c '>.*<text2>.*<' file
Output:
2
Where occur (everywhere)
sed -n "/<text2>/ =" filename
if you want in the middle (like write later in comment)
sed -n "/[^ ] \{1,\}<text2> \{1,\}[^ ]/ =" filename
I found this question and answer on how to remove triple empty lines. However, I need the same only for double empty lines. Ie. all double blank lines should be deleted completely, but single blank lines should be kept.
I know a bit of sed, but the proposed command for removing triple blank lines is over my head:
sed '1N;N;/^\n\n$/d;P;D'
This would be easier with cat:
cat -s
I've commented the sed command you don't understand:
sed '
## In first line: append second line with a newline character between them.
1N;
## Do the same with third line.
N;
## When found three consecutive blank lines, delete them.
## Here there are two newlines but you have to count one more deleted with last "D" command.
/^\n\n$/d;
## The combo "P+D+N" simulates a FIFO, "P+D" prints and deletes from one side while "N" appends
## a line from the other side.
P;
D
'
Remove 1N because we need only two lines in the 'stack' and it's enought with the second N, and change /^\n\n$/d; to /^\n$/d; to delete all two consecutive blank lines.
A test:
Content of infile:
1
2
3
4
5
6
7
Run the sed command:
sed '
N;
/^\n$/d;
P;
D
' infile
That yields:
1
2
3
4
5
6
7
sed '/^$/{N;/^\n$/d;}'
It will delete only two consecutive blank lines in a file. You can use this expression only in file then only you can fully understand. When a blank line will come that it will enter into braces.
Normally sed will read one line. N will append the second line to pattern space. If that line is empty line. the both lines are separated by newline.
/^\n$/ this pattern will match that time only the d will work. Else d not work. d is used to delete the pattern space whole content then start the next cycle.
This would be easier with awk:
awk -v RS='\n\n\n' 1
BUT the above solution only deletes first search of 3 consecutive blank line.
To delete all, 3 consecutive blank lines use below command
sed '1N;N;/^\n\n$/ { N;s/^\n\n//;N;D; };P;D' filename
As far as I can tell none of the solutions here work. cat -s as suggested by #DerMike isn't POSIX compliant (and it's less convenient if you're already using sed for another transformation), and sed 'N;/^\n$/d;P;D' as suggested by #Birei sometimes deletes more newlines than it should.
Instead, sed ':L;N;s/^\n$//;t L' works. For POSIX compliance use sed -e :L -e N -e 's/^\n$//' -e 't L', since POSIX doesn't specify using ; to separate commands.
Example:
$ S='foo\nbar\n\nbaz\n\n\nqux\n\n\n\nquxx\n';\
> paste <(printf "$S")\
> <(printf "$S" | sed -e 'N;/^\n$/d;P;D')\
> <(printf "$S" | sed -e ':L;N;s/^\n$//;t L')
foo foo foo
bar bar bar
baz baz baz
qux
qux
qux quxx
quxx
quxx
$
Here we can see the original file, #Birei's solution, and my solution side-by-side. #Birei's solution deletes all blank lines separating baz and qux, while my solution removes all but one as intended.
Explanation:
:L Create a new label called L.
N Read the next line into the current pattern space,
separated by an "embedded newline."
s/^\n$// Replace the pattern space with the empty pattern space,
corresponding to a single non-embedded newline in the output,
if the current pattern space only contains a single embedded newline,
indicating that a blank line was read into the pattern space by `N`
after a blank line had already been read from the input.
t L Branch to label L if the previous `s` command successfully
substituted text in the pattern space.
In effect, this deletes one recurrent blank line at a time, reading each into the pattern space as an embedded newline with N and deleting them with s.
BUT the above solution only deletes first search of 3 consecutive blank line. To delete all, 3 consecutive blank lines use below command
sed '1N;N;/^\n\n$/ { N;s/^\n\n//;N;D; };P;D' filename
Just pipe it to 'uniq' command and all empty lines regardless the number of them will be shrank to just one. Simpler is better.
Clarification: As Marlar stated this is not a solution if you have "other non-blank consecutive duplicated lines" that you do not want to get rid of. This is a solution in other cases like when trying to cleanup configuration files which was the solution I was after when I saw this question. I solved my problem indeed just using 'uniq'.