Match the number of characters in two lines - regex

I have a file that I'm trying to prepare for some downstream analysis, but I need the number of characters in two lines to be identical. The file is formatted as below, where the 2nd (CTTATAATGCCGCTCCCTAAG) and 4th (bbbeeeeegggggiiiiiiiiigghiiiiiiiiiiiiiiiiiigeccccb) lines need to contain the same number of characters.
#HWI-ST:8:1101:3346:2198#GTCCGC/1
CTTATAATGCCGCTCCCTAAG
+HWI-ST:8:1101:3346:2198#GTCCGC/1
bbbeeeeegggggiiiiiiiiigghiiiiiiiiiiiiiiiiiigeccccb
#HWI-ST:8:1101:10491:2240#GTCCGC/1
GAGTAGGGAGTATACATCAG
+HWI-ST:8:1101:10491:2240#GTCCGC/1
abbceeeeggggfiiiiiigg`gfhfhhifhifdgg^ggdf_`_Y[aa_R
#HWI-ST:8:1101:19449:2134#GTCCGC/1
AAGAAGAGATCTGTGGACCA
So far I've pulled out the second line from each set of four and generated a file containing a record of the length of each line using:
grep -v '[^A-Z]' file.fastq |awk '{ print length($0); }' > newfile
Now I'm just looking for a way to point to this record to direct a sed command as to how many characters to trim off of the end of the line. Something similar to:
sed -r 's/.{n}$//' file
Replacing n with some regular expression to reference the text file. I wonder if I'm overcomplicating things, but I need the lines to match EXACTLY so I haven't been able to think of another way to go about it. Any help would be awesome, thanks!

This might be what you're looking for:
awk '
# If 2nd line of 4-line group, save length as len.
NR % 4 == 2 { len = length($0) }
# If 4th line of 4-line group, trim the line to len.
NR % 4 == 0 { $0 = substr($0, 1, len)}
# print every line
{ print }
' file
This assumes that the file consists of 4-line groups where the 2nd and 4th line of each group are the ones you're interested in. It also assumes that the 2nd line of each group will be no longer than its corresponding 4th line.

Related

How do I insert a new line for every two lines except when encountering two consecutive new lines?

I am trying to insert a new line for every two lines of text, except I want to restart this pattern whenever a new paragraph (two consecutive new lines) is encountered. (My desired output should not have three consecutive new lines.)
For example, here is my input text:
This is my first
line to appear in
the text.
I need the second
line to appear in
the way that follows
the pattern specified.
I am not sure if
the third line will
appear as I want it
to appear because sometimes
the new line happens where
there are two consecutive
new lines.
And here's is my desired output:
This is my first
line to appear in
the text.
I need the second
line to appear in
the way that follows
the pattern specified.
I am not sure if
the third line will
appear as I want it
to appear because sometimes
the new line happens where
there are two consecutive
new lines.
I have tried using awk:
awk -v n=2 '1; NR % n == 0 {print ""}'
but this command does not restart the pattern after a new paragraph. Instead, I would get the following output from my example text above:
This is my first
line to appear in
the text.
I need the second
line to appear in
the way that follows
the pattern specified.
I am not sure if
the third line will
appear as I want it
to appear because sometimes
the new line happens where
there are two consecutive
new lines.
As this undesired output shows, without the restarting of the pattern, I would get instances of three consecutive new lines.
Paragraph mode in perl could help:
perl -00 -ple 's/.*\n.*\n/$&\n/g'
output
This is my first
line to appear in
the text.
I need the second
line to appear in
the way that follows
the pattern specified.
I am not sure if
the third line will
appear as I want it
to appear because sometimes
the new line happens where
there are two consecutive
new lines.
Based on #Borodin comment:
perl -00 -ple 's/(?:.*\n){2}\K/\n/g'
Perl to the rescue!
perl -00 -ple '$i = 0; s/\n/($i++ % 2) ? "\n\n" : "\n"/eg'
-00 turns on the "paragraph mode", i.e. Perl reads the input in blocks separated by at least two newlines.
-l removes the two newlines from the end of each block after reading it, but returns them back before printing, avoiding three consecutive newlines.
/e evaluates the right hand side of a substitution as code.
$i++ % 2 is the increment plus the modulo. It returns 1 for line 1, 3, 5 etc. in each block.
condition ? then : else is the ternary operator. Newlines on lines 1, 3, 5... will be replaced by two newlines, the other ones will stay.
$i is reset for each block to start from 0 again.
This will also restart the pattern for each paragraph:
use strict;
use warnings;
my $str = do { local $/; <DATA> };
my $i = 0;
$str =~ s/(\n+)/
if (length $1 > 1) {
$i = 0;
"\n\n";
}
else {
$i++ % 2 ? "\n\n" : "\n"
}
/ge;
print $str;
__DATA__
This is my first
line to appear in
the text.
I need the second
line to appear in
the way that follows
the pattern specified.
I am not sure if
the third line will
appear as I want it
to appear because sometimes
the new line happens where
there are two consecutive
new lines.
Output:
This is my first
line to appear in
the text.
I need the second
line to appear in
the way that follows
the pattern specified.
I am not sure if
the third line will
appear as I want it
to appear because sometimes
the new line happens where
there are two consecutive
new lines.
This might work for you (GNU sed):
sed '/\S/!d;n;//!b;$!G' file
Delete all empty lines ahead of non-empty line, print it, if the next line is empty break out, otherwise append a newline (unless it is the last line) and repeat.
If you prefer an empty line to signify the last true couplet:
sed '/\S/!d;n;//G' file
As an afterthought, to group consecutive lines programmatically:
sed '/\S/!d;:a;N;/\n\s*$/b;s/[^\n]*/&/5;Ta;G' file
This will split texts into groups of no more than five lines.
If you wait till you know if the next line is empty to make a decision about inserting a new-line, this becomes relatively straightforward. Here expressed in awk:
parse.awk
# Remember line count in the paragraph with n
NF { n++ }
!NF { n=0 }
# Only emit new-line if n is non-zero and the previous line
# number is divisible by m
n>=m && (n-1)%m==0 { printf "\n" }
# Print $0
1
Run it like this:
awk -v m=2 -f parse.awk file
Or, for example, like this:
awk -f parse.awk m=2 file m=3 file
Below is the output of the second invocation with the following header added to the script (The header is GNU awk specific):
BEGINFILE {
n = 0;
if(FNR != NR)
printf "\n\n"; print "===>>> " FILENAME ", m=" m " <<<==="
}
Output:
===>>> file, m=2 <<<===
This is my first
line to appear in
the text.
I need the second
line to appear in
the way that follows
the pattern specified.
I am not sure if
the third line will
appear as I want it
to appear because sometimes
the new line happens where
there are two consecutive
new lines.
===>>> file, m=3 <<<===
This is my first
line to appear in
the text.
I need the second
line to appear in
the way that follows
the pattern specified.
I am not sure if
the third line will
appear as I want it
to appear because sometimes
the new line happens where
there are two consecutive
new lines.
Golfed version:
{n=NF?n+1:0}(n-1)%m==0&&n>=m{printf "\n"}1

Search text for multiple lines matching string 1 which are not separated by string 2

I've got a file looking like this:
abc|100|test|line|with|multiple|information|||in|different||fields
abc|100|another|test|line|with|multiple|information|in||different|fields|
abc|110|different|looking|line|with|some|supplementary|information
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information
abc|100|another|test|line|with|multiple|information|in||different|fields|
abc|110|different|looking|line|with|supplementary|information
I'm looking for a regexp to use with sed / awk / (e)grep (it actually doesn't matter to me which of these as all would be fine) to find the following in the above mentioned text:
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information
I want to get back a |100| line if it is followed by at least two |110| lines before another |100| line appears. The result should contain the initial |100| line together with all |110| lines that follow but not the following |100| line.
sed -ne '/|100|/,/|110|/p'
provides me a list of all |100| lines which are followed by at least one |110| line. But it doesn't check, if the |110| line has been repeated more than once. I get back results I don't look for.
sed -ne '/|100|/,/|100|/p'
returns a list of all |100| lines and the content between the next |100| line including the next |100| line.
Trying to find lines between search patterns always was a nightmare to me. I spent hours of try and error on similar problems which finally worked. But I never really understood why. I hope, s.o. might be so kind to save me of the headache this time and maybe explain how the pattern does the work. I'm quite sure, I'll face this kind of problem again and then I finally could help myself.
Thank you for any help on this one!
Regards
Manuel
I'd do this in awk.
awk -F'|' '$2==100&&c>2{print b} $2==100{c=1;b=$0;next} $2==110&&c{c++;b=b RS $0;next} {c=0}' file
Broken out for easier reading:
awk -F'|' '
# If we're starting a new section and conditions have been met, print buffer
$2==100 && c>2 {print b}
# Start a section with a new count and a new buffer...
$2==100 {c=1;b=$0;next}
# Add to buffer
$2==110 && c {c++;b=b RS $0}
# Finally, zero everything if we encounter lines that don't fit the pattern
{c=0;b=""}
' file
Rather than using a regex, this steps through the file using the field delimiters you've specified. Upon seeing the "start" condition, it begins keeping a buffer. As subsequent lines match your "continue" condition, the buffer grows. Once we see the start of a new section, we print the buffer if the the counter is big enough.
Works for me on your sample data.
Here's a GNU awk specific answer: use |100| as the record separator, |110| as the field separator, and look for records with at least 3 fields.
gawk '
BEGIN {
# a newline, the first pipe-delimited column, then the "100" value
RS="(\n[^|]+[|]100[|])"
FS="[|]110[|]"
}
NF >= 3 {print RT $0} # RT is the actual text matching the RS pattern
' file
In AWK, the field separator is set to a pipe character and the second field is compared to 100 and 110 per line. $0 represents a line from the input file.
BEGIN { FS = "|" }
{
if($2 == 100) {
one_hundred = 1;
one_hundred_one = 0;
var0 = $0
}
if($2 == 110) {
one_hundred_one += 1;
if(one_hundred_one == 1 && one_hundred = 1) var1 = $0;
if(one_hundred_one == 2 && one_hundred = 1) var2 = $0;
}
if(one_hundred == 1 && one_hundred_one == 2) {
print var0
print var1
print var2
}
}
awk -f foo.awk input.txt
abc|100|test|line|with|multiple|information|||in|different||fields
abc|110|different|looking|line|with|some|other|supplementary|information
abc|110|different|looking|line|with|additional||information

Regex to move second line to end of first line

I have several lines with certain values and i want to merge every second line or every line beginning with <name> to the end of the line ending with
<id>rd://data1/8b</id>
<name>DM_test1</name>
<id>rd://data2/76f</id>
<name>DM_test_P</name>
so end up with something like
<id>rd://data1/8b</id><name>DM_test1</name>
The reason why it came out like this is because i used two piped xpath queries
Regex
Simply remove the newline at the end of a line ending in </id>. On a windows, replace (<\/id>)\r\n with \1 or $1 (which is perl syntax). On a linux search for (<\/id>)\n and replace it with the same thing.
awk
The ideal solution uses awk. The idea is simply, when the line number is odd, we print the line without a newline, if not we print it with a newline.
awk '{ if(NR % 2) { printf $0 } else { print $0 } }' file
sed
Using sed we place a line in the hold space when it contains <id>ยด and append the line to it when it's a` line. Then we remove the newline and print the hold buffer by exchanging it with the pattern space.
sed -n '/<id>.*<\/id>/{h}; /<name>.*<\/name>/{H;x;s/\n//;p}' file
pr
Using pr we can achieve a similar goal:
pr -s --columns 2 file

Unify lines that contains same patterns

I have a database with this structure:
word1#element1.1#element1.2#element1.3#...
word2#element2.1#element2.2#element2.3#...
...
...
I would like to unify the elements of 2 or more lines every time the word at the beginning is the same.
Example:
...
word8#element8.1#element8.2#element8.3#...
word9#element9.1#element9.2#element9.3#...
...
Now, lets suppose word8=word9, this is the result:
...
word8#element8.1#element8.2#element8.3#...#element9.1#element9.2#element9.3#...
...
I tried with the command sed:
I match 2 lines at time with N
Memorize the first word of the first line: ^\([^#]*\) (all the elements exept '#')
Memorize all the other elements of the first line: \([^\n]*\)
Check if in the second line (after \n) is present the same word: \1
If it's like that I just take out the newline char and the first word of the second line: \1#\2
This is the complete code:
sed 'N;s/^\([^#]*\)#\([^\n]*\)\n\1/\1#\2/' database
I would like to understand why it's not working and how I can solve that problem.
Thank you very much in advance.
This might work for you (GNU sed):
sed 'N;s/^\(\([^#]*#\).*\)\n\2/\1#/;P;D' file
Read 2 lines at all times and remove the line feed and the matching portion of the second line (reinstating the #) if the words at the beginning of those 2 lines match.
sed '#n
H
$ { x
:cycle
s/\(\n\)\([^#]*#\)\([^[:cntrl:]]*\)\1\2/\1\2\3#/g
t cycle
s/.//
p
}' YourFile
Assuming word are sorted
load the whole file in buffer (code could be adapted if file is to big to use only several lines in buffer)
at the end, load holding buffer content to working buffer
remove the new line and first word of any line where previous line start with same word (and add a # as seprator)
if occur, retry once again
if not, remove first char (a new line due to loading process)
print
You can try with perl. It reads input file line by line, splits in first # character and uses a hash of arrays to save the first word as key and append the rest of the line as value. At the END block it sorts by the first word and joins the lines:
perl -lne '
($key, $line) = split /#/, $_, 2;
push #{$hash{$key}}, $line;
END {
for $k ( sort keys %hash ) {
printf qq|%s#%s\n|, $k, join q|#|, #{$hash{$k}};
}
}
' infile
$ cat file
word1#element1.1#element1.2#element1.3
word2#element2.1#element2.2#element2.3
word8#element8.1#element8.2#element8.3
word8#element9.1#element9.2#element9.3
word9#element9.1#element9.2#element9.3
.
$ awk 'BEGIN{FS=OFS="#"}
NR>1 && $1!=prev { print "" }
$1==prev { sub(/^[^#]+/,"") }
{ printf "%s",$0; prev=$1 }
END { print "" }
' file
word1#element1.1#element1.2#element1.3
word2#element2.1#element2.2#element2.3
word8#element8.1#element8.2#element8.3#element9.1#element9.2#element9.3
word9#element9.1#element9.2#element9.3
Using text replacements:
perl -p0E 'while( s/(^|\n)(.+?#)(.*)\n\2(.*)/$1$2$3 $4/ ){}' yourfile
or indented:
perl -p0E 'while( # while we can
s/(^|\n) # substitute \n
(.+?\#) (.*) \n # id elems1
\2 (.*) # id elems2
/$1$2$3 $4/x # \n id elems1 elems2
){}'
thanks: #birei

SED: addressing two lines before match

Print line, which is situated 2 lines before the match(pattern).
I tried next:
sed -n ': loop
/.*/h
:x
{n;n;/cen/p;}
s/./c/p
t x
s/n/c/p
t loop
{g;p;}
' datafile
The script:
sed -n "1N;2N;/XXX[^\n]*$/P;N;D"
works as follows:
Read the first three lines into the pattern space, 1N;2N
Search for the test string XXX anywhere in the last line, and if found print the first line of the pattern space, P
Append the next line input to pattern space, N
Delete first line from pattern space and restart cycle without any new read, D, noting that 1N;2N is no longer applicable
This might work for you (GNU sed):
sed -n ':a;$!{N;s/\n/&/2;Ta};/^PATTERN\'\''/MP;$!D' file
This will print the line 2 lines before the PATTERN throughout the file.
This one with grep, a bit simpler solution and easy to read [However need to use one pipe]:
grep -B2 'pattern' file_name | sed -n '1,2p'
If you can use awk try this:
awk '/pattern/ {print b} {b=a;a=$0}' file
This will print two line before pattern
I've tested your sed command but the result is strange (and obviously wrong), and you didn't give any explanation. You will have to save three lines in a buffer (named hold space), do a pattern search with the newest line and print the oldest one if it matches:
sed -n '
## At the beginning read three lines.
1 { N; N }
## Append them to "hold space". In following iterations it will append
## only one line.
H
## Get content of "hold space" to "pattern space" and check if the
## pattern matches. If so, extract content of first line (until a
## newline) and exit.
g
/^.*\nsix$/ {
s/^\n//
P
q
}
## Remove the old of the three lines saved and append the new one.
s/^\n[^\n]*//
h
' infile
Assuming and input file (infile) with following content:
one
two
three
four
five
six
seven
eight
nine
ten
It will search six and as output will yield:
four
Here are some other variants:
awk '{a[NR]=$0} /pattern/ {f=NR} END {print a[f-2]}' file
This stores all lines in an array a. When pattern is found store line number.
At then end print that line number from the file.
PS may be slow with large files
Here is another one:
awk 'FNR==NR && /pattern/ {f=NR;next} f-2==FNR' file{,}
This reads the file twice (file{,} is the same as file file)
At first round it finds the pattern and store line number in variable f
Then at second round it prints the line two before the value in f