Remove lines containing same string - regex

If a line IN(..) and a line OUT(..) have the same string in their parentheses, then remove the line OUT(..).
My input file is like :
IN(ABC);
IN(DEF);
IN(FGH);
OUT(QWE);
OUT(ABC);
OUT(DEF);
My desired output is:
IN(ABC);
IN(DEF);
IN(FGH);
OUT(QWE);

On the assumption that all IN(...) lines are before the OUT(...) lines (i.e. sorted), the following should work:
my %in;
while (<DATA>) {
if (/^IN\((.*?)\)/) {
$in{$1} = 1;
} elsif (/^OUT\((.*?)\)/) {
if ($in{$1}) {
next;
}
}
print $_;
}
__DATA__
IN(ABC);
IN(DEF);
IN(FGH);
OUT(QWE);
OUT(ABC);
OUT(DEF);
The idea is to use a hash to track which IN values have been used. Go through the data line by line, if it's an IN line, store the value and print the line. If it's an OUT line and it is not in the list of recognized IN values, print it as well, otherwise, skip it.

Related

How can I find a protein sequence from a FASTA file using perl?

So I have an exercise in which I have to print the three first lines of a fasta file as well as the protein sequence. I have tried to run a script I wrote, but cygwin doesnt seem to print the sequence out. My code is as follows:
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if($_=~ m/^ID/) {
print $_ ;
}
if($_=~ m/^AC/) {
print $_ ;
}
if ($_=~ m/^SQ/) {
print $_;
}
if ($_=~ m/\^s+(\w+)/) { #this is the part I have trouble with
$a.=$1;
$a=~s/\s//g; #this is for removing the spaces inside the sequence
print $a;
}
The fast file looks like this:
SQ SEQUENCE 474 AA; 55345 MW; 0D9FA81230B282D9 CRC64;
MRFTFTSRCL ALFLLLNHPT PILPAFSNQT YPTIEPKPFL YVVGRKKMMD AQYKCYDRMQ
QLPAYQGEGP YCNRTWDGWL CWDDTPAGVL SYQFCPDYFP DFDPSEKVTK YCDEKGVWFK
HPENNRTWSN YTMCNAFTPE KLKNAYVLYY LAIVGHSLSI FTLVISLGIF VFFRSLGCQR
VTLHKNMFLT YILNSMIIII HLVEVVPNGE LVRRDPVSCK ILHFFHQYMM ACNYFWMLCE
GIYLHTLIVV AVFTEKQRLR WYYLLGWGFP LVPTTIHAIT RAVYFNDNCW LSVETHLLYI
IHGPVMAALV VNFFFLLNIV RVLVTKMRET HEAESHMYLK AVKATMILVP LLGIQFVVFP
WRPSNKMLGK IYDYVMHSLI HFQGFFVATI YCFCNNEVQT TVKRQWAQFK IQWNQRWGRR
PSNRSARAAA AAAEAGDIPI YICHQELRNE PANNQGEESA EIIPLNIIEQ ESSA
//
To match the sequence I used the fact that each line starts with several spaces and then its only letters. It doesnt seem to do the trick regarding cygwin. Here is the link for the sequence https://www.uniprot.org/uniprot/P30988.txt
The problem is with this line
if ($_=~ m/\^s+(\w+)/) { #this is the part I have trouble with
You have the backslash in the wrong place in this part \^s+. You are actually escaping the ^. The line in your code should be
if ($_=~ m/^\s+(\w+)/) { #this is the part I have trouble with
I'd write that block of code like this
if ($_=~ m/^\s/) {
s/\s+//g; #this is for removing the spaces inside the sequence
print $_;
}

AWK - Search for a pattern-add it as a variable-search for next line that isn't a variable & print it + variable

I have a given file:
application_1.pp
application_2.pp
#application_2_version => '1.0.0.1-r1',
application_2_version => '1.0.0.2-r3',
application_3.pp
#application_3_version => '2.0.0.1-r4',
application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp
#application_5_version => '3.0.0.1-r8',
application_5_version => '3.0.0.2-r9',
I would like to be able to read this file and search for the string
".pp"
When that string is found, it adds that line into a variable and stores it.
It then reads the next line of the file. If it encounters a line preceded by a # it ignores it and moves onto the next line.
If it comes across a line that does not contain ".pp" and doesn't start with # it should print out that line next to a the last stored variable in a new file.
The output would look like this:
application_1.pp
application_2.pp application_2_version => '1.0.0.2-r3',
application_3.pp application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp application_5_version => '3.0.0.2-r9',
I would like to achieve this with awk. If somebody knows how to do this and it is a simple solution i would be happy if they could share it with me. If it is more complex, it would be helpful to know what in awk I need to understand in order to know how to do this (arrays, variables, etc). Can it even be achieved with awk or is another tool necessary?
Thanks,
I'd say
awk '/\.pp/ { if(NR != 1) print line; line = $0; next } NF != 0 && substr($1, 1, 1) != "#" { line = line $0 } END { print line }' filename
This works as follows:
/\.pp/ { # if a line contains ".pp"
if(NR != 1) { # unless we just started
print line # print the last assembled line
}
line = $0 # and remember this new one
next # and we're done here.
}
NF != 0 && substr($1, 1, 1) != "#" { # otherwise, unless the line is empty
# or a comment
line = line $0 # append it to the line we're building
}
END { # in the end,
print line # print the last line.
}
You can use sed:
#n
/\.pp/{
h
:loop
n
/[^#]application.*version/{
H
g
s/\n[[:space:]]*/\t/
p
b
}
/\.pp/{
x
p
}
b loop
}
If you save this as s.sed and run
sed -f s.sed file
You will get this output
application_1.pp
application_2.pp application_2_version => '1.0.0.2-r3',
application_3.pp application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp application_5_version => '3.0.0.2-r9',
Explanation
The #n supresses normal output.
Once we match the /\.pp/, we store that line into the hold space with h, and start the loop.
We go to the next line with n
If it matches /[^#]application.*version/, meaning it doesn't start with a #, then we append the line to the hold space with H, then copy the hold space to the pattern space with g, and substitute the newline and any subsequent whitespace for a tab. Finally we print with p, and skip to the end of the script with b
If it matches /\.pp/, then we swap the pattern and hold spaces with x, and print with p.

printing all lines when multiple matching values in table perl

I have two tables: $conversion and $table. In my script I'm checking if there is a match between cols[5] from $conversion and cols[2] from $table, if this is the case I print out the value from another column in $conversion, namely the corresponding value in cols[1].
This is all working fine.
However some values in cols[5] from $conversion are the same. If this is the case I want to print off course everything from $conversion that matches. Now he prints only the corresponding value for the last match that he finds while going through the file. So when cols[5] from $conversion contains 4 times the same value, in the output only the corresponding value of the 4th match is printed. Any hint on how to solve this?
This is my script:
my %hash = ();
while (<$conversion>) {
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[5];
my $keyfield2 = $cols[1];
$hash{$keyfield} = $keyfield2;
}
seek $table,0,0; #cursor resetting
while (<$table>) {
my #cols = split(/\t/);
my $keyfield = $cols[2];
if (exists($hash{$keyfield})) {
print $output "$cols[0]", "\t", "$hash{$keyfield}", "\t", "$cols[1]\n";
}
}
Don't store a single $col[1], store the whole array of them:
push #{ $hash{$keyfield} }, $keyfield2;
You'll need to dereference the array reference when printing:
print $output "$cols[0]","\t","#{ $hash{$keyfield} }","\t","$cols[1]\n";
If you want unique values, you can use a hash instead of an array.
my %hash = ();
while(<$conversion>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[5];
my $keyfield2 = $cols[1];
push #$hash{$keyfield}, $keyfield2;
# $hash{$keyfield} = $keyfield2;
}
seek $table,0,0; #cursor resetting
while(<$table>){
my #cols = split(/\t/);
my $keyfield = $cols[2];
if (exists($hash{$keyfield})){
foreach(#$hash{$keyfield})
print $output "$cols[0]","\t","$_","\t","$cols[1]\n";
}
}

Insert code at the start and end of the outer nested code block only

I have some code like:
void main() {
//----------
var a;
var b;
var c =[];
var c = func(3);
if (a == b) {
print "nested";
}
//----------------
}
I want to select the inner portion in between brackets, This is what i have tried:
sed -re ':l;N;$!tl;N;s!(void \w+\(\) \{)([^])*!\1 Prepend;\n\2\nappend!g' test.txt
Edit:
I am trying to insert code after the first occurrence { and before the last occurrence of }.
Example:
void main() {
test1
//-----------------
var a;
var b;
var c =[];
var c = func(3);
if (a == b) {
print "nested";
}
test2
//-----------------
}
I think awk is a better solution for what you actually want to do:
$ awk '/{/{i++;if(i==1){print $0,"\ntest1";next}}{print}/}/{i--;if(i==1)print "test2"}' file
void main() {
test1
//-----------------
var a;
var b;
var c =[];
var c = func(3);
if (a == b) {
print "nested";
}
test2
//-----------------
}
Explanation:
Here is the script in multiline form with some explanatory comments, if you prefer it in this form save it to a file say nestedcode and run it like awk -f nestedcode code.c:
BEGIN{
#Track the nesting level
nestlevel=0
}
/{/ {
#The line contained a { so increase nestlevel
nestlevel++
#Only add code if the nestlevel is 1
if(nestlevel==1){
#Print the matching line and new code on the following line
print $0,"\ntest1"
#Skip to next line so the next block
#doesn't print current line twice
next
}
}
{
#Print all lines
print
}
/}/ {
# The line contained a } so decrease the nestlevel
nestlevel--
#Only print the code if the nestleve is 1
if(nestlevel==1)
print"test2"
}
This might work for you (GNU sed):
sed '/^void.*{$/!b;:a;/\n}$/bb;$!{N;ba};:b;s/\n/&test1&/;s/\(.*\n\)\(.*\n\)/\1test2\n\2/' file
/^void.*{$/!b if the line doesn't begin with void and end in { bail out (this may need to be tailored for your own needs).
:a;/\n}$/bb;$!{N;ba} if the line contains a newline followed by a } only, branch to label b otherwise read in the next line and loop back to label a.
:b begin substitutions here.
s/\n/&test1&/ after the first newline insert the first string.
s/\(.*\n\)\(.*\n\)/\1test2\n\2/ after the 2nd from last newline insert the second string.
sed, by default, operates on single lines. It can operate on multiple lines by using the N command to read more than one line into the pattern space.
For example, the following sed expression would join consecutive lines in a file with # symbols between them:
sed -e '{
N
s/\n/ # /
}'
(Example from http://www.thegeekstuff.com/2009/11/unix-sed-tutorial-multi-line-file-operation-with-6-practical-examples/)
Try this regex:
{[^]*} // [^] = any character, including newlines.
JavaScript example of the Regex working:
var s = "void main() {\n//----------\nvar a;\nvar b;\nvar c =[];\nvar c = func(3);\n//----------------\n}"
console.log(s.match(/{[^]*}/g));
//"{↵//----------↵var a;↵var b;↵var c =[];↵var c = func(3);↵//----------------↵}"
(I know this ain't JS question, but it works to illustrate that the regex returns the desired result.)

Regex for old type of comment

I have this kind of comments (a few examples):
//========================================================================
// some text some text some text some text some text some text some text
//========================================================================
// some text some text some text some text some text some text some text some text
// some text some text
// (..)
I want to replace it with comment of this style:
/*****************************************************************************\
Description:
some text some text
some text some text some text
\*****************************************************************************/
So I need regular expression for this. I managed to make this regex:
//=+\r//(.+)+
It matches the comment in group, but only one line(example 1). How to make it work with many lines comments(like example 2)?
Thanks for help
Using sed:
sed -n '
\_^//==*_!p;
\_^//==*_{
s_//_/*_; s_=_\*_g; s_\*$_\*\\_;
h; p; i\
Desctiption:
: l; n; \_//[^=]_{s_//_\t_;p;};t l;
x;s_^/_\\_;s_\\$_/_;p;x;p;
}
' input_file
Commented version:
sed -n '
# just print non comment lines
\_^//==*_!p;
# for old-style block comments:
\_^//==*_{
# generate header line
s_//_/*_; s_=_\*_g; s_\*$_\*\\_;
# remember header, add description
h; p; i\
Desctiption:
# while comment continues, replace // with tab
: l; n; \_//[^=]_{s_//_\t_;p;};t l;
# modify the header as footer and print
x;s_^/_\\_;s_\\$_/_;p
# also print the non-comment line
x;p;
}
' input_file
This regex matches the whole comment
(\/\/=+)(\s*\/\/ .+?$)+
A short perl script that should do what you need, explained in comments:
#!/usr/bin/perl -p
$ast = '*' x 75; # Number of asterisks.
if (m{//=+}) { # Beginning of a comment.
$inside = 1;
s{.*}{/$ast\\\nDescription:};
next;
}
if ($inside) {
unless (m{^//}) { # End of a comment.
undef $inside;
print '\\', $ast, "/\n" ;
}
s{^//}{}; # Remove the comment sign for internal lines.
}
If a regex is still wanted, dont know if there is a better solution or not this is what I came up with:
(?<=\/{2}\s)[\w()\.\s]+
Should get all the text that is of interest.