Edit line names with a new name containing an incremented value - regex

This seems like a simple task to me but getting it to work easily is ending up more difficult than I thought:
I have a fasta file containing several million lines of text (only a few hundred individual sequence entries) and these sequence names are long, I want to replace all characters after the header > with Contig $n, where $n is an integer starting at 1 and is incremented for each replacement.
an example input sequence name:
>NODE:345643RD:Cov_456:GC47:34thgd
ATGTCGATGCGT
>NODE...
ATGCGCTTACAC
Which I then want to output like this
>Contig 1
ATGTCGATGCGT
>Contig 2
ATGCGCTTACAC
so maybe a Perl script? I know some basics but I'd like to read in a file and then output the new file with the changes, and I'm unsure of the best way to do this? I've seen some Perl one liner examples but none did what I wanted.
$n = 1
if {
s/>.*/(Contig)++$n/e
++$n
}

$ awk '/^\\>/{$0="\\>Contig "++n} 1' file
\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC

Try something like this:
#!/usr/bin/perl -w
use strict;
open (my $fh, '<','example.txt');
open (my $fh1, '>','example2.txt');
my $n = 1;
# For each line of the input file
while(<$fh>) {
# Try to update the name, if successful, increment $n
if ($_ =~ s/^>.*/>Contig$n/) { $n++; }
print $fh1 $_;
}

When you use the /e modifier, Perl expects the substitution pattern to be a valid Perl expression. Try something like
s/>.*/">Contig " . ++$n/e

perl -i -pe 's/>.*/">Contig " . ++$c/e;' file.txt
Output:
\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC

I'm not awk expert (far from that), but solved this only for curiosity and because sed don't contain variables (limited possibilities).
One possible gawk solution could be
awk -v n=1 '/^>/{print ">Contig " n; n++; next}1' <file

Related

Output name of named pattern in sed or grep

I'm looking for a solution to output the name of named pattern in regular expression
Regex - can contain n patterns, each named idn, no duplicates:
(?P<id1>aba)|(?P<id2>cde)|(?P<id3>esa)|(?P<id4>fav)
input-file:
aba
cec
fav
gex
hur
output (any of the following):
id1
id4
id1;id4
1
4
1;4
Is there any way to do it with sed or grep on a linux os. The input file is a text file 200-500MB.
I know that PHP outputs pattern names in output array, but I'd prefer not to use it.
Any other solution is also welcome, but it should use basic linux commands.
Here's a simple Perl script which does what you ask.
perl -nle 'if (m/(?P<id1>aba)|(?P<id2>cde)|(?P<id3>esa)|(?P<id4>fav)/) {
for my $pat (keys %+) { print $pat } }' filename

Extract Filename before date Bash shellscript

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?
The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.
Expected input files:
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Expected Extract:
EXAMPLE_FILE
EXAMPLE_FILE_2
Attempt:
filename=$(basename "$file")
folder=sed '^s/_[^_]*$//)' $filename
echo 'Filename:' $filename
echo 'Foldername:' $folder
$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$
$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$
No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:
$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2
Read all about how to use the %%, %, ## and # operators in your friendly shell manual.
Bash itself has regex capability so you do not need to run a utility. Example:
for fn in *.out; do
[[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
done
With the example files, output is:
EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2
Using Bash itself will be faster, more efficient than spawning sed, awk, etc for each file name.
Of course in use, you would want to test for a successful match:
for fn in *.out; do
if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
else
echo "$fn no match"
fi
done
As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:
for fn in *.out; do
cap="${fn%_*}"
printf "%s => %s\n" "$fn" "$cap"
done
And then test $cap against $fn. If they are equal, the parameter expansion did not trim the file name after _ because it was not present.
The regex allows a test that a date-like string \d\d\d\d-\d\d-\d\d is after the _. Up to you which you need.
Code
See this code in use here
^\w+(?=_)
Results
Input
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Output
EXAMPLE_FILE
EXAMPLE_FILE_2
Explanation
^ Assert position at start of line
\w+ Match any word character (a-zA-Z0-9_) between 1 and unlimited times
(?=_) Positive lookahead ensuring what follows is an underscore _ character
Simply with sed:
sed 's/_[^_]*$//' file
The output:
EXAMPLE_FILE
EXAMPLE_FILE_2
----------
In case of iterating through the list of files with extension .out - bash solution:
for f in *.out; do echo "${f%_*}"; done
awk -F_ 'NF-=1' OFS=_ file
EXAMPLE_FILE
EXAMPLE_FILE_2
Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.
awk --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out
Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.
Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.
awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{ ##Checking here condition that when very first line of any Input_file is being read then do following actions.
if(val){ ##Checking here if variable named val value is NOT NULL then do following.
close(val) ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
print array[1]; ##Printing array 1st element here.
val=FILENAME; ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
nextfile ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out ##Mentioning all *.out Input_file(s) here.

Using Perl Regex Multiline to reformat file

I have the file with the following format:
(Type 1 data:1) B B (Type 1 data:2) B B B
(Type 1 data:3) B ..
Now I want to reformat this file so that it looks like:
(Type 1 data:1) B B (Type 1 data:2) B B B (Type 1 data:3) B
...
My approach was to use perl regex in command line,
cat file | perl -pe 's/\n(B)/ $1/smg'
My reasoning was to replace the new line character with space.
but it doesn't seem to work. can you please help me? Thanks
The -p reads a line at a time, so there is nothing after the "\n" to match with.
perl -pe 'chomp; $_ = ($_ =~ /Type/) ? "\n".$_ : " ".$_'
this does almost what you want but puts one extra newline at the beginning and loses the final newline.
If the only place that ( shows up is at the beginning of where you want your lines to start, then you could use this command.
perl -l -0x28 -ne's/\n/ /g;print"($_"if$_' < file
-l causes print to add \n on the end of each line it prints.
-0x28 causes it to split on ( instead of on \n.
-n causes it to loop on the input. Basically it adds while(<>){chomp $_; to the beginning, and } at the end of what ever is in -e.
s/\n/ /g
print "($_" if $_ The if $_ part just stops it from printing an extra line at the beginning.
It's a little more involved as -n and -p fit best for processing one line at a time while your requirement is to combine several lines, which means you'd have to maintain state for a while.
So just read the entire file in memory and apply the regex like this:
perl -lwe ^
"local $/; local $_ = <>; print join q( ), split /\n/ for m/^\(Type [^(]*/gsm"
Feed your file to this prog on STDIN using input redirection (<).
Note this syntax is for the Windows command line. For Bash, use single quotes to quote the script.

vim & csv file: put header info into a new column

I have a large number of csv files that look like this below:
xxxxxxxx
xxxxx
Shipment,YD564n
xxxxxxxxx
xxxxx
1,RR1760
2,HI3503
3,HI4084
4,HI1824
I need to make them look like the following:
xxxxxxxx
xxxxx
Shipment,YD564n
xxxxxxxxx
xxxxx
YD564n,1,RR1760
YD564n,2,HI3503
YD564n,3,HI4084
YD564n,4,HI1824
YD564n is a shipment number and will be different for every csv file. But it always comes right after "Shipment,".
What vim command(s) can I use?
In one file type the following in normal mode:
qqgg/^Shipment,<CR>ww"ay$}j:.,$s/^/<C-R>a,<CR>q
Note that <CR> is the ENTER key, and <C-R> is CTRL-R.
This will update that file and recrd the commands in register q.
Then in each other file type #q (also in normal mode). (this will play back register q)
You can do this using a macro, and applying it over several files.
Here's one example. Type the following in as is:
3gg$"ayiw:6,$s/^/<C-R>a/<CR>:w<CR>:bn<CR>
Now that looks horrendous. Let me see if I can explain that a bit better.
3gg$ : Go to the end of the third line.
"ayiw : Copy the last word into the register a.
:6,$s/^/<C-R>a/<CR> : In every line from the 6th onwards, replace at the beginning whatever is in register a.
:w<CR>:bn<CR> : Save and go to the next buffer.
Now you can map this to a key, by
:nnoremap <C-A> 3gg$"ayiw:6,$s/^/<C-R>a/<CR>:w<CR>:bn<CR>
Then if you have say 200 csv files, you open vim as
vim *.csv
and then
200<C-A>
Where you type Ctrl-A there, and it should be all done.
That said, I'd definitely be more comfortable doing this in a proper scripting language, it'd be much more straightforward.
This could be done as a Perl one-liner:
perl -i.bak -e' $c = do {local $/; <>};
($n) = ($c =~ /Shipment,(\w+)/);
$c =~ s/^(\d+,)/$n,$1/gm;
print $c' shipment.csv
This will read contents of shipment.csv into $c, extract the shipment ID into $n, and prepend every CSV line with the shipment number. The file will be modified in-place with a backup saved to shipment.csv.bak.
To do this from within Vim, adapt it as a filter:
:%!perl -e' $c = do {local $/; <>}; ($n) = ($c =~ /Shipment,(\w+)/); $c =~ s/^(\d+,)/$n,$1/gm; print $c'
Well, don't bash me, but... you could consider: Don't do this in vim!!
This is a classic usage example for scripting languages.
Take a basic python, perl or ruby tutorial. The solution for this would
be in it.
The regex for this might not be too difficult and it is doable in vim.
But there are much easier alternatives out there.
And much more flexible ones.
Why vim?
Try this shell script:
#!/bin/sh
input=$1
shipment=`grep Shipment $input|awk -F, '{print $2}'`
mv $input $input.orig
sed -e "s/^\([0-9]\)/$shipment,\1/" $input.orig > $input
You could iterate through specific files:
for input in *.txt
do
script.sh $i
done
I also think this isn't well suited for vim, how about in Bash instead?
FILENAME='filename.csv' && SHIPMENT=`grep Shipment $FILENAME | sed 's/^Shipment,//'` && cat $FILENAME | sed "s/^[0-9]/$SHIPMENT,&/" > $FILENAME

How can I extract lines of text from a file?

I have a directory full of files and I need to pull the headers and footers off of them. They are all variable length so using head or tail isn't going to work. Each file does have a line I can search for, but I don't want to include the line in the results.
It's usually
*** Start (more text here)
And ends with
*** Finish (more text here)
I want the file names to stay the same, so I need to overwrite the originals, or write to a different directory and I'll overwrite them myself.
Oh yeah, it's on a linux server of course, so I have Perl, sed, awk, grep, etc.
Try the flip flop! ".." operator.
# flip-flop.pl
use strict;
use warnings;
my $start = qr/^\*\*\* Start/;
my $finish = qr/^\*\*\* Finish/;
while ( <> ) {
if ( /$start/ .. /$finish/ ) {
next if /$start/ or /$finish/;
print $_;
}
}
U can then use the -i perl switch to update your file(s) like so.....
$ perl -i'copy_*' flip-flop.pl data.txt
...which changes data.txt but makes a copy beforehand as "copy_data.txt".
GNU coreutils are your friend...
csplit inputfile %^\*\*\* Start%1 /^\*\*\* Finish/ %% {*}
This produces your desired file as xx00. You can change this behaviour through the options --prefix, --suffix, and --digits, but see the manual for yourself. Since csplit is designed to produce a number of files, it is not possible to produce a file without suffix, so you will have to do the overwriting manually or through a script:
csplit $1 %^\*\*\* Start%1 /^\*\*\* Finish/ %% {*}
mv -f xx00 $1
Add loops as you desire.
To get the header:
cat yourFileHere | awk '{if (d > 0) print $0} /.*Start.*/ {d = 1}'
To get the footer:
cat yourFileHere | awk '/.*Finish.*/ {d = 1} {if (d < 1) print $0}'
To get the file from header to footer as you want:
cat yourFileHere | awk '/.*Start.*/ {d = 1; next} /.*Finish.*/ {d = 0; next} {if (d > 0) print $0}'
There's one more way, with csplit command, you should try something like:
csplit yourFileHere /Start/ /Finish/
And examine files named 'xxNN' where NN is running number, also take a look at csplit manpage.
Maybe? Start to Finish with not-delete.
$ sed -i '/^\*\*\* Start/,/^\*\*\* Finish/d!' *
or...less sure of it...but, if it works, should remove the Start and Finish lines as well:
$ sed -i -e '/./,/^\*\*\* Start/d' -e '/^\*\*\* Finish/,/./d' *
d! may depend on the build of sed you have -- not sure.
And, I wrote that entirely on (probably poor) memory.
A quick Perl hack, not tested. I am not fluent enough in sed or awk to get this effect with them, but I would be interested in how that would be done.
#!/usr/bin/perl -w
use strict;
use Tie::File;
my $Filename=shift;
tie my #File, 'Tie::File', $Filename or die "could not access $Filename.\n";
while (shift #File !~ /^\*\*\* Start/) {};
while (pop #File !~ /^\*\*\* Finish/) {};
untie #File;
Some of the examples in perlfaq5: How do I change, delete, or insert a line in a file, or append to the beginning of a file? may help. You'll have to adapt them to your situation. Also, Leon's flip-flop operator answer is the idiomatic way to do this in Perl, although you don't have to modify the file in place to use it.
A Perl solution that overwrites the original file.
#!/usr/bin/perl -ni
if(my $num = /^\*\*\* Start/ .. /^\*\*\* Finish/) {
print if $num != 1 and $num + 0 eq $num;
}