How can I extract lines of text from a file? - regex

I have a directory full of files and I need to pull the headers and footers off of them. They are all variable length so using head or tail isn't going to work. Each file does have a line I can search for, but I don't want to include the line in the results.
It's usually
*** Start (more text here)
And ends with
*** Finish (more text here)
I want the file names to stay the same, so I need to overwrite the originals, or write to a different directory and I'll overwrite them myself.
Oh yeah, it's on a linux server of course, so I have Perl, sed, awk, grep, etc.

Try the flip flop! ".." operator.
# flip-flop.pl
use strict;
use warnings;
my $start = qr/^\*\*\* Start/;
my $finish = qr/^\*\*\* Finish/;
while ( <> ) {
if ( /$start/ .. /$finish/ ) {
next if /$start/ or /$finish/;
print $_;
}
}
U can then use the -i perl switch to update your file(s) like so.....
$ perl -i'copy_*' flip-flop.pl data.txt
...which changes data.txt but makes a copy beforehand as "copy_data.txt".

GNU coreutils are your friend...
csplit inputfile %^\*\*\* Start%1 /^\*\*\* Finish/ %% {*}
This produces your desired file as xx00. You can change this behaviour through the options --prefix, --suffix, and --digits, but see the manual for yourself. Since csplit is designed to produce a number of files, it is not possible to produce a file without suffix, so you will have to do the overwriting manually or through a script:
csplit $1 %^\*\*\* Start%1 /^\*\*\* Finish/ %% {*}
mv -f xx00 $1
Add loops as you desire.

To get the header:
cat yourFileHere | awk '{if (d > 0) print $0} /.*Start.*/ {d = 1}'
To get the footer:
cat yourFileHere | awk '/.*Finish.*/ {d = 1} {if (d < 1) print $0}'
To get the file from header to footer as you want:
cat yourFileHere | awk '/.*Start.*/ {d = 1; next} /.*Finish.*/ {d = 0; next} {if (d > 0) print $0}'
There's one more way, with csplit command, you should try something like:
csplit yourFileHere /Start/ /Finish/
And examine files named 'xxNN' where NN is running number, also take a look at csplit manpage.

Maybe? Start to Finish with not-delete.
$ sed -i '/^\*\*\* Start/,/^\*\*\* Finish/d!' *
or...less sure of it...but, if it works, should remove the Start and Finish lines as well:
$ sed -i -e '/./,/^\*\*\* Start/d' -e '/^\*\*\* Finish/,/./d' *
d! may depend on the build of sed you have -- not sure.
And, I wrote that entirely on (probably poor) memory.

A quick Perl hack, not tested. I am not fluent enough in sed or awk to get this effect with them, but I would be interested in how that would be done.
#!/usr/bin/perl -w
use strict;
use Tie::File;
my $Filename=shift;
tie my #File, 'Tie::File', $Filename or die "could not access $Filename.\n";
while (shift #File !~ /^\*\*\* Start/) {};
while (pop #File !~ /^\*\*\* Finish/) {};
untie #File;

Some of the examples in perlfaq5: How do I change, delete, or insert a line in a file, or append to the beginning of a file? may help. You'll have to adapt them to your situation. Also, Leon's flip-flop operator answer is the idiomatic way to do this in Perl, although you don't have to modify the file in place to use it.

A Perl solution that overwrites the original file.
#!/usr/bin/perl -ni
if(my $num = /^\*\*\* Start/ .. /^\*\*\* Finish/) {
print if $num != 1 and $num + 0 eq $num;
}

Related

Extract Filename before date Bash shellscript

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?
The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.
Expected input files:
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Expected Extract:
EXAMPLE_FILE
EXAMPLE_FILE_2
Attempt:
filename=$(basename "$file")
folder=sed '^s/_[^_]*$//)' $filename
echo 'Filename:' $filename
echo 'Foldername:' $folder
$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$
$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$
No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:
$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2
Read all about how to use the %%, %, ## and # operators in your friendly shell manual.
Bash itself has regex capability so you do not need to run a utility. Example:
for fn in *.out; do
[[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
done
With the example files, output is:
EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2
Using Bash itself will be faster, more efficient than spawning sed, awk, etc for each file name.
Of course in use, you would want to test for a successful match:
for fn in *.out; do
if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
else
echo "$fn no match"
fi
done
As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:
for fn in *.out; do
cap="${fn%_*}"
printf "%s => %s\n" "$fn" "$cap"
done
And then test $cap against $fn. If they are equal, the parameter expansion did not trim the file name after _ because it was not present.
The regex allows a test that a date-like string \d\d\d\d-\d\d-\d\d is after the _. Up to you which you need.
Code
See this code in use here
^\w+(?=_)
Results
Input
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Output
EXAMPLE_FILE
EXAMPLE_FILE_2
Explanation
^ Assert position at start of line
\w+ Match any word character (a-zA-Z0-9_) between 1 and unlimited times
(?=_) Positive lookahead ensuring what follows is an underscore _ character
Simply with sed:
sed 's/_[^_]*$//' file
The output:
EXAMPLE_FILE
EXAMPLE_FILE_2
----------
In case of iterating through the list of files with extension .out - bash solution:
for f in *.out; do echo "${f%_*}"; done
awk -F_ 'NF-=1' OFS=_ file
EXAMPLE_FILE
EXAMPLE_FILE_2
Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.
awk --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out
Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.
Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.
awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{ ##Checking here condition that when very first line of any Input_file is being read then do following actions.
if(val){ ##Checking here if variable named val value is NOT NULL then do following.
close(val) ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
print array[1]; ##Printing array 1st element here.
val=FILENAME; ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
nextfile ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out ##Mentioning all *.out Input_file(s) here.

Edit line names with a new name containing an incremented value

This seems like a simple task to me but getting it to work easily is ending up more difficult than I thought:
I have a fasta file containing several million lines of text (only a few hundred individual sequence entries) and these sequence names are long, I want to replace all characters after the header > with Contig $n, where $n is an integer starting at 1 and is incremented for each replacement.
an example input sequence name:
>NODE:345643RD:Cov_456:GC47:34thgd
ATGTCGATGCGT
>NODE...
ATGCGCTTACAC
Which I then want to output like this
>Contig 1
ATGTCGATGCGT
>Contig 2
ATGCGCTTACAC
so maybe a Perl script? I know some basics but I'd like to read in a file and then output the new file with the changes, and I'm unsure of the best way to do this? I've seen some Perl one liner examples but none did what I wanted.
$n = 1
if {
s/>.*/(Contig)++$n/e
++$n
}
$ awk '/^\\>/{$0="\\>Contig "++n} 1' file
\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC
Try something like this:
#!/usr/bin/perl -w
use strict;
open (my $fh, '<','example.txt');
open (my $fh1, '>','example2.txt');
my $n = 1;
# For each line of the input file
while(<$fh>) {
# Try to update the name, if successful, increment $n
if ($_ =~ s/^>.*/>Contig$n/) { $n++; }
print $fh1 $_;
}
When you use the /e modifier, Perl expects the substitution pattern to be a valid Perl expression. Try something like
s/>.*/">Contig " . ++$n/e
perl -i -pe 's/>.*/">Contig " . ++$c/e;' file.txt
Output:
\>Contig 1
ATGTCGATGCGT
\>Contig 2
ATGCGCTTACAC
I'm not awk expert (far from that), but solved this only for curiosity and because sed don't contain variables (limited possibilities).
One possible gawk solution could be
awk -v n=1 '/^>/{print ">Contig " n; n++; next}1' <file

Delete lines before and after a match within a specified tags in SED

Need to delete before and after of a matching pattern within the tag
< mds:insert>
< attributeValues>
< AttrNames
< Item Value="MyContact_c"/>
< /AttrNames>
< /attributeValues>
< /mds:insert>
Using
sed -i -n '/MyContact_c/{s/.*//;x;d;};x;p;${x;p;}' $file
removes only line before and after the matching pattern, need to delete all the contents within the mds:insert tag... Any pointers will be helpful.
It isn't sed, but here is one for ex using the snippet you posted as the file content for $file:
kitsune:~$ printf '%s\n' 'set ic
1;/="MyContact_c"/<|?<mds:insert?+;/<\/mds:insert>/-d
%p' | ex -s $file
Output:
<mds:insert>
</mds:insert>
That will print the remains of the file after the first instance of the section is removed. If you want this done for all instances, the command line will look like this:
'set ic
g/="MyContact_c"/<|?<mds:insert?+;/<\/mds:insert>/-d
%p'
You can use this in the for loop of a shell script if you want this done to multiple files. Naturally you'll want a backup copy if you do such a thing, so make sure to copy the file before altering it if you intend to overwrite it.
By the way, if you've ever used Vim or even vi, these sorts of commands are used for saving, quitting, etc. It is worth adding ex to your toolbox of knowledge IMHO.
Edit
C Shell users cannot use these commands as-is because they contain quoted newlines, which isn't allowed in a C shell. Instead, you can modify the first command like so:
kitsune:~% printf '%s\n%s\n%s\n' 'set ic' '1;/="MyContact_c"/<|?<mds:insert?+;/<\/mds:insert>/-d' '%p' | ex -s $file
You can similarly do the same with the other string.
Disclaimer: I'm not a C shell user myself, so there might be a better way, but I don't know it.
Here is a way to do it with awk
awk '{a[NR]=$0} /MyContact_b/ {f=NR} END {for (i=1;i<=NR;i++) if (i+2<f || i-2>f || !f) print a[i]}' file
< mds:insert>
< /mds:insert>
It skips two line before and two after pattern if its found.
Here's one way to do it in sed:
sed -e ':a' -e '/ mds:insert/!{p;d;}' -e 'N;/\/mds:insert/{/MyContact_c/!p;d;};ba' filename
EDIT:
You and I may be using different versions of sed or something. Let's try an experiment:
sed -e '/ mds:insert/!{p;d;}' filename
It won't do anything very interesting, but I want to know whether it generates an error.

How to use regex to match ASTERISK in awk

I'm stil pretty new to regular expression and just started learning to use awk. What I am trying to accomplish is writing a ksh script to read-in lines from text, and and for every lines that match the following:
*RECORD 0000001 [some_serial_#]
to replace $2 (i.e. 000001) with a different number. So essentially the script read in batch record dump, and replace the record number with date+record#, and write to separate file.
So this is what I'm thinking the format should be:
awk 'match($0,"/*RECORD")!=0{$2="$DATE-n++"; print $0} match($0,"/*RECORD")==0{print $0}' $BATCH > $OUTPUT
but obviously "/*RECORD" is not going to work, and I'm not sure if changing $2 and then write the whole line is the correct way to do this. So I am in need of some serious enlightenment.
So you want your example line to look like
*RECORD $DATE-n++ [some_serial_#]
after awk's done with it?
awk '{ if (match($0, "*RECORD") != 0) { $2="$DATE-n++"; }; print }' $BATCH > $OUTPUT
Based on your update, it looks like you instead expect $DATE to be an environment variable which is used in the awk expression and n is a variable in the awk script that keeps count of how many records matched the pattern. Given that, this may look more like what you want.
$ cat script.awk
BEGIN { n=0 }
{
if (match($0, "\*RECORD") != 0) {
n++;
$2 = (ENVIRON["DATE"] "-" n);
}
print;
}
$ awk -f script.awk $BATCH > $OUTPUT
use equality.
D=$(date +%Y%m%d)
awk -vdate="$D" '
{
for(i=1;i<=NF;i++){
if ( $i == "*RECORD" ){
$(i+1) = date"00002"
break # break after searching for one record, otherwise, remove break
}
}
}1' file

vim & csv file: put header info into a new column

I have a large number of csv files that look like this below:
xxxxxxxx
xxxxx
Shipment,YD564n
xxxxxxxxx
xxxxx
1,RR1760
2,HI3503
3,HI4084
4,HI1824
I need to make them look like the following:
xxxxxxxx
xxxxx
Shipment,YD564n
xxxxxxxxx
xxxxx
YD564n,1,RR1760
YD564n,2,HI3503
YD564n,3,HI4084
YD564n,4,HI1824
YD564n is a shipment number and will be different for every csv file. But it always comes right after "Shipment,".
What vim command(s) can I use?
In one file type the following in normal mode:
qqgg/^Shipment,<CR>ww"ay$}j:.,$s/^/<C-R>a,<CR>q
Note that <CR> is the ENTER key, and <C-R> is CTRL-R.
This will update that file and recrd the commands in register q.
Then in each other file type #q (also in normal mode). (this will play back register q)
You can do this using a macro, and applying it over several files.
Here's one example. Type the following in as is:
3gg$"ayiw:6,$s/^/<C-R>a/<CR>:w<CR>:bn<CR>
Now that looks horrendous. Let me see if I can explain that a bit better.
3gg$ : Go to the end of the third line.
"ayiw : Copy the last word into the register a.
:6,$s/^/<C-R>a/<CR> : In every line from the 6th onwards, replace at the beginning whatever is in register a.
:w<CR>:bn<CR> : Save and go to the next buffer.
Now you can map this to a key, by
:nnoremap <C-A> 3gg$"ayiw:6,$s/^/<C-R>a/<CR>:w<CR>:bn<CR>
Then if you have say 200 csv files, you open vim as
vim *.csv
and then
200<C-A>
Where you type Ctrl-A there, and it should be all done.
That said, I'd definitely be more comfortable doing this in a proper scripting language, it'd be much more straightforward.
This could be done as a Perl one-liner:
perl -i.bak -e' $c = do {local $/; <>};
($n) = ($c =~ /Shipment,(\w+)/);
$c =~ s/^(\d+,)/$n,$1/gm;
print $c' shipment.csv
This will read contents of shipment.csv into $c, extract the shipment ID into $n, and prepend every CSV line with the shipment number. The file will be modified in-place with a backup saved to shipment.csv.bak.
To do this from within Vim, adapt it as a filter:
:%!perl -e' $c = do {local $/; <>}; ($n) = ($c =~ /Shipment,(\w+)/); $c =~ s/^(\d+,)/$n,$1/gm; print $c'
Well, don't bash me, but... you could consider: Don't do this in vim!!
This is a classic usage example for scripting languages.
Take a basic python, perl or ruby tutorial. The solution for this would
be in it.
The regex for this might not be too difficult and it is doable in vim.
But there are much easier alternatives out there.
And much more flexible ones.
Why vim?
Try this shell script:
#!/bin/sh
input=$1
shipment=`grep Shipment $input|awk -F, '{print $2}'`
mv $input $input.orig
sed -e "s/^\([0-9]\)/$shipment,\1/" $input.orig > $input
You could iterate through specific files:
for input in *.txt
do
script.sh $i
done
I also think this isn't well suited for vim, how about in Bash instead?
FILENAME='filename.csv' && SHIPMENT=`grep Shipment $FILENAME | sed 's/^Shipment,//'` && cat $FILENAME | sed "s/^[0-9]/$SHIPMENT,&/" > $FILENAME