How to use regex to match ASTERISK in awk - regex

I'm stil pretty new to regular expression and just started learning to use awk. What I am trying to accomplish is writing a ksh script to read-in lines from text, and and for every lines that match the following:
*RECORD 0000001 [some_serial_#]
to replace $2 (i.e. 000001) with a different number. So essentially the script read in batch record dump, and replace the record number with date+record#, and write to separate file.
So this is what I'm thinking the format should be:
awk 'match($0,"/*RECORD")!=0{$2="$DATE-n++"; print $0} match($0,"/*RECORD")==0{print $0}' $BATCH > $OUTPUT
but obviously "/*RECORD" is not going to work, and I'm not sure if changing $2 and then write the whole line is the correct way to do this. So I am in need of some serious enlightenment.

So you want your example line to look like
*RECORD $DATE-n++ [some_serial_#]
after awk's done with it?
awk '{ if (match($0, "*RECORD") != 0) { $2="$DATE-n++"; }; print }' $BATCH > $OUTPUT
Based on your update, it looks like you instead expect $DATE to be an environment variable which is used in the awk expression and n is a variable in the awk script that keeps count of how many records matched the pattern. Given that, this may look more like what you want.
$ cat script.awk
BEGIN { n=0 }
{
if (match($0, "\*RECORD") != 0) {
n++;
$2 = (ENVIRON["DATE"] "-" n);
}
print;
}
$ awk -f script.awk $BATCH > $OUTPUT

use equality.
D=$(date +%Y%m%d)
awk -vdate="$D" '
{
for(i=1;i<=NF;i++){
if ( $i == "*RECORD" ){
$(i+1) = date"00002"
break # break after searching for one record, otherwise, remove break
}
}
}1' file

Related

Numeric expression in if condition of awk

Pretty new to AWK programming. I have a file1 with entries as:
15>000000513609200>000000513609200>B>I>0011>>238/PLMN/000100>File Ef141109.txt>0100-75607-16156-14 09-11-2014
15>000000513609200>000000513609200>B>I>0011>Danske Politi>238/PLMN/000200>>0100-75607-16156-14 09-11-2014
15>000050354428060>000050354428060>B>I>0011>Danske Politi>238/PLMN/000200>>4100-75607-01302-14 31-10-2014
I want to write a awk script, where if 2nd field subtracted from 3rd field is a 0, then it prints field 2. Else if the (difference > 0), then it prints all intermediate digits incremented by 1 starting from 2nd field ending at 3rd field. There will be no scenario where 3rd field is less than 2nd. So ignoring that condition.
I was doing something as:
awk 'NR > 2 { print p } { p = $0 }' file1 | awk -F">" '{if ($($3 - $2) == 0) print $2; else l = $($3 - $2); for(i=0;i<l;i++) print $2++; }'
(( Someone told me awk is close to C in terms of syntax ))
But from the output it looks to me that the String to numeric or numeric to string conversions are not taking place at right place at right time. Shouldn't it be taken care by AWK automatically ?
The OUTPUT that I get:
513609200
513609201
513609200
Which is not quiet as expected. One evident issue is its ignoring the preceding 0s.
Kindly help me modify the AWK script to get the desired result.
NOTE:
awk 'NR > 2 { print p } { p = $0 }' file1 is just to remove the 1st and last entry in my original file1. So the part that needs to be fixed is:
awk -F">" '{if ($($3 - $2) == 0) print $2; else l = $($3 - $2); for(i=0;i<l;i++) print $2++; }'
In awk, think of $ as an operator to retrieve the value of the named field number ($0 being a special case)
$1 is the value of field 1
$NF is the value of the field given in the NF variable
So, $($3 - $2) will try to get the value of the field number given by the expression ($3 - $2).
You need fewer $ signs
awk -F">" '{
if ($3 == $2)
print $2
else {
v=$2
while (v < $3)
print v++
}
}'
Normally, this will work, but your numbers are beyond awk integer bounds so you need another solution to handle them. I'm posting this to initiate other solutions and better illustrate your specifications.
$ awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file
note that this will skip the rows that you say impossible to happen
A small scale example
$ cat file_0
x>1000>1000>etc
x>2000>2003>etc
x>3000>2999>etc
$ awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file_0
1000
2000
2001
2002
2003
Apparently, newer versions of gawk has --bignum options for arbitrary precision integers, if you have a compatible version that may solve your problem but I don't have access to verify.
For anyone who does not have ready access to gawk with bigint support, it may be simpler to consider other options if some kind of "big integer" support is required. Since ruby has an awk-like mode of operation,
let's consider ruby here.
To get started, there are just four things to remember:
invoke ruby with the -n and -a options (-n for the awk-like loop; -a for automatic parsing of lines into fields ($F[i]));
awk's $n becomes $F[n-1];
explicit conversion of numeric strings to integers is required;
To specify the lines to be executed on the command line, use the '-e TEXT' option.
Thus a direct translation of:
awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file
would be:
ruby -an -F'>' -e '($F[1].to_i .. $F[2].to_i).each {|i| puts i }' file
To guard against empty lines, the following script would be slightly better:
($F[1].to_i .. $F[2].to_i).each {|i| puts i } if $F.length > 2
This could be called as above, or if the script is in a file (say script.rb) using the incantation:
ruby -an -F'>' script.rb file
Given the OP input data, the output is:
513609200
513609200
50354428060
The left-padding can be accomplished in several ways -- see for example this SO page.

How to replace a text sequence that includes "\n" in a text file

This may sound duplicated, but I can't make this works.
Consider:
_ = space
- = minus sign
particle_little.csv is a file of this form:
waste line to be deleted
__data__data__data
_-data__data_-data
__data_-data__data
I need to get a standard csv format in particle_std.csv, like this:
data,data,data
-data,data,-data
data,-data,data
I am trying to use tail and tr to do that conversion, here I split the command:
tail -n +2 particle_little.csv to delete the first line
| tr -s ' ' to remove duplicated spaces
| tr '/\b\n \b/' '\n' to delete the very beginning space
| tr ' ' ',' to change spaces for commas
> particle_std.csv to put it in a output file
But I get this (without the 4th step):
data
data
data
-data
...
Finally, the file is huge, so it is almost impossible to open in editors (I know there are super editors that maybe can)
I would suggest that you used awk:
$ cat file
waste line to be deleted
data data data
-data data -data
data -data data
$ awk -v OFS=, '{ $1 = $1 } NR > 1' file
data,data,data
-data,data,-data
data,-data,data
The script sets the output field separator OFS to , and reassigns the first field to itself $1 = $1, causing awk to touch each line (and replace the spaces with commas). Lines after the first, where NR > 1, are printed (the default action is to print the line).
So if I'm reading you right - ignore lines that don't start with whitespace. Comma separate everything else.
I'd suggest perl:
perl -lane 'next unless /^\s/; print join ",", #F';
This, when given:
waste line to be deleted
data data data
-data data -data
data -data data
On STDIN (Or specified in a filename) outputs:
data,data,data
-data,data,-data
data,-data,data
This is because:
-l strips linefeeds (and replaces them after each print);
-a autosplits on any whitespace
-n wraps it in a while ( <> ) { loop which iterates line by line - functionally it means it works just like sed/grep/tr and reads STDIN or files specified as args.
-e allows specifying a perl snippet.
In this case:
skip any lines that don't start with \s or any whitespace.
any other lines, join the fields (#F generated by -a) with , as delimiter. (This auto-inserts a linefeed because -l)
Then you can either redirect the output to a file (>output.csv) or use -i.bak to edit inplace.
You should probably use sed or awk for this:
sed -e 1d -e 's/^ *//' -e 's/ */,/g'
One way to do it in Awk is:
awk 'NR == 1 { next }
{ pad=""; for (i = 1; i <= NF; i++) { printf "%s%s", pad, $i; pad="," } print "" }'
but there's a better way to do it in Awk:
awk 'BEGIN { OFS=","} NR == 1 { next } { $1 = $1; print }' data
The BEGIN block sets the output field separator; the assignment $1 = $1; forces Awk to rework the output line; the print prints it.
I've left the first Awk version around because it shows there's more than one way to do it, and in some circumstances, such methods can be useful. But for this task, the second Awk version is better — simpler, more compact (and isomorphic with Tom Fenech's answer).

Find and append to Text Between Two Strings or Words using sed or awk

I am looking for a sed in which I can recognize all of the text in between two indicators and then replace it with a place holder.
For instance, the 1st indicator is a list of words
(no|noone|haven't)
and the 2nd indicator is a list of punctuation
Code:
(.|,|!)
From an input text such as
"Noone understands the plot. There is no storyline. I haven't
recommended this movie to my friends! Did you understand it?"
The desired result would be.
"Noone understands_AFFIX me_AFFIX. There is no storyline_AFFIX. I
haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX
friends_AFFIX! Did you understand it?"
I know that there is the following sed:
sed -n '/WORD1/,/WORD2/p' /path/to/file
which recognizes the content between two indicators. I have also found a lot of great information and resources here. However, I still cannot find a way to append the affix to each token of text that occurs between the two indicators.
I have also considered to use awk, such as
awk '{sub(/.*indic1 /,"");sub(/ indic2.*/,"");print;}' < infile
yet still, it does not allow me to append the affix.
Does anyone have a suggestion to do so, either with awk or sed?
Little more compact awk
$ awk 'BEGIN{RS=ORS=" ";s="_AFFIX"}
/[.,!]$/{f=0; $0=gensub(/(.)$/,"s\\1","g")}
f{$0=$0s}
/Noone|no|haven'\''t/{f=1}1' story
Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?
Perl to the rescue!
perl -pe 's/(?:no(?:one)?|haven'\''t)\s*\K([^.,!]+)/
join " ", map "${_}_AFFIX", split " ", $1/egi
' infile > outfile
\K matches what's on its left, but excludes it from the replacement. In this case, it verifies the 1st indicator. (\K needs Perl 5.10+.)
/e evaluates the replacement part as code. In this case, the code splits $1 on whitespace, map adds _AFFIX to each of the members, and join joins them back into a string.
Here is one verbose awk command for the same:
s="Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"
awk -v IGNORECASE=1 -v kw="no|noone|haven't" -v pct='\\.|,|!' '{
a=0
for (i=2; i<=NF; i++) {
if ($(i-1) ~ "\\y" kw "\\y")
a=1
if (a && $i ~ pct "$") {
p = substr($i, length($i), 1)
$i = substr($i, 1, length($i)-1)
}
if (a)
$i=$i "_AFFIX" p
if(p) {
p=""
a=0
}
}
} 1'
Output:
Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

Awk print if no match

I am using the following statement in awk with text piped to it from another command:
awk 'match($0,/(QUOTATION|TAX INVOICE|ADJUSTMENT NOTE|DELIVERY DOCKET|PICKING SLIP|REMITTANCE ADVICE|PURCHASE ORDER|STATEMENT)/) && NR<11 {print substr($0,RSTART,RLENGTH)}'
which is almost working for what I need (find one of the words in the regex within the first 10 lines of the input and print that word). The main thing I need to do is to output something if there is no match. For instance, if none of those words are found in the first ten lines it would output UNKNOWN.
I also need to limit the output to the first match, as I need to ensure a single line of output per input file. I can do this with head or ask another question if needs be, I only include it here in case it affects how to output the no-match text.
I am also not tied to awk as a tool - if there is a simpler way to do this with sed or something else I am open to it.
You just need to exit at the first match, or on line 11 if no match
awk '
match($0,/(QUOTATION|TAX ... ORDER|STATEMENT)/) {
print substr($0,RSTART,RLENGTH)
exit
}
NR == 11 {print "UNKNOWN"; exit}
'
I like glenn jackman's answer, however, if you wish to print matches for all 10 lines then you can try something like this:
awk '
match($0,/(QUOTATION|TAX ... ORDER|STATEMENT)/) {
print NR " ---> " substr($0,RSTART,RLENGTH)
flag=1
}
flag==0 && NR==11 {
print "UNKNOWN"
exit
}'
You can do this..
( head -10 | egrep -o '(QUOTATION|TAX INVOICE|ADJUSTMENT NOTE|
DELIVERY DOCKET|PICKING SLIP|REMITTANCE ADVICE|PURCHASE ORDER|STATEMENT)'
|| print "Unkownn" ) | head -1

How can I extract lines of text from a file?

I have a directory full of files and I need to pull the headers and footers off of them. They are all variable length so using head or tail isn't going to work. Each file does have a line I can search for, but I don't want to include the line in the results.
It's usually
*** Start (more text here)
And ends with
*** Finish (more text here)
I want the file names to stay the same, so I need to overwrite the originals, or write to a different directory and I'll overwrite them myself.
Oh yeah, it's on a linux server of course, so I have Perl, sed, awk, grep, etc.
Try the flip flop! ".." operator.
# flip-flop.pl
use strict;
use warnings;
my $start = qr/^\*\*\* Start/;
my $finish = qr/^\*\*\* Finish/;
while ( <> ) {
if ( /$start/ .. /$finish/ ) {
next if /$start/ or /$finish/;
print $_;
}
}
U can then use the -i perl switch to update your file(s) like so.....
$ perl -i'copy_*' flip-flop.pl data.txt
...which changes data.txt but makes a copy beforehand as "copy_data.txt".
GNU coreutils are your friend...
csplit inputfile %^\*\*\* Start%1 /^\*\*\* Finish/ %% {*}
This produces your desired file as xx00. You can change this behaviour through the options --prefix, --suffix, and --digits, but see the manual for yourself. Since csplit is designed to produce a number of files, it is not possible to produce a file without suffix, so you will have to do the overwriting manually or through a script:
csplit $1 %^\*\*\* Start%1 /^\*\*\* Finish/ %% {*}
mv -f xx00 $1
Add loops as you desire.
To get the header:
cat yourFileHere | awk '{if (d > 0) print $0} /.*Start.*/ {d = 1}'
To get the footer:
cat yourFileHere | awk '/.*Finish.*/ {d = 1} {if (d < 1) print $0}'
To get the file from header to footer as you want:
cat yourFileHere | awk '/.*Start.*/ {d = 1; next} /.*Finish.*/ {d = 0; next} {if (d > 0) print $0}'
There's one more way, with csplit command, you should try something like:
csplit yourFileHere /Start/ /Finish/
And examine files named 'xxNN' where NN is running number, also take a look at csplit manpage.
Maybe? Start to Finish with not-delete.
$ sed -i '/^\*\*\* Start/,/^\*\*\* Finish/d!' *
or...less sure of it...but, if it works, should remove the Start and Finish lines as well:
$ sed -i -e '/./,/^\*\*\* Start/d' -e '/^\*\*\* Finish/,/./d' *
d! may depend on the build of sed you have -- not sure.
And, I wrote that entirely on (probably poor) memory.
A quick Perl hack, not tested. I am not fluent enough in sed or awk to get this effect with them, but I would be interested in how that would be done.
#!/usr/bin/perl -w
use strict;
use Tie::File;
my $Filename=shift;
tie my #File, 'Tie::File', $Filename or die "could not access $Filename.\n";
while (shift #File !~ /^\*\*\* Start/) {};
while (pop #File !~ /^\*\*\* Finish/) {};
untie #File;
Some of the examples in perlfaq5: How do I change, delete, or insert a line in a file, or append to the beginning of a file? may help. You'll have to adapt them to your situation. Also, Leon's flip-flop operator answer is the idiomatic way to do this in Perl, although you don't have to modify the file in place to use it.
A Perl solution that overwrites the original file.
#!/usr/bin/perl -ni
if(my $num = /^\*\*\* Start/ .. /^\*\*\* Finish/) {
print if $num != 1 and $num + 0 eq $num;
}