Edit Floating Specific Point Field Preserving/Padding w/ Whitespace in AWK? - regex

I have a file with the line:
CH1 12.30 4.800 12 !
I want to replace a specific field ... say $2 with some equivalent scaled by chosen floating point scalar on [0.0,1.0). However, I want to keep the same number of decimal digits and further to pad the front end with spaces to maintain the original length.
I'm thinking some combination len/gsub/printf in awk could accomplish this.
As an example of what I have tried currently:
scalar=0.00; echo 'CH1 12.30 4.800 12 !' | awk -v sc=$scalar '/CH1/{gsub(/[0-9]*\.[0-9]*/,$2*sc,$2);} {print;}'
Output:
CH1 0 4.800 12 !
Output:
Correctly outputs scaled #, but spaces are stripped from not just field $2, but entire line.
scalar=0.00; echo 'CH1 12.30 4.800 12 !' | awk -v sc=$scalar '/CH1/{gsub(/$2/,$2*sc,$0);} {print;}'
Output:
CH1 12.30 4.800 12 !
Notes:
Does nothing! Output is unchanged.
Assumptions:
Fields $2 and $3 may be the same, but I ONLY want to change field $2.
Field $1 contains only alphanumeric characters.
Fields $2 and $3 are floating point numbers with an arbitrary # of decimal digits, typically with the # of digits being on the range [1,4]. The whole part has no more than 3 digits..
Field $4 is an integer on the range [8,99].
Anything after field $4 is a comment and may contain special characters.
Searching for similar questions I've come across some questions pertaining to whitespace preservation, and those having given me some ideas... but mine is a bit different because I actually want to add whitespace, to keep the decimal place effectively locked in the same spot on the line, to keep user formatting nice in the targeted file.

The gsub(/$2/,...) expressions fail because /$2/ is looking for a literal $2 string, as opposed to whatever is in field 2. (And gsub is overkill since we are only changing one instance, so plain sub suffices, but gsub is harmless here.)
We can use just $2 (without slashes, although it's going to be treated as a regular expression rather than a literal string):
$ scalar=0.00; echo 'CH1 12.30 4.800 12 !' |
awk -v sc=$scalar '/CH1/{gsub($2,$2*sc);} {print;}'
CH1 0 4.800 12 !
This loses the decimal place stuff too, so is still not quite what we want, but shows that your approach can work.
Given that sprintf() can produce a string according to a format directive like "%5.2f" (which is what we would want to get 12.30), all we need to do is figure out the total length of the field $2 and the length of the fractional part (after the .), which is easy using split and length. Constructing the replacement string is even easier than it might first look, because instead of a literal 5 and 2, we can use * to extract integer arguments. Hence:
$ cat foo.sh
#! /bin/sh
scalar=0.00
echo 'CH1 12.30 4.800 12 !'
echo 'CH1 12.30 4.800 12 !' |
awk -v sc=$scalar '
$2 ~ /[0-9]*\.[0-9]*/ {
split($2, parts, /\./)
ofraclen = length(parts[2])
repl = sprintf("%*.*f", length($2), ofraclen, $2 * sc)
sub(/[0-9]*\.[0-9]*/, repl)
}
{print}
'
$ sh foo.sh
CH1 12.30 4.800 12 !
CH1 0.00 4.800 12 !
I put in the extra echo so that we can see that the fields still line up. I changed the matching criteria to $2 ~ ... so that we are guaranteed that $2 will split properly. We split it into its integer and fractional parts, grab the length of the fractional part, produce the replacement string, and then use sub on the (first) occurrence of a floating point number (safe if and only if field $1 never matches, there's no test for $1 matching and if so we'll sub the wrong one).
(I actually like the semicolons after each statement, but I took them all out here since they're not strictly required. Also, most of the temporary variables can be eliminated, keeping just parts, but the result will be difficult to understand.)

This is a general approach to reproducing the padding from the input in the output after operating on some field(s):
$ cat tst.awk
NR==1 {
# Find the width of each space-padded, right-aligned field:
rec = $0
for (i=1; i<=NF; i++) {
match(rec,/[^[:space:]]+/)
w[i] = RSTART - 1 + RLENGTH
rec = substr(rec,w[i]+1)
}
# Find the precision of the target field:
match($2,/\..*/)
p = RLENGTH - 1
}
{
# print the original just for comparison
print
# do the math:
$2 = sprintf("%.*f", p, $2 * scalar)
# print the updated record:
for (i=1;i<=NF;i++) {
printf "%*s", w[i], $i
}
print ""
}
.
$ awk -v scalar=0 -f tst.awk file
CH1 12.30 4.800 12 !
CH1 0.00 4.800 12 !
$ awk -v scalar=0.5 -f tst.awk file
CH1 12.30 4.800 12 !
CH1 6.15 4.800 12 !
$ awk -v scalar=9 -f tst.awk file
CH1 12.30 4.800 12 !
CH1 110.70 4.800 12 !
The above will work no matter what the value of scalar or which floating point field you want to change (easy tweak to work for decimal fields too if desired) and no matter what the value of $1.

Related

Bash / Regex: Replacing the second field in a CSV file when some of the first fields start with quotes and commas within those

This question is for a code written in bash, but is really more a regex question. I have a file (ARyy.txt) with CSV values in them. I want to replace the second field with NaN. This is no problem at all for the simple cases (rows 1 and 2 in the example), but it's much more difficult for a few cases where there are quotes in the first field and they have commas in them. These quotes are literally only there to indicate there are commas within them (so if quotes are only there if commas are there and vice versa). Quotes are always the first and last characters if there are commas in the first field.
Here is what I have thus far. NOTE: please try to answer using sed and the general format. There is a way to do this using awk for FPAT from what I know but I need one using sed ideally (or simple use case of awk).
#!/bin/bash
LN=1 #Line Number
while read -r LIN #LIN is a variable containing the line
do
echo "$LN: $LIN"
((LN++))
if [ $LN -eq 1 ]; then
continue #header line
elif [[ {$LIN:0:1} == "\"" ]]; then #if the first character in the line is quote
sed -i '${LN}s/\",/",NaN/' ARyy.txt #replace quote followed by comma with quote followed by comma followed by NaN
else #if first character doesn't start with a quote
sed -i '${LN}s/,[^,]*/,0/' ARyy.txt; fi
done < ARyy.txt
Other pertinent info:
There are never double or nested quotes or anything peculiar like this
There can be more than one comma inside the quotations
I am always replacing the second field
The second field is always just a number for the input (Never words or quotes)
Input Example:
Fruit, Weight, Intensity, Key
Apple, 10, 12, 343
Banana, 5, 10, 323
"Banana, green, 10 MG", 3, 14, 444 #Notice this line has commas in it but it has quotes to indicate this)
Desired Output:
Fruit, Weight, Intensity, Key
Apple, NaN, 12, 343
Banana, NaN, 10, 323
"Banana, green, 10 MG", NaN, 14, 444 #second field changed to NaN and first field remains in tact
Try this:
sed -E -i '2,$ s/^("[^"]*"|[^",]*)(, *)[0-9]*,/\1\2NaN,/' ARyy.txt
Explanation: sed -E invokes "extended" regular expression syntax, so it's easier to use parenthesized groups.
2,$ = On lines 2 through the end of file...
s/ = Replace...
^ = the beginning of a line
("[^"]*"|[^",]*) = either a double-quoted string or a string that doesn't contain any double-quotes or commas
(, *) = a comma, maybe followed by some spaces
[0-9]* = a number
, = and finally a comma
/ = ...with...
\1 = the first () group (i.e. the original first field)
\2 = the second () group (i.e. comma and spaces)
NaN, = Not a number, and the following comma
/ = end of replacement
Note that if the first field could contain escaped double-quotes and/or escaped commas (not in double-quotes), the first pattern would have to be significantly more complex to deal with them.
BTW, the original has an antipattern I see disturbingly often: reading through a file line-by-line to decide what to do with that line, then running something that processes the entire file in order to change that one line. So if you have a thousand-line file, it winds up processing the entire file a thousand times (for a total of a million lines processed). This is what's known as "quadratic scaling", because it takes time proportional to the square of the problem size. As Bruce Dawson put it,
O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
Given your specific format, in particular that the first field won't ever have any escaped double quotes in it:
sed -E '2,$ s/^("[^"]*"|[^,]*),[^,]*/\1,NaN/' < input.csv > output.csv
This does require the common but non-standard -E option to use POSIX Extended Regular Expression syntax instead of the default Basic (which doesn't support alternation).
One (somewhat verbose) awk idea that replaces the entire set of code posted in the question:
awk -F'"' ' # input field separator = double quotes
function print_line() { # print array
pfx=""
for (i=1; i<=4; i++) {
printf "%s%s", pfx, arr[i]
pfx=OFS
}
printf "\n"
}
FNR==1 { print ; next } # header record
NF==1 { split($0,arr,",") # no double quotes => split line on comma
arr[2]=" NaN" # override arr[2] with " NaN"
}
NF>=2 { split($3,arr,",") # first column in from file contains double quotes
# so split awk field #3 on comma; arr[2] will
# be empty
arr[1]="\"" $2 "\"" # override arr[1] with awk field #1 (the double
# quoted first column from the file
arr[2]=" NaN" # override arr[2] " NaN"
}
{ print_line() } # print our array
' ARyy.txt
For the sample input file this generates:
Fruit, Weight, Intensity, Key
Apple, NaN, 12, 343
Banana, NaN, 10, 323
"Banana, green, 10 MG", NaN, 14, 444
while read -r LIN; do
if [ $LN -eq 1 ]; then
((LN++))
continue
elif [[ $LIN == $(echo "$LIN" | grep '"') ]]; then
word1=$(echo "$LIN" | awk -F ',' '{print $4}')
echo "$LIN" | sed -i "$LN"s/"$word1"/\ NaN/ ARyy2.txt
elif [[ $LIN == $(echo "$LIN" | grep -E '[A-Z][a-z]*[,]\ [0-9]') ]]; then
word2=$(echo "$LIN" | cut -f2 -d ',')
echo "$LIN" | sed -i "$LN"s/"$word2"/\ NaN/ ARyy2.txt
fi
echo "$LN: $LIN"
((LN++))
done <ARyy.txt
make a copy of input ARyy.txt to ARyy2.txt and use this text files as the output.
(read from ARyy.txt and write to ARyy2.txt)
the first elif $(echo "$LIN" | grep '"') checks if the LINE starts with quotes " returns:
once selected, want to grab the number 3 with awk -F ',' '{print $4}and saved to variable word1. -F tells awk to separate columns each time encounters a , so 6 columns in total and number 3 is in column 4 that's why {print $4}
echo "$LIN" | sed -i "$LN"s/"$word1"/\ NaN/ ARyy2.txt
then use sed to select line number with $LN. The number 3 inside variable /$word1/. for replacement with /NaN/ BUT want to add a space to NaN so need to escape \ the space with /\ NaN/
always using echo $LIN to grab the correct LINE
the second elif $(echo "$LIN" | grep -E '[A-Z][a-z]*[,]\ [0-9]') returns:
$LIN only returns one line a time, like this:
The important is to check if the LINE has this pattern Word + space + ONE Digit
once selected, want to grab the number 10[second column] this time with cut -f2 -d ',' and save it to variable word2. -f2 selects the second column, and -d is telling cut to use , to separate each column.

CLI: Increment a number in a string while keeping padded zeroes

I currently use this perl command to increment the last number in a string:
perl -pe 's/(\d+)(?!.*\d+)/$1+1/e' <<< "abc123_00456.txt"
It outputs abc123_457.txt, while I want abc123_00457.txt.
I also want something like 99 to increment to 100, though if that's too hard, 00 is also acceptable.
Some more examples of what I want:
09 -> 10
004 -> 005
I also want to be able to increment by any number (not just 1), so no ++.
I do not want to use shell's builtins to accomplish this.
Try this:
perl -pe 's/(\d+)(?=\D*\z)/my $n = $1; ++$n/e' <<< "abc123_00456.txt"
The ++ operator preserves the number of digits when incrementing a string.
Alternatively:
perl -pe 's/(\d+)(?=\D*\z)/sprintf "%0*d", length($1), $1 + 1/e' <<< "abc123_00456.txt"
This lets you increment by more than just 1 (or perform other arithmetic operations).
sprintf %d formats an integer in decimal format. 0 means to pad the result with zeroes; * means the maximum field width is taken from the next argument instead of the format string itself. (E.g. %05d means "format a number by padding it with zeroes until it is at least 5 characters wide".)
Here we simply take the length of the original string of digits (length($1)) and use it as our field width. The number to format is $1 + 1. If it is shorter than the original string, sprintf automatically adds zeroes.
See also perldoc -f sprintf.
You can use a formatted string with sprintf:
perl -pe 's/(\d+)(?!.*\d)/sprintf("%05d",$1+1)/e' <<< "abc123_00456.txt"
The 5 gives the width of your number, the 0 is the character used to pad the number.
For an unknow width, you can build dinamically the formatted string:
perl -pe 's/(\d+)(?!.*\d)/sprintf("%0".length($1)."d",$1+1)/e' <<< "abc123_00456.txt"
With GNU awk for the 3rd arg to match():
$ awk -v n=17 'match($0,/(.*[^0-9])([0-9]+)(.*)/,a){$0=a[1] sprintf("%0*d",length(a[2]),a[2]+n) a[3]} 1' <<< "abc123_00456.txt"
abc123_00473.txt
With any awk in any shell on every UNIX box:
$ awk -v n=17 'match($0,/[0-9]+\./){lgth=RLENGTH-1; tgt=substr($0,RSTART,lgth); $0=substr($0,1,RSTART-1) sprintf("%0*d",lgth,tgt+n) substr($0,RSTART+lgth)} 1' <<< "abc123_00456.txt"
abc123_00473.txt
This might work for you (GNU sed & Bash):
sed -E 's/^([^0-9]*([0-9]+[^0-9]+)*0*)([0-9]+)(.*)/echo "\1$((\3+1))\4"/e' file

Numeric expression in if condition of awk

Pretty new to AWK programming. I have a file1 with entries as:
15>000000513609200>000000513609200>B>I>0011>>238/PLMN/000100>File Ef141109.txt>0100-75607-16156-14 09-11-2014
15>000000513609200>000000513609200>B>I>0011>Danske Politi>238/PLMN/000200>>0100-75607-16156-14 09-11-2014
15>000050354428060>000050354428060>B>I>0011>Danske Politi>238/PLMN/000200>>4100-75607-01302-14 31-10-2014
I want to write a awk script, where if 2nd field subtracted from 3rd field is a 0, then it prints field 2. Else if the (difference > 0), then it prints all intermediate digits incremented by 1 starting from 2nd field ending at 3rd field. There will be no scenario where 3rd field is less than 2nd. So ignoring that condition.
I was doing something as:
awk 'NR > 2 { print p } { p = $0 }' file1 | awk -F">" '{if ($($3 - $2) == 0) print $2; else l = $($3 - $2); for(i=0;i<l;i++) print $2++; }'
(( Someone told me awk is close to C in terms of syntax ))
But from the output it looks to me that the String to numeric or numeric to string conversions are not taking place at right place at right time. Shouldn't it be taken care by AWK automatically ?
The OUTPUT that I get:
513609200
513609201
513609200
Which is not quiet as expected. One evident issue is its ignoring the preceding 0s.
Kindly help me modify the AWK script to get the desired result.
NOTE:
awk 'NR > 2 { print p } { p = $0 }' file1 is just to remove the 1st and last entry in my original file1. So the part that needs to be fixed is:
awk -F">" '{if ($($3 - $2) == 0) print $2; else l = $($3 - $2); for(i=0;i<l;i++) print $2++; }'
In awk, think of $ as an operator to retrieve the value of the named field number ($0 being a special case)
$1 is the value of field 1
$NF is the value of the field given in the NF variable
So, $($3 - $2) will try to get the value of the field number given by the expression ($3 - $2).
You need fewer $ signs
awk -F">" '{
if ($3 == $2)
print $2
else {
v=$2
while (v < $3)
print v++
}
}'
Normally, this will work, but your numbers are beyond awk integer bounds so you need another solution to handle them. I'm posting this to initiate other solutions and better illustrate your specifications.
$ awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file
note that this will skip the rows that you say impossible to happen
A small scale example
$ cat file_0
x>1000>1000>etc
x>2000>2003>etc
x>3000>2999>etc
$ awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file_0
1000
2000
2001
2002
2003
Apparently, newer versions of gawk has --bignum options for arbitrary precision integers, if you have a compatible version that may solve your problem but I don't have access to verify.
For anyone who does not have ready access to gawk with bigint support, it may be simpler to consider other options if some kind of "big integer" support is required. Since ruby has an awk-like mode of operation,
let's consider ruby here.
To get started, there are just four things to remember:
invoke ruby with the -n and -a options (-n for the awk-like loop; -a for automatic parsing of lines into fields ($F[i]));
awk's $n becomes $F[n-1];
explicit conversion of numeric strings to integers is required;
To specify the lines to be executed on the command line, use the '-e TEXT' option.
Thus a direct translation of:
awk -F'>' '{for(i=$2;i<=$3;i++) print i}' file
would be:
ruby -an -F'>' -e '($F[1].to_i .. $F[2].to_i).each {|i| puts i }' file
To guard against empty lines, the following script would be slightly better:
($F[1].to_i .. $F[2].to_i).each {|i| puts i } if $F.length > 2
This could be called as above, or if the script is in a file (say script.rb) using the incantation:
ruby -an -F'>' script.rb file
Given the OP input data, the output is:
513609200
513609200
50354428060
The left-padding can be accomplished in several ways -- see for example this SO page.

Using awk to find a domain name containing the longest repeated word

For example, let's say there is a file called domains.csv with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.
A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f - argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.
A way with perl:
perl -F, -ane 'if (#m=$F[1]=~/(?=(.+)\1)/g) {
#m=sort { length $b <=> length $a} #m;
$cl=length #m[0];
if ($l<$cl) { #res=($_); $l=$cl; } elsif ($l==$cl) { push #res, ($_); }
}
END { print #res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (#m[0]).
Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the #F array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group.
Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).

unix regex for adding contents in a file

i have contents in a file
like
asdfb ... 1
adfsdf ... 2
sdfdf .. 3
I want to write a unix command that should be able to add 1 + 2 + 3 and give the result as 6
From what I am aware grep and awk would be handy, any pointers would help.
I believe the following is what you're looking for. It will sum up the last field in each record for the data that is read from stdin.
awk '{ sum += $NF } END { print sum }' < file.txt
Some things to note:
With awk you don't need to declare variables, they are willed into existence by assigning values to them.
The variable NF is the number of fields in the current record. By prepending it with a $ we are treating its value as a variable. At least this is how it appears to work anyway :)
The END { } block is only once all records have been processed by the other blocks.
An awk script is all you need for that, since it has grep facilities built in as part of the language.
Let's say your actual file consists of:
asdfb zz 1
adfsdf yyy 2
sdfdf xx 3
and you want to sum the third column. You can use:
echo 'asdfb zz 1
adfsdf yyy 2
sdfdf xx 3' | awk '
BEGIN {s=0;}
{s = s + $3;}
END {print s;}'
The BEGIN clause is run before processing any lines, the END clause after processing all lines.
The other clause happens for every line but you can add more clauses to change the behavior based on all sorts of things (grep-py things).
This might not exactly be what you're looking for, but I wrote a quick Ruby script to accomplish your goal:
#!/usr/bin/env ruby
total = 0
while gets
total += $1.to_i if $_ =~ /([0-9]+)$/
end
puts total
Here's one in Perl.
$ cat foo.txt
asdfb ... 1
adfsdf ... 2
sdfdf .. 3
$ perl -a -n -E '$total += $F[2]; END { say $total }' foo
6
Golfed version:
perl -anE'END{say$n}$n+=$F[2]' foo
6