How to do reach specific section of text file and then search - regex

I have a text file like
Apples
Big 7
Small 6
Apples
Good 5
Bad 3
Oranges
Big 4
Small 2
Good 1
Bad 5
How do I get to specific section of this file and then do a grep? For example, If I need to find how many Good Oranges are there, how do I do it from command line with this file as input, using say awk?

You could use the range operator like this:
awk '/Apples/,/^$/ { if (/Good/) print $2}' file
would print how many good apples there are:
5
The range operator , will evaluate to true when the first condition is satisfied and remain true until the second condition. The second pattern /^$/ matches a blank line. This means that only the relevant sections will be tested for the property Good, Bad, etc.
I'm assuming that your original input file wasn't double-spaced? If it was, the method above can be patched to skip every other line:
awk '!NR%2{next} /Oranges/,/^$/ { if (/Good/) print $2}' file
When the record number NR is even, NR%2 is 0 and !0 is true so every other line will be skipped.

You could use Bash to read from the file line by line in a loop.
while read -a fruit; do
[ ${#fruit[#]} -eq 1 ] && name=${fruit[0]}
case $name in
Oranges) [ "${fruit[0]}" = "Good" ] && echo ${fruit[1]};;
esac
done < file
You could also make this a function and pass it arguments to get trait information about any fruit.
read_fruit (){
while read -a fruit; do
[ ${#fruit[#]} -eq 1 ] && name=${fruit[0]}
case $name in
$1) [ "${fruit[0]}" = "$2" ] && echo ${fruit[1]};;
esac
done < file
}
Use:
read_fruit Apples Small
result:
6

When you have name/value pairs, it's usually best to first build an array indexed by the name and containing the value, then you can just print whatever you're interested in using the appropriate name(s) to index the array:
$ awk 'NF==1{key=$1} {val[key,$1]=$2} END{print val["Oranges","Good"]}' file
1
$ awk 'NF==1{key=$1} {val[key,$1]=$2} END{print val["Apples","Bad"]}' file
3
or if you're looking for the starting point to implement a more complete/complex set of requirements here's one way:
$ awk '
NF {
if (NF==1) {
key=$1
keys[key]
}
else {
val[key,$1]=$2
names[$1]
}
}
END {
for (key in keys)
for (name in names)
print key, name, val[key,name]
}
' file
Apples Big 7
Apples Bad 3
Apples Good 5
Apples Small 6
Oranges Big 4
Oranges Bad 5
Oranges Good 1
Oranges Small 2
To test #JohnB's theory that a shell script would be faster than an awk script if there were thousands of files, I copied the OPs input file 5,000 times into a tmp directory then ran these 2 equivalent scripts on them (the bash one based on Johns answer in this thread and then an awk one that does the same as the bash one):
$ cat tst.sh
for file in "$#"; do
while read -r field1 field2 ; do
[ -z "$field2" ] && name="$field1"
case $name in
Oranges) [ "$field1" = "Good" ] && echo "$field2";;
esac
done < "$file"
done
.
$ cat tst.awk
NF==1 { fruit=$1 }
fruit=="Oranges" && $1=="Good" { print $2 }
and here's the results of running both on those 5,000 files:
$ time ./tst.sh tmp/* > bash.out
real 0m6.490s
user 0m2.792s
sys 0m3.650s
.
$ time awk -f tst.awk tmp/* > awk.out
real 0m2.262s
user 0m0.311s
sys 0m1.934s
The 2 output files were identical.

Related

Regular Expression to search for a number between two

I am not very familiar with Regular Expressions.
I have a requirement to extract all lines that match an 8 digit number between any two given numbers (for example 20200628 and 20200630) using regular expression. The boundary numbers are not fixed, but need to be parameterized.
In case you are wondering, this number is a timestamp, and I am trying to extract information between two dates.
HHHHH,E.164,20200626113247
HHHHH,E.164,20200627070835
HHHHH,E.164,20200628125855
HHHHH,E.164,20200629053139
HHHHH,E.164,20200630125855
HHHHH,E.164,20200630125856
HHHHH,E.164,20200626122856
HHHHH,E.164,20200627041046
HHHHH,E.164,20200628125856
HHHHH,E.164,20200630115849
HHHHH,E.164,20200629204531
HHHHH,E.164,20200630125857
HHHHH,E.164,20200630125857
HHHHH,E.164,20200626083628
HHHHH,E.164,20200627070439
HHHHH,E.164,20200627125857
HHHHH,E.164,20200628231003
HHHHH,E.164,20200629122857
HHHHH,E.164,20200630122237
HHHHH,E.164,20200630122351
HHHHH,E.164,20200630122858
HHHHH,E.164,20200630122857
HHHHH,E.164,20200630084722
Assuming the above data is stored in a file named data.txt, the idea is to sort it on the 3rd column delimited by the comma (i.e. sort -nk3), and then pass the sorted output through this perl filter, as demonstrated by this find_dates.sh script:
#!/bin/bash
[ $# -ne 3 ] && echo "Expects 3 args: YYYYmmdd start, YYYYmmdd end, and data filename" && exit
DATE1=$1
DATE2=$2
FILE=$3
echo "$DATE1" | perl -ne 'exit 1 unless /^\d{8}$/'
[ $? -ne 0 ] && echo "ERROR: First date is invalid - $DATE1" && exit
echo "$DATE2" | perl -ne 'exit 1 unless /^\d{8}$/'
[ $? -ne 0 ] && echo "ERROR: Second date is invalid - $DATE2" && exit
[ ! -r "$FILE" ] && echo "ERROR: File not found - $FILE" && exit
cat $FILE | sort -t, -nk3 | perl -ne '
BEGIN { $date1 = shift; $date2 = shift }
print if /164,$date1/ .. /164,$date2/;
print if /164,$date2/;
' $DATE1 $DATE2 | sort -u
Running the command find_dates.sh 20200627 20200629 data.txt will produce the result:
HHHHH,E.164,20200627041046
HHHHH,E.164,20200627070439
HHHHH,E.164,20200627070835
HHHHH,E.164,20200627125857
HHHHH,E.164,20200628125855
HHHHH,E.164,20200628125856
HHHHH,E.164,20200628231003
HHHHH,E.164,20200629053139
HHHHH,E.164,20200629122857
HHHHH,E.164,20200629204531
For the example you gave, between 20200628 and 20200630, you may try:
\b202006(?:2[89]|30)
Demo
I might be tempted to make the general comment that regex is not very suitable for finding numerical ranges (whereas application programming languages are). However, in the case of parsing a text log file, regex is what would be easily available.

Make reference to a file in a regular expression

I have two files. One is a SALESORDERLIST, which goes like this
ProductID;ProductDesc
1,potatoes 1 kg.
2,tomatoes 2 k
3,bottles of whiskey 2 un.
4,bottles of beer 40 gal
(ProductID;ProductDesc) header is actually not in the file, so disregard it.
In another file, POSSIBLEUNITS, I have -you guessed- the possible units, and their equivalencies:
u;u.;un;un.;unit
k;k.;kg;kg.,kilograms
This is my first day with regular expressions and I would like to know how can I get the entries in SALESORDERLIST, whose units appear in POSSIBLEUNITS. In my example, I would like to exclude entry 4 since 'gal' is not listed in POSSIBLEUNITS file.
I say regex, since I have a further criteria that needs to be matched:
egrep "^[0-9]+;{1}[^; ][a-zA-Z ]+" SALESORDERLIST
From those resultant entries, I want to get those ending in valid units.
Thanks!
One way of achieving what you want is:
cat SALESORDERLIST | egrep "\b(u|u\.|un|un\.|unit|k|k\.|kg|kg\.|kilograms)\b"
1,potatoes 1 kg.
2,tomatoes 2 k
3,bottles of whiskey 2 un.
The metacharacter \b is an anchor that allows you to perform a "whole words only" search using
a regular expression in the form of \bword\b.
http://www.regular-expressions.info/wordboundaries.html
One way would be to create a bash script, say called findunit.sh:
while read line
do
match=$(egrep -E "^[0-9]+,{1}[^, ][a-zA-Z ]+" <<< $line)
name=${match##* }
# echo "$name..."
found=$(egrep "$name" /pathtofile/units.txt)
# echo "xxx$found"
[ -n "$found" ] && echo $line
done < $1
Then run with:
findunit.sh SALESORDERLIST
My output from this is:
1,potatoes 1 kg.
2,tomatoes 2 k
3,bottles of whiskey 2 un.
An example of doing it completely in bash:
declare -A units
while read line; do
while [ -n "$line" ]; do
i=`expr index $line ";"`
if [[ $i == 0 ]]; then
units[$line]=1
break
fi
units[${line:0:$((i-1))}]=1
line=${line#*;}
done
done < POSSIBLEUNITS
while read line; do
unit=${line##* }
if [[ ${units[$unit]} == 1 ]]; then
echo $line
fi
done < SALESORDERLIST

Replace number of specified characters

I have something like this:
aaaaaaaaaaaaaaaaaaaaaaaaa
I need something that will allow me to replace a with another character like c from left to right according to the specified number.
For example:
some_command 3 should replace the first 3 a with c
cccaaaaaaaaaaaaaaaaaaaaaa
some_command 15
cccccccccccccccccaaaaaaaaaa
This can be done entirely in bash:
some_command() {
a="aaaaaaaaaaaaaaaaaaaaaaaaa"
c="ccccccccccccccccccccccccc"
echo "${c:0:$1}${a:$1}"
}
> some_command 3
cccaaaaaaaaaaaaaaaaaaaaaa
Using awk:
s='aaaaaaaaaaaaaaaaaaaaaaaaa'
awk -F "\0" -v n=3 -v r='c' '{for (i=1; i<=n; i++) $i=r}1' OFS= <<< "$s"
cccaaaaaaaaaaaaaaaaaaaaaa
This might work for you (GNU sed):
sed -r ':a;/a/{x;/^X{5}$/{x;b};s/$/X/;x;s/a/c/;ba} file
This will replace the first 5 a's with c throughout the file:
sed -r ':a;/a/{x;/^X{5}$/{z;x;b};s/$/X/;x;s/a/c/;ba} file
This will replace the first 5 a's with cfor each line throughout the file.
#/bin/bash
char=c
word=aaaaaaaaaaaaaaaaaaaaaaaaa
# pass in the number of chars to replace
replaceChar () {
num=$1
newword=""
# this for loop to concatenate the chars could probably be optimized
for i in $(seq 1 $num); do newword="${newword}${char}"; done
word="${newword}${word:$num}"
echo $word
}
replaceChar 4
A more general solution than the OP asked for, building on #anubhava's excellent answer.
Parameterizes the replacement count as well as the "before and after" chars.
The "before" char is matched anywhere - not just at the beginning of the input string, and whether adjacent to other instances or not.
Input is taken from stdin, so multiple lines can be piped in.
# Usage:
# ... | some_command_x replaceCount beforeChar afterChar
some_command_x() {
awk -F '\0' -v n="$1" -v o="${2:0:1}" -v r="${3:0:1}" -v OFS='' \
'{
while(++i <= NF)
{ if ($i==o) { if (++n_matched > n) break; $i=r } }
{ i=n_matched=0; print }
}'
}
# Example:
some_command_x 2 a c <<<$'abc_abc_abc\naaa rating'
# Returns:
cbc_cbc_abc
cca rating
Perl has some interesting features that can be exploited. Define the following bash script some_command:
#! /bin/bash
str="aaaaaaaaaaaaaaaaaaaaaaaaa"
perl -s -nE'print s/(a{$x})/"c" x length $1/er' -- -x=$1 <<<"$str"
Testing:
$ some_command 5
cccccaaaaaaaaaaaaaaaaaaaa

Awk print if no match

I am using the following statement in awk with text piped to it from another command:
awk 'match($0,/(QUOTATION|TAX INVOICE|ADJUSTMENT NOTE|DELIVERY DOCKET|PICKING SLIP|REMITTANCE ADVICE|PURCHASE ORDER|STATEMENT)/) && NR<11 {print substr($0,RSTART,RLENGTH)}'
which is almost working for what I need (find one of the words in the regex within the first 10 lines of the input and print that word). The main thing I need to do is to output something if there is no match. For instance, if none of those words are found in the first ten lines it would output UNKNOWN.
I also need to limit the output to the first match, as I need to ensure a single line of output per input file. I can do this with head or ask another question if needs be, I only include it here in case it affects how to output the no-match text.
I am also not tied to awk as a tool - if there is a simpler way to do this with sed or something else I am open to it.
You just need to exit at the first match, or on line 11 if no match
awk '
match($0,/(QUOTATION|TAX ... ORDER|STATEMENT)/) {
print substr($0,RSTART,RLENGTH)
exit
}
NR == 11 {print "UNKNOWN"; exit}
'
I like glenn jackman's answer, however, if you wish to print matches for all 10 lines then you can try something like this:
awk '
match($0,/(QUOTATION|TAX ... ORDER|STATEMENT)/) {
print NR " ---> " substr($0,RSTART,RLENGTH)
flag=1
}
flag==0 && NR==11 {
print "UNKNOWN"
exit
}'
You can do this..
( head -10 | egrep -o '(QUOTATION|TAX INVOICE|ADJUSTMENT NOTE|
DELIVERY DOCKET|PICKING SLIP|REMITTANCE ADVICE|PURCHASE ORDER|STATEMENT)'
|| print "Unkownn" ) | head -1

unix regex for adding contents in a file

i have contents in a file
like
asdfb ... 1
adfsdf ... 2
sdfdf .. 3
I want to write a unix command that should be able to add 1 + 2 + 3 and give the result as 6
From what I am aware grep and awk would be handy, any pointers would help.
I believe the following is what you're looking for. It will sum up the last field in each record for the data that is read from stdin.
awk '{ sum += $NF } END { print sum }' < file.txt
Some things to note:
With awk you don't need to declare variables, they are willed into existence by assigning values to them.
The variable NF is the number of fields in the current record. By prepending it with a $ we are treating its value as a variable. At least this is how it appears to work anyway :)
The END { } block is only once all records have been processed by the other blocks.
An awk script is all you need for that, since it has grep facilities built in as part of the language.
Let's say your actual file consists of:
asdfb zz 1
adfsdf yyy 2
sdfdf xx 3
and you want to sum the third column. You can use:
echo 'asdfb zz 1
adfsdf yyy 2
sdfdf xx 3' | awk '
BEGIN {s=0;}
{s = s + $3;}
END {print s;}'
The BEGIN clause is run before processing any lines, the END clause after processing all lines.
The other clause happens for every line but you can add more clauses to change the behavior based on all sorts of things (grep-py things).
This might not exactly be what you're looking for, but I wrote a quick Ruby script to accomplish your goal:
#!/usr/bin/env ruby
total = 0
while gets
total += $1.to_i if $_ =~ /([0-9]+)$/
end
puts total
Here's one in Perl.
$ cat foo.txt
asdfb ... 1
adfsdf ... 2
sdfdf .. 3
$ perl -a -n -E '$total += $F[2]; END { say $total }' foo
6
Golfed version:
perl -anE'END{say$n}$n+=$F[2]' foo
6