Search strings from bulk data - regex

I have a folder with many files containing text like the following:
blabla
chargeableDuration 00 01 03
...
timeForStartOfCharge 14 55 41
blabla
...
blabla
calledPartyNumber 123456789
blabla
...
blabla
callingPartyNumber 987654321
I require the output like:
987654321 123456789 145541 000103
I have been trying with following awk:
awk -F '[[:blank:]:=,]+' '/findstr chargeableDuration|dateForStartOfCharge|calledPartyNumber|callingPartyNumber/ && $4{
if (calledPartyNumber != "")
print dateForStartOfCharge, "NIL"
dateForStartOfCharge=$5
next
}
/calledPartyNumber/ {
for(i=1; i<=NF; i++)
if ($i ~ /calledPartyNumber/)
break
print chargeableDuration, $i
chargeableDuration=""
}' file
Cannot make it work. Please help.

Assuming you have a file with text named "test.txt", below linux shell command will do the work for you.
egrep -o "[0-9 ]{1,}" test.txt | tr -d ' \t\r\f' | sort -nr | tr "\n" "\t"

Pretty much like Manishs answer:
tac test_regex.txt | grep -oP '(?<=chargeableDuration|timeForStartOfCharge|calledPartyNumber|callingPartyNumber)\s+([^\n]+)' | tr -d " \t\r\f" | tr "\n" " "
Only difference is, you keep the preceding order instead of sorting the result. So for your example both solutions would produce the same output, but you could end up with different results.

awk '/[0-9 ]+$/{
x=substr($0,( index($0," ") + 1 ) );
gsub(" ","",x);
a[$1]=x
}
END {
split("callingPartyNumber calledPartyNumber timeForStartOfCharge chargeableDuration",b," ");
for (i=1;i<=4;i++){
printf a[(b[i])]" "
}
}'
/[0-9 ]+$/ : Find lines end with number separated with/without spaces.
x=substr($0,( index($0," ") + 1 ) ) : Get the index after the first space match in $0 and save the substring after the first space match(ie digits) to a variable x
gsub(" ","",x) : Remove white spaces in x
a[$1]=x : Create an array a with index as $0 and assign x to it
END:
split("callingPartyNumber calledPartyNumber timeForStartOfCharge chargeableDuration",b," ") : Create array b where index 1,2,3 and 4 has value of your required field in the order you need
for (i=1;i<=4;i++){
printf a[(b[i])]" "
} : for loop to get the value in array a with index as value in array b[1],b[2],b[3] and b[4]

Related

awk sub with a capturing group into the replacement

I am writing an awk oneliner for this purpose:
file1:
1 apple
2 orange
4 pear
file2:
1/4/2/1
desired output: apple/pear/orange/apple
addendum: Missing numbers should be best kept unchanged 1/4/2/3 = apple/pear/orange/3 to prevent loss of info.
Methodology:
Build an associative array key[$1] = $2 for file1
capture all characters between the slashes and replace them by matching to the key of associative array eg key[4] = pear
Tried:
gawk 'NR==FNR { key[$1] = $2 }; NR>FNR { r = gensub(/(\w+)/, "key[\\1]" , "g"); print r}' file1.txt file2.txt
#gawk because need to use \w+ regex
#gensub used because need to use a capturing group
Unfortunately, results are
1/4/2/1
key[1]/key[4]/key[2]/key[1]
Any suggestions? Thank you.
You may use this awk:
awk -v OFS='/' 'NR==FNR {key[$1] = $2; next}
{for (i=1; i<=NF; ++i) if ($i in key) $i = key[$i]} 1' file1 FS='/' file2
apple/pear/orange/apple
Note that if numbers from file2 don't exist in key array then it will make those fields empty.
file1 FS='/' file2 will keep default field separators for file1 but will use / as field separator while reading file2.
EDIT: In case you don't have a match in file2 from file and you want to keep original value as it is then try following:
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val=(val=="" ? "" : val FS) (($i in arr)?arr[$i]:$i)
}
print val
}
' file1 FS="/" file2
With your shown samples please try following.
awk '
FNR==NR{
arr[$1]=$2
next
}
{
val=""
for(i=1;i<=NF;i++){
val = (val=="" ? "" : val FS) arr[$i]
}
print val
}
' file1 FS="/" file2
Explanation: Reading Input_file1 first and creating array arr with index of 1st field and value of 2nd field then setting field separator as / and traversing through each field os file2 and saving its value in val; printing it at last for each line.
Like #Sundeep comments in the comments, you can't use backreference as an array index. You could mix match and gensub (well, I'm using sub below). Not that this would be anywhere suggested method but just as an example:
$ awk '
NR==FNR {
k[$1]=$2 # hash them
next
}
{
while(match($0,/[0-9]+/)) # keep doing it while it lasts
sub(/[0-9]+/,k[substr($0,RSTART,RLENGTH)]) # replace here
}1' file1 file2
Output:
apple/pear/orange/apple
And of course, if you have k[1]="word1", you'll end up with a neverending loop.
With perl (assuming key is always found):
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|$h{$&}|g; print }' f1 f2
apple/pear/orange/apple
if(!$#ARGV) to determine first file (assuming exactly two files passed)
$h{$F[0]}=$F[1] create hash based on first field as key and second field as value
[^/]+ match non / characters
$h{$&} get the value based on matched portion from the hash
If some keys aren't found, leave it as is:
$ cat f2
1/4/2/1/5
$ perl -lane 'if(!$#ARGV){ $h{$F[0]}=$F[1] }
else{ s|[^/]+|exists $h{$&} ? $h{$&} : $&|ge; print }' f1 f2
apple/pear/orange/apple/5
exists $h{$&} checks if the matched portion exists as key.
Another approach using awk without loop:
awk 'FNR==NR{
a[$1]=$2;
next
}
$1 in a{
printf("%s%s",FNR>1 ? RS: "",a[$1])
}
END{
print ""
}' f1 RS='/' f2
$ cat f1
1 apple
2 orange
4 pear
$ cat f2
1/4/2/1
$ awk 'FNR==NR{a[$1]=$2;next}$1 in a{printf("%s%s",FNR>1?RS:"",a[$1])}END{print ""}' f1 RS='/' f2
apple/pear/orange/apple

Using gawk to extract rows with string in a column

I was trying to extract rows from a tab separated file, if it contained a certain word in the 4th column. For example, if input file test.txt is:
chr 8 1234 abc ; xyz
chr 8 1255 abc
chr 8 987 xyz
chr 8 5467 jxyzm
The following code correctly outputs only the 1st and 3rd line:
gawk -F"\t" ' { if($4 ~ /\<xyz\>/) print $0 } ' test.txt >> test.out
However, when I try to run this in a loop, in a bash script, my output file is blank. the code I am using is:
while read id
do
OFILE=${ODIR}/${id}.txt
gawk -v id="$id" -F"\t" ' { if($4 ~ /\<id\>/) print $0 } ' ${IFILE} >> ${OFILE}
done < ${GFILE}
The file ${GFILE} has one word per line, e.g.:
xyz
fg45
tre2y
What am I doing wrong?
thanks!
Edited to:
Add fourth row in input file
Added -v id="$id" to command...script still doesn't work!
You can very well use awk to read search patterns from one file and find matches in other like this:
awk -F '\t' '
NR == FNR {
words[$1]
next
}
{
for (w in words)
if (index($4, w)) {
print > w ".txt"
break
}
}' "$GFILE" "$IFILE"
Then check output:
cat xyz.txt
chr 8 1234 abc ; xyz
chr 8 987 xyz
If you really-really want to fix your shell script then here it is:
while read id; do
awk -F '\t' -v id="$id" '$4 ~ id' "$IFILE" > "$id.txt"
done < "$GFILE"

How can I group unknown (but repeated) words to create an index?

I have to create a shellscript that indexes a book (text file) by taking any words that are encapsulated in angled brackets (<>) and making an index file out of that. I have two questions that hopefully you can help me with!
The first is how to identify the words in the text that are encapsulated within angled brackets.
I found a similar question that was asked but required words inside of square brackets and tried to manipulate their code but am getting an error.
grep -on \\<.*> index.txt
The original code was the same but with square brackets instead of the angled brackets and now I am receiving an error saying:
line 5: .*: ambiguous redirect
This has been answered
I also now need to take my index and reformat it like so, from:
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
Into:
big: 1 3 9
but: 2
sun: 4 6 7 8
I know that I can flip the columns with an awk command like:
awk -F':' 'BEGIN{OFS=":";} {print $2,$1;}' index.txt
But am not sure how to group the same words into a single line.
Thanks!
Could you please try following(if you are not worried about sorting order, in case you need to sort it then append sort to following code).
awk '
BEGIN{
FS=":"
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1
}
END{
for(key in name){
print key": "name[key]
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=":" ##Setting field separator as : here.
}
{
name[$2]=($2 in name?name[$2] OFS:"")$1 ##Creating array named name with index of $2 and value of $1 which is keep appending to its same index value.
}
END{ ##Starting END block of this code here.
for(key in name){ ##Traversing through name array here.
print key": "name[key] ##Printing key colon and array name value with index key
}
}
' Input_file ##Mentioning Input_file name here.
If you want to extract multiple occurrences of substrings in between angle brackets with GNU grep, you may consider a PCRE regex based solution like
grep -oPn '<\K[^<>]+(?=>)' index.txt
The PCRE engine is enabled with the -P option and the pattern matches:
< - an open angle bracket
\K - a match reset operator that discards all text matched so far
[^<>]+ - 1 or more (due to the + quantifier) occurrences of any char but < and > (see the [^<>] bracket expression)
(?=>) - a positive lookahead that requires (but does not consume) a > char immediately to the right of the current location.
Something like this might be what you need, it outputs the paragraph number, line number within the paragraph, and character position within the line for every occurrence of each target word:
$ cat book.txt
Wee, <sleeket>, cowran, tim’rous beastie,
O, what a panic’s in <thy> breastie!
Thou need na start <awa> sae hasty,
Wi’ bickerin brattle!
I wad be laith to rin an’ chase <thee>
Wi’ murd’ring pattle!
I’m <truly> sorry Man’s dominion
Has broken Nature’s social union,
An’ justifies that ill opinion,
Which makes <thee> startle,
At me, <thy> poor, earth-born companion,
An’ fellow-mortal!
.
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS="\t" }
{
for (lineNr=1; lineNr<=NF; lineNr++) {
line = $lineNr
idx = 1
while ( match( substr(line,idx), /<[^<>]+>/ ) ) {
word = substr(line,idx+RSTART,RLENGTH-2)
locs[word] = (word in locs ? locs[word] OFS : "") NR ":" lineNr ":" idx + RSTART
idx += (RSTART + RLENGTH)
}
}
}
END {
for (word in locs) {
print word, locs[word]
}
}
.
$ awk -f tst.awk book.txt | sort
awa 1:3:21
sleeket 1:1:7
thee 1:5:34 2:4:24
thy 1:2:23 2:5:9
truly 2:1:6
Sample input courtesy of Rabbie Burns
GNU datamash is a handy tool for working on groups of columnar data (Plus some sed to massage its output into the right format):
$ grep -oPn '<\K[^<>]+(?=>)' index.txt | datamash -st: -g2 collapse 1 | sed 's/:/: /; s/,/ /g'
big: 1 3 9
but: 2
sun: 4 6 7 8
To transform
index.txt
1:big
3:big
9:big
2:but
4:sun
6:sun
7:sun
8:sun
into:
big: 1 3 9
but: 2
sun: 4 6 7 8
you can try this AWK program:
awk -F: '{ if (entries[$2]) {entries[$2] = entries[$2] " " $1} else {entries[$2] = $2 ": " $1} }
END { for (entry in entries) print entries[entry] }' index.txt | sort
Shorter version of the same suggested by RavinderSingh13:
awk -F: '{
{ entries[$2] = ($2 in entries ? entries[$2] " " $1 : $2 ": " $1 }
END { for (entry in entries) print entries[entry] }' index.txt | sort

AWK - add value based on regex

I have to add the numbers returned by REGEX using awk in linux.
Basically from this file:
123john456:x:98:98::/home/john123:/bin/bash
I have to add the numbers 123 and 456 using awk.
So the result would be 579
So far I have done the following:
awk -F ':' '$1 ~ VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'match($1, VAR=/[0-9].*?:/) ; {print VAR}' /etc/passwd
And from what I've seen match doesn't support this at all.
Does someone has any idea?
UPDATE:
it also should work for
john123 result - > 123
123john result - > 123
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
With your updated requirements:
$ cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
123
123
With gawk and for the given example
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); print a}' inputFile | bc
would do the job.
More general:
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); a=gensub(/^+/,"","g",a); a=gensub(/+$/,"","g",a); print a}' inputFile | bc
The regex-part replaces all sequences of letters with '+' (e.g., '12johnny34' becomes 12+34). Finally, this mathematical operation is evaluated by bc.
(The be safe, I remove leading and trailing '+' sings by ^+ and +$)
You may use
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' /etc/passwd
See online awk demo
s="123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash"
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' <<< "$s"
Output:
579
123
Details
-F ':' - records are split into fields with : char
n=split($1, a, /[^0-9]+/) - gets Field 1 and splits into digit only chunks saving the numbers in a array and the n var contains the number of these chunks
b=0 - b will hold the sum
for (i=1;i<=n;i++) { b += a[i]; } - iterate over a array and sum the values
print b - prints the result.
I used awk's split() to separate the first field on any string not containing numbers.
split(string, target_array, [regex], [separator_array]*)
*separator_array requires gawk
$ awk -F: '{split($1, A, /[^0-9]+/, S); print S[1], A[1]+A[2]}' <<EOF
123john456:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
EOF
john 579
john 123
You can use [^0-9]+ as a field separator, and :[^\n]*\n as a record separator instead:
awk -F '[^0-9]+' 'BEGIN{RS=":[^\n]*\n"}{print $1+$2}' /etc/passwd
so that given the content of /etc/passwd being:
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
This outputs:
579
123
123
You can try Perl also
$ cat johnny.txt
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ perl -F: -lane ' $_=$F[0]; $sum+= $1 while(/(\d+)/g); print $sum; $sum=0 ' johnny.txt
579
123
123
$
Here is another awk variant that adds all the numbers present in first field separated by ::
cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
1j2o3h4n5:x:98:98::/home/john123:/bin/bash
awk -F '[^0-9:]+' '{s=0; for (i=1; i<=NF; i++) {s+=$i; if ($i~/:$/) break} print s}' file
579
123
123
15

Parsing a .csv-like file in bash

I have a file formatted as follows:
string1,string2,string3,...
...
I have to analyze the second column, counting the occurrences of each string, and producing a file formatted as follows:
"number of occurrences of x",x
"number of occurrences of y",y
...
I managed to write the following script, that works fine:
#!/bin/bash
> output
regExp='^\s*([0-9]+) (.+)$'
while IFS= read -r line
do
if [[ "$line" =~ $regExp ]]
then
printf "${BASH_REMATCH[1]},${BASH_REMATCH[2]}\n" >> output
fi
done <<< "`gawk -F , '!/^$/ {print $2}' $1 | sort | uniq -c`"
My question is:
There is a better and simpler way to do the job?
In particular I don't know how to fix that:
gawk -F , '!/^$/ {print $2}' miocsv.csv | sort | uniq -c | gawk '{print $1","$2}'
The problem is that string2 can contain whitespaces and, if so, the second call on gawk will truncate the string.
Neither i know how to print all the field "from 2 to NF", maintaining the delimiter, which can occur several times in succession.
Thank very much,
Goodbye
EDIT:
As asked, here there is some sample data:
(It is an exercise, sorry for the inventive)
Input:
*,*,*
test, test ,test
prova, * , prova
test,test,test
prova, prova ,prova
leonardo,da vinci,leonardo
in,o u t ,pr
, spaces ,
, spaces ,
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
leonardo,da vinci,leonardo
in,o u t ,pr
test, test ,test
, tabs ,
, tabs ,
po,po,po
po,po,po
po,po,po
prova, * , prova
prova, * , prova
*,*,*
*,*,*
*,*,*
, spaces ,
, tabs ,
Output:
3, *
4,*
4,da vinci
2,o u t
3,po
1, prova
3, spaces
3, tabs
1,test
2, test
A one-liner in awk:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv
It stores the count for each 2nd column string in the associative array x, and in the end loops through the array and prints the results.
To get the exact output you showed for this example, you need to pipe it to sort(1), setting the field delimiter to , and the sort key to the 2nd field:
awk -F, 'x[$2]++ { } END { for (i in x) print x[i] "," i }' input.csv | sort -t, -k2,2
The only condition, of course, is that the 2nd column of each line doesn't contain a ,
You can make your final awk:
gawk '{ sub(" *","",$0); sub(" ",",",$0); print }'
or use sed for this sort of thing:
sed 's/ *\([0-9]*\) /\1,/'
Here is a Perl one-liner, similar to Filipe's awk solution:
perl -F, -lane '$x{$F[1]}++; END{ for $i (sort keys %x) { print "$x{$i},$i" } }' input.csv
The output is sorted alphabetically according to the second column.
The #F autosplit array starts at index $F[0] while awk fields start with $1