Using gawk to extract rows with string in a column - regex

I was trying to extract rows from a tab separated file, if it contained a certain word in the 4th column. For example, if input file test.txt is:
chr 8 1234 abc ; xyz
chr 8 1255 abc
chr 8 987 xyz
chr 8 5467 jxyzm
The following code correctly outputs only the 1st and 3rd line:
gawk -F"\t" ' { if($4 ~ /\<xyz\>/) print $0 } ' test.txt >> test.out
However, when I try to run this in a loop, in a bash script, my output file is blank. the code I am using is:
while read id
do
OFILE=${ODIR}/${id}.txt
gawk -v id="$id" -F"\t" ' { if($4 ~ /\<id\>/) print $0 } ' ${IFILE} >> ${OFILE}
done < ${GFILE}
The file ${GFILE} has one word per line, e.g.:
xyz
fg45
tre2y
What am I doing wrong?
thanks!
Edited to:
Add fourth row in input file
Added -v id="$id" to command...script still doesn't work!

You can very well use awk to read search patterns from one file and find matches in other like this:
awk -F '\t' '
NR == FNR {
words[$1]
next
}
{
for (w in words)
if (index($4, w)) {
print > w ".txt"
break
}
}' "$GFILE" "$IFILE"
Then check output:
cat xyz.txt
chr 8 1234 abc ; xyz
chr 8 987 xyz
If you really-really want to fix your shell script then here it is:
while read id; do
awk -F '\t' -v id="$id" '$4 ~ id' "$IFILE" > "$id.txt"
done < "$GFILE"

Related

File fields and columns adjustment with awk [LINUX]

I have an issue with columns delimiters adjustment in a file in linux into a database.
I need 14 columns and I use "|" as a delimiter so I applied :
awk -F'|' '{missing=14-NF;if(missing==0){print $0}else{printf "%s",$0;for(i=1;i<=missing-1;i++){printf "|"};print "|"}}' myFile
Suppose I have a row like that:
a|b|c|d|e||f||g||||h|i|
after applying the awk command it will be:
a|b|c|d|e||f||g||||h|i||
and this is not acceptable I need the data to be 14 columns only.
Sample input {In case of 14 fields row]:
a|b|c|d|e||f||g||||h|i
Do nothing
Sample input {In case of extra fields]:
a|b|c|d|e||f||g||||h|i|
ouput:
a|b|c|d|e||f||g||||h|i
Sample Input {In case of less fields}:
a|b|c|d||e||f||g|h
output:
a|b|c|d||e||f||g|h|||
You may use this gnu-awk solution:
awk -v n=14 '
BEGIN {FS=OFS="|"}
{
$0 = gensub(/^(([^|]*\|){13}[^|]*)\|.*/, "\\1", "1")
for (i=NF+1; i<=n; ++i)
$i = ""
} 1' file
a|b|c|d|e||f||g||||h|i
a|b|c|d|e||f||g||||h|i
a|b|c|d||e||f||g|h|||
Where original file is this:
cat file
a|b|c|d|e||f||g||||h|i
a|b|c|d|e||f||g||||h|i|
a|b|c|d||e||f||g|h
Here:
Using gnsub we remove all extra fields
Using for loop we create new fields to make NF = n
If you don't have gnu-awk then following should work on non-gnu awk (tested on BSD awk):
awk -v n=14 '
BEGIN {FS=OFS="|"}
{
for (i=NF+1; i<=n; ++i) $i=""
for (i=n+1; i<=NF; ++i) $i=""
NF = n
} 1' file

Add a condtion for specfic row length in a script

I want to modify the following script:
awk 'NR>242 && $1 =='$t' {print $4, "\t" '$t'}' test.txt > file
I want to add a condition for the first "1 to 121" data (corresponding to the first 121 points) and then for the "122 to 242" data (which corresponds to the other 121 points).
so it becomes:
when NR>242 take the corresponding values of rows form 1 to 121 print them to file1
when NR>242 take the corresponding values of rows form 121 to 242 print them to file2
Thanks!
Generic solution: Adding more generic solution here, where you could give all line numbers inside lines variable of awk program. Once line number matches with values it will increase counter of file with 1 eg: from file1 to file2 OR file2 to file3 and so on...
awk -v val="$t" -v lines="121,242" -v count=1'
BEGIN{
num=split(lines,arr,",")
for(i=1;i<=num;i++){
line[arr[i]]
outputfile="file"count
}
}
FNR in arr[i]{
close(outputfile)
outputfile="file"++count
}
($1 == val){
print $4 "\t" val > (outputfile)
}
' Input_file
With your shown samples, please try following. This will print all lines from 1st line to 242nd line to file1 and 243 line onwards it will print output to file2. Also program has a shell variable named t passed into awk program's variable named val here.
awk -v val="$t" '
FNR==1{
outputfile="file1"
}
FNR==243{
outputfile="file2"
}
($1 == val){
print $4 "\t" val > (outputfile)
}
' Input_file
$ awk -v val="$t" '{c=int((NR-1)%242/121)+1}
$1==val {print $4 "\t" $1 > (output"c")}' file
this should take the first, third, etc blocks of 121 records to output1 and second, fourth, etc blocks of 121 records to output2 if they satisfy the condition.
If you want to skip first two blocks (first 242 records) just add && NR>242 condition to the existing one.

awk convert number to date format to select line bigger than specific mmyy

INPUT:
test,1120,1
test,1219,2
Expected Output
test,1120,1
Goal: trying to print line where $2 which is mmyy format is bigger than 1020 as example.
I've tried with the following:
awk -F, '{ if ( $2 > 1020 ) { print $0 }}' file that's will not give the expected output because it's still number etc.. 1219 is bigger than 1020.
Assuming the 2nd field always contains 4 digits, how about:
awk -F, 'substr($2, 3, 2) substr($2, 1, 2) > 2010' input
Please note that I have interpreted the word bigger as later, meaning 0921 is bigger than 1020. If my assumption is incorrect, please let me know.
EDIT: Since OP mentioned that now if dates require lesser than provided input in that case one could try following.
awk -v val="1020" '
BEGIN{
FS=OFS=","
user_year=substr(val,3)
user_month=substr(val,1,2)
}
{
year=substr($2,3)
month=substr($2,1,2)
if(year==user_year){
if(month<user_month){
print
}
}
else if(year<user_year){
print
}
}
' Input_file
Could you please try following. I have create a variable named val here which will have value which user needs to compare to all the lines of Input_file. In this case it is set to 1020
awk -v val="1020" '
BEGIN{
FS=OFS=","
user_year=substr(val,3)
user_month=substr(val,1,2)
}
{
year=substr($2,3)
month=substr($2,1,2)
if(year==user_year){
if(month>user_month){
print
}
}
if(year>user_year){
print
}
}
' Input_file

AWK - add value based on regex

I have to add the numbers returned by REGEX using awk in linux.
Basically from this file:
123john456:x:98:98::/home/john123:/bin/bash
I have to add the numbers 123 and 456 using awk.
So the result would be 579
So far I have done the following:
awk -F ':' '$1 ~ VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'VAR+="/[0-9].*(?=:)/" ; {print VAR}' /etc/passwd
awk -F ':' 'match($1, VAR=/[0-9].*?:/) ; {print VAR}' /etc/passwd
And from what I've seen match doesn't support this at all.
Does someone has any idea?
UPDATE:
it also should work for
john123 result - > 123
123john result - > 123
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
With your updated requirements:
$ cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ awk -F':' '{split($1,t,/[^0-9]+/); print t[1] + t[2]}' file
579
123
123
With gawk and for the given example
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); print a}' inputFile | bc
would do the job.
More general:
awk -F ':' '{a=gensub(/[a-zA-Z]+/,"+", "g", $1); a=gensub(/^+/,"","g",a); a=gensub(/+$/,"","g",a); print a}' inputFile | bc
The regex-part replaces all sequences of letters with '+' (e.g., '12johnny34' becomes 12+34). Finally, this mathematical operation is evaluated by bc.
(The be safe, I remove leading and trailing '+' sings by ^+ and +$)
You may use
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' /etc/passwd
See online awk demo
s="123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash"
awk -F ':' '{n=split($1, a, /[^0-9]+/); b=0; for (i=1;i<=n;i++) { b += a[i]; }; print b; }' <<< "$s"
Output:
579
123
Details
-F ':' - records are split into fields with : char
n=split($1, a, /[^0-9]+/) - gets Field 1 and splits into digit only chunks saving the numbers in a array and the n var contains the number of these chunks
b=0 - b will hold the sum
for (i=1;i<=n;i++) { b += a[i]; } - iterate over a array and sum the values
print b - prints the result.
I used awk's split() to separate the first field on any string not containing numbers.
split(string, target_array, [regex], [separator_array]*)
*separator_array requires gawk
$ awk -F: '{split($1, A, /[^0-9]+/, S); print S[1], A[1]+A[2]}' <<EOF
123john456:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
EOF
john 579
john 123
You can use [^0-9]+ as a field separator, and :[^\n]*\n as a record separator instead:
awk -F '[^0-9]+' 'BEGIN{RS=":[^\n]*\n"}{print $1+$2}' /etc/passwd
so that given the content of /etc/passwd being:
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
This outputs:
579
123
123
You can try Perl also
$ cat johnny.txt
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
$ perl -F: -lane ' $_=$F[0]; $sum+= $1 while(/(\d+)/g); print $sum; $sum=0 ' johnny.txt
579
123
123
$
Here is another awk variant that adds all the numbers present in first field separated by ::
cat file
123john456:x:98:98::/home/john123:/bin/bash
john123:x:98:98::/home/john123:/bin/bash
123john:x:98:98::/home/john123:/bin/bash
1j2o3h4n5:x:98:98::/home/john123:/bin/bash
awk -F '[^0-9:]+' '{s=0; for (i=1; i<=NF; i++) {s+=$i; if ($i~/:$/) break} print s}' file
579
123
123
15

Search strings from bulk data

I have a folder with many files containing text like the following:
blabla
chargeableDuration 00 01 03
...
timeForStartOfCharge 14 55 41
blabla
...
blabla
calledPartyNumber 123456789
blabla
...
blabla
callingPartyNumber 987654321
I require the output like:
987654321 123456789 145541 000103
I have been trying with following awk:
awk -F '[[:blank:]:=,]+' '/findstr chargeableDuration|dateForStartOfCharge|calledPartyNumber|callingPartyNumber/ && $4{
if (calledPartyNumber != "")
print dateForStartOfCharge, "NIL"
dateForStartOfCharge=$5
next
}
/calledPartyNumber/ {
for(i=1; i<=NF; i++)
if ($i ~ /calledPartyNumber/)
break
print chargeableDuration, $i
chargeableDuration=""
}' file
Cannot make it work. Please help.
Assuming you have a file with text named "test.txt", below linux shell command will do the work for you.
egrep -o "[0-9 ]{1,}" test.txt | tr -d ' \t\r\f' | sort -nr | tr "\n" "\t"
Pretty much like Manishs answer:
tac test_regex.txt | grep -oP '(?<=chargeableDuration|timeForStartOfCharge|calledPartyNumber|callingPartyNumber)\s+([^\n]+)' | tr -d " \t\r\f" | tr "\n" " "
Only difference is, you keep the preceding order instead of sorting the result. So for your example both solutions would produce the same output, but you could end up with different results.
awk '/[0-9 ]+$/{
x=substr($0,( index($0," ") + 1 ) );
gsub(" ","",x);
a[$1]=x
}
END {
split("callingPartyNumber calledPartyNumber timeForStartOfCharge chargeableDuration",b," ");
for (i=1;i<=4;i++){
printf a[(b[i])]" "
}
}'
/[0-9 ]+$/ : Find lines end with number separated with/without spaces.
x=substr($0,( index($0," ") + 1 ) ) : Get the index after the first space match in $0 and save the substring after the first space match(ie digits) to a variable x
gsub(" ","",x) : Remove white spaces in x
a[$1]=x : Create an array a with index as $0 and assign x to it
END:
split("callingPartyNumber calledPartyNumber timeForStartOfCharge chargeableDuration",b," ") : Create array b where index 1,2,3 and 4 has value of your required field in the order you need
for (i=1;i<=4;i++){
printf a[(b[i])]" "
} : for loop to get the value in array a with index as value in array b[1],b[2],b[3] and b[4]