I have data like this (file is called list-in.dat)
a ; b ; c ; i
d
e ; f ; a ; b
g ; h ; i
and I want a list like this (output file list-out.dat) with all items, in alphabetically order (case insensitive) and each unique item only once.
a
b
c
d
e
f
g
h
i
My attempt is:
awk -F " ; " ' BEGIN { OFS="\n" ; } {for(i=0; i<=NF; i++) print $i} ' file-in.dat | uniq | sort -uf > file-out.dat
But I end up with all antries except those lines which has only one item:
a
b
c
e
f
g
h
i
How can I get all (unique, sorted) items no matter how many items are in one line / if the field separator is missing?
Using gnu-awk:
awk -F '[[:blank:]]*;[[:blank:]]*' '{
for (i=1; i<=NF; i++) uniq[$i]
}
END {
PROCINFO["sorted_in"]="#ind_str_asc"
for (i in uniq)
print i
}' file
a
b
c
d
e
f
g
h
i
For non-gnu awk use:
awk -F '[[:blank:]]*;[[:blank:]]*' '{for (i=1; i<=NF; i++) uniq[$i]}
END{for (i in uniq) print i}' file | sort
awk -F' ; ' -v OFS='\n' '{$1=$1} 1' ip.txt | sort -fu
-F' ; ' sets space followed by ; followed by space as field separator
-v OFS='\n' sets newline as output field separator
{$1=$1} change $0 as per new OFS
1 print $0
sort -fu sort uniquely ignoring case in alphabetic order
Could you please try following, awk + sort solution, written and tested with shown samples. In case you want to use ignorecase then add IGNORECASE=1 in awk code.
awk '
BEGIN{
FS=" ; "
}
{
for(i=1;i<=NF;i++){
if(!a[$i]++){ print $i }
}
}
' Input_file | sort
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=" ; " ##Setting field separator as space semi-colon space here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop till NF here for each line.
if(!a[$i]++){ print $i } ##Checking condition if current field is NOT present in array a then printing that field value here.
}
}
' Input_file | sort ##Mentioning Input_file name here and passing it to sort as Input to sort the data.
I was trying to extract rows from a tab separated file, if it contained a certain word in the 4th column. For example, if input file test.txt is:
chr 8 1234 abc ; xyz
chr 8 1255 abc
chr 8 987 xyz
chr 8 5467 jxyzm
The following code correctly outputs only the 1st and 3rd line:
gawk -F"\t" ' { if($4 ~ /\<xyz\>/) print $0 } ' test.txt >> test.out
However, when I try to run this in a loop, in a bash script, my output file is blank. the code I am using is:
while read id
do
OFILE=${ODIR}/${id}.txt
gawk -v id="$id" -F"\t" ' { if($4 ~ /\<id\>/) print $0 } ' ${IFILE} >> ${OFILE}
done < ${GFILE}
The file ${GFILE} has one word per line, e.g.:
xyz
fg45
tre2y
What am I doing wrong?
thanks!
Edited to:
Add fourth row in input file
Added -v id="$id" to command...script still doesn't work!
You can very well use awk to read search patterns from one file and find matches in other like this:
awk -F '\t' '
NR == FNR {
words[$1]
next
}
{
for (w in words)
if (index($4, w)) {
print > w ".txt"
break
}
}' "$GFILE" "$IFILE"
Then check output:
cat xyz.txt
chr 8 1234 abc ; xyz
chr 8 987 xyz
If you really-really want to fix your shell script then here it is:
while read id; do
awk -F '\t' -v id="$id" '$4 ~ id' "$IFILE" > "$id.txt"
done < "$GFILE"
I would like to reverse the complete text from the file.
Say if the file contains:
com.e.h/float
I want to get output as:
float/h.e.com
I have tried the command:
rev file.txt
but I have got all the reverse output: taolf/h.e.moc
Is there a way I can get the desired output. Do let me know. Thank you.
Here is teh link of teh sample file: Sample Text
You can use sed and tac:
str=$(echo 'com.e.h/float' | sed -E 's/(\W+)/\n\1\n/g' | tac | tr -d '\n')
echo "$str"
float/h.e.com
Using sed we insert \n before and after all non-word characters.
Using tac we reverse the output lines.
Using tr we strip all new lines.
If you have gnu-awk then you can do all this in a single awk command using 4 argument split function call that populates split strings and delimiters separately:
awk '{
s = ""
split($0, arr, /\W+/, seps)
for (i=length(arr); i>=1; i--)
s = s seps[i] arr[i]
print s
}' file
For non-gnu awk, you can use:
awk '{
r = $0
i = 0
while (match(r, /[^a-zA-Z0-9_]+/)) {
a[++i] = substr(r, RSTART, RLENGTH) substr(r, 0, RSTART-1)
r = substr(r, RSTART+RLENGTH)
}
s = r
for (j=i; j>=1; j--)
s = s a[j]
print s
}' file
Is it possible to use Perl?
perl -nlE 'say reverse(split("([/.])",$_))' f
This one-liner reverses all the lines of f, according to PO's criteria.
If prefer a less parentesis version:
perl -nlE 'say reverse split "([/.])"' f
For portability, this can be done using any awk (not just GNU) using substrings:
$ awk '{
while (match($0,/[[:alnum:]]+/)) {
s=substr($0,RLENGTH+1,1) substr($0,1,RLENGTH) s;
$0=substr($0,RLENGTH+2)
} print s
}' <<<"com.e.h/float"
This steps through the string grabbing alphanumeric strings plus the following character, reversing the order of those two captured pieces, and prepending them to an output string.
Using GNU awk's split, splitting from separators . and /, define more if you wish.
$ cat program.awk
{
for(n=split($0,a,"[./]",s); n>=1; n--) # split to a and s, use n from split
printf "%s%s", a[n], (n==1?ORS:s[(n-1)]) # printf it pretty
}
Run it:
$ echo com.e.h/float | awk -f program.awk
float/h.e.com
EDIT:
If you want to run it as one-liner:
awk '{for(n=split($0,a,"[./]",s); n>=1; n--); printf "%s%s", a[n], (n==1?ORS:s[(n-1)])}' foo.txt
I am new to awk and need to find the statement to compare two fields in files below
The columns are , seperated
1.csv
_________
1space, aspace
2,b
space3space,c
2.csv
____________
1space,spacea
space2,bspace
3,spacecspace
The below statement works fine if there are no leading or training spaces in the fields of either of 1.tsv or 2.tsv
nawk -F, 'NR==FNR{a[$1,$2]++;next} !(a[$1,$2])' 2.tsv 1.tsv
Kindly let me know how can we modify the above statement to trim leadind and lagging spaces and then compare. Thanks for the help.
awk -F, '
{ key=$1; gsub(/^[[:space:]]+|[[:space:]]+$/,"",key) }
NR==FNR { a[key]; next }
!(key in a)
' 2.tsv 1.tsv
Do this:
awk '
BEGIN {FS=OFS=","}
NR==FNR {
gsub(/^ *| *$/,"",$1)
a[$1]++
next
}
{
gsub(/^ *| *$/,"",$1);
if (!($1 in a)) {
print
}
}' 2.tsv 1.tsv
Code for GNU sed:
sed -r 's#\s*(\S+)\s*,\s*(\S+)\s*#/\\s*\1\\s*,\\s*\2\\s*/p#' file1|sed -f - file2
$cat file1
1 , a
2,b
3 ,c
$cat file2
1 ,a
2,b
3,c
$sed -r 's#\s*(\S+)\s*,\s*(\S+)\s*#/\\s*\1\\s*,\\s*\2\\s*/d#' file1|sed -nf - file2
You need to trim all the spaces from $1 before trying to locate it in array a:
awk -F"," 'NR==FNR{$1=$1;a[$1]++;next} {f1=$1; gsub(/ /, "", f1);
if (!a[f1]) print}' 2.tsv 1.tsv
I'm a beginner user of awk/gawk.
If I run below, the shell gives me nothing. Please help!
echo "A=1,B=2,3,C=,D=5,6,E=7,8,9"|awk 'BEGIN{
n = split($0, arr, /,(?=\\w+=)/)
for (x=1; x<n; x++) printf "arr[%d]=%s\n", x, arr[x]
}'
.....................................................
I am trying to parse:
A=1,B=2,3,C=,D=5,6,E=7,8,9
Expected Output:
A=1
B=2,3
C=
D=5,6
E=7,8,9
I bet there's something wrong with my awk.
gawk doesn't support look-ahead.
if you want gawk to parse it as you expected, try this:
awk '{n=split(gensub(/,([A-Z])/, " \\1","g" ),arr," ");for(x=1;x<=n;x++)print arr[x]}'
test with your example:
kent$ echo "A=1,B=2,3,C=,D=5,6,E=7,8,9"|awk '{n=split(gensub(/,([A-Z])/, " \\1","g" ),arr," ");for(x=1;x<=n;x++)print arr[x]}'
A=1
B=2,3
C=
D=5,6
E=7,8,9
This might be easier with sed:
$ echo "A=1,B=2,3,C=,D=5,6,E=7,8,9" | sed 's/,\(\w\+=\)/\n\1/g'
A=1
B=2,3
C=
D=5,6
E=7,8,9
If you are using gnu awk, you could do:
awk '{printf $0 "\n" substr( RT, 2 )}' RS=,[A-Z]
As nhahtdh, theres is no lookahead in awk... But you can use a different separator for the assignments. Why not "A=1;B=2,3,4;C=5..."?
If your input must have that format, try flex...
You could also use comma as the record separator:
echo "A=1,B=2,3,C=,D=5,6,E=7,8,9" |
awk -v RS=, '{sep=","} /=/ {sep="\n"} NR==1 {sep=""} {printf "%s%s", sep, $0}'
outputs
A=1
B=2,3
C=
D=5,6
E=7,8,9
You have two problems. First, you don't want a BEGIN clause; you just want this to run on every input line. Second, you are trying to use regular expression features that AWK does not support.
Instead of trying to use a fancy pattern that splits the string, loop and call match() to parse out the features you want.
echo "A=1,B=2,3,C=,D=5,6,E=7,8,9"|awk '
{
line = $0
for (i = 0;;)
{
i = match(line, /([A-Z]+)=([0-9,]*)(,|$)/, arr)
if (0 == i)
break
key = arr[1]
value = arr[2]
l = length(key "=" value ",") + 1
line = substr(line, l)
printf "DEBUG: key '%s' value '%s'\n", key, value
}
}'
This prints:
DEBUG: key A value 1
DEBUG: key B value 2,3
DEBUG: key C value
DEBUG: key D value 5,6
DEBUG: key E value 7,8,9
Other way using awk
awk '{print gensub(/,([A-Z]+=)/, "\n\\1","g")}' temp.txt
Output
A=1
B=2,3
C=
D=5,6
E=7,8,9