I have this data set
Format
ID date delimited-characters
Here is a sample file
FILE data.txt
004 06/23/1962 AAA-BBB-CCC-DDD
023 11/22/1963 AAA-BBB-CCC-DDD
070 06/23/1963 AAA-BBB-CCC-DDD
My gawk script works fine like this
call gawk 'BEGIN { BLANK = " " } { print $2 BLANK $3 }' lottery.midday.txt
and I receive just data and data which is what I want
06/23/1962 AAA-BBB-CCC-DDD
11/22/1963 AAA-BBB-CCC-DDD
06/23/1963 AAA-BBB-CCC-DDD
But my problem is I dont know how to substitute - with
I want to substitute dashes with blank spaces
gawk 'BEGIN { BLANK = " " } { print $3 BLANK $2 } data.txt
gawk 'BEGIN { BLANK = " " } { b=$3 gsub(/-/, " ") print} {print nb BLANK $2 }' data.txt
gawk { BLANK = " " } {print nb BLANK $2; gsub(/-/, " "); print }
gawk 'BEGIN { BLANK = " " RESULT=$3} {print gsub(/-/, " ", RESULT)} { print $3 BLANK $2 }' data.txt
try this:
awk '{gsub(/-/," ",$3);print $2,$3}' file
with your input example, the line above outputs:
06/23/1962 AAA BBB CCC DDD
11/22/1963 AAA BBB CCC DDD
06/23/1963 AAA BBB CCC DDD
P.S. I just found that we have same username! ^_^
Related
I have data like this (file is called list-in.dat)
a ; b ; c ; i
d
e ; f ; a ; b
g ; h ; i
and I want a list like this (output file list-out.dat) with all items, in alphabetically order (case insensitive) and each unique item only once.
a
b
c
d
e
f
g
h
i
My attempt is:
awk -F " ; " ' BEGIN { OFS="\n" ; } {for(i=0; i<=NF; i++) print $i} ' file-in.dat | uniq | sort -uf > file-out.dat
But I end up with all antries except those lines which has only one item:
a
b
c
e
f
g
h
i
How can I get all (unique, sorted) items no matter how many items are in one line / if the field separator is missing?
Using gnu-awk:
awk -F '[[:blank:]]*;[[:blank:]]*' '{
for (i=1; i<=NF; i++) uniq[$i]
}
END {
PROCINFO["sorted_in"]="#ind_str_asc"
for (i in uniq)
print i
}' file
a
b
c
d
e
f
g
h
i
For non-gnu awk use:
awk -F '[[:blank:]]*;[[:blank:]]*' '{for (i=1; i<=NF; i++) uniq[$i]}
END{for (i in uniq) print i}' file | sort
awk -F' ; ' -v OFS='\n' '{$1=$1} 1' ip.txt | sort -fu
-F' ; ' sets space followed by ; followed by space as field separator
-v OFS='\n' sets newline as output field separator
{$1=$1} change $0 as per new OFS
1 print $0
sort -fu sort uniquely ignoring case in alphabetic order
Could you please try following, awk + sort solution, written and tested with shown samples. In case you want to use ignorecase then add IGNORECASE=1 in awk code.
awk '
BEGIN{
FS=" ; "
}
{
for(i=1;i<=NF;i++){
if(!a[$i]++){ print $i }
}
}
' Input_file | sort
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=" ; " ##Setting field separator as space semi-colon space here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop till NF here for each line.
if(!a[$i]++){ print $i } ##Checking condition if current field is NOT present in array a then printing that field value here.
}
}
' Input_file | sort ##Mentioning Input_file name here and passing it to sort as Input to sort the data.
I have the following file : extract_info.txt
ABC
PNG
CHNS
and to_extractfrom.txt from which I need to retrieve information:
ABC 123 234 TCHSL
NBV 234 23764 DHG
CHNS 123 347 CGJKS
CVS 233 4747 JSHGD
PNG 122 324 HGH
SJDH 373 3487 JHG
and I am running the following code
while read line
do
gene=$(echo $line | awk -F' ' '{print $1}')
app1=$(awk -v comp1="$gene" '(comp1==$1) {print $1 }' to_extractfrom.txt)
done < extract_info.txt
However my desired output is to extract the information for the column in extract_info.txt from the file to_extractfrom.txt such that I get the first column of the previous line on the right and next line on the left of the pattern matched line i.e for the columns in the first file, I will have the output as :
NBV ABC -
SJDH PNG CVS
CVS CHNS NBV
awk '
BEGIN {prev = "-"}
NR == FNR {extract[$1] = 1; next}
is_match {print $1, m1, m2; is_match = 0}
$1 in extract {is_match = 1; m1 = $1; m2 = prev}
{prev = $1}
' extract_info.txt to_extractfrom.txt
NBV ABC -
CVS CHNS NBV
SJDH PNG CVS
If you must have the output in the same order as the extract_info file, and you use GNU awk, you can do
gawk '
BEGIN {prev = "-"}
NR == FNR {extract[$1] = FNR; next}
is_match {output[m1] = $1 FS m1 FS m2; is_match = 0}
$1 in extract {is_match = 1; m1 = $1; m2 = prev}
{prev = $1}
END {
PROCINFO["sorted_in"] = "#val_num_asc"
for (key in extract) print output[key]
}
' extract_info.txt to_extractfrom.txt
NBV ABC -
SJDH PNG CVS
CVS CHNS NBV
Background
Homopolymers are a sub-sequence of DNA with consecutives identical bases, like AAAAAAA. Example in python for extract it:
import re
DNA = "ACCCGGGTTTAACCGGACCCAA"
homopolymers = re.findall('A+|T+|C+|G+', DNA)
print homopolymers
['A', 'CCC', 'GGG', 'TTT', 'AA', 'CC', 'GG', 'A', 'CCC', 'AA']
my effort
I made a gawk script that solves the problem, but without to use regular expressions:
echo "ACCCGGGTTTAACCGGACCCAA" | gawk '
BEGIN{
FS=""
}
{
homopolymer = $1;
base = $1;
for(i=2; i<=NF; i++){
if($i == base){
homopolymer = homopolymer""base;
}else{
print homopolymer;
homopolymer = $i;
base = $i;
}
}
print homopolymer;
}'
output
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
question
how can I use regular expressions in awk or sed, getting the same result ?
grep -o will get you that in one-line:
echo "ACCCGGGTTTAACCGGACCCAA"| grep -ioE '([A-Z])\1*'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
Explanation:
([A-Z]) # matches and captures a letter in matched group #1
\1* # matches 0 or more of captured group #1 using back-reference \1
sed is not the best tool for this but since OP has asked for it:
echo "ACCCGGGTTTAACCGGACCCAA" | sed -r 's/([A-Z])\1*/&\n/g'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
PS: This is gnu-sed.
Try using split and just comparing.
echo "ACCCGGGTTTAACCGGACCCAA" | awk '{ split($0, chars, "")
for (i=1; i <= length($0); i++) {
if (chars[i]!=chars[i+1])
{
printf("%s\n", chars[i])
}
else
{
printf("%s", chars[i])
}
}
}'
A
CCC
GGG
TTT
AA
CC
GG
A
CCC
AA
EXPLANATION
The split method divides the one-line string you send to awk, and separes each character in array chars[]. Now, we go through the entire array and check if the char is equal to the next One if (chars[i]!=chars[i+1]) and then, if it´s equal, we just print the char, and wait for the next one. If the next one is different, we just print the base char, a \n what means a newline.
I am using sed command and I want to parse following string:
Mr. XYZ Mr. ABC, PQR
Ward-2, abc vs. MG Road, Pune,
Pune Dist.,
(Appellant) (Respondent)
Now I want to parse the above string and I want to get Appellant part separated from above example and respondent part separated.
That is I want following output:
Mr. XYZ Ward-2, abc(Appellant) that is one output and Mr. ABC, PQR MG Road, Pune, Pune Dist.,(Respondent) is another output by using sed command.
I used following regex but not getting proper output:
sed -n '/assessment year/I{ :loop; n; /Respondent/Iq; p; b loop}' abc.txt
sed is always the wrong tool for any job that involves looking at multiple lines. Just use awk, it's what it was invented for. Here's GNU awk for a couple of extensions:
$ cat tst.awk
BEGIN { FIELDWIDTHS="30 7 99" }
{
for (i=1;i<=NF;i++) {
gsub(/^\s*|\s*$/,"",$i)
if ($i != "") {
rec[i] = (rec[i]=="" ? "" : rec[i] " ") $i
}
}
}
/^\(/ {
print rec[1]
print rec[3]
delete rec
}
$
$ awk -f tst.awk file
Mr. XYZ Ward-2, abc (Appellant)
Mr. ABC, PQR MG Road, Pune, Pune Dist., (Respondent)
I achieved this with following way by using ruby:
appellant_respondent = %x(sed -n '/assessment year/I{ :loop; n; /respondent/Iq; p; b loop}' #{#file_name}).split("\n")
appellant_name_array = []
respondent_name_array = []
appellant_respondent.delete("")
appellant_respondent.each do |names|
names_array = names.split(/\s+\s+/)
appellant_name_array << names_array.first if names_array.first != ""
respondent_name_array << names_array.last if names_array.last != ""
end
#item[:appellant] = appellant_name_array.join(' ').gsub(/\s+vs\.*\s+/i, ' ').strip
#item[:respondent] = respondent_name_array.join(' ').gsub(/\s+vs\.*\s+/i, ' ').strip
What is fast and succinct way to match lines from a text file with a matching first field.
Sample input:
a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output:
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Desired output, alternative:
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
I can imagine many ways to write this, but I suspect there's a smart way to do it, e.g., with sed, awk, etc. My source file is approx 0.5 GB.
There are some related questions here, e.g., "awk | merge line on the basis of field matching", but that other question loads too much content into memory. I need a streaming method.
Here's a method where you only have to remember the previous line (therefore requires the input file to be sorted)
awk -F \| '
$1 == prev_key {print prev_line; matches ++}
$1 != prev_key {
if (matches) print prev_line
matches = 0
prev_key = $1
}
{prev_line = $0}
END { if (matches) print $0 }
' filename
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
Alternate output
awk -F \| '
$1 == prev_key {
if (matches == 0) printf "%s", $1
printf "%s%s", FS, prev_value
matches ++
}
$1 != prev_key {
if (matches) printf "%s%s\n", FS, prev_value
matches = 0
prev_key = $1
}
{prev_value = $2}
END {if (matches) printf "%s%s\n", FS, $2}
' filename
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
For fixed width fields you can used uniq:
$ uniq -Dw 1 file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
If you don't have fixed width fields here are two awk solution:
awk -F'|' '{a[$1]++;b[$1]=(b[$1])?b[$1]RS$0:$0}END{for(k in a)if(a[k]>1)print b[k]}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
awk -F'|' '{a[$1]++;b[$1]=b[$1]FS$2}END{for(k in a)if(a[k]>1)print k b[k]}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit
Using awk:
awk -F '|' '!($1 in a){a[$1]=$2; next} $1 in a{b[$1]=b[$1] FS a[$1] FS $2}
END{for(i in b) print i b[i]}' file
d|amet|consectetur
e|adipisicing|elit
b|ipsum|dolor
This might work for you (GNU sed):
sed -r ':a;$!N;s/^(([^|]*\|).*)\n\2/\1|/;ta;/^([^\n|]*\|){2,}/P;D' /file
This reads 2 lines into the pattern space then checks to see if the keys in both lines are the same. If so it removes the second key and repeats. If not it checks to see if more than two fields exist in the first line and if so prints it out and then deletes it otherwise it just deletes the first line.
$ awk -F'|' '$1 == prev {rec = rec RS $0; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum
b|dolor
d|amet
d|consectetur
e|adipisicing
e|elit
$ awk -F'|' '$1 == prev {rec = rec FS $2; size++; next} {if (size>1) print rec; rec=$0; size=1} {prev = $1} END{if (size>1) print rec}' file
b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit