Related
I have a file with a bunch of CSV lines with values with and without quotes like so :
"123","456",,17,"hello," how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a "meeting", unprepared while trying to be "awake","2018-05-29T18:58:10-05:00","ACD",
The fifth column is a text column which has escaped or unescaped double quotes. I am trying to get rid of all the quotes in this column so it looks like this
"123","456",,17,"hello, how are you this, fine, highly caffeinated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",
Any ideas how to achieve this using SED or AWK, or any other unix tools? Much appreciated!
With awk, you can do something like this that avoid very complex regex. The fact that only the fifth column is broken, that the previous columns do not contain commas, and that we know there are a fixed number of columns make it easy to repair :
Edited using gsub for portability as suggested by Ed Morton
awk '
BEGIN{FS=OFS=","}
{
for(i=6; i<=NF-3;i++){
$5 = $5 FS $i
}
}
{
gsub(/"/, "", "g", $5)
}
{print $1,$2,$3,$4,"\""$5"\"",$(NF-2),$(NF-1),$NF}
' <file>
Output :
"123","456",,17,"hello, how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",
If you want to escape quotes, you can use this :
awk '
BEGIN{FS=OFS=","}
{
for(i=6; i<=NF-3;i++){
$5 = $5 FS $i
}
}
{
gsub(/^"|"$/,"",$5);
gsub(/"/,"\\\"",$5);
$5="\""$5"\"";
}
{print $1,$2,$3,$4,$5,$(NF-2),$(NF-1),$NF}
' <file>
Output :
"123","456",,17,"hello,\" how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a \"meeting\", unprepared while trying to be \"awake","2018-05-29T18:58:10-05:00","ACD",
Your question is very difficult to answer in a generic way. To give an example:
"a","b","c","d"
How is this interpreted (if we remove the quotes from the fields of interest):
"a","b","c","d" (4 fields)
"a,b","c","d" (3 fields, $1 messed up)
"a","b,c","d" (3 fields, $2 messed up)
"a","b","c,d" (3 fields, $3 messed up)
"a,b,c","d" (2 fields, $1 messed up)
"a,b","c,d" (2 fields, $1 and $2 messed up)
"a","b,c,d" (2 fields, $2 messed up)
"a,b,c,d" (1 field , $1 messed up)
The only way this can be solved is by having the following knowledge:
How many fields does my CSV have
There is maximum one fields messed up
We know which field is messed up
The following awk program will help you fix it:
$ awk 'BEGIN{ere="[^,]*|\042[^\042]"}
{ head=tail=""; mid=$0 }
# extract the head which is correct
(n>1) {
ere_h="^"
for(i=1;i<n;++i) ere_h = ere_h (ere_h=="^" ? "",",") "(" ere ")"
match(mid,ere_h); head=substr(mid,RSTART,RLENGTH)
mid = substr(mid,RLENGTH+1)
}
# extract the tail which is correct
(nf>n) {
ere_t="$"
for(i=n+1;i<=nf;++i) ere_t = "(" ere ")" (ere_h=="$" ? "",",") ere_t
match(mid,ere_t); tail=substr(mid,RSTART,RLENGTH)
mid = substr(mid,1,RSTART-1)
}
# correct the mid part
{ gsub(/\042/,"",mid)
mid = (mid ~ /^,/) ? ( ",\042" substr(mid,2) ) : ( "\042" mid )
mid = (mid ~ /,$/) ? ( substr(mid,1,length(mid)-1) "\042," ) : (mid "\042" )
}
# print the stuff
{ print head mid tail }' n=5 nf=7 file
With GNU awk for the 3rd arg to match() and assuming you know how many fields there should be in each line:
$ cat tst.awk
BEGIN {
numFlds = 8
badFldNr = 5
}
match($0,"^(([^,]*,){"badFldNr-1"})(.*)((,[^,]*){"numFlds-badFldNr"})",a) {
gsub(/"/,"",a[3])
print a[1] "\"" a[3] "\"" a[4]
}
$ awk -f tst.awk file
"123","456",,17,"hello, how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ",
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",
With other awks you can do the same with a couple of calls to match() and variables instead of the array.
Try this regex :
,\d{2}\,(.*),\"\S{25}\",\"\w{3}"
It was made based on your examples. The goal is just to capture de fifth column. Like #Jerry Jeremiah suggested the point was to use the date wich will always be 25 char long. To prevent some missmatch I've also taken in account the 2 digits presents before the fifth and the 3 letters/digit after the date.
Regex101v1
We can also use a "stronger" regex by looking for the exact date match
,\d{2}\,(.*),\"\d{4}-\d{2}-\d{2}\w\d{2}:\d{2}:\d{2}-\d{2}:\d{2}\",\"\w{3}"
Regex101v2
With theses regex you'll be able to extract the fifth column using group. To go deeper in your question you can do this in bash :
regex='^(.*,[0-9]{2}\,")(.*)(",\"[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}-[0-9]{2}:[0-9]{2}\",\"[a-zA-Z]{3}".*$)'
while IFS= read -r line
do
if [[ $line =~ $regex ]]
then
before=${BASH_REMATCH[1]}
fifth=${BASH_REMATCH[2]}
after=${BASH_REMATCH[3]}
reworked_fifth="${fifth//\"}"
echo ${before}${reworked_fifth}${after}
else
echo "Line didnt match the regex"
fi
done < /my/file/path
I had to change the regex since my bash didn't take \d and \w. No need to sed or awk anything with this. Bash can handle it alone.
I have an input file with multiple paragraphs separated by at least two newlines (\n\n), and I'm wanting to extract fields from lines within certain paragraphs. I think the processing will be simplest if I can get gensub to work as I'm hoping. Considering the following input file:
[Record R1]
Var1=0
Var2=20
Var3=5
[Record R2]
Var1=10
Var3=9
Var4=/var/tmp/
Var2=12
[Record R3]
Var1=2
Var3=5
Var5=19
I want to print only the value of Var2 from records R1 and R3 (where Var2 doesn't actually exist). I can easily group all of the variables into their corresponding record by setting RS="\n\n", then they are all contained within $0. But since I don't know where it will appear it the list ahead of time, I want to use something like gensub to extract it. This is what I have going:
awk '
BEGIN {
RS="\n\n"
}
/Record R1/ || /Record R3/ {
print gensub(/[\n.]*Var2=(.*)[\n.]*/, "\\1", "g", $0)
}
' /tmp/input.txt
But instead of only printing 20 (the value of Var2 from R1), it prints the following:
[Record R1]
Var1=0
20
Var3=5
[Record R3]
Var1=2
Var3=5
Var5=19
The intent is that the regex in the gensub command would capture all characters (newlines: \n; and non-newlines: .) before and after Var2=XX and replace everything with XX. But instead, it's only capturing the characters on the same line as Var2=XX. Can awk's gensub do this kind of multi-line substitution?
I know an alternative would be to loop over all the fields in the record, the split the field that matches Var2= on the = sign, but that feels less efficient as I scale this out to multiple variables.
I don't understand what it is you're trying to do with gensub() but to do what you seem to be trying to do in any awk is:
awk -F'[][[:space:]=]+' '{f[$2]=$3} !NF{if (f["Record"]~/^R[12]$/) print f["Var2"]; delete f}' file
20
12
awk -F'[][[:space:]=]+' '{f[$2]=$3} !NF{if (f["Record"]~/^R[13]$/) print f["Var2"]; delete f}' file
20
gensub() doesn't care if the string it's operating on is one line or many lines btw - \n is just one more character, no different from any other character.
Oh, hang on, now I see what you're thinking with that gensub() - your problems are:
[\n.]* means zero or more newlines or periods but you don't have
any periods in your input so it's the same as \n* but you don't have any newlines immediately before a Var2
Var2 doesn't exist in your 2nd records so the regexp can't match it.
The (.*) will match everything to the end of the record (leftmost longest matches).
The "g" is misleading since you only expect 1 match.
So using gensub() on multi-line text isn't an issue, your regexps just wrong.
another awk
$ awk -v RS= '/\[Record R[13]\]/{for(i=2;i<=NF;i++)
{v=sub(/ *Var2=/,"",$i);
if(v) print $i}}' file
20
Is it possible using sed/awk to match the last k occurrences of a pattern in a line?
For simplicity's sake, say I just want to match the last 3 commas in each line, for example (note that the two lines have a different number of total commas):
10, 5, "Sally went to the store, and then , 299, ABD, F, 10
10, 6, If this is the case, and also this happened, then, 299, A, F, 9
I want to match only the commas starting from 299 until the end of the line in both bases.
Motivation: I'm trying to convert a CSV file with stray commas inside one of the fields to tab-delimited. Since the number of proper columns is fixed, my thinking was to replace the first couple commas with tabs up until the troublesome field (which is straightforward), and then go backwards from the end of the line to replace again. This should convert all proper delimiter commas to tabs, while leaving commas intact in the problematic field.
There's probably a smarter way to do this, but I figured this would be a good sed/awk teaching point anyways.
another sed alternative. Replace last 3 commas with tabs
$ rev file | sed 's/,/\t/;s/,/\t/;s/,/\t/' | rev
10, 5, "Sally went to the store, and then , 299 ABD F 10
with GNU sed, you can simply write
$ sed 's/,/\t/g5' file
10, 5, "Sally went to the store, and then , 299 ABD F 10
replace all starting from 5th.
You can use Perl to add the missing double quote into each line:
perl -aF, -ne '$F[-5] .= q("); print join ",", #F' < input > output
or, to turn the commas into tabs:
perl -aF'/,\s/' -ne 'splice #F, 2, -4, join ", ", #F[ 2 .. $#F - 4 ]; print join "\t", #F' < input > output
-n reads the input line by line.
-a splits the input into the #F array on the pattern specified by -F.
The first solution adds the missing quote to the fifth field from the right; the second one replaces the items from the third to the fifth from right with those elements joined by ", ", and separates the resulting array with tabs.
To fix the CSV, I would do this:
echo '10, 5, "Sally went to the store, and then , 299, ABD, F, 10' |
perl -lne '
#F = split /, /; # field separator is comma and space
#start = splice #F, 0, 2; # first 2 fields
#end = splice #F, -4, 4; # last 4 fields
$string = join ", ", #F; # the stuff in the middle
$string =~ s/"/""/g; # any double quotes get doubled
print join(",", #start, "\"$string\"", #end);
'
outputs
10,5,"""Sally went to the store, and then ",299,ABD,F,10
One regex that matches each of the three last commas separately would require a negative lookahead, which sed does not support.
You can use the following sed-regex to match the last three fields and the commas directly before them all at once:
,[^,]*,[^,]*,[^,]*$
$ matches the end of the line.
[^,] matches anything but ,.
Groups allow you to re-use the field values in sed:
sed -r 's/,([^,]*),([^,]*),([^,]*)$/\t\1\t\2\t\3/'
For awk, have a look at How to print last two columns using awk.
There's probably a smarter way to do this
In case all your wanted commas are followed by a space and the unwanted commas are not, how about
sed 's/,[^ ]/./g'
This transforms a, b, 12,3, c into a, b, 12.3, c.
Hi I guess this is doing the job
echo 'a,b,c,d,e,f' | awk -F',' '{i=3; for (--i;i>=0;i--) {printf "%s\t", $(NF-i) } print ""}'
Returns
d e f
But you need to ensure you have more than 3 arguments
This will do what you're asking for with GNU awk for the 3rd arg to match():
$ cat tst.awk
{
gsub(/\t/," ")
match($0,/^(([^,]+,){2})(.*)((,[^,]+){3})$/,a)
gsub(/,/,"\t",a[1])
gsub(/,/,"\t",a[4])
print a[1] a[3] a[4]
}
$ awk -f tst.awk file
10 5 "Sally went to the store, and then , 299 ABD F 10
10 6 If this is the case, and also this happened, then, 299 A F 9
but I'm not convinced what you're asking for is a good approach so YMMV.
Anyway, note the first gsub() making sure you have no tabs on the input line - that is crucial if you want to convert some commas to tabs to use tabs as output field separators!
I have a large document that I needed to put anchors in. I appended a number to the end of the line. The format was " Area 1" This list goes on for hundreds of entries.
I tried to awk out the slice I wanted with the anchor but this is what I get.
cat file | awk '/Area 5/{print $0}'
Area 5
Area 50
Area 51
Area 52
Area 53
Area 54
Area 55
Area 56
Area 57
Area 58
Area 59
As you can see I wanted just "Area 5" but the regex engine matched it with 5 and 5x. Yes, I know it is being greedy. I tried to limit that behavior with:
/Area 5{1}/
and I still had this problem. I also tried {0} and {0,1} to no effect.
Question 1: What can I do to force awk (and grep as well) to limit it to the requested number?
Question 2: I used awk '/pattern/ { $0=$0 "" ++i }1' to append the number. It leaves "Area 1" I would like it to be Area1. Any ideas?
Thanks for the help.
B
To avoid matching prefixes like '5x', you can use a word boundary.
(Explanation)
In awk, word boundaries are matched using \y.
To eliminate the space between area I simply match group 'Area' and the number '5' and then print them without space.
In my tests, the following worked:
cat test.txt | awk '/Area 5\y/{print $1 $2}'
Output
Area5
/Area 5([^0-9]|$)/ would account for end of line, as well as any-thing but a digit.
But a more awk way of doing things, would be:
awk '/^Area/ && $2==5' file
If the '5' is the end of the line, you can use /Area 5$/. The $ matches end-of-line.
If it's followed by further text, /Area 5[^0-9]/ should work. The [^0-9] matches one character that is anything except a digit.
Good luck!
Some proposals.
awk '$2==5' file
Area 5
awk '$2 ~ /^[5]$/' file
Area 5
i have a huge file and as an output some columns doesn't have a value, i need to fill these columns with 0 for further analysis. I can separate the columns with space or tab, now below it is seen separated with tab.
This is really a job for a CSV parser, but if it has to be a regex, and you never have tabs within quoted CSV entries, you could search for
(^|\t)(?=\t|$)
and replace with
$10
So, in Perl:
(ResultString = $subject) =~
s/( # Match either...
^ # the start of the line (preferably)
| # or
\t # a tab character
) # remember the match in backreference no. 1
(?= # Then assert that the next character is either
\t # a(nother) tab character
| # or
$ # the end of the line
) # End of lookahead assertion
/${1}0/xg;
This will transform
1 2 4 7 8
2 3 5 6 7
into
1 2 0 4 0 0 7 8
0 2 3 0 5 6 7 0
For a tab-separated file, this AWK snippet does the trick:
BEGIN { FS = "\t"; OFS="\t" }
{
for(i = 1; i <= NF; i++) {
if(!$i) { $i = 0 }
}
print $0
}
Here's a sed solution. Note that some versions of sed don't like \t.
sed 's/^\t/0\t/;:a;s/\t\t/\t0\t/g;ta;s/\t$/\t0/' inputfile
or
sed -e 's/^\t/0\t/' -e ':a' -e 's/\t\t/\t0\t/g' -e 'ta' -e 's/\t$/\t0/' inputfile
Explanation:
s/^\t/0\t/ # insert a zero before a tab that begins a line
:a # top of the loop
s/\t\t/\t0\t/g # insert a zero between a pair of tabs
ta # if a substitution was made, branch to the top of the loop
s/\t$/\t0/ # insert a zero after a tab that ends a line
Deleting my answer after re-reading the original post. There are no tabs as data, just delimeters. If there is no data, a double delimeter will apear to align the columns.
It can't be any other way. So if a single delimeter is there, it will separate two empty fields. "" = 1 empty field, "\t" = 2 empty fields. I got it now.
Tim Pietzcker has the correct answer all along. +1 for him.
It could be written alternatively as s/ (?:^|(?<=\t)) (?=\t|$) /0/xg;, but its the same thing.
If and only if your data only contains numbers and you have clear defined field separator FS, you can use the following trick:
awk 'BEGIN{FS=OFS="\t"}{for(i=1;i<=NF;++i) $i+=0}1' file
By adding zero, we convert strings to numbers. Empty strings will be converted to the number zero. You can define your field separator to anything you like.
This, however, might be a bit slow since it will reparse $0 and split it into fields, every time you reassign a field $i.
A faster way is the solution of Dennis Williamson