Related
This question is for a code written in bash, but is really more a regex question. I have a file (ARyy.txt) with CSV values in them. I want to replace the second field with NaN. This is no problem at all for the simple cases (rows 1 and 2 in the example), but it's much more difficult for a few cases where there are quotes in the first field and they have commas in them. These quotes are literally only there to indicate there are commas within them (so if quotes are only there if commas are there and vice versa). Quotes are always the first and last characters if there are commas in the first field.
Here is what I have thus far. NOTE: please try to answer using sed and the general format. There is a way to do this using awk for FPAT from what I know but I need one using sed ideally (or simple use case of awk).
#!/bin/bash
LN=1 #Line Number
while read -r LIN #LIN is a variable containing the line
do
echo "$LN: $LIN"
((LN++))
if [ $LN -eq 1 ]; then
continue #header line
elif [[ {$LIN:0:1} == "\"" ]]; then #if the first character in the line is quote
sed -i '${LN}s/\",/",NaN/' ARyy.txt #replace quote followed by comma with quote followed by comma followed by NaN
else #if first character doesn't start with a quote
sed -i '${LN}s/,[^,]*/,0/' ARyy.txt; fi
done < ARyy.txt
Other pertinent info:
There are never double or nested quotes or anything peculiar like this
There can be more than one comma inside the quotations
I am always replacing the second field
The second field is always just a number for the input (Never words or quotes)
Input Example:
Fruit, Weight, Intensity, Key
Apple, 10, 12, 343
Banana, 5, 10, 323
"Banana, green, 10 MG", 3, 14, 444 #Notice this line has commas in it but it has quotes to indicate this)
Desired Output:
Fruit, Weight, Intensity, Key
Apple, NaN, 12, 343
Banana, NaN, 10, 323
"Banana, green, 10 MG", NaN, 14, 444 #second field changed to NaN and first field remains in tact
Try this:
sed -E -i '2,$ s/^("[^"]*"|[^",]*)(, *)[0-9]*,/\1\2NaN,/' ARyy.txt
Explanation: sed -E invokes "extended" regular expression syntax, so it's easier to use parenthesized groups.
2,$ = On lines 2 through the end of file...
s/ = Replace...
^ = the beginning of a line
("[^"]*"|[^",]*) = either a double-quoted string or a string that doesn't contain any double-quotes or commas
(, *) = a comma, maybe followed by some spaces
[0-9]* = a number
, = and finally a comma
/ = ...with...
\1 = the first () group (i.e. the original first field)
\2 = the second () group (i.e. comma and spaces)
NaN, = Not a number, and the following comma
/ = end of replacement
Note that if the first field could contain escaped double-quotes and/or escaped commas (not in double-quotes), the first pattern would have to be significantly more complex to deal with them.
BTW, the original has an antipattern I see disturbingly often: reading through a file line-by-line to decide what to do with that line, then running something that processes the entire file in order to change that one line. So if you have a thousand-line file, it winds up processing the entire file a thousand times (for a total of a million lines processed). This is what's known as "quadratic scaling", because it takes time proportional to the square of the problem size. As Bruce Dawson put it,
O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
Given your specific format, in particular that the first field won't ever have any escaped double quotes in it:
sed -E '2,$ s/^("[^"]*"|[^,]*),[^,]*/\1,NaN/' < input.csv > output.csv
This does require the common but non-standard -E option to use POSIX Extended Regular Expression syntax instead of the default Basic (which doesn't support alternation).
One (somewhat verbose) awk idea that replaces the entire set of code posted in the question:
awk -F'"' ' # input field separator = double quotes
function print_line() { # print array
pfx=""
for (i=1; i<=4; i++) {
printf "%s%s", pfx, arr[i]
pfx=OFS
}
printf "\n"
}
FNR==1 { print ; next } # header record
NF==1 { split($0,arr,",") # no double quotes => split line on comma
arr[2]=" NaN" # override arr[2] with " NaN"
}
NF>=2 { split($3,arr,",") # first column in from file contains double quotes
# so split awk field #3 on comma; arr[2] will
# be empty
arr[1]="\"" $2 "\"" # override arr[1] with awk field #1 (the double
# quoted first column from the file
arr[2]=" NaN" # override arr[2] " NaN"
}
{ print_line() } # print our array
' ARyy.txt
For the sample input file this generates:
Fruit, Weight, Intensity, Key
Apple, NaN, 12, 343
Banana, NaN, 10, 323
"Banana, green, 10 MG", NaN, 14, 444
while read -r LIN; do
if [ $LN -eq 1 ]; then
((LN++))
continue
elif [[ $LIN == $(echo "$LIN" | grep '"') ]]; then
word1=$(echo "$LIN" | awk -F ',' '{print $4}')
echo "$LIN" | sed -i "$LN"s/"$word1"/\ NaN/ ARyy2.txt
elif [[ $LIN == $(echo "$LIN" | grep -E '[A-Z][a-z]*[,]\ [0-9]') ]]; then
word2=$(echo "$LIN" | cut -f2 -d ',')
echo "$LIN" | sed -i "$LN"s/"$word2"/\ NaN/ ARyy2.txt
fi
echo "$LN: $LIN"
((LN++))
done <ARyy.txt
make a copy of input ARyy.txt to ARyy2.txt and use this text files as the output.
(read from ARyy.txt and write to ARyy2.txt)
the first elif $(echo "$LIN" | grep '"') checks if the LINE starts with quotes " returns:
once selected, want to grab the number 3 with awk -F ',' '{print $4}and saved to variable word1. -F tells awk to separate columns each time encounters a , so 6 columns in total and number 3 is in column 4 that's why {print $4}
echo "$LIN" | sed -i "$LN"s/"$word1"/\ NaN/ ARyy2.txt
then use sed to select line number with $LN. The number 3 inside variable /$word1/. for replacement with /NaN/ BUT want to add a space to NaN so need to escape \ the space with /\ NaN/
always using echo $LIN to grab the correct LINE
the second elif $(echo "$LIN" | grep -E '[A-Z][a-z]*[,]\ [0-9]') returns:
$LIN only returns one line a time, like this:
The important is to check if the LINE has this pattern Word + space + ONE Digit
once selected, want to grab the number 10[second column] this time with cut -f2 -d ',' and save it to variable word2. -f2 selects the second column, and -d is telling cut to use , to separate each column.
Is there a way to default back referenced variables $1, $2 and $3 here ?
start="a" hi="1" bye="2"
start="b" bye="3"
start="c" hi="4"
I am using this command to filter out:
perl-ne 'print if s/.*start="([^"]+).*?hi="([^"]+).*?bye="([^"]+).*/$1 $2 $3/g'
a 1 2
Is there a way to generate below result :
a 1 2
b null 3
c 4 null
I also searched for defaulting a back referenced variable but no working solution about it on that front. Eg, in bash we use ${var:-null} to default the var to a string null.
The special number variables ($1 etc) get introduced for capture groups even if their subpatterns fail to match, when the capture groups are optional (otherwise the whole match fails if any one subpattern fails). Those without a match stay undef.
For example, if a pattern has three optional capture groups, like (...)?, then after the regex (or after the matching part in a substitution operator) there will exist all $1,$2,$3 variables, some possibly being undef if their subpattern didn't match (that ? still made those formally match, by there being zero occurrences of that pattern).
Then test each $N and if undef replace it with a desired phrase ('null' here)
perl -wnE'/
(?: start \s*=\s* "([^"]+)"\s* )?
(?: hi \s*=\s* "([^"]+)"\s* )?
(?: bye \s*=\s* "([^"]+)"\s* )? /x;
say join " ", map { $_//"null" } ($1,$2,$3)
' file
(broken over lines and spaced-out for readability) Since each term has the same structure the pattern can be prepared far more flexibly from a list of expected words.†
For the given sample file this prints
a 1 2
b null 3
c 4 null
† This is an overkill for a specific case and in a one-liner but is useful in a more rounded script which may be used with different keyword sets, since all hard-coded input is in the definition of the input array (#w)
perl -wnE'
BEGIN {
#w = qw(start hi bye); # keywords to form a pattern with
$re = join " ",
map { q{(?:} . $_ . q{\s*=\s*"([^"]+)"\s*)?} } #w;
};
#m = /$re/x;
say join " ", map { $_//"null" } #m
' file
This prints the same for the given input file. In bash shell it can simply be copy-pasted as it stands; in other shells you may need to make it back into one line, and remove comments. (Given as a command-line program, "one"-liner, for easy testing.)
Something like:
$ perl -nE 'my %vals=();
while (m/(\w+)="([^"]+)"/g) { $vals{$1} = $2 }
printf "%s %s %s\n", $vals{start}, $vals{hi}//"null", $vals{bye}//"null"
' input.txt
a 1 2
b null 3
c 4 null
Splits up the input into individual key/value pairs, saves them in a hash table, and then prints out the values using the // operator, which returns the left hand argument if it's defined, otherwise the right hand argument.
Variation if start, hi and bye are the only keys you can have and they always appear in that order:
$ perl -ne 'm/start="([^"]+)"(?:\s+hi="([^"]+)")?(?:\s+bye="([^"]+)")?/;
printf "%s %s %s\n", $1, $2//"null", $3//"null"' input.txt
a 1 2
b null 3
c 4 null
Uglier regular expression that makes the hi and bye parts optional matches.
This might work for you (GNU sed):
sed -E 's/^/start=null hi=null bye=null\n/ # insert a template
:a # loop name
s/(\S+=)\S+(.*\n.*)\1"(\S+)"/\3\2/ # replace lookup with value
ta # repeat till failure
s/\S+=//g # remove any template
P # print
d' file # delete debris
Insert a template and loop replacing matches with original values.
When no more matches, remove any unmatched template keys and debris from the original line.
Is it possible using sed/awk to match the last k occurrences of a pattern in a line?
For simplicity's sake, say I just want to match the last 3 commas in each line, for example (note that the two lines have a different number of total commas):
10, 5, "Sally went to the store, and then , 299, ABD, F, 10
10, 6, If this is the case, and also this happened, then, 299, A, F, 9
I want to match only the commas starting from 299 until the end of the line in both bases.
Motivation: I'm trying to convert a CSV file with stray commas inside one of the fields to tab-delimited. Since the number of proper columns is fixed, my thinking was to replace the first couple commas with tabs up until the troublesome field (which is straightforward), and then go backwards from the end of the line to replace again. This should convert all proper delimiter commas to tabs, while leaving commas intact in the problematic field.
There's probably a smarter way to do this, but I figured this would be a good sed/awk teaching point anyways.
another sed alternative. Replace last 3 commas with tabs
$ rev file | sed 's/,/\t/;s/,/\t/;s/,/\t/' | rev
10, 5, "Sally went to the store, and then , 299 ABD F 10
with GNU sed, you can simply write
$ sed 's/,/\t/g5' file
10, 5, "Sally went to the store, and then , 299 ABD F 10
replace all starting from 5th.
You can use Perl to add the missing double quote into each line:
perl -aF, -ne '$F[-5] .= q("); print join ",", #F' < input > output
or, to turn the commas into tabs:
perl -aF'/,\s/' -ne 'splice #F, 2, -4, join ", ", #F[ 2 .. $#F - 4 ]; print join "\t", #F' < input > output
-n reads the input line by line.
-a splits the input into the #F array on the pattern specified by -F.
The first solution adds the missing quote to the fifth field from the right; the second one replaces the items from the third to the fifth from right with those elements joined by ", ", and separates the resulting array with tabs.
To fix the CSV, I would do this:
echo '10, 5, "Sally went to the store, and then , 299, ABD, F, 10' |
perl -lne '
#F = split /, /; # field separator is comma and space
#start = splice #F, 0, 2; # first 2 fields
#end = splice #F, -4, 4; # last 4 fields
$string = join ", ", #F; # the stuff in the middle
$string =~ s/"/""/g; # any double quotes get doubled
print join(",", #start, "\"$string\"", #end);
'
outputs
10,5,"""Sally went to the store, and then ",299,ABD,F,10
One regex that matches each of the three last commas separately would require a negative lookahead, which sed does not support.
You can use the following sed-regex to match the last three fields and the commas directly before them all at once:
,[^,]*,[^,]*,[^,]*$
$ matches the end of the line.
[^,] matches anything but ,.
Groups allow you to re-use the field values in sed:
sed -r 's/,([^,]*),([^,]*),([^,]*)$/\t\1\t\2\t\3/'
For awk, have a look at How to print last two columns using awk.
There's probably a smarter way to do this
In case all your wanted commas are followed by a space and the unwanted commas are not, how about
sed 's/,[^ ]/./g'
This transforms a, b, 12,3, c into a, b, 12.3, c.
Hi I guess this is doing the job
echo 'a,b,c,d,e,f' | awk -F',' '{i=3; for (--i;i>=0;i--) {printf "%s\t", $(NF-i) } print ""}'
Returns
d e f
But you need to ensure you have more than 3 arguments
This will do what you're asking for with GNU awk for the 3rd arg to match():
$ cat tst.awk
{
gsub(/\t/," ")
match($0,/^(([^,]+,){2})(.*)((,[^,]+){3})$/,a)
gsub(/,/,"\t",a[1])
gsub(/,/,"\t",a[4])
print a[1] a[3] a[4]
}
$ awk -f tst.awk file
10 5 "Sally went to the store, and then , 299 ABD F 10
10 6 If this is the case, and also this happened, then, 299 A F 9
but I'm not convinced what you're asking for is a good approach so YMMV.
Anyway, note the first gsub() making sure you have no tabs on the input line - that is crucial if you want to convert some commas to tabs to use tabs as output field separators!
Sorry for a really basic question. How to replace a particular column in a csv file with some string?
e.g.
id, day_model,night_model
===========================
1 , ,
2 ,2_DAY ,
3 ,3_DAY ,3_NIGHT
4 , ,
(4 rows)
I want to replace any string in the column 2 and column 3 to true
others would be false, but not the 1,2 row and end row.
Output:
id, day_model,night_model
===========================
1 ,false ,false
2 ,true ,false
3 ,true ,true
4 ,false ,false
(4 rows)
What I tried is the following sample code( Only trying to replace the string to "true" in column 3):
#awk -F, '$3!=""{$3="true"}' OFS=, file.csv > out.csv
But the out.csv is empty. Please give me some direction.
Many thanks!!
Since your field separator is comma, the "empty" fields may contain spaces, particularly the 2nd field. Therefore they might not equal the empty string.
I would do this:
awk -F, -v OFS=, '
# ex
NR>2 && !/^\([0-9]+ rows\)/ {
for (i=2; i<=NF; i++)
$i = ($i ~ /[^[:blank:]]/) ? "true" : "false"
}
{ print }
' file
Well since you added sed in tag and you have only three columns I have solution for your problem in four steps because regex replacement was not possible for all cases in just one go.
Since your 2nd and 3rd column is having blank space. I wrote four sed commands to do the replacement for each kind of row.
sed '/^(\d+\s+,)\S+\s*,\S+\s*$/\1true,true/gm' file.txt
This will replace rows like 3 ,3_DAY ,3_NIGHT
Regex101 Demo
sed '/^(\d+\s+,)\S+\s*,\s*$/\1true,false/gm' file.txt
This will replace rows like 2 ,2_DAY ,
Regex101 Demo
sed '/^(\d+\s+,)\s*,\S+\s*$/\1false,true/gm' file.txt
This will replace rows like 5 , ,2_Day
Regex101 Demo
sed '/^(\d+\s+,)\s*,\s*$/\1false,false/gm' file.txt
This will replace rows like 1 , ,
Regex101 Demo
For example, let's say there is a file called domains.csv with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.
A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f - argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.
A way with perl:
perl -F, -ane 'if (#m=$F[1]=~/(?=(.+)\1)/g) {
#m=sort { length $b <=> length $a} #m;
$cl=length #m[0];
if ($l<$cl) { #res=($_); $l=$cl; } elsif ($l==$cl) { push #res, ($_); }
}
END { print #res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (#m[0]).
Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the #F array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group.
Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).