Matching the last K occurrences of a pattern in a line - regex

Is it possible using sed/awk to match the last k occurrences of a pattern in a line?
For simplicity's sake, say I just want to match the last 3 commas in each line, for example (note that the two lines have a different number of total commas):
10, 5, "Sally went to the store, and then , 299, ABD, F, 10
10, 6, If this is the case, and also this happened, then, 299, A, F, 9
I want to match only the commas starting from 299 until the end of the line in both bases.
Motivation: I'm trying to convert a CSV file with stray commas inside one of the fields to tab-delimited. Since the number of proper columns is fixed, my thinking was to replace the first couple commas with tabs up until the troublesome field (which is straightforward), and then go backwards from the end of the line to replace again. This should convert all proper delimiter commas to tabs, while leaving commas intact in the problematic field.
There's probably a smarter way to do this, but I figured this would be a good sed/awk teaching point anyways.

another sed alternative. Replace last 3 commas with tabs
$ rev file | sed 's/,/\t/;s/,/\t/;s/,/\t/' | rev
10, 5, "Sally went to the store, and then , 299 ABD F 10
with GNU sed, you can simply write
$ sed 's/,/\t/g5' file
10, 5, "Sally went to the store, and then , 299 ABD F 10
replace all starting from 5th.

You can use Perl to add the missing double quote into each line:
perl -aF, -ne '$F[-5] .= q("); print join ",", #F' < input > output
or, to turn the commas into tabs:
perl -aF'/,\s/' -ne 'splice #F, 2, -4, join ", ", #F[ 2 .. $#F - 4 ]; print join "\t", #F' < input > output
-n reads the input line by line.
-a splits the input into the #F array on the pattern specified by -F.
The first solution adds the missing quote to the fifth field from the right; the second one replaces the items from the third to the fifth from right with those elements joined by ", ", and separates the resulting array with tabs.

To fix the CSV, I would do this:
echo '10, 5, "Sally went to the store, and then , 299, ABD, F, 10' |
perl -lne '
#F = split /, /; # field separator is comma and space
#start = splice #F, 0, 2; # first 2 fields
#end = splice #F, -4, 4; # last 4 fields
$string = join ", ", #F; # the stuff in the middle
$string =~ s/"/""/g; # any double quotes get doubled
print join(",", #start, "\"$string\"", #end);
'
outputs
10,5,"""Sally went to the store, and then ",299,ABD,F,10

One regex that matches each of the three last commas separately would require a negative lookahead, which sed does not support.
You can use the following sed-regex to match the last three fields and the commas directly before them all at once:
,[^,]*,[^,]*,[^,]*$
$ matches the end of the line.
[^,] matches anything but ,.
Groups allow you to re-use the field values in sed:
sed -r 's/,([^,]*),([^,]*),([^,]*)$/\t\1\t\2\t\3/'
For awk, have a look at How to print last two columns using awk.
There's probably a smarter way to do this
In case all your wanted commas are followed by a space and the unwanted commas are not, how about
sed 's/,[^ ]/./g'
This transforms a, b, 12,3, c into a, b, 12.3, c.

Hi I guess this is doing the job
echo 'a,b,c,d,e,f' | awk -F',' '{i=3; for (--i;i>=0;i--) {printf "%s\t", $(NF-i) } print ""}'
Returns
d e f
But you need to ensure you have more than 3 arguments

This will do what you're asking for with GNU awk for the 3rd arg to match():
$ cat tst.awk
{
gsub(/\t/," ")
match($0,/^(([^,]+,){2})(.*)((,[^,]+){3})$/,a)
gsub(/,/,"\t",a[1])
gsub(/,/,"\t",a[4])
print a[1] a[3] a[4]
}
$ awk -f tst.awk file
10 5 "Sally went to the store, and then , 299 ABD F 10
10 6 If this is the case, and also this happened, then, 299 A F 9
but I'm not convinced what you're asking for is a good approach so YMMV.
Anyway, note the first gsub() making sure you have no tabs on the input line - that is crucial if you want to convert some commas to tabs to use tabs as output field separators!

Related

I am in troubles with a regexp to remove some \n

Im trying to define a regexp to remove some carriage return in a file to be loaded into a DB.
Here is the fragment
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP
";"";Hamilton;"";;0;0;0;1;1;"";
This is the regexp I used in https://regex101.com/
(;"[[:alnum:] ]+)[\n]+([[:alnum:] ]*)"
Which should get two groups, one before and one after some newline.
Looking at regexp101, it informs that the groups are correctly captured
But the result is wrong, because it still introduce an invisible new line as follow
I also try to use sed but the result is exactly the same.
So, the question is: Where am I wrong?
sed is line based. It's possible to achieve what you want, but I'd rather use a more suitable tool. For example, Perl:
perl -pe 's/\n/<>/e if tr/"// % 2 == 1' file.csv
-p reads the input line by line, running the code for each line before outputting it;
The /e option interprets the replacement in a substitution as code, in this case replacing the final newline with the following line (<> reads the input)
tr/"// in numeric context returns the number of matches, i.e. the number of double quotes;
If the number is odd, we remove the newline (% is the modulo operator).
The corresponding sed invocation would be
sed '/^\([^"]*"[^"]*"\)*[^"]*"[^"]*$/{N;s/\n//}' file.csv
on lines containing a non-paired double quote, read the next line to the pattern space (N) and remove the newline.
Update:
perl -ne 'chomp $p if ! /^[0-9]+;/; print $p; $p = $_; END { print $p }' file.csv
This should remove the newlines if they're not followed by a number and a semicolon. It keeps the previous line in the variable $p, if the current line doesn't start with a number followed by a semicolon, newline is chomped from the previous line. The, the previous line is printed and the current line is remembered. The last line needs to be printed separately as there's no following line for it to make it printed.
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}))' file.csv
will remove trailing newlines from every field in the CSV (with sep ;) and spit out correct CSV (with sep ,). If you want ; in to output too, use
perl -MText::CSV_XS=csv -wE'csv(in=>csv(in=>shift,sep=>";",on_in=>sub{s/\n+$// for#{$_[1]}}),sep=>";")' file.csv
It's usually best to use an existing parser rather than writing your own.
I'd use the following Perl program:
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
sep => ";",
blank_is_undef => 1,
quote_empty => 1,
on_in => sub { s/\n//g for #{ $_[1] }; };
' old.csv >new.csv
Output:
200;GBP;;"";"";"";"";;;;"";"";"";"";;;"";1122;"BP JET WASH IP2 9RP";"";Hamilton;"";;0;0;0;1;1;"";
If for some reason you want to avoid XS, the slower Text::CSV is a drop-in replacement.

Bash / Regex: Replacing the second field in a CSV file when some of the first fields start with quotes and commas within those

This question is for a code written in bash, but is really more a regex question. I have a file (ARyy.txt) with CSV values in them. I want to replace the second field with NaN. This is no problem at all for the simple cases (rows 1 and 2 in the example), but it's much more difficult for a few cases where there are quotes in the first field and they have commas in them. These quotes are literally only there to indicate there are commas within them (so if quotes are only there if commas are there and vice versa). Quotes are always the first and last characters if there are commas in the first field.
Here is what I have thus far. NOTE: please try to answer using sed and the general format. There is a way to do this using awk for FPAT from what I know but I need one using sed ideally (or simple use case of awk).
#!/bin/bash
LN=1 #Line Number
while read -r LIN #LIN is a variable containing the line
do
echo "$LN: $LIN"
((LN++))
if [ $LN -eq 1 ]; then
continue #header line
elif [[ {$LIN:0:1} == "\"" ]]; then #if the first character in the line is quote
sed -i '${LN}s/\",/",NaN/' ARyy.txt #replace quote followed by comma with quote followed by comma followed by NaN
else #if first character doesn't start with a quote
sed -i '${LN}s/,[^,]*/,0/' ARyy.txt; fi
done < ARyy.txt
Other pertinent info:
There are never double or nested quotes or anything peculiar like this
There can be more than one comma inside the quotations
I am always replacing the second field
The second field is always just a number for the input (Never words or quotes)
Input Example:
Fruit, Weight, Intensity, Key
Apple, 10, 12, 343
Banana, 5, 10, 323
"Banana, green, 10 MG", 3, 14, 444 #Notice this line has commas in it but it has quotes to indicate this)
Desired Output:
Fruit, Weight, Intensity, Key
Apple, NaN, 12, 343
Banana, NaN, 10, 323
"Banana, green, 10 MG", NaN, 14, 444 #second field changed to NaN and first field remains in tact
Try this:
sed -E -i '2,$ s/^("[^"]*"|[^",]*)(, *)[0-9]*,/\1\2NaN,/' ARyy.txt
Explanation: sed -E invokes "extended" regular expression syntax, so it's easier to use parenthesized groups.
2,$ = On lines 2 through the end of file...
s/ = Replace...
^ = the beginning of a line
("[^"]*"|[^",]*) = either a double-quoted string or a string that doesn't contain any double-quotes or commas
(, *) = a comma, maybe followed by some spaces
[0-9]* = a number
, = and finally a comma
/ = ...with...
\1 = the first () group (i.e. the original first field)
\2 = the second () group (i.e. comma and spaces)
NaN, = Not a number, and the following comma
/ = end of replacement
Note that if the first field could contain escaped double-quotes and/or escaped commas (not in double-quotes), the first pattern would have to be significantly more complex to deal with them.
BTW, the original has an antipattern I see disturbingly often: reading through a file line-by-line to decide what to do with that line, then running something that processes the entire file in order to change that one line. So if you have a thousand-line file, it winds up processing the entire file a thousand times (for a total of a million lines processed). This is what's known as "quadratic scaling", because it takes time proportional to the square of the problem size. As Bruce Dawson put it,
O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
Given your specific format, in particular that the first field won't ever have any escaped double quotes in it:
sed -E '2,$ s/^("[^"]*"|[^,]*),[^,]*/\1,NaN/' < input.csv > output.csv
This does require the common but non-standard -E option to use POSIX Extended Regular Expression syntax instead of the default Basic (which doesn't support alternation).
One (somewhat verbose) awk idea that replaces the entire set of code posted in the question:
awk -F'"' ' # input field separator = double quotes
function print_line() { # print array
pfx=""
for (i=1; i<=4; i++) {
printf "%s%s", pfx, arr[i]
pfx=OFS
}
printf "\n"
}
FNR==1 { print ; next } # header record
NF==1 { split($0,arr,",") # no double quotes => split line on comma
arr[2]=" NaN" # override arr[2] with " NaN"
}
NF>=2 { split($3,arr,",") # first column in from file contains double quotes
# so split awk field #3 on comma; arr[2] will
# be empty
arr[1]="\"" $2 "\"" # override arr[1] with awk field #1 (the double
# quoted first column from the file
arr[2]=" NaN" # override arr[2] " NaN"
}
{ print_line() } # print our array
' ARyy.txt
For the sample input file this generates:
Fruit, Weight, Intensity, Key
Apple, NaN, 12, 343
Banana, NaN, 10, 323
"Banana, green, 10 MG", NaN, 14, 444
while read -r LIN; do
if [ $LN -eq 1 ]; then
((LN++))
continue
elif [[ $LIN == $(echo "$LIN" | grep '"') ]]; then
word1=$(echo "$LIN" | awk -F ',' '{print $4}')
echo "$LIN" | sed -i "$LN"s/"$word1"/\ NaN/ ARyy2.txt
elif [[ $LIN == $(echo "$LIN" | grep -E '[A-Z][a-z]*[,]\ [0-9]') ]]; then
word2=$(echo "$LIN" | cut -f2 -d ',')
echo "$LIN" | sed -i "$LN"s/"$word2"/\ NaN/ ARyy2.txt
fi
echo "$LN: $LIN"
((LN++))
done <ARyy.txt
make a copy of input ARyy.txt to ARyy2.txt and use this text files as the output.
(read from ARyy.txt and write to ARyy2.txt)
the first elif $(echo "$LIN" | grep '"') checks if the LINE starts with quotes " returns:
once selected, want to grab the number 3 with awk -F ',' '{print $4}and saved to variable word1. -F tells awk to separate columns each time encounters a , so 6 columns in total and number 3 is in column 4 that's why {print $4}
echo "$LIN" | sed -i "$LN"s/"$word1"/\ NaN/ ARyy2.txt
then use sed to select line number with $LN. The number 3 inside variable /$word1/. for replacement with /NaN/ BUT want to add a space to NaN so need to escape \ the space with /\ NaN/
always using echo $LIN to grab the correct LINE
the second elif $(echo "$LIN" | grep -E '[A-Z][a-z]*[,]\ [0-9]') returns:
$LIN only returns one line a time, like this:
The important is to check if the LINE has this pattern Word + space + ONE Digit
once selected, want to grab the number 10[second column] this time with cut -f2 -d ',' and save it to variable word2. -f2 selects the second column, and -d is telling cut to use , to separate each column.

awk: Use gensub to substitute multiple lines from a paragraph record

I have an input file with multiple paragraphs separated by at least two newlines (\n\n), and I'm wanting to extract fields from lines within certain paragraphs. I think the processing will be simplest if I can get gensub to work as I'm hoping. Considering the following input file:
[Record R1]
Var1=0
Var2=20
Var3=5
[Record R2]
Var1=10
Var3=9
Var4=/var/tmp/
Var2=12
[Record R3]
Var1=2
Var3=5
Var5=19
I want to print only the value of Var2 from records R1 and R3 (where Var2 doesn't actually exist). I can easily group all of the variables into their corresponding record by setting RS="\n\n", then they are all contained within $0. But since I don't know where it will appear it the list ahead of time, I want to use something like gensub to extract it. This is what I have going:
awk '
BEGIN {
RS="\n\n"
}
/Record R1/ || /Record R3/ {
print gensub(/[\n.]*Var2=(.*)[\n.]*/, "\\1", "g", $0)
}
' /tmp/input.txt
But instead of only printing 20 (the value of Var2 from R1), it prints the following:
[Record R1]
Var1=0
20
Var3=5
[Record R3]
Var1=2
Var3=5
Var5=19
The intent is that the regex in the gensub command would capture all characters (newlines: \n; and non-newlines: .) before and after Var2=XX and replace everything with XX. But instead, it's only capturing the characters on the same line as Var2=XX. Can awk's gensub do this kind of multi-line substitution?
I know an alternative would be to loop over all the fields in the record, the split the field that matches Var2= on the = sign, but that feels less efficient as I scale this out to multiple variables.
I don't understand what it is you're trying to do with gensub() but to do what you seem to be trying to do in any awk is:
awk -F'[][[:space:]=]+' '{f[$2]=$3} !NF{if (f["Record"]~/^R[12]$/) print f["Var2"]; delete f}' file
20
12
awk -F'[][[:space:]=]+' '{f[$2]=$3} !NF{if (f["Record"]~/^R[13]$/) print f["Var2"]; delete f}' file
20
gensub() doesn't care if the string it's operating on is one line or many lines btw - \n is just one more character, no different from any other character.
Oh, hang on, now I see what you're thinking with that gensub() - your problems are:
[\n.]* means zero or more newlines or periods but you don't have
any periods in your input so it's the same as \n* but you don't have any newlines immediately before a Var2
Var2 doesn't exist in your 2nd records so the regexp can't match it.
The (.*) will match everything to the end of the record (leftmost longest matches).
The "g" is misleading since you only expect 1 match.
So using gensub() on multi-line text isn't an issue, your regexps just wrong.
another awk
$ awk -v RS= '/\[Record R[13]\]/{for(i=2;i<=NF;i++)
{v=sub(/ *Var2=/,"",$i);
if(v) print $i}}' file
20

SED: addressing two lines before match

Print line, which is situated 2 lines before the match(pattern).
I tried next:
sed -n ': loop
/.*/h
:x
{n;n;/cen/p;}
s/./c/p
t x
s/n/c/p
t loop
{g;p;}
' datafile
The script:
sed -n "1N;2N;/XXX[^\n]*$/P;N;D"
works as follows:
Read the first three lines into the pattern space, 1N;2N
Search for the test string XXX anywhere in the last line, and if found print the first line of the pattern space, P
Append the next line input to pattern space, N
Delete first line from pattern space and restart cycle without any new read, D, noting that 1N;2N is no longer applicable
This might work for you (GNU sed):
sed -n ':a;$!{N;s/\n/&/2;Ta};/^PATTERN\'\''/MP;$!D' file
This will print the line 2 lines before the PATTERN throughout the file.
This one with grep, a bit simpler solution and easy to read [However need to use one pipe]:
grep -B2 'pattern' file_name | sed -n '1,2p'
If you can use awk try this:
awk '/pattern/ {print b} {b=a;a=$0}' file
This will print two line before pattern
I've tested your sed command but the result is strange (and obviously wrong), and you didn't give any explanation. You will have to save three lines in a buffer (named hold space), do a pattern search with the newest line and print the oldest one if it matches:
sed -n '
## At the beginning read three lines.
1 { N; N }
## Append them to "hold space". In following iterations it will append
## only one line.
H
## Get content of "hold space" to "pattern space" and check if the
## pattern matches. If so, extract content of first line (until a
## newline) and exit.
g
/^.*\nsix$/ {
s/^\n//
P
q
}
## Remove the old of the three lines saved and append the new one.
s/^\n[^\n]*//
h
' infile
Assuming and input file (infile) with following content:
one
two
three
four
five
six
seven
eight
nine
ten
It will search six and as output will yield:
four
Here are some other variants:
awk '{a[NR]=$0} /pattern/ {f=NR} END {print a[f-2]}' file
This stores all lines in an array a. When pattern is found store line number.
At then end print that line number from the file.
PS may be slow with large files
Here is another one:
awk 'FNR==NR && /pattern/ {f=NR;next} f-2==FNR' file{,}
This reads the file twice (file{,} is the same as file file)
At first round it finds the pattern and store line number in variable f
Then at second round it prints the line two before the value in f

Cut and copy-paste given positions of the text

My dummy text file (one continuous line) looks like this:
AAChvhkfiAFAjjfkqAPPMB
I want to:
Delete part of the text (specific range);
Copy-Paste (specific range of characters) within the file.
How I am doing this:
To cut part of the text at wanted positions (from 5 to 7 characters & from 10 to 14 characters) I use cut
echo 'AAChvhkfiAFAjjfkqAPPMB' | cut --complement -c 5-7,10-14
AAChfifkqAPPMB
But I really don't know how to copy-paste text. For example: to copy text from 15 to 18 characters and paste it after character 1 (also using previous cut command). To get the final result like this:
fkqAAAChfifkqAPPMB
So I do have to questions:
How to read text (from .. to) given range using perl, awk or sed & paste this text at specific position.
How to combine this text pasting with the previous cut command as after cutting text will move to the left side, hence wrong text will be copied.
Maybe something like this:
$ echo AAChvhkfiAFAjjfkqAPPMB | awk '{ print(substr($1, 0, 14) substr($1, 18) substr($1, 15, 3)) }'
AAChvhkfiAFAjjAPPMBfkq
In Perl I think substr would be a good candidate, try eg.
$a = '1234567890';
#from pos 2, replace 3 chars with nothing, return the 3 chars
$b=substr($a,2,3,'');
print "$a\t$b\n"; #1267890 345
#in posistion 0 (first), replace 0 characters (ie pure insert)
#with the content of $b
substr($a,0,0,$b);
print "$a\t$b\n"; #3451267890 345
See http://perldoc.perl.org/functions/substr.html for more details.
splice() may be a candidate as well.
In perl, you can use array slice, by splitting the string in a array :
my $string = "AAChvhkfiAFAjjfkqAPPMB1";
my #arr = split //, $string;
and slicing (print element 5 to 7 and 10 to 14):
print #array[5..7,10..14];
you can use splice() too to re-arrange the array.
perldoc said :
Removes the elements designated by OFFSET and LENGTH from an array, and replaces them with the elements of LIST, if any.
See http://perldoc.perl.org/perldata.html#Slices
quite straightforward with awk:
kent$ echo "AAChvhkfiAFAjjfkqAPPMB"|awk '
{for(i=5;i<=7;i++)$i="";
for(i=10;i<=14;i++)$i="";
for(i=15;i<=18;i++)t=sprintf("%s%s",t,$i);
$0=t""$0}1' OFS="" FS=""
fkqAAAChfifkqAPPMB
edit
to reverse the part of text, you just need to swap t and $i:
kent$ echo "AAChvhkfiAFAjjfkqAPPMB"|awk '
{for(i=5;i<=7;i++)$i="";
for(i=10;i<=14;i++)$i="";
for(i=15;i<=18;i++)t=sprintf("%s%s",$i,t);
$0=t""$0}1' OFS="" FS=""
AqkfAAChfifkqAPPMB